| Literature DB >> 34897427 |
Franz Baumdicker1, Gertjan Bisschop2, Daniel Goldstein3,4, Graham Gower5, Aaron P Ragsdale6, Georgia Tsambos7, Sha Zhu8, Bjarki Eldon9, E Castedo Ellerman10, Jared G Galloway11,12, Ariella L Gladstein13,14, Gregor Gorjanc15, Bing Guo16, Ben Jeffery8, Warren W Kretzschumar17, Konrad Lohse2, Michael Matschiner18, Dominic Nelson19, Nathaniel S Pope20, Consuelo D Quinto-Cortés21, Murillo F Rodrigues11, Kumar Saunack22, Thibaut Sellinger23, Kevin Thornton24, Hugo van Kemenade, Anthony W Wohns8,4, Yan Wong8, Simon Gravel19, Andrew D Kern11, Jere Koskela25, Peter L Ralph11,26, Jerome Kelleher8.
Abstract
Stochastic simulation is a key tool in population genetics, since the models involved are often analytically intractable and simulation is usually the only way of obtaining ground-truth data to evaluate inferences. Because of this, a large number of specialized simulation programs have been developed, each filling a particular niche, but with largely overlapping functionality and a substantial duplication of effort. Here, we introduce msprime version 1.0, which efficiently implements ancestry and mutation simulations based on the succinct tree sequence data structure and the tskit library. We summarize msprime's many features, and show that its performance is excellent, often many times faster and more memory efficient than specialized alternatives. These high-performance features have been thoroughly tested and validated, and built using a collaborative, open source development model, which reduces duplication of effort and promotes software quality via community engagement.Entities:
Keywords: Ancestral Recombination Graphs; coalescent; mutations; simulation
Mesh:
Year: 2022 PMID: 34897427 PMCID: PMC9176297 DOI: 10.1093/genetics/iyab229
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.402
Major features of msprime 1.0 added since version 0.3.0 (Kelleher )
| Interface | Separation of ancestry and mutation simulations. Ability to store arbitrary metadata along with simulation results, and automatic recording of provenance information for reproducibility. Jupyter Notebook ( |
| Ancestry | SMC, SMC’, Beta- and Dirac-coalescent, discrete time Wright–Fisher, and selective sweep models. Instantaneous bottlenecks. Discrete or continuous genomic coordinates, arbitrary ploidy, gene conversion. Output full ARG with recombination nodes, ARG likelihood calculations. Record full migration history and census events. Improved performance for large numbers of populations. Integration with forward simulators such as |
| Demography | Improved interface with integrated metadata and referencing populations by name. Import from Newick species tree, *BEAST ( |
| Mutations | JC69, HKY, F84, GTR, BLOSUM62, PAM, infinite alleles, SLiM and general matrix mutation models. Varying rates along the genome, recurrent/back mutations, discrete or continuous genomic coordinates, overlaying multiple layers of mutations, exact times associated with mutations. |
Figure 1Visualization of the separation between ancestry and mutation simulation. (A) The result of an invocation of sim_ancestry is two trees along a 1 kb chunk of genome relating three diploid samples. Each diploid individual consists of two genomes (or nodes), indicated by color. (B) This ancestry is provided as the input to sim_mutations, which adds mutations. Graphics produced using tskit’s draw_svg method.
Figure 2An example tree sequence describing genealogies and sequence variation for four samples at 10 sites on a chromosome of 20 bases long. Information is stored in a set of tables (the tables shown here include only essential columns, and much more information can be associated with the various entities). The node table stores information about sampled and ancestral genomes. The edge table describes how these genomes are related along a chromosome, and defines the genealogical tree at each position. The site and mutation tables together describe sequence variation among the samples. The genotype matrix and tree topologies shown on the left are derived from these tables.
Figure 3Time required to run sim_mutations on tree sequences generated by sim_ancestry (with a population size of 104 and recombination rate of ) for varying (haploid) sample size and sequence length. We ran 10 replicate mutation simulations each for three different mutation rates, and report the average CPU time required (Intel Core i7-9700). (A) Holding sequence length fixed at 10 megabases and varying the number of samples (tree tips) from 10 to 100,000. (B) Holding number of samples fixed at 1000, and varying the sequence length from 1 to 100 megabases.
Figure 4Comparison of simulation performance using msprime (sim_ancestry), SimBac, and fastSimBac for varying (haploid) sample sizes, and the current estimates for E. coli parameters (Lapierre ): a 4.6 Mb genome, , gene conversion rate of per base and mean tract length of 542. We report (A) the total CPU time and (B) maximum memory usage averaged over five replicates (Intel Xeon E5-2680 CPU). We did not run SimBac beyond first two data points because of the very long running times.
Figure 5(A) A simple ARG in which a recombination occurs at position 0.3; (B) the equivalent topology depicted as a tree sequence, including the recombination node; (C) the same tree sequence topology “simplified” down to its minimal tree sequence representation. Note the original node IDs have been retained for clarity.
Figure 6Comparison of selective sweep simulation performance in msprime (sim_ancestry) and discoal (Intel Xeon Gold 6148 CPU). We report the average CPU time and maximum memory usage when simulating three replicates for 100 diploid samples in a model with a single selective sweep in its history, where the beneficial allele had a selection coefficient of s = 0.05, a per-base recombination rate of , population size of , and sequence length varying from 100 kb–3000 kb.
Figure 7Comparison of DTWF simulation performance in msprime (sim_ancestry) and ARGON (Intel Xeon E5-2680 CPU). We ran simulations with a population size of 104 and recombination rate of , with 500 diploid samples and varying sequence length. We report (A) total CPU time and (B) maximum memory usage; each point is the average over five replicate simulations. We show observations for ARGON, msprime’s DTWF implementation (“DTWF”) and a hybrid simulation of 100 generations of the DTWF followed by the standard coalescent with recombination (“Hybrid”). We ran ARGON with a mutation rate of 0 and with minimum output options, with a goal of measuring only ancestry simulation time. Memory usage for msprime’s DTWF and hybrid simulations are very similar.