| Literature DB >> 35511713 |
Nhan Ly-Trong1, Suha Naser-Khdour2, Robert Lanfear2, Bui Quang Minh1.
Abstract
Sequence simulators play an important role in phylogenetics. Simulated data has many applications, such as evaluating the performance of different methods, hypothesis testing with parametric bootstraps, and, more recently, generating data for training machine-learning applications. Many sequence simulation programmes exist, but the most feature-rich programmes tend to be rather slow, and the fastest programmes tend to be feature-poor. Here, we introduce AliSim, a new tool that can efficiently simulate biologically realistic alignments under a large range of complex evolutionary models. To achieve high performance across a wide range of simulation conditions, AliSim implements an adaptive approach that combines the commonly used rate matrix and probability matrix approaches. AliSim takes 1.4 h and 1.3 GB RAM to simulate alignments with one million sequences or sites, whereas popular software Seq-Gen, Dawg, and INDELible require 2-5 h and 50-500 GB of RAM. We provide AliSim as an extension of the IQ-TREE software version 2.2, freely available at www.iqtree.org, and a comprehensive user tutorial at http://www.iqtree.org/doc/AliSim.Entities:
Keywords: molecular evolution; phylogenetics; sequence simulation
Mesh:
Year: 2022 PMID: 35511713 PMCID: PMC9113491 DOI: 10.1093/molbev/msac092
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 8.800
Fig. 1.Sequence simulation process with two scenarios: (A) Simulating an MSA from a phylogenetic tree and a Markov substitution model, and (B) Simulating an MSA that mimics the underlying evolutionary process of a user-provided MSA. Here, the phylogenetic tree and the substitution model parameters are internally inferred from the user-provided MSA, which are used to simulate a new MSA.
Feature comparison between AliSim v2.2.0 (March 8, 2022) and existing tools, Seq-Gen v1.3.4 (August 29, 2019), Dawg v2.0.1 (March 8, 2022), INDELible v1.03, and phastSim v0.0.4 (February 8, 2022).
| Features | Seq-Gen | Dawg | INDELible | phastSim | AliSim | |
|---|---|---|---|---|---|---|
|
| ||||||
| DNA | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Amino acid | ✓ | ✓ | ✓ | ✓ | ||
| Codon | ✓ | ✓ | ✓ | ✓ | ||
| Binary and discrete morphological | ✓ | |||||
| RNA (base-pairing) | ✓ | |||||
| Non-reversible DNA and amino acid | ✓ | ✓ | ✓ | |||
|
| ||||||
| Invariant sites (+I) | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Discrete Gamma distribution (+G | ✓ | ✓ | ✓ | |||
| Continuous Gamma distribution (+GC) | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Distribution-free (+R | ✓ | ✓ | ||||
| Codon-position-specific rates | ✓ | |||||
| Nonsynonymous/synonymous codon rate heterogeneity | ✓ | ✓ | ||||
|
| ||||||
| Insertion–deletion | ✓ | ✓ | ✓ | ✓ | ||
| Indel-rate variation | ✓ | |||||
| Partition | Same model* | ✓ | ✓ | ✓ | ||
| Site mixture** | Codon only | ✓ | ||||
| Tree mixture for non-tree-like evolution | Same model and taxa | ✓ | ✓ | ✓ | ||
| Branch-specific substitutions*** | ✓ | ✓ | ✓ | |||
| Hypermutability | ✓ | |||||
| Heterotachy ( | ✓ | |||||
| Functional divergence ( | ✓ | |||||
| User-defined models | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Ascertainment bias correction | ✓ | |||||
|
| ||||||
| Mimicking a user-provided MSA | ✓ | |||||
| Model parameters following empirical or user-defined distributions | ✓ | |||||
| Simulating random trees | ✓ | ✓ | ||||
|
| ||||||
| Multifurcating trees | ✓ | ✓ | ✓ | ✓ | ||
| Branch length scaling | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Graphical user interface | ✓ | |||||
| Outputting ancestral sequences | ✓ | N/A | ✓ | ✓ | ✓ | |
| Output format | PHYLIP, NEXUS, FASTA | PHYLIP, FASTA, NEXUS, CLUSTAL, POO | PHYLIP, FASTA, NEXUS | PHYLIP, FASTA, NEWICK, MAT, Info | PHYLIP, FASTA | |
| Inserting output header | ✓ | ✓ | ||||
| Output compression | Gzip | |||||
| Programming language | C | C++ | C++ | Python | C++ | |
*, all partitions must share the same evolutionary model; **, a mixture model is a set of substitution models where each site has a probability of belonging to a substitution model; ***, users can specify different evolutionary models to individual branches of a tree.
Fig. 2.Runtimes and peak memory consumptions of five software AliSim, Seq-Gen, Dawg, INDELible, and phastSim for deep-data (varying number of sequences and 30K sites; sub-panels A and B) simulations without indels, long-data (varying number of sites and 30K sequences; sub-panels C and D) simulations without indels, and varied #sequences (varying number of sequences and setting root sequence length at 30K sites; sub-panels E and F) simulations with indels.