| Literature DB >> 24557445 |
Abstract
Genomic evolution can be highly heterogeneous. Here, we introduce a new framework to simulate genome-wide sequence evolution under a variety of substitution models that may change along the genome and the phylogeny, following complex multispecies coalescent histories that can include recombination, demographics, longitudinal sampling, population subdivision/species history, and migration. A key aspect of our simulation strategy is that the heterogeneity of the whole evolutionary process can be parameterized according to statistical prior distributions specified by the user. We used this framework to carry out a study of the impact of variable codon frequencies across genomic regions on the estimation of the genome-wide nonsynonymous/synonymous ratio. We found that both variable codon frequencies across genes and rate variation among sites and regions can lead to severe underestimation of the global dN/dS values. The program SGWE-Simulation of Genome-Wide Evolution-is freely available from http://code.google.com/p/sgwe-project/, including extensive documentation and detailed examples.Entities:
Keywords: heterogeneous substitution models; molecular adaptation; molecular evolution; multispecies coalescent
Mesh:
Substances:
Year: 2014 PMID: 24557445 PMCID: PMC3995339 DOI: 10.1093/molbev/msu078
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Genome-Wide Simulation Software.
| Program | Class | Evolutionary Process | Substitution Process | Variable | Rate Variation | Indels | Homogeneous/ Heterogeneous Substitution Model across Regions | Reference |
|---|---|---|---|---|---|---|---|---|
| SIMCOAL2 and Fastsimcoal | Coalescent | D, M, R | N: JC, K2P | No | No | No | Homogeneous | |
| GenomePop | Forward | D, M, R, S | N: GTR; Cod: MG94 | No | No | No | Homogeneous | |
| EvolSimulator | Birth–death process | D, M, L, S | N: GTR; C: Nt | No | Gsites | No | Homogeneous | |
| GSIMULATOR package | Phylogenetic | — | N: GTR; C: EM; A: Secondary structure | No | No | Yes | Homogeneous | |
| ALF | Birth–death process and phylogenetic | M, L | N: GTR; C: GY94 (M0,M2,M3,M8) and EM; A: 5 EM | Yes | Gsites, | Yes | Homogeneous/heterogeneous | |
| SGWE | Coalescent and phylogenetic | D, N, R | N: GTR; C: GY94 (M0-M13), MG94, HB and EM; A: 16 EM | Yes | Gsites | Yes | Heterogeneous | This study |
Note.—The column “Class” includes phylogenetic (where a phylogeny is user-specified), forward, birth–death, and coalescent approaches. The column “Evolutionary process” describes the implemented evolutionary scenarios: D (demographics), M (population structure and migration), R (recombination), L (lateral gene transfer), and S (selection). The column “Substitution process” refers to N (nucleotide), C (codon), and A (amino acid) substitution/replacement models. EM means “empirical model,” and it is indicated whether the model is fixed along the genome (homogeneous) or can change among genomic regions (heterogeneous). The column “Rate variation” indicates whether different sites can evolve under different rates (G: gamma distribution; I: proportion of invariable sites) and whether this level of heterogeneity can change across site positions (Gsites) and/or genomic regions (Gregions). The column “Indels” indicates the consideration of insertion and deletion events.
aCoding sequences are simulated through nucleotide substitution models just avoiding stop codons.
bThe rate of variation among sites can be user-specified.
cAmino acid models implemented in ALF: JTT, GCB, LG, WAG, CustomP.
dA maximum of three genomic regions based on different substitution models can be simulated.
eAmino acid models implemented in SGWE: Blosum62, CpRev, Dayhoff, DayhoffDCMUT, HIVb, HIVw, JTT, JonesDCMUT, LG, Mtart, Mtmam, Mtrev24, RtRev, VT, WAG, user-specified. See references in the supplementary material, Supplementary Material online.
FDepiction of three genome alignments simulated with SGWE. Each genome alignment contains six regions, printed with white and gray background to describe noncoding and coding regions, respectively. “+I” indicates proportion of invariable sites, and “+Gsites” indicates heterogeneity across sites according to a gamma distribution. “ECMSchn2005” indicates the empirical codon model by Schneider et al. (2005). “+F” indicates empirical frequencies (e.g., user-specified) are considered. “CAT” indicates that frequencies change across sites within a region.
FInfluence of variable codon frequencies and variable ti/tv across regions on the estimation of the genome-wide dN/dS when the true dN/dS value is 0.5, 1.0, and 2.0. The horizontal dashed black line indicates the simulated dN/dS value. White bars indicate the estimated dN/dS from the entire genome, while the gray bars display the averaged dN/dS across regions. Error bars indicate 95% confidence intervals.
FInfluence of variable codon frequencies, variable transition rates, and gamma-distributed rate variation among sites and across regions on the estimation of the genome-wide dN/dS when the true dN/dS value is 1.0. The horizontal dashed black line indicates the true, simulated value. White bars indicate the estimated dN/dS from the entire genome, while the gray bars display the averaged dN/dS across regions. Error bars indicate 95% confidence intervals.