| Literature DB >> 23188590 |
Nicola De Maio1, Ian Holmes, Christian Schlötterer, Carolin Kosiol.
Abstract
Empirical codon models (ECMs) estimated from a large number of globular protein families outperformed mechanistic codon models in their description of the general process of protein evolution. Among other factors, ECMs implicitly model the influence of amino acid properties and multiple nucleotide substitutions (MNS). However, the estimation of ECMs requires large quantities of data, and until recently, only few suitable data sets were available. Here, we take advantage of several new Drosophila species genomes to estimate codon models from genome-wide data. The availability of large numbers of genomes over varying phylogenetic depths in the Drosophila genus allows us to explore various divergence levels. In consequence, we can use these data to determine the appropriate level of divergence for the estimation of ECMs, avoiding overestimation of MNS rates caused by saturation. To account for variation in evolutionary rates along the genome, we develop new empirical codon hidden Markov models (ecHMMs). These models significantly outperform previous ones with respect to maximum likelihood values, suggesting that they provide a better fit to the evolutionary process. Using ECMs and ecHMMs derived from genome-wide data sets, we devise new likelihood ratio tests (LRTs) of positive selection. We found classical LRTs very sensitive to the presence of MNSs, showing high false-positive rates, especially with small phylogenies. The new LRTs are more conservative than the classical ones, having acceptable false-positive rates and reduced power.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23188590 PMCID: PMC3563974 DOI: 10.1093/molbev/mss266
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 8.800
Models Used for Tests of Positive Selection.
| Model | Parameters | Number of Free Parameters |
|---|---|---|
| M1a, ecM1a | 2 | |
| M2a, ecM2a | 4 | |
| M7, ecM7 | 2 | |
| M8, ecM8 | 4 |
Parameters describing selective pressure distribution: refers to selective pressure in class ( corresponding to neutrality), and is the proportion of sites belonging to class .
Performances of Models with Different Levels of Complexity on Real Data.
| Model Name | Number of Parameters | BIC Score | MNS Proportion |
|---|---|---|---|
| 2 | |||
| Nonrev. ECM | 3,720 | — | 2.8 |
| ECM | 1,890 | −4,941 | 2.7 |
| | 323 | −22,517 | 2.3 |
| Combined | 162 | +67,683 | 2.4 |
| Nucl. GTR | 69 | +282,903 | 4.0 |
| | 3,720 | — | 3.8 |
| ECM | 1,890 | +53,505 | 3.7 |
| Simpl. ECM | 323 | +44,179 | 2.9 |
| 2 | |||
| | 3,720 | — | 3.6 |
| ECM | 1,890 | +56,596 | 3.5 |
| Simpl. ECM | 323 | +46,197 | 2.8 |
| | 3,720 | — | 4.4 |
| ECM | 1,890 | +126,127 | 4.1 |
| Simpl. ECM | 323 | +120,582 | 2.7 |
| 2 | |||
| | 3,720 | — | 4.3 |
| ECM | 1,890 | +122,044 | 4.0 |
| Simpl. ECM | 323 | +115,454 | 2.7 |
Note.—The best model for each data set according to BIC score is underlined.
aBIC score difference between the current model and the nonreversible ECM trained on the same data set (models with smaller BIC score are considered preferable).
bEstimated proportion of MNSs.
cNonreversible empirical codon model.
dSimplified empirical codon model.
eCodon extension of the nucleotide general time reversible model.
FEstimation error of the ECM. Percent error in estimating ECM exchangeability parameters △ and instantaneous substitution rates with phylogenies consisting of: Dmel–Dsim (red), Dmel–Dsim–Dyak (green), and Dmel–Dsim–Dyak–Dana (blue). The ECM used for simulations is the one estimated on the Dmel–Dsim–Dyak–Dana data set. The vertical purple line represents the amount of codons in the smallest real data set used. Similar results are observed when simulating according to different ECMs (supplementary figs. S2 and S3, Supplementary Material online).
FEstimation of MNS rate with the ECM. Proportion of MNSs estimated with ECM using data simulated according to three real phylogenetic trees: Dmel–Dsim (△), Dmel–Dsim–Dyak (), and Dmel–Dsim–Dyak–Dana (). Simulations are repeated according to three different ECMs: the one estimated on Dmel–Dsim (red), the one on Dmel–Dsim–Dyak (green), and the one on Dmel–Dsim–Dyak–Dana (blue). Values shown represent the percentage of all substitutions, which are MNSs. The horizontal lines show the correct values, that is, the percentage of MNSs that was present in the respective ECM used for simulations.
Performance of ECMs Estimated on Data from Different Drosophila Clades.
| Model Name | Number of Parameters | BIC Score | MNS Proportion |
|---|---|---|---|
| Nonrev. ECM | 3,720 | — | 3.0 |
| ECM | 1,890 | +9,279 | 3.2 |
| Simpl. ECM | 323 | −8,409 | 2.6 |
| 2R-ecHMM | 328 | −102,306 | 0.8 |
| 2cu-ecHMM | 387 | −194,133 | 2.6 |
| | 398 | −332,099 | 1.8 |
| Nonrev. ECM | 3,720 | — | 2.7 |
| ECM | 1,890 | −18,158 | 2.7 |
| Simpl. ECM | 323 | −37,323 | 2.2 |
| 2R-ecHMM | 328 | −84,089 | 0.6 |
| 2cu-ecHMM | 387 | −139,756 | 2.1 |
| | 398 | −218,253 | 1.4 |
| Nonrev. ECM | 3,720 | — | 4.5 |
| ECM | 1,890 | +17,655 | 4.5 |
| Simpl. ECM | 323 | +9,383 | 3.2 |
| 2R-ecHMM | 328 | −108,776 | 1.5 |
| 2cu-ecHMM | 387 | −155,564 | 2.7 |
| | 398 | −277,451 | 2.5 |
Note.—The best model for each data set according to BIC score is underlined.
aBIC score difference between the current model and the non reversible ECM trained on the same data set.
bProportion of MNSs estimated by the model.
cNonreversible empirical codon model.
dSimplified empirical codon model.
eThe ecHMM having two classes for nonsynonymous/synonymous rate ratio variation and two classes for codon usage variation.
Comparisons between Models Estimated on Different Clades.
| Feature | |||
|---|---|---|---|
| ECM Q | 17.3 | 20.3 | 23.8 |
| Simpl. ECM Q | 16.8 | 20.3 | 23.1 |
| 2R-2cu-ecHMM Q | 15.2 | 17.8 | 22.1 |
| ECM | 7.0 | 12.5 | 12.7 |
| ECM nucleotide | 7.3 | 11.5 | 15.4 |
Note.—Comparison of parameter vectors estimated on different clades. Values show the Euclidean distances between vectors, normalized by the average of the norm of the two vectors compared and expressed as a percentage.
aModel feature that is compared between clades: “Q” is the instantaneous substitution rates matrix, “” is the codon frequencies vector, and “Nucleotide” stands for the nucleotide instantaneous substitution rates matrix extracted from the ECM averaging the single-nucleotide synonymous substitution rates for each ordered pair of nucleotides.
Performances of ecHMMs on Real Data.
| Model Name | Number of Parameters | BIC Scorea | MNS Proportionb (%) |
|---|---|---|---|
| 2 | |||
| 2cu-ecHMM | 387 | −181,736 | 2.3 |
| 3cu-ecHMM | 453 | −217,866 | 2.3 |
| 4cu-ecHMM | 521 | −231,229 | 2.3 |
| 2R-ecHMM | 328 | −100,305 | 2.0 |
| 3R-ecHMM | 335 | −105,603 | 2.0 |
| 4R-ecHMM | 344 | −112,395 | 2.0 |
| | 398 | −297,460 | 1.7 |
| 4cu-ecHMM | 521 | −239,972 | 2.5 |
| 4R-ecHMM | 344 | −289,980 | 2.3 |
| | 398 | −428,746 | 2.2 |
| 2 | |||
| 4cu-ecHMM | 521 | −228,193 | 2.4 |
| 4R-ecHMM | 344 | −283,604 | 2.1 |
| | 398 | −413,221 | 2.1 |
| 4cu-ecHMM | 521 | −61,579 | 2.7 |
| 4R-ecHMM | 344 | −111,026 | 2.7 |
| | 398 | −131,030 | 2.7 |
| 2 | |||
| 4cu-ecHMM | 521 | −59,924 | 2.6 |
| 4R-ecHMM | 344 | −107,284 | 2.6 |
| | 398 | −126,043 | 2.6 |
Note.—The best model for each data set according to BIC score is underlined.
aBIC score difference between the current model and the simplified ECM trained on the same data set.
bProportion of MNSs estimated by the model.
cThe ecHMM having two classes for nonsynonymous/synonymous rate ratio variation and two classes for codon usage variation.
FEstimation error of codon frequencies with 2cu-ecHMM. Estimation error of the two sets of codon frequencies on a data set simulated according to a 2cu-ecHMM model and recovered by a 2cu-ecHMM. Codon frequencies for both classes are considered. On the y axis is the error, expressed in percentage. On the x axis is the number of codons in the respective data set used.
FEstimation of MNS rate with 2cu-ecHMM. Estimation of MNS rate on a data set simulated according to a 2cu-ecHMM model. On the y axis is proportion of substitutions that are MNSs, expressed in percentage. On the x axis is the number of codons in the respective data set used. Blue △ represents the MNS rate estimated by a 2cu-ecHMM (the simulated model), red the MNS rate estimated by a simplified ECM, and green by an ECM. The horizontal line shows the simulated proportion of MNSs, that is, the true value to be estimated.
Performance of Positive Selection Tests on Simulated Data.
| No Positive Selection (False Positives) | With Positive Selection (Power) | |||
|---|---|---|---|---|
| Model | ||||
| M1a–M2a | 11.8% (4.4%) | 26.9% (13.1%) | 34.3% (17.8%) | 88.7% (75.0%) |
| ecM1a–ecM2a | 3.1% (0.8%) | 1.1% (0.5%) | 3.3% (1.0%) | 49.1% (29.7%) |
| M7–M8 | 14.0% (5.3%) | 28.0% (13.5%) | 35.7% (18.5%) | 89.3% (75.8%) |
| ecM7–ecM8 | 3.5% (1.1%) | 1.2% (0.5%) | 3.6% (1.0%) | 49.8% (30.6%) |
| M1a–M2a | 6.8% (2.3%) | 8.8% (2.6%) | 21.7% (10.5%) | 98.0% (92.8%) |
| ecM1a–ecM2a | 1.4% (0.1%) | 0.8% (0.1%) | 3.4% (0.9%) | 88.4% (75.2%) |
| M7–M8 | 8.8% (3.2%) | 9.9% (2.7%) | 24.2% (11.0%) | 98.2% (94.3%) |
| ecM7–ecM8 | 2.8% (0.3%) | 1.0% (0.1%) | 3.6% (1.1%) | 89.4% (76.7%) |
| M1a–M2a | 3.8% (0.7%) | 4.3% (1.3%) | 9.5% (3.8%) | 99.9% (99.3%) |
| ecM1a–ecM2a | 0.6% (0.1%) | 0.3% (0.1%) | 2.3% (0.6%) | 99.3% (97.9%) |
| M7–M8 | 7.0% (2.2%) | 10.5% (3.7%) | 14.5% (6.1%) | 99.9% (99.4%) |
| ecM7–ecM8 | 1.8% (0.1%) | 0.6% (0.1%) | 2.4% (0.8%) | 99.4% (97.9%) |
Note.—Proportion of tests detecting positive selection over 1,000 simulations. LRTs were performed with () significance according to a distribution. Alignments were simulated according to substitution rates of the 2cu-ecHMM. The exchangeability parameters of the model used for simulations are also used as constants in ecM1a, ecM2a, ecM7, and ecM8. Simulations are performed under the phylogenetic trees estimated on real data and indicated in the table (see Materials and Methods and supplementary table S10, Supplementary Material online).