| Literature DB >> 30642241 |
Katherine A Dunn1, Toby Kenney2, Hong Gu2, Joseph P Bielawski3,4,5.
Abstract
BACKGROUND: An excess of nonsynonymous substitutions, over neutrality, is considered evidence of positive Darwinian selection. Inference for proteins often relies on estimation of the nonsynonymous to synonymous ratio (ω = dN/dS) within a codon model. However, to ease computational difficulties, ω is typically estimated assuming an idealized substitution process where (i) all nonsynonymous substitutions have the same rate (regardless of impact on organism fitness) and (ii) instantaneous double and triple (DT) nucleotide mutations have zero probability (despite evidence that they can occur). It follows that estimates of ω represent an imperfect summary of the intensity of selection, and that tests based on the ω > 1 threshold could be negatively impacted.Entities:
Keywords: Codon frequencies; Codon model; False positives; G-series models; Likelihood ratio test; M-series models; Model misspecification; Multiple nonsynonymous rates; Multiple nucleotide mutations; Positive selection; Protein evolution
Mesh:
Substances:
Year: 2019 PMID: 30642241 PMCID: PMC6332903 DOI: 10.1186/s12862-018-1326-7
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Descriptions of the model-related acronyms
|
|
|
|---|---|
| DT | Indicates that a model allows simultaneous double (D) and triple (T) nucleotide changes between codons |
| G0 | A GPP codon model employing a single |
| G1aX | A GPP codon model with the same discrete mixture of two |
| G2aX | A GPP codon model with the same discrete mixture of three |
| GPP | General-Purpose Parametric (GPP) modelling framework for codons |
| GTR | General Time Reversible (GTR) model for single nucleotide changes |
| GY | The codon modelling framework of Goldman and Yang [26] where the transition probability is proportional to the target codon frequency |
| M0 | A codon model employing a single |
| M1a | A codon model employing a constrained discrete mixture of two |
| M2a | A codon model employing a constrained discrete mixture of three |
| M3 | A codon model employing an unconstrained discrete mixture of |
| M8 | A codon model employing a discretized |
| MEP | Mixed Empirical and Parametric (MEP) models combine empirical estimates of exchangeabilities with so-called mechanistic parameters of codon evolution |
| MG | The codon modelling framework of Muse and Gaut [47] where the transition probability is proportional to the target nucleotide frequency |
| MNR | A class of models allowing Multiple Nonsynonymous Rates (MNR) of exchangeability between codons |
| PCP | Physiochemical-Constrained Parametric (PCP) models parameterize the influence of physiochemical constraints on nonsynonymous changeability |
| REV | A fully reversible codon model described by a 61×61 matrix |
| SNR | A class of models allowing only a Single Nonsynonymous Rates (SNR) of exchangeability between codons |
Fig. 1Graphical illustration of the design of Simulation Studies 1 and 2. The overall design is comprised of 32 distinct evolutionary scenarios divided into four distinct Simulation studies focused on different objectives. The details of Simulation studies 1 and 2 are shown in this figure. The details of Simulation Studies 3 and 4 are derived from Studies 1 and 2, and are further explained in the text. All simulation studies were comprised of 100 replicates, each having sequences of 300 codons. All datasets were generated using version 1.2 of the COLD program “www.mathstat.dal.ca/~tkenney/Cold/, https://github.com/tjk23/COLD”. a The 5-taxon and 17-taxon tree topologies. The 5-taxon tree and branch lengths are the same as those used for simulating sequences in Wong et al. [45]. The 17-taxon tree and branch lengths are the same as those used for simulating sequences in Yang et al. [6]. The scale for the branch lengths gives the mean number of substitutions per codon. b Sequence generating process for Simulation Study 1. The purpose of this study is to investigate the impact of DT changes (0.06 and 0.03 respectively) on the false positive rate. The selective regime is based on a strictly neutral model having just two classes of sites; conserved (50% of data) having ω = 0 and neutral (50% of data) having ω = 1. The scenarios of this study differ in the complexity of the nucleotide substitution process; case 1a is simple (everything equal) and case 1b/1c is complex (unequal GTR exchangeabilities and nucleotide frequencies). The GTR exchangeabilities and nucleotide frequencies for case 1b/1c were obtained from β-globin gene sequences. c Sequence generating process for Simulation Study 2. This study has 24 scenarios, and covers more complexity than the strictly neutral case of Study 1. Each has a mixture of three selective regimes: a large fraction of strong purifying selection (77%, ω0 = 0.05), a moderate fraction of sites (22%) with ω1 = 0.5 or 1.0, and a small fraction evolving with ω ≥ 1 (3% ω+ = 1.0, 1.5, 2.0 or 5.0). MNRs were induced using hydrophobicity factors of 1.0, 0.4 or 0.05, which were linked to the codon model via the GPP parameter β. When there is no impact on nonsynonymous rates, yielding a SNR codon model. When , codon evolution has MNRs. The nucleotide process had heterogeneous GTR exchangeabilities, and unequal nucleotide frequencies at the three positions of the codon. DT codon changes were not included in Study 2; DT was added to MNRs in Simulation Study 3
False positive rates under a strictly neutral evolutionary process with DT nucleotide substitutions between codons
| Simulation | LRT false positive rate | median | ||
|---|---|---|---|---|
| M1a - M2a | G1aDT - G2aDT | M2a | G2aDT | |
| 1a (simple, 5 taxa) | 0.49 | 0.04 | ||
| 1b (complex, 5 taxa) | 0.22 | 0.04 | ||
| 1c (complex, 10 taxa) | 0.48 | 0.04* | ||
One hundred replicates (sequence length = 300 codons) were simulated for each scenario. Simulation 1a is based on a simple model (equal DNA exchangeabilities and equal codon frequencies) evolved over a 5-taxon tree. Simulation 1b is based on a more complex generating process using DNA exchangeabilities and codon frequencies derived from a real dataset. Simulation 1b was extended to the case of a 10-taxon tree. Codon models fitted to simulation 1a assumed equal codon frequencies (fequal), and those fitted to simulation 1b used GY94-style F3 × 4 codon frequencies. The asterisk symbol (*) indicates that the results for simulation 1c under the 10-taxon tree is based on 97 replicates due to convergence problems with some datasets
False positive rates (null scenarios) and true positive rates (alternative scenarios) for three LRTs when the evolutionary process includes both ω variability among sites and MNRs
| SNR ( | Low MNR ( | High MNR ( | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
| LRT-1 | LRT-2 | LRT-3 | LRT-1 | LRT-2 | LRT-3 | LRT-1 | LRT-2 | LRT-3 | |
|
|
| |||||||||||
| 2a | 0.05 | 0.5 | 1.0 | 0.00 | 0.01 | 0.01* | 0.00 | 0.01 | 0.00* | 0.00 | 0.00 | 0.00* |
| 2b | 1.0 | 1.0 | 0.01 | 0.04 | 0.03* | 0.00 | 0.03 | 0.00* | 0.00 | 0.03 | 0.00* | |
|
|
| |||||||||||
| 2c | 0.05 | 0.5 | 1.5 | 0.03 | 0.36 | 0.44 | 0.01 | 0.24 | 0.20* | 0.00 | 0.09 | 0.00* |
| 2d | 2.0 | 0.52 | 0.82 | 0.85 | 0.05 | 0.65 | 0.61* | 0.00 | 0.45 | 0.14* | ||
| 2e | 5.0 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 1.00 | 0.14 | 0.99 | 1.00* | ||
| 2f | 0.05 | 1.0 | 1.5 | 0.06 | 0.10 | 0.08 | 0.00 | 0.14 | 0.05 | 0.00 | 0.14 | 0.01* |
| 2 g | 2.0 | 0.33 | 0.46 | 0.37 | 0.00 | 0.46 | 0.24 | 0.00 | 0.31 | 0.09* | ||
| 2 h | 5.0 | 1.00 | 0.99 | 1.00 | 0.98 | 1.00 | 1.00 | 0.09 | 1.00 | 0.99* | ||
LRT-1 compares M1a to M2a (under-fit models). LRT-2 compares G1a to G2a (perfect-fit models). LRT-3 compares G1a13 to G2a13 (over-fit models). The asterisk symbol (*) indicates scenarios where either convergence problems or suboptimal peaks were encountered for the models of LRT-3. To overcome these problems, models were re-fit to the same dataset multiple times, each using a different set of initial parameter values. The number of problematic datasets for SNR was 2a = 21 and 2b = 1; for low MNR was 2a = 27, 2b = 16, 2c = 16 and 2f = 10; and for high MNR was 2a = 29, 2b = 20, 2c = 35, 2e = 15, 2f = 15 and 2 g = 1. Because using multiple initials for the problematic datasets was successful, the results above are for all 100 replicates
False positive rates (null scenarios) and true positive rates (alternative scenarios) for three LRTs when the evolutionary process includes DT nucleotide substitutions between codons, ω variability among sites, and MNRs
| SNR ( | High MNR ( | ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
| LRT-1 | LRT-2 | LRT-3 | LRT-1 | LRT-2 | LRT-3 | |
|
|
| ||||||||
| 3a | 0.05 | 1.0 | 1.0 | 0.55 | 0.02 | 0.03 | 0.0 | 0.06* | 0.10* |
|
|
| ||||||||
| 3b | 0.05 | 0.5 | 2.0 | 0.95 | 0.87 | 0.92 | 0.01 | 0.44 | 0.26 |
| 3c | 0.05 | 1.0 | 2.0 | 0.99 | 0.47 | 0.46 | 0.0 | 0.27 | 0.18 |
LRT-1 compares M1a to M2a (under-fit models). LRT-2 compares G1a to G2a (perfect-fit models). LRT-3 compares G1a13 to G2a13 (over-fit models). The asterisk symbol (*) indicates that the results are based on < 100 replicates due to convergence problems with some datasets when there was high MNRs. For LRT-2 case 3a is based on 99 replicates. LRT-3 case 3a is based on 91 replicates
Sensitivity of false positive rates to the choice of model parameterization under the nine different null scenarios of Simulation Studies 1–3
| SNR + DT cases | SNR (no DT) | High MNR (3a = DT) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| model | 1a | 1b | 1c | 3a | 2a | 2b | 2a | 2b | 3a | |
| M1a - M2a | GY | 0.31 | 0.22 | 0.48 | 0.55 | 0.0 | 0.01 | 0.0 | 0.0 | 0.0 |
| M1a - M2a | MG | 0.25 | 0.41 | 0.91 | 0.94 | 0.0 | 0.15 | 0.0 | 0.0 | 0.0 |
| M8 | GY | 0.47 | 0.24 | 0.58 | 0.56 | 0.0 | 0.02 | 0.0 | 0.0 | 0.0 |
| M8 | MG | 0.44 | 0.57 | 0.94 | 0.97 | 0.0 | 0.18 | 0.0 | 0.0 | 0.0 |
Scenarios 1a and 1b are based on a 5-taxon tree, and 2a, 2b and 3a are based on a 17-taxon tree (see Fig. 1). GY denotes the frequency parameterization of Goldman and Yang [26] where the transition probability is proportional to target codon. MG denotes the frequency parameterization Muse and Gaut [47] where the transition probability is proportional to target nucleotide. Both require frequency estimates for the four nucleotides at each position of the codon (denoted F3 × 4), and thus each requires 9 free parameters
Results of applying LRT-1 and LRT-3 to the set of 21 real Streptococcus sequence alignments
| Gene | under-fit models | over-fit models | M2a vs. G2a13 | |||||
|---|---|---|---|---|---|---|---|---|
| NC | NS | TL | LRT-1: | M2a | LRT-3: | G2a13 | 2 | |
| 1 | 892 | 19 | 6.98 | N.S. | N.S. | 881.3 | ||
| 2 | 639 | 16 | 6.37 | N.S. | 504.2 | |||
| 3 | 228 | 11 | 3.74 | N.S. | N.S. | 152.1 | ||
| 4 | 577 | 9 | 8.49 | N.S. | N.S. | 466.1 | ||
| 5 | 390 | 9 | 5.16 | N.S. | 109.6 | |||
| 6 | 348 | 11 | 4.5 | N.S. | N.S. | 113.7 | ||
| 7 | 184 | 10 | 0.37 | 71.7 | ||||
| 8 | 169 | 6 | 30 | N.S. | N.S. | 130.9 | ||
| 9 | 227 | 10 | 5.46 | N.S. | N.S. | 50.3 | ||
| 10 | 450 | 10 | 2.2 | N.S. | N.S. | 14.3 | ||
| 11 | 444 | 7 | 4.6 | N.S. | N.S. | 109.7 | ||
| 12 | 473 | 9 | 0.45 | N.S. | N.S. | 17.3 | ||
| 13 | 427 | 8 | 0.05 | 0.10 > | N.S. | 6.2 | ||
| 14 | 632 | 7 | 0.09 | 0.10 > | N.S. | 25.1 | ||
| 15 | 209 | 7 | 10.3 | N.S. | N.S. | 164.5 | ||
| 16 | 232 | 6 | 0.43 | 0.10 > | N.S. | 49.1 | ||
| 17 | 661 | 5 | 3.3 | N.S. | 220.6 | |||
| 18 | 564 | 5 | 7.7 | N.S. | N.S. | 171.4 | ||
| 19 | 261 | 4 | 9.5 | N.S. | N.S. | 113.6 | ||
| 20 | 201 | 4 | 2.2 | N.S. | N.S. | 40.4 | ||
| 21 | 166 | 4 | 2.7 | N.S. | N.S. | 34.69 | ||
NC is the number of codons in the sequence alignment after removal of sites with ambiguities or indels. NS is the number of gene sequences in the alignment. TL is the total tree length estimated under codon model M0 as the mean number of substitution per codon. N.S. indicates a non-significant LRT. The dagger symbol (†) indicates a gene for which likelihood optimization under a G-series model did meet convergence criteria. The two-fold s symbol (§) indicates that the MLEs were obtained by removing tip branches having near-zero lengths and re-fitting the model. The gene names, along with the sequence alignments, are provided in the DRYAD repository [51]