| Literature DB >> 30521036 |
Jeremy M Beaulieu1,2,3, Brian C O'Meara2,3, Russell Zaretzki4, Cedric Landerer2,3, Juanjuan Chai3,5, Michael A Gilchrist2,3.
Abstract
We present a new phylogenetic approach, selection on amino acids and codons (SelAC), whose substitution rates are based on a nested model linking protein expression to population genetics. Unlike simpler codon models that assume a single substitution matrix for all sites, our model more realistically represents the evolution of protein-coding DNA under the assumption of consistent, stabilizing selection using a cost-benefit approach. This cost-benefit approach allows us to generate a set of 20 optimal amino acid-specific matrix families using just a handful of parameters and naturally links the strength of stabilizing selection to protein synthesis levels, which we can estimate. Using a yeast data set of 100 orthologs for 6 taxa, we find SelAC fits the data much better than popular models by 104-105 Akike information criterion units adjusted for small sample bias. Our results also indicated that nested, mechanistic models better predict observed data patterns highlighting the improvement in biological realism in amino acid sequence evolution that our model provides. Additional parameters estimated by SelAC indicate that a large amount of nonphylogenetic, but biologically meaningful, information can be inferred from existing data. For example, SelAC prediction of gene-specific protein synthesis rates correlates well with both empirical (r=0.33-0.48) and other theoretical predictions (r=0.45-0.64) for multiple yeast species. SelAC also provides estimates of the optimal amino acid at each site. Finally, because SelAC is a nested approach based on clearly stated biological assumptions, future modifications, such as including shifts in the optimal amino acid sequence within or across lineages, are possible.Entities:
Keywords: Wright–Fisher; allele substitution; gene expression; protein function; stabilizing selection
Mesh:
Year: 2019 PMID: 30521036 PMCID: PMC6445302 DOI: 10.1093/molbev/msy222
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1.Summary of a five-gene simulation for a SelAC model where we assume , and thus, no site-specific sensitivity in the generating model. The “known” parameters were based on fitting the SelAC model to the 106 gene data set and phylogeny of Rokas et al. (2003), with gene choice being based on five evenly spaced points along the rank order of the gene-specific composite parameter . The points and associated uncertainty in the estimates of the gene-specific average protein synthesis rate, or ψ (calculated from ψ′) (a), nucleotide mutation rates under the UNREST model (b), proportion of correct optimal amino acids for a given gene (c), and estimates of the individual edge lengths are based the mean and 2.5% and 97.5% quantiles across all 50 simulated data sets (d). Gene index on the x-axis refers to the arbitrary number assigned to the simulated gene.
Comparison of Maximum Likelihood Fits for SelAC and Commonly Used Models Based on Negative Log-Likelihood (), AIC, AICc, and AICw from Analyses of 100 Selected Genes from Six Yeast Taxa (Salichos and Rokas 2013).
| Parameters | Model | |||||
|---|---|---|---|---|---|---|
| Model | Estimated | AIC | AICc | Weight | ||
| SelAC+Γ | 453,620.8 | 50,005 | 1,007,252 | 1,027,314 | 0 | >0.999 |
| SelAC | 464,114.8 | 50,004 | 1,028,238 | 1,048,299 | 20,985 | <0.001 |
| SelAC | 465,106.9 | 50,005 | 1,030,224 | 1,050,286 | 22,972 | <0.001 |
| SelAC | 478,302.4 | 50,004 | 1,056,613 | 1,076,674 | 49,360 | <0.001 |
| FMutSel | 597,140.7 | 178 | 1,194,637 | 1,194,638 | 167,324 | <0.001 |
| GY | 612,670.4 | 111 | 1,225,563 | 1,225,563 | 198,249 | <0.001 |
| GTR+Γ | 655,166.4 | 610 | 1,311,553 | 1,311,554 | 284,240 | <0.001 |
Note.—The subscripts M indicate model fits where the most common or “majority rule” amino acid was fixed as the optimal amino acid a* for each site. As discussed in text, despite the fact that a* for each site under M was not fitted by our algorithm, its value was determined by examining the data and, as a result, represent an additional parameter estimated from the data and are accounted for in our table. Sample size used in the calculation of AICc is assumed to be equal to the size of the matrix (number of taxa x number of sites ). For the comparison between the different SelAC and 192 other models fitted using IQTree (Nguyen et al. 2015), see supplementary table S1, Supplementary Material online. In summary, the different SelAC models and FMutSel fitted the data better than any of the IQTree models.
. 2.Comparisons between estimates of average protein translation rate obtained from SelAC + Γ and direct measurements of expression for individual yeast taxa across the 100 selected genes from Salichos and Rokas (2013) measured during log-growth phase. Estimates of were generated by dividing the composite term ψ′ by B (|). Gene expression was measured using either RNA-Seq (a–c) or microarray (d–e). The equations in the upper left-hand corner of each panel represent the regression fit and the Pearson correlation coefficient r.
. 3.Comparisons between , which is the nonsynonymous/synonymous mutation ratio in FMutSel, SelAC + Γ estimates of protein functionality production rates ψhat (a), RNA-Seq-based measurements of mRNA abundance (b), and ROC-SEMPPER’s estimates of protein translation rates , which are based solely on S. cerevisiae’s patterns of codon usage bias (c), for S. cerevisiae across the 100 selected genes from Salichos and Rokas (2013). As in figure 2, the equations in the upper right-hand corner of each panel provide the regression fit and correlation coefficient.
. 4.(a) Maximum likelihood estimates of branch lengths under SelAC + Γ for 100 selected genes from Salichos and Rokas (2013). Tests of model adequacy for S. cerevisiae (b) and S. castellii (c) indicated that, when these taxa are removed from the tree, and their sequences are simulated, the parameters of SelAC + Γ exhibit functionality B(|) that is far closer to the observed (dashed black line) than data sets produced from parameters of either FMutSel or GTR + Γ.