| Literature DB >> 25063439 |
Abstract
Phylogenetic analyses of molecular data require a quantitative model for how sequences evolve. Traditionally, the details of the site-specific selection that governs sequence evolution are not known a priori, making it challenging to create evolutionary models that adequately capture the heterogeneity of selection at different sites. However, recent advances in high-throughput experiments have made it possible to quantify the effects of all single mutations on gene function. I have previously shown that such high-throughput experiments can be combined with knowledge of underlying mutation rates to create a parameter-free evolutionary model that describes the phylogeny of influenza nucleoprotein far better than commonly used existing models. Here, I extend this work by showing that published experimental data on TEM-1 beta-lactamase (Firnberg E, Labonte JW, Gray JJ, Ostermeier M. 2014. A comprehensive, high-resolution map of a gene's fitness landscape. Mol Biol Evol. 31:1581-1592) can be combined with a few mutation rate parameters to create an evolutionary model that describes beta-lactamase phylogenies much better than most common existing models. This experimentally informed evolutionary model is superior even for homologs that are substantially diverged (about 35% divergence at the protein level) from the TEM-1 parent that was the subject of the experimental study. These results suggest that experimental measurements can inform phylogenetic evolutionary models that are applicable to homologs that span a substantial range of sequence divergence.Entities:
Keywords: deep mutational scanning; lactamase; phylogenetics; protein evolution; substitution model
Mesh:
Substances:
Year: 2014 PMID: 25063439 PMCID: PMC4166927 DOI: 10.1093/molbev/msu220
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
FThe amino acid preferences for TEM-1 beta-lactamase, calculated from the data of Firnberg et al. (2014). The heights of letters are proportional to the preference for that amino acid at that position in the protein. Residues are numbered using the scheme of Ambler et al. (1991). Letters are colored according to the hydrophobicity of the amino acid. Bars above the letters indicate the secondary structure and relative solvent accessibility as calculated from the crystal structure in PDB entry 1XPB (Fonzé et al. 1995) using DSSP (Kabsch and Sander 1983; Joosten et al. 2011), with maximum solvent accessibilities taken from Tien et al. (2013). The figure was generated using “WebLogo” (Crooks et al. 2004) integrated into the “mapmuts” software package (Bloom 2014). The data and source code used to create this plot are provided through http://jbloom.github.io/phyloExpCM/example_2014Analysis_lactamase.html (last accessed July 28, 2014).
FPhylogenetic trees of TEM (red) and SHV (blue) beta-lactamases inferred using “codonPhyML” (Gil et al. 2013) with the codon substitution model of (A) Goldman and Yang (1994) or (B) Kosiol et al. (2007). The scale bars have units of number of codon substitutions per site. The inferred trees are similar for both models; the distance between the trees computed using the measure of Robinson and Foulds (1981) is 0.14. The TEM and SHV sequences each cluster into closely related clades: The average number of nucleotide and amino acid differences between sequence pairs within these clades is 13 and 7 for the TEM sequences, and 10 and 5 for the SHV sequences. There is extensive divergence between these two clades: The average number of nucleotide and amino acid differences between sequence pairs across the clades is 326 and 100. For both substitution models, a single transition–transversion ratio (κ) and four discrete gamma-distributed nonsynonymous–synonymous ratios (ω) were estimated by maximum likelihood. The equilibrium codon frequencies were determined empirically using the CF3x4 method (Pond et al. 2010) for the model of Goldman and Yang (1994), or the F method for the model of Kosiol et al. (2007) The data and source code used to create these trees are provided through http://jbloom.github.io/phyloExpCM/example_2014Analysis_lactamase.html (last accessed July 28, 2014).
Experimentally Informed Evolutionary Models Fit the Combined TEM and SHV Beta-Lactamase Phylogeny in Figure 2A Much Better than Models That Do Not Utilize Experimental Data.
| Model | ΔAIC | Log Likelihood | Parameters (optimized + empirical) | Optimized Parameters |
|---|---|---|---|---|
| Experimental, | 0.0 | –4,020.6 | 5 (5 + 0) | |
| Experimental, | 46.4 | –4,044.8 | 4 (4 + 0) | |
| Experimental, | 77.3 | –4,059.3 | 5 (5 + 0) | |
| Experimental, | 85.7 | –4,064.5 | 4 (4 + 0) | |
| GY94, gamma | 392.4 | –4,208.8 | 13 (4 + 9) | |
| KOSI07+F, gamma | 410.7 | –4,167.0 | 64 (4 + 60) | |
| GY94, gamma | 460.9 | –4,244.1 | 12 (3 + 9) | |
| KOSI07+F, gamma | 467.0 | –4,196.1 | 63 (3 + 60) | |
| GY94, one | 528.9 | –4,278.1 | 12 (3 + 9) | |
| KOSI07+F, one | 551.3 | –4,238.3 | 63 (3 + 60) | |
| KOSI07+F, one | 632.9 | –4,280.1 | 62 (2 + 60) | |
| GY94, one | 656.2 | –4,342.7 | 11 (2 + 9) | |
| Randomized, | 724.5 | –4,382.9 | 5 (5 + 0) | |
| Randomized, | 735.1 | –4,388.2 | 5 (5 + 0) | |
| Avg. frequencies, | 820.8 | –4,371.0 | 65 (5 + 60) | |
| Avg. frequencies, | 841.8 | –4,381.5 | 65 (5 + 60) | |
| Avg. frequencies, | 858.0 | –4,390.6 | 64 (4 + 60) | |
| Avg. frequencies, | 900.7 | –4,412.0 | 64 (4 + 60) | |
| Randomized, | 1,264.9 | –4,654.1 | 4 (4 + 0) | |
| Randomized, | 1,474.5 | –4,758.9 | 4 (4 + 0) |
Note.—The difference in AIC relative to the best model (smaller ΔAIC indicates better fit), the log likelihood, the number of free parameters, and the values of key parameters are shown in this table. For each model, the branch lengths and model parameters were optimized for the fixed tree topology in figure 2A. The “Experimental” models use amino acid preferences derived from the data of Firnberg et al. (2014) plus four mutation rate parameters (eq. 8) and optionally the stringency parameter β. For the “Randomized” models, the experimentally measured amino acid preferences are randomized among sites—these models are far worse as the preferences are no longer assigned to the correct positions. For the “Avg. frequencies” models, the amino acid preferences are identical across sites and are set to the average frequency of that amino acid in the entire lactamase sequence alignment—these models are also far worse than the experimentally informed models, as they do not utilize site-specific information. Fitting the stringency parameter to a value of improves the fit of the experimentally informed models by enhancing the importance of the site-specific amino acid preferences. Fitting the stringency parameter to a value of improves the fit of the randomized and Avg. frequencies model by effectively equalizing the preferences across amino acids. “GY94” denotes the model of Goldman and Yang (1994) with nine equilibrium frequency parameters calculated using the CF3x4 method (Pond et al. 2010). “KOSI07+F” denotes the model of Kosiol et al. (2007) with 60 equilibrium frequency parameters calculated using the F methods. All variants of GY94 and KOSI07+F have a single transition–transversion ratio (κ) estimated by maximum likelihood. Different model variants either have a single nonsynonymous–synonymous ratio (ω) or values drawn from four discrete gamma-distributed categories (Yang et al. 2000), and either a single rate or rates drawn from four discrete gamma-distributed categories (Yang 1994). The data and source code used to generate this table are provided through http://jbloom.github.io/phyloExpCM/example_2014Analysis_lactamase.html (last accessed July 28, 2014).
Experimentally Informed Evolutionary Models Also Provide a Superior Phylogenetic Fit When the Tree Topology Is Estimated Using the Model of Kosiol et al. (2007) Rather than That of Goldman and Yang (1994).
| Model | ΔAIC | Log Likelihood | Parameters (optimized + empirical) | Optimized Parameters |
|---|---|---|---|---|
| Experimental, | 0.0 | –4,019.9 | 5 (5 + 0) | |
| Experimental, | 48.3 | –4,045.1 | 4 (4 + 0) | |
| Experimental, | 76.1 | –4,058.0 | 5 (5 + 0) | |
| Experimental, | 85.6 | –4,063.7 | 4 (4 + 0) | |
| GY94, gamma | 398.0 | –4,210.9 | 13 (4 + 9) | |
| KOSI07+F, gamma | 402.2 | –4,162.0 | 64 (4 + 60) | |
| KOSI07+F, gamma | 455.1 | –4,189.5 | 63 (3 + 60) | |
| GY94, gamma | 464.4 | –4,245.1 | 12 (3 + 9) | |
| GY94, one | 527.9 | –4,276.9 | 12 (3 + 9) | |
| KOSI07+F, one | 529.8 | –4,226.9 | 63 (3 + 60) | |
| KOSI07+F, one | 608.3 | –4,267.1 | 62 (2 + 60) | |
| GY94, one | 651.9 | –4,339.9 | 11 (2 + 9) | |
| Randomized, | 726.3 | –4,383.1 | 5 (5 + 0) | |
| Randomized, | 737.0 | –4,388.4 | 5 (5 + 0) | |
| Avg. frequencies, | 823.7 | –4,371.8 | 65 (5 + 60) | |
| Avg. frequencies, | 844.8 | –4,382.3 | 65 (5 + 60) | |
| Avg. frequencies, | 862.1 | –4,392.0 | 64 (4 + 60) | |
| Avg. frequencies, | 907.1 | –4,414.5 | 64 (4 + 60) | |
| Randomized, | 1,265.1 | –4,653.5 | 4 (4 + 0) | |
| Randomized, | 1,474.1 | –4,758.0 | 4 (4 + 0) |
Note.—This table differs from table 1 in that the phylogenetic fit is to all TEM and SHV sequences using the tree topology in figure 2B rather than that in figure 2A.
Experimentally Informed Evolutionary Models Also Provide a Superior Phylogenetic Fit When the Analysis Is Limited Only to TEM Beta-Lactamase Sequences.
| Model | ΔAIC | Log Likelihood | Parameters (optimized + empirical) | Optimized Parameters |
|---|---|---|---|---|
| Experimental, | 0.0 | –2,374.3 | 5 (5 + 0) | |
| Experimental, | 23.1 | –2,386.8 | 4 (4 + 0) | |
| Experimental, | 81.8 | –2,415.2 | 5 (5 + 0) | |
| Experimental, | 83.6 | –2,417.1 | 4 (4 + 0) | |
| GY94, gamma | 252.2 | –2,492.4 | 13 (4 + 9) | |
| GY94, one | 317.6 | –2,526.1 | 12 (3 + 9) | |
| GY94, gamma | 318.5 | –2,526.5 | 12 (3 + 9) | |
| KOSI07+F, gamma | 326.9 | –2,478.7 | 64 (4 + 60) | |
| KOSI07+F, gamma | 394.9 | –2,513.8 | 63 (3 + 60) | |
| KOSI07+F, one | 412.0 | –2,522.3 | 63 (3 + 60) | |
| Randomized, | 465.8 | –2,607.2 | 5 (5 + 0) | |
| Randomized, | 466.2 | –2,607.4 | 5 (5 + 0) | |
| GY94, one | 483.3 | –2,609.9 | 11 (2 + 9) | |
| KOSI07+F, one | 556.7 | –2,595.7 | 62 (2 + 60) | |
| Avg. frequencies, | 574.6 | –2,601.6 | 65 (5 + 60) | |
| Avg. frequencies, | 577.9 | –2,603.2 | 65 (5 + 60) | |
| Avg. frequencies, | 609.1 | –2,619.8 | 64 (4 + 60) | |
| Avg. frequencies, | 622.7 | –2,626.7 | 64 (4 + 60) | |
| Randomized, | 976.6 | –2,863.6 | 4 (4 + 0) | |
| Randomized, | 1,007.8 | –2,879.2 | 4 (4 + 0) |
Note.—This table differs from table 1 in that the phylogenetic fit is only to the TEM sequences (the portion of the tree shown in red in fig. 2A).
Experimentally Informed Evolutionary Models Also Provide a Superior Phylogenetic Fit When the Analysis Is Limited Only to SHV Beta-Lactamase Sequences.
| Model | ΔAIC | Log Likelihood | Parameters (optimized + empirical) | Optimized Parameters |
|---|---|---|---|---|
| Experimental, | 0.0 | –1,728.5 | 5 (5 + 0) | |
| Experimental, | 34.9 | –1,746.0 | 5 (5 + 0) | |
| Experimental, | 106.2 | –1,782.7 | 4 (4 + 0) | |
| Experimental, | 116.5 | –1,787.8 | 4 (4 + 0) | |
| KOSI07+F, gamma | 489.0 | –1,914.0 | 64 (4 + 60) | |
| KOSI07+F, one | 499.7 | –1,920.4 | 63 (3 + 60) | |
| GY94, gamma | 505.8 | –1,973.4 | 13 (4 + 9) | |
| GY94, one | 514.0 | –1,978.5 | 12 (3 + 9) | |
| KOSI07+F, gamma | 555.7 | –1,948.4 | 63 (3 + 60) | |
| KOSI07+F, one | 573.6 | –1,958.3 | 62 (2 + 60) | |
| GY94, gamma | 581.5 | –2,012.3 | 12 (3 + 9) | |
| Randomized, | 601.7 | –2,029.4 | 5 (5 + 0) | |
| GY94, one | 602.6 | –2,023.8 | 11 (2 + 9) | |
| Randomized, | 602.7 | –2,029.9 | 5 (5 + 0) | |
| Avg. frequencies, | 711.5 | –2,024.3 | 65 (5 + 60) | |
| Avg. frequencies, | 715.7 | –2,026.4 | 65 (5 + 60) | |
| Avg. frequencies, | 749.8 | –2,044.5 | 64 (4 + 60) | |
| Avg. frequencies, | 758.8 | –2,048.9 | 64 (4 + 60) | |
| Randomized, | 1,047.0 | –2,253.1 | 4 (4 + 0) | |
| Randomized, | 1,071.2 | –2,265.2 | 4 (4 + 0) |
Note.—This table differs from table 1 in that the phylogenetic fit is only to the SHV sequences (the portion of the tree shown in blue in fig. 2A).
Experimentally Informed Evolutionary Models Also Provide a Superior Phylogenetic Fit to the TEM Beta-Lactamases When the Tree Topology Is Estimated Using the Model of Kosiol et al. (2007) Rather than That of Goldman and Yang (1994).
| Model | ΔAIC | Log Likelihood | Parameters (optimized + empirical) | Optimized Parameters |
|---|---|---|---|---|
| Experimental, | 0.0 | –2,378.5 | 5 (5 + 0) | |
| Experimental, | 25.4 | –2,392.2 | 4 (4 + 0) | |
| Experimental, | 80.6 | –2,418.8 | 5 (5 + 0) | |
| Experimental, | 83.7 | –2,421.4 | 4 (4 + 0) | |
| GY94, gamma | 257.6 | –2,499.3 | 13 (4 + 9) | |
| GY94, one | 317.7 | –2,530.4 | 12 (3 + 9) | |
| GY94, gamma | 324.1 | –2,533.6 | 12 (3 + 9) | |
| KOSI07+F, gamma | 325.7 | –2,482.4 | 64 (4 + 60) | |
| KOSI07+F, gamma | 393.5 | –2,517.3 | 63 (3 + 60) | |
| KOSI07+F, one | 402.4 | –2,521.7 | 63 (3 + 60) | |
| Randomized, | 472.0 | –2,614.5 | 5 (5 + 0) | |
| Randomized, | 472.4 | –2,614.7 | 5 (5 + 0) | |
| GY94, one | 488.0 | –2,616.5 | 11 (2 + 9) | |
| KOSI07+F, one | 550.4 | –2,596.7 | 62 (2 + 60) | |
| Avg. frequencies, | 581.6 | –2,609.3 | 65 (5 + 60) | |
| Avg. frequencies, | 584.2 | –2,610.6 | 65 (5 + 60) | |
| Avg. frequencies, | 617.3 | –2,628.2 | 64 (4 + 60) | |
| Avg. frequencies, | 629.2 | –2,634.1 | 64 (4 + 60) | |
| Randomized, | 980.7 | –2,869.9 | 4 (4 + 0) | |
| Randomized, | 1,014.6 | –2,886.8 | 4 (4 + 0) |
Note.—This table differs from figure 3 in that the phylogenetic fit is to the TEM sequences using the red portion of tree topology in figure 2B rather than the red portion of the tree topology in figure 2A.
Experimentally Informed Evolutionary Models Also Provide a Superior Phylogenetic Fit to the SHV Beta-Lactamases When the Tree Topology Is Estimated Using the Model of Kosiol et al. (2007) Rather than That of Goldman and Yang (1994).
| Model | ΔAIC | Log Likelihood | Parameters (optimized + empirical) | Optimized Parameters |
|---|---|---|---|---|
| Experimental, | 0.0 | –1,725.4 | 5 (5 + 0) | |
| Experimental, | 34.1 | –1,742.4 | 5 (5 + 0) | |
| Experimental, | 104.4 | –1,778.5 | 4 (4 + 0) | |
| Experimental, | 114.6 | –1,783.7 | 4 (4 + 0) | |
| KOSI07+F, gamma | 486.6 | –1,909.6 | 64 (4 + 60) | |
| KOSI07+F, one | 491.7 | –1,913.2 | 63 (3 + 60) | |
| GY94, gamma | 497.3 | –1,966.0 | 13 (4 + 9) | |
| GY94, one | 501.5 | –1,969.1 | 12 (3 + 9) | |
| KOSI07+F, gamma | 547.6 | –1,941.2 | 63 (3 + 60) | |
| KOSI07+F, one | 562.5 | –1,949.6 | 62 (2 + 60) | |
| GY94, gamma | 568.0 | –2,002.4 | 12 (3 + 9) | |
| GY94, one | 586.2 | –2,012.5 | 11 (2 + 9) | |
| Randomized, | 597.7 | –2,024.2 | 5 (5 + 0) | |
| Randomized, | 598.7 | –2,024.7 | 5 (5 + 0) | |
| Avg. frequencies, | 706.4 | –2,018.6 | 65 (5 + 60) | |
| Avg. frequencies, | 710.9 | –2,020.8 | 65 (5 + 60) | |
| Avg. frequencies, | 745.9 | –2,039.3 | 64 (4 + 60) | |
| Avg. frequencies, | 755.3 | –2,044.0 | 64 (4 + 60) | |
| Randomized, | 1,040.9 | –2,246.8 | 4 (4 + 0) | |
| Randomized, | 1,063.9 | –2,258.3 | 4 (4 + 0) |
Note.—This table differs from table 4 in that the phylogenetic fit is to the SHV sequences using the blue portion of tree topology in figure 2B rather than the blue portion of the tree topology in figure 2A.
FComparison of likelihoods on a per-site basis between the best experimentally informed site-specific evolutionary model and the best conventional nonsite-specific model. The experimentally informed models are slightly better (positive ) for most sites, but far worse for a handful of sites. (A, B) The best experimentally informed lactamase model in table 1 versus the best GY94 variant in table 1. The experimentally informed model has a higher log likelihood for 72% of lactamase sites. (C, D) The best experimentally informed nucleoprotein model in table 7 versus the best GY94 variant in table 7. The experimentally informed model has a higher log likelihood for 82% of sites. For both genes, the per-site likelihoods were computed after fixing the model parameters and branch lengths to their maximum-likelihood values for the entire gene. Sites are classified in terms of their relative solvent accessibility or secondary structure as computed using DSSP (Kabsch and Sander 1983; Joosten et al. 2011) from PDB structures 1XPB (Fonzé et al. 1995) or 2IQH (Ye et al. 2006), normalizing solvent accessibilities to the values provided by Tien et al. (2013). The per-residue numerical data are in supplementary files S3 and S4, Supplementary Material online. The code and data used to create this figure are provided through http://jbloom.github.io/phyloExpCM/example_2014Analysis_lactamase.html and http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_NP_Human_1918_Descended_withbeta.html (last accessed July 28, 2014).
Fitting of the Stringency Parameter β Also Improves the Phylogenetic Fit of an Experimentally Informed Evolutionary Model for Influenza Nucleoprotein.
| Model | ΔAIC | Log Likelihood | Parameters (optimized + empirical) | Optimized Parameters |
|---|---|---|---|---|
| Experimental, | 0.0 | –12,144.2 | 1 (1 + 0) | |
| Experimental, | 391.8 | –12,341.1 | 0 (0 + 0) | None |
| GY94, gamma | 1,453.0 | –12,858.7 | 13 (4 + 9) | |
| GY94, gamma | 1,616.0 | –12,941.2 | 12 (3 + 9) | |
| KOSI07+F, gamma | 1,845.6 | –13,004.0 | 64 (4 + 60) | |
| GY94, one | 1,884.1 | –13,075.3 | 12 (3 + 9) | |
| GY94, one | 2,153.2 | –13,210.8 | 11 (2 + 9) | |
| KOSI07+F, gamma | 2,153.6 | –13,159.0 | 63 (3 + 60) | |
| KOSI07+F, one | 2,227.3 | –13,195.9 | 63 (3 + 60) | |
| KOSI07+F, one | 2,650.1 | –13,408.3 | 62 (2 + 60) | |
| Avg. frequencies, | 3,736.2 | –13,952.3 | 61 (1 + 60) | |
| Avg. frequencies, | 3,742.2 | –13,956.3 | 60 (0 + 60) | None |
Note.—The data in this table were generated by exactly repeating the analysis in the sixth table of Bloom (2014) except including the additional models listed here, which include a model with a stringency parameter. With the exception of the stringency parameter, the experimentally informed evolutionary model used here is derived entirely from experimental measurements, as both the mutation rates and amino acid preferences were measured in Bloom (2014). Only the model using fixation probabilities calculated from equation (3) is reported, as Bloom (2014) shows that this is the best model for influenza nucleoprotein. The data and source code used to generate this table are available through http://jbloom.github.io/phyloExpCM/example_2014Analysis_Influenza_NP_Human_1918_Descended_withbeta.html (last accessed July 28, 2014).