| Literature DB >> 19087270 |
Huai-Chun Wang1, Karen Li, Edward Susko, Andrew J Roger.
Abstract
BACKGROUND: Widely used substitution models for proteins, such as the Jones-Taylor-Thornton (JTT) or Whelan and Goldman (WAG) models, are based on empirical amino acid interchange matrices estimated from databases of protein alignments that incorporate the average amino acid frequencies of the data set under examination (e.g JTT + F). Variation in the evolutionary process between sites is typically modelled by a rates-across-sites distribution such as the gamma (Gamma) distribution. However, sites in proteins also vary in the kinds of amino acid interchanges that are favoured, a feature that is ignored by standard empirical substitution matrices. Here we examine the degree to which the pattern of evolution at sites differs from that expected based on empirical amino acid substitution models and evaluate the impact of these deviations on phylogenetic estimation.Entities:
Mesh:
Year: 2008 PMID: 19087270 PMCID: PMC2628903 DOI: 10.1186/1471-2148-8-331
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Statistical analyses of site-specific amino acid uniformity and state frequencies in 21 protein data sets.
| Protein family | Taxa | Sites | Z-test (uniformity) | χ2 test (states) | |||
|---|---|---|---|---|---|---|---|
| Rate 1 | Rate 2 | Rate 3 | Rate 4 | ||||
| Carboxyl_trans | 36 | 212 | 0.97 | ** | 0.05 | * | ** |
| CTP-synthetase | 65 | 212 | ** | ** | ** | ** | * |
| DNA topo IV | 49 | 228 | 0.21 | ** | ** | ** | * |
| Filament | 36 | 210 | 0.35 | 0.09 | 0.92 | 0.45 | 0.66 |
| Glu_synth_NTN | 40 | 253 | ** | ** | ** | ** | 0.01 |
| HSP70 | 34 | 432 | 0.31 | ** | * | ** | ** |
| ILVD_EDD | 51 | 310 | 0.20 | * | ** | ** | ** |
| MCM | 40 | 220 | 0.66 | * | * | 0.11 | ** |
| MreB | 32 | 275 | 0.50 | 0.10 | ** | * | 0.03 |
| Poty_coat | 34 | 212 | 0.19 | ** | ** | ** | ** |
| SecA | 70 | 203 | ** | ** | ** | ** | ** |
| Usher | 36 | 317 | * | ** | ** | ** | 0.08 |
| HSP90 | 54 | 459 | ** | ** | ** | ** | ** |
| NuoF | 41 | 405 | ** | ** | ** | ** | ** |
| Cpn60 | 41 | 466 | 0.18 | 0.04 | ** | ** | ** |
| MPP | 43 | 203 | 0.04 | 0.24 | ** | 0.03 | 0.32 |
| α-tubulin | 54 | 375 | ** | * | ** | ** | * |
| β-tubulin | 46 | 382 | ** | ** | * | ** | 0.02 |
| Actin | 48 | 363 | ** | ** | ** | * | * |
| EF-1α | 38 | 361 | 0.29 | ** | ** | ** | ** |
| EF-2 | 37 | 669 | ** | ** | ** | ** | ** |
P-values: ** < 0.001; * < 0.01. The protein family abbreviations are: Carboxyl_trans, acetyl-CoA carboxylase; Cpn60, 60-kDa chaperonin; DNA topo IV, DNA topoisomerase IV subunit A (GyrA); EF-1α, elongation factor 1α; EF2, elongation factor 2; Filament, intermediate filament protein; Glu_synth_NTN, Glutamate synthase aminotransferase; HSP70, 70-kDa heat shock protein; HSP90, 90-kDa heat shock protein; ILVD_EDD, dehydratase family proteins; NuoF, NADH dehydrogenase I chain F; MCM, minichromosome maintenance protein; MPP, mitochondrial processing peptidase sequences; MreB, a bacterial homolog of the eukaryotic actin; Poty_coat, potyvirus coat protein; Usher, Fimbrial usher protein.
Figure 1Numbers of sites with a given number of states in simulated versus real HSP90 data. The original HSP90 data have 54 taxa and 459 sites. The simulated data have the same number of taxa and 100,000 sites. In the latter case the proportions of sites with each number of states were calculated and then multiplied by 459 to make the numbers directly comparable to the HSP90 data set.
Figure 2Performance of ML tree reconstructions evaluated using simulations. The performance of ML tree reconstruction with the JTT + F + Γ model for data simulated under (A) the JTT + F + Γ model and (B) under a site-specific frequency model (JTT + ssF + Γ). The site-specific frequency data were derived from the HSP90 data set. The three heatmaps in (A) and (B) represent, respectively, the proportions of "Correct tree" (i.e., taxa 1 and 2 together), "Long branch attracts tree" (i.e., taxa 1 and 3 together) and "Other tree" (i.e., incorrectly put taxa 1 and 4 together) with regard to branch-lengths a and b. The four-taxon tree shown in (A) is the true tree (taxa 1 and 2 together, and taxa 3 and 4 together) used for simulating the data. Each box of the heatmaps represents 100 simulations for the given conditions.
Figure 3Principal components analysis of the amino acid frequency matrix from 21 protein data sets. Each site is indicated by an open circle. The classes and the regression lines were determined as shown in the main text.
Figure 4Average amino acid frequencies in the four site-specific classes derived from the PCA shown in Figure 3. The bottom frequency profile shows the overall frequencies of amino acids observed at all sites in the 21 amino acid alignments.
Fitting the class frequency mixture model (JTT + cF + Γ) to 25 protein data sets.
| Protein | Taxa | Sites | w(ΠF) | w(Π1) | w(Π2) | w(Π3) | w(Π4) | ΛlnL |
|---|---|---|---|---|---|---|---|---|
| Carboxyl_trans | 36 | 212 | 0.74 | 0.11 | 0.06 | 0.00 | 0.10 | 67.16 |
| CTP-synthetase | 65 | 212 | 0.28 | 0.29 | 0.13 | 0.04 | 0.24 | 225.24 |
| DNA topo IV | 49 | 228 | 0.58 | 0.15 | 0.05 | 0.02 | 0.21 | 162.77 |
| Filament | 36 | 210 | 0.81 | 0.10 | 0.00 | 0.05 | 0.05 | 39.58 |
| Glu_synth_NTN | 40 | 253 | 0.66 | 0.13 | 0.04 | 0.01 | 0.17 | 76.31 |
| HSP70 | 34 | 432 | 0.65 | 0.17 | 0.02 | 0.0002 | 0.16 | 136.71 |
| ILVD_EDD | 51 | 310 | 0.65 | 0.14 | 0.06 | 0.01 | 0.14 | 181.56 |
| MCM | 40 | 220 | 0.65 | 0.18 | 0.03 | 0.00 | 0.14 | 74.38 |
| MreB | 32 | 275 | 0.52 | 0.20 | 0.07 | 0.00 | 0.22 | 141.87 |
| Poty_coat | 34 | 212 | 0.60 | 0.17 | 0.04 | 0.02 | 0.18 | 125.57 |
| SecA | 70 | 203 | 0.40 | 0.24 | 0.09 | 0.08 | 0.19 | 217.82 |
| Usher | 36 | 317 | 0.78 | 0.10 | 0.02 | 0.004 | 0.10 | 76.11 |
| HSP90 | 54 | 459 | 0.37 | 0.19 | 0.05 | 0.09 | 0.30 | 279.92 |
| NuoF | 41 | 405 | 0.37 | 0.20 | 0.11 | 0.04 | 0.27 | 186.40 |
| Cpn60 | 41 | 466 | 0.52 | 0.19 | 0.04 | 0.03 | 0.22 | 257.04 |
| MPP | 43 | 203 | 0.73 | 0.13 | 0.03 | 0.00 | 0.11 | 74.82 |
| α-tubulin | 54 | 375 | 0.46 | 0.16 | 0.04 | 0.01 | 0.33 | 90.05 |
| β-tubulin | 46 | 382 | 0.59 | 0.15 | 0.03 | 0.02 | 0.21 | 69.84 |
| Actin | 48 | 363 | 0.58 | 0.12 | 0.03 | 0.02 | 0.25 | 41.50 |
| EF-1α | 38 | 361 | 0.60 | 0.15 | 0.05 | 0.00 | 0.21 | 104.78 |
| EF-2 | 37 | 669 | 0.52 | 0.16 | 0.06 | 0.03 | 0.22 | 273.30 |
| enolase | 60 | 305 | 0.63 | 0.13 | 0.06 | 0.00 | 0.19 | 24.08 |
| myoglobin | 80 | 153 | 0.59 | 0.14 | 0.06 | 0.03 | 0.17 | 35.73 |
| lipoprotein | 23 | 762 | 0.77 | 0.10 | 0.02 | 0.01 | 0.10 | 70.70 |
| lysozyme | 36 | 127 | 0.61 | 0.12 | 0.03 | 0.02 | 0.23 | 18.23 |
ΛlnL is the likelihood difference between the cF mixture model and the single frequency model (JTT + F + Γ). The p-values associated with these differences, calculated from χ2 tests with 4 degrees of freedom, are very significant in all cases (p < 0.01). The actual p-values would be even smaller as the tests are conservative (see the main text for a discussion).
Figure 5The performance of ML tree reconstruction with the JTT + F + Γ model and the JTT + cF + Γ model. The data were simulated under the site-specific frequency model (JTT + ssF + Γ) based on amino acid frequencies observed at each site of the HSP90 alignment. The ranges of branch-lengths a and b are 0.05–1.45 and 0.5–2.95, respectively, with an increment of 0.05. The left and right heatmaps represent, respectively, the proportions of correctly estimated trees estimated under JTT + F + Γ and JTT + cF + Γ models. Each box of the heatmaps represents 100 simulations for the given conditions.
Analysis of a large phylogenomic data set [29] consisting of 133 proteins from 40 taxa, 24294 sites for two competing trees under single frequency model and cF mixture model.
| Tree | Single frequency model (JTT + F + Γ) | Class-frequency mixture model (JTT + cF + Γ) |
|---|---|---|
| -745,292.15* | -738,445.15 | |
| -745,366.62 | -738,371.59* |