| Literature DB >> 35789388 |
Roberto Del Amparo1,2, Miguel Arenas1,2,3.
Abstract
The selection of the best-fitting substitution model of molecular evolution is a traditional step for phylogenetic inferences, including ancestral sequence reconstruction (ASR). However, a few recent studies suggested that applying this procedure does not affect the accuracy of phylogenetic tree reconstruction. Here, we revisited this debate topic by analyzing the influence of selection among substitution models of protein evolution, with focus on exchangeability matrices, on the accuracy of ASR using simulated and real data. We found that the selected best-fitting substitution model produces the most accurate ancestral sequences, especially if the data present large genetic diversity. Indeed, ancestral sequences reconstructed under substitution models with similar exchangeability matrices were similar, suggesting that if the selected best-fitting model cannot be used for the reconstruction, applying a model similar to the selected one is preferred. We conclude that selecting among substitution models of protein evolution is recommended for reconstructing accurate ancestral sequences.Entities:
Keywords: ancestral sequence reconstruction; molecular evolution; phylogenetics; protein evolution; substitution model selection; substitution models of protein evolution
Mesh:
Substances:
Year: 2022 PMID: 35789388 PMCID: PMC9254009 DOI: 10.1093/molbev/msac144
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 8.800
FIG. 1.Influence of substitution model selection on ancestral sequence reconstruction using simulated data. Distances between true ancestral sequences and ancestral sequences reconstructed under true (black bars) and other substitution models (gray bars; including from the left to the right a model that is similar, intermediate, and far from the true model). The distances are shown in percentage. The study is based on 1,000 simulated data sets of 50 protein sequences with sequence identity 0.2 (large genetic diversity; plots on the left), 0.5 (intermediate genetic diversity; middle plots), and 0.8 (low genetic diversity; plots on the right). Error bars indicate 95% confidence intervals. The same results showing ASR error (y-axis) from zero are presented in supplementary figure S3, Supplementary Material online.
FIG. 2.Influence of substitution model selection on ancestral sequence reconstruction of the TRXB protein family. The figure shows the distance between ancestral sequences reconstructed under the best-fitting substitution model (LG + I + G) and other substitution models (MtMam + I + G, HIVb + I + G, JTT + I + G, and Blosum62 + I + G; shown with different colors) at every internal node and as a function of the time to root. The distances are shown in percentage. Note that all the nodes shown in the figure are internal nodes, the tip nodes are excluded because their sequences are given (thus, they are not reconstructed).
Empirical Protein Families.
| Protein Family | PFAM Code | Number of Sequences | Sequence Length | Sequence Identity | Best-fitting Substitution Model |
|---|---|---|---|---|---|
|
| PF07478 | 42 | 399 | 0.40 | LG + I + G |
| Thioredoxins I (TRXB) | PF00070 | 28 | 375 | 0.46 | LG + I + G |
Note.—For each data set, the table includes name of the protein family, PFAM code, number of sequences, sequence length (number of amino acids), sequence identity (ranging from 0 to 1), and the best-fitting substitution model selected with ProtTest3.