Literature DB >> 35789388

Consequences of Substitution Model Selection on Protein Ancestral Sequence Reconstruction.

Roberto Del Amparo1,2, Miguel Arenas1,2,3.   

Abstract

The selection of the best-fitting substitution model of molecular evolution is a traditional step for phylogenetic inferences, including ancestral sequence reconstruction (ASR). However, a few recent studies suggested that applying this procedure does not affect the accuracy of phylogenetic tree reconstruction. Here, we revisited this debate topic by analyzing the influence of selection among substitution models of protein evolution, with focus on exchangeability matrices, on the accuracy of ASR using simulated and real data. We found that the selected best-fitting substitution model produces the most accurate ancestral sequences, especially if the data present large genetic diversity. Indeed, ancestral sequences reconstructed under substitution models with similar exchangeability matrices were similar, suggesting that if the selected best-fitting model cannot be used for the reconstruction, applying a model similar to the selected one is preferred. We conclude that selecting among substitution models of protein evolution is recommended for reconstructing accurate ancestral sequences.
© The Author(s) 2022. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution.

Entities:  

Keywords:  ancestral sequence reconstruction; molecular evolution; phylogenetics; protein evolution; substitution model selection; substitution models of protein evolution

Mesh:

Substances:

Year:  2022        PMID: 35789388      PMCID: PMC9254009          DOI: 10.1093/molbev/msac144

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   8.800


Introduction

Ancestral sequence reconstruction (ASR) constitutes a powerful framework in evolutionary biology with a variety of applications (Liberles 2007; Selberg et al. 2021). For example, it has been used to develop vaccines based on centralized (ancestral) sequences (Kothe et al. 2006; Arenas and Posada 2010a) and to understand the stability and functional properties of diverse paleoenzymes such as thioredoxins (Perez-Jimenez et al. 2011), beta-lactamases (Risso et al. 2013), RuBisCO (Shih et al. 2016), or alcohol dehydrogenases (Thomson et al. 2005), among others (Merkl and Sterner 2016). These molecular reconstructions are not only of interest to evolutionary researchers, they can also present useful applications in industrial processes (Thomson et al. 2005) due to the biological and physicochemical properties (i.e., high thermostability) of the resurrected enzymes (Trudeau et al. 2016). As for other phylogenetic analyses, probabilistic ASR methods (i.e., maximum-likelihood) require the specification of a substitution model of molecular evolution (Yang 2006). At the protein level, the substitution model includes the rates of change among the 20 amino acids (exchangeability matrix) and the equilibrium amino acid frequencies (Arenas 2015). Traditionally, the reconstruction of ancestral protein sequences is based on empirical substitution models of evolution (Thorne 2000; Arenas 2015). Despite the serve limits of these substitution models (i.e., all sites evolve under the same rates of change among amino acids, which is highly unrealistic; Echave et al. 2016), their mathematical simplicity (i.e., assuming site-independent evolution simplifies the likelihood function; Yang 2006) favored their establishment in phylogenetics. Empirical substitution models of evolution have been developed for diverse taxonomic, species, and protein groups (i.e., nuclear and mitochondrial proteins; Thorne 2000; Arenas 2015). Thus, nearly 100 empirical substitution models of protein evolution are currently available, many have been recently developed (Ng et al. 2000; Le et al. 2017; Le and Vinh 2020; Del Amparo and Arenas 2022) and still require efforts for their implementation in analytical phylogenetic frameworks (i.e., to perform substitution model selection and phylogenetic tree reconstruction, among others). Despite some data, the selection of an empirical substitution model is straightforward (i.e., when there is a substitution model biologically related to the study protein data), and in other times, this selection is unclear and traditionally requires the selection of a best-fitting substitution model (among the currently available substitution models) using likelihood-based methods (Yang et al. 1994; Zhang and Nei 1997; Zhang 1999; Minin et al. 2003; Lemmon et al. 2004). However, a few recent studies found that the selection of the best-fitting substitution model of protein evolution may not be mandatory for phylogenetic tree reconstruction (Abadi et al. 2019; Spielman and Shapiro 2020; Tao et al. 2020), although other studies suggested the opposite (Posada 2001; Minin et al. 2003; Dornburg et al. 2019). This debate brings the question of whether the selection among substitution models of protein evolution affects the accuracy of protein ASR, a relevant issue already mentioned in Pupko et al. (2007). For this application, a few studies suggested that substitution model selection does not affect ASR (Williams et al. 2006; Abadi et al. 2019), but they focused on ASR under substitution models of DNA evolution (Abadi et al. 2019) or ignored the influence of varying the amino acid exchangeability matrix (Williams et al. 2006). Here, we revisited this topic by the evaluation of the influence of selection among empirical substitution models of protein evolution, with focus on the exchangeability matrix, on the accuracy of ASR using simulated and real data. Overall, we found that the accuracy of the reconstructed ancestral sequences enhances the application of the selected best-fitting substitution model, especially in data presenting large genetic diversity.

Results

Evaluation of Substitution Model Selection for ASR Based on Simulated Protein Data

We evaluated the error of reconstructing ancestral sequences under diverse substitution models using computer simulations. We found that ancestral sequences inferred under the true (simulated) substitution model are more similar to the true ancestral sequences than ancestral sequences inferred under any other substitution model (fig. 1 and supplementary fig. S2, Supplementary Material online). In addition, we found that ancestral sequences reconstructed under a substitution model with an exchangeability matrix similar to that of the true substitution model are more accurate (in terms of similarity with the true ancestral sequences) than ancestral sequences reconstructed under a substitution model with exchangeability matrix far from that of the true substitution model (fig. 1 and supplementary fig. S2, Supplementary Material online). Interestingly, we also found that the sequence identity of the data affects the influence of the substitution model selection on the reconstructed ancestral sequences (fig. 1 and supplementary fig. S2, Supplementary Material online). In particular, the ASR from data with low-sequence identity (large genetic diversity) was more sensible to the selection of the substitution model than the ASR from data with high-sequence identity. Finally, we found that increasing the number of sequences (while maintaining genetic diversity) of the data qualitatively produced similar ASR error (compare fig. 1 and supplementary fig. S2, Supplementary Material online).
FIG. 1.

Influence of substitution model selection on ancestral sequence reconstruction using simulated data. Distances between true ancestral sequences and ancestral sequences reconstructed under true (black bars) and other substitution models (gray bars; including from the left to the right a model that is similar, intermediate, and far from the true model). The distances are shown in percentage. The study is based on 1,000 simulated data sets of 50 protein sequences with sequence identity 0.2 (large genetic diversity; plots on the left), 0.5 (intermediate genetic diversity; middle plots), and 0.8 (low genetic diversity; plots on the right). Error bars indicate 95% confidence intervals. The same results showing ASR error (y-axis) from zero are presented in supplementary figure S3, Supplementary Material online.

Influence of substitution model selection on ancestral sequence reconstruction using simulated data. Distances between true ancestral sequences and ancestral sequences reconstructed under true (black bars) and other substitution models (gray bars; including from the left to the right a model that is similar, intermediate, and far from the true model). The distances are shown in percentage. The study is based on 1,000 simulated data sets of 50 protein sequences with sequence identity 0.2 (large genetic diversity; plots on the left), 0.5 (intermediate genetic diversity; middle plots), and 0.8 (low genetic diversity; plots on the right). Error bars indicate 95% confidence intervals. The same results showing ASR error (y-axis) from zero are presented in supplementary figure S3, Supplementary Material online.

Evaluation of Substitution Model Selection for ASR Based on Real Protein Families

The studied real protein families showed that ancestral sequences reconstructed under different substitution models differ (fig. 2 and supplementary fig. S5, Supplementary Material online). In agreement with the results from simulated data, the distance between ancestral sequences reconstructed under different substitution models increases with the distance between the exchangeability matrices of the corresponding models and this can be observed at every internal node (fig. 2 and supplementary fig. S5, Supplementary Material online). Interestingly, the distance between ancestral sequences reconstructed under the best-fitting substitution model and ancestral sequences reconstructed under any other substitution model overall increased, going backwards in time with a maximum divergence near the center of tree (Deng et al. 2010), although with some fluctuations over time (fig. 2 and supplementary fig. S5, Supplementary Material online). This finding indicates that the ASR error caused by applying a substitution model that poorly fits with the data can affect the reconstructed sequences at all the internal nodes, and especially sequences belonging to internal nodes that are at a greater distance from the tip nodes.
FIG. 2.

Influence of substitution model selection on ancestral sequence reconstruction of the TRXB protein family. The figure shows the distance between ancestral sequences reconstructed under the best-fitting substitution model (LG + I + G) and other substitution models (MtMam + I + G, HIVb + I + G, JTT + I + G, and Blosum62 + I + G; shown with different colors) at every internal node and as a function of the time to root. The distances are shown in percentage. Note that all the nodes shown in the figure are internal nodes, the tip nodes are excluded because their sequences are given (thus, they are not reconstructed).

Influence of substitution model selection on ancestral sequence reconstruction of the TRXB protein family. The figure shows the distance between ancestral sequences reconstructed under the best-fitting substitution model (LG + I + G) and other substitution models (MtMam + I + G, HIVb + I + G, JTT + I + G, and Blosum62 + I + G; shown with different colors) at every internal node and as a function of the time to root. The distances are shown in percentage. Note that all the nodes shown in the figure are internal nodes, the tip nodes are excluded because their sequences are given (thus, they are not reconstructed). Concerning the number of CTL epitopes detected in the inferred ancestral sequences of the HIV-1 env data, we found that it varies depending on the substitution model applied in the ASR (supplementary tables S1 and S2, Supplementary Material online). Interestingly and in line with our previous findings, ancestral sequences reconstructed under substitution models with similar exchangeability matrices displayed a similar number of predicted epitopes (supplementary tables S1 and S2, Supplementary Material online).

Discussion

Selecting a best-fitting (in terms of likelihood) substitution model of evolution, among the available set of substitution models, and applying this model for probabilistic phylogenetic reconstruction is a well-established methodology. It is based on the natural reasoning of the phenotypic consequences caused by amino acid substitution events (i.e., >50 years ago Zuckerkandl and Pauling (1965) indicated that “it is the type rather than number of amino acid substitutions that is decisive”) and was supported by multiple likelihood-based phylogenetic studies for >20 years (Yang et al. 1994; Zhang and Nei 1997; Zhang 1999; Minin et al. 2003; Lemmon et al. 2004). However, a few recent works suggested that substitution model selection has little effect on phylogenetic tree reconstruction (Abadi et al. 2019; Spielman and Shapiro 2020; Tao et al. 2020) leading to a debate topic in the field. With regard to ASR, a study suggested that the selection among substitution models of DNA evolution does not influence nucleotide ASR (Abadi et al. 2019) and others investigated the influence of substitution rate variation among sites (Yang 1994; but under the same exchangeability matrix) on protein ASR (Pupko et al. 2002; Williams et al. 2006) or did not quantify the protein ASR error with computer simulations (Moshe and Pupko 2019). At the DNA level, we believe that the effect of substitution model selection on the accuracy of phylogenetic reconstructions could be reduced due to its lower number of character states compared with amino acids. Evaluating the influence of substitution rate variation among sites with a fixed exchangeability matrix is indeed relevant but still does not inform about the phylogenetic consequences of selection among substitution models of protein evolution considering different exchangeability matrices. Note that the currently available substitution models of protein evolution present differing empirical exchangeability matrices that are required to mimic diverse evolutionary patterns observed in nature (supplementary fig. S1, Supplementary Material online; Thorne 2000; Arenas 2015). In order to clarify this aspect, here, we revisited this topic to find that the selection among substitution models of protein evolution, with different exchangeability matrices, can seriously affect the reconstruction of ancestral sequences. We simulated protein sequences to quantify the distance between ancestral sequences inferred under diverse substitution models, including the true substitution model, and we found that applying the true substitution model produces the most accurate ancestral sequences (compared with ancestral sequences reconstructed under other substitution models). Interestingly, we found that substitution models with exchangeability matrices similar to the exchangeability matrix of the true substitution model led to more accurate ancestral sequences than substitution models with exchangeability matrices far from that of the true substitution model. In practice, this suggests that if the best-fitting substitution model is not available to perform the ASR (i.e., it is not implemented in the ASR evolutionary framework), applying a substitution model with an exchangeability matrix as similar as possible to the exchangeability matrix of the best-fitting substitution model is recommended. Next, we found that the influence of the substitution model on ASR is affected by the genetic diversity of the study data. In particular, data with large genetic diversity produce ancestral sequences more influenced by the applied substitution model. Note that data with large genetic diversity accumulated multiple substitution events during their evolutionary histories and thus involve a more intense contribution of the substitution model in the likelihood function of probabilistic ASR methods (Yang 2006). Indeed, any substitution model produces error and thus applying more frequently a substitution model can increase error, especially (as we demonstrate in this study) if the applied model does not fit well with the studied data. The ASR of real protein families performed under different substitution models showed that the reconstructed ancestral sequences (and also their biological properties in terms of number of CTL epitopes) differ depending on the applied substitution model. In particular, more distant substitution models produced more different ancestral sequences (with more different biological properties), in agreement with the results from the simulated data. Again, we found that using substitution models as similar as possible to the selected best-fitting substitution model is recommended when the best-fitting substitution model cannot be used for the ASR for any reason. The influence of the substitution model of protein evolution on ASR affects all the reconstructed ancestral sequences, but especially those belonging to the most internal nodes where the ASR method has to make more complicated decisions due to the long distance from the extant sequences. Despite some studies suggested that the selection of the best-fitting substitution model of evolution may not be a mandatory task for phylogenetic tree reconstruction (see Introduction), here we clearly found that substitution model selection is highly recommended for the reconstruction of ancestral proteins. As indicated in the Introduction, protein ASR is frequently applied in diverse fields such as paleoenzymology and biotechnology (e.g., Perez-Jimenez et al. 2011; Holinski et al. 2017) and the reconstructed molecules should be as realistic as possible to display reliable biological properties. The observed patterns of amino acid substitution are the consequence of diverse selection constrains (i.e., selection on the protein function and stability; Lorenzo-Redondo et al. 2014; Arenas et al. 2016; Duchene et al. 2016; Echave et al. 2016; Kirchner et al. 2017; Geoghegan and Holmes 2018; Jimenez-Santos et al. 2018; Moshe and Pupko 2019) that can differ among taxonomic levels (Duchene et al. 2016; Chang et al. 2020), protein families (Rios et al. 2015; Del Amparo and Arenas 2022), and even within protein families (Del Amparo and Arenas 2022). Here we show that these different selection processes, mimicked with different substitution models, should be taken into consideration to more accurately reconstruct ancestral proteins.

Materials and Methods

Analysis of the Influence of Substitution Model Selection on ASR Using Simulated Protein Data

We simulated data to evaluate the distance between ancestral sequences reconstructed under the true (simulated) substitution model and ancestral sequences reconstructed under other (close or far from the true model; supplementary fig. S1, Supplementary Material online) substitution models. First, we simulated phylogenetic trees with random topologies using the function rtree implemented in the library ape of R (Paradis et al. 2004). Next, for each simulated tree, we simulated protein sequence evolution (we assumed a sequence length of 250 amino acids) under a particular substitution model with the function simSeq implemented in the phangorn library of R (Schliep 2011). We applied the HIVw (Nickle et al. 2007), JTT (Jones et al. 1992), Blosum62 (Henikoff and Henikoff 1992), and MtMam (Yang et al. 1998) substitution models in the simulations (true models) to include representative models of viral, nuclear, and mitochondrial proteins. We evaluated the influence of substitution model selection on ASR in six evolutionary scenarios of simulated data with variable number of protein sequences (50 and 100) and sequence identity (pairwise sequence comparisons, 0.2, 0.5, and 0.8). For each evolutionary scenario, we simulated a total of 1,000 multiple sequence alignments. As a control check, we applied ProtTest3 (Darriba et al. 2011) to verify that the true substitution models are selected as the best-fitting substitution models from the simulated data. Next, for each simulated data set, we reconstructed its ancestral sequences using the simulated phylogenetic tree (thus, avoiding potential biases from phylogenetic tree reconstruction) with the ML ASR method implemented in the function ancestral.pml of the phangorn library of R. The ASR was performed under diverse substitution models that included the true model and other models that are close and far from the true models. In particular, data simulated under the HIVw substitution model were evaluated with the HIVw (true), HIVb (close to the true; Nickle et al. 2007), JTT (intermediate), Blosum62, and MtMam (far from the true) substitution models; data simulated under the JTT substitution model were evaluated with the JTT (true), WAG (close; Whelan and Goldman 2001), HIVb (intermediate), Blosum62, and MtMam (far) substitution models; data simulated under the Blosum62 substitution model were evaluated with the Blosum62 (true), JTT (close), Dayhoff (intermediate), HIVb, and MtMam (far) substitution models; and data simulated under the MtMam substitution model were evaluated with the MtMam (true), MtRev (close; Adachi and Hasegawa 1996), JTT (intermediate), Blosum62, and HIVb (far) substitution models. Finally, we calculated the distance between simulated (true) ancestral sequences and ancestral sequences reconstructed under each substitution model. This distance is the sequence dissimilarity calculated as the proportion of different amino acid states (comparing every site) between the sequences.

Analysis of the Influence of Substitution Model Selection on ASR Using Real Protein Families

We analyzed the prokaryotic protein families D-ala D-ala ligases and Thioredoxins I (TRXB; table 1) as illustrative real examples. These protein families, available from the PFAM database (table 1), include a putative group of homologs from many bacterial species (Bastolla et al. 2004) with extant sequences longer than 200 amino acids that allow well-supported phylogenetic reconstructions (Arenas et al. 2017; Arenas and Bastolla 2020) and also have been previously analyzed with ASR (Perez-Jimenez et al. 2011; Meziane-Cherif et al. 2012; Ingles-Prieto et al. 2013). We realigned the sequences with MAFFT (Katoh and Standley 2013) as a prudent procedure. Next, we identified the best-fitting substitution model of protein evolution with ProtTest3 (table 1) and inferred an ML phylogenetic tree with RAxML-NG (Kozlov et al. 2019) under the best-fitting substitution model. We reconstructed the ancestral sequences under the best-fitting substitution model and other substitution models with close and distant exchangeability matrices (i.e., JTT, HIVb, Blosum62, and MtMam). Finally, for every internal node of the phylogenetic tree, we evaluated the distance between the ancestral sequences reconstructed under the best-fitting substitution model and the ancestral sequences reconstructed under every other substitution model.
Table 1.

Empirical Protein Families.

Protein FamilyPFAM CodeNumber of SequencesSequence LengthSequence IdentityBest-fitting Substitution Model
D-ala D-ala ligases (DDL)PF07478423990.40LG + I + G
Thioredoxins I (TRXB)PF00070283750.46LG + I + G

Note.—For each data set, the table includes name of the protein family, PFAM code, number of sequences, sequence length (number of amino acids), sequence identity (ranging from 0 to 1), and the best-fitting substitution model selected with ProtTest3.

Empirical Protein Families. Note.—For each data set, the table includes name of the protein family, PFAM code, number of sequences, sequence length (number of amino acids), sequence identity (ranging from 0 to 1), and the best-fitting substitution model selected with ProtTest3. In order to provide an illustration of the biological consequences of substitution model selection in ASR, we also evaluated the predicted number of CTL epitopes in the ancestral sequences (reconstructed under different substitution models) of two alignments of the HIV-1 env region obtained from Arenas and Posada (2010b). Note that ancestral HIV-1 envelope proteins were widely used to design centralized vaccines against this virus (Nickle et al. 2003) and the accuracy of ASR can be crucial to obtain ancestral sequences with realistic immunological properties (Arenas and Posada 2010a). The first data set was a HIV-1 group M reference alignment with an outgroup from the Los Alamos HIV sequence database (41 sequences, 758 amino acids). The second data set included subtype B viruses and an outgroup (38 sequences, 810 amino acids; Doria-Rose et al. 2005). For each data set, we identified the best-fitting substitution model and inferred an ML tree under the best-fitting substitution model. Next, we reconstructed the ancestral sequences under the best-fitting substitution model and other, similar, and different substitution models. Then, we scanned the root ancestral sequences for known CTL epitopes with MHCPred (Guan et al. 2003).

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.
  59 in total

1.  Genomic determinants of protein folding thermodynamics in prokaryotic organisms.

Authors:  Ugo Bastolla; Andrés Moya; Enrique Viguera; Roeland C H J van Ham
Journal:  J Mol Biol       Date:  2004-11-05       Impact factor: 5.469

2.  Model of amino acid substitution in proteins encoded by mitochondrial DNA.

Authors:  J Adachi; M Hasegawa
Journal:  J Mol Evol       Date:  1996-04       Impact factor: 2.395

3.  FLAVI: An Amino Acid Substitution Model for Flaviviruses.

Authors:  Thu Kim Le; Le Sy Vinh
Journal:  J Mol Evol       Date:  2020-04-30       Impact factor: 2.395

4.  Combining ancestral sequence reconstruction with protein design to identify an interface hotspot in a key metabolic enzyme complex.

Authors:  Alexandra Holinski; Kristina Heyn; Rainer Merkl; Reinhard Sterner
Journal:  Proteins       Date:  2017-01-05

5.  ProtTest 3: fast selection of best-fit models of protein evolution.

Authors:  Diego Darriba; Guillermo L Taboada; Ramón Doallo; David Posada
Journal:  Bioinformatics       Date:  2011-02-17       Impact factor: 6.937

6.  Human immunodeficiency virus type 1 subtype B ancestral envelope protein is functional and elicits neutralizing antibodies in rabbits similar to those elicited by a circulating subtype B envelope.

Authors:  N A Doria-Rose; G H Learn; A G Rodrigo; D C Nickle; F Li; M Mahalanabis; M T Hensel; S McLaughlin; P F Edmonson; D Montefiori; S W Barnett; N L Haigwood; J I Mullins
Journal:  J Virol       Date:  2005-09       Impact factor: 5.103

7.  Conservation of protein structure over four billion years.

Authors:  Alvaro Ingles-Prieto; Beatriz Ibarra-Molero; Asuncion Delgado-Delgado; Raul Perez-Jimenez; Julio M Fernandez; Eric A Gaucher; Jose M Sanchez-Ruiz; Jose A Gavira
Journal:  Structure       Date:  2013-08-08       Impact factor: 5.006

8.  GPCRtm: An amino acid substitution matrix for the transmembrane region of class A G Protein-Coupled Receptors.

Authors:  Santiago Rios; Marta F Fernandez; Gianluigi Caltabiano; Mercedes Campillo; Leonardo Pardo; Angel Gonzalez
Journal:  BMC Bioinformatics       Date:  2015-07-02       Impact factor: 3.169

9.  Ancestral Sequence Reconstruction: From Chemical Paleogenetics to Maximum Likelihood Algorithms and Beyond.

Authors:  Avery G A Selberg; Eric A Gaucher; David A Liberles
Journal:  J Mol Evol       Date:  2021-01-24       Impact factor: 2.395

10.  Trends in substitution models of molecular evolution.

Authors:  Miguel Arenas
Journal:  Front Genet       Date:  2015-10-26       Impact factor: 4.599

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.