| Literature DB >> 33868595 |
Federico Scossa1,2, Alisdair R Fernie1,3.
Abstract
Whilst substantial research effort has been placed on understanding the interactions of plant proteins with their molecular partners, relatively few studies in plants - by contrast to work in other organisms - address how these interactions evolve. It is thought that ancestral proteins were more promiscuous than modern proteins and that specificity often evolved following gene duplication and subsequent functional refining. However, ancestral protein resurrection studies have found that some modern proteins have evolved de novo from ancestors lacking those functions. Intriguingly, the new interactions evolved as a consequence of just a few mutations and, as such, acquisition of new functions appears to be neither difficult nor rare, however, only a few of them are incorporated into biological processes before they are lost to subsequent mutations. Here, we detail the approach of ancestral sequence reconstruction (ASR), providing a primer to reconstruct the sequence of an ancestral gene. We will present case studies from a range of different eukaryotes before discussing the few instances where ancestral reconstructions have been used in plants. As ASR is used to dig into the remote evolutionary past, we will also present some alternative genetic approaches to investigate molecular evolution on shorter timescales. We argue that the study of plant secondary metabolism is particularly well suited for ancestral reconstruction studies. Indeed, its ancient evolutionary roots and highly diverse landscape provide an ideal context in which to address the focal issue around the emergence of evolutionary novelties and how this affects the chemical diversification of plant metabolism.Entities:
Keywords: APR, ancestral protein resurrection; ASR, ancestral sequence reconstruction; Ancestral sequence reconstruction; CDS, coding sequence; Evolution; GR, glucocorticoid receptor; GWAS, genome wide association study; Genomics; InDel, insertion/deletion; MCMC, Markov Chain Monte Carlo; ML, maximum likelihood; MP, maximum parsimony; MR, mineralcorticoid receptor; MSA, multiple sequence alignment; Metabolism; NJ, neighbor-joining; Phylogenetics; Plants; SFS, site frequency spectrum
Year: 2021 PMID: 33868595 PMCID: PMC8039532 DOI: 10.1016/j.csbj.2021.03.008
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Horizontal Vs vertical approach in the analysis of sequence data. The approach followed by ASR operates a shift from the classical, “horizontal” comparison of sequence data of extant species. Starting from a sequence multialignment and a phylogenetic tree (with branch lengths, here represented by t1…t8), the algorithms used by ASR infer the sequences in the ancestral nodes (blue dots). These ancestral sequences can be then aligned to the extant sequences (“vertical” comparison) to identify where and when the historical mutations occurred along the evolutionary trajectories. The ancestral coding sequences can be then expressed in heterologous systems for functional assays. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
List of computer programs and resources for the typical steps of an ASR study. Additional softwares for phylogenetic inference can be found in Joseph Felsenstein’s homepage (, with latest updates in 2012) or in [21].
| Name | Description | References |
|---|---|---|
| JustOrthologs | fast algorithm for ortholog inference. Avoids BLAST all-vs-all searches comparing instead lengths of CDS and calculates frequencies of dinucleotide occurrences in exon sequences | |
| Orthofinder | inference of orthologs with increased precision (takes into account the gene length bias associated to the BLAST similarity scores). Provides rooted species tree and gene trees for all orthogroups. Maps duplication events along tree branches. | |
| Orthograph | maps coding nucleotide sequences to genes of known orthology (useful for the extension of existing orthogroups) | |
| OrthoMCL | uses the Markov Cluster algorithm (MCL) to group putative orthologs and paralogs in a single orthogroup | |
| eggnog 5.0 | A database of orthogroups and functional annotations from virus, bacterial and eukaryotic genomes | |
| Genomicus Plants v.41 | a multi-species genome browser allowing to visualize orthology and paralogy relationships | |
| OrthoDB | large catalogs of orthologs (across around 600 eukaryotes and > 3000 bacteria), obtained from best reciprocal hits in Smith-Waterman local sequence alignments | |
| Phylome DB | a website hosting catalogs of precomputed gene phylogenies from multiple genomes (“phylomes”). Provides high-quality orthology and paralogy relationships based on phylogenetic trees. Several plant phylomes available | |
| PLAZA 4.5 | a comparative plant genomics database hosting instances for Eudicots and Monocots. Provides sets of orthologous genes obtained through Markov clustering | |
| BAli-Phy | evolution-based tool for multiple sequence alignment. Incorporates a parametric model of sequence evolution, considering also indels | |
| ClustalΩ (omega) | a fast progressive multialignment employing sequence embedding to reduce the time required to build the guide tree | |
| Expresso | a structure-based sequence alignment tool (protein 3D models from the Protein Data Bank are used as templates to guide the sequence alignment) | |
| Historian | An evolution-based alignment software optimized for assessing indel rates and dN/dS ratios | |
| MAFFT | progressive multialignment, includes iterative refinement methods (for small-scale alignments) and structural methods for RNA | |
| MUSCLE | progressive multialignment based on k-mer counting | |
| PRANK | evolution-based algorithm for alignment of closely-related sequences. Accurate placement of insertions and deletions. | |
| ProbCons | algorithm based on a Markov model progressive alignment in combination with probabilistic sequence conservation information | |
| SATe’-I and SATe’-II | Co-estimation of alignments and phylogenetic trees. Iterative approach using an initial RAxML-computed tree with a MAFFT alignment, followed by further refinements through a divide-and-conquer strategy | |
| T-Coffee | consistency-based multialignment, combining a global pairwise approach (e.g., ClustalW) with a local pairwise alignment (e.g. Lalign) | |
| BMGE | Calculates an entropy score for each column in the MSA and compares it with similarity score based on a PAM or BLOSUM matrix. Allows to distinguish, for each aligned character, biological variability from noise | |
| Divvier | Identifies clusters of characters of shared homology, filtering out divergent partitions; alleviates long-branch attraction in trees obtained from filtered MSAs | |
| Gblocks | Eliminates poorly aligned (highly variable) positions from a multialignment. Can be tailored to be more or less stringent according to the value of five different threshold scores | |
| Noisy | Eliminates homoplastic sites from MSAs based on character compatibility | |
| PREQUAL | A | |
| trimAl | Alignment trimming based on gap, similarity and consistency scores across all columns of a MSAs | |
| BEAST2 | Bayesian analysis of molecular sequences. It uses Markov chain Monte Carlo (MCMC) as a numerical approximation to average over tree space | |
| FastME | Distance-based tree inference (Neighbor-Joining) | |
| FastTree2 | approximate-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequence | |
| IQ-TREE | infers phylogenetic trees by maximum likelihood | |
| MPBoot | Tree reconstruction based on maximum parsimony, suitable for large DNA and protein sequence alignments | |
| MrBayes | Bayesian phylogenetic inference using Markov Chain Monte Carlo methods, with a large selection of evolutionary models for aminoacid and DNA (codon) data | |
| PAML (v4.9j) | a package of several programs for phylogenetic analyses of DNA or protein sequences using ML. Includes the empirical Bayes method for estimation of ancestral sequences using nucleotide, codon or amino acid substitution models | |
| PhyloBayes | A popular bayesian Monte Carlo Markov Chain (MCMC) software for phylogenetic reconstruction and molecular dating. It uses non-parametric methods to characterize sequence evolution | |
| PhyML v3.0 | package for phylogenetic reconstruction using ML from nucleotide or amino acid sequences; several substitution models and tree searching algorithms implemented; introduces the criteria of minimum posterior expected error (MPEE) for ancestral sequence reconstruction | |
| Phylo-MCOA | Identifies outlier genes and species in phylogenomic datasets | |
| TreeShrink | Identifies genes leading to long branches | |
| TreSpEx | Identifies artificial signals in phylogenetic reconstructions (paralogy, long-branch attraction) | |
| ANCESCON | ASR software incorporating different substitution rates among sites (“alignment-based rate factors”) with the estimation of phylogenetic trees based on a weighted neighbor-joining method (distance-based, | |
| FastML | A user-friendly web server for computing ancestral sequences based on ML (includes marginal and joint estimates, with the time required for calculation scaling linearly with the number of sequences, hence it is applicable to very large datasets) | |
| PhyloBot | A web-based tool, designed for non-experts, integrating all common steps for a typical ASR pipeline (sequence alignment, phylogenetic inference, ancestral reconstruction, and prediction of functional effects) | |
| ProtASR/ProtASR2 | prediction of ancestral sequences using a mean-field (MF) substitution model incorporating selection on folding stability | |
| Revenant | a database of resurrected ancestral proteins | |
Fig. 2The Fitch's algorithm of maximum parsimony (MP) to reconstruct ancestral states. To assign ancestral states the tree is traversed twice. The first time the algorithm proceeds from leaves to root, and assigns to each internal node a set of characters based on the intersection of descendant states (or the union of the intersection is empty). In the second step, the algorithm proceeds from the root to the leaves, and assigns to the internal nodes the state which is present both in the ancestral and in the descendant node. When different equally parsimonious reconstructions are possible, multiple solutions exist (see Suppl. Fig. 1).
Fig. 3Example of a Maximum Likelihood (ML) algorithm for reconstruction of ancestral states. The figure represents a simple case of ancestral reconstruction using ML. We considered only a single site, with two possible character states (H or Q), across a phylogenetic tree with equal branch lengths. The algorithm first traverses the tree from the leaves to the root, and, for each internal node, computes the likelihood of all possible states taking also into account all possible states of the father node(s). In the second step, the algorithm traverses the tree from the root to the leaves assigning the ancestral states which maximise the likelihood. The figure represents the calculation for the subtree composed by the leaf nodes 4 and 5, the internal node 6 and father node 7 (blue rectangle).