| Literature DB >> 15377393 |
Wei Cai1, Jimin Pei, Nick V Grishin.
Abstract
BACKGROUND: Modern-day proteins were selected during long evolutionary history as descendants of ancient life forms. In silico reconstruction of such ancestral protein sequences facilitates our understanding of evolutionary processes, protein classification and biological function. Additionally, reconstructed ancestral protein sequences could serve to fill in sequence space thus aiding remote homology inference.Entities:
Mesh:
Substances:
Year: 2004 PMID: 15377393 PMCID: PMC522809 DOI: 10.1186/1471-2148-4-33
Source DB: PubMed Journal: BMC Evol Biol ISSN: 1471-2148 Impact factor: 3.260
Figure 1a) Correlation between average αis Alignment-Based rate factor solely depending on the given alignment. αis rate factor estimated by maximum likelihood method, which requires an alignment and evolutionary tree inferred from the alignment. The protein family used here is the PDZ domain.
Difference of logarithm likelihood and CPU time when using different α vectors
| Δ | P* | Δ | P* | |||||
| Logarithm Likelihood | -5324.56 | -5087.72 | 236.84 | <0.0001 | -4987.27 | 337.29 | <0.0001 | |
| CPU Time (s)+ | 213 | 213 | 359 | |||||
The alignment tested here is a subset of SH2 family. It includes 44 sequences and each sequence contains 83 amino acids (including gaps).
* The likelihood ratio test (LRT) [58] is used to test whether αand αare significantly different from α = 1.0. The difference in number of free parameters between α, αand α = 1.0 model is 82.
+ CPU times were computed on a Dell PowerEdge 8450 server (CPU 700MHz, RAM 8G).
Figure 2The tree used to test ancestral sequence reconstruction. This is an arbitrarily selected evolutionary tree. Evolutionary distances are shown to scale.
Figure 3Comparison of pairwise distances between the rebuilt tree and original tree. a) distance estimation assuming no rate variation among sites; b) distance estimation with The rebuilt tree is inferred from the alignment that is generated by evolutionary simulation performed on the original tree. The original tree is arbitrarily selected.
Difference of logarithm likelihood and CPU time with and without optimization of π vector
| Δ | P* | |||
| Logarithm Likelihood | -5087.72 | -5055.97 | 31.75 | <0.0001 |
| CPU Time (s)+ | 213 | 14902 |
The alignment tested here is the same alignment used in Table 1. Calculated π means frequency vector calculated from the alignment.
* The likelihood ratio test (LRT) [58] is used to test whether optimized π is significantly different from calculated π. The difference in number of free parameters between these two models is 19.
+CPU times were computed on a Dell PowerEdge 8450 server (CPU 700MHz, RAM 8G).
Figure 4a) Correlation between the average probability of "the reconstructed amino acid" and the fraction of correct predictions. b) Correlation between the fraction of correct predictions and average The protein family used here is the PDZ domain. Red filled points are sites with incorrect reconstruction.
Ancestral sequence reconstruction accuracy by different programs
| Root Seq. | Tree | Leaf Node Num. | Methods | |||||||||
| ANCESCON | PAML | PHYLIP $ | PAUP* | |||||||||
| - | + | - | +L + | -L + | +L - | -L - | ||||||
| 1em2 | pii1 | 25 | 0.32 | 0.35 | 0.41 | 0.37 | 0.29 | 0.27 | 0.21 | 0.29 | 0.26 | |
| 1g9o | pii1 | 25 | 0.46 | 0.47 | 0.53 | 0.53 | 0.51 | 0.54 | 0.40 | 0.51 | 0.47 | |
| 1rgg | pii1 | 25 | 0.60 | 0.42 | 0.47 | 0.60 | 0.47 | 0.58 | 0.32 | 0.56 | 0.47 | |
| 1sgt | pii1 | 25 | 0.34 | 0.33 | 0.33 | 0.32 | 0.32 | 0.33 | 0.27 | 0.33 | 0.32 | |
| 1zm2 | pii1 | 25 | 0.29 | 0.3 | 0.28 | 0.25 | 0.21 | 0.25 | 0.21 | 0.27 | 0.16 | |
| 2a8v | pii1 | 25 | 0.45 | 0.42 | 0.56 | 0.55 | 0.44 | 0.46 | 0.28 | 0.50 | 0.36 | |
| 2ctb | pii1 | 25 | 0.40 | 0.39 | 0.41 | 0.38 | 0.24 | 0.24 | 0.21 | 0.29 | 0.22 | |
| Average accuracy | 0.383 | 0.390 | 0.446 | 0.431 | 0.354 | 0.381 | 0.271 | 0.393 | 0.323 | |||
| 2ctb | gef | 27 | 0.37 | 0.38 | 0.35 | 0.35 | 0.29 | 0.17 | 0.24 | 0.22 | 0.22 | |
| 2ctb | LacI | 54 | 0.64 | 0.57 | 0.44 | 0.37 | 0.49 | 0.35 | 0.42 | 0.33 | 0.34 | |
| 2ctb | pdz | 39 | 0.41 | 0.42 | 0.44 | 0.39 | 0.22 | 0.34 | 0.18 | 0.32 | 0.22 | |
| 2ctb | ph | 30 | 0.74 | 0.75 | 0.53 | 0.55 | 0.45 | 0.25 | 0.43 | 0.37 | 0.32 | |
| 2ctb | pii1 | 25 | 0.40 | 0.39 | 0.41 | 0.38 | 0.24 | 0.24 | 0.21 | 0.29 | 0.22 | |
| 2ctb | ptb | 29 | 0.39 | 0.43 | 0.39 | 0.38 | 0.29 | 0.23 | 0.26 | 0.24 | 0.23 | |
| 2ctb | sh2 | 34 | 0.42 | 0.40 | 0.43 | 0.40 | 0.30 | 0.22 | 0.20 | 0.27 | 0.22 | |
| 2ctb | sh3 | 43 | 0.82 | 0.80 | 0.62 | 0.55 | 0.69 | 0.45 | 0.66 | 0.46 | 0.54 | |
| 2ctb | GST | 140 | 0.73 | 0.73 | @ | @ | # | # | 0.47 | 0.38 | 0.33 | |
| Average accuracy& | 0.524 | 0.518 | 0.451 | 0.421 | 0.371 | 0.281 | 0.325 | 0.313 | 0.289 | |||
| 1em2 | pdz | 39 | 0.35 | 0.36 | 0.44 | 0.44 | 0.29 | 0.43 | 0.23 | 0.4 | 0.24 | |
| 1g9o | pii1 | 25 | 0.46 | 0.47 | 0.53 | 0.53 | 0.51 | 0.54 | 0.40 | 0.51 | 0.47 | |
| 1rgg | sh2 | 34 | 0.48 | 0.46 | 0.61 | 0.61 | 0.56 | 0.59 | 0.34 | 0.6 | 0.41 | |
| 1sgt | gef | 27 | 0.39 | 0.40 | 0.48 | 0.44 | 0.42 | 0.44 | 0.36 | 0.45 | 0.41 | |
| 1zm2 | ptb | 29 | 0.47 | 0.48 | 0.57 | 0.57 | 0.53 | 0.51 | 0.32 | 0.52 | 0.41 | |
| 2a8v | ph | 30 | 0.78 | 0.71 | 0.74 | 0.60 | 0.61 | 0.50 | 0.65 | 0.50 | ||
| 2ctb | LacI | 54 | 0.64 | 0.57 | 0.44 | 0.37 | 0.49 | 0.35 | 0.42 | 0.33 | 0.34 | |
| Average accuracy | 0.510 | 0.507 | 0.540 | 0.529 | 0.486 | 0.496 | 0.367 | 0.494 | 0.397 | |||
| ProbabilityΔ | 0.0026 | 0.0023 | 0.0248 | 0.0328 | 0.0007 | 0.0168 | 0.0001 | 0.0143 | 0.0005 | |||
All root sequences are taken from PDB database and the names listed in the table are PDB IDs.
Tree topologies for gef (guanine nucleotide exchange factor), LacI (PurR/LacI family of bacterial transcription factors), pdz, ph, pii1 (a signal transduction protein), ptb, sh2, sh3 and GST (glutathione S-transferase) are inferred from multiple sequence alignments chosen from Pfam database (version 7.3).
All tree topologies are generated from real alignments and the distances are rescaled in order to make the trees comparable.
The value in this table represents the accuracy of reconstruction, i.e. the fraction of correctly reconstructed sites for the root sequence. The best reconstruction accuracy in each test is shown in bold.
αmeans that the site-specific rate factors were estimated by maximum likelihood method.
αmeans that the site-specific rate factors were estimated by our empirical equation based on the given alignment (for details see Methods).
-α means that the rate factors were not considered in reconstruction.
+α means that the rate factors were considered in reconstruction.
+L means that branch lengths of the input tree were used in reconstruction, while -L means that branch lengths were estimated by the reconstruction program itself.
@: tree topology for GST had 140 leaf nodes that were too many for PAML to run through.
$: rate factors estimated by PAML were used by PHYLIP in ancestral sequence reconstruction.
#: tree topology for GST had 140 leaf nodes, which were too many for PAML to estimate rate factors for GST.
&:GST is excluded in calculation of the average.
Δ: paired t-test method [40] was used to estimate the one-tail probability between ANCESCON and the other three reconstruction methods.
Figure 5Comparison of "BEST", "SECOND BEST", "SHUFFLE" and "RANDOM" methods in the number of new homologs detected when compared with the benchmark experiment. The methods are defined in "Methods" section. The blue portion of the bar shows the number of true positives. The red portion of the bar shows the number of the false positives.
Homology detection results of OB-fold structures using reconstructed ancestral sequences
| SCOP Superfamily/family | PDB structure | New homologs | NCBI annotation |
| Nucleic acid-binding proteins/ Anticodon-binding domain | 1b7yB, 39–151 | N/A | - |
| 1b8aA, 1–102 | N/A | - | |
| 1bbuA, 64–151 | 13431467 | DNA polymerase II small subunit | |
| 15598836 | DNA polymerase III, alpha chain | ||
| 1c0aA, 1–106 | 11261591 | DNA polymerase III, alpha chain | |
| 11499379 | conserved hypothetical protein | ||
| 1169392 | DNA polymerase III alpha subunit | ||
| 118794 | DNA polymerase III alpha subunit | ||
| 13620707 | putative DNA polymerase III, alpha chain | ||
| 14194684 | DNA polymerase III alpha subunit | ||
| 14194702 | DNA polymerase III alpha subunit | ||
| 14195653 | DNA polymerase III alpha subunit | ||
| 14195659 | DNA polymerase III alpha subunit | ||
| 15594924 | DNA polymerase III, subunit alpha | ||
| 15598836 | DNA polymerase III, alpha chain | ||
| 15601899 | DnaE | ||
| 15642243 | DNA polymerase III, alpha subunit | ||
| 15669005 | M. | ||
| 15679404 | DNA polymerase delta small subunit | ||
| 3914611 | ATP-dependent DNA helicase recG | ||
| 1cuk, 1–64 | N/A | - | |
| 1e1oA, 64–148 | 11261591 | DNA polymerase III, alpha chain XF0204 | |
| 14194684 | DNA polymerase III alpha subunit | ||
| 1fguA, 181–298 | 15219507 | hypothetical protein | |
| 15230563 | putative protein | ||
| 15790309 | Vng1255c from | ||
| 6166145 | DNA polymerase III alpha subunit | ||
| 8778702 | T1N15.20 | ||
| 1fl0A | 10957481 | hypothetical protein | |
| 1g51A, 1–104 | 14520587 | hypothetical protein | |
| 14591565 | hypothetical protein | ||
| 15595886 | hypothetical protein | ||
| 3914638 | ATP-dependent DNA helicase recG | ||
| 1otcB, 36–126 | N/A | - | |
| 1quqA, 62–152 | 15387767 | probable replication protein a 28 Kd subunit | |
| 1qvcA, 1–114 | N/A | - | |
| Nucleic acid-binding proteins/Cold shock DNA-binding domain like | 1a62, 48–125 | N/A | - |
| 1ah9 | N/A | - | |
| 1bkb, 75–139 | 15790688 | translation initiation factor eIF-5A; Eif5a | |
| 1c9oA | 6014735 | Cold shock protein CspSt | |
| 1csp | N/A | - | |
| 1d7qA | N/A | - | |
| 1mjc | N/A | - | |
| 1rl2 | N/A | - | |
| 1sro | 15671445 | N utilization substance protein A | |
| 15794781 | N utilisation substance protein A | ||
| 15803711 | transcription pausing; L factor | ||
| 2eifA, 73–132 | N/A | - | |
| Nucleic acid-binding proteins/DNA ligase, mRNA capping enzyme, domain2 | 1a0i, 241–349 | N/A | - |
| 1dgsA, 315–400 | N/A | - | |
| 1ckmA, 238–302 | N/A | - | |
| 1fviA, 190–293 | N/A | - | |
| Nucleic acid-binding proteins/Phage ssDNA-binding proteins | 1gpc | N/A | - |
| 1gvp | N/A | - | |
| 1pfs | N/A | - | |
| Nucleic acid-binding proteins/RNA polymerase subunit RBP8 | 1a1d | N/A | - |
| Staphylococcal nuclease/Staphylococcal nuclease | 1eyd | 13422779 | aldose 1-epimerase * |
| Bacterial enterotoxins/Bacterial AB5 toxins, B units | 1c4qA | N/A | - |
| 1prtF | N/A | - | |
| Bacterial enterotoxins/Superantigen toxins | 1an8, 19–94 | N/A | - |
| TIMP-like/Tissue inhibitor of metalloproteases | 1ueaB, 14–106 | N/A | - |
| Inorganic pyrophosphatase/ Inorganic pyrophosphatase | 2prd | N/A | - |
| MOP-like/BiMOP, duplicated molybdate-binding domain | 1b9mA, 127–262 | 10639288 | probable ATP-binding protein |
| 10955070 | AgtA | ||
| 1175513 | Putative ferric transport ATP-binding protein afuC | ||
| 15598450 | probable ATP-binding component of ABC transporter | ||
| 3978166 | ATPase FbpC | ||
| 4895001 | glucose ABC transporter ATPase * | ||
| Histidine kinase CheA, C-terminal domain/ Histidine kinase CheA, C-terminal domain | 1b3qA, 540–671 | N/A | - |
* Putative false positives as assessed by manual inspection.
Comparison of the true hits among the top 10 predicted sites for ANCESCON, evolutionary trace (ET), simple conservation (SC), and conservation difference (CD) methods
| Protein Family | PDB ID# | Ligand/ substrate | Number of sites | * | ** | *** | ANCESCON | ET | SC | CD |
| adkinase | 1aky | AP5 | 188 | 42 | 20 | 18 | 3 | 9.5 | 9.1 | 8 |
| gef | 1bkd | H-Ras | 245 | 47 | 4 | 0 | 3 | 3 | 3 | 2 |
| globin | 1a6g | HEM | 147 | 21 | 1 | 1 | 2 | 5.5 | 6 | 6 |
| pdz | 1be9 | + | 81 | 15 | 2 | 1 | 6 | 4 | 4 | 2 |
| ph | 1mai | I3P | 109 | 11 | 2 | 0 | 2 | 2 | 3 | 2 |
| ptb | 1shc | PTR | 157 | 27 | 2 | 1 | 6 | 5 | 5 | 9 |
| ras | 821p | GTN | 185 | 29 | 10 | 9 | 2 | 5.6 | 8.7 | 5 |
| sh2 | 1a09 | ACE | 83 | 17 | 2 | 1 | 3 | 5 | 4 | 4 |
| sh3 | 1nlo | ACE | 57 | 9 | 1 | 1 | 2 | 5 | 4 | 0 |
| subtilase | 1av7 | SBL | 278 | 22 | 8 | 4 | 5 | 4.6 | 3.8 | 4 |
#:Representative protein structure
*: Number of sites within 5Å to ligand or substrates
**: Number of invariant sites, which may contain gaps
***: Number of invariant sites within 5 Å to ligand or substrates
+: C-terminal peptide of protein CRIPT
Figure 6Mapping top 10 predictions by ANCESCON to PDZ domain (PDB ID: 1be9) [50]. The color code scheme: ligand is shown in green and the predicted functional residues are shown in red.
Figure 7A partial alignment of the N-terminal part of adenylyl kinases. Sites colored in red are our predictions that are within 5Å from the ligand. Sites colored in orange are our predictions more than 5Å apart from the ligand.
Figure 8The evolutionary tree for the adenylyl kinase family generated by "Weighbor". The first cutting layer is shown. Evolutionary distances are shown to scale.
Figure 9Mapping top 10 predictions by ANCESCON to adenylyl kinase domain (PDB ID: 1aky) [47]. The color code scheme: ligand is shown in green and the predicted functional residues are shown in red.
Figure 10An evolutionary tree topology. Nodes C, D, E and F represent given protein sequences, while nodes A and B represent ancestral protein sequences, i.e. unknown sequences. drepresents the evolutionary distance between nodes Y and Z.
Figure 11An example showing the different cutting layers in a rooted tree. dis the average distance from the root to all leaf nodes. Nodes i and j are neighboring cutting nodes.