| Literature DB >> 30060201 |
Bryan A Moyers1, Jianzhi Zhang2.
Abstract
Phylostratigraphy is a method for estimating gene age, usually applied to large numbers of genes in order to detect nonrandom age-distributions of gene properties that could shed light on mechanisms of gene origination and evolution. However, phylostratigraphy underestimates gene age with a nonnegligible probability. The underestimation is severer for genes with certain properties, creating spurious age distributions of these properties and those correlated with these properties. Here we explore three strategies to reduce phylostratigraphic error/bias. First, we test several alternative homology detection methods (PSIBLAST, HMMER, PHMMER, OMA, and GLAM2Scan) in phylostratigraphy, but fail to find any that noticeably outperforms the commonly used BLASTP. Second, using machine learning, we look for predictors of error-prone genes to exclude from phylostratigraphy, but cannot identify reliable predictors. Finally, we remove from phylostratigraphic analysis genes exhibiting errors in simulation, which by definition minimizes error/bias if the simulation is sufficiently realistic. Using this last approach, we show that some previously reported phylostratigraphic trends (e.g., younger proteins tend to evolve more rapidly and be shorter) disappear or even reverse, reconfirming the necessity of controlling phylostratigraphic error/bias. Taken together, our analyses demonstrate that phylostratigraphic errors/biases are refractory to several potential solutions but can be controlled at least partially by the exclusion of error-prone genes identified via realistic simulations. These results are expected to stimulate the judicious use of error-aware phylostratigraphy and reevaluation of previous phylostratigraphic findings.Entities:
Mesh:
Year: 2018 PMID: 30060201 PMCID: PMC6105108 DOI: 10.1093/gbe/evy161
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
. 1.—Simulation for the assessment of phylostratigraphic error. (A) The phylogenetic tree along which simulation of protein sequence evolution was performed. Branch lengths follow TimeTree estimates of divergence times from the human. (B) Comparison of maximum-likelihood genetic distance determined by TreePuzzle between human and rat real and simulated proteins (Pearson’s r = 0.61, P = 2.2 × 10−316, slope = 0.85). Each circle represents one pair of orthologous proteins. Because the slope is < 1, the simulation apparently under-evolved the sequences, making our estimate of the false negative rate of phylostratigraphy conservative.
. 2.—False negative and false positive error rates in phylostratigraphy using BLASTP default parameters and the optimal parameters of five programs. False negative (A) and positive (B) rates for protein set I. False negative (C) and positive (D) rates for protein set II. False negative (E) and positive (F) rates for protein set III. GLAM2Scan is not shown, but can be seen in relation to other programs in supplementary fig. S9, Supplementary Material online.
Spurious Correlations (Spearman’s ρ) between Estimated Gene Age and Biological Features
| BLASTP | PSIBLAST | PHMMER | HMMER | |
|---|---|---|---|---|
| Protein set I | ||||
| Protein length | 0.14** | 0.16** | 0.03 | 0.11** |
| Evolutionary rate | −0.37*** | −0.36*** | −0.02 | −0.34*** |
| Block length | 0.35*** | 0.35*** | 0.03* | 0.32*** |
| Protein set II | ||||
| Protein length | 0.31*** | 0.30*** | 0.28*** | 0.27*** |
| Evolutionary rate | −022*** | −0.08** | −0.22*** | −0.22*** |
| Block length | 0.37*** | 0.28*** | 0.37*** | 0.33*** |
| Protein set III | ||||
| Protein length | 0.32*** | 0.36*** | 0.29*** | 0.28*** |
| Evolutionary rate | −0.12** | −0.12** | −0.13** | −0.13** |
| Block length | 0.41*** | 0.44*** | 0.42*** | 0.38*** |
Length of the longest block of conserved residues.
*P < 0.05; **P < 1 × 10−10; ***P < 1 × 10−100.
Performances of the Best-Performing Machine Learning Models in Identifying Error-Prone Genes for Each Protein Set by SVM and Random Forest Methods
| SVM | Random Forest | |||||
|---|---|---|---|---|---|---|
| Protein Set I | Protein Set II | Protein Set III | Protein Set I | Protein Set II | Protein Set III | |
| Best-performing model | Error ∼ L+E+B | Error ∼ L+E+B | Error ∼ L*E*B | Error ∼ L+E+B | Error ∼ L+E+B | Error ∼ B |
| Sensitivity | 0.504 | 0.253 | 0.512 | 0.711 | 0.629 | 0.633 |
| Specificity | 0.987 | 0.984 | 0.863 | 0.967 | 0.900 | 0.730 |
| Precision | 0.768 | 0.718 | 0.653 | 0.519 | 0.360 | 0.336 |
L = protein length; E = evolutionary rate; B = maximum length of conserved block.
Spearman’s Correlation between Gene Age and Gene Properties in Real Phylostratigraphy
| Gene Properties Correlated | 4,942 Proteins Randomly Chosen from Ensembl | 4,942 Proteins Randomly Chosen from OrthoMaM | 4,942 Proteins Used in Simulation | 4,619 Nonerror-Prone Proteins |
|---|---|---|---|---|
| Protein length and age | 0.16** | −0.010 | −0.09** | −0.12** |
| Evolutionary rate and age | NA | −0.18** | −0.05* | 0.002 |
*P < 0.05; **P < 1 × 10−10; ***P < 1 × 10−100.
N/A, not applicable because the evolutionary rate cannot be estimated for some genes due to the lack of detectable orthologs.