| Literature DB >> 28637261 |
Bryan A Moyers1, Jianzhi Zhang2.
Abstract
Phylostratigraphy, originally designed for gene age estimation by BLAST-based protein homology searches of sequenced genomes, has been widely used for studying patterns and inferring mechanisms of gene origination and evolution. We previously showed by computer simulation that phylostratigraphy underestimates gene age for a nonnegligible fraction of genes and that the underestimation is severer for genes with certain properties such as fast evolution and short protein sequences. Consequently, many previously reported age distributions of gene properties may have been methodological artifacts rather than biological realities. Domazet-Lošo and colleagues recently argued that our simulations were flawed and that phylostratigraphic bias does not impact inferences about gene emergence and evolution. Here we discuss conceptual difficulties of phylostratigraphy, identify numerous problems in Domazet-Lošo et al.'s argument, reconfirm phylostratigraphic error using simulations suggested by Domazet-Lošo and colleagues, and demonstrate that a phylostratigraphic trend claimed to be robust to error disappears when genes likely to be error-resistant are analyzed. We conclude that extreme caution is needed in interpreting phylostratigraphic results because of the inherent biases of the method and that reanalysis using genes exhibiting no error in realistic simulations may help reduce spurious findings.Entities:
Keywords: BLAST; computer simulation; de novo gene origination; disease genes; false negatives
Mesh:
Year: 2017 PMID: 28637261 PMCID: PMC5501971 DOI: 10.1093/gbe/evx109
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
. 1.—Proportions of simulated human genes with no detectable homologs in various organisms, despite the presence of simulated homologs in all organisms. Human: simulation using human sequences as ancestral sequences to initiate evolution. RandSeq: simulation using randomly shuffled human sequences as ancestral sequences to initiate evolution. RandSpecial: simulation using randomly shuffled human sequences (along with co-shuffled relative evolutionary rates) as ancestral sequences to initiate evolution. Presented are mean proportions ± one SDs from nine simulation replications.
. 2.—Percentage of disease genes in each age group. Rank correlation (ρ) between age group and percentage disease genes is shown for all genes, error-resistant genes based on simulation, and non-error-resistant genes, respectively.
Rank Correlations between Estimated Gene Age and Gene Properties for Yeast Genes
| ORF Length | RNA Abundance | Proximity of TF Binding Sites or Not | Codon Adaptation Index | Purifying Selection or Not | Optimal AUG Context | |
|---|---|---|---|---|---|---|
| All ORFs ( | 0.386 | 0.261 | 0.077 | 0.312 | 0.316 | 0.133 |
| Error-resistant ( | 0.179 | 0.093 | 0.050 | 0.208 | 0.166 | 0.045 |
| Non-error-resistant ( | 0.429 | 0.163 | −0.002 | 0.324 | 0.331 | 0.212 |
P < 0.05, **P < 10−10, ***P < 10−100.