| Literature DB >> 24013106 |
Daniel A Dalquen1, Christophe Dessimoz.
Abstract
Bidirectional best hits (BBH), which entails identifying the pairs of genes in two different genomes that are more similar to each other than either is to any other gene in the other genome, is a simple and widely used method to infer orthology. A recent study has analyzed the link between BBH and orthology in bacteria and archaea and concluded that, given the very high consistency in BBH they observed among triplets of neighboring genes, a high proportion of BBH are likely to be bona fide orthologs. However, limited by their analysis setup, the previous study could not easily test the reverse question: which proportion of orthologs are BBH? In this follow-up study, we consider this question in theory and answer it based on conceptual arguments, simulated data, and real biological data from all three domains of life. Our analyses corroborate the findings of the previous study, but also show that because of the high rate of gene duplication in plants and animals, as much as 60% of orthologous relations are missed by the BBH criterion.Entities:
Keywords: bidirectional best hit; comparative genomics; evolutionary relationships; in-paralogy; orthology; reciprocal best hit
Mesh:
Year: 2013 PMID: 24013106 PMCID: PMC3814191 DOI: 10.1093/gbe/evt132
Source DB: PubMed Journal: Genome Biol Evol ISSN: 1759-6653 Impact factor: 3.416
FPerformance of BBH in conceptual examples. (a) BBH recovers the orthologous pair, because the orthologous pair is closer than the paralogous pair due to evolution accrued between the duplication and speciation events (highlighted in bold). (b) BBH only identifies one of the two orthologous pairs, namely the one with higher score. This scenario is common if duplication occurs after speciations of interest. (c) BBH identifies paralogs if the orthologous counterpart is missing in both species. This might happen if the rate of gene losses is high (e.g., following a whole genome duplication). (d) BBH identifies paralogs if the departure from the molecular clock is so strong that paralogs are closer in sequence despite having started diverging before the orthologs.
FRelationship between the proportion of non-1-to-1 orthology and precision/recall for BBH (in red) on simulated data sets with different proportions of genes with a history of duplications. Results for Inparanoid (green) and OMA/GETHOGs (blue) are given for comparison. Each point corresponds to the mean value of five replicates. Error bars give the 95% confidence interval of the mean values in both dimensions.
FPrecision and recall of BBH on real biological data sets, estimated from the intersection and union sets of orthologs inferred by Inparanoid and GETHOGs—the intersection yielding a lower bound for precision and recall and the union yielding an upper bound for precision and recall. The trendlines depict regression over the mid-points.
Statistics Obtained by Comparing BBH to the Intersection and Union of Inparanoid and GETHOGs Predictions on Real Data
| Data Set | |||
|---|---|---|---|
| No. Orthologous Pairs | % Non-1-to-1 Orthologs | % Missed by BBH | |
| Archaea | 116,187–202,117 | 16.73–54.28 | 11.30–42.66 |
| Firmicutes | 193,354–395,959 | 20.08–64.73 | 12.93–52.25 |
| Fungi | 753,147–1,126,046 | 18.46–39.84 | 12.51–31.86 |
| γ-Proteobacteria | 126,865–180,691 | 7.48–35.88 | 5.0–27.40 |
| Metazoa | 1,049,129–3,089,297 | 45.93–80.30 | 35.98–73.69 |
| Viridiplantae | 883,507–2,231,018 | 66.73–87.25 | 46.59–75.09 |
Key Statistics for Simulated Data Sets
| % Duplications | |||||
|---|---|---|---|---|---|
| 0 | 10 | 20 | 30 | 40 | |
| No. of sequences | 1,000 | ||||
| Distr. of seq. length | |||||
| Min. sequence length | 50 | ||||
| Substitution model | WAG | ||||
| Insertion and deletion rate | 0.000125 | ||||
| Gene duplication rate | 0 | 0.003 | 0.0056 | 0.009 | 0.0125 |
| Gene loss rate | 0 | 0.003 | |||
| No. of species | 30 | ||||
| Seq. length (mean) | 316.6 | 326.4 | 323.3 | 325.0 | 320.3 |
| Seq. length (stderr) | 201.7 | 211.6 | 207.4 | 213.1 | 203.6 |
| Avg. % gap chars in MSA | 24.27 | 23.25 | 24.64 | 26.23 | 28.65 |
| Variance of % gap chars | 58.0 | 62.8 | 66.4 | 72.4 | 80.5 |
| Total species tree length | 763.6 | ||||
| Minimum species tree height | 31.70 | ||||
| Maximum species tree height | 77.80 | ||||
| Average species tree height | 41.36 | ||||
| Average distance between species pairs | 72.60 | ||||