| Literature DB >> 16729895 |
Debra L Fulton1, Yvonne Y Li, Matthew R Laird, Benjamin G S Horsman, Fiona M Roche, Fiona S L Brinkman.
Abstract
BACKGROUND: Orthologs (genes that have diverged after a speciation event) tend to have similar function, and so their prediction has become an important component of comparative genomics and genome annotation. The gold standard phylogenetic analysis approach of comparing available organismal phylogeny to gene phylogeny is not easily automated for genome-wide analysis; therefore, ortholog prediction for large genome-scale datasets is typically performed using a reciprocal-best-BLAST-hits (RBH) approach. One problem with RBH is that it will incorrectly predict a paralog as an ortholog when incomplete genome sequences or gene loss is involved. In addition, there is an increasing interest in identifying orthologs most likely to have retained similar function.Entities:
Mesh:
Year: 2006 PMID: 16729895 PMCID: PMC1524997 DOI: 10.1186/1471-2105-7-270
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An example of how RBH analysis may falsely identify a paralog as an ortholog. Illustrated is a hypothetical species tree and gene tree for the human, cattle, and mouse species, where human and cattle orthologs (unshaded genes) are being identified. If the true cattle ortholog has not yet been sequenced because of an incomplete bovine genome project, it will not be present in the gene dataset used for analysis (cattle gene crossed out with an X), and the best reciprocal BLAST hit for the human gene will be a cattle paralog (shaded gene). However, Ortholuge will detect this case as a potential paralog, because it examines the relative phylogenetic distance between genes and identifies how well their relative distances match expected species divergence.
Figure 2An overview of the Ortholuge method. (A) Flow-chart outlining the main steps of the method. (B) The three ratios computed by Ortholuge. The phylogenetic distances in the numerator (dark line) and denominator (dashed line) for each ratio is shown, overlaid on the phylogenetic tree (gray line) that relates the ingroups and outgroup. Note that the three ratios are related such that Ratio2 = Ratio1 × Ratio3. Therefore, ratio data is presented both in terms of frequency histograms for all three ratios (see Fig. 4) and also as Ratio1 × Ratio2 plots (see Fig. 5) for just two of the three ratios – the latter is simply another way to conveniently visualize the data.
Figure 3Ratio 1 (R1) ratio distribution curves for selected alignment characteristics. Higher quality mouse-rat-human ortholog sequence sets were analyzed to devise the gap-masking and sequence trimming approaches. These methods were evaluated for the introduction of ratio distribution biases for selected alignment characteristics such as identity and gap length. Ratio distribution curves were plotted for several characteristics. No obvious bias was observed through the introduction of our gap masking approach or alignment trimming.
Figure 4Histogram illustrating the distribution of RBH-predicted (i.e. putative) orthologous groups across the three Ortholuge distance ratios. The results for predicted mouse-rat-human RBH ortholog sets (EGO RBH data set; 19,200 ortholog groups) are shown. Each of the three ratios forms their own distribution: Ratio1 and Ratio2 are generally located at ratio values lower than 1 and Ratio3 is generally located about a ratio value of 1, reflecting the relative distances between ingroups and between each ingroup and the outgroup. A similar ratio analysis was performed on a RefSeq RBH dataset (see Figure 3 of [Additional file 1]).
Figure 5Ortholuge R1 × R2 plots (Ratio1 versus Ratio2) for selected eukaryotic data, where each point represents one putative ortholog group. (A) Putative orthologous groups identified using RBH for mouse-rat-human (Figure 4 shows the corresponding histogram). (B) Putative orthologs groups for mouse-rat-human from a higher quality (more precise) dataset (see Methods). It is expected that this more precise data set comprises primarily true orthologs. (C) A lower quality data set of RBH-predicted orthologous groups for cattle-human-mouse, where cattle genes have been identified from an incomplete genome sequence. (D), (E), (F) are zoomed-in versions of (A), (B), (C), respectively, with axes shown from 0 to 2 instead of 0 to 30. Note that most orthologous groups exhibit low Ratio1 and Ratio2 values, in all three data sets. For example, in panels A and D, about 86% of orthologs have Ratio1 and Ratio2 values less than 1. However, the higher quality data set (panels B and E) contains fewer points at higher Ratio values versus the RBH-predicted data set. The lower quality data set contains more points with very high Ratio2 values (i.e. only 73% of points have Ratio1 and Ratio2 values less than 1), potentially reflecting the increased occurrence of probable cattle paralogs (i.e. paralogs being misidentified as orthologs by an RBH-analysis with an incomplete cattle genome).
Figure 6Ortholuge R1 × R2 plots for the prokaryotic data, illustrating two ortholog data sets and a true-negative data set. (A) Putative orthologous groups from an RBH-predicted data set. (B) Probable true orthologs from a higher quality (more precise) data set. (C) True-negative orthologs (i.e. true paralogs) from the "gene-loss simulation" data set. Darker dots represent putative orthologous groups which have had an ingroup1 true-negative (paralog) introduced into the group. Lighter dots represent putative orthologous groups which have had an ingroup2 true-negative (paralog) introduced into the group. (D), (E), (F) are zoomed-in versions of (A), (B), (C), respectively, with axes shown from 0 to 2 instead of 0 to 10. Most putative ortholog groups (particularly for the high quality data set) exhibit low Ratio1 and Ratio2 values (for example, all values are less than 1 for the points in the high quality data set plot), whereas most true-negative groups exhibit higher Ratio1 and Ratio2 values (i.e. only 9% of ingroup1 true negative introductions, and 6% of ingroup2 true negative introductions, have points with Ratio1 and Ratio2 values less than 1).
Figure 7R1 × R2 plots, for the prokaryotic data, illustrating the effect of introducing outgroup paralogs (outgroup ortholog true-negatives) in the analysis. Unlike for other figures of R1 × R2 plots in the paper, only ratio ranges from 0 to 2 are shown for each axis. (A) RBH-predicted orthologous groups. (B) Outgroup paralogs from a true-negative data set where all possible outgroups were replaced with next best RBH paralogs. They cannot be well distinguished from other orthologs, however, this is actually promising, since Ortholuge is in essence identifying orthologs between the ingroups only. This analysis shows that an outgroup paralog does not interfere greatly with the identification of true orthologs shared between the ingroups.
Figure 8Example of the generation of cut-offs for classification of ssd-orthologs and probable paralogs, based on an iterative-true-negative analysis (i.e. based on an introduction of random sets of true-negatives). The particular analysis illustrated here is a Ratio1 analysis for the mouse, rat, human RefSeq RBH dataset, with true-negatives introduced into the mouse (ingroup1) set. In panel A, the number of putative orthologous groups in each ratio range for the true-negative-transformed data set is shown for the whole data set (light shaded bars) and for just the introduced true-negatives only (dark shaded bars). Note how the distribution of the data set differs from that of the true negatives (i.e. introduced paralogs). In panel B, the proportion of randomly introduced true-negatives at 0.5 ratio range intervals is used to formulate cut-offs (denoted by dashed lines) for classifying ssd-orthologs and probable paralogs for the analysis. For the ssd-orthologs cut-off (left-most dashed line), no more than 10% true negatives in a given ratio range are permitted for the ssd-orthologs range. For the probable paralogs cut-off (right-most dashed line) the proportion of true negatives is at or above 50 percent. The resulting middle region bounded by these two cut-off points establishes the "uncertain" orthology class ratio range. Dashed-lines denoting these particular cut-offs are also illustrated on the figure in Panel A for reference. This approach for a true-negative analysis and cut-off generation is also performed for Ratio2 [Additional file 1] and the combination of cut-offs for Ratio1 and Ratio2 are used to classify putative orthologous groups from another data set (such as an RBH-predicted data set) into the three classification levels of "probable ssd-ortholog", "uncertain" and "probable paralogs". Panel C schematically shows the areas of an R1 × R2 that would be classified in this way, with the cut-off numbers in this particular example matching the RefSeq RBH-based mouse-rat-human analysis (see Table 2 for how these ranges are numerically determined).
Ortholuge-ratios can help predict which gene in a given putative orthologous group is likely a paralogaa.
| Ingroup1 paralog | |||
| ↓ | Ingroup2 paralog | ||
| ↓ | ↓ | - | Outgroup paralogb |
| variable d | Ingroup1 & Ingroup2 paralogs | ||
| ↓ | variable d | Ingroup1 & Outgroup paralogs | |
| variable d | ↓ | ↓ | Ingroup2 & Outgroup paralogs |
a Only selected scenarios are listed. Arrows indicate relative increases or decreases in a ratio value, when compared to the highest frequency values in a histogram plot (i.e. "expected" ratio value). Smaller arrows indicate that the increase is less. In the case of the ingroup1 or ingroup2 paralog scenarios, it will depend on how divergent the paralog is and how distant the outgroup is.
b Note that an outgroup paralog cannot be discriminated from cases of orthologs, nor does this analysis need to discriminate such cases (see text). However, this has been included in the table solely to illustrate how ortholog paralog cases can be discriminated (using Ratio 3) from cases where there is a combination of an ingroup1 (or ingroup2) paralog and an outgroup paralog.
c This scenario will resemble an ingroup1 paralog scenario or ingroup2 paralog scenario, if one of the two ingroup paralogs diverged much more than the other.
d The variation may be an increase or decrease, depending on which of the two paralogs is more diverged. Ratio 3 can help resolve such cases.
Proportion of RBH-predicteda orthologs that are likely ssd-orthologsb and likely paralogs, according to Ortholuge analysis.
| rat-mouse comparison (human outgroup) | R1 ≤ 0.60 and R2 ≤ 0.55 | 0.8% | 76% | See footnotef | 16% | 14% | R1 > 0.80 or R2 > 0.80 | 77%d | 10% |
| R1 ≤ 0.55 and R2 ≤ 0.70 | 1.3% | 91% | See footnotef | 24% | 4% | R1 > 0.75 and R2 > 0.85 | 87% | 5% | |
a RBH-predicted = Predicted to be orthologous using a Reciprocal-best BLAST hit approach.
b "Supporting-species-divergence orthologs" = orthologs that appear to have diverged only due to speciation and have diverged at an expected relative rate for the species. Such orthologs are likely to have more similar function. See text for details.
c Ratio Range for both Ratio1 (R1) and Ratio2 (R2). See Figure 8C for a schematic illustration of the cut-off ranges on a R1 × R2 plot.
d Proportion of introduced true-negatives for the 25% true-negative analysis is shown here, however the actual number of true-negatives will be higher due to false-positives likely occurring in the original ortholog dataset. This analysis was used to estimate % false predictions in range (see text and Figure 8.
e RBH-predicted data sets were examined using the cut-offs generated by the true-negative analysis, to identify what proportion of all RBH-predicted orthologs fell within each range. For the rat-mouse comparison 6294 RefSeq-based groups were classified into "probable ssd-ortholog", "uncertain", and "probable paralog" classes. For the Pseudomonas comparison, a total of 1456 groups were classified. Note that for an analysis of the EGO-based rat-mouse data set of 19,200 groups with the same cut-offs, 76% ssd-orthologs and 16% probable paralogs were predicted (when in-paralogs were not counted, because of the lack of differentiation of gene isoforms in the EGO data set).
f This "uncertain" category falls between the other two ranges and is graphically illustrated, for ease of understanding, in Figure 8C. This category follows the formula (R1 > a and R1 < b and R2 < d) or (R2 > c and R2 < d and R1 < a), where a and b are the lower and upper cut-off values, respectively, for Ratio1 (i.e. lower = cut-off for ssd-orthologs and higher = cut-off for probable paralogs), and c and d are the lower and upper cut-off values, respectively, for Ratio2. Note this "uncertain" category also contains counts of in-paralogs detected (7% of eukaryotic data, and negligible for prokaryotic data) – see text for details.