| Literature DB >> 19435885 |
Ruchira S Datta1, Christopher Meacham, Bushra Samad, Christoph Neyer, Kimmen Sjölander.
Abstract
Ortholog detection is essential in functional annotation of genomes, with applications to phylogenetic tree construction, prediction of protein-protein interaction and other bioinformatics tasks. We present here the PHOG web server employing a novel algorithm to identify orthologs based on phylogenetic analysis. Results on a benchmark dataset from the TreeFam-A manually curated orthology database show that PHOG provides a combination of high recall and precision competitive with both InParanoid and OrthoMCL, and allows users to target different taxonomic distances and precision levels through the use of tree-distance thresholds. For instance, OrthoMCL-DB achieved 76% recall and 66% precision on this dataset; at a slightly higher precision (68%) PHOG achieves 10% higher recall (86%). InParanoid achieved 87% recall at 24% precision on this dataset, while a PHOG variant designed for high recall achieves 88% recall at 61% precision, increasing precision by 37% over InParanoid. PHOG is based on pre-computed trees in the PhyloFacts resource, and contains over 366 K orthology groups with a minimum of three species. Predicted orthologs are linked to GO annotations, pathway information and biological literature. The PHOG web server is available at http://phylofacts.berkeley.edu/orthologs/.Entities:
Mesh:
Year: 2009 PMID: 19435885 PMCID: PMC2703887 DOI: 10.1093/nar/gkp373
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Results of orthology prediction methods assessed on a benchmark dataset from the TreeFam-A resource. Performance was evaluated on 100 human proteins selected from the TreeFam-A manually curated orthology database, with orthologs to each human protein from mouse, zebrafish and fruit fly. Methods evaluated include several PHOG variants, OrthoMCL-DB, InParanoid and SCI-PHY. PHOG-S represents super-orthology predictions, PHOG-O represents standard orthology predictions and PHOG-T represents the tree-distance thresholded variants. PHOG-T variants PHOG-T(M), PHOG-T(Z) and PHOG-T(F) correspond to tree-distance thresholds selected for optimal performance on this dataset for mouse, zebrafish and fruit fly, respectively. Tree distance thresholds were 0.09375 (mouse), 0.296875 (zebrafish) and 0.9375 (fruit fly). SCI-PHY uses hierarchical clustering and encoding cost measures to define functional subtypes and is included for comparison. Recall measures the fraction of TreeFam-A orthologs detected by a method. Precision measures the fraction of a method's predicted orthologs that are included in TreeFam-A. A True Positive (TP) is an orthology pair included in TreeFam-A that is also predicted by a method, a False Positive (FP) is an orthology pair predicted by a method that is not included in TreeFam-A and a False Negative (FN) is a TreeFam-A ortholog that is missed by a method. Left: recall-precision curves over the entire dataset. Right: table of results for each method for individual species as well as over the entire dataset. Values in red highlight the recall and precision for species-specific threshold selections.
Figure 2.PhyloFacts ortholog identification pipeline. The input is a protein sequence, in either FASTA format (for BLAST search) or by accession. Results of a sequence accession search are displayed in an Orthology Report including a table of all PHOGs containing the query (F) followed by a table displaying the sequences contained in these PHOGs (G). Links in the columns labelled PhyloFacts Orthology Group retrieve the corresponding PHOG report (E). BLAST results are displayed in an initial table of results (not shown); users would then select one of the sequences in the table, to retrieve the Orthology Report for their selected sequence. (A) Protein sequence query. In this example, the query sequence consists of two evolutionarily conserved domains—an N-terminal Ig domain (pink) followed by a transmembrane helix and and a C-terminal Toll Interleukin Receptor (TIR) domain (blue). (B–D) PhyloFacts trees containing the query sequence are identified, and orthologs are extracted from the orthology group for the sequence (indicated by red subtrees). In this example, the sequence is contained in three PhyloFacts trees. The tree shown in B corresponds to sequences sharing the same overall domain architecture (global homologs). The trees shown in C and D contain sequences that share local (partial) homology along a single domain; the tree in C contains sequences having an Ig domain and the tree in D contains sequences having a TIR domain. (Note that the taxonomic distributions of these PHOGs differ, corresponding to differences in orthology predictions across these domains.) (E) PHOG report—this report displays summary data for the PHOG, followed by a table listing all the orthologs in the PHOG including a link to the sequence database from which the member was drawn, the species of origin, description and links to external resources (e.g. SwissProt, KEGG and BioCyc). (F) List of PHOGs containing the query. This table contains summary data about each PHOG, including PFAM domains, GO annotations and evidence codes and taxonomic distribution. (G) Orthology report: all members of all PhyloFacts orthology groups containing the query are gathered and presented in a table. Note that some orthologs to the query will belong to more than one PHOG (i.e. containing both the ortholog and the query); the column ‘PhyloFacts Orthology Group’ provides a link to the most informative PHOG for each sequence as well as to the PhyloFacts book containing that PHOG. GO annotations and evidence codes, PFAM domains and links to external resources (e.g. SwissProt, KEGG, BioCyc and GO) are also provided. These data are also overlaid on the phylogenetic tree for the PHOG as well as for the family tree from which the PHOG was drawn, and can be viewed using the PhyloScope tree viewer.