Literature DB >> 17947323

OrthoDB: the hierarchical catalog of eukaryotic orthologs.

Evgenia V Kriventseva¹, Nazim Rahman, Octavio Espinosa, Evgeny M Zdobnov.

Abstract

The concept of orthology is widely used to relate genes across different species using comparative genomics, and it provides the basis for inferring gene function. Here we present the web accessible OrthoDB database that catalogs groups of orthologous genes in a hierarchical manner, at each radiation of the species phylogeny, from more general groups to more fine-grained delineations between closely related species. We used a COG-like and Inparanoid-like ortholog delineation procedure on the basis of all-against-all Smith-Waterman sequence comparisons to analyze 58 eukaryotic genomes, focusing on vertebrates, insects and fungi to facilitate further comparative studies. The database is freely available at http://cegg.unige.ch/orthodb.

Entities: Gene Species

Mesh：

Year: 2007 PMID： 17947323 PMCID： PMC2238902 DOI： 10.1093/nar/gkm845

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Identification of orthologous genes is the cornerstone of comparative genomics, which is increasingly becoming an essential part of modern molecular biology. Functions of orthologous genes are often preserved through evolution, as by definition, orthologous genes descend by speciation from the common ancestor gene (1–3). Although the conservation of ortholog functions is not required or guaranteed, it is the most likely evolutionary scenario and provides a strong working hypothesis, particularly when the ortholog copy-number is preserved over a long period of time. Identification of orthologs is intricate as it assumes knowledge of ancestral state of the genes, and it requires knowledge of the complete gene repertoires. It is also complicated by gene duplication, fusion and exon shuffling, as well as pseudonization and loss, which make the problem particularly challenging with complex eukaryotic genomes. The fast growing number of available complete genomes facilitates a much better resolution of the gene genealogies, while at the same time greatly increasing the computational challenges. There are two main approaches to delineate orthologous genes: (i) from reconciliation of gene trees with the species phylogeny and (ii) from classification of all-against-all sequence comparisons of complete genomes. The phylogeny approach takes advantage of well-studied evolution of conserved cores of globular proteins using quantitative models of amino acid substitutions (4–6). The notable examples of the tree-based approach to delineate orthologous genes are HOVERGEN (7) and TreeFam (8). The expert curation of the phylogenetic trees and the underlying multiple sequence alignments is both, advantageous, providing better accuracy and disadvantageous, limiting the comprehensiveness, homogeneity of quality and expandability to new species. Although given the appropriate data phylogenetic methods are likely to give more accurate models of ancestral sequences and therefore to yield more accurate orthology prediction, their applicability to current genomic data is hindered by several factors, most importantly: (i) they require substantially more computational resources, (ii) the reconciliation of gene and species trees relies on poorly quantified models of gene duplication and loss, and (iii) they are sensitive to completeness of predicted genes as the evolutionary models are designed for only well-conserved globular cores of proteins and missing data (gaps) render the approach inapplicable. The tree-based approaches also require the knowledge of the species phylogeny, and although the consensus on animal phylogeny seems to be close, it is still constantly challenged. The alternative approach of clustering orthologous genes on the basis of their whole-length similarity around Best-Reciprocal-Hits (BRHs, also known as SymBets, bi-directional BeTs and best–best hits, denoting sequences most similar to each other in between-genome comparisons) was first introduced by the database of Clusters of Orthologous Groups (COGs) (9). Triggered by the earlier availability of much smaller and simpler bacterial genomes the database has quickly gained wide recognition and was later extrapolated to eukaryotic genomes (KOGs) (9). The identification of BRHs is widely adopted currently in the field of comparative genomics for its simplicity and feasibility of application to large-scale data. In terms of phylogenetic trees, BRHs could be interpreted as genes from different species with the shortest connecting path over the distance-based tree. The simplest application of this approach using BLAST (10) for interspecies comparisons suffers from inaccuracies of sequence distance estimates and ignores many gene duplications after the speciation that are, in fact, co-orthologs that are difficult to differentiate functionally. However, using these genes as anchors of orthologous groups in different species, additional co-orthologs can be identified as genes that are more similar to them in intra-genome comparisons than to any other gene in the other genomes, as popularized by the pairwise Inparanoid approach (11). There are also a few alternative clustering heuristics with varying compromises between specificity and selectivity (12) that focus on the growing number of available eukaryotic genomes such as the probabilistic approach of OrthoMCL (13) and the vertebrate-centric Ensembl-Compara (14). Another important feature of orthology and paralogy classification, which is currently underappreciated, is that it is relative to a particular ancestor, as orthology of genes is defined by their descent from a common ancestor gene by speciation (1–3). Therefore, the more distantly related species are considered the more general (inclusive) orthologous groups become, because all lineage-specific duplications since this last common ancestor should be considered as co-orthologs. Inversely, orthologous groups become more fine-grained (more 1 : 1 relations) when closely related species are considered, as there was less time for gene duplications to occur. The concept of hierarchical orthologous groups has already prompted development of Levels of Orthology From Trees (LOFT) (15) tool to interpret the gene-trees in the context of species tree, COrrelation COefficient-based CLustering (COCO-CL) (16) methodology to refine clusters of homologous genes, and PHOG approach (17) to resolve orthology at each taxonomy node using explicit modeling of the ancestral sequences and relying on PHOG-BLAST (18) profile–profile comparisons. Aiming to fuel comparative genomic studies we focused here on the most represented eukaryotic phyla, namely, we analyzed 23 fungi, 19 insect (plus one crustacean) and 15 vertebrate species with complete proteomes available (Table 1). For this analysis, we employed our own implementation of COG-like and Inparanoid-like ortholog identification procedures from all-against-all sequence comparisons across multiple species (19–22), and here we explicitly delineate the hierarchy of the orthologous groups, consistently applying the procedure to the sets of species with varying levels of relatedness according to the species tree (Figure 1).

Table 1

Sets of covered complete proteomes

*Only the longest transcript per gene was considered.

**Including one crustacean.

Figure 1.

Example screenshot of the OrthoDB web interface (http://cegg.unige.ch/orthodb). The left panel enumerates the modes to query and browse the database by: (1) a keyword, (2) a user specified phylogenetic gene copy-number profile, (3) a common phylogenetic profile, or (4) sequence homology search; the middle panel is reserved for displaying results; and the right panel accommodates help and query history messages.

METHODS

Orthology delineation

Groups of orthologous genes were automatically identified using a strategy employed previously (19–22) that is based on all-against-all protein sequence comparisons using the Smith-Waterman algorithm as implemented in ParAlign (23) with default parameters, followed by clustering of best reciprocal hits from highest scoring ones to 10−6 e-value cutoff for triangulating BRH or 10−10 cutoff for unsupported BRH, and requiring a sequence alignment overlap of at least 30 amino acids across all members of a group. Furthermore, the orthologous groups were expanded by genes that are more similar to each other within a proteome than to any gene in any of the other species, and by very similar copies that share over 97% sequence identity, which were identified initially using CD-Hit (24). We considered only the longest transcript per gene or the most common as specified in UniProt (25). The outlined procedure was first applied to all species considered, and then to each subset of species according to the radiation of the phylogenetic tree.

Phylogeny reconstruction

To guide computation of the ortholog hierarchy we produced the multiple alignment of concatenated single-copy orthologs, using well-aligned regions extracted with Gblocks (26) from individually aligned orthologous sequences using Muscle (27). This was used to compute the phylogenetic trees using the maximum-likelihood method as implemented in PHYML (28), employing the JTT model, a gamma correction with four discrete classes, and an estimated alpha parameter and proportion of invariable sites.

DATABASE CONTENT

Overview statistics

As detailed in Table 1 we analyzed 23 complete proteomes of fungal species, 19 insects and 15 vertebrates at different levels of the species phylogeny. Overall, this effort spans 870 737 genes, 82% of which have been classified into 10 876 orthologous groups in fungi, 19 835 in insects and 23 940 in vertebrates, providing the first systematic classification of the wealth of data that will provide the basis for further comparative evolutionary analyses.

WEB INTERFACE

The database is freely accessible from http://cegg.unige.ch/orthodb

Hierarchy of the orthologous groups

Orthology and paralogy classification is relative to the set of species considered [namely, to the particular ancestor (1–3)] and is more general (inclusive) for distantly related species, and more fine-grained (specific) for closely related species. We therefore delineated orthologous groups at each radiation node of the species phylogeny. To clearly show the hierarchy level of the classifications and to allow easy navigation along the hierarchy we display the interactive species tree (Figure 1). The default level for an initial user query is set to fungi, arthropods or vertebrates and the level can be adjusted afterwards by selecting a radiation of interest on the phylogeny. Each result page provides a precompiled Bookmarklet, a snippet of JavaScript code that can be easily bookmarked in the user browser, to allow direct query to a particular phylogeny level.

Stable identifiers

We assigned short identifiers using Noid utility to the generated orthologous groups that we will maintain unique across subsequent updates of the database to allow stable references to the data.

Querying by keywords

The database is searchable by the relevant identifiers used for the proteins or orthologous groups, as well as by keywords associated with the protein annotation in UniProt (25) or Ensembl (14). Currently the search is implemented as mySQL full-text index and the query is interpreted in Boolean mode that allows use of ‘+’ and ‘−’operators to indicate that a word is required to be present or absent, respectively, for a match to occur; parentheses used to group words into subexpressions; ‘*’ serves as the wildcard operator; and a phrase is matched literally if it is enclosed within quotes (e.g. ‘cytochrome c’). The results always refer to the relevant orthologous groups, not separate genes.

Filtering by phylogenic profile

Another feature of the database interface is filtering orthologous groups by a phylogenic profile. This can be done by activating the set of selectors next to the phylogenetic tree and specifying the ortholog copy-number requirements in the species of interest, where ‘?’ notation stands for no restriction (‘any number’) and ‘0’, ‘1’, ‘>1’ are self explanatory. The ‘Filter’ button in the ‘Specify copy-number profile’ section will execute the corresponding query. In addition, we provide a set of precompiled queries for phylogenic profiles of common interest (via the selection list) that are more complicated to express, e.g. ‘all but one’ type: all single-copy orthologs but allowing for a loss or run-away in one of the species, or multigene orthologs in all but one species, etc. This allows viewing of the gene clusters that have undergone expansions or losses in the specific lineages, which is informative in the evolutionary context (29). These queries, as well as text search, are performed with respect to the selected speciation root, marked by red on the phylogenetic tree.

Query by sequence homology

Not all protein identifiers are widely known, particularly for automatically annotated genomes, and functional annotations for many genes are still anticipated. We therefore provide data querying by sample sequences, e.g. a user submitted sequence is matched using Blast against the collected proteomes, and the top five matches from distinct proteomes are shown to the user and used to retrieve the associated orthologous groups, ranking by the number of hits to each group. Please note that if a sequence of an as yet unanalyzed species is used, the query will return the best matching ortholog cluster, however, this may not be sufficient to assume orthology.

Export of data

All results or particular groups can be retrieved as tab-delimited text or as Fasta formatted protein sequences with annotation of the orthologous group.

FUTURE PERSPECTIVES

All current approaches to identify orthologous groups of genes have different deficiencies and there are ways to improve their sensitivity and specificity. The implemented infrastructure in principal does not depend on the particular choice of the method, although our own implementation of a COG-like and Inparanoid-like ortholog identification procedure seems to produce reliable results according to extensive checks in the frame of our previous research projects. We plan also to test other available orthology delineation procedures.

29 in total

1. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.

Authors: J Castresana
Journal: Mol Biol Evol Date: 2000-04 Impact factor: 16.240

2. Orthology, paralogy and proposed classification for paralog subtypes.

Authors: Erik L L Sonnhammer; Eugene V Koonin
Journal: Trends Genet Date: 2002-12 Impact factor: 11.639

3. Amino acid substitution matrices from protein blocks.

Authors: S Henikoff; J G Henikoff
Journal: Proc Natl Acad Sci U S A Date: 1992-11-15 Impact factor: 11.205

4. The origin and evolution of protein superfamilies.

Authors: M O Dayhoff
Journal: Fed Proc Date: 1976-08

5. Distinguishing homologous from analogous proteins.

Authors: W M Fitch
Journal: Syst Zool Date: 1970-06

6. HOVERGEN: a database of homologous vertebrate genes.

Authors: L Duret; D Mouchiroud; M Gouy
Journal: Nucleic Acids Res Date: 1994-06-25 Impact factor: 16.971

7. Effects of amino acid substitutions in the -10 binding region of sigma E from Bacillus subtilis.

Authors: C H Jones; K M Tatti; C P Moran
Journal: J Bacteriol Date: 1992-11 Impact factor: 3.490

8. Quantification of ortholog losses in insects and vertebrates.

Authors: Stefan Wyder; Evgenia V Kriventseva; Reinhard Schröder; Tatsuhiko Kadowaki; Evgeny M Zdobnov
Journal: Genome Biol Date: 2007 Impact factor: 13.583

9. The COG database: an updated version includes eukaryotes.

Authors: Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal: BMC Bioinformatics Date: 2003-09-11 Impact factor: 3.169

10. MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors: Robert C Edgar
Journal: BMC Bioinformatics Date: 2004-08-19 Impact factor: 3.169

62 in total

1. Pathogenomics of Culex quinquefasciatus and meta-analysis of infection responses to diverse pathogens.

Authors: Lyric C Bartholomay; Robert M Waterhouse; George F Mayhew; Corey L Campbell; Kristin Michel; Zhen Zou; Jose L Ramirez; Suchismita Das; Kanwal Alvarez; Peter Arensburger; Bart Bryant; Sinead B Chapman; Yuemei Dong; Sara M Erickson; S H P Parakrama Karunaratne; Vladimir Kokoza; Chinnappa D Kodira; Patricia Pignatelli; Sang Woon Shin; Dana L Vanlandingham; Peter W Atkinson; Bruce Birren; George K Christophides; Rollie J Clem; Janet Hemingway; Stephen Higgs; Karine Megy; Hilary Ranson; Evgeny M Zdobnov; Alexander S Raikhel; Bruce M Christensen; George Dimopoulos; Marc A T Muskavitch
Journal: Science Date: 2010-10-01 Impact factor: 47.728

2. Requirement for commissureless2 function during dipteran insect nerve cord development.

Authors: Joseph Sarro; Emily Andrews; Longhua Sun; Susanta K Behura; John C Tan; Erliang Zeng; David W Severson; Molly Duman-Scheel
Journal: Dev Dyn Date: 2013-10-02 Impact factor: 3.780

3. DBH2H: vertebrate head-to-head gene pairs annotated at genomic and post-genomic levels.

Authors: Hui Yu; Fu-Dong Yu; Guo-Qing Zhang; Xiang Shen; Yun-Qin Chen; Yuan-Yuan Li; Yi-Xue Li
Journal: Database (Oxford) Date: 2009-06-02 Impact factor: 3.451

4. Gene-oriented ortholog database: a functional comparison platform for orthologous loci.

Authors: Meng-Ru Ho; Chun-houh Chen; Wen-chang Lin
Journal: Database (Oxford) Date: 2010-02-10 Impact factor: 3.451

5. FUNGIpath: a tool to assess fungal metabolic pathways predicted by orthology.

Authors: Sandrine Grossetête; Bernard Labedan; Olivier Lespinet
Journal: BMC Genomics Date: 2010-02-01 Impact factor: 3.969

6. DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection.

Authors: Ting-wen Chen; Timothy H Wu; Wailap V Ng; Wen-chang Lin
Journal: BMC Bioinformatics Date: 2010-10-15 Impact factor: 3.169

7. Difference in gene duplicability may explain the difference in overall structure of protein-protein interaction networks among eukaryotes.

Authors: Takeshi Hase; Yoshihito Niimura; Hiroshi Tanaka
Journal: BMC Evol Biol Date: 2010-11-18 Impact factor: 3.260

8. The bovine lactation genome: insights into the evolution of mammalian milk.

Authors: Danielle G Lemay; David J Lynn; William F Martin; Margaret C Neville; Theresa M Casey; Gonzalo Rincon; Evgenia V Kriventseva; Wesley C Barris; Angie S Hinrichs; Adrian J Molenaar; Katherine S Pollard; Nauman J Maqbool; Kuljeet Singh; Regan Murney; Evgeny M Zdobnov; Ross L Tellam; Juan F Medrano; J Bruce German; Monique Rijnkels
Journal: Genome Biol Date: 2009-04-24 Impact factor: 13.583

9. Sympatric ecological speciation meets pyrosequencing: sampling the transcriptome of the apple maggot Rhagoletis pomonella.

Authors: Dietmar Schwarz; Hugh M Robertson; Jeffrey L Feder; Kranthi Varala; Matthew E Hudson; Gregory J Ragland; Daniel A Hahn; Stewart H Berlocher
Journal: BMC Genomics Date: 2009-12-27 Impact factor: 3.969

10. eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations.

Authors: J Muller; D Szklarczyk; P Julien; I Letunic; A Roth; M Kuhn; S Powell; C von Mering; T Doerks; L J Jensen; P Bork
Journal: Nucleic Acids Res Date: 2009-11-09 Impact factor: 16.971