Literature DB >> 19900971

eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations.

J Muller¹, D Szklarczyk, P Julien, I Letunic, A Roth, M Kuhn, S Powell, C von Mering, T Doerks, L J Jensen, P Bork.

Abstract

The identification of orthologous relationships forms the basis for most comparative genomics studies. Here, we present the second version of the eggNOG database, which contains orthologous groups (OGs) constructed through identification of reciprocal best BLAST matches and triangular linkage clustering. We applied this procedure to 630 complete genomes (529 bacteria, 46 archaea and 55 eukaryotes), which is a 2-fold increase relative to the previous version. The pipeline yielded 224,847 OGs, including 9724 extended versions of the original COG and KOG. We computed OGs for different levels of the tree of life; in addition to the species groups included in our first release (i.e. fungi, metazoa, insects, vertebrates and mammals), we have now constructed OGs for archaea, fishes, rodents and primates. We automatically annotate the non-supervised orthologous groups (NOGs) with functional descriptions, protein domains, and functional categories as defined initially for the COG/KOG database. In-depth analysis is facilitated by precomputed high-quality multiple sequence alignments and maximum-likelihood trees for each of the available OGs. Altogether, eggNOG covers 2,242 035 proteins (built from 2,590,259 proteins) and provides a broad functional description for at least 1,966,709 (88%) of them. Users can access the complete set of orthologous groups via a web interface at: http://eggnog.embl.de.

Entities: Chemical Gene Species

Mesh：

Year: 2009 PMID： 19900971 PMCID： PMC2808932 DOI： 10.1093/nar/gkp951

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Next-generation sequencing technologies are now generating a vast amount of sequence data. This leads to a dramatic increase in the number of predicted protein sequences, which serve as a starting point for structural, functional and phylogenomic studies. In such studies, high-throughput comparative analyses are often required to transfer information between organisms, for which the concept of orthology is crucial. The original definition by Fitch (1) describes orthologs as genes that diverged through a speciation event, as opposed to paralogs, which diverged after a duplication event. This has been extended and refined by introducing the concepts of orthologous groups (OGs) (2), in-paralogs and out-paralogs (3,4). In practice, however, the identification and classification of homologous genes remain very difficult and rely on operational definitions. An enormous effort is being put into the development of different approaches to establish orthologous relationships between genes from different genomes. This includes several algorithms using the simple graph-based methods, including reciprocal-best-hit approach (5), identification of best-hit triangles (2,6–8) and clustering-based approaches (9–11) as well as tree-based methods (12–16). In addition to the quality of the grouping of genes, the practical usability of OGs is determined by the ability to provide a robust functional annotation. Thus, newer projects not only aggregate orthology information from various sources to allow comparison between methods but also aim to provide annotation tools (17,18). Nevertheless, evolutionary genealogy of genes: non-supervised OGs (eggNOG) (19) and the COG/KOG/arCOG resources (2,6,7) are still the only databases providing explicit functional annotations for the OGs at different hierarchical levels, whereby the COG/KOG resource is based on a robust manual expert annotation, which eggNOG is using and automatically extending (19). Here, we describe the new features of the second version of eggNOG, a resource that provides OGs from the three domains of life at several levels of resolution. eggNOG v2 contains twice as many species and proteins as the previous version, additional hierarchical levels allowing higher resolution for a number of taxonomic groups, new annotation sources and an extended interface for an in-depth analysis of orthologous relationships.

CONSTRUCTION OF HIERARCHICAL OGs

The automated procedure described previously (19) has been used to assemble proteins into OGs from 630 complete genomes (529 bacteria, 46 archaea and 55 eukaryotes). Complete proteomes were downloaded from the RefSeq (20), Ensembl (21), GiardiaDB (22) or TAIR (23) databases. This particular data set also forms the basis for STRING v8 (24) and STITCH v2 (25), allowing for easy integration across these databases. Altogether, the protein data set covers 2 590 259 proteins of which 2 242 035 (87%) were included in at least one of 224 847 OGs generated by eggNOG. The growing number of species and proteins included in this release drastically increased the computational time. All-against-all similarity searches have therefore been performed using Basic Local Alignment Search Tool (BLAST) (26) instead of the Smith–Waterman algorithm (27). Compared to the 4873 COGs and the 4850 KOGs that are constructed across all three domains of life and for all eukaryotes, respectively, this procedure assembles additional proteins into NOGs (440 359 proteins into 59 497 NOGs and 181 427 into 17 845 euNOGs). These complement the published COGs and KOGs built respectively for 66 and seven species (6), which are extended in eggNOG to cover 630 species encompassing, respectively, 1 547 381 and 483 043 proteins. To provide a higher resolution of OGs in frequently used taxonomic groupings, we applied our procedure to several subsets of organisms separately. We updated the previously computed more fine-grained NOGs at the level of fungi (fuNOGs), metazoans (meNOGs), insects (inNOGs), vertebrates (veNOGs) and mammals (maNOGs) and added groups for archaea (arNOGs), fishes (fiNOGs), rodents (roNOGs) and primates (prNOGs).

Extending the automated annotation of protein function

An important feature of eggNOG is the functional annotations of the OGs. Our original pipeline, providing functional descriptions for the NOGs, is now complemented by an automatic inference of functional categories (FCs) which were taken from the COG database (2). The 25 FCs available from the COG resource have been widely used to assess comparative genomics studies and will enable higher-order analyses of OGs identified in any data set. We use two complementary methods to infer FCs of OGs based on the 4617 COGs (used for NOGs and arNOGs) and 4381 KOGs (used for all other OGs). The first method uses Support Vector Machines (SVM) trained on the COGs and KOGs to classify NOGs into the 25 FCs based on feature vectors. Two feature vectors were created for each OG. One was built from functional information mapped onto the eggNOG protein data set, including KEGG pathways and modules (28), GO terms (29), SMART domains (30), PFAM domains (31), UniProt keywords (32) and words from UniProt/RefSeq (20) description lines. The second feature vector includes also words from MEDLINE abstracts referring to a particular protein (24). Each attribute in the feature vector encodes the fraction of proteins in the group having the feature in question. The second method for assigning FCs makes use of the hierarchical structure of eggNOG, namely that the same proteins can be assigned to OGs at several levels in the tree of life (e.g. a KOG and a meNOG). In case an FC could not be assigned to a NOG by the SVM method, we check if most of the proteins in the NOG belong to a common functionally annotated COG or KOG, in which case we transfer the FCs from the coarse-grained level (COGs or KOGs) to the more fine-grained one (e.g. arNOGs or meNOGs). The assignment of an FC to a single NOG is achieved on the basis of a coverage value determined by the occurrence of that FC (via the proteins shared with the reference level) in respect to the total number of proteins in that NOG.

ANNOTATION RESULTS

In addition to providing functional annotations via description lines for many NOGs (19), we are now able to predict functional categories as well. At the universal level, our function annotation pipeline provides description lines for 14 956 (25%) and an FC for 6262 (11%) of the 59 497 coarse-grained NOGs. At the eukaryotic level, 7566 euNOGs (52%) have a description line and 4120 (34%) have an FC. In addition, eggNOG contains 137 782 more fine-grained OGs of which 100 750 (73%) and 89 232 (65%) have been annotated with a description line and an FC, respectively (Table 1).

Table 1.

Annotation statistics at different taxonomic levels

Level	OG count	Description line		Functional categories
		Annotated	(%)	Annotated	(%)
COG + NOG	64 370	4474 + 14 956	30.2	2824 + 6262	14.1
arNOG	9809	4144	42.2	4540	46.3
KOG + euNOG	22 695	4288 + 7566	52.2	3514 + 4120	33.6
fuNOG	9976	5661	56.7	5775	57.9
meNOG	22 691	16 636	73.3	13 490	59.5
inNOG	8049	5034	62.5	5810	72.2
veNOG	21 357	16 722	78.3	13 291	62.2
fiNOG	13 674	8903	65.1	9580	70.1
maNOG	20 222	16 959	83.9	13 075	64.7
roNOG	14 038	11 918	84.9	10 547	75.1
prNOG	17 966	14 773	82.2	13 124	73.0

At the levels for COGs (universal) and KOGs (eukaryotes) the additional automatically generated non-supervised orthologous groups NOGs and euNOGs, respectively, are separated.

Annotation statistics at different taxonomic levels At the levels for COGs (universal) and KOGs (eukaryotes) the additional automatically generated non-supervised orthologous groups NOGs and euNOGs, respectively, are separated. This enables us to assign 2 242 035 of the 2 590 259 genes (87% of the genes in the analyzed genomes) to an OG and to provide at least a broad functional description or FC for 1 966 709 of them (78% of the genes that could be assigned to an OG). The corresponding numbers for each set of OGs as well as for each individual genome are summarized in Figure 1.

Figure 1.

Statistics on the content of the eggNOG database. The eggNOG assignments for 630 complete genomes were mapped onto the tree of life. The stacked bar charts outside the tree show the proportion of genes from each genome that can be assigned to a functionally annotated orthologous group (green), an unannotated orthologous group (orange) or no orthologous group (gray). The length of each bar is proportional to the logarithm of the number of genes in the respective genome. The pie charts inside the tree show the fractions of orthologous groups at each level in the hierarchy that could be annotated with a functional category (green for NOGs, light green for extended COGs and KOGs) or not (orange for NOGs, light orange for extended COGs and KOGs). An interactive version is available in the ‘Overview’ section at: http://eggnog.embl.de. This figure was made using iTOL.

Extended features in eggNOG v2.0

To facilitate the in-depth analysis of the orthologous relationships within the groups of proteins, we now provide precomputed high-quality Multiple Sequence Alignments (MSAs) and maximum-likelihood trees via the web interface (Figure 2).

Figure 2.

Screenshot of the detailed results page. The eggNOG database was queried for the term ‘mTERF’, the mitochondrial precursor of the transcription termination factor 1. The navigation tree at the top of the page allows the user to change the view to more coarse-grained orthologous groups, for example, the mammalian orthologous groups. The tab menu, shown here, enables several in-depth interactions with the new data (i.e. MSA or phylogenetic trees, here displayed with SMART domains). Numerous methods are available to build MSAs [e.g. ClustalW (33), Muscle (34), MAFFT (35) and PRANK (36)] but some programs appear to be more suitable for particular protein families than others (37). Thus, we applied a new approach, named Automated QUality improvement for multiple sequence Alignments (AQUA) (Muller et al., submitted for publication), which combines existing tools to deliver high-quality MSAs. The construction of the different phylogenetic trees was carried out using the following steps. One hundred bootstrap replicates were created from the MSA using the SEQBOOT program from the Phylip package (38). Following this, PhyML (39) was used to find the maximum-likelihood tree for each of the 100 bootstrap replicates and for the original alignment using default parameters. Finally, a consensus tree was constructed, using the CONSENSE program from the Phylip package. We used ReadSeq (40) to convert between the different sequence file formats used by those programs.

ACCESS OPTIONS

The eggNOG resource can be queried via a web interface; data can be downloaded under the Creative Commons Attribution 3.0 License at: http://eggnog.embl.de or via FTP at: ftp://eggnog.embl.de/eggNOG/2.0/. Gene and protein names, database identifiers, amino acid sequences, or OG names can be used to query the database. As a default, the most fine-grained OGs available are displayed for maximal resolution. The user can navigate among the different levels of orthology using an available guide-tree of organisms to find the desired balance between phylogenetic coverage and functional specificity within our hierarchy of OGs. Through the new interface, users can access different information panels encompassing the detailed list of proteins belonging to a particular OG as well as the corresponding MSA and phylogenetic tree. The MSA can be interactively displayed using the Jalview applet (41) or downloaded in FASTA format. The phylogenetic trees are accessed through a dedicated iTOL (42) viewer together with mapped PFAM and SMART domains, via the ATV program applet (43), or can be downloaded in Newick format.

CONCLUSIONS/PERSPECTIVES

With 630 genomes covered, an increased OG hierarchy, and a high coverage of newly categorized functional annotation, the new version of eggNOG is one of the most comprehensive and complete resources for deciphering the orthologous relationships between proteins from various species. The changes and improvements in the interface and the availability of the OGs for download will not only facilitate the daily use of the database, but also the integration of eggNOG in high-throughput comparative genomics studies. Our future plans include the addition of more complete genomes and development of a more scalable and flexible pipeline for generating the groups.

FUNDING

EMBL, the European Commission Programme, Eurasnet EU [Grant LSHG-CT-2005-518238 (FP6), IMPACT 213037 (FP7)]; the Novo Nordisk Foundation Center for Protein Research, the Swiss Institute of Bioinformatics; and the University of Zurich (partial, through its Research Priority Program ‘Systems Biology and Functional Genomics’). This work was supported in part by the Bundesministerium fuer Bildung und Forschung (Nationales Genomforschungsnetz Foerderkennzeichen 01GS08169). Funding for open access charge: European Molecular Biology Laboratory. Conflict of interest statement. None declared.

42 in total

Review 1. Orthologs, paralogs, and evolutionary genomics.

Authors: Eugene V Koonin
Journal: Annu Rev Genet Date: 2005 Impact factor: 16.830

2. BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark.

Authors: Julie D Thompson; Patrice Koehl; Raymond Ripp; Olivier Poch
Journal: Proteins Date: 2005-10-01

3. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation.

Authors: Ivica Letunic; Peer Bork
Journal: Bioinformatics Date: 2006-10-18 Impact factor: 6.937

Review 4. A genomic perspective on protein families.

Authors: R L Tatusov; E V Koonin; D J Lipman
Journal: Science Date: 1997-10-24 Impact factor: 47.728

Review 5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

6. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Authors: J D Thompson; D G Higgins; T J Gibson
Journal: Nucleic Acids Res Date: 1994-11-11 Impact factor: 16.971

7. HCOP: a searchable database of human orthology predictions.

Authors: Tina A Eyre; Mathew W Wright; Michael J Lush; Elspeth A Bruford
Journal: Brief Bioinform Date: 2006-09-02 Impact factor: 11.622

8. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

9. MBGD: a platform for microbial comparative genomics based on the automated construction of orthologous groups.

Authors: Ikuo Uchiyama
Journal: Nucleic Acids Res Date: 2006-11-29 Impact factor: 16.971

10. PARALIGN: rapid and sensitive sequence similarity searches powered by parallel computing technology.

Authors: Per Eystein Saebø; Sten Morten Andersen; Jon Myrseth; Jon K Laerdahl; Torbjørn Rognes
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

119 in total

1. Whole-genome sequencing of the snub-nosed monkey provides insights into folivory and evolutionary history.

Authors: Xuming Zhou; Boshi Wang; Qi Pan; Jinbo Zhang; Sudhir Kumar; Xiaoqing Sun; Zhijin Liu; Huijuan Pan; Yu Lin; Guangjian Liu; Wei Zhan; Mingzhou Li; Baoping Ren; Xingyong Ma; Hang Ruan; Chen Cheng; Dawei Wang; Fanglei Shi; Yuanyuan Hui; Yujing Tao; Chenglin Zhang; Pingfen Zhu; Zuofu Xiang; Wenkai Jiang; Jiang Chang; Hailong Wang; Zhisheng Cao; Zhi Jiang; Baoguo Li; Guang Yang; Christian Roos; Paul A Garber; Michael W Bruford; Ruiqiang Li; Ming Li
Journal: Nat Genet Date: 2014-11-02 Impact factor: 38.330

2. Voltage-gated proton channel in a dinoflagellate.

Authors: Susan M E Smith; Deri Morgan; Boris Musset; Vladimir V Cherny; Allen R Place; J Woodland Hastings; Thomas E Decoursey
Journal: Proc Natl Acad Sci U S A Date: 2011-10-17 Impact factor: 11.205

3. Complete genome sequence of Paenibacillus mucilaginosus 3016, a bacterium functional as microbial fertilizer.

Authors: Mingchao Ma; Zhenya Wang; Li Li; Xin Jiang; Dawei Guan; Fengming Cao; Huijun Chen; Xuan Wang; Delong Shen; Binghai Du; Jun Li
Journal: J Bacteriol Date: 2012-05 Impact factor: 3.490

4. Computational methods for Gene Orthology inference.

Authors: David M Kristensen; Yuri I Wolf; Arcady R Mushegian; Eugene V Koonin
Journal: Brief Bioinform Date: 2011-06-19 Impact factor: 11.622

5. Evolution of the voltage sensor domain of the voltage-sensitive phosphoinositide phosphatase VSP/TPTE suggests a role as a proton channel in eutherian mammals.

Authors: Keith A Sutton; Melissa K Jungnickel; Luca Jovine; Harvey M Florman
Journal: Mol Biol Evol Date: 2012-03-06 Impact factor: 16.240

10. Cyclebase.org: version 2.0, an updated comprehensive, multi-species repository of cell cycle experiments and derived analysis results.

Authors: Nicholas Paul Gauthier; Lars Juhl Jensen; Rasmus Wernersson; Søren Brunak; Thomas S Jensen
Journal: Nucleic Acids Res Date: 2009-11-24 Impact factor: 16.971