Literature DB >> 30418610

eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses.

Jaime Huerta-Cepas1,2, Damian Szklarczyk3, Davide Heller3, Ana Hernández-Plaza2, Sofia K Forslund1,4, Helen Cook5, Daniel R Mende6, Ivica Letunic7, Thomas Rattei8, Lars J Jensen5, Christian von Mering3, Peer Bork1,9,10,11.   

Abstract

eggNOG is a public database of orthology relationships, gene evolutionary histories and functional annotations. Here, we present version 5.0, featuring a major update of the underlying genome sets, which have been expanded to 4445 representative bacteria and 168 archaea derived from 25 038 genomes, as well as 477 eukaryotic organisms and 2502 viral proteomes that were selected for diversity and filtered by genome quality. In total, 4.4M orthologous groups (OGs) distributed across 379 taxonomic levels were computed together with their associated sequence alignments, phylogenies, HMM models and functional descriptors. Precomputed evolutionary analysis provides fine-grained resolution of duplication/speciation events within each OG. Our benchmarks show that, despite doubling the amount of genomes, the quality of orthology assignments and functional annotations (80% coverage) has persisted without significant changes across this update. Finally, we improved eggNOG online services for fast functional annotation and orthology prediction of custom genomics or metagenomics datasets. All precomputed data are publicly available for downloading or via API queries at http://eggnog.embl.de.

Entities:  

Year:  2019        PMID: 30418610      PMCID: PMC6324079          DOI: 10.1093/nar/gky1085

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Identifying orthologs, those sequences diverging from a common ancestry after a speciation event, constitutes a fundamental task in molecular and evolutionary biology. Compared to paralogs, which are sequences diverged after a duplication event, orthologs are more prone to retain their ancestral function (1,2), even at long evolutionary timescales (3). Therefore, differentiating between these two subtypes of homology relationships is crucial to produce accurate functional predictions (2,4,5). It is also essential for proper analysis in, for example, phylogenetics and comparative genomics (6) or the study of cell-type evolution (7). Hence, several databases have been developed over the years that provide precomputed orthology predictions using different approaches and operational definitions (8–13). Most of those resources, including eggNOG, are part of the international consortium Quest for Orthologs (14), were standardized benchmarking approaches (15) and reference datasets are developed and shared. eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups) is a public resource in which thousands of genomes are analyzed at once to establish orthology relationships between all their genes. Compared to similar databases, eggNOG focuses on providing: (i) comprehensive functional annotations for the inferred orthologs, (ii) predictions across thousands of genomes covering the three domains of life and viruses, and iii) hierarchical resolution of orthology assignments and fine-grained relationships (i.e. in-paralogies) based on phylogenetic analysis. For that, a species-aware clustering algorithm based on the concept of triangulation of best reciprocal hits (16) is applied to identify Orthologous Groups (OGs): sets of homologous sequences that started diverging from the same speciation event. As orthology relationships vary depending on the assumed reference speciation event (outgroup)—with increasing resolution toward the tips of the tree of life—since its inception in 2008 (17), eggNOG computes orthology predictions at different taxonomic levels. All OGs from all taxonomic levels are then functionally annotated and analyzed using phylogenetic methods, which allows users to further explore the history of speciation and duplication events within each OG, infer pairwise orthology relationships between specific species, or trace functional changes therein. Here, we describe eggNOG v5.0, including the following improvements over previous versions: (i) a major upgrade of the underlying databases, featuring one of the most comprehensive selection of prokaryotic, eukaryotic and viral genomes available; (ii) updates in the online service for custom (meta-)genome annotation, now including options for fast orthology prediction and improved computational power via cloud computing and (iii) better visualization options of OGs and their associated functional data.

UPDATES AND ADDITIONS SINCE PREVIOUS RELEASE

Genomes update

eggNOG 5.0 has increased the number of genomes used for inferring orthology from 2031 core organisms to 5090. Viral proteomes have also been upgraded, increasing from 352 to 2502 proteomes collected from Uniprot and filtered by completeness (those with less than three proteins after in silico cleaving of polyproteins were discarded). In order to select best representative prokaryotic genomes, we used the SpecI species delineation method (18) against a total set of 25 038 genomes retrieved from RefSeq (19), obtaining 4445 reference species. Similarly, 477 eukaryotic genomes were collected from Ensembl (11) and other project-oriented resources (see online methods at http://eggnog.embl.de/). In all cases, genomes and proteomes were standardized and checked for completeness and minimum quality before inclusion into the database. For instance, incomplete prokaryotic genomes missing more than 4 out of 40 universal, single copy, marker genes (20) were excluded, as well as genomes that could not be assembled to fewer than 300 contigs or genomes with an N50 of <10 000.

Taxonomic levels and non-supervised Orthologous Groups

An Orthologous Group (OG) is defined as a cluster of three or more homologous sequences that diverge from the same speciation event (16,17). Different OGs could therefore be inferred depending on the speciation split considered, that is, implicitly, the taxonomic resolution one considers. Older speciation events lead to larger OGs with more in-paralogs (duplication events occurred after the speciation) and higher functional divergence among their members. By contrast, recent speciations lead to smaller and usually more functionally specific sets of orthologs. For example, this implies that vertebrate-specific OGs would yield more fine-grained functional differentiation than OGs built using all eukaryotic species. In order to better reflect this taxonomic range and improve the precision of eggNOG functional predictions, in this version we have largely increased the number of pre-defined taxonomic levels (speciation splits) for which OGs are independently computed. In total, we applied the non-supervised eggNOG clustering method described in Jensen et al. (17) on 379 taxonomic levels, leading to 4.4M OGs (compared to 107 levels and 1.9M OGs in the previous version (21)). OGs were built using best reciprocal hits information derived from an all-against-all Smith-Waterman matrix provided by the SIMAP project (22). In addition, manually curated OGs available for the three domains of life were integrated into the corresponding levels in eggNOG, namely bacterial subset of COGs (23), archaeal arCOGs (24) and eukaryotic KOGs (25). Similarly, viral OGs were updated using deeper taxonomic categories, now descending to the family level. The taxonomic distribution in eggNOG v5.0, as well as the number of organisms, OGs inferred, and functional annotation coverage per level is shown in Figure 1.
Figure 1.

Taxonomic levels for which OGs have been independently computed based on (A) prokaryotic, (B) eukaryotic and (C) viral genomes. Names in blue indicate new taxonomic levels with respect to previous eggNOG versions. Numbers indicate the the amount of OGs per level (red), number of species covered (black) and functional annotation coverage (green).

Taxonomic levels for which OGs have been independently computed based on (A) prokaryotic, (B) eukaryotic and (C) viral genomes. Names in blue indicate new taxonomic levels with respect to previous eggNOG versions. Numbers indicate the the amount of OGs per level (red), number of species covered (black) and functional annotation coverage (green).

Hierarchical consistency of OGs

Relationships between more rootward OGs and their nested children OGs at more specific taxonomic levels were explicitly tracked and ascertained to be consistent, with exceptions only for mosic proteins with multi-domain combinations, where individual domains might have evolved independently (26,27). Hierarchical inconsistencies are the inevitable product of executing eggNOG’s clustering algorithm independently at each taxonomic level. Given that the set of species vary at each level, nested OGs might describe slightly incompatible evolutionarily histories for the same set of proteins. Solving those cases is particularly important for third-party applications (e.g. STRING (28)), in which information needs to be propagated across the hierarchy of taxonomic levels. Therefore, from version 4.5, we apply a post-processing step to ensure hierarchical consistency of all nested OGs. In this database update, we have improved our methodology by implementing a more accurate strategy based on gene-tree reconciliation. Briefly, for each hierarchical inconsistency found, we subsample the proteins spanning the affected OGs and perform gene-tree to species-tree reconciliation. Each reconciled tree sample represents a vote towards one of the conflicting evolutionary hypotheses. We combine the reconciliations by majority voting to decide how to resolve the inconsistency. Given the large number of species in this version of eggNOG, we have however retained some size control heuristics, such as the rule that COGs should not be merged. A full description of the reconciliation method is available at https://github.com/meringlab/og_consistency_pipeline.

Phylogenetics analysis

As in previous releases, all OGs in eggNOG v5.0 were analyzed using a comprehensive phylogenetic approach. Based on recent benchmarks (29), we adapted our phylogenomic strategy to the following steps: multiple sequence alignments inferred with Clustal Omega (30), soft alignment trimming by removing columns with less than five aligned residues, model testing using ModelFinder (31), maximum likelihood trees computed with IqTree (32) and branch supports calculated using the ultrafast bootstrap method (33). The full workflow was executed using the ETE toolkit v3.1.1 (34), which integrates the complete pipeline as a built in gene-tree workflow (code name ‘eggnog50_full’). For ∼57 000 OGs, due to the increasing gene family sizes, computation was not possible in this pipeline, so a fall-back method was used where IqTree was executed with the less-sensitive option ‘–fast’. All 4.4M trees were analyzed to infer speciation and duplication events (i.e. in-paralogy relationships) using the species overlap algorithm described in (35), leading to pairwise orthology tables (differentiating one-to-one versus many-to-many relationships) for each OG.

Functional annotations

Orthologous Groups were functionally annotated using updated versions of Gene Ontology (36), KEGG pathways (37), SMART/PFAM domains (38) and expanded to CAZy (39) and KEGG modules. Moreover, general free text descriptions and COG functional categories were updated for each OG using the automated text-mining and machine learning-based pipeline described in (21). In short, OGs were assigned text descriptions based on a heuristic to find the most informative text substring from either names of assigned SMART domains, assigned Gene Ontology terms, or common substrings in free text annotations from the source gene databases. In total, 80% of all OGs were annotated using at least one functional source. Finally, we improved the online visualization of functional annotations, which can now be explored from an evolutionary point of view by plotting functional descriptors together with the phylogenetic tree and the duplication/speciation events inferred for each OG (Figure 2).
Figure 2.

Visualization of the phylogeny associated to the OG ENOG5048VVQ at the vertebrate level (A) extracted from the eggNOG website. Target orthologs were restricted to primates in the phylogenetic tree to facilitate exploration (B). Duplication nodes (in-paralogies) are labeled in red, and speciation events in blue (C). The functional profile of each orthologous sequence is shown in the presence/absence matrix (D). Functional differences can be noticed at both sides of the duplication event separating EPX from MPO sequences (E) in both GO Slim terms (red squares in matrix D) and KEGG Modules (blue squares in matrix D), while having similar domain architectures (F).

Visualization of the phylogeny associated to the OG ENOG5048VVQ at the vertebrate level (A) extracted from the eggNOG website. Target orthologs were restricted to primates in the phylogenetic tree to facilitate exploration (B). Duplication nodes (in-paralogies) are labeled in red, and speciation events in blue (C). The functional profile of each orthologous sequence is shown in the presence/absence matrix (D). Functional differences can be noticed at both sides of the duplication event separating EPX from MPO sequences (E) in both GO Slim terms (red squares in matrix D) and KEGG Modules (blue squares in matrix D), while having similar domain architectures (F).

Fast functional and orthology assignments for custom user data

eggNOG v5.0 has also improved the underlying precomputed data used by the online version of eggNOG-mapper (40), a tool for rapid annotation of custom (meta-)genomes. Moreover, our online services are now cloud-enabled, permitting intensive computations required by functional annotation of massive datasets to run on dedicated servers with hundreds of CPUs available. We have also introduced a new option for fast batch orthology assignments of custom sets of sequences, which allows users to assign orthology relationships between novel genes and all genomes represented in eggNOG.

BENCHMARK

The average quality of orthology predictions and functional annotations was benchmarked in order to estimate the effect of adding novel genomes. Both orthobench2 (41) and the Quest For Orthologs (QFO) benchmark (15) were used. Compared to eggNOG v4.5, we improved the performance in the orthobench's Bilaterian (from 72.1% to 73.1% F-measure) and Gammaproteobacteria test (from 93.2% to 94.7% F-measure). On the other hand, the QFO benchmark allowed us to evaluate the performance of both OG-based predictions and fine-grained predictions. Results show a clear tradeoff in the precision-recall ratio depending on the strategy selected, which in turn reflects different use cases of orthology assignments. OG-based predictions produced results with high recall values, predicting more than twice the number of orthologous pairs with <10.6% drop in average Schlicker similarity compared to the benchmark average in the Enzyme Classification and Gene Onthology Conservation tests. This high recall pattern is in general preferred by probabilistic prediction methods such as interolog inference in the STRING database (28). By contrast, fine-grained predictions showed higher precision values, while maintaining a similar recall as the previous EggNOG versions, which is usually preferred for accurate functional transfers. In general, for the majority of QFO benchmark tests, the performance of eggNOG 5.0 was slightly better or stayed at the Pareto line compared to previous eggNOG version (detailed plots and results are available at http://orthology.benchmarkservice.org). Taken together, this indicates that the large increase of genomes had no major impact on the quality of the inferred orthologous groups, suggesting the eggNOG approach continues to scale well.

CONCLUSIONS AND PERSPECTIVES

By further streamlining and modernizing the automated approach for the construction of eggNOG orthologous groups, as well as synchronizing with improved or newly developed source databases (e.g. proGenomes for the classification of high quality prokaryotic genomes, (42)), we have been able to more than double core genome coverage for eggNOG, including extensive expansion of viral gene families, largely without loss of quality of orthology reconstruction or functional annotation. Due to a supervised increase of pre-defined taxonomic levels as basis for OG calculation, we almost tripled to number of OGs to 4.4M. Version 5 of eggNOG should thus be a useful resource for ecological, evolutionary or medical -omics analysis, also serving as an entry point for fast functional annotation of newly sequenced genes, genomes and metagenomes. We are currently working on conceptual and algorithmic improvements to be able to continue to keep pace with a vastly expanding number of organisms and meta-genomes sequenced.
  41 in total

Review 1.  Homology a personal view on some of the problems.

Authors:  W M Fitch
Journal:  Trends Genet       Date:  2000-05       Impact factor: 11.639

Review 2.  Molecular phylogeny: pitfalls and progress.

Authors:  D Moreira; H Philippe
Journal:  Int Microbiol       Date:  2000-03       Impact factor: 2.479

3.  Toward automatic reconstruction of a highly resolved tree of life.

Authors:  Francesca D Ciccarelli; Tobias Doerks; Christian von Mering; Christopher J Creevey; Berend Snel; Peer Bork
Journal:  Science       Date:  2006-03-03       Impact factor: 47.728

Review 4.  The evolution of cell types in animals: emerging principles from molecular studies.

Authors:  Detlev Arendt
Journal:  Nat Rev Genet       Date:  2008-11       Impact factor: 53.242

5.  The human phylome.

Authors:  Jaime Huerta-Cepas; Hernán Dopazo; Joaquín Dopazo; Toni Gabaldón
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

6.  PhylomeDB v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions.

Authors:  Jaime Huerta-Cepas; Salvador Capella-Gutierrez; Leszek P Pryszcz; Ivan Denisov; Diego Kormes; Marina Marcet-Houben; Toni Gabaldón
Journal:  Nucleic Acids Res       Date:  2010-11-12       Impact factor: 16.971

Review 7.  Ortholog identification in the presence of domain architecture rearrangement.

Authors:  Kimmen Sjölander; Ruchira S Datta; Yaoqing Shen; Grant M Shoffner
Journal:  Brief Bioinform       Date:  2011-06-28       Impact factor: 11.622

8.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors:  Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal:  Mol Syst Biol       Date:  2011-10-11       Impact factor: 11.429

9.  eggNOG: automated construction and annotation of orthologous groups of genes.

Authors:  Lars Juhl Jensen; Philippe Julien; Michael Kuhn; Christian von Mering; Jean Muller; Tobias Doerks; Peer Bork
Journal:  Nucleic Acids Res       Date:  2007-10-16       Impact factor: 16.971

10.  The COG database: an updated version includes eukaryotes.

Authors:  Roman L Tatusov; Natalie D Fedorova; John D Jackson; Aviva R Jacobs; Boris Kiryutin; Eugene V Koonin; Dmitri M Krylov; Raja Mazumder; Sergei L Mekhedov; Anastasia N Nikolskaya; B Sridhar Rao; Sergei Smirnov; Alexander V Sverdlov; Sona Vasudevan; Yuri I Wolf; Jodie J Yin; Darren A Natale
Journal:  BMC Bioinformatics       Date:  2003-09-11       Impact factor: 3.169

View more
  572 in total

1.  The First Draft Genome Assembly of Snow Sheep (Ovis nivicola).

Authors:  Maulik Upadhyay; Andreas Hauser; Elisabeth Kunz; Stefan Krebs; Helmut Blum; Arsen Dotsev; Innokentiy Okhlopkov; Vugar Bagirov; Gottfried Brem; Natalia Zinovieva; Ivica Medugorac
Journal:  Genome Biol Evol       Date:  2020-08-01       Impact factor: 3.416

2.  The 'in vivo lifestyle' of bile acid 7α-dehydroxylating bacteria: comparative genomics, metatranscriptomic, and bile acid metabolomics analysis of a defined microbial community in gnotobiotic mice.

Authors:  Jason M Ridlon; Saravanan Devendran; João Mp Alves; Heidi Doden; Patricia G Wolf; Gabriel V Pereira; Lindsey Ly; Alyssa Volland; Hajime Takei; Hiroshi Nittono; Tsuyoshi Murai; Takao Kurosawa; George E Chlipala; Stefan J Green; Alvaro G Hernandez; Christopher J Fields; Christy L Wright; Genta Kakiyama; Isaac Cann; Purna Kashyap; Vance McCracken; H Rex Gaskins
Journal:  Gut Microbes       Date:  2019-06-09

3.  Transcriptome characterization and generation of marker resource for Himalayan vulnerable species, Ulmus wallichiana.

Authors:  Amandeep Singh; Aasim Majeed; Pankaj Bhardwaj
Journal:  Mol Biol Rep       Date:  2021-01-13       Impact factor: 2.316

4.  Accurate Annotation of Microbial Metagenomic Genes and Identification of Core Sets.

Authors:  Chiara Vanni
Journal:  Methods Mol Biol       Date:  2021

Review 5.  Tutorial: guidelines for annotating single-cell transcriptomic maps using automated and manual methods.

Authors:  Zoe A Clarke; Tallulah S Andrews; Jawairia Atif; Delaram Pouyabahar; Brendan T Innes; Sonya A MacParland; Gary D Bader
Journal:  Nat Protoc       Date:  2021-05-24       Impact factor: 13.491

6.  Metage2Metabo, microbiota-scale metabolic complementarity for the identification of key species.

Authors:  Arnaud Belcour; Clémence Frioux; Méziane Aite; Anthony Bretaudeau; Falk Hildebrand; Anne Siegel
Journal:  Elife       Date:  2020-12-29       Impact factor: 8.140

7.  Clostridium scindens ATCC 35704: Integration of Nutritional Requirements, the Complete Genome Sequence, and Global Transcriptional Responses to Bile Acids.

Authors:  Saravanan Devendran; Rachana Shrestha; João M P Alves; Patricia G Wolf; Lindsey Ly; Alvaro G Hernandez; Celia Méndez-García; Ashley Inboden; J'nai Wiley; Oindrila Paul; Avery Allen; Emily Springer; Chris L Wright; Christopher J Fields; Steven L Daniel; Jason M Ridlon
Journal:  Appl Environ Microbiol       Date:  2019-03-22       Impact factor: 4.792

8.  Genome-Centric Metagenomic Insights into the Impact of Alkaline/Acid and Thermal Sludge Pretreatment on the Microbiome in Digestion Sludge.

Authors:  Zhiwei Liang; Jiangjian Shi; Chen Wang; Junhui Li; Dawei Liang; Ee Ling Yong; Zhili He; Shanquan Wang
Journal:  Appl Environ Microbiol       Date:  2020-11-10       Impact factor: 4.792

9.  Proteogenomic Insights into the Physiology of Marine, Sulfate-Reducing, Filamentous Desulfonema limicola and Desulfonema magnum.

Authors:  Vanessa Schnaars; Lars Wöhlbrand; Sabine Scheve; Christina Hinrichs; Richard Reinhardt; Ralf Rabus
Journal:  Microb Physiol       Date:  2021-02-19

10.  Nanaerobic growth enables direct visualization of dynamic cellular processes in human gut symbionts.

Authors:  Leonor García-Bayona; Michael J Coyne; Noam Hantman; Paula Montero-Llopis; Salena S Von; Takeshi Ito; Michael H Malamy; Marek Basler; Blanca Barquera; Laurie E Comstock
Journal:  Proc Natl Acad Sci U S A       Date:  2020-09-16       Impact factor: 11.205

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.