Literature DB >> 34597405

eggNOG-mapper v2: Functional Annotation, Orthology Assignments, and Domain Prediction at the Metagenomic Scale.

Carlos P Cantalapiedra¹, Ana Hernández-Plaza¹, Ivica Letunic², Peer Bork^3,4,5, Jaime Huerta-Cepas¹.

Abstract

Even though automated functional annotation of genes represents a fundamental step in most genomic and metagenomic workflows, it remains challenging at large scales. Here, we describe a major upgrade to eggNOG-mapper, a tool for functional annotation based on precomputed orthology assignments, now optimized for vast (meta)genomic data sets. Improvements in version 2 include a full update of both the genomes and functional databases to those from eggNOG v5, as well as several efficiency enhancements and new features. Most notably, eggNOG-mapper v2 now allows for: 1) de novo gene prediction from raw contigs, 2) built-in pairwise orthology prediction, 3) fast protein domain discovery, and 4) automated GFF decoration. eggNOG-mapper v2 is available as a standalone tool or as an online service at http://eggnog-mapper.embl.de.

Entities: Chemical

Keywords: bioinformatics; computational genomics; functional annotation; metagenomics

Mesh：

Year: 2021 PMID： 34597405 PMCID： PMC8662613 DOI： 10.1093/molbev/msab293

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Introduction

Inference of gene function via orthology, rather than by homology detection, is generally considered the most reliable approach for transferring functional information between molecular sequences, as orthologs are expected to retain function more often than paralogs (Gabaldón and Koonin 2013; Glover et al. 2019). However, since delineating orthology is highly demanding (both computationally and algorithmically), most automated methods rely on homology-based annotations (Götz et al. 2008; Seemann 2014; Blum et al. 2021). EggNOG-mapper relies on the eggNOG database (Huerta-Cepas et al. 2019) of orthologs groups (OGs), covering thousands of bacterial, archaeal, and eukaryotic organisms. For this, it takes advantage of the precomputed phylogenies inferred for each OG to efficiently refine orthology assignments and therefore minimize the transferring of annotations from putative in-paralogs. The method was originally proven to provide more accurate predictions than homology-based approaches (Huerta-Cepas et al. 2017), while preserving computational performance at the genomic and metagenomic scale. Here, we present eggNOG-mapper v2, a major upgrade featuring improvements in annotation coverage, overall performance, and program capabilities (fig. 1).

Fig. 1

Workflow and new features of eggNOG-mapper v2. (A) The gene prediction stage uses Prodigal to perform protein prediction from assembled contigs. (B) During the search stage, HMMER3, Diamond, or MMseqs2 can be used to align the input proteins to eggNOG v5. (C) During the orthology inference stage, a report of orthologs is generated based on the desired taxonomic scope. (D) Finally, protein annotations and domains are transferred from orthologs to the queries and reported as tabular and GFF files.

Coverage and Performance Improvements

The underlying genome database has been updated to be in sync with eggNOG v5, spanning 4.4 million OGs and more than twice the number of organisms than in the previous version. This improvement increases annotation coverage and phylogenetic resolution, particularly noticeable when analyzing large metagenomic data sets. For instance, the reannotation of 1.75 million proteins randomly subsampled from a human-gut metagenomic gene catalog (Almeida et al. 2021) yielded a 3.23% increase in annotation coverage (56,569 newly annotated proteins), compared with eggNOG-mapper version 1. The phylogenetic resolution was also improved, obtaining significantly better alignment scores for the query sequences than previous versions (Wilcoxon test W = 1.2E + 12, P-value < 2.2E − 16). Moreover, although the underlying databases have doubled in size, eggNOG-mapper v2 improves the annotation rate (annotated queries per second) by 16% on average, compared with previous versions. The most important changes regarding computational enhancements relate to database optimizations, allowing for faster queries and parallelization, and a new memory-based mode that significantly reduces the impact of slow I/O disk operations. Taken together, these changes improve annotation rates by 608% on average, with respect to eggNOG-mapper v1 (fig. 2). Compared with Prokka (Seemann 2014), one of the fastest annotation tools available for prokaryotic genomes according to recent benchmarks (Shaffer et al. 2020), eggNOG-mapper runs faster, especially on large metagenomic data sets (fig. 2).

Fig. 2

Performance of eggNOG-mapper v2. (A) average minutes to annotate input proteomes. EggNOG-mapper v2 (blue) against eggNOG-mapper v1 (red). (B) average minutes to annotate input genomes. EggNOG-mapper v2 (blue) against Prokka (green). (C) average minutes (in log scale) to annotate input proteins. MMseqs2 (-s 2,4,6; black) against Diamond (iterate/sensitive mode; orange). (D) Specificity (Sp), recall (Re), and F1 score, of PFAM domain annotation either from direct transference from orthologs, or after realignment. Full de novo realignment results were used as reference. (E) average minutes for PFAM domain annotation, using either PFAM full de novo (brown) or realign to orthologs domains (blue) modes. Benchmark setup: tests in (A) and (B) were done on 20 sets of 1–100 random proteomes (A) or genomes (B) from (Almeida et al. 2021), and executed using 10 CPUs and 80 GB of RAM. Tests in (C) were done on 35 random sets of 10–10,000,000 proteins from Progenomes v2 (Mende et al. 2020), using 30 CPUs and 240 GB of RAM. Tests in (Dm) and (E) as in (C), only for sets of 10–100,000 proteins.

ORF Prediction

Another major capability added to the new eggNOG-mapper workflow is predicting ORFs directly from assembled contigs (fig. 1). ORF detection, only available for prokaryotic assemblies, is performed using Prodigal (Hyatt et al. 2010), which provides the protein sequences to be used by eggNOG-mapper for functional annotation. Prodigal modes (“normal,” “anonymous,” and “training”) as well as custom translation tables can be further chosen by the user.

Sequence Mapping Modes

Additionally, we have broadened the options for the initial sequence-mapping step carried out by eggNOG-mapper (fig. 1). Now, Diamond, MMseqs2, and HMMER3 (Mistry et al. 2013) modes are available, each recommended for different use cases. The default Diamond mode provides the best balance between speed and memory consumption. EggNOG-mapper v2 also implements the new Diamond mode ‘–iterate’, which iteratively applies different levels of sensitivity. Using the new –iterate mode, Diamond is twice as fast as MMseqs2 for query sets of 1,000,000, and three times faster for 10,000,000 sequences. The MMseqs2 mode provides faster results than Diamond (fig. 2) for query sets up to 100,000 sequences for comparable sensitivity. When input data are nucleotide sequences, a direct translation is done assuming they represent coding sequences starting in an open reading frame. Alternatively, both Diamond and MMseqs2 can be executed in blastx-like mode, particularly useful when using sequencing reads as input data. For the annotation of long, error-prone sequencing reads, the frameshift option of Diamond can also be enabled. The HMMER3 mode is significantly slower than the other two and requires heavy databases to be downloaded. However, HMM-based searches might aid in the annotation of sequences for which distant homology relationships against the eggNOG v5 OGs cannot be detected by Diamond and MMseqs modes.

Adjusting Taxonomic Scopes

Another new feature now available with eggNOG-mapper v2 is the possibility of creating custom annotation databases constrained to specific taxonomic groups. For instance, users could easily create databases spanning only their domain or phylum of interest, therefore reducing computational times of subsequent annotation jobs. Moreover, the new version provides enhanced options to control the taxonomic scope (fig. 1) used for transferring functional annotations, which can be adjusted from automatic mode (recommended for mixed metagenomic data sets) to lineage-specific scopes (preventing transferring functional terms from orthologs of unwanted lineages).

Orthology Reports

Taking advantage of the rapid orthology assignments performed by eggNOG-mapper, it is now possible to report pairwise orthology relationships for each query against any of the genomes covered by eggNOG v5 (fig. 1). Although this feature is not intended to substitute more precise orthology prediction methods, it provides a very quick and simple “first-pass” approach to obtain pairwise relationships between query sequences and all eggNOG v5 organisms. Orthology reports can be further adjusted by specifying the target taxa and the type of orthologs to be reported (i.e., one-to-one, many-to-many).

Annotation Sources

In order to provide an integrated report of functional annotations per query, eggNOG-mapper v2 offers new annotation sources and improved reports (fig. 1). The functional annotation sources, which provide different levels of coverage (supplementary fig. S1, Supplementary Material online), are: predicted protein name; KEGG pathways, modules, and orthologs (Kanehisa et al. 2017); Gene Ontology labels (Gene Ontology Consortium 2018); EC numbers, BiGG reactions (Norsigian et al. 2020); CAZy terms (Lombard et al. 2014); COG functional categories (Tatusov et al. 2000); eggNOG OGs; and free text descriptions at all taxonomic levels. Reports are generated in tab-delimited and/or XLSX file formats. Moreover, when ORF prediction mode is enabled, proteins used to annotate are reported in FASTA format, together with a functionally decorated GFF file. Alternatively, eggNOG-mapper annotation reports can be used to decorate any custom GFF file.

Protein Domain Annotations

Along with the functional terms annotated per query, this new version of eggNOG-mapper provides PFAM (Mistry et al. 2021) and SMART (Letunic et al. 2021) protein domain predictions. PFAM domain annotations are by default transferred from the inferred orthologs with very little impact on computational cost, but also with a small proportion of false-positive and negative predictions (F1 score 89.7%, fig. 2). Optionally, de novo PFAM domain annotation is also available at large scales, both as a refinement phase for the orthology-based predictions (thus keeping the computational cost very low, while eliminating the risk of false positives; F1 score 98.9%, fig. 2) or by full computation (obtaining native results independent from orthology predictions). When using the de novo approach, HMMER3 searches are executed using in-memory mode for higher efficiency. Moreover, GA-based thresholds and PFAM clan disambiguation are automatically applied. Performance comparisons between the different modes are shown in figure 2

Features Comparison with Other Functional Annotation Tools

Although accuracy and performance are the main goal of eggNOG-mapper v2, other practical differences might motivate its use, or not, compared with alternative software. Those practical issues include the type of input and output data, the functional sources used to annotate query sequences, and the target taxonomic scope of each program. EggNOG-mapper is tightly bound to the eggNOG database, which covers a wide range of prokaryotic and eukaryotic organisms, and provides normalized functional annotations from multiple sources (see previous sections). As eggNOG is based on a curated selection of representative species spanning the whole tree of life, it provides a broad annotation coverage while keeping redundancy low. Other tools use smaller reference databases (e.g., SwissProt [UniProt Consortium 2021]) for general functional annotations but incorporate niche-specific functional databases for specific purposes. For instance, Prokka is suited for the annotation for prokaryotic-only genomes and includes the detection of tRNAs, transposases and signal peptides (which are not specifically targeted by eggNOG-mapper). DRAM (Shaffer et al. 2020) and MicrobiomeAnnotator (Ruiz-Perez et al. 2021) use broad annotation sources for microbial data sets such as KOFam/KEGG classification. Both provide optional mappings to large protein reference databases comparable to eggNOG (e.g., trEMBL or Uniref90), but at the cost of more computational resources and time (Ruiz-Perez et al. 2021). In the case of DRAM, additional genomic features can be annotated, including rRNAs, tRNAs, peptidases, and carbohydrate-active enzymes. Mantis (Queirós et al. 2021) and InterProScan allow users to annotate sequences based on multiple HMM-based database sources, including eggNOG, PFAM, or PANTHER (Mi et al. 2010). On the other hand, this new version of eggNOG-mapper is particularly focused on improving the annotation process at the metagenomic scale. For instance, it incorporates the possibility of using Prodigal for the automatic prediction of open reading frames (ORFs) out of raw contigs. Those options are not commonly available in other annotation tools, with only Prokka and DRAM providing similar capabilities. Moreover, eggNOG-mapper v2 allows for the direct annotation of reads via blastx-like searches, which is not a recommended mode in terms of accuracy, but it offers the possibility of obtaining functional profiles of metagenomic samples before assembling. Regarding outputs, eggNOG-mapper is optimized for large-scale annotation jobs, producing GFF files decorated with functional annotations, tab-delimited files with the predicted functional terms per query, orthology assignment tables and taxonomic placements. Other tools provide similar output files (except for the orthology and taxonomic predictions), with only DRAM and MicrobiomeAnnotator providing further statistics and graphical outputs summarizing the results. Notably, Prokka can be set to produce Genbank and Sequin files compliant with the Genbank/ENA/DDJB format, thus facilitating the uploading of annotated genomes to public databases.

Conclusions

Overall, eggNOG-mapper v2 provides a more efficient, versatile, and scalable automated functional annotation workflow than its predecessor. Standalone versions are available at GitHub (https://github.com/eggnogdb/eggnog-mapper), together with extensive documentation and usage examples (https://github.com/eggnogdb/eggnog-mapper/wiki). For convenience, an online service for the annotation of large genomic and metagenomic data sets is also available at http://eggnog-mapper.embl.de.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

23 in total

1. The COG database: a tool for genome-scale analysis of protein functions and evolution.

Authors: R L Tatusov; M Y Galperin; D A Natale; E V Koonin
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Prokka: rapid prokaryotic genome annotation.

Authors: Torsten Seemann
Journal: Bioinformatics Date: 2014-03-18 Impact factor: 6.937

Review 3. Functional and evolutionary implications of gene orthology.

Authors: Toni Gabaldón; Eugene V Koonin
Journal: Nat Rev Genet Date: 2013-04-04 Impact factor: 53.242

4. KEGG: new perspectives on genomes, pathways, diseases and drugs.

Authors: Minoru Kanehisa; Miho Furumichi; Mao Tanabe; Yoko Sato; Kanae Morishima
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

5. proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes.

Authors: Daniel R Mende; Ivica Letunic; Oleksandr M Maistrenko; Thomas S B Schmidt; Alessio Milanese; Lucas Paoli; Ana Hernández-Plaza; Askarbek N Orakov; Sofia K Forslund; Shinichi Sunagawa; Georg Zeller; Jaime Huerta-Cepas; Luis Pedro Coelho; Peer Bork
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

6. BiGG Models 2020: multi-strain genome-scale models and expansion across the phylogenetic tree.

Authors: Charles J Norsigian; Neha Pusarla; John Luke McConn; James T Yurkovich; Andreas Dräger; Bernhard O Palsson; Zachary King
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

7. Mantis: flexible and consensus-driven genome annotation.

Authors: Pedro Queirós; Francesco Delogu; Oskar Hickl; Patrick May; Paul Wilmes
Journal: Gigascience Date: 2021-06-02 Impact factor: 6.524

8. SMART: recent updates, new developments and status in 2020.

Authors: Ivica Letunic; Supriya Khedkar; Peer Bork
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

9. Pfam: The protein families database in 2021.

Authors: Jaina Mistry; Sara Chuguransky; Lowri Williams; Matloob Qureshi; Gustavo A Salazar; Erik L L Sonnhammer; Silvio C E Tosatto; Lisanna Paladin; Shriya Raj; Lorna J Richardson; Robert D Finn; Alex Bateman
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

10. The InterPro protein families and domains database: 20 years on.

Authors: Matthias Blum; Hsin-Yu Chang; Sara Chuguransky; Tiago Grego; Swaathi Kandasaamy; Alex Mitchell; Gift Nuka; Typhaine Paysan-Lafosse; Matloob Qureshi; Shriya Raj; Lorna Richardson; Gustavo A Salazar; Lowri Williams; Peer Bork; Alan Bridge; Julian Gough; Daniel H Haft; Ivica Letunic; Aron Marchler-Bauer; Huaiyu Mi; Darren A Natale; Marco Necci; Christine A Orengo; Arun P Pandurangan; Catherine Rivoire; Christian J A Sigrist; Ian Sillitoe; Narmada Thanki; Paul D Thomas; Silvio C E Tosatto; Cathy H Wu; Alex Bateman; Robert D Finn
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

128 in total

1. Isolation of a novel Lactiplantibacillus plantarum strain resistant to nitrite stress and its transcriptome analysis.

Authors: Chae Young Kwon; Kyoung Jin Choi; Dongeun Yong; Ji-Eun Kim; Sang Sun Yoon
Journal: J Microbiol Date: 2022-07-04 Impact factor: 3.422

2. Paenibacillus albicereus sp. nov. and Niallia alba sp. nov., isolated from digestive syrup.

Authors: Vipool Thorat; Kiran Kirdat; Bhavesh Tiwarekar; Priyanka Dhanavade; Prachi Karodi; Yogesh Shouche; Shivaji Sathe; Tushar Lodha; Amit Yadav
Journal: Arch Microbiol Date: 2022-01-08 Impact factor: 2.552

3. Oceanobacter mangrovi Sp. Nov., a Novel Poly-β-hydroxybutyrate Accumulating Bacterium Isolated from Mangrove Sediment.

Authors: Zhaobin Huang; Yuanyuan Huang; Qiliang Lai; Xinlan Chen; Chunming Dong; Xiaozhou Huang
Journal: Curr Microbiol Date: 2022-02-12 Impact factor: 2.188

4. Genome-wide transcriptomic changes reveal the genetic pathways involved in insect migration.

Authors: Toby Doyle; Eva Jimenez-Guri; Will L S Hawkes; Richard Massy; Federica Mantica; Jon Permanyer; Luca Cozzuto; Toni Hermoso Pulido; Tobias Baril; Alex Hayward; Manuel Irimia; Jason W Chapman; Chris Bass; Karl R Wotton
Journal: Mol Ecol Date: 2022-07-12 Impact factor: 6.622

5. Bacterial growth in multicellular aggregates leads to the emergence of complex life cycles.

Authors: Julia A Schwartzman; Ali Ebrahimi; Grayson Chadwick; Yuya Sato; Benjamin R K Roller; Victoria J Orphan; Otto X Cordero
Journal: Curr Biol Date: 2022-06-30 Impact factor: 10.900

6. Whole genome assembly of the armored loricariid catfish Ancistrus triradiatus highlights herbivory signatures.

Authors: Alexandre Lemopoulos; Juan I Montoya-Burgos
Journal: Mol Genet Genomics Date: 2022-08-25 Impact factor: 2.980

7. mOTUpan: a robust Bayesian approach to leverage metagenome-assembled genomes for core-genome estimation.

Authors: Moritz Buck; Maliheh Mehrshad; Stefan Bertilsson
Journal: NAR Genom Bioinform Date: 2022-08-15

8. Genome-wide identification and expression analysis of the aquaporin gene family reveals the role in the salinity adaptability in Nile tilapia (Oreochromis niloticus).

Authors: Ping Ni; Xiang Zhao; Yujun Liang
Journal: Genes Genomics Date: 2022-10-14 Impact factor: 2.164

9. Characterization and Identification of Probiotic Features in Lacticaseibacillus Paracasei Using a Comparative Genomic Analysis Approach.

Authors: Alexis Torres-Miranda; Felipe Melis-Arcos; Daniel Garrido
Journal: Probiotics Antimicrob Proteins Date: 2022-10-06 Impact factor: 5.265

10. Microbiome composition modulates secondary metabolism in a multispecies bacterial community.

Authors: Marc G Chevrette; Chris S Thomas; Amanda Hurley; Natalia Rosario-Meléndez; Kris Sankaran; Yixing Tu; Austin Hall; Shruthi Magesh; Jo Handelsman
Journal: Proc Natl Acad Sci U S A Date: 2022-10-10 Impact factor: 12.779