Literature DB >> 27153620

MOCAT2: a metagenomic assembly, annotation and profiling framework.

Jens Roat Kultima1, Luis Pedro Coelho1, Kristoffer Forslund1, Jaime Huerta-Cepas1, Simone S Li2, Marja Driessen1, Anita Yvonne Voigt3, Georg Zeller1, Shinichi Sunagawa1, Peer Bork4.   

Abstract

UNLABELLED: MOCAT2 is a software pipeline for metagenomic sequence assembly and gene prediction with novel features for taxonomic and functional abundance profiling. The automated generation and efficient annotation of non-redundant reference catalogs by propagating pre-computed assignments from 18 databases covering various functional categories allows for fast and comprehensive functional characterization of metagenomes.
AVAILABILITY AND IMPLEMENTATION: MOCAT2 is implemented in Perl 5 and Python 2.7, designed for 64-bit UNIX systems and offers support for high-performance computer usage via LSF, PBS or SGE queuing systems; source code is freely available under the GPL3 license at http://mocat.embl.de CONTACT: : bork@embl.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2016. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2016        PMID: 27153620      PMCID: PMC4978931          DOI: 10.1093/bioinformatics/btw183

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Metagenomics has enabled large-scale studies investigating the structure, function and diversity of microbial communities. The computational analysis of samples, often totaling many gigabases of sequence data, usually involves mapping reads to taxonomic and functional reference databases (which may require the de novo assembly of predicted genes), and subsequent abundance profiling. Whereas taxonomic profiling methodology has matured recently (Segata et al., 2013; Sunagawa ), functional profiling still remains challenging due to the difficulties in assigning functions to millions of reads from metagenomes. Moreover, current metagenomic pipelines (Abubucker ; Bose ; Edwards ; Glass ; Huson ; Lingner ; Markowitz ; Meinicke, 2015; Glass ; Huson ; Lingner ; Abubucker ; Edwards ; Bose ; Silva ) for functional annotation and/or profiling mainly implement metabolic pathway or protein domain databases (Segata et al., 2013) such as KEGG (Kanehisa ), SEED (Overbeek ) or Pfam (Finn ). Here, we present metagenomic analysis toolkit version 2 (MOCAT2), which was developed to enable functional profiling of metagenomes based on a much wider range and diversity of functional gene annotations. Its features are compared to existing tools in Supplementary Table S1.

2 The MOCAT2 pipeline

The metagenomic analysis toolkit (MOCAT) (Kultima ) proceeds through the following steps: raw sequence reads are quality-filtered and subsequently assembled into longer contigs, on which open reading frames are predicted (Fig. 1).
Fig. 1.

The MOCAT2 pipeline. Read quality control, assembly and gene prediction represent the original MOCAT pipeline (dark green box). Blue path: Genes are clustered into reference gene catalogs, which are functionally annotated. Orange path: To quantify functional composition, reads are mapped to the annotated gene catalog and summarized over the respective annotation categories. Taxonomic profiles (mOTU, specI and NCBI) are generated by mapping reads to mOTU and reference marker gene (RefMG) catalogs

The MOCAT2 pipeline. Read quality control, assembly and gene prediction represent the original MOCAT pipeline (dark green box). Blue path: Genes are clustered into reference gene catalogs, which are functionally annotated. Orange path: To quantify functional composition, reads are mapped to the annotated gene catalog and summarized over the respective annotation categories. Taxonomic profiles (mOTU, specI and NCBI) are generated by mapping reads to mOTU and reference marker gene (RefMG) catalogs Its main extensions in MOCAT2 enable comprehensive functional profiling, in addition to the eggNOG database, by integrating 18 publicly available resources that cover diverse functional properties (Table 1). The databases were selected to include large, widely used protein databases, as well as ones targeting specific functional categories (Supplementary Text). Each database has been filtered for relevance, for example from the eukaryote-centered database DrugBank only the genes with bacterial homologs were extracted.
Table 1.

Databases from which functional properties are obtained

ProteinsCoveragePrecisionRecallReference
Protein domains and families
eggNOG7 449 593100100100Huerta-Cepas et al. (2015)
Pfam16 230*879094Finn et al. (2014)
Superfamily15 438*938994Gough et al. (2001)
(Metabolic) pathways
KEGG7 423 864989393Kanehisa et al. (2014)
MetaCyc388 7821008994Caspi et al. (2014)
SEED4 247 700999494Overbeek et al. (2014)
Antibiotic resistance
ARDB25 360899988Liu and Pop (2009)
CARD2 8201008193McArthur et al. (2013)
Resfams123*809494Gibson et al. (2014)
Virulence factors
MvirDB29 3571009593Zhou et al. (2007)
PATRIC2 194 475939393Mao et al. (2015)
vFam29 655359986Skewes-Cox et al. (2014)
VFDB1 627 380868991Chen et al. (2012)
Victors3 329 893919294Mao et al. (2015)
Complex carbohydrate metabolism
dbCAN333*769999Yin et al. (2012)
Bacterial drug targets and exotoxins
DBETH2281009986Chakraborty et al. (2012)
DrugBank3 899998894Knox et al. (2011)
Mobile genetic elements
ICEberg13 984987991Bi et al. (2012)
Prophages119 183958891Waller et al. (2014)

Coverage of each database in percent, e.g., of the 18 202 orthologous groups in KEGG (KO), 17 773 (98%) are covered and thus propagated by the eggNOG database. Coverage, precision and recall are given as percentages.

*Number of hidden Markov models (HMMs), whereby one HMM can hit several proteins and several HMMs can map to one protein.

Databases from which functional properties are obtained Coverage of each database in percent, e.g., of the 18 202 orthologous groups in KEGG (KO), 17 773 (98%) are covered and thus propagated by the eggNOG database. Coverage, precision and recall are given as percentages. *Number of hidden Markov models (HMMs), whereby one HMM can hit several proteins and several HMMs can map to one protein. To avoid the computational burden of mapping reads to multiple databases, predicted genes are first clustered using CD-HIT (Huang ) into a non-redundant gene set, called a reference gene catalog (Qin ). Next, this gene catalog is mapped to the eggNOG database with wide taxonomic coverage of orthologous groups, to which sequence annotations from other databases have been pre-computed so that functional information from multiple databases can be transferred efficiently to the catalog. This indirect annotation methodology not only provides a 10-fold speed up compared to directly mapping to each database separately, but also enables annotations of short genes, which would otherwise be missed (Supplementary Figure Fig. S1). For computational efficiency MOCAT2 uses DIAMOND (Buchfink ) in the annotation step. Combined, these features yield a more than 1400-fold annotation speedup over a conventional BLAST-based annotation pipeline (Supplementary Text). Users can either create and annotate their own gene catalogs de novo, or use pre-computed and pre-annotated reference gene catalogs for the human gut and skin, mouse gut, or the ocean (Li ; Oh ; Sunagawa ; Xiao ). Finally, to quantify functional composition, reads from each sample are mapped to the annotated gene catalog and summarized over the respective annotation categories (Fig. 1). MOCAT2 now also offers several approaches for taxonomic profiling, all of which are based on mapping reads to a benchmarked set of single copy marker genes (Fig. 1). Taxonomic abundance estimates are calculated not only for different NCBI taxonomic levels, but also for species clusters defined based on molecular sequence identity (specI; Mende ) and species that currently lack sequenced reference genomes based on metagenomic operational taxonomic units (mOTU; Sunagawa ).

3 Annotation and profiling benchmarks

As complex functional annotation based on 18 databases via indirect propagation of eggNOG annotations is conceptually new, we benchmarked the (indirect) MOCAT2 annotations and functional profiles (Supplementary Table S2 and Supplementary Text). First, we compared the indirect annotations to the direct ones (generated using the annotation tool of each individual database or recommended pipeline and cutoffs) for >65 million genes from five diverse datasets (precision and recall are listed in Table 1). Next, using data from (Zeller ) we compared the direct annotations to ones produced by COGNIZER and UProC (Bose ; Meinicke ), two recently developed annotation tools integrating multiple databases. In our tests, MOCAT2 annotations were either similar to, or more accurate, than those of COGNIZER and UProC (Supplementary Table S3). Finally, the functional abundance profiles obtained using the indirect MOCAT2 annotations were very similar to those obtained using the direct method (Spearman = 0.95; n = 1300).

4 Conclusions

MOCAT2 is a software pipeline for metagenomics using state of the art assembly, annotation as well as taxonomic and functional profiling approaches in this fast moving field. Generating and annotating gene catalogs with precomputed assignments to a large selection of functional databases allows for comprehensive and efficient functional profiling of complex microbial communities. MOCAT2 thus enables such analysis at an extent far beyond what other tools currently offer and is scalable to the anticipated deluge of metagenomic data from diverse sources.
  38 in total

1.  Integrative analysis of environmental sequences using MEGAN4.

Authors:  Daniel H Huson; Suparna Mitra; Hans-Joachim Ruscheweyh; Nico Weber; Stephan C Schuster
Journal:  Genome Res       Date:  2011-06-20       Impact factor: 9.043

2.  CoMet--a web server for comparative functional profiling of metagenomes.

Authors:  Thomas Lingner; Kathrin Petra Asshauer; Fabian Schreiber; Peter Meinicke
Journal:  Nucleic Acids Res       Date:  2011-05-26       Impact factor: 16.971

3.  ICEberg: a web-based resource for integrative and conjugative elements found in Bacteria.

Authors:  Dexi Bi; Zhen Xu; Ewan M Harrison; Cui Tai; Yiqing Wei; Xinyi He; Shiru Jia; Zixin Deng; Kumar Rajakumar; Hong-Yu Ou
Journal:  Nucleic Acids Res       Date:  2011-10-18       Impact factor: 16.971

4.  dbCAN: a web resource for automated carbohydrate-active enzyme annotation.

Authors:  Yanbin Yin; Xizeng Mao; Jincai Yang; Xin Chen; Fenglou Mao; Ying Xu
Journal:  Nucleic Acids Res       Date:  2012-05-29       Impact factor: 16.971

5.  Potential of fecal microbiota for early-stage detection of colorectal cancer.

Authors:  Georg Zeller; Julien Tap; Anita Y Voigt; Shinichi Sunagawa; Jens Roat Kultima; Paul I Costea; Aurélien Amiot; Jürgen Böhm; Francesco Brunetti; Nina Habermann; Rajna Hercog; Moritz Koch; Alain Luciani; Daniel R Mende; Martin A Schneider; Petra Schrotz-King; Christophe Tournigand; Jeanne Tran Van Nhieu; Takuji Yamada; Jürgen Zimmermann; Vladimir Benes; Matthias Kloor; Cornelia M Ulrich; Magnus von Knebel Doeberitz; Iradj Sobhani; Peer Bork
Journal:  Mol Syst Biol       Date:  2014-11-28       Impact factor: 11.429

6.  Profile hidden Markov models for the detection of viruses within metagenomic sequence data.

Authors:  Peter Skewes-Cox; Thomas J Sharpton; Katherine S Pollard; Joseph L DeRisi
Journal:  PLoS One       Date:  2014-08-20       Impact factor: 3.240

7.  Biogeography and individuality shape function in the human skin metagenome.

Authors:  Julia Oh; Allyson L Byrd; Clay Deming; Sean Conlan; Heidi H Kong; Julia A Segre
Journal:  Nature       Date:  2014-10-02       Impact factor: 49.962

8.  Real time metagenomics: using k-mers to annotate metagenomes.

Authors:  Robert A Edwards; Robert Olson; Terry Disz; Gordon D Pusch; Veronika Vonstein; Rick Stevens; Ross Overbeek
Journal:  Bioinformatics       Date:  2012-10-09       Impact factor: 6.937

9.  The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).

Authors:  Ross Overbeek; Robert Olson; Gordon D Pusch; Gary J Olsen; James J Davis; Terry Disz; Robert A Edwards; Svetlana Gerdes; Bruce Parrello; Maulik Shukla; Veronika Vonstein; Alice R Wattam; Fangfang Xia; Rick Stevens
Journal:  Nucleic Acids Res       Date:  2013-11-29       Impact factor: 16.971

Review 10.  Computational meta'omics for microbial community studies.

Authors:  Nicola Segata; Daniela Boernigen; Timothy L Tickle; Xochitl C Morgan; Wendy S Garrett; Curtis Huttenhower
Journal:  Mol Syst Biol       Date:  2013-05-14       Impact factor: 11.429

View more
  69 in total

Review 1.  A review of methods and databases for metagenomic classification and assembly.

Authors:  Florian P Breitwieser; Jennifer Lu; Steven L Salzberg
Journal:  Brief Bioinform       Date:  2019-07-19       Impact factor: 11.622

2.  Characterization of microbial communities in ethanol biorefineries.

Authors:  Fernanda C Firmino; Davide Porcellato; Madison Cox; Garret Suen; Jeffery R Broadbent; James L Steele
Journal:  J Ind Microbiol Biotechnol       Date:  2019-12-17       Impact factor: 3.346

3.  Associations of the gut microbiome and clinical factors with acute GVHD in allogeneic HSCT recipients.

Authors:  Emma E Ilett; Mette Jørgensen; Marc Noguera-Julian; Jens Christian Nørgaard; Gedske Daugaard; Marie Helleberg; Roger Paredes; Daniel D Murray; Jens Lundgren; Cameron MacPherson; Joanne Reekie; Henrik Sengeløv
Journal:  Blood Adv       Date:  2020-11-24

4.  YAMP: a containerized workflow enabling reproducibility in metagenomics research.

Authors:  Alessia Visconti; Tiphaine C Martin; Mario Falchi
Journal:  Gigascience       Date:  2018-07-01       Impact factor: 6.524

Review 5.  Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors:  Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal:  Funct Integr Genomics       Date:  2021-10-18       Impact factor: 3.410

6.  The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments.

Authors:  Yosuke Nishimura; Susumu Yoshizawa
Journal:  Sci Data       Date:  2022-06-17       Impact factor: 8.501

Review 7.  Potential of Meta-Omics to Provide Modern Microbial Indicators for Monitoring Soil Quality and Securing Food Production.

Authors:  Christophe Djemiel; Samuel Dequiedt; Battle Karimi; Aurélien Cottin; Walid Horrigue; Arthur Bailly; Ali Boutaleb; Sophie Sadet-Bourgeteau; Pierre-Alain Maron; Nicolas Chemidlin Prévost-Bouré; Lionel Ranjard; Sébastien Terrat
Journal:  Front Microbiol       Date:  2022-06-30       Impact factor: 6.064

Review 8.  High-resolution characterization of the human microbiome.

Authors:  Cecilia Noecker; Colin P McNally; Alexander Eng; Elhanan Borenstein
Journal:  Transl Res       Date:  2016-07-25       Impact factor: 7.012

9.  Low nadir CD4+ T-cell counts predict gut dysbiosis in HIV-1 infection.

Authors:  Yolanda Guillén; Marc Noguera-Julian; Javier Rivera; Maria Casadellà; Alexander S Zevin; Muntsa Rocafort; Mariona Parera; Cristina Rodríguez; Marçal Arumí; Jorge Carrillo; Beatriz Mothe; Carla Estany; Josep Coll; Isabel Bravo; Cristina Herrero; Jorge Saz; Guillem Sirera; Ariadna Torrella; Jordi Navarro; Manuel Crespo; Eugènia Negredo; Christian Brander; Julià Blanco; Maria Luz Calle; Nichole R Klatt; Bonaventura Clotet; Roger Paredes
Journal:  Mucosal Immunol       Date:  2018-08-31       Impact factor: 7.313

10.  The maintenance of microbial community in human fecal samples by a cost effective preservation buffer.

Authors:  Chongming Wu; Tianda Chen; Wenyi Xu; Tingting Zhang; Yuwei Pei; Yanan Yang; Fang Zhang; Hao Guo; Qingshi Wang; Li Wang; Bowen Zhao
Journal:  Sci Rep       Date:  2021-06-29       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.