Literature DB >> 21062764

NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads.

Gail L Rosen¹, Erin R Reichenberger, Aaron M Rosenfeld.

Abstract

MOTIVATION: Datasets from high-throughput sequencing technologies have yielded a vast amount of data about organisms in environmental samples. Yet, it is still a challenge to assess the exact organism content in these samples because the task of taxonomic classification is too computationally complex to annotate all reads in a dataset. An easy-to-use webserver is needed to process these reads. While many methods exist, only a few are publicly available on webservers, and out of those, most do not annotate all reads.
RESULTS: We introduce a webserver that implements the naïve Bayes classifier (NBC) to classify all metagenomic reads to their best taxonomic match. Results indicate that NBC can assign next-generation sequencing reads to their taxonomic classification and can find significant populations of genera that other classifiers may miss. AVAILABILITY: Publicly available at: http://nbc.ece.drexel.edu.

Entities: Species

Mesh：

Year: 2010 PMID： 21062764 PMCID： PMC3008645 DOI： 10.1093/bioinformatics/btq619

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

After acquiring a sample and using next-generation technology to perform shotgun sequencing, the next step in metagenomic analysis it to assess the taxonomic content of the sample. This methodology, also known as phylogenetic analysis, gives a simple look at ‘Who is in this sample?’ The first tool ever used (which is still widely used) for taxonomic assessment is Basic Local Alignment Search Tool (BLAST; Altschul ). In recent years, several specialized webservers have been made available to the public to ease the process of taxonomically classifying reads, namely Phylopythia (McHardy ), CAMERA (Seshadri ), WebCARMA (Gerlach ), MG-RAST (Meyer ) and Galaxy (Pond ). Unlike BLAST, Phylopythia and WebCARMA return more specific taxonomic information and assign reads to higher level taxonomic levels using a consensus of BLAST top-hit taxonomies [aka ‘last common ancestor’ algorithms (Huson )]. In this article, we focus our comparison to remote stand-alone webservers and not to methods that only have locally installable software. Ultimately, all the metagenomic analysis webservers aim to ease analysis of complex environmental samples for users that do not have resources to maintain their own databases and systems. Phylopythia was the first taxonomic classification webserver to be implemented. Phylopythia is based on a support vector machine (SVM) classification method and produces very good accuracy for long (≥ 1 Kbp) reads (McHardy ). WebCARMA is a homology-based approach that matches environmental gene tags to protein families and reports good results for long and ultrashort 35-bp reads using (i) BLASTX to find candidate environmental gene tags (EGTs) and (ii) using Pfam (protein family) hidden Markov models (HMMs) to match the EGTs against protein families during an EGT candidate selection process. MG-RAST (Metagenome Rapid Annotation using Subsystem Technology) (Meyer ), CAMERA (Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis) (Seshadri ) and the Galaxy Project (Pond ) are high-throughput metagenomic pipelines that aim to be an all-in-one one-stop analysis for metagenomic samples. For taxonomic classification of shotgun sequencing, MG-RAST offers a homology-based approach, SEED (Overbeek ). CAMERA and Galaxy provide high-throughput implementations and custom databases for BLASTN. BLASTN yields best hit sequence matches and is known to have reasonable accuracy (Rosen ). Previously, Rosen et al. have explored a machine learning method, naïve Bayes classifier (NBC), as a possible way to classify fragments that can annotate more sequences than BLAST (Rosen ). We now implement the algorithm on a webserver for public use and benchmark it against other web sites.

2 METHODS AND MATERIALS

We selected a previously benchmarked dataset (Gerlach ): the Biogas reactor dataset (Schlüter ), composed of 353 213 reads of average 230 bp length. We selected a real dataset as opposed to a synthetic one because we did not want to tailor the dataset to any specific database, since the database will vary on each web site. This comparison fairly assesses each webserver's performance on a ‘real’ dataset containing known and novel organisms. We conducted our tests against NBC and five other webservers in July and August of 2010. WebCARMA and MG-RAST require no parameters. Phylopythia requires the type of model to match against. MG-RAST requires an E-value cutoff under the SEED viewer (which we selected the highest). We selected default BLAST parameters for the NT database for Galaxy. For NBC, we used an Nmer size of 15 and the default 1032 organism genome-list. For CAMERA, we only retained the best top-hit organism for each read and used the ‘All Prokaryotes’ BLASTN database (and used the default parameters for the rest). We implement the NBC approach in Rosen ) that assigns each read a log-likelihood score. We introduce two functions of NBC: (i) the novice functionality and (ii) the expert functionality. We expect that most users will fit into the ‘novice’ category, which will enable them to upload their FASTA file of reads and obtain a file of summarized results matching each read to its most likely organism, given the training database. The parameters that (expert and novice) users can choose from are as follows: Upload File: the FASTA formatted file of metagenomic reads. The webserver also accepts .zip, .gz and .tgz of several FASTA files. Genome list: the algorithm speed depends linearly on the number of genomes that one scores against. So, if an expert user has prior knowledge about the expected microbes in the environment, he/she can select only those microbes that should be scored against. This will both speed up the computation time and reduce false positives of the algorithm. Nmer length: the user can select different Nmer feature sizes, but it is recommended that the novice user use N = 15 since it works well for both long and short reads (Rosen ). Email: The user's email address is required so that they can be notified as to where to retrieve the results when the job is completed. Output: For a beginner, we suggest to (i) upload a FASTA file with the metagenomic reads and (ii) enter an email address. The output is a link to a directory that contains your original upload file (renamed as userAnalysisFile.txt), the genomes that were scored against (masterGenomeList.txt) and a summary of the matches for each read (summarized_results.txt). The expert user may be particularly interested in the *.csv.gz files where he/she can analyze the ‘score distribution’ of each read more in depth.

3 DISCUSSION

In Figure 1, we show the percentage of reads (out of the whole dataset) that ranked in the top eight genera for each algorithm. We see that all methods are in unanimous agreement for Clostridium and Bacillus, while most methods (except Galaxy) agree for prominence of Methanoculleus. CAMERA supports NBC's findings of Pseudomonas and Burkholderia, known to be found in sewage treatment plants (Vinneras ). [The biogas reactor contained ∼2% chicken manure so it can have the traits of sludge waste (Schlüter )]. In Hery ), Pseudomonas and Sorangium have been found in sludge wastes. Streptosporangium and Streptomyces are commonly found in vegetable gardens (Nolan ), which is quite reasonable since this is an agricultural bioreactor. Therefore, NBC potentially has found significant populations of genera that other classifiers have missed. Thermosinus is not in NBC's completed microbial training database and therefore, it did not find any matches.

Fig. 1.

Percentage of reads that are assigned to a particular genera out of all 454 reads from the Biogas reactor community. CAMERA and NBC tend to agree for over 70% of the genera shown while MG-RAST agrees with CAMERA and NBC near 50%. WebCARMA bins fewers reads, and Galaxy has high variability. For the first 5602 reads (1.5 Mb web site limit), Phylopythia only classifies eight reads to the phylum level and is not included in the graph due to its inability to make assignments at the genus level. NBC took 21 h to run and classified all 100% of the reads compared with 12 h/23% for WebCARMA, 5 h/99% for CAMERA, 2–3 h/140% for Galaxy , and a few weeks /56.2% for MG-RAST. NBC runs on a 4-core Intel machine and speed would linearly increase with distributed computing in the future.

4 CONCLUSION

The naïve Bayes classification tool is implemented on a web site for public use. We demonstrate that the tool can handle a complete pyrosequencing dataset, and it gives the full taxonomy for each read, so that users can easily analyze the taxonomic composition of their datasets. NBC classifies every read unlike other tools and is easy to use, runs an entire dataset in a reasonable amount of time and yields competitive results.

14 in total

1. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

2. Monitoring of bacterial communities during low temperature thermal treatment of activated sludge combining DNA phylochip and respirometry techniques.

Authors: Marina Héry; Hervé Sanguin; Sergio Perez Fabiel; Xavier Lefebvre; Timothy M Vogel; Etienne Paul; Sandrine Alfenore
Journal: Water Res Date: 2010-07-30 Impact factor: 11.236

3. MEGAN analysis of metagenomic data.

Authors: Daniel H Huson; Alexander F Auch; Ji Qi; Stephan C Schuster
Journal: Genome Res Date: 2007-01-25 Impact factor: 9.043

4. Identification of the microbiological community in biogas systems and evaluation of microbial risks from gas usage.

Authors: Björn Vinnerås; Caroline Schönning; Annika Nordin
Journal: Sci Total Environ Date: 2006-03-23 Impact factor: 7.963

5. Accurate phylogenetic classification of variable-length DNA fragments.

Authors: Alice Carolyn McHardy; Héctor García Martín; Aristotelis Tsirigos; Philip Hugenholtz; Isidore Rigoutsos
Journal: Nat Methods Date: 2006-12-10 Impact factor: 28.547

6. The metagenome of a biogas-producing microbial community of a production-scale biogas plant fermenter analysed by the 454-pyrosequencing technology.

Authors: Andreas Schlüter; Thomas Bekel; Naryttza N Diaz; Michael Dondrup; Rudolf Eichenlaub; Karl-Heinz Gartemann; Irene Krahn; Lutz Krause; Holger Krömeke; Olaf Kruse; Jan H Mussgnug; Heiko Neuweger; Karsten Niehaus; Alfred Pühler; Kai J Runte; Rafael Szczepanowski; Andreas Tauch; Alexandra Tilker; Prisca Viehöver; Alexander Goesmann
Journal: J Biotechnol Date: 2008-05-27 Impact factor: 3.307

7. Complete genome sequence of Streptosporangium roseum type strain (NI 9100).

Authors: Matt Nolan; Johannes Sikorski; Marlen Jando; Susan Lucas; Alla Lapidus; Tijana Glavina Del Rio; Feng Chen; Hope Tice; Sam Pitluck; Jan-Fang Cheng; Olga Chertkov; David Sims; Linda Meincke; Thomas Brettin; Cliff Han; John C Detter; David Bruce; Lynne Goodwin; Miriam Land; Loren Hauser; Yun-Juan Chang; Cynthia D Jeffries; Natalia Ivanova; Konstantinos Mavromatis; Natalia Mikhailova; Amy Chen; Krishna Palaniappan; Patrick Chain; Manfred Rohde; Markus Göker; Jim Bristow; Jonathan A Eisen; Victor Markowitz; Philip Hugenholtz; Nikos C Kyrpides; Hans-Peter Klenk
Journal: Stand Genomic Sci Date: 2010-01-28

8. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes.

Authors: Ross Overbeek; Tadhg Begley; Ralph M Butler; Jomuna V Choudhuri; Han-Yu Chuang; Matthew Cohoon; Valérie de Crécy-Lagard; Naryttza Diaz; Terry Disz; Robert Edwards; Michael Fonstein; Ed D Frank; Svetlana Gerdes; Elizabeth M Glass; Alexander Goesmann; Andrew Hanson; Dirk Iwata-Reuyl; Roy Jensen; Neema Jamshidi; Lutz Krause; Michael Kubal; Niels Larsen; Burkhard Linke; Alice C McHardy; Folker Meyer; Heiko Neuweger; Gary Olsen; Robert Olson; Andrei Osterman; Vasiliy Portnoy; Gordon D Pusch; Dmitry A Rodionov; Christian Rückert; Jason Steiner; Rick Stevens; Ines Thiele; Olga Vassieva; Yuzhen Ye; Olga Zagnitko; Veronika Vonstein
Journal: Nucleic Acids Res Date: 2005-10-07 Impact factor: 16.971

9. CAMERA: a community resource for metagenomics.

Authors: Rekha Seshadri; Saul A Kravitz; Larry Smarr; Paul Gilna; Marvin Frazier
Journal: PLoS Biol Date: 2007-03 Impact factor: 8.029

10. Metagenome fragment classification using N-mer frequency profiles.

Authors: Gail Rosen; Elaine Garbarine; Diamantino Caseiro; Robi Polikar; Bahrad Sokhansanj
Journal: Adv Bioinformatics Date: 2008-11-16

82 in total

Review 1. Analytical tools and databases for metagenomics in the next-generation sequencing era.

Authors: Mincheol Kim; Ki-Hyun Lee; Seok-Whan Yoon; Bong-Soo Kim; Jongsik Chun; Hana Yi
Journal: Genomics Inform Date: 2013-09-30

2. Integrative analysis of environmental sequences using MEGAN4.

Authors: Daniel H Huson; Suparna Mitra; Hans-Joachim Ruscheweyh; Nico Weber; Stephan C Schuster
Journal: Genome Res Date: 2011-06-20 Impact factor: 9.043

3. Census-based rapid and accurate metagenome taxonomic profiling.

Authors: Amirhossein Shamsaddini; Yang Pan; W Evan Johnson; Konstantinos Krampis; Mariya Shcheglovitova; Vahan Simonyan; Amy Zanne; Raja Mazumder
Journal: BMC Genomics Date: 2014-10-21 Impact factor: 3.969

4. TIPP: taxonomic identification and phylogenetic profiling.

Authors: Nam-Phuong Nguyen; Siavash Mirarab; Bo Liu; Mihai Pop; Tandy Warnow
Journal: Bioinformatics Date: 2014-10-29 Impact factor: 6.937

Review 5. Application of computational approaches to analyze metagenomic data.

Authors: Ho-Jin Gwak; Seung Jae Lee; Mina Rho
Journal: J Microbiol Date: 2021-02-10 Impact factor: 3.422

Review 6. A clinician's guide to microbiome analysis.

Authors: Marcus J Claesson; Adam G Clooney; Paul W O'Toole
Journal: Nat Rev Gastroenterol Hepatol Date: 2017-08-09 Impact factor: 46.802

Review 7. Survey of (Meta)genomic Approaches for Understanding Microbial Community Dynamics.

Authors: Anukriti Sharma; Rup Lal
Journal: Indian J Microbiol Date: 2016-11-11 Impact factor: 2.461

8. AKE - the Accelerated k-mer Exploration web-tool for rapid taxonomic classification and visualization.

Authors: Daniel Langenkämper; Alexander Goesmann; Tim Wilhelm Nattkemper
Journal: BMC Bioinformatics Date: 2014-12-13 Impact factor: 3.169

Review 9. Ancient and modern environmental DNA.

Authors: Mikkel Winther Pedersen; Søren Overballe-Petersen; Luca Ermini; Clio Der Sarkissian; James Haile; Micaela Hellstrom; Johan Spens; Philip Francis Thomsen; Kristine Bohmann; Enrico Cappellini; Ida Bærholm Schnell; Nathan A Wales; Christian Carøe; Paula F Campos; Astrid M Z Schmidt; M Thomas P Gilbert; Anders J Hansen; Ludovic Orlando; Eske Willerslev
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2015-01-19 Impact factor: 6.237

10. k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets.

Authors: David Ainsworth; Michael J E Sternberg; Come Raczy; Sarah A Butcher
Journal: Nucleic Acids Res Date: 2017-02-28 Impact factor: 16.971