Literature DB >> 22180827

ClaMS: A Classifier for Metagenomic Sequences.

Amrita Pati, Lenwood S Heath, Nikos C Kyrpides, Natalia Ivanova.

Abstract

ClaMS - "Classifier for Metagenomic Sequences" - is a Java application for binning assembled contigs in metagenomes using user-specified training sets and initial parameters. Since ClaMS trains on sequence composition-based genomic signatures, it is much faster than binning tools that rely on alignments to homologs; ClaMS can bin ~20,000 sequences in 3 minutes on a laptop with a 2.4 GH× Intel Core 2 Duo processor and 2 GB RAM. ClaMS is meant to be a desktop application for biologists and can be run on any machine under any Operating System on which the Java Runtime Environment can be installed.

Entities: Chemical Species

Year: 2011 PMID： 22180827 PMCID： PMC3235515 DOI： 10.4056/sigs.2075298

Source DB: PubMed Journal: Stand Genomic Sci ISSN： 1944-3277

Introduction

Metagenome binning is the process of assigning nucleotide sequences in a metagenome to known taxonomic groups. Mapping sequences to their taxonomic groups of origin leads to better characterization of a metagenome, which facilitates the accomplishment of objectives such as genome assembly from metagenomes and assembly and annotation improvement. Existing binning methods can be characterized in two ways -- (1) Composition-based binning tools and homology-based binning tools (2) ab initio unsupervised classifiers and supervised/training-based classifiers. In unsupervised binning, a dataset is classified to pre-existing bins trained on genomic sequences without any interference or supervision from the user. In supervised binning, the user integrates additional known facts about the dataset into the binning process by participating in the training process – by specifying sequences for each training bin and/or selecting the taxonomic units to which the dataset must be binned. Homology-based classifiers such as MEGAN [1] rely on alignments of sequences to homologs and are extremely computation-intensive. For large metagenomic datasets sequenced using next-generation sequencing technologies, homology-based binning can be prohibitive in terms of time and computation. While existing composition-based binning tools (Phylopythia [2], TETRA [3]) are much faster than homology-based binning tools, they are mostly unsupervised, and their accuracy is limited since the information about the presence and abundance of specific phylogenetic populations is not used in the binning process, even though such information obtained by 16S rDNA amplicon analysis results is available for many metagenomic datasets. Even in the absence of rRNA amplicon analysis experiments, some intelligence about the constituent organisms of a metagenome can be obtained by a few iterations of ab initio binning. The objective of ClaMS is to integrate this information into the binning process thus achieving higher accuracy of binning, and to produce a desktop/laptop application that is platform-independent, fast, and easily usable by biologists.

Principles

ClaMS works by characterizing a sequence with a signature vector that is derived from its composition and described as a de Bruijn chain (DBC) signature [4]. A double stranded DNA sequence is treated as a walk in a de Bruijn graph and artifacts such as the stationary distribution of the underlying Markov chain and the strength of connectivity of various graph-components to the graph are used to compute the DBC signature. The transition probability matrix of the underlying Markov chain of even a relatively short sequence can accurately predict its stationary distribution, and this property is exploited in the computation of DBC signatures. The DBC signature is highly conserved within a species while varying between species and this can be proved both mathematically and experimentally [4]. This property also manifests at higher taxonomic levels. It is more complex than the oligonucleotide frequency signatures used by Phylopythia and TETRA, and different from the interpolated Markov Models used by Phymm [5]. Since a DBC signature of order k incorporates information about k-mers and (k+1)-mers in its computation, it is much faster to train. While the greater amount of information used by applications such as Phylopythia and Phymm does mean higher accuracy, ClaMS is targeted for use on assembled contigs with supervision from the user and in this scenario, accuracy is not compromised. Pre-computed signatures at various word lengths (2-4) are included with ClaMS for all finished genomes. These signatures have been computed using the taxonomy and isolate genome sequences in IMG [6] and will be updated with each release of ClaMS or on request. The users can define training sequence sets either by clicking a node in the phylogenetic tree in the ClaMS-GUI or by uploading their own fasta files of sequences. For each sequence to be binned, its signature, which is a vector, is computed. This signature is compared individually with the centroid signatures of all training sets and the best match is declared as the bin for that sequence.

Results and Discussion

To demonstrate the accuracy of binning using ClaMS, we binned a real metagenome and a simulated metagenome using ClaMS. The real metagenome, the Phrap-assembled phosphorus removal sludge metagenome (SLU) sampled from laboratory-scale bioreactor (IMG/M, taxon OID: 2000000000 [6]), is 56.6M bases long, has 60.45% GC, and contains 31,742 assembled contigs. The simulated metagenome, the assembled medium complexity simulated simMC dataset from FAMeS [7], has 15109 non-chimeric contigs that were 1000 bases or longer and candidates for binning using ClaMS. We evaluated the results using cross-validation of the binned contigs. In the case of simMC, the correct bins of the contigs were already known for cross-validation, in the case of SLU, best hits from Blast alignment were used to cross-validate bins. The phylogenetic distribution of genes in the SLU dataset based on their best Blast hits in IMG/M [6] and the 16S rRNA tree in [8] showed that the dataset was dominated by Betaproteobacteria (127 species), Gammaproteobacteria (396 species), Bacteroidetes (81 species), and the genome of Candidatus A. phosphatis. Four training sets were used to bin SLU: the longest contig belonging to Candidatus A. phosphatis in the SLU dataset (subsequently removed from the set to be binned), betaproteobacterial isolate genomes, all gammaproteobacterial isolate genomes, and all genomes of Bacteroidetes. Scaffolds assigned to each bin were then cross-validated using their existing Blast-based class assignment in IMG/M. As part of the processing pipeline in IMG/M ,the phylogenetic distribution for the metagenome is computed by aligning genes on scaffolds (using BLASTP) to the non-redundant database of sequences computed from isolate genomes stored in IMG. Results are viewable as a phylogenetic distribution of genes in the metagenome by assigning scaffolds to appropriate bins at various taxonomic levels based on the alignment of genes present on them. Results are outlined in Figure 1 Approximately 91% of the scaffolds in the Candidatus A. phosphatis bin have best BLAST matches to Betaproteobacteria, as do 77% of the scaffolds in the Betaproteobacteria bin. Similarly,90% of the scaffolds in the Bacteroidetes bin have BLAST matches to Bacteroidetes, while the scaffolds in the Gammaproteobacteria bin are distributed between Betaproteobacteria (59%) and Gammaproteobacteria (25%). The latter misclassification could be attributed to the fact that the Gammaproteobacteria in the SLU dataset are dominated by Xanthomonadales whose scaffolds have high GC content (64-67%) that is closer to that of Betaproteobacteria (62%) than to Gammaproteobacteria (48%). Moreover the taxonomic position of Xanthomonadales is not well defined [9]. This example illustrates the dangers of relying on isolate genome sequences as a training set, especially when relatively large taxonomic groups, such as phyla or classes are considered. Binning can often produce more accurate results if longer contigs from the sequence set to be binned, whose origins are known, are used as training sets.

Figure 1

Cross-validation of scaffolds in the SLU dataset assigned by ClaMS to user-defined bins with respect to existing phylogenetic assignments of these scaffolds made by IMG/M based on their best Blast hit. The x-axis indicates the bin whose scaffolds are being cross-validated. The y-axis indicates the bacterial classes to which scaffolds in that bin actually map to in IMG/M. Bubble sizes represent percentages. The Phrap-assembled simulated acid mine drainage dataset (simMC) from FAMeS was binned in an unsupervised manner at various phylogenetic levels. The dataset has been constructed from the reads collected from genomes classified to 79 genera, 60 families, 42 orders, 17 classes, and 9 phyla under the bacterial and archaeal domains. Whole genome sequences of organisms under a taxonomic unit were used to train the bin for that taxonomic unit. For example, all Alphaproteobacteria species (except those used in the simulated dataset) were used to train the Alphaproteobacteria bin. All contigs longer than 1,000 bases were binned using ClaMS. Figure 2 (Left) illustrates the sensitivity and specificity of the unsupervised binning process at various phylogenetic levels when the best two bins for a contig are considered for the correct match. For example, at the genus level, 79 bins (one for each genus) were used to bin the assembled contigs, where a bin for a particular genus was trained using genomic sequences from all isolate genomes belonging to that genus. Negatives were determined by counting sequences that could not be binned at given cut-offs for distance and contig length. Sensitivity was computed as the percentage of sequences for which bins existed that were binned correctly (ratio of the number of true positives to the sum of the number of true positives and the number of false negatives) while specificity was computed as the ratio of the number of true negatives to the sum of the number of true negatives and the number of false positives. Unsupervised binning of a metagenomic dataset yields relatively accurate results at the genus, family, and domain levels, but the same cannot be said of the order, class, and phylum levels, where the dispersion in the properties of the signature is much greater and the accuracy of binning is much lower. For metagenomic datasets whose dominant constituent populations are known, supervised binning while training on contigs from the same dataset is the best course of action. This is illustrated by the specificity vs. sensitivity plots in Figure 2 (Right), where binning was done on all contigs longer than 1,000 bases in the simMC dataset using training sets specified by the user. A total of 9 genera, 8 families/orders and 6 classes were selected and each bin was trained using contigs from the same metagenome. A combination of the two binning approaches, in which the user specified a training set of isolate genomes instead of selecting training sequences from the same metagenome produces better results than unsupervised binning, but is less accurate than supervised binning with training contigs from the same metagenome (Figure 3).

Figure 2

Figure 3

Sensitivity and specificity on binning contigs longer than 1,000 bps in the Phrap-assembled simMC dataset at the genus, family, order, and class levels using ClaMS in an ab initio manner. Complete genomes were used to train the 9 genera, 8 families/orders, and 6 classes specified by the user. Observe that a large amount of noise is added to the bins at the class level because of including all complete genomes in that class. The in-built taxonomy browser in ClaMS was used to make these bin selections.

(Left) Sensitivity and specificity on binning contigs longer than 1000 bps in the Phrap-assembled simMC dataset at the genus, family, order, and class levels using ClaMS in an ab initio manner. The stars in matching colors indicate the same values for binning all contigs longer than 8,000 bps in the same dataset. The grey star represents the sensitivity/specificity values at the domain level. (Right) Results of binning the same contigs in simMC using user-specfied bins for training. Sensitivity and specificity on binning contigs longer than 1,000 bps in the Phrap-assembled simMC dataset at the genus, family, order, and class levels using ClaMS in an ab initio manner. Complete genomes were used to train the 9 genera, 8 families/orders, and 6 classes specified by the user. Observe that a large amount of noise is added to the bins at the class level because of including all complete genomes in that class. The in-built taxonomy browser in ClaMS was used to make these bin selections. ClaMS can run in a command-line mode, which makes it convenient to be included in processing pipelines and large-scale batch-processing jobs. Screenshots of the ClaMS user-interface and a demonstration of the usage including visualization of results are available at http://clams.jgi-psf.org. The user-friendly interface, built-in taxonomy browser, bundled genomic signatures, and fast computations make ClaMS an ideal desktop supervised binning application for biologists.

8 in total

1. The evolutionary origin of Xanthomonadales genomes and the nature of the horizontal gene transfer process.

Authors: Iñaki Comas; Andrés Moya; Rajeev K Azad; Jeffrey G Lawrence; Fernando Gonzalez-Candelas
Journal: Mol Biol Evol Date: 2006-08-01 Impact factor: 16.240

2. MEGAN analysis of metagenomic data.

Authors: Daniel H Huson; Alexander F Auch; Ji Qi; Stephan C Schuster
Journal: Genome Res Date: 2007-01-25 Impact factor: 9.043

3. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.

Authors: Konstantinos Mavromatis; Natalia Ivanova; Kerrie Barry; Harris Shapiro; Eugene Goltsman; Alice C McHardy; Isidore Rigoutsos; Asaf Salamov; Frank Korzeniewski; Miriam Land; Alla Lapidus; Igor Grigoriev; Paul Richardson; Philip Hugenholtz; Nikos C Kyrpides
Journal: Nat Methods Date: 2007-04-29 Impact factor: 28.547

4. Accurate phylogenetic classification of variable-length DNA fragments.

Authors: Alice Carolyn McHardy; Héctor García Martín; Aristotelis Tsirigos; Philip Hugenholtz; Isidore Rigoutsos
Journal: Nat Methods Date: 2006-12-10 Impact factor: 28.547

5. Metagenomic analysis of two enhanced biological phosphorus removal (EBPR) sludge communities.

Authors: Héctor García Martín; Natalia Ivanova; Victor Kunin; Falk Warnecke; Kerrie W Barry; Alice C McHardy; Christine Yeates; Shaomei He; Asaf A Salamov; Ernest Szeto; Eileen Dalin; Nik H Putnam; Harris J Shapiro; Jasmyn L Pangilinan; Isidore Rigoutsos; Nikos C Kyrpides; Linda Louise Blackall; Katherine D McMahon; Philip Hugenholtz
Journal: Nat Biotechnol Date: 2006-09-24 Impact factor: 54.908

6. TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences.

Authors: Hanno Teeling; Jost Waldmann; Thierry Lombardot; Margarete Bauer; Frank Oliver Glöckner
Journal: BMC Bioinformatics Date: 2004-10-26 Impact factor: 3.169

7. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models.

Authors: Arthur Brady; Steven L Salzberg
Journal: Nat Methods Date: 2009-08-02 Impact factor: 28.547

8. IMG/M: a data management and analysis system for metagenomes.

Authors: Victor M Markowitz; Natalia N Ivanova; Ernest Szeto; Krishna Palaniappan; Ken Chu; Daniel Dalevi; I-Min A Chen; Yuri Grechkin; Inna Dubchak; Iain Anderson; Athanasios Lykidis; Konstantinos Mavromatis; Philip Hugenholtz; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2007-10-11 Impact factor: 16.971

8 in total

21 in total

Review 1. A clinician's guide to microbiome analysis.

Authors: Marcus J Claesson; Adam G Clooney; Paul W O'Toole
Journal: Nat Rev Gastroenterol Hepatol Date: 2017-08-09 Impact factor: 46.802

Review 2. No more non-model species: the promise of next generation sequencing for comparative immunology.

Authors: Nolwenn M Dheilly; Coen Adema; David A Raftos; Benjamin Gourbal; Christoph Grunau; Louis Du Pasquier
Journal: Dev Comp Immunol Date: 2014-02-06 Impact factor: 3.636

3. SCGid: a consensus approach to contig filtering and genome prediction from single-cell sequencing libraries of uncultured eukaryotes.

Authors: Kevin R Amses; William J Davis; Timothy Y James
Journal: Bioinformatics Date: 2020-04-01 Impact factor: 6.937

4. Genome-scale phylogenetics reveals a monophyletic Zoopagales (Zoopagomycota, Fungi).

Authors: William J Davis; Kevin R Amses; Gerald L Benny; Derreck Carter-House; Ying Chang; Igor Grigoriev; Matthew E Smith; Joseph W Spatafora; Jason E Stajich; Timothy Y James
Journal: Mol Phylogenet Evol Date: 2019-01-11 Impact factor: 4.286

5. Microbial dark matter ecogenomics reveals complex synergistic networks in a methanogenic bioreactor.

Authors: Masaru K Nobu; Takashi Narihiro; Christian Rinke; Yoichi Kamagata; Susannah G Tringe; Tanja Woyke; Wen-Tso Liu
Journal: ISME J Date: 2015-01-23 Impact factor: 10.302

6. Genes related to ion-transport and energy production are upregulated in response to CO2-driven pH decrease in corals: new insights from transcriptome analysis.

Authors: Jeremie Vidal-Dupiol; Didier Zoccola; Eric Tambutté; Christoph Grunau; Céline Cosseau; Kristina M Smith; Michael Freitag; Nolwenn M Dheilly; Denis Allemand; Sylvie Tambutté
Journal: PLoS One Date: 2013-03-27 Impact factor: 3.240

7. The metagenome of an anaerobic microbial community decomposing poplar wood chips.

Authors: Daniel van der Lelie; Safiyh Taghavi; Sean M McCorkle; Luen-Luen Li; Stephanie A Malfatti; Denise Monteleone; Bryon S Donohoe; Shi-You Ding; William S Adney; Michael E Himmel; Susannah G Tringe
Journal: PLoS One Date: 2012-05-21 Impact factor: 3.240

Review 8. Metagenomics: tools and insights for analyzing next-generation sequencing data derived from biodiversity studies.

Authors: Anastasis Oulas; Christina Pavloudi; Paraskevi Polymenakou; Georgios A Pavlopoulos; Nikolas Papanikolaou; Georgios Kotoulas; Christos Arvanitidis; Ioannis Iliopoulos
Journal: Bioinform Biol Insights Date: 2015-05-05

9. Exploring microbial dark matter to resolve the deep archaeal ancestry of eukaryotes.

Authors: Jimmy H Saw; Anja Spang; Katarzyna Zaremba-Niedzwiedzka; Lina Juzokaite; Jeremy A Dodsworth; Senthil K Murugapiran; Dan R Colman; Cristina Takacs-Vesbach; Brian P Hedlund; Lionel Guy; Thijs J G Ettema
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2015-09-26 Impact factor: 6.237

10. Challenges of the Unknown: Clinical Application of Microbial Metagenomics.

Authors: Graham Rose; David J Wooldridge; Catherine Anscombe; Edward T Mee; Raju V Misra; Saheer Gharbia
Journal: Int J Genomics Date: 2015-09-14 Impact factor: 2.326