Literature DB >> 34260698

mTAGs: taxonomic profiling using degenerate consensus reference sequences of ribosomal RNA genes.

Guillem Salazar¹, Hans-Joachim Ruscheweyh¹, Falk Hildebrand^2,3, Silvia G Acinas⁴, Shinichi Sunagawa¹.

Abstract

SUMMARY: Profiling the taxonomic composition of microbial communities commonly involves the classification of ribosomal RNA gene fragments. As a trade-off to maintain high classification accuracy, existing tools are typically limited to the genus level. Here, we present mTAGs, a taxonomic profiling tool that implements the alignment of metagenomic sequencing reads to degenerate consensus reference sequences of small subunit ribosomal RNA genes. It uses DNA fragments, that is, paired-end sequencing reads, as count units and provides relative abundance profiles at multiple taxonomic ranks, including operational taxonomic units (OTUs) based on a 97% sequence identity cutoff. At the genus rank, mTAGs outperformed other tools across several metrics, such as the F1 score by > 11% across data from different environments, and achieved competitive (F1 score) or better results (Bray-Curtis dissimilarity) at the sub-genus level.
AVAILABILITY AND IMPLEMENTATION: The software tool mTAGs is implemented in Python. The source code and binaries are freely available (https://github.com/SushiLab/mTAGs). SUPPLEMENTARY INFORMATION: Supplementary information is available at Bioinformatics online. The data and analysis scripts used in this article are available at https://doi.org/10.5281/zenodo.4352762. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Year: 2021 PMID： 34260698 PMCID： PMC8696115 DOI： 10.1093/bioinformatics/btab465

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

The relative abundance of taxa in a microbial community can be estimated by classifying sequences of phylogenetic marker genes. A common approach involves the generation of polymerase chain reaction (PCR)-derived amplicon sequences using oligonucleotide primers to target highly conserved regions of the small subunit ribosomal RNA (SSU-rRNA) gene. However, this approach has several limitations due to the introduction of errors (Acinas ) and taxonomic selection biases (Hong ) in the PCR step, and the inconsistency of results when targeting different variable regions of theSU-rRNA gene (Claesson ). As an alternative, the generation of metagenomic data, i.e. by shotgun-sequencing of microbial community DNA, allows for an unbiased extraction of SSU-rRNA gene fragments (Logares ) and their subsequent classification to generate taxonomic profiles. However, current tools performing SSU rRNA gene-based taxonomic profiling of metagenomes (Bengtsson-Palme ; Guo ; Shah ; Xie ) suffer from shortcomings, such as their inability to use reads originating from any region of the SSU-rRNA gene (Bengtsson-Palme ; Guo ; Xie ) or a limitation of the taxonomic resolution to the genus rank (Bengtsson-Palme ; Shah ; Xie ). The classification performance of SSU-rRNA gene fragments of PCR-targeted or metagenomic origin differs between tools using reference sequence databases of reduced complexity (e.g. Bolyen ; Matias Rodrigues ; Schloss ). The construction of such reference databases may thus be a critical factor, in particular at high taxonomic resolution, that is, at ranks below the genus level, such as the operational taxonomic unit (OTU) defined at a 97% sequence identity cutoff. Here, we tested if the use of the International Union of Pure and Applied Chemistry (IUPAC) code for nucleotides to generate a reference database, in which each OTU is represented by a degenerate consensus sequence of all respective members, would increase the accuracy of individual SSU-rRNA sequence classification and community composition profiling at different taxonomic ranks. We implemented this approach in a new taxonomic profiler for metagenomes named mTAGs. We show an advantage of this method over simply using the longest sequence as an OTU representative, and that at the genus level, mTAGs provides higher accuracy compared to other tools that are commonly used to classify SSU-rRNA gene fragments (Bolyen ; Caporaso ; Matias Rodrigues ; Schloss ).

2 Tool description

The mTAGs tool uses a reference database, which was built by first clustering sequences into OTUs within each genus defined in the full-length SILVA SSU database version 138 (Quast ) at 97% identity. Then, for each OTU a degenerate consensus sequence was generated using the IUPAC DNA code to represent all respective member sequences (see Supplementary Information). The tool is capable of processing single-end and pair-end reads, takes advantage of the information contained in any region of the SSU-rRNA gene and provides relative abundance profiles at multiple taxonomic ranks, including OTUs. mTAGs takes shotgun-sequenced metagenomic data as an input and uses hidden Markov models to detect sequence fragments from any position of the SSU rRNA gene. These fragments are aligned to the reference database and conservatively classified to a taxonomic rank (according to the SILVA taxonomy) by determining the last common ancestor of all target sequences. The runtime of mTAGs increases linearly with the size of the metagenome (see Supplementary Information) at a rate of 53 s per million reads processed (wallclock time using eight CPU threads; 306 s in CPU time) allowing the processing of deeply sequenced metagenomes in reasonable time (i.e. ∼100 million paired-end reads in ∼1.5 h). Although the primary use of mTAGs is the taxonomic profiling of metagenomes, it can also be used for profiling SSU-rRNA amplicon data or for classifying amplicon sequence variants produced by other methods (Callahan ; Edgar, 2016).

3 Results

We benchmarked the effect of differences in the generation of the reference database by classifying reads of known identity (Fig. 1A; Supplementary Fig. S1; Supplementary Information). The definition of the representative sequence for each OTU as a degenerate consensus sequence of all its respective members, rather than the longest sequence, resulted in a ∼14% increase in classification performance at the OTU level when profiling paired-end reads of 250 bp (14.0%, 14.1% and 14.0% for precision, recall and F1 score, respectively). A 25.4% increase in taxonomic profiling performance was observed as measured by an increase in the median Bray–Curtis similarity to the true profiles from 0.355 to 0.265 (Supplementary Fig. S1). This effect was still observed for reads of 150 bp, while no effect was found for reads of 100 bp and/or higher taxonomic ranks (Supplementary Fig. S1).

Fig. 1.

Benchmarking results on taxonomic profiling of microbial communities. (A) Internal benchmarking: benchmarking of the mTAGs reference database construction for read length of 150 bp. Values correspond to the performance in classification (F1 score) and profiling (Bray–Curtis similarity to the expected composition) at seven taxonomic ranks for the definition of the OTU representative sequence as (i) the degenerate consensus sequence of all respective members (blue) or (ii) the longest member sequence (green).The values of 10 independent evaluations are plotted. See the Supplementary Figure S1 for precision and recall values and results based on alternative read lengths. (B) External benchmarking: benchmarking of mTAGs against QIIME 1, QIIME 2, mothur and MAPseq using simulated datasets comprising the most abundant genera found in the human gut, ocean and soil environments (Almeida et al., 2018). Bray–Curtis similarity to the expected composition and F1 score values correspond to classifications at the genus-level (the lowest taxonomic rank common to all tools). To ensure comparability between the tools, the results are based on the SILVA SSU database version 128. See the Supplementary Information for more details and Supplementary Figure S2 for precision and recall values and results based on alternative reference databases. (C) Metagenomes-based benchmarking: benchmarking of mTAGs and MAPseq using metagenomic data from the second CAMI challenge (Meyer ). Values correspond to the performance in classification (F1 score) and profiling (Bray–Curtis dissimilarity to the expected composition) at seven taxonomic ranks For an independent evaluation and comparison of classification and profiling performance, we used simulated data from previous work (Almeida ) using SSU-rRNA datasets comprising the most abundant genera found in the human gut, ocean and soil environments (Fig. 1B; Supplementary Fig. S2) to benchmark a number of taxonomic profiling tools. In this comparison (Fig. 1B), mTAGs achieved a median F1 score of 0.88 and a median Bray–Curtis similarity to the expected abundance profile of 0.89 outperforming other tools classifying SSU-rRNA gene fragments down to the genus-level, the lowest taxonomic rank common to all tools (QIIME 1, QIIME 2, mothur and MAPseq achieved median F1 scores of 0.72, 0.80, 0.53 and 0.60 and Bray–Curtis similarities of 0.75, 0.77, 0.51 and 0.60, respectively). mTAGs had a high median precision of 0.98, comparable to the precision of MAPseq, and a median recall of 0.80, which was the highest value among the tested tools (Fig. 1B). This high classification performance was consistent for data from different environments (human gut, ocean and soil) and also when tested separately for different hyper-variable regions within the full-length SSU-rRNA gene (see Supplementary Information and Supplementary Fig. S2). To assess the performance of mTAGs for shotgun metagenomics data and at the sub-genus level, a third evaluation was performed with human and mouse-associated metagenomes (Meyer ). This benchmark was performed in comparison with MAPseq, which was the only tool that provided outputs at the sub-genus taxonomic level (Fig. 1C; Supplementary Fig. S3). At this level (OTU level and NCBI species level for mTAGs and MAPseq, respectively) mTAGs achieved higher median Bray–Curtis similarity to the expected abundance profile, while the median F1 score was comparable between the tools (Fig. 1C). A breakdown of the F1 score showed a lower precision, but higher recall of mTAGs compared to MAPseq (Supplementary Fig. S3).

4 Conclusions

With mTAGs, we introduce a freely available tool for SSU-rRNA gene-based microbial community profiling that defines degenerate consensus sequences and uses them as a reference database to enable OTU-level relative abundance estimation. Click here for additional data file.

16 in total

1. Comparing bacterial communities inferred from 16S rRNA gene sequencing and shotgun metagenomics.

Authors: Neethu Shah; Haixu Tang; Thomas G Doak; Yuzhen Ye
Journal: Pac Symp Biocomput Date: 2011

2. Microbial Community Analysis with Ribosomal Gene Fragments from Shotgun Metagenomes.

Authors: Jiarong Guo; James R Cole; Qingpeng Zhang; C Titus Brown; James M Tiedje
Journal: Appl Environ Microbiol Date: 2015-10-16 Impact factor: 4.792

3. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities.

Authors: Patrick D Schloss; Sarah L Westcott; Thomas Ryabin; Justine R Hall; Martin Hartmann; Emily B Hollister; Ryan A Lesniewski; Brian B Oakley; Donovan H Parks; Courtney J Robinson; Jason W Sahl; Blaz Stres; Gerhard G Thallinger; David J Van Horn; Carolyn F Weber
Journal: Appl Environ Microbiol Date: 2009-10-02 Impact factor: 4.792

Review 4. Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit.

Authors: Fernando Meyer; Till-Robin Lesker; David Koslicki; Adrian Fritz; Alexey Gurevich; Aaron E Darling; Alexander Sczyrba; Andreas Bremges; Alice C McHardy
Journal: Nat Protoc Date: 2021-03-01 Impact factor: 13.491

5. Benchmarking taxonomic assignments based on 16S rRNA gene profiling of the microbiota from commonly sampled environments.

Authors: Alexandre Almeida; Alex L Mitchell; Aleksandra Tarkowska; Robert D Finn
Journal: Gigascience Date: 2018-05-01 Impact factor: 6.524

6. DADA2: High-resolution sample inference from Illumina amplicon data.

Authors: Benjamin J Callahan; Paul J McMurdie; Michael J Rosen; Andrew W Han; Amy Jo A Johnson; Susan P Holmes
Journal: Nat Methods Date: 2016-05-23 Impact factor: 28.547

7. QIIME allows analysis of high-throughput community sequencing data.

Authors: J Gregory Caporaso; Justin Kuczynski; Jesse Stombaugh; Kyle Bittinger; Frederic D Bushman; Elizabeth K Costello; Noah Fierer; Antonio Gonzalez Peña; Julia K Goodrich; Jeffrey I Gordon; Gavin A Huttley; Scott T Kelley; Dan Knights; Jeremy E Koenig; Ruth E Ley; Catherine A Lozupone; Daniel McDonald; Brian D Muegge; Meg Pirrung; Jens Reeder; Joel R Sevinsky; Peter J Turnbaugh; William A Walters; Jeremy Widmann; Tanya Yatsunenko; Jesse Zaneveld; Rob Knight
Journal: Nat Methods Date: 2010-04-11 Impact factor: 28.547

8. Comparison of two next-generation sequencing technologies for resolving highly complex microbiota composition using tandem variable 16S rRNA gene regions.

Authors: Marcus J Claesson; Qiong Wang; Orla O'Sullivan; Rachel Greene-Diniz; James R Cole; R Paul Ross; Paul W O'Toole
Journal: Nucleic Acids Res Date: 2010-09-29 Impact factor: 16.971

9. RiboTagger: fast and unbiased 16S/18S profiling using whole community shotgun metagenomic or metatranscriptome surveys.

Authors: Chao Xie; Chin Lui Wesley Goi; Daniel H Huson; Peter F R Little; Rohan B H Williams
Journal: BMC Bioinformatics Date: 2016-12-22 Impact factor: 3.169

10. MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis.

Authors: João F Matias Rodrigues; Thomas S B Schmidt; Janko Tackmann; Christian von Mering
Journal: Bioinformatics Date: 2017-12-01 Impact factor: 6.937