Literature DB >> 35561196

MetaSquare: an integrated metadatabase of 16S rRNA gene amplicon for microbiome taxonomic classification.

Chun-Chieh Liao¹, Po-Ying Fu², Chih-Wei Huang¹, Chia-Hsien Chuang¹, Yun Yen³, Chung-Yen Lin¹, Shu-Hwa Chen³.

Abstract

MOTIVATION: Taxonomic classification of 16S ribosomal RNA gene amplicon is an efficient and economic approach in microbiome analysis. 16S rRNA sequence databases like SILVA, RDP, EzBioCloud and HOMD used in downstream bioinformatic pipelines have limitations on either the sequence redundancy or the delay on new sequence recruitment. To improve the 16S rRNA gene-based taxonomic classification, we merged these widely used databases and a collection of novel sequences systemically into an integrated resource.
RESULTS: MetaSquare version 1.0 is an integrated 16S rRNA sequence database. It is composed of more than 6 million sequences and improves taxonomic classification resolution on both long-read and short-read methods.
AVAILABILITY AND IMPLEMENTATION: Accessible at https://hub.docker.com/r/lsbnb/metasquare_db and https://github.com/lsbnb/MetaSquare. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Substances：
RNA, Ribosomal, 16S

Year: 2022 PMID： 35561196 PMCID： PMC9113242 DOI： 10.1093/bioinformatics/btac184

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Metagenomics, the collective view of the mass genome of microbes in specified habitats, widely impacts our knowledge about all kinds of biological processes in recent decades. Researchers discover microbes for different purposes and wish to know the composition and contributions of these species. The cost of whole-genome shotgun metagenomics analysis has decreased. Resolving microbiome composition via 16S ribosomal RNA gene amplicon sequencing remains a mainstream strategy for its stable performance and cost efficiency. With the high-throughput next-generation sequencing, an exhausting list of species can be found by bioinformatic pipelines. Taxonomic classification is a crucial component of microbiome analysis. Bioinformatic pipelines like QIIME 2 (Bolyen ) and mothur (Schloss, 2020) rely on 16S rRNA sequence databases for conducting sequence-to-taxon matches. One of the widely used rRNA gene sequence databases, SILVA (Quast et al., 2013), contains ∼9 million ribosomal RNA sequences from bacteria, archaea and some eukarya. Because of the complexity of data sources, sequence duplicates and uneven coverage of clades in these data depository had been argued (Agnihotry ). Besides, considerable efforts are required for maintaining the database up to date. Greengenes, another widely used database (DeSantis ) with rich taxonomic annotations, was not updated since 2013. The RDP with about 3 million rRNA sequences (Cole ) was also stopped updating in 2016. Furthermore, some recent metagenome approaches may reveal new microbe sequences but are delayed on the database due to the curation schedules. For example, the EzBioCloud 16S rRNA gene database, derived from microbe genomic assemblies, contains new bacteria, archaea and eukarya (Yoon ). The HOMD is a specified 16S rRNA gene database built for exploring unique taxa in the oral microbiome (Escapa ). A database agglomeration work, 16S-UDb, had been presented (Agnihotry ). In this work, unified full-length, fully annotated 16S rRNA sequences were collected. This dataset could meet the requirement for conducting 16S rRNA amplicon analyses in various designs, while the recruited taxon number greatly reduced for sequence length constrain. To improve the resolution of taxonomy analysis, we attempted a data collecting process to build an updated non-redundant 16S rRNA database MetaSquare. This database meets the need for 16S rRNA classification on both long-read and short-read methods.

2 Materials and methods

We adopted the SILVA database (version 138.1) as the starting set for its greatest coverage of sequence entries and its continuing maintenance and agglomerated other entries to form the final dataset. Firstly, we reformatted the sequencing taxonomy assignment of all datasets to comply with Greengenes’ format. Next, we appended the Greengenes (version 13.5) set to the starting set except for those entries that were identical or substrings to an existing entry; RDP (version 11.5), EzBioCloud (visited on 2020.02) and HOMD (version 15.2) were appended in the same criteria. We further recruited 516 sequences of 16S rRNA gene from novel genomes assemblies reported (Pasolli ). Sequence duplication was identified using mothur align.seqs on each database appending process. Next, we filtered sequence duplicates from the approximate merged set according to the annotation context, viz. We picked the most detailed taxonomic annotations and preferred entries from the latest renewed database. Finally, the eukaryote sequences were excluded. We collected sequences that met these criteria: (i) 5 or fewer ambiguous bases, (ii) 8 or fewer homopolymers and (iii) longer than 600 bps to ensure the usability for long 16S rRNA amplicon taxonomic classification pipelines. The database construction workflow is in Supplementary Figures S1 and S2. Two analyses were conducted for database performance: QIIME 2 on a classical short-read/16S rRNA gene amplicon with the V3–V4 amplicon dataset published by NCBI BioProject PRJNA715083 (Kameoka ) and Kraken 2 on long-read/16S rRNA gene near-full length amplicon with datasets from PRJDB9744, V1–V9 amplicon (Matsuo ) and PRJNA637202, V3–V9 amplicon (Angell ). We compared the taxonomic classification output of QIIME 2 with MetaSquare (this study), SILVA, Greengenes and 16-UDb. For 16S rRNA gene V3–V4 region amplicon analyses, the V3–V4 region of 16S rRNA gene sequences were extracted using the V-Xtractor software tool (Hartmann ). The benchmarking dataset was listed in Supplementary Table S1 and the workflow for these analyses in Supplementary Figure S1.

3 Results

MetaSquare is composed of a FASTA file and an annotation taxonomy file complied to Greengenes style; 6 449 552 sequences (archaea: 260 555 entries, bacteria: 6 188 997 entries, version 1.0). The composition of MetaSquare by the source is presented in Supplementary Figure S3. As shown in Figure 1, Supplementary Table S2 and Supplementary Figure S4, MetaSquare outperformed the other three rRNA databases in terms of identified taxon numbers in the 16S rRNA amplicon analysis. Compared with 16-UDb, MetaSquare helps identify much more genera (436 versus 237) on the short-read microbiome dataset (Supplementary Table S1). We also noticed very few unclassified sequences in QIIME 2 + 16-UDb and QIIME 2+Greengenes.

Fig. 1.

Taxonomic classification result of MetaSquare and competing databases through QIIME 2 pipeline. We counted the non-redundant taxons identified in 16S rRNA gene V3–V4 amplicon (Kameoka ) (NCBI PRJNA715083) Performance of using MetaSquare for long-read 16S rRNA gene amplicon taxonomic classification was accessed by Kraken 2. MetaSquare can help to identify considerably more taxonomic classification genera than the other databases (Supplementary Fig. S5). Details on the results as mentioned above are available in the Supplementary Information.

4 Conclusion

We integrated essential databases to build MetaSquare for microbiome composition profiling based on 16S rRNA gene sequencing data. Overall, MetaSquare included widely used 16S rRNA gene databases with limited data redundancy. Furthermore, it includes novel sequences to increase database coverage. Presently, the update of MetaSquare is scheduled as a biannually semi-automatic process.

Funding

This project was funded by the grants MOST 108-2321-B-037 -001 and 110-2314-B-001 -006 from the Ministry of Science and Technology, Taiwan, and AS-GCS-109-07 from Academia Sinica, Taiwan, to financially support this research and publication. Conflict of Interest: none declared. Click here for additional data file.

12 in total

1. V-Xtractor: an open-source, high-throughput software tool to identify and extract hypervariable regions of small subunit (16S/18S) ribosomal RNA gene sequences.

Authors: Martin Hartmann; Charles G Howes; Kessy Abarenkov; William W Mohn; R Henrik Nilsson
Journal: J Microbiol Methods Date: 2010-08-27 Impact factor: 2.363

2. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.

Authors: T Z DeSantis; P Hugenholtz; N Larsen; M Rojas; E L Brodie; K Keller; T Huber; D Dalevi; P Hu; G L Andersen
Journal: Appl Environ Microbiol Date: 2006-07 Impact factor: 4.792

3. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2.

Authors: Evan Bolyen; Jai Ram Rideout; Matthew R Dillon; Nicholas A Bokulich; Christian C Abnet; Gabriel A Al-Ghalith; Harriet Alexander; Eric J Alm; Manimozhiyan Arumugam; Francesco Asnicar; Yang Bai; Jordan E Bisanz; Kyle Bittinger; Asker Brejnrod; Colin J Brislawn; C Titus Brown; Benjamin J Callahan; Andrés Mauricio Caraballo-Rodríguez; John Chase; Emily K Cope; Ricardo Da Silva; Christian Diener; Pieter C Dorrestein; Gavin M Douglas; Daniel M Durall; Claire Duvallet; Christian F Edwardson; Madeleine Ernst; Mehrbod Estaki; Jennifer Fouquier; Julia M Gauglitz; Sean M Gibbons; Deanna L Gibson; Antonio Gonzalez; Kestrel Gorlick; Jiarong Guo; Benjamin Hillmann; Susan Holmes; Hannes Holste; Curtis Huttenhower; Gavin A Huttley; Stefan Janssen; Alan K Jarmusch; Lingjing Jiang; Benjamin D Kaehler; Kyo Bin Kang; Christopher R Keefe; Paul Keim; Scott T Kelley; Dan Knights; Irina Koester; Tomasz Kosciolek; Jorden Kreps; Morgan G I Langille; Joslynn Lee; Ruth Ley; Yong-Xin Liu; Erikka Loftfield; Catherine Lozupone; Massoud Maher; Clarisse Marotz; Bryan D Martin; Daniel McDonald; Lauren J McIver; Alexey V Melnik; Jessica L Metcalf; Sydney C Morgan; Jamie T Morton; Ahmad Turan Naimey; Jose A Navas-Molina; Louis Felix Nothias; Stephanie B Orchanian; Talima Pearson; Samuel L Peoples; Daniel Petras; Mary Lai Preuss; Elmar Pruesse; Lasse Buur Rasmussen; Adam Rivers; Michael S Robeson; Patrick Rosenthal; Nicola Segata; Michael Shaffer; Arron Shiffer; Rashmi Sinha; Se Jin Song; John R Spear; Austin D Swafford; Luke R Thompson; Pedro J Torres; Pauline Trinh; Anupriya Tripathi; Peter J Turnbaugh; Sabah Ul-Hasan; Justin J J van der Hooft; Fernando Vargas; Yoshiki Vázquez-Baeza; Emily Vogtmann; Max von Hippel; William Walters; Yunhu Wan; Mingxun Wang; Jonathan Warren; Kyle C Weber; Charles H D Williamson; Amy D Willis; Zhenjiang Zech Xu; Jesse R Zaneveld; Yilong Zhang; Qiyun Zhu; Rob Knight; J Gregory Caporaso
Journal: Nat Biotechnol Date: 2019-08 Impact factor: 54.908

4. Introducing EzBioCloud: a taxonomically united database of 16S rRNA gene sequences and whole-genome assemblies.

Authors: Seok-Hwan Yoon; Sung-Min Ha; Soonjae Kwon; Jeongmin Lim; Yeseul Kim; Hyungseok Seo; Jongsik Chun
Journal: Int J Syst Evol Microbiol Date: 2017-05-30 Impact factor: 2.747

5. Construction & assessment of a unified curated reference database for improving the taxonomic classification of bacteria using 16S rRNA sequence data.

Authors: Shikha Agnihotry; Aditya N Sarangi; Rakesh Aggarwal
Journal: Indian J Med Res Date: 2020-01 Impact factor: 2.375

6. Full-length 16S rRNA gene amplicon analysis of human gut microbiota using MinION™ nanopore sequencing confers species-level resolution.

Authors: Yoshiyuki Matsuo; Shinnosuke Komiya; Yoshiaki Yasumizu; Yuki Yasuoka; Katsura Mizushima; Tomohisa Takagi; Kirill Kryukov; Aisaku Fukuda; Yoshiharu Morimoto; Yuji Naito; Hidetaka Okada; Hidemasa Bono; So Nakagawa; Kiichi Hirota
Journal: BMC Microbiol Date: 2021-01-26 Impact factor: 3.605

7. Benchmark of 16S rRNA gene amplicon sequencing using Japanese gut microbiome data from the V1-V2 and V3-V4 primer sets.

Authors: Shoichiro Kameoka; Daisuke Motooka; Satoshi Watanabe; Ryuichi Kubo; Nicolas Jung; Yuki Midorikawa; Natsuko O Shinozaki; Yu Sawai; Aya K Takeda; Shota Nakamura
Journal: BMC Genomics Date: 2021-07-10 Impact factor: 3.969

8. Ribosomal Database Project: data and tools for high throughput rRNA analysis.

Authors: James R Cole; Qiong Wang; Jordan A Fish; Benli Chai; Donna M McGarrell; Yanni Sun; C Titus Brown; Andrea Porras-Alfaro; Cheryl R Kuske; James M Tiedje
Journal: Nucleic Acids Res Date: 2013-11-27 Impact factor: 16.971

9. Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets.

Authors: Isabel F Escapa; Yanmei Huang; Tsute Chen; Maoxuan Lin; Alexis Kokaras; Floyd E Dewhirst; Katherine P Lemon
Journal: Microbiome Date: 2020-05-15 Impact factor: 14.650