Literature DB >> 28130230

MetaShot: an accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data.

B Fosso¹, M Santamaria¹, M D'Antonio², D Lovero¹, G Corrado³, E Vizza³, N Passaro⁴, A R Garbuglia⁵, M R Capobianchi⁵, M Crescenzi⁴, G Valiente⁶, G Pesole^1,7.

Abstract

SUMMARY: Shotgun metagenomics by high-throughput sequencing may allow deep and accurate characterization of host-associated total microbiomes, including bacteria, viruses, protists and fungi. However, the analysis of such sequencing data is still extremely challenging in terms of both overall accuracy and computational efficiency, and current methodologies show substantial variability in misclassification rate and resolution at lower taxonomic ranks or are limited to specific life domains (e.g. only bacteria). We present here MetaShot, a workflow for assessing the total microbiome composition from host-associated shotgun sequence data, and show its overall optimal accuracy performance by analyzing both simulated and real datasets.
AVAILABILITY AND IMPLEMENTATION: https://github.com/bfosso/MetaShot. CONTACT: graziano.pesole@uniba.it. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28130230 PMCID： PMC5447231 DOI： 10.1093/bioinformatics/btx036

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Shotgun metagenomics approaches are opening new amazing avenues for better understanding host-microbe interactions and related pathologies. However, the effective and accurate characterization of host-associated microbiomes is still a largely unsolved issue as different methodologies show substantial variability in classification accuracy, precision and computational resources consumption. Current methods for taxonomic binning of metagenomic shotgun reads may be classified as supervised or unsupervised, with the former having a substantial external dependency on reference databases (Santamaria ). Unsupervised methods are generally based on specifically taxon-associated compositional features of genome sequences (e.g. oligonucleotide composition, periodic sequence signals, etc.) (Koslicki ; Wood and Salzberg, 2014), whereas supervised methods are generally based on similarity data obtained by aligning sampled reads to reference databases. Although the accuracy of supervised methods is strongly dependent on the reliability of reference databases, they usually provide a deeper level of taxonomic classification, up to the species level which is unfeasible for k-mer based techniques which hardly distinguish viral genomes from bacterial and eukaryotic genomes (Bazinet and Cummings, 2012; Soueidan ). The microbial components of environmental or clinical samples include viruses and all three life domains, including Archaea, Eubacteria, and Fungi and Protists among Eukaryotes. Furthermore, in the case of clinical samples data, cleaning from host reads is a crucial step for an appropriate and accurate microbiome assessment. We present here MetaShot, a novel analysis workflow for assessing the microbiome composition from host-associated shotgun sequence data, and show its overall better performance with respect to Kraken (Wood and Salzberg, 2014) and MetaPhlAn2 (Truong ), two state-of-the-art comparable tools.

2 Methods

The MetaShot workflow implements a two-step similarity-based approach to attain the best compromise between computational efficiency and assignment accuracy. Indeed, in consideration that the large majority of shotgun reads derive from the host we first carry out a fast similarity-based screening to detect candidate microbial reads, then a fine-grained taxonomic assessment of the much smaller set of putative microbial reads is carried out by using also an iterative taxon refinement procedure (see Supplementary Material for details). A software package implementing the MetaShot pipeline is freely available at https://github.com/bfosso/MetaShot and includes a utility tool for extracting all reads assigned to a specific NCBI taxonomic ID or all those left unassigned.

3 Results

In order to carry out a comparative assessment of MetaShot performance with respect to Kraken (Wood and Salzberg, 2014) and MetaPhlAn2 (Truong ), two state of the art tools for analyzing shotgun metagenomics data, we used ART (Huang ) to generate an in silico designed human microbiota with a composition resembling a typical human sample, containing human, bacterial and viral sequences (see Table 1A and Supplementary Material for a more detailed description).

Table 1.

(A)
	Assigned %^a			Correctly Assigned %^b
	KR	MS	MP^c	KR	MS	MP
Human (host)	100.00	99.18	0^c	100.00	99.99	0^c
Prokaryotes
Family	57.41	97.91	5.16	96.77	98.37	97.59
Genus	55.01	98.14	4.96	95.92	98.17	98.02
Species	54.17	99.31	4.76	79.52	88.06	90.7
Viruses
Family	74.78	97.74	49.32	99.16	98.53	98.48
Genus	101.88	97.39	66.85	99.37	99.75	99.30
Species	73.45	97.81	43.86	98.98	96.70	95.46

The percentage refers to the total number of reads assignable to the specific taxonomic rank.

The percentage refers to the relevant assigned reads.

MetaPhlAn2 assigns just the sequences containing specific taxon markers and does not search for human host sequences.

(A) Benchmark assessment of Kraken (KR) and MetaShot (MS) on a simulated dataset (see the Supplementary Material for details) consisting of 19 582 500 human (94.5%), 986 114 bacterial (4.8%) and 146 886 viral (0.7%) reads. (B) Precision (P), Recall (R), F-measure (F) and Unclassified reads (U) of Kraken (KR), MetaShot (MS) and MetaPhlAn2 (MP) on the same simulated dataset, at the Species level The percentage refers to the total number of reads assignable to the specific taxonomic rank. The percentage refers to the relevant assigned reads. MetaPhlAn2 assigns just the sequences containing specific taxon markers and does not search for human host sequences. The simulated dataset also included reads from PhiX phage, which have been shown to contaminate many assembled microbial genomes (Mukherjee ) and from human endogenous retroviruses (HERV), which escape detection by most tools designed for analyzing shotgun metagenomics data because they are simply labeled as host reads. Indeed, under specific conditions these viruses can be expressed and may play a role in disease pathogenesis (Agoni ; Li ). Moreover, in order to compare MetaShot, Kraken and MetaPhlAn2 on a controlled real dataset we analyzed a bacterial and viral mock community (Conceicao-Neto ) available in the NCBI-SRA archive (SRR3458569). The results of the benchmark assessment displayed in Table 1 clearly show that MetaShot outperforms Kraken and MetaPhlAn2 in terms of the overall accuracy of reads assignment for the Prokaryotes and Viruses simulated datasets, at the Family, Genus and Species levels. In addition, MetaShot performs better that Kraken and MetaPhlAn2 also in terms of taxon assignment accuracy at Species and Genus levels at both qualitative (see Supplementary Tables S1–S4) and quantitative levels (See Supplementary Figs S2 and S3). Finally, in order to test MetaShot on a real dataset we analyzed DNA-seq (528 034 456 100 bp x 2 PE reads) and RNA-Seq (61 318 866 100 bp x 2 PE reads) data from a sample of cervical squamous cell carcinoma of the uterus. While it is known that about 95% of these cancers harbor human papillomavirus (HPV) genomes, the specific serotype involved varies, the most common ones being HPV16, HPV18, and HPV31 (Growdon and Del Carmen, 2008). We previously established by PCR assessment that this test sample contained HPV31. Indeed, HPV31 was detected only by MetaShot in both DNA-Seq and RNA-Seq datasets (25 359 reads over 25 368 total viral reads in DNA-Seq data and 13 684 reads over 14 150 total viral reads in RNA-Seq data) whereas Kraken detected much fewer viral reads (2656 and 1565 in total for DNA-Seq and RNA-Seq data, respectively), notably not including HPV31 (see Supplementary Table S1) which also MetaPhlAn2 was unable to detect. These results confirm the optimal performance of MetaShot with respect to Kraken and MetaPhlAn2 also in the case of real data analysis. The MetaShot output consists of: (i) an HTML interactive table reporting for each node in the inferred taxonomy the taxon name, the NCBI taxonomy ID and the number of assigned reads; (ii) a CSV file containing the same information reported in the interactive table; (iii) a Krona graph (Ondov ) to graphically inspect the inferred microbiome. A remarkable unique feature of MetaShot is the possibility to extract all unassigned reads or the set of reads assigned to a specific taxon, defined by the NCBI taxonomy ID. This feature is particularly useful for downstream analyses such as OTU generation, contig assembly for the characterization of unassigned reads, or functional annotation of the reads belonging to a specific species/strain. In addition, this feature may allow for shotgun mapping species-specific DNA-seq reads to their target genome, if available, to prevent the possibility of artifacts, usually associated with a strong positional mapping bias, due to chimeric contamination in GenBank reference sequences (Mukherjee ). Moreover, in the case of shotgun RNA-Seq reads, mapping to their target genome may precisely assess their relevant expression profile. The price for the overall better accuracy of MetaShot is a lower computational efficiency. MetaShot is about 2 and 3 times slower than Kraken and MetaPhlAn2, respectively, for the complete analysis of the simulated benchmark dataset (see Supplementary Material). Click here for additional data file.

13 in total

1. MetaPhlAn2 for enhanced metagenomic taxonomic profiling.

Authors: Duy Tin Truong; Eric A Franzosa; Timothy L Tickle; Matthias Scholz; George Weingart; Edoardo Pasolli; Adrian Tett; Curtis Huttenhower; Nicola Segata
Journal: Nat Methods Date: 2015-10 Impact factor: 28.547

2. Human endogenous retrovirus-K contributes to motor neuron disease.

Authors: Wenxue Li; Myoung-Hwa Lee; Lisa Henderson; Richa Tyagi; Muzna Bachani; Joseph Steiner; Emilie Campanac; Dax A Hoffman; Gloria von Geldern; Kory Johnson; Dragan Maric; H Douglas Morris; Margaret Lentz; Katherine Pak; Andrew Mammen; Lyle Ostrow; Jeffrey Rothstein; Avindra Nath
Journal: Sci Transl Med Date: 2015-09-30 Impact factor: 17.956

3. Human papillomavirus-related gynecologic neoplasms: screening and prevention.

Authors: Whitfield B Growdon; Marcela Del Carmen
Journal: Rev Obstet Gynecol Date: 2008

4. Interactive metagenomic visualization in a Web browser.

Authors: Brian D Ondov; Nicholas H Bergman; Adam M Phillippy
Journal: BMC Bioinformatics Date: 2011-09-30 Impact factor: 3.307

5. Large-scale contamination of microbial isolate genomes by Illumina PhiX control.

Authors: Supratim Mukherjee; Marcel Huntemann; Natalia Ivanova; Nikos C Kyrpides; Amrita Pati
Journal: Stand Genomic Sci Date: 2015-03-30

6. Finding and identifying the viral needle in the metagenomic haystack: trends and challenges.

Authors: Hayssam Soueidan; Louise-Amélie Schmitt; Thierry Candresse; Macha Nikolski
Journal: Front Microbiol Date: 2015-01-07 Impact factor: 5.640

7. A comparative evaluation of sequence classification programs.

Authors: Adam L Bazinet; Michael P Cummings
Journal: BMC Bioinformatics Date: 2012-05-10 Impact factor: 3.169

8. Detection of Human Endogenous Retrovirus K (HERV-K) Transcripts in Human Prostate Cancer Cell Lines.

Authors: Lorenzo Agoni; Chandan Guha; Jack Lenz
Journal: Front Oncol Date: 2013-07-09 Impact factor: 6.244

9. WGSQuikr: fast whole-genome shotgun metagenomic classification.

Authors: David Koslicki; Simon Foucart; Gail Rosen
Journal: PLoS One Date: 2014-03-13 Impact factor: 3.240

10. Kraken: ultrafast metagenomic sequence classification using exact alignments.

Authors: Derrick E Wood; Steven L Salzberg
Journal: Genome Biol Date: 2014-03-03 Impact factor: 13.583

9 in total

1. Unbiased Taxonomic Annotation of Metagenomic Samples.

Authors: Bruno Fosso; Graziano Pesole; Francesc Rosselló; Gabriel Valiente
Journal: J Comput Biol Date: 2017-10-13 Impact factor: 1.479

2. Vipie: web pipeline for parallel characterization of viral populations from multiple NGS samples.

Authors: Jake Lin; Lenka Kramna; Reija Autio; Heikki Hyöty; Matti Nykter; Ondrej Cinek
Journal: BMC Genomics Date: 2017-05-15 Impact factor: 3.969

3. A Modular Metagenomics Pipeline Allowing for the Inclusion of Prior Knowledge Using the Example of Anaerobic Digestion.

Authors: Daniela Becker; Denny Popp; Hauke Harms; Florian Centler
Journal: Microorganisms Date: 2020-05-05

4. No metagenomic evidence of tumorigenic viruses in cancers from a selected cohort of immunosuppressed subjects.

Authors: Nunzia Passaro; Andrea Casagrande; Matteo Chiara; Bruno Fosso; Caterina Manzari; Anna Maria D'Erchia; Samuele Iesari; Francesco Pisani; Antonio Famulari; Patrizia Tulissi; Stefania Mastrosimone; Maria Cristina Maresca; Giuseppe Mercante; Giuseppe Spriano; Giacomo Corrado; Enrico Vizza; Anna Rosa Garbuglia; Maria Rosaria Capobianchi; Carla Mottini; Alessandra Cenci; Marco Tartaglia; Alessandro Nanni Costa; Graziano Pesole; Marco Crescenzi
Journal: Sci Rep Date: 2019-12-24 Impact factor: 4.379

Review 5. Tissue-associated microbial detection in cancer using human sequencing data.

Authors: Rebecca M Rodriguez; Vedbar S Khadka; Mark Menor; Brenda Y Hernandez; Youping Deng
Journal: BMC Bioinformatics Date: 2020-12-03 Impact factor: 3.169

6. LABRADOR-A Computational Workflow for Virus Detection in High-Throughput Sequencing Data.

Authors: Izabela Fabiańska; Stefan Borutzki; Benjamin Richter; Hon Q Tran; Andreas Neubert; Dietmar Mayer
Journal: Viruses Date: 2021-12-18 Impact factor: 5.048

7. Comparison of DNA and RNA sequencing of total nucleic acids from human cervix for metagenomics.

Authors: Laila Sara Arroyo Mühr; Joakim Dillner; Agustin Enrique Ure; Karin Sundström; Emilie Hultin
Journal: Sci Rep Date: 2021-09-22 Impact factor: 4.379

Review 8. Overview of Virus Metagenomic Classification Methods and Their Biological Applications.

Authors: Sam Nooij; Dennis Schmitz; Harry Vennema; Annelies Kroneman; Marion P G Koopmans
Journal: Front Microbiol Date: 2018-04-23 Impact factor: 5.640

9. Application of a bioinformatic pipeline to RNA-seq data identifies novel virus-like sequence in human blood.

Authors: Marko Melnick; Patrick Gonzales; Thomas J LaRocca; Yuping Song; Joanne Wuu; Michael Benatar; Björn Oskarsson; Leonard Petrucelli; Robin D Dowell; Christopher D Link; Mercedes Prudencio
Journal: G3 (Bethesda) Date: 2021-09-06 Impact factor: 3.154

9 in total