Literature DB >> 27632579

MetaStorm: A Public Resource for Customizable Metagenomics Annotation.

Gustavo Arango-Argoty¹, Gargi Singh², Lenwood S Heath¹, Amy Pruden³, Weidong Xiao⁴, Liqing Zhang¹.

Abstract

Metagenomics is a trending research area, calling for the need to analyze large quantities of data generated from next generation DNA sequencing technologies. The need to store, retrieve, analyze, share, and visualize such data challenges current online computational systems. Interpretation and annotation of specific information is especially a challenge for metagenomic data sets derived from environmental samples, because current annotation systems only offer broad classification of microbial diversity and function. Moreover, existing resources are not configured to readily address common questions relevant to environmental systems. Here we developed a new online user-friendly metagenomic analysis server called MetaStorm (http://bench.cs.vt.edu/MetaStorm/), which facilitates customization of computational analysis for metagenomic data sets. Users can upload their own reference databases to tailor the metagenomics annotation to focus on various taxonomic and functional gene markers of interest. MetaStorm offers two major analysis pipelines: an assembly-based annotation pipeline and the standard read annotation pipeline used by existing web servers. These pipelines can be selected individually or together. Overall, MetaStorm provides enhanced interactive visualization to allow researchers to explore and manipulate taxonomy and functional annotation at various levels of resolution.

Entities: Chemical

Mesh：

Year: 2016 PMID： 27632579 PMCID： PMC5025195 DOI： 10.1371/journal.pone.0162442

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The field of metagenomics has arisen following the advent of next-generation DNA sequencing. Through new technologies, such as Illumina and pyrosequencing, it is now possible to directly shot-gun sequence DNA extracted from various environmental samples, without the need for cloning. Metagenomics is particularly promising for advancing the understanding of the structure and function of microbial communities residing in natural, human, and engineered environments. To date, metagenomic data sets have been obtained from different regions of the human body [1, 2, 3], seas and oceans [4, 5, 6], lakes and rivers [7, 8, 9], wastewater and drinking water treatment systems [10, 11, 12, 13], soil [14, 15], and air [16, 17]. Unlike single organismal genomic characterization, metagenomic data sets contain DNA sequences derived from hundreds or even thousands of microbial species [18, 19]. Thus, a major computational undertaking is to annotate metagenomic samples in terms of the kinds of microbes (taxonomy) and genes (functional annotation), particularly those that are present in complex environmental samples. Various computational resources have been developed for taxonomic and functional annotation of metagenomics data sets. These resources can be classified into two main categories: 1) Web services organized as a collection of different computational resources that facilitate the storage, analysis, and retrieval of metagenomic data (e.g., MG-RAST [20] and EBI-Metagenomics [21]); 2) stand-alone programs for various aspects of metagenomic data annotation (e.g., MEGAN [22], MOCAT [23], QIIME [24], MetaPhlAn [25], MetaHIT [26], and MyTaxa [27]), which have been commonly incorporated into Web services. Generally, current services (MG-RAST and EBI-Metagenomics) annotate metagenomic samples by matching raw sequences against a fixed set of large reference sequence databases (e.g., UniProtKB [28], Clusters of Orthologous Groups of proteins (COG) [29]. This practice has two major limitations. First, there is a lack of user customization, particularly the inability to select specific sets of genes. Thus, all annotations are made with respect to the same reference databases, which may not be the most suitable depending on the hypotheses driving the research. The ability to select and focus on desired sets or subsets of reference sequences enables testing of domain-specific hypotheses. For instance, conclusions of studies of antibiotic resistance gene occurrence in the environment (e.g., [30]) can vary depending on the database selected, i.e., CARD [31], a specialized antibiotic resistance gene database, versus the full GenBank database. Second, due to short sequence length, the ability to assemble reads can be critical to identifying genes of interest and avoiding loss of information. The assembly of raw reads into longer contigs/scaffolds has proved to be more effective for annotating sequence features such as operons, transcription binding sites, chromosome organization and taxonomy [19, 32]. Here we introduce a new online metagenomic analysis server, MetaStorm, which improves available web resources, particularly for environmental samples, while maintaining a user-friendly interface. MetaStorm offers both read matching and assembly-based annotation pipelines, while also enabling customization of reference databases. This allows users to upload databases containing curated genes of interest to facilitate functional and taxonomic annotation. MetaStorm also provides enhanced visualization of annotation results, allowing the user to explore and manipulate taxonomic and functional annotations at various levels of resolution and to compare annotation for similarities and differences across multiple data samples using various graphs.

Materials and Methods

Raw data is submitted to the MetaStorm server via a user-friendly web interface. Submitted data can remain private or be made public depending on user preference. Users are required to create an account and a profile. This profile allows them to retrieve, submit, analyze, and compare not only their own samples but also other public projects. MetaStorm stores the metagenomics samples and results into user projects which describe the features of the metagenomic experiments. If a project is made public, the raw and any associated results are free for download.

Required data types

MetaStorm requires the user to upload raw sequences in the widely-used FASTQ format [33]. Any high-throughput DNA sequencing technology (e.g., amplicon or shotgun sequencing) is accepted. Provision of detailed metadata associated with the samples from which the DNA sequences were derived is mandatory during the submission process. Provision of metadata is critical to help users identify similar studies that are already in the MetaStorm repository for additional sample comparisons. Data is organized in a manner that facilitates retrieval. A project may contain several samples and each sample may be nested with several associated studies within it (e.g., taxonomy annotation, antibiotic resistance, or any functional annotation using both assembly and read matching pipelines). All user, sample, and project information is stored in a relational database.

Reference database

Apart from a set of standard databases (e.g., CARD [31], UniProtKB [28], and GREENGENES [34]) (Table 1), MetaStorm also allows users to upload and use their own customized databases as reference databases. The customizability of reference databases is especially useful when researchers seek to test a hypothesis by comparison against a very specific set of sequences. Neither MG-RAST nor the EBI-metagenomics Web service allows for customized reference databases. In this way, MetaStorm enhances user control by allowing them to select reference sequences.

Table 1

Default reference databases provided by the MetaStorm Web service.

Database	Source	Type	#IDs	annotation
UniProtKB	http://www.uniprot.org/help/uniprotkb	protein	551,705	function
CARD	http://arpcard.mcmaster.ca/	protein	4,120	function
ACLAME	http://aclame.ulb.ac.be/	protein	122,154	function
BACMET	http://bacmet.biomedicine.gu.se/	protein	444	function
CAZy	http://www.cazy.org/	protein	281,237	function
SILVA	http://www.arb-silva.de/	nucleotide	1,756,783	taxonomy
COG	http://www.ncbi.nlm.nih.gov/COG/	protein	346,378	function
GREENGENES	http://greengenes.lbl.gov/cgi-bin/nph-index.cgi	nucleotide	1,262,986	taxonomy

Web-based submission

Submission of metagenomic data is made by an interactive web interface (). Users are first required to login into the MetaStorm website, select (or create) the project they wish to analyze, and select the desired method (Assembly/Read matching). Once in the project profile page, users need to insert sample information (number of samples, name of the samples, conditions, environment, and library preparation), select reference databases, upload raw FASTQ files, and finally run the annotation pipeline. To simplify the process of data submission, MetaStorm does not require external files such as Excel spreadsheets for sample description and provision of metadata (although this functionality can be easily added for future update if necessary). This interactive tool also allows users to remove samples and projects or re-run the samples with different pipelines, visualizing the results as needed.

Main user interface of MetaStorm.

Create a new project allows to submit a project under the user profile. My Projects grant access to the data management interface that includes: Upload raw files, add samples, remove samples, visualize individual samples and compare samples. Customize Reference Database gives access to the form for uploading a customized reference database. Browse projects allows to find samples by biome and/or location. Comparison tool allows users to compare samples from different projects. Profile allows users to modify their personal information and password.

Analysis pipeline

Once stored in the MetaStorm server, raw reads are queued for taxonomic and functional annotations. MetaStorm incorporates two pipelines, the assembly-based pipeline and the read-matching pipeline (). Selecting the appropriate pipeline depends of several parameters including: the design of the experiment, the previous knowledge about the experiment, the research hypothesis and goals. For instance, if the objective is to characterize the most abundant taxonomy in the community, the assembly pipeline may suffice [18].

Pipelines.

Overview of the computational pipelines implemented in the MetaStorm service for taxonomic and functional annotation.

Assembly pipeline

Through the assembly process, metagenomics reads are merged into large contiguous sequences varying in length from several hundred bases to nearly complete genomes providing much richer information relative to the raw reads [18, 19]. MetaStorm provides a fully automated assembly pipeline that allows the user to visualize, compare, and analyze the taxonomy and functional content of a sample or set of samples by matching and computing the abundance. The pipeline for assembly and gene finding is similar to the methods reported from the MetaHIT consortium [26] (mainly the metagenome assembly and gene prediction through scaffolds). This pipeline consists of the following major procedures: Quality control (QC): reads are trimmed and filtered out by TRIMMOMATIC [35] to remove low quality sequences from the data set. Assembly: IDBA-UD [36] is a widely used metagenome assembler that has demonstrated consistent production of high quality scaffolds [37, 38, 39]. IDBA-UD is used to assemble the QC filtered reads. MetaStorm uses the default parameters. Gene prediction: Once a set of scaffolds are assembled, PRODIGAL [40] (metagenomics version), a microbial gene finding program, is deployed to predict genes within each scaffold. Taxonomy annotation: Predicted genes are matched to a reference database using two alignment tools (BLAST [41] and DIAMOND [42]). Currently included are the following databases: Two 16S rRNA databases (SILVA [43] and GREENGENES [34]). The 16S rRNA gene abundance is computed by first selecting the best hit (same definition as in MG-RAST representative hit [44]) to the scaffold-genes from the reference database using BLASTN [41] and then computing the number of genes that each taxa contains (E-Value<1e-10, identity >90%). Note that the taxonomy profile is computed based on the abundance of predicted genes, not the number of reads. A set of marker genes processed by the MetaPhlAn2 [45] pipeline. This technique is included because whole genome sequencing samples typically contain very low 16S rRNA sequence content [26, 27, 45]. Functional annotation: Predicted genes (translated proteins from PRODIGAL) are matched to the user selected reference databases using the DIAMOND BLASTP aligner [42]. We use the representative hit strategy with an E-value<1e-10, identity>60% over the entire length [46], and minimum length of 25aa. The reference sequence databases for functional annotation depend on the user criteria. For instance, a user interested in antibiotic resistance genes may prefer to run the analysis over the CARD database [31], whereas a project related to the degradation process may use the CAZy database [47].

Read matching pipeline

The read matching pipeline conducts taxonomic and functional annotation of metagenomic data comparing the raw sequence reads to a reference database. This approach is also called marker gene analysis [18]. For taxonomy annotation, MetaStorm uses a matching scheme similar to MG-RAST and EBI-metagenomic where reads are first trimmed out and quality filtered using TRIMMOMATIC [35] and then mapped to a 16S rRNA sequence database (SILVA/GREENGENES). To speed up the read matching process, we use Bowtie2 [48], a fast and sensitive read matching tool specialized for mapping short reads to reference genomes (—local-sensitive, identity>90%, best-hit-alignment). It has proven to be particularly efficient for matching marker gene databases; MetaPhlAn2 [45] using Bowtie2 for read matching produced more accurate results than its earlier version MetaPhlAn1 [25] that uses BLAST. MetaPhlAn2 [45] which uses a set of clade specific genes is also offered by MetaStorm to estimate the taxonomic abundance. Functional annotation is made comparing the high quality reads to the reference database using the DIAMOND BLASTX [42] aligner with the representative hit approach [44] (E-value<1e-10, identity>90%, and minimum length of 25aa).

Sample normalization and comparison

Sample comparison consists of the analysis of relative abundance through a set of samples, allowing the user to visualize similarities and differences among samples. One of the critical aspects of sample comparison is data normalization. MetaStorm implement three different normalization techniques as follows: Scaling: Normalize the number of matches obtained per sample, with relative abundance between 0 and 100. RPKM: Normalize the number of matches using the Reads per Kilobase per Million Mapped Reads of each gene. Relative to 16S rRNAs: We use the normalization concept described in [30], which defines the relative abundance as the copy of a functional gene per copy of 16S rRNA genes. Normalizations are calculated differently for both pipelines. For the assembly-based pipeline all the computations are made in terms of number of matched genes whereas the read-matching pipeline normalize the samples using the number of matched reads.

Visualization of taxonomic abundance

MetaStorm offers interactive visualization, allowing users to see in detail the main features of the sequence make-up of each sample. A taxonomic tree encodes relative abundance information of different lineages in the sample. For example, in , a user interested in the relative abundance of various kinds of Proteobacteria will find that the genus Achromobacter is the most abundant. Unlike other metagenomic tools, such as MG-RAST and EBI-metagenomics, we allow interactive visualization to improve the user experience. In particular, the tree allows users to keep track of various levels of the phylogenetic hierarchy. Also, when the user clicks on any specific node (taxa), all descendants from that node will be displayed as a pie chart. The overall abundance of a taxonomy level can also be displayed as a pie chart. Node colors represent relative abundance. All visualization formats are available for the taxonomic annotation methods.

Taxonomy visualization.

Taxonomy levels are shown as pie charts (only Family and Genus are shown for illustration). The interactive tree allows users to follow the path of the abundant taxas and the chart displays the selected taxonomy level. The right panel shows the hits distribution to the open node in the taxonomy tree. In this example, the families under the order Rhizobiales are shown in the left panel.

Visualization of functional abundance

Functional relative abundance is described by a set of interactive pie charts and bar plots () that relate functional categories with the genes involved in each category. Users can select the reference database to analyze and all the tables in text format can be downloaded. When analyzing individual samples, read/gene counts are normalized using a linear scale between 0 to 100.

Functional and sample comparison visualization.

(A) Functional annotation is depicted by a pie chart, where the user can select the database to visualize. (B) Sample comparison visualization using stacked bars for both taxonomy and function. (C) interactive heat map visualization where users can click on the branches to zoom over the related functions or taxas.

Visualization of sample comparison

Visualization techniques employed by MetaStorm include: heat maps, stacked bars, and interactive trees (taxonomy annotation). As for single sample visualization, the response tree shows relative abundance for each node (taxa) and also for each taxonomic hierarchical level, allowing a high level of specificity. This type of interactive visualization features () are not available in other visualization tools, such as MG-RAST or EBI-Metagenomics.

Data Access

Similar to MG-RAST and EBI-Metagenomics, all the information on a project tagged public, such as raw read files, processed files, description files, and visualization tables, are freely available through MetaStorm. From the home page, the user can access descriptions of all the recently listed (public) projects and the reference databases that other users submitted. A search tool is available for users to identify potential sets of reference sequences that can match their analysis. MetaStorm’s reference sharing capability aims to support 1) the focus of knowledge based on user runs and 2) the projected run time for reporting MetaStorm results. Expectedly, small customized databases will report results faster than full reference databases. A novice user can use this database for analysis and jump to the specific biological problem, thus saving the computing time. Moreover, the search tool enables users to find similar existing metagenome samples in MetaStorm (public ones) and include them for more comprehensive comparison studies. Comparison across different samples is made feasible by the normalization criteria implemented in MetaStorm. Finally, all the raw and generated files for the metagenomic analysis can be downloaded in a variety of formats by clicking on the download button of each section in the visualization page.

Results and Discussion

Compared to other metagenomic resources, such as MG-RAST and EBI-metagenomics, MetaStorm extends the analysis and visualization of metagenomic samples by: 1) adding a fully developed assembly-based annotation pipeline, in addition to the read matching pipeline deployed by these Web servers; 2) offering a customized analysis where the user can select and upload reference databases, which enables focus on specific genes of interest as well as inter-project comparison; and 3) interactive visualization capabilities, including an interactive taxonomic tree, which permit users to interrogate and compare specific aspects of the sequence data. MetaStorm includes a wide variety of databases used for metagenomics analysis (section customizable reference database). Those databases have been used as default by several current metagenomics resources. While the assembly pipeline implemented by MetaStorm is similar to that of the MetaHIT pipeline [26], it incorporates a more meaningful relative abundance determination in which copies are normalized to 16S rRNA gene copies [30]. Normalization enables comparison across multiple metagenomics data sets, including those generated by external labs, empowering researchers to address broad. This last feature is particularly promising for the future applicability of the MetaStorm server.

Conclusion

MetaStorm is a free and public metagenomics resource that enables a more specific user customization through various improvements of visualization, data management, and user interactivity. MetaStorm offers two main metagenomic analysis pipelines: the read matching pipeline (similar to the current web resources) and the assembly pipeline. MetaStorm, unlike any other web resources, incorporates user reference customization, which will help to streamline the annotation process when a research hypothesis requires specific and customized databases.

44 in total

1. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.

Authors: Yu Peng; Henry C M Leung; S M Yiu; Francis Y L Chin
Journal: Bioinformatics Date: 2012-04-11 Impact factor: 6.937

2. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB.

Authors: T Z DeSantis; P Hugenholtz; N Larsen; M Rojas; E L Brodie; K Keller; T Huber; D Dalevi; P Hu; G L Andersen
Journal: Appl Environ Microbiol Date: 2006-07 Impact factor: 4.792

3. Fast gapped-read alignment with Bowtie 2.

Authors: Ben Langmead; Steven L Salzberg
Journal: Nat Methods Date: 2012-03-04 Impact factor: 28.547

4. Metagenomic analysis of sludge from full-scale anaerobic digesters operated in municipal wastewater treatment plants.

Authors: Ying Yang; Ke Yu; Yu Xia; Frankie T K Lau; Daniel T W Tang; Wing Cheong Fung; Herbert H P Fang; Tong Zhang
Journal: Appl Microbiol Biotechnol Date: 2014-03-16 Impact factor: 4.813

5. A metagenome-wide association study of gut microbiota in type 2 diabetes.

Authors: Junjie Qin; Yingrui Li; Zhiming Cai; Shenghui Li; Jianfeng Zhu; Fan Zhang; Suisha Liang; Wenwei Zhang; Yuanlin Guan; Dongqian Shen; Yangqing Peng; Dongya Zhang; Zhuye Jie; Wenxian Wu; Youwen Qin; Wenbin Xue; Junhua Li; Lingchuan Han; Donghui Lu; Peixian Wu; Yali Dai; Xiaojuan Sun; Zesong Li; Aifa Tang; Shilong Zhong; Xiaoping Li; Weineng Chen; Ran Xu; Mingbang Wang; Qiang Feng; Meihua Gong; Jing Yu; Yanyan Zhang; Ming Zhang; Torben Hansen; Gaston Sanchez; Jeroen Raes; Gwen Falony; Shujiro Okuda; Mathieu Almeida; Emmanuelle LeChatelier; Pierre Renault; Nicolas Pons; Jean-Michel Batto; Zhaoxi Zhang; Hua Chen; Ruifu Yang; Weimou Zheng; Songgang Li; Huanming Yang; Jian Wang; S Dusko Ehrlich; Rasmus Nielsen; Oluf Pedersen; Karsten Kristiansen; Jun Wang
Journal: Nature Date: 2012-09-26 Impact factor: 49.962

Review 6. The human gut microbiome: ecology and recent evolutionary changes.

Authors: Jens Walter; Ruth Ley
Journal: Annu Rev Microbiol Date: 2011 Impact factor: 15.500

7. Metagenomic analysis of the human distal gut microbiome.

Authors: Steven R Gill; Mihai Pop; Robert T Deboy; Paul B Eckburg; Peter J Turnbaugh; Buck S Samuel; Jeffrey I Gordon; David A Relman; Claire M Fraser-Liggett; Karen E Nelson
Journal: Science Date: 2006-06-02 Impact factor: 47.728

8. Metagenomics of the water column in the pristine upper course of the Amazon river.

Authors: Rohit Ghai; Francisco Rodriguez-Valera; Katherine D McMahon; Danyelle Toyama; Raquel Rinke; Tereza Cristina Souza de Oliveira; José Wagner Garcia; Fernando Pellon de Miranda; Flavio Henrique-Silva
Journal: PLoS One Date: 2011-08-19 Impact factor: 3.240

9. Assessment of de novo assemblers for draft genomes: a case study with fungal genomes.

Authors: Mostafa M Abbas; Qutaibah M Malluhi; Ponnuraman Balakrishnan
Journal: BMC Genomics Date: 2014-12-08 Impact factor: 3.969

10. The human gut and groundwater harbor non-photosynthetic bacteria belonging to a new candidate phylum sibling to Cyanobacteria.

Authors: Sara C Di Rienzi; Itai Sharon; Kelly C Wrighton; Omry Koren; Laura A Hug; Brian C Thomas; Julia K Goodrich; Jordana T Bell; Timothy D Spector; Jillian F Banfield; Ruth E Ley
Journal: Elife Date: 2013-10-01 Impact factor: 8.140

13 in total

Review 1. Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410

2. MetaCompare: a computational pipeline for prioritizing environmental resistome risk.

Authors: Min Oh; Amy Pruden; Chaoqi Chen; Lenwood S Heath; Kang Xia; Liqing Zhang
Journal: FEMS Microbiol Ecol Date: 2018-07-01 Impact factor: 4.194

3. Evaluation of Metagenomic-Enabled Antibiotic Resistance Surveillance at a Conventional Wastewater Treatment Plant.

Authors: Haniyyah J Majeed; Maria V Riquelme; Benjamin C Davis; Suraj Gupta; Luisa Angeles; Diana S Aga; Emily Garner; Amy Pruden; Peter J Vikesland
Journal: Front Microbiol Date: 2021-05-13 Impact factor: 5.640

4. Shotgun Metagenomics Reveals Taxonomic and Functional Shifts in Hot Water Microbiome Due to Temperature Setting and Stagnation.

Authors: Dongjuan Dai; William J Rhoads; Marc A Edwards; Amy Pruden
Journal: Front Microbiol Date: 2018-11-13 Impact factor: 5.640

5. Effects of Dairy Manure-Based Amendments and Soil Texture on Lettuce- and Radish-Associated Microbiota and Resistomes.

Authors: Giselle K P Guron; Gustavo Arango-Argoty; Liqing Zhang; Amy Pruden; Monica A Ponder
Journal: mSphere Date: 2019-05-08 Impact factor: 4.389

6. NanoARG: a web service for detecting and contextualizing antimicrobial resistance genes from nanopore-derived metagenomes.

Authors: G A Arango-Argoty; D Dai; A Pruden; P Vikesland; L S Heath; L Zhang
Journal: Microbiome Date: 2019-06-07 Impact factor: 14.650

7. DirtyGenes: testing for significant changes in gene or bacterial population compositions from a small number of samples.

Authors: Laurence M Shaw; Adam Blanchard; Qinglin Chen; Xinli An; Peers Davies; Sabine Tötemeyer; Yong-Guan Zhu; Dov J Stekel
Journal: Sci Rep Date: 2019-02-20 Impact factor: 4.379

8. Identification of discriminatory antibiotic resistance genes among environmental resistomes using extremely randomized tree algorithm.

Authors: Suraj Gupta; Gustavo Arango-Argoty; Liqing Zhang; Amy Pruden; Peter Vikesland
Journal: Microbiome Date: 2019-08-29 Impact factor: 14.650

9. Integrated Metagenomic Assessment of Multiple Pre-harvest Control Points on Lettuce Resistomes at Field-Scale.

Authors: Lauren Wind; Ishi Keenum; Suraj Gupta; Partha Ray; Katharine Knowlton; Monica Ponder; W Cully Hession; Amy Pruden; Leigh-Anne Krometis
Journal: Front Microbiol Date: 2021-07-09 Impact factor: 5.640

10. DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data.

Authors: Gustavo Arango-Argoty; Emily Garner; Amy Pruden; Lenwood S Heath; Peter Vikesland; Liqing Zhang
Journal: Microbiome Date: 2018-02-01 Impact factor: 14.650