Literature DB >> 22962339

Metagenomic analysis: the challenge of the data bonanza.

Chris I Hunter¹, Alex Mitchell, Philip Jones, Craig McAnulla, Sebastien Pesseat, Maxim Scheremetjew, Sarah Hunter.

Abstract

Several thousand metagenomes have already been sequenced, and this number is set to grow rapidly in the forthcoming years as the uptake of high-throughput sequencing technologies continues. Hand-in-hand with this data bonanza comes the computationally overwhelming task of analysis. Herein, we describe some of the bioinformatic approaches currently used by metagenomics researchers to analyze their data, the issues they face and the steps that could be taken to help overcome these challenges.

Entities: Chemical Species

Mesh：

Year: 2012 PMID： 22962339 PMCID： PMC3504930 DOI： 10.1093/bib/bbs020

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

METAGENOMICS: A BROAD FIELD

The discipline of metagenomics is the study of the genetic material present in a given environment (for a detailed review of the field, see [1, 2]). However, the term ‘metagenomics’ applies to a very broad range of technical activities, including the collection of environmental samples [3], the extraction of deoxyribonucleic acid/ribonucleic acid (RNA)/protein from those samples, the ever-increasing variety of technologies used for sequencing [4] and the subsequent analysis and interpretation of the resulting data. In this article, we briefly review the current practices in metagenomic sequence analysis and describe potential future developments that may impact on them.

TAXONOMIC ANALYSIS AND METAGENOMICS

The taxonomic classification of living things has long been a central theme in biology; this is particularly true of metagenomics. Amplicon-based taxonomic studies currently dominate the field, and, at the time of writing, more than 80% of the publicly available data sets within the MG-RAST service [5] are taxonomic analyses of the 16S RNA marker gene. Other phylogenetic classification approaches, such as those offered by Phymm [6] and PhyloPythia [7], are also being used more extensively. Such analyses are highly valuable, as particular phylogenetic groupings can be associated with important functions, and the diversity of a microbial community is thought to provide an indication of the resilience of the system (i.e. its ability to carry on functioning when conditions change). However, taxonomic studies may not necessarily reflect the complex biological processes that exist in an environment, as microbial genes can move horizontally between unrelated species. Consequently, the same functional gene can be present in a variety of backgrounds. Furthermore, these approaches do not take account of intra-species diversity (where organisms may gain or lose function as they adapt to a specific environment) or situations where organisms may be actively engaged in only a subset of their functional repertoire.

FUNCTIONAL ANALYSIS OF METAGENOMIC SAMPLES

A complementary approach is to analyze the putative functional entities (such as protein coding sequences) within the genomic and/or transcriptomic sequences from an environmental sample. This has become an increasingly realistic proposition with the increasing power and reducing cost of high-throughput sequencing; it is now feasible to sequence a representative proportion of an entire metagenome at reasonable price. The remaining challenge is to process the massive volumes of data produced by such approaches. Analysis of putative protein coding sequences typically begins with the identification and translation of open reading frames within nucleotide sequences. A minimum size constraint is usually applied, as prediction of function for very short sequences is not reliable. Frequently, pairwise sequence alignment methods, such as BLAST [8], are then used to infer function by searching for similarity to other sequences in a reference database. One of the original design specifications for BLAST was to provide a tool for fast comparison of sequences. Despite having been developed over 20 years ago, it is still one of the fastest sequence comparison algorithms available. Nevertheless, the sheer volume of sequence data produced during metagenomic studies means that BLAST-based analyses represent significant bottlenecks, which are unlikely to be addressed simply by scaling up computational resources [9].

PROTEIN SIGNATURE-BASED ANALYSES

An alternative protein sequence analysis approach is to use computational models, known as protein signatures, of the type housed in the InterPro [10] consortium of databases, such as Pfam [11], PROSITE [12], PRINTS [13], CATH-Gene3D [14] and TIGRFAMs [15]. These signatures draw on multiple sequence alignments of protein families, domains and functionally important sites. By using such alignments, protein signatures are able to model the (often few) amino acid residues that are conserved in distantly related proteins that are essential for stability and function. Identifying such residues is not possible with pairwise alignment techniques, and consequently protein signatures are usually more sensitive at detecting divergent homologs [16, 17]. Protein signature-based sequence analysis methods offer two further important advantages over their pairwise alignment-based counterparts. As they are built to recognize specific functional entities, such as individual protein families or particular functional domains, matches to signatures are highly accurate predictors of function. This is in contrast to pairwise alignment approaches, where the only significant matches are often to other uncharacterized sequences, meaning that no functional information can be inferred. Furthermore, recent technological advances, such as the development of the HMMER3 algorithm [18], have led to substantial performance increases in a number of protein signature-based analysis techniques, so that they can now offer fast, as well as accurate and sensitive, alternatives to BLAST. A number of metagenomic analysis pipelines already use protein signatures to predict the functional characteristics of metagenomics data sets. For example, both CAMERA [19] and WebMGA [20] use Pfam and TIGRFAMs alongside BLAST-based approaches for functional sequence analysis. CARMA [21] and CoMet [22] also draw on Pfam for their analyses. EMBL-EBIs recently launched resource (http://www.ebi.ac.uk/metagenomics) uses InterPro for functional characterization of metagenomic sequences. InterPro combines different types of protein signature from multiple diverse databases, providing extensive sequence coverage and fine-grained functional analyses. It also provides additional benefits, such as the association of Gene Ontology terms [23] with signatures and inference of potential involvement in biological pathways, further augmenting the annotation of protein sequences. InterPro’s utility is expected to grow in the future as investigations into over-represented amino acid sequences in metagenomic data lead to the in silico identification of novel protein families and domains, which will in turn be modeled and incorporated into the InterPro Consortium’s member databases.

COMPUTATIONAL ADVANCES IN METAGENOMIC ANALYSIS— THE NEED FOR SPEED

Even if protein signature-based methods are used, the time taken to analyze metagenomic data currently far outweighs the length of time taken to produce the sequences in the first place. It is anticipated that new paradigms, such as the use of graphical processing unit (GPU) computing and cloud computing, may help to mitigate this bottleneck in the future. Promising work has already begun in this area. For example, the developers of Parallel-META [24] have reported a 10–15-fold increase in analysis speeds using GPU over central processing unit. CloVR [25], meanwhile, provides a virtualized machine containing multiple microbial sequence analysis pipelines, including one for metagenomics. It gives the user the option to run their analysis locally or using a commercial or academic cloud. The use of GPUs and other hardware-based approaches is limited by the specialist programming required to adapt software to run on these architectures. Indeed, the number of general bioinformatics applications that can be run on GPUs is still restricted because of this. Cloud computing facilities should eventually revolutionize the way metagenomics researchers work, potentially allowing even small laboratories access to vast amounts of compute power. However, there remain some drawbacks with this approach, including the relative expense of the compute (running a fully utilized compute farm is cheaper than purchasing time on a commercial cloud [26]) and potential security issues related to transferring data into the cloud environment.

METADATA PROVIDES CONTEXT TO ANALYSIS

Speed is not the only important consideration in metagenomics analysis. Critical to any metagenomic study is the extent and quality of the associated metadata, as this provides context to the experiments and allows meaningful comparisons to be made between studies. This is exemplified by the Western English Channel study [27], where multiple samples have been meaningfully compared across a large time series. The collection of detailed metadata for each sample allowed the researchers to hypothesize which factors affected the species and functional variety at that site the most. In recognition of its importance, there has recently been a community-driven shift toward a greater degree of sample contextual metadata being archived with study data, which has been largely facilitated by the Genomic Standards Consortium (GSC) [28]. The mission statement of the GSC is to work toward the implementation of new genomic standards for metadata and methods of capturing and exchanging that metadata. It is immensely valuable to store standards-compliant metadata and the raw sequence data they describe in public repositories, as it allows future reuse and reinterpretation of these data by other scientists. For this reason, researchers are encouraged to submit metadata and raw sequence reads to the INSDC Nucleotide Archives either directly or by the EMBL-EBI metagenomics portal.

CONCLUSION: THE NEED FOR A CONSOLIDATED APPROACH TO METAGENOMICS

Multiple public resources already exist that allow users to view and analyze metagenomics data; however, the field still faces several challenges. It is vital that the metagenomics service providers adopt consistent policy toward metadata, metadata standards and user access to associated raw data, so that metagenomes can be interpreted appropriately by researchers. Despite improvements to functional analysis methods (including the adoption of protein signatures for increased search performance and the optimization of algorithms such as HMMER), the expense of compute remains a barrier to the full realization of metagenomics’ potential. It is hoped that collaboration between analysis providers will lead to better exploitation of new computing paradigms to solve some of these issues. Metagenomics has historically been dominated by the taxonomic diversity approach, but next generation sequencing is changing this, with more people beginning to investigate the functional potential of an environmental sample. Protein signatures are a sensitive way to identify protein families, domains and functionally important sites within protein sequence fragments. High-quality contextual data are essential to allow meaningful comparisons to be made between environmental samples. The EMBL-EBI metagenomics portal has recently been launched in beta. It facilitates InterPro-driven functional analysis of metagenome sequences and combines this with a metadata-rich archive of metagenomics experiments.

27 in total

Review 1. The next-generation sequencing technology: a technology review and future perspective.

Authors: XiaoGuang Zhou; LuFeng Ren; YunTao Li; Meng Zhang; YuDe Yu; Jun Yu
Journal: Sci China Life Sci Date: 2010-02-12 Impact factor: 6.038

Review 2. Microbial metagenomics: beyond the genome.

Authors: Jack A Gilbert; Christopher L Dupont
Journal: Ann Rev Mar Sci Date: 2011

3. The Pfam protein families database.

Authors: Robert D Finn; Jaina Mistry; John Tate; Penny Coggill; Andreas Heger; Joanne E Pollington; O Luke Gavin; Prasad Gunasekaran; Goran Ceric; Kristoffer Forslund; Liisa Holm; Erik L L Sonnhammer; Sean R Eddy; Alex Bateman
Journal: Nucleic Acids Res Date: 2009-11-17 Impact factor: 16.971

Review 4. A primer on metagenomics.

Authors: John C Wooley; Adam Godzik; Iddo Friedberg
Journal: PLoS Comput Biol Date: 2010-02-26 Impact factor: 4.475

5. The taxonomic and functional diversity of microbes at a temperate coastal site: a 'multi-omic' study of seasonal and diel temporal variation.

Authors: Jack A Gilbert; Dawn Field; Paul Swift; Simon Thomas; Denise Cummings; Ben Temperton; Karen Weynberg; Susan Huse; Margaret Hughes; Ian Joint; Paul J Somerfield; Martin Mühling
Journal: PLoS One Date: 2010-11-29 Impact factor: 3.240

6. Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource.

Authors: Shulei Sun; Jing Chen; Weizhong Li; Ilkay Altintas; Abel Lin; Steve Peltier; Karen Stocks; Eric E Allen; Mark Ellisman; Jeffrey Grethe; John Wooley
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971

7. The Genomic Standards Consortium.

Authors: Dawn Field; Linda Amaral-Zettler; Guy Cochrane; James R Cole; Peter Dawyndt; George M Garrity; Jack Gilbert; Frank Oliver Glöckner; Lynette Hirschman; Ilene Karsch-Mizrachi; Hans-Peter Klenk; Rob Knight; Renzo Kottmann; Nikos Kyrpides; Folker Meyer; Inigo San Gil; Susanna-Assunta Sansone; Lynn M Schriml; Peter Sterk; Tatiana Tatusova; David W Ussery; Owen White; John Wooley
Journal: PLoS Biol Date: 2011-06-21 Impact factor: 8.029

Metagenomic analysis: the challenge of the data bonanza.

METAGENOMICS: A BROAD FIELD

TAXONOMIC ANALYSIS AND METAGENOMICS

FUNCTIONAL ANALYSIS OF METAGENOMIC SAMPLES

PROTEIN SIGNATURE-BASED ANALYSES

COMPUTATIONAL ADVANCES IN METAGENOMIC ANALYSIS— THE NEED FOR SPEED

METADATA PROVIDES CONTEXT TO ANALYSIS

CONCLUSION: THE NEED FOR A CONSOLIDATED APPROACH TO METAGENOMICS

Review 1. The next-generation sequencing technology: a technology review and future perspective.

Review 2. Microbial metagenomics: beyond the genome.

3. The Pfam protein families database.

Review 4. A primer on metagenomics.

5. The taxonomic and functional diversity of microbes at a temperate coastal site: a 'multi-omic' study of seasonal and diel temporal variation.

6. Community cyberinfrastructure for Advanced Microbial Ecology Research and Analysis: the CAMERA resource.

7. The Genomic Standards Consortium.

8. CoMet--a web server for comparative functional profiling of metagenomes.

9. WebMGA: a customizable web server for fast metagenomic sequence analysis.

10. The Gene Ontology in 2010: extensions and refinements.

1. Evaluating techniques for metagenome annotation using simulated sequence data.

2. Census-based rapid and accurate metagenome taxonomic profiling.

Review 3. Analysis of plant microbe interactions in the era of next generation sequencing technologies.

Review 4. Improved cultivation and metagenomics as new tools for bioprospecting in cold environments.

Review 5. Metagenomic search strategies for interactions among plants and multiple microbes.

Review 6. Phage therapy: eco-physiological pharmacology.

7. The Tara Oceans Project: New Opportunities and Greater Challenges Ahead.