Literature DB >> 22772835

Functional assignment of metagenomic data: challenges and applications.

Abstract

Metagenomic sequencing provides a unique opportunity to explore earth's limitless environments harboring scores of yet unknown and mostly unculturable microbes and other organisms. Functional analysis of the metagenomic data plays a central role in projects aiming to explore the most essential questions in microbiology, namely 'In a given environment, among the microbes present, what are they doing, and how are they doing it?' Toward this goal, several large-scale metagenomic projects have recently been conducted or are currently underway. Functional analysis of metagenomic data mainly suffers from the vast amount of data generated in these projects. The shear amount of data requires much computational time and storage space. These problems are compounded by other factors potentially affecting the functional analysis, including, sample preparation, sequencing method and average genome size of the metagenomic samples. In addition, the read-lengths generated during sequencing influence sequence assembly, gene prediction and subsequently the functional analysis. The level of confidence for functional predictions increases with increasing read-length. Usually, the most reliable functional annotations for metagenomic sequences are achieved using homology-based approaches against publicly available reference sequence databases. Here, we present an overview of the current state of functional analysis of metagenomic sequence data, bottlenecks frequently encountered and possible solutions in light of currently available resources and tools. Finally, we provide some examples of applications from recent metagenomic studies which have been successfully conducted in spite of the known difficulties.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22772835 PMCID： PMC3504928 DOI： 10.1093/bib/bbs033

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

INTRODUCTION

The microbial world shows vast diversity, and microbes inhabit almost every niche on the planet. Many of them have been shown to be important members of their given ecosystems and to play crucial roles in various environmental and host-associated biological processes. However, due to their general unculturability (it is believed that only a small percentage of bacteria in nature can be cultured [1]), up until just a few years ago it was practically impossible to sequence and analyze them in greater detail. As a result, a large fraction of microbes still remain poorly characterized and unstudied; and the means by which they exert beneficial or other effects in different environments remain largely unknown. The recent culture independent technology to study microbes inhabiting different environments, termed metagenomics [2], has opened new avenues for answering questions commonly asked in microbiology, such as ‘Which species inhabit a given environment?’ and ‘What are these microbes doing and how are they doing it?’ The basic steps involved in a typical metagenomic project to estimate the number of species and the functional repertoire of an environment include DNA or RNA sequencing using next-generation sequencers (such as Illumina and Roche 454), sequence assembly, gene prediction, functional and metabolic analysis, taxonomic binning and comparative analysis of the sequence data using specialized bioinformatics methods and tools (Figure 1, Tables 1 and 2). However, each stage of the analysis suffers heavily due to inherent problems of the metagenomic data generated, including incomplete coverage, massive volumes of raw sequence data produced by the next-generation sequencers, generally short read-lengths, species abundance and diversity and so on [3, 4].

Figure 1:

Flow chart for the analysis of a metagenome from sequencing to functional annotation. Only the basic flow of data is shown up to the gene prediction step. For the context-based annotation approach, only the gene neighborhood method has been implemented thus far on metagenomic data sets; although in principal, other approaches which have been used for whole genome analysis can also be implemented and tested. *: A list of tools commonly used for these processes is provided in Table 1. Table 3 provides a list of some of the additional functional analyses that can be performed on the metagenomic sequences.

Table 1:

List of commonly used tools for sequence assembly, protein coding gene prediction, RNA gene prediction and phylogenetic classification steps of metagenomic data analysis

Process	Tools	URL/ References
Sequence assembly	Phrap	http://www.phrap.org/
	Forge	http://combiol.org/forge/
	Arachne	[5]
	JAZZ	[6]
	Celera	[7]
	Velvet	[8]
	Newbler	454 Life Sciences
	SOAPdenovo	[9]
	EULER	[10]
	ORFome assembly	[11]
	IDBA-UD	[12]
Gene prediction	Metagene	[13]
	GeneMark	[14]
	ORF-Finder	http://www.ncbi.nlm.nih.gov/ projects/gorf/
	FragGeneScan	[15]
	fgenesB	http://www.softberry.com
	GLIMMER	[16]
	BLAST	[17]
RNA gene prediction	tRNAscan-SE	[18]
RNA gene prediction	Similarity-based searches for rRNA in reference databases	–
Taxonomic binning	MetaBin	[19]
	MEGAN	[20]
	WebCARMA	[21]
	PhyloPythia	[22]
	TETRA	[23]
	NBC	[24]
	TACOA	[25]

Table 2:

Current list of commonly used publicly available pipelines for the functional annotation of metagenomic data sets

Pipeline/tools	IMG/M	METAREP	CAMERA	RAMMCAP	MG-RAST	Smash community	MEGAN4	CoMet	WebMGA
Functional analysis
Homology-based
Known sequence	NCBI (NR), SMART, UniProt	NCBI (NR), UniProt	NCBI (NR)	–	NCBI (NR), SMART, UniProt	SMART, UniProt	NCBI (NR)	–	NCBI (NR)
Metagenomic data sets	IMG/M	–	–	–	IMG/M	–	–	–	–
Orthologous groups	COGs	–	COGs	COGs	COGs, eggNOGs	COGs, eggNOGs	–	–	COGs
Protein families	Pfam, TIGRfam	Pfam, TIGRfam	Pfam, TIGRfam	Pfam, TIGRfam	FIGfams	Pfam	–	Pfam	Pfam, TIGRfam
Ontology	GO	GO	GO	GO	GO	–	–	GO	GO
Enzymes, pathways and subsystems	KEGG, SEED	PRIAM	KEGG, SEED	–	KEGG, SEED	KEGG	KEGG, SEED	–	KEGG
Protein interactions	–	–	–	–	STRING	STRING	–	–	–
Motif- and pattern-based
Database	InterPro	–	–	–	–	–	–	–	–
Context-based
Approach	Gene neighborhood	–	–	–	–	Gene Neighborhood	–	–	–
Other functional analysis
Types of predictions	CRISPRs, enzymes, transporter classes	Enzymes, transmembrane helices, lipoprotein motifs	–	–	–	Protein networks	–	–	–
URL	http://img.jgi.doe.gov/m/doc/uiMap.html	http://www.jcvi.org/metarep/	http://camera.calit2.net/	–	http://metagenomics.nmpdr.org/	http://www.bork.embl.de/software/smash/	http://ab.inf.uni-tuebingen.de/software/megan/	http://comet .gobics.de/	http://weizhong-lab.ucsd.edu/metagenomic-analysis/
References	[26]	[27]	[28]	[29]	[30]	[31]	[32]	[33]	[34]

Table 3:

List of commonly used available resources for functional analysis (other than homology-, motif- and context-based) that can be performed on metagenomic data sets

Type of prediction	Resource name	URL
Carbohydrate-active enzymes	CAZy	http://www.cazy.org/
Glycosyl hydrolases	GAS	http://csbl.bmb.uga.edu/∼ffzhou/GASdb/
Protein localization	PSORT	http://psort.hgc.jp/
	Cell-PLoc	http://www.csbio.sjtu.edu.cn/bioinf/Cell-PLoc/
	CELLO	http://cello.life.nctu.edu.tw/
	PA-SUB	http://webdocs.cs.ualberta.ca/∼bioinfo/PA/Sub/index.html
Membrane proteins	DAS	http://www.sbc.su.se/∼miklos/DAS/
	HMMTOP	http://www.enzim.hu/hmmtop/html/submit.html
	HMM-TM	http://bioinformatics.biol.uoa.gr/HMM-TM/index.jsp
	TMB-Comp	http://bmbpcu36.leeds.ac.uk/∼andy/betaBarrel/TMB_Hunt_2/TMB_Comp.cgi
Lipoproteins	DOLOP	http://www.mrc-lmb.cam.ac.uk/genomes/dolop/dolop.htm
	LIPO	http://services.cbu.uib.no/tools/lipo
	SignalP	http://www.cbs.dtu.dk/services/SignalP/
	LipoP	http://www.cbs.dtu.dk/services/LipoP/
	PRED-LIPO	http://bioinformatics.biol.uoa.gr/PRED-LIPO/input.jsp
Secretory proteins (signal peptide Type I)	Tatfind	http://signalfind.org/tatfind.html
	TatP	http://www.cbs.dtu.dk/services/TatP/
	SignalP	http://www.cbs.dtu.dk/services/SignalP/
	PrediSi	http://www.predisi.de/index.html
Adhesins	SPAAN	Sachdeva et al. 2004 [75]
Transporters	TansportTP	http://bioinfo3.noble.org/transporter/
	TransAAP	http://www.membranetransport.org/transaap/TransAAP_login.html
	TCDB	http://www.tcdb.org/
Insertion sequences	ISsaga	http://issaga.biotoul.fr/ISsaga/issaga_index.php
CRISPRs	PILER	http://www.drive5.com/pilercr/
CRISPRs	CRISPRfinder	http://crispr.u-psud.fr/Server/
Repeats	Tandem Repeats Finder	http://tandem.bu.edu/trf/trf.html
Repeats	EMBOSS	http://emboss.sourceforge.net/
Virulence factors	VFDB	http://www.mgc.ac.cn/VFs/
Virulence factors	MvirDB	http://predictioncenter.llnl.gov/

List of commonly used tools for sequence assembly, protein coding gene prediction, RNA gene prediction and phylogenetic classification steps of metagenomic data analysis Current list of commonly used publicly available pipelines for the functional annotation of metagenomic data sets These problems also adversely affect the downstream functional analysis process. For example, due to shorter read-length the overall functional composition is comparatively poor for shorter pyrosequencing- or Illumina-sequencing derived reads than for longer Sanger reads [35]. Additionally, for very complex communities, partial or poor assemblies are obtained due to incomplete coverage, resulting in many short contigs and unassembled sequences. This leads to the prediction of a large number of small, fragmented genes which may not exhibit any matches in the reference sequence databases, or match with very low significance [36]. Although sequence assembly and gene prediction tools specifically developed for metagenomic data sets offer some advantages over similar tools developed for more complete genome sequences, surprisingly, no such ‘metagenome specific’ tools have yet been developed for functional analysis. Thus, appropriate tools, from the current repertoire, and parameters must be used to achieve comprehensive and biologically meaningful functional analysis of metagenomic data sets. The steps for sequence assembly and gene prediction of metagenomic data sets are compared in several recent comprehensive reviews [3, 4, 37, 38]. The scope of this review is to comprehensively discuss the prime objectives, methods and problems for functional and metabolic analysis of metagenomic sequence data, and to propose some solutions for the latter. Toward this, we first try to familiarize the reader with the aims of functional metagenomic analysis and the most commonly adopted publicly available tools and resources to achieve them. Next, we discuss how the problems arising from metagenomic sequencing affect this process, and we suggest various strategies for addressing some of these issues under the present scenario. Lastly, we demonstrate that, despite these issues, metagenomic functional analysis can still be reliably used to address globally important environmental and biological questions.

OBJECTIVES OF FUNCTIONAL METAGENOMIC ANALYSIS STUDIES

Interestingly, the same microbial communities sampled at different times or from different hosts can vary significantly. For example, the gut microbiomes of 13 healthy Japanese individuals were quite different, yet they still shared many microbes [39]. Also, the community members for any given environment commonly play different roles. For example, in the human gut microbiome, segmented filamentous bacteria are known to play important roles in maintaining intestinal immunity [40, 41], whereas bifidobacteria are known to utilize complex carbohydrates and thereby exert beneficial effects on human health [42]. Thus, there are mainly two broad objectives of the functional analysis for metagenomic studies: the first is to determine what are the functional and metabolic repertoires of the different community members that enable them to exert different effects, and the second is to identify the variations, if any, within the functional compositions of the different communities, e.g. those found between healthy and diseased individuals that may be related to the cause of the disease. To determine the functional content of the member species of a microbiome, the coding and functional capacity for all (or at least the dominant) members should be comprehensively analyzed. Alternatively, if the goal of the study is to analyze and contrast the functional and metabolic capacities of different communities, then the functional and metabolic pathway profiles for the communities need to be generated and compared.

PUBLICLY AVAILABLE RESOURCES AND TOOLS FOR FUNCTIONAL ANNOTATION OF METAGENOMIC DATA

Dedicated tools for functional annotation and analysis of metagenomic data sets lag far behind the rate at which the data is being generated. Recently, some web-based, as well as local-use based, pipelines have been developed for the analysis of metagenomic data sets. Table 2 provides a list of a few well-known representative pipelines and compares the functional analysis capacity of each. Almost all of these pipelines provide integrated platforms for the functional prediction of metagenomic sequences using multiple tools and databases, which are also commonly used for the analysis of whole genome sequences. Most of the pipelines offer sufficient resources for the functional analysis of user data. However, to account for the inherent problems associated with the metagenomic data sets, it is highly recommended to evaluate the computational workflow and parameters for any given project. This can be achieved by using simulated sequencing reads generated by MetaSim [43], to assess and compare different tools before actually using them on full data sets. The analysis time of any pipeline typically depends on the size of the data sets and, in the case of web-based servers, the load of requests that are already in progress submitted by other users. Web-based servers such as CAMERA [28], MG-RAST [30] and IMG/M [26] host pre-computed results for most published metagenomes that enable users to perform comparative analysis with their own data sets. In most cases, the computed data can be visualized in the form of simple plots. However, KEGG [44] pathway maps and abundance profiles can also be obtained using the IMG/M and MG-RAST servers.

STRATEGIES COMMONLY ADOPTED BY THE PIPELINES FOR THE FUNCTIONAL ANALYSIS OF METAGENOMIC DATA

Protein function is a very broad term, as function can be predicted at several different levels. For example, the Gene Ontology database [45] adopts three broad domains for classifying gene products viz., the cellular location of the protein, the overall biological process it takes part in and the molecular function of the protein. On the other hand, the subsystem-based classification approach adopted by the SEED database [46] relies mainly on the grouping of functional roles into subsystems by curation experts. The defined subsystems may be thought of as a generalization of the term ‘pathway’. Similarly, the KEGG database [44] is a resource of pathway maps built from both genomic and chemical information of the biological systems. However, such specific functional assignment may be lacking for completely novel proteins or for those which share very weak homology with known proteins both of which are ample in metagenomic data sets. For such proteins, even minimal information that can be extracted related to their function can be useful, and may be the only available clues to their function. As shown in Figure 1 and Table 2, the basic tools that are implemented in almost all of the available pipelines for functional analysis of metagenomic data are the same as those which are commonly used for whole genome studies and are well known. However, their performance in the metagenomic context have yet to be evaluated and reviewed. Thus, in the current review, we have divided these tools into four categories based on their inherent approach. In the following sections, we review each approach in context to its application to metagenomic data analysis, keeping in mind the associated problems of the data itself.

Homology-based approach

As shown in Table 2, the ‘simplest’ and most common approach adopted by all of the available pipelines for functional prediction is by comparison of the predicted query proteins to existing resources of reference protein sequences, including NCBI NR [47], SMART [48] and UniProt/UniRef [49]. The IMG/M [26] and MG-RAST [30] servers also search the publicly available metagenomic data sets for homologs of the query sequences. The databases of clusters of orthologous groups (COGs) [50], non-supervised orthologous groups (NOGs) [51], protein families and domains including Pfam [52] and TIGRFAM [53], etc. are used by several pipelines to infer functional categories or to identify families and domains embedded in the query proteins. In some cases, similarities to genes found in the GO database are further explored to infer hierarchical annotations. Pathway and subsystem information for the query proteins is inferred by searching for homologs in the KEGG and SEED databases, respectively, by almost all of the pipelines. For these searches, different variants of BLAST [17] are the most preferred algorithms, including BLASTX, BLASTP, RPS-BLAST, etc. For less sensitive, but faster, searches BLAT [54] may also be used, as in the case of MG-RAST server. Additionally, more sensitive profile- and pattern-based search methods are used by almost all of the pipelines in which sequence profiles generated from alignments of protein families in Pfam or TIGRfam databases are searched using the hidden Markov model-based algorithm, HMMER [55]. For all these methods, best hits are identified based on statistical calculations and annotation information is directly applied to the query proteins. Homology-based approaches mainly suffer from the long computation time required to search for homologs for each of the sequences within the typically massive metagenomic data sets. Additionally, BLAST-based functional predictions have been estimated to include 13–15% database propagation errors [56]. Moreover, to detect a true match, the reference database being searched needs to contain at least one homolog of the query sequence. And, the fragmentary nature of the shotgun-generated metagenomic data leading to partial proteins negatively impacts homology-based function prediction. This is discussed in more detail below. The extent to which metagenomic functional annotation has been achieved using different databases is demonstrated in Figures 2 and 3. The highest fraction of metagenomic sequences were annotated using the NCBI RefSeq database, which is a comprehensive collection of non-redundant well-annotated protein sequences. On the other hand, only a small fraction of sequences could be annotated using the Swiss-Prot database, which harbors manually annotated and reviewed protein sequences. The number of proteins annotated using the COGs database was slightly less than RefSeq. Among the protein family and profile databases, more predictions were made using Pfam as compared to the TIGRFAM database. This could mainly be due to the great number of protein families that are included in the Pfam database (13 672 in Pfam 26.0 release) than in the TIGRFAM database (4209 in TIGRFAM 12.0 release). The annotation using KEGG metabolic pathways is relatively low mainly due to the inherent problems of the metagenomic data sets, as discussed below. The SEED system of classification performs similar to that of KEGG, although the number of predictions is slightly lower.

Figure 2:

Figure 3:

Status of functional prediction of protein-coding genes from different metagenomic data sets and representatives of completely sequenced genomes. The overall functional prediction bars represent the fraction of protein-coding genes that map to at least any one of the four databases including cluster of orthologous groups (COGs), Pfam, TIGRFAM and KEGG pathways. For comparative purposes, the functional annotation status for the well-studied model microbial genome, E. coli K12-W3310, the smallest microbial genome, M. genitalium, and the human genome are also shown. The data for this graph was derived from the IMG/M database. It should be noted that for uniform comparison, the prokaryotic COGs version was also used for Homo sapiens. The number of matches to eukaryotic COGs (KOG database [57]) may be higher for H. sapiens. The numbers next to the bars represent the total number of predicted protein-coding genes in each data set using the IMG/M annotation pipeline. For the Sludge [58] community, data from only the Phrap assembly, a widely used program for DNA sequence assembly, was used. Except for the Cow Rumen Viral community [59], which was sequenced using the 454 platform (average read-length > 300 bp), all other metagenomes were sequenced using the Sanger method (average read-length > 1000 bp). The following additional data sets were used: Ocean [60], Soil [61], Acid Mine Drainage [62], Human Gut [63].

Distribution of metagenomic sequence matches in the SwissProt, RefSeq, KEGG and SEED databases at various E-value cut-offs. Smaller sequences match at lower confidence (higher E-values; lighter colors) or do not match at all in the databases. More sequences match with higher confidence (lower E-values; darker colors) as the sequence length used for the analysis increases. Pre-computed data for the metagenomes shown was derived from the MG-RAST server. Status of functional prediction of protein-coding genes from different metagenomic data sets and representatives of completely sequenced genomes. The overall functional prediction bars represent the fraction of protein-coding genes that map to at least any one of the four databases including cluster of orthologous groups (COGs), Pfam, TIGRFAM and KEGG pathways. For comparative purposes, the functional annotation status for the well-studied model microbial genome, E. coli K12-W3310, the smallest microbial genome, M. genitalium, and the human genome are also shown. The data for this graph was derived from the IMG/M database. It should be noted that for uniform comparison, the prokaryotic COGs version was also used for Homo sapiens. The number of matches to eukaryotic COGs (KOG database [57]) may be higher for H. sapiens. The numbers next to the bars represent the total number of predicted protein-coding genes in each data set using the IMG/M annotation pipeline. For the Sludge [58] community, data from only the Phrap assembly, a widely used program for DNA sequence assembly, was used. Except for the Cow Rumen Viral community [59], which was sequenced using the 454 platform (average read-length > 300 bp), all other metagenomes were sequenced using the Sanger method (average read-length > 1000 bp). The following additional data sets were used: Ocean [60], Soil [61], Acid Mine Drainage [62], Human Gut [63].

Motif- or pattern-based approach

The partial proteins generated from short contigs and unassembled sequences which arise due to short read-lengths or complex environments generally exhibit very poor similarities using homology-based approaches (Figure 2). Additionally, some proteins, despite sharing a common function, are more diverse at the sequence level. The overall sequence similarity of such proteins is usually lower than the thresholds used for homology-based functional prediction; however, they still share one or more common sequence or structural patterns or motifs necessary to maintain their structure and function. Currently, databases like PROSITE [64] and PRINTS [65] present a reliable repository of such patterns or motifs against which the query metagenomic sequences may be searched either independently or through the integrated InterPro database [66]. Currently, only the IMG/M server incorporates the InterPro database. However, a general problem with motif-based annotation is that short sequence matches typically show low statistical significance and false-positive rates can be high [67]. Nevertheless, given the amount of novelty inherent in metagenomic data sets, it is recommended to run motif-based analysis in parallel with other functional prediction approaches.

Context-based annotation

Metagenomic data sets contain a large number of novel sequences which share no homology with known sequences and thus remain unannotated by the previous two approaches. To overcome these limitations, gene context-based approaches may also be used. A few examples from single genome annotation projects include genomic neighborhood [68, 69], gene fusion [70, 71], phylogenetic profiling [72] and gene co-expression analysis [73]. Among these, only the genomic neighborhood approach has been implemented in the case of metagenomics. In 2007, Harrington et al. [74] applied a combination of homology-based searches and customized gene neighborhood methods to four metagenomic data sets derived from a variety of complex environments. Whereas BLAST-based methods alone annotated 70% of the sequences, their combined method inferred specific functions for 76% and non-specific functions for 83% of the sequences. However, due to the paucity of complete genomes in metagenomic data sets and the lack of knowledge about the true species origin of the sequences, this approach has its limitations. These problems may be ameliorated by increasing the sequencing depth and by improving the taxonomic assignment of the sequences. Additionally, better assemblies resulting in longer contigs will also improve the efficiency of context-based annotation methods. Currently, only IMG/M and SmashCommunity [31] can be used to view predicted genes in the genomic neighborhood context.

Other types of functional prediction

Lastly, the putative roles of the metagenomic sequences can also be inferred by running more specific analyses using dedicated tools that target prediction of carbohydrate active enzymes, glycosyl hydrolases, protein localizations, lipoproteins, adhesins, secretory proteins, transporters, CRISPRs (Clustered Regulatory Interspaced Short Palindromic Repeats), insertion sequences, virulence factors, etc. A list of a few representative tools for such analysis is given in Table 3. It should be noted that the list is not comprehensive, and that a discussion about all the tools for the above-mentioned purpose is beyond the scope of this review. List of commonly used available resources for functional analysis (other than homology-, motif- and context-based) that can be performed on metagenomic data sets

GENE-CENTRIC ANALYSIS OF METAGENOMIC DATA SETS

To explore the effect of environment on the functional and metabolic contents of different communities, comparative functional analysis may be performed on the total gene-content of the communities, i.e. gene-centric analysis. For this purpose, functional profiles can be compared and contrasted across different metagenomic data sets to look for functional characteristics responsible for community differences. Normally two levels of comparison are performed, viz., comparison of abundance of functional families and pathways, and estimation of statistical parameters to ensure that the observed differences in abundance are not merely chance occurrences. Different types of abundance profiles may be generated and compared using, for example, COGs functional categories, Pfam functional families, KEGG metabolic pathways, or SEEDs subsystems. However, before comparing the metagenomes, proper normalizations of the data sets should be performed to account for the data-associated problems, such as partial genes and effective genome sizes (discussed later). Heat-maps are commonly used to visualize the differences in communities with respect to the above-mentioned functional or metabolic profiles (for example [60, 61, 76–78]). In addition, statistical methods, such as principal component analysis (PCA) and multidimensional scaling (MDS), may be used to reveal which factors most affect the observed data (for example [79, 80]). The common approaches and limitations of the gene-centric analysis are discussed and reviewed by Kunin et al. [3].

PROBLEMS ASSOCIATED WITH FUNCTIONAL ANALYSIS OF METAGENOMIC DATA

The analysis and annotation of metagenomic data sets differ from that of whole genome studies mainly because the former is a complex mixture of sequences from multiple species. Even draft quality bacterial whole genome sequences represent most of the chromosomes, except for a few of the more complex regions that include repeats, insertion sequences, tRNAs, rRNAs, etc. When sequence coverage is sufficient, the assemblies obtained usually result in very long contigs with few gaps. The efficiency of gene prediction algorithms on such long contigs is quite high and most of the full-length coding DNA sequences (CDSs) can be predicted with high confidence. Functional prediction analysis can next be applied to obtain the functional repertoire of the genome. The functionally annotated CDSs can then be viewed in the context of metabolic pathways to predict the metabolic capabilities of the species under study. A metagenome can be viewed as a collection of several whole genomes. To fully understand an environment, in principal, draft quality whole genome sequences for every member should be achieved by complete DNA sequencing. However, in spite of the availability of high throughput second-generation sequencers, this is still a very expensive and daunting task. What can be best captured from a metagenomic sample is a mixture of fragmented sequences from the community members, and mostly from dominant members of the environment. When the sequencing depth is sufficient, and by the use of sequence assemblers developed specifically for metagenomic data (Table 1), draft quality assemblies for some of the member species may be achieved; e.g. a draft methanogen genome was recently assembled from a permafrost microbial community [78]. However, this still did not suffice for completely understanding the environment, as the assemblies for many other members remained poor due to the inherent complexity of the environments and lower sequencing coverage for these genomes. Thus, for most metagenomic studies, we are left with only enormous volumes of fragmented sequences (comprised of a mixture of short contigs and singletons) from multiple species to perform analysis on. In the case of contigs, gene predictions will be more accurate, whereas the predicted genes from singletons will almost always be partial in spite of using gene prediction tools specifically developed for metagenomic data (Table 1), unless very long read-lengths were obtained during sequencing. This is mainly because the typical average read-lengths generated by next-generation sequencers providing deeper coverage, including Illumina, are still smaller (up to 300 bp for paired-end reads) than the average size of the typical prokaryotic protein coding gene (∼1000 bp [81]). The 454 pyrosequencing platform can be an alternative technology due to the longer average read-lengths it can generate (up to 700 bp for 454 GS FLX+ pyrosequencer, http://454.com/downloads/GSFLXApplicationFlyer_FINALv2.pdf), but it is not the preferred choice mainly due to its lower coverage and higher cost as compared to Illumina sequencing. To obtain the most complete information of the functional repertoire for any metagenome it is recommended to use the genes predicted from both the contigs and the singletons, even though many of the predicted CDSs are partial. In general, short query lengths negatively impact homology-based functional prediction as they may decrease the significance of pairwise similarities due to added noise. This is clearly evident from Figure 2, which shows that there are no matches for sequences of length ∼100 bp for the ‘Cow Rumen’ metagenome [79] in the lower and more significant E-value bins (E-value < 1e − 10). On the other hand, as sequence length increases, the E-value bins with lower values become more populated, as in the case of the ‘Human Gut Japanese’ [39] data set. Additionally, for short sequence lengths, homology-based approaches have limited sensitivity. For example, only ∼25% of the ‘Cow Rumen’ sequences could be annotated using GenBank, whereas >75% of the ‘Human Gut Japanese’ sequences could be annotated using the same database with the same parameters (Figure 2). These problems may be ameliorated to some extent by increasing sequencing depth or read-length so that better assemblies and gene predictions can be obtained. Another problem in metagenomic functional analysis stems from the lack of knowledge of the species of origin of the sequences. Although phylogenetic classification and binning methods specific to metagenomic sequences may be able to classify 40–93% of the reads [19] at the genus level, depending on the novelty of the data set, at the species level this percentage is expected to decrease. This indicates that at least 7–60% of the sequences still remain unclassified due to the limitations of the available tools and the paucity of reference genomes in the public databases. Thus, in spite of gaining some functional information, due to the absence of specific species information, it is extremely difficult to put together many functionally annotated metagenomic sequences in context of their actual metabolic pathways. Additionally, because most of the metagenomic sequences will be derived from the dominant species, the complete functional and metabolic repertoire of the less abundant members cannot be obtained. Other techniques complimentary to metagenomics, such as single cell genomics [82], may help in overcoming this problem by providing access to the genomic DNA from unculturable microbes. However, even single cell genomics has many challenges remaining [82]. Nevertheless, if the objective of the metagenomic study is to only analyze the overall metabolic capacity of the entire community, then putting the sequences in context of their individual genomes of origin may not pose a serious problem. Given that metagenomic studies are aimed at exploring complex environments harboring many yet uncultured and unknown microbes, the data sets are expected to possess a large number of novel sequences. As shown in Figure 3, the overall functional annotation achieved in the case of some example bacterial metagenomes is 50–75%, with the remaining sequences being unannotated. Even for ‘complete’ genomes, functional annotation is not complete. In the most studied model organism, Escherichia coli K12-W3110, and the smallest studied genome, Mycoplasma genitalium, both of which are considered ‘simpler’ systems, the overall functional annotation remains ∼90%. And, in a more complex system viz., the human genome, only ∼82% of the predicted proteins are currently annotated. For the even more complex human gut metagenome, this number decreases to ∼75%. Interestingly, while ocean and soil are also considered as ‘complex metagenomes’ on the scale of the human gut microbiome, only ∼50–55% of the sequences in these communities can be annotated. This difference in level of annotation could be due to a bias in the number of human-associated microbial genomes that have thus far been sequenced and are included in the reference sequence databases. To deal with the novelty of metagenomic data, reference genome sequencing efforts should be initiated for other environments as has been done under the Human Microbiome Project [83], which plans to sequence a large number of reference genomes from different body sites for the human microbiome. While the functional annotation of bacterial metagenomes is at a reasonable level and is gradually improving, the situation for viral metagenomes, or viromes, lags far behind. The extent of virome annotation for cow rumen [59] and human lung [80] drops to as low as 13–15% (Figure 4) in comparison to bacterial annotation (cow rumen: 32%) for similar environments. The average metagenomic read-length used for the human lung virome was only 84 bp. One might argue that this reduction in the percentage of functional annotation may be due to the short read-length, which is known to affect the extent and confidence level of the functional prediction process, as discussed earlier. But, surprisingly, the percentage of functional annotation for the cow rumen virome is also low (15%), despite using a longer read-length (>300 bp). Thus, this reduction in the extent of functional prediction for viromes could be mainly due to the limited number of completely sequenced viral species in the reference databases.

Figure 4:

Status of functional prediction for viral metagenomes. The bars for the Cow Rumen viral metagenome data set represent the percentage of genes predicted from assembled contigs, while those for the Human Lung viral metagenome data set [80] represent the percentage of raw reads. The genome sizes of the individual microbial members of a community can vary greatly. It is known that larger genomes harbor a smaller relative fraction of universal and housekeeping genes, and thus contain a large number of novel genes [84, 85]. Indeed, a weakly significant positive correlation was found between the effective genome size and the potential for carrying novel genes [86]. Therefore, the average genome size in an environmental sample could also affect the comparative functional analysis of the metagenome. Recently, Beszteri et al. [87] demonstrated how, among metagenomic samples, the differences in relative gene abundance, which are often used to interpret habitat-specific adaptations, are biased by the average genome size of the communities sampled. Thus, before arriving at biological conclusions from functional analysis of metagenomic data sets, the latter should be normalized to account for their different average genome sizes. Apart from the aforementioned problems, the analysis of metagenomic data sets can also be influenced by the sequencing technology used. For example, 454 pyrosequencing technology produces between 11–35% artificial replicates, both identical reads (duplicates) and reads that begin at the same position but vary in length or contain sequencing discrepancies, which lead to biased functional annotations [88]. Replicates were also observed in an Illumina sequenced permafrost microbial community analysis [78]. Thus, the metagenomic reads should be de-replicated before in-depth functional analysis is performed. Both 454 pyrosequencing and the more recent Ion Torrent sequencing technologies are known to introduce frameshift errors in the reads, mostly due to homopolymer runs. Almost none of the available bioinformatics tools for functional annotation of metagenomic sequences are capable of handling such errors; although several specialized tools for frameshift detection are currently available [89-93] in the public domain and should be used for more in-depth functional analysis. In some cases, the protocols used for sample preparation, particularly the use of filters or other sample selection methods, can also lead to inappropriate biological interpretations. For example, in the first Sargasso Sea data set [94], some nitrogen-fixing genes were found to be lacking [95]. However, the lack of these genes was later attributed to the absence of their main contributors, cyanobacteria, which were likely removed during the filtration step [96].

APPLICATIONS OF METAGENOMIC FUNCTIONAL ANALYSIS

Despite the challenges for metagenomic functional analysis, many studies exploring different environments are being conducted with varying degrees of success. The applications of metagenomic functional analysis is an extremely important and versatile subject; and, given the scope of the current review, it is impossible to comprehensively discuss it here. Therefore, to exemplify the successful implementation of metagenomic functional analysis to answer some biologically and environmentally important issues, a few recent example studies are presented in the following sections. For a discussion of other studies of major interest, we recommend the comprehensive review by Wooley et al. [4].

Comparative metagenomic-based studies

Recently, in a large-scale metagenomic analysis of 124 European individuals, a catalogue of over 3.3 million human gut microbial genes was created [97]. This led to the identification of bacterial functions that are necessary for a bacterium to thrive in the gut context, and to those functions involved in homeostasis of the entire ecosystem. This catalogue not only provides a good resource for annotating new human gut-related metagenomes and for comparative analysis, it also enables future studies to discover associations between the microbial genes and human phenotypes. In another study, the gut metagenomes of four healthy individuals were compared to those of individuals with autoimmune disorders, including type I diabetes [98]. This analysis suggested that increased adhesion and flagella synthesis in diseased individuals may be involved in triggering type I diabetes associated autoimmune response. Recently, a comparison between the human gut environment and the oral cavity was made by comparing the two metagenomes, and clear distinctions in the functional capacities of the two niches were observed [99]. In the same study, another comparison between oral metagenomes from supragingival dental plaque and cavities of healthy and diseased individuals, respectively, suggested that the dental plaque of healthy individuals (those who have never suffered from caries) may be a genetic reservoir for novel anticaries compounds and probiotics, which are live microorganisms thought to be beneficial to the host organism. Metagenomics studies to date have not only aimed at exploring human health-related issues, but have also attempted to address various environmental issues. Global warming resulting from the emission of greenhouse gases is a major concern worldwide. Rising global temperatures cause permafrost, a vast reservoir of natural carbon, to thaw, resulting in microbial degradation of organic matter and emission of more greenhouse gases. Comparative metagenomics of permafrost was recently applied to both the frozen and thawed states to analyze the shifts in microbial and functional composition [78]. Multiple genes involved in carbon and nitrogen cycling were found to shift rapidly during thaw. From this study, important insights about the microbial species and functional components involved in greenhouse gas emissions may be obtained.

Metagenomic data-mining-based studies

The natural diversity and affluence of metagenomic data is enormous. Over 300 independent metagenomic projects have already been completed or are underway. These facts provide a great opportunity for in-depth mining of metagenomic data and exploration of novel gene candidates useful under a variety of different scenarios. For example, the metagenomic data sets from 10 diverse sources were used to identify several novel candidates for commercially useful enzymes (CUEs) [100]. A catalogue of 510 CUEs was prepared using literature search followed by manual curation, and then the catalogue was used to find homologues in the metagenomic data sets. High-throughput functional metagenomic screening may be used to look for the presence of CUEs and other specific enzymes of interest in the metagenomes [101]. In another study, the recruitment of genomes from pathogens against the metagenomes of healthy individuals containing commensal strains of the same species was used to identify the genomic regions of individual bacterial isolates missing in the metagenomes [102]. These regions are referred to as metagenomic islands and are found to harbor several virulence-related genes specific to the pathogenic strain.

CONCLUSIONS

Metagenomic sequencing provides a unique opportunity to explore yet unknown environments in great detail. Functional analysis of the metagenomic data plays a central role in such studies by providing important clues about functional and metabolic diversity, as well as variation. While metagenomic studies continue to suffer from certain caveats that make the downstream data analysis a challenging task for bioinformaticians, the gradual improvement in metagenomic technologies and development of tools and resources that account for the known problems will relieve some of the burdens. For example, the use of next-generation sequencers producing longer read-lengths (>300 bp) will usually lead to better sequence coverage. This can then be followed by the use of sequence assembly and gene prediction tools and parameters specifically developed for metagenomic sequences which will further help in improving assembly and gene prediction efficiency, respectively, and will result in a greater number of complete predicted proteins. Better functional assignments for metagenomic data sets can be obtained by using more complete proteins. However, while comparing the abundance profiles of functions between communities, the frequencies of the functions should not be masked by the assembly, and the read depths of the contigs should be accounted for. Another common problem that is usually encountered in metagenomic data functional analysis is the long computational time that is required for BLAST-based homology searches for orthologs. The use of alternative search algorithms, such as BLAT, can provide analysis results in shorter times; however, the loss of sensitivity by BLAT-based searches should be taken into account when analyzing the results. Alternatively, profile-based search methods using the HMMER algorithm may also be used whenever pre-computed sequence profiles are available. Certain issues, including large volumes of metagenomic sequence data, large storage requirements for the analyzed data, and the typically large number of unknown sequences in the metagenomic data still pose serious challenges for its analysis. Therefore, there is great need for the development of new, faster, more sensitive tools and more thorough resources dedicated to the functional analysis of metagenomic data sets. Also, it is strongly advised that when analyzing the data, one must be aware of any additional factors that can influence the functional analysis, including sample preparation, sequencing method, diversity of the environments, etc. Proper calibrations, normalizations and statistical tests for significance should always be performed in order to arrive at the most reliable conclusions. DNA sequence-based metagenomic functional analysis is limited in that it only provides information about the functional content of an environment. Thus, it may be complemented by other independent approaches that help to gain further insights about the more dynamic aspects of a given community. For example, a few metatranscriptomic projects have been undertaken to address which genes are actually being expressed in different environments and to what extent [103, 104]. Given that proteins are much more stable than mRNAs [105], a proteome-based analysis is expected to provide a more accurate view of the functionality of a given environment. Toward this, a few metaproteomic studies have been conducted to explore which protein products are formed and how are they involved in the cross-talk within the environment under different conditions [106-109]. The metabolome, which represents the complete set of small molecules in an organism, can influence gene expression and protein function. Therefore, metabolomics also plays a key role in understanding cellular systems and decoding the functions of genes [110, 111]. A few metabolomic analyses have been conducted to determine which metabolites are produced as a result of the underlying metabolic pathways that are being exerted in a given community and to study host-microbe interactions [112-117]. Another alternative to the DNA-based studies used for determining microbial community composition, metalipidomics, is being implemented mainly to identify the living microbial cells in an environment [118]. Intact polar lipids (IPLs), which are the basic building blocks of biomembranes, are ubiquitous in nature and have several characteristics that make them useful as proxies for living microbial cells. To date, metabolomic studies have not been directly used for the functional analysis of environments. However, studies seeking to identify microbes of specific functional interest may be conducted, as has been done for ammonia-oxidizing microbes from marine and estuarine sediments [119]. The functional component of the environment may then be extensively analyzed using different approaches to gain more insights about the cross-talk taking place in that environment. Thus, the application of metalipidomics to study host-associated microbial composition and functional analysis, while not yet explored, appears promising. Read-lengths generated during metagenomic sequencing influence assembly, gene prediction and eventually functional analysis. The enormous volume of sequence data, which leads to long computational times and massive storage requirements, also impedes metagenomic functional prediction. Factors that potentially influence functional analysis of metagenomic data, including sample preparation, sequencing method, average genome size, etc. should be considered prior to analysis. A higher fraction of metagenomic sequences are annotated using BLAST against data-rich reference sequence databases such as NCBI NR as compared to SwissProt, COGs, KEGG, etc. Integrated methods using more than one approach can improve the efficiency and reliability of functional predictions. DNA-sequence-based metagenomic functional analysis should be complemented with other types of approaches, such as metatranscriptomics, metaproteomics, metabolomics and metalipidomics, to gain better insights of the dynamics of a community.

FUNDING

This work was supported by the operational expenditure fund of RIKEN.

118 in total

1. Systematic artifacts in metagenomes from complex microbial communities.

Authors: Vicente Gomez-Alvarez; Tracy K Teal; Thomas M Schmidt
Journal: ISME J Date: 2009-07-09 Impact factor: 10.302

2. Core and intact polar glycerol dibiphytanyl glycerol tetraether lipids of ammonia-oxidizing archaea enriched from marine and estuarine sediments.

Authors: Angela Pitcher; Ellen C Hopmans; Annika C Mosier; Soo-Je Park; Sung-Keun Rhee; Christopher A Francis; Stefan Schouten; Jaap S Sinninghe Damsté
Journal: Appl Environ Microbiol Date: 2011-03-25 Impact factor: 4.792

3. Induction of intestinal Th17 cells by segmented filamentous bacteria.

Authors: Ivaylo I Ivanov; Koji Atarashi; Nicolas Manel; Eoin L Brodie; Tatsuichiro Shima; Ulas Karaoz; Dongguang Wei; Katherine C Goldfarb; Clark A Santee; Susan V Lynch; Takeshi Tanoue; Akemi Imaoka; Kikuji Itoh; Kiyoshi Takeda; Yoshinori Umesaki; Kenya Honda; Dan R Littman
Journal: Cell Date: 2009-10-30 Impact factor: 41.582

4. WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads.

Authors: Wolfgang Gerlach; Sebastian Jünemann; Felix Tille; Alexander Goesmann; Jens Stoye
Journal: BMC Bioinformatics Date: 2009-12-18 Impact factor: 3.169

5. PROSITE, a protein domain database for functional characterization and annotation.

Authors: Christian J A Sigrist; Lorenzo Cerutti; Edouard de Castro; Petra S Langendijk-Genevaux; Virginie Bulliard; Amos Bairoch; Nicolas Hulo
Journal: Nucleic Acids Res Date: 2009-10-25 Impact factor: 16.971

6. The NIH Human Microbiome Project.

Authors: Jane Peterson; Susan Garges; Maria Giovanni; Pamela McInnes; Lu Wang; Jeffery A Schloss; Vivien Bonazzi; Jean E McEwen; Kris A Wetterstrand; Carolyn Deal; Carl C Baker; Valentina Di Francesco; T Kevin Howcroft; Robert W Karp; R Dwayne Lunsford; Christopher R Wellington; Tsegahiwot Belachew; Michael Wright; Christina Giblin; Hagit David; Melody Mills; Rachelle Salomon; Christopher Mullins; Beena Akolkar; Lisa Begg; Cindy Davis; Lindsey Grandison; Michael Humble; Jag Khalsa; A Roger Little; Hannah Peavy; Carol Pontzer; Matthew Portnoy; Michael H Sayre; Pamela Starke-Reed; Samir Zakhari; Jennifer Read; Bracie Watson; Mark Guyer
Journal: Genome Res Date: 2009-10-09 Impact factor: 9.043

7. Evaluating the fidelity of de novo short read metagenomic assembly using simulated data.

Authors: Miguel Pignatelli; Andrés Moya
Journal: PLoS One Date: 2011-05-23 Impact factor: 3.240

8. Metagenomic analysis of respiratory tract DNA viral communities in cystic fibrosis and non-cystic fibrosis individuals.

Authors: Dana Willner; Mike Furlan; Matthew Haynes; Robert Schmieder; Florent E Angly; Joas Silva; Sassan Tammadoni; Bahador Nosrat; Douglas Conrad; Forest Rohwer
Journal: PLoS One Date: 2009-10-09 Impact factor: 3.240

9. MetaBioME: a database to explore commercially useful enzymes in metagenomic datasets.

Authors: Vineet K Sharma; Naveen Kumar; Tulika Prakash; Todd D Taylor
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

10. Analysis and comparison of very large metagenomes with fast clustering and functional annotation.

Authors: Weizhong Li
Journal: BMC Bioinformatics Date: 2009-10-28 Impact factor: 3.169

55 in total

Review 1. Analytical tools and databases for metagenomics in the next-generation sequencing era.

Authors: Mincheol Kim; Ki-Hyun Lee; Seok-Whan Yoon; Bong-Soo Kim; Jongsik Chun; Hana Yi
Journal: Genomics Inform Date: 2013-09-30

2. Dispersing misconceptions and identifying opportunities for the use of 'omics' in soil microbial ecology.

Authors: James I Prosser
Journal: Nat Rev Microbiol Date: 2015-06-08 Impact factor: 60.633

3. Census-based rapid and accurate metagenome taxonomic profiling.

Authors: Amirhossein Shamsaddini; Yang Pan; W Evan Johnson; Konstantinos Krampis; Mariya Shcheglovitova; Vahan Simonyan; Amy Zanne; Raja Mazumder
Journal: BMC Genomics Date: 2014-10-21 Impact factor: 3.969

4. Are multi-omics enough?

Authors: Cristina Vilanova; Manuel Porcar
Journal: Nat Microbiol Date: 2016-07-26 Impact factor: 17.745

5. MEBS, a software platform to evaluate large (meta)genomic collections according to their metabolic machinery: unraveling the sulfur cycle.

Authors: Valerie De Anda; Icoquih Zapata-Peñasco; Augusto Cesar Poot-Hernandez; Luis E Eguiarte; Bruno Contreras-Moreira; Valeria Souza
Journal: Gigascience Date: 2017-11-01 Impact factor: 6.524