| Literature DB >> 22772835 |
Tulika Prakash1, Todd D Taylor.
Abstract
Metagenomic sequencing provides a unique opportunity to explore earth's limitless environments harboring scores of yet unknown and mostly unculturable microbes and other organisms. Functional analysis of the metagenomic data plays a central role in projects aiming to explore the most essential questions in microbiology, namely 'In a given environment, among the microbes present, what are they doing, and how are they doing it?' Toward this goal, several large-scale metagenomic projects have recently been conducted or are currently underway. Functional analysis of metagenomic data mainly suffers from the vast amount of data generated in these projects. The shear amount of data requires much computational time and storage space. These problems are compounded by other factors potentially affecting the functional analysis, including, sample preparation, sequencing method and average genome size of the metagenomic samples. In addition, the read-lengths generated during sequencing influence sequence assembly, gene prediction and subsequently the functional analysis. The level of confidence for functional predictions increases with increasing read-length. Usually, the most reliable functional annotations for metagenomic sequences are achieved using homology-based approaches against publicly available reference sequence databases. Here, we present an overview of the current state of functional analysis of metagenomic sequence data, bottlenecks frequently encountered and possible solutions in light of currently available resources and tools. Finally, we provide some examples of applications from recent metagenomic studies which have been successfully conducted in spite of the known difficulties.Entities:
Mesh:
Year: 2012 PMID: 22772835 PMCID: PMC3504928 DOI: 10.1093/bib/bbs033
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1:Flow chart for the analysis of a metagenome from sequencing to functional annotation. Only the basic flow of data is shown up to the gene prediction step. For the context-based annotation approach, only the gene neighborhood method has been implemented thus far on metagenomic data sets; although in principal, other approaches which have been used for whole genome analysis can also be implemented and tested. *: A list of tools commonly used for these processes is provided in Table 1. Table 3 provides a list of some of the additional functional analyses that can be performed on the metagenomic sequences.
List of commonly used tools for sequence assembly, protein coding gene prediction, RNA gene prediction and phylogenetic classification steps of metagenomic data analysis
| Process | Tools | URL/ References |
|---|---|---|
| Sequence assembly | Phrap | |
| Forge | ||
| Arachne | [ | |
| JAZZ | [ | |
| Celera | [ | |
| Velvet | [ | |
| Newbler | 454 Life Sciences | |
| SOAPdenovo | [ | |
| EULER | [ | |
| ORFome assembly | [ | |
| IDBA-UD | [ | |
| Gene prediction | Metagene | [ |
| GeneMark | [ | |
| ORF-Finder | ||
| FragGeneScan | [ | |
| fgenesB | ||
| GLIMMER | [ | |
| BLAST | [ | |
| RNA gene prediction | tRNAscan-SE | [ |
| Similarity-based searches for rRNA in reference databases | – | |
| Taxonomic binning | MetaBin | [ |
| MEGAN | [ | |
| WebCARMA | [ | |
| PhyloPythia | [ | |
| TETRA | [ | |
| NBC | [ | |
| TACOA | [ |
Current list of commonly used publicly available pipelines for the functional annotation of metagenomic data sets
| Pipeline/tools | IMG/M | METAREP | CAMERA | RAMMCAP | MG-RAST | Smash community | MEGAN4 | CoMet | WebMGA |
|---|---|---|---|---|---|---|---|---|---|
| Functional analysis | |||||||||
| Homology-based | |||||||||
| Known sequence | NCBI (NR), SMART, UniProt | NCBI (NR), UniProt | NCBI (NR) | – | NCBI (NR), SMART, UniProt | SMART, UniProt | NCBI (NR) | – | NCBI (NR) |
| Metagenomic data sets | IMG/M | – | – | – | IMG/M | – | – | – | – |
| Orthologous groups | COGs | – | COGs | COGs | COGs, eggNOGs | COGs, eggNOGs | – | – | COGs |
| Protein families | Pfam, TIGRfam | Pfam, TIGRfam | Pfam, TIGRfam | Pfam, TIGRfam | FIGfams | Pfam | – | Pfam | Pfam, TIGRfam |
| Ontology | GO | GO | GO | GO | GO | – | – | GO | GO |
| Enzymes, pathways and subsystems | KEGG, SEED | PRIAM | KEGG, SEED | – | KEGG, SEED | KEGG | KEGG, SEED | – | KEGG |
| Protein interactions | – | – | – | – | STRING | STRING | – | – | – |
| Motif- and pattern-based | |||||||||
| Database | InterPro | – | – | – | – | – | – | – | – |
| Context-based | |||||||||
| Approach | Gene neighborhood | – | – | – | – | Gene Neighborhood | – | – | – |
| Other functional analysis | |||||||||
| Types of predictions | CRISPRs, enzymes, transporter classes | Enzymes, transmembrane helices, lipoprotein motifs | – | – | – | Protein networks | – | – | – |
| URL | – | ||||||||
| References | [ | [ | [ | [ | [ | [ | [ | [ | [ |
List of commonly used available resources for functional analysis (other than homology-, motif- and context-based) that can be performed on metagenomic data sets
| Type of prediction | Resource name | URL |
|---|---|---|
| Carbohydrate-active enzymes | CAZy | |
| Glycosyl hydrolases | GAS | |
| Protein localization | PSORT | |
| Cell-PLoc | ||
| CELLO | ||
| PA-SUB | ||
| Membrane proteins | DAS | |
| HMMTOP | ||
| HMM-TM | ||
| TMB-Comp | ||
| Lipoproteins | DOLOP | |
| LIPO | ||
| SignalP | ||
| LipoP | ||
| PRED-LIPO | ||
| Secretory proteins (signal peptide Type I) | Tatfind | |
| TatP | ||
| SignalP | ||
| PrediSi | ||
| Adhesins | SPAAN | Sachdeva |
| Transporters | TansportTP | |
| TransAAP | ||
| TCDB | ||
| Insertion sequences | ISsaga | |
| CRISPRs | PILER | |
| CRISPRfinder | ||
| Repeats | Tandem Repeats Finder | |
| EMBOSS | ||
| Virulence factors | VFDB | |
| MvirDB |
Figure 2:Distribution of metagenomic sequence matches in the SwissProt, RefSeq, KEGG and SEED databases at various E-value cut-offs. Smaller sequences match at lower confidence (higher E-values; lighter colors) or do not match at all in the databases. More sequences match with higher confidence (lower E-values; darker colors) as the sequence length used for the analysis increases. Pre-computed data for the metagenomes shown was derived from the MG-RAST server.
Figure 3:Status of functional prediction of protein-coding genes from different metagenomic data sets and representatives of completely sequenced genomes. The overall functional prediction bars represent the fraction of protein-coding genes that map to at least any one of the four databases including cluster of orthologous groups (COGs), Pfam, TIGRFAM and KEGG pathways. For comparative purposes, the functional annotation status for the well-studied model microbial genome, E. coli K12-W3310, the smallest microbial genome, M. genitalium, and the human genome are also shown. The data for this graph was derived from the IMG/M database. It should be noted that for uniform comparison, the prokaryotic COGs version was also used for Homo sapiens. The number of matches to eukaryotic COGs (KOG database [57]) may be higher for H. sapiens. The numbers next to the bars represent the total number of predicted protein-coding genes in each data set using the IMG/M annotation pipeline. For the Sludge [58] community, data from only the Phrap assembly, a widely used program for DNA sequence assembly, was used. Except for the Cow Rumen Viral community [59], which was sequenced using the 454 platform (average read-length > 300 bp), all other metagenomes were sequenced using the Sanger method (average read-length > 1000 bp). The following additional data sets were used: Ocean [60], Soil [61], Acid Mine Drainage [62], Human Gut [63].
Figure 4:Status of functional prediction for viral metagenomes. The bars for the Cow Rumen viral metagenome data set represent the percentage of genes predicted from assembled contigs, while those for the Human Lung viral metagenome data set [80] represent the percentage of raw reads.