| Literature DB >> 22678395 |
Rob Knight1, Janet Jansson, Dawn Field, Noah Fierer, Narayan Desai, Jed A Fuhrman, Phil Hugenholtz, Daniel van der Lelie, Folker Meyer, Rick Stevens, Mark J Bailey, Jeffrey I Gordon, George A Kowalchuk, Jack A Gilbert.
Abstract
Metagenomics holds enormous promise for discovering novel enzymes and organisms that are biomarkers or drivers of processes relevant to disease, industry and the environment. In the past two years, we have seen a paradigm shift in metagenomics to the application of cross-sectional and longitudinal studies enabled by advances in DNA sequencing and high-performance computing. These technologies now make it possible to broadly assess microbial diversity and function, allowing systematic investigation of the largely unexplored frontier of microbial life. To achieve this aim, the global scientific community must collaborate and agree upon common objectives and data standards to enable comparative research across the Earth's microbiome. Improvements in comparability of data will facilitate the study of biotechnologically relevant processes, such as bioprospecting for new glycoside hydrolases or identifying novel energy sources.Entities:
Mesh:
Year: 2012 PMID: 22678395 PMCID: PMC4902277 DOI: 10.1038/nbt.2235
Source DB: PubMed Journal: Nat Biotechnol ISSN: 1087-0156 Impact factor: 54.908
A, B, C, D Recent studies, number of samples, and reported results. Studies with more samples have a higher impact and clearer biological interpretations than studies with comparable amounts of sequencing but spread over fewer samples: the reason is ability to correlate information with biological or clinical parameters of the system. Three comparisons of successive studies are shown: Table 1A - blue – marine; Table 1B - brown – human gut; Table 1C - green – human skin; Table 1D - orange – soil.
| Study | Number of samples | Sequencing target | Key results |
|---|---|---|---|
| Gilbert et al., Environmental Microbiology, 2009 [ | 12 monthly marine samples | 16S RNA V6 | Evidence of seasonally structured community diversity and for seasonal succession, significantly correlated to a combination of temperature, phosphate and silicate concentrations. |
| Gilbert et al., ISME J, 2011[ | 72 monthly marine samples | 16S rRNA V6 | Community had strong repeatable seasonal patterns, with winter peaks in diversity. Change in day length explained 65% of the diversity variance. The results suggested that seasonal changes in environmental variables are more important than trophic interactions. Relationships between Bacteria were stronger than with Eukaryotes or environment. The increase in temporal sampling over Gilbert et al., 2009, increased the capability to explore community relationships. |
| Zinger et al., PLoS ONE, 2011[ | 509 marine samples | 16S rRNA | High variability of bacterial community composition specific to vent and coastal ecosystems. Both pelagic and benthic bacterial community distributions correlate with surface water productivity. Also, differences in physical mixing may play a fundamental role in the distribution patterns of marine bacteria, as benthic communities showed a higher dissimilarity with increasing distance than pelagic communities. |
Figure 1Conceptual diagram of why replicated samples, especially across a gradient or along a time series, are critical for interpretation of results. Structure that is externally imposed via study design greatly improves our ability to recover biologically meaningful relationships rather than simply finding statistical differences between samples (especially important because every pair of biological samples will be different if sequenced deeply enough). In this case, we show the L4 Western English Channel ocean time series samples [22]: Sampling only during the summer, highlighted in blue, would only reveal the tip of the iceberg of variability in this ecosystem, which is driven by seasonal change (the graph shows day on the x-axis; log of the observed number of species on the y-axis). Similar principles apply in other ecosystems that have other major drivers of variation that, when overlooked, can influence the results in ways that are puzzling, or give a misleading picture of variation.
| Challenge | Decision | Pitfall | Consequence |
|---|---|---|---|
| Biological and technical replicates are expensive and time-consuming | Whether to perform replication, or gamble that a single sample in each group is informative with sufficiently well-described ecosystem parameters | Often non-replicated designs are not interpretable, or are over interpreted (e.g. attributing differences in a single healthy versus diseased person to the disease) | Conclusions cannot be replicated by other researchers, and may not be generalizable beyond the specific samples analyzed |
| A fixed sequencing budget must be divided among some number of samples (e.g. by multiplexing at some level) | Whether to sequence few samples deeply, or many samples more shallowly | The appropriate number of samples and sequencing depth are unknown | Few samples may be uninformative and may preclude informative analysis of variation in the system and/or replication; shallow sequencing may miss rare but important taxa or functions |
| Experimental challenges due to low yield of DNA and/or high community diversity | Whether to adopt new protocols for improved DNA extraction, amplification and/or assembly | DNA extraction and manipulation steps all introduce biases that may make it difficult to compare between studies | For unique or rare samples that require special treatment it is essential that all steps in the treatment are considered if comparing results to those from other studies. |
| Defining the dimensions of variation that matter in a given system is challenging, and often is the purpose of the study itself | Which scales and parameters to select, and how much variation to cover | “Extremes” of variation in the system being studied are expensive and difficult to obtain (tail of distribution) and may not even be extreme from the microbes’ perspective; relevant variation often unknown | Conclusions from one population or study site inappropriately generalized to other populations or study sites; relevant variation in system undiscovered; extreme efforts to obtain exotic samples are unrewarded |
| Must choose a sequencing platform | Trade-off between read length and number of sequences; must decide when to adopt new technology | All sequencing technologies and processing pipelines have drawbacks, not all of which are widely advertised; technology changes rapidly | Sequences may be too short, too few too error-prone to interpret, or too passé to publish |
| Interpretation of sequence data | Must decide whether to use reference-based or de novo methods for assembly, taxonomy and functional assignment, and if so which reference to use | Different reference databases give different results; de novo is unbiased but far less powerful when appropriate references exist; analyses differ as reference databases update rapidly, limiting comparisons between studies. Current assembly algorithms are insufficient for highly complex metagenome data. | Incorrect and/or hard-to-reconcile functional and taxonomic assignments |
| Metadata collection | Must decide what metadata (i.e. sample or site data) to collect and associate with sample | Too complex to be implemented; fields inconsistent with previous studies due to lack of standards-compliance; data model can’t accommodate | Chaos! |
| Centralization | Whether to centralize sample collection, metadata curation, DNA extraction, sequencing, data storage, and data analysis | Decentralization can lead to inconsistencies that make data difficult to interpret; centralization can lead to delays while funding is acquired or capacity is built, and can limit creativity | Either the dataset may be vast but too inconsistent to interpret, or it may be extremely consistent but limited in scope and/or interpretation. Specific considerations apply to each stage; the EMP currently favors decentralized sample collection and centralization of other steps on a case-by-case basis |