Literature DB >> 22678395

Unlocking the potential of metagenomics through replicated experimental design.

Rob Knight¹, Janet Jansson, Dawn Field, Noah Fierer, Narayan Desai, Jed A Fuhrman, Phil Hugenholtz, Daniel van der Lelie, Folker Meyer, Rick Stevens, Mark J Bailey, Jeffrey I Gordon, George A Kowalchuk, Jack A Gilbert.

Abstract

Metagenomics holds enormous promise for discovering novel enzymes and organisms that are biomarkers or drivers of processes relevant to disease, industry and the environment. In the past two years, we have seen a paradigm shift in metagenomics to the application of cross-sectional and longitudinal studies enabled by advances in DNA sequencing and high-performance computing. These technologies now make it possible to broadly assess microbial diversity and function, allowing systematic investigation of the largely unexplored frontier of microbial life. To achieve this aim, the global scientific community must collaborate and agree upon common objectives and data standards to enable comparative research across the Earth's microbiome. Improvements in comparability of data will facilitate the study of biotechnologically relevant processes, such as bioprospecting for new glycoside hydrolases or identifying novel energy sources.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22678395 PMCID： PMC4902277 DOI： 10.1038/nbt.2235

Source DB: PubMed Journal: Nat Biotechnol ISSN： 1087-0156 Impact factor: 54.908

Introduction

The Earth hosts more than 1030 microbial cells[1], a figure that exceeds the number of known stars in the universe by nine orders of magnitude. This richness of single-celled life, the first life to evolve on the planet, still accounts for the vast majority of functional drivers of our planet’s ecosystems[2]. Yet the diversity and interdependencies of these microscopic organisms remain largely unknown. Likewise, our understanding of the functional potential of most individual microbial taxa residing within any ecosystem is extremely limited and generally restricted to measurements of gross enzymatic processes of the community. Moreover, sequenced metagenomic datasets have, to date, only played a limited role in biotechnological knowledge discovery, with the majority of novel developments occurring through heterologous expression of enzymes. Our knowledge of microbial diversity on Earth is poised to be revolutionized by the development of new technologies that will permit us to ‘see’ the ‘who, what, when, where, why, and how?’ of microbial communities. Most recently, next-generation sequencing methods have begun to rapidly improve our understanding of the functional and evolutionary processes necessary to advance the field of microbial ecology. Matching these technological strides are progress in scientific community cooperation, increases in interdisciplinary interaction, and the development of standards for experimental and sample contextual “metadata” acquisition, which are essential for downstream interpretation[3]. Here we discuss how advances in DNA sequencing, the handling of contextual data and improvements in study design can unlock the potential of metagenomics. We discuss the need for robust experimental design[4] (e.g., replication and improved ecosystem characterization) and highlight the need for an Earth Microbiome Project that will rely on metagenomics to explore Earth’s microbial dark matter across temporal and spatial scales and simultaneously facilitate novel gene discovery. Through standardized data generation approaches and metadata collection, we stand poised to make rapid progress toward advancing biotechnological goals.

Changing the paradigm in metagenomic experimental design

For more than 80 years, it has been recognized that the majority of microbial life cannot be easily cultured in the laboratory. This has constrained understanding of microbial ecosystems and impeded our ability to discover and utilize new beneficial functions derived from microorganisms (e.g., enzymes to drive biotechnological reactions, processes to enhance bioremediation, and biomarkers for disease diagnosis and therapeutic targets). Current biotechnology is still based on a small stable of “domesticated” species, yet technical improvements in molecular microbial ecology and synthetic biology offer the potential for novel enzyme discovery and exploitation from the previously inaccessible depths of the tree of life. However, in this age of exploration and discovery, as we test the capability and limits of these new tools, it is unsurprising that the majority of studies have failed to live up to expectations. This has created a paradox, in that funding agencies are not providing the resources required to undertake metagenomic sequencing and analysis of the large and sufficiently replicated sample sets needed to produce scientifically valid investigations. Financial constraints should not compromise the need for scientific rigor. A genuine concern exists that such constraints have led some journals and reviewers to accept the argument that proper experimental design and true replication is logistically infeasible and therefore should not be required for publication of the observations made. Yet, as discovery moves from the description of apparent diversity to the genuine description of complexity and function, this is no longer acceptable or desirable. Is it possible that metagenomics has failed to deliver what it promised—a fast, cheap, and comprehensive method to explore functional biochemistry in the natural world? It is too early to reach this conclusion, but several factors led to this perception, including underestimation of the complexity of microbial diversity, limited data concerning the source of each sample and the identity of many genes, difficulties in integrating and comparing results obtained with different technologies in different labs, mismatched expectations between researchers who sought to generate understanding of ecological patterns, with those who were excited to test the limits of new technology, and the lack of agreed upon data standards. For the discovery of enzymes such as glycoside hydrolases[5] (important for biomass breakdown), information on the type of biomass, biological or physicochemical pretreatment (e.g. grinding of biomass by wood feeding insects), redox conditions, pH and temperature are important parameters to record. If we continue to develop these environmental data checklists for other types of sample sets, it will be feasible to search for relevant genes in databases created by metagenomic endeavors, which will greatly assist in finding genes relevant to a target biotechnology application. To change perspectives, national and global cooperation is needed to adopt minimum standards in experimental design and to convince funding agencies to make the appropriate levels of investment. Initial advances toward novel gene discovery using metagenomics relied on direct cloning and sequencing of DNA fragments extracted from uncultured microbial communities. Although an important step forward, these methods were also time consuming and expensive. For example, metagenomic data generation for the first leg of the Global Ocean Sampling expedition was estimated to cost > $10 million. Although costly, the dataset is comparatively limited by today’s standards. Since the introduction of the first wave of ‘next-generation’ highly parallel DNA sequencers in 2006, there has been an explosion in gigabase- to terabase-scale metagenomic sequencing projects[6]. An illustrative, though not exhaustive, list includes the continued Global Ocean Survey (GOS), International Census of Marine Microbes, MetaHIT, the Human Microbiome Project (HMP), TARA Oceans, DeepSoil, MetaSoil, Genomic Observatories[7], the JGI Great Prairie pilot study, and the National Ecological Observatory Network (NEON). Pioneering metagenomic studies of microbial community composition and function in different environments (e.g., acid mine drainage[8], soil/permafrost[9, 10], marine GOS[11], Hawaiian ocean time series[12], Western Channel Observatory L4[13], termite hindgut[14], cow rumen[15], human gastrointestinal tract[16], and mouse gastrointestinal tract[17]) provided a first glimpse into the potential of this approach to uncover previously unknown functional genes, phylogenetic types, and interactions among community members. Indeed, comparative metagenomic analyses have yielded considerable insight into the distribution of gene families across different ecosystems, and the role of specific functional attributes in adaptation to physical and chemical conditions[18-20]. However, these initial studies were limited by their status as pilot studies, often due to the need to develop and prove the technologies and the high cost of sequencing. Therefore, most of these studies were observational and were not able to adopt the normal scientific methodological approach of well replicated coverage of the respective ecosystems for statistically relevant analyses[21] of the biological variation. Now that sequencing costs have declined as throughput has dramatically increased, we expect, without any reasonable exceptions, rigorous experimental design to be applied to future metagenomics experiments. Further, we must take full advantage of this brave new world of rigorous metagenomic study design by thinking like cartographers, and creating a map that can be used to navigate the uncharted regions of the microbial universe. One example of this map could be a catalogue of all known proteins and the environments (including comprehensive metadata) in which they were found. To do this, it will be necessary to better characterize individual ecosystems with prolonged and in-depth investigations, comprehensive physical, chemical and biological contextual data, appropriate statistical design, and improved interpretation of functional and taxonomic characteristics (Box 1). A metagenomic dataset is only as good as the contextual experimental and environmental data associated with it. Just as maps require a standard format to enable comparability, in-depth investigations also must be comparable, and be able to be linked, to uncover what features are common to multiple systems, or specific to each system. Standardization efforts enable further analyses, such as determining the distribution of these elements across time and space, thereby improving our understanding of microbial dynamics across planet Earth.

Defining the playing field through shallow and deep surveys

Ultra-deep sequencing of taxonomic or functional marker genes such as the small subunit ribosomal RNA gene (SSU-rRNA) or nifH has enabled comprehensive cataloging of the inhabitants of a variety of microbial ecosystems[22-26]. Deep sequencing of a few samples can provide information about rare taxa and rare genes, but without analyzing larger numbers of samples, limitations arise: the statistical significance of observed patterns cannot be determined, the patterns of co-occurrence between genes and taxa are difficult to assess, and the dominant biotic or abiotic factors structuring communities across time and space remain undetermined. As an analogy, if naturalists in the 19th century had only focused on plant and animal diversity in a few, isolated plots instead of exploring ecosystems across broad swaths of the globe, the fields of botany and zoology would have reached a standstill, and the global patterns of biogeography, which were crucial to forming our modern understanding of ecology and evolution, would have remained unknown. Thus, for microbial biogeography, many samples from related or contrasting communities must be studied in parallel. We recognize the recent advances that have been made by the deep sequencing of a few samples (e.g. generating billions or trillions of base pairs from a single sample). Indeed, broad, shallow sequencing from many thousands of samples can help to direct which samples should be analysed in more detail using deep sequencing, which enables additional data analyses that may lead to better interpretation of the biological information. For example, in order to obtain enough information to allow reliable assembly of specific genomic fragments (using currently available sequencing technologies), deep sequencing of random shotgun DNA is essential. Recent work on rumen samples obtained from two cows illustrates this point. Hess and colleagues[15] were able to assemble 15 near-complete bacterial genomes from short-read length shotgun sequencing data. However, Improved coverage is not the only answer, but can help to focus the question; for example, using a rough calculation of 4 Mbp (mega base pairs) per genome and a billion cells per gram, a single gram of soil could contain up to 3 Pbp (peta base pairs) of genetic data. Recently, Mackelprang et al.[9] used deep sequencing to successfully assemble a draft genome of a novel methanogen from highly diverse permafrost soil. Therefore, although soil is one of the most challenging ecosystems for metagenomics because of it’s high diversity, advances in new assembly algorithms show great promise for genome reassembly from deep sequence studies [27]. The question of whether to sequence deeply or shallowly across many samples is dependant on the question you want to answer. Deep sequencing is required to observe rare members of microbial communities. Regardless of the habitat in question, rare members of the community can have key functional roles, such as nutrient cycling (e.g., methanogenesis[28], nitrogen fixation[26]), pathogenesis, stimulation of the immune system, and metabolite production (e.g., butyrate in the gut, or antibiotics). Moreover, microbes that are rare in one sample may be common in another. For example, in the European Meta-HIT project, metagenome sequences from fecal samples were obtained from 124 individuals, and the human gut microbes identified as being shared between individuals varied 8- to 1500-fold among different hosts[29]. Shallow sequencing, in contrast, enables the exploration of microbial community structure dynamics, which is fundamental to building a predictive understanding of an ecosystem[30]. Recent evidence suggests that some ecosystems maintain a temporally persistent but vast microbial seed bank[31], suggesting that taxa identified by shallow surveys are merely indicative of the abundant taxa selected by the chemical, physical and biological processes leading up to and present at the time of sampling. However, one likely hypothesis states that “the dominant microorganisms in a sample are those that play the most important functional roles under normal conditions.” Hence, if one is interested in the ecology of more abundant processes or taxa, ultra-deep metagenomic sequencing is not essential; relatively small fractions of the genetic diversity contained within samples can reveal ecological patterns that help define ecosystem structure[13]. The potential for reliance on shallow sequence data (either amplicon or shotgun) for some studies is supported by a study of gnotobiotic mice harbouring a defined consortium where the complete genome sequence of every community member was known. In that study it was possible to obtain accurate descriptions of the community’s meta-transcriptome and meta-proteome based on short sequence reads[32]. Creating a highly detailed picture of an individual or environmental sample from one angle at one instant creates a static view of that sample that can be useful. However, it cannot capture temporal dynamics, or variability among individuals or habitats. Far more is gained from complementing such pictures with others, even if these others are taken at lower resolution, as it permits more accurate reconstruction of shape. Likewise, low-resolution pictures taken successively over time can provide a sense of motion and dynamics and low-resolution pictures of many different samples can provide a view of diversity and variability that cannot be obtained by a single sample. However, all these pictures or individual snap-shots must be well organized, as it is of little value to have them unsorted in a pile that prohibits retrieval of the series of the data sets, or images, necessary to reconstruct a view of a specific phenomenon under study. To determine dynamic processes, it is necessary to apply broad sampling (both in time and space) at an appropriate resolution to determine the frequency of the dynamics. With most studies, an increase in the number of samples analyzed has a significant impact on analytical power (Table 1). Gilbert and colleagues[33] generated a 12-sample survey of the annual changes in the microbiota of surface waters in the English Channel, and found evidence for seasonal succession driven by temperature and nutrient availability. However, when they augmented this with 60 more samples, making a contiguous 72 sample time series over six years[22], the patterns were significantly refined, with the seasonality being extremely robust, and day-length being identified as the key driver of richness in the community (Figure 1; Table 1). Additionally, Arumugam and colleagues[34] used metagenomic sequencing from 22 individuals to show that human gut microbiota could be classified into 3 enterotypes, which showed no correlation to diet or ethnicity. However, Wu et al.[35] performed the same analysis on 98 individuals and demonstrated that the increase analytical power found distinct correlations with diet (Table 1). Other examples of the power of sampling breadth can be routinely found in the literature (Table 1), and they demonstrate that using statistically relevant experimental design is vital to generating accurate analyses.

Table 1

A, B, C, D Recent studies, number of samples, and reported results. Studies with more samples have a higher impact and clearer biological interpretations than studies with comparable amounts of sequencing but spread over fewer samples: the reason is ability to correlate information with biological or clinical parameters of the system. Three comparisons of successive studies are shown: Table 1A - blue – marine; Table 1B - brown – human gut; Table 1C - green – human skin; Table 1D - orange – soil.

Study	Number of samples	Sequencing target	Key results
Gilbert et al., Environmental Microbiology, 2009 [33]	12 monthly marine samples	16S RNA V6	Evidence of seasonally structured community diversity and for seasonal succession, significantly correlated to a combination of temperature, phosphate and silicate concentrations.
Gilbert et al., ISME J, 2011[22]	72 monthly marine samples	16S rRNA V6	Community had strong repeatable seasonal patterns, with winter peaks in diversity. Change in day length explained 65% of the diversity variance. The results suggested that seasonal changes in environmental variables are more important than trophic interactions. Relationships between Bacteria were stronger than with Eukaryotes or environment. The increase in temporal sampling over Gilbert et al., 2009, increased the capability to explore community relationships.
Zinger et al., PLoS ONE, 2011[46]	509 marine samples	16S rRNA	High variability of bacterial community composition specific to vent and coastal ecosystems. Both pelagic and benthic bacterial community distributions correlate with surface water productivity. Also, differences in physical mixing may play a fundamental role in the distribution patterns of marine bacteria, as benthic communities showed a higher dissimilarity with increasing distance than pelagic communities.

Figure 1

Conceptual diagram of why replicated samples, especially across a gradient or along a time series, are critical for interpretation of results. Structure that is externally imposed via study design greatly improves our ability to recover biologically meaningful relationships rather than simply finding statistical differences between samples (especially important because every pair of biological samples will be different if sequenced deeply enough). In this case, we show the L4 Western English Channel ocean time series samples [22]: Sampling only during the summer, highlighted in blue, would only reveal the tip of the iceberg of variability in this ecosystem, which is driven by seasonal change (the graph shows day on the x-axis; log of the observed number of species on the y-axis). Similar principles apply in other ecosystems that have other major drivers of variation that, when overlooked, can influence the results in ways that are puzzling, or give a misleading picture of variation.

Defining the effect size and the power of a study is a particularly important challenge in the design of clinical microbiome-directed trials (e.g. probiotics, prebiotics, antibiotics and stool transplants) or the natural or man-made disturbance in any terrestrial or oceanic ecosystem. A recent attempt to define effect sizes in studies of the human microbiome[36] foundered due to the lack of comparability of different datasets and methodologies for taxon detection and assignment. Such effect sizes can only be determined with sufficiently large sample sizes of “normal” versus “altered” states, studied over sufficient temporal and spatial scales to reveal variation. The dilemma, especially for human studies, is that large samples are required to determine effect size, but such studies cannot gain Institutional Review Board approval because the effect size, and therefore the correct number of subjects required to achieve statistical power, is unknown.

Towards an Earth Microbiome Project

In recognition of the value of a multi-environmental survey of microbial diversity, we have instigated an initiative called the Earth Microbiome Project (EMP; www.earthmicrobiome.org). The EMP seeks to systematically characterize microbial taxonomic and functional biodiversity across global ecosystems, and to organize international environmental microbiology research by standardizing the protocols used to generate and analyze the data between studies. The Earth Microbiome Project (EMP) constitutes a restructuring and refocusing of microbial ecology. Individual projects are grouped (by single PI, or by consortium) into overarching science questions that can be used to define the fundamental purpose of a single project, or individual hypothesis-driven studies can be grouped under a larger question. While this framework provides a way to influence and globally organize environmental microbiology research, the novelty lies in the sheer scale of the endeavor and the standardization of the protocols used to generate and analyze the data between studies. The EMP standard operating procedures (SOPs) define a route to minimize bias between community analyses associated with different material extraction techniques, analytical methods and core data quality control and analysis. However, currently, the EMP does not promote a standard physical sample acquisition protocol or preservation technique, but is working to explore the impact of these variables on ecological interpretation[37]. The EMP framework promotes open access research; hence all data is being made public, including to industry, and comparable within an open access forum, which creates a data resource capable of answering and asking fundamental questions about the function of microbes in different environmental habitats. However, it is not just data that must be open access. The scientists themselves also need to be more accessible through open science initiatives[38], ensuring that the right researchers are able to work on the most relevant topics, making the best use of reductionist expertise. Additionally, the EMP framework enables multidisciplinary cooperation across funding agencies and scientific research areas. Stand-alone projects are mapped onto larger research themes, and these stack into overarching global questions, yielding multiple layers and scales of inquiry. This focus on multidisciplinary activity brings new dimensions to microbial investigation, through renewed interest in data processing, requirements for large-scale computational infrastructure, modeling community dynamics and functional capability, and linking the analyzed data and generated models to climate modeling informatics programs. It also merges aspects of biogeochemistry, microbiology, protein/enzyme interaction, and transcriptional feedback as we move from molecular scale processes to processes and dynamics on other scales. These range from cellular interaction, to community ecology, local, regional, national, continental and global scales. Such a broad knowledgebase will be critical for developing a predictive understanding of genes and organisms of biotechnological interest. Of course, for large scale sequencing efforts such as the EMP to be focused and coordinated, the community must avoid the “sequence everything” approach, simply because “we can.” Hypotheses must guide our selection of the most appropriate samples to sequence. To a large extent these will be sample sets that have rich metadata, and samples that have the potential to provide fundamental new knowledge.

The role of metadata acquisition in improved experimental design

Initiatives like the EMP are saved from becoming simple natural history exercises in data collection by the requiring the acquisition and appropriate organization of the metadata that accompany every sequence dataset generated. These environmental and experimental metadata are the primary data of many multidisciplinary research groups, who already work together to generate a comprehensive understanding of a particular environment, e.g. a marine sampling field expedition, or a temporal exploration of soil and ecosystem dynamics in one location. Such environmental parameters give context to the origin of the sequence data we rely upon to generate interpretative analyses about the microbial dynamics in that ecosystem. They include temperature, latitude and longitude, altitude, moisture content, nutrient concentrations, and standard ontologies for geolocators and ecosystem descriptors. But these must also be accompanied by experimental metadata that appropriately describe the methods used to create the sequence data, such as sample handling, nucleic acid extraction, PCR amplification method, sequence protocol, and bioinformatic analysis. Acquisition of these metadata are essential to the EMP, as they provide ecological grounding to analyses of the taxonomic and functional capacity of the sequenced microbial community. Hence, this robust framework for routine collection of metadata and reliable standards will enable comparison between studies. A suite of such standard languages is provided by the Minimum Information about any (x) Sequence checklists (MIxS[39]). MIxS is an umbrella term to describe MIGS, MIMS and MIMARKS[3] and contains standard formats for recording environmental and experimental data. The latest of these checklists, MIMARKS (Minimum Information about a MARKer Sequence) builds on the foundation of the MIGS (the Minimum Information about a Genome Sequence) and MIMS (the Minimum Information about a Metagenome Sequence) checklists[3], by including an expansion of the rich contextual information about each environmental sample. What is recorded depends on where the sample comes from. For example, human samples can be annotated with fields such as the age, weight, and health status of the subject, whereas seawater samples can be annotated with fields such as pH, salinity, depth and temperature. Additionally, detailed technical information such as the sequencing platform, and the genes and regions targeted are also required, making meta-analyses of many studies much easier to perform and interpret, because outliers can be traced back to technical differences or to biological differences automatically, rather than requiring the researcher to read scores of papers as is necessary for meta-analyses today[40]. This integration is especially important for finding enzymes that participate in processes that are potentially industrially useful but where the origin is irrelevant to the industrial application except for improving the possibility that the enzyme will work under the necessary conditions. We believe that the MIxS standard will play a key role for three reasons: First, it will enable large-scale projects to collect massive datasets according to standard protocols at multiple sites, and to share these data to facilitate global understanding. Second, it will enable integration of each lab’s individual projects into this universe of sequences, allowing community-level comparisons, unprecedented exploration of the diversity and distribution of life, easy detection and exclusion of contaminated samples, and the exploration of gene or taxon co-occurrence patterns. These features are especially crucial for accessing and integrating data from every clinic or every field site. Third, it will provide a framework for large-scale integration of efforts, especially predictive modeling. Stanislaw Ulam said, “Great scientists see analogies between theorems or theories. The very best ones see analogies between analogies.” By providing a method of integrating both the systematically collected results of large-scale projects such as the EMP and the highly distributed efforts of smaller groups, standards such as MIxS will help enable a future in which analogies across spatial scales, temporal scales, and even theories are not only possible but routine. As the cost of sequencing continues to decline, there has been a rapid adoption of the MIxS standard, and of sound sampling principles. For example, tools such as QIIME[41] and MG-RAST[42] are already MIxS-compliant and provide ways of viewing and analyzing MIxS-compliant data. INSDC has committed to incorporating a MIxS keyword as a standard, and large projects such as the HMP (https://commonfund.nih.gov/hmp/), NEON (http://www.neoninc.org/), the EMP (www.earthmicrobiome.org), the Bio Weather Map (http://bioweathermap.org/), and the Personal Genome Project (http://www.personalgenomes.org) have already pledged to support the standard. This rapid response is timely. As sequencing and computational methods co-evolve in a dynamic ‘arms race’ that spurs their mutual growth and progress, so too must data standards co-evolve. International activities such as the EMP provide test beds to allow the community to agree on standards for exchange of data products that go well beyond the trading of consensus sequences and annotations (e.g., GenBank). Even given the expected advances in cloud computing and the predicted decrease in computation costs according to Moore’s law, one main driver of innovation will be the need to provide analyses of datasets that are orders of magnitude larger without the corresponding need for vast increases in the bioinformatics budget. Investments in data reuse and usable data standards are critical. However, it is easier to create standards than it is to successfully promote their use. The Genomic Standards Consortium (GSC) has conducted pioneering work on minimum information checklists that have enabled provenance standards, and it is now taking on the much more complicated task of defining standards for computed data products. In this regard, journals can play a role by universally adopting such standards as a requirement for accepting and publishing manuscripts. The role of data generation in the discovery of novel enzymes and phylogenetic structure in microbial biodiversity must be complemented by improved functional and taxonomic databases that more appropriately represent the full breadth of microbial diversity. One critical aspect of this development will be mapping of metagenomic reads against reference genomes. The Earth Microbiome Project is partnered with the Genomic Encyclopedia of Bacteria and Archaea and Microbial Earth initiatives[43] that aim to improve the phylogenetic representation of sequenced genomes. These efforts combined with improved gene and protein database curation (e.g., IMG and IMG/M[44, 45]) will aid with metagenomic data interpretation, facilitating more efficient biodiscovery.

Conclusion

As it occurred with many other technologies such as computing, telecommunications and photography (which, like sequencing, began with scientific applications but rapidly transformed consumers’ lives across the globe), metagenomics is in a time of transition. The community is moving from a situation in which technologies are first deployed centrally by large organizations, then by departments, by individual laboratories, and it is perhaps not unreasonable to speculate that sequencing devices will soon be owned by individuals, perhaps even in a handheld format. Standard protocols are necessary to integrate the information and to allow easy communication across studies—after all, the role played by the internet in today’s world is only possible because computers everywhere can communicate with a set of standard, open protocols. While currently these initiatives are focused on DNA sequencing (amplicon sequencing and metagenomics), it will be necessary to determine integration of metabolomics, proteomics and single-cell genomics into these efforts to improve community characterization, and enable more appropriate ecological inferences. The ‘omics ratio (ratio of applied techniques, e.g. genomics:transcriptomics:proteomics:metabolomics) should always be determined by the hypothesis. We believe and hope that MIxS and the EMP will enable the same type of functionality for ecologists, allowing us to construct not just a catalog of organisms on Earth but also to understand and exploit the critical processes they perform in the environment over a vast range of spatial and temporal scales.

Challenge	Decision	Pitfall	Consequence
Biological and technical replicates are expensive and time-consuming	Whether to perform replication, or gamble that a single sample in each group is informative with sufficiently well-described ecosystem parameters	Often non-replicated designs are not interpretable, or are over interpreted (e.g. attributing differences in a single healthy versus diseased person to the disease)	Conclusions cannot be replicated by other researchers, and may not be generalizable beyond the specific samples analyzed
A fixed sequencing budget must be divided among some number of samples (e.g. by multiplexing at some level)	Whether to sequence few samples deeply, or many samples more shallowly	The appropriate number of samples and sequencing depth are unknown	Few samples may be uninformative and may preclude informative analysis of variation in the system and/or replication; shallow sequencing may miss rare but important taxa or functions
Experimental challenges due to low yield of DNA and/or high community diversity	Whether to adopt new protocols for improved DNA extraction, amplification and/or assembly	DNA extraction and manipulation steps all introduce biases that may make it difficult to compare between studies	For unique or rare samples that require special treatment it is essential that all steps in the treatment are considered if comparing results to those from other studies.
Defining the dimensions of variation that matter in a given system is challenging, and often is the purpose of the study itself	Which scales and parameters to select, and how much variation to cover	“Extremes” of variation in the system being studied are expensive and difficult to obtain (tail of distribution) and may not even be extreme from the microbes’ perspective; relevant variation often unknown	Conclusions from one population or study site inappropriately generalized to other populations or study sites; relevant variation in system undiscovered; extreme efforts to obtain exotic samples are unrewarded
Must choose a sequencing platform	Trade-off between read length and number of sequences; must decide when to adopt new technology	All sequencing technologies and processing pipelines have drawbacks, not all of which are widely advertised; technology changes rapidly	Sequences may be too short, too few too error-prone to interpret, or too passé to publish
Interpretation of sequence data	Must decide whether to use reference-based or de novo methods for assembly, taxonomy and functional assignment, and if so which reference to use	Different reference databases give different results; de novo is unbiased but far less powerful when appropriate references exist; analyses differ as reference databases update rapidly, limiting comparisons between studies. Current assembly algorithms are insufficient for highly complex metagenome data.	Incorrect and/or hard-to-reconcile functional and taxonomic assignments
Metadata collection	Must decide what metadata (i.e. sample or site data) to collect and associate with sample	Too complex to be implemented; fields inconsistent with previous studies due to lack of standards-compliance; data model can’t accommodate	Chaos!
Centralization	Whether to centralize sample collection, metadata curation, DNA extraction, sequencing, data storage, and data analysis	Decentralization can lead to inconsistencies that make data difficult to interpret; centralization can lead to delays while funding is acquired or capacity is built, and can limit creativity	Either the dataset may be vast but too inconsistent to interpret, or it may be extremely consistent but limited in scope and/or interpretation. Specific considerations apply to each stage; the EMP currently favors decentralized sample collection and centralization of other steps on a case-by-case basis

52 in total

1. Open science is a research accelerator.

Authors: Michael Woelfle; Piero Olliaro; Matthew H Todd
Journal: Nat Chem Date: 2011-09-23 Impact factor: 24.427

2. Functional metagenomic profiling of nine biomes.

Authors: Elizabeth A Dinsdale; Robert A Edwards; Dana Hall; Florent Angly; Mya Breitbart; Jennifer M Brulc; Mike Furlan; Christelle Desnues; Matthew Haynes; Linlin Li; Lauren McDaniel; Mary Ann Moran; Karen E Nelson; Christina Nilsson; Robert Olson; John Paul; Beltran Rodriguez Brito; Yijun Ruan; Brandon K Swan; Rick Stevens; David L Valentine; Rebecca Vega Thurber; Linda Wegley; Bryan A White; Forest Rohwer
Journal: Nature Date: 2008-03-12 Impact factor: 49.962

Review 3. The microbial engines that drive Earth's biogeochemical cycles.

Authors: Paul G Falkowski; Tom Fenchel; Edward F Delong
Journal: Science Date: 2008-05-23 Impact factor: 47.728

4. Ocean time-series reveals recurring seasonal patterns of virioplankton dynamics in the northwestern Sargasso Sea.

Authors: Rachel J Parsons; Mya Breitbart; Michael W Lomas; Craig A Carlson
Journal: ISME J Date: 2011-08-11 Impact factor: 10.302

5. Metagenomic mining for microbiologists.

Authors: Tom O Delmont; Cedric Malandain; Emmanuel Prestat; Catherine Larose; Jean-Michel Monier; Pascal Simonet; Timothy M Vogel
Journal: ISME J Date: 2011-05-19 Impact factor: 10.302

6. Enterotypes of the human gut microbiome.

Authors: Manimozhiyan Arumugam; Jeroen Raes; Eric Pelletier; Denis Le Paslier; Takuji Yamada; Daniel R Mende; Gabriel R Fernandes; Julien Tap; Thomas Bruls; Jean-Michel Batto; Marcelo Bertalan; Natalia Borruel; Francesc Casellas; Leyden Fernandez; Laurent Gautier; Torben Hansen; Masahira Hattori; Tetsuya Hayashi; Michiel Kleerebezem; Ken Kurokawa; Marion Leclerc; Florence Levenez; Chaysavanh Manichanh; H Bjørn Nielsen; Trine Nielsen; Nicolas Pons; Julie Poulain; Junjie Qin; Thomas Sicheritz-Ponten; Sebastian Tims; David Torrents; Edgardo Ugarte; Erwin G Zoetendal; Jun Wang; Francisco Guarner; Oluf Pedersen; Willem M de Vos; Søren Brunak; Joel Doré; María Antolín; François Artiguenave; Hervé M Blottiere; Mathieu Almeida; Christian Brechot; Carlos Cara; Christian Chervaux; Antonella Cultrone; Christine Delorme; Gérard Denariaz; Rozenn Dervyn; Konrad U Foerstner; Carsten Friss; Maarten van de Guchte; Eric Guedon; Florence Haimet; Wolfgang Huber; Johan van Hylckama-Vlieg; Alexandre Jamet; Catherine Juste; Ghalia Kaci; Jan Knol; Omar Lakhdari; Severine Layec; Karine Le Roux; Emmanuelle Maguin; Alexandre Mérieux; Raquel Melo Minardi; Christine M'rini; Jean Muller; Raish Oozeer; Julian Parkhill; Pierre Renault; Maria Rescigno; Nicolas Sanchez; Shinichi Sunagawa; Antonio Torrejon; Keith Turner; Gaetana Vandemeulebrouck; Encarna Varela; Yohanan Winogradsky; Georg Zeller; Jean Weissenbach; S Dusko Ehrlich; Peer Bork
Journal: Nature Date: 2011-04-20 Impact factor: 49.962

7. Bacterial community variation in human body habitats across space and time.

Authors: Elizabeth K Costello; Christian L Lauber; Micah Hamady; Noah Fierer; Jeffrey I Gordon; Rob Knight
Journal: Science Date: 2009-11-05 Impact factor: 47.728

8. Linking long-term dietary patterns with gut microbial enterotypes.

Authors: Gary D Wu; Jun Chen; Christian Hoffmann; Kyle Bittinger; Ying-Yu Chen; Sue A Keilbaugh; Meenakshi Bewtra; Dan Knights; William A Walters; Rob Knight; Rohini Sinha; Erin Gilroy; Kernika Gupta; Robert Baldassano; Lisa Nessel; Hongzhe Li; Frederic D Bushman; James D Lewis
Journal: Science Date: 2011-09-01 Impact factor: 47.728

9. The Earth Microbiome Project: Meeting report of the "1 EMP meeting on sample selection and acquisition" at Argonne National Laboratory October 6 2010.

Authors: Jack A Gilbert; Folker Meyer; Janet Jansson; Jeff Gordon; Norman Pace; James Tiedje; Ruth Ley; Noah Fierer; Dawn Field; Nikos Kyrpides; Frank-Oliver Glöckner; Hans-Peter Klenk; K Eric Wommack; Elizabeth Glass; Kathryn Docherty; Rachel Gallery; Rick Stevens; Rob Knight
Journal: Stand Genomic Sci Date: 2010-12-25

10. IMG/M: the integrated metagenome data management and comparative analysis system.

Authors: Victor M Markowitz; I-Min A Chen; Ken Chu; Ernest Szeto; Krishna Palaniappan; Yuri Grechkin; Anna Ratner; Biju Jacob; Amrita Pati; Marcel Huntemann; Konstantinos Liolios; Ioanna Pagani; Iain Anderson; Konstantinos Mavromatis; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2011-11-15 Impact factor: 16.971

92 in total

1. Ecological succession and viability of human-associated microbiota on restroom surfaces.

Authors: Sean M Gibbons; Tara Schwartz; Jennifer Fouquier; Michelle Mitchell; Naseer Sangwan; Jack A Gilbert; Scott T Kelley
Journal: Appl Environ Microbiol Date: 2014-11-14 Impact factor: 4.792

2. High abundance of heterotrophic prokaryotes in hydrothermal springs of the Azores as revealed by a network of 16S rRNA gene-based methods.

Authors: Kerstin Sahm; Patrick John; Heiko Nacke; Bernd Wemheuer; Ralf Grote; Rolf Daniel; Garabed Antranikian
Journal: Extremophiles Date: 2013-05-26 Impact factor: 2.395

3. Winter-summer succession of unicellular eukaryotes in a meso-eutrophic coastal system.

Authors: Urania Christaki; Konstantinos A Kormas; Savvas Genitsaris; Clément Georges; Télesphore Sime-Ngando; Eric Viscogliosi; Sébastien Monchy
Journal: Microb Ecol Date: 2013-10-01 Impact factor: 4.552

4. Stability and succession of the rhizosphere microbiota depends upon plant type and soil composition.

Authors: Andrzej Tkacz; Jitender Cheema; Govind Chandra; Alastair Grant; Philip S Poole
Journal: ISME J Date: 2015-04-24 Impact factor: 10.302

5. Impacts of Sampling Design on Estimates of Microbial Community Diversity and Composition in Agricultural Soils.

Authors: Sarah C Castle; Deborah A Samac; Michael J Sadowsky; Carl J Rosen; Jessica L M Gutknecht; Linda L Kinkel
Journal: Microb Ecol Date: 2019-03-09 Impact factor: 4.552

6. Impact of cropping systems on the functional diversity of rhizosphere microbial communities associated with maize plant: a shotgun approach.

Authors: Ayomide Emmanuel Fadiji; Jerry Onyemaechi Kanu; Olubukola Oluranti Babalola
Journal: Arch Microbiol Date: 2021-05-11 Impact factor: 2.552

7. Metagenomic profiling of rhizosphere microbial community structure and diversity associated with maize plant as affected by cropping systems.

Authors: Ayomide Emmanuel Fadiji; Jerry Onyemaechi Kanu; Olubukola Oluranti Babalola
Journal: Int Microbiol Date: 2021-03-05 Impact factor: 2.479

Review 8. Uncovering the hidden microbiota in hospital and built environments: New approaches and solutions.

Authors: Ana P Christoff; Aline Fr Sereia; Camila Hernandes; Luiz Fv de Oliveira
Journal: Exp Biol Med (Maywood) Date: 2019-01-07

Review 9. Potential for Monitoring Gut Microbiota for Diagnosing Infections and Graft-versus-Host Disease in Cancer and Stem Cell Transplant Patients.

Authors: Andrew Y Koh
Journal: Clin Chem Date: 2017-07-18 Impact factor: 8.327

Review 10. Toward Accurate and Quantitative Comparative Metagenomics.

Authors: Stephen Nayfach; Katherine S Pollard
Journal: Cell Date: 2016-08-25 Impact factor: 41.582