Literature DB >> 27698619

Metagenomic Assembly: Overview, Challenges and Applications.

Jay S Ghurye¹, Victoria Cepeda-Espinoza¹, Mihai Pop¹.

Abstract

Advances in sequencing technologies have led to the increased use of high throughput sequencing in characterizing the microbial communities associated with our bodies and our environment. Critical to the analysis of the resulting data are sequence assembly algorithms able to reconstruct genes and organisms from complex mixtures. Metagenomic assembly involves new computational challenges due to the specific characteristics of the metagenomic data. In this survey, we focus on major algorithmic approaches for genome and metagenome assembly, and discuss the new challenges and opportunities afforded by this new field. We also review several applications of metagenome assembly in addressing interesting biological problems.

Entities: Chemical Disease Gene Species

Keywords: Assembly; Metagenomics; Microbiome

Mesh：

Year: 2016 PMID： 27698619 PMCID： PMC5045144

Source DB: PubMed Journal: Yale J Biol Med ISSN： 0044-0086

Introduction

DNA sequencing has become an important tool in biological research. The cost of sequencing has been rapidly decreasing, leading to the use of sequencing technologies in a broad set of biological applications. In particular, sequencing has been used to characterize the microbial communities associated with human and animal bodies as well as with many environments within our world. The use of high throughput sequencing in the analysis of microbial communities has led to the creation of a new scientific field – metagenomics – the analysis of the combined genomes of organisms co-existing in a community. A critical step in such analyses is metagenomic assembly – the stitching together of individual DNA sequences into genes or organisms. Genome assembly algorithms have been an important component of efforts to characterize the genomes of single organisms and have been key to the modern genomic revolution. In the context of single organisms the genome assembly problem has been thoroughly studied and a number of effective strategies have been developed, strategies that underlie modern assembly tools. Metagenomic data, however, pose new challenges and create new scientific questions that still await an answer. In this review, we will survey the key algorithmic paradigms underlying modern assembly tools. We will then discuss the specific challenges posed by metagenomic data and outline some of the strategies recently developed to address the complexities associated with these data. We will conclude with a discussion of specific biological findings that were made possible by the newly developed metagenomic assembly approaches.

Genome Assembly Overview

Genome assembly [1] is the reconstruction of genomes from the smaller DNA segments called reads which are generated by a sequencing experiment. Various sequencing technologies have been developed in the past couple of decades. See Table 1 for a summary of various sequencing technologies along with their advantages and disadvantages. In many cases, reads are pair ended or mate-paired, which means that pairs of reads are sequenced from the same DNA fragment. The distance between the reads in each pair, and their relative orientation are approximately known. This information is used to resolve ambiguities caused by repetitive sequences during assembly [2] as well as to order and orient the assembled contigs – the fragments of the genome that could be stitched together from the set of reads [3]. Below, we detail these approaches.

Table 1

Overview of current sequencing technologies.

Technology	Read Length	Accuracy	Time per run	Bases per run
Single Molecule Real-Time Sequencing (Pacific Biosciences)	10 kbp to 15 kbp	87% (Low)	30 minutes to 4 hours	5 – 10 Gb
Oxford Nanopore MinION Sequencing	5 kbp to 10 kbp	70% to 90% (Low)	1 to 2 days	500 Mb
Ion Semiconductor (Ion Torrent sequencing)	Up to 400 bp	98% (Medium)	2 hours	10Gb
Sequencing by synthesis (Illumina)	50 – 300bp	99.9% (High)	1 to 11 days	300 Gb
Sequencing by ligation (SOLiD sequencing)	75 bp	99.9% (High)	1 to 2 weeks	3 Gb
Pyrosequencing (454)	700 bp	98% (Medium)	24 hours	400 Mb
Chain termination sequencing (Sanger sequencing)	400 to 900 bp	99.9% (High)	20 mins to 3 hours	50 – 100 Kb

Algorithms for Genome Assembly

In the following we will distinguish between de novo assembly – which involves reconstructing genomes directly from the read data, and comparative assembly – where the aim is to use the sequences of previously sequenced closely related organisms to guide the construction of a new genome. The general problem of de novo assembly is proved to be NP-Hard [4], which means that this problem cannot be solved efficiently. Due to the computational intractability, heuristic based methods have been devised to perform de novo assembly. The most widely used strategies (paradigms) are – greedy, overlap-layout-consensus (OLC), and De Bruijn graph (See Figure 1).

Figure 1

Overview of different de novo assembly paradigms. Schematic representation of the three main paradigms for genome assembly – Greedy, Overlap-Layout-Consensus, and de Bruijn. In Greedy assembler, reads with maximum overlaps are iteratively merged into contigs. In Overlap-Layout-Consensus approach, a graph is constructed by finding overlaps between all pairs of reads. This graph is further simplified and contigs are constructed by finding branch-less paths in the graph, and taking the consensus sequence of the overlapping reads implied by the corresponding paths. Contigs are further organized and extended using mate pair information. In de Bruijn graph assemblers, reads are chopped into short overlapping segments (k-mers) which are organized in a de Bruijn graph structure based on their co-occurrence across reads. The graph is simplified to remove artifacts due to sequencing errors, and branch-less paths are reported as contigs.

Greedy

This is the most simple and intuitive method of assembly. In this method, individual reads are joined together into contigs in an iterative manner starting with the reads that overlap best and ending once no more reads or contigs can be merged. This approach is simple to implement and effective in many practical settings, and was used in several of the early genome assemblers such as TIGR [5], Phrap, VCAKE [6]. This simple greedy method, however, has some serious drawbacks. The choices made during merging of reads/contigs are locally optimal and do not consider global relationships between reads, As a result, the approach can get stuck or can result in incorrect assemblies within repetitive sequences.

Overlap-Layout-Consensus

This three step approach begins with a calculation pairwise overlaps between all pairs of reads. The overlaps are computed with a variant of a dynamic programming-based alignment algorithm, making assembly possible even if the reads contain errors. Using this information, an overlap graph is constructed where nodes are reads and edges denote overlaps between them. The layout stage consists of a simplification of the overlap graph to help identify a path that corresponds to the sequence of the genome. More precisely, a path through the overlap graph implies a 'layout' of the reads along the genome. In the consensus stage, layout is used to construct a multiple alignment of the reads and to infer the likely sequence of the genome. This assembly paradigm was used in a number of assemblers, including Celera Assembler [7], which was used to reconstruct the human genome, and Arachne [8] assembler used in many of the genome projects at the Broad Institute. The overlap-layout-consensus approach has also re-emerged recently as the primary paradigm used in assembling long reads with high error rates, such as those produced by the technologies from Pacific Biosciences and Oxford Nanopore.

De Bruijn Graph

The de Bruijn graph assembly paradigm focuses on relationship between substrings of fixed length k (k-mers) derived from the reads. The k-mers are organized in a graph structure where the nodes correspond to the k-1 prefixes and suffixes of k-mers, connected by edges that represent the k-mers. In this approach reads are not explicitly aligned to each other, rather their overlaps can be inferred from the fact that they share k-mers. With this graph, assembly problem reduces to finding an Eulerian path – a path through the graph that visits each edge once. Unlike the Overlap-Layout-Consensus approach, the de Bruijn graph paradigm is affected by errors in the reads, errors which introduce false k-mers (false nodes and edges) in the graph. These errors must be eliminated prior to identifying an Eulerian path in the graph. All practical de Bruijn assemblers include a number of heuristic strategies for eliminating errors from the reads and the graph. This paradigm has become widely used after the introduction of high throughput and relatively low-error sequencing technologies, in part because it is easy to implement and efficient even in high depth of coverage settings. Some notable assemblers include: Velvet [9], SOAPdenovo [10], SOAPdenovo2 [11], ALLPATHS [12], and SPADES [13].

Comparative Assembly

The number of organisms whose genomes have been sequenced has been rapidly increasing. These genomes can be used to assist the assembly process through a strategy called Reference Guided Assembly or Comparative Assembly. Comparative assembly consists of two steps – first, all the reads are aligned against the reference genome; then a consensus sequence is generated by inferring the alignments. This approach is more effective than de novo assembly in resolving repeats and is thus able to get better results than de novo approaches especially at low depths of coverage. Long repeats are still a challenge as they lead to an ambiguous alignment of reads against the genome, though the use of mate-pair information can partly mitigate this issue and help identify the correct placement of reads. At the same time, the effectiveness of the comparative assembly approach depends on the availability of a closely related reference sequence. Differences between genome being assembled and the reference can lead to either errors in reconstruction or to a fragmented assembly. AMOScmp [14] comparative assembler attempts to identify such polymorphisms and rearrangements between genomes and breaks the assembly at these locations in order to avoid mis-assemblies. A number of tools were developed to help augment or improve de novo assemblies with the help of reference genomes. OSLay [15], Projector 2 [16], ABACAS [17] and r2cat [18] simply use a reference sequences to identify the correct order and orientation of contigs from a de novo assembly. An extension of this approach was proposed by Husemann et al. [19] that leverages information from multiple related genomes, weighted by their evolutionary distance from the sequence being assembled. Scaffold_builder [20] also provides functionality to join together contigs that were left unassembled by the de novo approach, thereby helping improve the assembly through the use of a reference sequence. Finally, E-RGA [21] performs de novo and reference guided assembly independently first and then merges two assemblies later using a novel data structure called merge graph to avoid mis-assemblies and ambiguous overlaps.

Tradeoffs Between Different Assembly Methods

None of the methods described above is universally applicable, rather each method has specific strength and weaknesses depending on the characteristic of the data being assembled. The greedy method is easy to implement and is effective when the data contain no or only short repeats. The Overlap-Layout-Consensus approach is effective even at high error rates however its efficiency rapidly degrades with depth of coverage as it starts by computing overlaps. The de-Bruijn graph approach is computationally efficient even at high depths of coverage, however it is affected by errors in the data and is, thus, most appropriate for relatively clean datasets. Comparative assembly approaches are most effective when a sufficiently closed related sequence is available (Please refer to Table 2).

Table 2

Comparison of different . The columns in the table denote various assembly methods. The rows denote the parameters which are compared across these assembly methods. Prototypical assemblers are highlighted in each category. Assemblers marked with a * are not specifically designed for metagenomic applications.

	Greedy	OLC	De-Bruijn
Effect of repeats	✓	✓	✓
Effect of high depth of coverage	✓	✓	✗
Effect of sequencing errors	✗	✗	✓
Ease of implementation	✓	✗	✗
Assemblers	VCAKE, phrap, TIGR*	Celera Assembler, Omega, SGA	MetaVelvet, Meta-IDBA, Megahit, Meta-Ray, Meta-Spades

Metagenomics

Metagenomics is a fairly new research field focused on the analysis of sequencing data derived from mixtures of organisms. The assembly problem outlined above only become more complex as the goal is no longer to assemble a single genome, but to reconstruct the entire mixture (See Figure 2). Below we further detail these challenges and outline several of the approaches developed to address them.

Figure 2

Metagenomic assembly pipeline. Multiple bacterial genomes within a community are represented as circles of different colors indicating multiple individuals form a same organism. Note the different levels of sequencing coverage for the individual organisms' genomes, due to the different abundance of the organisms in the original sample. After sequencing redundant reads can be removed through digital normalization, reducing the computational needs for assembly. The filtered reads are then assembled into contigs and they are classified using k-mers and coverage statistics. Contigs in each group are then binned to form draft genome sequences for organisms within the population.

Metagenomic Data

Metagenomic data consists of mixture of DNA from different organisms, and may comprise viral, bacterial, or eukaryotic organisms. The different organisms present in a mixture may have widely different levels of abundance, as well as different levels of relatedness with each other. These characteristics complicate the assembly process. As we described above, one of the main challenges to the assembly of single organisms is due to repetitive DNA segments within an organism's genome. For a single organism, assuming a uniform sequencing process, such repeats can be detected simply as anomalies in the depth of coverage (a two copy repeat would contain twice as many reads as expected). Due to the uneven (and unknown) representation of the different organisms within a metagenomic mixture, simple coverage statistics can no longer be used to detect the repeats. The confounding effect of repeats on the assembly process is further exacerbated by the fact that unrelated genomes may contain nearly-identical DNA (inter-genomic repeats) representing, for example, mobile genetic elements. At the other extreme, the multiple individuals from a same species may harbor small genetic differences (strain variants). The decision of whether such differences can be ignored when reconstructing the corresponding genome, or whether it is appropriate to reconstruct individual-specific genomes is not only computationally difficult but also ill-defined from a biological point of view. Furthermore, distinguishing true biological differences from sequencing errors becomes nearly impossible in a metagenomic setting. A final challenge also arises from the uneven depth of sequencing coverage within a metagenomic mixture. Some organisms' genomes may be sequenced to high depths of coverage (often exceeding 1000-fold), situation that leads to high computational costs. In the Overlap-Layout-Consensus paradigm such high depths of coverage lead to a quadratic growth in the time necessary to compute overlaps (and in the number of overlaps that need to be processed), while in a de Bruijn graph setting, the higher depth of coverage amplifies the effect of errors on the assembly graph and may even stymie error correction algorithms (simply by chance multiple random errors can confirm each other). Due to these complications, algorithms developed for single genome assembly cannot be applied directly to metagenomics data. Below we outline some approaches that have been developed in the community to deal with such challenges.

Depth Normalization and Error Correction

As outlined above, the high depth of sequencing coverage within abundant organisms in a sample impacts both the computational cost of the assembly process and also its accuracy as errors in the reads are hard to identify and correct. Brown et al. [22] proposed a strategy named digital normalization that aims to eliminate redundant reads within regions of high depth of coverage. This approach relies on k-mer frequencies to identify and remove reads from regions with high depth of coverage, thereby reducing the redundancy of the data. Within the reduced dataset sequencing errors are more easily detected and corrected, thereby allowing the subsequent assembly process to be both more efficient (in terms of time and memory use) and more accurate (see Figure 2).

Reducing Memory Requirements During Assembly

Most metagenomic assemblers developed to date (MetaVelvet [23], Meta-IDBA [24], MEGAHIT [25] and Ray [26]) use de Bruijn graph approach. The main assumption of this approach is that the reads contain few errors, or more precisely, that the errors can be easily corrected prior to assembly. As we mentioned above, even after filtering and error correction, many errors and polymorphisms remain in the data, causing an increase in the size of the resulting size of the de Bruijn graph. The size of the graph translates into the need for a larger memory size as the use of external memory would result in a loss of performance. Several approaches have been developed that allow storing and using the de Bruijn graphs in a lower memory footprint than the naïve solutions. One strategy involves the use of Bloom filters to partition the graph prior to assembly, leading to a large decrease in memory size [27]. Bloom filters are an inexact data-structure that trades off accuracy for memory size. To reduce the risk of false positives (nodes or edges not present in the real graph but reported by a Bloom-filter encoded de Bruijn graph), Chikhi et al. [28] introduced an extension to the approach that also compactly represents the information that may be incorrectly reported, allowing a more precise representation of the original information without losing the space efficiency. Salikhov et al. [29] further optimized graph representation by reducing storage by 30 percent to 40 percent by using a series of cascading Bloom filters.

Dealing with Genomic Variants

The approaches mentioned above address the memory requirements of assembly but not the confounding effect of genomic variants. Differences between closely related organisms can make it hard for assemblers to identify a consistent path through the assembly graph, leading to potentially fragmented assemblies. Many of the existing metagenomic assemblers try to address this issue by performing a more aggressive 'bubble popping' procedure – approach used to correct errors in the assembly of single organisms through the de Bruijn approach. Specifically, wherever parallel paths are found within the graph that differ by only a small amount, these paths are collapsed into one, allowing the assembly to reconstruct longer contiguous segments from the metagenome. Such an approach is employed, for example, by MetaVelvet [23] and Meta-IDBA [24].

Detecting and Reporting Genomic Variants Within the Assembly

Differences between closely related genomes are of potential interest to biologists, and approaches, such as those described above, which try to collapse such variants may, therefore, hide valuable information from the researchers. One of the first tools developed to find such variants after assembly is Strainer [30], a tool that analyzes the alignment pattern of reads against the reconstructed scaffold of assembled reads and provides researchers with a visualization of genetic variants found within the data. Bambus2 [31] includes a module that identifies patterns within the assembly graph that may indicate the presence of variants, approach that has been extended in Marygold [32] through the use of SPQR trees [33] – a graph data-structure that allows the efficient detection of complex 'bubble' structures within the assembly. In the more specific case of viral metagenomic samples, where a reference sequences is available, a number of approaches have been developed to reconstruct the quasi-species structure of the data (the population of variants found within a sample). These approaches include ShoRah [34], Vispa [35] and QuRe [36], and all rely on combinatorial optimization approaches to identify a small number of genomic sequences that best explain the read data. A similar approach was also proposed in Genovo [37] in the context of full metagenomic assembly, and in EMIRGE [38] to reconstruct just the 16S rRNA gene from metagenomic mixtures. These latter approaches have substantial computational costs which limits their application to relatively small datasets.

Repeat Detection

As already mentioned, simple approaches for finding repeats based on depth of coverage anomalies are not effective within metagenomic data. An alternative approach involves the analysis of the graph structure itself, in order to find regions of the graph that appear to be 'tangled' by repeats. In Bambus2 [31] these regions are identified based on the concept of betweenness centrality [39] – a measure developed in the field of social network analysis to identify nodes in the graph that appear to have a central role (nodes traversed by many paths).

Identifying Specific Organisms within Metagenomic Samples

Even after applying the strategies outlined above, metagenomic assemblies are highly fragmented, consisting of small fragments of the genomes found in a sample. Linking together these fragments to obtain a partial reconstruction of individual fragments is challenging. A number of approaches have been developed for this purpose that leverage two complementary types of information – the DNA composition of the assembled contigs, and their depth of coverage. Sequences from the same organisms have long been shown to have a similar DNA composition (in terms of frequencies of 2-mers or 4-mers) [40,41], and this information can be used to group together contigs that have similar profiles [42]. Contigs from a same organisms can also be assumed to have similar sequencing depth within a sample, allowing them to be grouped together and even to separate out closely related sequences that may not be distinguishable by DNA composition alone [43]. The coverage approach can be further extended to leverage information from multiple samples containing a same organisms. Contigs with correlated abundance profiles can be assumed to come from a single organism. Approaches used to identify such correlations include clustering of data based on simple correlation metrics (such as Pearson or Spearman correlation of normalized abundance profiles) [44], the formulation of the problem as a under-constrained linear system of equations [45], and the combination of DNA composition measures and coverage information within a Bayesian framework as performed in CONCOCT [46]. Nielsen et al. [44] have demonstrated the power of such approaches by reconstructing 238 high quality genome sequences (as defined by the quality standards established by the Human Microbiome Project [47]) from 396 human gut samples sequenced as part of the MetaHIT project [48].

Metagenomic Analysis Pipelines

Assembly is just a small part of the data analysis process, and the increased use of metagenomic methods in biological research has led to the development of integrated pipelines for metagenomic analysis. Such pipelines include MetAmos [49] and MOCAT [50], which are stand-alone packages, as well as CloVR [51] – a framework that enables metagenomic analyses on cloud computing frameworks.

Assembly Quality, Assembly Evaluation

It should be apparent by now that metagenomic assembly is a difficult computational problem. A largely overlooked analytical step is the validation of the resulting data. None of the algorithms described so far can be proven to correctly solve the assembly problem in a general setting, nor can one eliminate the possibility of errors introduced by programmers when implementing complex algorithmic techniques. Frequently, the quality of assemblies is evaluated through simple size statistics, such as the number and average sizes of the contigs generated. A measure developed in the context of the sequencing of the human genome, the N50 size (the weighted median contig size) is also often misused in a metagenomic context. The N50 size is the size of the largest contig c such as the sum of the sizes of contigs larger than c add up to the half of the correct genome size. In a metagenomic setting, the correct genome size is unknown, and therefore the N50 value is a meaningless measure. A better assessment of quality can be made by aligning metagenomic contigs to related genome sequences, as done by MetaQuast [52], or by exploring the internal consistency of the assembly (in terms of uniformity of depth of coverage and consistency of the placement of mate-pairs) as done in AMOSvalidate [53]. Recently, a number of tools have been developed that view assembly as a generative probabilistic process, allowing one to assign a likelihood to a genome assembly [54,55,56], approach that was also extended to a metagenomic context [57]. Such approaches cannot provide an absolute measure of assembly quality but can help rank multiple assemblies of the same dataset.

The Use of Metagenomic Assembly in Biological Applications

Below we highlight several examples of biological applications where metagenomic assembly approaches have been an important part of the biological results presented. These are just few from among the many other studies that have been and are being conducted, however a broader discussion of metagenomic analysis projects is beyond the scope of our paper.

Characterizing the Human-associated Microbiota

It has long been known that humans harbor complex microbial communities, but sequencing costs have prevented scientists from characterizing most of these microbes. The advent of inexpensive high throughput sequencing approaches has spurred a number of scientific efforts to better characterize the human-associate microbiota. The European project MetaHIT [48] focused on the characterization of the gut microbiota in healthy adults as well as in patients suffering from inflammatory bowel disease. Their initial publication surveyed 124 individuals through high throughput sequencing. The assembly of the resulting data reconstructed 3.3 million non-redundant gene sequences, most of which (99 percent) were derived from an estimated more than 1000 different bacterial species. Each individual was estimated to harbor an average of 160 microbial species. This initial study is only the beginning of understanding the true diversity of the human gut microbiota as evidenced by the continued discovery of new gene sequences in subsequent studies such as those by Li et al. [48] and Gevers et al. [47]. The NIH-led Human Microbiome Project [58,59] has further expanded this knowledge by adding data collected from the microbiota associated with other human body sites. The gut microbiota is by far the best studied in humans, in no small part due to the ease of extracting samples from stool. The wealth of data collected from the gut microbiota have allowed scientists to address a number of interesting questions. Turnbaugh et al. [60] explored whether a core gut microbiota exist (a group of microbes present in all individuals) and found that while such a concept is hard to define at the organism level, the functions performed by the gut bacteria are highly conserved across people. The MetaHIT data revealed a non-random clustering of individuals in terms of their gut microbiota, leading to the proposal of a concept of 'enterotype' – semi-stable states within which a person's microbiota can exist [61]. This concept is controversial and has been debated in the scientific literature. Koren et al. [62] studied the effect of factors such as clustering methodology, distance metrics, OTU-picking approaches, sequencing depth, sequence data type and 16S rRNA region on detection of enterotypes and concluded that the concept of enterotype is not universal rather strongly tied to the methodology used to identify clustering within the data. Huse et al. [63], recently argued that the enterotypes are primarily defined by the most dominant organisms in a sample (commonly Prevotella or Bacteroidetes within human gut communities), rather than reflecting an actual "community state". The study of the gut microbiota has also revealed the factors that influence its composition and diversity, such as diet [64,65,66], age [67,68], environment [69] and medication [70].

Premature Infant Gut Microbiome

A particularly fascinating research area is the study of the dynamic changes that occur in the human microbiota in the days and months after birth. This is not simply a matter of scientific curiosity, but also of important clinical relevance as premature infants frequently develop necrotizing enterocolitis (NEC) – a severe intestinal disease that can lead to death. The process of microbial colonization of the human gut begins at birth and continues throughout the first year of life until the gut microbiome reaches maturity. Sterile born babies acquire population of microbes through the birthing process either through the vaginal canal or from environmental introductions through cesarean delivery [71]. It is thought that in premature infants, aberrations during colonization may lead to illness or long-term health issues. Morowitz et al. [72] studied the gut microbiota within the first 3 weeks of life of a newborn baby, sequencing samples collected at four different times during this period. Their study revealed a shift in the microbial community from a community dominated by members of the Pseudomonas genus to a community dominated by organisms from the Serratia and Citrobacter genera. More importantly, however, a careful manual analysis of the assembled data revealed the presence within the developing gut microbiota of multiple Citrobacter strains. The relative abundance was shown to change across time, demonstrating the power of methods that explicitly take into account strain structure in the reconstruction of metagenomic data. A recent study by Raveh-Sadka et al. [73] investigated a group of infants which developed NEC over a short period of time to find out which specific microbial strains were shared amongst co-hospitalized infants and whether the disease could be attributed to the single infectious agent. They also investigated strain level metabolic potential and population heterogeneity. Their study did not find any evidence for one common infective agent causing NEC and the dominant population of each bacterium acquired by each organism was genotypically distinct. This suggests the presence of barriers to the spread of bacteria among infants.

Global Ocean Microbiome

Microorganism in the ocean environment play important roles in various bio-geological processes. The recent advancements in metagenomics has enabled to study ocean microbial communities, their structural patterns and diversity [74]. The Sorcerer II Global Ocean Sampling sequenced and analyzed 6.3 Gb of DNA from surface water samples along the transect from the Northwest Atlantic to the Eastern Tropical Pacific [75]. Gene prediction within the assembly of the resulting data allowed scientists to essentially double the number of proteins available in public databases, demonstrating the power of metagenomic approaches in surveying previously uncultured organisms. Recently, the Tara Oceans expedition collected about 35,000 samples across multiple sea depths at global scale, in order to facilitate complete study of effect of environmental factors on ocean life [76]. While a large part of this study is focused on eukaryotic organisms, Sunagawa et al. [77] studied the bacterial microbiota of 248 samples. They generated 7.2 terabases of Illumina sequencing data, and used it to create a new annotated reference gene catalog for the ocean microbiome. Among the findings enabled by this catalog was the discovery that the vertical stratification of the composition of communities in the surface layer of ocean is mostly driven by temperature rather than geography or other environmental factors. Surprisingly, they also found that greater than 73 percent of the composition of the ocean microbiome is shared with the human gut microbiome, despite significant differences in these ecosystems. The studies of the ocean microbiome have also highlighted the broad geographical distribution of phylogenetically similar organisms, raising the question of whether specific genomic variants can be identified that correlate with or contribute to the geographical location of microbes.

Conclusion

The relatively recent development of inexpensive high throughput sequencing technologies has spurred efforts to characterize the microbial communities inhabiting the human body and the environment, leading to the development of a new field - metagenomics. The analysis of the resulting data has created the opportunity for developing new algorithms that account for the specific characteristics of metagenomic data. Here we have outlined the key challenges and opportunities created by this new field in the context of sequence assembly – the process used to reconstruct the genomes of organisms from DNA fragments. Despite advances in this field, further developments are still needed, particularly for the validation of the resulting assemblies in settings where a ground truth is not available. Also important is the development of new tools for uncovering and characterizing microbial communities at the strain level. Repetitive sequences remain a challenge even for single genomes and their effect in metagenomic data is further amplified by the presence of cross-organismal repeats and uneven levels of representation of organisms within a sample. New sequencing technologies such as PacBio and Oxford Nanopore that provide long but error prone reads can overcome some of the challenges posed by repeats, however these approaches are still too expensive to be applied in a metagenomic data. Algorithms for long read assembly are still in the preliminary stage even for single genomes, and further algorithm and software development needs to take place before these technologies can be used effectively in a metagenomic setting. In closing, we would like to note that metagenomics approaches are not the only tools available to researchers studying microbial communities. Techniques such as metatranscriptomics [78], metaproteomics [48] and metabolomics [79] have and are being developed to help provide a better understanding of the function microbes play in a community. Furthermore, targeted studies based on the 16S rRNA gene have already generated a wealth of data about microbial communities, primarily restricted to information about the taxonomic origin of organisms. Tremendous opportunities exist for the development of methods that combine all these different ways of interrogating microbial communities in order to provide a more complete understanding of the role these communities play in our world.

69 in total

1. Hawkeye and AMOS: visualizing and assessing the quality of genome assemblies.

Authors: Michael C Schatz; Adam M Phillippy; Daniel D Sommer; Arthur L Delcher; Daniela Puiu; Giuseppe Narzisi; Steven L Salzberg; Mihai Pop
Journal: Brief Bioinform Date: 2011-12-23 Impact factor: 11.622

2. MetaQUAST: evaluation of metagenome assemblies.

Authors: Alla Mikheenko; Vladislav Saveliev; Alexey Gurevich
Journal: Bioinformatics Date: 2015-11-26 Impact factor: 6.937

3. Extending assembly of short DNA sequences to handle error.

Authors: William R Jeck; Josephine A Reinhardt; David A Baltrus; Matthew T Hickenbotham; Vincent Magrini; Elaine R Mardis; Jeffery L Dangl; Corbin D Jones
Journal: Bioinformatics Date: 2007-09-24 Impact factor: 6.937

4. Your gut microbiota are what you eat.

Authors: Claire Chewapreecha
Journal: Nat Rev Microbiol Date: 2014-01 Impact factor: 60.633

5. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.

Authors: Jason Pell; Arend Hintze; Rosangela Canino-Koning; Adina Howe; James M Tiedje; C Titus Brown
Journal: Proc Natl Acad Sci U S A Date: 2012-07-30 Impact factor: 11.205

Review 6. Human intestinal metagenomics: state of the art and future.

Authors: Hervé M Blottière; Willem M de Vos; S Dusko Ehrlich; Joël Doré
Journal: Curr Opin Microbiol Date: 2013-07-16 Impact factor: 7.934

7. A holistic approach to marine eco-systems biology.

Authors: Eric Karsenti; Silvia G Acinas; Peer Bork; Chris Bowler; Colomban De Vargas; Jeroen Raes; Matthew Sullivan; Detlev Arendt; Francesca Benzoni; Jean-Michel Claverie; Mick Follows; Gaby Gorsky; Pascal Hingamp; Daniele Iudicone; Olivier Jaillon; Stefanie Kandels-Lewis; Uros Krzic; Fabrice Not; Hiroyuki Ogata; Stéphane Pesant; Emmanuel Georges Reynaud; Christian Sardet; Michael E Sieracki; Sabrina Speich; Didier Velayoudon; Jean Weissenbach; Patrick Wincker
Journal: PLoS Biol Date: 2011-10-18 Impact factor: 8.029

8. ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data.

Authors: Osvaldo Zagordi; Arnab Bhattacharya; Nicholas Eriksson; Niko Beerenwinkel
Journal: BMC Bioinformatics Date: 2011-04-26 Impact factor: 3.307

9. The Human Microbiome Project: a community resource for the healthy human microbiome.

Authors: Dirk Gevers; Rob Knight; Joseph F Petrosino; Katherine Huang; Amy L McGuire; Bruce W Birren; Karen E Nelson; Owen White; Barbara A Methé; Curtis Huttenhower
Journal: PLoS Biol Date: 2012-08-14 Impact factor: 8.029

10. Genome assembly forensics: finding the elusive mis-assembly.

Authors: Adam M Phillippy; Michael C Schatz; Mihai Pop
Journal: Genome Biol Date: 2008-03-14 Impact factor: 13.583

35 in total

1. Assessment of metagenomic assemblers based on hybrid reads of real and simulated metagenomic sequences.

Authors: Ziye Wang; Ying Wang; Jed A Fuhrman; Fengzhu Sun; Shanfeng Zhu
Journal: Brief Bioinform Date: 2020-05-21 Impact factor: 11.622

Review 2. A review of methods and databases for metagenomic classification and assembly.

Authors: Florian P Breitwieser; Jennifer Lu; Steven L Salzberg
Journal: Brief Bioinform Date: 2019-07-19 Impact factor: 11.622

3. Metagenome SNP calling via read-colored de Bruijn graphs.

Authors: Bahar Alipanahi; Martin D Muggli; Musa Jundi; Noelle R Noyes; Christina Boucher
Journal: Bioinformatics Date: 2021-04-01 Impact factor: 6.937

Review 4. Genomic and Metagenomic Approaches for Predictive Surveillance of Emerging Pathogens and Antibiotic Resistance.

Authors: Kimberley V Sukhum; Luke Diorio-Toth; Gautam Dantas
Journal: Clin Pharmacol Ther Date: 2019-07-22 Impact factor: 6.875

5. Critical Assessment of Metagenome Interpretation: the second round of challenges.

Authors: Fernando Meyer; Adrian Fritz; Zhi-Luo Deng; David Koslicki; Till Robin Lesker; Alexey Gurevich; Gary Robertson; Mohammed Alser; Dmitry Antipov; Francesco Beghini; Denis Bertrand; Jaqueline J Brito; C Titus Brown; Jan Buchmann; Aydin Buluç; Bo Chen; Rayan Chikhi; Philip T L C Clausen; Alexandru Cristian; Piotr Wojciech Dabrowski; Aaron E Darling; Rob Egan; Eleazar Eskin; Evangelos Georganas; Eugene Goltsman; Melissa A Gray; Lars Hestbjerg Hansen; Steven Hofmeyr; Pingqin Huang; Luiz Irber; Huijue Jia; Tue Sparholt Jørgensen; Silas D Kieser; Terje Klemetsen; Axel Kola; Mikhail Kolmogorov; Anton Korobeynikov; Jason Kwan; Nathan LaPierre; Claire Lemaitre; Chenhao Li; Antoine Limasset; Fabio Malcher-Miranda; Serghei Mangul; Vanessa R Marcelino; Camille Marchet; Pierre Marijon; Dmitry Meleshko; Daniel R Mende; Alessio Milanese; Niranjan Nagarajan; Jakob Nissen; Sergey Nurk; Leonid Oliker; Lucas Paoli; Pierre Peterlongo; Vitor C Piro; Jacob S Porter; Simon Rasmussen; Evan R Rees; Knut Reinert; Bernhard Renard; Espen Mikal Robertsen; Gail L Rosen; Hans-Joachim Ruscheweyh; Varuni Sarwal; Nicola Segata; Enrico Seiler; Lizhen Shi; Fengzhu Sun; Shinichi Sunagawa; Søren Johannes Sørensen; Ashleigh Thomas; Chengxuan Tong; Mirko Trajkovski; Julien Tremblay; Gherman Uritskiy; Riccardo Vicedomini; Zhengyang Wang; Ziye Wang; Zhong Wang; Andrew Warren; Nils Peder Willassen; Katherine Yelick; Ronghui You; Georg Zeller; Zhengqiao Zhao; Shanfeng Zhu; Jie Zhu; Ruben Garrido-Oter; Petra Gastmeier; Stephane Hacquard; Susanne Häußler; Ariane Khaledi; Friederike Maechler; Fantin Mesny; Simona Radutoiu; Paul Schulze-Lefert; Nathiana Smit; Till Strowig; Andreas Bremges; Alexander Sczyrba; Alice Carolyn McHardy
Journal: Nat Methods Date: 2022-04-08 Impact factor: 28.547

6. Use of antibiotic impregnated resorbable beads reduces pressure ulcer recurrence: A retrospective analysis.

Authors: Ibrahim Khansa; Jenny C Barker; Piya Das Ghatak; Chandan K Sen; Gayle M Gordillo
Journal: Wound Repair Regen Date: 2018-03 Impact factor: 3.617

Review 7. Sequencing-based methods and resources to study antimicrobial resistance.

Authors: Manish Boolchandani; Alaric W D'Souza; Gautam Dantas
Journal: Nat Rev Genet Date: 2019-06 Impact factor: 53.242

Review 8. Approaches for characterizing and tracking hospital-associated multidrug-resistant bacteria.

Authors: Kevin S Blake; JooHee Choi; Gautam Dantas
Journal: Cell Mol Life Sci Date: 2021-02-13 Impact factor: 9.261

Review 9. Strategies for Natural Products Discovery from Uncultured Microorganisms.

Authors: Khorshed Alam; Muhammad Nazeer Abbasi; Jinfang Hao; Youming Zhang; Aiying Li
Journal: Molecules Date: 2021-05-17 Impact factor: 4.411

10. Drift of the Subgingival Periodontal Microbiome during Chronic Periodontitis in Type 2 Diabetes Mellitus Patients.

Authors: Irina P Balmasova; Evgenii I Olekhnovich; Ksenia M Klimina; Anna A Korenkova; Maria T Vakhitova; Elmar A Babaev; Leyla A Ovchinnikova; Yakov A Lomakin; Ivan V Smirnov; Victor N Tsarev; Ashot M Mkrtumyan; Alexey A Belogurov; Alexander G Gabibov; Elena N Ilina; Sergey D Arutyunov
Journal: Pathogens Date: 2021-04-22