Literature DB >> 21685107

Meta-IDBA: a de Novo assembler for metagenomic data.

Yu Peng1, Henry C M Leung, S M Yiu, Francis Y L Chin.   

Abstract

MOTIVATION: Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from a single genome, there are no assemblers for assembling reads in metagenomic data without reference genome sequences. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory, because of the existence of common regions in the genomes of subspecies and species, which make the assembly problem much more complicated.
RESULTS: We introduce the Meta-IDBA algorithm for assembling reads in metagenomic data, which contain multiple genomes from different species. There are two core steps in Meta-IDBA. It first tries to partition the de Bruijn graph into isolated components of different species based on an important observation. Then, for each component, it captures the slight variants of the genomes of subspecies from the same species by multiple alignments and represents the genome of one species, using a consensus sequence. Comparison of the performances of Meta-IDBA and existing assemblers, such as Velvet and Abyss for different metagenomic datasets shows that Meta-IDBA can reconstruct longer contigs with similar accuracy. AVAILABILITY: Meta-IDBA toolkit is available at our website http://www.cs.hku.hk/~alse/metaidba. CONTACT: chin@cs.hku.hk.

Entities:  

Mesh:

Year:  2011        PMID: 21685107      PMCID: PMC3117360          DOI: 10.1093/bioinformatics/btr216

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Metagenomic research studies the genetic information in an entire microbial community. It plays an important role in microbiology because over 99% of microbes can neither be isolated nor cultured (Wooley ). Recent advances in next-generation sequencing technology allow us to generate reads from genomes of multiple species in these samples in an effective manner. The set of reads obtained is very complicated which makes the assembling of genomes of species that exist in the sample extremely difficult. There are two main approaches to study reads from these samples. One is to group (called binning) the reads according to some biological markers or structural features (Huson ; Krause ; Yang ). Then, reads belonging to different species are studied. The other is to deduce the potential biological functions of the whole community by studying the reads directly using gene prediction or function annotation (Mavromatis ; Qin ; Wooley ). Since the reads from the next-generation sequencing technologies are still relatively short, it is more effective if longer contigs can be constructed from the reads before conducting the study even though we are not able to assemble the genome of every species. High quality assembly results of the contigs are desirable in both approaches. If the assembled contigs are short and erroneous, the accuracy of binning, gene prediction, function annotation, etc. will be impaired. Thus, assemblers that can generate longer and more accurate contigs will definitely facilitate the study of metagenomic data. If similar reference genomes exist in the database, one can assemble reads by first aligning them to the reference genomes (Gnerre ). However, as over 90% of microbes in metagenomic data are unknown (Wooley ), de novo assemblers are needed to assemble reads without any reference genomes. To our knowledge, currently there are no de novo metagenome-specific assemblers available. Assemblers such as EULER (Chaisson ; Chaisson and Pevzner, 2008; Pevzner ), Velvet (Zerbino and Birney, 2008; Zerbino ), Abyss (Simpson ), SOAPdenovo (Li ) are for single genome, but are used in metagenomic study (Wooley ). All these assemblers are based on the de Bruijn graph (Pevzner ), which is a common approach to perform de novo assembly. In the de Bruijn graph, a vertex represents a length-k substring called k-mer and an edge connecting vertices u and v represents u and v appearing consecutively in a read. All these assemblers generate the de Bruijn graph from reads, and then apply some error removal methods (Simpson ; Zerbino and Birney, 2008), e.g. removing tips and merging bubbles, to modify the graph based on its topological structure. Simple paths in the graph are outputted as contigs and paired-end information might be applied to further merge the contigs. These existing assemblers do not work well for metagenomic datasets, except for some very small datasets containing specific species (Pop, 2009). Two main properties of a reasonably complicated metagenomic dataset make these assemblers fail to produce long contigs: (i) polymorphisms among similar subspecies and common genomic regions shared by different species; and (ii) uneven abundance ratios of species in a sample. The polymorphism of similar subspecies, especially subspecies of the same species, consists of very similar sequences with few variations (single nucleotide variation, short insertion or deletion or genomic rearrangements, etc.) and each variation introduces a branch in the de Bruijn graph (we call these branches sp-branches). Another source of branches is due to the common or similar genomic regions, say housekeeping genes, shared by different species (we call these branches cr-branches). These two types of branches, which do not exist when assembling single genomes, would make the de Bruijn graph for metagenomic data very complicated. Since existing assemblers output simple paths in the graph as contigs, these extra branches caused by common regions in different species prevent the construction of long contigs. Some assemblers resolve branches by merging similar sequences as bubbles into one sequence. A bubble is defined as several similar paths with the same start vertex and the same end vertex (Zerbino and Birney, 2008) in the de Bruijn graph. Bubble merging helps to merge similar regions and reduce complexity of the de Bruijn graph. An important assumption used by assemblers to remove bubbles for single genome assembly is that the bubble is caused by a few single nucleotide polymorphisms (SNP) or errors in reads; thus, the simple paths inside a bubble are very similar, except for a few nucleotides. However, the ‘bubbles’ found in the graph for metagenomic dataset do not follow this assumption. Different bubbles mix together to make the start vertex and the end vertex very difficult to be identified. Some of these bubbles are formed by a mixture of sp- and cr-branches. Figure 1 shows an example of this phenomenon in which every simple path is contracted into a vertex for visualization. All branches at a vertex in this graph normally lead to some other vertex in the same component, but it is uncertain that these are bubbles for merging. If we look closer at these bubbles, even if the bubble is formed only by sp-branches (because of variations in subspecies), the multiple paths inside the bubble may differ a lot (maybe with larger insertion/deletion). Existing approaches for merging bubbles for single genome assembly do not work for this case; thus, they will fail to resolve these ‘bubbles’ and are unable to construct long contigs. Even if all bubbles can be identified, it is not easy to merge them together to form a consensus.
Fig. 1.

A component in de Bruijn graph of five E.coli subspecies.

A component in de Bruijn graph of five E.coli subspecies. To resolve branches in a de Bruijn graph, existing assemblers also try to use paired-end information to help find paths with paired-end reads support so as to eliminate those branches caused by erroneous reads and to construct longer contigs. In the case of multiple subspecies, each path will have a lot of support since they are not caused by erroneous reads, but variations in subspecies and the assemblers are not able to resolve these branches easily. Moreover, since the contigs are short, applying paired-end information becomes difficult, because usually paired-end information can only be applied to connect long contigs. To show the complexity of a de Bruijn graph for metagenomic dataset, Table 1 compares graphs and assembly results of simulated reads sampled from a single genome (Escherichia coli 536) and from five different E.coli subspecies genomes. Table 1 shows that the five E.coli subspecies contain about twice the number of k-mers as that of a single E.coli subspecies, but 150 times more branches as that of single subspecies. This makes the graph complicated and genome assembly difficult. In fact, the performance of all the assemblers is poor when there are a lot of subspecies. The N50 values of Velvet , Abyss and SOAPdenovo drop from 178 914, 32 440 and 125 404 bp for single genome data to 875, 849 and 713 bp, respectively, for metagenomic data with five subspecies. Uneven abundance ratios in metagenomic data introduce another problem in assembly, because existing assemblers cannot distinguish erroroneous reads sampled from genomes with high abundance ratios and reads from genomes with low abundance ratios. Thus, it is difficult to identify species with the low abundance ratios.
Table 1.

Assembly results of E.coli 563 and five E.coli subspecies

AssemblerE.coli 536Five E.coli subspecies
Perfect de Bruijn graph Assembler
k5050
    No. of k-mers4 859 64911 533 119
    No. of branches810130 045
Velvet
    No. of contigs22613 516
    N50178 914875
    Coverage (%)99.3390.24
Abyss
    No. of contigs33727 428
    N5032 440849
    Coverage (%)99.7794.15
SOAPdenovo
    No. of contigs24721 589
    N50125 404713
    Coverage (%)99.8194.20
Meta-IDBA
    No. of contigs2569292
    N50122 3175781
    Coverage (%)99.6488.37

Simulated length-75 reads are sampled randomly from references with 1% error and 250 insert distance with a depth of 30. The value of k is set to 50 for all assemblers.

Assembly results of E.coli 563 and five E.coli subspecies Simulated length-75 reads are sampled randomly from references with 1% error and 250 insert distance with a depth of 30. The value of k is set to 50 for all assemblers. Resolving the branches (both sp- and cr-branches) in the graph is one of the key issues for solving the problem of metagenomic assembly. In this article, we focus on this issue. We remark that the issue of uneven abundance ratios of subspecies is also very important and difficult to solve and demands dedicated research effort. Conceptually, we look at the problem from the following perspective. The cr-branches (caused by common regions in different species) should be removed to isolate one species from another. For sp-branches, since they are caused by variations of subspecies, instead of producing separate contigs for individual subspecies, it is preferable to represent possible variations (e.g. insertion) of similar sequences. The core of our proposed metagenomic assembler, Meta-IDBA, has two steps. The first step is to identify and remove cr-branches in the de Bruijn graph leaving a set of connected components, each of which hopefully corresponds to a set of subspecies of the same species. The second step is to transform each component into a multiple alignment with consensus so as to represent the contigs of different subspecies of the same species. To distinguish cr-branches from sp-branches, the main idea is based on the fact that the genome sequences of subspecies in the same species are more similar than those from different species. Thus, more common k-mers are shared by the genome sequences of subspecies of the same species. Table 2 summarizes the k-mer similarity between subspecies within different taxonomic levels.
Table 2.

k-mer Similarity (k=50) in different taxonomic level

SpeciesGenusFamilyOrderClassPhylum
Similarity (%)63.27.32.30.060.020.01

For each level, 1000 pairs of subspecies with lowest common ancestor of that level are generated randomly for k-mer similarity calculation.

k-mer Similarity (k=50) in different taxonomic level For each level, 1000 pairs of subspecies with lowest common ancestor of that level are generated randomly for k-mer similarity calculation. The similarity inside the same species is much higher than that of the genus. Sixty percent similarity means that on average two k-mers of a subspecies will share at least one with another subspecies. Moreover, although the average similarity at the genus level is 7.3%, the common k-mers concentrate at some common regions, shared by different species, and is <1% at other regions. Nevertheless, reads from species further apart in higher taxonomic levels will share fewer k-mers. Similar observations are also made in Fofanov ) that a k-mer, with k≥20, tends to occur uniquely in the genome of a single species. Therefore in the de Bruijn graph, sp-branches started from a common k-mer in similar subspecies usually have (converge to) another common k-mer within a short distance, while cr-branches started from a common k-mer in multiple species seldom have another common k-mer within a short distance. By removing those branches that cannot converge to another k-mer within a short distance, the genome of a species with several subspecies can be represented by connected components in the de Bruijn graph, where each component represents similar contigs in the genomes from similar subspecies. As the contigs for each subspecies in the same component are with slight differences, a consensus contig can be found by multiple alignment (to capture the variations of the genome of the subspecies) to represent the genome of the species. Thus, instead of outputting simple paths in the de Bruijn graph to represent contigs, connected components representing consensus contigs of species will be output. As shown in Table 1, our assembler called Meta-IDBA is more effective in assembling reads in metagenomic data, where the N50 of Meta-IDBA is 5781 bp, about seven times longer than those of Velvet, Abyss and SOAPdenovo. We agree that the improvement in contig lengths mainly stems from the multiple alignment with consensus representation of the components, which seems to be a more appropriate output for contigs belonging to different subspecies due to variations. Moreover, if the similar regions cannot be separated, it will be too expensive to do multiple alignment among resulting contigs. To summarize, our work has two major contributions: to isolate components that derive from similar subspecies of the same species in the complicated de Bruijn graph of a metagenomic dataset; and to report the contigs of the subspecies using multiple alignment to highlight possible variants.

2 METHODS

In this section, we will describe our algorithm, Meta-IDBA, for assembling reads from multiple genomes of subspecies in different species. There are two main steps in Meta-IDBA as shown in Figure 2. Initially (Step 1) sequencing reads are used to construct a de Bruijn graph using any de Bruijn graph-based assembler (Chaisson and Pevzner, 2008; Peng ; Simpson ; Zerbino and Birney, 2008). Each simple path in the de Bruijn graph might represent a contig of the genome of some species or subspecies. As there are some sequences appearing in multiple species, the de Bruijn graph of reads from different species are interconnected by cr-branches. In the second step (Step 2), based on the assumption that the genomes of subspecies from the same species share more similar regions than the genomes of subspecies from different species, Meta-IDBA divides the de Bruijn graph into many small connected components by removing cr-branches. As the genome sequences of subspecies even from the same species are not exactly the same, the consensus contig may not be represented by a simple path in the de Bruijn graph, but may be represented by a component with multiple paths (due to sp-branches) from a source vertex to a single sink vertex, so as to confine the variations of the similar regions in each genome. These small components are then merged into bigger components, which represent longer consensus contigs using paired-end reads (Step 3). In the last step of Meta-IDBA (Step 4), each component is transformed to a multiple alignment of similar contigs of different subspecies from the same species. The consensus contig, which represents the similar contigs of the subspecies in a species, is found. The steps of Meta-IDBA will be described in detail in the following sections.
Fig. 2.

Workflow of Meta-IDBA algorithm.

Workflow of Meta-IDBA algorithm.

2.1 Construct de Bruijn graph

Any genome assembler based on the de Bruijn graph approach (Chaisson ; Peng ; Pevzner ; Simpson ; Zerbino and Birney, 2008) can be used to assemble reads to form contigs as if the reads were from a single genome. The de Bruijn graph-based algorithm first divides each reads into length-k substrings called k-mers. Each k-mer is represented by a vertex in the de Bruijn graph. There is an edge from u to v if the length k −1 suffix of k-mer u is the same as the length k −1 prefix of k-mer v and k-mers u and v occur adjacently in at least one read. In the single genome assembly problem, each simple path in the de Bruijn graph represents a contig of the genome and each branch represents a repeat region in the genome or an error in the read. The IDBA genome assembler (Peng ) is used in our implementation as IDBA iteratively increases the value of k and removes k-mers with few supports from reads, more branches can be resolved and longer contigs of the genome can be obtained. However, metagenomic data contains read from genomes of multiple species, and the sp- and cr-branches in the de Bruijn graph represent common regions occurring in genomes of different species or subspecies. These kinds of branches cannot be removed by the single genome assembler and thus very short contigs will be produced especially for samples with many subspecies of the same species. Thus, additional steps are required for resolving these branches.

2.2 Divide graph into connected components

Based on the well-accepted idea (Pop, 2009) that genomes of subspecies in the same species are more similar than genomes of subspecies from different species, Meta-IDBA divides the de Bruijn graph into connected components such that each component represents a consensus contig of a species. With the assumption that genomes of subspecies from the same species are very similar, there are many similar regions along the genome, where common k-mers (e.g. k=50) do not appear too distinct and at least one common k-mer exists within a relative short region of length w (e.g. w=300 bp). On the other hand, the genomes of subspecies from different species seldom contain a common k-mer so frequently. Based on this assumption, we divide the de Bruijn graph into components by solving the following graph partitioning algorithm. Graph partition problem: given a directed graph G, the graph partition problem is to partition the graph into maximal connected components that satisfy the following property. For each vertex u in a component C, there is another vertex v in C such that: (i) starting from each out-going edge e of u, there is at least a path from u to v in C with length at most w; or (ii) for every in-coming edge e of u, there is at least a path from v to u in C with length at most w. The idea behind the formulation of the graph partition problem is as follows. If a k-mer u can reach another k-mer v in a de Bruijn graph through a path of length at most w starting from each of its out-going edges (or vice versa), k-mers u and v are likely to occur in the genomes of subspecies from the same species, which represent high similarity among the genomes of the subspecies. Otherwise, k-mers u and v may occur in genomes of different species and they should be separated. The graph partition problem can be solved by a greedy algorithm, which repeatedly checks all out-going (in-coming) edges of each vertex. If there are no paths of length at most w starting from (ending with) the out-going (in-coming) edges of a vertex u that ends with (start from) another vertex v, all the out-going (in-coming) edges of vertex u are removed from the de Bruijn graph. The removing process will be repeated until all components satisfy our requirements. The correctness of this greedy algorithm is proved in Theorem 1. The process will remove the cr-branches as well as some of the sp-branches, resulting in many small connected components with close common regions (k-mers) representing consensus contigs of a single species.

Theorem 1.

The greedy algorithm solves the graph partition problem.

Proof.

By induction of the number of removing edges. When the first edge e1=(u1, u1′) is removed by the greedy algorithm, edge e1 should not be in any component, because either there is no vertex v such that all out-going edges of u1 can access v in a path in G with length at most w or there is no vertex u such that u can access u1′ by paths in G with length at most w that end with each in-coming edges of u1′. Consider the k-th edge e=(u,uk′) removed by the greedy algorithm. Either there is no vertex v such that all out-going edges of u can access v in a path in G/{e1,…,e} with length at most w or there is no vertex u such that u can access u1′ by paths in G/{e1,…,e} with length at most w that end with each in-coming edges of uk′. Since the edges e1,…,e are not in any component, edge e should not be in any component.    ▪

2.3 Connect components by paired-end reads

After solving the graph partition problem, larger components may be broken down into smaller components, because of the erroneous reads and common regions among different species and each component will represent a consensus contig of subspecies from a single species. Meta-IDBA merges the components into larger components using paired-end reads. A paired-end read represents two reads appearing in the same genome with known order and distance (insert distance). One end of a paired-end read is considered to appear in a component if a k-mer of the read appears in the component. If the two ends of a paired-end read appear in different components, the paired-end read is considered as a support that the two components should be merged into a larger component. Meta-IDBA merges the components if there are at least α (e.g. α=10) supports from paired-end reads and the two components are connected by the shortest path, which matches with the insert distance of the pair-end reads only if the merging is unambiguous (i.e. the merging will not be performed if a component has enough supports to connect to more than one component). Moreover, only the components with consensus contig longer than insert distance are considered in this procedure, because otherwise a component caused by a repeat region shorter than insert distance will be connected to multiple components with enough supports.

2.4 Construct multiple alignment and consensus contigs

Since the genomes of the subspecies from the same species are similar with small variants, it is very difficult to determine long contigs (simple path) from each subspecies, because these contigs are interconnected in the de Bruijn graph. Instead of representing a contig of a species by a simple path, a consensus contig of the species (which has many subspecies) is represented by a component, which is usually a direct acyclic graph (with exception when there is repeat region in the contig) to capture the small variants among the genomes of different subspecies in the same species. Each simple path in the component represents a short contig of a subspecies and the relative positions of these short contigs can be obtained from the structure of the directed acyclic graph. In order to obtain a contig of a single species from the component, multiple alignment of the short contigs represented by simple paths is performed. Meta-IDBA starts with an alignment of some short simple subpaths at the source vertex of the component and progressively constructs the alignments of longer contigs with reference to the longest subpath, because it provides more information of the genomes of the subspecies than the short subpaths with deletions. As the relative position of each simple path can be determined, the alignment between each simple subpath can be found locally. Finally, each component is represented by a consensus contig obtained through multiple alignment.

3 RESULTS

3.1 Simulated data

To compare the performance of Meta-IDBA with the other assemblers, we constructed three datasets with different complexities. The NCBI RefSeq (Pruitt ) database was used for generating the simulated sequencing reads. Table 3 summarizes the composition and experiment result of each dataset.
Table 3.

The compositions and experiment results of simulated datasets

FeaturesLow-complexity
Medium-complexity
High-complexity
Taxonomic level≤ Genus≤ Family≤ Class
No. of species2510
No. of cases10105
Expression levelUniformLog normalUniformLog normalUniformLog normal
Meta-IDBA
    Component accuracy (%)98.8298.8098.3298.1699.3599.1
    N5018 72920 68911 11114 61082469553
    Coverage (%)91.7689.2087.3981.6291.4784.16
    No. of contigs2674218013 627771655 24938 500
    No. of bases5 418 6955 223 47420 615 42217 644 01268 465 18858 088 054
    No. of error contigs99262190223
    No. of error bases57 64544 427115 05082 159336 429371 334
    Time (min)12.512.154.155.2120.1115.7
Velvet
    N5011 43797713356343319831997
    Coverage (%)87.2983.7086.8384.8892.3975.77
    No. of contigs2309223012 83911 208106 91737 567
    No. of bases4 876 1104 674 62115 824 79816 017 08570 072 91544 948 173
    No. of error contigs97161419659
    No. of error bases54 36434 79662 47070 148401 434230 525
    Time (min)16.815.244.745.096.495.8
Abyss
    N50239536081188157014842511
    Coverage (%)95.0693.4093.8088.9794.5686.53
    No. of contigs10 724844848 40935 796123 58385 837
    No. of bases8 199 6658 151 93031 072 59226 252 90594 857 36381 759 539
    No. of error contigs15204542426389
    No. of error bases24 03522 73343 11239 118173 856163 246
    Time (min)38.337.5147.7145.7319.0323.1
SOAPdenovo
    N50745782332502217118061351
    Coverage (%)93.5793.9194.9793.7797.2787.37
    No. of contigs7742756642 15841 446124 756116 723
    No. of bases6 253 6996 210 92325 421 06724 160 48280 261 15363 438 459
    No. of error contigs67202694159
    No. of error bases16 33728 98757 28360 601185 439160 779
    Time (min)11.310.239.837.684.785.1
The compositions and experiment results of simulated datasets For low-complexity datasets, two species, each having at least two subspecies within a genus, were selected. Length-75 sequencing reads were sampled from the selected reference genome at 30× depth. In all datasets, the error rate and insert distance were set to 1% and 250, respectively. Medium-complexity and high-complexity datasets were generated similarly according to the properties shown in Table 3. Two distributions of abundance ratios were used to generate simulated sequencing reads. The first one was uniform distribution, for a situation without uneven abundance ratios. The other one was log normal distribution, since some research on species richness estimation in metagenomic data shows that log normal distribution fits the data well (Hong ; Youssef and Elshahed, 2008). For comparison, Velvet, Abyss, SOAPdenovo and Meta-IDBA were executed on the above three datasets. In all experiments, the k-value of the de Bruijn graph was set to 50. Default values were used for all assemblers, except option ‘-M 3 -F’ is activated to merge similar regions for SOAPdenovo. The quality of assembler output was measured in three aspects on resultant contigs: N50 for contiguity, coverage for completeness and number of erroneous bases for accuracy. If the assembler generates scaffolds with wildcard nucleotide symbol ‘N’ inside, the ‘N’ will be removed to split the scaffolds into contigs. Correctness was checked by alignment, and a contig is considered as correct if it can be aligned to a reference with 95% similarity by BLAT (Kent, 2002). The 95% similarity means that the sum of mismatch bases and the in-del length should not be >5% of a contig. In the calculation of N50 and coverage, only correct contigs are considered. For Meta-IDBA, one more measure, is considered. The component accuracy indicates how accurate graph separation method works. A component is considered as correct only if 95% of the bases, inside one component, are actually from the same species. The experimental results for datasets of different complexities are shown in Figures 3–5, and the average values of all measurements are summarized in Table 3. Experiment results of low-complexity datasets. Experiment results of medium-complexity datasets. Experiment results of high-complexity datasets. For the low-complexity dataset, Meta-IDBA had the longest N50 in most cases, i.e. about ≥1.5 times that of Velvet, Abyss and SOAPdenovo. In a few cases, Velvet had similar or better N50 than Meta-IDBA, because the graph separation algorithm in Meta-IDBA partitioned some parts of the graph into components which could be better handled by bubble merging for low-complexity datasets. When considering coverage, all the assemblers had similar performances. In general, Abyss and SOAPdenovo had slightly better coverage and produced more contigs. Velvet and Meta-IDBA had similar numbers of erroneous contigs, which are a little more than Abyss and SOAPdenovo generated. Component accuracy showed that Meta-IDBA can separate contigs from different species accurately. When considering log normal distribution, it is interesting to note that N50 of the assembly results increased in some cases. This is because some subspecies, having low coverage after introducing different abundance ratios, can be considered removed from the dataset, reducing the complexity of the dataset. Consequently, the resultant contigs from the other similar species did not form complicated components. Overall, the N50 of Meta-IDBA for the log normal distribution was always better than that of the uniform distribution, because we applied the de Bruijn graph based assembler IDBA (Peng ), which does not rely on the read coverage of the genome very much. For medium-complexity datasets, Meta-IDBA always gave the best N50 among all assemblers. This means Meta-IDBA is able to group similar regions from different subspecies of the same species together effectively when the complexity of graph increases, while the bubble merging method failed in this situation. The performances of all assemblers in coverage and accuracy were similar to those of low-complexity data. The performance of the high-complexity datasets further confirmed that Meta-IDBA can handle complicated graphs. The N50 of the other assemblers is much shorter, while that of Meta-IDBA only decreased slightly. The other measures remained similar as before. All the experiments were executed in an eight-core machine with 144 GB memory. The runtime of Meta-IDBA and Velvet shown in Table 3 is more or less the same, but shorter than that of Abyss.

3.2 Real data

A real metagenomic sequencing dataset (SRX024329) from NCBI (http://www.ncbi.nlm.nih.gov/) was used to evaluate the performance of all assemblers in practice. It is a human metagenome sample from the G_DNA_Tongue dorsum of a female participant in the dbGaP study ‘HMP Core Microbiome Sampling Protocol A (HMP-A)’. Meta-IDBA provided the longest contigs among the assemblers (Table 4). It is difficult to access the accuracy of these results, since there are no references for most of the species in the real dataset. However, based on the results of simulated experiments, we have high confidence that Meta-IDBA can produce correct contigs and components.
Table 4.

Experimental results of real data

AssemblerNo. of contigsTotal basesN50Maximum
Meta-IDBA121 92474 493 7482380371 462
Velvet199 31080 297 709738207 709
Abyss203 983102 106 241956121 166
SOAPdenovo271 500110 655 983591367 374
Experimental results of real data

3.3 Multiple alignment of component

In the last step of Meta-IDBA, multiple alignments are performed among contigs in each component. Because of the restrictive properties of the components, all contigs in one component represent similar subsequences in some subspecies of the same species. Figure 6 presents part of the multiple alignment of a component from five E.coli subspecies, which shows similar contigs with a small number of variants. We can confirm confidently that the multiple alignment, as shown in Figure 6, represents contigs from similar subspecies.
Fig. 6.

Multiple alignment of a component in five E.coli subspecies. Consensus is shown in the first row. Contigs are separated by spaces. The conserved nucleotides are represented by dots. The difference between contigs and consensus is highlighted.

Multiple alignment of a component in five E.coli subspecies. Consensus is shown in the first row. Contigs are separated by spaces. The conserved nucleotides are represented by dots. The difference between contigs and consensus is highlighted.

4 DISCUSSION AND CONCLUSION

We tackled the assembly difficulty caused by the polymorphism in similar species in the metagenomic environment. Similar regions between species make the de Bruijn graph more complicated. Based on the observation that the genomes of subspecies from the same species share much more common k-mers than the genomes of subspecies from different species, we define the component to present the similar regions among subspecies from the same species. The assembly problem can then be modeled as a graph partition problem. After that, we designed an algorithm to identify the components from de Bruijn graph. Finally, paired-end reads are used to further connect components together. Experiments on datasets of different complexities showed that Meta-IDBA can usually produce the longest contigs with similar accuracy and coverage when compared with other assemblers, especially for high-complexity datasets which contain more branches since there are more genomes in the dataset. Besides reconstructing longer contig, Meta-IDBA provides a multiple alignment of similar contigs from different subspecies in the same species, which represents the variants among genomes of these subspecies. The multiple alignment may be used to study the structural variations of genomes of different subspecies or determine conserved regions, which have biological functions for the subspecies. As the multiple alignments are constructed by a greedy algorithm, further study on finding the optimal multiple alignment may improve the accuracy and the usage of Meta-IDBA in analyzing metagenomic data. At present, Meta-IDBA cannot reconstruct the contigs of each single subspecies, because their genomes have a lot of common regions. Further study on using pair-end information and read coverage for reconstructing these contigs will be carried out. There are cases in which Meta-IDBA fails to separate reads from different species into components. One case is in low- complexity dataset, which consists of Streptococcus pyogenes and S.dysgalactiae. The k-mer similarity of these two species is 17.69%, which is relatively higher than the other pair of genomes (~1%). On the other hand, the component accuracy of this dataset is 93%, which means many reads in the components are shared by these two species and some components represent similar regions of these two species. Therefore, it might not be necessary to separate the reads in these components, since these components may represent regions with the significant biological functions necessary for both species. For metagenomic dataset with uneven abundance ratios, because the IDBA genome assembler does not depend much on coverage to create the de Bruijn graph, the change in abundance ratios will not affect its performance too much. If each species is sampled with high enough coverage (the required coverage depends on the error rate and read length), they can be assembled by Meta-IDBA. However, uneven abundance ratios will affect the sampling rates of reads of different species, but the difference in sampling rates can also provide information to separate the reads sampled from species with low abundance ratios and those from species with high abundance ratios (Wu and Ye, 2010). More research should be performed for studying how to make use of this information to improve the accuracy of Meta-IDBA. Since our graph partition algorithm cannot distinguish erroneous edges and correct edges inside the graph, it relies very much on the quality of the de Bruijn graph generated by the assembler. If there are many false positive edges, the graph may be partitioned into many small components. Similarly, if there are many species (in practice, there are many more species (>1000) than those used in our simulation), there would be more edges in the graph, which also leads to many small components. As shown in Table 1, there is a big gap between the assembly results of single species and metagenomic. Much can be done to improve the quality of metagenomic assembly. Funding: This research was supported in parts by HKU Genomics SRT funding, GRF HKU 711611 and GRF HKU 719709E. Conflict of Interest: none declared.
  20 in total

1.  An Eulerian path approach to DNA fragment assembly.

Authors:  P A Pevzner; H Tang; M S Waterman
Journal:  Proc Natl Acad Sci U S A       Date:  2001-08-14       Impact factor: 11.205

2.  BLAT--the BLAST-like alignment tool.

Authors:  W James Kent
Journal:  Genome Res       Date:  2002-04       Impact factor: 9.043

3.  How independent are the appearances of n-mers in different genomes?

Authors:  Yuriy Fofanov; Yi Luo; Charles Katili; Jim Wang; Yuri Belosludtsev; Thomas Powdrill; Chetan Belapurkar; Viacheslav Fofanov; Tong-Bin Li; Sergey Chumakov; B Montgomery Pettitt
Journal:  Bioinformatics       Date:  2004-04-15       Impact factor: 6.937

4.  Predicting microbial species richness.

Authors:  Sun-Hee Hong; John Bunge; Sun-Ok Jeon; Slava S Epstein
Journal:  Proc Natl Acad Sci U S A       Date:  2005-12-20       Impact factor: 11.205

5.  MEGAN analysis of metagenomic data.

Authors:  Daniel H Huson; Alexander F Auch; Ji Qi; Stephan C Schuster
Journal:  Genome Res       Date:  2007-01-25       Impact factor: 9.043

6.  Short read fragment assembly of bacterial genomes.

Authors:  Mark J Chaisson; Pavel A Pevzner
Journal:  Genome Res       Date:  2007-12-14       Impact factor: 9.043

7.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.

Authors:  Konstantinos Mavromatis; Natalia Ivanova; Kerrie Barry; Harris Shapiro; Eugene Goltsman; Alice C McHardy; Isidore Rigoutsos; Asaf Salamov; Frank Korzeniewski; Miriam Land; Alla Lapidus; Igor Grigoriev; Paul Richardson; Philip Hugenholtz; Nikos C Kyrpides
Journal:  Nat Methods       Date:  2007-04-29       Impact factor: 28.547

8.  A human gut microbial gene catalogue established by metagenomic sequencing.

Authors:  Junjie Qin; Ruiqiang Li; Jeroen Raes; Manimozhiyan Arumugam; Kristoffer Solvsten Burgdorf; Chaysavanh Manichanh; Trine Nielsen; Nicolas Pons; Florence Levenez; Takuji Yamada; Daniel R Mende; Junhua Li; Junming Xu; Shaochuan Li; Dongfang Li; Jianjun Cao; Bo Wang; Huiqing Liang; Huisong Zheng; Yinlong Xie; Julien Tap; Patricia Lepage; Marcelo Bertalan; Jean-Michel Batto; Torben Hansen; Denis Le Paslier; Allan Linneberg; H Bjørn Nielsen; Eric Pelletier; Pierre Renault; Thomas Sicheritz-Ponten; Keith Turner; Hongmei Zhu; Chang Yu; Shengting Li; Min Jian; Yan Zhou; Yingrui Li; Xiuqing Zhang; Songgang Li; Nan Qin; Huanming Yang; Jian Wang; Søren Brunak; Joel Doré; Francisco Guarner; Karsten Kristiansen; Oluf Pedersen; Julian Parkhill; Jean Weissenbach; Peer Bork; S Dusko Ehrlich; Jun Wang
Journal:  Nature       Date:  2010-03-04       Impact factor: 49.962

Review 9.  A primer on metagenomics.

Authors:  John C Wooley; Adam Godzik; Iddo Friedberg
Journal:  PLoS Comput Biol       Date:  2010-02-26       Impact factor: 4.475

10.  Phylogenetic classification of short environmental DNA fragments.

Authors:  Lutz Krause; Naryttza N Diaz; Alexander Goesmann; Scott Kelley; Tim W Nattkemper; Forest Rohwer; Robert A Edwards; Jens Stoye
Journal:  Nucleic Acids Res       Date:  2008-02-19       Impact factor: 16.971

View more
  111 in total

1.  TruSPAdes: barcode assembly of TruSeq synthetic long reads.

Authors:  Anton Bankevich; Pavel A Pevzner
Journal:  Nat Methods       Date:  2016-02-01       Impact factor: 28.547

Review 2.  Analytical tools and databases for metagenomics in the next-generation sequencing era.

Authors:  Mincheol Kim; Ki-Hyun Lee; Seok-Whan Yoon; Bong-Soo Kim; Jongsik Chun; Hana Yi
Journal:  Genomics Inform       Date:  2013-09-30

3.  Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold.

Authors:  Jurgen F Nijkamp; Mihai Pop; Marcel J T Reinders; Dick de Ridder
Journal:  Bioinformatics       Date:  2013-09-20       Impact factor: 6.937

4.  Bambus 2: scaffolding metagenomes.

Authors:  Sergey Koren; Todd J Treangen; Mihai Pop
Journal:  Bioinformatics       Date:  2011-09-16       Impact factor: 6.937

5.  Viral assemblage composition in Yellowstone acidic hot springs assessed by network analysis.

Authors:  Benjamin Bolduc; Jennifer F Wirth; Aurélien Mazurie; Mark J Young
Journal:  ISME J       Date:  2015-06-30       Impact factor: 10.302

6.  Assessment of metagenomic assemblers based on hybrid reads of real and simulated metagenomic sequences.

Authors:  Ziye Wang; Ying Wang; Jed A Fuhrman; Fengzhu Sun; Shanfeng Zhu
Journal:  Brief Bioinform       Date:  2020-05-21       Impact factor: 11.622

Review 7.  Application of computational approaches to analyze metagenomic data.

Authors:  Ho-Jin Gwak; Seung Jae Lee; Mina Rho
Journal:  J Microbiol       Date:  2021-02-10       Impact factor: 3.422

8.  MetAMOS: a modular and open source metagenomic assembly and analysis pipeline.

Authors:  Todd J Treangen; Sergey Koren; Daniel D Sommer; Bo Liu; Irina Astrovskaya; Brian Ondov; Aaron E Darling; Adam M Phillippy; Mihai Pop
Journal:  Genome Biol       Date:  2013-01-15       Impact factor: 13.583

9.  Ray Meta: scalable de novo metagenome assembly and profiling.

Authors:  Sébastien Boisvert; Frédéric Raymond; Elénie Godzaridis; François Laviolette; Jacques Corbeil
Journal:  Genome Biol       Date:  2012-12-22       Impact factor: 13.583

10.  Longitudinal Analysis of Microbiota in Microalga Nannochloropsis salina Cultures.

Authors:  Haifeng Geng; Kenneth L Sale; Mary Bao Tran-Gyamfi; Todd W Lane; Eizadora T Yu
Journal:  Microb Ecol       Date:  2016-03-08       Impact factor: 4.552

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.