| Literature DB >> 21685107 |
Yu Peng1, Henry C M Leung, S M Yiu, Francis Y L Chin.
Abstract
MOTIVATION: Next-generation sequencing techniques allow us to generate reads from a microbial environment in order to analyze the microbial community. However, assembling of a set of mixed reads from different species to form contigs is a bottleneck of metagenomic research. Although there are many assemblers for assembling reads from a single genome, there are no assemblers for assembling reads in metagenomic data without reference genome sequences. Moreover, the performances of these assemblers on metagenomic data are far from satisfactory, because of the existence of common regions in the genomes of subspecies and species, which make the assembly problem much more complicated.Entities:
Mesh:
Year: 2011 PMID: 21685107 PMCID: PMC3117360 DOI: 10.1093/bioinformatics/btr216
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A component in de Bruijn graph of five E.coli subspecies.
Assembly results of E.coli 563 and five E.coli subspecies
| Assembler | Five | |
|---|---|---|
| Perfect de Bruijn graph Assembler | ||
| 50 | 50 | |
| No. of | 4 859 649 | 11 533 119 |
| No. of branches | 810 | 130 045 |
| Velvet | ||
| No. of contigs | 226 | 13 516 |
| N50 | 178 914 | 875 |
| Coverage (%) | 99.33 | 90.24 |
| Abyss | ||
| No. of contigs | 337 | 27 428 |
| N50 | 32 440 | 849 |
| Coverage (%) | 99.77 | 94.15 |
| SOAPdenovo | ||
| No. of contigs | 247 | 21 589 |
| N50 | 125 404 | 713 |
| Coverage (%) | 99.81 | 94.20 |
| Meta-IDBA | ||
| No. of contigs | 256 | 9292 |
| N50 | 122 317 | 5781 |
| Coverage (%) | 99.64 | 88.37 |
Simulated length-75 reads are sampled randomly from references with 1% error and 250 insert distance with a depth of 30. The value of k is set to 50 for all assemblers.
k-mer Similarity (k=50) in different taxonomic level
| Species | Genus | Family | Order | Class | Phylum | |
|---|---|---|---|---|---|---|
| Similarity (%) | 63.2 | 7.3 | 2.3 | 0.06 | 0.02 | 0.01 |
For each level, 1000 pairs of subspecies with lowest common ancestor of that level are generated randomly for k-mer similarity calculation.
Fig. 2.Workflow of Meta-IDBA algorithm.
The compositions and experiment results of simulated datasets
| Features | Low-complexity | Medium-complexity | High-complexity | |||
|---|---|---|---|---|---|---|
| Taxonomic level | ≤ Genus | ≤ Family | ≤ Class | |||
| No. of species | 2 | 5 | 10 | |||
| No. of cases | 10 | 10 | 5 | |||
| Expression level | Uniform | Log normal | Uniform | Log normal | Uniform | Log normal |
| Meta-IDBA | ||||||
| Component accuracy (%) | 98.82 | 98.80 | 98.32 | 98.16 | 99.35 | 99.1 |
| N50 | 18 729 | 20 689 | 11 111 | 14 610 | 8246 | 9553 |
| Coverage (%) | 91.76 | 89.20 | 87.39 | 81.62 | 91.47 | 84.16 |
| No. of contigs | 2674 | 2180 | 13 627 | 7716 | 55 249 | 38 500 |
| No. of bases | 5 418 695 | 5 223 474 | 20 615 422 | 17 644 012 | 68 465 188 | 58 088 054 |
| No. of error contigs | 9 | 9 | 26 | 21 | 90 | 223 |
| No. of error bases | 57 645 | 44 427 | 115 050 | 82 159 | 336 429 | 371 334 |
| Time (min) | 12.5 | 12.1 | 54.1 | 55.2 | 120.1 | 115.7 |
| Velvet | ||||||
| N50 | 11 437 | 9771 | 3356 | 3433 | 1983 | 1997 |
| Coverage (%) | 87.29 | 83.70 | 86.83 | 84.88 | 92.39 | 75.77 |
| No. of contigs | 2309 | 2230 | 12 839 | 11 208 | 106 917 | 37 567 |
| No. of bases | 4 876 110 | 4 674 621 | 15 824 798 | 16 017 085 | 70 072 915 | 44 948 173 |
| No. of error contigs | 9 | 7 | 16 | 14 | 196 | 59 |
| No. of error bases | 54 364 | 34 796 | 62 470 | 70 148 | 401 434 | 230 525 |
| Time (min) | 16.8 | 15.2 | 44.7 | 45.0 | 96.4 | 95.8 |
| Abyss | ||||||
| N50 | 2395 | 3608 | 1188 | 1570 | 1484 | 2511 |
| Coverage (%) | 95.06 | 93.40 | 93.80 | 88.97 | 94.56 | 86.53 |
| No. of contigs | 10 724 | 8448 | 48 409 | 35 796 | 123 583 | 85 837 |
| No. of bases | 8 199 665 | 8 151 930 | 31 072 592 | 26 252 905 | 94 857 363 | 81 759 539 |
| No. of error contigs | 15 | 20 | 45 | 42 | 426 | 389 |
| No. of error bases | 24 035 | 22 733 | 43 112 | 39 118 | 173 856 | 163 246 |
| Time (min) | 38.3 | 37.5 | 147.7 | 145.7 | 319.0 | 323.1 |
| SOAPdenovo | ||||||
| N50 | 7457 | 8233 | 2502 | 2171 | 1806 | 1351 |
| Coverage (%) | 93.57 | 93.91 | 94.97 | 93.77 | 97.27 | 87.37 |
| No. of contigs | 7742 | 7566 | 42 158 | 41 446 | 124 756 | 116 723 |
| No. of bases | 6 253 699 | 6 210 923 | 25 421 067 | 24 160 482 | 80 261 153 | 63 438 459 |
| No. of error contigs | 6 | 7 | 20 | 26 | 94 | 159 |
| No. of error bases | 16 337 | 28 987 | 57 283 | 60 601 | 185 439 | 160 779 |
| Time (min) | 11.3 | 10.2 | 39.8 | 37.6 | 84.7 | 85.1 |
Experimental results of real data
| Assembler | No. of contigs | Total bases | N50 | Maximum |
|---|---|---|---|---|
| Meta-IDBA | 121 924 | 74 493 748 | 2380 | 371 462 |
| Velvet | 199 310 | 80 297 709 | 738 | 207 709 |
| Abyss | 203 983 | 102 106 241 | 956 | 121 166 |
| SOAPdenovo | 271 500 | 110 655 983 | 591 | 367 374 |
Fig. 6.Multiple alignment of a component in five E.coli subspecies. Consensus is shown in the first row. Contigs are separated by spaces. The conserved nucleotides are represented by dots. The difference between contigs and consensus is highlighted.