| Literature DB >> 22537038 |
Chengxi Ye1, Zhanshan Sam Ma, Charles H Cannon, Mihai Pop, Douglas W Yu.
Abstract
BACKGROUND: The very large memory requirements for the construction of assembly graphs for de novo genome assembly limit current algorithms to super-computing environments.Entities:
Mesh:
Year: 2012 PMID: 22537038 PMCID: PMC3369186 DOI: 10.1186/1471-2105-13-S6-S1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1From overlap graph to a string graph. (a) an overlap graph, in which all the overlaps are recorded. (b) the string graph, transitive overlap (a, c) is removed.
Figure 2A node with branches in the . (a) A node with branches in a de Bruijn graph. (b) The binary implementation of (a). (c) A node with branches in a sparse k-mer graph. (d) The binary implementation of (c). The k-mers which are nodes in the graph are squared in the blocks. Neighbouring nucleotides indicating the edges of the graph are circled.
Figure 3Breadth-first search bubble removal in the sparse . Removing unwanted structures in the sparse de Bruijn graph. (a) Before removal. (b) After removal.
Assembly performance comparison on the fruit fly genome
| ( |
|
|
|
|
|---|---|---|---|---|
| Time (hr) | 5.5 | 3 | 3 | 1 |
| Memory peak (GB) | 46 | 31 | 14 | 2 |
| > 100 bp (# contigs) | 23,992 | 23,104 | 20,580 | 20,429 |
| Sum (kbp) | 113,580 | 113,574 | 112,395 | 113,650 |
| Mean size (bp) | 4,734 | 4,916 | 5,461 | 5,563 |
| N50 (bp) | 18,317 | 19,576 | 25,461 | 28,355 |
| N95 (bp) | 66 | 61 | 67 | 74 |
| Corr NG50 (bp) | 18,317 | 19,576 | 25,461 | 28,355 |
| Corr NG95 (bp) | 0 | 0 | 0 | 0 |
| Longest contig (bp) | 162,263 | 190,104 | 195,709 | 273,977 |
| Coverage (%) | 96.24 | 96.82 | 95.53 | 97.83 |
| Misjoins | 0 | 6 | 0 | 0 |
The performance on the fruit fly genome dataset, genome size: 120,291 kbp. Programs are run using default settings.
Assembly performance comparison on the rice genome
| ( |
|
|
|
|
|---|---|---|---|---|
| Time (hr) | 13 | 7 | 16 | 5 |
| Memory peak (GB) | 69 | 51 | 29 | 4 |
| > 100 bp (# contigs) | 458,456 | 397,252 | 444,545 | 386,604 |
| Sum (kbp) | 253,708 | 225,618 | 258,106 | 262,988 |
| Mean size (bp) | 553 | 568 | 581 | 680 |
| N50 (bp) | 538 | 310 | 655 | 734 |
| N95 (bp) | 38 | 0 | 40 | 31 |
| Corr NG50 (bp) | 538 | 310 | 655 | 733 |
| Corr NG95 (bp) | 38 | 0 | 0 | 0 |
| Longest contig (bp) | 23,220 | 23,939 | 26,869 | 26,890 |
| Coverage (%) | 69.2 | 62.3 | 71.3 | 71.5 |
| Misjoins | 1 | 34 | 10 | 9 |
The performance on the rice genome dataset, genome size: 370,733 kbp. Programs are run using default settings.
Assembly performance on the human genome
| Memory peak (GB) | 14 | 16 | 19 | 30 | 49 | 51 |
| > 100 bp (# | 3,195 | 1,984 | 714 | 2,727 | 1,554 | 1,359 |
| Sum (G bp) | 2.37 | 2.79 | 2.83 | 2.29 | 2.72 | 2.88 |
| Mean size (bp) | 743 | 1,406 | 3,961 | 839 | 1,751 | 2,121 |
| N50 (bp) | 2,130 | 6,479 | 79,906 | 2,121 | 6,319 | 49,572 |
| N90 (bp) | 244 | 631 | 10,441 | 304 | 872 | 1,021 |
| Longest contig (bp) | 50,800 | 124,293 | 801,692 | 47164 | 124,292 | 537,017 |
Assembly performance on the E.coli genome (ERR022075)
| ( |
|
|
|
|---|---|---|---|
| Time (hr) | 2 | 1 | 0.7 |
| Memory peak (GB) | 3.5 | 9.1 | 0.7 |
| > 100 bp (# contigs) | 430 | 632 | 485 |
| Sum (bp) | 4,556,772 | 4,413,080 | 4,577,604 |
| Mean size (bp) | 10,597 | 6,983 | 9,438 |
| N50 (bp) | 57,655 | 19,067 | 57,830 |
| N95 (bp) | 5,629 | 128 | 5,906 |
| Corr NG50 (bp) | 57,655 | 19,067 | 57,828 |
| Corr NG95 (bp) | 5,629 | 125 | 5,676 |
| Longest contig (bp) | 166,107 | 120,922 | 173,976 |
| Coverage (%) | 99.90 | 96.53 | 99.94 |
| Misjoins | 1 | 1 | 2 |
Assembly performance on the human chromosome 14
| ( |
|
|
|
|
|---|---|---|---|---|
| Time (hr) | 6 | 2.5 | 6. | 1.9 |
| Memory peak (GB) | 49 | 37 | 30 | 3 |
| > 100 bp (# contigs) | 85,181 | 129,046 | 84,719 | 55,024 |
| Sum (kbp) | 88,663 | 89,854 | 87,908 | 86,296 |
| Mean size (bp) | 1,041 | 696 | 1,038 | 1,568 |
| N50 (bp) | 3,568 | 1,499 | 3,117 | 3,890 |
| N95 (bp) | 179 | 184 | 197 | 202 |
| Corr NG50 (bp) | 3,475 | 1,487 | 3,065 | 3,760 |
| Corr NG95 (bp) | 175 | 178 | 192 | 198 |
| Coverage (%) | 98.54 | 98.86 | 98.42 | 97.56 |
| Longest contig (bp) | 61,018 | 16,043 | 49,584 | 60,797 |
| Misjoins | 24 | 62 | 47 | 61 |
This dataset was downloaded from http://gage.cbcb.umd.edu/data/index.html, genome size: 88,289,540.
Assembly performance on the NA12878 human genome
| Memory peak (GB) | 26 | 29 | 29 |
| > 100 bp (# | 2,740 | 2,800 | 2,744 |
| Sum (G bp) | 2.33 | 2.57 | 2.70 |
| Mean size (bp) | 743 | 919 | 3,961 |
| N50 (bp) | 2,054 | 2,647 | 2,915 |
| N95 (bp) | 318 | 335 | 380 |
| Longest contig (bp) | 36,460 | 38,864 | 50,441 |
| Corr NG50 (bp)* | 1,502 | 2,213 | 2,610 |
| Corr NG95 (bp)* | 0 | 0 | 114 |
| Misjoins* | 21 | 19 | 17 |
* The corrected statistics are calculated by mapping back to human chromosome 14.