| Literature DB >> 21867511 |
Yongchao Liu1, Bertil Schmidt, Douglas L Maskell.
Abstract
BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads.Entities:
Mesh:
Year: 2011 PMID: 21867511 PMCID: PMC3167803 DOI: 10.1186/1471-2105-12-354
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Schematic diagram of the PASHA assembly pipeline. (i) k-mer generation and distribution over a number of MPI processes; (ii) distributed preliminary de Bruijn graph construction and simplification over a number of MPI processes; (iii) Bubble merging and contig generation; and (iv) scaffolding.
Figure 2Linkage construction for two adjacent k-mers in a read.
Short read datasets for assembler assessment
| Bacillus | Bordetella | E.coli | Yoruban male | |
|---|---|---|---|---|
| library | 160bp | 198bp* | 200bp | 200bp |
| read length | 36 | 36 | 36 | 36~42 |
| no. of reads | 16,633,474 | 12,549,138 | 20,816,448 | 3,758,659,514 |
| coverage | 142× | 111× | 162× | 44× |
| genome size | 4,215,606 | 4,086,189 | 4,639,675 | 3,101,788,170** |
* uses an estimated insert size from assembly due to the unavailability of the real library insert size; ** uses the total length of all scaffolds in the GRCh37/hg19 build human reference sequence.
Assembly results for Bacillus
| PASHA | Velvet | ABySS | SOAPdenovo | |
|---|---|---|---|---|
| no. of scaffolds | 20 | 80 | 66 | 98 |
| NG50 | 1,435,675 | 670,481 | 424,309 | 487,364 |
| NG80 | 182,534 | 117,643 | 124,700 | 96,291 |
| max | 2,044,786 | 919,263 | 890,628 | 918,694 |
| mean | 208,124 | 52,046 | 67,457 | 42,399 |
| genome coverage | 92.27% | 98.69% | 97.92% | 97.60% |
| incorrect contigs (mean) | 5(61,643) | 1(44,055) | 1(70,485) | 2(22,680) |
| time (in seconds) | 332 | 433 | 747 | 467 |
PASHA uses the parameters "k = 29", Velvet uses "k = 29, -exp_cov = auto, -cov_cutoff = auto", ABySS uses "k = 29, n = 10" and SOAPdenvo uses "k = 23, insert_length = 160".
Assembly results for Bordetella
| PASHA | Velvet | ABySS | SOAPdenovo | |
|---|---|---|---|---|
| no. of scaffolds | 228 | 294 | 287 | 298 |
| NG50 | 24,517 | 18,063 | 18,150 | 17,870 |
| NG80 | 10,006 | 8,237 | 9,215 | 8,157 |
| max | 121,801 | 75,085 | 75,809 | 74,881 |
| mean | 16,508 | 12,797 | 13,520 | 12,583 |
| genome coverage | 70.44% | 68.45% | 53.67% | 72.45% |
| incorrect contigs (mean) | 166(5,521) | 150(6,834) | 138(12,172) | 81(10,261) |
| time (in seconds) | 207 | 292 | 484 | 293 |
PASHA uses the parameters "k = 31", Velvet uses "k = 31, -exp_cov = auto, -cov_cutoff = auto", ABySS uses "k = 31, n = 10" and SOAPdenvo uses "k = 25, insert_length = 198".
Assembly results for E.coli
| PASHA | Velvet | ABySS | SOAPdenovo | |
|---|---|---|---|---|
| no. of scaffolds | 64 | 179 | 124 | 166 |
| NG50 | 164,390 | 95,486 | 96,308 | 105,781 |
| NG80 | 63,677 | 43,814 | 43,972 | 41,901 |
| max | 297,975 | 268,283 | 268,372 | 221,692 |
| mean | 71,305 | 25,465 | 37,381 | 27,406 |
| genome coverage | 97.44% | 98.67% | 95.58% | 97.97% |
| incorrect contigs (mean) | 8(6,145) | 5(9,909) | 5(39,765) | 8(7,202) |
| time (in seconds) | 325 | 490 | 595 | 533 |
PASHA uses the parameters "k = 31", Velvet uses "k = 31, -exp_cov = auto, -cov_cutoff = auto", ABySS uses "k = 33, n = 10" and SOAPdenvo uses "k = 23, insert_length = 215".
Assembly results for the Yoruban male genome without scaffolding
| PASHA | ABySS | |
|---|---|---|
| NG50 | 503 | 513 |
| max | 18,981 | 15,909 |
| mean | 581 | 543 |
| median | 283 | 261 |
| genome coverage | 66.47% | 68.90% |
| no. of contigs | 3,518,718 | 3,916,628 |
| incorrect contigs (mean) | 39,419(467) | 31,189(413) |
| sum (bps) | 2,045,433,773 | 2,125,482,148 |
Assembly results for the Yoruban male genome with scaffolding
| PASHA | ABySS | |
|---|---|---|
| NG50 | 2,294 | 1,326 |
| max | 54,491 | 29,862 |
| mean | 1,948 | 1,170 |
| median | 973 | 636 |
| genome coverage | 66.94% | 71.52% |
| no. of scaffolds | 1,133,810 | 1,893,930 |
| incorrect contigs (mean) | 70,160(367) | 27,367(726) |
| sum (bps) | 2,208,249,938 | 2,216,254,604 |
Runtime of PASHA and utilized compute resources for different stages
| Stages | Time (h) | No. of CPUs |
|---|---|---|
| 0.7 | 32 | |
| de Bruijn graph construction and simplification | 3.1 | 32 |
| bubble merging and contig generation | 11.6 | 8 |
| scaffolding | 5.9 | 8 |
| overall | 21.3 | N/A |
Figure 3Execution time of PASHA on different numbers of CPU cores.