| Literature DB >> 24086547 |
Dimitrios Kleftogiannis1, Panos Kalnis, Vladimir B Bajic.
Abstract
A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.Entities:
Mesh:
Year: 2013 PMID: 24086547 PMCID: PMC3785575 DOI: 10.1371/journal.pone.0075505
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1DiMA (Diginorm-MSP-Velvet) strategy.
This figure depicts the DiMA assembly strategy combined with the Velvet assembler. The process begins by cleaning the original data with a three-phase Digital Normalization algorithm. The cleaned data are distributed on different disk partitions based on the MSP algorithm. Then, the velveth program runs followed by velvetg on each partition. These programs constitute the Velvet assembler’s distinct phases (overlapping computation using hashing and graph construction) and the results are stored on the disk. A merging phase creates the final assembly graph and Velvet’s traversing algorithm produces the final results.
NGS data used in the experiments.
|
|
|
|
|
|
|---|---|---|---|---|
|
| 2,860,307 | 1,294,104 | 101 | 0.15 |
|
| 4,603,060 | 2,050,868 | 101 | 0.24 |
|
| 143,819,757 | 36,504,800 | 101 | 4.7 |
|
| 373,481,773 | 303,118,594 | 124 | 46.5 |
Memory-efficient techniques.
|
|
|
|
|
|
|---|---|---|---|---|
| SparseAssembler | DBG | Exploits sparseness | No |
|
| Gossamer | DBG | Succinct Data Structure (Bitmap) | Yes |
|
| Minia | DBG | Probabilistic Data Structure (BF) | No |
|
| SGA | OLC | FM-index | No |
|
| Minimum Substring Partitioning | Pre-processing | On-disk processing based on heuristics | No |
|
| Diginorm | Pre-processing | Elimination of redundant information and errors | Yes |
|
Fragment assembly results for
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
|
| 679 | 914 | 1,039 | 1,009 | 627 | 1,341 | 2,373 | 467 |
|
| 8,127 | 5,427 | 4,277 | 4,700 | 8,688 | 3,344 | 1,669 | 12,363 |
|
| 3,185,299 | 2,817,839 | 2,783,007 | 2,860,307 | 2,863,078 | 2,877,916 | 2,905,031 | 2,844,437 |
|
| 356,537 | 26,709 | 19,703 | 636,820 | 28,179 | 50,827 | 97,791 | 28,997 |
|
| 9,639 (0.34%) | 48,720 (1.70%) | 93,467 (3.27%) | 30,262 (1.06%) | 8,700 (0.30%) | 13,765 (0.48%) | 16,230 (0.57%) | 21,059 (0.74%) |
|
| 5,860 | 2,402 | 3,348 | 34,668 | 5,094 | 4,444 | 4,551 | 3,948 |
|
| 28 | 2 | 9 | 50 | 30 | 10 | 42 | 22 |
|
| 674 | 912 | 1,037 | 981 | 625 | 1,333 | 2,364 | 476 |
|
| 8,053 | 5,427 | 4,277 | 4,672 | 8,547 | 3,337 | 1,665 | 11,850 |
|
| 2:32 | 4:52 | 0:52 | 33:29 | 3:14 | 3:10 | 3:20 | 1:39 |
|
| 0.31 | 3 | 0.11 | 1.27 | 0.96 | 0.96 | 0.96 | 1.7 |
Fragment assembly results for chromosome 14.
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
|
| 52,785 | 67,160 | 52,926 | 55,002 | 61,039 | 68,253 | 52,085 |
|
| 264 | 123 | 161 | 273 | 233 | 252 | 325 |
|
| 101,600,523 | 73,046,277 | 74,079,569 | 79,129,375 | 80,448,331 | 81,139,464 | 81,190,207 |
|
| 28,034,067 | 3,861,802 | 3,318,028 | 5,365,076 | 7,803,232 | 7,554,603 | 6,844,058 |
|
| 66,811,187 (46.45%) | 72,145,106 (50.16%) | 71,829,430 (49.94%) | 67,835,777 (47.17%) | 67,535,775 (46.96%) | 68,640,864 (47.73%) | 66,461,819 (46.21%) |
|
| 1,811,908 | 797,409 | 1,256,381 | 1,742,555 | 1,644,762 | 1,688,868 | 1,506,630 |
|
| 3,849 | 1,247 | 1,447 | 3,920 | 2,026 | 4,874 | 3,824 |
|
| 55,175 | 69,103 | 55,230 | 56,146 | 61,351 | 68,849 | 53,589 |
|
| 172 | 0 | 107 | 177 | 159 | 171 | 195 |
|
| 1:1:37 | 3:6:50 | 1:33:13 | 1:18:16 | 1:21:8 | 1:15:09 | 2:27:46 |
|
| 1.72 | 3 | 0.76 | 3.34 | 8.7 | 1.2 | 49.3 |
* The SGA program failed similarly to ref [18].
Fragment assembly results for *.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
| 73,065 | 69,110 | 184,131 | 388,411 | 414,813 |
|
| 2,318 | 2,312 | 708 | 260 | 161 |
|
| 494,097,945 | 227,494,682 | 232,965,134 | 228,316,538 | 237,258,668 |
|
| 267,660,451 | 2,539,466 | 11,619,692 | 39,540,500 | 67,160,134 |
|
| 123,265,375 (33%) | 129,390,711 (34.64%) | 127,532,299 (34.15%) | 136,372,260 (36.51%) | 138,953,569 (37.20%) |
|
| 1,084,237 | 443,936 | 867,642 | 746,248 | 892,388 |
|
| 10,566 | 1,709 | 7,945 | 5,696 | 10,101 |
|
| 73,842 | 69,523 | 181,952 | 385,691 | 409,935 |
|
| 2,178 | 2,250 | 696 | 203 | 155 |
|
| 13:30:22 | 48:42:50 | 7:40:31 | 7:32:41 | 7:13:36 |
|
| 17.7 | 1.28 | 21.8 | 19.7 | 3.2 |
* Gossamer, SGA and the original Velvet failed to produce results.
Ranking of memory-efficient assemblers based on the quality of the assembly.
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| 2:32 | 4:52 | 0:52 | 33:29 | 3:14 | 3:10 | 3:20 | |
|
| 0.31 | 3 | 0.11 | 1.27 | 0.96 | 0.96 | 0.96 | |
|
|
|
|
|
|
|
|
|
|
|
| 3:15 | 7:37 | 1:23 | 55:37 | 5:12 | 5:15 | 6:04 | |
|
| 0.36 | 3 | 0.17 | 2.01 | 0.96 | 0.96 | 0.96 | |
|
|
|
|
|
|
|
|
|
|
|
| 1:1:37 | 3:6:50 | 1:33:13 |
| 1:18:16 | 1:21:8 | 1:15:09 | |
|
| 1.72 | 3 | 0.76 | 3.34 | 8.7 | 1.2 | ||
|
|
|
|
|
|
|
|
|
|
|
| 13:30:22 |
| 48:42:50 |
| 7:40:31 | 7:32:41 | 7:13:36 | |
|
| 17.7 | 1.28 | 21.8 | 19.7 | 3.2 |
Maximum memory utilization for each assembler in GB .
|
|
|
|
|
|
|---|---|---|---|---|
|
| 0.31 | 0.36 | 1.72 | 17.7 |
|
| 3 | 3 | 3 | Failed |
|
| 0.11 | 0.17 | 0.76 | 1.2 |
|
| 1.27 | 2.01 | Failed | Failed |
|
| 0.96 | 0.96 | 3.34 | 21.8 |
|
| 0.96 | 0.96 | 8.7 | 19.7 |
|
| 0.96 | 0.96 | 1.2 | 3.2 |
|
| 1.7 | 2.4 | 49.3 | Failed |
* Typically, a program that requires less than 4 GB RAM can run on a laptop; 4-8 GB RAM on a desktop; 8-32 GB RAM on a workstation; and more than 32 GB RAM on a server.
Executing assembly on the cloud (Amazon EC2) *.
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|
|
|
| M1/0 | M2/0.015 | M1/0 | M2/0.17 | M2/0.008 | M2/0.006 | M1/0 |
|
| 24:20 | 7:45 | 5:46 | 82:23 | 3:50 | 3:8 | 20:30 | |
|
| 2:32 | 4:52 | 0:52 | 33:29 | 3:14 | 3:10 | 3:20 | |
|
|
| M1/0 | M2/0.027 | M1/0 | M2/0.22 | M2/0.011 | M2/0.011 | M1/0 |
|
| 29:29 | 13:41 | 10:23 | 109:16 | 5:28 | 5:33 | 28:27 | |
|
| 3:15 | 7:37 | 1:24 | 55:37 | 5:12 | 5:15 | 6:4 | |
|
|
| M2/0.21 | M2/0.59 | M2/0.41 | M2/0.16 | M3/1.1 | M2/0.17 | |
|
| 1:45:32 | 4:57:12 | 3:22:27 |
| 1:18:21 | 1:23:23 | 1:23:22 | |
|
| 1:1:37 | 3:6:50 | 1:33:13 | 1:18:16 | 1:21:8 | 1:15:09 | ||
|
|
| M3/10.9 | M2/7.24 | M3/6.14 | M3/6.04 | M2/0.93 | ||
|
| 13:38:23 |
| 60:20:16 |
| 7:40:16 | 7:33:15 | 7:47:42 | |
|
| 13:30:22 | 48:42:50 | 7:40:31 | 7:32:41 | 7:13:36 |
* We use the cheaper eligible instance for each combination of assembler and dataset. We report the financial cost and the time required per assembly. For comparison, we also report the execution time on local memory-equivalent machines. These comparisons should be considered with caution.
Cost-equivalent number of assemblies per week between local and cloud execution.*
| |
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|
|
| 127,145 | 430 | 563,539 | 38 | 915 | 843 | 151,572 |
|
| 105,050 | 239 | 300,774 | 29 | 607 | 601 | 108,840 |
|
| 30 | 10 | 15 | - | 40 | 13 | 38 |
|
| 1 | - | 1 | - | 2 | 2 | 6 |
* Below this threshold, it is cheaper to utilize a cloud system instead of running the assembly on a local machine. Computations are based on formula (I) and use prices from Amazon EC2 (June 2013).
Fragment assembly results for
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
|
| 2,218 | 6,927 | 2,887 | 2,652 | 2,143 | 3,582 | 5,580 | 2,164 |
|
| 3,211 | 607 | 2,246 | 2,275 | 3,369 | 1,822 | 989 | 3,245 |
|
| 4,985,042 | 4,372,958 | 4,502,157 | 4,386,839 | 4,580,783 | 4,581,634 | 4,676,725 | 4,603,060 |
|
| 490,566 | 370,300 | 51,725 | 110,663 | 67,933 | 129,282 | 307,705 | 69,147 |
|
| 34,461 (0.76%) | 236,516 (5.14%) | 119,552 (2.60%) | 303,703 (6.60%) | 22,572 (0.49%) | 49,767 (1.08%) | 47,324 (1.03%) | 88,863 (1.93%) |
|
| 2,716 | 3,609 | 3,125 | 1,288 | 6,850 | 6,486 | 3,508 | 11,233 |
|
| 1 | 1 | 1 | 1 | 5 | 1 | 5 | 2 |
|
| 2,224 | 6,923 | 2,890 | 2,658 | 2,147 | 3,579 | 5,579 | 2,169 |
|
| 3,201 | 607 | 2,239 | 2,267 | 3,326 | 1,812 | 988 | 3,232 |
|
| 3:15 | 7:37 | 1:23 | 55:37 | 5:12 | 5:15 | 6:04 | 2:33 |
|
| 0.36 | 3 | 0.17 | 2.01 | 0.96 | 0.96 | 0.96 | 2.2 |