| Literature DB >> 21447171 |
Yongchao Liu1, Bertil Schmidt, Douglas L Maskell.
Abstract
BACKGROUND: Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads) at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the de novo assembly in terms of assembly quality and scalability for large-scale short read datasets.Entities:
Mesh:
Year: 2011 PMID: 21447171 PMCID: PMC3072957 DOI: 10.1186/1471-2105-12-85
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Program workflow and data dependence between different stages.
Figure 2Workflow of each PE for distributed spectrum construction.
Figure 3Workflow of each PE for distributed error-free read filtering.
Figure 4Pseudocode of the CUDA kernel of the voting algorithm.
Figure 5Workflow of each PE for fixing erroneous reads.
Simulated and real short read datasets
| Datasets | Read length | Coverage | Error rate | No. of Reads |
|---|---|---|---|---|
| D30X1.5 | 36 | 30 | 1.5% | 3866000 |
| D30X3.0 | 36 | 30 | 3.0% | 3860000 |
| D75X1.5 | 36 | 75 | 1.5% | 9666000 |
| D75X3.0 | 36 | 75 | 3.0% | 9666000 |
| D150X1.5 | 72 | 150 | 1.5% | 9666000 |
| D150X3.0 | 72 | 150 | 3.0% | 9666000 |
| SRR006331 | 36 | 69 | - | 1693848 |
| SRR016146 | 51 | 81 | - | 4438066 |
| SRR001665 | 36 | 162 | - | 20816448 |
Definitions for the read binary classification test
| Classification | Read Condition | |
|---|---|---|
| Erroneous | Error-free | |
| Detected as erroneous | TP | FP |
| Detected as error-free | FN | TN |
Summary of the classification test for simulated datasets
| Datasets | Algorithm | TP | FP | FN | TN | Sensitivity | Specificity |
|---|---|---|---|---|---|---|---|
| D30X1.5 | DecGPU | 1620660 | 349908 | 253 | 1895179 | 99.98 | 84.41 |
| hSHREC | 1617685 | 13998 | 3228 | 2231089 | 99.80 | 99.38 | |
| D30X3.0 | DecGPU | 2575411 | 660533 | 306 | 629750 | 99.99 | 48.81 |
| hSHREC | 2571520 | 31367 | 4197 | 1258916 | 99.84 | 97.57 | |
| D75X1.5 | DecGPU | 4053688 | 23 | 1024 | 5611265 | 99.97 | 100.00 |
| hSHREC | 4053827 | 4990124 | 885 | 621164 | 99.98 | 11.07 | |
| D75X3.0 | DecGPU | 6435328 | 3481 | 1621 | 3225570 | 99.97 | 99.89 |
| hSHREC | 6436305 | 3129803 | 644 | 99248 | 99.99 | 3.07 | |
| D150X1.5 | DecGPU | 6406078 | 2 | 5395 | 3254525 | 99.92 | 100.00 |
| hSHREC | 6411346 | 3185858 | 127 | 68669 | 100.00 | 2.11 | |
| D150X3.0 | DecGPU | 8578176 | 1 | 8651 | 1079172 | 99.90 | 100.00 |
| hSHREC | 8586743 | 1056392 | 84 | 22781 | 100.00 | 2.11 | |
The error rates and execution time comparison for DecGPU and Hybrid SHREC
| Datasets | Original Error Rate (%) | Corrected Error Rate (%) | Time (seconds) | ||||
|---|---|---|---|---|---|---|---|
| DecGPU | hSHREC | DecGPU | hSHREC | ||||
| one fixing | two fixing | one fixing | two fixing | ||||
| D30X1.5 | 1.498 | 0.426 | 0.341 | 0.713 | 125 | 145 | 2721 |
| D30X3.0 | 3.003 | 1.773 | 1.625 | 2.014 | 164 | 217 | 2882 |
| D75X1.5 | 1.500 | 0.347 | 0.248 | 3.936 | 288 | 348 | 4380 |
| D75X3.0 | 3.000 | 1.262 | 0.988 | 4.058 | 375 | 473 | 5079 |
| D150X1.5 | 1.500 | 0.579 | 0.348 | 3.233 | 981 | 1118 | 11047 |
| D150X3.0 | 3.001 | 1.781 | 1.241 | 4.082 | 1254 | 1489 | 12951 |
Performance comparison with respect to R, Rand Rmeasures
| Datasets | Algorithms | CC | IC | EU | EI | |||
|---|---|---|---|---|---|---|---|---|
| D30X1.5 | DecGPU | 1275967 | 191 | 809207 | 893 | 61.19 | 0.01 | 0.05 |
| hSHREC | 1736112 | 10960 | 214851 | 125381 | 88.49 | 0.56 | 6.95 | |
| D30X3.0 | DecGPU | 1611459 | 344 | 2567906 | 2932 | 38.55 | 0.01 | 0.08 |
| hSHREC | 2983112 | 27448 | 764097 | 326466 | 79.03 | 0.73 | 9.38 | |
| D75X1.5 | DecGPU | 3373714 | 388 | 1844213 | 530 | 64.65 | 0.01 | 0.02 |
| hSHREC | 1431267 | 27988 | 3256061 | 2219648 | 30.35 | 0.59 | 47.67 | |
| D75X3.0 | DecGPU | 5425615 | 746 | 5013497 | 1122 | 51.97 | 0.01 | 0.02 |
| hSHREC | 757454 | 29924 | 9248234 | 1250738 | 7.55 | 0.30 | 12.76 | |
| D150X1.5 | DecGPU | 7242425 | 2913 | 3196883 | 1004 | 69.36 | 0.03 | 0.04 |
| hSHREC | 741722 | 37618 | 9034830 | 3345778 | 7.56 | 0.38 | 34.47 | |
| D150X3.0 | DecGPU | 11221669 | 7593 | 9655700 | 2121 | 53.73 | 0.04 | 0.05 |
| hSHREC | 1152718 | 71504 | 18896523 | 3136637 | 5.73 | 0.36 | 15.94 | |
Figure 6Percentage of mapped reads as a function of maximum number of mismatches.
Assembly quality and parameters for different assemblers
| Datasets | Type | Assembler | N50 | N90 | MAX | #Seq | Parameters |
|---|---|---|---|---|---|---|---|
| SRR006331 | Velvet | 6229 | 1830 | 21166 | 288 | k = 23, cov_cutoff = auto | |
| D-Velvet | 7411 | 1549 | 17986 | 282 | |||
| ABySS | 5644 | 1505 | 15951 | 334 | k = 24 | ||
| D-ABySS | 4789 | 1216 | 12090 | 371 | |||
| SRR016146 | Velvet | 34052 | 7754 | 112041 | 301 | k = 31, cov_cutoff = auto | |
| D-Velvet | 34898 | 7754 | 134258 | 292 | |||
| ABySS | 34124 | 7758 | 112038 | 297 | k = 33 | ||
| D-ABySS | 34889 | 7916 | 134314 | 297 | |||
| SRR001665 | Velvet | 17900 | 4362 | 73058 | 601 | k = 29, cov_cutoff = auto | |
| D-Velvet | 18484 | 4687 | 73058 | 586 | |||
| ABySS | 18161 | 4364 | 71243 | 603 | k = 30 | ||
| D-ABySS | 18161 | 4604 | 73060 | 595 | |||
| Velvet | 95486 | 26570 | 268283 | 179 | k = 31,exp_cov = auto, cov_cutoff = auto | ||
| D-Velvet | 95429 | 26570 | 268084 | 175 | |||
| ABySS | 96308 | 25780 | 268372 | 124 | k = 33, n = 10 | ||
| D-ABySS | 96904 | 27002 | 210775 | 122 | |||
Assembly quality and parameters after further tuning parameters for some datasets
| Datasets | Type | Assembler | N50 | N90 | MAX | #Seq | Parameters |
|---|---|---|---|---|---|---|---|
| SRR006331 | D-ABySS | 6130 | 1513 | 16397 | 311 | k = 24, c = 7 | |
| SRR001665 | D-ABySS | 20068 | 5147 | 73062 | 565 | k = 31, c = 12 | |
| D-Velvet | 101245 | 30793 | 269944 | 146 | k = 31, exp_cov = 36, cov_cutoff = 13 | ||
Execution time and MBPS of DecGPU on different number of compute resources
| Datasets | No. of CPU cores | No. of GPUs | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 4 | 8 | 16 | 32 | 1 | 2 | 4 | 8 | |||
| SRR006331 | Spectrum | Time(s) | 36 | 19 | 11 | 7 | 21 | 15 | 9 | 9 |
| MBPS | 1.7 | 3.2 | 5.5 | 8.7 | 2.9 | 4.1 | 6.8 | 6.8 | ||
| EC | Time(s) | 35 | 38 | 41 | 42 | 9 | 11 | 18 | 23 | |
| MBPS | 1.7 | 1.6 | 1.5 | 1.5 | 6.8 | 5.5 | 3.4 | 2.7 | ||
| SRR016146 | Spectrum | Time(s) | 194 | 96 | 51 | 30 | 121 | 86 | 46 | 48 |
| MBPS | 1.2 | 2.4 | 4.4 | 7.5 | 1.9 | 2.6 | 4.9 | 4.7 | ||
| EC | Time(s) | 194 | 168 | 175 | 206 | 63 | 53 | 43 | 45 | |
| MBPS | 1.2 | 1.3 | 1.3 | 1.1 | 3.6 | 4.3 | 5.3 | 5.0 | ||
| SRR001665 | Spectrum | Time(s) | 473 | 247 | 136 | 86 | 297 | 231 | 133 | 137 |
| MBPS | 1.6 | 3.0 | 5.5 | 8.7 | 2.5 | 3.2 | 5.6 | 5.5 | ||
| EC | Time(s) | 266 | 223 | 251 | 306 | 94 | 85 | 85 | 99 | |
| MBPS | 2.8 | 3.4 | 3.0 | 2.4 | 8.0 | 8.8 | 8.8 | 7.6 | ||
Figure 7Execution time comparison between DecGPU and CUDA-EC.
FPP and maximal Nfor representative α value
|
| FPP | |
|---|---|---|
| 1 | 2.5 × 10-2 | 536870912 |
| 0.5 | 5.7 × 10-4 | 268435456 |
| 0.25 | 5.7 × 10-6 | 134217728 |
| 0.125 | 3.6 × 10-8 | 67108864 |