| Literature DB >> 31856721 |
Arghya Kusum Das1, Sayan Goswami2, Kisung Lee2, Seung-Jong Park2.
Abstract
BACKGROUND: Long-read sequencing has shown the promises to overcome the short length limitations of second-generation sequencing by providing more complete assembly. However, the computation of the long sequencing reads is challenged by their higher error rates (e.g., 13% vs. 1%) and higher cost ($0.3 vs. $0.03 per Mbp) compared to the short reads.Entities:
Keywords: Hadoop; Hybrid error correction; Illumina; NoSQL; PacBio
Mesh:
Year: 2019 PMID: 31856721 PMCID: PMC6923905 DOI: 10.1186/s12864-019-6286-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Widest Path Example: Select correct path for high coverage error k-mers
Fig. 2Skewness in k-mer coverage statistics
Fig. 3Indel error correction
Fig. 4Error correction steps
Fig. 5Substitution error correction
Fig. 6De Bruijn graph construction and k-mer count
Datasets
| Data | Accn. # | #Reads | Data size (GB) | Read length | %Reads aligned | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| PacBio | Illumina | PacBio | Illumina | PacBio | Illumina | PacBio (Avg) | Illumina | PacBio | Illumina | |
| E. coli | DevNet | ERR022075 | 282394 | 45440200 | 1.032 | 13.50 | 1120 | 101 | 78.97 | 99.44 |
| Yeast | DevNet | SRR567755 | 2315594 | 4503422 | 0.53 | 1.20 | 5874 | 101 | 82.12 | 93.75 |
| Fruit fly | BergmanLab | ERX645969 | 6701498 | 179363706 | 55 | 59 | 4328 | 101 | 51.14 | 95.56 |
| Human | DevNet | SRX016231 | 23897260 | 1420689270 | 312 | 452 | 6587 | 101 | 72.3 | 79.60 |
Experimental environment
| Maximum #nodes | 128 |
|---|---|
| Processor | Intel IvyBridge Xeon |
| #cores per node | 20 |
| DRAM per node | 64 GB |
| Disk per node | 250 GB hard disk drive |
| Network | 56 Gbps InfiniBand |
Accuracy comparison (Alignments)
| Data | Methodology | #Reads | #Bases | N50 | #Aligned Reads | #Aligned bases | %Aligned reads | %Aligned bases |
|---|---|---|---|---|---|---|---|---|
| E. coli | Original | 282394 | 316367409 | 3414 | 223017 | 237497013 | 78.97 | 75.07 |
| LoRDEC | 282394 | 307987923 | 3422 | 247227 | 266373078 | 87.55 | 86.49 | |
| Jabba | 149836 | 149322524 | 2517 | 148293 | 141563938 | 98.97 | 94.80 | |
| Proovread | 263206 | 284871906 | 1222 | 241948 | 246138387 | 91.92 | 86.40 | |
| ParLECH (Indel) | 282394 | 309367145 | 3394 | 264574 | 285070391 | 93.69 | 92.15 | |
| ParLECH (Indel+Subst) | 282394 | 309367145 | 3394 | 264720 | 295438268 | |||
| Yeast | Original | 231594 | 1360457697 | 2990 | 190184 | 1206524663 | 82.12 | 88.69 |
| LoRDEC | 231594 | 1345253694 | 2982 | 196669 | 1171490123 | 84.92 | 87.08 | |
| Jabba | 152882 | 634947441 | 2173 | 151359 | 634732955 | 99.02 | 99.09 | |
| Proovread | 225032 | 1307137185 | 1693 | 211323 | 1100350212 | 93.90 | 84.18 | |
| ParLECH (Indel) | 231594 | 1389446261 | 2994 | 199332 | 1240945939 | 86.07 | 89.31 | |
| ParLECH (Indel+Subst) | 231594 | 1389446261 | 2994 | 201857 | 1254987596 | |||
| Fruit fly | Original | 6701498 | 29007475325 | 15154 | 3427146 | 13355041639 | 51.14 | 46.04 |
| LoRDEC | 6701498 | 30025673204 | 15154 | 3654326 | 14919815143 | 54.53 | 49.69 | |
| Jabba | 4423855 | 10820828565 | 14302 | 3921032 | 9455816742 | 88.63 | 87.38 | |
| Proovread | 6511617 | 20174923756 | 8603 | 5450784 | 14497076095 | 83.70 | 71.86 | |
| ParLECH (Indel) | 6701498 | 30117416348 | 15154 | 4417627 | 18799138439 | 65.92 | 62.42 | |
| ParLECH (Indel+Subst) | 6701498 | 30117416348 | 15154 | 4557627 | 19983756932 |
The best results are shown in bold faces
Accuracy comparison (Gain)
| TP | FP | FN | %Gain | ||
|---|---|---|---|---|---|
| E. coli | LoRDEC | 31264830 | 330659 | 4230385 | 87.15 |
| Jabba | 10386868 | 105445 | 244608 | 96.7 | |
| Proovread | 23541209 | 318191 | 3942940 | 84.49 | |
| ParLECH (Indel) | 33229635 | 355464 | 3275190 | 90.05 | |
| ParLECH (Indel+Subst) | 34521649 | 250129 | 2088511 | ||
| Yeast | LoRDEC | 322660270 | 8989628 | 62594234 | 81.42 |
| Jabba | 171200961 | 3004132 | 9543906 | 93.06 | |
| Proovread | 313517992 | 8734915 | 60820684 | 83.21 | |
| ParLECH (Indel) | 355708411 | 20037769 | 51642375 | 82.40 | |
| ParLECH (Indel+Subst) | 368206322 | 19556218 | 39626015 | ||
| Fruit fly | LoRDEC | 732799376 | 34190591 | 84891209 | 85.43 |
| Jabba | 188817493 | 18141254 | 45042597 | 93.2 | |
| Proovread | 613007402 | 30867421 | 72123053 | 84.96 | |
| ParLECH (Indel) | 785735162 | 37126377 | 97826995 | 84.73 | |
| ParLECH (Indel+Subst) | 799834035 | 34065158 | 86789341 |
The best results are shown in bold faces
Fig. 7Scalability of ParLECH. a Time to correct indel error of fruit fly dataset. b Time to correct subst. error of fruit fly dataset
Fig. 8Comparing execution time of ParLECH with existing error correction tools. a Time for hybrid correction of indel errors in E.coli long reads (1.032 GB). b Time for correction of substitution errors in E.coli short reads (13.50 GB)
Effects of different traversal algorithms
| Data | Methodology | #Reads | #Bases | #Aligned Reads | #Aligned bases | %Aligned reads | %Aligned bases |
|---|---|---|---|---|---|---|---|
| E. coli | ParLECH | 282394 | 309367145 | 264574 | 285070391 | 93.69 | 92.15 |
| ParLECH | 282394 | 307987923 | 247227 | 266373078 | 87.55 | 86.49 | |
| ParLECH | 282394 | 328966341 | 216543 | 233312807 | 76.68 | 70.92 | |
| Yeast | ParLECH | 231594 | 1389446261 | 199332 | 1240945939 | 86.07 | 89.31 |
| ParLECH | 231594 | 1355153783 | 196669 | 1171490123 | 84.92 | 86.44 | |
| ParLECH | 231594 | 1399628927 | 175478 | 1045262567 | 75.77 | 74.68 | |
| Fruit fly | ParLECH | 6701498 | 30117416348 | 4417627 | 18799138439 | 65.92 | 62.42 |
| ParLECH | 6701498 | 30193752318 | 3654326 | 14919815143 | 54.53 | 49.41 | |
| ParLECH | 6701498 | 32131749687 | 2946734 | 12030871508 | 43.97 | 37.44 |
Comparing resource consumption of ParLECH with existing error correction tools with respect to E. coli dataset
| Error correction tool | CPU-Hour (single node) | Peak memory usage |
|---|---|---|
| LoRDEC | 10 | 20.65 |
| Jabba | 18 | 11.16 |
| Proovread | 89 | 31.77 |
| ParLECH (configured for least execution time) | 11.67 | 23.80 |
| ParLECH (configured to use lower DRAM) | 29.37 | 5 |
Correcting a human genome
| PacBio data size | 312GB |
| Illumina data size | 452GB |
| #nodes used | 128 |
| Time | 28.6 h |
| %Aligned reads (Indel) | 78.3 |
| %Aligned bases (Indel) | 75.43 |
| %Gain (Indel) | 82.38 |
| Time (Indel + Subst) | 3.4 h |
| %Aligned reads (Indel + Subst) | 79.73 |
| %Aligned bases (Indel + Subst) | 80.24 |
| %Gain (Indel + Subst) | 84.51 |