| Literature DB >> 29967328 |
Olivia Choudhury1, Ankush Chakrabarty2, Scott J Emrich3.
Abstract
Second-generation DNA sequencing techniques generate short reads that can result in fragmented genome assemblies. Third-generation sequencing platforms mitigate this limitation by producing longer reads that span across complex and repetitive regions. However, the usefulness of such long reads is limited because of high sequencing error rates. To exploit the full potential of these longer reads, it is imperative to correct the underlying errors. We propose HECIL-Hybrid Error Correction with Iterative Learning-a hybrid error correction framework that determines a correction policy for erroneous long reads, based on optimal combinations of decision weights obtained from short read alignments. We demonstrate that HECIL outperforms state-of-the-art error correction algorithms for an overwhelming majority of evaluation metrics on diverse, real-world data sets including E. coli, S. cerevisiae, and the malaria vector mosquito A. funestus. Additionally, we provide an optional avenue of improving the performance of HECIL's core algorithm by introducing an iterative learning paradigm that enhances the correction policy at each iteration by incorporating knowledge gathered from previous iterations via data-driven confidence metrics assigned to prior corrections.Entities:
Mesh:
Year: 2018 PMID: 29967328 PMCID: PMC6028576 DOI: 10.1038/s41598-018-28364-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Comparison of k-mer-based and alignment-based metrics (with % improvement) evaluated from testing E. coli, E. coli (Sequel-sequenced), S. cerevisiae, and A. funestus on proovread, LoRDEC, CoLoRMap, and HECIL.
| Data | Evaluation Metric | Original | proovread | LoRDEC | CoLoRMap | HECIL (Iter 1) | HECIL (Iter 5) |
|---|---|---|---|---|---|---|---|
|
| # unique | 81,523,648 | 78,925,288 (3.1) | 80,708,419 (1.0) | 80,399,425 (1.3) |
|
|
| # valid | 14,531,881 | 11,463,127 (−21.1) | 10,240,970 (−29.5) | 15,026,950 (3.4) |
|
| |
| # aligned reads | 31,071 | 23,453 (−24.5) | 30,837 (−0.7) | 31,271 (0.6) |
|
| |
| # aligned bases | 86,642,500 | 71,320,858 (−17.6) | 79,365,407 (−8.4) | 83,344,272 (−3.8) |
|
| |
| % matched bases | 76.9 | 87.9 (14.3) | 85.2 (10.7) | 87.5 (13.7) |
|
| |
| PI | 94.8 | 99.4 (4.8) | 99.2 (4.6) |
|
| ||
| # unique | 1,982,480,568 | 84,739,287 (95.7) | 86,825,382 (95.6) | 85,031,655 (95.7) |
|
| |
| # valid | 11,890,472 | 11,365,013 (−4.4) | 10,167,397 (−14.4) | 12,626,801 (6.1) |
|
| |
| # aligned reads | 1,158,421 | 910,384 (−21.4) | 1,161,432 (0.2) | 1,189,253 (2.6) |
|
| |
| # aligned bases | 4,343,460,105 | 3,963,123,749 (8.7) | 4,471,081,390 (2.9) | 4,416,369,371 (1.6) |
|
| |
| % matched bases | 85.1 | 93.1 (9.4) | 92.8 (9.0) | 93.7 (10.1) |
|
| |
| PI | 85.0 | 93.1 (9.5) | 92.8 (9.1) | 93.7 (10.2) |
|
| |
|
| # unique | 1,870,396,869 | 1,871,451,237 (−0.0) | 1,868,238,946 (0.1) | 1,869,232,456 (0.0) |
|
|
| # valid | 36,904,129 | 32,436,294 (−12.1) | 30,534,546 (−17.2) | 37,797,300 (2.4) |
|
| |
| # aligned reads | 224,694 | 222,976 (−0.7) | 221,692 (−1.3) | 223,641 (−0.4) |
|
| |
| # aligned bases | 1,229,724,663 | 1,205,706,114 (−1.9) | 1,171,490,123 (−4.7) | 1,207,729,568 (−1.7) |
|
| |
| % matched bases | 78.8 | 83.1 (5.4) | 83.4 (5.8) | 85.6 (8.6) |
|
| |
| PI | 93.8 | 96.3 (2.6) | 98.3 (4.8) | 98.3 (4.8) |
|
| |
|
| # unique | 692,831,731 | 649,989,172 (6.1) | 653,931,808 (5.6) | 662,366,838 (4.4) |
|
|
| # valid | 211,908,809 | 172,074,427 (−18.8) | 229,625,736 (8.3) | 222,195,325 (4.8) |
|
| |
| # aligned reads | 190,217 | 94,536 (−50.3) |
| 190,166 (−0.0) | 190,229 (0.0) |
| |
| # aligned bases | 671,881,278 | 401,850,047 (−40.1) | 655,072,426 (−2.5) | 660,848,583 (−1.6) |
|
| |
| % matched bases | 84.0 | 81.4 (−3.1) | 83.1 (−1.0) | 82.1 (−2.2) |
|
| |
| PI | 94.5 | 96.8 (2.4) | 95.6 (1.1) | 97.1 (2.7) |
|
| |
|
| # unique | 216,327,700 |
| 205,883,182 (4.8) | 206,986,374 (4.3) | 205,064,188 (5.2) |
|
| # valid | 80,612,612 | 72,716,589 (−9.8) | 82,568,831 (2.4) | 81,027,437 (0.5) |
|
| |
| # aligned reads | 59,163 | 32,726 (−44.6) | 59,165 (0.0) | 59,159 (−0.0) |
|
| |
| # aligned bases | 231,326,514 | 149,049,154 (−35.5) | 234,098,182 (1.2) | 233,435,402 (0.9) |
|
| |
| % matched bases | 86.3 | 83.2 (−3.5) | 87.0 (0.8) | 85.6 (−0.8) |
|
| |
| PI | 94.3 | 96.9 (2.7) | 96.6 (2.4) | 97.2 (3.0) |
|
| |
|
| # unique | 265,998,542 | 250,267,133 (5.9) | 252,291,701 (5.1) | 254,293,778 (4.4) |
|
|
| # valid | 96,317,177 | 86,396,798 (−10.3) | 106,713,483 (10.7) | 101,431,900 (5.3) |
|
| |
| # aligned reads | 73,779 | 43,530 (−41.0) | 73,757 (−0.0) | 73,750 (−0.0) |
|
| |
| # aligned bases | 278,976,792 | 190,054,632 (−31.8) | 280,699,552 (0.6) | 280,831,201 (0.6) |
|
| |
| % matched bases | 84.3 | 82.7 (−1.9) | 85.6 (1.5) | 84.5 (0.2) |
|
| |
| PI | 94.8 | 96.9 (2.2) | 96.3 (1.5) | 97.4 (2.7) |
|
|
For the case of HECIL, metrics are reported before and after using the iterative learning algorithm; specifically, iteration 1 (the core algorithm) and iteration 5 (with four rounds of learning) are shown.
Comparison of assembly-based metrics (with % improvement) evaluated from testing E. coli: with downsampled short reads (D-SR) having 18x coverage (lowest coverage) and original short reads, E. coli (Sequel-sequenced) S. cerevisiae, A. funestus (merged flowcells) on proovread, LoRDEC, CoLoRMap, and HECIL.
| Data | Evaluation Metric | Original | proovread | LoRDEC | CoLoRMap | HECIL (Iter 1) | HECIL (Iter 5) |
|---|---|---|---|---|---|---|---|
| # Contigs | 182 | 29 (84.0) | 28 (84.6) | 24 (86.8) |
| — | |
| Largest contig | 69,266 | 567,484 (719.2) | 885,819 (1178.8) | 813,262 (1074.1) |
| — | |
| Total length | 3,508,197 | 4,235,031 (20.7) | 4,068,085 (15.9) | 4,036,161 (15.0) |
| — | |
| N50 | 24,663 | 189,712 (669.2) | 179,638 (628.3) | 184,367 (647.5)) |
| — | |
| NG50 | 17,847 | 212,621 (1091.3) | 190,621 (968.0) | 210,913 (1081.7) |
| — | |
| Aligned base (%) - Ref/Query | 83/84 | 87/89 | 92/93 | 48/92 |
| — | |
| Average Identity (1–1) - Ref/Query | 88/88 | 93/93 | 97/97 | 97/97 |
| — | |
| # Contigs | 182 | 26 (85.7) | 24 (86.8) |
|
|
| |
| Largest contig | 69,266 | 605,792 (774.5) | 920,903 (1229.5) | 1,089,140 (1472.4) |
|
| |
| Total length | 3,508,197 | 4,629,719 (31.9) | 4,623,137 (31.7) | 4,624,793 (31.8) |
|
| |
| N50 | 24,663 | 231,774 (839.7) | 226,456 (818.2) | 239,066 (869.3) |
|
| |
| NG50 | 17,847 | 231,774 (1198.6) | 226,456 (1168.8) | 239,066 (1239.5) |
|
| |
| Aligned base (%) - Ref/Query | 82/87 | 92/92 | 98/98 | 54/94 |
|
| |
| Average Identity (1–1) - Ref/Query | 91/91 | 95/95 | 96/96 | 97/97 |
|
| |
| # Contigs | 84 | 34 (59.5) | 29 (65.4) | 29 (65.4) |
|
| |
| Largest contig | 88,975 | 775,707 (771.8) | 884,469 (894.0) | 1,363,678 (1432.6) |
|
| |
| Total length | 5,389,574 | 6,012,453 (11.5) | 5,821,596 (8.0) | 5,819,632 (7.9) |
|
| |
| N50 | 18,611 | 119,735 (543.3) | 117,028 (528.8) | 127,892 (587.1) |
|
| |
| NG50 | 13,903 | 116,255 (736.1) | 113,036 (713.0) | 118,087 (749.3) |
|
| |
| Aligned base (%) - Ref/Query | 78/80 | 89/89 | 95/95 | 67/92 |
|
| |
| Average Identity (1–1) - Ref/Query | 88/88 | 92/92 | 92/92 | 93/93 |
|
| |
|
| # Contigs | 26 | 32 (−23.0) | 28 (−7.6) |
|
|
|
| Largest contig | 1,543,990 | 1,537,979 (−0.3) | 1,552,711 (0.5) | 1,555,857 (0.7) |
|
| |
| Total length | 12,341,981 (1.1) | 12,485,995 (1.1) |
| 12,315,869 (−0.2) | 12,435,702 (0.7) |
| |
| N50 | 777,602 | 777,713 (0.0) | 818,962 (5.3) | 932,935 (19.9) |
|
| |
| NG50 | 777,602 | 777,713 (0.0) | 818,962 (5.3) | 932,935 (19.9) |
|
| |
| Aligned base (%) - Ref/Query | 95/90 | 91/91 | 95/95 | 78/97 |
|
| |
| Average Identity (1–1) - Ref/Query | 92/92 | 93/93 | 97/97 | 98/98 |
|
| |
|
| # Contigs | 998 | 712 (28.6) | 788 (21.0) | 847 (15.1) |
|
|
| Largest contig | 71,070 | 36,306 (−48.9) | 75,298 (5.9) | 72,306 (1.7) |
|
| |
| Total length | 25,405,949 | 8,371,287 (−67.0) | 26,745,092 (5.2) | 26,802,126 (5.5) |
|
| |
| N50 | 13,038 | 14,802 (13.5) | 15,118 (15.9) | 14,555 (11.6) |
|
| |
| NG50 | 71,070 | 45,637 (−35.7) | 77,294 (8.7) | 76,306 (7.3) |
|
| |
| Aligned base (%) - Ref/Query | 20/87 | 23/93 | 27/96 | 20/95 |
|
| |
| Average Identity (1–1) - Ref/Query | 83/83 | 87/87 | 95/95 | 92/92 |
|
|
For the case of HECIL, metrics are reported before and after using the iterative learning algorithm; specifically, iteration 1 (the core algorithm) and iteration 5 (with four rounds of learning) are shown.
Comparison of k-mer-based and alignment-based metrics with downsampled E. coli short reads using HECIL’s core algorithm.
| Evaluation Metric | All SRs | 50% SRs | 25% SRs | 12% SRs |
|---|---|---|---|---|
| #unique | 78,693,704 | 78,292,463 | 78,097,941 | 78,008,319 |
| #valid | 15,973,826 | 15,889,155 | 15,737,641 | 15,576,317 |
| #aligned reads | 31,332 | 31,328 | 31,322 | 31,318 |
| #aligned bases | 87,582,014 | 87,359,227 | 87,288,475 | 87,196,236 |
| % matched bases | 88.4 | 88.4 | 88.3 | 88.3 |
| PI | 99.7 | 99.7 | 99.7 | 99.6 |
Comparison of runtime and maximum memory footprint for correcting long reads.
| Data | Method | Runtime (hh:mm:ss) | Memory (GB) |
|---|---|---|---|
|
| proovread | 6:15:37 | 11.4 |
| LoRDEC | 38:53 | 6.2 | |
| CoLoRMap | 2:48:23 | 28.9 | |
| HECIL (Iter 1; Iter 5) | 1:16:55; 4:47:52 | 9.1; 9.1 | |
| 4 | proovread | 42:53:06 | 34.6 |
| LoRDEC | 17:47:27 | 24.3 | |
| CoLoRMap | 26:20:23 | 40.9 | |
| HECIL (Iter 1; Iter 5) | 19:33:47; 59:18:23 | 26.5; 26.5 | |
| 4 | proovread | 20:54:15 | 14.5 |
| LoRDEC | 3:43:12 | 6.1 | |
| CoLoRMap | 7:57:49 | 38.2 | |
| HECIL (Iter 1; Iter 5) | 5:14:09; 21:19:24 | 11.2; 11.2 | |
| proovread | 76:13:47 | 8.8 | |
| LoRDEC | 35:08:13 | 3.1 | |
| CoLoRMap | 90:50:12 | 23.4 | |
| HECIL (Iter 1; Iter 5) | 46:06:47; 162:21:37 | 8.3; 8.3 | |
| proovread | 36:32:25 | 7.3 | |
| LoRDEC | 11:25:05 | 6.7 | |
| CoLoRMap | 32:18:30 | 20.7 | |
| HECIL (Iter 1; Iter 5) | 17:38:01; 51:37:34 | 6.9; 6.9 |
Runtime includes index construction, alignment of short and long reads, and error correction (after the first and fifth iterations). Only the best and worst A. funestus results are shown.
Figure 1Improvement of alignment-based metrics (# fewer unique k-mers, additional aligned long reads, additional aligned bases, additional percent matched bases) for E. coli, S. cerevisiae, and A. funestus with iterative learning. The 0th iteration denotes the original data set and the 1st iteration indicates corrected data set obtained from running HECIL’s core algorithm.
Figure 2Illustration of HECIL’s core algorithm. The orange rectangle denotes an erroneous long read and the purple rectangles represent aligned short reads. (A) Quick correction with high consensus. (B) Optimization-based correction: The green dashed box depicts the objective function values, from which the optimal short read (green rectangle) is selected for correction.
Figure 3Iterative learning procedure of HECIL. Other hybrid error correction algorithms can replace the core algorithm.