| Literature DB >> 28381259 |
Ergude Bao1,2, Lingxiao Lan3.
Abstract
BACKGROUND: The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology, but contain approximately 15% sequencing errors. Several error correction algorithms have been designed to efficiently reduce the error rate to 1%, but they discard large amounts of uncorrected bases and thus lead to low throughput. This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis.Entities:
Keywords: Error correction; PacBio long reads; Throughput
Mesh:
Year: 2017 PMID: 28381259 PMCID: PMC5382505 DOI: 10.1186/s12859-017-1610-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Illustrations of the approaches discussed in the “Background” section. The similar repeat based alignment approach, the long read support based validation approach and the adjacent alignment based validation approach are illustrated in (1), (2) and (3), respectively. (1) Long read region A in r 1 does not have its true genome region in contig c 1, but it could be aligned to its similar repeat B (shaded), which is the true genome region of long read region B in contig c 1. (2) Adjacent long read regions A and B in r 1 are aligned to contig region A in c 1 and contig region B in c 2, respectively. These alignments are accepted after validation with a sufficient number of long reads r 1, r 2, r 3 and r 4 supporting them. (3) Adjacent long read regions A and B in r 1 are aligned to the adjacent contig regions A and B in c 1, respectively, and are thus accepted after validation
Correspondence of algorithm steps, approaches and problems addressed
| Steps | Approaches | Problems |
|---|---|---|
| 1 | Similar repeat based alignment | Lack of reference data |
| 3-4 | Long read support and adjacent alignment based validation | Error richness |
| 5 | Similar repeat based alignment | Lack of reference data |
Fig. 2Illustration of the HALC algorithm. Long reads r 1 to r 3 are aligned to contigs c 1 to c 3 with a relatively low identity requirement based on the similar repeat based alignment approach in (1), and a contig graph is constructed to validate the alignments and correct the long reads based on the long read support based validation approach and the adjacent alignment based validation approach in (2). (1) The long read region r 1(B) (region B of r 1; below follows) is error rich, so it is aligned either to its true genome region in the contigs c 1(B) or its similar repeat c 3(E) (shaded). The reads r 1(C), r 2(C) and r 3(C) do not have their true genome regions in the contigs and thus are aligned to their similar repeat c 3(G) (shaded). The aligned contig region c 1(A B) is split into c 1(A) and c 1(B), and the long read regions are split accordingly. (2) A contig graph is constructed, with vertices A, B, D, E and G representing the aligned contig regions connected by weighted edges. Edge (A,B) (edge between vertices A and B; below follows) is weighted 0, since the contig regions A and B are adjacent. (B,G) and (G,D) are weighted 0, since sufficient adjacent long read regions are aligned to contig regions B and G and G and D, respectively. As a result, a path of the minimum total edge weight to correct all the long reads is found containing vertices A, B, G and D. The reads r 1(C), r 2(C) and r 3(C) are corrected using their similar repeats and can be refined with the initial short reads
Evaluation of error correction performance
| Method | Throughput | Alignment ratio | Alignment identity | Genome fraction | N reads | Average read length | Sensitivity | Gain | Specificity |
|---|---|---|---|---|---|---|---|---|---|
|
| |||||||||
| Initial | 100.0% | 50.4% | 95.2% | 100.0% | 75152 | 2381 | - | - | - |
| PacBioToCAa | 24.2% | 100.0% | 100.0% | 99.5% | 53447 | 810 | - | - | - |
| LSC | 53.5% | 98.7% | 99.9% | 99.7% | 115960 | 825 | 52.6% | 51.7% | 99.9% |
| Proovread | 57.4% | 100.0% | 99.9% | 99.7% | 44986 | 2284 | 57.4% | 56.8% | 99.9% |
| CoLoRMap | 42.8% | 99.7% | 100.0% | 99.9% | 70582 | 1084 | 42.7% | 42.2% | 99.9% |
| ECTools | 23.5% | 99.9% | 99.2% | 99.4% | 8095 | 5211 | 23.4% | 21.8% | 99.8% |
| LoRDEC | 60.8% | 97.8% | 100.0% | 99.8% | 70164 | 1549 | 60.7% | 60.5% | 100.0% |
| Jabba | 52.8% | 99.6% | 100.0% | 98.6% | 26459 | 3568 | 52.8% | 52.7% | 100.0% |
| HALC | 64.6% | 98.6% | 99.9% | 99.8% | 78731 | 1467 | 64.4% | 64.0% | 99.9% |
|
| |||||||||
| Initial | 100.0% | 32.4% | 92.4% | 82.4% | 490418 | 2645 | - | - | - |
| PacBioToCAa | 10.7% | 99.2% | 99.7% | 63.9% | 260834 | 535 | - | - | - |
| LSC | 25.9% | 100.0% | 99.5% | 71.4% | 659123 | 509 | 24.2% | 22.3% | 99.7% |
| Proovread | 27.8% | 99.8% | 99.7% | 79.8% | 125786 | 2864 | 26.5% | 24.9% | 99.7% |
| CoLoRMap | 21.4% | 99.4% | 99.7% | 69.3% | 230933 | 1203 | 20.5% | 19.2% | 99.8% |
| ECTools | 11.3% | 99.8% | 99.5% | 63.1% | 21354 | 6886 | 10.8% | 9.8% | 99.8% |
| LoRDEC | 28.0% | 86.4% | 99.5% | 74.4% | 847963 | 428 | 25.9% | 22.8% | 99.6% |
| Jabba | 10.8% | 99.6% | 99.7% | 56.1% | 51353 | 2726 | 10.5% | 9.9% | 99.9% |
| HALC | 34.7% | 96.5% | 99.5% | 85.8% | 548872 | 819 | 33.2% | 29.7% | 99.3% |
|
| |||||||||
| Initial | 100.0% | 46.9% | 91.3% | 91.9% | 1307812 | 10082 | - | - | - |
| LoRDEC | 33.6% | 97.9% | 99.7% | 89.5% | 7372455 | 601 | 32.4% | 29.8% | 99.6% |
| HALC | 41.2% | 98.7% | 99.6% | 90.7% | 4833536 | 1123 | 40.2% | 37.5% | 99.4% |
The long reads of tests (a)-(c) are from E.coli, A. thaliana and Maylandia zebra, respectively. The initial and error corrected long reads by PacBioToCA, LSC, Proovread, CoLoRMap, ECTools, LoRDEC, Jabba and HALC are compared in the tests. The performance measurements are listed in the “Performance measurements” section.
aSome measurements are not available without the correspondence information between a split long read and its initial long read
Evaluation of long read assemblies
| Method | N Contigs | N50 | Largest contig length | N Covered bases | EPKB |
|---|---|---|---|---|---|
|
| |||||
| PacBioToCA | 1629 | 27806 | 110971 | 51672726 | 119.4 |
| LSC | 1284 | 29305 | 105354 | 40390383 | 128.6 |
| Proovread | 1193 | 37828 | 233854 | 50379300 | 123.9 |
| CoLoRMap | 1324 | 35477 | 150127 | 52913748 | 113.1 |
| ECTools | 1218 | 40122 | 182238 | 56377176 | 143.5 |
| LoRDEC | 1331 | 28133 | 104370 | 40034274 | 145.1 |
| Jabba | 618 | 29548 | 87500 | 22681404 | 102.1 |
| HALC | 1147 | 44684 | 296154 | 54821730 | 119.9 |
|
| |||||
| LoRDEC | 37460 | 8878 | 115008 | 204752772 | 121.6 |
| HALC | 41434 | 12062 | 153787 | 324763656 | 92.9 |
The contigs of tests (a)-(b) are for A. thaliana and Maylandia zebra, respectively. The contigs assembled from the error corrected long reads by PacBioToCA, LSC, Proovread, CoLoRMap, ECTools, LoRDEC, Jabba and HALC are compared in the tests. The performance measurements are listed in the “Performance measurements” section
Evaluation of error correction performance on the transcriptomic data set of S. cerevisiae
| Method | Throughput | Alignment ratio | Alignment identity | Transcriptome fraction |
|---|---|---|---|---|
| Initial | 100.0% | 7.0% | 99.5% | 17.0% |
| LSC | 31.2% | 20.2% | 99.8% | 16.5% |
| CoLoRMap | 26.0% | 29.8% | 99.8% | 20.6% |
| LoRDEC | 33.2% | 18.8% | 99.9% | 16.6% |
| Jabba | 33.7% | 14.9% | 100.0% | 15.5% |
| HALC | 49.8% | 30.5% | 99.4% | 21.4% |
The initial and the error corrected long reads by LSC, CoLoRMap, LoRDEC, Jabba and HALC are compared. The performance measurements are listed in the “Performance measurements” section
Fig. 3Percentage of genome covered above various long read coverages on the E. coli data. The percentage of genome bases (y-axis) is plotted with long read coverage from 1 × to 50 × (x-axis), corresponding to the error correction results of different algorithms in Table 2(a)