| Literature DB >> 27148393 |
Giles Miclotte1, Mahdi Heydari1, Piet Demeester1, Stephane Rombauts2, Yves Van de Peer3, Pieter Audenaert1, Jan Fostier1.
Abstract
BACKGROUND: Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned.Entities:
Keywords: Error correction; Maximal exact matches; Sequence analysis; de Bruijn graph
Year: 2016 PMID: 27148393 PMCID: PMC4855726 DOI: 10.1186/s13015-016-0075-7
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1To align a read to the de Bruijn graph, a seed-and-extend algorithm is used. First MEMs are found between the read and the graph, then a path in the graph is found between these seeds, creating the final alignment
Fig. 2Expected coverage by exact regions of size for reads of size 10,000 with and errors, expressed as percentages of the whole read as a function of the minimal size of the exact regions
Fig. 3Expected percentage of reads of size 10,000 that contain at least one exact region of size , for reads with and errors
The data sets and reference genomes
| ID | Number of reads | Number of bases (Mbp) | Maximal read length | N50 | Estimated coverage | |
|---|---|---|---|---|---|---|
|
| ||||||
| Reference | NC_000913a | |||||
| Short reads | ART | 28.4 M | 2840 | 100 | 100 | 600× |
| Long reads | SRR1284073b | 163 K | 649 | 49,424 | 13,578 | 135× |
|
| ||||||
| Reference | NC_008570a | |||||
| Short reads | ART | 4.74 M | 474 | 100 | 100 | 100× |
| Long reads |
| 515 | 4.74 | 24,430 | 10,421 | 1× |
|
| ||||||
| Reference | NC_001133a | |||||
| Short reads | ART | 9.72 M | 2430 | 250 | 250 | 200× |
| Long reads | SRR1284074b | 1.96 M | 5580 | 37,008 | 3973 | 453× |
| SRR1284662b | ||||||
|
| ||||||
| Reference | NC_014426a | |||||
| Short reads | [ | 9.72 M | 1778 | 76 | 76 | 135× |
| Long reads | [ | 225 K | 1135 | 22,892 | 7322 | 86× |
|
| ||||||
| Reference | NC_003070a | |||||
| Short reads | ART | 23.9 M | 5975 | 250 | 250 | 49× |
| Long reads | SRR1284093b | 327 K | 1439 | 86,350 | 14,256 | 12× |
| SRR1284094b | ||||||
|
| ||||||
| Reference | Release 5c | |||||
| Short reads | ART | 24.1 M | 6025 | 250 | 250 | 49× |
| Long reads | SRR1204085b | 327 K | 686 | 55,988 | 12,478 | 6× |
| SRR1204086b | ||||||
aReference genome available at http://www.ncbi.nlm.nih.gov/nuccore
bReads available at http://www.ncbi.nlm.nih.gov/sra
cReference genome available at http://www.fruitfly.org/sequence/release5genomic.shtml
Average CPU time per read for LoRDEC, proovread and Jabba
| LoRDEC (ms) | proovread (ms) | Jabba (ms) | |
|---|---|---|---|
|
| 111 | 1782 | 47 |
|
| 582 | 5652 | 11 |
|
| 172 | – | 28 |
|
| 462 | 3165 | 9 |
|
| 633 | 2128 | 100 |
|
| 289 | 1699 | 53 |
Results for proovread on S. cerevisiae have been left out because they did not compute in 3 days
Fig. 4Nx plots for E. coli, A. hydrophila and S. cerevisiae
Fig. 5Nx plots for O. tauri, A. thaliana and D. melanogaster
Results for LoRDEC, proovread and Jabba
| Gain (%) | Accuracy (%) | Error-free (%) | Aligned (%) | Throughput (%) | N50 (bp) | |
|---|---|---|---|---|---|---|
|
| ||||||
| Uncorrected reads | 85.16 | 0 | 59.16 | 13,578 | ||
| LoRDECn | 96.46 | 99.47 | 13.74 | 82.16 | 62.30 | 4661 |
| LoRDEC | 98.83 | 99.82 | 79.31 | 88.95 | 63.70 | 7618 |
| proovread | 99.64 | 99.94 | 89.64 | 99.57 | 58.65 | 5706 |
| Jabba | 99.70 | 99.95 | 95.70 | 99.23 | 57.04 | 12,760 |
|
| ||||||
| Uncorrected reads | 86.84 | 0 | 100 | 10,421 | ||
| LoRDECn | 99.21 | 99.89 | 25.29 | 96.72 | 94.79 | 7625 |
| LoRDEC | 99.93 | 99.99 | 86.74 | 99.76 | 95.35 | 9695 |
| proovread | 99.99 | 99.99 | 96.53 | 99.99 | 95.40 | 9803 |
| Jabba | 99.74 | 99.96 | 97.66 | 99.98 | 98.04 | 10,215 |
|
| ||||||
| Uncorrected reads | 83.21 | 1.50 | 27.99 | 3969 | ||
| LoRDECn | 91.17 | 98.51 | 44.02 | 77.77 | 21.72 | 2869 |
| LoRDEC | 92.08 | 98.67 | 60.82 | 83.12 | 30.43 | 3802 |
| proovread | – | – | – | – | – | – |
| Jabba | 99.87 | 99.97 | 98.35 | 99.93 | 27.67 | 8373 |
|
| ||||||
| Uncorrected reads | 83.83 | 0.05 | 23.10 | 7322 | ||
| LoRDECn | 91.04 | 98.55 | 63.60 | 85.05 | 31.43 | 985 |
| LoRDEC | 91.51 | 98.62 | 66.76 | 85.42 | 31.54 | 1043 |
| proovread | 98.11 | 99.69 | 80.28 | 90.55 | 26.31 | 1501 |
| Jabba | 99.06 | 99.84 | 83.33 | 93.31 | 13.81 | 4183 |
|
| ||||||
| Uncorrected reads | 83.32 | 8.00 | 47.82 | 14,256 | ||
| LoRDEC | 90.43 | 98.40 | 59.35 | 50.69 | 46.09 | 904 |
| proovread | 91.11 | 98.51 | 69.71 | 96.66 | 42.08 | 7788 |
| Jabba | 99.47 | 99.91 | 96.67 | 99.85 | 39.87 | 12,647 |
|
| ||||||
| Uncorrected reads | 85.70 | 22.97 | 41.72 | 12,478 | ||
| LoRDEC | 89.18 | 98.45 | 54.29 | 49.24 | 44.78 | 1119 |
| proovread | 97.07 | 99.58 | 67.72 | 98.36 | 43.49 | 11,476 |
| Jabba | 99.51 | 99.93 | 96.24 | 99.81 | 38.20 | 15,553 |
| Jabbap | 99.51 | 99.93 | 96.24 | 99.82 | 38.22 | 15,564 |
Results for proovread on S. cerevisiae have been left out because they did not compute in 3 days. The subscript p indicates that the tool used the reference genome instead of short reads. The subscript n indicates that the tool used uncorrected short reads
Peak memory usage for LoRDEC, proovread and Jabba
| LoRDEC (MB) | proovread (MB) | Jabba (MB) | |
|---|---|---|---|
|
| 2946 | 17,035 | 175 |
|
| 1205 | 617 | 103 |
|
| 2693 | – | 401 |
|
| 2208 | 12,963 | 328 |
|
| 3876 | 7042 | 5098 |
|
| 3936 | 6656 | 4099 |
Results for proovread on S. cerevisiae have been left out because they did not compute in 3 days
Peak memory usage for the index in Jabba, with different sparseness factors on A. hydrophila
| Sparseness factor | Memory (MB) |
|---|---|
| 1 | 103 |
| 2 | 62 |
| 3 | 48 |
| 4 | 41 |
Runtimes and peak memory usage for Karect, with a limit of 64 Gb memory
| CPU time (h) | Memory (GB) | |
|---|---|---|
|
| 5.26 | 35.0 |
|
| 3.57 | 9.7 |
|
| 1.99 | 60.4 |
|
| 1.60 | 37.7 |
|
| 18.27 | 50.3 |
|
| 16.01 | 50.6 |
Throughput and N50 for proovread on A. hydrophila without preprocessing the short reads with Karect
| proovread without Karect | 94.59 % | 7303 bp |
| proovread with Karect | 95.40 % | 9803 bp |