| Literature DB >> 29426289 |
Jeremy R Wang1, James Holt2, Leonard McMillan2, Corbin D Jones3.
Abstract
BACKGROUND: Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging "hybrid" assemblies that use long reads for scaffolding and short reads for accuracy.Entities:
Keywords: BWT; FM-Index; Hybrid error correction; Long read; Pacbio; de novo assembly
Mesh:
Year: 2018 PMID: 29426289 PMCID: PMC5807796 DOI: 10.1186/s12859-018-2051-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Illustration of the seed-and-bridge correction strategy using short and long k-mers. Implicit de Bruijn graphs with arbitrary k can be inferred from an FM-index. The use of a short, fixed k often does not resolve “hairball” and other structures in the graph caused by low-complexity and repetitive genomic elements. Longer K-mers may dramatically simplify the bridging step if sufficiently long seeds can be found. Illustrative seed-and-bridge paths are shown for short k-mer and long K-mer graphs. Seed k-mers are shown in orange, and the correct path in black. The two-pass (k, K) seed-and-bridge correction implemented in FMLRC allows the correction of short, nonrepetitive segments in the first pass, then seeding larger K-mers and bridging to resolve more complex sequences
Choosing k and K
|
| ||||||
|---|---|---|---|---|---|---|
|
| – | 49 | 59 | 69 | 79 | 89 |
| 17 | 404736174 |
| 404809579 | 404608992 | 404361044 | 403888297 |
| 19 | 403117580 | 404580392 | 404571352 | 404418084 | 404297325 | 403826761 |
| 21 | 403002237 | 404365615 | 404367089 | 404255830 | 404131272 | 403841312 |
| 23 | 403516577 | 404381062 | 404378242 | 404240202 | 404107504 | 404041491 |
| 25 | 403819785 | 404461301 | 404480970 | 404453527 | 404363292 | 404236385 |
| 17 | 0.1011 | 0.388 | 0.4709 | 0.5258 | 0.5468 | 0.521 |
| 19 | 0.3823 | 0.5887 | 0.612 | 0.6245 | 0.6279 | 0.6172 |
| 21 | 0.4879 | 0.634 | 0.6429 | 0.6459 | 0.6442 | 0.6345 |
| 23 | 0.5137 | 0.641 | 0.6474 |
| 0.6457 | 0.6361 |
| 25 | 0.523 | 0.6396 | 0.6453 | 0.6461 | 0.6422 | 0.6318 |
| 17 | 1250679980 |
| 1252340540 | 1251288299 | 1250445441 | 1249925285 |
| 19 | 1250052124 | 1252517259 | 1252462544 | 1252139063 | 1251853858 | 1251785285 |
| 21 | 1248322270 | 1251887685 | 1251963458 | 1251672602 | 1251758116 | 1251744201 |
| 23 | 1248801294 | 1252245368 | 1252387319 | 1252408890 | 1252545735 | 1252558864 |
| 25 | 1249574404 | 1252269051 | 1252478532 | 1252557840 | 1252778626 | 1252739127 |
| 17 | 0.0264 | 0.224 | 0.3159 | 0.3946 | 0.452 | 0.4871 |
| 19 | 0.1172 | 0.3903 | 0.443 | 0.4822 | 0.5096 | 0.5273 |
| 21 | 0.2527 | 0.4938 | 0.5129 | 0.527 | 0.5367 | 0.5434 |
| 23 | 0.3319 | 0.5153 | 0.5251 | 0.5332 | 0.5388 |
|
| 25 | 0.3728 | 0.5155 | 0.5226 | 0.5287 | 0.5334 | 0.5372 |
This table shows the result of running FMLRC using many different values for k and K for an E. coli and S. cerevisiae datasets
The test cases with K=− indicate that no second pass of correction using the long K-mer was performed, so those test cases use a single pass short k-mer only. After correcting the reads, we aligned the results using BLASR [22] and gathered statistics on the alignments. Matching bases indicates the number of matching bases across all mappings. Gain is defined as (TP−FP)/(TP+FN) (see “Correction accuracy” section). For each statistic, the best result is bolded in the above table. To summarize, increasing values for k and K tend to increase the gain but decrease the total matching bases - a tradeoff between sensitivity and specificity. Additionally, all tested values of K for a long K-mer pass improves the results over a single k-mer pass
After aligning the corrected reads to a reference genome, sensitivity, specificity, and gain were computed
| Method | Reads aligned | TP | FN | Sens. | Spec. | Gain | CPU (s) | Mem (GB) |
|---|---|---|---|---|---|---|---|---|
| E. coli K12 | ||||||||
| Canu | 58982 | 657793 | 1571579 | 0.2951 | 0.9987 | 0.1482 | 17421 | 2.99 |
| CoLoRMap | 81485 | 5538038 | 35474190 | 0.1350 | 0.9998 | 0.1332 | 137777 | 23.06 |
| FMLRC |
| 13562639 | 15069242 | 0.4737 | 0.9996 | 0.4689 |
| 4.60 |
| Jabba | 75620 |
|
|
|
|
| 22922 | 63.87 |
| LoRDEC | 81138 | 3278911 | 36691424 | 0.0820 | 0.9998 | 0.0808 | 61305 |
|
| LoRMA | 81051 | 1135657 | 144161 | 0.8874 |
| 0.8669 | 54240 | 45.72 |
| Sprai | 75532 | 463636 | 293039 | 0.6127 |
| 0.5783 | 44302 | 33.53 |
| S. cerevisiae W303 | ||||||||
| Canu | 142765 | 1562542 | 5441239 | 0.2231 | 0.9992 | 0.1401 | 108175 |
|
| CoLoRMap | 210423 | 18901871 | 79065115 | 0.1929 | 0.9992 | 0.1857 | 2815200 | 45.49 |
| FMLRC | 211270 |
| 49204291 | 0.3929 | 0.9991 | 0.3829 |
| 17.77 |
| Jabba |
| 29893606 |
|
|
|
| 187968 | 367.92 |
| LoRDEC | 210151 | 8468872 | 96577493 | 0.0806 | 0.9997 | 0.0776 | 212495 | 3.56 |
| LoRMA | 204323 | 3063164 | 221583 | 0.9325 |
| 0.9176 | 223358 | 49.52 |
| Sprai | 192670 | 2013269 | 3063751 | 0.3965 | 0.9996 | 0.3288 | 215261 | 49.52 |
| A. thaliana Ler-0 | ||||||||
| Canu | 574065 | 12002535 | 72017120 | 0.1429 | 0.9986 | 0.1030 | 1301971 | 10.92 |
| CoLoRMap | 1075381 | 170235345 | 2056204621 | 0.0765 | 0.9983 | 0.0737 | 6802359 | 106.08 |
| FMLRC |
|
| 1164804073 | 0.2751 | 0.993 | 0.2601 |
| 16.26 |
| Jabba | 813495 | 320742341 | 2945173 |
| 0.9968 |
| 3641309 | 333.44 |
| LoRDEC | 1113617 | 78276025 | 2022520259 | 0.0373 | 0.9979 | 0.0337 | 1111800 |
|
| LoRMA | 903298 | 2217661 |
| 0.4439 | 0.9986 | 0.3715 | 17281259 | 70.28 |
| Sprai | 751684 | 18960255 | 30734331 | 0.3815 |
| 0.3631 | 5996657 | 8.11 |
For A. thaliana and S. cerevisiae, FMLRC produced more total true positive (corrected loci) than any other method while maintaining competitive sensitivity and gain. Methods with higher average specificity, notably Jabba, often discard a higher proportion of reads, reporting only those with the highest-confidence corrected sequence. FMLRC also requires significantly less CPU time than other hybrid error correction methods, and comparable memory. LoRDEC and FMLRC CPU time and memory results include construction of the BWT
Long-read and hybrid correction and assembly methods
| Method | Correction | Assembly | Preassembly | Citation |
|---|---|---|---|---|
| Miniasm | Long-read | [ | ||
| Canu | Long-read | Long-read | [ | |
| Sprai | Long-read | [ | ||
| LoRMA | Long-read | [ | ||
| hybridSPAdes | Hybrid | [ | ||
| DBG2OLC | Hybrid | X | [ | |
| Cerulean | Hybrid | X | [ | |
| ECTools | Hybrid | X | [ | |
| LoRDEC | Hybrid | [ | ||
| Jabba | Hybrid | [ | ||
| CoLoRMap | Hybrid | [ | ||
| Nanocorr | Hybrid | [ | ||
| FMLRC | Hybrid | Our method |
All of the compared methods are shown along with their mode of error correction and assembly, each either long-read only or “hybrid” using complementary short-read data. “Preassembly” indicates whether a hybrid method requires the short read data to be preassembled using a different method
Long-read and hybrid correction assembly statistics
| Dataset | Method | # contigs | N50 | Genome fraction | Error rate |
|---|---|---|---|---|---|
| Canu + Miniasm |
| 4631922 | 99.832 | 0.00444 | |
| Genome: 5Mb | CoLoRMap + Miniasm |
| 4723063 | 84.485 | 0.02322 |
| Pacbio: 450Mb | FMLRC + Miniasm |
| 4646838 | 99.757 | 0.00029 |
| Illumina: 3.4Gb | Jabba + Miniasm | 76 | 72751 | 92.923 |
|
| LoRDEC + Miniasm |
| 4688727 | 97.504 | 0.00321 | |
| LoRMA + Miniasm | 107 | 64214 | 89.871 | 0.00099 | |
| Sprai + Miniasm |
| 4639974 | 99.989 | 0.00092 | |
| Miniasm |
|
| 0.002 | 0.01333 | |
| Canu | 2 |
|
| 0.00012 | |
| hybridSPAdes | 2 | 4469733 | 99.967 | 0.0001078 | |
| DBG2OLC | 2 | 4585967 | 98.210 | 0.00225 | |
| Cerulean | 16 | 1258842 | 98.959 | 0.09500 | |
| Canu + Miniasm | 36 | 729798 |
| 0.00521 | |
| Genome: 12Mb | CoLoRMap + Miniasm |
| 766539 | 83.464 | 0.00761 |
| Pacbio: 1.3Gb | FMLRC + Miniasm | 32 | 771324 | 87.717 | 0.00175 |
| Illumina: 18Gb | Jabba + Miniasm | 186 | 62337 | 72.239 |
|
| LoRDEC + Miniasm | 61 | 597849 | 85.563 | 0.00941 | |
| LoRMA + Miniasm | 292 | 49850 | 74.632 | 0.00103 | |
| Sprai + Miniasm | 39 | 561985 | 88.697 | 0.00193 | |
| Miniasm | 29 | 566484 | 0.009 | 0.03738 | |
| Canu | 26 |
|
| 0.00068 | |
| hybridSPAdes | 229 | 568823 | 87.303 | 0.00899 | |
| DBG2OLC | 32 | 530806 | 0.067 | 0.03407 | |
| Cerulean | 78 | 466556 | 0.03687 | ||
| Canu + Miniasm | 2100 | 74153 | 81.844 | 0.01410 | |
| Genome: 120Mb | CoLoRMap + Miniasm | 963 | 404022 | 60.225 | 0.01853 |
| Pacbio: 11Gb | FMLRC + Miniasm | 1923 | 57751 | 67.275 | 0.00401 |
| Illumina: 13Gb | Jabba + Miniasm | 1632 | 57307 | 62.796 |
|
| LoRDEC + Miniasm | 2232 | 30229 | 43.088 | 0.01107 | |
| LoRMA + Miniasm |
| 26316 | 0.361 | 0.00448 | |
| Sprai + Miniasm | 1475 | 169744 | 91.070 | 0.00824 | |
| Miniasm | 740 | 615512 | 0.003 | 0.03409 | |
| Canu | 419 |
|
| 0.00760 | |
| DBG2OLC | 440 | 754404 | 87.477 | 0.00388 |
Miniasm does not perform either read correction or consensus calling, so the resulting assembly has the same error profile of the input read