| Literature DB >> 32781410 |
Ehsan Haghshenas1, Hossein Asghari1, Jens Stoye2, Cedric Chauve3, Faraz Hach4.
Abstract
Third-generation sequencing technologies from companies such as Oxford Nanopore and Pacific Biosciences have paved the way for building more contiguous and potentially gap-free assemblies. The larger effective length of their reads has provided a means to overcome the challenges of short to mid-range repeats. Currently, accurate long read assemblers are computationally expensive, whereas faster methods are not as accurate. Moreover, despite recent advances in third-generation sequencing, researchers still tend to generate accurate short reads for many of the analysis tasks. Here, we present HASLR, a hybrid assembler that uses error-prone long reads together with high-quality short reads to efficiently generate accurate genome assemblies. Our experiments show that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on most of the samples, while being on par with other assemblers in terms of contiguity and accuracy.Entities:
Keywords: Bioinformatics; Genomics; Sequence Analysis
Year: 2020 PMID: 32781410 PMCID: PMC7419660 DOI: 10.1016/j.isci.2020.101389
Source DB: PubMed Journal: iScience ISSN: 2589-0042
Figure 1Two Backbone Graphs Built from a Real PacBio Dataset Sequenced from a Yeast Genome
Each graph is visualized with Bandage (Wick et al., 2015) and colored using its rainbow coloring feature. Each chromosome is colored with a full rainbow spectrum. (Left) The backbone graph built from all SR contigs. (Right) The backbone graph built from unique SR contigs. As it can be seen, using only unique SR contigs for building the backbone graph resolves many of the complexities and ambiguities in the graph. However, it is important to note that excluding non-unique SR contigs could potentially result in a more fragmented graph (some chromosomes are split into multiple paths rather than a single one) and assembly.
Figure 2Precision and Recall Results in Identification of Unique Short Read Contigs on Six Different Reference Genomes
Precision is shown with blue dots and recall is shown with orange dots. Precision is always high across the different experiments, and in all the experiments a big jump in recall happens at length threshold of 300.
Comparison between Draft Assemblies Obtained by Different Tools on Simulated Data
| Genome | Assembler | Contigs | Genome Fraction | NGA50 | Misassemblies Extensive + Local | Mismatch Rate | Indel Rate | Time | Memory (GB) |
|---|---|---|---|---|---|---|---|---|---|
| Canu | 1 | 99.648 | 4,625,313 | 0 + 0 | 0.86 | 15.85 | 30:18 | 4.16 | |
| Flye | 1 | 99.937 | 4,639,833 | 0 + 0 | 0.34 | 25.31 | 5:59 | 12.10 | |
| wtdbg2 | 135 | 96.158 | 107,864 | 4 + 79 | 216.99 | 492.12 | 0:46 | 19.36 | |
| miniasm | 4 | 99.470 | 4,178,447 | 0 + 1 | 52.24 | 646.11 | 0:41 | 2.56 | |
| Minia | 162 | 97.713 | 58,763 | 0 + 0 | 0.26 | 0.00 | 0:26 | 3.04 | |
| SPAdes | 79 | 98.333 | 176,163 | 1 + 2 | 1.69 | 0.11 | 6:56 | 113.92 | |
| hybridSPAdes | 1 | 100.000 | 4,641,652 | 0 + 0 | 6.18 | 0.32 | 8:05 | 113.92 | |
| Unicycler | 1 | 99.997 | 4,641,530 | 0 + 0 | 3.12 | 0.45 | 18:43 | 21.56 | |
| DBG2OLC | 2 | 92.497 | 2,647,379 | 0 + 0 | 0.28 | 30.05 | 4:37 | 1.35 | |
| MaSuRCA | 1 | 99.874 | 4,636,209 | 0 + 4 | 0.56 | 0.19 | 5:21 | 32.52 | |
| Wengan | 1 | 100.000 | 4,641,731 | 0 + 0 | 2.54 | 5.36 | 2:21 | 3.19 | |
| HASLR | 1 | 99.999 | 4,643,699 | 0 + 0 | 2.00 | 42.89 | 0:41 | 3.04 | |
| Yeast | Canu | 21 | 98.831 | 910,628 | 0 + 0 | 3.18 | 25.44 | 44:10 | 5.51 |
| Flye | 19 | 99.418 | 916,686 | 6 + 1 | 11.37 | 49.72 | 9:03 | 19.65 | |
| wtdbg2 | 490 | 92.871 | 77,726 | 24 + 191 | 259.00 | 577.63 | 1:58 | 28.35 | |
| miniasm | 18 | 96.637 | 776,254 | 0 + 0 | 54.28 | 709.35 | 1:49 | 6.63 | |
| Minia | 608 | 94.104 | 39,673 | 0 + 0 | 0.46 | 0.04 | 1:03 | 5.05 | |
| SPAdes | 211 | 95.231 | 151,550 | 0 + 0 | 5.62 | 0.69 | 16:16 | 113.93 | |
| hybridSPAdes | 38 | 97.840 | 797,316 | 2 + 12 | 41.54 | 2.12 | 19:41 | 113.93 | |
| Unicycler | 52 | 97.893 | 799,601 | 0 + 1 | 8.81 | 0.44 | 57:47 | 22.99 | |
| DBG2OLC | 18 | 98.492 | 771,063 | 1 + 0 | 5.9 | 85.95 | 13:29 | 1.21 | |
| MaSuRCA | 17 | 99.476 | 919,651 | 0 + 3 | 5.97 | 0.56 | 15:10 | 32.66 | |
| Wengan | 22 | 97.065 | 796,244 | 0 + 0 | 6.14 | 24.48 | 4:14 | 5.55 | |
| HASLR | 18 | 96.597 | 796,649 | 0 + 0 | 5.39 | 76.63 | 1:52 | 10.48 | |
| Canu | 10 | 99.847 | 13,775,238 | 3 + 1 | 5.88 | 67.73 | 5:15:05 | 13.76 | |
| Flye | 16 | 99.798 | 15,266,425 | 8 + 0 | 1.10 | 55.35 | 1:01:26 | 89.50 | |
| wtdbg2 | 4,487 | 95.468 | 81,074 | 194 + 506 | 246.33 | 657.89 | 15:57 | 29.45 | |
| miniasm | 37 | 99.696 | 7,468,924 | 3 + 7 | 68.24 | 864.11 | 20:37 | 19.35 | |
| Minia | 13,546 | 86.788 | 10,047 | 13 + 4 | 0.76 | 0.11 | 6:18 | 8.36 | |
| SPAdes | 3,219 | 94.713 | 58,307 | 30 + 62 | 6.42 | 1.36 | 2:45:34 | 114.80 | |
| hybridSPAdes | 340 | 98.643 | 924,797 | 67 + 197 | 73.26 | 9.14 | 3:11:50 | 114.79 | |
| Unicycler | NA | ||||||||
| DBG2OLC | 16 | 99.692 | 6,732,354 | 10 + 7 | 8.55 | 174.21 | 2:04:23 | 7.99 | |
| MaSuRCA | 18 | 99.609 | 4,614,507 | 34 + 123 | 14.89 | 4.56 | 2:07:41 | 33.76 | |
| Wengan | 46 | 98.917 | 2,042,350 | 53 + 20 | 7.26 | 59.81 | 28:21 | 11.18 | |
| HASLR | 25 | 99.182 | 6,455,832 | 0 + 0 | 14.74 | 230.58 | 10:45 | 22.42 | |
| Human | Canu | 1,461 | 97.279 | 15,045,226 | 854 + 99 | 37.7 | 196.78 | 562:14:04 | 58.72 |
| Flye | NA | ||||||||
| wtdbg2 | 122,438 | 92.735 | 87,595 | 3,436 + 13,041 | 224.02 | 598.87 | 10:25:19 | 190.07 | |
| miniasm | 2,528 | 97.170 | 10,294,834 | 374 + 181 | 71.56 | 775.18 | 110:33:23 | 511.16 | |
| Minia | 593,601 | 80.704 | 4,537 | 1,016 + 16 | 1.55 | 0.13 | 3:29:08 | 8.91 | |
| SPAdes | NA | ||||||||
| hybridSPAdes | NA | ||||||||
| Unicycler | NA | ||||||||
| DBG2OLC | 1,906 | 91.013 | 14,385,033 | 221 + 246 | 8.43 | 201.56 | 81:18:15 | 69.53 | |
| MaSuRCA | NA | ||||||||
| Wengan | 1,776 | 94.617 | 11,216,374 | 185 + 70 | 3.84 | 33.5 | 20:12:12 | 38.08 | |
| HASLR | 897 | 91.213 | 17,025,446 | 2 + 5 | 11.32 | 207.88 | 6:06:43 | 58.55 |
Note: Mismatch and indel rates are reported per 100 kbp. Unicycler crashed on C. elegans dataset due to maximum recursion limit. For the human dataset, Flye, SPAdes, hybridSPAdes, and Unicycler failed due to memory limit and MaSuRCA failed due to a segmentation fault.
Statistics of Real Long Read Datasets
| Dataset | Technology | N50 Length | Estimated Coverage | Total Size (Gb) | Aligned Size (Gb) | Avg. Alignment Identity (%) |
|---|---|---|---|---|---|---|
| ONT R9.4 | 63,747 | 1,080 | 5.01 | 4.31 | 85.03 | |
| (K-12 MG1655) | Illumina | 2 × 151 | 372 | 1.73 | – | – |
| Yeast | PacBio | 8,561 | 132 | 1.61 | 1.42 | 86.90 |
| (S288C) | Illumina | 2 × 150 | 82 | 1.00 | – | – |
| PacBio | 16,675 | 47 | 4.73 | 4.32 | 87.43 | |
| (Bristol) | Illumina | 2 × 100 | 67 | 6.76 | – | – |
| Human | PacBio | 19,960 | 59 | 182.51 | 163.51 | 85.85 |
| (CHM1) | Illumina | 2 × 151 | 41 | 127.76 | – | – |
Note: Alignment statistics were obtained by aligning long reads against their reference genome using lordFAST (Haghshenas et al., 2019).
Comparison between Assemblies Obtained by Different Tools on Real Data
| Dataset | Assembler | Contigs | Genome Fraction | NGA50 | Misassemblies Extensive + Local | Mismatch Rate | Indel Rate | Time | Memory (GB) |
|---|---|---|---|---|---|---|---|---|---|
| Canu | 1 | 99.976 | 3,647,271 | 2 + 6 | 108.85 | 1,254.40 | 702:57:07 | 32.39 | |
| Flye | NA | ||||||||
| wtdbg2 | 9 | 79.114 | 141,474 | 38 + 72 | 245.82 | 1,501.74 | 4:57 | 28.05 | |
| miniasm | 3 | 99.992 | 3,106,217 | 4 + 10 | 279.13 | 1,263.23 | 50:00 | 55.56 | |
| Minia | 177 | 97.698 | 57,763 | 0 + 0 | 0.24 | 0.02 | 2:22 | 4.76 | |
| SPAdes | 95 | 98.281 | 133,063 | 0 + 9 | 1.16 | 0.15 | 34:51 | 114.29 | |
| hybridSPAdes | 15 | 99.964 | 3,863,268 | 2 + 7 | 7.16 | 0.50 | 3:38:13 | 114.29 | |
| Unicycler | NA | ||||||||
| DBG2OLC | 1 | 99.950 | 3,539,045 | 3 + 4 | 46.86 | 335.82 | 8:25 | 8.74 | |
| MaSuRCA | 1 | 99.988 | 3,892,134 | 3 + 7 | 2.82 | 0.50 | 30:28 | 32.66 | |
| Wengan | 3 | 99.998 | 3,346,596 | 3 + 2 | 4.74 | 9.24 | 20:02 | 14.37 | |
| HASLR | 2 | 99.992 | 3,970,011 | 2 + 2 | 22.62 | 79.85 | 3:18 | 5.78 | |
| Yeast | Canu | 23 | 99.724 | 739,932 | 29 + 2 | 8.85 | 7.99 | 1:00:19 | 5.97 |
| Flye | 19 | 99.511 | 566,399 | 28 + 2 | 11.60 | 28.41 | 26:10 | 17.49 | |
| wtdbg2 | 28 | 97.668 | 640,895 | 20 + 3 | 10.65 | 27.17 | 3:04 | 16.26 | |
| miniasm | 88 | 98.292 | 547,238 | 21 + 34 | 31.45 | 381.55 | 5:59 | 15.58 | |
| Minia | 722 | 93.758 | 33,472 | 1 + 1 | 1.67 | 0.81 | 1:18 | 6.36 | |
| SPAdes | 246 | 95.054 | 126,338 | 4 + 2 | 6.44 | 1.47 | 17:11 | 114.09 | |
| hybridSPAdes | 61 | 97.207 | 436,584 | 28 + 20 | 44.77 | 3.71 | 20:58 | 114.09 | |
| Unicycler | 51 | 97.555 | 531,185 | 15 + 5 | 15.13 | 4.22 | 2:09:27 | 36.90 | |
| DBG2OLC | 24 | 63.275 | 229,397 | 25 + 10 | 28.37 | 58.43 | 9:51 | 0.99 | |
| MaSuRCA | 24 | 99.262 | 538,374 | 30 + 8 | 11.83 | 5.85 | 23:15 | 32.69 | |
| Wengan | 29 | 96.258 | 528,763 | 14 + 10 | 11.86 | 34.29 | 6:38 | 8.64 | |
| HASLR | 28 | 95.735 | 530,856 | 11 + 5 | 8.13 | 100.64 | 2:25 | 11.30 | |
| Canu | 172 | 99.665 | 561,201 | 723 + 596 | 65.28 | 58.82 | 4:15:23 | 11.62 | |
| Flye | 64 | 99.638 | 558,112 | 550 + 450 | 50.50 | 52.89 | 1:08:43 | 31.60 | |
| wtdbg2 | 288 | 98.994 | 561,292 | 329 + 596 | 26.82 | 79.72 | 14:13 | 21.19 | |
| miniasm | 174 | 99.537 | 540,855 | 505 + 432 | 79.10 | 393.94 | 20:12 | 19.95 | |
| Minia | 17,388 | 86.274 | 7,198 | 33 + 27 | 1.34 | 0.99 | 8:05 | 6.61 | |
| SPAdes | 7,234 | 92.003 | 23,152 | 257 + 256 | 11.87 | 4.72 | 2:00:57 | 74.10 | |
| hybridSPAdes | 2,336 | 96.720 | 84,003 | 633 + 638 | 108.04 | 15.96 | 2:47:32 | 74.11 | |
| Unicycler | 858 | 97.102 | 139,992 | 940 + 692 | 58.36 | 45.47 | 23:49:29 | 105.06 | |
| DBG2OLC | 206 | 99.100 | 421,196 | 546 + 383 | 44.75 | 80.61 | 2:34:44 | 11.36 | |
| MaSuRCA | 216 | 97.013 | 471,366 | 368 + 504 | 49.20 | 23.50 | 1:57:49 | 33.48 | |
| Wengan | 270 | 93.341 | 341,861 | 308 + 336 | 35.75 | 121.11 | 45:45 | 8.02 | |
| HASLR | 261 | 97.431 | 453,631 | 259 + 331 | 26.08 | 140.40 | 15:35 | 17.93 | |
| CHM1 | Canu | 2,110 | 96.084 | 2,329,909 | 6,715 + 7,048 | 145.81 | 120.69 | 689:26:01 | 70.44 |
| Flye | NA | ||||||||
| wtdbg2 | 3,723 | 92.896 | 2,081,842 | 3,535 + 6,286 | 118.45 | 72.54 | 11:35:22 | 202.41 | |
| miniasm | NA | ||||||||
| Minia | 697,240 | 65.977 | 1,823 | 955 + 823 | 87.93 | 13.17 | 3:13:13 | 9.56 | |
| SPAdes | NA | ||||||||
| hybridSPAdes | NA | ||||||||
| Unicycler | NA | ||||||||
| DBG2OLC | 2,118 | 95.547 | 1,599,466 | 3,718 + 8,690 | 116.81 | 116.89 | 78:21:08 | 64.94 | |
| MaSuRCA | 3,781 | 93.782 | 1,761,291 | 4,984 + 7,491 | 180.83 | 57.53 | 350:35:59 | 225.63 | |
| Wengan | 4,474 | 88.948 | 875,489 | 2,771 + 7,577 | 115.65 | 160.71 | 18:19:47 | 112.73 | |
| HASLR | 1,469 | 92.664 | 1,699,092 | 2,097 + 7,661 | 113.06 | 281.74 | 6:32:33 | 60.75 |
Note: Mismatch and indel rates are reported per 100 kbp. Flye, SPAdes, hybridSPAdes, and Unicycler failed on human genome datasets due to memory limit. Unicycler did not finish on E. coli dataset within one month. Flye failed on E. coli with error "No disjointigs were assembled."
Gene Completeness Analysis
| Dataset | Assembler | Complete (%) | Complete | Complete Duplicate (%) | Fragmented (%) | Missing (%) | Total BUSCO Groups |
|---|---|---|---|---|---|---|---|
| Canu | 4.1 | 4.1 | 0.0 | 16.8 | 79.1 | 440 | |
| Flye | NA | ||||||
| wtdbg2 | 1.8 | 1.8 | 0.0 | 9.1 | 89.1 | 440 | |
| miniasm | 3.0 | 3.0 | 0.0 | 18.0 | 79.0 | 440 | |
| minia | 99.8 | 99.3 | 0.5 | 0.2 | 0.0 | 440 | |
| SPAdes | 100.0 | 99.5 | 0.5 | 0.0 | 0.0 | 440 | |
| hybridSPAdes | 100.0 | 99.5 | 0.5 | 0.0 | 0.0 | 440 | |
| Unicycler | NA | ||||||
| DBG2OLC | 35.9 | 35.7 | 0.2 | 33.0 | 31.1 | 440 | |
| MaSuRCA | 99.7 | 98.6 | 1.1 | 0.0 | 0.3 | 440 | |
| Wengan | 100.0 | 99.5 | 0.5 | 0.0 | 0.0 | 440 | |
| HASLR | 97.8 | 97.3 | 0.5 | 1.6 | 0.6 | 440 | |
| Yeast | Canu | 96.6 | 94.8 | 1.8 | 0.2 | 3.2 | 2,137 |
| Flye | 94.6 | 93.0 | 1.6 | 0.1 | 5.3 | 2,137 | |
| wtdbg2 | 88.4 | 86.8 | 1.6 | 0.8 | 10.8 | 2,137 | |
| miniasm | 25.8 | 25.6 | 0.2 | 5.2 | 69.0 | 2,137 | |
| minia | 96.3 | 94.9 | 1.4 | 0.1 | 3.6 | 2,137 | |
| SPAdes | 96.3 | 94.5 | 1.8 | 0.2 | 3.5 | 2,137 | |
| hybridSPAdes | 96.6 | 94.8 | 1.8 | 0.1 | 3.3 | 2,137 | |
| Unicycler | 96.4 | 94.7 | 1.7 | 0.1 | 3.5 | 2,137 | |
| DBG2OLC | 57.1 | 56.5 | 0.6 | 0.5 | 42.4 | 2,137 | |
| MaSuRCA | 96.3 | 94.1 | 2.2 | 0.1 | 3.6 | 2,137 | |
| Wengan | 96.5 | 94.9 | 1.6 | 0.0 | 3.5 | 2,137 | |
| HASLR | 95.8 | 94.4 | 1.4 | 0.1 | 4.1 | 2,137 | |
| Canu | 97.4 | 96.8 | 0.6 | 1.1 | 1.5 | 3,131 | |
| Flye | 98.6 | 98.0 | 0.6 | 0.3 | 1.1 | 3,131 | |
| wtdbg2 | 97.1 | 96.5 | 0.6 | 1.3 | 1.6 | 3,131 | |
| miniasm | 83.2 | 82.8 | 0.4 | 6.5 | 10.3 | 3,131 | |
| minia | 80.4 | 79.9 | 0.5 | 9.0 | 10.6 | 3,131 | |
| SPAdes | 91.4 | 90.8 | 0.6 | 4.1 | 4.5 | 3,131 | |
| hybridSPAdes | 96.4 | 95.8 | 0.6 | 1.3 | 2.3 | 3,131 | |
| Unicycler | 97.7 | 97.1 | 0.6 | 0.7 | 1.6 | 3,131 | |
| DBG2OLC | 97.5 | 95.8 | 1.7 | 0.6 | 1.9 | 3,131 | |
| MaSuRCA | 95.5 | 94.1 | 1.4 | 0.4 | 4.1 | 3,131 | |
| Wengan | 91.6 | 91.1 | 0.5 | 0.9 | 7.5 | 3,131 | |
| HASLR | 97.1 | 96.7 | 0.4 | 0.8 | 2.1 | 3,131 |
Note: We used enterobacterales odb10, saccharomycetes odb10, and nematoda odb10 gene sets for assessing gene completeness of E. coli, Yeast, and C. elegans assemblies, respectively. We were not able to obtain the gene completeness results for the human dataset due to time restrictions.
Figure 3Distribution of Repeats in HASLR's Assembly of CHM1 Dataset Identified Using RepeatMasker
Figure 4Performance of HASLR in Assembling Different Datasets on Subsampled Coverage