| Literature DB >> 34307867 |
Abstract
Technologies for next-generation sequencing (NGS) have stimulated an exponential rise in high-throughput sequencing projects and resulted in the development of new read-assembly algorithms. A drastic reduction in the costs of generating short reads on the genomes of new organisms is attributable to recent advances in NGS technologies such as Ion Torrent, Illumina, and PacBio. Genome research has led to the creation of high-quality reference genomes for several organisms, and de novo assembly is a key initiative that has facilitated gene discovery and other studies. More powerful analytical algorithms are needed to work on the increasing amount of sequence data. We make a thorough comparison of the de novo assembly algorithms to allow new users to clearly understand the assembly algorithms: overlap-layout-consensus and de-Bruijn-graph, string-graph based assembly, and hybrid approach. We also address the computational efficacy of each algorithm's performance, challenges faced by the assem- bly tools used, and the impact of repeats. Our results compare the relative performance of the different assemblers and other related assembly differences with and without the reference genome. We hope that this analysis will contribute to further the application of de novo sequences and help the future growth of assembly algorithms. ©2021 Dida and Yi.Entities:
Keywords: DNA sequences; De novo assembly; De-Bruijn-Graph; Overlap-Layout-Consensus; String-Graph based assembly
Year: 2021 PMID: 34307867 PMCID: PMC8279138 DOI: 10.7717/peerj-cs.636
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1The general workflow of OLC method.
Figure 2The general workflow of DBG method.
Dataset information.
| Short | Illumina |
| GCF_000001735.4 | 304.3M | 2.3 | |
| Long | PacBio |
| 1.9G | 8.7 | ||
| Short | Illumina |
| GCF_000007825.1 | 443.6M | 3.0 | |
| Long | PacBio |
| 1.2G | 25.6 | ||
| Short | Illumina |
| GCF_000002985.6 | 373.8M | N/A | |
| Long | PacBio |
| 1.5G | N/A | ||
| Short | Illumina |
| GCF_000005845.2 | 326.7M | 62.9 | |
| Long | PacBio |
| 3.3G | 488.9 | ||
| Short | Illumina |
| GCF_000001405.39 | 860.9M | N/A | |
| Long | PacBio |
| 6.8G | 2.3 | ||
| Short | Illumina |
| GCF_000146045.2 | 3.0G | 225.0 | |
| Long | PacBio |
| 5.3G | 288.7 | ||
| Short | Illumina |
| GCF_000013425.1 | 480.7M | 167.8 | |
| Long | PacBio |
| 2.5G | 715.5 |
Notes.
All the reads are taken from NCBI.
Assembly statistics of assemblers on Arabidopsis thaliana dataset.
| A5-MiSeq | 11,739 | 637 | 7876 | 36.65 | |
| Canu | 162 | 10,862 | 48 | 43.09 | |
| Falcon | 735 | 10,115 | 225 | 40.95 | |
| Flye | 1,826 | 77,106 | 860 | 37.24 | |
| Hinge | 522 | 29,990 | 145 | 41.02 | |
| SGA | 98 | 10,612 | 7 | 38.14 | |
| SPAdes | 26,750 | 722 | 9,629 | 36.8 | |
| SOAPdenovo2 | 31 | 89,813 | 1 | 35.94 |
Notes.
The minimum number of contigs generating 50% of the assembly base is represented by L50.
Assembly statistics of datasets with a reference genome.
NGA50 is NG50 in which aligned block lengths are counted rather than contig lengths. LA75 follows the same analog as NGA50 concerning L70. GF is genome fraction. DR is duplication ratio. LA is largest alignment and TAL is total aligned length.
| A5-MiSeq | 6.496 | 1.027 | 55632 | 7821611 | - | 618 | - | 8163 | 70 | |
| Canu | 0.5 | 3.624 | 103154 | 1957236 | - | 8557 | - | 145 | 698 | |
| Falcon | 2.785 | 1.753 | 52449 | 5753720 | - | 7414 | - | 605 | 662 | |
| Flye | 71.501 | 1.029 | 323229 | 87783639 | 68321 | 37683 | 31223 | 2014 | 147 | |
| Hinge | 0.52 | 1.748 | 72349 | 1082572 | - | - | - | - | 295 | |
| SGA | 0.151 | 1.005 | 17368 | 181086 | - | 10612 | - | 22 | 10 | |
| SOAPdenovo2 | 0.119 | 1.093 | 84661 | 143517 | - | 84661 | - | 4 | 0.17 | |
| SPAdes | 15.989 | 1.042 | 31938 | 19105410 | - | 606 | - | 21082 | 17 | |
| A5-MiSeq | 5.761 | 1.014 | 31356 | 316547 | 94223 | - | - | - | 70 | |
| Canu | 18.489 | 1.096 | 15861 | 1099326 | 1271827 | - | - | - | 75 | |
| Falcon | 17.567 | 1.486 | 30007 | 1415685 | 34235 | - | - | - | 260 | |
| Flye | 17.877 | 1.022 | 63366 | 991230 | 3779838 | - | - | - | 17 | |
| Hinge | 21.753 | 1.544 | 10569 | 1823312 | 2010834 | - | - | - | 41 | |
| SGA | 5.924 | 1.031 | 31081 | 329564 | 18389 | - | - | - | 12 | |
| Spades | 5.602 | 1.71 | 13119 | 304717 | 24307 | - | - | - | 11 | |
| Soapdenovo2 | 5.632 | 1.009 | 31143 | 307668 | 94674 | - | - | - | 0.73 | |
| A5-MiSeq | 91.246 | 1.004 | 194019 | 4238262 | 207510 | 46684 | 55105 | 69 | 240 | |
| Canu | 89.572 | 1.013 | 146928 | 4211826 | 4056254 | 24795 | 51958 | - | 22 | |
| Falcon | 48.749 | 2.294 | 20252 | 5152719 | 9582 | 2499 | 4511 | - | 1040 | |
| Flye | 89.557 | 1.011 | 146930 | 4201449 | 1349247 | 28464 | 51984 | - | 52 | |
| Hinge | 89.623 | 2.012 | 147472 | 8360148 | 3358761 | 28410 | 72928 | - | 565 | |
| SGA | 91.188 | 1.004 | 193965 | 4237989 | 196903 | 52805 | 55100 | 62 | 30 | |
| Spades | 90.968 | 1.009 | 194101 | 4227658 | 216146 | 54887 | 55987 | 63 | 7 | |
| Soapdenovo2 | 91.082 | 1.002 | 193904 | 4224949 | 196831 | 41435 | 52593 | 77 | 26 | |
| A5-MiSeq | 2.437 | 1.015 | 18785 | 75924817 | - | 583 | - | 86057 | 1075 | |
| Canu | 1.383 | 1.421 | 108455 | 59774470 | - | 10241 | - | 5608 | 4320 | |
| Falcon | 0.874 | 1.012 | 64590 | 27427277 | - | 12924 | - | 1409 | 631 | |
| Flye | 0.866 | 1.03 | 93090 | 27601994 | - | 26154 | - | 762 | 67 | |
| HiFiasm | 10.488 | 1.014 | 74358 | 330374434 | - | 21579 | - | 10562 | 17 | |
| Hinge | 0.207 | 2.197 | 60176 | 13733461 | - | 1386 | - | - | 35 | |
| SGA | 0.013 | 1.039 | 7137 | 401793 | - | 596 | - | - | 506 | |
| Spades | 0.005 | 1.064 | 16600 | 166272 | - | 827 | - | 160 | 35 | |
| A5-MiSeq | 93.834 | 1.004 | 238989 | 11422884 | 87302 | 77574 | 72006 | 98 | 294 | |
| Canu | 95.269 | 1.144 | 546941 | 13088915 | 789964 | 160884 | 199769 | 56 | 229 | |
| Falcon | 3.176 | 10.781 | 17395 | 4111569 | - | 3172 | - | - | 2036 | |
| Flye | 94.704 | 1.018 | 546784 | 11700248 | 904738 | 230022 | 218721 | 38 | 65 | |
| Hinge | 93.224 | 1.943 | 532988 | 22012911 | 1015480 | 195155 | 337524 | 83 | 79 | |
| Spades | 93.195 | 1.028 | 538406 | 11406775 | 234358 | 149088 | 147545 | 48 | 39 | |
| Soapdenovo2 | 93.518 | 1.003 | 328324 | 11377122 | 109730 | 102238 | 93828 | 79 | 5 | |
| A5-MiSeq | 88.836 | 1.006 | 171163 | 2518098 | 177125 | 72014 | 72014 | 30 | 71 | |
| Canu | 92.505 | 1.032 | 329791 | 2692695 | 2907970 | 92196 | 92541 | 21 | 61 | |
| Falcon | 89.225 | 6.2 | 24089 | 15571683 | 15602 | 4915 | 11704 | 2471 | 817 | |
| Flye | 92.502 | 1.005 | 329938 | 2618043 | 2896520 | 92541 | 92541 | 19 | 58 | |
| Hinge | 92.561 | 2.033 | 323857 | 5303669 | 1311923 | 91667 | 151987 | 44 | 157 | |
| SGA | 88.509 | 1.008 | 171354 | 2509007 | 109236 | 54884 | 54884 | 35 | 11 | |
| Spades | 86.544 | 1.057 | 259183 | 2442671 | 180321 | 91655 | 103869 | 25 | 35 | |
| Soapdenovo2 | 88.562 | 1.003 | 171154 | 2504468 | 174162 | 72014 | 72014 | 30 | 0.36 |
Notes.
NGA50 is NG50 in which aligned block lengths are counted rather than contig lengths. LA75 follows the same analog as NGA50 concerning L70. GF is genome fraction. DR is duplication ratio. LA is largest alignment and TAL is total aligned length. All assemblers except SOAPdenovo2 and SPAdes represents long read assembler.
Figure 3Comparison of misassemblies of datasets with each assemblers.
The Y axis is the total number of aligned bases divided by the reference length, in the contigs having the total number of misassemblies at most X. All assemblers except SOAPdenovo2 and SPAdes represents long read assembler.
Unaligned and mismatched statistics of each datasets.
| A5-MiSeq | 188 | 120181 | 33388 | 4523 | |
| Canu | 0 | 0 | 49425 | 8398 | |
| Falcon | 1 | 4663 | 45626 | 69358 | |
| Flye | 1 | 4544 | 1350240 | 1438162 | |
| Hinge | 95 | 1145948 | 9363 | 20687 | |
| SGA | 0 | 0 | 8 | 2 | |
| SOAPdenovo2 | 0 | 0 | 163 | 12 | |
| SPAdes | 232 | 162162 | 157325 | 12794 | |
| A5-MiSeq | 93 | 751664 | 11700 | 117 | |
| Canu | 9 | 146028 | 37826 | 282 | |
| Falcon | 71 | 820032 | 49672 | 674 | |
| Flye | 4 | 93567 | 35165 | 256 | |
| Hinge | 5 | 161003 | 51871 | 1693 | |
| SGA | 519 | 1724075 | 11977 | 108 | |
| SOAPdenovo2 | 359 | 1477993 | 11590 | 99 | |
| SPAdes | 66 | 636544 | 11508 | 98 | |
| A5-MiSeq | 96 | 236146 | 52596 | 812 | |
| Canu | 82 | 451545 | 92909 | 1309 | |
| Falcon | 375 | 1870266 | 110250 | 3622 | |
| Flye | 9 | 165451 | 92818 | 1327 | |
| Hinge | 18 | 350499 | 191453 | 28035 | |
| SGA | 51 | 56900 | 52556 | 856 | |
| SOAPdenovo2 | 22 | 117911 | 52189 | 794 | |
| SPAdes | 350 | 432044 | 52367 | 782 | |
| A5-MiSeq | 1519 | 1327725 | 176747 | 25981 | |
| Canu | 398 | 4419980 | 729243 | 65863 | |
| Falcon | 1 | 136597 | 310378 | 98715 | |
| Flye | 42 | 383749 | 342934 | 77101 | |
| HiFiasm | 20 | 587213 | 541861 | 712152 | |
| Hinge | 147 | 2007596 | 281097 | 29752 | |
| SGA | 109 | 175934 | 1620 | 83 | |
| SPAdes | 15 | 20781 | 2253 | 107 | |
| A5-MiSeq | 39 | 38622 | 63755 | 5780 | |
| Canu | 1 | 1782 | 125881 | 16218 | |
| Falcon | 3 | 48343 | 73016 | 15369 | |
| Flye | 0 | 0 | 75321 | 11172 | |
| Hinge | 22 | 150873 | 193317 | 491909 | |
| SOAPdenovo2 | 9 | 10420 | 64077 | 6005 | |
| SPAdes | 58 | 48546 | 65866 | 5594 | |
| A5-MiSeq | 2 | 3052 | 34933 | 1233 | |
| Canu | 0 | 0 | 5337 | 379 | |
| Falcon | 23 | 188293 | 39852 | 21850 | |
| Flye | 0 | 0 | 5024 | 302 | |
| Hinge | 0 | 0 | 15470 | 29413 | |
| SGA | 17 | 24149 | 34775 | 1206 | |
| SOAPdenovo2 | 5 | 9944 | 30565 | 1007 | |
| SPAdes | 7 | 7580 | 34733 | 1202 | |
Assembly statistics of assemblers on Bacillus cereus dataset.
| A5-MiSeq | 180 | 94223 | 39 | 35.37 | |
| Canu | 68 | 858185 | 3 | 35.32 | |
| Falcon | 444 | 21016 | 87 | 35.45 | |
| Flye | 16 | 3779838 | 2 | 35.28 | |
| Hinge | 40 | 825555 | 4 | 35.31 | |
| SGA | 765 | 18848 | 75 | 35.47 | |
| SPAdes | 146 | 86886 | 17 | 35.34 | |
| SOAPdenovo2 | 535 | 24261 | 55 | 35.39 |
Notes.
The minimum number of contigs generating 50% of the assembly base is represented by L50.
Assembly statistics of assemblers on Caenorhabditis elegans dataset.
| Canu | 102 | 5867748 | 1 | 62.67 | |
| Falcon | 857 | 15548 | 196 | 62.54 | |
| Flye | 1 | 5953794 | 1 | 62.71 | |
| Hinge | 52 | 3641048 | 2 | 62.47 | |
| SPAdes | 5 | 538 | 3 | 53.38 | |
| SOAPdenovo2 | 5 | 500 | 3 | 52.06 |
Notes.
The minimum number of contigs generating 50% of the assembly base is represented by L50.
Assembly statistics of assemblers on Escherichia coli dataset.
| A5-MiSeq | 173 | 197188 | 8 | 50.5 | |
| Canu | 94 | 4056254 | 1 | 50.55 | |
| Falcon | 1643 | 6143 | 431 | 49.78 | |
| Flye | 24 | 1072054 | 3 | 50.62 | |
| Hinge | 36 | 3356412 | 2 | 50.47 | |
| SGA | 114 | 196903 | 7 | 50.64 | |
| SPAdes | 413 | 196647 | 9 | 50.19 | |
| SOAPdenovo2 | 79 | 216146 | 6 | 50.68 |
Notes.
The minimum number of contigs generating 50% of the assembly base is represented by L50.
Assembly statistics of assemblers on Human dataset.
| A5-MiSeq | 120438 | 602 | 47999 | 40.62 | |
| Canu | 5365 | 13580 | 1908 | 98.62 | |
| Falcon | 1961 | 14011 | 715 | 39.72 | |
| Flye | 1232 | 28044 | 381 | 40.39 | |
| HiFiasm | 15377 | 21918 | 6260 | 39.59 | |
| Hinge | 1591 | 14749 | 480 | 39.54 | |
| SGA | 917 | 905 | 140 | 47.12 | |
| SPAdes | 135 | 1882 | 30 | 50.94 |
Notes.
The minimum number of contigs generating 50% of the assembly base is represented by L50.
Assembly statistics of assemblers on Saccharomyces cerevisiae dataset.
| A5-MiSeq | 369 | 91325 | 80 | 38.12 | |
| Canu | 193 | 710827 | 7 | 36.20 | |
| Falcon | 490 | 13483 | 309 | 27.58 | |
| Flye | 26 | 904913 | 8 | 38.19 | |
| Hinge | 110 | 754764 | 13 | 37.74 | |
| SPAdes | 376 | 93238 | 40 | 38.14 | |
| SOAPdenovo2 | 293 | 234358 | 17 | 38.13 |
Notes.
The minimum number of contigs generating 50% of the assembly base is represented by L50.
Assembly statistics of assemblers on Staphylococcus aureus dataset.
| A5-MiSeq | 36 | 244991 | 9 | 32.69 | |
| Canu | 9 | 2907970 | 1 | 32.73 | |
| Falcon | 2035 | 9071 | 746 | 33.21 | |
| Flye | 1 | 2896520 | 1 | 23.74 | |
| Hinge | 20 | 929468 | 3 | 32.73 | |
| SGA | 81 | 109236 | 17 | 32.66 | |
| SPAdes | 47 | 127345 | 8 | 32.66 | |
| SOAPdenovo2 | 41 | 180321 | 5 | 32.69 |
Notes.
The minimum number of contigs generating 50% of the assembly base is represented by L50.