| Literature DB >> 33203354 |
Ming-Feng Hsieh1, Chin Lung Lu1, Chuan Yi Tang2,3.
Abstract
BACKGROUND: Next-generation sequencing technologies revolutionized genomics by producing high-throughput reads at low cost, and this progress has prompted the recent development of de novo assemblers. Multiple assembly methods based on de Bruijn graph have been shown to be efficient for Illumina reads. However, the sequencing errors generated by the sequencer complicate analysis of de novo assembly and influence the quality of downstream genomic researches.Entities:
Keywords: DNA sequencing; De bruijn graph; De novo genome assembly
Mesh:
Year: 2020 PMID: 33203354 PMCID: PMC7672897 DOI: 10.1186/s12859-020-03788-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The clustering effect of Clover on Leptospira shermani assemblies
| p | Nodes | Contigs | Scaffolds | Time (min) | Memory (GB) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Num | Total (kb) | Max (kb) | N50 (kb) | Num | Total (kb) | Max (kb) | N50 (kb) | ||||
| 0 | 33,692,986 | 5070 | 3818 | 9 | 1.0 | 182 | 3947 | 178 | 54 | 45.8 | 17.7 |
| 1 | 15,969,082 | 1201 | 3875 | 34 | 5.5 | 117 | 3897 | 196 | 85 | 30.5 | 16.2 |
| 2 | 11,917,810 | 1103 | 3859 | 27 | 6.1 | 117 | 3880 | 145 | 63 | 36.0 | 11.7 |
p the level of error allowance on the k-mers, Nodes the number of nodes to build de Bruijn graph, Num the number of sequences produced, Total the total length of sequences produced, Max the maximum length of sequences produced, N50 the N50 statistic calculated with respect to the total length of sequences produced, Time the run time to assemble the genome, Memory the memory requirement to assemble the genome
Fig. 1The flowchart of the Clover pipeline
Fig. 2An example of 5-mers clustering while allowing 1 error
Comparison of assemblers on Staphylococcus aureus (SA), Rhodobacter sphaeroides (RS) and human chromosome 14 (HG)
| Data (Mb | Assembler | Contigs | Scaffolds | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Num | N50 (kb) | E-size (kb) | Errs | N50C (kb) | E-sizeC (kb) | Num | N50 (kb) | E-size (kb) | Errs | N50C (kb) | E-sizeC (kb) | ||
| SA | Clover | 128 | 43.9 | 53.1 | 13 | 41.3 | 50.5 | 1490 | 947 | 2 | 890 | ||
| 2.9 | ABySS | 129.1 | 181.1 | 16 | 61 | 170 | 199 | 107 | 127 | ||||
| Bambus2 | 109 | 50.2 | 69.1 | 178 | 16.7 | 19.5 | 17 | 1084 | 1120 | 1084 | |||
| CABOG | Could not run because of incompatible read lengths in one library | ||||||||||||
| MSR-CA | 94 | 59.2 | 60.4 | 22 | 49.2 | 51.4 | 17 | 1 | 1022 | 1039 | |||
| SGA | 1252 | 4.0 | 4.7 | 4.0 | 4.7 | 546 | 208 | 166 | 2 | 208 | 164 | ||
| SOAPdenovo | 107 | 58 | 62.7 | 67.5 | 99 | 332 | 302 | 288 | 227 | ||||
| SPAdes | 98 | 62.6 | 87.9 | 9 | 57.0 | 75.1 | 41 | 1703 | 1144 | 2 | 684 | 570 | |
| Velvet | 162 | 48.4 | 60.3 | 19 | 41.5 | 49.8 | 45 | 762 | 664 | 18 | 284 | 282 | |
| RS | Clover | 453 | 20.1 | 23.8 | 19 | 59 | 2483 | 1795 | 1 | 2483 | 1795 | ||
| 4.6 | ABySS | 644 | 19.7 | 25.1 | 57 | 13.3 | 18.5 | 414 | 51 | 56 | 46 | 47 | |
| Bambus2 | 93.2 | 94.5 | 360 | 12.8 | 16.3 | 92 | 2439 | 1375 | 1 | 390 | 1106 | ||
| CABOG | 322 | 20.2 | 24.1 | 31 | 17.9 | 21.5 | 130 | 66 | 520 | 3 | 65 | 381 | |
| MSR-CA | 395 | 22.1 | 24.2 | 32 | 19.1 | 21.5 | 3 | ||||||
| SGA | 3067 | 2.3 | 3.3 | 2.3 | 3.3 | 2096 | 51 | 53 | 51 | 53 | |||
| SOAPdenovo | 204 | 401 | 14.6 | 18.7 | 166 | 660 | 688 | 660 | 559 | ||||
| SPAdes | 768 | 11.8 | 13.7 | 7 | 11.7 | 13.5 | 352 | 718 | 840 | 718 | 840 | ||
| Velvet | 583 | 15.7 | 18.6 | 24 | 14.5 | 16.9 | 178 | 353 | 380 | 16 | 301 | 352 | |
| HG | Clover | 24,527 | 3.4 | 5.3 | 718 | 3.2 | 5.0 | 2089 | 839 | 943 | 385 | ||
| 88.3 | ABySS | 21,222 | 14.7 | 19.0 | 1876 | 10.4 | 13.4 | 19,249 | 18 | 24 | 13 | 19 | |
| Bambus2 | 13,592 | 5.9 | 23.3 | 8175 | 4.3 | 6.3 | 1792 | 324 | 528 | 240 | 200 | 274 | |
| CABOG | 2346 | 393 | 549 | 39 | 309 | 457 | |||||||
| MSR-CA | 30,103 | 4.9 | 6.8 | 1656 | 4.3 | 5.9 | 1425 | 893 | 1420 | 1430 | 282 | 407 | |
| SGA | 56,939 | 2.7 | 3.8 | 2.7 | 3.7 | 30,975 | 83 | 113 | 24 | 81 | 111 | ||
| SOAPdenovo | 21,818 | 16.7 | 21.9 | 6587 | 7.8 | 10.4 | 13,502 | 454 | 533 | 384 | 227 | 276 | |
| SPAdes | 16,854 | 12.7 | 16.7 | 1519 | 10.4 | 13.6 | 9245 | 173 | 223 | 199 | 129 | 162 | |
| Velvet | 45,564 | 2.3 | 3.3 | 3665 | 2.1 | 3.0 | 3565 | 8659 | 86 | 124 | |||
Num the number of sequences produced, N50 the N50 statistic calculated with respect to the genome size, E-size the most likely size of the sequence containing some random base in the genome, Errs the number of misjoins and for the contig value, also the number of indels > 5 bases, N50C the N50 calculated after splitting all sequences at error locations, and E-sizeC the E-size calculated after splitting all sequences at error locations. The best result in each column, for each dataset, is indicated in bold
Comparison of assemblers on run times and memory requirements
| Assembler | Human Chromosome 14 | |||||
|---|---|---|---|---|---|---|
| Time | Memory | Time | Memory | Time | Memory | |
| Clover | 5.6 min | 10.1 GB | 13.9 min | 11.0 GB | 10.4 h | 59.3 GB |
| ABySS | 5.1 min | 0.5 GB | 11.6 min | 0.5 GB | 6.7 h | 3.3 GB |
| Bambus2 | 55.5 min | 2.3 GB | 3.7 h | 12.3 GB | 5.1 d | 190.3 GB |
| CABOG | NA* | NA* | 2.9 h | 12.3 GB | 22.9 h | 190.4 GB |
| MSR-CA | 25.5 min | 26.2 GB | 41.3 min | 28.3 GB | 1.3 d | 34.6 GB |
| SGA | 35.5 min | 1.1 GB | 1.1 h | 3.5 GB | 18.8 h | 35.0 GB |
| SOAPdenovo | 2.3 min | 3.1 GB | 1.8 min | 5.0 GB | 2.1 h | 8.0 GB |
| SPAdes | 56.8 min | 6.1 GB | 29.5 min | 4.5 GB | 10.9 h | 22.0 GB |
| Velvet | 5.4 min | 0.4 GB | 7.3 min | 0.5 GB | 11.7 h | 72.3 GB |
Time the run time to assemble the genome, Memory the memory requirement to assemble the genome
*NA, could not run because of incompatible read lengths in one library
The results of two bacterial genome sequencing projects
| Length of sequence | 3,957,368 bp | 3,826,919 bp |
| Number of contigs | 165 | 58 |
| GC content | 39% | 51% |
| Number of protein-coding sequences | 3682 | 3565 |
| Number of tRNA genes | 75 | 72 |
| Number of rRNA genes | 6 | 10 |
| GenBank accession number | CP003856 | ALJX00000000 |
The NGS datasets of these two bacterial sequencing projects are available for download at https://oz.nthu.edu.tw/~d9562563/src.html