| Literature DB >> 30286803 |
Alexandre Souvorov1, Richa Agarwala2, David J Lipman1,3.
Abstract
SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. Comparison with SPAdes and MegaHit shows that SKESA produces assemblies that have high sequence quality and contiguity, handles low-level contamination in reads, is fast, and produces an identical assembly for the same input when assembled multiple times with the same or different compute resources. SKESA has been used for assembling over 272,000 read sets in the Sequence Read Archive at NCBI and for real-time pathogen detection. Source code for SKESA is freely available at https://github.com/ncbi/SKESA/releases .Entities:
Keywords: Contamination; De-novo assembly; DeBruijn graphs; Illumina reads; Sequence quality
Mesh:
Year: 2018 PMID: 30286803 PMCID: PMC6172800 DOI: 10.1186/s13059-018-1540-z
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Run time comparison using 56 inputs in the run time set
| Run time | 4 cores, 16 Gb | 8 cores, 32 Gb | 12 cores, 32 Gb | ||||||
|---|---|---|---|---|---|---|---|---|---|
| (seconds) | SKESA | SPAdes | MegaHit | SKESA | SPAdes | MegaHit | SKESA | SPAdes | MegaHit |
| <=300 | 6 | 1 | 6 | 16 | 2 | 24 | 32 | 3 | 37 |
| 301−400 | 3 | 0 | 2 | 16 | 1 | 12 | 11 | 3 | 11 |
| 401−500 | 5 | 2 | 8 | 7 | 3 | 7 | 2 | 1 | 5 |
| 501−600 | 6 | 1 | 10 | 6 | 1 | 8 | 3 | 3 | 0 |
| 601−700 | 10 | 1 | 6 | 0 | 3 | 2 | 3 | 3 | 0 |
| > 700 | 26 | 51 | 24 | 11 | 46 | 3 | 5 | 43 | 3 |
| Median | 688 | 2303 | 616 | 359 | 1319 | 328 | 275 | 1086 | 240 |
Best of three wall-clock times is used for each input, method, and resource combination
Number of misassemblies in 381 inputs in the benchmark set
| Count | SKESA | SPAdes | MegaHit |
|---|---|---|---|
| 0 | 214 | 172 | 128 |
| 1 | 83 | 98 | 91 |
| 2 | 40 | 43 | 66 |
| 3 | 13 | 30 | 30 |
| 4 | 9 | 12 | 18 |
| 5 | 7 | 7 | 15 |
| 6 | 2 | 3 | 10 |
| 7 | 2 | 0 | 5 |
| 8 | 1 | 1 | 3 |
| 9 | 0 | 0 | 2 |
| 10+ | 10 | 15 | 13 |
| Median | 0 | 1 | 1 |
Mismatches per 100 Kb as reported by QUAST for benchmark and contamination sets
| Benchmark set | |||
| Measure | SKESA | SPAdes | MegaHit |
| Median | 0.08 | 2.76 | 1.89 |
| Maximum | 7.78 | 41.60 | 31.94 |
| Average | 0.40 | 3.21 | 2.79 |
| Assembly counts in benchmark set | |||
| Mismatches range | SKESA | SPAdes | MegaHit |
| 0 | 105 | 1 | 1 |
| 0.01−1 | 247 | 40 | 80 |
| 1.01−2 | 9 | 76 | 121 |
| 2.01−3 | 9 | 89 | 58 |
| 3.01−4 | 1 | 71 | 45 |
| > 4 | 10 | 104 | 76 |
| Mismatches reported in contamination set | |||
| Set | SKESA | SPAdes | MegaHit |
| No contamination | 0 | 1.44 | 3.83 |
| 3x contamination | 0 | 1.42 | 3.21 |
| 6x contamination | 0 | 1.44 | 3.02 |
| 9x contamination | 0.02 | 1.61 | 4.38 |
| 12x contamination | 0.02 | 1.52 | 4.96 |
| 15x contamination | 0.04 | 1.50 | 5.83 |
Deviation of assembly length produced by the assemblers from the assembly length of the reference as computed using aligned length reported by QUAST and assembly lengths for benchmark and contamination sets
| Benchmark set | |||
| Measure | SKESA | SPAdes | MegaHit |
| Median | 2.72 | 10.91 | 5.59 |
| Maximum | 135.75 | 775.14 | 407.78 |
| Average | 4.61 | 57.98 | 24.23 |
| Deviation in contamination set | |||
| Contamination | SKESA | SPAdes | MegaHit |
| None | 1.33 | 1.68 | 1.35 |
| 3x | 1.36 | 1.68 | 1.33 |
| 6x | 1.33 | 1.68 | 1.30 |
| 9x | 1.36 | 1.67 | 1.47 |
| 12x | 1.41 | 1.68 | 2.05 |
| 15x | 1.44 | 1.68 | 2.96 |
Contiguity for benchmark, random, and contamination sets
| Benchmark set | |||
| N50 measure | SKESA | SPAdes | MegaHit |
| <=10 Kb | 14 | 69 | 19 |
| 10001−50 Kb | 40 | 41 | 46 |
| 50001−100 Kb | 41 | 56 | 67 |
| 100001−250 Kb | 191 | 169 | 197 |
| 250001−500 Kb | 77 | 43 | 48 |
| > 500 Kb | 18 | 3 | 4 |
| Median | 170,647 | 117,340 | 124,833 |
| Minimum | 1832 | 364 | 687 |
| Maximum | 1,197,860 | 622,367 | 617,087 |
| Average | 195,141 | 131,823 | 146,706 |
| N50 statistic in contamination set | |||
| Contamination | SKESA | SPAdes | MegaHit |
| None | 282,763 | 260,531 | 202,384 |
| 3x | 282,763 | 260,531 | 202,384 |
| 6x | 282,763 | 260,532 | 202,384 |
| 9x | 225,630 | 260,531 | 151,916 |
| 12x | 77,455 | 260,531 | 107,175 |
| 15x | 42,440 | 260,531 | 65,124 |
| Random set | |||
| N50 measure | SKESA | SPAdes | MegaHit |
| <=10 Kb | 6 | 10 | 6 |
| 10001−50 Kb | 349 | 206 | 285 |
| 50001−100 Kb | 788 | 409 | 1516 |
| 100001−250 Kb | 2307 | 2369 | 2889 |
| 250001−500 Kb | 1324 | 1616 | 266 |
| > 500 Kb | 226 | 390 | 38 |
| Median | 170,877 | 208,907 | 117,074 |
| Minimum | 2414 | 209 | 4182 |
| Maximum | 1,545,488 | 1,530,182 | 1,499,532 |
| Average | 213,847 | 255,079 | 136,339 |
Fig. 1Substrings mismatches: mismatches per 100 Kb seen in assemblies of SPAdes and MegaHit for inputs in substrings set. SKESA has no mismatches at any length in this set
Fig. 2Substrings contiguity: N50 for assemblies generated by SKESA, SPAdes, and MegaHit for inputs in substrings set
Fig. 3Substrings deviation: deviation for assemblies generated by SKESA, SPAdes, and MegaHit for inputs in substrings set. We do not show values for input length 22 where MegaHit has value of almost 100 and input length 34 and 56 for which SPAdes did not produce an assembly
Fig. 4SKESA flowchart: flowchart describing main steps in the algorithm used by SKESA for assembly
Fig. 5Main distribution in SRR2821438: histogram for frequency of 21-mers seen in SRR2821438 with counts on X axis up to 400 and number of 21-mers with that count on Y axis
Fig. 6Small distributions in SRR2821438: histogram for frequency of 21-mers seen in SRR2821438 with counts on X axis between 325 and 2000 and number of 21-mers with that count on Y axis
Runs and species for testing running time performance
| SRA run | Species |
|---|---|
| SRR2820668 |
|
| SRR2822445 |
|
| SRR2821368 |
|
| SRR2821369 |
|
| SRR2823707 |
|
| SRR2823715 |
|
| SRR2823716 |
|
| SRR2824043 |
|
| SRR2822462 |
|
| SRR2818794 |
|
| SRR1284629 |
|
| SRR2821773 | |
| SRR1515967 |
|
| SRR1576778 | |
| SRR1576808 | |
| SRR2822449 | |
| ERR008613 |
|
| ERR022075 |
|
| SRR530851 |
|
| SRR587217 |
|
| SRR2817810 |
|
| SRR2817811 |
|
| SRR2822309 |
|
| ERR351267 |
|
| SRR2821438 |
|
| SRR2820617 |
|
| SRR2820618 |
|
| SRR1501122 |
|
| SRR1427234 |
|
| SRR1505904 |
|
| SRR1427243 |
|
| SRR1501128 |
|
| SRR1510963 |
|
| SRR941212 |
|
| SRR2823701 |
|
| SRR2822442 |
|
| SRR2820663 |
|
| SRR498276 |
|
| SRR2814419 |
|
| SRR2814420 |
|
| SRR2819198 |
|
| SRR2812569 |
|
| SRR2812570 |
|
| SRR1206476 |
|
| SRR2822404 |
|
| SRR2820641 |
|
| SRR2820657 |
|
| SRR2822469 |
|
| SRR2820294 |
|
| SRR2819094 |
|
| SRR2820674 |
|
| SRR2815879 |
|
| SRR2817447 |
|
| SRR2818033 |
|
| SRR2818092 |
|
| SRR2818127 |
|