| Literature DB >> 30346548 |
Swati C Manekar1, Shailesh R Sathe1.
Abstract
The rapid development of high-throughput sequencing technologies means that hundreds of gigabytes of sequencing data can be produced in a single study. Many bioinformatics tools require counts of substrings of length k in DNA/RNA sequencing reads obtained for applications such as genome and transcriptome assembly, error correction, multiple sequence alignment, and repeat detection. Recently, several techniques have been developed to count k-mers in large sequencing datasets, with a trade-off between the time and memory required to perform this function. We assessed several k-mer counting programs and evaluated their relative performance, primarily on the basis of runtime and memory usage. We also considered additional parameters such as disk usage, accuracy, parallelism, the impact of compressed input, performance in terms of counting large k values and the scalability of the application to larger datasets.We make specific recommendations for the setup of a current state-of-the-art program and suggestions for further development.Entities:
Mesh:
Year: 2018 PMID: 30346548 PMCID: PMC6280066 DOI: 10.1093/gigascience/giy125
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Ontology of k-mer counting approaches
| Approach for | Disk-based | In-memory |
|---|---|---|
| Hash table | Gerbil [ | Squeakr [ |
| Sorting | KMC3 [ | Turtle [ |
| Burst tries | - | KCMBT [ |
| Enhanced suffix array | - | Tallymer [ |
Datasets used in our study
| Sr. no. | Dataset ID | Organism | Genome size (Mb) | Input FASTQ/FASTA file size (GB) (1 GB = 109 bytes) | Average read length (bases) | Total no. of bases (Gb) | Total no. of reads |
|---|---|---|---|---|---|---|---|
| 1 | FV |
| 214 | 10.9 | 353 | 4.5 | 12,803,137 |
| 2 | DM |
| 122 | 10.5 | 76 | 3.7 | 48,432,878 |
| 3 | MB |
| 472 | 197.1 | 100 | 56.3 | 562,968,372 |
| 4 | HS1 |
| 2,991 | 292.1 | 151 | 123.7 | 819,148,264 |
| 5 | HS2 |
| 2,991 | 339.5 | 100 | 135.3 | 1,339,740,542 |
| 6 | NC |
| 41 | 23.3 | 7,778.3 | 22.9 | 2,942,564 |
| 7 | AT |
| 120 | 72.7 | 4,804.6 | 36.1 | 7,515,360 |
Experimental results for the FV dataset
| SN | Tools (version; compression type) |
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
| Time (s) | RAM (GB) | Disk (GB) | CPU utilization (%) (comment) | Time (s) | RAM (GB) | Disk (GB) | CPU Utilization (%) (comment) | ||
| 1 | Jellyfish (2.2.6) | 138.33 | 7.9 | 0 | 1093.55 (consistent) | 226 |
| 0 |
|
| 2 | DSK (2.2.0) | 56.33 | 6.35 | 6 | 866.50 (consistent) | 78.33 | 7.04 | 5 | 633.49 (declined from ∼1174 to ∼129.7) |
| 3 | DSK (2.2.0; gzip) | 197 | 4 | 6 | 402.71 (first 80% of time consistent with ∼300; last 20% inconsistent to ∼1,200 with sudden increase) | 222 | 6 | 5 | 441.21 (first 75% of time consistent with ∼390; last 25% inconsistent to ∼1,200 with sudden increase) |
| 4 | KAnalyze (2.0.0) |
| 10 |
| 509.20 (initially in the range 1,000–2,000 then declined to ∼200) |
| 11 |
| 337.46 (initially in the range 1,000–2,000 then declined to ∼150) |
| 5 | KAnalyze (2.0.0; gzip) | 1,999 | 9 | 22.9 | 507.84 (first 30% of time inconsistent in the range 2,250–750; last 70% consistent with sudden drop to ∼200) | 3,395 | 11 | 12.8 | 360.456 (first 25% of time inconsistent to ∼900; last 75% consistent to ∼200 with a sudden drop) |
| 6 | KMC3 | 38.66 | 7.66 |
| 998.10 (consistent) |
| 11.2 | 4 | 987.891 (consistent) |
| 7 | KMC3 (gzip) | 35 | 7 | 2.2 | 1,004.61 (consistent) | 37 | 11 | 0 | 1,056.25 (consistent) |
| 8 | Gerbil (1.0) |
|
|
|
| 60.33 |
|
| 1,030.50 (consistent) |
| 9 | Gerbil (1.0; gzip) | 49 | 0.82 | 1.5 | 858.46 (first 50% of time consistent with ∼600; last 50% suddenly increased to ∼1,200) | 55 | 1 | 1 | 880.77 (first 50% of time consistent with ∼600; last 50% suddenly increased to ∼1,300) |
| 10 | KCMBT (1.0) | 137.5 |
| 0 | 628.87 (inconsistent) | Not supported | |||
| 11 | MSPKmerCounter (0.1) | 59.33 | 4.45 | 1 | 811.70 (phase 1: consistent (∼200); phase 2: consistent (∼1,500)) | 67.33 | 4.61 | 1 | 770.87 (phase 1: consistent (∼200); phase 2: consistent (∼1,500) |
| 12 | aTurtle (0.3) | 671 | 14 | 0 |
| 1,185 | 26 | 0 |
|
| 13 | GenomeTester4 | 214 | 26 | 0 |
| Not supported | |||
| 14 | BFCounter (1.0) | 1731 | 3 | 0 | 274.10 (first 80% of time almost 100, then gradual increase to ∼1,000) | 1,790 | 9 | 0 |
|
| 15 | BFCounter (1.0; gzip) | 1,847 | 3 | 0 | 259.70 (inconsistent) | 1,889 | 9 | 0 | 251.49 (inconsistent) |
Experimental results for the HS2 dataset
| SN | Tools (version; compression type) |
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
| Time (s) | RAM (GB) | Disk (GB) | CPU utilization (%) (comment) | Time (s) | RAM (GB) | Disk (GB) | CPU utilization (%) (comment) | ||
| 1 | Jellyfish (2.2.6) |
|
| 0 |
|
|
| 0 |
|
| 2 | DSK (2.2.0) |
| 13 |
|
| 7,982 | 13 |
|
|
| 3 | DSK (2.2.0; gzip) | 10,360 | 10 | 146 | 242.01 (inconsistent) | 10,199 | 12 | 109 | 240.21 (first 60% of time consistent with ∼300, last 40% suddenly increasing to ∼650) |
| 4 | KAnalyze (2.0.0) | Failed: “IO error writing segment file: no space left on device” | Failed: “IO error writing segment file: no space left on device” | ||||||
| 5 | KAnalyze (2.0.0; gzip) | Failed: “IO error writing segment file: no space left on device” | Failed: “IO error writing segment file: no space left on device” | ||||||
| 6 | KMC3 | 4,252 | 10 | 85 | 218.02 (increased towards end to ∼600, otherwise up to ∼12) |
| 11 | 29 | 214.99 (increased towards end to ∼600, otherwise up to ∼12) |
| 7 | KMC3 (gzip) | 2,362 | 10 | 86 | 580.72 (inconsistent) | 1,995 | 11 | 29 | 556.31 (inconsistent) |
| 8 | Gerbil (1.0) | 4,553 |
|
| 371.26 (increased towards end to ∼1,000, otherwise up to ∼250) | 4,260 |
|
| 317.65 (initially ∼250, increasing towards end to ∼1,000) |
| 9 | Gerbil (1.0; gzip) | 3,358 | 5 | 74 | 553.59 (first 70% of time consistent to ∼400, then increasing to ∼1,000 for last 30%) | 3,121 | 9 | 23 | 507.19 (first 70% of time consistent to ∼450, then increasing to ∼1,000 for last 30%) |
| 10 | KCMBT (1.0) | >15 hours | Not supported | ||||||
| 11 | MSPKmerCounter (0.1) | 3,128 | 6 | 22.2 | 120.17 (consistent) | 3,124 | 9 | 5.7 | 340.49 (consistent) |
| 12 | aTurtle (0.3) | Aborted (core dumped) | Aborted (core dumped) | ||||||
| 13 | GenomeTester4 | >15 hours | Not supported | ||||||
| 14 | BFCounter (1.0) | >15 hours | >15 hours | ||||||
| 15 | BFCounter (1.0; gzip) | >15 hours | >15 hours | ||||||
Test machine configuration
| Processor | Intel(R) Xeon(R) CPU E5–2698 v3 @ 2.30GHz |
|---|---|
| Main memory | 64 GB |
| Hard disk drive | 1 TB |
| CPU(s) | 16 |
| Online CPU(s) list | 0–15 |
| Thread(s) per core | 2 |
| Core(s) per socket | 16 |
| No. of sockets | 1 |
Summary of Tables 4–8
| Dataset ID |
| Time | RAM | Disk | CPU utilization (%) | ||||
|---|---|---|---|---|---|---|---|---|---|
| Highest | Lowest | Highest | Lowest | Highest | Lowest | Highest | Lowest | ||
| FV | 28 | KAnalyze | Gerbil | KCMBT | Gerbil | KAnalyze | Gerbil, KMC3 | Gerbil | GenomeTester4, aTurtle |
| 55 | KAnalyze | KMC3 | Jellyfish | Gerbil | KAnalyze | Gerbil | Jellyfish | BFCounter, aTurtle | |
| DM | 28 | BFCounter | KMC3 | GenomeTester4 | Gerbil | KAnalyze | Gerbil | Gerbil | GenomeTester4, aTurtle |
| 55 | BFCounter | KMC3 | aTurtle | Gerbil | KAnalyze | KMC3 | KMC3 | BFCounter, aTurtle | |
| MB | 28 | KAnalyze | Jellyfish | aTurtle | Gerbil | KAnalyze | Gerbil | Jellyfish | KCMBT, aTurtle |
| 55 | KAnalyze | Jellyfish | Jellyfish | Gerbil | KAnalyze | Gerbil | Jellyfish | DSK | |
| HS1 | 28 | DSK | KMC3 | DSK | Gerbil | DSK | Gerbil | Gerbil | DSK |
| 55 | DSK | KMC3 | DSK | Gerbil, KMC3 | DSK | Gerbil | Gerbil | DSK | |
| HS2 | 28 | DSK | Jellyfish | Jellyfish | Gerbil | DSK | Gerbil | Jellyfish | DSK |
| 55 | Jellyfish | KMC3 | Jellyfish | Gerbil | DSK | Gerbil | Jellyfish | DSK | |
Figure 1:Analysis of time and memory utilization of k-mer counting algorithms for the NC and AT datasets for different values of k (28, 40, 55, 65, 100, 125, 150, 175, and 200).
Experimental results for the DM dataset
| SN | Tools (version; compression type) |
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
| Time (s) | RAM (GB) | Disk (GB) | CPU utilization (%) (comment) | Time (s) | RAM (GB) | Disk (GB) | CPU utilization (%) (comment) | ||
| 1 | Jellyfish (2.2.6) | 77 | 4 | 0 | 1,055.25 (consistent) | 71 | 9 | 0 | 917.79 (consistent) |
| 2 | DSK (2.2.0) | 52 | 2 | 4.2 | 736.09 (initially ∼600, increasing to ∼1,173) | 49 | 2 | 2.7 | 622.36 (initially ∼500, increasing to ∼1,150) |
| 3 | DSK (2.2.0; gzip) | 183 | 4.68 | 3.6 | 331.88 (first 90% of time consistent with ∼270, last 10% suddenly increasing to ∼1,200) | 173 | 4.45 | 2.4 | 300.70 (first 90% of time consistent with ∼250 then suddenly increasing to ∼1,200) |
| 4 | KAnalyze (2.0.0) | 794 | 10 |
| 695.64 (gradually declined) | 393 | 11 |
| 829.45 (gradually declined from ∼2,000 to ∼100) |
| 5 | KAnalyze (2.0.0; gzip) | 822 | 9 | 14.4 | 691.15 (first 40% of time consistent with ∼1,250; last 60% inconsistent with sudden drop to ∼200) | 411 | 11 | 12.9 | 843.79 (first 60% of time inconsistent in the range 2,250–900, rest of time inconsistent with sudden drop to ∼200) |
| 6 | KMC3 |
| 5 | 2.23 | 942.26 (consistent) |
| 8 |
|
|
| 7 | KMC3 (gzip) | 35 | 5 | 1.64 | 739.01 (last 20% of time consistent with ∼1,250, rest consistent with ∼700) | 31 | 8 | 0 | 637.29 (last 20% of time consistent with ∼1,250, rest consistent with ∼600) |
| 8 | Gerbil (1.0) | 20 |
|
|
| 16.5 |
| 4 | 1,010.89 (consistent) |
| 9 | Gerbil (1.0; gzip) | 33 | 0.82 | 1.31 | 821.69 (first 55% of time consistent with ∼700, last 45% suddenly increased to ∼1,200) | 29 | 0.81 | 0 | 685.74 (first 55% of time consistent with ∼600, last 45% inconsistent with sudden increase to ∼1,200) |
| 10 | KCMBT (1.0) | 61 | 2 | 0 | 595.37 (initially ∼300 then increased towards end to ∼900) | Not supported | |||
| 11 | MSPKmerCounter (0.1) | 234 | 5 | 14.2 | 912.92 (both phases: consistent) | 219 | 5 | 11.2 | 914.62 (phase 1: initially ∼1,000 then declined to ∼300; phase 2: consistent) |
| 12 | aTurtle (0.3) | 423 | 7 | 0 |
| 330 |
| 0 |
|
| 13 | GenomeTester4 | 144 |
| 0 |
| Not supported | |||
| 14 | BFCounter (1.0) |
| 1 | 0 | 307.53 (consistent) |
| 2 | 0 |
|
| 15 | BFCounter (1.0; gzip) | 1,002 | 2 | 0 | 321.78 (last 20% of time ∼500, rest is ∼300) | 559 | 2 | 0 | 306.30 (last 20% of time ∼500, rest ∼300) |
Experimental results for the HS1 dataset
| SN | Tools (version; compression type) |
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
| Time (s) | RAM (GB) | Disk (GB) | CPU utilization (%) (comment) | Time (s) | RAM(GB) | Disk (GB) | CPU utilization (%) (comment) | ||
| 1 | Jellyfish (2.2.6) | >15 hours (system hang) | >15 hours (system hang) | ||||||
| 2 | DSK (2.2.0) |
|
|
|
|
|
|
|
|
| 3 | DSK (2.2.0; gzip) | 9,240 | 11 | 134 | 218.77 (inconsistent) | 8,480 | 12 | 104 | 284.68 (inconsistent) |
| 4 | KAnalyze (2.0.0) | Failed: “IO error writing segment file: no space left on device” | Failed: “IO error writing segment file: No space left on device” | ||||||
| 5 | KAnalyze (2.0.0; gzip) | Failed: “IO error writing segment file: no space left on device” | Failed: “IO error writing segment file: No space left on device” | ||||||
| 6 | KMC3 |
| 10 | 78 | 276.64 (gradually declined) |
|
| 28 | 270.55 (inconsistent) |
| 7 | KMC3 (gzip) | 1,964 | 11 | 79 | 620.84 (inconsistent) | 1,626 | 11 | 29 | 663.31 (inconsistent) |
| 8 | Gerbil (1.0) | 4,078 |
|
|
| 3,818 |
|
|
|
| 9 | Gerbil (1.0; gzip) | 2,849 | 6 | 66 | 569.83 (first 70% of time consistent to ∼420, then increasing to ∼1,000 for last 30%) | 2,614 | 11 | 22 | 541.63 (first 70% of time consistent to ∼400, then increasing to ∼1,000 for last 30%) |
| 10 | KCMBT (1.0) | >23 hours | Not supported | ||||||
| 11 | MSPKmerCounter (0.1) | >15 hours (phase 2 failed: “OutOfMemoryError”) | >15 hours (phase 2 failed: “OutOfMemoryError”) | ||||||
| 12 | aTurtle (0.3) | Aborted (core dumped) | Aborted (core dumped) | ||||||
| 13 | GenomeTester4 | >15 hours | Not supported | ||||||
| 14 | BFCounter 1.0 | >15 hours | >15 hours | ||||||
| 15 | BFCounter (1.0; gzip) | >15 hours | >15 hours | ||||||
Experimental results for the MB dataset
| SN | Tools (version; compression type) |
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
| Time (s) | RAM (GB) | Disk (GB) | CPU utilization (%) (comment) | Time (s) | RAM (GB) | Disk (GB) | CPU utilization (%) (comment) | ||
| 1 | Jellyfish (2.2.6) |
| 15 | 0 |
|
|
| 0 |
|
| 3 | DSK (2.2.0) | 3,358 | 12 | 59 | 185.09 (consistent) | 3039 | 11 | 45 |
|
| 4 | KAnalyze (2.0.0) |
| 10 |
| 279.40 (initially ∼2,000, then declining to ∼150) |
| 11 |
| 248.04 (declined from ∼2,000 to ∼100) |
| 5 | KMC3 | 2,019 | 9 | 36 | 216.93 (initially in the range 12–400; increasing towards end to ∼600) | 1,804 | 10 | 14 | 211.12 (initially in the range 12–400, increasing towards end to ∼600) |
| 6 | KMC3 (bz2) | 3,341 | 11 | 36.3 | 289.46 (first 90% of time consistent in the range 200–400; last 10% up to ∼1,300) | 3,250 | 11 | 13 | 282.77 (first 90% consistent in range 200–400; last 10% up to ∼1,300) |
| 7 | Gerbil (1.0) | 2,238 |
|
| 269.52 (initially within 150, increasing towards end to ∼800) | 1,941 |
|
| 250.32 (initially within 150, increasing towards end to ∼800) |
| 8 | Gerbil (1.0; bz2) | 3,487 | 2 | 30.7 | 306.37 (first 90% of time consistent with ∼270; last 10% suddenly increasing to ∼1,300) | 3,137 | 3 | 11 | 304.02 (first 90% of time consistent with ∼270; last 10% suddenly increasing to ∼1,300) |
| 9 | KCMBT (1.0) | 1,644 | 34 | 0 |
| Not supported | |||
| 10 | MSPKmerCounter (0.1) | 11,094 | 8 | 173 | 316.90 (consistent) | 8,759 | 9 | 118 | 1,284.05 (consistent) |
| 11 | aTurtle 0.3 | 8,764 |
| 0 |
| >15 hours | |||
| 12 | GenomeTester4 | 3,520 | 60 | 0 | 153.67 (consistent) | Not supported | |||
| 13 | BFCounter (1.0) | 18,950 | 10 | 0 | 300.37 (consistent) | 15,264 | 19 | 0 | 295.40 (first 50% up to ∼254 then increasing to ∼434) |
Figure 2:Scalability comparison of different k-mer counting tools based on the number of threads.