| Literature DB >> 31015787 |
Swati C Manekar1, Shailesh R Sathe1.
Abstract
BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years.Entities:
Keywords: Distinct k-mers; Hashing; High-throughput sequencing; K-mer abundance histogram; Singleton k-mers; Streaming algorithms
Year: 2019 PMID: 31015787 PMCID: PMC6446480 DOI: 10.2174/1389202919666181026101326
Source DB: PubMed Journal: Curr Genomics ISSN: 1389-2029 Impact factor: 2.236
Fig. (1)Computation time versus datasets of varying size on left-hand side and memory usage versus datasets of varying size on right-hand side. Three of these datasets are human dataset with large coverage. Runtime is reported in seconds and memory usage in megabytes (MB). Note that H. sapiens 2 has average read lengths of 100 bases hence in plot for k =125 the data is missing for H. sapiens 2. Abbreviations: FV = F. vesca; HS1 = H. sapiens 1; HS2 = H. sapiens 2; and NA19238 = human genome NA19238.
Fig. (2)Speedup and memory usage for various numbers of threads.
F.vesca results of considered tools for k = 25, 50, 75, 100 and 125.
|
|
|
|
|
|
|
| |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| ||
| 1 | DSK 2.2.0 | 117 | 20,337 / 3.7 | 140 | 11,627 / 3 | 217 | 12,260 / 2 | 315 | 11,865 / 34 | 326 | 13,269 / 79 |
| CPU utilization (%) | 770.492 (Initial 45% of time consistent to 95, remaining 55% inconsistent in the range of 1600 – 10) | 690.89 | 607.77 (Initial 20% of time consistent to 1500, remaining 80% inconsistent in the range of 1600 – 100) | 567.65 (Inconsistent in the range of 1600 – 100) | 538.23 | ||||||
| 2 | ntCard 1.0.0 | 534 / – | |||||||||
| CPU utilization (%) | 388.46 (In initial 50% of time declined from 1100 to 100, rest 50% of time consistent with 100) | 399.87 (In initial 50% of time declined from 1000 to 100, rest 50% of time consistent with 100) | 408.2 (In initial 60% of time declined from 900 to 130, rest 40% of time consistent with 100) | 383.91 (In initial 50% of time declined from 900 to 140, rest 50% of time consistent with 100) | 342.56 (In initial 50% of time declined from 900 to 100, rest 50% of time consistent with 100) | ||||||
| 3 | KmerStreame 1.1 | 197 | 107 / – | 187 | 106 / – | 184 | 106 / – | 182 | 106 / – | 176 | 106 / – |
| CPU utilization (%) | |||||||||||
| 4 | KmerGenie 1.7048 | 630 | 144 / – | 144 / – | 144 / – | 152 / – | 159 / – | ||||
| CPU utilization (%) | 115.10 (Consistent in the range of 94 – 200) | 115.55 (Consistent in the range of 94 – 200) | 118.03 (Consistent in the range of 94 – 200) | 115.11 (Consistent in the range of 94 – 200) | 120.01 (Consistent in the range of 94 – 200) | ||||||
| 5 | Khmer 2.1.1 | 229 | 231 | 235 | 227 | 25 / – | 226 | ||||
| CPU utilization (%) | |||||||||||
| 6 | Khmer 2.1.1 | ERROR: Khmer only supports | |||||||||
| CPU utilization (%) | 1,191.52 | ||||||||||
Best results are indicated in bold font and average results are underlined. Abbreviations: sec = Seconds, GB = Gigabytes, MB = Megabytes.
H.sapiens 1 results of considered tools for k = 25, 50, 75, 100 and 125.
|
|
|
|
|
|
|
| |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| ||
| 1 | DSK 2.2.0 | 5,046 | 23,106 / 102.8 | 5,881 | 24,240 / 67.7 | 7,033 | 21,731 / 5.6 | 5,767 | 21,678 / 41.9 | 4,348 | 20,511 / 32.7 |
| CPU utilization (%) | 470.34 (Initial 45% of time consistent to 800, next 55% inconsistent in the range of 1600 – 300) | 499.35 (Initial 40% of time consistent to 800, next 60% inconsistent in the range of 1600–550) | 419.15 (Initial 35% of time consistent to 800, next 65% inconsistent in the range of 1200–700) | 427.96 (Initial 40% of time consistent to 780, next 65% inconsistent in the range of 1300–650) | 348.97 (Initial 50% of time consistent in the range 550–10, next 50% in the range of 1600–550) | ||||||
| 2 | ntCard 1.0.0 | 530 / – | |||||||||
| CPU utilization (%) | 144.31 (Inconsistent in the range of 70 – 200) | 129.13 | 121.40 | 108.72 | |||||||
| 3 | KmerStreame 1.1 | 5,637 | 57 / – | 5,107 | 56 / – | 4,854 | 57 / – | 4,539 | 56 / – | 4,315 | 58 / – |
| CPU utilization (%) | 98.90 (Consistent in the range of 80 – 106) | ||||||||||
| 4 | KmerGenie 1.7048 | 16,619 | 223 / – | 277 / – | 196 / – | 248 / – | 7,956 | 173 / – | |||
| CPU utilization (%) | 115.84 (Consistent in the range of 94 – 200) | 118.30 (Consistent in the range of 94 – 200) | 121.90 (Consistent in the range of 94 – 200) | 126.66 (Consistent in the range of 94 – 200) | 134.02 (Consistent in the range of 100 – 200) | ||||||
| 5 | Khmer 2.1.1 | 9,281 | 9,248 | 9,424 | |||||||
| CPU utilization (%) | |||||||||||
| 6 | Khmer 2.1.1 | ERROR: Khmer only supports | |||||||||
| CPU utilization (%) | 1,590.34 | ||||||||||
Best results are indicated in bold font and average results are underlined. Abbreviations: sec = Seconds, GB = Gigabytes, MB = Megabytes.
H.sapiens 2 results of considered tools for k = 25, 50, 75 and 100.
|
|
|
|
|
|
| |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |||||||
| 1 | DSK 2.2.0 | 5,065 | 25,400 / 113.7 | 5,049 | 25,739 / 69.7 | 4,778 | 25,322 / 49.8 | 3,168 | 27,930 / 31.7 | |||||
| CPU utilization (%) | 418.559 (Inconsistent) | 484.26 (Inconsistent) | 391.295 (Consistent) | 247.251 (Initial 60% of time consistent, last 40% inconsistent (1250 – 10)) | ||||||||||
| 2 | ntCard 1.0.0 | 538 / – | ||||||||||||
| CPU utilization (%) | 387.25 (Highly inconsistent in the range of 100 – 800) | 341.52 (Highly inconsistent in the range of 90–700) | 283.68 (Highly inconsistent in the range of 50–650) | 202.91 | ||||||||||
| 3 | KmerStreame 1.1 | 6,677 | 48 / – | 6,282 | 49 / – | 5,877 | 47 / – | 5,422 | 48 / – | |||||
| CPU utilization (%) | ||||||||||||||
| 4 | KmerGenie 1.7048 | 17,109 | 145 / – | 13,526 | 145 / – | 9,997 | 145 / – | 6,562 | 144 / – | |||||
| CPU utilization (%) | 119.03 (Consistent in the range of 94 – 200) | 124.12 (Consistent in the range of 94 – 200) | 132.73 (Consistent in the range of 100 – 200) | 150.38 (Consistent in the range of 100 – 200) | ||||||||||
| 5 | Khmer 2.1.1 | 14,378 | ||||||||||||
| CPU utilization (%) | ||||||||||||||
| 6 | Khmer 2.1.1 | ERROR: Khmer only supports | ||||||||||||
| CPU utilization (%) | 1,592.78 | |||||||||||||
Best results are indicated in bold font and average results are underlined. Abbreviations: sec = Seconds, GB = Gigabytes, MB = Megabytes.
Human genome NA19238 results of considered tools for k = 25, 50, 75, 100 and 125.
|
|
|
|
|
|
|
| |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| ||
| 1 | DSK 2.2.0 | 11,829 | 25,991 / 220 | 12,254 | 24,795 / 155.8 | 13,933 | 25,325 / 136.9 | 14,516 | 27,400 / 122.9 | 14,215 | 25,800 / 277 |
| CPU utilization (%) | 437.71 (Initial 50% of time consistent to 600, rest 50% consistent to 300) | 478.81 | 489.37 | 515.74 (Initial 40% of time consistent to 850, rest 60% inconsistent in the range of 1200 – 300) | 479.66 (Initial 40% of time consistent to 850, rest 60% inconsistent in the range of 1200 – 300) | ||||||
| 2 | ntCard 1.0.0 | 527 / – | |||||||||
| CPU utilization (%) | 146.36 | 139.33 | 133.68 | 127.19 | 123.27 | ||||||
| 3 | KmerStreame 1.1 | 9,138 | 75 / – | 8,822 | 76 / – | 8,599 | 76 / – | 8,278 | 76 / – | 8,013 | 76 / – |
| CPU utilization (%) | |||||||||||
| 4 | KmerGenie 1.7048 | 32,071 | 144 / – | 144 / – | 144 / – | 144 / – | 144 / – | ||||
| CPU utilization (%) | 113.09 (Consistent in the range of 94 – 200) | 115.26 (Consistent in the range of 94 – 200) | 117.38 (Consistent in the range of 94 – 200) | 118.57 (Consistent in the range of 94 – 200) | 121.68 (Consistent in the range of 94 – 200) | ||||||
| 5 | Khmer 2.1.1 | 12,952 | 12,896 | 12,807 | 12,802 | 12,582 | |||||
| CPU utilization (%) | |||||||||||
| 6 | Khmer 2.1.1 | ERROR: Khmer only supports | |||||||||
| CPU utilization (%) | 1,588.08 | ||||||||||
Best results are indicated in bold font and average results are underlined. Abbreviations: sec = Seconds, GB = Gigabytes, MB = Megabytes.
Streaming algorithms employed in the comparative experiment along with their estimated output.
|
|
|
|
|---|---|---|
| KmerGenie [ | Arbitrary large length | |
| KmerStream [ | Arbitrary large length | |
| Khmer 2.1.1 | Arbitrary large length | |
| Khmer 2.1.1 | Full | ≤ 32 |
| ntCard [ | Arbitrary large length |
F 0: distinct number of k-mers in input read set; f1: the number on singletons in input read set.
Sequence datasets.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| 1 | 214 | 353 | 4.5 | 10.9 | 12,803,137 | |
| 2 | 2,991 | 151 | 123.7 | 292.1 | 819,148,264 | |
| 3 | 2,991 | 100 | 135.3 | 339.5 | 1,339,740,542 | |
| 4 | Human genome for the individual NA19238 | 5,712.43 | 250 | 228.5 | 507.6 | 913,959,800 |
Estimated values of F0 and f1 by ntCard, KmerGenie 1.7040 and Khmer 2.1.1 for F. vesca for k = 25, 50, 75, 100 and 125.
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| 25 | 583,137,847 | 583,676,933 | 530,329,548 | 9.06 | 601,623,000 | 3.17 | 591,556,352 | 1.42 | ||
| 323,527,880 | 323,786,587 | 271,301,619 | 16.14 | 340,844,130 | 5.35 | - | - | |||
| 50 | 914,031,454 | 914,604,363 | 912,466,570 | 0.17 | 950,320,256 | 3.97 | 920,025,399 | 0.65 | ||
| 602,056,795 | 602,439,035 | 601,435,694 | 0.10 | 636,040,064 | 5.64 | - | - | |||
| 75 | 1,098,780,218 | 1,099,778,393 | 1,096,628,468 | 0.20 | 1,145,461,053 | 4.25 | 1,128,906,482 | 2.67 | ||
| 776,776,680 | 777,519,706 | 773,700,917 | 0.40 | 821,351,861 | 5.74 | - | - | |||
| 100 | 1,191,576,112 | 1,192,817,786 | 1,190,435,159 | 0.10 | 1,244,998,932 | 4.48 | 1,239,588,773 | 3.87 | ||
| 876,971,790 | 878,084,632 | 876,638,409 | 0.04 | 928,811,850 | 5.91 | - | - | |||
| 125 | 1,232,899,836 | 1,233,719,858 | 1,232,868,822 | 0.00 | 1,291,769,640 | 4.77 | 1,295,874,877 | 4.86 | ||
| 933,435,198 | 934,166,463 | 935,890,616 | 0.26 | 991,233,936 | 6.19 | - | - |
Column ‘Error%’ shows errors in percent. Column ‘DSK 2.2.0’ shows the exact values of F0 and f1.
Estimated values of F0 and f1 by ntCard, KmerGenie 1.7040 and Khmer 2.1.1 for H. sapiens 1 for k = 25, 50, 75, 100 and 125.
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| 25 | 11,217,637,486 | 11,216,386,861 | 6,751,865,989 | 39.81 | 16,165,719,040 | 30.61 | 15,836,062,038 | 557.13 | ||
| 8,490,459,593 | 8,485,840,663 | 3,714,396,067 | 56.25 | 13,017,354,240 | 34.78 | - | - | |||
| 50 | 13,699,865,268 | 13,695,660,817 | 13,564,472,565 | 0.99 | 21,248,181,675 | 35.52 | 21,159,640,890 | 294.99 | ||
| 10,675,454,671 | 10,673,259,381 | 10,575,153,639 | 0.94 | 17,847,311,013 | 40.18 | - | - | |||
| 75 | 12,875,754,286 | 12,857,905,621 | 12,817,488,354 | 0.45 | 20,643,246,978 | 37.63 | 20,592,981,331 | 206.38 | ||
| 9,779,384,551 | 9,757,713,845 | 9,702,616,635 | 0.78 | 17,212,356,585 | 43.18 | - | - | |||
| 100 | 10,630,623,336 | 10,611,400,488 | 10,606,879,146 | 0.22 | 17,214,539,980 | 38.25 | 16,959,187,669 | 151.24 | ||
| 7,575,718,688 | 7,554,340,962 | 7,577,620,078 | 0.03 | 13,906,537,380 | 45.52 | - | - | |||
| 125 | 7,080,173,077 | 7,071,172,312 | 7,066,649,232 | 0.19 | 11,641,131,786 | 39.18 | 11,712,142,863 | 88.90 | ||
| 4,469,409,703 | 4,460,297,170 | 4,454,226,447 | 0.34 | 8,873,732,302 | 49.63 | - | - |
Column ‘Error%’ shows errors in percent. Column ‘DSK 2.2.0’ shows the exact values of F0 and f1.
Estimated values of F0 and f1 by ntCard, KmerGenie 1.7040 and Khmer 2.1.1 for H. sapiens 2 for k = 25, 50, 75, 100 and 125.
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| 25 | 6,317,577,945 | 6,321,370,851 | 4,440,606,371 | 29.71 | 6,493,972,870 | 2.79 | 6,447,772,640 | 2.02 | ||
| 3,726,921,849 | 3,728,402,513 | 2,124,581,580 | 42.99 | 3,900,164,180 | 4.65 | - | - | |||
| 50 | 7,576,436,303 | 7,575,568,030 | 7,507,180,258 | 0.91 | 7,768,465,452 | 2.53 | 7,808,249,016 | 2.97 | ||
| 4,610,596,550 | 4,612,437,322 | 4,554,225,163 | 1.22 | 4,803,170,055 | 4.18 | - | - | |||
| 75 | 6,645,775,719 | 6,634,466,410 | 6,620,493,247 | 0.38 | 6,770,079,405 | 1.87 | 6,820,243,427 | 2.56 | ||
| 3,601,391,827 | 3,590,235,204 | 3,574,106,736 | 0.76 | 3,726,960,545 | 3.49 | - | - | |||
| 100 | 2,055,560,283 | 2,054,217,987 | 2,055,238,955 | 0.02 | 2,067,553,684 | 0.58 | 2,055,943,971 | 0.02 | ||
| 1,668,703,535 | 1,667,591,992 | 1,668,568,360 | 0.01 | 1,679,933,472 | 0.67 | - | - |
Column ‘Error%’ shows errors in percent. Column ‘DSK 2.2.0’ shows the exact values of F0 and f1.
Estimated values of F0 and f1 by ntCard, KmerGenie 1.7040 and Khmer 2.1.1 for human genome NA19238 for k = 25, 50, 75, 100 and 125.
|
|
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|---|
| 25 | 15,695,189,022 | 15,698,708,120 | 7,874,114,589 | 49.83 | 15,774,752,125 | 0.51 | 15,821,852,733 | 0.80 | ||
| 12,590,059,674 | 12,589,842,091 | 4,097,744,520 | 67.45 | 12,660,147,875 | 0.56 | - | - | |||
| 50 | 21,386,123,607 | 21,388,812,027 | 21,129,592,460 | 1.20 | 21,527,611,504 | 0.66 | 21,394,927,325 | 0.04 | ||
| 18,003,764,501 | 18,008,739,129 | 17,708,817,546 | 1.64 | 18,125,064,461 | 0.67 | - | - | |||
| 75 | 22,940,994,545 | 22,903,722,006 | 22,768,767,319 | 0.75 | 23,081,411,904 | 0.61 | 23,158,655,606 | 0.94 | ||
| 19,455,121,869 | 19,413,069,020 | 19,200,406,066 | 1.31 | 19,580,787,984 | 0.65 | - | - | |||
| 100 | 22,825,882,964 | 22,795,280,303 | 22,837,840,964 | 0.05 | 22,981,764,792 | 0.68 | 22,657,973,838 | 0.74 | ||
| 19,311,399,602 | 19,271,642,379 | 19,350,438,432 | 0.20 | 19,462,845,864 | 0.78 | - | ||||
| 125 | 21,623,019,167 | 21,584,850,645 | 21,572,310,929 | 0.23 | 21,771,418,913 | 0.69 | 21,674,560,549 | 0.24 | ||
| 18,103,091,932 | 18,054,837,842 | 17,990,743,190 | 0.62 | 18,227,641,426 | 0.69 | - | - |
Column ‘Error%’ shows errors in percent. Column ‘DSK 2.2.0’ shows the exact values of F0 and f1.
Summary of results from Appendix (Tables A1-A4).
|
|
|
|
|
| |||
|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| ||
| 25 | ntCard 1.0.0 | Khmer 2.1.1 | Khmer 2.1.1 | Khmer 2.1.1 | KmerStream 1.1 | Khmer 2.1.1 | |
| 50 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 | |
| 75 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 | |
| 100 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 | |
| 125 | ntCard 1.0.0 | KmerGenie | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 | |
| 25 | ntCard 1.0.0 | Khmer 2.1.1 | Khmer 2.1.1 | Khmer 2.1.1 | KmerStream 1.1 | Khmer 2.1.1 | |
| 50 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 | |
| 75 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 | |
| 100 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 | |
| 125 | ntCard 1.0.0 | Khmer 2.1.1 | Khmer 2.1.1 | ntCard 1.0.0 | ntCard 1.0.0 | Khmer 2.1.1 | |
| 25 | ntCard 1.0.0 | Khmer 2.1.1 | Khmer 2.1.1 | Khmer 2.1.1 | KmerStream 1.1 | Khmer 2.1.1 | |
| 50 | ntCard 1.0.0 | Khmer 2.1.1 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 | |
| 75 | ntCard 1.0.0 | Khmer 2.1.1 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 | |
| 100 | ntCard 1.0.0 | Khmer 2.1.1 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 (unique-kmers.py) | |
| 25 | ntCard 1.0.0 | Khmer 2.1.1 | Khmer 2.1.1 | Khmer 2.1.1 | KmerStream 1.1 | Khmer 2.1.1 (unique-kmers.py) | |
| 50 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 (unique-kmers.py) | |
| 75 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 (unique-kmers.py) | |
| 100 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 (unique-kmers.py) | |
| 125 | ntCard 1.0.0 | KmerGenie 1.7048 | Khmer 2.1.1 | ntCard 1.0.0 | KmerStream 1.1 | Khmer 2.1.1 (unique-kmers.py) | |