| Literature DB >> 35758794 |
Abstract
MOTIVATION: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings-in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge.Entities:
Mesh:
Year: 2022 PMID: 35758794 PMCID: PMC9235479 DOI: 10.1093/bioinformatics/btac245
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Some basic statistics for the datasets used in the experiments, for k = 31, such as number of: k-mers (n), paths (p), and bases (N)
| Dataset |
|
|
|
|
|---|---|---|---|---|
| Cod | 502 465 200 | 2 406 681 | 574 665 630 | 30 |
| Kestrel | 1 150 399 205 | 682 344 | 1 170 869 525 | 31 |
| Human | 2 505 445 761 | 13 014 641 | 2 895 884 991 | 32 |
| Bacterial | 5 350 807 438 | 26 449 008 | 6 144 277 678 | 33 |
Fig. 1.A schematic representation of the proposed dictionary data structure. The input of the example contains p = 4 strings (pictorially separated by a ‘.’ symbol, but practically by the Endpoints array) for a total of N = 405 bases, and k-mers for k = 31. There are M = 24 minimizers for m = 8 and z = 28 super-k-mers, thus the Sizes and Offsets arrays have length, respectively, and z = 28. All minimizers have bucket size equal to 1 except for 3 of them (i.e. AACCTGAA, ATCCTGAA, TGTCAAAG) that have bucket size equal to 2. The picture also shows an example of Lookup for the k-mer g ACATCCTGAAAATTGTCAAAGAATGGCGGCG, whose minimizer r = ATCCTGAA is highlighted in bold font. The flow of the algorithm is represented by the arrows. First, the function f returns the identifier of r as f(r) = 5. Then the bucket size of r is computed: in this case, we have , indicating that there are 2 super-k-mers to consider. The offsets of the super-k-mers are retrieved as and . The two super-k-mers are scanned in Strings starting at and , respectively. (At most k-mers are considered in each super-k-mer, as highlighted by the gray box. See the Supplementary Material for a discussion about this point.) Lastly, the k-mer g is found at position w = 8 in the second super-k-mer, i.e. at offset . Since there are two strings before the one containing g, then j = 2, and we have to discard invalid ranks for the calculation of the identifier i of g. Therefore, we return
Bucket size distribution (%) for k = 31 and the first k-mers of the human genome, by varying minimizer length m
| Size/ | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 13.7 | 19.8 | 29.7 | 42.4 | 61.5 | 79.5 | 89.8 | 94.4 | 96.3 | 97.1 | 97.5 |
| 2 | 7.5 | 10.6 | 14.4 | 17.7 | 19.4 | 13.6 | 7.3 | 3.9 | 2.4 | 1.7 | 1.4 |
| 3 | 5.2 | 7.3 | 8.8 | 10.4 | 8.4 | 3.7 | 1.4 | 0.8 | 0.5 | 0.4 | 0.4 |
| 4 | 4.0 | 5.5 | 6.0 | 7.0 | 4.1 | 1.3 | 0.5 | 0.3 | 0.2 | 0.2 | 0.2 |
| 5 | 3.2 | 4.4 | 4.5 | 5.0 | 2.2 | 0.6 | 0.3 | 0.2 | 0.1 | 0.1 | 0.1 |
Fig. 2.A schematic view of the skew index component of the dictionary (a), comprising partitions , each consisting of an MPHF f and a compact vector V. Let us consider an example (b) of Lookup for g GAACCTGAAAACATCCTGAAAATTGTCAAAG and . Suppose that the bucket for the minimizer contains s = 13 super-k-mers (whose offsets are in the picture), thus it belongs to partition i = 3 because . (Each integer in V3 is less than , so it can be coded in bits.) Now, also suppose that g is located in the 9-th super-k-mer of the bucket (i.e. that of index 8). It would then be time-consuming to fully scan the 8 super-k-mers before the 9-th. Therefore, we retrieve 8—the index of the 9-th super-k-mer where g is located—from and know that g has to be searched for in the super-k-mer whose offset is
Space in bits/k-mer (bpk) and Lookup time (indicated by Lkp+ for positive queries; by Lkp– for negative) in average ns/k-mer for regular and canonical SSHash dictionaries by varying minimizer length m
| Dataset |
|
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| bpk | Lkp+ | Lkp– | bpk | Lkp+ | Lkp– | bpk | Lkp+ | Lkp– | bpk | Lkp+ | Lkp– | |
| Cod | 15 |
|
| 18 | ||||||||
| Regular | 6.60 | 1236 | 1267 | 6.82 | 1100 | 1174 |
|
|
| 7.21 | 1015 | 1157 |
| Canonical | 7.68 | 945 | 768 |
|
|
| 8.18 | 786 | 672 | 8.47 | 755 | 658 |
| Kestrel |
|
| 18 | 19 | ||||||||
| Regular | 6.19 | 1137 | 1323 |
|
|
| 6.79 | 1005 | 1245 | 7.12 | 997 | 1240 |
| Canonical |
|
|
| 7.68 | 790 | 722 | 8.09 | 743 | 696 | 8.51 | 730 | 691 |
| Human | 17 | 18 |
|
| ||||||||
| Regular | 7.44 | 1591 | 1668 | 7.67 | 1459 | 1573 | 7.95 | 1406 | 1547 |
|
|
|
| Canonical | 8.76 | 1150 | 936 | 9.04 | 1054 | 881 |
|
|
| 9.80 | 958 | 838 |
| Bacterial | 18 |
|
| 21 | ||||||||
| Regular | 7.42 | 1535 | 1867 | 7.80 | 1425 | 1813 |
|
|
| 8.70 | 1368 | 1774 |
| Canonical | 8.75 | 1129 | 1043 |
|
|
| 9.75 | 1028 | 947 | 10.34 | 998 | 956 |
Note: For each dataset, we indicate promising configurations in bold font.
Dictionary space in total GB and average bits/k-mer (bpk)
| Dictionary | Cod | Kestrel | Human | Bacterial | ||||
|---|---|---|---|---|---|---|---|---|
| GB | bpk | GB | bpk | GB | bpk | GB | bpk | |
| dBG-FM, | 0.22 | 3.48 | 0.44 | 3.07 | — | — | — | — |
| dBG-FM, | 0.27 | 4.38 | 0.55 | 3.86 | — | — | — | — |
| dBG-FM, | 0.39 | 6.16 | 0.78 | 5.43 | — | — | — | — |
| Pufferfish, sparse | 1.75 | 27.80 | 3.69 | 25.66 | 8.87 | 28.32 | 18.91 | 28.28 |
| 1.49 | 23.70 | 3.37 | 23.40 | 7.50 | 23.96 | 16.09 | 24.06 | |
| Pufferfish, dense | 2.69 | 42.76 | 5.97 | 41.54 | 14.11 | 45.04 | 30.70 | 45.89 |
| 2.43 | 38.66 | 5.65 | 39.28 | 12.74 | 40.68 | 27.88 | 41.68 | |
| Blight, | 0.91 | 14.53 | 2.16 | 15.00 | 5.04 | 16.11 | 11.40 | 17.04 |
| Blight, | 1.04 | 16.57 | 2.45 | 17.04 | 5.67 | 18.13 | 12.74 | 19.05 |
| Blight, | 1.17 | 18.61 | 2.74 | 19.06 | 6.32 | 20.17 | 14.12 | 21.11 |
| SSHash, regular | 0.44 | 6.98 | 0.93 | 6.48 | 2.59 | 8.28 | 5.50 | 8.22 |
| SSHash, canonical | 0.50 | 7.92 | 1.00 | 7.30 | 2.94 | 9.39 | 6.17 | 9.22 |
Dictionary Lookup time in average ns/k-mer
| Dictionary | Cod | Kestrel | Human | Bacterial | ||||
|---|---|---|---|---|---|---|---|---|
| Lkp+ | Lkp– | Lkp+ | Lkp– | Lkp+ | Lkp– | Lkp+ | Lkp– | |
| dBG-FM, | 22 980 | 16 501 | 23 934 | 16 764 | — | — | — | — |
| dBG-FM, | 15 013 | 10 919 | 15 929 | 11 462 | — | — | — | — |
| dBG-FM, | 11 386 | 7929 | 11 703 | 8073 | — | — | — | — |
| Pufferfish, sparse | 1110 | 700 | 5456 | 769 | 13 656 | 862 | 27 748 | 983 |
| Pufferfish, dense | 624 | 439 | 635 | 485 | 720 | 519 | 816 | 582 |
| Blight, | 2520 | 2751 | 2743 | 3104 | 2820 | 3329 | 3105 | 3913 |
| Blight, | 1800 | 1643 | 1916 | 1820 | 2008 | 1975 | 2095 | 2146 |
| Blight, | 1571 | 1317 | 1692 | 1472 | 1780 | 1610 | 1859 | 1751 |
| SSHash, regular | 1045 | 1158 | 1042 | 1265 | 1338 | 1530 | 1389 | 1780 |
| SSHash, canonical | 834 | 690 | 882 | 781 | 990 | 854 | 1051 | 995 |
Query time for streaming membership queries for various dictionaries
| (a) high-hit workload | ||||||||
|---|---|---|---|---|---|---|---|---|
| Dictionary | Cod | Kestrel | Human | Bacterial | ||||
| SRR12858649 | SRR11449743 | SRR5833294 | SRR5901135 | |||||
| 81.37% hits | 74.60% hits | 91.65% hits | 87.79% hits | |||||
| Tot | Avg | Tot | Avg | Tot | Avg | Tot | Avg | |
| Pufferfish, sparse | 0.6 | 214 | 14.1 | 609 | 17.0 | 651 | 9.1 | 691 |
| Pufferfish, dense | 0.2 | 92 | 8.5 | 368 | 10.5 | 402 | 5.3 | 404 |
| Blight, | 2.1 | 766 | 32.5 | 1400 | 27.3 | 1041 | 11.4 | 864 |
| Blight, | 1.2 | 453 | 16.6 | 714 | 17.5 | 670 | 8.6 | 648 |
| Blight, | 0.8 | 282 | 10.8 | 464 | 11.5 | 440 | 5.8 | 434 |
| SSHash, regular | 0.5 | 166 | 6.2 | 267 | 8.2 | 311 | 3.0 | 223 |
| SSHash, canonical | 0.3 | 111 | 5.1 | 219 | 6.7 | 253 | 2.4 | 184 |
|
| ||||||||
| (b) low-hit workload | ||||||||
| Dictionary |
Cod |
Kestrel |
Human |
Bacterial | ||||
| SRR11449743 | SRR12858649 | SRR5901135 | SRR5833294 | |||||
|
0.659% hits |
0.484% hits |
0.002% hits |
0.086% hits | |||||
| Tot | Avg | Tot | Avg | Tot | Avg | Tot | Avg | |
| Pufferfish, sparse | 14.6 | 627 | 0.9 | 312 | 11.3 | 855 | 25.5 | 975 |
| Pufferfish, dense | 8.7 | 374 | 0.2 | 92 | 5.8 | 435 | 13.6 | 518 |
| Blight, | 72.2 | 3112 | 6.6 | 2407 | 35.7 | 2704 | 253.2 | 9675 |
| Blight, | 45.9 | 1978 | 3.0 | 1115 | 19.1 | 1445 | 117.7 | 4498 |
| Blight, | 18.1 | 780 | 1.8 | 655 | 14.4 | 1088 | 32.2 | 1232 |
| SSHash, regular | 10.7 | 463 | 0.9 | 314 | 6.2 | 463 | 14.3 | 544 |
| SSHash, canonical | 5.1 | 220 | 0.4 | 155 | 2.5 | 183 | 6.4 | 244 |
Note: The query time is reported as total time in minutes (tot), and average ns/k-mer (avg). We also indicate the query file (SRR number) and the percentage of hits. Both high-hit ( hits) and low-hit ( hits) workloads are considered.
Dictionary construction times in minutes (using a single processing thread) and peak internal memory used during construction in GB (blight’s performance was the same for all values of b in the experiment)
| Dictionary | Cod | Kestrel | Human | Bacterial | ||||
|---|---|---|---|---|---|---|---|---|
| Min | GB | Min | GB | Min | GB | Min | GB | |
| dBG-FM, | 28.5 | 0.5 | 100.0 | 0.7 | — | — | — | — |
| dBG-FM, | 28.5 | 0.6 | 100.0 | 0.9 | — | — | — | — |
| dBG-FM, | 28.5 | 0.7 | 100.0 | 1.1 | — | — | — | — |
| Pufferfish, sparse | 15.5 | 3.3 | 35.2 | 6.7 | 86.0 | 19.4 | 200.8 | 40.1 |
| Pufferfish, dense | 13.0 | 2.8 | 29.2 | 5.9 | 70.7 | 14.0 | 173.2 | 30.4 |
| Blight | 5.0 | 3.3 | 11.0 | 7.0 | 25.0 | 7.5 | 50.0 | 15.8 |
| SSHash, regular | 1.5 | 2.6 | 3.8 | 5.7 | 12.5 | 15.4 | 29.6 | 33.4 |
| SSHash, canonical | 2.0 | 2.8 | 4.4 | 5.8 | 16.2 | 17.3 | 36.0 | 36.6 |