| Literature DB >> 28968408 |
Yaron Orenstein1, David Pellow2, Guillaume Marçais3, Ron Shamir2, Carl Kingsford3.
Abstract
With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.Entities:
Mesh:
Year: 2017 PMID: 28968408 PMCID: PMC5645146 DOI: 10.1371/journal.pcbi.1005777
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Length of longest sequence avoiding an unavoidable set for different values of k.
For each value k, a minimum decycling set was removed from a complete de Bruijn graph, and the length L of the longest sequence, represented as a longest path, was calculated.
| 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
| 5 | 11 | 20 | 45 | 70 | 117 | 148 | 239 | 311 | 413 | 570 | 697 | 931 |
Fig 1Performance of DOCKS.
For different combinations of k and L we ran DOCKS over the DNA alphabet. (A) Set sizes. The results are shown as a fraction of the total number of k-mers |Σ|. The broken lines show the decycling set size for each k. (B) Running time in seconds. Note that y-axis is in log scale. (C) Maximum memory usage in megabytes. Note that y-axis is in log scale.
Fig 2Comparison of the sizes of the universal sets generated by the different heuristics.
The histogram shows the size of the universal sets generated by DOCKS, DOCKSany, and DOCKSanyX with X = 625. The results are for k = 10 and 20 ≤ L ≤ 200. The size of the decycling set is provided as a lower bound for comparison.
Fig 3Performance of ILP solver compared to DOCKS.
For each combination of 5 ≤ k ≤ 10 and 20 ≤ L ≤ 200 we ran the ILP solver for up to 24 hours starting from a DOCKS feasible solution. The histograms show the percent improvement of the k-mer set size generated by the ILP solver compared to DOCKS. For L > 60 and all tested values of k, the improvement was <1%.
The number of 10-mers needed to hit all 30-long sequences in four genomes: Two bacterial genomes A. tropicalis, C. crescentus, the worm C. elegans and a mammal genome, H. sapiens.
The genome sizes are quoted after removing all Ns and ambiguous codes. We tested three algorithms: minimizers picking the lexicographically smallest 10-mer, minimizer picking the first in a random k-mer ordering, and selection using the set produced by DOCKS. In case of multiple DOCKS-selected 10-mers in the 30-long window, the lexicographically smallest was chosen. # mers is the number of distinct 10-mers selected, and avg. dist. is the average distance between two selected 10-mers.
| Species | Genome size (Mbp) | Method | # mers (thousands) | avg. dist. |
|---|---|---|---|---|
| 0.393 | lexicographic | 32.9 | 9.48 | |
| randomized | 28.0 | 11.0 | ||
| DOCKS | 23.7 | 12.4 | ||
| 4 | lexicographic | 114.0 | 10.2 | |
| randomized | 89.6 | 11.0 | ||
| DOCKS | 66.0 | 12.4 | ||
| 100 | lexicographic | 286.0 | 8.83 | |
| randomized | 277.0 | 11.0 | ||
| DOCKS | 145.0 | 12.4 | ||
| 2900 | lexicographic | 543.0 | 9.13 | |
| randomized | 389.0 | 10.9 | ||
| DOCKS | 154.0 | 12.1 |