| Literature DB >> 28881970 |
Guillaume Marçais1, David Pellow2, Daniel Bork1, Yaron Orenstein3, Ron Shamir2, Carl Kingsford1.
Abstract
MOTIVATION: The minimizers scheme is a method for selecting k -mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k -mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k -mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues.Entities:
Mesh:
Year: 2017 PMID: 28881970 PMCID: PMC5870760 DOI: 10.1093/bioinformatics/btx235
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Windows W, starting at position i, and window starting at position i – 1. There are 3 different qualitative cases for the start position of the smallest k-mer m: i – 1 (left dot), (right dot) or in the range
Statistics on the sparsity and density factor of the universal hitting sets generated by random ordering, the DOCKS universal hitting set and the lexicographic ordering
| Ordering | ||||
|---|---|---|---|---|
| % | % | |||
| random | 0.07 | 51 | 1.999 | 1.998 |
| DOCKS | 13.3 | 21 | 1.737 | 1.733 |
| lexicographic | 0.00 | 100 | 2.236 | 2.000 |
The computation is done with k = 10, w = 10 on a binary alphabet. The values for the random ordering are averages over 1000 different randomized orderings. d is the density factor and is the density factor estimated given the sparsity of the set by equation 3. The difference between these numbers is due to the imperfect nature of Hypotheses 2.
Fig. 2Distribution of the separation between minimizers for k = 7 and w = 11 on DNA sequences. (A) Results on a de Bruijn sequence of order w + k. (B) Results computed on the human reference genome (hg19). Each line represents a different minimizer scheme using a different ordering. Note that previous heuristic orderings all behave like the randomized orderings (uniform distribution) except for separation of 1 and 2. The universal k-mer ordering computed by DOCKS has a noticeably different distribution, with a mode and a higher mean
Statistics on the distribution of the distances between minimizers in Figure 2
| Ordering |
| distance | low sep. | |
|---|---|---|---|---|
| mean ± stdev | % | |||
| dbg | lexico. | 2.18 | 5.5(34) | 27 |
| random | 2.00 | 6.0(32) | 18 | |
| Minimap | 2.05 | 5.9(32) | 21 | |
| KMC2 | 1.97 | 6.1(32) | 18 | |
| UMD Ovl | 1.91 | 6.3(30) | 14 | |
| Kraken | 1.88 | 6.4(29) | 11 | |
| DOCKS | 1.75 | 6.9(25) | 4.6 | |
| human | lexico. | 2.34 | 5.1(34) | 33 |
| random | 2.02 | 6.0(32) | 19 | |
| Minimap | 2.09 | 5.8(33) | 22 | |
| KMC2 | 2.02 | 5.9(33) | 19 | |
| UMD Ovl | 1.97 | 6.1(31) | 17 | |
| Kraken | 1.93 | 6.2(30) | 13 | |
| DOCKS | 1.77 | 6.7(26) | 6.2 |
The table reports the density factor (d), the mean distance between minimizers (mean ± stdev) and the percentage of selected k-mers that are consecutive or separated by one base (low sep.). These were computed on a de Bruijn sequence (dbg) and on the human genome sequence (human).
Statistics on the distribution of the bin sizes
| Ordering | # bins | avg size | max ratio | ||
|---|---|---|---|---|---|
| (mega bases) | |||||
| dbg | lexico. | 16384 | 4.19 | 367 | 1.37 |
| random | 12003 | 5.73 | 9.32 | 0.23 | |
| Minimap | 13267 | 5.18 | 10.4 | 0.27 | |
| KMC2 | 12370 | 5.56 | 166 | 1.13 | |
| UMD Ovl | 13108 | 5.24 | 274 | 1.37 | |
| Kraken | 12502 | 5.5 | 210 | 1.31 | |
| DOCKS | 4063 | 16.9 | 7.44 | 0.05 | |
| human | lexico. | 16285 | 0.175 | 96.4 | 0.91 |
| random | 11388 | 0.251 | 38.3 | 0.62 | |
| Minimap | 12280 | 0.233 | 69.5 | 0.65 | |
| KMC2 | 12287 | 0.233 | 74.5 | 0.77 | |
| UMD Ovl | 11015 | 0.259 | 26.1 | 0.61 | |
| Kraken | 10389 | 0.275 | 26.3 | 0.57 | |
| DOCKS | 4046 | 0.706 | 23.5 | 0.12 |
The table reports the number of bins created (# bins), the average bin size (avg size, in million of bases), the ratio of the largest bin size to the average (max ratio), and the Kullback-Leibler divergence, , between B, the distribution of the bin sizes, and U, the uniform distribution (a smaller divergence implies that the distribution B is closer to the uniform distribution). These were computed on a de Bruijn sequence (dbg) and the human genome sequence (human).