| Literature DB >> 31519212 |
Will P M Rowe1,2.
Abstract
Considerable advances in genomics over the past decade have resulted in vast amounts of data being generated and deposited in global archives. The growth of these archives exceeds our ability to process their content, leading to significant analysis bottlenecks. Sketching algorithms produce small, approximate summaries of data and have shown great utility in tackling this flood of genomic data, while using minimal compute resources. This article reviews the current state of the field, focusing on how the algorithms work and how genomicists can utilize them effectively. References to interactive workbooks for explaining concepts and demonstrating workflows are included at https://github.com/will-rowe/genome-sketching .Entities:
Mesh:
Year: 2019 PMID: 31519212 PMCID: PMC6744645 DOI: 10.1186/s13059-019-1809-x
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Glossary of terms
| Term | Definition |
|---|---|
| Bit-pattern observable | The run of 0 s in a binary string |
| Bit vector | An array data structure that holds bits |
| Canonical k-mer | The smallest hash value between a k-mer and its reverse complement |
| Hash function | A function that takes input data of arbitrary size and maps it to a bit string that is of fixed size and typically smaller than the input |
| Jaccard similarity | A similarity measure defined as the intersection of sets, divided by their union |
| K-mer decomposition | The process of extracting all sub-sequences of length k from a sequence |
| Minimizer | The smallest hash value in a set |
| Multiset | A set that allows for multiple instances of each of its elements (i.e. element frequency) |
| Register | A quickly accessible bit vector used to hold information |
| Sketch | A compact data structure that approximates a data set |
| Stochastic averaging | A process used to reduce the variance of an estimator |
Fig. 1a Sketching applied to a genomic data stream. The genomic data stream is viewed via a window; the window size may be equivalent to the length of a sequence read, a genome or some other arbitrary length. The sequence within the window is decomposed into a set of constituent k-mers; each k-mer can be evaluated against its reverse complement to keep only the canonical k-mer. As k-mers are generated, they are sketched and the sketch data structure may be updated. The sketch can be evaluated and allow feedback to the data stream process. b Common sketching algorithms applied to a single k-mer from a set, using example parameters. MinHash KHF: the k-mer is hashed by three functions, giving three values (green, blue, purple). The number of hash functions corresponds to the length of the sketch. Each value is evaluated against the corresponding position in the sketch; i.e. green compared against the first value, blue against the second, and purple against the third. The sketch updates with any new minimum; e.g. the blue value is smaller than the existing one in this position (3 < 66), so replaces it. Bloom filter: the k-mer is hashed by two functions; giving two values (red and orange). The output range of the hash functions corresponds to the length of the sketch, here 0–3. The hash values are used to set bits to 1 at the corresponding positions. CountMin sketch: the k-mer is hashed by two functions; giving two values (red and brown). The number of functions corresponds to a row in the sketch, here 0 or 1, and the output range of the functions corresponds to the length of the rows, here 0–2. So the first hash value (red) gives matrix position 0,0 and the second gives 1,1. The counters held at these positions in the matrix are incremented. HyperLogLog: the k-mer is hashed by one function; giving a single value (10011). The prefix (brown) corresponds to a register, and the suffix (blue) corresponds to the bit-pattern observable. The suffix is compared to the existing value in register 1, is found to have more leading zeros and so replaces the existing value in register 1
Examples of bioinformatic software utilizing sketching algorithms
| Software | Purpose | Sketching algorithm |
|---|---|---|
| GROOT [ | Variant detection in metagenomes | MinHash (KHF) |
| mashtree [ | Phylogenetic tree construction | MinHash (KMV) |
| MashMap [ | Long read alignment | Minimizer/MinHash (KMV) |
| MASH [ | Sequence analysis | MinHash (KMV) |
| sourmash [ | Sequence analysis | MinHash (KMV) |
| finch [ | Sequence analysis | MinHash (KMV) |
| MiniMap2 [ | Read alignment | Minimizer |
| ABySS [ | Genome assembly | Bloom filter |
| Lighter [ | Sequencing error correction | Bloom filter |
| BIGSI [ | Sequence index and search | Bloom filter |
| khmer [ | Sequence analysis | Count-Min sketch |
| FastEtch [ | Genome assembly | Count-Min sketch |
| dashing [ | Sequence analysis | HyperLogLog |
| krakenUniq [ | Metagenome classification | HyperLogLog |
| HULK [ | Sequence analysis | Histosketch |
| ntCard [ | Sequence analysis | ntCard |
| BBsketch [ | Sequence analysis | MinHash |
| MHAP [ | Genome assembly | MinHash |
| MC-MinH [ | Sequence clustering | MinHash |
| KmerGenie [ | Sequence analysis | Count-Min sketch variant |
| Squeakr [ | Sequence analysis | Counting Quotient Filter |
| Mantis [ | Sequence index and search | Counting Quotient Filter |
| kssd [ | Sequence analysis | K-mer Substring Space Decomposition |
An up-to-date list is provided in [40]