| Literature DB >> 27423894 |
Hamid Mohamadi1, Justin Chu1, Benjamin P Vandervalk1, Inanc Birol1.
Abstract
MOTIVATION: Hashing has been widely used for indexing, querying and rapid similarity search in many bioinformatics applications, including sequence alignment, genome and transcriptome assembly, k-mer counting and error correction. Hence, expediting hashing operations would have a substantial impact in the field, making bioinformatics applications faster and more efficient.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27423894 PMCID: PMC5181554 DOI: 10.1093/bioinformatics/btw397
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Performance of ntHash. (a) A good hash function should have its values uniformly and independently distributed across the target domain. One way of measuring that is through correlation coefficients between the bits of hashed values. The plot shows natural statistical fluctuations for smaller sample sets (100 data points, the area above diagonal). The correlations dissipate rapidly for large sample sets (100 000 data points, the area below diagonal). (b) Runtime for hashing 250 bp DNA sequences with different k-mer lengths from 50 to 250. ntHash outperforms all other hash methods when hashing more than two subsequent k-mers, i.e. k < 249. (c) Comparing multi-hashing runtime of ntHash with the leading hash functions for one billion 50-mers. ntHash performs over 20× faster than the closest competitor, cityhash. Grey, orange and blue bars refer to calculation of one, three and five hash functions, respectively