| Literature DB >> 28453674 |
Hamid Mohamadi1,2, Hamza Khan1,2, Inanc Birol1,2.
Abstract
Motivation: Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k -mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k -mers, or even better, to build a histogram of k -mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k -mer histogram from large volumes of sequencing data is a challenging task.Entities:
Mesh:
Year: 2017 PMID: 28453674 PMCID: PMC5408799 DOI: 10.1093/bioinformatics/btw832
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.64-bit hash value generated by ntHash. The s left bits are used for sampling the k-mers in input datasets and the r right bits are used as resolution bit for building the reduced multiplicity table, with
Dataset specification
| Dataset | Read number | Read length | Total bases | Size |
|---|---|---|---|---|
| HG004 | 868,593,056 | 250 bp | 217,148,264,000 | 480 GB |
| NA19238 | 913,959,800 | 250 bp | 228,489,950,000 | 500 GB |
| PG29 | 6,858,517,737 | 250 bp | 1,714,629,434,250 | 2.4 TB |
Fig. 2.k-mer frequency histograms for human genomes HG004 and NA19238 (rows 1 and 2, respectively), and the white spruce genome PG29 (row 3). We have used DSK k-mer counting results as our ground truth in evaluation (orange circle data points). The k-mer coverage frequency results, of ntCard and KmerGenie for different values of (the four columns from left to right) are shown with the symbols (+) and (), respectively
Accuracy of algorithms in estimating F0 and f1 for HG004 reads
| DSK | ntCard | KmerGenie | KmerStream | Khmer | ||
|---|---|---|---|---|---|---|
| 32 | 13,319,957,567 | 0.97% | 7.04% | – | ||
| 16,539,753,749 | 0.64% | 5.12% | 0.67% | |||
| 64 | 17,898,672,342 | 0.35% | 0.73% | – | ||
| 21,343,659,785 | 0.22% | 0.66% | 0.15% | |||
| 96 | 18,827,062,018 | 0.36% | 0.87% | – | ||
| 22,313,944,415 | 0.24% | 0.69% | 0.31% | |||
| 128 | 18,091,241,186 | 0.76% | 0.40% | – | ||
| 21,555,678,676 | 0.25% | 0.62% | 0.30% | |||
The DSK column reports the exact k-mer counts, and columns for the other tools report percent errors.
Fig. 3.Runtime of DSK, ntCard, KmerGenie, KmerStream and Khmer for all three datasets, HG004, NA19238 and PG29. We have calculated the runtime of all algorithms for different values of k in . As we see in the plots, ntCard estimates the full k-mer coverage frequency histograms >15× faster than KmerStream
Accuracy of algorithms in estimating F0 and f1 for NA19238 reads
| DSK | ntCard | KmerGenie | KmerStream | Khmer | ||
|---|---|---|---|---|---|---|
| 32 | 14,881,561,565 | 0.53% | 6.36% | – | ||
| 18,091,801,391 | 0.40% | 4.64% | 1.82% | |||
| 64 | 19,074,667,480 | 0.75% | 0.68% | – | ||
| 22,527,419,136 | 0.77% | 0.65% | 1.22% | |||
| 96 | 19,420,503,673 | 0.22% | 0.66% | – | ||
| 22,932,238,161 | 0.16% | 0.66% | 0.46% | |||
| 128 | 17,902,027,438 | 0.21% | 0.85% | – | ||
| 21,421,517,759 | 0.13% | 0.76% | 1.05% | |||
The DSK column reports the exact k-mer counts, and columns for the other tools report percent errors.
Accuracy of algorithms in estimating F0 and f1 for PG29 reads
| DSK | ntCard | KmerGenie | KmerStream | Khmer | ||
|---|---|---|---|---|---|---|
| 32 | 27,430,910,938 | 15.33% | 9.41% | – | ||
| 42,642,198,777 | 11.02% | 7.37% | 8.86% | |||
| 64 | 44,344,130,469 | 16.36% | 2.61% | – | ||
| 67,800,291,613 | 11.14% | 1.73% | 11.18% | |||
| 96 | 43,300,244,443 | 17.51% | 0.73% | – | ||
| 69,855,690,006 | 11.13% | 0.57% | 9.36% | |||
| 128 | 32,089,613,024 | 0.40% | 14.82% | – | ||
| 58,195,246,941 | 0.30% | 8.35% | 7.39% | |||
The DSK column reports the exact k-mer counts, and columns for the other tools report percent errors.