| Literature DB >> 36124869 |
Jim Shaw1, Yun William Yu1,2.
Abstract
MOTIVATION: Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the 'lowest-ordered' k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perform well.Entities:
Mesh:
Year: 2022 PMID: 36124869 PMCID: PMC9563685 DOI: 10.1093/bioinformatics/btab790
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Simplified definitions of concepts discussed in Sections 2 and 3
| Term | Simplified definition |
|---|---|
|
| Function which selects k-mers from a string based on windows of |
|
| A k-mer selection method has this property if for every |
| Minimizer |
|
| Word-based methods | 1-local k-mer selection methods that select a k-mer if its prefix is in a specified set |
| Open syncmer | 1-local k-mer selection method that selects a k-mer if the smallest s-mer inside the k-mer is at the |
| Closed syncmer/charged context | 1-local k-mer selection method that selects a k-mer if the smallest s-mer inside the k-mer is at the first or last position |
| Conservation | Percentage of bases in a long string |
| Spread | A k-mer selection method has this property if with high likelihood, the k-mers chosen are not too close together |
|
| Probability vector of |
|
| Vector of length |
|
| Vector which is an upper bound on |
Fig. 2.Three examples of mutation configurations. Circles represent bases of around i, while black and white indicate mutated or unmutated bases, respectively. Boxes indicate an unmutated k-mer overlapping position i
Properties of discussed methods
| Method (parameters) | ( | Density | ( |
|---|---|---|---|
| Random minimizer ( |
|
|
|
| Miniception |
|
|
|
| Open syncmer ( | 1 |
|
|
| Closed syncmer ( | 1 |
|
|
| Words-based method ( | 1 | depends on |
|
| ( | 1 | = |
|
Note: The ∼ sign denotes up to a small error term. means that there is no window guarantee. Words-based methods and open syncmers may have a window guarantee for some parameters, i.e. if t =1 for open syncmers, but usually do not.
Fig. 3.Fraction of upper bound achieved for conservation. 95% confidence intervals are built from 100 simulations for methods with empirically deduced values. Note that, some methods have different densities due to parameter constraints; this is mentioned in the labels
Fig. 4.Histogram of chaining scores corresponding to alignments of reads for E.coli W (bc1087) in against its assembly. Minimap2 with open syncmers and minimizers were compared against each other with parameters chosen so that density is fixed at 1/5. Mean chaining scores are given in the legend. t = 3 is the optimal value of t by Theorem 8
Open syncmer versus minimizer mappings for two versions of minimap2 on long-reads. We fix parameters so the density is 1/5 for both seeding methods
| Long-read dataset | Reference |
|
|
|
| Total no. of reads | % increase in mapped reads |
|---|---|---|---|---|---|---|---|
|
|
| 312 (21.56) | 102 (11.57) | 194 455 | 194 245 | 196 901 | 0.108 |
|
|
| 548 (20.30) | 187 (11.33) | 220 459 | 220 098 | 226 906 | 0.164 |
|
|
| 11 434 (19.80) | 3679 (12.10) | 143 724 | 135 969 | 251 838 | 5.70 |
| Downsampled human ONT (rel3) | Human—CHM13 | 370 (3.53) | 103 (2.52) | 37 819 | 37 552 | 51 210 | 0.711 |
| Downsampled human ONT (rel3) | Mouse—GRCm38 | 2467 (2.32) | 1005 (1.90) | 19 214 | 17 752 | 51 210 | 8.23 |
Note: OS is the subset of reads successfully mapped using open syncmers, and M similarly for minimizers. is the set of reads which are uniquely mapped by open syncmers, and are reads uniquely mapped by minimizers. The average mapQ outputted by minimap2 within the set is presented as well (Section 5.2).
Open syncmer versus minimizer mapping statistics as a function of read length for the rel3 ONT read set mapping onto CHM13
| Human ONT (rel3) read lengths |
|
|
|
| Total no. of reads | % change in mapped reads |
|---|---|---|---|---|---|---|
| 100-1000 bp | 546 (4.25) | 146 (2.96) | 6068 | 5668 | 44 397 | 7.05% |
| 1000-2000 bp | 165 (3.72) | 38 (1.71) | 3277 | 3150 | 10 097 | 4.03% |
| 2000-3000 bp | 53 (3.70) | 14 (5.71) | 1840 | 1801 | 4151 | 2.16% |
| >3000 bp | 152 (3.27) | 51 (2.25) | 32 706 | 32 605 | 36 968 | 0.31% |
Note: OS is the set of mapped reads with open syncmers, M is the set of mapped reads with minimizers. Values in parenthesis indicate average mapping qualities calculated by minimap2.
Fig. 5.Sensitivity and precision investigation of simulated ONT cDNA data. Mapped reads were classified into success and errors based on the true transcript location. We repeated the experiment 9 times to get a 95% confidence interval. 18 104 reads were generated for each parameterization. The two left plots show that reducing k or switching to syncmers improves alignment quality. The top right plot shows the success-error curves and the bottom right plot shows the CPU time taken for each method
Fig. 6.Errors versus time taken for each method. Each point is for a specific value of k. Syncmers provide different parameterizations when modifying density and k. Parameters close to the bottom-left are well-performing