| Literature DB >> 32246826 |
Christopher Pockrandt1,2,3,4, Mai Alzamel5,6, Costas S Iliopoulos5, Knut Reinert3,4.
Abstract
MOTIVATION: Computing the uniqueness of k-mers for each position of a genome while allowing for up to e mismatches is computationally challenging. However, it is crucial for many biological applications such as the design of guide RNA for CRISPR experiments. More formally, the uniqueness or (k, e)-mappability can be described for every position as the reciprocal value of how often this k-mer occurs approximately in the genome, i.e. with up to e mismatches.Entities:
Year: 2020 PMID: 32246826 PMCID: PMC7320602 DOI: 10.1093/bioinformatics/btaa222
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.(k, e)-frequency vectors F for k=4 and on the same sequence. A frequency of 1 indicates that the k-mer starting at that position in the text is unique in the entire sequence without errors, respectively, with up to one mismatch. (a) (4, 0)-frequency and (b) (4, 1)-frequency
Fig. 2.The optimum search scheme for two mismatches consists of three searches with four pieces each. The arrows indicate in which order the pieces are searched. The error bounds below each part are cumulative bounds, i.e. the minimum number of errors that must, respectively, the maximum number of errors that can be spent until searching the end of the corresponding piece. Illustrated for searching the 8-mer CGTACAAG. The forward search covers the error distributions 0010, 0011, 0020, the backward search covers 2000, 1100, 0200, 1010, 0110 and the bidirectional search 0000, 0001, 0002, 1000, 1001, 0100, 0101. (a) Forward search: S=(1234, 0011, 0022); (b) backward search: S=(4321, 0002, 0122) and (c) bidirectional search: S=(3214, 0000, 0112)
Fig. 3.Searching s overlapping k-mers using optimum search schemes for the infix and extending it using backtracking. Illustrated for k=11 and s=4. (a) First, the common overlap (light gray) is searched using optimum search schemes. Second, the search of T1 and T2 is continued recursively by extending the previously identified approximate matches of the infix in the index by GC to the left (allowing for the remaining number of errors; medium gray). T1 and T2 are then retrieved separately by backtracking in the index by one character to the left and one character to the right (allowing for an error, if any left; dark gray). T3 and T4 are extended analogously in a recursive manner. (b) The same strategy presented as a backtracking tree. It is traversed for all occurrences reported by the search of the infix T[4, 11] using optimum search schemes. Each edge also has to account for remaining errors, i.e. approximate string matching is performed using backtracking
Running times for computing the frequency of the human genome (GRCh38) using 16 threads
| Tool | (36, 0) | (24, 1) | (36, 2) | (50, 2) | (75, 3) |
|---|---|---|---|---|---|
| Instances are taken from the experiments by | |||||
| GEM exact | 5 h 10 m | N/A | N/A | N/A | N/A |
| GEM heuristic | 23 m | N/A | 7 h 11 m | 5 h 50 m | 4 h 26 m |
| GenMap | 3 m | 23 m | 1 h 19 m | 42 m | 1 h 27 m |
| Tool | (101, 0) | (101, 1) | (101, 2) | (101, 3) | (101, 4) |
| Typical Illumina read length with growing number of mismatches | |||||
| GEM exact | 44 m | 7 h 28 m | 7 h 34 m | 7 h 45 m | 8 h 8 m |
| GEM heuristic | 28 m | 2 h 40 m | 3 h 17 m | 3 h 31 m | 3 h 49 m |
| GenMap | 2 m | 7 m | 17 m | 46 m | 2 h 42 m |
Note: Timeouts of 1 day are represented as N/A.
(30, 2)-mappability on four strains of E.coli assigned to the phylogenetic group B1 based on the known marker genes by Clermont et al.
| All | Non-adjacent | |||||
|---|---|---|---|---|---|---|
| Strain | Unique | Pseudo |
| Unique | Pseudo |
|
| IAI1 | 171 942 | 4992 | 27 ± 627 | 1829 | 81 | 2476 ± 5560 |
| SE11 | 305 439 | 10 365 | 15 ± 447 | 2356 | 176 | 1942 ± 4708 |
| 11128 | 260 305 | 40 101 | 20 ± 953 | 2494 | 685 | 2049 ± 9517 |
| 11368 | 434 033 | 108 968 | 13 ± 912 | 3142 | 1116 | 1674 ± 10 592 |
Note: We computed the mean distance of the unique marker sequences and their standard deviation.
Fig. 4.Illustration of the experiments performed on E.coli sequences in Tables 2 and 3. (a) Four strains belonging to the same phylogenetic group. The sequence in light gray is conserved within this group and a marker sequence. The light gray k-mers belonging to this marker sequence are also all found in the other strains. The k-mers in dark gray are unique among all four strains and allow distinguishing each of the strains. (b) Six sequences belonging to two different phylogenetic groups. Marker sequences are highlighted in light and dark gray. They only occur in one of the groups and are present in all of its strains
(30, 2)-mappability on six strains of E.coli of the groups A and B1
| All | Non-adjacent | ||||
|---|---|---|---|---|---|
| Group | Strain | Unique |
| Unique |
|
| A | W3110 | 109 375 | 41 ± 731 | 2398 |
|
| A | HS | 111 179 | 39 ± 709 | 2414 |
|
| B1 | IAI1 | 125 042 | 37 ± 680 | 3063 |
|
| B1 | SE11 | 127 302 | 38 ± 690 | 3123 |
|
| B1 | 11128 | 121 325 | 42 ± 766 | 3275 |
|
| B1 | 11368 | 131 121 | 41 ± 814 | 3473 |
|
Note: Only k-mers were counted that perfectly separated the strains in A from B1, i.e. if and only if the k-mer matched all strains of A and no strain of B1 and vice versa.