| Literature DB >> 33346833 |
Martin C Frith1,2,3, Laurent Noé4, Gregory Kucherov5,6.
Abstract
MOTIVATION: Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence.Entities:
Year: 2020 PMID: 33346833 PMCID: PMC8016470 DOI: 10.1093/bioinformatics/btaa1054
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Parameters of the T92 DNA model
| PAM |
| %g + c | %identity | Transitions per transversion |
|---|---|---|---|---|
| 20 | 1 | 50 | 82 | 0.5 |
| 50 | 1 | 50 | 64 | 0.5 |
| 20 | 3 | 50 | 83 | 1.4 |
Non-overlapping DNA words
| Word | Constructed | Maximum | |
|---|---|---|---|
| length | Words | Number | number |
| 2 | ry | 4 | 4 |
| 3 | abb | 9 | 9 |
| 4 | abbb | 27 | 27 |
| 5 | abbbb | 81 | 81 |
| 6 | abbbbb | 243 | 251 |
Note: ; ;
Fig. 1.Sensitivity (y-axis) and spurious hit count (x-axis) for exact-match seeds with every-nth sparsity. Sensitivity was measured on sequence pairs with PAM distance 20 (left panel) or 50 (right panel). Seed lengths 5–14 were tested, shown in gray in the left panel
Fig. 2.Sensitivity (y-axis) and spurious hit count (x-axis) for exact-match seeds with every nth or word-based sparsity (A, B, C, D). Sensitivity was measured on sequence pairs with PAM distance 20. Seed lengths 5–14 were tested, as shown in (A)
Variance-to-mean ratios
| Words | VMR1 | VMR2 |
|---|---|---|
|
| ||
| ry |
|
|
| ryn |
|
|
| rynn |
| 0.375 |
| rrry, ryrr, ryyr, yyyr |
|
|
| rrrry, rryrr, ryryr, ryyrr, | ||
| ryyry, ryyyy, yyyrr, yyyry |
|
|
| rrrrry, rryrry, rryryy, ryrrrr, | ||
| ryrrry, ryryry, ryyrrr, ryyrry, | ||
| ryyryr, ryyryy, ryyyry, ryyyyy, | ||
| yryrry, yyyrrr, yyyrry, yyyyry |
|
|
|
| ||
| ryy |
|
|
| rrrry, yrrry, yrryy, yryyy |
|
|
| rrrrry, yrrrry, yrrryr, yrrryy, | ||
| yrryry, yrryyy, yyryry, yyryyy |
|
|
| rrryrrr, rrryryr, rryrryr, rryyrrr, | ||
| rryyrry, rryyryr, rryyyrr, rryyyyr, | ||
| ryryyrr, ryyyryr, ryyyyyr, ryyyyyy, | ||
| yryyryr, yryyyrr, yryyyyr, yyryyrr |
|
|
| rrrrrrry, rryrrryy, ryrrrryr, ryrrrryy, | ||
| ryrrryry, yrrrrrry, yrrrrryr, yrrrrryy, | ||
| yrrryrry, yrryrryr, yrryrryy, yryrrryy, | ||
| yryrryry, yryrryyr, yryrryyy, yryryryy, | ||
| yyrrrryr, yyrrrryy, yyrrryry, yyrrryyr, | ||
| yyrrryyy, yyrryryr, yyrryryy, yyrryyry, | ||
| yyrryyyr, yyrryyyy, yyryryyr, yyryryyy, | ||
| yyryyryy, yyyryyyr, yyyryyyy, yyyyyyyr | 0.151 | 0.281 |
Note: Bold values are known to be the minimum possible, for that sparsity and word length.
Fig. 3.Sensitivity (y-axis) and spurious hit count (x-axis) for exact-match seeds with word-based sparsity (A, B, C, D). Sensitivity was measured on sequence pairs with PAM distance 20. Seed lengths 5–14 were tested, as shown in (C). In this figure, the sensitivity is shown relative to every-nth sparsity: (% of related sequence pairs found by word-restricted seeds)/(% of related sequence pairs found by every-nth seeds)
Fig. 4.Sparsity of minimizers, with three orderings. Red line: sparsity of abb words. Blue line: sparsity of abbb words. The diagonal gray line in (A), and the horizontal gray line in (B), show the expected minimizer sparsity
Fig. 5.Sensitivity (y-axis) and spurious hit count (x-axis) for exact-match seeds at minimizer positions. ‘w’ means window length. Seed lengths 5–14 were tested, shown in gray in the left panel
Fig. 6.Sensitivity (y-axis) and spurious hit count (x-axis) for exact-match seeds at either word positions or minimizer positions. Seed lengths 5–14 were tested. Sensitivity was measured on sequence pairs with PAM distance 20 (A, C) or 50 (B, D)
Fig. 7.Sensitivity (y-axis) at different evolutionary distances (x-axis), for minimap seeds and word-based seeds. Here, ‘sensitivity’ is the average number of conserved seeds over 1000 pairs of length-1000 sequences from human chromosome 22
Seed patterns designed by Iedera for PAM 20, κ =3, alignment length 64
| Weight | Pattern |
|---|---|
|
| |
| 5 | RYNN@@ |
| 6 | RY@@@@NN |
| 7 | RYN@@@@NN |
| 8 | RYN@@@@NNN |
| 9 | RYN@@@@@@NNN |
| 10 | RYN@@@nnNN@@@NN |
| 11 | RYN@@@nnNN@@@NNN |
| 12 | RYN@@@@NNnn@@@@NNN |
| 13 | RYN@@@@NN@nn@@@NNNN |
| 14 | RYN@@@@NN@nn@@@@NNNN@ |
|
| |
| 5 | RYY@@@@ |
| 6 | RYYN@@@@ |
| 7 | RYYN@@@@@@ |
| 8 | RYYNN@@@@@@ |
| 9 | RYY@@@@@@@@NN |
| 10 | RYY@@@@@@@@NNN |
| 11 | RYYN@@@@@@@@NNN |
| 12 | RYYN@@@@@@@@@@NNN |
| 13 | RYYN@@@@@@@@@@NNNN |
| 14 | RYYN@@@@@@@@@@NNNNN |
Fig. 8.Sensitivity (y-axis) and random hit count (x-axis) of seeding methods, for sequences with transition/transversion bias (κ =3) and PAM distance 20. Seed weights 5–14 were tested. ‘Transition seeds’ allow transition substitutions at all positions