| Literature DB >> 27376057 |
Mireille Régnier1, Philippe Chassignet2.
Abstract
Repetitive patterns in genomic sequences have a great biological significance and also algorithmic implications. Analytic combinatorics allow to derive formula for the expected length of repetitions in a random sequence. Asymptotic results, which generalize previous works on a binary alphabet, are easily computable. Simulations on random sequences show their accuracy. As an application, the sample case of Archaea genomes illustrates how biological sequences may differ from random sequences.Entities:
Keywords: K-mers; combinatorics; probability
Year: 2016 PMID: 27376057 PMCID: PMC4896921 DOI: 10.3389/fbioe.2016.00035
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Mean profile for 100 random binary sequences.
| Observed | Predicted | Observed | Asymptotic | |||||
|---|---|---|---|---|---|---|---|---|
| 11 | 0.29 | 0 | 0.3 | 0.3 | −0.0803 | |||
| 12 | 7.91 | 0 | 8.3 | 8.3 | 0.1341 | |||
| 13 | 87.87 | 0.1 | 86.9 | 87.1 | 0.2902 | 0.0843 | ||
| 14 | 552.88 | 1.2 | 550.3 | 551.5 | 0.4094 | 0.3340 | ||
| 15 | 2456.77 | 86.6 | 2366.4 | 2453.0 | 0.5061 | 0.4962 | ||
| 16 | 8269.20 | 209.4 | 8069.1 | 8278.5 | 0.5848 | 0.6181 | ||
| 17 | 22516.20 | 406.1 | 22097.7 | 22503.8 | 0.6497 | 0.7136 | ||
| 18 | 51085.15 | 4823.8 | 46267.2 | 51091.0 | 0.7028 | 0.7897 | ||
| 19 | 99387.01 | 6636.1 | 92717.6 | 99353.7 | 0.7460 | 0.8504 | ||
| 20 | 169303.03 | 37415.5 | 131882.6 | 169298.1 | 0.7805 | 0.8984 | ||
| 21 | 256358.10 | 42003.9 | 214454.4 | 256458.3 | 0.8074 | 0.9357 | ||
| 22 | 349801.23 | 137615.9 | 212264.2 | 349880.1 | 0.8276 | 0.9635 | ||
| 23 | 434625.83 | 134807.6 | 299824.7 | 434632.4 | 0.8416 | 0.9830 | ||
| 24 | 495572.93 | 122283.1 | 373279.8 | 495562.8 | 0.8501 | 0.9949 | ||
| 25 | 522788.19 | 255284.4 | 267476.3 | 522760.7 | 0.8536 | 0.9998 | ||
| 26 | 513374.76 | 211204.2 | 302252.5 | 513456.7 | 0.8524 | 0.9982 | ||
| 27 | 472126.51 | 315154.7 | 157087.0 | 472241.6 | 0.8470 | 0.9906 | ||
| 28 | 408946.76 | 242583.4 | 166360.3 | 408943.7 | 0.8377 | 0.9772 | ||
| 29 | 335080.05 | 273441.0 | 61579.7 | 335020.7 | 0.8248 | 0.9582 | ||
| 30 | 260999.29 | 198163.4 | 62712.5 | 260875.9 | 0.8086 | 0.9339 | ||
| 31 | 194100.36 | 137502.0 | 56463.1 | 193965.1 | 0.7894 | 0.9043 | ||
| 32 | 138437.13 | 122218.3 | 16090.9 | 138309.2 | 0.7675 | 0.8699 | ||
| 33 | 95017.33 | 80937.1 | 14067.8 | 95004.9 | 0.7431 | 0.8346 | ||
| 34 | 63082.67 | 60397.1 | 2744.6 | 63141.7 | 0.7165 | 0.7993 | ||
| 35 | 40742.97 | 38411.9 | 2368.9 | 40780.8 | 0.6882 | 0.7639 | ||
| 36 | 25679.21 | 23888.2 | 1817.4 | 25705.6 | 0.6582 | 0.7286 | ||
| 37 | 15860.59 | 15622.9 | 255.8 | 15878.7 | 0.6270 | 0.6933 | ||
| 38 | 9645.84 | 9455.0 | 194.2 | 9649.2 | 0.5948 | 0.6580 | ||
| 39 | 5791.32 | 5772.7 | 15.9 | 5788.6 | 0.5617 | 0.6227 | ||
| 40 | 3433.87 | 3426.4 | 12.1 | 3438.5 | 0.5278 | 0.5874 | ||
| 41 | 2032.57 | 2027.2 | 0.4 | 2027.6 | 0.4938 | 0.5520 | ||
| 42 | 1188.84 | 1189.0 | 0.3 | 1189.3 | 0.4590 | 0.5167 | ||
| 43 | 692.28 | 694.8 | 0.2 | 695.0 | 0.4240 | 0.4814 | ||
| 44 | 402.75 | 405.1 | 0 | 405.1 | 0.3889 | 0.4461 | ||
| 45 | 233.35 | 235.7 | 0 | 235.7 | 0.3535 | 0.4108 | ||
| 46 | 135.42 | 137.0 | 0 | 137.0 | 0.3182 | 0.3755 | ||
| 47 | 78.39 | 79.6 | 0 | 79.6 | 0.2828 | 0.3401 | ||
| 48 | 44.69 | 46.2 | 0 | 46.2 | 0.2463 | 0.3048 | ||
| 49 | 25.35 | 26.8 | 0 | 26.8 | 0.2096 | 0.2695 | ||
| 50 | 14.57 | 15.6 | 0 | 15.6 | 0.1737 | 0.2342 | ||
| 51 | 8.44 | 9.0 | 0 | 9.0 | 0.1383 | 0.1989 | ||
| 52 | 4.76 | 5.2 | 0 | 5.2 | 0.1012 | 0.1636 | ||
| 53 | 2.76 | 3.0 | 0 | 3.0 | 0.0658 | 0.1282 | ||
| 54 | 1.74 | 1.8 | 0 | 1.8 | 0.0359 | 0.0929 | ||
| 55 | 1.02 | 1.0 | 0 | 1.0 | 0.0013 | 0.0576 | ||
| 56 | 0.64 | 0.6 | 0 | 0.6 | −0.0289 | 0.0223 | − | |
| 57 | 0.32 | 0.3 | 0 | 0.3 | −0.0739 | −0.0130 | ||
| 58 | 0.18 | 0.2 | 0 | 0.2 | −0.1112 | −0.0483 | ||
| 59 | 0.16 | 0.1 | 0 | 0.1 | −0.1188 | −0.0836 | ||
| 60 | 0.12 | 0.07 | 0 | 0.07 | −0.1375 | −0.1190 | ||
| 61 | 0.08 | 0.04 | 0 | 0.04 | −0.1637 | −0.1543 | ||
| 62 | 0.06 | 0.02 | 0 | 0.02 | −0.1824 | −0.1896 | ||
| 63 | 0.04 | 0.01 | 0 | 0.01 | −0.2087 | −0.2249 | ||
| 64 | 0.04 | 0.008 | 0 | 0.008 | −0.2087 | −0.2602 | ||
(.
Profile for the sequence from .
| Observed | Predicted | |||||||
|---|---|---|---|---|---|---|---|---|
| 6 | 4 | 0 | 0.05 | 0.05 | ||||
| 7 | 1975 | 0 | 4e + 02 | 4e + 02 | ||||
| 8 | 41349 | 0 | 2e + 04 | 2e + 04 | ||||
| 9 | 178523 | 781.2 | 213568.8 | 214350.1 | ||||
| 10 | 382032 | 66858.4 | 617279.6 | 684137.9 | ||||
| 11 | 542386 | 171711.2 | 742379.1 | 914090.3 | ||||
| 12 | 570499 | 407976.5 | 215942.2 | 623918.7 | ||||
| 13 | 459330 | 259860.7 | 6512.5 | 266373.2 | ||||
| 14 | 305002 | 87488.6 | 0 | 87488.6 | ||||
| 15 | 169317 | 25704.4 | 0 | 25704.4 | ||||
| 16 | 86379 | 7264.7 | 0 | 7264.7 | ||||
| 17 | 40391 | 2028.2 | 0 | 2028.2 | ||||
| 18 | 17432 | 564.1 | 0 | 564.1 | ||||
| 19 | 7866 | 156.7 | 0 | 156.7 | ||||
| 20 | 3830 | 43.5 | 0 | 43.5 | ||||
| 21 | 1957 | 12.1 | 0 | 12.1 | ||||
| 22 | 1229 | 3.4 | 0 | 3.4 | ||||
| 23 | 910 | 0.9 | 0 | 0.9 | ||||
| 24 | 733 | 0.3 | 0 | 0.3 | ||||
| 25 | 617 | 0.07 | 0 | 0.07 | ||||
| 26 | 561 | 0.02 | 0 | 0.02 | ||||
| 27 | 492 | 0.006 | 0 | 0.006 | ||||
| 28 | 446 | 0.002 | 0 | 0.002 | ||||
| 29 | 436 | 0.0005 | 0 | 0.0005 | ||||
| 30 | 397 | 0.0001 | 0 | 0.0001 | ||||
| 31 | 374 | 1e−05 | 0 | 1e−05 | ||||
| 32 | 359 | 2e−06 | 0 | 2e−06 | ||||
| 33 | 322 | 2e−08 | 0 | 2e−08 | ||||
| … | … | … | ||||||
Mean profile for 100 random degenerated quaternary sequences.
| Observed | Predicted | Observed | asymptotic | |||||
|---|---|---|---|---|---|---|---|---|
| 6 | 0.03 | 0 | 0.0 | 0.0 | −0.2359 | |||
| 7 | 363.29 | 0 | 363.9 | 363.9 | 0.3967 | |||
| 8 | 21236.17 | 0 | 21252.2 | 21252.2 | 0.6704 | |||
| 9 | 214371.12 | 781.6 | 213574.7 | 214356.3 | 0.8260 | 0.7242 | ||
| 10 | 684344.68 | 66877.4 | 617315.1 | 684192.5 | 0.9041 | 0.9280 | ||
| 11 | 914013.67 | 171742.8 | 742383.0 | 914125.8 | 0.9235 | 0.9985 | ||
| 12 | 623870.12 | 407973.4 | 215914.6 | 623888.0 | 0.8978 | 0.9655 | ||
| 13 | 266366.73 | 259826.1 | 6510.8 | 266336.9 | 0.8406 | 0.8792 | ||
| 14 | 87424.58 | 87471.6 | 0 | 87471.6 | 0.7656 | 0.7930 | ||
| 15 | 25704.95 | 25698.5 | 0 | 25698.5 | 0.6832 | 0.7068 | ||
| 16 | 7253.72 | 7262.9 | 0 | 7262.9 | 0.5981 | 0.6206 | ||
| 17 | 2025.99 | 2027.6 | 0 | 2027.6 | 0.5123 | 0.5344 | ||
| 18 | 565.97 | 563.9 | 0 | 563.9 | 0.4265 | 0.4482 | ||
| 19 | 155.90 | 156.7 | 0 | 156.7 | 0.3397 | 0.3620 | ||
| 20 | 43.52 | 43.5 | 0 | 43.5 | 0.2539 | 0.2758 | ||
| 21 | 12.28 | 12.1 | 0 | 12.1 | 0.1688 | 0.1895 | ||
| 22 | 3.06 | 3.4 | 0 | 3.4 | 0.0753 | 0.1033 | ||
| 23 | 0.80 | 0.9 | 0 | 0.9 | −0.0150 | 0.0171 | − | |
| 24 | 0.28 | 0.3 | 0 | 0.3 | −0.0857 | −0.0691 | − | |
| 25 | 0.14 | 0.1 | 0 | 0.1 | −0.1323 | −0.1553 | − | |
GC-content is 0.6664.
Distribution of the extinction level for 100 random degenerated quaternary sequences.
| 21 | 22 | 23 | 24 | 25 | |
| 26 | 42 | 18 | 7 | 7 |
GC-content is 0.6664.
| 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | |
| 10 | 16 | 13 | 19 | 14 | 14 | 6 | 1 | 1 | 2 | 1 | 1 | 0 | 2 |
| 0.50 | 22.25 | 22.25 | 22.25 | 22.25 | 44.51 |
| 0.55 | 19.32 | 22.42 | 22.74 | 25.80 | 45.16 |
| 0.60 | 16.83 | 22.92 | 24.27 | 30.20 | 47.18 |
| 0.65 | 14.69 | 23.82 | 27.06 | 35.81 | 50.83 |
| 0.70 | 12.81 | 25.25 | 31.60 | 43.25 | 56.63 |
| 0.75 | 11.13 | 27.43 | 38.80 | 53.62 | 65.64 |
| 0.80 | 9.58 | 30.83 | 50.63 | 69.13 | 79.99 |
| 0.85 | 8.13 | 36.49 | 71.78 | 94.91 | 104.80 |
| 0.90 | 6.70 | 47.45 | 116.72 | 146.40 | 155.45 |
| 0.95 | 5.15 | 77.70 | 259.56 | 300.72 | 309.05 |