| Literature DB >> 18215286 |
Leming Zhou1, Jonathan Stanton, Liliana Florea.
Abstract
BACKGROUND: To meet the needs of gene annotation for newly sequenced organisms, optimized spaced seeds can be implemented into cross-species sequence alignment programs to accurately align gene sequences to the genome of a related species. So far, seed performance has been tested for comparisons between closely related species, such as human and mouse, or on simulated data. As the number and variety of genomes increases, it becomes desirable to identify a small set of universal seeds that perform optimally or near-optimally on a large range of comparisons.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18215286 PMCID: PMC2375135 DOI: 10.1186/1471-2105-9-36
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Scatterplot of seed sensitivity values between the CHK and DOG comparisons, for weight W = 16 and for the (16, 6, 0) combination. The linear and non-linear regression (solid) and the 95% confidence interval (dotted) curves are shown.
Comparison of seed sensitivity distributions between models. Seed population sizes for (12,10,0), (14,8,0), (16,6,0) are 352716, 203490 and 54264, respectively. Explanation of columns: y– minimum and maximum sensitivity in the compared model; x– maximum sensitivity in the reference model; tσ– half the length of the 95% prediction interval (t= 1.96 when α = 0.05); T(x, y) = ((a + bxmax + + ) - tσ)/ymax, where a, b, c and d are the coefficients of the regression curve, and σis the estimated regression standard error of prediction for a given x value. Because the number of values is large, σ≃ σ for all y. Outliers are determined as those points that satisfy either of the following two criteria: y - > 2.5σ (U) or - y > 2.5σ (L). Here, = a + bx + cx2 + dx3.
| W = 12 | |||||||
| CHK-DOG | 0.742 | 0.992 | 0.936 | 0.002 | 0.996 | 0.04% | 1.75% |
| DOG-CHK | 0.992 | 0.742 | 0.429 | 0.028 | 0.947 | 0.99% | 0.09% |
| CHK-MUS | 0.742 | 0.964 | 0.827 | 0.005 | 0.991 | 0.01% | 1.74% |
| MUS-CHK | 0.964 | 0.742 | 0.429 | 0.017 | 0.978 | 1.70% | 0.00% |
| CHK-ZFS | 0.742 | 0.476 | 0.196 | 0.006 | 0.995 | 1.52% | 0.10% |
| ZFS-CHK | 0.476 | 0.742 | 0.429 | 0.006 | 0.984 | 0.04% | 1.36% |
| DOG-MUS | 0.992 | 0.964 | 0.827 | 0.004 | 0.996 | 1.63% | 0.03% |
| MUS-DOG | 0.964 | 0.992 | 0.936 | 0.001 | 0.999 | 0.02% | 1.95% |
| DOG-ZFS | 0.992 | 0.476 | 0.196 | 0.033 | 0.866 | 1.59% | 0.01% |
| ZFS-DOG | 0.476 | 0.992 | 0.936 | 0.003 | 0.995 | 0.03% | 1.71% |
| MUS-ZFS | 0.964 | 0.476 | 0.196 | 0.023 | 0.934 | 1.98% | 0.00% |
| ZFS-MUS | 0.476 | 0.964 | 0.827 | 0.006 | 0.988 | 0.01% | 1.70% |
| W = 14 | |||||||
| CHK-DOG | 0.567 | 0.971 | 0.900 | 0.005 | 0.990 | 0.00% | 1.71% |
| DOG-CHK | 0.971 | 0.567 | 0.335 | 0.027 | 0.913 | 3.26% | 0.00% |
| CHK-MUS | 0.567 | 0.901 | 0.758 | 0.009 | 0.985 | 0.00% | 1.66% |
| MUS-CHK | 0.901 | 0.567 | 0.335 | 0.018 | 0.953 | 1.89% | 0.00% |
| CHK-ZFS | 0.567 | 0.299 | 0.132 | 0.005 | 0.979 | 1.61% | 0.01% |
| ZFS-CHK | 0.299 | 0.567 | 0.335 | 0.007 | 0.983 | 0.00% | 1.48% |
| DOG-MUS | 0.971 | 0.901 | 0.758 | 0.005 | 0.996 | 1.43% | 0.01% |
| MUS-DOG | 0.901 | 0.971 | 0.900 | 0.002 | 0.997 | 0.01% | 1.69% |
| DOG-ZFS | 0.971 | 0.299 | 0.132 | 0.024 | 0.836 | 1.65% | 0.00% |
| ZFS-DOG | 0.299 | 0.971 | 0.900 | 0.006 | 0.988 | 0.00% | 1.66% |
| MUS-ZFS | 0.901 | 0.299 | 0.132 | 0.018 | 0.882 | 2.09% | 0.00% |
| ZFS-MUS | 0.299 | 0.901 | 0.758 | 0.012 | 0.978 | 0.00% | 1.70% |
| W = 16 | |||||||
| CHK-DOG | 0.405 | 0.930 | 0.821 | 0.010 | 0.982 | 0.00% | 1.98% |
| DOG-CHK | 0.930 | 0.405 | 0.248 | 0.021 | 0.907 | 1.65% | 0.00% |
| CHK-MUS | 0.405 | 0.808 | 0.633 | 0.013 | 0.976 | 0.00% | 1.98% |
| MUS-CHK | 0.808 | 0.405 | 0.248 | 0.015 | 0.944 | 1.86% | 0.00% |
| CHK-ZFS | 0.405 | 0.176 | 0.086 | 0.004 | 0.959 | 1.99% | 0.00% |
| ZFS-CHK | 0.176 | 0.405 | 0.248 | 0.006 | 0.986 | 0.00% | 1.90% |
| DOG-MUS | 0.930 | 0.808 | 0.633 | 0.006 | 0.991 | 1.62% | 0.00% |
| MUS-DOG | 0.808 | 0.930 | 0.821 | 0.003 | 0.996 | 0.04% | 1.49% |
| DOG-ZFS | 0.930 | 0.176 | 0.086 | 0.015 | 0.804 | 1.68% | 0.00% |
| ZFS-DOG | 0.176 | 0.930 | 0.821 | 0.012 | 0.976 | 0.00% | 1.97% |
| MUS-ZFS | 0.808 | 0.176 | 0.086 | 0.012 | 0.858 | 1.96% | 0.00% |
| ZFS-MUS | 0.176 | 0.808 | 0.633 | 0.018 | 0.964 | 0.00% | 1.97% |
Seeds optimized for the CHK, DOG, MUS, ZFS comparisons, for weight W = 10..16, using hill-climbing. For large weights (e.g., W ≥ 16), the fixed span k = 22 may significantly constrain the range of seeds, and therefore the seeds produced under this model may not be optimal in practice.
| CHK | 9 | 11 | 2 | 10 | 0.9033819146 | 1x11011011x11000000000 |
| CHK | 9 | 9 | 4 | 11 | 0.8373594847 | 1xx1011011011xx1000000 |
| CHK | 10 | 8 | 4 | 12 | 0.7617553847 | 1xx1011011011xx1100000 |
| CHK | 11 | 7 | 4 | 13 | 0.6781533749 | 11x1101101x011xx110000 |
| CHK | 12 | 6 | 4 | 14 | 0.5907201266 | 11x11011011011xxx11000 |
| CHK | 12 | 4 | 6 | 15 | 0.5096393921 | 11x110xxx1011011011xx1 |
| CHK | 14 | 4 | 4 | 16 | 0.4328248093 | 11x110xx11011011011x11 |
| DOG | 8 | 10 | 4 | 10 | 0.9991951771 | 11xx1101x011x100000000 |
| DOG | 9 | 9 | 4 | 11 | 0.9976354479 | 11x110110x0x1x11000000 |
| DOG | 10 | 8 | 4 | 12 | 0.9943213958 | 11x110x1011x1x11000000 |
| DOG | 10 | 6 | 6 | 13 | 0.9882916262 | 11x1101x1x0xx011x11000 |
| DOG | 11 | 5 | 6 | 14 | 0.9790689537 | 11xx11011x0x0x1011x110 |
| DOG | 12 | 4 | 6 | 15 | 0.9652197277 | 11x110x1x010x1011x1x11 |
| DOG | 13 | 3 | 6 | 16 | 0.9461146848 | 11x110x1x01xx1011x1111 |
| MUS | 8 | 10 | 4 | 10 | 0.9946391036 | 1xx1011011xx1100000000 |
| MUS | 9 | 9 | 4 | 11 | 0.9868058410 | 11x1101x011xx110000000 |
| MUS | 9 | 7 | 6 | 12 | 0.9737371393 | 1x1101x010x1x011xx1000 |
| MUS | 10 | 6 | 6 | 13 | 0.9538071885 | 1x1101xx10x1x011x11000 |
| MUS | 11 | 5 | 6 | 14 | 0.9260863280 | 11xx110x1x010x1011x110 |
| MUS | 11 | 3 | 8 | 15 | 0.8901430490 | 11x110x1x01xx1011x1xx1 |
| MUS | 13 | 3 | 6 | 16 | 0.8445827211 | 11x1101xxx011x11011x11 |
| ZFS | 9 | 11 | 2 | 10 | 0.7113928043 | 1x11011011x11000000000 |
| ZFS | 10 | 10 | 2 | 11 | 0.6041894577 | 11x11011011x1100000000 |
| ZFS | 10 | 8 | 4 | 12 | 0.4894026703 | x1x11011011011xx100000 |
| ZFS | 12 | 8 | 2 | 13 | 0.3972876500 | 11x11011011011x1100000 |
| ZFS | 12 | 6 | 4 | 14 | 0.3090795416 | 1xx110110x1011011x1100 |
| ZFS | 13 | 5 | 4 | 15 | 0.2419994803 | 11x1101101xx11011x1100 |
| ZFS | 14 | 4 | 4 | 16 | 0.1863098023 | 11xx11011011x11011x110 |
Figure 2Sensitivity maxima for DOG, MUS, CHK and ZFS comparisons, for seeds of weight W = 10..18. For each weight W, (n1, n0, n) combinations are shown right-to-left starting with n= 0 and subsequently increasing n. Sensitivity drop rates between consecutive weights are shown at the top of the plots.
Theoretical (T) and empirical (E) sensitivities of optimal seeds in the four models. Letters C, D, M, Z indicate the CHK, DOG, MUS and ZFS comparisons, respectively. Crepresents sensitivity values for the optimal DOG seeds (do) when applied to the CHK (C) model. Values are averaged within each weight group.
| 10 | 0.881 | 0.875 | 0.878 | 0.881 | 0.998 | 0.999 | 0.999 | 0.998 |
| 10 | 0.904 | 0.896 | 0.897 | 0.905 | 0.983 | 0.984 | 0.984 | 0.982 |
| 11 | 0.808 | 0.800 | 0.803 | 0.808 | 0.996 | 0.996 | 0.996 | 0.996 |
| 11 | 0.876 | 0.859 | 0.856 | 0.875 | 0.976 | 0.978 | 0.976 | 0.976 |
| 12 | 0.717 | 0.708 | 0.715 | 0.716 | 0.990 | 0.990 | 0.990 | 0.989 |
| 12 | 0.822 | 0.819 | 0.822 | 0.831 | 0.967 | 0.970 | 0.968 | 0.967 |
| 13 | 0.628 | 0.619 | 0.624 | 0.627 | 0.978 | 0.979 | 0.979 | 0.977 |
| 13 | 0.779 | 0.756 | 0.761 | 0.790 | 0.954 | 0.954 | 0.953 | 0.953 |
| 14 | 0.555 | 0.542 | 0.549 | 0.554 | 0.967 | 0.969 | 0.969 | 0.965 |
| 14 | 0.735 | 0.706 | 0.731 | 0.763 | 0.943 | 0.942 | 0.943 | 0.943 |
| 15 | 0.483 | 0.472 | 0.480 | 0.481 | 0.952 | 0.955 | 0.954 | 0.949 |
| 15 | 0.706 | 0.676 | 0.686 | 0.733 | 0.927 | 0.927 | 0.926 | 0.930 |
| 16 | 0.412 | 0.400 | 0.407 | 0.410 | 0.931 | 0.935 | 0.935 | 0.927 |
| 16 | 0.667 | 0.610 | 0.627 | 0.679 | 0.912 | 0.909 | 0.910 | 0.910 |
| 10 | 0.992 | 0.992 | 0.993 | 0.992 | 0.663 | 0.647 | 0.654 | 0.663 |
| 10 | 0.978 | 0.978 | 0.979 | 0.978 | 0.745 | 0.716 | 0.725 | 0.751 |
| 11 | 0.981 | 0.982 | 0.982 | 0.981 | 0.547 | 0.529 | 0.534 | 0.547 |
| 11 | 0.970 | 0.969 | 0.969 | 0.969 | 0.678 | 0.636 | 0.626 | 0.676 |
| 12 | 0.963 | 0.963 | 0.963 | 0.961 | 0.432 | 0.417 | 0.426 | 0.433 |
| 12 | 0.957 | 0.958 | 0.958 | 0.957 | 0.570 | 0.562 | 0.565 | 0.591 |
| 13 | 0.932 | 0.933 | 0.934 | 0.930 | 0.342 | 0.329 | 0.335 | 0.344 |
| 13 | 0.940 | 0.938 | 0.940 | 0.941 | 0.504 | 0.464 | 0.474 | 0.525 |
| 14 | 0.904 | 0.905 | 0.906 | 0.900 | 0.276 | 0.260 | 0.267 | 0.278 |
| 14 | 0.925 | 0.922 | 0.927 | 0.929 | 0.436 | 0.394 | 0.425 | 0.488 |
| 15 | 0.867 | 0.870 | 0.871 | 0.861 | 0.220 | 0.207 | 0.215 | 0.222 |
| 15 | 0.909 | 0.906 | 0.908 | 0.913 | 0.395 | 0.358 | 0.365 | 0.447 |
| 16 | 0.822 | 0.825 | 0.826 | 0.815 | 0.172 | 0.160 | 0.166 | 0.173 |
| 16 | 0.889 | 0.882 | 0.886 | 0.890 | 0.350 | 0.283 | 0.302 | 0.368 |
Distances between models: KLD is the Kullback-Leibler Divergence.
| CHK-DOG (DOG-CHK) | 4.857 (4.077) |
| CHK-MUS (MUS-CHK) | 2.351 (2.012) |
| CHK-ZFS (ZFS-CHK) | 0.870 (0.885) |
| DOG-MUS (MUS-DOG) | 0.468 (0.491) |
| DOG-ZFS (ZFS-DOG) | 8.366 (9.997) |
| MUS-ZFS (ZFS-MUS) | 5.352 (6.329) |