| Literature DB >> 20122187 |
Matteo Comin1, Davide Verzotto.
Abstract
BACKGROUND: The classification of protein sequences using string algorithms provides valuable insights for protein function prediction. Several methods, based on a variety of different patterns, have been previously proposed. Almost all string-based approaches discover patterns that are not "independent, " and therefore the associated scores overcount, a multiple number of times, the contribution of patterns that cover the same region of a sequence.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20122187 PMCID: PMC3009487 DOI: 10.1186/1471-2105-11-S1-S16
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Example of meet between a sequence and a suffix of the other sequence.
| position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ||
| . | . |
Meet between s1 and with sequences s1 = aabababab and s2 = babacacac of length 9.
Example of counters I1 and I2 of a meet.
| position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Counters I1 and I2 of the pattern p = a.a.a (that is the meet between s1 and ) for each position of s1 and s2 with sequences s1 = aabababab and s2 = babacacac of length 9.
Irredundant vs. maximal patterns.
| No. |
|
| | | | | Maximals | Irredundants | % Of irredundants | |
|---|---|---|---|---|---|---|---|---|
| 1. | 1alo | 1bjt | 597 | 760 | 1357 | ≫16697 | 1256 | ≪7.5 |
| 2. | 1qax | 1cxp | 316 | 466 | 782 | 8397 | 682 | 8.1 |
| 3. | 1gai | 1nmt | 472 | 227 | 699 | 7037 | 612 | 8.7 |
| 4. | 1cvu | 1lgr | 511 | 368 | 879 | 9014 | 787 | 8.7 |
| 5. | 1gpe | 1yrg | 392 | 343 | 735 | 6853 | 653 | 9.5 |
| 6. | 1qqj | 3pcc | 415 | 236 | 651 | 5090 | 566 | 11.1 |
| 7. | 1bxk | 1ofg | 352 | 220 | 572 | 3549 | 489 | 13.8 |
| 8. | 1ebf | 2nac | 169 | 188 | 357 | 1126 | 277 | 24.6 |
| 9. | 1a03 | 1mho | 90 | 88 | 178 | 257 | 108 | 42.0 |
| 10. | 1gpt | 1ayj | 47 | 50 | 97 | 64 | 45 | 70.3 |
Number of irredundant and maximal common patterns over 10 pairs of protein sequences taken from experiments in Table 4. Rows are sorted according to the percentage of irredundants over the total number of maximal patterns.
Experiments of Liao and Noble.
|
|
|
|
| ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| No. | Target family | Pos. | Neg. | Pos. | Neg. | No. | Target family | Pos. | Neg. | Pos. | Neg. |
| 0 | 7.3.5.2 | 12 | 2330 | 9 | 1746 | 27 | 7.3.10.1 | 11 | 423 | 95 | 3653 |
| 1 | 2.56.1.2 | 11 | 2509 | 8 | 1824 | 28 | 3.32.1.11 | 46 | 3880 | 5 | 421 |
| 2 | 3.1.8.1 | 19 | 3002 | 8 | 1263 | 29 | 3.32.1.13 | 43 | 3627 | 8 | 674 |
| 3 | 3.1.8.3 | 17 | 2686 | 10 | 1579 | 30 | 7.3.6.1 | 33 | 3203 | 9 | 873 |
| 4 | 1.27.1.1 | 12 | 2890 | 6 | 1444 | 31 | 7.3.6.2 | 16 | 1553 | 26 | 2523 |
| 5 | 1.27.1.2 | 10 | 2408 | 8 | 1926 | 32 | 7.3.6.4 | 37 | 3591 | 5 | 485 |
| 6 | 3.42.1.1 | 29 | 3208 | 10 | 1105 | 33 | 2.38.4.1 | 30 | 3682 | 5 | 613 |
| 7 | 1.45.1.2 | 33 | 3650 | 6 | 663 | 34 | 2.1.1.1 | 90 | 3102 | 31 | 1068 |
| 8 | 1.4.1.1 | 26 | 2256 | 23 | 1994 | 35 | 2.1.1.2 | 99 | 3412 | 22 | 758 |
| 9 | 2.9.1.2 | 17 | 2370 | 14 | 1951 | 36 | 3.32.1.1 | 42 | 3542 | 9 | 759 |
| 10 | 1.4.1.2 | 41 | 3557 | 8 | 693 | 37 | 2.38.4.3 | 24 | 2946 | 11 | 1349 |
| 11 | 2.9.1.3 | 26 | 3625 | 5 | 696 | 38 | 2.1.1.3 | 113 | 3895 | 8 | 275 |
| 12 | 1.4.1.3 | 40 | 3470 | 9 | 780 | 39 | 2.1.1.4 | 88 | 3033 | 33 | 1137 |
| 13 | 2.44.1.2 | 11 | 307 | 140 | 3894 | 40 | 2.38.4.5 | 26 | 3191 | 9 | 1104 |
| 14 | 2.9.1.4 | 21 | 2928 | 10 | 1393 | 41 | 2.1.1.5 | 94 | 3240 | 27 | 930 |
| 15 | 3.42.1.5 | 26 | 2876 | 13 | 1437 | 42 | 7.39.1.2 | 20 | 3204 | 7 | 1121 |
| 16 | 3.2.1.2 | 37 | 3002 | 16 | 1297 | 43 | 2.52.1.2 | 12 | 3060 | 5 | 1275 |
| 17 | 3.42.1.8 | 34 | 3761 | 5 | 552 | 44 | 7.39.1.3 | 13 | 2083 | 14 | 2242 |
| 18 | 3.2.1.3 | 44 | 3569 | 9 | 730 | 45 | 1.36.1.2 | 29 | 3477 | 7 | 839 |
| 19 | 3.2.1.4 | 46 | 3732 | 7 | 567 | 46 | 3.32.1.8 | 40 | 3374 | 11 | 927 |
| 20 | 3.2.1.5 | 46 | 3732 | 7 | 567 | 47 | 1.36.1.5 | 10 | 1199 | 26 | 3117 |
| 21 | 3.2.1.6 | 48 | 3894 | 5 | 405 | 48 | 7.41.5.1 | 10 | 2241 | 9 | 2016 |
| 22 | 2.28.1.1 | 18 | 1246 | 44 | 3044 | 49 | 7.41.5.2 | 10 | 2241 | 9 | 2016 |
| 23 | 3.3.1.2 | 22 | 3280 | 7 | 1043 | 50 | 1.41.1.2 | 36 | 3692 | 6 | 615 |
| 24 | 3.2.1.7 | 48 | 3894 | 5 | 405 | 51 | 2.5.1.1 | 13 | 2345 | 11 | 1983 |
| 25 | 2.28.1.3 | 56 | 3875 | 6 | 415 | 52 | 2.5.1.3 | 14 | 2525 | 10 | 1803 |
| 26 | 3.3.1.5 | 13 | 1938 | 16 | 2385 | 53 | 1.41.1.5 | 17 | 1744 | 25 | 2563 |
Experiments presented in [8] and associated to 54 protein families of SCOP version 1.53, here ordered by progressive number. For each target family ID is detailed the number of positive and negative sequences of training and test.
Comparison of results against state-of-the-art methods
| Learning algorithm | Mean ROC | Mean ROC50 | Mean mRFP |
|---|---|---|---|
| Irredundant Class | 0.524 | 0.0554 | |
| Local Alignment ("ekm," | 0.600 | ||
| Local Alignment ("eig," | 0.925 | 0.0541 | |
| Word Correlation Matrices ( | 0.904 | 0.447 | 0.0778 |
| Pairwise | 0.896 | 0.464 | 0.0837 |
| Mismatch ( | 0.872 | 0.400 | 0.0837 |
| Spectrum ( | 0.824 | 0.294 | 0.1535 |
| Fisher | 0.773 | 0.250 | 0.2040 |
The comparison is based on mean scores of ROC, ROC50 (ROC curve up to the first 50 false positives), and Median rate of false positives (mRFP) for the Irredundant Class and state-of-the-art methods. In bold are reported the best results for each score.
Figure 1ROC scores distributions. (a) ROC scores distribution for the Irredundant Class and state-of-the-art methods. (b) ROC scores across families.
Figure 2ROC scores family-by-family comparisons. (a) Family-by-family ROC scores comparison of the Irredundant Class against Mismatch. (b) Family-by-family ROC scores comparison of the Irredundant Class against Local Alignment version "eig."
Figure 3Irredundant patterns footprint for protein family 50. Histogram of the irredundant patterns footprint for S100 proteins (family no. 50 of Table 4).
Figure 4Irredundant patterns footprint for protein families 32 and 53. Histogram of the irredundant patterns footprint for: (a) plant defensis (family no. 32 of Table 4) and (b) bacterial repressors (family no. 53).