| Literature DB >> 19426450 |
Pavel Kuksa1, Pai-Hsi Huang, Vladimir Pavlovic.
Abstract
BACKGROUND: Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19426450 PMCID: PMC2681072 DOI: 10.1186/1471-2105-10-S4-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Contiguous .
Figure 2Spectrum (.
Figure 3Extracting only statistically significant regions (red/light color) from the hits.
Figure 4The SCOP (Structural Classification of Proteins) hierarchy.
Figure 5ROC50 plots of four competing methods using the triple-(1,3) and mismatch-(5,1) kernels with PDB, Swiss-Prot and NR as unlabeled databases for remote homology prediction.
Experimental results on the remote homology detection task for all competing methods using the triple(1,3) kernel.
| neighborhood (no clustering) | clustered neighborhood | |||||
| dataset | ROC | ROC50 | p-value | ROC | ROC50 | p-value |
| PDB | ||||||
| full sequence | .9476 | .7582 | - | .9515 | .7633 | - |
| region | .9708 | .9716 | ||||
| no tails (full seq.) | .9443 | .7522 | .5401 | .9472 | .7559 | .5324 |
| max length (full seq.) | .9471 | .7497 | .4407 | .9536 | .7584 | .5468 |
| Swiss-Prot | ||||||
| full sequence | .9245 | .6908 | - | .9464 | .7474 | - |
| region | .9752 | .9732 | ||||
| no tails (full seq.) | .9361 | .6938 | .8621 | .9395 | .7160 | .6259 |
| max length (full seq.) | .9300 | .6514 | .2589 | .9348 | .6817 | .1369 |
| NR | ||||||
| full sequence | .9419 | .7328 | - | .9556 | .7566 | - |
| region | .9824 | .9861 | ||||
| no tails (full seq.) | .9575 | .7438 | .6640 | .9602 | .7486 | .8507 |
| max length (full seq.) | .9513 | .7401 | .8656 | .9528 | .7595 | .8696 |
* p-value: signed-rank test on ROC50 scores against full sequence in the corresponding setting
Experimental results for all competing methods on the remote homology detection task using the mismatch(5,1) kernel.
| neighborhood (no clustering) | clustered neighborhood | |||||
| dataset | ROC | ROC50 | p-value | ROC | ROC50 | p-value |
| PDB | ||||||
| full sequence | .9389 | .7203 | - | .9414 | .7230 | - |
| region | .9698 | .9705 | ||||
| no tails (full seq.) | .9379 | .7287 | .9390 | .9378 | .7301 | .7605 |
| max length (full seq.) | .9457 | .7359 | .4725 | .9526 | .7491 | .3817 |
| Swiss-Prot | ||||||
| full sequence | .9253 | .6685 | - | .9378 | .7258 | - |
| region | .9757 | .9773 | ||||
| no tails (full seq.) | .9290 | .6750 | .9813 | .9344 | .6874 | .5600 |
| max length (full seq.) | .9185 | .6094 | .1436 | .9223 | .6201 | .0279 |
| NR | ||||||
| full sequence | .9475 | .7233 | - | .9544 | .7510 | - |
| region | .9837 | .9874 | ||||
| no tails (full seq.) | .9554 | .7083 | .7930 | .9584 | .7211 | .7501 |
| max length (full seq.) | .9508 | .7421 | .7578 | .9518 | .7613 | .9387 |
* p-value: signed-rank test on ROC50 scores against full sequence in the corresponding setting
Multi-class remote fold recognition using the triple(1,3) kernel
| Method | Error | Top-5 Error | Balanced Error | Top-5 Balanced Error | F1 | Top-5 F1 |
| full sequence | 50.81 | 17.92 | 71.95 | 27.80 | 28.92 | 73.93 |
| region | ||||||
| no tails (full seq.) | 48.21 | 19.71 | 70.42 | 33.37 | 30.91 | 73.39 |
| max. length (full seq.) | 51.63 | 23.13 | 76.96 | 39.21 | 26.85 | 66.99 |
Multi-class remote fold recognition performance using the mismatch(5,1) kernel
| Method | Error | Top-5 Error | Balanced Error | Top-5 Balanced Error | F1 | Top-5 F1 |
| full sequence | 50.49 | 22.31 | 76.44 | 38.61 | 24.96 | 65.58 |
| region | ||||||
| no tails (full seq.) | 51.79 | 20.85 | 79.66 | 35.72 | 22.72 | 66.68 |
| max. length (full seq.) | 56.03 | 26.06 | 86.68 | 47.05 | 15.04 | 58.36 |
Multi-class remote fold recognition using the mismatch(5,2) kernel
| Method | Error | Top-5 Error | Balanced Error | Top-5 Balanced Error | F1 | Top-5 F1 |
| Without clustering | ||||||
| full seq. | 50.16 | 21.82 | 67.17 | 32.55 | 37.43 | 71.40 |
| region | 42.83 | 13.68 | 61.43 | 40.36 | ||
| no tails (full seq.) | 50.16 | 21.82 | 71.81 | 32.59 | 30.17 | 69.12 |
| max. length (full seq.) | 52.44 | 24.43 | 77.31 | 39.17 | 23.98 | 65.22 |
| With clustering | ||||||
| full seq. | 50.33 | 19.71 | 70.04 | 27.21 | 32.10 | 75.03 |
| region | 22.82 | 79.03 | ||||
| no tails (full seq.) | 48.37 | 20.68 | 69.83 | 32.27 | 31.48 | 70.03 |
| max. length (full seq.) | 52.44 | 23.29 | 77.05 | 36.52 | 26.84 | 68.02 |
Comparison of performance against the state-of-the-art methods for remote homology detection
| PDB | Swiss-Prot | NR | ||||
| ROC | ROC50 | ROC | ROC50 | ROC | ROC50 | |
| triple(1,3), full seq. | .9475 | .7582 | .9245 | .6908 | .9419 | .7327 |
| triple(1,3), region | .9708 | .9752 | .8556 | .9824 | .8861 | |
| triple(1,3), region, clustering | .8246 | .9732 | .9861 | |||
| mismatch(5,1), full seq. | .9389 | .7203 | .9253 | .6685 | .9423 | .7233 |
| mismatch(5,1), region | .9698 | .8048 | .9757 | .8280 | .9837 | .8824 |
| mismatch(5,1), region, clustering | .9705 | .8038 | .8414 | .8885 | ||
| profile(5,7.5) | .9511 | .7205 | .9709 | .7914 | .9734 | .8151 |
Comparison with the state-of-the-art methods for multi-class remote fold recognition
| Method | Error | Top-5 Error | Balanced Error | Top-5 Balanced Error | F1 | Top-5 F1 |
| mismatch (full seq.) | 50.49 | 22.31 | 76.44 | 38.61 | 24.96 | 65.58 |
| triple (full seq.) | 50.81 | 17.92 | 71.95 | 27.80 | 28.92 | 73.93 |
| mismatch (region) | 44.79 | 13.36 | 67.26 | 25.40 | 33.17 | 77.45 |
| triple (region) | ||||||
| profile(5,7.5) | 45.11 | 15.80 | 71.27 | 31.55 | 32.34 | 75.68 |
| profile(5,7.5)† | 46.30 | 14.50 | 62.80 | 23.50 | - | - |
†: directly quoted from [12]
Figure 6Ranking quality (0–1 top-.
Figure 7Ranking quality (top-.
Figure 8The importance of only extracting relevant region from neighboring sequences (middle) for inferring sequence labels.
The number of neighbors (mean/median/maximum) and the number of observed features with and without clustering for the remote fold recognition task
| Method | Without Clustering | With Clustering | ||
| # neighbors | # features | # neighbors | # features | |
| full seq. | 135/99/490 | 192,378,952 | 64/41/356 | 120,990,413 |
| region | 64/41/356 | 34,807,209 | 50/26/352 | 28,738,521 |
| no tails (full seq.) | 75/17/402 | 57,575,176 | 23/11/325 | 29,649,870 |
| max. length (full seq.) | 70/16/431 | 39,915,003 | 22/12/279 | 14,634,511 |
Running time for kernel matrix computation (3860 × 3860), [s]
| Method | mismatch(5,1) | mismatch(5,2) | triple(1,3) |
| full seq. | 12,084 | 13,593 | 153 |
| region | 2,624 | 3,195 | 73 |
| region+clustering | 2,412 | 2,998 | 64 |