| Literature DB >> 20028521 |
Huzefa Rangwala1, Christopher Kauffman, George Karypis.
Abstract
BACKGROUND: Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models.Entities:
Mesh:
Substances:
Year: 2009 PMID: 20028521 PMCID: PMC2805646 DOI: 10.1186/1471-2105-10-439
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1(a) Input example sequence along with PSI-BLAST profile matrix of dimensions . (b) Example wmer with w = 3 giving length seven with extracted features from the original PSI-BLAST matrix. (c) Encoded vector of length 7 × 20 formed by linearizing the sub-matrix. (d) Flexible encoding showing three residues in the center using the finer representation, and two residues flanking the central residues on both sides using a coarser representation as an averaging statistic. Length of this vector equals 5 × 20.
Problem-specific Datasets.
| Problem | Source | Type | #C | #Seq | #Res | #CV | % |
|---|---|---|---|---|---|---|---|
| Disorder Prediction | DisPro [ | Binary | 2 | 723 | 215612 | 10 | 30 |
| Protein-DNA Site | DISIS [ | Binary | 2 | 693 | 127240 | 3 | 20 |
| Residue-wise Contact | SVM [ | Regression | ∞ | 680 | 120421 | 15 | 40 |
| Local Structure | Profnet [ | Multiclass | 16 | 1600 | 286238 | 3 | 40 |
#C, #Seq, #Res, #CV, and % denote the number of classes, sequences, residues, number of cross validation folds, and the maximum pairwise sequence identity between the sequences, respectively. 8 represents the regression problem.
Classification Performance on the Disorder Dataset.
| 3 | 0.775 | 0.312 | 0.350 | - | - | - | - | - | - | - | - | ||
| 7 | 0.815 | 0.366 | 0.380 | 0.816 | 0.384 | 0.816 | 0.383 | - | - | - | - | ||
| 11 | 0.821 | 0.378 | 0.826 | 0.391 | 0.396 | 0.826 | 0.400 | 0.824 | 0.404 | 0.823 | 0.403 | ||
| 13 | 0.823 | 0.384 | 0.829 | 0.398 | 0.832* | 0.405 | 0.830 | 0.404 | 0.828 | 0.407 | 0.826 | 0.409 | |
| 3 | 0.370 | 0.811 | 0.369 | - | - | - | - | - | - | - | - | ||
| 7 | 0.845 | 0.442 | 0.450 | 0.848 | 0.445 | 0.845 | 0.442 | - | - | - | - | ||
| 11 | 0.848 | 0.464 | 0.855 | 0.478 | 0.858 | 0.482 | 0.480 | 0.855 | 0.470 | 0.853 | 0.468 | ||
| 13 | 0.848 | 0.473 | 0.855 | 0.484 | 0.859 | 0.490 | 0.861* | 0.492 | 0.860 | 0.487 | 0.857 | 0.478 | |
| 3 | 0.815 | 0.377 | 0.379 | - | - | - | - | - | - | - | - | ||
| 7 | 0.847 | 0.446 | 0.461 | 0.852 | 0.454 | 0.851 | 0.454 | - | - | - | - | ||
| 11 | 0.848 | 0.469 | 0.856 | 0.482 | 0.860 | 0.491 | 0.491 | 0.861 | 0.485 | 0.862 | 0.485 | ||
| 13 | 0.847 | 0.473 | 0.856 | 0.485 | 0.861 | 0.491 | 0.864 | 0.495 | 0.865* | 0.494 | 0.864 | 0.492 | |
| 3 | 0.836 | 0.418 | 0.423 | - | - | - | - | - | - | - | - | ||
| 7 | 0.860 | 0.472 | 0.476 | 0.860 | 0.473 | 0.859 | 0.468 | - | - | - | - | ||
| 11 | 0.861 | 0.490 | 0.867 | 0.496 | 0.498 | 0.868 | 0.495 | 0.866 | 0.488 | 0.865 | 0.485 | ||
| 13 | 0.860 | 0.497 | 0.867 | 0.503 | 0.870 | 0.503 | 0.871* | 0.503 | 0.870 | 0.498 | 0.868 | 0.492 | |
| 3 | 0.428 | 0.841 | 0.428 | - | - | - | - | - | - | - | - | ||
| 7 | 0.869 | 0.497 | 0.499 | 0.869 | 0.494 | 0.867 | 0.489 | - | - | - | - | ||
| 11 | 0.871 | 0.516 | 0.875 | 0.518 | 0.517 | 0.877 | 0.512 | 0.874 | 0.508 | 0.873 | 0.507 | ||
| 13 | 0.869 | 0.519 | 0.875 | 0.522 | 0.878 | 0.521 | 0.879** | 0.519 | 0.879 | 0.518 | 0.876 | 0.514 | |
DISPro [7] reports a ROC score of 0.878. The numbers in bold show the best models for a fixed w parameter, as measured by ROC. , ℬ, and represent the PSI-BLAST profile, BLOSUM62, and YASSPP scoring matrices, respectively. soe, rbf, and lin represent the three different kernels studied using the Was the base kernel. * denotes the best classification results in the sub-tables, and ** denotes the best classification results achieved on this dataset. For the best model we report a Q2 accuracy of 84.60% with an se rate of 0.33.
Runtime Performance of svmPRAT on the Disorder Dataset (in seconds).
| w = f = 11 | w = f = 13 | w = f = 15 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.93e+10 | 83993 | 45025 | 1.86 | 1.92e+10 | 95098 | 53377 | 1.78 | 1.91e+10 | 106565 | 54994 | 1.93 | |
| 1.91e+10 | 79623 | 36933 | 2.15 | 1.88e+10 | 90715 | 39237 | 2.31 | 1.87e+10 | 91809 | 39368 | 2.33 | |
| 2.01e+10 | 99501 | 56894 | 1.75 | 2.05e+10 | 112863 | 65035 | 1.73 | 2.04e+10 | 125563 | 69919 | 1.75 | |
The runtime performance of svmPRAT was benchmarked for learning a classification model on a 64-bit Intel Xeon CPU 2.33 GHz processor. #KER denotes the number of kernel evaluations for training the SVM model. NO denotes runtime in seconds when the CBLAS library was not used, YES denotes the runtime in seconds when the CBLAS library was used, and SP denotes the speedup achieved using the CBLAS library.
Disorder Prediction Performance at CASP8.
| Method | ROC | Q_2 | |
|---|---|---|---|
| MULTICOM | 0.92 | 0.81 | 0.61 |
| CBRC-DP_DR | 0.91 | 0.81 | 0.62 |
| GS-MetaServer2 | 0.91 | 0.83 | 0.66 |
| McGuffin | 0.91 | 0.82 | 0.64 |
| DISOclust | 0.91 | 0.82 | 0.64 |
| GeneSilicoMeta | 0.90 | 0.83 | 0.655 |
| Poodle | 0.90 | 0.80 | 0.61 |
| CaspIta | 0.89 | 0.78 | 0.571 |
| fais-server | 0.89 | 0.78 | 0.56 |
| MULTICOM-CMFR | 0.89 | 0.82 | 0.64 |
| MARINER* | 0.88 | 0.80 | 0.61 |
* - MARINER used svmPRAT to train models for disorder prediction in participation at CASP8 using the kernel with w = f = 11. We used the 723 sequences with disordered residues from the DisPro [7] dataset. The results are the official results from the CASP organizers and were presented by Dr. Joel Sussman at the Weizmann Institute of Science. Q2 denotes the 2-state accuracy for the prediction and Sis a weighted accuracy rewarding the prediction of disordered residue.
Residue-wise Contact Order Estimation Performance
| 3 | 0.704 | 0.696 | 0.692 | - | - | - | - | - | - | - | - | ||
| 7 | 0.712 | 0.683 | 0.719 | 0.677 | 0.672 | 0.722 | 0.672 | - | - | - | - | ||
| 11 | 0.711 | 0.681 | 0.720 | 0.673 | 0.667 | 0.725 | 0.666 | 0.724 | 0.666 | 0.722 | 0.667 | ||
| 15 | 0.709 | 0.680 | 0.719 | 0.672 | 0.726** | 0.665 | 0.726 | 0.664 | 0.725 | 0.664 | 0.723 | 0.664 | |
CC and RMSE denotes the average correlation coefficient and RMSE values. The numbers in bold show the best models as measured by CC for a fixed w parameter. , and represent the PSI-BLAST profile and YASSPP scoring matrices, respectively. soe, rbf, and lin represent the three different kernels studied using the as the base kernel. * denotes the best regression results in the sub-tables, and ** denotes the best regression results achieved on this dataset. For the best results the se rate for the CC values is 0.003. The published results [15] uses the default rbf kernel to give CC = 0.600 and RMSE = 0.78.
Classification Performance on the Protein-DNA Interaction Site Prediction.
| 0.463 | 0.469 | 0.748 | 0.452 | |||
| 0.465 | 0.462 | 0.759 | 0.466 | |||
| 0.466 | 0.468 | 0.763 | 0.468 | |||
The numbers in bold show the best models for a fixed w parameter, as measured by ROC. , and represent the PSI-BLAST profile and YASSPP scoring matrices, respectively. soe, rbf, and lin represent the three different kernels studied using the as the base kernel. * denotes the best classification results in the sub-tables, and ** denotes the best classification results achieved on this dataset. For the best model we report a Q2 accuracy of 83.0% with an se rate of 0.34.
Classification Performance on the Local Structure Alphabet Dataset.
| 0.82 | 64.9 | 0.81 | 64.7 | 0.81 | 64.2 | |
| 0.83 | 67.3 | 0.82 | 67.7 | 0.82 | 67.7 | |
| 0.84 | 66.4 | 0.84 | 66.9 | 0.83 | 67.2 | |
| 0.85 | 68.0 | 0.84 | 68.5 | 0.83 | 68.9** | |
w = f gave the best results on testing on few sample points, and hence due to the expensive nature of this problem, we did not test it on a wide set of parameters. ** denotes the best scoring model based on the Q16 scores. For this best model the se rate of 0.21.