| Literature DB >> 22759584 |
Shan Gao1, Shuo Xu, Yaping Fang, Jianwen Fang.
Abstract
BACKGROUND: Identification of phosphorylation sites by computational methods is becoming increasingly important because it reduces labor-intensive and costly experiments and can improve our understanding of the common properties and underlying mechanisms of protein phosphorylation.Entities:
Year: 2012 PMID: 22759584 PMCID: PMC3380725 DOI: 10.1186/1477-5956-10-S1-S7
Source DB: PubMed Journal: Proteome Sci ISSN: 1477-5956 Impact factor: 2.480
Average classification accuracy of different classifiers with 560 features
| window size | LS-SVMs | MTL-Feat3 | MTLS-SVMs |
|---|---|---|---|
| 3 | 0.7381 | 0.727 | 0.728 |
| 5 | 0.754 | 0.7462 | 0.7459 |
| 7 | 0.7611 | 0.7595 | 0.7595 |
| 9 | 0.7498 | 0.741 | 0.74 |
| 11 | 0.7504 | 0.7455 | 0.7478 |
| 13 | 0.7491 | 0.7403 | 0.7416 |
| 15 | 0.7439 | 0.7355 | 0.7394 |
| 17 | 0.7439 | 0.729 | 0.7316 |
| 19 | 0.7325 | 0.7251 | 0.727 |
| 21 | 0.7325 | 0.7192 | 0.7176 |
| opt* | 0.7939 | 0.791 | 0.7936 |
Five fold cross validation and grid fitting of parameters are used to estimate the performance of all classifiers. *The optimized window sizes (3, 17, 7 and 9) for 4 kinase family datasets are used to build classifiers.
Classification accuracy of different classifiers with 560 features for 4 kinase datasets
| CDK kinase family | CK2 kinase family | PKA kinase family | PKC kinase family | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| window size | SVM-rbf | SVM-linear | RF | SVM -rbf | SVM-linear | RF | SVM-rbf | SVM-linear | RF | SVM-rbf | SVM-linear | RF |
| 3 | 0.8598 | 0.83 | 0.7783 | 0.7796 | 0.7326 | 0.6656 | 0.6678 | 0.6289 | 0.6723 | 0.6758 | 0.6113 | |
| 5 | 0.8013 | 0.8122 | 0.7579 | 0.806 | 0.8112 | 0.7935 | 0.7156 | 0.7178 | 0.7111 | 0.7173 | 0.724 | 0.7069 |
| 7 | 0.7578 | 0.7581 | 0.7455 | 0.8655 | 0.8724 | 0.8599 | 0.7567 | 0.7589 | 0.7242 | 0.7196 | 0.7253 | |
| 9 | 0.7305 | 0.7077 | 0.7171 | 0.8706 | 0.8688 | 0.8548 | 0.7456 | 0.7489 | 0.7622 | 0.7253 | 0.7183 | |
| 11 | 0.7223 | 0.724 | 0.7226 | 0.8654 | 0.8617 | 0.8619 | 0.7433 | 0.7478 | 0.7511 | 0.7161 | 0.7298 | 0.7023 |
| 13 | 0.721 | 0.7103 | 0.7049 | 0.867 | 0.8705 | 0.874 | 0.7367 | 0.7378 | 0.7311 | 0.7287 | 0.7299 | 0.7299 |
| 15 | 0.7211 | 0.7023 | 0.7049 | 0.8724 | 0.8723 | 0.8792 | 0.7322 | 0.7267 | 0.7156 | 0.7264 | 0.7286 | 0.7253 |
| 17 | 0.7087 | 0.7038 | 0.717 | 0.8812 | 0.874 | 0.7211 | 0.7167 | 0.7089 | 0.7253 | 0.7286 | 0.7252 | |
| 19 | 0.7142 | 0.6995 | 0.7049 | 0.8759 | 0.8844 | 0.8757 | 0.72 | 0.6978 | 0.7167 | 0.7173 | 0.7194 | 0.7194 |
| 21 | 0.7263 | 0.6887 | 0.72 | 0.8759 | 0.8775 | 0.8739 | 0.7133 | 0.7022 | 0.7044 | 0.7082 | 0.7309 | 0.6999 |
*Best performance for each kinase family by SVM with linear kernel and the corresponding window size is selected as the final optimized window size for that kinase family.
Figure 1Performance with different feature numbers. *window size 7 across 4 kinase family datasets. # optimized window sizes (3, 17, 7 and 9) across 4 kinase family datasets.
Classification Accuracy of different classifiers with selected features
| Methods | Window size | Feature number | aveAc |
|---|---|---|---|
| MetaPred | NA | NA | |
| LS-SVMs | 7 | 560 | 0.7611 |
| opt | 560 | 0.7939 | |
| *MT-Feat3 | 7 | 25 | 0.7605 |
| opt | 23 | 0.7972 | |
| MTLS-SVMs | 7 | 560 | 0.7595 |
| opt | 560 | 0.7936 | |
| *MTLS-SVMs | 7 | 20 | 0.7621 |
| opt | 26 | ||
| #MTLS-SVMs | 7 | 12 | 0.7455 |
| opt | 18 |
Five fold cross validation and grid fitting of parameters are used to estimate the performance of all the classifiers.
*Using features selected by the MT-Feat3 method.
#Using features filtered by the Metric multi-dimensional scaling method.
Selected features by MT-Feat3
| Subset 1 (20 features) | Subset 2 (26 features) | |
|---|---|---|
| Backbone electrostatic interactions | AVBF000101*# | AVBF000101*# |
| AVBF000102*# | AVBF000102*# | |
| AVBF000104*# | AVBF000104*# | |
| AVBF000105*# | AVBF000105*# | |
| AVBF000106* | AVBF000106* | |
| AVBF000107*# | AVBF000107*# | |
| AVBF000108*# | AVBF000108*# | |
| AVBF000109* | AVBF000109*# | |
| Hydrophobicity | ROSM880104* | ROSM880104* |
| ROSM880105*# | ROSM880105* | |
| Apparent partition energies | GUYH850103* | GUYH850103* |
| Negative charge | FAUJ880112* | FAUJ880112* |
| Fractional occurrence in left helix regions | RACS820103* | RACS820103* |
| Side chain conformation others | YANJ020101* | YANJ020101* |
| CHAM830108 | SNEP660101 | |
| PALJ810113 | BUNA790103 | |
| WILM950104 | CRAJ730101 | |
| BURA740101 | TANS770102 | |
| JOND920102 | BULH740101 | |
| AVBF000103# | GEIM800103 | |
| PALJ810107 | ||
| GEIM800105 | ||
| VELV850101 | ||
| COSI940101# | ||
| ISOY800107 | ||
| CHOP780211 |
Features in Subset 1 are selected by MT-Feat3 with window size 7. Features in Subset 2 are selected by MT-Feat3 with optimized window sizes across 4 kinase family datasets.
*Common features shared by subset 1 (20 features) and subset 2 (26 features).
# Features filtered by the Metric multi-dimensional scaling method.
Figure 2Two-dimensional map by metric multi-dimensional scaling method. (A) Subset 1 selected by MT-Feat3 with window size 7. Redundant features (in circles) are removed, leading to subset 3. (B) Subset 2 selected by MT-Feat3 with optimized window sizes. Redundant features (in circles) are removed, leading to feature subset 4. All the removed features are marked # in Table 4.