| Literature DB >> 27437197 |
Yong Gao1, Weilin Hao1,2, Jing Gu1, Diwei Liu1, Chao Fan1, Zhigang Chen1, Lei Deng1,3.
Abstract
BACKGROUND: Post-translational modifications (PTMs) occur on almost all proteins and often strongly affect the functions of modified proteins. Phosphorylation is a crucial PTM mechanism with important regulatory functions in biological systems. Identifying the potential phosphorylation sites of a target protein may increase our understanding of the molecular processes in which it takes part.Entities:
Keywords: Ensemble learning; Phosphorylation sites; Structural neighborhood properties
Year: 2016 PMID: 27437197 PMCID: PMC4943517 DOI: 10.1186/s40709-016-0042-y
Source DB: PubMed Journal: J Biol Res (Thessalon) ISSN: 1790-045X Impact factor: 1.889
Performance of the two-step feature selection method
| AUC | Accuracy | Recall | Specificity | CC | F1 | |
|---|---|---|---|---|---|---|
| CK2 | ||||||
| Optimal features | 0.877 | 0.963 | 0.433 | 0.992 | 0.433 | 0.429 |
| All features | 0.842 | 0.954 | 0.350 | 0.986 | 0.370 | 0.366 |
| MAPK | ||||||
| Optimal features | 0.839 | 0.952 | 0.483 | 0.973 | 0.480 | 0.480 |
| All features | 0.833 | 0.959 | 0.400 | 0.985 | 0.424 | 0.423 |
| PKA | ||||||
| Optimal features | 0.858 | 0.948 | 0.375 | 0.980 | 0.426 | 0.432 |
| All features | 0.846 | 0.926 | 0.335 | 0.959 | 0.279 | 0.310 |
| PKC | ||||||
| Optimal features | 0.857 | 0.952 | 0.303 | 0.985 | 0.396 | 0.363 |
| All features | 0.821 | 0.948 | 0.226 | 0.984 | 0.282 | 0.274 |
| SRC | ||||||
| Optimal features | 0.900 | 0.951 | 0.558 | 0.973 | 0.510 | 0.499 |
| All features | 0.890 | 0.946 | 0.241 | 0.985 | 0.317 | 0.294 |
Fig. 1The framework of PredPhos. Phosphorylation sites in the training set were mapped to the protein entries of Protein Data Bank (PDB) by using Blast. We encode each residue using 51 site features, 51 Euclidean neighborhood features and 51 Voronoi neighborhood features. The first step of feature selection is done by a random forest algorithm. Features are ranked in descending order by Z-Scores and the top 80 features are selected. The second step is performed using a wrapper-based feature selection. Features are evaluated by tenfold cross-validation with the SVM algorithm, redundant features are removed by sequential backwards elimination. Finally, an ensemble of n classifiers is built using different subsets, the final result is determined by majority votes among the outputs of the n classifiers
Performance comparison on the independent test dataset
| Tools | Kinase family | Sn | Sp | Pre | CC | F1 |
|---|---|---|---|---|---|---|
| PPSP | PKA | 1.000 | 0.540 | 0.096 | 0.228 | 0.176 |
| PKC | 0.400 | 0.527 | 0.031 | −0.028 | 0.058 | |
| CK2 | 0.500 | 0.390 | 0.038 | −0.047 | 0.071 | |
| SRC | 0.538 | 0.859 | 0.286 | 0.304 | 0.373 | |
| MAPK | 0.571 | 0.380 | 0.043 | −0.021 | 0.081 | |
| Kinasephos | PKA | 0.125 | 0.877 | 0.048 | 0.001 | 0.069 |
| PKC | 0.200 | 0.863 | 0.053 | 0.034 | 0.083 | |
| CK2 | 0.500 | 0.976 | 0.500 | 0.476 | 0.500 | |
| SRC | 0.115 | 0.960 | 0.231 | 0.103 | 0.154 | |
| MAPK | 0.571 | 0.937 | 0.308 | 0.381 | 0.400 | |
| NetphosK | PKA | 0.375 | 0.914 | 0.176 | 0.204 | 0.240 |
| PKC | 0.200 | 0.802 | 0.037 | 0.001 | 0.063 | |
| CK2 | 0.500 | 0.805 | 0.111 | 0.158 | 0.182 | |
| SRC | 0.038 | 1.000 | 1.000 | 0.187 | 0.074 | |
| MAPK | 0.286 | 0.979 | 0.400 | 0.311 | 0.333 | |
| GPS | PKA | 0.500 | 0.871 | 0.160 | 0.222 | 0.242 |
| PKC | 0.600 | 0.695 | 0.070 | 0.119 | 0.125 | |
| CK2 | 0.500 | 0.854 | 0.143 | 0.202 | 0.222 | |
| SRC | 0.462 | 0.871 | 0.273 | 0.265 | 0.343 | |
| MAPK | 0.571 | 0.789 | 0.118 | 0.182 | 0.195 | |
| PredPhos | PKA | 0.571 | 0.779 | 0.100 | 0.164 | 0.170 |
| PKC | 0.824 | 0.870 | 0.452 | 0.544 | 0.583 | |
| CK2 | 1.000 | 0.659 | 0.176 | 0.341 | 0.300 | |
| SRC | 0.789 | 0.802 | 0.234 | 0.356 | 0.361 | |
| MAPK | 0.375 | 0.986 | 0.600 | 0.452 | 0.462 |
Fig. 2Performance comparison on feature selection and non-feature selection. The performances of PKA, PKC, CK2, SRC and MAPK are shown in a, b, c, d and e, respectively
Fig. 3The proportions of residue-based features, Euclidean features and Voronoi features on the top 10 list ranked by the two-step feature selection method for the 5 kinase families
Performance comparison on the non-kinase-specific dataset
| Methods | Accuracy | Recall | Specificity | Precision | CC | F1 |
|---|---|---|---|---|---|---|
| Netphos | 0.66 | 0.51 | 0.68 | 0.14 | 0.11 | 0.21 |
| PredPhos | 0.77 | 0.60 | 0.82 | 0.38 | 0.23 | 0.45 |