| Literature DB >> 18541042 |
Paul D Yoo1, Yung Shwen Ho, Bing Bing Zhou, Albert Y Zomaya.
Abstract
BACKGROUND: Post-translational modifications have a substantial influence on the structure and functions of protein. Post-translational phosphorylation is one of the most common modification that occur in intracellular proteins. Accurate prediction of protein phosphorylation sites is of great importance for the understanding of diverse cellular signalling processes in both the human body and in animals. In this study, we propose a new machine learning based protein phosphorylation site predictor, SiteSeek. SiteSeek is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence. The newly proposed method proves to be more accurate and exhibits a much stable predictive performance than currently existing phosphorylation site predictors.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18541042 PMCID: PMC2442102 DOI: 10.1186/1471-2105-9-272
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Comparison of encoding schemes.
| Models | Accuracy (Ac) | Sensitivity (Sn) | Specificity (Sp) | Correlation-Coefficient (Cc) | Type I ER | Type II ER |
| CompactPSSM | 0.73 | 0.71 | 0.75 | 0.46 | 0.13 | 0.14 |
| PSSM | 0.70 | 0.69 | 0.71 | 0.40 | 0.15 | 0.15 |
| OE | 0.58 | 0.60 | 0.56 | 0.16 | 0.23 | 0.19 |
| Hydrophobicity | 0.58 | 0.56 | 0.61 | 0.17 | 0.19 | 0.23 |
| SARAH1 | 0.61 | 0.80 | 0.42 | 0.24 | 0.29 | 0.10 |
Prediction results of machine learning models on PS-Benchmark_1 dataset.
| Models | Accuracy (Ac) | Sensitivity (Sn) | Specificity (Sp) | Correlation-Coefficient (Cc) | Var. | Time |
| Ada-SVM | 0.791 | 0.776 | 0.806 | 0.583 | 0.031 | 51.593 |
| SVM | 0.798 | 0.787 | 0.809 | 0.596 | 0.030 | 32.886 |
| kNN | 0.767 | 0.753 | 0.781 | 0.534 | 0.032 | 35.630 |
| GRNN | 0.759 | 0.724 | 0.793 | 0.518 | 0.041 | 85.422 |
| MLP | 0.752 | 0.715 | 0.789 | 0.505 | 0.046 | 180.344 |
| RBFN | 0.737 | 0.685 | 0.788 | 0.475 | 0.044 | 68.654 |
| DT (J48) | 0.732 | 0.718 | 0.747 | 0.465 | 0.025 | 8.393 |
| KLR | 0.726 | 0.682 | 0.772 | 0.456 | 0.038 | 156.690 |
Figure 1Prediction scores simulated by Adaptive-LEKM and SVM.
Prediction results of Adaptive-LEKM for the four kinase families.
| K-Families | Accuracy (Ac) | Sensitivity (Sn) | Specificity (Sp) | Correlation-Coefficient (Cc) | Type I ER | Type II ER |
| CDK | ||||||
| 0.777 | 0.455 | 0.992 | 0.900 | |||
| CK2 | ||||||
| 0.840 | 0.765 | 0.888 | 0.660 | |||
| PKA | ||||||
| 0.816 | 0.561 | 0.987 | 0.640 | |||
| PKC | ||||||
| 0.726 | 0.475 | 0.898 | 0.420 | |||
| Avg. | ||||||
| 0.790 | 0.564 | 0.941 | 0.655 | |||
| Var. | ||||||
| 0.050 | 0.142 | 0.056 | 0.196 | |||
The experimental results of SiteSeek are written in bold and others are the consensus results of literature obtained by Kim et al 2004.
Prediction results of Adaptive-LEKM for the four kinase groups.
| K-Gruops | Accuracy (Ac) | Sensitivity (Sn) | Specificity (Sp) | Correlation-Coefficient (Cc) | Type I ER | Type II ER |
| AGC | 0.862 | 0.796 | 0.913 | 0.719 | 0.048 | 0.090 |
| CAMK | 0.821 | 0.721 | 0.900 | 0.638 | 0.056 | 0.123 |
| CMGC | 0.900 | 0.891 | 0.907 | 0.796 | 0.054 | 0.046 |
| TK | 0.792 | 0.667 | 0.892 | 0.580 | 0.060 | 0.148 |
| Avg. | 0.844 | 0.769 | 0.903 | 0.683 | 0.055 | 0.102 |
| Var. | 0.047 | 0.097 | 0.009 | 0.094 | 0.005 | 0.044 |
Predictive performance of phosphorylation site predictors.
| Accuracy (Ac) | Sensitivity (Sn) | Specificity (Sp) | Correlation-Coefficient (Cc) | Type I Error | Type II Error | |
| PredPhospho | 0.843 | 0.821 | 0.862 | 0.684 | 0.076 | 0.079 |
| NetPhosK | 0.836 | 0.790 | 0.876 | 0.670 | 0.066 | 0.099 |
| Scansite | 0.827 | 0.755 | 0.883 | 0.647 | 0.066 | 0.107 |
| DISPHOS | 0.805 | 0.773 | 0.827 | 0.601 | 0.092 | 0.106 |
Although each predictor was trained using its own training dataset, they all were tested on same benchmark dataset, (PS-Benchmark_1) which contains 1,668 polypeptide chains (Refer to Section "PS-Benchmark_1).
Four main kinase groups.
| DMPK_group | CaM-KIalpha | Abl | CDK_group | CK2 alpha |
| GRK_group | CaM-KI_group | ALK | CDK1 | CK2 beta |
| GRK-1 | CaM-KII_group | Axl | CDK11 | CK2_group |
| GRK-2 | CaM-KIIalpha | Csk | CDK2 | N/A |
| GRK-3 | CaM-KIV | EGFR | CDK4 | |
| GRK-4 | CaM-Kkalpha | EphA2 | CDK5 | |
| GRK-5 | CaM-Kkbeta | EphA3 | CDK6 | |
| GRK-6 | CDPK | EphA4 | CDK7 | |
| NDR1 | CHK1 | EphA8 | CDK9 | |
| NDR2 | CHK2 | EphB1 | CLK1 | |
| PDK1 | DAPK_group | EphB2 | DYRK1A | |
| PDK2 | DAPK1 | EphB3 | DYRK1B | |
| PDK_alpha | DAPK2 | EphB5 | DYRK2 | |
| PKA_group | DAPK3 | FAK | DYRK3 | |
| PKA alpha | MAPKAPK2 | Fer | GSK-3_group | |
| PKB_group | MARK_group | FGFR_group | GSK-3alpha | |
| PKB beta | MLCK_group | FGFR1 | GSK-3beta | |
| PKC_group | PHK_group | FGFR3 | MAPK_group | |
| PKC alpha | Pim-1 | FGFR4 | MAPK1 | |
| PKC beta | PKD1 | JAK_group | MAPK10 | |
| PKC delta | PKD2 | JAK1 | MAPK11 | |
| PKC epsilon | PKD3 | JAK2 | MAPK12 | |
| PKC eta | RSK_group | JAK3 | MAPK13 | |
| PKC gamma | RSK-1 | Met | MAPK14 | |
| PKC iota | RSK-2 | PDGFR_group | MAPK3 | |
| PKC theta | RSK-3 | PDGFR alpha | MAPK4 | |
| PKC zeta | RSK-5 | PDGFR beta | MAPK6 | |
| PKG | N/A | Ret | MAPK7 | |
| PKG1 | Src | MAPK8 | ||
| PKG1A | Src_group | MAPK9 | ||
| PKG1B | Syk | PRP4 | ||
| PKG2 | Tec | N/A | ||
| RSK_group | Tie2 | |||
| RSK-1 | TRKA | |||
| RSK-2 | TRKB | |||
| RSK-3 | N/A | |||
| RSK-5 | ||||
| SGK_group |
Figure 2Comparison of different data dimensions.
Hydrophobicity Scale: Nonpolar → Polar distributions of amino acids chains, pH7 (kcal/mol) [54].
| Amino Acid | Feature Value | Amino Acid | Feature Value | ||
| 1 | I | 4.92 | 11 | Y | -0.14 |
| 2 | L | 4.92 | 12 | T | -2.57 |
| 3 | V | 4.04 | 13 | S | -3.40 |
| 4 | P | 4.04 | 14 | H | -4.66 |
| 5 | F | 2.98 | 15 | Q | -5.54 |
| 6 | M | 2.35 | 16 | K | -5.55 |
| 7 | W | 2.33 | 17 | N | -6.64 |
| 8 | A | 1.81 | 18 | E | -6.81 |
| 9 | C | 1.28 | 19 | D | -8.72 |
| 10 | G | 0.94 | 20 | R | -14.92 |
Rose hydrophobicity scale [55]
| Amino Acid | Amino Acid | Feature Value | |||
| 1 | A | 0.74 | 11 | L | 0.85 |
| 2 | R | 0.64 | 12 | K | 0.52 |
| 3 | N | 0.63 | 13 | M | 0.85 |
| 4 | D | 0.62 | 14 | F | 0.88 |
| 5 | C | 0.91 | 15 | P | 0.64 |
| 6 | Q | 0.62 | 16 | S | 0.66 |
| 7 | E | 0.62 | 17 | T | 0.70 |
| 8 | G | 0.72 | 18 | W | 0.85 |
| 9 | H | 0.78 | 19 | Y | 0.76 |
| 10 | I | 0.88 | 20 | V | 0.86 |
SARAH1 Scale.
| Amino Acid | Binary Code | Amino Acid | Binary Code | ||
| 1 | C | 1, 1, 0, 0, 0 | 11 | G | 0, 0, 0, -1, -1 |
| 2 | F | 1, 0, 1, 0, 0 | 12 | T | 0, 0, -1, 0, -1 |
| 3 | I | 1, 0, 0, 1, 0 | 13 | S | 0, 0, -1, -1, 0 |
| 4 | V | 1, 0, 0, 0, 1 | 14 | R | 0, -1, 0, 0, -1 |
| 5 | L | 0, 1, 1, 0, 0 | 15 | P | 0, -1, 0, -1, 0 |
| 6 | W | 0, 1, 0, 1, 0 | 16 | N | 0, -1, -1, 0, 0 |
| 7 | M | 0, 1, 0, 0, 1 | 17 | D | -1, 0, 0, 0, -1 |
| 8 | H | 0, 0, 1, 1, 0 | 18 | Q | -1, 0, 0, -1, 0 |
| 9 | Y | 0, 0, 1, 0, 1 | 19 | E | -1, 0, -1, 0, 0 |
| 10 | A | 0, 0, 0, 1, 1 | 20 | K | -1, -1, 0, 0, 0 |
Figure 3Two dimensional LBG Vector Quantisation.
Figure 4SiteSeek Basic Architecture.