| Literature DB >> 27066500 |
Hamid D Ismail1, Ahoi Jones2, Jung H Kim2, Robert H Newman3, Dukka B Kc1.
Abstract
Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 2.0 (RF-Phos 2.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 2.0, which uses random forest with sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 2.0 compares favorably to other popular mammalian phosphosite prediction methods, such as PhosphoSVM, GPS2.1, and Musite.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27066500 PMCID: PMC4811047 DOI: 10.1155/2016/3281590
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
The benchmark sequences of known phosphorylation sites.
| Residue | Number of sequences | Number of sites |
|---|---|---|
| Ser | 6,635 | 20,964 |
| Thr | 3,227 | 5,685 |
| Tyr | 1,392 | 2,163 |
The number of windows before and after redundancy removal for size = 9.
| Residue | Positive windows | Negative | |
|---|---|---|---|
| Before | After | Used | |
| Ser | 20577 | 1554 | 1543 |
| Thr | 5596 | 707 | 453 |
| Tyr | 2124 | 267 | 226 |
Independent test set.
| Residue | Positive/negative |
|---|---|
| Ser | 307/307 |
| Thr | 68/68 |
| Tyr | 51/51 |
Figure 1Accuracy versus number of tree for serine.
Figure 2Feature distribution. (a) Distribution of the feature importance of all 593 features for Ser (top), Thr (middle), and Tyr (bottom). Features and corresponding indices are noted. Dashed lines represent boundaries between feature indices. (b) Top ten important features Ser (top), Thr (middle), and Tyr (bottom). The bar labels indicate the feature type to which the important features belong. CTD: composition, transition, and distribution; ASA: accessible surface area; SF: sequence features; ACH: average cumulative hydrophobicity; OP: overlapping properties.
Feature types and their count percentage in the top-ten important features for each phosphosite.
| Residues | Features |
|---|---|
| S | ASA (30%), CDT (20%), QSO (20%) ACH (10%), OP (10%), and SF (10%) |
| T | CTD (50%), OP (20%), QSO (20%), and SF (10%) |
| Y | CDT (60%), ASA (20%), IG (10%), and OP (10%) |
Evaluation metrics obtained from 10-fold cross validation for the model trained using either the entire set of 593 features (“all”) or the top 100 features (“100”). Results using all 593 features are shown in boldface.
| Metrics | Residues | |||||||
|---|---|---|---|---|---|---|---|---|
| S | T | Y | ||||||
| All | 100 | All | 100 | All | 100 | |||
| Accuracy |
| 80.00 |
| 84.00 |
| 85.00 | ||
| Precision |
| 79.00 |
| 87.00 |
| 88.00 | ||
| Sensitivity |
| 81.00 |
| 87.00 |
| 84.00 | ||
| Specificity |
| 80.00 |
| 79.00 |
| 84.00 | ||
|
|
| 80.00 |
| 87.00 |
| 86.00 | ||
| MCC |
| 0.61 |
| 0.66 |
| 0.69 | ||
| AUC |
| 0.85 |
| 0.85 |
| 0.88 | ||
Figure 3The receiver operating characteristic (ROC) curve of RF-Phos 2.0 using 10-fold cross validation.
Scoring metrics using 10-fold cross validation.
| Methods | Residue = S | |||
|---|---|---|---|---|
| AUC | Sen (%) | Sp (%) | MCC | |
| NetPhosK | 0.63 | 50.9 | 67.8 | 0.08 |
| GPS 2.1 | 0.74 | 33.1 | 93.3 | 0.20 |
| Swaminathan | 0.70 | 31.3 | 88.7 | 0.13 |
| NetPhos | 0.70 | 34.1 | 86.7 | 0.12 |
| PPRED | 0.75 | 32.3 | 91.6 | 0.17 |
| Musite | 0.81 | 41.4 | 93.7 | 0.25 |
| PhosphoSVM | 0.84 | 44.4 | 94.0 | 0.30 |
|
|
|
|
|
|
|
| ||||
| Methods | Residue = T | |||
| AUC | Sen (%) | Sp (%) | MCC | |
|
| ||||
| NetPhosK | 0.60 | 62.0 | 56.8 | 0.07 |
| GPS 2.1 | 0.70 | 38.1 | 92.3 | 0.20 |
| Swaminathan | 0.72 | 28.0 | 92.5 | 0.14 |
| NetPhos | 0.66 | 34.3 | 83.7 | 0.09 |
| PPRED | 0.73 | 30.3 | 91.0 | 0.13 |
| Musite | 0.78 | 33.8 | 94.8 | 0.22 |
| PhosphoSVM | 0.82 | 37.3 | 95.0 | 0.25 |
|
|
|
|
|
|
|
| ||||
| Methods | Residue = Y | |||
| AUC | Sen (%) | Sp (%) | MCC | |
|
| ||||
| NetPhosK | 0.60 | 39.5 | 74.2 | 0.08 |
| GPS 2.1 | 0.61 | 34.5 | 78.9 | 0.08 |
| Swaminathan | 0.62 | 60.5 | 57.0 | 0.09 |
| NetPhos | 0.65 | 34.7 | 84.5 | 0.13 |
| PPRED | 0.70 | 43.0 | 82.7 | 0.17 |
| Musite | 0.72 | 38.4 | 86.7 | 0.18 |
| PhosphoSVM | 0.74 | 41.9 | 87.3 | 0.21 |
|
|
|
|
|
|
Scoring metrics using an independent test dataset.
| Methods | Residue = S | ||
|---|---|---|---|
| Sen (%) | Sp (%) | MCC | |
| NetPhosK | 80.13 | 38.79 | 0.10 |
| GPS 2.1 | 94.79 | 28.62 | 0.14 |
| NetPhos | 76.55 | 54.20 | 0.16 |
| PHOSFER | 74.59 | 65.51 | 0.22 |
| Musite | 55.70 | 87.39 | 0.31 |
| PhosphoSVM | 63.84 | 81.76 | 0.29 |
|
|
|
|
|
|
| |||
| Methods | Residue = T | ||
| Sen (%) | Sp (%) | MCC | |
|
| |||
| NetPhosK | 69.12 | 50.82 | 0.06 |
| GPS 2.1 | 95.59 | 20.84 | 0.07 |
| NetPhos | 54.41 | 77.43 | 0.12 |
| PHOSFER | 77.94 | 64.77 | 0.14 |
| Musite | 48.53 | 93.55 | 0.26 |
| PhosphoSVM | 70.59 | 78.16 | 0.19 |
|
|
|
|
|
|
| |||
| Methods | Residue = Y | ||
| Sen (%) | Sp (%) | MCC | |
|
| |||
| NetPhosK | 25.49 | 83.23 | 0.04 |
| GPS 2.1 | 98.04 | 21.42 | 0.09 |
| NetPhos | 64.71 | 67.50 | 0.13 |
| PHOSFER | 62.75 | 59.29 | 0.08 |
| Musite | 47.06 | 88.77 | 0.20 |
| PhosphoSVM | 82.35 | 64.18 | 0.18 |
|
|
|
|
|