| Literature DB >> 34030700 |
Salma Jamal1, Waseem Ali1, Priya Nagpal2, Abhinav Grover3, Sonam Grover4.
Abstract
BACKGROUND: Post-translational modification (PTM) is a biological process that alters proteins and is therefore involved in the regulation of various cellular activities and pathogenesis. Protein phosphorylation is an essential process and one of the most-studied PTMs: it occurs when a phosphate group is added to serine (Ser, S), threonine (Thr, T), or tyrosine (Tyr, Y) residue. Dysregulation of protein phosphorylation can lead to various diseases-most commonly neurological disorders, Alzheimer's disease, and Parkinson's disease-thus necessitating the prediction of S/T/Y residues that can be phosphorylated in an uncharacterized amino acid sequence. Despite a surplus of sequencing data, current experimental methods of PTM prediction are time-consuming, costly, and error-prone, so a number of computational methods have been proposed to replace them. However, phosphorylation prediction remains limited, owing to substrate specificity, performance, and the diversity of its features.Entities:
Keywords: MRMR; Post-translational modification; Random forest; Support vector machine; Symmetrical uncertainty
Year: 2021 PMID: 34030700 PMCID: PMC8142496 DOI: 10.1186/s12967-021-02851-0
Source DB: PubMed Journal: J Transl Med ISSN: 1479-5876 Impact factor: 5.531
Fig. 1Overall workflow of the proposed approach for prediction of phosphorylation sites
The number of phosphorylation sites
| Training set | Testing set | |||
|---|---|---|---|---|
| Positive | Negative | Positive | Negative | |
| Serine | 107,668 | 87,180 | 26,916 | 21,795 |
| Threonine | 45,952 | 21,932 | 11,488 | 5483 |
| Tyrosine | 29,078 | 253 | 63 | 7269 |
Initial number and types of different features used to encode sequence fragments
| Feature types | Features | Number |
|---|---|---|
| Physicochemical property-based | Amino acid composition, average flexibility indices, hydrophobicity indices, net charge, partition coefficient, residue volume and molecular weight | 147 (21 × 7) |
| Sequence-based | Binary-encoding | 420 (21 × 20) |
| Structural level | Accessible surface area; secondary structure (coil, helix and strand) and disordered regions | 105 (21 × 5) |
| Functional features | Gene ontology (GO) terms (1) biological process (BP), (2) molecular function (MF) and (3) cellular component (CC); protein domain and KEGG pathway | 555 GO, 177 domain, 114 KEGG pathway |
| Functional annotation | UP_SEQ_FEATURE and UP_KEYWORDS | 526 |
Fig. 2ROC curve on (a) random forest (b) Support vector machine models using independent test set for Serine phosphorylation site prediction
Fig. 3ROC curve on (a) random forest (b) Support vector machine models using independent test set for Threonine phosphorylation site prediction
Fig. 4ROC curve on (a) random forest (b) support vector machine models using independent test set for Tyrosine phosphorylation site prediction
Performance comparison with individual and combined feature encoding schemes for pS site prediction on the independent dataset
| Attributes | Methods | Accuracy (%) | Sensitivity (%) | Specificity (%) | Precision (%) | F-measure (%) | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| Physiochemical property | RF | 71.5 | 79 | 62.3 | 72.1 | 75.4 | 0.42 | 0.74 |
| SVM | 64.31 | 79.9 | 45.0 | 64.2 | 71.2 | 0.26 | 0.62 | |
| Structure | RF | 87.34 | 86.5 | 80.6 | 90.2 | 88.3 | 0.74 | 0.94 |
| SVM | 70.12 | 96.4 | 37.7 | 65.6 | 78.1 | 0.43 | 0.67 | |
| Sequence | RF | 70.3 | 80.1 | 58.2 | 70.3 | 74.9 | 0.39 | 0.73 |
| SVM | 77.65 | 99.1 | 51.2 | 77.5 | 83.1 | 0.59 | 0.75 | |
| Functional features | RF | 62.87 | 93.9 | 24.6 | 60.6 | 73.6 | 0.26 | 0.59 |
| SVM | 62.58 | 93.4 | 24.6 | 60.5 | 73.4 | 0.25 | 0.59 | |
| Functional annotation | RF | 62.75 | 90.7 | 28.2 | 60.9 | 72.9 | 0.24 | 0.60 |
| SVM | 62.50 | 93.6 | 24.1 | 60.4 | 73.4 | 0.25 | 0.58 | |
| Combined | ||||||||
| SVM | 88.50 | 79.9 | 99.1 | 99.1 | 88.5 | 0.79 | 0.89 |
Performance metrics for best results are highlighted in bold
Performance comparison with individual and combined feature encoding schemes for pT site prediction on the independent dataset
| Attributes | Methods | Accuracy (%) | Sensitivity (%) | Specificity (%) | Precision (%) | F-measure (%) | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| Physiochemical property | RF | 77.54 | 92.1 | 47.1 | 78.5 | 84.7 | 0.45 | 0.76 |
| SVM | 67.05 | 77.7 | 55.9 | 64.9 | 70.7 | 0.34 | 0.66 | |
| Structure | RF | 89.58 | 94.8 | 78.8 | 90.3 | 92.5 | 0.75 | 0.96 |
| SVM | 77.66 | 99 | 33 | 75.6 | 85.7 | 0.47 | 0.66 | |
| Sequence | RF | 71.79 | 86 | 42.1 | 75.7 | 80.5 | 0.31 | 0.69 |
| SVM | 73.74 | 94.4 | 69.5 | 70.3 | 79.9 | 0.16 | 0.57 | |
| Functional features | RF | 72.75 | 92.1 | 32.3 | 74.0 | 82.1 | 0.31 | 0.63 |
| SVM | 72.43 | 93.0 | 29.3 | 73.4 | 82.0 | 0.29 | 0.61 | |
| Functional annotation | RF | 68.5 | 92.6 | 18.1 | 70.3 | 79.9 | 0.16 | 0.57 |
| SVM | 68.34 | 92.3 | 31.7 | 70.3 | 79.8 | 0.15 | 0.55 | |
| Combined | ||||||||
| SVM | 73.96 | 100 | 19.4 | 72.2 | 83.9 | 0.37 | 0.59 |
Performance metrics for best results are highlighted in bold
Performance comparison with individual and combined feature encoding schemes for pY site prediction on the independent dataset
| Attributes | Methods | Accuracy (%) | Sensitivity (%) | Specificity (%) | Precision (%) | F-measure (%) | MCC | AUC |
|---|---|---|---|---|---|---|---|---|
| Physiochemical property | RF | 77.19 | 77.5 | 46 | 99.4 | 87.1 | 0.05 | 0.65 |
| SVM | 79.21 | 79.6 | 30.2 | 99.2 | 88.4 | 0.02 | 0.54 | |
| Structure | RF | 99.3 | 100 | 79.4 | 99.3 | 99.7 | 0.43 | 0.95 |
| SVM | 96.08 | 96.4 | 63.5 | 99.7 | 98 | 0.27 | 0.79 | |
| Sequence | RF | 69.74 | 70 | 41.3 | 99 | 82 | 0.02 | 0.59 |
| SVM | 99 | 99.8 | 11.1 | 99.2 | 99.5 | 0.18 | 0.55 | |
| Functional features | RF | 98.09 | 98.6 | 39.7 | 99.5 | 99.0 | 0.27 | 0.70 |
| SVM | 97.73 | 98.2 | 39.7 | 99.5 | 98.9 | 0.24 | 0.69 | |
| Functional annotation | RF | 95.51 | 96.1 | 31.7 | 99.4 | 97.7 | 0.12 | 0.68 |
| SVM | 95.24 | 95.8 | 31.7 | 99.4 | 97.6 | 0.12 | 0.63 | |
| Combined | ||||||||
| SVM | 99.46 | 99.7 | 23.8 | 99.7 | 99.7 | 0.71 | 0.87 |
Performance metrics for best results are highlighted in bold
Performance comparison of different existing tools for pS/pT/pY site prediction
| Phosphorylation site | Methods | Sensitivity (%) | Specificity (%) | MCC | AUC |
|---|---|---|---|---|---|
| Serine | PhosPred-RF | 79.70 | 75.00 | 0.54 | 0.85 |
| PhosphoSVM | 44.43 | 94.04 | 0.29 | 0.84 | |
| PPRED | 32.27 | 91.6 | 0.16 | 0.75 | |
| iPhos-PseEn | 79.64 | 79.78 | 0.39 | – | |
| Threonine | PhosPred-RF | 73.80 | 72.60 | 0.46 | 0.81 |
| PhosphoSVM | 37.31 | 94.99 | 0.25 | 0.81 | |
| PPRED | 34.32 | 83.65 | 0.09 | 0.65 | |
| iPhos-PseEn | 71.51 | 80.68 | 0.34 | – | |
| Tyrosine | PhosPred-RF | 72.70 | 64.00 | 0.36 | 0.76 |
| PhosphoSVM | 41.92 | 87.34 | 0.20 | 0.73 | |
| PPRED | 43.04 | 82.65 | 0.16 | 0.70 | |
| iPhos-PseEn | 76.18 | 76.29 | 0.32 | – | |
Performance metrics for best results are highlighted in bold