| Literature DB >> 31164681 |
Md Mehedi Hasan1, Md Mamunur Rashid1, Mst Shamima Khatun1, Hiroyuki Kurata2,3.
Abstract
Protein phosphorylation on serine (S) and threonine (T) has emerged as a key device in the control of many biological processes. Recently phosphorylation in microbial organisms has attracted much attention for its critical roles in various cellular processes such as cell growth and cell division. Here a novel machine learning predictor, MPSite (Microbial Phosphorylation Site predictor), was developed to identify microbial phosphorylation sites using the enhanced characteristics of sequence features. The final feature vectors optimized via a Wilcoxon rank sum test. A random forest classifier was then trained using the optimum features to build the predictor. Benchmarking investigation using the 5-fold cross-validation and independent datasets test showed that the MPSite is able to achieve robust performance on the S- and T-phosphorylation site prediction. It also outperformed other existing methods on the comprehensive independent datasets. We anticipate that the MPSite is a powerful tool for proteome-wide prediction of microbial phosphorylation sites and facilitates hypothesis-driven functional interrogation of phosphorylation proteins. A web application with the curated datasets is freely available at http://kurata14.bio.kyutech.ac.jp/MPSite/ .Entities:
Mesh:
Substances:
Year: 2019 PMID: 31164681 PMCID: PMC6547684 DOI: 10.1038/s41598-019-44548-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1A computational framework of MPSite.
Figure 2Sequence logo representation of pS and pT sites. The local sequence neighborhood of 10 upstream and 10 downstream residues surrounding the phosphorylation sites was used to plot the sequence logos. Two-sample logos show the dominance of surface accessible residues in microbial pS and pT sites.
AUC value of different schemes on the training dataset via a 5-fold CV test.
| Method | pS-site | pT-site |
|---|---|---|
| AAC | 0.639 | 0.646 |
| PSSM | 0.608 | 0.616 |
| PKA | 0.691 | 0.813 |
| AIP | 0.671 | 0.685 |
| BE | 0.683 | 0.690 |
| AFC | 0.725 | 0.826 |
| SSF | 0.641 | 0.662 |
| MPSite | 0.822 | 0.853 |
| DT | 0.807 | 0.817 |
| SVM | 0.819 | 0.838 |
| NB | 0.789 | 0.811 |
| ANN | 0.774 | 0.784 |
The AUC scores of AAC, PSSM, PKA, AIP, BE, AFC, and SSF schemes were measured by using the RF algorithm. The AUC value of MPSite, DT, SVM, NB, and ANN are estimated by integrating the four descriptors of PKA, AIP, BE, and AFC.
Figure 3ROC curves on the various prediction models using a 5-fold CV test on training datasets. (A) Performance in the pS site dataset and (B) Performance in the pT site dataset. ‘MPSite’ indicates the optimum performances of the combined four features via the WR scheme.
Performance of MPSite based on the training datasets via a 5-fold CV test.
| Predictors | pS sites | pT sites |
|---|---|---|
|
| 0.897 | 0.901 |
|
| 0.503 | 0.596 |
|
| 0.766 | 0.799 |
|
| 0.452 | 0.522 |
| AUC | 0.822 | 0.853 |
Figure 4AUC values for different window sizes based on 5-fold cross-validation tests. (A) pS and (B) pT site prediction.
Performance comparison of pS and pT site prediction on the independent dataset.
| Method |
|
|
|
|
|---|---|---|---|---|
|
| ||||
| MPSite | 0.811 | 0.412 | 0.678 | 0.239 |
| ANN | 0.803 | 0.261 | 0.622 | 0.124 |
| DT | 0.801 | 0.291 | 0.631 | 0.157 |
| NB | 0.801 | 0.271 | 0.624 | 0.133 |
| SVM | 0.802 | 0.361 | 0.655 | 0.183 |
| NetPhosBac | 0.678 | 0.331 | 0.562 | −0.006 |
|
| ||||
| MPSite | 0.818 | 0.616 | 0.751 | 0.432 |
| ANN | 0.806 | 0.465 | 0.692 | 0.292 |
| DT | 0.803 | 0.499 | 0.702 | 0.322 |
| NB | 0.801 | 0.446 | 0.683 | 0.283 |
| SVM | 0.805 | 0.565 | 0.725 | 0.372 |
| NetPhosBac | 0.883 | 0.101 | 0.622 | 0.011 |