| Literature DB >> 23110047 |
Xiaowei Zhao1, Wenyi Zhang, Xin Xu, Zhiqiang Ma, Minghao Yin.
Abstract
As one of the most widespread protein post-translational modifications, phosphorylation is involved in many biological processes such as cell cycle, apoptosis. Identification of phosphorylated substrates and their corresponding sites will facilitate the understanding of the molecular mechanism of phosphorylation. Comparing with the labor-intensive and time-consuming experiment approaches, computational prediction of phosphorylation sites is much desirable due to their convenience and fast speed. In this paper, a new bioinformatics tool named CKSAAP_PhSite was developed that ignored the kinase information and only used the primary sequence information to predict protein phosphorylation sites. The highlight of CKSAAP_PhSite was to utilize the composition of k-spaced amino acid pairs as the encoding scheme, and then the support vector machine was used as the predictor. The performance of CKSAAP_PhSite was measured with a sensitivity of 84.81%, a specificity of 86.07% and an accuracy of 85.43% for serine, a sensitivity of 78.59%, a specificity of 82.26% and an accuracy of 80.31% for threonine as well as a sensitivity of 74.44%, a specificity of 78.03% and an accuracy of 76.21% for tyrosine. Experimental results obtained from cross validation and independent benchmark suggested that our method was very promising to predict phosphorylation sites and can be served as a useful supplement tool to the community. For public access, CKSAAP_PhSite is available at http://59.73.198.144/cksaap_phsite/.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23110047 PMCID: PMC3478286 DOI: 10.1371/journal.pone.0046302
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
An example of the CKSAAP encoding scheme with k = 0, 1, 2, 3 for sequence fragment AAACD.
|
|
| The corresponding feature vectors |
| 0 | ( | (2, 1, 0,…,0)441 |
| 1 | ( | (1, 1, 1,…,0)441 |
| 2 | ( | (0, 1, 1,…,0)441 |
| 3 | ( | (0, 0, 1,…,0)441 |
Comparison of the two encoding schemes on the training dataset.
| Site | Encoding scheme |
|
|
|
|
| S | Binary | 80.37±0.69 | 84.89±0.73 | 82.63±0.61 | 0.653±0.012 |
| CKSAAP_PhSite | 84.81±0.52 | 86.07±0.56 | 85.43±0.82 | 0.709±0.005 | |
| T | Binary | 60.05±0.95 | 90.17±0.64 | 75.12±0.53 | 0.528±0.008 |
| CKSAAP_PhSite | 78.59±0.51 | 82.26±0.86 | 80.31±0.62 | 0.599±0.015 | |
| Y | Binary | 65.07±1.09 | 81.36±1.06 | 73.15±0.87 | 0.471±0.017 |
| CKSAAP_PhSite | 74.44±0.74 | 78.03±0.21 | 76.21±0.32 | 0.524±0.006 |
Figure 1ROC curves of CKSAAP_PhSite and the binary encoding scheme in terms of serine (S) site prediction based on the training dataset.
Figure 2ROC curves of CKSAAP_PhSite and the binary encoding scheme in terms of threonine (T) site prediction based on the training dataset.
Figure 3ROC curves of CKSAAP_PhSite and the binary encoding scheme in terms of tyrosine (Y) site prediction based on the training dataset.
The top 20 features ranked by IG based feature selection method.
| The top 20 features | S | T | Y |
| 1 | SP | TP | Y××P |
| 2 | S×S | P×××P | D×Y |
| 3 | S×××S | L××L | L×L |
| 4 | R××S | LL | D××Y |
| 5 | S××S | SP | LL |
| 6 | S×××××S | T××××P | L×××××L |
| 7 | S××E | P××××S | DY |
| 8 | S××××S | L××××L | V××L |
| 9 | S×R | T×P | YE |
| 10 | S××××E | T×××T | P×××P |
| 11 | SS | L×L | E××Y |
| 12 | RS | T×××××P | F×××V |
| 13 | R××××S | P××P | L×××L |
| 14 | S×××E | PE | P×××××P |
| 15 | R×××××S | PP | L×A |
| 16 | S×E | L××G | L××L |
| 17 | L×××L | S××T | D××S |
| 18 | R×××S | P×T | L××××L |
| 19 | E×E | RP | P×Y |
| 20 | L××L | L×××L | Y×××P |
Figure 4Three Two-Sample-Logos of the position-specific residue composition surrounding the phosphorylated site and non-phosphorylated sites.
(A) serine site logo, (B) threonine site logo, (C) tyrosine site logo. These three logos were generated using the web server http://www.twosamplelogo.org/ and only residues significantly enriched and depleted surrounding phosphorylated sites (t-test, P<0.05) are shown.
Comparison of the two encoding schemes on the training dataset containing no ‘O’ residues.
| Site | Encoding scheme |
|
|
|
|
| S | Binary | 80.29±0.52 | 84.43±0.54 | 82.36±0.45 | 0.648±0.023 |
| CKSAAP_PhSite | 84.41±0.37 | 85.46±0.48 | 84.94±0.59 | 0.699±0.007 | |
| T | Binary | 62.13±0.28 | 88.61±0.72 | 73.82±0.47 | 0.512±0.004 |
| CKSAAP_PhSite | 78.36±0.67 | 81.72±0.36 | 79.59±0.71 | 0.568±0.028 | |
| Y | Binary | 66.14±0.87 | 75.36±0.82 | 71.04±0.49 | 0.423±0.021 |
| CKSAAP_PhSite | 72.15±0.63 | 76.18±0.51 | 74.16±0.52 | 0.484±0.005 |
Performance of DISPHOS, PPRED, NetPhos, and our predictors in terms of serine (S) site prediction on the independent dataset.
| Method | Performance parameters of the systems | |||
|
|
|
|
| |
| DISPHOS | 81.03 | 62.86 | 70.10 | 0.432 |
| PPRED | 72.62 | 56.42 | 62.87 | 0.286 |
| NetPhos | 78.90 | 55.64 | 64.91 | 0.343 |
| CKSAAP_PhSite | 79.45 | 78.03 | 78.59 | 0.566 |
Performance of DISPHOS, PPRED, NetPhos, and our predictors in terms of threonine (T) site prediction on the independent dataset.
| Systems | Performance parameters of the systems | |||
|
|
|
|
| |
| DISPHOS | 70.06 | 73.04 | 71.93 | 0.421 |
| PPRED | 48.26 | 70.34 | 62.12 | 0.187 |
| NetPhos | 47.78 | 74.75 | 64.70 | 0.231 |
| CKSAAP_PhSite | 79.16 | 78.88 | 78.98 | 0.567 |
Performance of DISPHOS, PPRED, NetPhos, and our predictors in terms of tyrosine (Y) site prediction on the independent dataset.
| Systems | Performance parameters of the systems | |||
|
|
|
|
| |
| DISPHOS | 55.24 | 74.19 | 66.62 | 0.298 |
| PPRED | 43.01 | 65.35 | 56.42 | 0.084 |
| NetPhos | 45.80 | 69.30 | 59.92 | 0.154 |
| CKSAAP_PhSite | 52.10 | 79.53 | 68.58 | 0.329 |