| Literature DB >> 18282281 |
Yong-Zi Chen1, Yu-Rong Tang, Zhi-Ya Sheng, Ziding Zhang.
Abstract
BACKGROUND: As one of the most common protein post-translational modifications, glycosylation is involved in a variety of important biological processes. Computational identification of glycosylation sites in protein sequences becomes increasingly important in the post-genomic era. A new encoding scheme was employed to improve the prediction of mucin-type O-glycosylation sites in mammalian proteins.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18282281 PMCID: PMC2335299 DOI: 10.1186/1471-2105-9-101
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Prediction accuracy of O-glycosylation sites based on different encoding schemesa
| Site | Encoding scheme | Feature selection | ||||
| S | Binaryb | No selection | 74.2 ± 1.7 | 81.9 ± 3.0 | 78.0 ± 1.9 | 0.567 ± 0.039 |
| Binaryc | No selection | 76.5 ± 3.5 | 74.6 ± 3.6 | 75.6 ± 3.1 | 0.523 ± 0.060 | |
| CKSAAP | No selection | 77.9 ± 1.7 | 86.5 ± 3.0 | 82.2 ± 1.8 | 0.655 ± 0.037 | |
| CKSAAPc | No selection | 79.0 ± 5.2 | 83.0 ± 2.4 | 81.0 ± 2.6 | 0.628 ± 0.050 | |
| CKSAAP | CC | 80.7 ± 3.3 | 85.6 ± 3.9 | |||
| CKSAAP | IE | 82.1 ± 2.3 | 83.9 ± 3.8 | 83.0 ± 2.4 | 0.665 ± 0.048 | |
| T | Binaryb | No selection | 74.8 ± 4.1 | 78.3 ± 1.7 | 76.6 ± 2.3 | 0.536 ± 0.045 |
| Binaryc | No selection | 77.8 ± 3.4 | 76.6 ± 3.2 | 77.2 ± 2.4 | 0.548 ± 0.048 | |
| CKSAAP | No selection | 80.4 ± 2.2 | 82.3 ± 2.9 | 81.3 ± 2.3 | 0.631 ± 0.045 | |
| CKSAAPc | No selection | 80.3 ± 1.9 | 85.7 ± 1.9 | 83.0 ± 1.8 | 0.666 ± 0.038 | |
| CKSAAP | CC | 80.3 ± 1.8 | 82.5 ± 2.3 | |||
| CKSAAP | IE | 80.8 ± 1.5 | 81.9 ± 3.1 | 81.3 ± 2.2 | 0.631 ± 0.045 | |
a The SVM based prediction algorithm with the RBF kernel function. The CC-based feature selection resulted in the highest accuracy, and the corresponding values of Ac and MCC were represented in bold types. The corresponding measurement was represented as the average value ± standard deviation. b In this encoding scheme, the window size was optimally set as 41. cThe method was trained and tested on new negative site data sets where <40% identity was not required between positive and negative sites.
Comparison of CKSAAP_OGlySite with NetOGlyc 3.1
| Site | Method | ||||
| S | Binarya,b | 49.7 ± 4.8 | 88.0 ± 0.8 | 81.7 ± 1.4 | 0.364 ± 0.054 |
| CKSAAP_OGlySitea,b | 56.7 ± 3.2 | 95.6 ± 0.4 | 89.1 ± 0.8 | 0.575 ± 0.040 | |
| NetOGlyc 3.1b | 54.9 ± 0.3 | 91.6 ± 0.7 | 85.6 ± 0.5 | 0.473 ± 0.011 | |
| T | Binarya,b | 60.8 ± 0.8 | 85.4 ± 1.3 | 81.3 ± 1.2 | 0.416 ± 0.026 |
| CKSAAP_OGlySitea,b | 68.8 ± 1.7 | 92.9 ± 0.3 | 88.9 ± 0.2 | 0.608 ± 0.009 | |
| NetOGlyc 3.1b | 76.9 ± 0.0 | 86.1 ± 0.6 | 84.6 ± 0.5 | 0.549 ± 0.009 | |
aThe method was trained and tested in datasets with a 1:5 ratio of O-glycosylation sites to non-glycosylation sites. b The corresponding measurement was represented as the average value ± standard deviation.
The top 20 features selected by correlation coefficient (CC-) and information entropy (IE-) based methods
| Site | S | T | ||
| Top 20 features | CC | IE | CC | IE |
| 1 | ||||
| 2 | ||||
| 3 | AXXXP | |||
| 4 | SXS | TXXXT | ||
| 5 | TXXXXT | |||
| 6 | TXT | |||
| 7 | AXP | |||
| 8 | ||||
| 9 | SXXXS | |||
| 10 | TXXXP | TXA | ||
| 11 | TS | |||
| 12 | SXXXT | SXXXXP | SXXXP | |
| 13 | TP | |||
| 14 | SS | |||
| 15 | SP | PXXA | ||
| 16 | TXXA | SA | SXXXXT | |
| 17 | PXA | TXXS | ||
| 18 | ST | |||
| 19 | PP | |||
| 20 | AXXXXT | PXXXT | ||
a TXXP represents a 2-spaced amino acid pair of TP, where X stands for any amino acid. The same representation was applied to other k-spaced amino acid pairs. b The k-spaced amino acid pairs in bold type mean they are consistently ranked as the top 20 features by both feature selection methods.
Figure 1The occurrences of each amino acid in top 500 . The top 500 k-spaced amino acid pairs were resulted from the CC-based feature selection and ranked by considering all the cross-validation tests.
Performance of S+T predictora
| Datasetsb | Encoding | ||||
| 1:1 | CKSAAPc,d | 82.9 ± 1.3 | 83.4 ± 1.8 | 83.2 ± 1.6 | 0.667 ± 0.033 |
| 1:5 | CKSAAPc,d | 63.7 ± 1.7 | 95.1 ± 0.3 | 89.8 ± 0.4 | 0.617 ± 0.017 |
a In this predictor, the datasets of S and T sites were combined to train a prediction model. Therefore, the O-glycosylated S/T sites can be predicted in one predictor. bThe predictor was trained and tested in datasets with two different ratios of O-glycosylation and non-glycosyaltion sites (i.e. 1:1 and 1:5). cThe corresponding measurement was represented as the average value ± standard deviation. dNo feature selection method was applied.
Figure 2ROC curves of O-glycosylation site prediction based on balanced datasets. (A) Prediction of O-glycosylated S sites. (B) Prediction of O-glycosylated T sites. No feature selection was carried out for the CKSAAP encoding.
Figure 3ROC curves of O-glycosylation site prediction based on 1:5 datasets. (A) Prediction of O-glycosylated S sites. (B) Prediction of O-glycosylated T sites. No feature selection was carried out for the CKSAAP encoding.
Prediction performance based on the datasets filtered by amino acid compositiona,b
| Site | Encoding scheme | ||||
| S | Binaryc | 73.9 ± 3.8 | 83.1 ± 5.9 | 78.5 ± 3.2 | 0.590 ± 0.068 |
| CKSAAPc,d | 79.3 ± 2.0 | 86.8 ± 2.0 | 83.1 ± 1.8 | 0.677 ± 0.032 | |
| T | Binaryc | 77.7 ± 2.7 | 83.1 ± 3.0 | 80.4 ± 2.5 | 0.612 ± 0.052 |
| CKSAAPc,d | 81.1 ± 1.8 | 88.0 ± 1.1 | 84.5 ± 1.1 | 0.699 ± 0.023 | |
a The predictors were based on datasets with a 1:1 ratio of O-glycosylation to non-glycosylation sites. bThe cut-off value of correlation coefficient between any two sequence segments' amino acid composition was set as 0.95. cThe corresponding measurement was represented as the average value ± standard deviation. dNo feature selection method was applied.