| Literature DB >> 26680539 |
Hui-Ju Kao, Chien-Hsun Huang, Neil Arvin Bretaña, Cheng-Tsung Lu, Kai-Yao Huang, Shun-Long Weng, Tzong-Yi Lee.
Abstract
Protein O-GlcNAcylation, involving the β-attachment of single N-acetylglucosamine (GlcNAc) to the hydroxyl group of serine or threonine residues, is an O-linked glycosylation catalyzed by O-GlcNAc transferase (OGT). Molecular level investigation of the basis for OGT's substrate specificity should aid understanding how O-GlcNAc contributes to diverse cellular processes. Due to an increasing number of O-GlcNAcylated peptides with site-specific information identified by mass spectrometry (MS)-based proteomics, we were motivated to characterize substrate site motifs of O-GlcNAc transferases. In this investigation, a non-redundant dataset of 410 experimentally verified O-GlcNAcylation sites were manually extracted from dbOGAP, OGlycBase and UniProtKB. After detection of conserved motifs by using maximal dependence decomposition, profile hidden Markov model (profile HMM) was adopted to learn a first-layered model for each identified OGT substrate motif. Support Vector Machine (SVM) was then used to generate a second-layered model learned from the output values of profile HMMs in first layer. The two-layered predictive model was evaluated using a five-fold cross validation which yielded a sensitivity of 85.4%, a specificity of 84.1%, and an accuracy of 84.7%. Additionally, an independent testing set from PhosphoSitePlus, which was really non-homologous to the training data of predictive model, was used to demonstrate that the proposed method could provide a promising accuracy (84.05%) and outperform other O-GlcNAcylation site prediction tools. A case study indicated that the proposed method could be a feasible means of conducting preliminary analyses of protein O-GlcNAcylation and has been implemented as a web-based system, OGTSite, which is now freely available at http://csb.cse.yzu.edu.tw/OGTSite/.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26680539 PMCID: PMC4682369 DOI: 10.1186/1471-2105-16-S18-S10
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Data statistics of positive and negative training data.
| Data resource | Residue | Number of O-GlcNAcylated sites (Positive data) | Number of non-O-GlcNAcylated sites (Negative data) | Number of non-O-GlcNAcylated sites (Balanced negative data) |
|---|---|---|---|---|
| dbOGAP | Serine | 250 | 18,570 | - |
| Threonine | 142 | 11,240 | - | |
| OGlycBase | Serine | 24 | 1,013 | - |
| Threonine | 24 | 694 | - | |
| UniProtKB | Serine | 66 | 4,851 | - |
| Threonine | 51 | 3,255 | - | |
| Serine | 261 | 17,381 | 261 | |
| Threonine | 149 | 10,587 | 149 | |
| Combined | 410 | 27,968 | 410 | |
Figure 1Analytical flowchart of MDD clustering.
Figure 2Conceptual diagram of constructing two-layered prediction model from MDD-identified substrate motifs.
Figure 3Amino acids composition surrounding the O-GlcNAcylation sites. (A) Comparison of amino acids composition between positive data (410 O-GlcNAcylation sites) and negative data (27968 non-O-GlcNAcylation sites). (B) Position-specific amino acids composition surrounding the O-GlcNAcylation sites. (C) TwoSampleLogo (p-value<0.05) between positive data and negative data.
Figure 4The tree view of potential OGT substrate motifs identified by MDD clustering on 410 O-GlcNAcylation sites.
Five-fold cross validation results on profile HMMs learned from all data and seven MDD-clustered subgroups.
| Models | Number of positive data | Number of negative data | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|
| Single HMM with all data | 410 | 410 | 68.8% | 70.7% | 69.8% | 0.395 |
| HMM with OGT1 | 100 | 100 | 93.0% | 89.0% | 91.0% | 0.821 |
| HMM with OGT2 | 105 | 105 | 83.8% | 71.4% | 77.6% | 0.557 |
| HMM with OGT3 | 95 | 95 | 85.3% | 75.8% | 80.5% | 0.613 |
| HMM with OGT4 | 39 | 39 | 71.8% | 74.4% | 73.1% | 0.462 |
| HMM with OGT5 | 30 | 30 | 73.3% | 73.3% | 73.3% | 0.467 |
| HMM with OGT6 | 19 | 19 | 78.9% | 73.7% | 76.3% | 0.527 |
| HMM with OGT7 | 22 | 22 | 72.7% | 68.2% | 70.5% | 0.410 |
| MDD-clustered HMMs (Combined 7 OGT HMMs) | 410 | 410 | 83.7% | 77.1% | 80.4% | 0.609 |
| Two-layered model (7 HMMs + 1 SVM) | 410 | 410 | 85.4% | 84.1% | 84.7% | 0.695 |
The comparison of independent testing results between our methods and other three O-GlcNAcylation prediction tools.
| Methods | TP | FN | TN | FP | Sn | Sp | Acc | MCC |
|---|---|---|---|---|---|---|---|---|
| Single HMM with all data | 609 | 347 | 40072 | 20904 | 63.70% | 65.72% | 65.69% | 0.076 |
| MDD-clustered HMMs | 833 | 123 | 45212 | 15764 | 87.13% | 74.15% | 74.38% | 0.171 |
| Two-layered model | 828 | 128 | 51224 | 9752 | 86.61% | 84.01% | 84.05% | 0.231 |
| YinOYang | 449 | 507 | 50619 | 10357 | 46.97% | 83.01% | 82.46% | 0.097 |
| O-GlcNAcScan | 411 | 545 | 51219 | 9757 | 42.99% | 84.00% | 83.37% | 0.089 |
| O-GlcNAcPRED | 554 | 402 | 38414 | 22562 | 57.95% | 63.00% | 62.92% | 0.053 |
Figure 5A case study of O-GlcNAcylation sites prediction on Synapsin-1 (Syn1) of Rattus norvegicus.