| Literature DB >> 34135600 |
Theo Mauri1, Laurence Menu-Bouaouiche2, Muriel Bardor2, Tony Lefebvre1, Marc F Lensink1, Guillaume Brysbaert1.
Abstract
BACKGROUND: O-GlcNAcylation is an essential post-translational modification (PTM) in mammalian cells. It consists in the addition of a N-acetylglucosamine (GlcNAc) residue onto serines or threonines by an O-GlcNAc transferase (OGT). Inhibition of OGT is lethal, and misregulation of this PTM can lead to diverse pathologies including diabetes, Alzheimer's disease and cancers. Knowing the location of O-GlcNAcylation sites and the ability to accurately predict them is therefore of prime importance to a better understanding of this process and its related pathologies.Entities:
Keywords: O-GlcNAc; OGT; dataset; glycosylation; machine learning; post-translational modification
Year: 2021 PMID: 34135600 PMCID: PMC8197665 DOI: 10.2147/AABC.S294867
Source DB: PubMed Journal: Adv Appl Bioinform Chem ISSN: 1178-6949
Definition of the Classes of Amino Acids
| No residue (Empty) (E) | NA |
| Glycine (G) | Gly |
| Very Small (V) | Ala, Val |
| Small (S) | Ser, Thr, Ile, Leu, Cys |
| Normal (N) | Asp, Asn, Glu, Gln, Met |
| Long (L) | Arg, Lys |
| Aromatic (A) | Phe, Trp, Tyr, His |
| Proline (P) | Pro |
| Polar uncharged with hydroxyl group (A) | Ser, Thr |
| Polar uncharged with amide (B) | Asn, Gln |
| Positively charged polar (C) | Arg, Lys, His |
| Negatively charged polar (D) | Asp, Glu |
| Non-polar suffered (E) | Met, Cys |
| Non-polar aromatic (F) | Tyr, Phe, Trp |
| Non-polar aliphatic (G) | Ala, Val, Leu, Ile, Pro |
| Glycine (H) | Gly |
| No residue (I) | NA |
Abbreviations: Gly, Glycine; Ala, Alanine; Val, Valine; Ser, Serine; Thr, Threonine; Ile, Isoleucine; Leu, Leucine; Cys, Cysteine; Asp, Aspartic Acid; Asn, Asparagine; Glu, Glutamic Acid; Gln, Glutamine; Met, Methionine; Arg, Arginine; Lys, Lysine; Phe, Phenylalanine; Trp, Tryptophan; Tyr, Tyrosine; His, Histidine; Pro, Proline.
Figure 1Steps of the machine learning training and testing. Machine learning process pipeline representation with over and undersampling to create the training/testing data and the various models from the different algorithms.
Evaluation of Commonly Used Methods for O-GlcNAcylated Sites Prediction on Our Dataset
| YoY + | YoY ++ | YoY +++ | YoY ++++ | OGP-II | OGT Site | |
|---|---|---|---|---|---|---|
| TP | 267 | 172 | 79 | 21 | 358 | 270 |
| FP | 8158 | 3233 | 1068 | 221 | 8830 | 4084 |
| TN | 30507 | 35432 | 37597 | 38444 | 29835 | 34581 |
| FN | 283 | 378 | 471 | 529 | 192 | 280 |
| Sensitivity (%) | 48.55 | 31.27 | 14.36 | 3.82 | 65.09 (81.05) | 49.09 (85.4) |
| Specificity (%) | 78.97 | 91.67 | 97.25 | 99.43 | 77.16 (95.91) | 89.44 (84.1) |
| Precision (PPV) (%) | 3.17 | 5.05 | 6.89 | 8.68 | 3.90 | 6.20 |
| NPV (%) | 99.08 | 98.95 | 98.78 | 98.65 | 99.36 | 99.20 |
| Accuracy (%) | 78.55 | 90.47 | 96.04 | 98.32 | 76.99 (91.43) | 88.87 (84.7) |
| FDR (%) | 96.83 | 94.95 | 93.11 | 91.32 | 96.10 | 93.80 |
| Total | 39215 | 39215 | 39215 | 39215 | 39215 | 39215 |
Notes: Table showing the statistical measures of YinOYang (with different stringency thresholds, the higher number of “+”, the more stringent), O-GlcNAc-Pred II and OGTSite. When available, published performances of software on their data are put in brackets.
Abbreviations: YoY, YinOYang; OGP-II, O-GlcNAcPred II; TP, True Positives; FP, False Positives; TN, True Negatives; FN, False Negative; PPV, Positive Predictive Value; NPV, Negative Predictive Value; FDR, False Detection Rate.
Figure 2WebLogo representing the proportion of amino acids around sites. WebLogo representing the proportion of each amino acid in a -/+ 10 frame around (A) O-GlcNAcylated sites and (B) non O-GlcNAcylated sites.
Figure 3Composition in side chain size classes. Composition of side chain size classes around (A) O-GlcNAcylated sites and (B) non O-GlcNAcylated sites. Classes are detailed in Table 1A. Random corresponds to the composition of any position in a random sequence from UniProt.
Figure 4Composition in polarity classes. Composition of polarity classes around (A) O-GlcNAcylated sites and (B) non O-GlcNAcylated sites. Classes are detailed in Table 1B. Random corresponds to the composition of any position in a random sequence from UniProt.
Figure 5Number of serine and threonine residues around positive and negative sites. Histograms representing the mean and median of the number of serine and threonine residues around positive (blue) and negative (red) sites.
Figure 6Predictive flexibility of O-GlcNAcylated sites and non O-GlcNAcylated sites. Flexibility predicted with DynaMine for positive (red) and negative (blue) datasets depending on the nature of the site: serine, threonine, or both.
Percentage of β-Like, α-Like and Other Backbone Angles from −3 to +2 of O-GlcNAcylated Sites (Positive) and Non O-GlcNAcylated Sites (Negative)
| Positive | Negative | |
|---|---|---|
| Beta-like | 61,91% | 59,36% |
| Alpha-like | 29,65% | 27,11% |
| Other | 8,44% | 13,53% |
Note: Predictions made with SPIDER 3.
Sensitivity and PPV of the Three ML Algorithms Tested on Undersampled (Equal) and Not Undersampled (Real) Data
| Min (Equal/Real) | Max (Equal/Real) | Mean (Equal/Real) | Median (Equal/Real) | Standard Deviation (Equal/Real) | |
|---|---|---|---|---|---|
| 97.35 | 99.12 | 98.58 | 98.58 | 0.62 | |
| 47.46 | 51.61 | 48.98 | 48.69 | 1.36 | |
| 13.64 | 82.30 | 48.76 | 49.11 | 32.18 | |
| 13.51 | 86.11 | 47.04 | 46.08 | 27.47 | |
| 29.20 | 51.33 | 37.70 | 38.94 | 6.82 | |
| 31.13 | 43.94 | 36.53 | 37.33 | 4.13 | |
| 8.90 | 98.23 | 36.81 | 15.93 | 42.11 | |
| 11.11 | 48.88 | 30.79 | 29.16 | 14.62 | |
| 34.51 | 53.10 | 42.74 | 42.92 | 5.99 | |
| 33.05 | 43.48 | 37.76 | 37.59 | 3.70 | |
| 28.32 | 53.10 | 39.12 | 38.94 | 7.56 | |
| 28.32 | 53.10 | 39.12 | 38.94 | 4.34 |
Notes: Undersampled testing data contains the same number of positive vs negative data (50%/50%) whereas not undersampled data contains real proportions (1.4%/98.6%). Statistics that correspond to real data are set in bold. Blue background contains results for sensitivity, white background for PPV. Values are indicated in %.
Abbreviations: PPV, Positive Predictive Value; ML, Machine Learning; RF, Random Forest; GBT, Gradient Boosting Tree; SVM, Support Vector Machine.
Figure 7Network of GO terms (Molecular Functions) of partners of OGT. Network visualisation of GO terms (Molecular Functions) of proteins known to interact with the human OGT from the IMex interaction database - enrichment performed with ClueGO (see Material and methods for parameters).