| Literature DB >> 22808107 |
Jagat S Chauhan1, Adil H Bhat, Gajendra P S Raghava, Alka Rao.
Abstract
Glycosylation is one of the most abundant post-translational modifications (PTMs) required for various structure/function modulations of proteins in a living cell. Although elucidated recently in prokaryotes, this type of PTM is present across all three domains of life. In prokaryotes, two types of protein glycan linkages are more widespread namely, N- linked, where a glycan moiety is attached to the amide group of Asn, and O- linked, where a glycan moiety is attached to the hydroxyl group of Ser/Thr/Tyr. For their biologically ubiquitous nature, significance, and technology applications, the study of prokaryotic glycoproteins is a fast emerging area of research. Here we describe new Support Vector Machine (SVM) based algorithms (models) developed for predicting glycosylated-residues (glycosites) with high accuracy in prokaryotic protein sequences. The models are based on binary profile of patterns, composition profile of patterns, and position-specific scoring matrix profile of patterns as training features. The study employ an extensive dataset of 107 N-linked and 116 O-linked glycosites extracted from 59 experimentally characterized glycoproteins of prokaryotes. This dataset includes validated N-glycosites from phyla Crenarchaeota, Euryarchaeota (domain Archaea), Proteobacteria (domain Bacteria) and validated O-glycosites from phyla Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria (domain Bacteria). In view of the current understanding that glycosylation occurs on folded proteins in bacteria, hybrid models have been developed using information on predicted secondary structures and accessible surface area in various combinations with training features. Using these models, N-glycosites and O-glycosites could be predicted with an accuracy of 82.71% (MCC 0.65) and 73.71% (MCC 0.48), respectively. An evaluation of the best performing models with 28 independent prokaryotic glycoproteins confirms the suitability of these models in predicting N- and O-glycosites in potential glycoproteins from aforementioned organisms, with reasonably high confidence. A web server GlycoPP, implementing these models is available freely at http:/www.imtech.res.in/raghava/glycopp/.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22808107 PMCID: PMC3392279 DOI: 10.1371/journal.pone.0040155
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
An evaluation of performances of some of the well-known models for glycosylation prediction on prokaryotic glycoproteins.
| Type of Glycosylation | Prediction Tools | Threshold | Sensitivity (%) | Specificity (%) | Accuracy (%) | MCC (%) |
| N-linked | NetNglyc1 | 0.5 | 81.75 | 10.16 | 34.41 | −0.11 |
| 0.6 | 50.79 | 42.68 | 45.43 | −0.06 | ||
| 0.7 | 15.87 | 76.02 | 55.65 | −0.09 | ||
| 0.9 | 0.79 | 98.37 | 65.32 | −0.03 | ||
| EnsembleGly3 | 0.3 | 92.86 | 0.41 | 31.72 | −0.2 | |
| 0.5 | 90.48 | 1.63 | 31.72 | −0.18 | ||
| 0.7 | 79.37 | 10.98 | 34.14 | −0.13 | ||
| 0.9 | 51.59 | 47.15 | 48.66 | −0.01 | ||
| O-linked | NetOglyc2 | 0 | 8.38 | 95.64 | 87.96 | 0.05 |
| EnsembleGly3 | 0 | 9.5 | 93.37 | 86 | 0.03 |
Footnotes: (1: http://www.cbs.dtu.dk/services/NetNGlyc/, 2: http://www.cbs.dtu.dk/services/NetOGlyc-3.0/, 3: http://turing.cs.iastate.edu/EnsembleGly/).
Experimentally characterized glycan linkages at known glycosites of bacteria and archaea.
| Sugar linkage | Class | Example glycoproteins |
|
| ||
| Glc-Asn |
| Flagellin, Slg |
| βGalNAc- Asn, |
| Flagellin, Slg, Cytochrome subunit |
|
| ||
| Bac-Asn |
| AcrA, PEB3, CgpA, HisJ, ZnuA, jlpA etc. |
| GlcNAc-Asn |
| HmcA |
| Hexose-Asn, dihexose-Asn, Glu-Asn,Gal-Asn |
| Adhesins |
|
| ||
| Man-Ser/Thr |
| Glycosidases, Cell surface lipoproteins,Secreted antigens, Superoxide dismutase, Heparinase, Chondroitinase etc. |
| Fucose |
| Putative cell division proteins, exported proteins, outer membrane proteins etc. |
| β-GalNAc-Ser/Thr |
| Slg |
| β-D-Gal-Ser/Thr |
| Slg, SgsE, SgtA etc. |
| β-GlcNAc-Ser/Thr, HexNAc |
| Glycocin F, Flagellin |
| Bac/DATDH-Ser |
| Pilin, CcoP, CycB etc. |
| FucNAc-Ser |
| Pilin |
| Rha-Ser/Thr, Deoxyhexose-Ser |
| Flagellin |
Footnotes: Detailed information about attached glycan and glycoproteins can be obtained from www.proglycprot.org).
An analysis of experimentally observed secondary structures in prokaryotic glycosites.
| Protein Name (Source organism) | PDB ID | Presence of glycanin structure | Validated Glycositesin full length protein sequence | Position of Glycositesin PDB entry sequence | SS |
|
| |||||
| Tetrabrachion(Staphylothermus marinus) | 1YBK,1FE6 | – | N44, N605, N641,N685, N708, N1279,N1402 | N44 (N1279*) |
|
| Chondroitinase ABC(Proteus vulgaris) | 1HN0 | – | N282, N338, N345,N515, N675, N856,N963 | N282, N963 & N675N338, N345 & N515N856 |
|
| PotD (Escherichia coli) | 1POT,1POY | – | N26, N62 | N26N62 |
|
| AcrA (Campylobacter jejuni) | 2K32,2K33(NMR) | Heptasaccharide | N123, N273 | N42 (N123*) |
|
| PEB3 (Campylobacter jejuni) | 2HXW | - | N90 | N90 |
|
| HmcA (Desulfovibrio gigas) | 1Z1N | Trisaccharide (NAG,NAA,any epimer of NAG), | N290 | N261 |
|
|
| |||||
| Chondroitinase-AC(Pedobacter heparinus) | 1CB8,1HM2,1HM3,1HMU,1HMW | Tetrasaccharide Man-(Rha)-GlcUA-Xyl, | S328, S455 | S328 S455 |
|
| Chondroitinase-B(Pedobacter heparinus) | 1DBG,1DBO,1OFL,1OFM | Heptasaccharide galactose-β(1–4)[galactose-α(1–3)](2-O-Me)fucose-β(1–4)xylose-β (1–4)glucuronicacid-α(1–2)[rhamnose-α(1–4)]mannose-α(1- | S234 | S234 |
|
| Heparinase II(Pedobacter heparinus) | 2FUQ,2FUT | Tetrasaccharide Man-(Rha)-GlcUA-Xyl (xylose-β(1–4)glucuronic acid-α(1–2)[rhamnose-α(1–4)]mannose-α(1- | T134 | T134 |
|
| Fimbrial protein(Neisseria gonorrhoeae) | 2HI2,2HIL,2PIL,1AY2 | Disaccharides α-D-galactopyranosyl-(1→3)-2,4-diacetamido-2,4-dideoxy-β-D-glucopyranoside(bacillosamine, Bac);Gal-DADDGlc; andGlcNAc-α1,3-Gal | S70 | S63 |
|
| Glycocin F(Lactobacillus plantarum) | 2KUY(NMR) | Two N-Acetylglucosamines | S39 | S18 |
|
| Endo-β-N-acetylglucosaminidase F3 (Flavobacterium meningosepticum) | 1EOM,1EOK | - | T88 | T49 |
|
Footnotes: All crystal structures are obtained from www.rcsb.org. All structures are at a resolution of 1.4 Å or above.
Symbols used: - : No sugar detected, *: Corresponding position in full length protein sequence, F: flexible Regions with turns/loops/coils/bends or no assigned secondary structure, H: helix, B: beta sheet, 1: Intra domain, 2: Interdomain, 3: no assigned domain.
Figure 1GlycoPP websever Schema.
A flowchart of methodologies employed for development of GlycoPP webserver for prediction of N & O-glycosites in prokaryotic protein sequences.
Figure 2Sequence contexts of prokaryotic glycosites.
Two sample weblogos depicting enriched and depleted amino acids around prokaryotic N-glycosites (logo A) and prokaryotic O-glycosites (logo B) in comparison to the percentage of these amino acids around non-glycosylated prokaryotic N-glycosites and O-glycosites, respectively. Similarly, logos C and D provide an assessment of probabilities of amino acids around prokaryotic N- and O-glycosites in comparison to probabilities around eukaryotic N- and O-glycosites, respectively. The datasets for eukaryotic N- and O- glycosites for generation of weblogos is obtained from SWISS-PROT (2011 release).
Figure 3Predicted secondary structures around prokaryotic glycosites.
Average percentage of secondary structures predicted in and around N-glycosites (panel A) and O-glycosites (panel B) in prokaryotic glycoproteins. The graph indicates a general likelihood of locating a glycosylated residue in coils/turns in a protein.
Figure 4Predicted Surface accessibility of prokaryotic glycosites.
Average percentage of exposed and buried residues predicted in and around N-glycosites (panel A) and O-glycosites (panel B) in prokaryotic glycoproteins. The graph suggests higher accessibility of glycosylated residues on surface of a protein in comparison to non-glycosylated ones.
Combined performance statistics of SVM employing solo features and hybrid approaches in predicting N-glycosites (using balanced dataset).
| Feature | Sensitivity (%) | Specificity (%) | Accuracy (%) | MCC (%) | AUC (%) |
| CPP | 59.81 | 64.49 | 62.15 | 0.24 | 0.65019 |
| CPP+SS | 63.55 | 69.16 | 66.36 | 0.33 | 0.68731 |
| CPP+ASA | 71.03 | 69.16 | 70.09 | 0.40 | 0.77203 |
| CPP+SS+ASA | 70.09 | 67.29 | 68.69 | 0.37 | 0.71159 |
|
|
|
|
|
|
|
| BPP+SS | 82.24 | 80.37 | 81.31 | 0.63 | 0.88453 |
|
|
|
|
|
|
|
| BPP+SS+ASA | 84.11 | 80.37 | 82.24 | 0.65 | 0.88497 |
| PPP | 76.42 | 69.81 | 73.11 | 0.46 | 0.76833 |
| PPP+SS | 75.70 | 71.03 | 73.36 | 0.47 | 0.78880 |
| PPP+ASA | 77.57 | 71.03 | 74.30 | 0.49 | 0.78636 |
| PPP+SS+ASA | 75.70 | 71.96 | 73.83 | 0.48 | 0.79334 |
Footnotes: BPP- Binary profile of patterns, CPP- Composition profile of patterns, PPP- PSSM profile of patterns, MCC- Matthews correlation coefficient, AUC- Area under curve, SS-secondary structure and ASA- Accessible surface area.
Combined performance statistics of SVM classifiers employing solo features and hybrid approaches in predicting O-glycosites (using balanced dataset).
| Feature | Sensitivity (%) | Specificity (%) | Accuracy (%) | MCC (%) | AUC (%) |
| CPP | 68.10 | 72.41 | 70.26 | 0.41 | 0.74071 |
| CPP+SS | 70.69 | 71.55 | 71.12 | 0.42 | 0.75743 |
| CPP+ASA | 67.24 | 75.00 | 71.12 | 0.42 | 0.75780 |
| CPP+SS+ASA | 72.41 | 75.00 | 73.71 | 0.47 | 0.76955 |
| BPP | 66.38 | 67.24 | 66.81 | 0.34 | 0.73023 |
| BPP+SS | 69.83 | 68.10 | 68.97 | 0.38 | 0.74160 |
| BPP+ASA | 77.59 | 61.21 | 69.40 | 0.39 | 0.71143 |
| BPP+SS+ASA | 65.52 | 72.41 | 68.97 | 0.38 | 0.73766 |
|
|
|
|
|
|
|
| PPP+SS | 73.28 | 73.28 | 73.28 | 0.47 | 0.76806 |
| PPP+ASA | 74.14 | 71.55 | 72.84 | 0.46 | 0.77341 |
|
|
|
|
|
|
|
Footnotes: BPP- Binary profile of patterns, CPP- Composition profile of patterns, PPP- PSSM profile of patterns, MCC- Matthews correlation coefficient, AUC- Area under curve, SS-secondary structure and ASA- Accessible surface area.
Comparative performances of existing well-known glycosylation prediction tools and GlycoPP models on independent dataset of prokaryotic glycoproteins.
| Prediction of N-glycosites | ||||||
| Models (Threshold) | NetNglyc1 (0.5) | EnsembleGly3 (0.7) | GlycoPP-BPP (−0.1) | GlycoPP-CPP (0.3) | GlycoPP-PPP (−0.2) | GlycoPP-BPP+ASA |
| Sensitivity (%) | 88.89 | 94.44 | 89.47 | 68.42 | 78.95 |
|
| Specificity (%) | 25.00 | 11.36 | 73.68 | 73.68 | 73.68 |
|
| Accuracy (%) | 43.55 | 35.48 | 81.58 | 71.05 | 76.32 |
|
| MCC (%) | 0.15 | 0.09 | 0.64 | 0.42 | 0.53 |
|
|
| ||||||
|
|
|
|
|
|
|
|
| Sensitivity (%) | 100.00 | 6.67 | 72.13 | 72.55 | 77.05 |
|
| Specificity (%) | 3.19 | 93.05 | 73.77 | 68.18 | 70.49 |
|
| Accuracy (%) | 8.27 | 88.28 | 72.95 | 70.36 | 73.77 |
|
| MCC (%) | 0.04 |
| 0.46 | 0.41 | 0.48 |
|
Footnotes: 1: http://www.cbs.dtu.dk/services/NetNGlyc/, 2: http://www.cbs.dtu.dk/services/NetOGlyc-3.0/, 3: http://turing.cs.iastate.edu/EnsembleGly/, BPP- Binary profile of patterns, CPP- Composition profile of patterns, PPP- PSSM profile of patterns, MCC- Matthews correlation coefficient, AUC- Area under curve, SS-secondary structure and ASA- Accessible surface area.