| Literature DB >> 33297954 |
Kai-Yao Huang1,2, Fang-Yu Hung3, Hui-Ju Kao1, Hui-Hsuan Lau4,5,6, Shun-Long Weng7,8,9.
Abstract
BACKGROUND: Protein phosphoglycerylation, the addition of a 1,3-bisphosphoglyceric acid (1,3-BPG) to a lysine residue of a protein and thus to form a 3-phosphoglyceryl-lysine, is a reversible and non-enzymatic post-translational modification (PTM) and plays a regulatory role in glucose metabolism and glycolytic process. As the number of experimentally verified phosphoglycerylated sites has increased significantly, statistical or machine learning methods are imperative for investigating the characteristics of phosphoglycerylation sites. Currently, research into phosphoglycerylation is very limited, and only a few resources are available for the computational identification of phosphoglycerylation sites. RESULT: We present a bioinformatics investigation of phosphoglycerylation sites based on sequence-based features. The TwoSampleLogo analysis reveals that the regions surrounding the phosphoglycerylation sites contain a high relatively of positively charged amino acids, especially in the upstream flanking region. Additionally, the non-polar and aliphatic amino acids are more abundant surrounding phosphoglycerylated lysine following the results of PTM-Logo, which may play a functional role in discriminating between phosphoglycerylation and non-phosphoglycerylation sites. Many types of features were adopted to build the prediction model on the training dataset, including amino acid composition, amino acid pair composition, positional weighted matrix and position-specific scoring matrix. Further, to improve the predictive power, numerous top features ranked by F-score were considered as the final combination for classification, and thus the predictive models were trained using DT, RF and SVM classifiers. Evaluation by five-fold cross-validation showed that the selected features was most effective in discriminating between phosphoglycerylated and non-phosphoglycerylated sites.Entities:
Keywords: 3-Phosphoglyceryl-lysine (pgK); Post-translational modification (PTM); Protein phosphoglycerylation; Sequence-based features
Year: 2020 PMID: 33297954 PMCID: PMC7727188 DOI: 10.1186/s12859-020-03916-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Composition of amino acids surrounding phosphoglycerylation sites. a Comparison of AAC between 89 positive and 178 negative sequences. b Position-specific AAC of 89 phosphoglycerylated fragment sequences. c Comparison of position-specific AAC between phosphoglycerylated and non-phosphoglycerylated sequences based on TwoSampleLogo analysis
Fig. 2The frequency differences of 20 × 20 amino acid pairs between phosphoglycerylated sites and non-phosphoglycerylated sites
Fig. 3The motif analysis based on position-specific amino-acid probability backgrounds of 89 phosphoglycerylated sequences
Five-fold cross validation results of the DT, RF and SVM models trained on single type of features
| Training feature | Classifier | Sensitivity (%) | Specificity (%) | Accuracy (%) | MCC |
|---|---|---|---|---|---|
| AAC | DT | 59.6 | 55.6 | 56.9 | 0.14 |
| RF | 59.6 | 59.0 | 59.2 | 0.18 | |
| SVM | 56.2 | 59.6 | 58.4 | 0.15 | |
| AAPC | DT | 59.6 | 47.8 | 51.7 | 0.07 |
| RF | 48.3 | 62.9 | 58.1 | 0.11 | |
| SVM | 60.7 | 47.8 | 52.1 | 0.08 | |
| B62 | DT | 44.9 | 70.8 | 62.2 | 0.16 |
| RF | 55.1 | 55.1 | 55.1 | 0.10 | |
| SVM | 51.7 | 43.3 | 46.1 | − 0.05 | |
| PSSM | DT | 34.8 | 71.9 | 59.6 | 0.07 |
| RF | 58.4 | 52.2 | 54.3 | 0.10 | |
| SVM | 39.3 | 59.6 | 52.8 | − 0.01 | |
| PCAAC | DT | 50.6 | 71.3 | 64.4 | 0.22 |
| RF | 50.6 | 50.6 | 50.6 | 0.01 | |
| SVM | 58.4 | 68.0 | 64.8 | 0.25 |
Five-fold cross validation results of the DT, RF and SVM models trained with multiple types of features
| Training feature | Classifier | Sensitivity (%) | Specificity (%) | Accuracy (%) | MCC |
|---|---|---|---|---|---|
| AAC + AAPC | DT | 53.9 | 50.6 | 51.7 | 0.04 |
| RF | 58.4 | 60.1 | 59.6 | 0.18 | |
| SVM | 53.9 | 57.3 | 56.2 | 0.11 | |
| AAC + B62 | DT | 42.7 | 66.3 | 58.4 | 0.09 |
| RF | 59.6 | 59.0 | 59.2 | 0.18 | |
| SVM | 68.5 | 34.8 | 46.1 | 0.03 | |
| AAC + PSSM | DT | 32.6 | 64.6 | 53.9 | − 0.03 |
| RF | 59.6 | 59.0 | 59.2 | 0.18 | |
| SVM | 39.3 | 59.6 | 52.8 | − 0.01 | |
| AAPC + PSSM | DT | 31.5 | 69.7 | 56.9 | 0.01 |
| RF | 62.9 | 54.5 | 57.3 | 0.16 | |
| SVM | 39.3 | 59.6 | 52.8 | − 0.01 | |
| AAC + AAPC + B62 | DT | 40.4 | 63.5 | 55.8 | 0.04 |
| RF | 60.7 | 54.5 | 56.6 | 0.14 | |
| SVM | 68.5 | 34.3 | 45.7 | 0.03 | |
| AAC + AAPC + PSSM | DT | 31.5 | 64.6 | 53.6 | − 0.04 |
| RF | 62.9 | 62.9 | 62.9 | 0.24 | |
| SVM | 69.7 | 39.3 | 49.4 | 0.09 |
Fig. 4Comparison of ROC curves among the models trained using various features based on five-fold cross-validation
Five-fold cross validation results of the DT, RF and SVM models trained with the selected features
| Classifier | Sensitivity (%) | Specificity (%) | Accuracy (%) | MCC |
|---|---|---|---|---|
| DT | 59.6 | 58.4 | 58.8 | 0.17 |
| RF | 70.8 | 70.8 | 70.8 | 0.40 |
| SVM | 77.5 | 73.6 | 74.9 | 0.49 |
Comparison of independent testing results between our method and the available prediction tools
| Classifier | Sensitivity (%) | Specificity (%) | Accuracy (%) | MCC |
|---|---|---|---|---|
| Phogly-PseAAC | 59.5 | 67.4 | 67.2 | 0.09 |
| iPGK-PseAAC | 37.8 | 96.2 | 94.5 | 0.27 |
iDPGK (our method) | 75.7 | 64.9 | 70.3 | 0.41 |
Fig. 5Comparison of the predictive performance between the proposed models and existing prediction tools based upon independent testing
Fig. 6The analytical flowchart of the identification of protein phosphoglycerylation sites
Data statistics of training and testing datasets after the removal of homologous sequences using CD-HIT program
| Sequence identity cut-off | Number of phosphoglycerylation sites | Number of non-phosphoglycerylation sites |
|---|---|---|
| Raw data | 150 | 3997 |
| 90% | 107 | 3031 |
| 80% | 104 | 2610 |
| 70% | 98 | 2319 |
| 60% | 96 | 2040 |
| 50% | 93 | 1845 |
| 40% | 89 | 1318 |
| Training data | 89 | 178 |
| Independent testing data | 37 | 74 |