| Literature DB >> 35557849 |
Wang-Ren Qiu1, Meng-Yue Guan1, Qian-Kun Wang1, Li-Liang Lou1, Xuan Xiao1.
Abstract
Pupylation is an important posttranslational modification in proteins and plays a key role in the cell function of microorganisms; an accurate prediction of pupylation proteins and specified sites is of great significance for the study of basic biological processes and development of related drugs since it would greatly save experimental costs and improve work efficiency. In this work, we first constructed a model for identifying pupylation proteins. To improve the pupylation protein prediction model, the KNN scoring matrix model based on functional domain GO annotation and the Word Embedding model were used to extract the features and Random Under-sampling (RUS) and Synthetic Minority Over-sampling Technique (SMOTE) were applied to balance the dataset. Finally, the balanced data sets were input into Extreme Gradient Boosting (XGBoost). The performance of 10-fold cross-validation shows that accuracy (ACC), Matthew's correlation coefficient (MCC), and area under the ROC curve (AUC) are 95.23%, 0.8100, and 0.9864, respectively. For the pupylation site prediction model, six feature extraction codes (i.e., TPC, AAI, One-hot, PseAAC, CKSAAP, and Word Embedding) served to extract protein sequence features, and the chi-square test was employed for feature selection. Rigorous 10-fold cross-validations indicated that the accuracies are very high and outperformed its existing counterparts. Finally, for the convenience of researchers, PUP-PS-Fuse has been established at https://bioinfo.jcu.edu.cn/PUP-PS-Fuse and http://121.36.221.79/PUP-PS-Fuse/as a backup.Entities:
Keywords: chi-square test; multiple features; post-translational modification; pupylation; word embedding
Mesh:
Substances:
Year: 2022 PMID: 35557849 PMCID: PMC9088680 DOI: 10.3389/fendo.2022.849549
Source DB: PubMed Journal: Front Endocrinol (Lausanne) ISSN: 1664-2392 Impact factor: 6.055
Figure 1The framework of PUP-PS-Fuse (rounded squares represent data sets, cylinders represent feature extraction methods, rectangles and ellipses represent feature selection methods, and diamonds represent classifiers. RUS is the abbreviation of Random Under-sampling and SMOTE is the abbreviation of Synthetic Minority Over-sampling).
Data set for prediction of pupylation protein and pupylation site.
| Datasets | Positive | Negative | Ratio |
|---|---|---|---|
| Pupylation proteins | 201 | 1126 | 1:5.6 |
| Pupylation site training | 186 | 186 | 1:1 |
| Pupylation site test | 87 | 191 | 1:2.2 |
Positive represents the number of positive samples, and Negative represents the number of negative samples.
The prediction results of different feature extraction and balance methods for predicting pupylation proteins.
| Feature | ACC (%) | Sn (%) | Sp (%) | MCC | AUC | |
|---|---|---|---|---|---|---|
| Unbalanced | GO-KNN | 94.36 | 77.08 | 97.45 | 0.7731 | 0.9530 |
| CBOW | 91.91 | 67.42 | 96.27 | 0.6700 | 0.9553 | |
| PUP-P-Fuse | 92.07 | 60.25 | 97.77 | 0.6615 | 0.9647 | |
| Balanced | PUP-P-Fuse |
|
| 96.00 |
|
|
GO-KNN and CBOW represent two feature extraction methods for predicting pupylation proteins, and PUP-P-Fuse is a fusion of the above two methods.
The bold values are means the best performance of the column with the same metric and are showed in following tables with the same meaning.
Figure 2The prediction results of different characteristics on balanced data for predicting pupylation proteins.
The prediction results of different classifiers for predicting pupylation proteins.
| Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
|---|---|---|---|---|---|
| XGBoost |
|
|
|
|
|
| Ensemble Learning | 93.87 | 90.61 | 94.48 | 0.7874 | 0.9788 |
| SVM | 91.36 | 93.65 | 90.96 | 0.7335 | 0.9689 |
| RF | 92.87 | 82.40 | 94.75 | 0.7355 | 0.9703 |
| KNN | 83.88 | 96.90 | 81.55 | 0.6104 | 0.9585 |
The bold values are means the best performance of the column with the same metric and are showed in following tables with the same meaning.
Figure 3ROC curves of different classifiers for predicting the pupylation protein.
The prediction results of different classifiers on the testing set of pupylation proteins.
| Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
|---|---|---|---|---|---|
| XGBoost | 84.66 | 80.99 | 86.62 | 0.6630 | 0.9251 |
| Ensemble Learning | 85.34 | 80.96 | 87.41 | 0.6738 | 0.9376 |
| SVM | 85.48 | 88.78 | 83.85 | 0.6955 | 0.9317 |
| RF | 84.55 | 79.15 | 87.79 | 0.6571 | 0.9270 |
| KNN | 78.56 | 83.97 | 75.61 | 0.5653 | 0.8868 |
The effect of different feature extraction methods on the training set of pupylation sites.
| Features | ACC (%) | Sn (%) | Sp (%) | MCC | AUC |
|---|---|---|---|---|---|
| AAI | 56.71 | 56.21 | 57.52 | 0.1380 | 0.6148 |
| One-Hot | 57.49 | 59.49 | 55.95 | 0.1550 | 0.6296 |
| PseAAC | 61.56 | 62.00 | 61.64 | 0.2367 | 0.6597 |
| Word Embedding | 69.92 | 73.36 | 66.55 | 0.4001 | 0.7645 |
| CKSAAP | 68.84 | 68.92 | 69.20 | 0.3818 | 0.7596 |
| TPC | 70.36 | 70.69 | 70.65 | 0.4143 | 0.7697 |
| PUP-S-Fuse |
|
|
|
|
|
The bold values are means the best performance of the column with the same metric and are showed in following tables with the same meaning.
The effect of feature fusion Pup-S-Fuse by using the chi-square test for predicting pupylation sites.
| Features | ACC (%) | Sn (%) | Sp (%) | MCC | AUC |
|---|---|---|---|---|---|
| K = 200 | 89.09 | 88.82 | 89.57 | 0.7830 | 0.9531 |
| K = 400 | 91.21 | 92.89 | 89.52 | 0.8256 | 0.9565 |
|
|
|
|
|
| 0.9599 |
| K = 800 | 91.99 | 93.31 | 90.55 | 0.8400 | 0.9634 |
| K = 1,000 | 92.00 | 92.27 |
| 0.8394 |
|
| K = 1,200 | 90.70 | 91.77 | 89.70 | 0.8145 | 0.9604 |
The bold values are means the best performance of the column with the same metric and are showed in following tables with the same meaning.
The prediction results of different classifiers for predicting pupylation sites.
| Algorithms | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
|---|---|---|---|---|---|
| EL |
| 93.97 |
|
| 0.9599 |
| SVM | 91.72 |
| 88.59 | 0.8377 |
|
| RF | 86.72 | 87.50 | 86.24 | 0.7361 | 0.9347 |
| KNN | 81.34 | 90.37 | 75.63 | 0.6706 | 0.9388 |
| XGBoost | 78.49 | 79.02 | 77.94 | 0.5703 | 0.8622 |
EL, ensemble learning.
The bold values are means the best performance of the column with the same metric and are showed in following tables with the same meaning.
Figure 4The ROC curves of different classification methods for predicting pupylation sites. (EL is the abbreviation of ensemble learning).
Comparison of methods on Independent Dataset for predicting pupylation sites.
| Methods | Acc (%) | Sn (%) | Sp (%) | MCC | AUC |
|---|---|---|---|---|---|
| iPUP | 73 | 40 | 88 | 0.32 | |
| GPS-PUP | 68 | 21 | 89 | 0.13 | |
| PUPS | 67 | 17 | 89 | 0.08 | |
| pbPUP | 79 | 48 | 82 | 0.45 | |
| PUP-Fuse | 82 | 59 | 91 | 0.55 | |
| PUP-S-Fuse |
|
|
|
|
|
The bold values are means the best performance of the column with the same metric and are showed in following tables with the same meaning.