| Literature DB >> 28872627 |
Xuanguo Nan1, Lingling Bao2, Xiaosa Zhao3, Xiaowei Zhao4, Arun Kumar Sangaiah5, Gai-Ge Wang6, Zhiqiang Ma7.
Abstract
Protein pupylation is a type of post-translation modification, which plays a crucial role in cellular function of bacterial organisms in prokaryotes. To have a better insight of the mechanisms underlying pupylation an initial, but important, step is to identify pupylation sites. To date, several computational methods have been established for the prediction of pupylation sites which usually artificially design the negative samples using the verified pupylation proteins to train the classifiers. However, if this process is not properly done it can affect the performance of the final predictor dramatically. In this work, different from previous computational methods, we proposed an enhanced positive-unlabeled learning algorithm (EPuL) to the pupylation site prediction problem, which uses only positive and unlabeled samples. Firstly, we separate the training dataset into the positive dataset and the unlabeled dataset which contains the remaining non-annotated lysine residues. Then, the EPuL algorithm is utilized to select the reliably negative initial dataset and then iteratively pick out the non-pupylation sites. The performance of the proposed method was measured with an accuracy of 90.24%, an Area Under Curve (AUC) of 0.93 and an MCC of 0.81 by 10-fold cross-validation. A user-friendly web server for predicting pupylation sites was developed and was freely available at http://59.73.198.144:8080/EPuL.Entities:
Keywords: positive-unlabeled learning algorithm; prediction; pupylation sites; support vector machine; web server
Mesh:
Substances:
Year: 2017 PMID: 28872627 PMCID: PMC6151806 DOI: 10.3390/molecules22091463
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Ten-fold cross-validation performance of EPuL, PUL-PUP, PSoL and SVM_balance.
| Method | Sn (%) | Sp (%) | ACC (%) | MCC | AUC |
|---|---|---|---|---|---|
| EPuL | 84.21 | 95.45 | 90.24 | 0.81 | 0.93 |
| PUL-PUP | 82.24 | 91.57 | 88.92 | 0.74 | 0.92 |
| PSoL | 67.50 | 73.60 | 70.55 | 0.42 | 0.80 |
| SVM_balance | 76.71 | 63.65 | 69.88 | 0.40 | 0.77 |
Figure 1The ROC curve of EPuL on the training dataset.
Independent test performance of EPuL, PUL-PUP, and PSoL.
| Method | Sn (%) | Sp (%) | ACC (%) | MCC | AUC |
|---|---|---|---|---|---|
| EPuL | 72.41 | 71.57 | 71.63 | 0.24 | 0.78 |
| PUL-PUP | 68.97 | 70.83 | 70.71 | 0.22 | 0.77 |
| PSoL | 51.72 | 73.14 | 71.62 | 0.13 | 0.74 |
| SVM-balance | 62.07 | 67.4 | 67.05 | 0.15 | 0.7 |
The performance of EPuL and four exiting pupylation sites predictors on the independent testing dataset.
| Predictors | Thresholds | Sn (%) | Sp (%) | ACC (%) | MCC | AUC |
|---|---|---|---|---|---|---|
| GPS-PUP | High | 31.03 | 89.46 | 85.62 | 0.16 | |
| Medium | 34.48 | 85.54 | 82.19 | 0.14 | 0.6 | |
| Low | 41.38 | 76.72 | 74.43 | 0.1 | ||
| iPUP | High | 48.28 | 82.84 | 80.55 | 0.2 | |
| Medium | 51.72 | 76.47 | 74.83 | 0.16 | 0.66 | |
| Low | 55.17 | 72.06 | 70.94 | 0.15 | ||
| pbPUP | High | 17.24 | 88.48 | 83.75 | 0.04 | |
| Medium | 31.03 | 80.15 | 76.89 | 0.07 | 0.6 | |
| Low | 41.38 | 69.85 | 67.96 | 0.07 | ||
| PUL-PUP | High | 51.72 | 83.33 | 81.24 | 0.22 | |
| Medium | 65.52 | 76.72 | 75.97 | 0.24 | 0.77 | |
| Low | 68.97 | 72.79 | 72.54 | 0.23 | ||
| EPuL | High | 37.93 | 89.46 | 86.04 | 0.21 | |
| Medium | 58.62 | 79.90 | 78.49 | 0.23 | 0.78 | |
| Low | 68.97 | 74.02 | 73.68 | 0.24 |
Figure 2Top 25 k-spaced amino acid pairs.
Figure 3The two-sample-logos of the composition of k-spaced amino acid pairs surrounding the pupylation site and non-pupylation site.