| Literature DB >> 29312575 |
Wenzheng Bao1, Zhu-Hong You2, De-Shuang Huang1.
Abstract
Recently, experiments revealed the pupylation to be a signal for the selective regulation of proteins in several serious human diseases. As one of the most significant post translational modification in the field of biology and disease, pupylation has the ability to playing the key role in the regulation various diseases' biological processes. Meanwhile, effectively identification such type modification will be helpful for proteins to perform their biological functions and contribute to understanding the molecular mechanism, which is the foundation of drug design. The existing algorithms of identification such types of modified sites often have some defects, such as low accuracy and time-consuming. In this research, the pupylation sites' identification model, CIPPN, demonstrates better performance than other existing approaches in this field. The proposed predictor achieves Acc value of 89.12 and Mcc value of 0.7949 in 10-fold cross-validation tests in the Pupdb Database (http://cwtung.kmu.edu.tw/pupdb). Significantly, such algorithm not only investigates the sequential, structural and evolutionary hallmarks around pupylation sites but also compares the differences of pupylation from the environmental, conservative and functional characterization of substrates. Therefore, the proposed feature description approach and algorithm results prove to be useful for further experimental investigation of such modification's identification.Entities:
Keywords: classification; disease; post translational modification
Year: 2017 PMID: 29312575 PMCID: PMC5752488 DOI: 10.18632/oncotarget.22335
Source DB: PubMed Journal: Oncotarget ISSN: 1949-2553
Prediction the database on Pupdb 10-fold with AAIndex PCA
| Subset | Sn(%) | Sp(%) | Acc(%) | Mcc | AUC |
|---|---|---|---|---|---|
| 1 | 65.21 | 96.45 | 80.83 | 0.6491 | 0.8017 |
| 2 | 73.42 | 95.36 | 84.39 | 0.7049 | 0.8115 |
| 3 | 69.43 | 97.56 | 83.50 | 0.6980 | 0.8231 |
| 4 | 64.43 | 96.23 | 80.33 | 0.6398 | 0.7667 |
| 5 | 72.02 | 98.32 | 85.17 | 0.7291 | 0.8091 |
| 6 | 65.32 | 97.67 | 81.50 | 0.6656 | 0.8073 |
| 7 | 68.64 | 97.53 | 83.09 | 0.6912 | 0.8137 |
| 8 | 69.43 | 98.64 | 84.04 | 0.7117 | 0.8342 |
| 9 | 67.57 | 98.67 | 83.12 | 0.6970 | 0.8451 |
| 10 | 77.71 | 98.63 | 88.17 | 0.7806 | 0.8072 |
| Average | 69.55 | 97.51 | 83.53 | 0.6987 | 0.8119 |
The first column records sensitivity of these ten subsets of the Pupdb. The second column records the specialty of such subsets. And the 3th and 4th column record the accuracy and the Markovian correlation coefficient, AUC of these data, respectively.
Figure 1The ROC curves of feature of AAIndex PCA
Figure 2The ROC curves of feature of AAIndex BLOSUM62 PCA
Prediction the database on Pupdb 10-fold with AAIndex BLOSUM62 PCA
| Subset | Sn(%) | Sp(%) | Acc(%) | Mcc | AUC |
|---|---|---|---|---|---|
| 1 | 99.81 | 80.48 | 90.15 | 0.8183 | 0.8127 |
| 2 | 95.57 | 80.11 | 87.84 | 0.7660 | 0.8157 |
| 3 | 99.27 | 76.79 | 88.03 | 0.7806 | 0.8287 |
| 4 | 99.72 | 83.54 | 91.63 | 0.8437 | 0.7903 |
| 5 | 99.34 | 81.59 | 90.46 | 0.8224 | 0.8107 |
| 6 | 99.43 | 85.75 | 92.59 | 0.8599 | 0.8102 |
| 7 | 99.52 | 77.53 | 88.52 | 0.7898 | 0.8167 |
| 8 | 99.62 | 76.42 | 88.02 | 0.7817 | 0.8397 |
| 9 | 99.75 | 80.47 | 90.11 | 0.8175 | 0.8576 |
| 10 | 87.57 | 80.06 | 83.82 | 0.6782 | 0.8162 |
| Average | 97.96 | 80.28 | 89.12 | 0.7949 | 0.8199 |
The first column records sensitivity of these ten subsets of the Pupdb. The second column records the specialty of such subsets. And the 3th and 4th column record the accuracy and the Markovian correlation coefficient, AUC of these data, respectively.
Prediction the Pupdb database comparison with other methods
| Method | Sn(%) | Sp(%) | Acc(%) | Mcc | AUC |
|---|---|---|---|---|---|
| PUL-PUP | 82.24 | 91.57 | 88.92 | 0.7413 | 0.7238 |
| PSoL | 67.50 | 73.60 | 70.55 | 0.4118 | 0.6378 |
| SVM_balance | 76.71 | 63.65 | 69.88 | 0.4071 | 0.6571 |
| Naïve Bayesian | 82.78 | 86.40 | 84.59 | 0.6923 | 0.7528 |
| DEC–SVM | 75.49 | 77.87 | 77.70 | 0.5338 | 0.7891 |
| SET–SVM | 93.77 | 77.87 | 79.05 | 0.7256 | 0.8013 |
| IMP-PUP | 94.58 | 78.12 | 79.34 | 0.7371 | 0.8031 |
| AAIndex PCA+Neural Network | 65.50 | 99.52 | 82.51 | 0.6914 | 0.8119 |
| AAIndex BLOSUM62 PCA+ Neural Network | 97.96 | 80.28 | 89.12 | 0.7949 | 0.8199 |
The comparison with difference features
| Features | Sn(%) | Sp(%) | Acc(%) | Mcc | AUC |
|---|---|---|---|---|---|
| Binary Encoding | 43.36 | 75.80 | 59.58 | 0.2026 | 0.6472 |
| AA Composition | 64.14 | 52.79 | 58.46 | 0.1704 | 0.6121 |
| AA Pair Composition | 62.46 | 62.48 | 62.47 | 0.2494 | 0.6917 |
| Grouping AA Composition | 41.78 | 76.04 | 58.91 | 0.1897 | 0.5919 |
| Physicochemical Properties | 55.53 | 63.93 | 59.73 | 0.1953 | 0.5976 |
| KNN Features | 64.94 | 55.85 | 60.39 | 0.2088 | 0.6477 |
| Secondary Tendency Structure | 59.96 | 57.40 | 58.68 | 0.1737 | 0.6211 |
| PSSM | 51.20 | 69.39 | 60.30 | 0.2094 | 0.6374 |
| Binary Coding | 64.04 | 78.60 | 71.63 | 0.4310 | 0.6271 |
| PSSM2 | 61.11 | 68.94 | 65.11 | 0.3014 | 0.7921 |
| AAIndex PCA | 65.50 | 99.17 | 82.32 | 0.6868 | 0.8119 |
| AAIndex BLOSUM62 PCA | 97.96 | 80.28 | 89.12 | 0.7949 | 0.8199 |
The selected properties from the AAIndex database
| No. | AAIndex ID | Name of Properties |
|---|---|---|
| 1 | CHOP780207 | Normalized frequency of C-terminal non helical region |
| 2 | DAYM780201 | Relative mutability |
| 3 | EISD860102 | Atom-based hydrophobic moment |
| 4 | FAUJ880108 | Localized electrical effect |
| 5 | FAUJ880111 | Positive charge |
| 6 | FINA910103 | Helix termination parameter at position j-2, j-1, j |
| 7 | JANJ780101 | Average accessible surface area |
| 8 | KARP850103 | Flexibility parameter for two rigid neighbors |
| 9 | KLEP840101 | Net charge |
| 10 | KRIW710101 | Side chain interaction parameter |
| 11 | KRIW790102 | Fraction of site occupied by water |
| 12 | NAKH920103 | AA composition of EXT of single-spanning proteins |
| 13 | QIAN880101 | Weights for alpha-helix at the window position of -6 |
Figure 3The Steps of AAIndex PCA Features
The initial step is the predicted protein sequences in this work. The second step is the predicted amino acid segments from the protein sequences. The 3th step is transform the amino acid segments to property matrix of the amino acid segments. The fourth step is the Principal Component Analysis (PCA) of the property matrix.
Figure 4The Steps of AAIndex BLOSUM62 PCA Features
The initial step is the protein segments of the predicted amino acid segments in this work. The 2nd step is transform the amino acid segments to property matrix of the amino acid segments. The first and second steps are same as the second and third steps of the steps of AAIndex PCA features. The 3th step is the BLOSUM 62 matrix, which is the interaction between the amino acid residues. The property matrix and the BLOSUM 62 matrix get the multiplication operation in this steps. And then, they get a novel interaction matrix. The fourth step is the Principal Component Analysis (PCA) with the novel interaction matrix.