| Literature DB >> 26712737 |
Taigang Liu1, Yufang Qin2, Yongjie Wang3, Chunhua Wang4.
Abstract
The prior knowledge of protein structural class may offer useful clues on understanding its functionality as well as its tertiary structure. Though various significant efforts have been made to find a fast and effective computational approach to address this problem, it is still a challenging topic in the field of bioinformatics. The position-specific score matrix (PSSM) profile has been shown to provide a useful source of information for improving the prediction performance of protein structural class. However, this information has not been adequately explored. To this end, in this study, we present a feature extraction technique which is based on gapped-dipeptides composition computed directly from PSSM. Then, a careful feature selection technique is performed based on support vector machine-recursive feature elimination (SVM-RFE). These optimal features are selected to construct a final predictor. The results of jackknife tests on four working datasets show that our method obtains satisfactory prediction accuracies by extracting features solely based on PSSM and could serve as a very promising tool to predict protein structural class.Entities:
Keywords: feature selection; gapped-dipeptide; position-specific score matrix; protein structural class; recursive feature elimination; support vector machine
Mesh:
Substances:
Year: 2015 PMID: 26712737 PMCID: PMC4730262 DOI: 10.3390/ijms17010015
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1This graph shows how different top K features affect the overall accuracies.
Prediction performances on four datasets by our method.
| Dataset | Accuracy (%) | Matthews Correlation Coefficient (MCC) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| All-α | All-β | α/β | α + β | Overall | All-α | All-β | α/β | α + β | |
| Z277 | 97.1 | 98.4 | 97.5 | 96.9 | 97.5 | 0.96 | 0.98 | 0.97 | 0.96 |
| Z498 | 98.1 | 100 | 98.5 | 97.7 | 98.6 | 0.96 | 1 | 0.98 | 0.98 |
| 1189 | 94.2 | 93.2 | 92.5 | 83.0 | 90.9 | 0.89 | 0.91 | 0.89 | 0.82 |
| 25PDB | 94.8 | 92.3 | 87.0 | 86.4 | 90.3 | 0.88 | 0.89 | 0.87 | 0.84 |
Comparison of different methods by the jackknife test for the Z277 dataset.
| Method | Prediction Accuracy (%) | ||||
|---|---|---|---|---|---|
| All-α | All-β | α/β | α + β | Overall | |
| Neural network [ | 68.6 | 85.2 | 86.4 | 56.9 | 74.7 |
| Component coupled [ | 84.3 | 82.0 | 81.5 | 67.7 | 79.1 |
| LogitBoost [ | 81.4 | 88.5 | 92.6 | 72.3 | 84.1 |
| IGA-SVM [ | 84.3 | 88.5 | 92.6 | 70.7 | 84.5 |
| CWT-PCA-SVM [ | 85.7 | 90.2 | 87.7 | 80.1 | 85.9 |
| Markov-SVM [ | 90.0 | 85.2 | 86.4 | 81.5 | 85.9 |
| SVM fusion [ | 85.7 | 90.2 | 93.8 | 80.0 | 87.7 |
| AAC-PSSM-AC [ | 88.6 | 95.1 | 97.5 | 81.5 | 91.0 |
| Our method | 97.1 | 98.4 | 97.5 | 96.9 | 97.5 |
Comparison of different methods by the jackknife test for the Z498 dataset.
| Method | Prediction Accuracy (%) | ||||
|---|---|---|---|---|---|
| All-α | All-β | α/β | α + β | Overall | |
| Neural network [ | 86.0 | 96.0 | 88.2 | 86.0 | 89.2 |
| Component-coupled [ | 93.5 | 88.9 | 90.4 | 84.5 | 89.2 |
| SVM fusion [ | 99.1 | 96.0 | 80.9 | 91.5 | 91.4 |
| Markov-SVM [ | 91.6 | 94.4 | 96.3 | 91.5 | 93.6 |
| IGA-SVM [ | 96.3 | 93.6 | 97.8 | 89.2 | 94.2 |
| LogitBoost [ | 92.6 | 96.0 | 97.1 | 93.0 | 94.8 |
| CWT-PCA-SVM [ | 94.4 | 96.8 | 97.0 | 92.3 | 95.2 |
| AAC-PSSM-AC [ | 94.4 | 96.8 | 97.8 | 93.8 | 95.8 |
| Our method | 98.1 | 100 | 98.5 | 97.7 | 98.6 |
Performance comparison of different methods on the 1189 dataset.
| Method | Prediction Accuracy (%) | ||||
|---|---|---|---|---|---|
| All-α | All-β | α/β | α + β | Overall | |
| AADP-PSSM [ | 69.1 | 83.7 | 85.6 | 35.7 | 70.7 |
| AAC-PSSM-AC [ | 80.7 | 86.4 | 81.4 | 45.2 | 74.6 |
| Comb_11,10,6 1 [ | 80.2 | 83.6 | 85.4 | 44.6 | 74.8 |
| SCPRED [ | 89.1 | 86.7 | 89.6 | 53.8 | 80.6 |
| LCC-PSSM [ | 89.2 | 88.8 | 85.6 | 58.5 | 81.2 |
| RKS-PPSC [ | 89.2 | 86.7 | 82.6 | 65.6 | 81.3 |
| MODAS [ | 92.3 | 87.1 | 87.9 | 65.4 | 83.5 |
| PSSM-SPINE-S [ | 98.2 | 91.5 | 83.8 | 72.2 | 86.3 |
| Our method | 94.2 | 93.2 | 92.5 | 83.0 | 90.9 |
1 The result is evaluated using 10-fold cross-validation test.
Performance comparison of different methods on the 25PDB dataset.
| Method | Prediction Accuracy (%) | ||||
|---|---|---|---|---|---|
| All-α | All-β | α/β | α + β | Overall | |
| AADP-PSSM [ | 83.3 | 78.1 | 76.3 | 54.4 | 72.9 |
| AAC-PSSM-AC [ | 85.3 | 81.7 | 73.7 | 55.3 | 74.1 |
| Comb_11,10,6 1 [ | 86.1 | 80.8 | 80.6 | 60.1 | 76.7 |
| LCC-PSSM [ | 91.7 | 80.8 | 79.8 | 64.0 | 79.0 |
| SCPRED [ | 92.6 | 80.1 | 74.0 | 71.0 | 79.7 |
| MODAS [ | 92.3 | 83.7 | 81.2 | 68.3 | 81.4 |
| RKS-PPSC [ | 92.8 | 83.3 | 85.8 | 70.1 | 82.9 |
| PSSM-SPINE-S [ | 96.8 | 93.7 | 90.1 | 87.0 | 92.2 |
| Our method | 94.8 | 92.3 | 87.0 | 86.4 | 90.3 |
1 The result is evaluated using 10-fold cross-validation test.
The compositions of four datasets adopted in this study.
| Dataset | All-α | All-β | α/β | α + β | Total |
|---|---|---|---|---|---|
| Z277 | 70 | 61 | 81 | 65 | 277 |
| Z498 | 107 | 126 | 136 | 129 | 498 |
| 1189 | 223 | 294 | 334 | 241 | 1092 |
| 25PDB | 443 | 443 | 346 | 441 | 1673 |