| Literature DB >> 34055035 |
Yaoxin Wang1, Yingjie Xu2, Zhenyu Yang1, Xiaoqing Liu3, Qi Dai1.
Abstract
Many combinations of protein features are used to improve protein structural class prediction, but the information redundancy is often ignored. In order to select the important features with strong classification ability, we proposed a recursive feature selection with random forest to improve protein structural class prediction. We evaluated the proposed method with four experiments and compared it with the available competing prediction methods. The results indicate that the proposed feature selection method effectively improves the efficiency of protein structural class prediction. Only less than 5% features are used, but the prediction accuracy is improved by 4.6-13.3%. We further compared different protein features and found that the predicted secondary structural features achieve the best performance. This understanding can be used to design more powerful prediction methods for the protein structural class.Entities:
Year: 2021 PMID: 34055035 PMCID: PMC8123985 DOI: 10.1155/2021/5529389
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Protein distribution of different structural classes among four protein data sets.
| Data set | All- | All- |
|
| Total |
|---|---|---|---|---|---|
| 25PDB | 443 | 443 | 346 | 441 | 1673 |
| D640 | 138 | 154 | 177 | 171 | 640 |
| FC699 | 130 | 269 | 377 | 82 | 858 |
| 1189 | 223 | 294 | 334 | 241 | 1092 |
Sensitivity (Sens), specificity (Spec), and F1 of the proposed method on four data sets.
| Data set | Class | Sens (%) | Spec (%) | F1 (%) |
|---|---|---|---|---|
| 25PDB | All- | 94.81 | 98.29 | 95.02 |
| All- | 95.26 | 98.13 | 95.05 | |
|
| 89.88 | 95.25 | 86.39 | |
|
| 85.71 | 97.16 | 88.52 | |
| D640 | All- | 97.10 | 97.81 | 94.70 |
| All- | 92.86 | 99.18 | 95.02 | |
|
| 97.18 | 92.87 | 90.05 | |
|
| 80.70 | 98.93 | 87.90 | |
| FC699 | All- | 97.69 | 99.45 | 97.32 |
| All- | 98.51 | 99.49 | 98.70 | |
|
| 95.23 | 99.38 | 97.16 | |
|
| 96.34 | 97.68 | 88.27 | |
| 1189 | All- | 94.62 | 96.55 | 90.95 |
| All- | 89.80 | 98.50 | 92.63 | |
|
| 82.04 | 94.20 | 84.05 | |
|
| 81.74 | 92.95 | 79.12 |
Prediction accuracies (variances in the brackets) of the proposed method for four data sets and comparison with other reported results.
| Data set | Method | Prediction accuracy (%) | ||||
|---|---|---|---|---|---|---|
| All- | All- |
|
| Overall | ||
| 25PDB | AADP-PSSM [ | 69.1 | 83.7 | 85.6 | 35.7 | 70.7 |
| AAC-PSSM-AC [ | 85.3 | 81.7 | 73.7 | 55.3 | 74.1 | |
| SCPRED [ | 92.6 | 80.1 | 74.0 | 71.0 | 79.7 | |
| MODAS [ | 92.3 | 83.7 | 81.2 | 68.3 | 81.4 | |
| RKS-PPSC [ | 92.8 | 83.3 | 85.8 | 70.1 | 82.9 | |
| Ding et al. [ | 95.0 | 81.3 | 83.2 | 77.6 | 84.3 | |
| Xia et al. [ | 92.6 | 72.5 | 71.7 | 71.0 | 77.2 | |
| Zhang et al. [ | 95.7 | 80.8 | 82.4 | 75.5 | 83.7 | |
| Ding et al. [ | 91.7 | 80.8 | 79.8 | 64.0 | 79.0 | |
| Zhang et al. [ | 94.4 | 83.3 | 83.5 | 73.2 | 83.6 | |
| This paper | 94.8 | 95.3 | 89.9 | 85.7 | 91.5 | |
| D640 | SCEC [ | 73.9 | 61.0 | 81.9 | 33.9 | 62.3 |
| SCPRED [ | 90.6 | 81.8 | 85.9 | 66.7 | 80.8 | |
| RKS-PPSC [ | 89.1 | 85.1 | 88.1 | 71.4 | 83.1 | |
| Ding et al. [ | 92.8 | 88.3 | 85.9 | 66.1 | 82.7 | |
| Zhang et al. [ | 92.0 | 81.8 | 87.6 | 74.3 | 83.6 | |
| Kong et al. [ | 94.2 | 80.5 | 87.6 | 77.2 | 84.5 | |
| This paper | 97.1 | 92.8 | 97.1 | 80.7 | 91.7 | |
| FC699 | SCPRED [ | — | — | — | — | 87.5 |
| 11 features [ | 97.7 | 88.0 | 89.1 | 84.2 | 89.6 | |
| Kong et al. [ | 96.2 | 90.7 | 96.3 | 69.5 | 92.0 | |
| This paper | 97.7 | 98.5 | 95.2 | 96.3 | 96.7 | |
| 1189 | AADP-PSSM [ | 69.1 | 83.7 | 85.6 | 35.7 | 70.7 |
| AAC-PSSM-AC [ | 80.7 | 86.4 | 81.4 | 45.2 | 74.6 | |
| SCPRED [ | 89.1 | 86.7 | 89.6 | 53.8 | 80.6 | |
| MODAS [ | 92.3 | 87.1 | 87.9 | 65.4 | 83.5 | |
| RKS-PPSC [ | 89.2 | 86.7 | 82.6 | 65.6 | 81.3 | |
| Zhang et al. [ | 92.4 | 84.4 | 84.4 | 73.4 | 83.6 | |
| Ding et al. [ | 89.2 | 88.8 | 85.6 | 58.5 | 81.2 | |
| Zhang et al. [ | 91.5 | 86.7 | 82.0 | 66.4 | 81.8 | |
| Kong et al. [ | 91.9 | 84.4 | 85.3 | 72.2 | 83.5 | |
| This paper | 94.6 | 89.7 | 82.1 | 81.7 | 86.6 | |
Figure 1The comparison of the overall accuracies of all experiments with the selected feature sets for four data sets.
Figure 2Comparison of the overall prediction accuracies of four kinds of the protein features.