| Literature DB >> 22723837 |
Xia-Yu Xia1, Meng Ge, Zhi-Xin Wang, Xian-Ming Pan.
Abstract
Because of the increasing gap between the data from sequencing and structural genomics, the accurate prediction of the structural class of a protein domain solely from the primary sequence has remained a challenging problem in structural biology. Traditional sequence-based predictors generally select several sequence features and then feed them directly into a classification program to identify the structural class. The current best sequence-based predictor achieved an overall accuracy of 74.1% when tested on a widely used, non-homologous benchmark dataset 25PDB. In the present work, we built a multiple linear regression (MLR) model to convert the 440-dimensional (440D) sequence feature vector extracted from the Position Specific Scoring Matrix (PSSM) of a protein domain to a 4-dimensinal (4D) structural feature vector, which could then be used to predict the four major structural classes. We performed 10-fold cross-validation and jackknife tests of the method on a large non-homologous dataset containing 8,244 domains distributed among the four major classes. The performance of our approach outperformed all of the existing sequence-based methods and had an overall accuracy of 83.1%, which is even higher than the results of those predicted secondary structure-based methods.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22723837 PMCID: PMC3378576 DOI: 10.1371/journal.pone.0037653
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Discrimination of the protein domains between paired structural class groups.
A) Discrimination of the all-α from all-β domains. B) Discrimination of the all-α from mixed αβ domains. C) Discrimination of the all-β from mixed αβ domains. D) Discrimination of the α/β from α+β domains.
Performance of the 10-fold cross-validation and jackknife tests using the D8244 dataset.
| Class | 10-fold Cross-validation | Jackknife | ||||||
| Sn (%) | Sp (%) | MCC | GC2 | Sn (%) | Sp (%) | MCC | GC2 | |
| All-α | 91.9 | 97.2 | 0.88 | 92.0 | 97.3 | 0.89 | ||
| All-β | 84.6 | 96.1 | 0.82 | 85.0 | 96.2 | 0.82 | ||
| α/β | 83.1 | 94.4 | 0.78 | 83.2 | 94.5 | 0.79 | ||
| α+β | 73.7 | 89.0 | 0.62 | 74.4 | 89.0 | 0.63 | ||
| Overall | 82.8 | 0.56 | 83.1 | 0.56 | ||||
Performance of the blind test using the independent D1185 dataset.
| Class | Accuracies | |||
| Sn (%) | Sp (%) | MCC | GC2 | |
| All-α | 95.6 | 95.6 | 0.88 | |
| All-β | 81.0 | 94.7 | 0.76 | |
| α/β | 78.9 | 94.2 | 0.71 | |
| α+β | 71.9 | 87.4 | 0.60 | |
| Overall | 80.1 | 0.50 | ||
Figure 2The effect of the number of sequence features used on the overall prediction accuracies for the two datasets 25PDB and D8244.
Comparison of the jackknife test results between our method and other competing structural class prediction methods using the 25PDB dataset.
| Algorithm | Reference | Accuracies | GC2 | ||||
| All-α | All-β | α/β | α+β | Overall | |||
| SVM (Gaussian kernel) |
| 68.6 | 59.6 | 59.8 | 28.6 | 53.9 | 0.17 |
| Bagging with random tree |
| 58.7 | 47.0 | 35.5 | 24.7 | 41.8 | 0.06 |
| Logistic regression |
| 71.1 | 65.3 | 67.1 | 37.3 | 60.0 | 0.25 |
| StackingC ensemble |
| 74.6 | 67.9 | 70.2 | 32.4 | 61.3 | 0.26 |
| Specific tri-peptides |
| 60.6 | 60.7 | 67.9 | 44.3 | 58.6 | – |
| LLSC-PRED |
| 75.2 | 67.5 | 62.1 | 44.0 | 62.2 | 0.27 |
| SVM |
| 77.4 | 66.4 | 61.3 | 45.4 | 62.7 | 0.28 |
| AAD-CGR |
| 64.3 | 65.0 | 65.0 | 61.7 | 64.0 | – |
| CWT-PCA-SVM |
| 76.5 | 67.3 | 66.8 | 45.8 | 64.0 | – |
| AADP-PSSM |
| 83.3 | 78.1 | 76.3 | 54.4 | 72.9 | – |
| AAC-PSSM-AC |
| 85.3 | 81.7 | 73.7 | 55.3 | 74.1 | – |
| SCPRED |
| 92.6 | 80.1 | 74.0 | 71.0 | 79.7 | 0.55 |
| MODAS |
| 92.3 | 83.7 | 81.2 | 68.3 | 81.4 | 0.58 |
| RKS-PPSC |
| 92.8 | 83.3 | 85.8 | 70.1 | 82.9 | – |
| SVM |
| 92.6 | 81.3 | 81.5 | 76.0 | 82.9 | – |
| This work | 92.6 | 72.5 | 71.7 | 71.0 | 77.2 | 0.50 | |