| Literature DB >> 25148528 |
Huilin Wang1, Mingjun Wang1, Hao Tan2, Yuan Li1, Ziding Zhang3, Jiangning Song4.
Abstract
X-ray crystallography is the primary approach to solve the three-dimensional structure of a protein. However, a major bottleneck of this method is the failure of multi-step experimental procedures to yield diffraction-quality crystals, including sequence cloning, protein material production, purification, crystallization and ultimately, structural determination. Accordingly, prediction of the propensity of a protein to successfully undergo these experimental procedures based on the protein sequence may help narrow down laborious experimental efforts and facilitate target selection. A number of bioinformatics methods based on protein sequence information have been developed for this purpose. However, our knowledge on the important determinants of propensity for a protein sequence to produce high diffraction-quality crystals remains largely incomplete. In practice, most of the existing methods display poorer performance when evaluated on larger and updated datasets. To address this problem, we constructed an up-to-date dataset as the benchmark, and subsequently developed a new approach termed 'PredPPCrys' using the support vector machine (SVM). Using a comprehensive set of multifaceted sequence-derived features in combination with a novel multi-step feature selection strategy, we identified and characterized the relative importance and contribution of each feature type to the prediction performance of five individual experimental steps required for successful crystallization. The resulting optimal candidate features were used as inputs to build the first-level SVM predictor (PredPPCrys I). Next, prediction outputs of PredPPCrys I were used as the input to build second-level SVM classifiers (PredPPCrys II), which led to significantly enhanced prediction performance. Benchmarking experiments indicated that our PredPPCrys method outperforms most existing procedures on both up-to-date and previous datasets. In addition, the predicted crystallization targets of currently non-crystallizable proteins were provided as compendium data, which are anticipated to facilitate target selection and design for the worldwide structural genomics consortium. PredPPCrys is freely available at http://www.structbioinfor.org/PredPPCrys.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25148528 PMCID: PMC4141844 DOI: 10.1371/journal.pone.0105902
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Schematic illustration of the PredPPCrys approach.
The details of each of the six major steps are discussed within the main text.
Number of selected features after one-step and two-step mRMR feature selection for 5-class prediction.
| Feature type | Number of features selected for each class | |||||||||
| CLF | MF | PF | CF | CRYS | ||||||
| A | B | A | B | A | B | A | B | A | B | |
| AAindex1 | 50 | 18 | 110 | 24 | 85 | 31 | 47 | 6 | 110 | 13 |
| PROFEAT | 0 | 67 | 22 | 157 | 13 | 147 | 81 | 206 | 23 | 164 |
| AA composition (AA type 1) | 2 | 5 | 4 | 3 | 2 | 2 | 3 | 1 | 2 | 2 |
| AA group (AA type 3) | 0 | 3 | 2 | 1 | 0 | 0 | 4 | 2 | 2 | 2 |
| Tri-peptide composition | 4 | 21 | 12 | 14 | 6 | 11 | 5 | 9 | 8 | 11 |
| Secondary structure | 10 | 27 | 25 | 22 | 7 | 13 | 20 | 18 | 25 | 18 |
| Disorder | 6 | 9 | 3 | 5 | 15 | 10 | 2 | 3 | 1 | 7 |
| Exposure related information | 105 | 78 | 111 | 43 | 136 | 60 | 76 | 28 | 119 | 52 |
| Burial related information | 121 | 73 | 7 | 26 | 35 | 24 | 62 | 29 | 5 | 29 |
| Other | 5 | 4 | 10 | 7 | 5 | 6 | 5 | 3 | 9 | 6 |
| Number of some combinationfeatures selected for each class | ||||||||||
| AAindex1 & Exposed | 95 | 63 | 103 | 34 | 130 | 51 | 68 | 22 | 116 | 44 |
| AAindex1 & Buried | 117 | 60 | 0 | 21 | 31 | 20 | 57 | 25 | 1 | 26 |
| AA types & Exposed/Buried | 11 | 23 | 9 | 12 | 6 | 2 | 8 | 7 | 3 | 7 |
| Statistical analysis of someselected feature types | ||||||||||
| Exposure/Burial ratio | 1.15 | 0.94 | 0.06 | 0.60 | 0.26 | 0.40 | 0.82 | 1.03 | 23.8 | 0.56 |
| Percentage ofAAindex relatedfeatures (%) | 87.3 | 47 | 71 | 26.3 | 82 | 34 | 57.3 | 17.7 | 75.7 | 27.7 |
Feature selection was performed based on benchmark datasets.
CLF, MF, PF, CF and CRYS represent assignment of 5-class experimental steps.
A denotes the one-step mRMR feature selection.
B denotes the two-step mRMR feature selection.
AA (amino acid) composition denotes the 20 standard amino acid compositions.
Exposure-related information: the features integrate the predicted exposed residue information.
Burial-related information: the features integrate the predicted exposed residue information.
AAindex 1 & Exposed: average values of physicochemical properties using the amino acid index (AAindex 1) in all the predicted exposed residues (Table S2).
AAindex1 & Buried: average values of physicochemical properties using the amino acid index (AAIndex1) in all the predicted buried residues.
AA types & Exposed/Buried: frequency of the 20 standard AAs (type 1), hydrophobic/hydrophilic/neutral/position/negative AAs (type 2) and AA groups (type 3) in all predicted exposed or buried residues.
Exposure/burial ratio: ratio of the features integrating the predicted exposed residue information to that integrating the predicted buried residue information.
Percentage of AA index-related features denotes the frequency of AA index-related features within the selected set.
Further explanations are included in Table S2.
Performance comparison of the SVM models trained based on various feature subsets selected using different methods on the 5-class benchmark datasets.
| Feature selection method | CLF | MF | PF | CF | CRYS |
| one-step mRMR + IFS | 0.691 | 0.769 | 0.722 | 0.684 | 0.760 |
| one-step mRMR + FFS | 0.711 | 0.767 |
| 0.665 | 0.753 |
| two-step mRMR + IFS | 0.698 | 0.763 | 0.759 |
| 0.756 |
| two-step mRMR + FFS |
|
| 0.779 | 0.645 |
|
Performance was evaluated based on the AUC score.
Prediction performance of the primary classifier built based on the best-performing final feature subset, along with the number of final selected features for each class.
| Class | Numberof final selected features | AUC | MCC | Accuracy (%) | Specificity (%) | Sensitivity (%) | Precision (%) |
| CLF | 31 | 0.727 | 0.339 | 67.8 | 62.7 | 71.4 | 73.3 |
| MF | 43 | 0.777 | 0.384 | 70.3 | 69.6 | 71.8 | 50.4 |
| PF | 54 | 0.790 | 0.445 | 73.8 | 70.5 | 75.5 | 83.3 |
| CF | 229 | 0.707 | 0.289 | 62.7 | 74.8 | 58.8 | 87.8 |
| CRYS | 37 | 0.765 | 0.309 | 69.2 | 69.1 | 69.3 | 34.2 |
Performance on the benchmark training dataset was evaluated based on AUC, MCC, Accuracy, Specificity, Sensitivity and Precision, using 5-fold cross-validation test.
Performance comparison of SVM classifiers with different kernel functions and parameters.
| Class | Model | POLY | RBF | SIG | ||||||
| 1/γ |
| AUC | 1/γ |
| AUC | 1/γ |
| AUC | ||
| CLF | Initialmodel | 31 | 1 | 0.714 | 31 | 1 | 0.717 | 31 | 1 | 0.717 |
| Optimizedmodel | 97 | 1 | 0.726 |
|
|
| 183 | 1 | 0.727 | |
| MF | Initialmodel | 43 | 1 | 0.766 | 43 | 1 | 0.767 | 43 | 1 | 0.766 |
| Optimizedmodel |
|
|
| 38 | 1 | 0.768 | 179 | 1 | 0.768 | |
| PF | Initialmodel | 54 | 1 | 0.762 | 54 | 1 | 0.763 | 54 | 1 | 0.761 |
| Optimizedmodel | 2 | 3 | 0.795 |
|
|
| 58 | 1 | 0.763 | |
| CF | Initialmodel | 229 | 1 | 0.681 | 284 | 1 | 0.666 | 284 | 1 | 0.654 |
| Optimizedmodel |
|
|
| 231 | 0.2 | 0.682 | 252 | 9 | 0.693 | |
| CRYS | Initialmodel | 37 | 1 | 0.750 | 37 | 1 | 0.738 | 37 | 1 | 0.750 |
| Optimizedmodel |
|
|
| 115 | 0.5 | 0.754 | 98 | 0.125 | 0.752 | |
Performance was evaluated based on the AUC scores using independent tests.
Figure 2Correlations between the probability outputs of any two classes.
Results were evaluated based on the training dataset.
Figure 3ROC curves for different predictors.
(A), CLF; (B), MF; (C), PF; (D), CF; and (E), CRYS class. Taking the CLF class as an example, the performance of the first-level predictor PredPPCrys I (corresponding to the CLF class feature in Figure A), predictors built using the outputs of classifiers for other classes as inputs, as well as the second-level predictor, PredPPCrys II, are compared using the respective ROC curves. All predictors were built using the optimized SVM parameters based on the respective training datasets, and subsequently tested on the corresponding independent test datasets.
Performance comparison of PredPPCrys I, PredPPCrys II and previous methods, including PPCPred, ParCrys, OBScore, CRYSTAP2, XtalPred, SVMCRYs, SCMCRYS and XtalPred-RF.
| Experimental step | Method | AUC | MCC | Accuracy (%) | Specificity (%) | Sensitivity(%) | Precision (%) |
| CLF | PredPPCrys I | 0.711 | 0.296 | 65.33 | 63.58 | 66.50 | 73.16 |
| PredPPCrys I (−) | 0.697 | 0.291 | 64.70 | 64.40 | 64.94 | 70.48 | |
|
|
|
|
|
|
|
| |
| PredPPCrys II (−) | 0.710 | 0.307 | 65.66 | 64.40 | 66.61 | 71.01 | |
| MF | PPCPred | 0.683 | 0.334 | 68.06 | 67.99 | 68.22 | 47.20 |
| PredPPCrys I | 0.772 | 0.380 | 69.93 | 68.21 | 72.88 | 49.95 | |
| PredPPCrys I (−) | 0.776 | 0.398 | 69.86 | 67.37 | 75.03 | 52.47 | |
|
|
|
|
|
|
|
| |
| PredPPCrys II (−) | 0.809 | 0.461 | 74.32 | 74.42 | 74.10 | 58.18 | |
| PF | PPCPred | 0.612 | 0.183 | 58.83 | 62.23 | 57.08 | 74.57 |
| PredPPCrys I | 0.800 | 0.460 | 74.83 | 70.52 | 77.02 | 83.77 | |
| PredPPCrys I (−) | 0.779 | 0.437 | 72.85 | 72.65 | 72.95 | 83.89 | |
|
|
|
|
|
|
|
| |
| PredPPCry II (−) | 0.872 | 0.588 | 80.22 | 82.55 | 79.09 | 90.31 | |
| CF | PPCPred | 0.432 | −0.014 | 55.23 | 32.21 | 61.24 | 75.53 |
| PredPPCrys I | 0.712 |
| 67.05 | 67.65 | 66.91 | 89.42 | |
| PredPPCrys I (−) | 0.693 | 0.258 | 66.04 | 65.63 | 66.14 | 88.42 | |
|
|
| 0.175 |
|
|
|
| |
| PredPPCrys II (−) | 0.692 | 0.186 | 59.12 | 65.63 | 57.48 | 86.90 | |
| CRYS | ParCrys | 0.611 | 0.132 | 59.66 | 60.56 | 55.91 | 25.40 |
| OBScore | 0.638 | 0.184 | 59.28 | 57.78 | 65.49 | 27.14 | |
| CRYSTAP2 | 0.599 | 0.123 | 51.64 | 48.10 | 67.78 | 22.28 | |
| XtalPred | - | 0.224 | 65.04 | 65.61 | 62.51 | 29.31 | |
| SVMCRYs | - | 0.142 | 55.11 | 52.78 | 65.70 | 23.39 | |
| PPCPred | 0.704 | 0.254 | 63.63 | 62.09 | 70.67 | 29.03 | |
| XtalPred-RF | - | 0.205 | 60.94 | 59.67 | 66.41 | 27.56 | |
| SCMCRYS | - | 0.145 | 60.93 | 62.01 | 56.24 | 25.48 | |
| PredPPCrys I | 0.770 | 0.326 | 69.65 | 69.30 | 71.13 | 35.23 | |
| PredPPCrys I (−) | 0.794 | 0.379 | 72.63 | 73.30 | 70.32 | 43.46 | |
|
|
|
|
|
|
|
| |
| PredPPCrys II (−) | 0.858 | 0.502 | 78.35 | 78.16 | 79.02 | 51.35 |
Performance was evaluated based on independent test datasets.
(−) denotes that our proposed method PredPPCrys was tested on the independent test datasets with a 25% sequence identity cutoff compared with the training datasets.
Figure 4ROC curves displaying the performance of our methods (PredPPCrys I and II predictors), compared to previous procedures, on independent test datasets for predicting propensity of targets to successfully pass each experimental step.
(A), CLF; (B), MF; (C), PF; (D), CF and (E), CRYS class. PredPPCrys-I denotes the first-level predictors of PredPPCrys, PredPPCry-II denotes second-level predictors of PredPPCrys, while PredPPCrys-II_POLY, PredPPCrys-II_RBF, PredPPCrys-II_SIG denote the best performing SVM classifiers built with SVM_POLY, SVM_RBF, SVM_SIG kernels in second-level predictors, respectively.
Figure 5Statistical significance of the contributions of selected features to the prediction performance of the five classes, evaluated based on the negative logarithmic value of p-value (-log(P)) calculated using t-tests.
Contribution significance was determined using t-tests, and only the final selected feature types that made a significant contribution (p<0.01) to performance were included in the analysis. The vertical and horizontal axes display the contributory features. The pie chart insets denote the percentages of selected feature types in the final feature subset for each class.