| Literature DB >> 26906024 |
Huilin Wang1, Liubin Feng1, Ziding Zhang2, Geoffrey I Webb3, Donghai Lin1, Jiangning Song3,4,5.
Abstract
The failure of multi-step experimental procedures to yield diffraction-quality crystals is a major bottleneck in protein structure determination. Accordingly, several bioinformatics methods have been successfully developed and employed to select crystallizable proteins. Unfortunately, the majority of existing in silico methods only allow the prediction of crystallization propensity, seldom enabling computational design of protein mutants that can be targeted for enhancing protein crystallizability. Here, we present Crysalis, an integrated crystallization analysis tool that builds on support-vector regression (SVR) models to facilitate computational protein crystallization prediction, analysis, and design. More specifically, the functionality of this new tool includes: (1) rapid selection of target crystallizable proteins at the proteome level, (2) identification of site non-optimality for protein crystallization and systematic analysis of all potential single-point mutations that might enhance protein crystallization propensity, and (3) annotation of target protein based on predicted structural properties. We applied the design mode of Crysalis to identify site non-optimality for protein crystallization on a proteome-scale, focusing on proteins currently classified as non-crystallizable. Our results revealed that site non-optimality is based on biases related to residues, predicted structures, physicochemical properties, and sequence loci, which provides in-depth understanding of the features influencing protein crystallization. Crysalis is freely available at http://nmrcen.xmu.edu.cn/crysalis/.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26906024 PMCID: PMC4764925 DOI: 10.1038/srep21383
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Statistics of the final selected features after one-step and two-step feature selection for each class of experimental procedure.
| Feature type | CLF | MF | PF | CF | CRYs | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| OFS | TFS | OFS | TFS | OFS | TFS | OFS | TFS | OFS | TFS | |
| AAc | 1 | 0 | 3 | 3 | 1 | 1 | 1 | 1 | 1 | 1 |
| Tri-peptide | 2 | 2 | 6 | 4 | 10 | 9 | 3 | 1 | 9 | 6 |
| AAindex | 7 | 1 | 22 | 20 | 10 | 8 | 1 | 1 | 24 | 21 |
| KSAAP | 86 | 21 | 19 | 13 | 66 | 64 | 94 | 65 | 26 | 13 |
| GKSAAP | 4 | 0 | 50 | 38 | 13 | 13 | 1 | 0 | 40 | 24 |
| Total | 100 | 24 | 100 | 78 | 100 | 95 | 100 | 68 | 100 | 65 |
Feature selection was performed on the benchmark datasets.
OFS (One-method Feature Selection) denotes results after two-step feature selection (mRMR).
TFS (Two-method Feature Selection) denotes results after three-step feature selection (mRMR+FFS).
AAc: features generated based on 20 standard amino acids.
Performance comparison of first-level SVR predictors from PredPPCrys and Crysalis.
| Models | Features | AUC | MCC | ACC(%) | SPE(%) | SEN(%) | PRE(%) | |
|---|---|---|---|---|---|---|---|---|
| CLF | PredPPCrys | 31 | 0.727 | 0.339 | 67.8 | 62.7 | 71.4 | 73.3 |
| Crysalis | 25 | 0.732 | 0.334 | 66.7 | 68.6 | 65.5 | 76.2 | |
| MF | PredPPCrys | 43 | 0.777 | 0.384 | 70.3 | 69.6 | 71.8 | 50.4 |
| Crysalis | 78 | 0.767 | 0.400 | 70.4 | 68.4 | 75.1 | 49.8 | |
| PF | PredPPCrys | 54 | 0.790 | 0.445 | 73.8 | 70.5 | 75.5 | 83.3 |
| Crysalis | 95 | 0.790 | 0.447 | 74.4 | 69.0 | 77.1 | 83.4 | |
| CF | PredPPCrys | 229 | 0.707 | 0.289 | 62.7 | 74.8 | 58.8 | 87.8 |
| Crysalis | 68 | 0.737 | 0.329 | 70.7 | 73.3 | 63.1 | 85.5 | |
| CRYs | PredPPCrys | 37 | 0.765 | 0.309 | 69.2 | 69.1 | 69.3 | 34.2 |
| Crysalis | 65 | 0.773 | 0.326 | 69.2 | 68.5 | 72.2 | 34.7 |
Performance of all models were evaluated using the benchmark datasets.
#The number of final selected features used for training the first-level SVR predictors.
Figure 1AUC-based performance following removal or inclusion of individual four-feature types in the optimal feature set.
The graphs illustrate the impact on prediction performance of the Crysalis first-level models for all five prediction classes. (A) Performance of the trained models following removal of the corresponding feature type. (B) Performance of the trained models using the individual feature type. All predicted models are compared with the best models that were trained using the optimal feature sets.
Prediction performance comparison of Crysalis and other existing methods.
| Experimental step | Method | AUC | MCC | ACC(%) | SPEC(%) | SENS(%) | PRE(%) |
|---|---|---|---|---|---|---|---|
| CLF | PredPPCrys I | 0.711 | 0.296 | 65.33 | 63.58 | 66.50 | 73.16 |
| PredPPCrys II | 0.725 | 0.322 | 66.54 | 65.56 | 67.20 | 74.44 | |
| Crysalis I | 0.731 | 0.332 | 66.98 | 66.60 | 67.22 | 75.56 | |
| Crysalis II | |||||||
| MF | PPCPred | 0.683 | 0.334 | 68.06 | 67.99 | 68.22 | 47.20 |
| PredPPCrys I | 0.772 | 0.380 | 69.93 | 68.21 | 72.88 | 49.95 | |
| PredPPCrys II | 0.416 | 71.95 | 71.36 | 73.30 | 52.70 | ||
| Crysalis I | 0.759 | 0.377 | 70.23 | 69.93 | 70.99 | 49.25 | |
| Crysalis II | |||||||
| PF | PPCPred | 0.612 | 0.183 | 58.83 | 62.23 | 57.08 | 74.57 |
| PredPPCrys I | 0.800 | 0.460 | 74.83 | 70.52 | 77.02 | 83.77 | |
| PredPPCrys II | |||||||
| Crysalis I | 0.796 | 0.436 | 73.87 | 67.80 | 73.87 | 82.47 | |
| Crysalis II | 0.801 | 0.411 | 71.22 | 72.67 | 70.48 | 83.48 | |
| CF | PPCPred | 0.432 | −0.014 | 55.23 | 32.21 | 61.24 | 75.53 |
| PredPPCrys I | 0.712 | 0.280 | 67.05 | 67.65 | 66.91 | 89.42 | |
| PredPPCrys II | 0.735 | 0.175 | 68.89 | ||||
| Crysalis I | 0.739 | 0.281 | 65.50 | 70.59 | 64.23 | 89.80 | |
| Crysalis II | 62.57 | 56.93 | 93.97 | ||||
| CRYs | ParCrys | 0.611 | 0.132 | 59.66 | 60.56 | 55.91 | 25.40 |
| OBScore | 0.638 | 0.184 | 59.28 | 57.78 | 65.49 | 27.14 | |
| CRYSTAP2 | 0.599 | 0.123 | 51.64 | 48.10 | 67.78 | 22.28 | |
| XtalPred | — | 0.224 | 65.04 | 65.61 | 62.51 | 29.31 | |
| SVMCRYs | — | 0.142 | 55.11 | 52.78 | 65.70 | 23.39 | |
| PPCPred | 0.704 | 0.254 | 63.63 | 62.09 | 70.67 | 29.03 | |
| XtalPred-RF | — | 0.205 | 60.94 | 59.67 | 66.41 | 27.56 | |
| PredPPCrys I | 0.770 | 0.326 | 69.65 | 69.30 | 71.13 | 35.23 | |
| PredPPCrys II | 0.428 | 76.04 | 76.21 | 75.30 | 42.64 | ||
| Crysalis I | 0.788 | 0.339 | 71.00 | 70.89 | 71.41 | 35.50 | |
| Crysalis II |
Performance was evaluated using independent test datasets. Note that most methods (ParCrys, OBScore, CRYSTAP2, XtalPred, and SVMCRYs) only provide one-class prediction (CRYs) and PPCPred includes four-class (MF, PF, CF, and CRYs) predictors. Thus, we only compared the performance of these tools for valid classes. In the case of PredPPCrys, we compared its performance with Crysalis for all five classes.
Figure 2Schematic illustration of the Crysalis prediction mode (A) and design mode (B).
Statistical analysis of site non-optimality for protein crystallizability engineering using the independent test dataset for the CRYs class (sequence redundancy removed at 25% sequence identity).
| All | Secondary structure | Disorder | Buried/Exposed | Side chain entropy | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Coil(296571) | Helix(260752) | Sheet(88276) | Disorder(66129) | Order(536962) | Exposed(275189) | Buried(370410) | SCE(96423) | SCE_E(74506) | SCE_B(21917) | ||
| 52.2% | 49.8% | 56.1% | 49.1% | 57.3% | 52.3% | 51.1% | 53.2% | 36.3% | 38.0% | 30.4% | |
| 32.3% | 30.2% | 35.4% | 29.8% | 39.3% | 32.2% | 31.7% | 32.7% | 21.4% | 22.9% | 16.3% | |
| 15.3% | 14.2% | 16.9% | 14.6% | 22.7% | 15.1% | 15.9% | 15.0% | 10.6% | 11.6% | 7.01% | |
| 4.08% | 3.92% | 4.36% | 3.83% | 8.80% | 3.76% | 4.65% | 3.67% | 3.37% | 3.80% | 1.90% | |
| 1.22% | 1.27% | 1.23% | 1.02% | 3.83% | 0.99% | 1.53% | 0.99% | 1.26% | 1.44% | 0.62% | |
| 0.33% | 0.41% | 0.27% | 0.22% | 1.64% | 0.19% | 0.46% | 0.23% | 0.43% | 0.48% | 0.24% | |
| 35.4% | 55.4% | 46.0% | 49.6% | 51.7% | 54.3% | 67.5% | 50.6% | 62.3% | |||
| 22.7% | 36.0% | 29.7% | 31.2% | 30.3% | 34.1% | 48.9% | 30.4% | 45.6% | |||
| 12.4% | 19.0% | 15.9% | 15.7% | 13.9% | 16.0% | 29.3% | 13.7% | 28.1% | |||
| 4.45% | 6.01% | 5.28% | 4.42% | 3.81% | 4.01% | 10.8% | 3.18% | 11.8% | |||
| 1.62% | 2.35% | 2.00% | 1.35% | 1.30% | 1.08% | 3.76% | 0.78% | 5.66% | |||
| 0.56% | 0.95% | 0.77% | 0.36% | 0.47% | 0.23% | 0.90% | 0.14% | 2.75% | |||
aThe dataset contains 2,342 proteins comprising of 1,814 proteins currently classified as non-crystallizable. Residue numbers for different groups are shown in brackets.
bStatistical analysis of side-chain entropy considered three residues with high conformational entropies (KQE). SCE denotes the number of KQE residues in the entire sequence, while SCE_E and SCE_B denote the numbers of KQE residues annotated to be localized to exposed or buried regions, respectively.
cN-terminal and C-terminal denote the initial and final 20 residues located at the N- or C-terminal region of protein sequences. The Intermediate group is comprised of all residues from protein sequences, excluding N-terminal and C-terminal residues.
Site non-optimality analysis of 20 standard amino acids for protein crystallizability.
| AAs | Number | ||||||
|---|---|---|---|---|---|---|---|
| 0.005 | 0.01 | 0.02 | 0.05 | 0.1 | 0.2 | ||
| Total | 645599 | 52.2% | 32.2% | 15.3% | 4.08% | 1.22% | 0.33% |
| A | 57068 | 53.8% | 27.4% | 9.68% | 1.54% | 0.24% | 0.21% |
| C | 8137 | 85.7% | 73.8% | 45.9% | 16.1% | 4.48% | 0.96% |
| D | 34265 | 46.9% | 29.7% | 15.7% | 5.05% | 1.48% | 0.34% |
| E | 39524 | 25.5% | 16.6% | 9.55% | 3.94% | 1.73% | 0.75% |
| F | 27004 | 59.1% | 42.3% | 24.3% | 6.65% | 2.00% | 0.42% |
| G | 45831 | 36.9% | 18.0% | 6.12% | 1.23% | 0.32% | 0.08% |
| H | 14891 | 50.2% | 38.5% | 26.2% | 12.9% | 7.20% | 4.02% |
| I | 36964 | 53.7% | 35.5% | 17.0% | 4.05% | 1.02% | 0.16% |
| K | 32967 | 37.0% | 20.2% | 9.45% | 2.77% | 1.00% | 0.27% |
| L | 68034 | 59.1% | 40.2% | 20.3% | 5.88% | 1.81% | 0.51% |
| M | 14613 | 56.3% | 34.7% | 15.6% | 3.60% | 0.82% | 0.07% |
| N | 25557 | 76.7% | 53.1% | 27.6% | 7.09% | 1.86% | 0.33% |
| P | 29532 | 45.5% | 23.6% | 9.33% | 1.81% | 0.38% | 0.04% |
| Q | 24284 | 53.0% | 30.8% | 13.8% | 3.25% | 0.82% | 0.11% |
| R | 35632 | 74.6% | 49.5% | 24.9% | 6.13% | 1.55% | 0.29% |
| S | 44801 | 72.5% | 45.5% | 22.0% | 5.34% | 1.38% | 0.23% |
| T | 35036 | 44.4% | 23.4% | 8.66% | 1.78% | 0.33% | 0.03% |
| V | 43817 | 46.2% | 25.9% | 10.1% | 1.95% | 0.49% | 0.06% |
| W | 9114 | 37.0% | 23.2% | 10.4% | 2.40% | 0.51% | 0.08% |
| Y | 20786 | 46.0% | 25.5% | 9.64% | 1.76% | 0.33% | 0.01% |
Crysalis performed the computational design of the CRYs class using proteins currently classified as non-crystallizable in the independent test datasets with 25% sequence identity.
Classification of the 20 amino acids in the GKSAAP encoding scheme according to six different types of physicochemical properties.
| Physicochemical property | Low | Middle | High |
|---|---|---|---|
| Accessible surface area | ACGILFV | HMPSTWY | RNDQEK |
| Side-chain orientation | NDQEK | ARGHPSTWYV | CILMF |
| Charge | DE | NQCILMFSWTYVAGP | KHR |
| Hydrogen-bond donors | ADCGST | NQEHILMPV | RKFWY |
| Hydrophobicity | THGSQ | RKNDEP | FIWLVMYCA |
| van der Waals potential | ILMFWY | RCQEHKPV | ANDGST |