| Literature DB >> 24019868 |
Phasit Charoenkwan1, Watshara Shoombuatong, Hua-Chin Lee, Jeerayut Chaijaruwanich, Hui-Ling Huang, Shinn-Ying Ho.
Abstract
Existing methods for predicting protein crystallization obtain high accuracy using various types of complemented features and complex ensemble classifiers, such as support vector machine (SVM) and Random Forest classifiers. It is desirable to develop a simple and easily interpretable prediction method with informative sequence features to provide insights into protein crystallization. This study proposes an ensemble method, SCMCRYS, to predict protein crystallization, for which each classifier is built by using a scoring card method (SCM) with estimating propensity scores of p-collocated amino acid (AA) pairs (p=0 for a dipeptide). The SCM classifier determines the crystallization of a sequence according to a weighted-sum score. The weights are the composition of the p-collocated AA pairs, and the propensity scores of these AA pairs are estimated using a statistic with optimization approach. SCMCRYS predicts the crystallization using a simple voting method from a number of SCM classifiers. The experimental results show that the single SCM classifier utilizing dipeptide composition with accuracy of 73.90% is comparable to the best previously-developed SVM-based classifier, SVM_POLY (74.6%), and our proposed SVM-based classifier utilizing the same dipeptide composition (77.55%). The SCMCRYS method with accuracy of 76.1% is comparable to the state-of-the-art ensemble methods PPCpred (76.8%) and RFCRYS (80.0%), which used the SVM and Random Forest classifiers, respectively. This study also investigates mutagenesis analysis based on SCM and the result reveals the hypothesis that the mutagenesis of surface residues Ala and Cys has large and small probabilities of enhancing protein crystallizability considering the estimated scores of crystallizability and solubility, melting point, molecular weight and conformational entropy of amino acids in a generalized condition. The propensity scores of amino acids and dipeptides for estimating the protein crystallizability can aid biologists in designing mutation of surface residues to enhance protein crystallizability. The source code of SCMCRYS is available at http://iclab.life.nctu.edu.tw/SCMCRYS/.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24019868 PMCID: PMC3760885 DOI: 10.1371/journal.pone.0072368
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Some existing methods for predicting protein crystallization from sequences.
| Method | Classifier | Sequence Features (no. of feature types) | Single/Ensemble | Year |
| OB-Score | Single Threshold | PCP (1) | Single | 2006 |
| SECERT | SVM | AAC, DPC, TPC (3) | Single | 2006 |
| CRYSTALP | Naïve Bayes | AAC, PAAC (2) | Single | 2007 |
| XtalPred | Logarithm Method | AAC, PCP, SS (3) | Single | 2007 |
| ParCrys | Parzen Window Density Estimator | AAC, PCP, Low complexity region (3) | Single | 2008 |
| CRYSTALP2 | Gaussian radial basis function network | AAC, DPC, TPC, PAAC, PCP (5) | Single | 2009 |
| SVMCRYS | SVM | AAC, TPC, PCP, SS (4) | Single | 2010 |
| PPCpred | SVM | PCP, AAC, SS, Disorder, Solvent accessibility (5) | Ensemble | 2011 |
| RFCRYS | Random Forest | AAC, DPC, TPC, PCP, Sequence Length (5) | Ensemble | 2012 |
| SCMCRYS | SCM | PAAC (1) | Ensemble | This study |
SS is defined as secondary structure.
AAC is defined as amino acid composition.
DPC is defined as dipeptide composition.
TPC is defined as tripeptide composition.
PCP is defined as physicochemical properties.
PAAC is defined as p-collocated amino acid pair composition.
PseAAC is defined as Pseudo amino acid composition.
Performance of the Init-SCM method using the p-collocated AA pairs.
|
| Test Accuracy (%) | MCC | Sensitivity | Specificity |
|
| 71.47 | 0.30 | 0.33 | 0.91 |
|
| 71.72 | 0.30 | 0.32 | 0.92 |
|
| 71.05 | 0.29 | 0.32 | 0.91 |
|
| 71.42 | 0.30 | 0.37 | 0.89 |
|
| 71.02 | 0.29 | 0.33 | 0.90 |
|
| 71.14 | 0.28 | 0.27 | 0.93 |
|
| 70.74 | 0.27 | 0.29 | 0.92 |
|
| 70.21 | 0.25 | 0.19 | 0.96 |
|
| 70.77 | 0.27 | 0.21 | 0.96 |
|
| 70.13 | 0.26 | 0.31 | 0.90 |
| Mean | 70.97±0.52 | 0.28±0.02 | 0.29±0.05 | 0.91±0.02 |
Mean performance of the SCM method using the p-collocated AA pairs.
|
| Test Accuracy (%) | MCC | Sensitivity | Specificity |
|
| 73.90±0.57 | 0.38±0.02 | 0.45±0.03 | 0.88±0.01 |
|
| 72.63±0.77 | 0.35±0.02 | 0.46±0.03 | 0.86±0.02 |
|
| 71.28±0.90 | 0.32±0.01 | 0.46±0.03 | 0.84±0.03 |
|
| 73.30±0.59 | 0.37±0.01 | 0.49±0.03 | 0.86±0.02 |
|
| 73.14±0.47 | 0.37±0.01 | 0.48±0.02 | 0.86±0.01 |
|
| 71.10±0.48 | 0.32±0.01 | 0.47±0.02 | 0.83±0.02 |
|
| 72.78±0.47 | 0.36±0.01 | 0.49±0.03 | 0.85±0.02 |
|
| 71.73±0.59 | 0.33±0.01 | 0.47±0.03 | 0.84±0.02 |
|
| 72.85±0.40 | 0.36±0.01 | 0.46±0.04 | 0.86±0.02 |
|
| 72.55±0.84 | 0.36±0.01 | 0.49±0.03 | 0.85±0.03 |
Figure 1Heat map of the propensity scores of dipeptides obtained from the SCM method.
Figure 2Distribution of locations of high-score dipeptides on the two typical sequences 3K9I and Q4V970.
The distribution of locations of high-score dipeptides on the two typical sequences 3K9I and Q4V970 correctly predicted as crystallizable and non-crystallizable proteins, respectively.
Comparisons of the proposed method SCMCRYS with existing classifiers.
| Classifiers | Type | Test Accuracy (%) | MCC | Sensiti vity | Specifi city |
| CRYSTALP2 | single | 55.3 | 0.19 | 0.74 | 0.46 |
| SVMCRYS | single | 56.3 | 0.21 | 0.75 | 0.47 |
| SVM_POLY | single | 74.6 | 0.40 | 0.48 | 0.88 |
| SVM_DPC | single | 77.55 | 0.47 | 0.45 | 0.94 |
| SCM (Dipeptide) | single | 73.90 | 0.38 | 0.45 | 0.88 |
| PPCpred | ensemble | 76.8 | 0.47 | 0.61 | 0.85 |
| RFCRYS | ensemble | 80.0 | 0.53 | 0.51 | 0.95 |
| SCMCRYS | ensemble | 76.1 | 0.44 | 0.46 | 0.91 |
Results come from the work RFCRYS [14].
Performances of SVM using amino acid composition (AAC) and p-collocated AA pairs.
| Feature Name | Test Accuracy (%) | MCC | Sensitivity | Specificity |
| AAC | 73.12 | 0.35 | 0.38 | 0.91 |
|
| 77.55 | 0.47 | 0.45 | 0.94 |
|
| 77.02 | 0.46 | 0.49 | 0.91 |
|
| 76.57 | 0.44 | 0.47 | 0.91 |
|
| 76.65 | 0.44 | 0.44 | 0.93 |
|
| 77.02 | 0.45 | 0.45 | 0.93 |
|
| 76.46 | 0.44 | 0.48 | 0.91 |
|
| 77.69 | 0.47 | 0.50 | 0.91 |
|
| 76.37 | 0.44 | 0.44 | 0.93 |
|
| 76.82 | 0.45 | 0.47 | 0.92 |
|
| 77.52 | 0.47 | 0.51 | 0.91 |
| Mean ( | 76.967 | 0.453 | 0.47 | 0.92 |
The propensity scores of amino acids to be crystallizable and related physicochemical properties.
| Amino acid | Crystallizability Score (rank) | Solubility Score (rank) | Melting point (rank) | Molecular weight (rank) | Conformational Entropy (rank) |
| E-Glu | 486.38 (1) | 570.85 (2) | 249 (13) | 147.13 (14) | 1.81 (17) |
| G-Gly | 454.90 (2) | 378.05 (14) | 290 (5) | 75.07 (1) | 0 (1) |
| A-Ala | 451.38 (3) | 599.42 (1) | 297 (3) | 89.09 (2) | 0 (1) |
| H-His | 451.23 (4) | 406.18 (12) | 277 (10) | 155.16 (16) | 0.96 (9) |
| V-Val | 449.23 (5) | 424.17 (6) | 293 (4) | 117.15 (5) | 0.51 (4) |
| I-Ile | 445.63 (6) | 414.75 (9) | 284 (7) | 131.17 (12) | 0.89 (8) |
| Y-Tyr | 429.83 (7) | 339.80 (19) | 344 (1) | 181.19 (19) | 0.98 (10) |
| M-Met | 423.63 (8) | 420.88 (7) | 283 (8) | 149.21 (6) | 1.61 (14) |
| W-Trp | 408.88 (9) | 350.00 (18) | 282 (9) | 204.23 (20) | 0.98 (10) |
| K-Lys | 398.30 (10) | 445.27 (4) | 224 (17) | 146.19 (13) | 1.94 (18) |
| L-Leu | 395.95 (11) | 440.73 (5) | 337 (2) | 131.17 (8) | 0.78 (7) |
| D-Asp | 394.53 (12) | 507.95 (3) | 270 (11) | 133.10 (11) | 1.25 (12) |
| F-Phe | 392.83 (13) | 420.12 (8) | 284 (6) | 165.19 (17) | 0.58 (6) |
| T-Thr | 392.45 (14) | 411.02 (10) | 253 (12) | 119.12 (15) | 1.63 (15) |
| R-Arg | 376.90 (15) | 370.58 (16) | 238 (14) | 174.20 (18) | 2.03 (19) |
| P-Pro | 372.28 (16) | 406.23 (11) | 222 (18) | 115.13 (4) | 0 (1) |
| Q-Gln | 364.80 (17) | 400.02 (13) | 185 (19) | 146.15 (9) | 2.11 (20) |
| C-Cys | 357.43 (18) | 363.83 (17) | 178 (20) | 121.16 (7) | 0.55 (5) |
| N-Asn | 346.48 (19) | 376.65 (15) | 236 (15) | 132.12 (10) | 1.57 (13) |
| S-Ser | 271.93 (20) | 334.10 (20) | 228 (16) | 105.09 (3) | 1.71 (16) |
| R | 1.00 | 0.52 | 0.54 | 0.05 | −0.32 |
| R1 | 1.00 | 0.69 | 0.61 | −0.12 | −0.40 |
| R2 | 1.00 | 0.93 | 0.90 | 0.30 | −0.60 |
R is correlation between crystallizability scores and other physicochemical properties of amino acids.
R1 is correlation between crystallizability scores and other physicochemical properties of sequences in a training dataset.
R2 is correlation between crystallizability scores and other physicochemical properties of sequences belonging to the set consisting of 20 and 20 sequences with the highest and lowest crystallizability scores, respectively.
Figure 3The scatter plot of correlation between solubility scores and crystallizability.
scores where R = 0.52.
The five top-ranked physiochemical properties in the AAindex database having the highest absolute correlation with crytalizability scores of amino acids.
| Rank | AAIndex | Correlation R | Description |
| 1 | AURR980101 | 0.61 | Normalized positional residue frequency at helix termini N4′ (Aurora-Rose, 1998) |
| 2 | MAXF760106 | −0.57 | Normalized frequency of alpha region (Maxfield-Scheraga, 1976) |
| 3 | FASG760102 | 0.54 | Melting point (Fasman, 1976) |
| 4 | NAKH900113 | 0.54 | Ratio of average and computed composition (Nakashima et al., 1990) |
| 5 | SNEP660104 | −0.53 | Principal component IV (Sneath, 1966) |
Figure 4The three-dimensional structure of Rho GDP-dissociation inhibitor.
(a) The predicted structure of a wild type Rho GDP-dissociation inhibitor and (b) The structure of a mutant Rho GDP-dissociation inhibitor (NDelta66: K135,138,141A;L196F mutant; 1fso).
The datasets for evaluating the predictors of protein crystallization, obtained from Mizianty and Kurgan [13].
| Dataset | Number in | Number in this study | Final dataset | ||
| Positive | Negative | ||||
| CRYS-TRN | 3587 | 3575 | 1197 | 2378 | |
| CRYS-TEST | 3585 | 3572 | 1198 | 2374 | |
Some sequences of short length and with non-amino acids are removed.
Figure 5The system flowchart of the SCMCRYS method.