| Literature DB >> 19646256 |
Lukasz Kurgan1, Ali A Razib, Sara Aghakhani, Scott Dick, Marcin Mizianty, Samad Jahandideh.
Abstract
BACKGROUND: Current protocols yield crystals for <30% of known proteins, indicating that automatically identifying crystallizable proteins may improve high-throughput structural genomics efforts. We introduce CRYSTALP2, a kernel-based method that predicts the propensity of a given protein sequence to produce diffraction-quality crystals. This method utilizes the composition and collocation of amino acids, isoelectric point, and hydrophobicity, as estimated from the primary sequence, to generate predictions. CRYSTALP2 extends its predecessor, CRYSTALP, by enabling predictions for sequences of unrestricted size and provides improved prediction quality.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19646256 PMCID: PMC2731098 DOI: 10.1186/1472-6807-9-50
Source DB: PubMed Journal: BMC Struct Biol ISSN: 1472-6807
Selected set of features.
| features | L, Y | EFV, IVV, TKV, F-TK, K-TV, M-DS, P-PE, Q-QQ, R-PS, DP-V, LR-F, MG-S, SA-D, VT-G, YV-E, F-E-F, K-I-R, N-P-G, S-T-S | pI, average hydropho-bicity | |
| # features | 2 | 65 | 19 | 2 |
The underscored features in the "Collocated dipeptides" column indicate features that were also used in CRYSTALP.
Comparison of prediction quality measured via accuracy, MCC and AROC between the proposed and five competing methods.
| D4181 | SECRET | 70.0 | 0.34 | N/A |
| CRYSTALP | 77.5 | 0.55 | N/A | |
| CRYSTALP2 | 77.5 | 0.55 | N/A | |
| TEST-RL2 | CRYSTALP | 46.5 | -0.07 | N/A |
| SECRET | 58.1 | 0.16 | 0.58 | |
| ParCrys-W | 67.4 | 0.38 | 0.84 | |
| OB-Score | 69.8 | 0.40 | 0.71 | |
| ParCrys | 79.1 | 0.58 | 0.84 | |
| XtalPred4 | 76.7 | 0.54 | 0.82 | |
| CRYSTALP2 | 69.8 | 0.40 | 0.72 | |
| TEST2 | OB-Score | 64.6 | 0.32 | 0.68 |
| ParCrys-W | 68.0 | 0.37 | 0.75 | |
| ParCrys | 71.5 | 0.45 | 0.75 | |
| XtalPred4 | 79.2 | 0.58 | 0.83 | |
| CRYSTALP2 | 75.7 | 0.52 | 0.79 | |
| TEST-NEW | ParCrys3 | 70.6 | 0.43 | 0.75 |
| XtalPred4 | 70.0 | 0.40 | 0.76 | |
| CRYSTALP25 | 69.3 | 0.39 | 0.74 | |
The AROC values for ParCrys, OB-Score and SECRET were taken from [23].
1 Results based on tenfold cross-validation test on the D418 dataset
2 Results based on training the classification model on FEAT dataset and testing on TEST-RL or TEST datasets, respectively
3 Results based the ParCrys server at
4 Results based the XtalPred server at
5 Results based on training the classification model on FEAT dataset and testing on TEST-NEW datasets, respectively
6 N/A means that the corresponding results was not reported and cannot be duplicated or computed
Figure 1ROC curves for ParCrys, XtalPred, and CRYSTALP2 on the TEST-NEW (top panel), TEST (middle panel) and TEST-RL (bottom panel) datasets.
Comparison of predictions generated by CRYSTALP2, XtalPred and ParCrys on the TEST, TEST-RL and TEST-NEW datasets.
| correct prediction | incorrect prediction | correct prediction | incorrect prediction | correct prediction | incorrect prediction | correct prediction | incorrect prediction | |
| correct prediction | 91 | 18 | 49 | 11 | 1091 | 295 | 1130 | 256 |
| incorrect prediction | 23 | 12 | 17 | 9 | 310 | 304 | 283 | 331 |
The table provides a breakdown of the predictions into those that were correct for both CRYSTALP2 and XtalPred (or ParCrys), correct for CRYSTALP2 and incorrect for XtalPred (or ParCrys), incorrect for CRYSTALP2 and correct for XtalPred (or ParCrys), and incorrect for both CRYSTALP2 and XtalPred (or ParCrys).
Comparison of prediction quality measured via accuracy, MCC and AROC between the proposed method that uses the set of 88 features (including composition, collocation, pI and hydrophobicity), a method that uses the 86 composition and collocation features, and a method that uses only pI and hydrophobicity features.
| TEST-RL | only pI and hydrophobicity (2 features) | 67.4 | 0.38 | 0.63 |
| only composition and collocation (86 features) | 62.8 | 0.26 | 0.66 | |
| CRYSTALP2 (88 features) | 69.8 | 0.40 | 0.72 | |
| TEST | only pI and hydrophobicity (2 features) | 66.0 | 0.37 | 0.66 |
| only composition and collocation (86 features) | 63.2 | 0.26 | 0.69 | |
| CRYSTALP2 (88 features) | 75.7 | 0.52 | 0.79 | |
| TEST-NEW | only pI and hydrophobicity (2 features) | 68.8 | 0.41 | 0.71 |
| only composition and collocation (86 features) | 61.9 | 0.24 | 0.66 | |
| CRYSTALP2 (88 features) | 69.3 | 0.39 | 0.74 | |
Results are based on training the classification model on FEAT dataset and testing on TEST-RL, TEST, and TEST-NEW datasets, respectively.
Figure 2ROC curves for CRYSTALP2, the predictions based on the 86 composition and collocation features, and a method that uses only pI and hydrophobicity features. Top panel shows results on the TEST-NEW dataset, middle panel on the TEST dataset and bottom panel on the TEST-RL dataset.
Figure 3The scatter plots showing the relation between the selected input features for crystallizable (denoted by green markers) and noncrystallizable (red markers) protein chains from the FEAT dataset. Panel A shows relation between the summed values of the 6 collocations associated with crystallizable proteins (x-axis) and the 6 collocations associated with the noncrystallizable proteins (y-axis). Panel B shows relation between pI (x-axis) and hydrophobicity (y-axis).
Figure 4Prediction quality (y-axis) measured with accuracy and MCC (shown using markers) when training the CRYSTALP2 model on subsets of the FEAT training dataset (x-axis) and testing on the TEST-NEW, TEST and TEST-RL datasets. The solid and dashed lines show linear regression trends associated with the increasing size of the training dataset for the accuracy and MCC, respectively.
Comparison of prediction quality measured with sensitivity and specificity for the prediction of the crystallizable and noncrystallizable proteins by the CRYSTALP2 method.
| TEST-RL | 74.4 | 65.1 | 65.1 | 74.4 |
| TEST | 79.1 | 72.2 | 72.2 | 79.1 |
| TEST-NEW | 76.1 | 62.6 | 62.5 | 76.1 |
Results are based on training the classification model on FEAT dataset and testing on the TEST-RL, TEST, and TEST-NEW datasets, respectively.