| Literature DB >> 29295714 |
Fanchi Meng1, Chen Wang2, Lukasz Kurgan3.
Abstract
BACKGROUND: Development of predictors of propensity of protein sequences for successful crystallization has been actively pursued for over a decade. A few novel methods that expanded the scope of these predictions to address additional steps of protein production and structure determination pipelines were released in recent years. The predictive performance of the current methods is modest. This is because the only input that they use is the protein sequence and since the experimental annotations of these data might be inconsistent given that they were collected across many laboratories and centers. However, even these modest levels of predictive quality are still practical compared to the reported low success rates of crystallization, which are below 10%. We focus on another important aspect related to a high computational cost of running the predictors that offer the expanded scope.Entities:
Keywords: Prediction; Protein production; Protein structure determination; Structural genomics; Target selection; X-ray crystallography
Mesh:
Substances:
Year: 2018 PMID: 29295714 PMCID: PMC6389161 DOI: 10.1186/s12859-017-1995-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary of the considered and selected feature sets
| Feature types | The complete set of considered features | Features used to predict MF step | Features used to predict PF step | Features used to predict CF step | Features used to predict CR step |
|---|---|---|---|---|---|
| Amino acid composition | 420 | 2 | 2 | 0 | 3 |
| Clusters of amino acids types | 336 | 2 | 3 | 1 | 1 |
| Physicochemical properties of amino acids | 448 | 3 | 2 | 2 | 5 |
| Physicochemical properties of proteins | 4 | 1 | 1 | 1 | 1 |
| Sequence complexity and intrinsic disorder | 68 | 1 | 0 | 0 | 1 |
| Total | 1276 | 9 | 8 | 4 | 11 |
Fig. 1Sample result generated by the fDETECT webserver
Fig. 2Comparative analysis of runtime. The analysis covers the four methods that predict the four steps of the protein production and crystallization process. Proteins in the TESTsmall dataset were divided into five equally sized subsets with increasing sequence length (very short, short, medium, long and very long). For each subsets of proteins and each of the four predictors we show the average runtime [msec] as bars, the numerical value of the average inside the bars, and the corresponding standard deviation as the error bars
Comparison of estimated and measured runtime
| Predictor | Coefficients of the linear fit into the measured runtime values | Runtime estimated using linear fit for the human proteome | Runtime estimated using linear fit for the TESTsmall dataset | Runtime measured for the TESTsmall dataset |
|---|---|---|---|---|
| fDETECT |
| 0.71 h | 0.90 min | 0.88 min |
|
| ||||
| PPCpred |
| 76.15 days | 27.49 h | 28.35 h |
|
| ||||
| Crysalis |
| 0.73 h | 0.77 min | 0.77 min |
|
| ||||
| PredPPCrys |
| 79.12 days | 34.04 h | 34.48 h |
|
|
The analysis covers the four methods that predict the four steps of the protein production and crystallization process. The second column shows coefficients of a linear fit into the measured values of the runtime and protein sequence length on the TESTsmall dataset, i.e., runtime = a*sequence_length + b. The total runtimes estimated with that linear fit for the proteins in the complete human proteome and from the benchmark dataset are listed in columns three and four, respectively. The right-most column shows the total runtime that was empirically measured on the TESTsmall dataset
Predictive performance on the TESTlarge dataset
| Predictors | Material production (MF) | Purification (PF) | Crystallization (CF) | Diffraction-quality crystallization (CR) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Average | ±std |
| Average | ±std |
| Average | ±std |
| Average | ±std |
| ||
| AUC | fDETECT |
| ±0.05 |
| ±0.05 |
| ±0.05 |
| ±0.04 | ||||
| Crysalis |
| ±0.05 | 0.269 | 0.58 | ±0.05 | <0.001 |
| ±0.05 | 0.234 | 0.60 | ±0.04 | <0.001 | |
| MCC | fDETECT |
| ±0.07 |
| ±0.07 |
| ±0.09 |
| ±0.07 | ||||
| Crysalis | 0.10 | ±0.08 | 0.011 | 0.10 | ±0.07 | <0.001 | 0.03 | ±0.08 | <0.001 | 0.16 | ±0.06 | <0.001 | |
| Accuracy | fDETECT |
| ±2.0 |
| ±2.3 |
| ±3.4 |
| ±3.3 | ||||
| Crysalis | 74.8 | ±2.1 | 0.012 | 70.9 | ±2.3 | <0.001 | 63.6 | ±2.9 | <0.001 | 58.1 | ±3.2 | <0.001 | |
We report average AUC, MCC and accuracy and their corresponding standard deviations over 100 bootstrap tests (each test is based on 25% of randomly chosen proteins). Statistical significance of differences between fDETECT and Crysalis was measured with paired t-test; the measured values are normal, which we verified based on the Anderson-Darling test at 0.05 significance. The best results that are not significantly different with each other (p-value >0.05) for each outcome are given in bold font
Predictive performance on the TESTsmall dataset
| Predictors | Material production (MF) | Purification (PF) | Crystallization (CF) | Diffraction-quality crystallization (CR) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Average | ±std |
| Average | ±std |
| Average | ±std |
| Average | ±std |
| ||
| AUC | fDETECT |
| ±0.11 | 0.64 | ±0.11 | 0.55 | ±0.11 |
| ±0.07 | ||||
| PPCpred | 0.64 | ±0.11 | 0.004 |
| ±0.12 | <0.001 |
| ±0.12 | <0.001 |
| ±0.08 | 0.054 | |
| Crysalis |
| ±0.11 | 0.392 | 0.59 | ±0.11 | <0.001 | 0.56 | ±0.10 | 0.366 | 0.60 | ±0.08 | <0.001 | |
| PredPPCrys | 0.62 | ±0.11 | <0.001 | 0.59 | ±0.11 | 0.002 | 0.48 | ±0.12 | <0.001 | 0.62 | ±0.08 | 0.001 | |
| XtalPRed | NA | NA | NA | 0.59 | ±0.09 | <0.001 | |||||||
| XtalPred-RF | NA | NA | NA |
| ±0.08 | 0.392 | |||||||
| TragetCrys | NA | NA | NA |
| ±0.07 | 0.734 | |||||||
| CRYSTALP2 | NA | NA | NA |
| ±0.08 | 0.419 | |||||||
| MCC | fDETECT |
| ±0.17 | 0.15 | ±0.19 |
| ±0.18 | 0.20 | ±0.13 | ||||
| PPCpred | 0.19 | ±0.20 | 0.039 |
| ±0.21 | <0.001 |
| ±0.18 | 0.752 | 0.23 | ±0.15 | 0.253 | |
| Crysalis |
| ±0.18 | 0.115 | 0.11 | ±0.19 | 0.018 | 0.07 | ±0.16 | <0.001 | 0.15 | ±0.15 | 0.005 | |
| PredPPCrys | 0.12 | ±0.17 | <0.001 | 0.06 | ±0.18 | <0.001 | 0.00 | ±0.19 | <0.001 | 0.19 | ±0.15 | 0.314 | |
| XtalPRed | NA | NA | NA | 0.18 | ±0.18 | 0.580 | |||||||
| XtalPred-RF | NA | NA | NA |
| ±0.14 | 0.039 | |||||||
| TragetCrys | NA | NA | NA | 0.21 | ±0.12 | 0.889 | |||||||
| CRYSTALP2 | NA | NA | NA | 0.23 | ±0.14 | 0.297 | |||||||
| Accuracy | fDETECT |
| ±4.9 | 72.7 | ±6.2 |
| ±6.9 |
| ±6.7 | ||||
| PPCpred | 77.4 | ±5.4 | 0.040 |
| ±6.7 | <0.001 |
| ±6.6 | 0.770 |
| ±7.6 | 0.211 | |
| Crysalis | 77.7 | ±4.9 | 0.104 | 71.4 | ±6.0 | 0.018 | 65.1 | ±6.1 | <0.001 | 57.5 | ±7.3 | 0.004 | |
| PredPPCrys | 75.6 | ±4.8 | <0.001 | 70.0 | ±6.0 | <0.001 | 61.9 | ±7.5 | <0.001 |
| ±7.5 | 0.348 | |
| XtalPRed | NA | NA | NA |
| ±8.8 | 0.315 | |||||||
| XtalPred-RF | NA | NA | NA |
| ±7.2 | 0.117 | |||||||
| TragetCrys | NA | NA | NA |
| ±6.2 | 1.000 | |||||||
| CRYSTALP2 | NA | NA | NA |
| ±6.7 | 0.256 | |||||||
We report average AUC, MCC and accuracy and their corresponding standard deviations over 100 bootstrap tests (each test is based on 25% of randomly chosen proteins). Statistical significance of differences between fDETECT and each other method was measured with paired t-test; the measured values are normal, which we verified based on the Anderson-Darling test at 0.05 significance. The best results that are not significantly different with each other (p-value >0.05) for each outcome are given in bold font. NA means that a given method does not provide this type of prediction
Fig. 3ROC curves for the four predictors of the four steps of the crystallization pipeline: failure of material production (panel a), failure to purify (panel b), failure to crystallize (panel c) and success to yield diffraction-quality crystals (panel d). The curves were computed on the TESTsmall dataset
Fig. 4Correlation between propensities generated by fDETECT and the three other methods that cover multiple steps of the crystallization pipeline: PPCpred, PredPPCrys and Crysalis. We report the values of the Pearson correlation coefficient (PCC) for the each of the four steps: MF (on blue background), PF (green), CF (orange) and CF (grey)