| Literature DB >> 35055158 |
Chen Jin1, Zhuangwei Shi2, Chuanze Kang1, Ken Lin2, Han Zhang2.
Abstract
X-ray diffraction technique is one of the most common methods of ascertaining protein structures, yet only 2-10% of proteins can produce diffraction-quality crystals. Several computational methods have been proposed so far to predict protein crystallization. Nevertheless, the current state-of-the-art computational methods are limited by the scarcity of experimental data. Thus, the prediction accuracy of existing models hasn't reached the ideal level. To address the problems above, we propose a novel transfer-learning-based framework for protein crystallization prediction, named TLCrys. The framework proceeds in two steps: pre-training and fine-tuning. The pre-training step adopts attention mechanism to extract both global and local information of the protein sequences. The representation learned from the pre-training step is regarded as knowledge to be transferred and fine-tuned to enhance the performance of crystalization prediction. During pre-training, TLCrys adopts a multi-task learning method, which not only improves the learning ability of protein encoding, but also enhances the robustness and generalization of protein representation. The multi-head self-attention layer guarantees that different levels of the protein representation can be extracted by the fine-tuned step. During transfer learning, the fine-tuning strategy used by TLCrys improves the task-specialized learning ability of the network. Our method outperforms all previous predictors significantly in five crystallization stages of prediction. Furthermore, the proposed methodology can be well generalized to other protein sequence classification tasks.Entities:
Keywords: attention mechanism; fine-tuning; pre-training; protein crystallization; transfer learning
Mesh:
Substances:
Year: 2022 PMID: 35055158 PMCID: PMC8778968 DOI: 10.3390/ijms23020972
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Structure of attention module in TLCrys pre-training step.
Figure 2The architecture of TLCrys consists of two parts: (a) self-supervised pre-training protein representation models on protein sequences and Gene Ontology annotations. (b) supervised fine-tuning on protein crystallization dataset with pre-trained parameters.
Parameters of TLCrys.
|
|
|
|
|
| 512 | 128 | 5 | 9 |
|
|
|
|
|
| 1 | 64 | 128 | 4 |
Statistics of datasets
| Tasks | Dataset | Clone f. | Material Production f. | Purification f. | Crystallization f. | Crystallization |
|---|---|---|---|---|---|---|
| CLF | Train | N:9502 | P:14428 | |||
| Test | N:1939 | P:2852 | ||||
| MF | Train | N:17017 | P:6913 | |||
| Test | N:3347 | P:1444 | ||||
| PF | Train | - | N:2318 | P:4702 | ||
| Test | - | N:474 | P:932 | |||
| CF | Train | - | N:224 | P:631 | ||
| Test | - | N:35 | P:138 | |||
| CRYs | Train | N:19509 | P:4421 | |||
| Test | N:3892 | P:899 |
Comparison of our two model with other methods on test sets.
| Model | AUC | MCC | ACC (%) | SPEC (%) | SEN (%) | PRE (%) | F1 Score (%) | |
|---|---|---|---|---|---|---|---|---|
| CLF | PredPPCrys I | 0.711 | 0.296 | 65.33 | 63.58 | 66.50 | 73.16 | 69.67 |
| PredPPCrys II | 0.725 | 0.322 | 66.54 | 65.56 | 67.20 | 74.44 | 70.63 | |
| Crysalis I | 0.731 | 0.332 | 66.98 | 66.60 | 67.22 | 75.56 | 71.15 | |
| Crysalis II | 0.756 | 0.365 | 68.34 | 69.95 | 68.34 | 76.85 | 72.35 | |
| Direct learning | 0.701 | 0.326 | 64.65 | 76.14 | 56.84 | 77.81 | 65.69 | |
| TLCrys | 0.817 | 0.455 | 72.90 | 74.28 | 71.96 | 80.46 | 77.00 | |
| MF | PredPPCrys I | 0.772 | 0.380 | 69.93 | 68.21 | 72.88 | 49.95 | 59.27 |
| PredPPCrys II | 0.793 | 0.416 | 71.95 | 71.36 | 73.30 | 52.70 | 61.32 | |
| Crysalis I | 0.759 | 0.377 | 70.23 | 69.93 | 70.99 | 49.25 | 58.15 | |
| Crysalis II | 0.793 | 0.427 | 73.08 | 73.58 | 73.09 | 54.15 | 62.21 | |
| Direct learning | 0.745 | 0.307 | 73.31 | 88.67 | 37.74 | 58.98 | 46.03 | |
| TLCrys | 0.848 | 0.446 | 78.37 | 92.53 | 45.57 | 72.47 | 55.90 | |
| PF | PredPPCrys I | 0.800 | 0.460 | 74.83 | 70.52 | 77.02 | 83.77 | 80.25 |
| PredPPCrys II | 0.872 | 0.579 | 79.73 | 81.43 | 78.86 | 89.31 | 83.76 | |
| Crysalis I | 0.796 | 0.436 | 73.87 | 67.80 | 73.87 | 82.47 | 77.93 | |
| Crysalis II | 0.793 | 0.427 | 73.08 | 73.58 | 73.09 | 54.15 | 62.21 | |
| Direct learning | 0.778 | 0.505 | 78.52 | 60.97 | 73.09 | 54.15 | 62.21 | |
| TLCrys | 0.861 | 0.583 | 81.58 | 70.25 | 87.34 | 85.24 | 86.27 | |
| CF | PredPPCrys I | 0.712 | 0.280 | 67.05 | 67.65 | 66.91 | 89.42 | 76.54 |
| PredPPCrys II | 0.735 | 0.175 | 69.47 | 68.89 | 69.50 | 97.80 | 81.26 | |
| Crysalis I | 0.739 | 0.281 | 65.50 | 70.59 | 64.23 | 89.80 | 74.89 | |
| Crysalis II | 0.752 | 0.337 | 62.57 | 85.29 | 56.93 | 93.97 | 70.90 | |
| Direct learning | 0.694 | 0.123 | 71.10 | 31.43 | 81.16 | 82.35 | 81.75 | |
| TLCrys | 0.785 | 0.459 | 79.77 | 68.57 | 82.61 | 91.20 | 86.69 | |
| CRYs | ParCrys | 0.611 | 0.132 | 59.66 | 60.56 | 55.91 | 25.40 | 34.93 |
| OBScore | 0.638 | 0.184 | 59.28 | 57.78 | 65.49 | 27.14 | 38.38 | |
| CRYSTAP2 | 0.599 | 0.123 | 51.64 | 48.10 | 67.78 | 22.28 | 33.54 | |
| XtalPred | - | 0.224 | 65.04 | 65.61 | 62.51 | 29.31 | 39.91 | |
| SVMCRYs | - | 0.142 | 55.11 | 52.78 | 65.70 | 23.39 | 34.50 | |
| PPCPred | 0.704 | 0.254 | 63.63 | 62.09 | 70.67 | 29.03 | 41.15 | |
| XtalPred-RF | - | 0.205 | 60.94 | 59.67 | 66.41 | 27.56 | 38.95 | |
| SCMCRYS | - | 0.145 | 60.93 | 62.01 | 56.24 | 25.48 | 35.07 | |
| PredPPCrys I | 0.770 | 0.326 | 69.65 | 69.30 | 71.13 | 35.23 | 47.12 | |
| PredPPCrys II | 0.838 | 0.428 | 76.04 | 76.21 | 75.30 | 42.64 | 54.45 | |
| Crysalis I | 0.788 | 0.339 | 71.00 | 70.89 | 71.41 | 35.50 | 47.42 | |
| Crysalis II | 0.838 | 0.435 | 76.27 | 76.28 | 76.20 | 42.84 | 54.85 | |
| DeepCrystal | 0.858 | 0.477 | 77.83 | 77.43 | 79.51 | 45.90 | 58.20 | |
| Direct learning | 0.801 | 0.367 | 83.79 | 95.99 | 31.03 | 64.14 | 41.83 | |
| TLCrys | 0.879 | 0.546 | 87.24 | 94.96 | 53.84 | 71.18 | 61.30 |
Area Under Curve (AUC), Matthew’s Correlation Coefficient (MCC), Accuracy (ACC), Specificity (SPEC), Precision(PRE), Sensitivity (SENS), and Precision (PRE).
Ablation experiments on CRYs task.
| Models | AUC | MCC | ACC (%) | F1 Score (%) |
|---|---|---|---|---|
| without multi-head attention | 0.874 | 0.498 | 86.57 | 54.36 |
| 2 heads | 0.875 | 0.537 | 86.69 | 61.32 |
| 4 heads | 0.880 | 0.529 | 86.99 | 59.09 |
| TLCrys (6 heads) | 0.879 | 0.546 | 87.24 | 61.30 |
| 8 heads | 0.879 | 0.541 | 87.30 | 60.10 |
Predicted probability value of the TLCrys and other predictors for Sox transcription factor proteins.
| Model | Sox9 FL (−) | Sox9 HMG (+) | Sox17 FL (−) | Sox17 HMG (+) | Sox17 EK-HMG (+) |
|---|---|---|---|---|---|
| TLCrys | 0.156 | 0.674 | 0.260 | 0.791 | 0.681 |
| DeepCrystal | 0.315 | 0.676 | 0.430 | 0.643 | 0.633 |
| TargetCrys | 0.032 | 0.045 | 0.037 | 0.029 | 0.031 |
| Crysalis II | 0.474 | 0.55 | 0.474 | 0.553 | 0.555 |
| Crysalis I | 0.438 | 0.482 | 0.487 | 0.567 | 0.557 |
| PPCPred | 0.039 | 0.658 | 0.089 | 0.462 | 0.523 |
| CrystalP2 | 0.327 | 0.459 | 0.470 | 0.436 | 0.402 |
“+” represents crystallizable protein and “−” represents non-crystallizable protein.