| Literature DB >> 33924898 |
Bo Qiang1, Junyong Lai1, Hongwei Jin1, Liangren Zhang1, Zhenming Liu1.
Abstract
A large proportion of lead compounds are derived from natural products. However, most natural products have not been fully tested for their targets. To help resolve this problem, a model using transfer learning was built to predict targets for natural products. The model was pre-trained on a processed ChEMBL dataset and then fine-tuned on a natural product dataset. Benefitting from transfer learning and the data balancing technique, the model achieved a highly promising area under the receiver operating characteristic curve (AUROC) score of 0.910, with limited task-related training samples. Since the embedding distribution difference is reduced, embedding space analysis demonstrates that the model's outputs of natural products are reliable. Case studies have proved our model's performance in drug datasets. The fine-tuned model can successfully output all the targets of 62 drugs. Compared with a previous study, our model achieved better results in terms of both AUROC validation and its success rate for obtaining active targets among the top ones. The target prediction model using transfer learning can be applied in the field of natural product-based drug discovery and has the potential to find more lead compounds or to assist researchers in drug repurposing.Entities:
Keywords: deep learning; natural product; target prediction; transfer learning
Mesh:
Substances:
Year: 2021 PMID: 33924898 PMCID: PMC8124298 DOI: 10.3390/ijms22094632
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1(a) Grid search for the pre-trained network over a fivefold cross-validation method. (b) Grid search for the pre-trained network over the intersection of the COCONUT dataset and the ChEMBL dataset. The values in the grids are the mean AUROC scores. The higher scores our models have achieved are colored with a deeper blue background. A score of 0.65 is defined as a baseline of the transparent color.
Figure 2The effect of the use of different sets of transfer learning hyperparameters. Models F1–F6 correspond to the model finetuning with learning rate: 5 × 10−2, batch size: 32; learning rate: 5 × 10−2, batch size: 64; learning rate: 5 × 10−2, batch size: 128; learning rate: 5 × 10−3, batch size: 32; learning rate: 5 × 10−3, batch size: 64; learning rate: 5 × 10−3, batch size: 128.
Figure 3The average AUROC of the pre-trained models and fine-tuned models of the test set. Model 1 to model 12 are models that possess different sets of hyperparameters listed in Table 1. The green boxes are the distributions of AUROC scores computed from the random split test sets before finetuning the pre-trained models, and the blue boxes are the distributions of AUROC scores computed from the same random split test sets after 100 epochs of finetuning.
Validation scores of the pre-trained model and fine-tuned model. The values that were promoted are bolded in the table.
| AUROC | Sensitivity (SE) | Specificity (SP) | Precision (PR) | Accuracy (ACC) | Matthews Correlation Coefficient (MCC) | |
|---|---|---|---|---|---|---|
| Model: Pre-trained model | 0.8461 | 0.3842 | 0.9932 | 0.2884 | 0.9889 | 0.3274 |
| Model: Pre-trained model | 0.8548 | 0.4167 | 0.9935 | 0.4158 | 0.9872 | 0.4098 |
| Model: Fine-tuned model |
|
| 0.9401 ± 0.000012 | 0.06617 ± 0.000113 | 0.9380 ± 0.000008 | 0.1908 ± 0.000319 |
| Model: Fine-tuned model | 0.7646 |
| 0.9159 | 0.0654 | 0.9117 | 0.1636 |
Figure 4(a) Dimensionality reduction image of the pre-trained model embedding space and (b) dimensionality reduction image of the fine-tuned model’s embedding space. The two-dimensional space is generated by a step-wise algorithm. More details of the dimensionality reduction method can be found in Section 3.4.
Figure 5(a) Number of molecules with a certain number of targets in the ChEMBL dataset and (b) number of molecules with a certain number of targets in the intersection of the ChEMBL dataset and COCONUT dataset.
Figure 6AUROC curve of models removing the data balancing method in the pre-training and fine-tuning step.
Comparison between the fine-tuned model and STarFish.
| Probability of Top15 | Probability of Top20 | AUROC | |
|---|---|---|---|
| Fine-tuned Model | 0.817 | 0.860 | 0.910 |
| STarFish | 0.621 | 0.653 | 0.899 |
Figure 7The effect of different cutoff ratios on the SE. The two curves correspond to the STarFish model and the fine-tuned model. The validation is carried on the same test set.
Structure and targets of fluphenazine. The potential targets that appear in interactions weaker than our data cleaning standards are marked with asterisks (*) and bolded.
| Structure of Fluphenazine | Predicted Experimental High-Frequency Targets | Recommended High-Frequency Targets |
|---|---|---|
|
| ChEMBL 217, ChEMBL 234, ChEMBL 224, ChEMBL 3371, ChEMBL 225, ChEMBL 287, ChEMBL 1833, ChEMBL 223, ChEMBL 231, ChEMBL 2056, ChEMBL 1867, ChEMBL 319, ChEMBL 1916, ChEMBL 1942, ChEMBL 315 | ChEMBL 214, |
Figure 8Structures of the approved drugs that the fine-tuned model predicted all the right targets for. The drug names and indication information are from DrugBank.
Figure 9The experimental kinase targets and predicted kinase targets of Bosutinib.
Figure 10Model structure of the target prediction model.
Grid search space of pre-trained models.
| Grid Search Space | |
|---|---|
| Model 1 | Learning Rate: 5 × 10−2 Batch size: 256 |
| Model 2 | Learning Rate: 5 × 10−2 Batch size: 512 |
| Model 3 | Learning Rate: 5 × 10−2 Batch size: 1024 |
| Model 4 | Learning Rate: 5 × 10−3 Batch size: 256 |
| Model 5 | Learning Rate: 5 × 10−3 Batch size: 512 |
| Model 6 | Learning Rate: 5 × 10−3 Batch size: 1024 |
| Model 7 | Learning Rate: 5 × 10−4 Batch size: 256 |
| Model 8 | Learning Rate: 5 × 10−4 Batch size: 512 |
| Model 9 | Learning Rate: 5 × 10−4 Batch size: 1024 |
| Model 10 | Learning Rate: 5 × 10−5 Batch size: 256 |
| Model 11 | Learning Rate: 5 × 10−5 Batch size: 512 |
| Model 12 | Learning Rate: 5 × 10−5 Batch size: 1024 |