| Literature DB >> 35136094 |
Shenmin Guan1, Ning Fu2.
Abstract
Machine intelligence (MI), including machine learning and deep learning, have been regarded as promising methods to reduce the prohibitively high cost of drug development. However, a dilemma within MI has limited its wide application: machine learning models are easier to interpret but yield worse predictive performance than deep learning models. Therefore, we propose a pipeline called Class Imbalance Learning with Bayesian Optimization (CILBO) to improve the performance of machine learning models in drug discovery. To demonstrate the efficacy of the CILBO pipeline, we developed an example model to predict antibacterial candidates. Comparison of the antibacterial prediction performance between our model and a well-known deep learning model published by Stokes et al. suggests that our model can perform as well as the deep learning model in drug activity prediction. The CILBO pipeline we propose provides a simple, alternative approach to accelerate preliminary screenings and decrease the cost of drug discovery.Entities:
Year: 2022 PMID: 35136094 PMCID: PMC8827090 DOI: 10.1038/s41598-022-05717-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Workflow of the final model construction.
Best hyperparameters suggested by Bayesian optimization.
*Frame indicates the hyperparameters for treating imbalanced datasets.
Figure 2ROC-AUC of our final model.
Confusion matrix of our final model.
| Predicted | Actual | |
|---|---|---|
| Non-antibacterial | Antibacterial | |
| Non-antibacterial | 221 | 0 |
| Antibacterial | 5 | 7 |
This confusion matrix is based on testing set of our final model, molecules with prediction score above 0.5 were regarded as predicted antibacterials.
Figure 3Plot of the prediction results by both models. Plot of the 162 molecules with empirically tested antibacterial information and also predicted with top and bottom scores respectively for antibacterial properties by Stokes’ model[10]. Blue dots represent non-antibacterials; orange dots represent antibacterials. X-axis (Pred_Score_Forest) is the score predicted by our final model, a random forest classifier; Y-axis (Pred_Score_Net) is the score predicted by Stokes’ final model[10], a graph neural network.
| Hyperparameters | Value type (range) |
|---|---|
| n_estimators | Integer (5, 5000) |
| Criterion | Categorical ([“gini”, “entropy”]) |
| max_depth | Integer (1, 6000) |
| min_samples_split | Integer (2, 200) |
| min_samples_leaf | Integer (1, 200) |
| Bootstrap | Categorical ([True, False]) |
| class_weight | Categorical ([“balanced”, “balanced_subsample”, None]) |
| sampling_strategy | Categorical ([‘majority’, ‘not minority’, ‘not majority’]) |