| Literature DB >> 35501680 |
Yang Liu1, Hansaim Lim2, Lei Xie3.
Abstract
BACKGROUND: Drug discovery is time-consuming and costly. Machine learning, especially deep learning, shows great potential in quantitative structure-activity relationship (QSAR) modeling to accelerate drug discovery process and reduce its cost. A big challenge in developing robust and generalizable deep learning models for QSAR is the lack of a large amount of data with high-quality and balanced labels. To address this challenge, we developed a self-training method, Partially LAbeled Noisy Student (PLANS), and a novel self-supervised graph embedding, Graph-Isomorphism-Network Fingerprint (GINFP), for chemical compounds representations with substructure information using unlabeled data. The representations can be used for predicting chemical properties such as binding affinity, toxicity, and others. PLANS-GINFP allows us to exploit millions of unlabeled chemical compounds as well as labeled and partially labeled pharmacological data to improve the generalizability of neural network models.Entities:
Keywords: Artificial intelligence; Chemical embedding; Deep neural network; Drug discovery; Drug metabolism; Drug toxicity; Drug-target interaction; Graph neural network; Self-supervised learning; Semi-supervised learning
Mesh:
Year: 2022 PMID: 35501680 PMCID: PMC9063120 DOI: 10.1186/s12859-022-04681-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Classification results of the baseline models, MLP with NS, and MLP with PLANS using ECFP for the representation of chemical structures
| Accuracy | Precision | Recall | F1 | ||
|---|---|---|---|---|---|
| Cyp450 | SVM | 56.37 ± 0.93 | 0.53 ± 0.01 | 0.81 ± 0.01 | 0.64 ± 0.01 |
| RF | 54.44 ± 0.98 | 0.42 ± 0.01 | 0.81 ± 0.02 | 0.55 ± 0.01 | |
| AdaBoost | 52.40 ± 1.01 | 0.25 ± 0.02 | 0.39 ± 0.02 | ||
| XGBoost | 55.13 ± 1.21 | 0.57 ± 0.01 | 0.75 ± 0.01 | 0.65 ± 0.01 | |
| MLP | 51.97 ± 1.12 | 0.64 ± 0.03 | 0.72 ± 0.02 | 0.68 ± 0.02 | |
| MLP + mixup | 54.28 ± 0.79 | 0.60 ± 0.02 | 0.73 ± 0.01 | 0.66 ± 0.01 | |
| MLP + NS | 56.11 ± 1.63 | 0.64 ± 0.01 | 0.76 ± 0.02 | 0.69 ± 0.01 | |
| MLP + mixup + NS | 56.48 ± 1.45 | 0.60 ± 0.04 | 0.76 ± 0.02 | 0.67 ± 0.02 | |
| MLP + PLANS | 58.94 ± 0.96 | 0.72 ± 0.02 | 0.78 ± 0.01 | ||
| MLP + PLANS + mixup | 58.04 ± 0.70 | 0.69 ± 0.02 | 0.76 ± 0.01 | 0.72 ± 0.01 | |
| MLP + PLANS + balancing | 59.02 ± 1.12 | 0.76 ± 0.02 | |||
| MLP + PLANS + balancing + mixup | 0.68 ± 0.03 | 0.78 ± 0.01 | 0.73 ± 0.01 |
The best performance is highlight in bold. The upper part shows the results for the CYP450 datasets and the lower part shows the results for Tox21 dataset. Note that underlined AdaBoost achieves the best recall performance. However, it is heavily affected by data imbalance. The recision and F1 scores of AdaBoost were much lower than other models
Fig. 1GINFP training loss and ECFP constructions. Every bit of ECFP and the predicted ECFP constructed from GINFP are shown as bars below and above the x-axis, respectively. The values of predicted ECFP are after sigmoid activation
Classification results of the baseline models, MLP with NS, and MLP with PLANS using GINFP for the representation of chemical structures
| Accuracy | Precision | Recall | F1 | ||
|---|---|---|---|---|---|
| Cyp450 | SVM | 58.19 ± 0.81 | 0.68 ± 0.01 | 0.77 ± 0.01 | 0.72 ± 0.01 |
| RF | 54.38 ± 0.78 | 0.46 ± 0.01 | 0.79 ± 0.02 | 0.58 ± 0.01 | |
| AdaBoost | 48.13 ± 3.86 | 0.23 ± 0.09 | 0.83 ± 0.14 | 0.35 ± 0.07 | |
| XGBoost | 54.93 ± 0.78 | 0.59 ± 0.02 | 0.76 ± 0.01 | 0.66 ± 0.01 | |
| MLP | 57.31 ± 1.47 | 0.74 ± 0.02 | 0.75 ± 0.02 | 0.74 ± 0.01 | |
| MLP + mixup | 57.42 ± 0.46 | 0.65 ± 0.03 | 0.78 ± 0.02 | 0.71 ± 0.01 | |
| MLP + NS | 59.83 ± 0.41 | 0.70 ± 0.02 | 0.79 ± 0.01 | 0.74 ± 0.01 | |
| MLP + mixup + NS | 58.50 ± 0.48 | 0.65 ± 0.00 | 0.78 ± 0.01 | 0.71 ± 0.01 | |
| MLP + PLANS | 60.61 ± 1.00 | 0.79 ± 0.01 | |||
| MLP + PLANS + mixup | 59.95 ± 1.41 | 0.72 ± 0.02 | 0.79 ± 0.01 | 0.75 ± 0.01 | |
| MLP + PLANS + balancing | 0.75 ± 0.02 | ||||
| MLP + PLANS + balancing + mixup | 60.58 ± 1.38 | 0.73 ± 0.02 | 0.78 ± 0.02 | 0.76 ± 0.01 |
The evaluation metric of the best performed model is highlighted in bold. The upper part shows the results for the CYP450 dataset and the lower part shows the results for the Tox21 dataset
Fig. 2Sample distribution before and after data balancing. Blue bars represent the original samples. Orange bars represent the samples added from the ChEMBL24 dataset
Fig. 3Analyzing the training results with or without the data balancing. Blue bars represent the samples that were correctly predicted. Orange bars represent the samples that the model failed to recall. Red bars represent samples that were incorrectly classified into the class by the model. The subpanels are the zoom-in of the classes without the all-negative class
Fig. 4Overview of the workflow
Fig. 5GIN model architecture and GINFP. Sum is used for node-level pooling and mean is used for graph-level pooling
Fig. 6Architectures of the MLP models
Hyper parameter screening for conventional models
| SVM | ||||
|---|---|---|---|---|
| Hyper parameters | C | Kernel function | Degree | |
| Screened range | [0.5, 1.0] | RBF, Sigmoid, Polynomial | Scale, auto | [2, 6] |
| Best | 1.0 | RBF | Scale | N/A |
Fig. 7Statistics of chemical molecule graphs for datasets used in our experiments. The upper and lower parts of the first two panels for the ChEMBL dataset have different y-axis scales because of the large number of nodes/edges distribute in several bins