| Literature DB >> 34912370 |
Emna Harigua-Souiai1, Mohamed Mahmoud Heinhane1, Yosser Zina Abdelkrim1, Oussama Souiai2, Ines Abdeljaoued-Tej2,3, Ikram Guizani1.
Abstract
Drug discovery and repurposing against COVID-19 is a highly relevant topic with huge efforts dedicated to delivering novel therapeutics targeting SARS-CoV-2. In this context, computer-aided drug discovery is of interest in orienting the early high throughput screenings and in optimizing the hit identification rate. We herein propose a pipeline for Ligand-Based Drug Discovery (LBDD) against SARS-CoV-2. Through an extensive search of the literature and multiple steps of filtering, we integrated information on 2,610 molecules having a validated effect against SARS-CoV and/or SARS-CoV-2. The chemical structures of these molecules were encoded through multiple systems to be readily useful as input to conventional machine learning (ML) algorithms or deep learning (DL) architectures. We assessed the performances of seven ML algorithms and four DL algorithms in achieving molecule classification into two classes: active and inactive. The Random Forests (RF), Graph Convolutional Network (GCN), and Directed Acyclic Graph (DAG) models achieved the best performances. These models were further optimized through hyperparameter tuning and achieved ROC-AUC scores through cross-validation of 85, 83, and 79% for RF, GCN, and DAG models, respectively. An external validation step on the FDA-approved drugs collection revealed a superior potential of DL algorithms to achieve drug repurposing against SARS-CoV-2 based on the dataset herein presented. Namely, GCN and DAG achieved more than 50% of the true positive rate assessed on the confirmed hits of a PubChem bioassay.Entities:
Keywords: SARS-CoV-2; artificial neural network; deep learning; drug discovery and repurposing; graph convoluational networks; machine learning
Year: 2021 PMID: 34912370 PMCID: PMC8667578 DOI: 10.3389/fgene.2021.744170
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1Anticoronavirus dataset composition. (A) Distribution of the pairwise chemical similarity among the molecules based on the Tanimoto coefficient. (B) Proportions of “active” and “inactive” molecules within each experimental category.
FIGURE 2ROC-AUC scores of all models for three different datasets (heterogeneous, undersampled homogeneous, and oversampled homogeneous). (A) ROC-AUC scores achieved by all models under the random 80/10/10 split. (B) ROC-AUC scores achieved by all models under the scaffold 80/10/10 split. (C) Boxplots of the ROC-AUC scores achieved by each model on all validation subsets (heterogeneous, undersampled homogeneous, and oversampled homogeneous included) and with both splitting proportions (80/10/10; 60/20/20). (D) Boxplots of the ROC-AUC scores achieved by each model on all test subsets (heterogeneous, undersampled homogeneous, and oversampled homogeneous included) and with both splitting proportions (80/10/10; 60/20/20).
Performances of 11 algorithms in predicting activity class of the anticoronavirus dataset. Optimized settings based on the MoleculeNet benchmarks were considered for all models.
| Model | Train | Validation | Test | Train | Validation | Test | Train | Validation | Test |
|---|---|---|---|---|---|---|---|---|---|
| ROC-AUC | ROC-AUC | ROC-AUC | F1-score | F1-score | F1-score | Recall | Recall | Recall | |
| GraphConv | 0.99 | 0.80 | 0.86 | 0.98 | 0.75 | 0.79 | 0.98 | 0.75 | 0.80 |
| DAG | 0.99 | 0.82 | 0.87 | 0.99 | 0.72 | 0.73 | 0.98 | 0.68 | 0.68 |
| GAT | 0.75 | 0.77 | 0.82 | 0.62 | 0.65 | 0.69 | 0.54 | 0.55 | 0.61 |
| GCN | 0.94 | 0.82 | 0.87 | 0.86 | 0.75 | 0.79 | 0.88 | 0.75 | 0.82 |
| LR | 0.99 | 0.81 | 0.89 | 0.97 | 0.76 | 0.82 | 0.97 | 0.77 | 0.82 |
| SVM | 0.99 | 0.86 | 0.90 | 0.97 | 0.80 | 0.82 | 0.97 | 0.79 | 0.82 |
| RF | 0.99 | 0.86 | 0.90 | 0.99 | 0.78 | 0.81 | 0.99 | 0.80 | 0.81 |
| MTC | 0.81 | 0.77 | 0.84 | 0.67 | 0.71 | 0.68 | 0.99 | 0.99 | 0.99 |
| IRV-MTC | 0.82 | 0.82 | 0.85 | 0.75 | 0.78 | 0.76 | 0.88 | 0.89 | 0.90 |
| Robust MTC | 0.83 | 0.80 | 0.85 | 0.71 | 0.73 | 0.71 | 0.97 | 0.96 | 0.99 |
| XGBoost | 0.93 | 0.84 | 0.88 | 0.85 | 0.76 | 0.80 | 0.82 | 0.73 | 0.84 |
FIGURE 3Performances of the optimized models. (A) Radar plots of the models’ performances assessed on the train set (left) and the test set (right) through ROC-AUC, F1-score, Accuracy, Cohen’s Kappa, MCC, and Recall. (B) The ROC curve of all three models. (C) The Precision-Recall (PR) curve of all three models.
Tenfold cross-validation results for the best classifiers. Scores are presented as mean values ± SD based on 10 iterations.
| Model | ROC-AUC | F1-score | Recall |
|---|---|---|---|
| RF | 0.85 ± 0.026 | 0.78 ± 0.027 | 0.76 ± 0.032 |
| DAG model | 0.79 ± 0.013 | 0.73 ± 0.052 | 0.74 ± 0.103 |
| GCN model | 0.83 ± 0.026 | 0.73 ± 0.037 | 0.70 ± 0.082 |
FIGURE 4ROC-AUC scores of the best classifiers tested on stratified subsets of the data (homogeneous, heterogeneous, and mixed).
External validation of the three models’ performances in comparison with experimental results from the PubChem bioassay AID_1409594. Columns 2–5 report TP, TN, FP, and FN counts based on the overall predictions of the algorithms. Columns 6–9 report the TP, TN, FP, and FN counts based on the subselection of molecules with prediction confidence higher than 80%.
| Activity criterion | All molecules: no confidence threshold | Subselection of molecules above the 80% confidence threshold | ||||||
|---|---|---|---|---|---|---|---|---|
| TP | TN | FP | FN | TP | TN | FP | FN | |
| RF | 4 | 490 | 119 | 13 | 1 | 425 | 12 | 8 |
| DAG | 7 | 719 | 340 | 10 | 3 | 359 | 99 | 3 |
| GCN | 8 | 877 | 182 | 9 | 5 | 835 | 147 | 9 |