| Literature DB >> 36133824 |
Pietro Delre1,2, Giovanna J Lavado3, Giuseppe Lamanna1,2, Michele Saviano4, Alessandra Roncaglioni3, Emilio Benfenati3, Giuseppe Felice Mangiatordi1, Domenico Gadaleta3.
Abstract
Drug-induced cardiotoxicity is a common side effect of drugs in clinical use or under postmarket surveillance and is commonly due to off-target interactions with the cardiac human-ether-a-go-go-related (hERG) potassium channel. Therefore, prioritizing drug candidates based on their hERG blocking potential is a mandatory step in the early preclinical stage of a drug discovery program. Herein, we trained and properly validated 30 ligand-based classifiers of hERG-related cardiotoxicity based on 7,963 curated compounds extracted by the freely accessible repository ChEMBL (version 25). Different machine learning algorithms were tested, namely, random forest, K-nearest neighbors, gradient boosting, extreme gradient boosting, multilayer perceptron, and support vector machine. The application of 1) the best practices for data curation, 2) the feature selection method VSURF, and 3) the synthetic minority oversampling technique (SMOTE) to properly handle the unbalanced data, allowed for the development of highly predictive models (BAMAX = 0.91, AUCMAX = 0.95). Remarkably, the undertaken temporal validation approach not only supported the predictivity of the herein presented classifiers but also suggested their ability to outperform those models commonly used in the literature. From a more methodological point of view, the study put forward a new computational workflow, freely available in the GitHub repository (https://github.com/PDelre93/hERG-QSAR), as valuable for building highly predictive models of hERG-mediated cardiotoxicity.Entities:
Keywords: QSAR; cardiotoxicity; consensus modeling; hERG; ligand-based
Year: 2022 PMID: 36133824 PMCID: PMC9483173 DOI: 10.3389/fphar.2022.951083
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.988
Partitioning schemes before (top) and after the application of the AD at each considered toxicity threshold (bottom). For hERG-DB, the number of active and inactive chemicals and the related class distribution is reported for the training set (TS), validation set (VS), and external set (ES) and at each considered toxicity threshold. Notably, the total number of chemicals (#), the number of hERG blockers (ACT) and hERG non-blocker (INA) chemicals, as well as the ratio between nonblockers and blockers are shown.
| Dataset | Toxicity threshold (pIC50) | |||||||
|---|---|---|---|---|---|---|---|---|
| 6 | 5 | |||||||
| # | INA | ACT | INA:ACT | # | INA | ACT | INA:ACT | |
| Starting composition | ||||||||
| TS | 6371 | 5388 | 983 | 05:01 | 6371 | 3295 | 3076 | 01:01 |
| VS | 1592 | 1346 | 246 | 05:01 | 1592 | 821 | 771 | 01:01 |
| ES | 792 | 676 | 116 | 05:01 | 792 | 365 | 427 | 01:01 |
| Applicability domain (AD) | ||||||||
| VS | 1583 | 1338 | 245 | 05:01 | 1579 | 810 | 769 | 01:01 |
| ES | 754 | 642 | 112 | 05:01 | 754 | 350 | 404 | 01:01 |
FIGURE 1PCA based on the physicochemical properties returned by the compounds belonging to TS, VS, and ES.
FIGURE 2Flowchart showing the main steps of the adopted computational workflow.
Performances on the VS of the models developed using pIC50 = 6 (top) and 5 (bottom). For each model, the following statistics are reported: balanced accuracy (BA), sensitivity (SE), specificity (SP), Matthews correlation coefficient (MCC), area under the ROC (AUC), number of true negatives (TNs), false positives (FPs), true positives (TPs), and false negatives (FNs). The top-performing model selected for additional validation is indicated in bold.
| Toxicity threshold pIC50 = 6 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Balancing | Method | BA | SE | SP | MCC | AUC | TP | FP | TN | FN |
| - |
|
|
|
|
|
|
|
|
|
|
| GB | 0.82 | 0.68 | 0.96 | 0.68 | 0.94 | 170 | 52 | 1286 | 75 | |
| KNN | 0.84 | 0.72 | 0.96 | 0.70 | 0.91 | 176 | 51 | 1287 | 69 | |
| MLP | 0.76 | 0.56 | 0.96 | 0.57 | 0.89 | 136 | 54 | 1284 | 109 | |
| XGB | 0.84 | 0.73 | 0.95 | 0.69 | 0.94 | 179 | 60 | 1278 | 66 | |
| SVM | 0.84 | 0.72 | 0.96 | 0.69 | 0.93 | 176 | 58 | 1280 | 69 | |
| SMOTE | (S)RF | 0.85 | 0.73 | 0.96 | 0.72 | 0.95 | 178 | 46 | 1292 | 67 |
| (S)GB | 0.83 | 0.72 | 0.95 | 0.66 | 0.82 | 177 | 72 | 1266 | 68 | |
|
|
|
|
|
|
|
|
|
|
| |
| (S)MLP | 0.81 | 0.86 | 0.76 | 0.48 | 0.91 | 211 | 322 | 1016 | 34 | |
| (S)XGB | 0.78 | 0.74 | 0.82 | 0.46 | 0.87 | 181 | 239 | 1099 | 64 | |
|
|
|
|
|
|
|
|
|
|
| |
| Toxicity threshold pIC50 = 5 | ||||||||||
| - |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| KNN | 0.83 | 0.84 | 0.82 | 0.66 | 0.91 | 647 | 151 | 659 | 122 | |
| MLP | 0.80 | 0.83 | 0.77 | 0.60 | 0.88 | 640 | 192 | 618 | 129 | |
| XGB | 0.83 | 0.84 | 0.82 | 0.65 | 0.91 | 643 | 150 | 660 | 126 | |
|
|
|
|
|
|
|
|
|
|
| |
| SMOTE | (S)RF | 0.83 | 0.84 | 0.82 | 0.67 | 0.92 | 645 | 143 | 673 | 118 |
| (S)GB | 0.83 | 0.84 | 0.82 | 0.64 | 0.92 | 637 | 147 | 672 | 123 | |
| (S)KNN | 0.82 | 0.86 | 0.78 | 0.64 | 0.90 | 652 | 178 | 641 | 108 | |
| (S)MLP | 0.77 | 0.84 | 0.69 | 0.65 | 0.85 | 645 | 143 | 665 | 126 | |
| (S)XGB | 0.84 | 0.85 | 0.83 | 0.68 | 0.90 | 681 | 136 | 647 | 115 | |
| (S)SVM | 0.84 | 0.84 | 0.84 | 0.68 | 0.91 | 650 | 130 | 678 | 121 | |
Performance of the consensus models on the VS and on the ES (temporal validation) developed using pIC50 = 6 (top) and 5 (bottom). For each model, the following statistics are reported: balanced accuracy (BA), sensitivity (SE), specificity (SP), Matthews correlation coefficient (MCC), area under the ROC (AUC), number of true negatives (TNs), false positives (FPs), true positives (TPs), false negatives (FNs), and the total number of molecules (#). The top-performing models selected for temporal validation are indicated in bold.
| Toxicity threshold pIC50 = 6 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Dataset | BA | SE | SP | MCC | AUC | TP | FP | TN | FN | # |
| (S)SVM+(S)KNN | VS | 0.91 | 0.93 | 0.90 | 0.72 | 0.93 | 207 | 117 | 1048 | 15 | 1387 |
| ES | 0.72 | 0.66 | 0.77 | 0.34 | 0.73 | 55 | 107 | 365 | 29 | 556 | |
| BRF+(S)SVM | VS | 0.91 | 0.95 | 0.87 | 0.69 | 0.95 | 215 | 153 | 1043 | 11 | 1422 |
| ES | 0.71 | 0.60 | 0.81 | 0.33 | 0.72 | 53 | 102 | 454 | 35 | 644 | |
| BRF+(S)KNN | VS | 0.91 | 0.94 | 0.88 | 0.68 | 0.95 | 208 | 152 | 1031 | 13 | 1404 |
| ES | 0.72 | 0.68 | 0.76 | 0.33 | 0.73 | 53 | 112 | 348 | 25 | 538 | |
| Toxicity threshold pIC50 = 5 | |||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
| BRF + GB | VS | 0.87 | 0.88 | 0.86 | 0.74 | 0.93 | 618 | 104 | 622 | 86 | 1430 |
| ES | 0.70 | 0.66 | 0.74 | 0.41 | 0.73 | 216 | 71 | 212 | 120 | 619 | |
| SVM + GB | VS | 0.87 | 0.88 | 0.86 | 0.74 | 0.93 | 614 | 103 | 603 | 78 | 1398 |
| ES | 0.71 | 0.66 | 0.75 | 0.41 | 0.74 | 223 | 69 | 213 | 116 | 621 | |
Comparison in terms of performance on the ES (temporal validation) of the best performing model presented in this study (BRF + SVM) and different classifiers available in the literature. The following statistics are reported: balanced accuracy (BA), sensitivity (SE), specificity (SPE), Matthews correlation coefficient (MCC), and the total number of molecules (#).
| BRF + SVM | OCHEM-I | OCHEM-II | Cardprep | ADMET2.0 | DeepHIT | CardioTox | |
|---|---|---|---|---|---|---|---|
| BA | 0.72 | 0.60 | 0.60 | 0.63 | 0.63 | 0.62 | 0.68 |
| SE | 0.67 | 0.24 | 0.24 | 0.66 | 0.89 | 0.80 | 0.70 |
| SP | 0.76 | 0.95 | 0.95 | 0.59 | 0.36 | 0.44 | 0.65 |
| MCC | 0.43 | 0.28 | 0.28 | 0.26 | 0.31 | 0.24 | 0.35 |
|
|
|
|
|
|
|
|
|
FIGURE 3Comparison of balanced accuracies (BAs) and Matthews correlation coefficients (MCCs) for the selected model on the ES. Blue bars refer to BA, while orange bars refer to MCC.
FIGURE 4Dialog box to set up the calculation using the KNIME workflow.
FIGURE 5Output tables returned by the KNIME workflow for the three compounds examined: mibefradil, sertindole, and terfenafide.