| Literature DB >> 28652195 |
Eleni Kotsampasakou1, Floriane Montanari1, Gerhard F Ecker2.
Abstract
Drug-induced liver injury (DILI) is a major issue for both patients and pharmaceutical industry due to insufficient means of prevention/prediction. In the current work we present a 2-class classification model for DILI, generated with Random Forest and 2D molecular descriptors on a dataset of 966 compounds. In addition, predicted transporter inhibition profiles were also included into the models. The initially compiled dataset of 1773 compounds was reduced via a 2-step approach to 966 compounds, resulting in a significant increase (p-value<0.05) in model performance. The models have been validated via 10-fold cross-validation and against three external test sets of 921, 341 and 96 compounds, respectively. The final model showed an accuracy of 64% (AUC 68%) for 10-fold cross-validation (average of 50 iterations) and comparable values for two test sets (AUC 59%, 71% and 66%, respectively). In the study we also examined whether the predictions of our in-house transporter inhibition models for BSEP, BCRP, P-glycoprotein, and OATP1B1 and 1B3 contributed in improvement of the DILI mode. Finally, the model was implemented with open-source 2D RDKit descriptors in order to be provided to the community as a Python script.Entities:
Keywords: 2-class classification; Data curation; Drug-induced liver injury; Liver transporters; Random Forest; Toxicity reports
Mesh:
Substances:
Year: 2017 PMID: 28652195 PMCID: PMC6422282 DOI: 10.1016/j.tox.2017.06.003
Source DB: PubMed Journal: Toxicology ISSN: 0300-483X Impact factor: 4.221
Classification models for DILI reported in literature. Acc stands for accuracy, Sen for sensitivity, Spec for specificity, BA for balanced accuracy, CV for cross validation, EV for external validation and IV for internal validation.
| Reference | Descriptors | Classification algorithm | Data used | Reported performance |
|---|---|---|---|---|
| 2D molecular descriptor | Ensemble recursive partitioning | 382 drugs for CV | CV: 76% Acc; 76% Sen; 75% Spec | |
| 54 drugs for EV | EV: 81% Acc; 70% Sen; 90% Spec | |||
| Radial distribution function | Linear discriminant analysis | 74 drugs for CV | CV: 84% Acc; 78% Sen; 90% Spec | |
| molecular descriptors | 13 drugs for EV | EV: 82% Acc | ||
| Molecular descriptors | 4 commercial QSAR programs | ~1600 drugs for CV | CV: 39% Sen; 87% Spec | |
| 18 drugs for EV | EV: 89% Sen | |||
| topological | k-nearest neighbor | 37 drugs for EV | 84% Acc; 74% Sen; 94% Spec | |
| 2D fragments and Dragon | Support vector machine | 531 drugs for CV 18 compounds for EV | CV: 62–68% Accs | |
| molecular descriptors | EV: 78% Acc | |||
| extended connectivity functional | Linear discriminant analysis | 295 compound for CV | CV: 59% ACC; 53% Sen; 65% Spec | |
| class fingerprints of maximum diameter 6 (ECFC_6) | 237 compounds for EV | EV: 60% Acc; 56% Sen; 67% Spec | ||
| PaDEL molecular descriptor | Ensemble of mixed learning | 1087 compounds for CV | CV: 68% Accs; 67% Sen; 70% Spec | |
| 120 compounds for EV | EV: 75% Acc; 82% Sen; 65% Spec | |||
| functional class | Bayesian models | 888 drugs for training3 data sets with 40–148 drugs for EV | EV: 60–70% Accs | |
| fingerprints (FCFP_6) | ||||
| Mold2 chemical descriptor | Decision Forest | 197 drugs for CV | CV: 70% Acc | |
| Three data sets with | EV: 62–69% Accs | |||
| physicochemical descriptors and fingerprints | Ensemble classifier | 677 compounds for CV | 81% BA; 66% Sen; 95% Spec | |
| physicochemical descriptors and fingerprints | Ensemble classifier | 677 compounds for CV | 81% BA; 66% Sen; 95% Spec | |
| ISIDA fragment descriptors | SVM | 424 drugs for CV | 66% BA | |
| Encoding layers based on SMILES, PaDEL descriptors | Deep Learning | 190, 475 & 1065 compounds for CV | CV: 70–88% Accs; 70–90% Sens; 70–87% Specs | |
| 185,320, 236,198 & 119 compounds for EV | EV: 62–87% Accs; 62–83% Sens; 62–93% Specs | |||
| 2D and 3D physicochemical descriptors | SVM with a genetic algorithm | 3712 compounds for training | IV: 75% Acc; 73% AUC | |
| 221 compounds for IV | ||||
| 269 compounds for EV | ||||
| FP4 fingerprints | SVM | 1317 compounds for training | Training set: 66% Acc; 85% Sen; 34% Spec; 55% AUC | |
| 88 compounds for EV | EV: 75% Acc; 93% Sen; 38% Spec; 61% AUC |
Description of the sources upon which the training set was built. In number of compounds, “+” denotes the number of DILI-positive compounds and “−” the number of negative compounds. These numbers correspond to the number of compounds remaining after data curation in a source by source basis.
| Source name | Type of data | Number of compounds | Label choice |
|---|---|---|---|
| 132 (100+/32−) | “severely” and “moderately” toxic are considered positives. | ||
| FDA reports database | 382 (75+/307−) | Authors classification | |
| Text mining | 902 (620+/282−) | Authors classification | |
| Compilation of published data | 385 (252+/133−) | Authors classification | |
| Clinical data for hepatotoxicity | 499 (294+/205−) | Authors classification | |
| FDA-approved labels | 279 (218+/61−) | “most DILI concern” and “less DILI concern” are considered positives | |
| SIDER_2 database | 835 (188+/647−) | Authors classification | |
| Post-marketing safety data | 1948 (651+/1297−) | Authors classification, keeping only highest class certainty | |
| LiverTox database | 583 (409+/174−) | “hepatotoxic” and “possible hepatotoxic” are considered positives |
Description of the sources upon which the test set was built. In number of compounds, “+” denotes the number of DILI-positive compounds and “−” the number of negative compounds. These numbers correspond to the number of compounds remaining after data curation in a source by source basis.
| Source name | Type of data | Number of compounds | Label choice |
|---|---|---|---|
| Micromedex reports of adverse reactions | 341 (221+/120−) | Authors classification | |
| Compilation of public data, data from PharmaPendium and Leadscope | 921 (519+/402−) | Authors classification | |
| Compilation of public data and LiverTox | 96 (50+/46−) | “most DILI concern” and “less DILI concern” are considered positives, “verified no DILI concern” as negatives | |
| Merged | The 3 external datasets were merged and the common compounds with contradictory class labels were removed | 996 (541+/455−) | Maintenance of the class labels of the original external test sets |
Chart 1Overlap of DILI positives and negatives across the different amount of sources.
Statistical performance of the final Random Forest (100 trees) model A) using all 2D MOE descriptors and transporter predictions (DILI_MOE_transp_RF model) or B) using only the 2D MOE descriptors (DILI_MOE_RF model) and the C) open source model (DILI_RDKit _RF100).
| Accuracy | Sensitivity | Specificity | AUC | Precision | |
|---|---|---|---|---|---|
| A) DILI_MOE_transp _RF100 | |||||
| 10-fold CV (average +/− standard deviation for 50 iterations) | 0.65 ± 0.01 | 0.68 ± 0.01 | 0.61 ± 0.01 | 0.69 ± 0.01 | 0.65 ± 0.01 |
| Mulliner 921 cpds | 0.57 | 0.63 | 0.50 | 0.59 | 0.62 |
| Liew 341 cpds | 0.67 | 0.72 | 0.56 | 0.71 | 0.75 |
| Chen 96 cpds | 0.59 | 0.54 | 0.65 | 0.61 | 0.63 |
| Merged test set 966cpds | 0.59 | 0.68 | 0.50 | 0.62 | 0.62 |
| B) DILI_ MOE _RF100 | |||||
| 10-fold CV (average +/− standard deviation for 50 iterations) | 0.65 ± 0.01 | 0.68 ± 0.01 | 0.61 ± 0.01 | 0.69 ± 0.01 | 0.65 ± 0.01 |
| Mulliner 921 cpds | 0.58 | 0.60 | 0.55 | 0.59 | 0.63 |
| Liew 341 cpds | 0.68 | 0.68 | 0.67 | 0.71 | 0.79 |
| Chen 96 cpds | 0.63 | 0.56 | 0.70 | 0.66 | 0.67 |
| Merged test set 966cpds | 0.60 | 0.64 | 0.56 | 0.62 | 0.63 |
| C) DILI_RDKit_RF100 | |||||
| 10-fold CV (average +/− standard deviation for 50 iterations) | 0.64 ± 0.01 | 0.70 ± 0.01 | 0.57 ± 0.01 | 0.69 ± 0.01 | 0.63 ± 0.01 |
| Mulliner 921 cpds | 0.60 | 0.64 | 0.54 | 0.62 | 0.64 |
| Liew 332 cpds | 0.67 | 0.72 | 0.56 | 0.71 | 0.72 |
| Chen 95 cpds | 0.64 | 0.64 | 0.64 | 0.73 | 0.64 |
| Merged test set 966cpds | 0.60 | 0.67 | 0.52 | 0.64 | 0.63 |
Notes: The number of compounds for the external datasets is slightly different for the predictions on model C because for some compounds (peptides), some descriptor values computed by RDKit were too large to be handled by the machine learning algorithm.