| Literature DB >> 31443562 |
Yangyang Wang1, Qingxin Xiao1, Peng Chen2,3,4, Bing Wang5.
Abstract
Drug-induced liver injury (DILI) is a major factor in the development of drugs and the safety of drugs. If the DILI cannot be effectively predicted during the development of the drug, it will cause the drug to be withdrawn from markets. Therefore, DILI is crucial at the early stages of drug research. This work presents a 2-class ensemble classifier model for predicting DILI, with 2D molecular descriptors and fingerprints on a dataset of 450 compounds. The purpose of our study is to investigate which are the key molecular fingerprints that may cause DILI risk, and then to obtain a reliable ensemble model to predict DILI risk with these key factors. Experimental results suggested that 8 molecular fingerprints are very critical for predicting DILI, and also obtained the best ratio of molecular fingerprints to molecular descriptors. The result of the 5-fold cross-validation of the ensemble vote classifier method obtain an accuracy of 77.25%, and the accuracy of the test set was 81.67%. This model could be used for drug-induced liver injury prediction.Entities:
Keywords: Drug-induced liver injury; ensemble classifier; molecular fingerprints; quantitative structure–activity relationship (QSAR)
Mesh:
Substances:
Year: 2019 PMID: 31443562 PMCID: PMC6747689 DOI: 10.3390/ijms20174106
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1Concept map of drug-induced liver injury (DILI) modeling process.
Performance Comparison of Base Classifiers on the Whole Training Dataset.
| No | Descriptor | Base Classifier | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| LR | SVM | GDBT | AdaBT | XGBT | RF | ExtraTrees | LGBT | CatBT | ||
| 1 | AP2DFP |
| 0.6978 |
|
|
| 0.7067 | 0.6944 | 0.6911 |
|
| 2 | Estate FP | 0.7078 | 0.7044 |
| 0.6811 |
|
| 0.7211 |
|
|
| 3 | ExtendedFP |
| 0.7322 | 0.7511 | 0.7355 |
|
| 0.7333 |
|
|
| 4 | FP | 0.7133 | 0.6878 |
| 0.7111 |
|
| 0.7056 |
|
|
| 5 | GraphOnlyFP | 0.7067 | 0.6689 |
| 0.7056 |
|
| 0.7089 |
|
|
| 6 | KRFP | 0.7500 | 0.7344 | 0.7522 | 0.7211 |
|
|
|
|
|
| 7 | MaccsFP | 0.7300 | 0.7045 | 0.7389 | 0.7256 |
|
|
|
|
|
| 8 | nAP2DFP | 0.6933 | 0.6822 |
| 0.6889 |
|
|
|
| 0.7033 |
| 9 | nKRFP | 0.7522 | 0.7056 |
| 0.7356 | 0.7544 |
|
|
|
|
| 10 | nSubstructreFP | 0.7111 | 0.7033 |
| 0.7111 |
|
|
| 0.7289 |
|
| 11 | PubchemFP | 0.7278 | 0.6956 |
| 0.7100 |
|
| 0.7167 |
|
|
| 12 | SubstructreFP | 0.7300 | 0.7267 |
|
| 0.7244 |
|
| 0.7189 |
|
| Number (Top 5) |
|
|
|
|
|
|
|
|
| |
The bolt numbers in each row denote the top 5 classifiers with the specific fingerprint descriptor.
Sorted Average Accuracies of the Top 5 Classifiers with Respect of Different Fingerprints.
| NO | Fingerprint | Average Accuracy |
|---|---|---|
|
| ExtendedFP | 0.7693 |
|
| KRFP | 0.7662 |
|
| MaccsFP | 0.7600 |
|
| nKRFP | 0.7598 |
|
| FP | 0.7531 |
|
| nSubstructreFP | 0.7529 |
|
| SubstructreFP | 0.7493 |
|
| PubchemFP | 0.7464 |
|
| EStateFP | 0.7311 |
|
| GraphOnlyFP | 0.7230 |
|
| AP2DFP | 0.7160 |
|
| nAP2DFP | 0.7067 |
Figure 2The selection of top n molecular fingerprints from the 12 molecular fingerprints by top 5 classifier.
Figure 3Select best weight ratio of fingerprints to molecular descriptors. Abbreviations: FPs, fingerprints; Mols, molecular descriptors.
Performance Comparison of Base Classifiers on the Test Dataset.
| Algorithms/Fingerprints | LR | SVC | GBDT | ADB | XGB | Random Forest | Extra Trees | LGB | CatB |
|---|---|---|---|---|---|---|---|---|---|
| AP2DFP | 0.7000 | 0.6600 | 0.6200 | 0.5800 | 0.6800 | 0.6640 | 0.6480 | 0.6800 | 0.6360 |
| Estate FP | 0.6600 | 0.6800 | 0.7000 | 0.6800 | 0.7000 | 0.6880 | 0.7200 | 0.7000 | 0.7160 |
| ExtendedFP | 0.7800 | 0.7400 | 0.7000 | 0.7600 | 0.7400 | 0.7480 | 0.7360 | 0.6600 | 0.7800 |
| FP | 0.6600 | 0.7000 | 0.7440 | 0.6600 | 0.7000 | 0.7240 | 0.6760 | 0.7400 | 0.7320 |
| GraphOnlyFP | 0.6000 | 0.6000 | 0.6720 | 0.6400 | 0.7200 | 0.6840 | 0.6720 | 0.7200 | 0.6960 |
| KRFP | 0.7200 | 0.6400 | 0.7240 | 0.6600 | 0.7000 | 0.7840 | 0.7600 | 0.7400 | 0.7520 |
| MaccsFP | 0.7600 | 0.7200 | 0.7200 | 0.7000 | 0.7200 | 0.7520 | 0.7040 | 0.7200 | 0.7360 |
| nAP2DFP | 0.6600 | 0.6200 | 0.6360 | 0.6000 | 0.6600 | 0.7040 | 0.6600 | 0.6400 | 0.6880 |
| nKRFP | 0.6600 | 0.6200 | 0.7160 | 0.7400 | 0.7200 | 0.7520 | 0.7480 | 0.7000 | 0.7320 |
| nSubstructreFP | 0.7200 | 0.6400 | 0.6400 | 0.5400 | 0.6400 | 0.5920 | 0.6280 | 0.6200 | 0.6200 |
| PubchemFP | 0.7200 | 0.7600 | 0.7040 | 0.6200 | 0.7400 | 0.7760 | 0.7360 | 0.6600 | 0.7480 |
| SubstructreFP | 0.7400 | 0.7600 | 0.7200 | 0.7200 | 0.7000 | 0.7360 | 0.7280 | 0.6800 | 0.7440 |
Performance Comparison of Several Hepatotoxicity Prediction Models.
| Model Name | No. of Compounds | Test Method | Q (%) | SE (%) | SP (%) | AUC(%) |
|---|---|---|---|---|---|---|
| Bayesian [ | 295 | 10-fold CV×100 | 58.5 | 52.8 | 65.5 | 62.0 |
| Decision Forest [ | 197 | 10-fold CV×2000 | 69.7 | 57.8 | 77.9 | – |
| Naive Bayesian [ | 420 | Test set | 72.6 | 72.5 | 72.7 | – |
| Our Method | 450 | 5-fold CV×1000 | 77.25 | 64.38 | 85.83 | 75.10 |
| Test set | 81.67 | 64.55 | 96.15 | 80.35 |
Abbreviations: Q: accuracy; SE: sensitivity; SP: specificity; AUC: area under the curve.
Performance Comparison of Previous Models.
| Model Name | No. of Compounds | Test Method | Q (%) | SE (%) | SP (%) | AUC (%) | MCC (%) |
|---|---|---|---|---|---|---|---|
| Decision Forest [ | 451 | 5-fold CV | 72.9 | 62.8 | 79.8 | – | 51.4 |
| Our Method | 450 | 5-fold CV | 76.9 | 62.2 | 87.0 | 74.6 | 43.2 |
Abbreviations: Q, accuracy; SE, sensitivity; SP, specificity; AUC, area under the curve; MCC, Matthews correlation coefficient.
Summary of the 12 Types of Molecular Fingerprints.
| Fingerprint Type | Abbreviation | Pattern Type | Size (bits) |
|---|---|---|---|
| CDK | FP | Hash fingerprints | 1024 |
| CDK Extended | ExtendedFP | Hash fingerprints | 1024 |
| CDK GraphOnly | GraphOnlyFP | Hash fingerprints | 1024 |
| Estate | EstateFP | Structural features | 79 |
| MACCS | MaccsFP | Structural features | 166 |
| Pubchem | PubchemFP | Structural features | 881 |
| Substructure | Substructure | Structural features | 307 |
| Substructure Count | nSubstructure | Structural features count | 307 |
| Klekota-Roth | KRFP | Structural features | 4860 |
| Klekota-Roth Count | nKRFP | Structural features count | 4860 |
| 2D Atom Pairs | AP2D | Structural features | 780 |
| 2D Atom Pairs Count | nAP2DC | Structural features count | 780 |
Figure 4Flowchart of the ensemble classifier system with top 5 classifiers and top 8 fingerprint filters.
Figure 5Flowchart of the ensemble model.