| Literature DB >> 33330410 |
Ting Li1,2, Weida Tong1, Ruth Roberts1,3,4, Zhichao Liu1, Shraddha Thakkar1.
Abstract
Drug-induced liver injury (DILI) is one of the most cited reasons for the high drug attrition rate and drug withdrawal from the market. The accumulated large amount of high throughput transcriptomic profiles and advances in deep learning provide an unprecedented opportunity to improve the suboptimal performance of DILI prediction. In this study, we developed an eight-layer Deep Neural Network (DNN) model for DILI prediction using transcriptomic profiles of human cell lines (LINCS L1000 dataset) with the current largest binary DILI annotation data [i.e., DILI severity and toxicity (DILIst)]. The developed models were evaluated by Monte Carlo cross-validation (MCCV), permutation test, and an independent validation (IV) set. The developed DNN model achieved the area under the receiver operating characteristic curve (AUC) of 0.802 and 0.798, and balanced accuracy of 0.741 and 0.721 for training and an IV set, respectively, outperforming the conventional machine learning algorithms, including K-nearest neighbors (KNN), Support Vector Machine (SVM), and Random Forest (RF). Moreover, the developed DNN model provided a more balanced sensitivity of 0.839 and specificity of 0.603. Besides, we found the developed DNN model had a superior predictive performance for oncology drugs. Also, the functional and network analysis of genes driving the predictions revealed their relevance to the underlying mechanisms of DILI. The proposed DNN model could be a promising tool for early detection of DILI potential in the pre-clinical setting.Entities:
Keywords: DILI; deep learning–artificial neural network; high throughput transcriptomics; machine learning; risk assessment; toxicity prediction model
Year: 2020 PMID: 33330410 PMCID: PMC7728858 DOI: 10.3389/fbioe.2020.562677
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
FIGURE 1Overall workflow for the model development: (1) 6000 treatments were stratified split into the training (80%) and independent validation (20%) datasets. (2) A 100-iteration Monte Carlo cross-validation (MCCV) was carried out for hyperparameter optimization of the four algorithms, including Deep Neural Network: DNN; K-nearest neighbors: KNN); Support Vector Machine: SVM and Random Forest: (RF). (3) The optimized models were used to predict the independent validation sets. (4) the DNN model was further evaluated by using a “Y-Scrambling”-based permutation test.
FIGURE 2Hyperparameters optimization of DNN models: we investigated a total of 16 hyperparameter combinations of four activation functions (i.e., Tanh, ReLU, SELU, and ELU) and four optimizers (i.e., Adam, Adadelta, RMSProp, and SGD). Each hyperparameter combination was evaluated with a 100-iteration MCCV. The mean and standard deviation of V scores is illustrated for each hyperparameter combination. The hyperparameter combination of Elu and Adam was chosen as the best hyperparameters for developing an optimized DNN model.
FIGURE 3Performance comparison based on MCCV results: the distributions of AUC scores were plotted based on 100-iteration MCCV results of DNN, SVM, RF and KNN in a violin box style. DNN models distributed around 1.5× interquartile range of violin plot were highlighted in black dots.
Performance for four different classifiers with optimized hyperparameters.
| Dataset | Classifiers | AUC | MCC | F1 | Cohen’s kappa | Balanced Accuracy | Accuracy | Sensitivity | Specificity |
| Training | DNN | 0.497 | 0.809 | 0.761 | 0.851 | 0.630 | |||
| KNN | 0.762 | 0.441 | 0.789 | 0.436 | 0.713 | 0.735 | 0.834 | 0.591 | |
| SVM | 0.778 | 0.478 | 0.805 | 0.472 | 0.729 | 0.753 | 0.856 | 0.602 | |
| RF | 0.771 | 0.549 | 0.837 | 0.491 | 0.727 | 0.774 | 0.977 | 0.476 | |
| IV | DNN | 0.458 | 0.795 | 0.743 | 0.839 | 0.603 | |||
| KNN | 0.764 | 0.409 | 0.778 | 0.405 | 0.698 | 0.721 | 0.821 | 0.574 | |
| SVM | 0.777 | 0.455 | 0.804 | 0.438 | 0.709 | 0.743 | 0.888 | 0.529 | |
| RF | 0.747 | 0.502 | 0.824 | 0.436 | 0.700 | 0.752 | 0.975 | 0.424 |
FIGURE 4Absolute differences, | MCCV-IV|, of the four performance metrics: absolute differences of the eight performance metrics between the 100-iteration MCCV and the independent validation (IV) set were calculated for the four classifiers.
FIGURE 5Permutation tests for the developed DNN model: the distributions of the eight performance metrics from 100-iteration MCCV were illustrated. The distributions of DNN models using the real training datasets and Y-scrambled training datasets were denoted with orange and gray colors, respectively.
FIGURE 6The model performance for the individual therapeutic class according to the first level of the WHO ATC codes.
IPA based on 200 high-frequent genes of correctly predicted transcriptomic profiles.
| IPA modules | # Genes | Gene Name | |
| GADD45 Signaling | 10 | CCNB1, CCND1, CCND3, CDK1, CDK4, CDKN1A, GADD45A, GADD45B, PCNA, TP53 | 2.13E-14 |
| Cell Cycle: G2/M DNA Damage Checkpoint Regulation | 9 | CCNB1, CCNB2, CDK1, CDKN1A, GADD45A, PLK1, RPS6KA1, TOP2A, TP53 | 8.66E-09 |
| Estrogen-mediated S-phase Entry | 7 | CCND1, CDC25A, CDK1, CDK4, CDKN1A, CDKN1B, MYC | 1.16E-08 |
| Cyclins and Cell Cycle Regulation | 10 | CCNB1, CCNB2, CCND1, CCND3, CDC25A, CDK1, CDK4, CDKN1A, CDKN1B, TP53 | 2.12E-08 |
| ATM Signaling | 10 | ATF1, CCNB1, CCNB2, CDC25A, CDK1, CDKN1A, GADD45A, GADD45B, NFKBIA, TP53 | 2.94E-08 |
| Liver Cancer | 46 | ATF5, BIRC5, C2CD5, CANT1, CCDC92, CCNB1, CCNB2, CCND3, CDC45, CGRRF1, CNDP2, CYTH1, DDIT4, DUSP4, GRB10, HMGCS1, HMOX1, HSPA8, HSPD1, IFRD2, IGFBP3, INPP1, INSIG1, JADE2, KEAP1, KIF14, LBR, LGMN, LOXL1, LSM5, MYC, NPC1, NR2F6, NRIP1, POLR2I, RELB, SPR, STXBP1, TIAM1, TIPARP, TLE1, TOP2A, TP53, USP1, WASHC4, WDTC1 | 2.26E-01 |
| Hepatoblastoma | 1 | TP53 | 3.06E-02 |
| Nucleus | 55 | TOP2A, KDM5B, CDKN1B, FHL2, GABPB1, PRSS23, HOXA10, RGS2, ZFP36, CCND3, TUBB6, CCND1, MYC, RPS6KA1, IER3, TSC22D3, IGFBP3, NOSIP, CDC25A, UGDH, NPEPL1, MELK, BIRC5, ATF5, TP53, RBKS, NOTCH1, PCNA, CDCA4, GLRX, ADRB2, FOXO4, RELB, DNAJB1, CCNB2, CCNB1, NFIL3, USP1, HMOX1, SCAND1, WDTC1, SPDEF, HSPA8, XBP1, SPAG7, GADD45B, UBE2C, TIPARP, SORBS3, PSMB8, NR2F6, FOSL1, NFKBIA, NET1, POLE2 | 1.96E-03 |
| Nucleolus | 20 | TOP2A, TCERG1, TLE1, MYO10, PARP2, PLK1, PWP1, ACAT2, JMJD6, CDK4, MYC, NRIP1, RRS1, NUSAP1, TIMELESS, CCDC86, RBM34, POLR2I, RAE1, TP53 | 6.81E-03 |
| Nucleoplasm | 33 | TOP2A, ATF1, RNH1, PCNA, STXBP1, KEAP1, FOXO4, TSEN2, SMC4, CNDP2, GNAI2, RELB, CDC20, CCND1, PUF60, SPR, MYC, EPB41L2, HMG20B, CBR3, PARP2, GADD45A, CGRRF1, CDC25B, JMJD6, CCNA2, UGDH, PLCB3, TNIP1, STUB1, ATF5, DLD, RAD9A | 1.01E-02 |
| Midbody | 7 | PLK1, CDK1, KIF14, KEAP1, BIRC5, KIF20A, GNAI2 | 2.27E-02 |
| Cytoplasm | 52 | KDM5B, RNH1, CASC3, TSEN2, SMC4, CNDP2, HOXA10, CDC20, PCMT1, RGS2, CCND3, TUBB6, PNP, CCND1, SPR, KIF5C, RPS6KA1, NUSAP1, PARP2, TSC22D3, NOSIP, KCTD5, CDC25A, CDC25B, NPEPL1, MELK, BIRC5, ATF5, TP53, RBKS, PCNA, INPP1, PRUNE1, HSPD1, GNAI2, EPB41L2, CYTH1, UBQLN2, MYO10, GADD45B, UBE2C, GADD45A, PLK1, SORBS3, PSMB8, MLLT11, EIF5, GNPDA1, TNIP1, BAMBI, CDK1, RAD9A | 2.27E-02 |
| Spindle Microtubule | 5 | PLK1, NUSAP1, CDK1, BIRC5, AURKA | 2.27E-02 |
| Cyclin-Dependent Protein Kinase Holoenzyme Complex | 4 | CCND3, CDKN1A, CCND1, CDK4 | 3.36E-02 |
| Positive Regulation Of Apoptotic Process | 10 | MLLT11, TOP2A, NET1, NOTCH1, MELK, GADD45B, GADD45A, IGFBP3, HMOX1, SPDEF | 2.75E-02 |
FIGURE 7Cytoscape network analysis of protein–protein interactions (PPIs): 482 high confidence PPIs were extracted from the STRING database version 11.0 based on 200 high-frequent genes derived from the DNN model. Panels (A,B) are the top 2 subnetworks of the PPI network obtained by using MCODE plug-in for Cytoscape. The hepatoxicity-related genes were highlighted in green color. The size of code is projected based on frequency of genes.
FIGURE 8The performance of the proposed DNN model based on the two different sampling strategies: (A) balanced data sampling; (B) drug-based data splitting.