| Literature DB >> 35846129 |
Sanjay Rathee1, Meabh MacMahon1,2, Anika Liu1,3, Nicholas M Katritsis1,4, Gehad Youssef1, Woochang Hwang1, Lilly Wollman1, Namshik Han1,5.
Abstract
Drug-induced liver injury (DILI) is a class of adverse drug reactions (ADR) that causes problems in both clinical and research settings. It is the most frequent cause of acute liver failure in the majority of Western countries and is a major cause of attrition of novel drug candidates. Manual trawling of the literature is the main route of deriving information on DILI from research studies. This makes it an inefficient process prone to human error. Therefore, an automatized AI model capable of retrieving DILI-related articles from the huge ocean of literature could be invaluable for the drug discovery community. In this study, we built an artificial intelligence (AI) model combining the power of natural language processing (NLP) and machine learning (ML) to address this problem. This model uses NLP to filter out meaningless text (e.g., stop words) and uses customized functions to extract relevant keywords such as singleton, pair, and triplet. These keywords are processed by an apriori pattern mining algorithm to extract relevant patterns which are used to estimate initial weightings for a ML classifier. Along with pattern importance and frequency, an FDA-approved drug list mentioning DILI adds extra confidence in classification. The combined power of these methods builds a DILI classifier (DILI C ), with 94.91% cross-validation and 94.14% external validation accuracy. To make DILI C as accessible as possible, including to researchers without coding experience, an R Shiny app capable of classifying single or multiple entries for DILI is developed to enhance ease of user experience and made available at https://researchmind.co.uk/diliclassifier/. Additionally, a GitHub link (https://github.com/sanjaysinghrathi/DILI-Classifier) for app source code and ISMB extended video talk (https://www.youtube.com/watch?v=j305yIVi_f8) are available as supplementary materials.Entities:
Keywords: artificial intelligence (AI); drug-induced liver injury (DILI); machine learning (ML); natural language processing (NLP); pattern mining; shiny app
Year: 2022 PMID: 35846129 PMCID: PMC9277181 DOI: 10.3389/fgene.2022.867946
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Steps of DILI from a dataset of DILI-positive and DILI-negative articles to validations showing integration of FDA and SIDER datasets.
Confusion matrix of GBM classifier applied to standout abstract cohort.
| True class | |||
|---|---|---|---|
| Positive | Negative | ||
| Predicted class | Positive | 1335 | 44 |
| Negative | 101 | 1362 | |
FIGURE 2Prediction probability plot.
FIGURE 3Internal accuracies for all ML classifiers (EN, elastic net; LR, logistic regression; SVM, support vector machines; CN, convolution network; RF, random forest; GBM, gradient boosting machine; FSB, feature selection-based model) showing that GBM has the highest accuracy.
Results for the GBM model applied to the validation set and additional external sets of DILI and the non-DILI literature. The inclusion of FDA and SIDER datasets improved the GBM model.
| Validation set (14211) | Additional external set (2000) | |||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | F1 score | Recall | Precision | Accuracy | F1 score | Recall | Precision | |
| GBM (abstract only) | 0.9386 | 0.9376 | 0.9631 | 0.9133 | 0.8845 | 0.8936 | 0.9700 | 0.8284 |
| GBM (+FDA) | 0.9406 | 0.9396 | 0.9659 | 0.9147 | 0.8915 | 0.8992 | 0.9680 | 0.8395 |
| GBM (+SIDER) | 0.9414 | 0.9408 | 0.9602 | 0.9221 | 0.9025 | 0.9094 | 0.9790 | 0.8491 |