| Literature DB >> 34017245 |
Weixin Xie1, Limei Wang1,2, Qi Cheng1, Xueying Wang1, Ying Wang1, Hongyuan Bi3, Bo He1, Weixing Feng1.
Abstract
Clinical drug-drug interactions (DDIs) have been a major cause for not only medical error but also adverse drug events (ADEs). The published literature on DDI clinical toxicity continues to grow significantly, and high-performance DDI information retrieval (IR) text mining methods are in high demand. The effectiveness of IR and its machine learning (ML) algorithm depends on the availability of a large amount of training and validation data that have been manually reviewed and annotated. In this study, we investigated how active learning (AL) might improve ML performance in clinical safety DDI IR analysis. We recognized that a direct application of AL would not address several primary challenges in DDI IR from the literature. For instance, the vast majority of abstracts in PubMed will be negative, existing positive and negative labeled samples do not represent the general sample distributions, and potentially biased samples may arise during uncertainty sampling in an AL algorithm. Therefore, we developed several novel sampling and ML schemes to improve AL performance in DDI IR analysis. In particular, random negative sampling was added as a part of AL since it has no expanse in the manual data label. We also used two ML algorithms in an AL process to differentiate random negative samples from manually labeled negative samples, and updated both the training and validation samples during the AL process to avoid or reduce biased sampling. Two supervised ML algorithms, support vector machine (SVM) and logistic regression (LR), were used to investigate the consistency of our proposed AL algorithm. Because the ultimate goal of clinical safety DDI IR is to retrieve all DDI toxicity-relevant abstracts, a recall rate of 0.99 was set in developing the AL methods. When we used our newly proposed AL method with SVM, the precision in differentiating the positive samples from manually labeled negative samples improved from 0.45 in the first round to 0.83 in the second round, and the precision in differentiating the positive samples from random negative samples improved from 0.70 to 0.82 in the first and second rounds, respectively. When our proposed AL method was used with LR, the improvements in precision followed a similar trend. However, the other AL algorithms tested did not show improved precision largely because of biased samples caused by the uncertainty sampling or differences between training and validation data sets.Entities:
Keywords: active learning (AL); drug–drug interaction (DDI); information retrieval (IR); random negative sampling; text mining (TM)
Year: 2021 PMID: 34017245 PMCID: PMC8130007 DOI: 10.3389/fphar.2020.582470
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.810
Inclusion and exclusion criteria for clinical drug safety drug–drug interaction (DDI) abstract selection.
| Inclusion | Clinical trial DDI toxicity study: Phase I/II/III clinical trials in which drug combination and/or single drug toxicity data are evaluated and reported |
| Pharmaco-epidemiological DDI study: Pharmaco-epidemiology studies in which toxicities from drug combinations are reported and compared to toxicities from a single drug | |
| DDI and adverse drug event (ADE) case reports: DDI-induced ADE cases in which the time sequential drug and ADE are reported in clinical settings | |
| Exclusion | a) Clinical PK DDI study: Both single drug and drug combination exposures (i.e., pharmacokinetics) are evaluated either in patients or healthy volunteers |
| b) Clinical PK PG study: The single drug exposure (i.e., pharmacokinetics) is evaluated among patients who have different genotypes in CYP450 and UGT enzymes and drug transporters | |
| c) | |
| d) Drug interaction detection algorithms or software | |
| e) Compliance of avoiding DDIs | |
| f) Concordance of DDI reporting among different drug interaction knowledge base | |
| g) Comparison of tde performance of DDI clinical decision systems | |
| h) Drug–alcohol/food interactions | |
| i) Drug/test interactions | |
| j) Case report studies | |
| k) Review papers | |
| l) Cell culture and animal studies | |
| m) Other studies that are not related to drug interactions |
FIGURE 1Distribution of the standard deviations (SDs) of the initial 46,604 terms from all the selected abstracts.
FIGURE 2Frequency distribution of one representative term (“advanc”) in all the selected texts.
Training and validation sample sizes for four different active learning (AL) methods in two rounds.
| AL algorithms | First round (SVM) | Second round (SVM) | First round (LR) | Second round (LR) | |||||
|---|---|---|---|---|---|---|---|---|---|
| Training | Validation | Training | Validation | Training | Validation | Training | Validation | ||
| Traditional AL | 200+* 200-* | 200+* 200-* | 208+* 567-* | 200+* 200-* | 200+* 200-* | 200+* 200-* | 205+* 473-* | 200+* 200-* | |
| Traditional AL with random negative sampling | 200+* 1000R* | 200+* 200-* 200R* | 217+* 1104R* | 200+* 200-*200R* | 200+* 1000R* | 200+* 200-* 200R* | 211+* 1085R* | 200+* 200-* 200R* | |
| AL with two separate ML algorithm integrated random negative sampling | ML1 | 200+* 1000R* | 200+* 200-* 200R* | 209+* 1213R* | 200+* 200-* 200R* | 200+* 1000R* | 200+* 200-* 200R* | 206+* 1106R* | 200+* 200-* 200R* |
| ML2 | 200+* 200-* | 209+* 412-* | 200+* 200-* | 206+* 306-* | |||||
| AL with two separate ML algorithm integrated random negative sampling, and validation sample update | ML1 | 200+* 1000R* | 209+* 412-* 196R* | 204+* 1053R* | 209+* 412-* 196R* | 200+* 1000R* | 204+* 391-* 199R* | 202+* 1046R* | 204+* 391-* 199R* |
| ML2 | 200+* 200-* | 205+* 306-* | 200+* 200-* | 202+* 295-* | |||||
Note: +* (positive samples), -*(negative samples), and R*(random negative samples).
FIGURE 3Active learning workflow with a single machine learning (ML) algorithm, uncertainty sampling, and no validation update.
FIGURE 4Active learning workflow with a single machine learning (ML) algorithm, uncertainty sampling plus random negative sampling, and no validation update.
FIGURE 5Active learning workflow with two separate ML algorithms, uncertainty sampling plus random negative sampling, and no validation update.
FIGURE 6Active learning workflow with two separate ML algorithms, uncertainty sampling plus random negative sampling, and validation sample update.
Performances (SVM and LR) of two active learning methods with single machine learning (ML) algorithms.
| Two AL methods’ performances for SVM | TP | FP | TN | FN | Recall | Precision | F-score | Precision (recall = 0.99) | |
|---|---|---|---|---|---|---|---|---|---|
| Uncertainty sampling, a single ML algorithm, and no validation data update | 1st round | 190 | 2 | 198 | 10 | 0.95 | 0.99 | 0.97 | 0.97 |
| 2nd round | 134 | 0 | 200 | 66 | 0.67 | 1.00 | 0.80 | 0.94 | |
| Uncertainty sampling and random negative sampling, a single ML algorithm, and no validation data update | 1st round | 176 | 0 | 400 | 24 | 0.88 | 1.00 | 0.94 | 0.96 |
| 2nd round | 138 | 0 | 400 | 62 | 0.69 | 1.00 | 0.82 | 0.94 | |
|
|
|
|
|
|
|
|
|
| |
| Uncertainty sampling, a single ML algorithm, and no validation data update | 1st round | 187 | 7 | 193 | 13 | 0.94 | 0.97 | 0.95 | 0.96 |
| 2nd round | 161 | 1 | 199 | 39 | 0.81 | 0.99 | 0.89 | 0.94 | |
| Uncertainty sampling and random negative sampling, a single ML algorithm, and no validation data update | 1st round | 176 | 5 | 395 | 24 | 0.88 | 0.97 | 0.92 | 0.95 |
| 2nd round | 159 | 1 | 399 | 41 | 0.80 | 0.99 | 0.88 | 0.94 | |
Performance (SVM and LR) of an active learning method with two separate machine learning (ML) algorithms.
| Performance | TP | FP | TN | FN | Recall | Precision | F-score | Precision (Recall = 0.99) | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Uncertainty sampling and random negative sampling, two separate ML algorithms, and no validation data update (SVM) | ML1 | 1st round | 178 | 0 | 400 | 22 | 0.89 | 1.00 | 0.94 | 0.96 |
| 2nd round | 132 | 0 | 400 | 68 | 0.66 | 1.00 | 0.80 | 0.94 | ||
| ML2 | 1st round | 190 | 6 | 394 | 10 | 0.95 | 0.97 | 0.96 | 0.94 | |
| 2nd round | 150 | 0 | 400 | 50 | 0.75 | 1.00 | 0.86 | 0.94 | ||
| Uncertainty sampling and random negative sampling, two separate ML algorithms, and no validation data update (LG) | ML1 | 1st round | 180 | 3 | 397 | 20 | 0.90 | 0.98 | 0.94 | 0.96 |
| 2nd round | 142 | 2 | 398 | 58 | 0.71 | 0.99 | 0.83 | 0.93 | ||
| ML2 | 1st round | 182 | 0 | 400 | 18 | 0.91 | 1.00 | 0.95 | 0.95 | |
| 2nd round | 152 | 2 | 398 | 48 | 0.76 | 0.99 | 0.86 | 0.94 | ||
Performance of our proposed active learning (AL) method evaluated using SVM and LR.
| Performance | TP | FP | TN | FN | Recall | Precision | F-score | Precision (Recall = 0.99) | ||
|---|---|---|---|---|---|---|---|---|---|---|
| AL with two separate ML algorithms integrated random negative sampling, and validation sample update (SVM) | ML1 | 1st round | 182 | 6 | 602 | 27 | 0.87 | 0.97 | 0.92 | 0.70 |
| 2nd round | 178 | 0 | 608 | 31 | 0.85 | 1.00 | 0.92 | 0.82 | ||
| ML2 | 1st round | 199 | 98 | 510 | 10 | 0.95 | 0.67 | 0.79 | 0.45 | |
| 2nd round | 184 | 4 | 604 | 25 | 0.88 | 0.98 | 0.93 | 0.83 | ||
| AL with two separate ML algorithms integrated random negative sampling, and validation sample update (LG) | ML1 | 1st round | 159 | 8 | 582 | 45 | 0.78 | 0.95 | 0.86 | 0.81 |
| 2nd round | 161 | 18 | 572 | 43 | 0.79 | 0.90 | 0.88 | 0.84 | ||
| ML2 | 1st round | 194 | 58 | 532 | 10 | 0.95 | 0.77 | 0.85 | 0.60 | |
| 2nd round | 188 | 6 | 584 | 16 | 0.92 | 0.97 | 0.94 | 0.90 | ||
FIGURE 7Performance of ML1 and ML2 on uncertainty samples with SVM and logistic regression as the machine learning (ML) methods.
FIGURE 8Distribution patterns of various sample types by principal component analysis.