| Literature DB >> 30740594 |
Jake Vasilakes1,2, Rubina Rizvi1,2, Genevieve B Melton1,3, Serguei Pakhomov1,2, Rui Zhang1,2.
Abstract
OBJECTIVES: This study evaluated and compared a variety of active learning strategies, including a novel strategy we proposed, as applied to the task of filtering incorrect semantic predications in SemMedDB.Entities:
Keywords: active machine learning; clinical medicine; drug interactions; medical informatics; natural language processing; supervised machine learning
Year: 2018 PMID: 30740594 PMCID: PMC6367018 DOI: 10.1093/jamiaopen/ooy021
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.An overview of the active learning system development process.
Figure 2.The active learning process. From an initial labeled set L, train the ML model θ, choose the most informative example from the unlabeled set U using the query strategy QS and the updated θ, query the oracle for its label, and update L.
Area under the learning curve (ALC) and number of training examples required to reach target area under the ROC curve (AUC) of the uncertainty, representative, and combined query strategies evaluated on the substance interactions and clinical medicine datasets
| Type | Query strategy | Substance interactions | Clinical medicine | ||
|---|---|---|---|---|---|
| ALC | | | ALC | | | ||
| Baseline | Passive | 0.590 | 1295 | 0.491 | 2473 |
| SM | 0.597 | 1218 | 0.541 | 2093 | |
| Uncertainty | LC | 0.606 | 1051 | 0.543 | 2043 |
| LCB2 | 0.607 | 1060 | 0.542 | 2089 | |
| D2C | 0.623 | 891 | 0.548 | 2166 | |
| Representative | Density | 0.622 | 905 | 0.547 | 2136 |
| Min-Max | 0.634 | 657 | |||
| Combined | ID ( | 0.626 | 771 | 0.534 | 2157 |
| ID ( | 0.542 | 2146 | |||
| ID ( | 0.635 | 653 | 2174 | ||
| ID (dynamic | 0.641 | 587 | 0.549 | 2180 | |
Bold values indicate the best performing method for that metric.
Novel algorithm.
Figure 3.Average area under the ROC curve (AUC) learning curves for the uncertainty-based, representative-based, and combined query strategy types for the substance interactions and clinical medicine datasets. Rows correspond to query strategy types. Columns correspond to the datasets.
Figure 4.Query patterns of the low, middle, and high performing query strategies for the substance interactions and clinical medicine datasets overlaid on a visualization of U generated using t-SNE along with their corresponding learning curves. Dark blue corresponds to the first examples queried. Yellow corresponds to the last examples queried. Columns correspond to the substance interactions and clinical medicine datasets, respectively. Rows from the top correspond to the low, middle, and high performing query strategies, respectively.