| Literature DB >> 27293211 |
Kazuma Hashimoto1, Georgios Kontonatsios2, Makoto Miwa3, Sophia Ananiadou4.
Abstract
Systematic reviews require expert reviewers to manually screen thousands of citations in order to identify all relevant articles to the review. Active learning text classification is a supervised machine learning approach that has been shown to significantly reduce the manual annotation workload by semi-automating the citation screening process of systematic reviews. In this paper, we present a new topic detection method that induces an informative representation of studies, to improve the performance of the underlying active learner. Our proposed topic detection method uses a neural network-based vector space model to capture semantic similarities between documents. We firstly represent documents within the vector space, and cluster the documents into a predefined number of clusters. The centroids of the clusters are treated as latent topics. We then represent each document as a mixture of latent topics. For evaluation purposes, we employ the active learning strategy using both our novel topic detection method and a baseline topic model (i.e., Latent Dirichlet Allocation). Results obtained demonstrate that our method is able to achieve a high sensitivity of eligible studies and a significantly reduced manual annotation cost when compared to the baseline method. This observation is consistent across two clinical and three public health reviews. The tool introduced in this work is available from https://nactem.ac.uk/pvtopic/.Entities:
Keywords: Active learning; Citation screening; Document embeddings; Paragraph vectors; Systematic reviews; Topic modelling
Mesh:
Year: 2016 PMID: 27293211 PMCID: PMC4981645 DOI: 10.1016/j.jbi.2016.06.001
Source DB: PubMed Journal: J Biomed Inform ISSN: 1532-0464 Impact factor: 6.317
Fig. 1Detecting latent topics using paragraph vectors.
Fig. 2Examples of topics and descriptive topic labels extracted by the paragraph vector-based topic detection method (i.e., PV topic detection) and the LDA topic model from an abstract within the Cooking Skills dataset. Topic labels that are present in the abstract are highlighted with solid green lines for the paragraph vector-based topic detection method and with dashed blue lines for LDA. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Characteristics of clinical and social science reviews used for experimentation.
| Dataset | Scientific | # citations domain | Ratio of eligible to ineligible studies (%) |
|---|---|---|---|
| COPD | Clinical | 1606 | 12 |
| ProtonBeam | Clinical | 4751 | 5 |
| Cooking Skills | Public health | 11,515 | 2 |
| Tobacco packaging | Public health | 3210 | 5 |
| Youth development | Public health | 15,544 | 10 |
Fig. 3High-level view of a certainty-based active learning strategy [4] used for citation screening.
Fig. 4Performance (yield and burden) achieved by the AL_LDA and AL_PV models when applied to the clinical COPD dataset.
Fig. 5Performance (yield and burden) achieved by the AL_LDA and AL_PV models when applied to the public health Cooking Skills dataset.
Fig. 6WSS@95% achieved by the AL_PV and AL_LDA active learning models across clinical and public health reviews.