| Literature DB >> 34747151 |
Claire Stansfield1, Gillian Stokes1, James Thomas1.
Abstract
Manual screening of citation records could be reduced by using machine classifiers to remove records of very low relevance. This seems particularly feasible for update searches, where a machine classifier can be trained from past screening decisions. However, feasibility is unclear for broad topics. We evaluate the performance and implementation of machine classifiers for update searches of public health research using two case studies. The first study evaluates the impact of using different sets of training data on classifier performance, comparing recall and screening reduction with a manual screening 'gold standard'. The second study uses screening decisions from a review to train a classifier that is applied to rank the update search results. A stopping threshold was applied in the absence of a gold standard. Time spent screening titles and abstracts of different relevancy-ranked records was measured.Entities:
Keywords: information retrieval; supervised machine learning; systematic reviews as topic; update search
Mesh:
Year: 2021 PMID: 34747151 PMCID: PMC9299040 DOI: 10.1002/jrsm.1537
Source DB: PubMed Journal: Res Synth Methods ISSN: 1759-2879 Impact factor: 9.308
Description of the training and test sets
| Exclusion (Ex) and inclusion criteria for TRoPHI | Training set ( | Test set A ( | Test set B ( | Test set C ( |
|---|---|---|---|---|
| Description | Searches between January 2012 and June 2013 | Searches July 2013–March 2015 | Searches during 2010, publication date 2009–2010 | Searches during 2008, publication date 2007–2008 |
| Ex1: Focus is not on health promotion or public health | 16, 536 | 8502 | 6796 | 5351 |
| Ex2: Study is not a prospective evaluation of an intervention | 1799 | |||
| Ex3: Study has no control or comparison group | ||||
| Ex4: Item is a review (consider for Database of Promoting Health Interventions) | 312 | |||
| Include 1: non‐randomised controlled trial (non‐RCT) | 220 | 213 | 103 | 96 |
| Include 2: Randomised controlled trial (RCT) (this includes a true randomised method, or quasi‐randomisation such as alternate allocation) | 892 | 653 | 286 | 365 |
Classifier was trained on 20,050 references, numbers adjusted following additional duplicate‐removal.
Performance of the classifiers on test set A (n = 9368), B (n = 7,85) and C (n = 5812)
| Classifier | Training criteria | Set | RCTs | Non‐RCTs | Screening reduction % | ||
|---|---|---|---|---|---|---|---|
| Precision % | Recall % | Precision % | Recall % | ||||
| RCT classifier | Include RCT in any human health domain | A | 12.3 | 99.7 | 3.4 | 85.9 | 43.3 |
| B | 8.1 | 99.7 | 2.4 | 83.5 | 50.9 | ||
| C | 11.1 | 99.2 | 2.6 | 87.5 | 43.6 | ||
| Custom 1 | Include any studies that are in the health promotion domain (all studies without Ex1 code) | A | 11.7 | 99.1 | 3.8 | 99.5 | 40.9 |
| B | 7.8 | 100% | 2.8 | 100% | 49.3 | ||
| C | 13.6 | 99.5 | 3.5 | 96.9 | 54.2 | ||
| Custom 2 | Include any RCTs, non‐RCTs or reviews in the health promotion domain (all studies without Ex1, Ex2 or Ex3 codes) | A | 16.5 | 98.8 | 5.4 | 98.6 | 58.4 |
| B | 12.2 | 99.3 | 4.3 | 97.1 | 67.6 | ||
| C | 19.6 | 98.1 | 4.9 | 93.8 | 68.6 | ||
| Custom 3 | Include any studies that are RCTs or non‐RCTs in the health promotion domain (all studies without Ex1, Ex2, Ex3 or Ex4 codes) | A | 19.7 | 98.0 | 6.3 | 96.2 | 65.4 |
| B | 14.6 | 99.0 | 5.1 | 97.1 | 72.9 | ||
| C | 23.2 | 97.8 | 5.8 | 92.7 | 73.5 | ||
Ex1, Ex2, Ex3, Ex4 are described in Table 1.
BOX 1Definitions of the performance parameters
FIGURE 1Relevance scores of two classifiers (n = 21,404)
FIGURE 2Relevant records per volume screened (n = 8,449)
FIGURE 3Precision of the 62 relevant records within relevance ranking scores
FIGURE 4Contribution of all the studies shown by publication year (N = 336)
Time taken to screen citation records at different relevance scores
| Screening time per record based on five per category (seconds) | ||||
|---|---|---|---|---|
| Relevance score | Minimum | Maximum | Mean | Total |
| 90–99 | 138 | 280 | 214.2 | 1071 |
| 80–89 | 198 | 320 | 266 | 1330 |
| 70–79 | 15 | 303 | 132.2 | 661 |
| 60–69 | 30 | 132 | 78.4 | 392 |
| 50–59 | 35 | 134 | 93.2 | 466 |
| 40–49 | 10 | 222 | 62 | 310 |
| 30–39 | 10 | 140 | 61.8 | 309 |
| 20–29 | 8 | 130 | 59.8 | 299 |
| 18–19 | 7 | 15 | 11.2 | 56 |
| 17–18 | 5 | 17 | 12.2 | 61 |
| 16–17 | 5 | 15 | 8.4 | 42 |
| 16 | 3 | 18 | 11 | 55 |
| 15–16 | 4 | 30 | 12.2 | 61 |
| 14–15 | 5 | 15 | 9.4 | 47 |
| 13–14 | 5 | 8 | 6.6 | 33 |
| 13 | 3 | 15 | 7 | 35 |
FIGURE 5Screening time for records by relevance score (n = 5, for each relevance score range)