| Literature DB >> 29956486 |
Piotr Przybyła1, Austin J Brockmeier1, Georgios Kontonatsios1, Marie-Annick Le Pogam2, John McNaught1, Erik von Elm2, Kay Nolan3, Sophia Ananiadou1.
Abstract
Screening references is a time-consuming step necessary for systematic reviews and guideline development. Previous studies have shown that human effort can be reduced by using machine learning software to prioritise large reference collections such that most of the relevant references are identified before screening is completed. We describe and evaluate RobotAnalyst, a Web-based software system that combines text-mining and machine learning algorithms for organising references by their content and actively prioritising them based on a relevancy classification model trained and updated throughout the process. We report an evaluation over 22 reference collections (most are related to public health topics) screened using RobotAnalyst with a total of 43 610 abstract-level decisions. The number of references that needed to be screened to identify 95% of the abstract-level inclusions for the evidence review was reduced on 19 of the 22 collections. Significant gains over random sampling were achieved for all reviews conducted with active prioritisation, as compared with only two of five when prioritisation was not used. RobotAnalyst's descriptive clustering and topic modelling functionalities were also evaluated by public health analysts. Descriptive clustering provided more coherent organisation than topic modelling, and the content of the clusters was apparent to the users across a varying number of clusters. This is the first large-scale study using technology-assisted screening to perform new reviews, and the positive results provide empirical evidence that RobotAnalyst can accelerate the identification of relevant studies. The results also highlight the issue of user complacency and the need for a stopping criterion to realise the work savings.Entities:
Mesh:
Year: 2018 PMID: 29956486 PMCID: PMC6175382 DOI: 10.1002/jrsm.1311
Source DB: PubMed Journal: Res Synth Methods ISSN: 1759-2879 Impact factor: 5.273
Figure 1An information flow diagram of the information processing and user interaction available in RobotAnalyst
Details of reference collections used in the evaluation experiments, including their topical areas (surv denotes surveillance reviews), origin (with relevant NICE guideline if applicable), overall size, and percentage of relevant references (averaged in case of parallel reviews)
|
|
|
|
|
|
|---|---|---|---|---|
| TUB | Tuberculosis | NICE | 4678 | 2.42 |
| BC | Behaviour change: individual approaches | NICE | 1502 | 13.72 |
| BC‐S | Behaviour change: individual approaches (surv) | NICE | 937 | 21.66 |
| BC‐C | Choice architecture in behaviour change (surv) | NICE | 959 | 15.33 |
| WC‐D | Walking and cycling (surv, database search) | NICE | 304 | 27.30 |
| WC‐C | Walking and cycling (surv, citation search) | NICE | 468 | 12.18 |
| WC‐F | Walking and cycling (surv, focused search) | NICE | 86 | 9.30 |
| PAP | Physical activity and pregnancy | NICE | 320 | 11.88 |
| WGP | Weight gain and pregnancy | NICE | 110 | 11.82 |
| PW‐S | Preventing excess weight gain (surv, self‐weighing) | NICE | 157 | 8.28 |
| PW‐E | Preventing excess weight gain (surv, eating patterns) | NICE | 719 | 5.15 |
| WM | Weight management (surv) | NICE | 665 | 29.62 |
| SH | Sexual health | NICE | 3760 | 1.36 |
| QSH | Quality and safety in hospitals | IUMSP | 4964 | 18.63 |
| LD | Learning difficulties | NICE | 2148 | 0.28 |
| OCM | Osteoarthritis: care and management (surv) | NICE | 2986 | 15.00 |
| HB | Hepatitis B: diagnosis and management (surv) | NICE | 1523 | 3.81 |
Guideline: https://www.nice.org.uk/guidance/ng33.
Guideline: https://www.nice.org.uk/guidance/ph49.
Guideline: https://www.nice.org.uk/guidance/ph41.
Guideline: https://www.nice.org.uk/guidance/ng7.
Guideline: https://www.nice.org.uk/guidance/ph47.
Guideline: https://www.nice.org.uk/guidance/ng68.
Guideline: https://www.nice.org.uk/guidance/cg177.
Guideline: https://www.nice.org.uk/guidance/cg165.
Results of the controlled experiments performed on two reference collections, each screened using three procedures in parallel, with performance measured using WSS@95% and AUR metrics
|
|
|
|
|
|---|---|---|---|
| TUB | * 70.74% | 0.9078 | AL only |
| * 69.67% | 0.9196 | topics + AL | |
| * 11.65% | 0.7699 | topics only | |
| BC | * 29.89% | 0.7983 | AL only |
| * 46.53% | 0.8040 | topics + AL | |
| ‐1.80% | 0.4729 | topics only |
Values of WSS@95%, which were significantly greater than expected by random sampling (exact test, significance level of 0.01), are starred.
Results of the unconstrained experiments performed, each involving screening a collection by a junior or senior reviewer using all the features of the system, with performance measured using WSS@95% and AUR metrics, grouped by whether relevancy‐based screening (AL‐based) prioritisation was used throughout
|
|
|
|
|
|
|---|---|---|---|---|
| BC‐C | Senior | * 6.89% | 0.7276 | Yes |
| WC‐D | Senior | * 29.54% | 0.8477 | Yes |
| WC‐C | Senior | * 22.35% | 0.7904 | Yes |
| PAP | Senior | * 40.63% | 0.8398 | Yes |
| WGP | Senior | * 36.82% | 0.7893 | Yes |
| PW‐S | Senior | * 63.15% | 0.8285 | Yes |
| PW‐E | Senior | * 38.81% | 0.8369 | Yes |
| WM | Senior | * 23.72% | 0.8374 | Yes |
| SH | Senior | * 66.17% | 0.8858 | Yes |
| QSH | Junior | * 39.84% | 0.8914 | Yes |
| QSH | Senior | * 31.32% | 0.8818 | Yes |
| LD | Senior | * 50.45% | 0.9058 | Yes |
| OCM | Senior | * 63.99% | 0.9377 | Yes |
| BC‐S | Senior | * 9.41% | 0.6519 | No |
| WC‐F | Senior | 8.95% | 0.5244 | No |
| HB | Junior | ‐3.62% | 0.7347 | No |
Values of WSS@95% which were significantly greater than expected by random sampling (exact test, significance level of 0.01) are starred.
Figure 2Cumulative recall curves and median decision times for three screening tasks. The times are smoothed by using the medians within a sliding window of 51 interdecision intervals. A graphical depiction of WSS@95% is shown as the difference between the recall curve and recall expected under a randomly sampled ordering of the references
Figure 3Cumulative recall curves for the Tuberculosis collection when using active learning versus topic‐based screening at the beginning followed by random sampling for the remainder
Figure 4Running recall curves for the Hospital care quality collection, computed by comparing the decisions made by the senior and junior reviewer at a given stage of the process to a baseline decision set (see explanation in text)
Descriptive clustering outlier detection accuracy of six reviewers split between two collections (the average accuracy per collection in parentheses)
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|
| Spectral clustering | 75% | 69% | 92% | (78.67%) | 49% | 83% | 69% | (67%) |
| Topic modelling | 63% | 15% | 75% | (51%) | 30% | 58% | 38% | (42%) |
Reviewers used any apparent coherence of the references and the description to choose an outlier reference. Accuracy is computed for 100 tasks with a chance rate of 20%.