| Literature DB >> 29314757 |
Iain J Marshall1, Anna Noel-Storr2, Joël Kuiper3, James Thomas4, Byron C Wallace5.
Abstract
Machine learning (ML) algorithms have proven highly accurate for identifying Randomized Controlled Trials (RCTs) but are not used much in practice, in part because the best way to make use of the technology in a typical workflow is unclear. In this work, we evaluate ML models for RCT classification (support vector machines, convolutional neural networks, and ensemble approaches). We trained and optimized support vector machine and convolutional neural network models on the titles and abstracts of the Cochrane Crowd RCT set. We evaluated the models on an external dataset (Clinical Hedges), allowing direct comparison with traditional database search filters. We estimated area under receiver operating characteristics (AUROC) using the Clinical Hedges dataset. We demonstrate that ML approaches better discriminate between RCTs and non-RCTs than widely used traditional database search filters at all sensitivity levels; our best-performing model also achieved the best results to date for ML in this task (AUROC 0.987, 95% CI, 0.984-0.989). We provide practical guidance on the role of ML in (1) systematic reviews (high-sensitivity strategies) and (2) rapid reviews and clinical question answering (high-precision strategies) together with recommended probability cutoffs for each use case. Finally, we provide open-source software to enable these approaches to be used in practice.Entities:
Mesh:
Year: 2018 PMID: 29314757 PMCID: PMC6030513 DOI: 10.1002/jrsm.1287
Source DB: PubMed Journal: Res Synth Methods ISSN: 1759-2879 Impact factor: 5.273
Figure 1Tree diagram: The false positive burden associated with using a high‐sensitivity search compounded by RCTs being a minority class. Illustrative figures, assuming that 1.6% of all articles are RCTs (based on PubMed search; approximately 423 000 in total), and a search filter with 98.4% sensitivity and 77.9% specificity (the performance of the Cochrane HSSS based on data from McKibbon et al9). The 2 blue shaded boxes together represent the search retrieval. The search filter thus retrieves a total of 6 201 349 articles, of which only 416 232 (or 6.7%) are actually RCTs (being the precision statistic) [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 2Receiver operating characteristic scatterplot for conventional database filters (based on data published by McKibbon et al,9 with the 2 comparator strategies from this analysis labeled. RCT PT tag, the single‐term strategy based on the manually applied PT tag (the high‐precision comparator); Cochrane HSSS, the Cochrane Highly Sensitive Search Strategy (the high‐sensitivity comparator) [Colour figure can be viewed at wileyonlinelibrary.com]
Figure 3The Cochrane Crowd/EMBASE project pipeline. Source articles (titles and abstracts) are identified via a sensitive database search filter. Articles already tagged as being RCTs (via Emtree PT tag) are sent directly to CENTRAL. Articles predicted to have <10% probability of being RCTs via an SVM classifier are directly excluded. The remaining articles are considered by the crowd
Figure 4Schematic illustrating separating plane in support vector machines, here depicted in 2 dimensions. Here, the separating plane (a straight line in this two‐dimensional case) is depicted as the black line and the margin are depicted in gray. The instances nearest to the margin (support vectors) are highlighted in white
Figure 5Schematic illustrating convolutional neural network architecture for text classification. Here, y i is the label (RCT or not) for document i, w is a weight vector associated with the classification layer, and x i is the vector representation of document i induced by the CNN
Area under receiver operating characteristics (ROC) curves for the ML strategies evaluated
| Model | Area Under ROC Curve (95% CI) |
|---|---|
| SVM | 0.975 (0.972‐0.979) |
| CNN | 0.978 (0.974‐0.982) |
| SVM + CNN | 0.982 (0.979‐0.985) |
| SVM + PT | 0.987 (0.983‐0.989) |
| CNN + PT | 0.984 (0.980‐0.988) |
|
|
|
Bold text signifies best performing model.
Figure 6Receiver operating characteristics of the machine learning algorithms trained on plain text alone, (1) support vector machine, (2) convolutional neural network both single model, and bagged result of 10 models (each trained on all RCTs and a different random sample of non‐RCTs). The points depict the 3 conventional database filters, which use plain text only and do not require use of MeSH/PT tags. The blue shaded area in the left part of the figure is enlarged on the right‐side bottom section
Figure 7Left: Receiver operating characteristics curve (zoomed to accentuate variance); effects of balanced sampling: The individual models are depicted in light blue; the magenta curve depicts the performance of the consensus classification (the mean probability score of being an RCT from the component models). Right: Cumulative performance (area under receiver operating characteristics curve) of bagging multiple models trained on balanced samples. Performance increases until approximately 6 models are included, then is static afterwards
Figure 8Receiver operating characteristics curve: Hybrid/ensembled models including use of the manually applied PT tag. The area bounded by the blue shaded area on the left‐hand plot is enlarged on the right to illustrate differences between models and conventional database filters. Note that the RCT PT tag has become more sensitive from 20099 to 2017 (the reanalysis conducted here), reflecting the late application of the tag to missed RCTs including through data provided to PubMed by the Cochrane Collaboration11
Performance on highly sensitive (systematic review) task, with comparison to conventional database filters
| Sensitivity | Specificity | Precision | Number Needed to Screen | |
|---|---|---|---|---|
| SVM | 98.5 (97.8‐99.0) | 71.7 (71.3‐72.1) | 10.4 (9.9‐10.9) | 9.6 |
| CNN | 98.5 (97.8‐99.0) | 61.2 (60.7‐61.6) | 7.8 (7.5‐8.2) | 12.8 |
| SVM + CNN | 98.5 (97.8‐99.0) | 68.8 (68.4‐69.3) | 9.6 (9.1‐10.0) | 10.4 |
|
|
|
|
|
|
| CNN + PT | 98.5 (97.8‐99.0) | 82.1 (81.7‐82.4) | 15.5 (14.8‐16.2) | 6.5 |
| SVM + CNN + PT | 98.5 (97.8‐99.0) | 84.0 (83.6‐84.3) | 17.1 (16.3‐17.8) | 5.8 |
| Cochrane HSSS | 98.5 (97.8‐99.0) | 76.9 (76.6‐77.3) | 12.5 (11.9‐13.1) | 8.0 |
Abbreviation: Cochrane HSSS, the Cochrane Highly Sensitive Search Strategy.
For the machine learning approaches, a predictive cutoff was chosen to achieve a fixed sensitivity of around 99%; better‐performing classifiers will achieve better specificity (ie, retrieve fewer non‐RCTs) at this sensitivity level. Bold text signifies best performing model.
Performance on highly specific search task, with comparison to conventional database filters
| Sensitivity | Specificity | Precision | Number needed to screen | |
|---|---|---|---|---|
| SVM | 82.3 (80.3‐84.1) | 97.5 (97.4‐97.7) | 52.5 (50.5‐54.5) | 1.9 |
| CNN | 93.4 (92.0‐94.6) | 97.5 (97.3‐97.6) | 55.4 (53.5‐57.3) | 1.8 |
| SVM + CNN | 93.3 (91.9‐94.4) | 97.5 (97.4‐97.6) | 55.6 (53.6‐57.5) | 1.8 |
| SVM + PT | 95.1 (94.0‐96.2) | 97.5 (97.3‐97.6) | 55.8 (53.9‐57.7) | 1.8 |
|
|
|
|
|
|
| SVM + CNN + PT | 95.1 (94.0‐96.2) | 97.5 (97.3‐97.6) | 55.8 (53.9‐57.7) | 1.8 |
| Pubmed | 94.8 (93.6‐95.9) | 97.5 (97.3‐97.6) | 55.7 (53.8‐57.5) | 1.8 |
Abbreviation: PubMed PT, Randomized Controlled Trial publication‐type tag as a single search term.
For the machine learning approaches, a predictive cutoff was chosen to achieve a fixed specificity of 97.5%; better‐performing classifiers will achieve better sensitivity (ie, miss fewer RCTs) at this specificity level. Bold text signifies best performing model.
Figure 9PubMed PT information is used where present; where not, the best‐performing text‐alone approach is automatically used, with a modest reduction in accuracy
|
Class weighting: RCTs: 1.0 to 30.0, non‐RCTs: 1.0 |
Search space for CNN hyperparameters
|
Class weighting: RCTs: 1.0 to 30.0, non‐RCTs: 1.0 |
Search space for SVM hyperparameters
Separate n‐gram features were compiled for words appearing in the title and abstract. We hypothesized that certain words appearing in the title (for example, randomized) might have more importance than if they appeared in the abstract. A model that was able to weight these differently might therefore perform better.