| Literature DB >> 25656516 |
Aaron M Cohen1, Neil R Smalheiser2, Marian S McDonagh3, Clement Yu4, Clive E Adams5, John M Davis2, Philip S Yu4.
Abstract
OBJECTIVE: For many literature review tasks, including systematic review (SR) and other aspects of evidence-based medicine, it is important to know whether an article describes a randomized controlled trial (RCT). Current manual annotation is not complete or flexible enough for the SR process. In this work, highly accurate machine learning predictive models were built that include confidence predictions of whether an article is an RCT.Entities:
Keywords: Evidence-Based Medicine; Information Retrieval; Natural Language Processing; Randomized Controlled Trials as Topic; Support Vector Machines; Systematic Reviews
Mesh:
Year: 2015 PMID: 25656516 PMCID: PMC4457112 DOI: 10.1093/jamia/ocu025
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 4.497
Performance of the main classifier using the CITATION_PLUS_MESH_MODEL, showing 5x 2-way cross-validation performance on the entire training dataset, as well as performance on the held-out testing sets corresponding to human-related articles published in the years 2011 and 2012
| CITATION_PLUS_MESH_MODEL | |||
|---|---|---|---|
| 1987–2010 cross-validation | 2011 | 2012 | |
| 0.976 | 0.973 (0.9714, 0.9746) | 0.972 (0.9704, 0.9736) | |
| 0.877 | 0.873 | 0.870 | |
| 0.820 | 0.822 | 0.824 | |
| 0.985 | 0.985 | 0.985 | |
| */0.048 | 0.013/0.045 | 0.013/0.044 | |
AUC = area under the receiver operating characteristic curve, AVERAGE_PRECISION = average precision at RCT positive rankings, F1 = balanced F-measure, harmonic mean of precision and recall, MSE = mean squared error of the confidence predictions with/without use of the enhanced Rüping confidence estimation method. The confidence estimation method was not used for the cross-validation model selection runs because of its increased run-time. The 95% confidence intervals for AUC on the 2011 and 2012 data sets are shown in parentheses next to these values.
Performance of the final classifier using the CITATION_ONLY_MODEL, showing 5× 2-way cross-validation performance on the entire training dataset, as well as performance on the held-out testing sets corresponding to human-related articles published in the years 2011 and 2012
| CITATION_ONLY_MODEL | |||
|---|---|---|---|
| 1987–2010 cross-validation | 2011 | 2012 | |
| 0.969 | 0.966 (0.9642, 0.9678) | 0.965 (0.9632, 0.9668) | |
| 0.855 | 0.854 | 0.852 | |
| 0.800 | 0.807 | 0.811 | |
| 0.984 | 0.984 | 0.984 | |
| */0.052 | 0.014/0.048 | 0.014/0.048 | |
AUC = area under the receiver operating characteristic curve, AVERAGE_PRECISION = average precision at RCT positive rankings, F1 = balanced F-measure, harmonic mean of precision and recall, MSE = mean squared error of the confidence predictions with/without use of the enhanced Rüping confidence estimation method. The confidence estimation method was not used for the cross-validation model selection runs because of its increased run-time. The 95% confidence intervals for AUC on the 2011 and 2012 data sets are shown in parentheses next to these values.
Performance comparisons between several alternate modeling approaches and the final classifier models
| MODELING APPROACH | ||||
|---|---|---|---|---|
| MODEL | FORWARD SELECTION | CHISQ FEATURE FILTERING | POSTHOC DROP LOW WEIGHTS | FORWARD SELECTION PLUS PUBTYPES |
| AUC | ||||
| CITATION_ONLY_MODEL | 0.966 | 0.948 | 0.967 | 0.969 |
| CITATION_PLUS_MESH_MODEL | 0.973 | 0.959 | 0.973 | 0.974 |
| AVERAGE_PRECISION | ||||
| CITATION_ONLY_MODEL | 0.854 | 0.781 | 0.853 | 0.840 |
| CITATION_PLUS_MESH_MODEL | 0.873 | 0.826 | 0.873 | 0.856 |
| F1 | ||||
| CITATION_ONLY_MODEL | 0.807 | 0.727 | 0.808 | 0.778 |
| CITATION_PLUS_MESH_MODEL | 0.822 | 0.771 | 0.823 | 0.794 |
| MSE | ||||
| CITATION_ONLY_MODEL | 0.014 | 0.020 | 0.014 | 0.015 |
| CITATION_PLUS_MESH_MODEL | 0.013 | 0.016 | 0.013 | 0.014 |
| NUMBER OF FEATURES | ||||
| CITATION_ONLY_MODEL | 44,114,421 | 102,023 | 102,262 | 44,114,572 |
| CITATION_PLUS_MESH_MODEL | 34,636,788 | 113,177 | 110,982 | 34,636,939 |
AUC = area under the receiver operating characteristic curve, AVERAGE_PRECISION = average precision at RCT positive rankings, F1 = balanced F-measure, harmonic mean of precision and recall, MSE = mean squared error of the confidence predictions.
Figure 1:This graph shows the correspondence between the predicted RCT confidence centered at each 0.10 width range between 0.0 and 1.0, and the prevalence of articles determined to describe RCTs by manual review. Samples were chosen randomly across four searches corresponding to Cochrane topics where none of the chosen articles were tagged in MEDLINE with the “Randomized Controlled Trial” publication type. It can be seen that estimated prevalence is slightly below the predicted confidence. This is likely due to two reasons. First, in order to keep the manual review task modest, the binning that was used to group the confidence ranges, and the number of samples in each bin are somewhat coarse. Second, and more importantly, the manually reviewed samples do not represent a uniform random sample from MEDLINE. The samples were specifically chosen to not have the MEDLINE RCT_PT. Since all of these had been previously reviewed by MEDLINE annotators and not tagged with this publication type, it is reasonable to expect that these articles would have somewhat less than predicted chance of being RCTs. Still, for the articles with high predicted confidence, a large fraction of the articles were designated as RCTs by the reviewer.