| Literature DB >> 25588314 |
Alison O'Mara-Eves, James Thomas1, John McNaught, Makoto Miwa, Sophia Ananiadou.
Abstract
BACKGROUND: The large and growing number of published studies, and their increasing rate of publication, makes the task of identifying relevant studies in an unbiased way for inclusion in systematic reviews both complex and time consuming. Text mining has been offered as a potential solution: through automating some of the screening process, reviewer time can be saved. The evidence base around the use of text mining for screening has not yet been pulled together systematically; this systematic review fills that research gap. Focusing mainly on non-technical issues, the review aims to increase awareness of the potential of these technologies and promote further collaborative research between the computer science and systematic review communities.Entities:
Mesh:
Year: 2015 PMID: 25588314 PMCID: PMC4320539 DOI: 10.1186/2046-4053-4-5
Source DB: PubMed Journal: Syst Rev ISSN: 2046-4053
Definitions of performance measures reported in the studies
| Measure | # | Definition | Formula |
|---|---|---|---|
|
| 22 | Proportion of correctly identified positives amongst all |
|
|
| 18 | Proportion of correctly identified positives amongst all positives. |
|
|
| 10 | Combines precision and recall. Values of |
|
|
| 10 | Area under the curve traced out by graphing the true positive rate against the false positive rate. 1.0 is a perfect score and 0.50 is equivalent to a random ordering | |
|
| 8 | Proportion of agreements to total number of documents. |
|
|
| 8 | The percentage of papers that the reviewers do not have to read because they have been screened out by the classifier |
|
|
| 7 | Time taken to screen (usually in minutes) | |
|
| 4 | The fraction of the total number of items that a human must screen (active learning) |
|
|
| 3 | The fraction of items that are identified by a given screening approach (active learning) |
|
|
| 5 | Relative measure of burden and yield that takes into account reviewer preferences for weighting these two concepts (active learning) |
|
|
| 2 | The proportion of includes in a random sample of items before prioritisation or classification takes place. The number to be screened is determined using a power calculation |
|
|
| 2 | Number of relevant items selected divided by the time spent screening, where relevant items were those marked as included by two or more people |
|
|
| 2 | The proportion of correctly identified negatives (excludes) out of the total number of negatives |
|
|
| 2 | The number of correctly identified positives (includes) | TP |
|
| 1 | The number of incorrectly identified negatives (excludes) | FN |
|
| 1 | The ratio of positives in the data pool that are annotated during active learning |
|
|
| 1 | Expected time to label an item multiplied by the unit cost of the labeler (salary per unit of time), as calculated from their (known or estimated) salary | timeexpected × costunit |
|
| 1 | Proportion of disagreements to total number of documents | 100 % − accuracy % |
|
| 1 | Total number of falsely classified items divided by the total number of items |
|
|
| 1 | Number of items excluded by the classifier that do not need to be manually screened | TN + FN |
|
| 1 | The proportion of includes out of the total number screened, after prioritisation or classification takes place |
|
TP = true positives, TN = true negatives, FP = false positives, FN = false negatives.
aPerformance is the term used by Felizardo [13], whilst efficiency was used by Malheiros [14].
[Not used in the included studies, though worthy of note is the ‘G-mean’. This is the geometric mean of sensitivity and specificity, and it is often used for a metric alternative to F score in evaluating classification on imbalanced datasets. G-mean evaluates the classification performance for classification labels, whilst AUC evaluates the classification performance for classification scores. Note that these metrics alone do not always reflect the goal in systematic reviews [15].
Figure 1Brief timeline of developments in the use of text mining technologies for reducing screening burden in systematic reviews.
The number of studies implicitly or explicitly addressing screening workload problems ( = 44)
| Workload reduction approach | Number of studies |
|---|---|
| Reducing number needed to screen | 30 |
| Text mining as a second screener | 6 |
| Increasing the rate (speed) of screening | 7 |
| Improving workflow through screening prioritisation | 12 |
Note. Some studies adopted more than one approach to workload reduction, so column total is greater than 44 studies.
Cross tabulation showing the number of studies employing certain research designs by the aspects of text mining that were compared ( = 44)
| What aspect of text mining was compared | Retrospective simulation | Prospective—case study | Prospective—controlled trial | Prospective—other | Total—what was compared |
|---|---|---|---|---|---|
| Classifiers/ algorithms | 13 | 0 | 0 | 3 | 16 |
| Number of features | 2 | 0 | 0 | 0 | 2 |
| Feature extraction/sets (e.g., BoW) | 8 | 0 | 0 | 2 | 10 |
| Views (e.g., T&A, MeSH) | 5 | 0 | 0 | 1 | 6 |
| Training set size | 2 | 0 | 0 | 0 | 2 |
| Kernels | 2 | 0 | 0 | 0 | 2 |
| Topic specific versus general training data | 3 | 0 | 0 | 1 | 4 |
| Other optimisations | 9 | 0 | 0 | 4 | 13 |
| No comparison | 5 | 5 | 4 | 1 | |
| Total | (27) | (5) | (4) | (8) |
Note. Many studies compared more than one aspect of text mining, therefore column total for ‘Total—what was compared’ sums to greater than 44. The row for ‘Total—study design (duplicates removed)’ shows the number of studies of each design type rather than the column totals, as the column totals would include duplications of the same studies that compared multiple aspects of text mining technologies.