| Literature DB >> 34390193 |
Allard J van Altena1, René Spijker2,3, Mariska M G Leeflang1, Sílvia Delgado Olabarriaga1.
Abstract
When performing a systematic review, researchers screen the articles retrieved after a broad search strategy one by one, which is time-consuming. Computerised support of this screening process has been applied with varying success. This is partly due to the dependency on large amounts of data to develop models that predict inclusion. In this paper, we present an approach to choose which data to use in model training and compare it with established approaches. We used a dataset of 50 Cochrane diagnostic test accuracy reviews, and each was used as a target review. From the remaining 49 reviews, we selected those that most closely resembled the target review's clinical topic using the cosine similarity metric. Included and excluded studies from these selected reviews were then used to develop our prediction models. The performance of models trained on the selected reviews was compared against models trained on studies from all available reviews. The prediction models performed best with a larger number of reviews in the training set and on target reviews that had a research subject similar to other reviews in the dataset. Our approach using cosine similarity may reduce computational costs for model training and the duration of the screening process.Entities:
Keywords: computerised support; cosine similarity; machine learning; screening automation; training sample selection
Mesh:
Year: 2021 PMID: 34390193 PMCID: PMC9292892 DOI: 10.1002/jrsm.1518
Source DB: PubMed Journal: Res Synth Methods ISSN: 1759-2879 Impact factor: 9.308
Document characteristics after cleaning
| Number of DTA reviews | 50 |
| Total number of documents | 266,966 |
| Included documents | 4661 |
| # Words per document | 922 [0–9795] |
| # Unique words per document | 70 [9–529] |
| Per review | |
| # Documents | 5339 [64–43,363] |
| # Included documents | 93 [2–619] |
| % Included documents | 4% [ |
| Missing abstracts | |
| # All documents | 45,033 (17%) |
| # Included documents | 359 (7%) |
Mean [minimum‐maximum].
Reasons for missing abstract
| All | Inclusions | |
|---|---|---|
| Foreign language | 16,075 | 81 |
| Before 1975 | 14,721 | 24 |
| Not journal article | 23,368 | 142 |
Note: Three major characteristics were found: (1) the document was written in a foreign language and not available in English, (2) the document was published before (approximately) 1975 and was not digitally available, and (3) the document was not a primary research publication (e.g., comment, case report, etc.). Note that there was overlap between the characteristics, as an document might both be written in a foreign language and be published before 1975.
Review groups according to disease (target condition)
| Group | # Reviews | ICD‐10 | Disease |
|---|---|---|---|
| 1 | 2 | A | Tuberculosis |
| 2 | 4 | B | Parasitic |
| 3 | 8 | C | Cancer |
| 4 | 12 | G and F | Dementia and Alzheimer |
| 5 | 4 | K | Liver |
| 6 | 5 | M | Musculoskeletal system |
| 7 | 3 | Q | Down syndrome |
| 8 (other) | 12 | ‐ | Various |
Metadata collected for each review [Colour table can be viewed at wileyonlinelibrary.com]
| Identifier | # docs. | # incl. | ICD‐10 | Secondary ICD‐10 | Disease group |
|---|---|---|---|---|---|
| CD007394 | 2545 | 95 | B44.0 | 2 | |
| CD007427 | 1521 | 123 | M75.4 | 6 | |
| CD007431 | 2074 | 24 | M54.3 | M54.5 | 6 |
| CD008054 | 3217 | 274 | N87.9 | Other | |
| CD008081 | 970 | 26 | H35.81 | E14.3 | Other |
| CD008643 | 15083 | 11 | S32.001A | M54.5 | 6 |
| CD008686 | 3966 | 7 | M53.9 | M54.5 | 6 |
| CD008691 | 1316 | 73 | I25.10 | Z94 | Other |
| CD008760 | 64 | 12 | I85 | Other | |
| CD008782 | 10507 | 45 | G30 | F06.7 | 4 |
| CD008803 | 5220 | 99 | H44.51 | Other | |
| CD009020 | 1584 | 162 | M75.101 | M25.5 | 6 |
| CD009135 | 791 | 77 | B55.0 | 2 | |
| CD009185 | 1615 | 92 | N10 | Other | |
| CD009323 | 3881 | 122 | C25.9 | C24.1 | 3 |
| CD009372 | 2248 | 25 | I61.9 | Other | |
| CD009519 | 5971 | 104 | C34.90 | C80 | 3 |
| CD009551 | 1911 | 46 | B44.0 | 2 | |
| CD009579 | 6455 | 138 | B65 | 2 | |
| CD009591 | 7991 | 144 | N80 | Other | |
| CD009593 | 14922 | 78 | A15.3 | U84.9 | 1 |
| CD009647 | 2785 | 56 | E86 | Other | |
| CD009786 | 2065 | 10 | C56 | C80 | 3 |
| CD009925 | 6531 | 460 | Q90.2 | 7 | |
| CD009944 | 1181 | 117 | C16.9 | C80 | 3 |
| CD010023 | 981 | 52 | S92.2 | Other | |
| CD010173 | 5495 | 23 | C06.9 | C80 | 3 |
| CD010276 | 5495 | 54 | C06.9 | C80 | 3 |
| CD010339 | 12807 | 114 | K80 | 5 | |
| CD010386 | 625 | 2 | F03 | F06.7 | 4 |
| CD010409 | 43363 | 76 | C51 | C77.4 | 3 |
| CD010438 | 3250 | 39 | D68.9 | T14.9 | Other |
| CD010542 | 348 | 20 | K70 | 5 | |
| CD010632 | 1504 | 32 | F03 | F06.7 | 4 |
| CD010633 | 1573 | 4 | G31.8 | F02.8 | 4 |
| CD010653 | 8002 | 45 | F20 | 4 | |
| CD010705 | 114 | 23 | A15.3 | U84.9 | 1 |
| CD010771 | 322 | 48 | F03 | 4 | |
| CD010772 | 316 | 47 | F03 | 4 | |
| CD010775 | 241 | 11 | G30 | F03 | 4 |
| CD010783 | 10905 | 30 | G30 | F03 | 4 |
| CD010860 | 94 | 7 | G30 | F03 | 4 |
| CD010896 | 169 | 6 | G31.0 | F03 | 4 |
| CD011134 | 1953 | 215 | C18 | C80 | 3 |
| CD011145 | 10872 | 202 | F03 | 4 | |
| CD011548 | 12708 | 113 | K80 | 5 | |
| CD011549 | 12705 | 2 | K80 | 5 | |
| CD011975 | 8201 | 619 | Q90.2 | 7 | |
| CD011984 | 8192 | 454 | Q90.2 | 7 | |
| CD012019 | 10317 | 3 | N80 | Other |
Note: Colours are added to highlight disease groups.
FIGURE 1Overview of workflow for the approaches using different training data: (a) similar data (SIMILAR) and random data (RANDOM), and (b) all data (ALL). Feature extraction was implemented using TF‐IDF (term frequency inverse document frequency). The prediction model was implemented using the Random Forest classifier [Colour figure can be viewed at wileyonlinelibrary.com]
FIGURE 2Boxplot of model performance stratified by the training set size. Performance is shown separately for the RANDOM, SIMILAR, and ALL approaches [Colour figure can be viewed at wileyonlinelibrary.com]
P‐values for SIMILAR versus ALL performance and SIMILAR versus RANDOM performance
| SIMILAR | ALL | RAND | |||
|---|---|---|---|---|---|
| 49 ( | 1 ( | 2 ( | 5 ( | 10 ( | |
| 1 ( | <0.001 | 0.03 | |||
| 2 ( | <0.001 | <0.01 | |||
| 5 ( | <0.001 | 0.66 | |||
| 10 ( | 0.05 | 1.00 | |||
Note: SIMILAR is stratified by the training set size. is the median WSS@95 performance over all models.
P‐value is significant.
FIGURE 3Boxplot of SIMILAR performance stratified by the training set size. The results are shown for groups 1–7 and other [Colour figure can be viewed at wileyonlinelibrary.com]
P‐values for other versus disease groups 1–7 performance, both are stratified by the training set size
| Groups 1–7 | Other | ||||
|---|---|---|---|---|---|
| 1 ( | 2 ( | 5 ( | 10 ( | 49 ( | |
| 1 ( | 0.009 | ||||
| 2 ( | <0.001 | ||||
| 5 ( | 0.005 | ||||
| 10 ( | 0.002 | ||||
| 49 ( | 0.038 | ||||
Note: All P‐values are significant. is the median WSS@95 performance over all models.
Pearson correlation between the performance and cosine similarity for each training set size in the SIMILAR approach
| Performance | Similarity | |||
|---|---|---|---|---|
| 1 | 2 | 5 | 10 | |
| 1 | 0.40 | |||
| 2 | 0.47 | |||
| 5 | 0.36 | |||
| 10 | 0.32 | |||