| Literature DB >> 35717167 |
Faith Wavinya Mutinda1, Kongmeng Liew1, Shuntaro Yada1, Shoko Wakamiya1, Eiji Aramaki2.
Abstract
BACKGROUND: Meta-analyses aggregate results of different clinical studies to assess the effectiveness of a treatment. Despite their importance, meta-analyses are time-consuming and labor-intensive as they involve reading hundreds of research articles and extracting data. The number of research articles is increasing rapidly and most meta-analyses are outdated shortly after publication as new evidence has not been included. Automatic extraction of data from research articles can expedite the meta-analysis process and allow for automatic updates when new results become available. In this study, we propose a system for automatically extracting data from research abstracts and performing statistical analysis.Entities:
Keywords: Automatic data extraction; Automatic meta-analysis; Evidence-based medicine; Named entity recognition (NER); Natural language processing (NLP)
Mesh:
Year: 2022 PMID: 35717167 PMCID: PMC9206132 DOI: 10.1186/s12911-022-01897-4
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 3.298
Fig. 1Proposed system architecture
Corpus statistics
| Category | Sub-category | # tags |
|---|---|---|
| Participants | Total-participants | 1094 |
| Intervention-participants | 887 | |
| Control-participants | 784 | |
| Age | 231 | |
| Eligibility | 925 | |
| Ethnicity | 101 | |
| Condition | 327 | |
| Location | 186 | |
| Intervention | Intervention | 1067 |
| Control | Control | 979 |
| Outcomes | Outcome | 5053 |
| Outcome-measure | 1081 | |
| bin-abs-iv | 556 | |
| bin-percent-iv | 1376 | |
| cont-mean-iv | 366 | |
| cont-median-iv | 270 | |
| cont-sd-iv | 129 | |
| cont-q1-iv | 4 | |
| cont-q3-iv | 4 | |
| bin-abs-cv | 465 | |
| bin-percent-cv | 1148 | |
| cont-mean-cv | 327 | |
| cont-median-cv | 247 | |
| cont-sd-cv | 124 | |
| cont-q1-cv | 4 | |
| cont-q3-cv | 4 | |
Fig. 2A sample abstract with PICO elements highlighted. The top part shows the abstract while the bottom part shows the PICO elements transformed into a structured format
Fig. 3Visualization system interface
NER models results
| (a) BioBERT model results | ||||||
|---|---|---|---|---|---|---|
| BioBERT | BioBERT_split | |||||
| Sub-category | Precision | Recall | F1 | Precision | Recall | F1 |
| Total-participants | 0.95 | 0.94 | 0.94 | 0.94 | ||
| Intervention-participants | 0.91 | 0.78 | ||||
| Control-participants | 0.87 | 0.85 | 0.88 | |||
| Age | 0.66 | 0.97 | 0.79 | 0.66 | 0.96 | 0.78 |
| Eligibility | 0.75 | 0.77 | 0.76 | 0.77 | 0.74 | 0.76 |
| Ethnicity | 0.82 | 0.89 | 0.86 | 0.82 | ||
| Condition | 0.86 | 0.84 | 0.75 | 0.79 | ||
| Location | 0.75 | 0.80 | 0.73 | 0.81 | 0.77 | |
| Intervention | 0.85 | 0.82 | 0.84 | 0.85 | 0.82 | 0.84 |
| Control | 0.78 | 0.80 | 0.79 | 0.77 | 0.76 | 0.77 |
| Outcome | 0.82 | 0.81 | 0.81 | 0.84 | 0.80 | 0.82 |
| Outcome-measure | 0.79 | 0.90 | 0.84 | 0.81 | 0.88 | 0.84 |
| bin-abs-iv | 0.75 | 0.78 | 0.77 | 0.81 | 0.78 | 0.79 |
| bin-abs-cv | 0.79 | 0.83 | 0.77 | 0.80 | 0.79 | |
| bin-percent-iv | 0.88 | 0.87 | 0.83 | 0.86 | 0.84 | |
| bin-percent-cv | 0.87 | 0.82 | 0.84 | |||
| cont-mean-iv | 0.78 | 0.83 | 0.80 | 0.86 | 0.83 | |
| cont-mean-cv | 0.86 | 0.81 | 0.84 | 0.83 | ||
| cont-median-iv | 0.80 | 0.75 | ||||
| cont-median-cv | 0.76 | 0.74 | ||||
| cont-sd-iv | 0.68 | 0.79 | 0.80 | 0.85 | 0.82 | |
| cont-sd-cv | 0.76 | 0.84 | 0.80 | 0.72 | 0.85 | 0.78 |
| cont-q1-iv | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| cont-q1-cv | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| cont-q3-iv | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| cont-q3-cv | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Bold texts represent the best score for each sub-category
Results of selected meta-analysis
| Study | Outcome | Gold values | System extracted values | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Ee | Ne | Ec | Nc | Ee | Ne | Ec | Nc | ||
| Alba et al. [ | Pathological complete response | 14 | 47 | 16 | 46 | 14 | 48 | 16 | 46 |
| Ando et al. [ | 23 | 37 | 10 | 38 | 23 | 37 | 10 | 38 | |
| Gluz et al. [ | 70 | 154 | 52 | 182 | |||||
| Loibl et al. [ | 92 | 160 | 49 | 158 | 160 | 49 | 158 | ||
| Sikov et al. [ | 60 | 110 | 43 | 105 | NA | NA | |||
| Tung et al. [ | 9 | 40 | 10 | 36 | NA | NA | |||
| Minckwitz et al. [ | 90 | 158 | 67 | 157 | |||||
| Wu et al. [ | 24 | 62 | 8 | 63 | 24 | 62 | 8 | 63 | |
| Zhang et al. [ | 18 | 47 | 6 | 44 | 18 | 47 | 6 | 44 | |
| Alba et al. [ | Objective response rate | 36 | 47 | 32 | 46 | 37 | 48 | 32 | 46 |
| Wu et al. [ | 58 | 62 | 46 | 63 | 58 | 62 | 46 | 63 | |
| Zhang et al. [ | 42 | 47 | 34 | 44 | 42 | 47 | 34 | 44 | |
Ee is the number of events in the intervention group, Ne is the number of participants in the control group, Ec is the number of events in the control group, and Nc is the number of participants in the control group. NA indicates where the information was not available in the abstract. are NER model prediction errors while are values where extra pre-processing was required