| Literature DB >> 34821562 |
Kevin Kallmes1, Karl Holub1, Nicole Hardy1.
Abstract
BACKGROUND: Systematic reviews depend on time-consuming extraction of data from the PDFs of underlying studies. To date, automation efforts have focused on extracting data from the text, and no approach has yet succeeded in fully automating ingestion of quantitative evidence. However, the majority of relevant data is generally presented in tables, and the tabular structure is more amenable to automated extraction than free text.Entities:
Keywords: automated data extraction; clinical comparative data; data elements; data reporting conventions; statistic formats; systematic review; table structure
Year: 2021 PMID: 34821562 PMCID: PMC8663462 DOI: 10.2196/33124
Source DB: PubMed Journal: JMIR Form Res ISSN: 2561-326X
Defining tabular concepts.
| Concept | Definition | Example |
| Data element | A characteristic or quality being measured | Mortality |
| Metric | A measured instance of a data element, a descriptive statistic | 2/59 (3.4%) |
| Arm | A subset of an experiment’s participants that are assigned a specific intervention | Placebo group |
| Time point | The point(s) in time in an experiment when measurement of data elements is performed | 6-month follow-up |
| Measurement context | The combination of a data element, arm, and time point in experimental reporting | Mortality in the placebo group at 6 months |
Figure 1Example of a 2x1 context with arms (blue) nested in data elements (red) on rows and time points (green) on the columns. Table from Chellappa et al [17]. ANOVA: analysis of variance.
Figure 2A 1x1 baseline table reporting data elements on rows and arms on columns. Arm sizes are embedded in intervention headers (red), category labels are reported in the data element array indented (blue), and statistic formats are reported in headers (green). Table from Gauto Benitez et al [18]. HFNC: high flow nasal cannula.
Tagging hierarchy of table structural attributes.
| Tags | Applied when | |
|
| ||
|
| In table | The article reports baseline characteristics for the study population in a table, broken out by arm |
|
| Participant level | The article reports baseline characteristics for the study population in a table, reported for each participant |
|
| No arm-level breakout | The article reports baseline characteristics for the entire study population in a table, with no breakout |
|
| ||
|
| In table | The article reports outcomes in a table, including the primary outcome(s), at one or more follow-up time points |
|
| Participant level | The article reports outcomes in a table, including the primary outcome(s), at one or more follow-up time points, reported for each participant |
|
| Secondary only, in table | The article reports outcomes in a table, but not all the primary outcomes of the study, instead focusing on data points other than primary outcomes |
|
| ||
|
| Rotated 90 degrees | One or more baseline or outcome tables in the article is rotated 90 degrees in either direction on the page but is otherwise normal |
|
| Multipage | One or more baseline or outcome tables in the article overflows beyond its starting page but is otherwise normal |
Tagging hierarchy of measurement context attributes.
| Tags | Applied when | ||
|
| |||
|
| 1×1 | Only two pieces of context are shown in the table dimensions: one on rows and one on columns (eg, data elements on rows and arms columns) | |
|
| 2×1 | Three pieces of context are shown in the table dimensions: two on rows and one on columns (eg, arms nested in data elements on rows, time points on columns) | |
|
| 1×2 | Three pieces of context are shown in the table dimensions: one on rows and two on columns | |
|
| |||
|
| Embedded in arm | Arm sizes are reported as part of the arm or intervention label (eg, “Placebo [n=25]”) | |
|
| Separate array | Arm sizes are reported in a distinct column or row in the table | |
|
| |||
|
| Full name | The entire name of the intervention(s) for the arm is shown in the arm header | |
|
| Acronym or abbreviation | An acronym or shortened version of the invention name(s) for the arm is shown in the arm header | |
|
| Control/experimental | The arm header is labeled with “Control” and “Experimental” or “Treatment” or “Intervention“ | |
|
| Alternate labels | Any header labeling scheme not identified above is used | |
|
| |||
|
| Contains unit of time | The time point header contains an amount of time, including units | |
|
| Pre/post | The time point header is labeled “Pre/Post,” “Before/After,” “Baseline/Follow-up” | |
|
| Incremental numbered | Time point headers are labeled with numbers or letters in order of time (eg, “t1,” “t2”) | |
aContext is tagged as “Embedded” when individual header cells include 2 elements of context (eg, “Baseline BMI”).
Tagging hierarchy of metric attributes.
| Tags | Applied when | |
|
| ||
|
| In header | The statistic format or just constituents are reported in the header of the array of metrics |
|
| In description or footnotes | The statistic format or constituents are reported in the description or footnotes; these may apply to the entire table or be annotations for arrays |
|
| ||
|
| In header | The units of data elements are reported in each array header |
|
| In descriptions or footnotes | The units of continuous data elements are reported in the description or footnotes; these may apply to the entire table or be annotations for arrays |
|
| Not relevant | The article includes no continuous data elements or the continuous data elements are unitless (eg, scale data) |
|
| ||
|
| Continuous | The format is used for continuous data elements |
|
| Dichotomous | The format is used for dichotomous data elements, specifically when only a single category is implied (eg, “Mortality” or “Gender Male”) |
|
| Categorical | The format is used for categorical data elements; this also applies when a dichotomous data element explicitly lists all categories (eg, “Smoking” has separate arrays for “Yes” and “No”) |
|
| ||
|
| Separate array | Category labels are in an entirely separate (delimited) array from the data element header array |
|
| Same array | Category labels are in the data element header array, with no distinction from other data element labels |
|
| Same array indented | Category labels are in the data element header array, but are nested under the categorical data element header via white space or list indentation |
|
| In cell | Categories are all reported in the same cell (eg, “Gender M/F” with metrics “11/9”) |
aIf multiple cases apply, the lowest in the table is the classification.
bIf multiple cases apply, the lowest in the table is the classification. If units are missing on one or more data elements, this classification should be left empty.
cThe formats under each tag are created as they are encountered in articles.
dThis classification is left empty in the event that no categorical data elements are reported.
Figure 3A screenshot of the interactive tagging hierarchy applied across the 78 studies included in this pilot survey. Two filters were applied: “Outcomes Reported In Table” was selected first, and then “Mean ± SD,” meaning the sunburst plot is filtered to studies for which both tags are present. The right menu displays the 38 studies for which this is true as well as statistics about how common the tags in question are across all included studies.
Baseline and outcome reporting per article and per table.
| Type | Frequency per article (N=78), n (%) | Frequency per table (N=174), n (%) | |
|
| 66 (85) | 67 (38.5) | |
|
| Arm-level breakout | 64 (97) | 65 (97) |
|
| No arm-level breakout | 2 (3) | 2 (3) |
|
| 72 (92) | 107 (61.5) | |
|
| Arm-level breakout | 69 (96) | 104 (97.2) |
|
| Secondary only | 2 (3) | 2 (1.9) |
|
| Participant level | 1 (1) | 1 (0.9) |
Measurement context reporting per article (N=77).
| Type | Frequency per relevanta article, n (%) | |
|
| 57 (74) | |
|
| Embedded in arm | 50 (88) |
|
| In separate array | 6 (11) |
|
| In description | 1 (1) |
|
| 77 (100) | |
|
| Control/experimental | 26 (34) |
|
| Acronyms or abbreviated | 25 (33) |
|
| Full name | 23 (30) |
|
| Alternate labels | 3 (4) |
|
| 52 (68) | |
|
| Pre/post | 24 (46) |
|
| Contains unit of time | 22 (42) |
|
| Incremental numbering | 6 (12) |
aOne article reported no baseline or outcome data in tables and was thus left out from the measurement context analysis.
Dimensions of tabular reporting of measurement context (N=174).
| Context dimensions | Frequency per table, n (%) | ||
|
| 99 (56.9) | ||
|
| DEsa on rows, arms on columns | 90 (91) | |
|
| Arms on rows, DEs on columns | 5 (5) | |
|
| Arm on rows, TPb on columns | 2 (2) | |
|
| TPs on rows, arms on columns | 2 (2) | |
|
| 34 (19.5) | ||
|
| TPs nested in DEs on rows, arms on columns | 18 (53) | |
|
| Arms nested in DEs on rows, TPs on columns | 8 (24) | |
|
| DEs nested in arms on rows, TPs on columns | 2 (6) | |
|
| Arms nested in TPs on rows, DEs on columns | 2 (6) | |
|
| DEs and TPs embedded in rows, arms on columns | 2 (6) | |
|
| TPs nested in arms on rows, DEs on columns | 1 (3) | |
|
| DEs nested in TPs on rows, arms on columns | 1 (3) | |
|
| 26 (14.9) | ||
|
| DEs on rows, TPs nested in arms on columns | 16 (62) | |
|
| DEs on rows, arms nested in TPs on columns | 5 (19) | |
|
| Arms on rows, TPs nested in DEs on columns | 3 (12) | |
|
| DEs on rows, arms and TPs embedded on columns | 2 (8) | |
|
| 15 (8.6) | ||
|
| Stratified reporting | 8 (53) | |
|
| Only reports comparative statistics | 7 (47) | |
aDE: data element.
bTP: time point.
Continuous metric reporting format in tables (N=77).
| Continuous metrics reported | Frequency per relevanta article, n (%) | |
|
| 69 (90) | |
|
| Mean ± SD | 41 (59) |
|
| Mean (SD) | 25 (36) |
|
| Mean and SD in separate arrays | 2 (3) |
|
| Mean | 2 (3%) |
|
| Mean SD | 1 (1%) |
|
| Mean (CI; lower-higher) | 1 (1%) |
|
| Mean ± SD (range; min-max) | 1 (1%) |
|
| 21 (27) | |
|
| Median (IQR; 25th percentile-75th percentile) | 11 (52) |
|
| Median (range, min-max) | 4 (19) |
|
| Median [IQR; 25th percentile-75th percentile] | 3 (14) |
|
| Median [IQR; 25th percentile, 75th percentile] | 1 (5) |
|
| Median (IQR; 25th percentile to 75th percentile) | 1 (5) |
|
| Median (IQR) | 1 (5) |
|
| Min-Max (median) | 1 (5) |
aOne article reported no baseline or outcome data in tables and was thus left out from continuous data characterization.
Dichotomous and categorical formats and labels in tables (N=77).
| Statistics reported | Frequency per relevanta articles, n (%) | |
|
| 40 (52) | |
|
| n (%) | 30 (75) |
|
| n | 3 (8) |
|
| n/(N–n) | 3 (8) |
|
| % | 3 (8) |
|
| n, % | 2 (5) |
|
| n and % in separate arrays | 1 (3) |
|
| 47 (61) | |
|
| n (%) | 40 (85) |
|
| n | 8 (17) |
|
| 47 (61) | |
|
| Same array, indented | 35 (74) |
|
| Separate array | 7 (15) |
|
| In cell | 7 (15) |
|
| Same array, unindented | 1 (2) |
aOne article reported no baseline or outcome data in tables and was thus left out from dichotomous/categorical data characterization.