| Literature DB >> 35435957 |
Katie R Bradwell1, Jacob T Wooldridge2, Benjamin Amor1, Tellen D Bennett3, Adit Anand2, Carolyn Bremer2, Yun Jae Yoo2, Zhenglong Qian2, Steven G Johnson4, Emily R Pfaff5, Andrew T Girvin1, Amin Manna1, Emily A Niehaus1, Stephanie S Hong6, Xiaohan Tanner Zhang7, Richard L Zhu7, Mark Bissell1, Nabeel Qureshi1, Joel Saltz2, Melissa A Haendel8, Christopher G Chute9, Harold P Lehmann7, Richard A Moffitt2.
Abstract
OBJECTIVE: The goals of this study were to harmonize data from electronic health records (EHRs) into common units, and impute units that were missing.Entities:
Keywords: SARS-CoV-2; data accuracy; data collection; electronic health records; reference standards
Mesh:
Year: 2022 PMID: 35435957 PMCID: PMC9196692 DOI: 10.1093/jamia/ocac054
Source DB: PubMed Journal: J Am Med Inform Assoc ISSN: 1067-5027 Impact factor: 7.942
Example canonical units table
| Measured variable | Enclave codeset ID | Canonical unit concept ID | Canonical unit concept name | Maximum plausible value | Minimum plausible value | Measurement table row count |
|---|---|---|---|---|---|---|
| Respiratory rate | 286601963 | 8483 | Counts per minute | 200 | 0 | 201 976 073 |
| Sodium, mmol/L | 887473517 | 8753 | Millimole per liter | 250 | 50 | 147 177 271 |
| SpO2 | 780678652 | 8554 | Percent | 100 | 0 | 145 403 614 |
| Systolic blood pressure | 186465804 | 8876 | Millimeter mercury column | 400 | 0 | 136 188 546 |
| Temperature | 656562966 | 586323 | Degree celsius | 45 | 25 | 123 986 764 |
| Glucose, mg/dL | 59698832 | 8840 | Milligram per deciliter | 1000 | 0 | 104 743 184 |
| Heart rate | 596956209 | 8483 | Counts per minute | 500 | 0 | 67 530 040 |
| Height | 754731201 | 9546 | Meter | 3 | 0 | 53 998 207 |
| Body weight | 776390058 | 9529 | Kilogram | 500 | 0.1 | 42 113 217 |
| Diastolic blood pressure | 573275931 | 8876 | Millimeter mercury column | 200 | 0 | 42 024 537 |
Note: Chosen canonical units and plausible value range for the top 10 most frequent measured variables in the data out of those selected for unit harmonization and inference.
Figure 1.Diversity of equivalent and nonequivalent units across measured variables: Units present per measurement variable and their equivalency to the selected canonical unit. Equivalent units to the canonical unit are described as “identity” and those with nonequivalent units are referred to as “non-identity.”
Figure 2.Unit conversion workflow summary. Overview of the process for harmonizing unit in the OMOP measurement table. SME: subject matter expert.
Figure 3.Unit inference and harmonization workflows. (A) Unit-inference threshold validation workflow. Masking of known units was used as a guide to assess the range of KS test P values that pertain to values in equivalent units across populations. The final threshold selected after plotting all P values together was 1e−5, which was then used for identifying units when they are missing. (B) Unit inference workflow. Process for sampling and performing KS tests on values across data partner and measurement concept combinations, checking for P values above the 1e−5 threshold, and applying thresholds to omit unit inference in cases where units cannot be confidently assigned. (C) Unit harmonization workflow. Conversion of values for each record into the canonical unit. KS test: Kolmogorov-Smirnov test.
Figure 4.KS test P-value threshold validation. KS P values for equivalent versus nonequivalent units per data partner ID/measurement concept name. CRP was omitted due to having various completely overlapping value distributions in nonequivalent units after visual inspection. CRP: c-reactive protein; KS test: Kolmogorov-Smirnov test.
Figure 5.Omitting variables where units cannot be uniquely assigned; Unit inference omission criteria. The standard deviation of the log median harmonized values (above the KS test P-value threshold) was used as a measure of closeness of different populations of values, and was compared to the log of the minimum conversion factor to determine the level of overlap expected between different units. Ratios: 0.125–0.25 (right-most shaded segment), 0.25–0.5 (middle shaded segment), and >0.5 (left-most shaded segment). KS test: Kolmogorov-Smirnov test.
Counts and percentages of harmonized and inferred units
| Metric | Count | Percentage |
|---|---|---|
| Total measurements with values present | 1 607 758 125 | N/A |
| Total measurements with valid units | 933 030 577 | 58.0 |
| Total measurements without units | 674 727 548 | 42.0 |
| Total nonequivalent units harmonized | 725 051 924 | 45.1 |
| Total harmonized | 1 416 354 459 | 88.1 |
| Total units inferred | 527 400 086 | 78.2 |
Note: Harmonized and inferred unit counts and percentages were calculated across all measured variables out of a total of 1 607 758 125 measurements with values, of which 674 727 548 (42%) had missing units.
Out of total records that were missing units.
Out of total measurements with values.
Figure 6.Overview and examples of successful harmonized and inferred units. (A) Percentage of values with harmonized and inferred units by measurement variable. Roughly half of the data had correct units and did not require conversion (light green), while half of the data had their units inferred (blue). A minority of values had units that needed conversion (dark green), and the smallest group of data had nonsensical or mislabeled units (black). (B) Original units and their values for body weight and harmonized data for body weight. (C) Inferred versus observed harmonized value distributions.