| Literature DB >> 35860322 |
Kenneth A Scott1,2, Sara Deakyne Davies3, Rachel Zucker4, Toan Ong5, Emily McCormick Kraus6, Michael G Kahn5, Jessica Bondy7,8, Matt F Daley9, Kate Horle10, Emily Bacon1,8, Lisa Schilling4,11, Tessa Crume2, Romana Hasnain-Wynia12, Seth Foldy1, Gregory Budney1, Arthur J Davidson1,7,8.
Abstract
Introduction: Learning health systems can help estimate chronic disease prevalence through distributed data networks (DDNs). Concerns remain about bias introduced to DDN prevalence estimates when individuals seeking care across systems are counted multiple times. This paper describes a process to deduplicate individuals for DDN prevalence estimates.Entities:
Keywords: electronic health records; medical record linkage; network; public health informatics; public health surveillance
Year: 2021 PMID: 35860322 PMCID: PMC9284932 DOI: 10.1002/lrh2.10297
Source DB: PubMed Journal: Learn Health Syst ISSN: 2379-6146
FIGURE 1Strobe flow diagram representing the number of unique patients across two distributed data network partners participating in a study of identity management's influence of type 1 and type 2 diabetes prevalence
FIGURE 2Generating stratified, deduplicated estimates of diabetes prevalence through a distributed query process that minimizes exchange of phi
Example of reconciliation process for selecting data partners to contribute demographic and geographic data for individuals seen in multiple health care systems (selection criteria are highlighted)
| Network identifier | Diagnosis present | Final 2017 visit date | Selected data partner | ||
|---|---|---|---|---|---|
| Date partner | Data partner | ||||
| 1 | 2 | 1 | 2 | ||
| CID1 | Yes | No | January 1, 2017 | December 31, 2017 | 1 |
| CID2 | Yes | Yes | December 31, 2017 | January 1, 2017 | 1 |
| CID3 | No | No | January 1, 2017 | December 31, 2017 | 2 |
| CID4 | Yes | Yes | January 1, 2017 | January 1, 2017 | 2 (random) |
| CID5 | No | No | December 31, 2017 | December 31, 2017 | 1 (random) |
Distribution of demographic characteristics and disease prevalence for patient populations (<19 years old) with any encounter during the study period among two data partners, seven‐county Denver metro area, 2017
| Data partner |
| ||
|---|---|---|---|
| 1 | 2 | ||
| Number of patients | 58 351 | 167 569 | n/a |
| Diabetes prevalence (per 1000) | |||
| Type 1 | 1.6 | 4.1 | <.0001 |
| Type 2 | 1.2 | 0.9 | .03 |
| Sex (percent) | <.0001 | ||
| Female | 50 | 48 | |
| Male | 50 | 52 | |
| Unknown | 0 | <1 | |
| Age group in years (percent) | <.0001 | ||
| 0‐3 | 21 | 33 | |
| 4‐6 | 16 | 16 | |
| 7‐9 | 16 | 15 | |
| 10‐12 | 17 | 14 | |
| 13‐15 | 17 | 14 | |
| 16‐17 | 13 | 8 | |
| Race and ethnicity (percent) | <.0001 | ||
| Non‐Hispanic (NH) White | 13 | 46 | |
| Hispanic | 68 | 32 | |
| NH Black | 14 | 8 | |
| NH Asian | 4 | 3 | |
| NH American Indian or Alaska Native | <1 | <1 | |
| NH multiple races | 1 | 4 | |
| NH race unknown or not reported | 2 | 7 | |
| Residing in census tract with > = 20% below federal poverty level (percent) | <.0001 | ||
| Yes | 44 | 18 | |
| No | 56 | 82 | |
Note: P‐values calculated using Pearson's Chi‐squared test.
Some addresses could not be geolocated to the census tract.
Distribution of demographic characteristics and disease prevalence for patient populations (<19 years old) with any encounter during the study period among two data partners, by duplicate status, seven‐county Denver metro area, 2017
| Duplicate status | |||
|---|---|---|---|
| Yes | No |
| |
| Number of patients | 7628 | 210 809 | |
| Diabetes prevalence (per 1000) | |||
| Type 1 | 5 | 3.4 | .03 |
| Type 2 | 4 | 0.8 | <.0001 |
| Sex (percent) | .13 | ||
| Female | 49 | 48 | |
| Male | 51 | 52 | |
| Unknown | 0 | <1 | |
| Age Group in Years (percent) | <.0001 | ||
| 0‐3 | 20 | 30 | |
| 4‐6 | 19 | 16 | |
| 7‐9 | 17 | 15 | |
| 10‐12 | 15 | 15 | |
| 13‐15 | 17 | 15 | |
| 16‐18 | 11 | 9 | |
| Race and ethnicity (percent) | <.0001 | ||
| Non‐Hispanic (NH) White | 10% | 40% | |
| Hispanic | 64% | 39% | |
| NH Black | 16% | 9% | |
| NH Asian | 4% | 3% | |
| NH American Indian or Alaska Native | <1% | <1% | |
| NH multiple races | 1% | 3% | |
| NH race unknown or not reported | 3% | 6% | |
| Residing in census tract with > = 20% below federal poverty level (percent) | <.0001 | ||
| Yes | 46% | 23% | |
| No | 54% | 77% | |
Some addresses could not be geolocated to the census tract.
Prevalence (per 1000) of Type 1 and Type 2 diabetes among patient populations (<19 years) for all encounter types from two health care systems, before and after deduplication, seven‐county Denver Metropolitan Area, Colorado, 2017
| Deduplication | ||||
|---|---|---|---|---|
| Type 1 | Type 2 | |||
| Before | After | Before | After | |
| Overall | 3.4 (3.2, 3.6) | 3.5 (3.3, 3.7) | 1.0 (0.9, 1.1) | 0.9 (0.8, 1.0) |
| Sex | ||||
| Female | 3.6 (3.2, 4.0) | 3.7 (3.3, 4.1) | 1.1 (0.9, 1.3) | 1.0 (0.8, 1.2) |
| Male | 3.3 (3.0, 3.6) | 3.3 (3.0, 3.6) | 0.8 (0.6, 1.0) | 0.8 (0.6, 1.0) |
| Age in years | ||||
| 0‐3 | 0.4 (0.2, 0.6) | 0.4 (0.2, 0.6) | 0 (0, 0) | 0 (0, 0) |
| 4‐6 | 1.6 (1.2, 2.0) | 1.6 (1.2, 2.0) | 0 (0, 0) | 0 (0, 0) |
| 7‐9 | 3.2 (2.6, 3.8) | 3.2 (2.6, 3.8) | 0.3 (0.1, 0.5) | 0.3 (0.1, 0.5) |
| 10‐12 | 5.5 (4.7, 6.3) | 5.6 (4.8, 6.4) | 0.8 (0.5, 1.1) | 0.8 (0.5, 1.1) |
| 13‐15 | 7.1 (6.2, 8.0) | 7.2 (6.3, 8.1) | 2.2 (1.7, 2.7) | 2.0 (1.5, 2.5) |
| 16‐17 | 7.5 (6.3, 8.7) | 7.7 (6.5, 8.9) | 4.9 (4.0, 5.8) | 4.8 (3.8, 5.8) |
| Race | ||||
| Non‐Hispanic (NH) White | 5.5 (5.0, 6.0) | 5.5 (5.0, 6.0) | 0.5 (0.3, 0.7) | 0.5 (0.3, 0.7) |
| Hispanic | 1.9 (1.6, 2.2) | 1.9 (1.6, 2.2) | 1.4 (1.2, 1.6) | 1.3 (1.1, 1.5) |
| NH Black | 3.1 (2.3, 3.9) | 3.0 (2.2, 3.8) | 1.8 (1.2, 2.4) | 1.6 (1.0, 2.2) |
| NH Asian | 1.0 (0.3, 1.7) | 1.1 (0.3, 1.9) | 0.6 (0.0, 1.2) | 0.6 (0.0, 1.2) |
| NH American Indian or Alaska Native | 4.3 (−0.6, 9.2) | 4.5 (−0.6, 9.6) | 2.9 (−1.1, 6.9) | 3.0 (−1.1, 7.2) |
| NH multiple races | 2.5 (1.3, 3.7) | 2.5 (1.3, 3.7) | 0.4 (−0.1, 0.9) | 0.4 (−0.1, 0.9) |
| NH race unknown or not reported | 3.4 (2.4, 4.4) | 3.4 (2.4, 4.4) | 0.2 (0.0, 0.4) | 0.2 (−0.1, 0.5) |
| Residing in census tract with > = 20% below federal poverty level | ||||
| Yes | 2.2 (1.8, 2.6) | 2.1 (1.7, 2.5) | 1.4 (1.1, 1.7) | 1.3 (1.0, 1.6) |
| No | 3.9 (3.6, 4.2) | 3.9 (3.6, 4.2) | 0.8 (0.7, 0.9) | 0.8 (0.7, 0.9) |
Some addresses could not be geolocated to the census tract.
FIGURE 3Opportunities for duplication bias when estimating disease prevalence from two data partners
Factors influencing prevalence (per 1000) under several scenarios of overlapping populations at two hypothetical data partners (DP1, DP2) with 1000 patients each
| DP1 prevalence per 1000 | DP2 prevalence per 1000 | Case overlap (n) | Population overlap (n) | Aggregated prevalence per 1000 | Deduplicated prevalence per 1000 | |
|---|---|---|---|---|---|---|
| Set theory notation | n(A)/n(C) x 1000 | n(B)/n(D) x 1000 | n(A ⋂ B) | n(C ⋂ D) | (n(A) + n(B)) / (n(C) + n(D)) | (n[A ⋃ B]) / (n[C ⋃ D]) |
| Low prevalence: No overlap | 5 | 5 | 0 | 0 | 5 | 5 |
| Low prevalence: High population overlap | 5 | 5 | 0 | 995 | 5 | 9.95 |
| Low prevalence: Complete case overlap | 5 | 5 | 5 | 5 | 5 | 2.51 |
| Low prevalence: Complete overlap | 5 | 5 | 5 | 1000 | 5 | 5 |
| High prevalence: No overlap | 500 | 500 | 0 | 0 | 500 | 500 |
| High prevalence: High population overlap | 500 | 500 | 0 | 500 | 500 | 666.6 |
| High prevalence: Complete case overlap | 500 | 500 | 500 | 500 | 500 | 333.3 |
| High prevalence: Complete overlap | 500 | 500 | 500 | 1000 | 500 | 500 |
Note: Aside from the top row, which uses set theory notation to represent the meaning of each column, each row illustrates how different combinations of conditions impact duplication bias in the prevalence estimate. Conditions that influence the degree of bias include: the prevalence of a condition (eg, low or high), the degree of overlap in the overall population [none, high (complete among non‐cases), or complete], and the degree of overlap in the case population (none or complete). Bias is introduced when overlap is disproportionate among cases and non‐cases.