| Literature DB >> 34608222 |
Mackenzie J Edmondson1, Chongliang Luo1, Rui Duan2, Mitchell Maltenfort3, Zhaoyi Chen4,5, Kenneth Locke1, Justine Shults1, Jiang Bian4,5, Patrick B Ryan6, Christopher B Forrest3, Yong Chen7.
Abstract
Clinical research networks (CRNs), made up of multiple healthcare systems each with patient data from several care sites, are beneficial for studying rare outcomes and increasing generalizability of results. While CRNs encourage sharing aggregate data across healthcare systems, individual systems within CRNs often cannot share patient-level data due to privacy regulations, prohibiting multi-site regression which requires an analyst to access all individual patient data pooled together. Meta-analysis is commonly used to model data stored at multiple institutions within a CRN but can result in biased estimation, most notably in rare-event contexts. We present a communication-efficient, privacy-preserving algorithm for modeling multi-site zero-inflated count outcomes within a CRN. Our method, a one-shot distributed algorithm for performing hurdle regression (ODAH), models zero-inflated count data stored in multiple sites without sharing patient-level data across sites, resulting in estimates closely approximating those that would be obtained in a pooled patient-level data analysis. We evaluate our method through extensive simulations and two real-world data applications using electronic health records: examining risk factors associated with pediatric avoidable hospitalization and modeling serious adverse event frequency associated with a colorectal cancer therapy. In simulations, ODAH produced bias less than 0.1% across all settings explored while meta-analysis estimates exhibited bias up to 12.7%, with meta-analysis performing worst in settings with high zero-inflation or low event rates. Across both applied analyses, ODAH estimates had less than 10% bias for 18 of 20 coefficients estimated, while meta-analysis estimates exhibited substantially higher bias. Relative to existing methods for distributed data analysis, ODAH offers a highly accurate, computationally efficient method for modeling multi-site zero-inflated count data.Entities:
Mesh:
Year: 2021 PMID: 34608222 PMCID: PMC8490431 DOI: 10.1038/s41598-021-99078-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1On the left, a histogram displaying counts generated with a Poisson–Logit hurdle distribution with 10% prevalence and a zero-truncated event rate of . On the right, a hierarchical diagram visualizing the data generation process in a Poisson–Logit hurdle framework. Independent realizations are generated from a Bernoulli process, with underlying probability modeled using a logit link. Realizations where are zero counts (), while realizations where are positive counts (). The positive counts are generated by a zero-truncated Poisson distribution, with underlying event rate modeled using a log link.
Figure 2Visual representation of one-shot distributed algorithm for hurdle regression (ODAH). In the initialization round, coefficient () and variance () estimates from fitting separate hurdle models at each collaborating site are sent to the lead site; these estimates are then used together with lead site estimates in a meta-analysis to produce initial estimates () for ODAH, which are sent to each collaborating site. In the surrogate likelihood estimation round, first-order () and second-order () gradients are computed at each site, evaluated at the received initial estimates and sent to the lead site. These gradients are used in conjunction with data from the lead site to construct surrogate likelihood functions and , which are then maximized to produce surrogate maximum likelihood estimates and .
Simulation settings varying baseline outcome prevalence , baseline event rate , and size of lead site .
| Simulation setting | True parameter values | ||||
|---|---|---|---|---|---|
| Prevalence (%) | Event rate ( | ||||
| 5 | 0.03 | 20,000 | |||
| 2.5 | 0.03 | 20,000 | |||
| 1 | 0.03 | 20,000 | |||
| 0.5 | 0.03 | 20,000 | |||
| 2.5 | 0.25 | 20,000 | |||
| 2.5 | 0.01 | 20,000 | |||
| 2.5 | 0.005 | 20,000 | |||
| 2.5 | 0.03 | 38,000 | |||
| 2.5 | 0.03 | 56,000 | |||
| 2.5 | 0.03 | 74,000 | |||
Summary statistics describing patient population across six CHOP primary care sites.
| Site 1 | Site 2 | Site 3 | Site 4 | Site 5 | Site 6 | Total | |
|---|---|---|---|---|---|---|---|
| (n = 5456) | (n = 9111) | (n = 7893) | (n = 27,288) | (n = 7996) | (n = 13,074) | (n = 70,818) | |
| Female | 2589 (47.5%) | 4427 (48.6%) | 3862 (48.9%) | 13,458 (49.3%) | 4013 (50.2%) | 6494 (49.7%) | 34,843 (49.2%) |
| Male | 2867 (52.5%) | 4684 (51.4%) | 4031 (51.1%) | 13,830 (50.7%) | 3983 (49.8%) | 6580 (50.3%) | 35,975 (50.8%) |
| Caucasian | 3476 (63.7%) | 5508 (60.5%) | 4783 (60.6%) | 15,747 (57.7%) | 4649 (58.1%) | 9158 (70.0%) | 43,321 (61.2%) |
| Other | 1980 (36.3%) | 3603 (39.5%) | 3110 (39.4%) | 11,541 (42.3%) | 3347 (41.9%) | 3916 (30.0%) | 27,497 (38.8%) |
| Mean (SD) | 8.02 (5.48) | 7.95 (5.58) | 7.77 (5.50) | 7.54 (5.60) | 7.60 (5.57) | 7.04 (5.37) | 7.57 (5.54) |
| Median [min, max] | 7.87 [0.0216, 18.0] | 7.67 [0.0181, 18.0] | 7.44 [0.0376, 17.9] | 6.79 [0.0158, 17.9] | 7.02 [0.0170, 17.9] | 6.10 [0.0202, 17.9] | 6.97 [0.0158, 18.0] |
| Public | 1997 (36.6%) | 3410 (37.4%) | 2339 (29.6%) | 9545 (35.0%) | 2477 (31.0%) | 3438 (26.3%) | 23,206 (32.8%) |
| Private/self-pay | 3459 (63.4%) | 5701 (62.6%) | 5554 (70.4%) | 17,743 (65.0%) | 5519 (69.0%) | 9636 (73.7%) | 47,612 (67.2%) |
| Mean (SD) | 5.19 (5.14) | 5.00 (4.75) | 4.84 (4.35) | 4.51 (4.59) | 5.34 (4.86) | 5.17 (4.85) | 4.88 (4.72) |
| Median [min, max] | 3.52 [0.243, 65.3] | 3.68 [0.276, 85.3] | 3.50 [0.276, 70.8] | 3.16 [0.238, 97.5] | 3.95 [0.233, 73.2] | 3.83 [0.253, 85.3] | 3.50 [0.233, 97.5] |
| At least one avoidable hospitalization (AH) | 71 (1.3%) | 70 (0.8%) | 33 (0.4%) | 878 (3.2%) | 76 (1.0%) | 396 (3.0%) | 1524 (2.2%) |
| No Ahs | 5385 (98.7%) | 9041 (99.2%) | 7860 (99.6%) | 26,410 (96.8%) | 7920 (99.0%) | 12,678 (97.0%) | 69,294 (97.8%) |
| Mean (SD) | 1.38 (1.19) | 1.51 (1.82) | 1.48 (1.64) | 1.46 (1.16) | 1.46 (0.901) | 1.47 (1.58) | 1.47 (1.31) |
| Median [min, max] | 1.00 [1.00, 10.0] | 1.00 [1.00, 15.0] | 1.00 [1.00, 10.0] | 1.00 [1.00, 10.0] | 1.00 [1.00, 5.00] | 1.00 [1.00, 16.0] | 1.00 [1.00, 16.0] |
| Mean (SD) | 3.43 (2.05) | 4.67 (2.76) | 4.74 (2.75) | 4.72 (2.72) | 4.92 (2.77) | 4.75 (2.74) | 4.64 (2.72) |
| Median [min, max] | 3.25 [0.0766, 8.74] | 4.58 [0.0766, 8.74] | 4.83 [0.0766, 8.74] | 4.74 [0.0766, 8.74] | 5.08 [0.0766, 8.74] | 4.74 [0.0766, 8.74] | 4.58 [0.0766, 8.74] |
Figure 3Distribution of total number of avoidable hospitalizations (AHs) for patients with at least one AH in CHOP data sample.
Figure 4Map detailing locations of OneFlorida clinical partners.
Summary statistics describing patient population across three OneFlorida clinical sites.
| Site 1 | Site 2 | Site 3 | Total | |
|---|---|---|---|---|
| (n = 48) | (n = 226) | (n = 386) | (n = 660) | |
| Female | 22 (45.8%) | 90 (39.8%) | 178 (46.1%) | 290 (43.9%) |
| Male | 26 (54.2%) | 136 (60.2%) | 208 (53.9%) | 370 (56.1%) |
| Caucasian | 25 (52.1%) | 165 (73.0%) | 302 (78.2%) | 492 (74.5%) |
| Other | 23 (47.9%) | 61 (27.0%) | 84 (21.8%) | 168 (25.5%) |
| Mean (SD) | 51.8 (9.55) | 56.2 (11.9) | 57.2 (11.9) | 56.5 (11.8) |
| Yes | 12 (25.0%) | 9 (4.0%) | 226 (58.5%) | 247 (37.4%) |
| No | 36 (75.0%) | 217 (96.0%) | 160 (41.5%) | 413 (62.6%) |
| Mean (SD) | 5.23 (0.52) | 5.27 (0.75) | 5.24 (0.87) | 5.25 (0.81) |
| Mean (SD) | 1.81 (1.71) | 2.11 (2.19) | 1.47 (1.72) | 1.72 (1.91) |
| Patients with 0 SAEs | 12 (25%) | 53 (23.5%) | 151 (39.1%) | 216 (32.7%) |
| Zero-truncated mean (SD) | 2.42 (1.56) | 2.75 (2.11) | 2.42 (1.61) | 2.55 (1.82) |
Figure 5Simulation results for estimating zero-truncated Poisson component covariate . (A) Results for Setting A, fixing = 20,000 and = − 3.6 () while varying outcome prevalence. (B) Results for Setting B, fixing = 20,000 and = − 3.7 (2.5% prevalence) while varying event rate (. (C) Results for Setting C, fixing = − 3.7 (2.5% prevalence) and () while varying proportion of observations in lead site. Horizontal blue line represents true value of .
Figure 6Plots depicting results from CHOP avoidable hospitalization analysis. Log odds ratio (A) and log relative risk (B) estimates (along with corresponding 95% confidence intervals) for each covariate in the fitted hurdle model. Dashed horizontal line represents pooled estimate, our gold standard for comparing methods.
Figure 7Plots depicting results from OneFlorida serious adverse event application. Log odds ratio (A) and log relative risk (B) estimates (along with corresponding 95% confidence intervals) for each covariate in the fitted hurdle model. Dashed horizontal line represents pooled estimate, our gold standard for comparing methods.
Figure 8Geographical map of 27 CHOP primary care sites across greater Philadelphia region. In the left map, the proportion of patients of Caucasian race are depicted for each site. In the right map, the proportion of patients using public insurance (Medicaid) at each site is depicted. The size of each site on each map is proportional to the number of patients at the given site. Stars indicate sites used in our data analysis.