Teresa J Filshtein1, Xiang Li2, Scott C Zimmerman1, Sarah F Ackley1, M Maria Glymour1, Melinda C Power2. 1. From the Department of Epidemiology and Biostatistics, University of California San Francisco School of Public Health, San Francisco, CA. 2. Department of Epidemiology, Milken Institute School of Public Health, George Washington University, Washington, DC.
Abstract
BACKGROUND: Integrating results from multiple samples is often desirable, but privacy restrictions may preclude full data pooling, and most datasets do not include fully harmonized variable sets. We propose a simulation-based method leveraging partial information across datasets to guide creation of synthetic data based on explicit assumptions about the underlying causal structure that permits pooled analyses that adjust for all desired confounders in the context of privacy restrictions. METHODS: This proof-of-concept project uses data from the Health and Retirement Study (HRS) and Atherosclerosis Risk in Communities (ARIC) study. We specified an estimand of interest and a directed acyclic graph (DAG) summarizing the presumed causal structure for the effect of glycated hemoglobin (HbA1c) on cognitive change. We derived publicly reportable statistics to describe the joint distribution of each variable in our DAG. These summary estimates were used as data-generating rules to create synthetic datasets. After pooling, we imputed missing covariates in the synthetic datasets and used the synthetic data to estimate the pooled effect of HbA1c on cognitive change, adjusting for all desired covariates. RESULTS: Distributions of covariates and model coefficients and associated standard errors for our model estimating the effect of HbA1c on cognitive change were similar across cohort-specific original and preimputation synthetic data. The estimate from the pooled synthetic incorporates control for confounders measured in either original dataset. DISCUSSION: Our approach has advantages over meta-analysis or individual-level pooling/data harmonization when privacy concerns preclude data sharing and key confounders are not uniformly measured across datasets.
BACKGROUND: Integrating results from multiple samples is often desirable, but privacy restrictions may preclude full data pooling, and most datasets do not include fully harmonized variable sets. We propose a simulation-based method leveraging partial information across datasets to guide creation of synthetic data based on explicit assumptions about the underlying causal structure that permits pooled analyses that adjust for all desired confounders in the context of privacy restrictions. METHODS: This proof-of-concept project uses data from the Health and Retirement Study (HRS) and Atherosclerosis Risk in Communities (ARIC) study. We specified an estimand of interest and a directed acyclic graph (DAG) summarizing the presumed causal structure for the effect of glycated hemoglobin (HbA1c) on cognitive change. We derived publicly reportable statistics to describe the joint distribution of each variable in our DAG. These summary estimates were used as data-generating rules to create synthetic datasets. After pooling, we imputed missing covariates in the synthetic datasets and used the synthetic data to estimate the pooled effect of HbA1c on cognitive change, adjusting for all desired covariates. RESULTS: Distributions of covariates and model coefficients and associated standard errors for our model estimating the effect of HbA1c on cognitive change were similar across cohort-specific original and preimputation synthetic data. The estimate from the pooled synthetic incorporates control for confounders measured in either original dataset. DISCUSSION: Our approach has advantages over meta-analysis or individual-level pooling/data harmonization when privacy concerns preclude data sharing and key confounders are not uniformly measured across datasets.
Authors: Melissa J Azur; Elizabeth A Stuart; Constantine Frangakis; Philip J Leaf Journal: Int J Methods Psychiatr Res Date: 2011-03 Impact factor: 4.035
Authors: Andrea R Zammit; Andrea M Piccinin; Emily C Duggan; Andriy Koval; Sean Clouston; Annie Robitaille; Cassandra L Brown; Philipp Handschuh; Chenkai Wu; Valérie Jarry; Deborah Finkel; Raquel B Graham; Graciela Muniz-Terrera; Marcus Praetorius Björk; David Bennett; Dorly J Deeg; Boo Johansson; Mindy J Katz; Jeffrey Kaye; Richard B Lipton; Mike Martin; Nancy L Pederson; Avron Spiro; Daniel Zimprich; Scott M Hofer Journal: J Gerontol B Psychol Sci Soc Sci Date: 2019-06-11 Impact factor: 4.077
Authors: Alexander P Keil; Jessie K Edwards; David B Richardson; Ashley I Naimi; Stephen R Cole Journal: Epidemiology Date: 2014-11 Impact factor: 4.822
Authors: Daniel Westreich; Stephen R Cole; Jessica G Young; Frank Palella; Phyllis C Tien; Lawrence Kingsley; Stephen J Gange; Miguel A Hernán Journal: Stat Med Date: 2012-04-11 Impact factor: 2.373
Authors: Xiaojuan Li; Bruce H Fireman; Jeffrey R Curtis; David E Arterburn; David P Fisher; Érick Moyneur; Mia Gallagher; Marsha A Raebel; W Benjamin Nowell; Lindsay Lagreid; Sengwee Toh Journal: Am J Epidemiol Date: 2019-04-01 Impact factor: 4.897
Authors: Andre Goncalves; Priyadip Ray; Braden Soper; Jennifer Stevens; Linda Coyle; Ana Paula Sales Journal: BMC Med Res Methodol Date: 2020-05-07 Impact factor: 4.615