Raymond Heatherly1, Joshua C Denny2, Jonathan L Haines3, Dan M Roden4, Bradley A Malin5. 1. Department of Biomedical Informatics, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 1030, Nashville, TN 37203, USA. Electronic address: r.heatherly@vanderbilt.edu. 2. Department of Biomedical Informatics, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 1030, Nashville, TN 37203, USA; Department of Medicine, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 1030, Nashville, TN 37203, USA. 3. Department of Epidemiology and Biostatistics, University School of Medicine, Case Western Reserve University, USA. 4. Department of Medicine, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 1030, Nashville, TN 37203, USA; Department of Pharmacology, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 1030, Nashville, TN 37203, USA. 5. Department of Biomedical Informatics, School of Medicine, Vanderbilt University, 2525 West End Avenue, Suite 1030, Nashville, TN 37203, USA; Department of Electrical Engineering and Computer Science, School of Engineering, Vanderbilt University, 2525 West End Avenue, Suite 1030, Nashville, TN 37203, USA.
Abstract
OBJECTIVE: Electronic medical records (EMRs) data is increasingly incorporated into genome-phenome association studies. Investigators hope to share data, but there are concerns it may be "re-identified" through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome-phenome association studies under various conditions. METHODS: We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome-phenome association study and compare the discoveries using the protected data and the original data through the correlation (r(2)) of the p-values of association significance. RESULTS: Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome-phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000-75,000). We observed a general trend of increasing r(2) for larger data set sizes: r(2)=0.9481 for small-sized datasets, r(2)=0.9493 for moderately-sized datasets, r(2)=0.9934 for large-sized datasets. CONCLUSIONS: This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.
OBJECTIVE: Electronic medical records (EMRs) data is increasingly incorporated into genome-phenome association studies. Investigators hope to share data, but there are concerns it may be "re-identified" through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome-phenome association studies under various conditions. METHODS: We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome-phenome association study and compare the discoveries using the protected data and the original data through the correlation (r(2)) of the p-values of association significance. RESULTS: Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome-phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000-75,000). We observed a general trend of increasing r(2) for larger data set sizes: r(2)=0.9481 for small-sized datasets, r(2)=0.9493 for moderately-sized datasets, r(2)=0.9934 for large-sized datasets. CONCLUSIONS: This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.
Authors: David R Crosslin; Andrew McDavid; Noah Weston; Sarah C Nelson; Xiuwen Zheng; Eugene Hart; Mariza de Andrade; Iftikhar J Kullo; Catherine A McCarty; Kimberly F Doheny; Elizabeth Pugh; Abel Kho; M Geoffrey Hayes; Stephanie Pretel; Alexander Saip; Marylyn D Ritchie; Dana C Crawford; Paul K Crane; Katherine Newton; Rongling Li; Daniel B Mirel; Andrew Crenshaw; Eric B Larson; Chris S Carlson; Gail P Jarvik Journal: Hum Genet Date: 2011-10-30 Impact factor: 4.132
Authors: J T Delaney; A H Ramirez; E Bowton; J M Pulley; M A Basford; J S Schildcrout; Y Shi; R Zink; M Oetjens; H Xu; J H Cleator; E Jahangir; M D Ritchie; D R Masys; D M Roden; D C Crawford; J C Denny Journal: Clin Pharmacol Ther Date: 2011-12-21 Impact factor: 6.875
Authors: J M Pulley; J C Denny; J F Peterson; G R Bernard; C L Vnencak-Jones; A H Ramirez; J T Delaney; E Bowton; K Brothers; K Johnson; D C Crawford; J Schildcrout; D R Masys; H H Dilks; R A Wilke; E W Clayton; E Shultz; M Laposata; J McPherson; J N Jirjis; D M Roden Journal: Clin Pharmacol Ther Date: 2012-05-16 Impact factor: 6.875
Authors: Yongtai Liu; Zhiyu Wan; Weiyi Xia; Murat Kantarcioglu; Yevgeniy Vorobeychik; Ellen Wright Clayton; Abel Kho; David Carrell; Bradley A Malin Journal: AMIA Annu Symp Proc Date: 2018-12-05
Authors: Raymond Heatherly; Luke V Rasmussen; Peggy L Peissig; Jennifer A Pacheco; Paul Harris; Joshua C Denny; Bradley A Malin Journal: J Am Med Inform Assoc Date: 2015-11-13 Impact factor: 4.497