Literature DB >> 25038554

Size matters: how population size influences genotype-phenotype association studies in anonymized data.

Raymond Heatherly1, Joshua C Denny2, Jonathan L Haines3, Dan M Roden4, Bradley A Malin5.   

Abstract

OBJECTIVE: Electronic medical records (EMRs) data is increasingly incorporated into genome-phenome association studies. Investigators hope to share data, but there are concerns it may be "re-identified" through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome-phenome association studies under various conditions.
METHODS: We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome-phenome association study and compare the discoveries using the protected data and the original data through the correlation (r(2)) of the p-values of association significance.
RESULTS: Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome-phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000-75,000). We observed a general trend of increasing r(2) for larger data set sizes: r(2)=0.9481 for small-sized datasets, r(2)=0.9493 for moderately-sized datasets, r(2)=0.9934 for large-sized datasets.
CONCLUSIONS: This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.
Copyright © 2014 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Anonymization; Clinical codes; Data publishing; Privacy

Mesh:

Year:  2014        PMID: 25038554      PMCID: PMC4260994          DOI: 10.1016/j.jbi.2014.07.005

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  38 in total

1.  Data sharing in medical research: an empirical investigation.

Authors:  D D Reidpath; P A Allotey
Journal:  Bioethics       Date:  2001-04       Impact factor: 1.898

2.  What makes UK Biobank special?

Authors:  Rory Collins
Journal:  Lancet       Date:  2012-03-31       Impact factor: 79.321

3.  Identifying personal genomes by surname inference.

Authors:  Melissa Gymrek; Amy L McGuire; David Golan; Eran Halperin; Yaniv Erlich
Journal:  Science       Date:  2013-01-18       Impact factor: 47.728

4.  Research ethics. The complexities of genomic identifiability.

Authors:  Laura L Rodriguez; Lisa D Brooks; Judith H Greenberg; Eric D Green
Journal:  Science       Date:  2013-01-18       Impact factor: 47.728

5.  Scalable privacy-preserving data sharing methodology for genome-wide association studies.

Authors:  Fei Yu; Stephen E Fienberg; Aleksandra B Slavković; Caroline Uhler
Journal:  J Biomed Inform       Date:  2014-02-06       Impact factor: 6.317

6.  Genetic variants associated with the white blood cell count in 13,923 subjects in the eMERGE Network.

Authors:  David R Crosslin; Andrew McDavid; Noah Weston; Sarah C Nelson; Xiuwen Zheng; Eugene Hart; Mariza de Andrade; Iftikhar J Kullo; Catherine A McCarty; Kimberly F Doheny; Elizabeth Pugh; Abel Kho; M Geoffrey Hayes; Stephanie Pretel; Alexander Saip; Marylyn D Ritchie; Dana C Crawford; Paul K Crane; Katherine Newton; Rongling Li; Daniel B Mirel; Andrew Crenshaw; Eric B Larson; Chris S Carlson; Gail P Jarvik
Journal:  Hum Genet       Date:  2011-10-30       Impact factor: 4.132

7.  Predicting clopidogrel response using DNA samples linked to an electronic health record.

Authors:  J T Delaney; A H Ramirez; E Bowton; J M Pulley; M A Basford; J S Schildcrout; Y Shi; R Zink; M Oetjens; H Xu; J H Cleator; E Jahangir; M D Ritchie; D R Masys; D M Roden; D C Crawford; J C Denny
Journal:  Clin Pharmacol Ther       Date:  2011-12-21       Impact factor: 6.875

8.  On sharing quantitative trait GWAS results in an era of multiple-omics data and the limits of genomic privacy.

Authors:  Hae Kyung Im; Eric R Gamazon; Dan L Nicolae; Nancy J Cox
Journal:  Am J Hum Genet       Date:  2012-03-28       Impact factor: 11.025

9.  Operational implementation of prospective genotyping for personalized medicine: the design of the Vanderbilt PREDICT project.

Authors:  J M Pulley; J C Denny; J F Peterson; G R Bernard; C L Vnencak-Jones; A H Ramirez; J T Delaney; E Bowton; K Brothers; K Johnson; D C Crawford; J Schildcrout; D R Masys; H H Dilks; R A Wilke; E W Clayton; E Shultz; M Laposata; J McPherson; J N Jirjis; D M Roden
Journal:  Clin Pharmacol Ther       Date:  2012-05-16       Impact factor: 6.875

10.  Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data.

Authors:  Deven McGraw
Journal:  J Am Med Inform Assoc       Date:  2012-06-26       Impact factor: 4.497

View more
  4 in total

1.  Detecting the Presence of an Individual in Phenotypic Summary Data.

Authors:  Yongtai Liu; Zhiyu Wan; Weiyi Xia; Murat Kantarcioglu; Yevgeniy Vorobeychik; Ellen Wright Clayton; Abel Kho; David Carrell; Bradley A Malin
Journal:  AMIA Annu Symp Proc       Date:  2018-12-05

2.  A multi-institution evaluation of clinical profile anonymization.

Authors:  Raymond Heatherly; Luke V Rasmussen; Peggy L Peissig; Jennifer A Pacheco; Paul Harris; Joshua C Denny; Bradley A Malin
Journal:  J Am Med Inform Assoc       Date:  2015-11-13       Impact factor: 4.497

Review 3.  Enhancing Reuse of Data and Biological Material in Medical Research: From FAIR to FAIR-Health.

Authors:  Petr Holub; Florian Kohlmayer; Fabian Prasser; Michaela Th Mayrhofer; Irene Schlünder; Gillian M Martin; Sara Casati; Lefteris Koumakis; Andrea Wutte; Łukasz Kozera; Dominik Strapagiel; Gabriele Anton; Gianluigi Zanetti; Osman Ugur Sezerman; Maimuna Mendy; Dalibor Valík; Marialuisa Lavitrano; Georges Dagher; Kurt Zatloukal; GertJan B van Ommen; Jan-Eric Litton
Journal:  Biopreserv Biobank       Date:  2018-01-23       Impact factor: 2.300

4.  Design and evaluation of a data anonymization pipeline to promote Open Science on COVID-19.

Authors:  Carolin E M Jakob; Florian Kohlmayer; Thierry Meurers; Jörg Janne Vehreschild; Fabian Prasser
Journal:  Sci Data       Date:  2020-12-10       Impact factor: 6.444

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.