Literature DB >> 27832965

Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints.

Giorgos Poulis1, Grigorios Loukides2, Spiros Skiadopoulos3, Aris Gkoulalas-Divanis4.   

Abstract

Publishing data about patients that contain both demographics and diagnosis codes is essential to perform large-scale, low-cost medical studies. However, preserving the privacy and utility of such data is challenging, because it requires: (i) guarding against identity disclosure (re-identification) attacks based on both demographics and diagnosis codes, (ii) ensuring that the anonymized data remain useful in intended analysis tasks, and (iii) minimizing the information loss, incurred by anonymization, to preserve the utility of general analysis tasks that are difficult to determine before data publishing. Existing anonymization approaches are not suitable for being used in this setting, because they cannot satisfy all three requirements. Therefore, in this work, we propose a new approach to deal with this problem. We enforce the requirement (i) by applying (k,km)-anonymity, a privacy principle that prevents re-identification from attackers who know the demographics of a patient and up to m of their diagnosis codes, where k and m are tunable parameters. To capture the requirement (ii), we propose the concept of utility constraint for both demographics and diagnosis codes. Utility constraints limit the amount of generalization and are specified by data owners (e.g., the healthcare institution that performs anonymization). We also capture requirement (iii), by employing well-established information loss measures for demographics and for diagnosis codes. To realize our approach, we develop an algorithm that enforces (k,km)-anonymity on a dataset containing both demographics and diagnosis codes, in a way that satisfies the specified utility constraints and with minimal information loss, according to the measures. Our experiments with a large dataset containing more than 200,000 electronic health records show the effectiveness and efficiency of our algorithm.
Copyright © 2016 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Demographics; Diagnosis codes; Generalization; Privacy; Suppression; Utility constraints

Mesh:

Year:  2016        PMID: 27832965     DOI: 10.1016/j.jbi.2016.11.001

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  3 in total

1.  Returning to our roots: The use of geospatial data for nurse-led community research.

Authors:  Kelli N DePriest; Timothy M Shields; Frank C Curriero
Journal:  Res Nurs Health       Date:  2019-10-10       Impact factor: 2.228

2.  Privacy Policy and Technology in Biomedical Data Science.

Authors:  April Moreno Arellano; Wenrui Dai; Shuang Wang; Xiaoqian Jiang; Lucila Ohno-Machado
Journal:  Annu Rev Biomed Data Sci       Date:  2018-07

3.  Probabilistic record linkage of de-identified research datasets with discrepancies using diagnosis codes.

Authors:  Boris P Hejblum; Griffin M Weber; Katherine P Liao; Nathan P Palmer; Susanne Churchill; Nancy A Shadick; Peter Szolovits; Shawn N Murphy; Isaac S Kohane; Tianxi Cai
Journal:  Sci Data       Date:  2019-01-08       Impact factor: 6.444

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.