Literature DB >> 27020263

Optimizing annotation resources for natural language de-identification via a game theoretic framework.

Muqun Li1, David Carrell2, John Aberdeen3, Lynette Hirschman3, Jacqueline Kirby4, Bo Li5, Yevgeniy Vorobeychik5, Bradley A Malin6.   

Abstract

OBJECTIVE: Electronic medical records (EMRs) are increasingly repurposed for activities beyond clinical care, such as to support translational research and public policy analysis. To mitigate privacy risks, healthcare organizations (HCOs) aim to remove potentially identifying patient information. A substantial quantity of EMR data is in natural language form and there are concerns that automated tools for detecting identifiers are imperfect and leak information that can be exploited by ill-intentioned data recipients. Thus, HCOs have been encouraged to invest as much effort as possible to find and detect potential identifiers, but such a strategy assumes the recipients are sufficiently incentivized and capable of exploiting leaked identifiers. In practice, such an assumption may not hold true and HCOs may overinvest in de-identification technology. The goal of this study is to design a natural language de-identification framework, rooted in game theory, which enables an HCO to optimize their investments given the expected capabilities of an adversarial recipient.
METHODS: We introduce a Stackelberg game to balance risk and utility in natural language de-identification. This game represents a cost-benefit model that enables an HCO with a fixed budget to minimize their investment in the de-identification process. We evaluate this model by assessing the overall payoff to the HCO and the adversary using 2100 clinical notes from Vanderbilt University Medical Center. We simulate several policy alternatives using a range of parameters, including the cost of training a de-identification model and the loss in data utility due to the removal of terms that are not identifiers. In addition, we compare policy options where, when an attacker is fined for misuse, a monetary penalty is paid to the publishing HCO as opposed to a third party (e.g., a federal regulator).
RESULTS: Our results show that when an HCO is forced to exhaust a limited budget (set to $2000 in the study), the precision and recall of the de-identification of the HCO are 0.86 and 0.8, respectively. A game-based approach enables a more refined cost-benefit tradeoff, improving both privacy and utility for the HCO. For example, our investigation shows that it is possible for an HCO to release the data without spending all their budget on de-identification and still deter the attacker, with a precision of 0.77 and a recall of 0.61 for the de-identification. There also exist scenarios in which the model indicates an HCO should not release any data because the risk is too great. In addition, we find that the practice of paying fines back to a HCO (an artifact of suing for breach of contract), as opposed to a third party such as a federal regulator, can induce an elevated level of data sharing risk, where the HCO is incentivized to bait the attacker to elicit compensation.
CONCLUSIONS: A game theoretic framework can be applied in leading HCO's to optimized decision making in natural language de-identification investments before sharing EMR data.
Copyright © 2016 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Electronic medical records; Game theory; Natural language processing; Privacy

Mesh:

Year:  2016        PMID: 27020263      PMCID: PMC4996128          DOI: 10.1016/j.jbi.2016.03.019

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  32 in total

1.  Standards for privacy of individually identifiable health information. Final rule.

Authors: 
Journal:  Fed Regist       Date:  2002-08-14

2.  Electronic health records in small physician practices: availability, use, and perceived benefits.

Authors:  Sowmya R Rao; Catherine M Desroches; Karen Donelan; Eric G Campbell; Paola D Miralles; Ashish K Jha
Journal:  J Am Med Inform Assoc       Date:  2011-05-01       Impact factor: 4.497

3.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries.

Authors:  Min Jiang; Yukun Chen; Mei Liu; S Trent Rosenbloom; Subramani Mani; Joshua C Denny; Hua Xu
Journal:  J Am Med Inform Assoc       Date:  2011-04-20       Impact factor: 4.497

4.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions.

Authors:  Wendy W Chapman; Prakash M Nadkarni; Lynette Hirschman; Leonard W D'Avolio; Guergana K Savova; Ozlem Uzuner
Journal:  J Am Med Inform Assoc       Date:  2011 Sep-Oct       Impact factor: 4.497

5.  Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network.

Authors:  Katherine M Newton; Peggy L Peissig; Abel Ngo Kho; Suzette J Bielinski; Richard L Berg; Vidhu Choudhary; Melissa Basford; Christopher G Chute; Iftikhar J Kullo; Rongling Li; Jennifer A Pacheco; Luke V Rasmussen; Leslie Spangler; Joshua C Denny
Journal:  J Am Med Inform Assoc       Date:  2013-03-26       Impact factor: 4.497

6.  The MITRE Identification Scrubber Toolkit: design, training, and assessment.

Authors:  John Aberdeen; Samuel Bayer; Reyyan Yeniterzi; Ben Wellner; Cheryl Clark; David Hanauer; Bradley Malin; Lynette Hirschman
Journal:  Int J Med Inform       Date:  2010-10-14       Impact factor: 4.046

7.  The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies.

Authors:  Catherine A McCarty; Rex L Chisholm; Christopher G Chute; Iftikhar J Kullo; Gail P Jarvik; Eric B Larson; Rongling Li; Daniel R Masys; Marylyn D Ritchie; Dan M Roden; Jeffery P Struewing; Wendy A Wolf
Journal:  BMC Med Genomics       Date:  2011-01-26       Impact factor: 3.063

Review 8.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.

Authors:  Amber Stubbs; Christopher Kotfila; Özlem Uzuner
Journal:  J Biomed Inform       Date:  2015-07-28       Impact factor: 6.317

9.  A systematic review of re-identification attacks on health data.

Authors:  Khaled El Emam; Elizabeth Jonker; Luk Arbuckle; Bradley Malin
Journal:  PLoS One       Date:  2011-12-02       Impact factor: 3.240

10.  A game theoretic framework for analyzing re-identification risk.

Authors:  Zhiyu Wan; Yevgeniy Vorobeychik; Weiyi Xia; Ellen Wright Clayton; Murat Kantarcioglu; Ranjit Ganta; Raymond Heatherly; Bradley A Malin
Journal:  PLoS One       Date:  2015-03-25       Impact factor: 3.240

View more
  7 in total

1.  Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

Authors:  David S Carrell; David J Cronkite; Bradley A Malin; John S Aberdeen; Lynette Hirschman
Journal:  Methods Inf Med       Date:  2016-07-13       Impact factor: 2.176

2.  Scalable Iterative Classification for Sanitizing Large-Scale Datasets.

Authors:  Bo Li; Yevgeniy Vorobeychik; Muqun Li; Bradley Malin
Journal:  IEEE Trans Knowl Data Eng       Date:  2016-11-11       Impact factor: 6.977

Review 3.  Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing.

Authors:  A Névéol; P Zweigenbaum
Journal:  Yearb Med Inform       Date:  2017-09-11

4.  Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

Authors:  David S Carrell; Bradley A Malin; David J Cronkite; John S Aberdeen; Cheryl Clark; Muqun Rachel Li; Dikshya Bastakoty; Steve Nyemba; Lynette Hirschman
Journal:  J Am Med Inform Assoc       Date:  2020-07-01       Impact factor: 4.497

5.  Using game theory to thwart multistage privacy intrusions when sharing data.

Authors:  Zhiyu Wan; Yevgeniy Vorobeychik; Weiyi Xia; Yongtai Liu; Myrna Wooders; Jia Guo; Zhijun Yin; Ellen Wright Clayton; Murat Kantarcioglu; Bradley A Malin
Journal:  Sci Adv       Date:  2021-12-10       Impact factor: 14.136

6.  Leveraging text skeleton for de-identification of electronic medical records.

Authors:  Yue-Shu Zhao; Kun-Li Zhang; Hong-Chao Ma; Kun Li
Journal:  BMC Med Inform Decis Mak       Date:  2018-03-22       Impact factor: 2.796

7.  Feasibility of capturing real-world data from health information technology systems at multiple centers to assess cardiac ablation device outcomes: A fit-for-purpose informatics analysis report.

Authors:  Guoqian Jiang; Sanket S Dhruva; Jiajing Chen; Wade L Schulz; Amit A Doshi; Peter A Noseworthy; Shumin Zhang; Yue Yu; H Patrick Young; Eric Brandt; Keondae R Ervin; Nilay D Shah; Joseph S Ross; Paul Coplan; Joseph P Drozda
Journal:  J Am Med Inform Assoc       Date:  2021-09-18       Impact factor: 4.497

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.