Literature DB >> 28943741

Scalable Iterative Classification for Sanitizing Large-Scale Datasets.

Bo Li1, Yevgeniy Vorobeychik1, Muqun Li1, Bradley Malin1.   

Abstract

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.

Entities:  

Keywords:  Privacy preserving; game theory; weak structured data sanitization

Year:  2016        PMID: 28943741      PMCID: PMC5607782          DOI: 10.1109/TKDE.2016.2628180

Source DB:  PubMed          Journal:  IEEE Trans Knowl Data Eng        ISSN: 1041-4347            Impact factor:   6.977


  21 in total

1.  Standards for privacy of individually identifiable health information. Office of the Assistant Secretary for Planning and Evaluation, DHHS. Final rule.

Authors: 
Journal:  Fed Regist       Date:  2000-12-28

2.  Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text.

Authors:  David Carrell; Bradley Malin; John Aberdeen; Samuel Bayer; Cheryl Clark; Ben Wellner; Lynette Hirschman
Journal:  J Am Med Inform Assoc       Date:  2012-07-06       Impact factor: 4.497

3.  Recognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine.

Authors:  Son Doan; Hua Xu
Journal:  Proc Int Conf Comput Ling       Date:  2010-08

4.  A de-identifier for medical discharge summaries.

Authors:  Ozlem Uzuner; Tawanda C Sibanda; Yuan Luo; Peter Szolovits
Journal:  Artif Intell Med       Date:  2007-11-28       Impact factor: 5.326

5.  Replacing personally-identifying information in medical records, the Scrub system.

Authors:  L Sweeney
Journal:  Proc AMIA Annu Fall Symp       Date:  1996

6.  Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

Authors:  David S Carrell; David J Cronkite; Bradley A Malin; John S Aberdeen; Lynette Hirschman
Journal:  Methods Inf Med       Date:  2016-07-13       Impact factor: 2.176

7.  De-identification of health records using Anonym: effectiveness and robustness across datasets.

Authors:  Guido Zuccon; Daniel Kotzur; Anthony Nguyen; Anton Bergheim
Journal:  Artif Intell Med       Date:  2014-04-03       Impact factor: 5.326

8.  The MITRE Identification Scrubber Toolkit: design, training, and assessment.

Authors:  John Aberdeen; Samuel Bayer; Reyyan Yeniterzi; Ben Wellner; Cheryl Clark; David Hanauer; Bradley Malin; Lynette Hirschman
Journal:  Int J Med Inform       Date:  2010-10-14       Impact factor: 4.046

9.  A game theoretic framework for analyzing re-identification risk.

Authors:  Zhiyu Wan; Yevgeniy Vorobeychik; Weiyi Xia; Ellen Wright Clayton; Murat Kantarcioglu; Ranjit Ganta; Raymond Heatherly; Bradley A Malin
Journal:  PLoS One       Date:  2015-03-25       Impact factor: 3.240

10.  Development and evaluation of an open source software tool for deidentification of pathology reports.

Authors:  Bruce A Beckwith; Rajeshwarri Mahaadevan; Ulysses J Balis; Frank Kuo
Journal:  BMC Med Inform Decis Mak       Date:  2006-03-06       Impact factor: 2.796

View more
  3 in total

1.  The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.

Authors:  David S Carrell; David J Cronkite; Muqun Rachel Li; Steve Nyemba; Bradley A Malin; John S Aberdeen; Lynette Hirschman
Journal:  J Am Med Inform Assoc       Date:  2019-12-01       Impact factor: 4.497

2.  Scalable Iterative Classification for Sanitizing Large-Scale Datasets.

Authors:  Bo Li; Yevgeniy Vorobeychik; Muqun Li; Bradley Malin
Journal:  IEEE Trans Knowl Data Eng       Date:  2016-11-11       Impact factor: 6.977

3.  Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

Authors:  David S Carrell; Bradley A Malin; David J Cronkite; John S Aberdeen; Cheryl Clark; Muqun Rachel Li; Dikshya Bastakoty; Steve Nyemba; Lynette Hirschman
Journal:  J Am Med Inform Assoc       Date:  2020-07-01       Impact factor: 4.497

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.