Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Scalable Iterative Classification for Sanitizing Large-Scale Datasets.

Literature DB >> 28943741

Scalable Iterative Classification for Sanitizing Large-Scale Datasets.

Bo Li¹, Yevgeniy Vorobeychik¹, Muqun Li¹, Bradley Malin¹.
1. Vanderbilt University.

Abstract

Cheap ubiquitous computing enables the collection of massive amounts of personal data in a wide variety of domains. Many organizations aim to share such data while obscuring features that could disclose personally identifiable information. Much of this data exhibits weak structure (e.g., text), such that machine learning approaches have been developed to detect and remove identifiers from it. While learning is never perfect, and relying on such approaches to sanitize data can leak sensitive information, a small risk is often acceptable. Our goal is to balance the value of published data and the risk of an adversary discovering leaked identifiers. We model data sanitization as a game between 1) a publisher who chooses a set of classifiers to apply to data and publishes only instances predicted as non-sensitive and 2) an attacker who combines machine learning and manual inspection to uncover leaked identifying information. We introduce a fast iterative greedy algorithm for the publisher that ensures a low utility for a resource-limited adversary. Moreover, using five text data sets we illustrate that our algorithm leaves virtually no automatically identifiable sensitive instances for a state-of-the-art learning algorithm, while sharing over 93% of the original data, and completes after at most 5 iterations.

Entities: Chemical Disease Gene Species

Keywords: Privacy preserving; game theory; weak structured data sanitization

Year: 2016 PMID： 28943741 PMCID： PMC5607782 DOI： 10.1109/TKDE.2016.2628180

Source DB: PubMed Journal: IEEE Trans Knowl Data Eng ISSN： 1041-4347 Impact factor: 6.977

21 in total

1. Standards for privacy of individually identifiable health information. Office of the Assistant Secretary for Planning and Evaluation, DHHS. Final rule.

Authors:
Journal: Fed Regist Date: 2000-12-28

2. Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text.

Authors: David Carrell; Bradley Malin; John Aberdeen; Samuel Bayer; Cheryl Clark; Ben Wellner; Lynette Hirschman
Journal: J Am Med Inform Assoc Date: 2012-07-06 Impact factor: 4.497

3. Recognizing Medication related Entities in Hospital Discharge Summaries using Support Vector Machine.

Authors: Son Doan; Hua Xu
Journal: Proc Int Conf Comput Ling Date: 2010-08

4. A de-identifier for medical discharge summaries.

Authors: Ozlem Uzuner; Tawanda C Sibanda; Yuan Luo; Peter Szolovits
Journal: Artif Intell Med Date: 2007-11-28 Impact factor: 5.326

5. Replacing personally-identifying information in medical records, the Scrub system.

Authors: L Sweeney
Journal: Proc AMIA Annu Fall Symp Date: 1996

6. Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

Authors: David S Carrell; David J Cronkite; Bradley A Malin; John S Aberdeen; Lynette Hirschman
Journal: Methods Inf Med Date: 2016-07-13 Impact factor: 2.176

7. De-identification of health records using Anonym: effectiveness and robustness across datasets.

Authors: Guido Zuccon; Daniel Kotzur; Anthony Nguyen; Anton Bergheim
Journal: Artif Intell Med Date: 2014-04-03 Impact factor: 5.326

8. The MITRE Identification Scrubber Toolkit: design, training, and assessment.

Authors: John Aberdeen; Samuel Bayer; Reyyan Yeniterzi; Ben Wellner; Cheryl Clark; David Hanauer; Bradley Malin; Lynette Hirschman
Journal: Int J Med Inform Date: 2010-10-14 Impact factor: 4.046

9. A game theoretic framework for analyzing re-identification risk.

Authors: Zhiyu Wan; Yevgeniy Vorobeychik; Weiyi Xia; Ellen Wright Clayton; Murat Kantarcioglu; Ranjit Ganta; Raymond Heatherly; Bradley A Malin
Journal: PLoS One Date: 2015-03-25 Impact factor: 3.240

10. Development and evaluation of an open source software tool for deidentification of pathology reports.

Authors: Bruce A Beckwith; Rajeshwarri Mahaadevan; Ulysses J Balis; Frank Kuo
Journal: BMC Med Inform Decis Mak Date: 2006-03-06 Impact factor: 2.796

3 in total

1. The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.

Authors: David S Carrell; David J Cronkite; Muqun Rachel Li; Steve Nyemba; Bradley A Malin; John S Aberdeen; Lynette Hirschman
Journal: J Am Med Inform Assoc Date: 2019-12-01 Impact factor: 4.497

2. Scalable Iterative Classification for Sanitizing Large-Scale Datasets.

Authors: Bo Li; Yevgeniy Vorobeychik; Muqun Li; Bradley Malin
Journal: IEEE Trans Knowl Data Eng Date: 2016-11-11 Impact factor: 6.977

3. Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

Authors: David S Carrell; Bradley A Malin; David J Cronkite; John S Aberdeen; Cheryl Clark; Muqun Rachel Li; Dikshya Bastakoty; Steve Nyemba; Lynette Hirschman
Journal: J Am Med Inform Assoc Date: 2020-07-01 Impact factor: 4.497

3 in total