Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 A unified framework for evaluating the risk of re-identification of text de-identification tools.

Literature DB >> 27426236

A unified framework for evaluating the risk of re-identification of text de-identification tools.

Martin Scaiano¹, Grant Middleton², Luk Arbuckle³, Varada Kolhatkar¹, Liam Peyton⁴, Moira Dowling⁵, Debbie S Gipson⁶, Khaled El Emam⁷.

Abstract

OBJECTIVES: It has become regular practice to de-identify unstructured medical text for use in research using automatic methods, the goal of which is to remove patient identifying information to minimize re-identification risk. The metrics commonly used to determine if these systems are performing well do not accurately reflect the risk of a patient being re-identified. We therefore developed a framework for measuring the risk of re-identification associated with textual data releases.
METHODS: We apply the proposed evaluation framework to a data set from the University of Michigan Medical School. Our risk assessment results are then compared with those that would be obtained using a typical contemporary micro-average evaluation of recall in order to illustrate the difference between the proposed evaluation framework and the current baseline method.
RESULTS: We demonstrate how this framework compares against common measures of the re-identification risk associated with an automated text de-identification process. For the probability of re-identification using our evaluation framework we obtained a mean value for direct identifiers of 0.0074 and a mean value for quasi-identifiers of 0.0022. The 95% confidence interval for these estimates were below the relevant thresholds. The threshold for direct identifier risk was based on previously used approaches in the literature. The threshold for quasi-identifiers was determined based on the context of the data release following commonly used de-identification criteria for structured data. DISCUSSION: Our framework attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions that are made using the more traditional evaluation approach. It therefore provides a more realistic estimate of the true probability of re-identification.
CONCLUSIONS: This framework should be used as a basis for computing re-identification risk in order to more realistically evaluate future text de-identification tools.

Entities: Disease Species

Keywords: Data sharing; De-identification; Evaluation framework; Medical text; Natural language processing; Re-identification risk

Mesh：

Year: 2016 PMID： 27426236 DOI： 10.1016/j.jbi.2016.07.015

Source DB: PubMed Journal: J Biomed Inform ISSN： 1532-0464 Impact factor: 6.317

Keyword Cloud
Cited

6 in total

1. Ethics and Epistemology in Big Data Research.

Authors: Wendy Lipworth; Paul H Mason; Ian Kerridge; John P A Ioannidis
Journal: J Bioeth Inq Date: 2017-03-20 Impact factor: 1.352

Review 2. Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing.

Authors: A Névéol; P Zweigenbaum
Journal: Yearb Med Inform Date: 2017-09-11

3. The UAB Informatics Institute and 2016 CEGS N-GRID de-identification shared task challenge.

Authors: Duy Duc An Bui; Mathew Wyatt; James J Cimino
Journal: J Biomed Inform Date: 2017-05-03 Impact factor: 6.317

4. Learning to identify Protected Health Information by integrating knowledge- and data-driven algorithms: A case study on psychiatric evaluation notes.

Authors: Azad Dehghan; Aleksandar Kovacevic; George Karystianis; John A Keane; Goran Nenadic
Journal: J Biomed Inform Date: 2017-06-07 Impact factor: 6.317

5. Evaluating the re-identification risk of a clinical study report anonymized under EMA Policy 0070 and Health Canada Regulations.

Authors: Janice Branson; Nathan Good; Jung-Wei Chen; Will Monge; Christian Probst; Khaled El Emam
Journal: Trials Date: 2020-02-18 Impact factor: 2.279

6. Deidentification of free-text medical records using pre-trained bidirectional transformers.

Authors: Alistair E W Johnson; Lucas Bulgarelli; Tom J Pollard
Journal: Proc ACM Conf Health Inference Learn (2020) Date: 2020-04-02

6 in total