Literature DB >> 25106934

De-identification of clinical narratives through writing complexity measures.

Muqun Li1, David Carrell2, John Aberdeen3, Lynette Hirschman3, Bradley A Malin4.   

Abstract

PURPOSE: Electronic health records contain a substantial quantity of clinical narrative, which is increasingly reused for research purposes. To share data on a large scale and respect privacy, it is critical to remove patient identifiers. De-identification tools based on machine learning have been proposed; however, model training is usually based on either a random group of documents or a pre-existing document type designation (e.g., discharge summary). This work investigates if inherent features, such as the writing complexity, can identify document subsets to enhance de-identification performance.
METHODS: We applied an unsupervised clustering method to group two corpora based on writing complexity measures: a collection of over 4500 documents of varying document types (e.g., discharge summaries, history and physical reports, and radiology reports) from Vanderbilt University Medical Center (VUMC) and the publicly available i2b2 corpus of 889 discharge summaries. We compare the performance (via recall, precision, and F-measure) of de-identification models trained on such clusters with models trained on documents grouped randomly or VUMC document type.
RESULTS: For the Vanderbilt dataset, it was observed that training and testing de-identification models on the same stylometric cluster (with the average F-measure of 0.917) tended to outperform models based on clusters of random documents (with an average F-measure of 0.881). It was further observed that increasing the size of a training subset sampled from a specific cluster could yield improved results (e.g., for subsets from a certain stylometric cluster, the F-measure raised from 0.743 to 0.841 when training size increased from 10 to 50 documents, and the F-measure reached 0.901 when the size of the training subset reached 200 documents). For the i2b2 dataset, training and testing on the same clusters based on complexity measures (average F-score 0.966) did not significantly surpass randomly selected clusters (average F-score 0.965).
CONCLUSIONS: Our findings illustrate that, in environments consisting of a variety of clinical documentation, de-identification models trained on writing complexity measures are better than models trained on random groups and, in many instances, document types.
Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

Entities:  

Keywords:  Electronic medical records; Natural language processing; Privacy

Mesh:

Year:  2014        PMID: 25106934      PMCID: PMC4215974          DOI: 10.1016/j.ijmedinf.2014.07.002

Source DB:  PubMed          Journal:  Int J Med Inform        ISSN: 1386-5056            Impact factor:   4.046


  29 in total

1.  MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment.

Authors:  Philip M McCarthy; Scott Jarvis
Journal:  Behav Res Methods       Date:  2010-05

2.  A de-identifier for medical discharge summaries.

Authors:  Ozlem Uzuner; Tawanda C Sibanda; Yuan Luo; Peter Szolovits
Journal:  Artif Intell Med       Date:  2007-11-28       Impact factor: 5.326

3.  An electronic health record based on structured narrative.

Authors:  Stephen B Johnson; Suzanne Bakken; Daniel Dine; Sookyung Hyun; Eneida Mendonça; Frances Morrison; Tiffani Bright; Tielman Van Vleck; Jesse Wrenn; Peter Stetson
Journal:  J Am Med Inform Assoc       Date:  2007-10-18       Impact factor: 4.497

4.  State-of-the-art anonymization of medical records using an iterative machine learning framework.

Authors:  György Szarvas; Richárd Farkas; Róbert Busa-Fekete
Journal:  J Am Med Inform Assoc       Date:  2007 Sep-Oct       Impact factor: 4.497

5.  A new readability yardstick.

Authors:  R FLESCH
Journal:  J Appl Psychol       Date:  1948-06

Review 6.  Extracting information from textual documents in the electronic health record: a review of recent research.

Authors:  S M Meystre; G K Savova; K C Kipper-Schuler; J F Hurdle
Journal:  Yearb Med Inform       Date:  2008

7.  Development of a large-scale de-identified DNA biobank to enable personalized medicine.

Authors:  D M Roden; J M Pulley; M A Basford; G R Bernard; E W Clayton; J R Balser; D R Masys
Journal:  Clin Pharmacol Ther       Date:  2008-05-21       Impact factor: 6.875

8.  Content and structure of clinical problem lists: a corpus analysis.

Authors:  Tielman T Van Vleck; Adam Wilcox; Peter D Stetson; Stephen B Johnson; Noémie Elhadad
Journal:  AMIA Annu Symp Proc       Date:  2008-11-06

Review 9.  What can natural language processing do for clinical decision support?

Authors:  Dina Demner-Fushman; Wendy W Chapman; Clement J McDonald
Journal:  J Biomed Inform       Date:  2009-08-13       Impact factor: 6.317

10.  Mining clinical relationships from patient narratives.

Authors:  Angus Roberts; Robert Gaizauskas; Mark Hepple; Yikun Guo
Journal:  BMC Bioinformatics       Date:  2008-11-19       Impact factor: 3.169

View more
  5 in total

Review 1.  Clinical Natural Language Processing in 2014: Foundational Methods Supporting Efficient Healthcare.

Authors:  A Névéol; P Zweigenbaum
Journal:  Yearb Med Inform       Date:  2015-08-13

Review 2.  Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.

Authors:  S Velupillai; D Mowery; B R South; M Kvist; H Dalianis
Journal:  Yearb Med Inform       Date:  2015-08-13

3.  Big heart data: advancing health informatics through data sharing in cardiovascular imaging.

Authors:  Avan Suinesiaputra; Pau Medrano-Gracia; Brett R Cowan; Alistair A Young
Journal:  IEEE J Biomed Health Inform       Date:  2014-11-14       Impact factor: 5.772

4.  Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

Authors:  David S Carrell; Bradley A Malin; David J Cronkite; John S Aberdeen; Cheryl Clark; Muqun Rachel Li; Dikshya Bastakoty; Steve Nyemba; Lynette Hirschman
Journal:  J Am Med Inform Assoc       Date:  2020-07-01       Impact factor: 4.497

5.  Investigation of the Utility of Features in a Clinical De-identification Model: A Demonstration Using EHR Pathology Reports for Advanced NSCLC Patients.

Authors:  Tanmoy Paul; Md Kamruz Zaman Rana; Preethi Aishwarya Tautam; Teja Venkat Pavan Kotapati; Yaswitha Jampani; Nitesh Singh; Humayera Islam; Vasanthi Mandhadi; Vishakha Sharma; Michael Barnes; Richard D Hammer; Abu Saleh Mohammad Mosa
Journal:  Front Digit Health       Date:  2022-02-16
  5 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.