Literature DB >> 24556292

Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research.

Todd Lingren1, Yizhao Ni1, Louise Deleger1, Megan Kaiser1, Laura Stoutenborough1, Keith Marsolo1, Michal Kouril1, Katalin Molnar1, Imre Solti1.   

Abstract

OBJECTIVE: The current study aims to fill the gap in available healthcare de-identification resources by creating a new sharable dataset with realistic Protected Health Information (PHI) without reducing the value of the data for de-identification research. By releasing the annotated gold standard corpus with Data Use Agreement we would like to encourage other Computational Linguists to experiment with our data and develop new machine learning models for de-identification. This paper describes: (1) the modifications required by the Institutional Review Board before sharing the de-identification gold standard corpus; (2) our efforts to keep the PHI as realistic as possible; (3) and the tests to show the effectiveness of these efforts in preserving the value of the modified data set for machine learning model development.
MATERIALS AND METHODS: In a previous study we built an original de-identification gold standard corpus annotated with true Protected Health Information (PHI) from 3503 randomly selected clinical notes for the 22 most frequent clinical note types of our institution. In the current study we modified the original gold standard corpus to make it suitable for external sharing by replacing HIPAA-specified PHI with newly generated realistic PHI. Finally, we evaluated the research value of this new dataset by comparing the performance of an existing published in-house de-identification system, when trained on the new de-identification gold standard corpus, with the performance of the same system, when trained on the original corpus. We assessed the potential benefits of using the new de-identification gold standard corpus to identify PHI in the i2b2 and PhysioNet datasets that were released by other groups for de-identification research. We also measured the effectiveness of the i2b2 and PhysioNet de-identification gold standard corpora in identifying PHI in our original clinical notes.
RESULTS: Performance of the de-identification system using the new gold standard corpus as a training set was very close to training on the original corpus (92.56 vs. 93.48 overall F-measures). Best i2b2/PhysioNet/CCHMC cross-training performances were obtained when training on the new shared CCHMC gold standard corpus, although performances were still lower than corpus-specific trainings. DISCUSSION AND
CONCLUSION: We successfully modified a de-identification dataset for external sharing while preserving the de-identification research value of the modified gold standard corpus with limited drop in machine learning de-identification performance.
Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Automated de-identification; De-identification gold standard; Health insurance portability and accountability act; Natural Language Processing; Privacy of patient data; Protected Health Information

Mesh:

Year:  2014        PMID: 24556292      PMCID: PMC4125487          DOI: 10.1016/j.jbi.2014.01.014

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  17 in total

1.  PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals.

Authors:  A L Goldberger; L A Amaral; L Glass; J M Hausdorff; P C Ivanov; R G Mark; J E Mietus; G B Moody; C K Peng; H E Stanley
Journal:  Circulation       Date:  2000-06-13       Impact factor: 29.690

2.  A successful technique for removing names in pathology reports using an augmented search and replace method.

Authors:  Sean M Thomas; Burke Mamlin; Gunther Schadow; Clement McDonald
Journal:  Proc AMIA Symp       Date:  2002

Review 3.  Evaluating natural language processors in the clinical domain.

Authors:  C Friedman; G Hripcsak
Journal:  Methods Inf Med       Date:  1998-11       Impact factor: 2.176

4.  Building gold standard corpora for medical natural language processing tasks.

Authors:  Louise Deleger; Qi Li; Todd Lingren; Megan Kaiser; Katalin Molnar; Laura Stoutenborough; Michal Kouril; Keith Marsolo; Imre Solti
Journal:  AMIA Annu Symp Proc       Date:  2012-11-03

5.  Multiple significance tests: the Bonferroni method.

Authors:  J M Bland; D G Altman
Journal:  BMJ       Date:  1995-01-21

6.  The MITRE Identification Scrubber Toolkit: design, training, and assessment.

Authors:  John Aberdeen; Samuel Bayer; Reyyan Yeniterzi; Ben Wellner; Cheryl Clark; David Hanauer; Bradley Malin; Lynette Hirschman
Journal:  Int J Med Inform       Date:  2010-10-14       Impact factor: 4.046

7.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research.

Authors:  Dilip Gupta; Melissa Saul; John Gilbertson
Journal:  Am J Clin Pathol       Date:  2004-02       Impact factor: 2.493

Review 8.  Automatic de-identification of textual documents in the electronic health record: a review of recent research.

Authors:  Stephane M Meystre; F Jeffrey Friedlin; Brett R South; Shuying Shen; Matthew H Samore
Journal:  BMC Med Res Methodol       Date:  2010-08-02       Impact factor: 4.615

9.  Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements.

Authors:  Todd Lingren; Louise Deleger; Katalin Molnar; Haijun Zhai; Jareen Meinzen-Derr; Megan Kaiser; Laura Stoutenborough; Qi Li; Imre Solti
Journal:  J Am Med Inform Assoc       Date:  2013-09-03       Impact factor: 4.497

10.  Large-scale evaluation of automated clinical note de-identification and its impact on information extraction.

Authors:  Louise Deleger; Katalin Molnar; Guergana Savova; Fei Xia; Todd Lingren; Qi Li; Keith Marsolo; Anil Jegga; Megan Kaiser; Laura Stoutenborough; Imre Solti
Journal:  J Am Med Inform Assoc       Date:  2012-08-02       Impact factor: 4.497

View more
  12 in total

Review 1.  Clinical Natural Language Processing in 2014: Foundational Methods Supporting Efficient Healthcare.

Authors:  A Névéol; P Zweigenbaum
Journal:  Yearb Med Inform       Date:  2015-08-13

2.  Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus.

Authors:  Amber Stubbs; Özlem Uzuner
Journal:  J Biomed Inform       Date:  2015-08-28       Impact factor: 6.317

3.  Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

Authors:  David S Carrell; David J Cronkite; Bradley A Malin; John S Aberdeen; Lynette Hirschman
Journal:  Methods Inf Med       Date:  2016-07-13       Impact factor: 2.176

4.  Creation of a new longitudinal corpus of clinical narratives.

Authors:  Vishesh Kumar; Amber Stubbs; Stanley Shaw; Özlem Uzuner
Journal:  J Biomed Inform       Date:  2015-10-01       Impact factor: 6.317

Review 5.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1.

Authors:  Amber Stubbs; Christopher Kotfila; Özlem Uzuner
Journal:  J Biomed Inform       Date:  2015-07-28       Impact factor: 6.317

6.  Optimizing annotation resources for natural language de-identification via a game theoretic framework.

Authors:  Muqun Li; David Carrell; John Aberdeen; Lynette Hirschman; Jacqueline Kirby; Bo Li; Yevgeniy Vorobeychik; Bradley A Malin
Journal:  J Biomed Inform       Date:  2016-03-25       Impact factor: 6.317

7.  Leveraging Food and Drug Administration Adverse Event Reports for the Automated Monitoring of Electronic Health Records in a Pediatric Hospital.

Authors:  Huaxiu Tang; Imre Solti; Eric Kirkendall; Haijun Zhai; Todd Lingren; Jaroslaw Meller; Yizhao Ni
Journal:  Biomed Inform Insights       Date:  2017-06-08

8.  A two-site survey of medical center personnel's willingness to share clinical data for research: implications for reproducible health NLP research.

Authors:  Chunhua Weng; Carol Friedman; Casey A Rommel; John F Hurdle
Journal:  BMC Med Inform Decis Mak       Date:  2019-04-04       Impact factor: 2.796

Review 9.  A Comprehensive Review of Computational Methods for Automatic Prediction of Schizophrenia With Insight Into Indigenous Populations.

Authors:  Randall Ratana; Hamid Sharifzadeh; Jamuna Krishnan; Shaoning Pang
Journal:  Front Psychiatry       Date:  2019-09-12       Impact factor: 4.157

10.  Will they participate? Predicting patients' response to clinical trial invitations in a pediatric emergency department.

Authors:  Yizhao Ni; Andrew F Beck; Regina Taylor; Jenna Dyas; Imre Solti; Jacqueline Grupp-Phelan; Judith W Dexheimer
Journal:  J Am Med Inform Assoc       Date:  2016-04-27       Impact factor: 4.497

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.