Literature DB >> 24380818

De-identification of clinical notes in French: towards a protocol for reference corpus development.

Cyril Grouin1, Aurélie Névéol2.   

Abstract

BACKGROUND: To facilitate research applying Natural Language Processing to clinical documents, tools and resources are needed for the automatic de-identification of Electronic Health Records.
OBJECTIVE: This study investigates methods for developing a high-quality reference corpus for the de-identification of clinical documents in French.
METHODS: A corpus comprising a variety of clinical document types covering several medical specialties was pre-processed with two automatic de-identification systems from the MEDINA suite of tools: a rule-based system and a system using Conditional Random Fields (CRF). The pre-annotated documents were revised by two human annotators trained to mark ten categories of Protected Health Information (PHI). The human annotators worked independently and were blind to the system that produced the pre-annotations they were revising.The best pre-annotation system was applied to another random selection of 100 documents.After revision by one annotator, this set was used to train a statistical de-identification system.
RESULTS: Two gold standard sets of 100 documents were created based on the consensus of two human revisions of the automatic pre-annotations.The annotation experiment showed that (i) automatic pre-annotation obtained with the rule-based system performed better (F=0.813) than the CRF system (F=0.519), (ii) the human annotators spent more time revising the pre-annotations obtained with the rule-based system (from 102 to 160minutes for 50 documents), compared to the CRF system (from 93 to 142minutes for 50 documents), (iii) the quality of human annotation is higher when pre-annotations are obtained with the rule-based system (F-measure ranging from 0.970 to 0.987), compared to the CRF system (F-measure ranging from 0.914 to 0.981).Finally, only 20 documents from the training set were needed for the statistical system to outperform the pre-annotation systems that were trained on corpora from a medical speciality and hospital different from those in the reference corpus developed herein.
CONCLUSION: We find that better pre-annotations increase the quality of the reference corpus but require more revision time. A statistical de-identification method outperforms our rule-based system when as little as 20 custom training documents are available.
Copyright © 2013 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Confidentiality; Electronic Health Records; France; Information Dissemination; Natural Language Processing

Mesh:

Year:  2013        PMID: 24380818     DOI: 10.1016/j.jbi.2013.12.014

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  9 in total

Review 1.  Clinical Natural Language Processing in 2014: Foundational Methods Supporting Efficient Healthcare.

Authors:  A Névéol; P Zweigenbaum
Journal:  Yearb Med Inform       Date:  2015-08-13

2.  De-identifying Spanish medical texts - named entity recognition applied to radiology reports.

Authors:  Irene Pérez-Díez; Raúl Pérez-Moraga; Adolfo López-Cerdán; Jose-Maria Salinas-Serrano; María de la Iglesia-Vayá
Journal:  J Biomed Semantics       Date:  2021-03-29

Review 3.  Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress.

Authors:  S M Meystre; C Lovis; T Bürkle; G Tognola; A Budrionis; C U Lehmann
Journal:  Yearb Med Inform       Date:  2017-09-11

4.  Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

Authors:  David S Carrell; David J Cronkite; Bradley A Malin; John S Aberdeen; Lynette Hirschman
Journal:  Methods Inf Med       Date:  2016-07-13       Impact factor: 2.176

Review 5.  Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.

Authors:  S Velupillai; D Mowery; B R South; M Kvist; H Dalianis
Journal:  Yearb Med Inform       Date:  2015-08-13

6.  The OpenDeID corpus for patient de-identification.

Authors:  Jitendra Jonnagaddala; Aipeng Chen; Sean Batongbacal; Chandini Nekkantti
Journal:  Sci Rep       Date:  2021-10-07       Impact factor: 4.379

Review 7.  Clinical Natural Language Processing in languages other than English: opportunities and challenges.

Authors:  Aurélie Névéol; Hercules Dalianis; Sumithra Velupillai; Guergana Savova; Pierre Zweigenbaum
Journal:  J Biomed Semantics       Date:  2018-03-30

8.  Design of an extensive information representation scheme for clinical narratives.

Authors:  Louise Deléger; Leonardo Campillos; Anne-Laure Ligozat; Aurélie Névéol
Journal:  J Biomed Semantics       Date:  2017-09-11

9.  De-identifying free text of Japanese electronic health records.

Authors:  Kohei Kajiyama; Hiromasa Horiguchi; Takashi Okumura; Mizuki Morita; Yoshinobu Kano
Journal:  J Biomed Semantics       Date:  2020-09-21
  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.