Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 De-identification of clinical notes in French: towards a protocol for reference corpus development.

Literature DB >> 24380818

De-identification of clinical notes in French: towards a protocol for reference corpus development.

Abstract

BACKGROUND: To facilitate research applying Natural Language Processing to clinical documents, tools and resources are needed for the automatic de-identification of Electronic Health Records.
OBJECTIVE: This study investigates methods for developing a high-quality reference corpus for the de-identification of clinical documents in French.
METHODS: A corpus comprising a variety of clinical document types covering several medical specialties was pre-processed with two automatic de-identification systems from the MEDINA suite of tools: a rule-based system and a system using Conditional Random Fields (CRF). The pre-annotated documents were revised by two human annotators trained to mark ten categories of Protected Health Information (PHI). The human annotators worked independently and were blind to the system that produced the pre-annotations they were revising.The best pre-annotation system was applied to another random selection of 100 documents.After revision by one annotator, this set was used to train a statistical de-identification system.
RESULTS: Two gold standard sets of 100 documents were created based on the consensus of two human revisions of the automatic pre-annotations.The annotation experiment showed that (i) automatic pre-annotation obtained with the rule-based system performed better (F=0.813) than the CRF system (F=0.519), (ii) the human annotators spent more time revising the pre-annotations obtained with the rule-based system (from 102 to 160minutes for 50 documents), compared to the CRF system (from 93 to 142minutes for 50 documents), (iii) the quality of human annotation is higher when pre-annotations are obtained with the rule-based system (F-measure ranging from 0.970 to 0.987), compared to the CRF system (F-measure ranging from 0.914 to 0.981).Finally, only 20 documents from the training set were needed for the statistical system to outperform the pre-annotation systems that were trained on corpora from a medical speciality and hospital different from those in the reference corpus developed herein.
CONCLUSION: We find that better pre-annotations increase the quality of the reference corpus but require more revision time. A statistical de-identification method outperforms our rule-based system when as little as 20 custom training documents are available.

Entities: Species

Keywords: Confidentiality; Electronic Health Records; France; Information Dissemination; Natural Language Processing

Mesh：

Year: 2013 PMID： 24380818 DOI： 10.1016/j.jbi.2013.12.014

Source DB: PubMed Journal: J Biomed Inform ISSN： 1532-0464 Impact factor: 6.317

Keyword Cloud
Cited

9 in total

De-identification of clinical notes in French: towards a protocol for reference corpus development.

Review 1. Clinical Natural Language Processing in 2014: Foundational Methods Supporting Efficient Healthcare.

2. De-identifying Spanish medical texts - named entity recognition applied to radiology reports.

Review 3. Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress.

4. Is the Juice Worth the Squeeze? Costs and Benefits of Multiple Human Annotators for Clinical Text De-identification.

Review 5. Recent Advances in Clinical Natural Language Processing in Support of Semantic Analysis.

6. The OpenDeID corpus for patient de-identification.

Review 7. Clinical Natural Language Processing in languages other than English: opportunities and challenges.

8. Design of an extensive information representation scheme for clinical narratives.

9. De-identifying free text of Japanese electronic health records.