| Literature DB >> 31437914 |
Tobias Kolditz1, Christina Lohr1, Johannes Hellrich1, Luise Modersohn1, Boris Betz2, Michael Kiehntopf2, Udo Hahn1.
Abstract
We devised annotation guidelines for the de-identification of German clinical documents and assembled a corpus of 1,106 discharge summaries and transfer letters with 44K annotated protected health information (PHI) items. After three iteration rounds, our annotation team finally reached an inter-annotator agreement of 0.96 on the instance level and 0.97 on the token level of annotation (averaged pair-wise F1 score). To establish a baseline for automatic de-identification on our corpus, we trained a recurrent neural network (RNN) and achieved F1 scores greater than 0.9 on most major PHI categories.Keywords: Confidentiality; Data Anonymization; Natural Language Processing
Mesh:
Year: 2019 PMID: 31437914 DOI: 10.3233/SHTI190212
Source DB: PubMed Journal: Stud Health Technol Inform ISSN: 0926-9630