Literature DB >> 32428572

Active deep learning to detect demographic traits in free-form clinical notes.

Amir Feder1, Danny Vainstein2, Roni Rosenfeld3, Tzvika Hartman4, Avinatan Hassidim4, Yossi Matias4.   

Abstract

The free-form portions of clinical notes are a significant source of information for research, but before they can be used, they must be de-identified to protect patients' privacy. De-identification efforts have focused on known identifier types (names, ages, dates, addresses, ID's, etc.). However, a note can contain residual "Demographic Traits" (DTs), unique enough to re-identify the patient when combined with other such facts. Here we examine whether any residual risks remain after removing these identifiers. After manually annotating over 140,000 words worth of medical notes, we found no remaining directly identifying information, and a low prevalence of demographic traits, such as marital status or housing type. We developed an annotation guide to the discovered Demographic Traits (DTs) and used it to label MIMIC-III and i2b2-2006 clinical notes as test sets. We then designed a "bootstrapped" active learning iterative process for identifying DTs: we tentatively labeled as positive all sentences in the DT-rich note sections, used these to train a binary classifier, manually corrected acute errors, and retrained the classifier. This train-and-correct process may be iterated. Our active learning process significantly improved the classifier's accuracy. Moreover, our BERT-based model outperformed non-neural models when trained on both tentatively labeled data and manually relabeled examples. To facilitate future research and benchmarking, we also produced and made publicly available our human annotated DT-tagged datasets. We conclude that directly identifying information is virtually non-existent in the multiple medical note types we investigated. Demographic traits are present in medical notes, but can be detected with high accuracy using a cost-effective human-in-the-loop active learning process, and redacted if desired.2.
Copyright © 2020 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Active machine learning; Data anonymization; Deep learning; Natural language processing; Personally identifiable information

Mesh:

Year:  2020        PMID: 32428572     DOI: 10.1016/j.jbi.2020.103436

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  2 in total

1.  A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance.

Authors:  Hongxia Lu; Louis Ehwerhemuepha; Cyril Rakovski
Journal:  BMC Med Res Methodol       Date:  2022-07-02       Impact factor: 4.612

Review 2.  A Review on Human-AI Interaction in Machine Learning and Insights for Medical Applications.

Authors:  Mansoureh Maadi; Hadi Akbarzadeh Khorshidi; Uwe Aickelin
Journal:  Int J Environ Res Public Health       Date:  2021-02-22       Impact factor: 3.390

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.