| Literature DB >> 31801524 |
Xi Yang1, Tianchen Lyu1, Qian Li1, Chih-Yin Lee1, Jiang Bian1, William R Hogan1, Yonghui Wu2.
Abstract
BACKGROUND: De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions.Entities:
Keywords: Cross institutions; De-identification; Deep learning; EHR; Protected health information
Mesh:
Year: 2019 PMID: 31801524 PMCID: PMC6894104 DOI: 10.1186/s12911-019-0935-4
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
PHI distributions in the 2014 i2b2/UTHealth de-identification corpus and UF Health clinical notes
| PHI Category | Number of Annotations | ||||
|---|---|---|---|---|---|
| 2014 i2b2/UTHealth | UF Heath | ||||
| Training | Validation | Training | Validation | Test | |
| DATE | 9067 | 3104 | 2056 | 774 | 1872 |
| NAME | 5472 | 1868 | 856 | 356 | 771 |
| AGE | 1507 | 490 | 158 | 86 | 164 |
| ID | 1142 | 364 | 156 | 41 | 137 |
| PHONE | 406 | 128 | 50 | 28 | 47 |
| WEB | 6 | 1 | 0 | 0 | 4 |
| INSTITUTE | 1926 | 592 | 128 | 72 | 119 |
| STREET | 280 | 72 | 25 | 6 | 21 |
| CITY | 502 | 152 | 43 | 26 | 45 |
| ZIP | 276 | 76 | 34 | 11 | 20 |
| Total | 20,584 | 6847 | 3506 | 1400 | 3200 |
Fig. 1An overview of the LSTM-CRFs model with knowledge-based features derived from the local resources
Performance of LSTM-CRFs trained with different word embeddings (trained using i2b2 training set and evaluated using i2b2 validation set)
| Model | Embedding | Performance on validation set (i2b2/UTHealth) | |||||
|---|---|---|---|---|---|---|---|
| Strict | Relax | ||||||
| Precision | Recall | F1 score | Precision | Recall | F1 score | ||
| LSTM-CRFs | GoogleNews | 0.9679 | 0.9263 | 0.9466 | 0.9783 | 0.9362 | 0.9567 |
| CommonCrawl | 0.9697 | 0.9401 | 0.9797 | 0.9498 | |||
| MIMIC-word2vec | 0.9669 | 0.9341 | 0.9502 | 0.9774 | 0.9443 | 0.9606 | |
| MIMIC-fastText | 0.9631 | 0.9380 | 0.9504 | 0.9758 | 0.9504 | 0.9629 | |
| MADE | 0.9662 | 0.9158 | 0.9403 | 0.9782 | 0.9271 | 0.9520 | |
Best F1 scores are highlighted in bold
Performance of LSTM-CRFs models on UF test set
| Model | Training data | Fine Tuning | Performance on UF Test | |||||
|---|---|---|---|---|---|---|---|---|
| Strict | Relax | |||||||
| Pre | Rec | F1 | Pre | Rec | F1 | |||
| LSTM-CRFs | i2b2 | NA | 0.8883 | 0.8274 | 0.8568 | 0.9288 | 0.8651 | 0.8958 |
| LSTM-CRFs+Lexical | i2b2 | NA | 0.8767 | 0.8509 | 0.8636 | 0.9314 | 0.9041 | 0.9175 |
| LSTM-CRFs+Lexical + Knowledge | i2b2 | NA | 0.8767 | 0.8706 | 0.8736 | 0.9229 | 0.9166 | 0.9197 |
| LSTM-CRFs+Lexical + Knowledge | i2b2 | UF | 0.9474 | 0.9109 | 0.9776 | 0.9400 | ||
| LSTM-CRFs+Lexical + Knowledge | UF | NA | 0.9408 | 0.8992 | 0.9195 | 0.9705 | 0.9277 | 0.9486 |
| LSTM-CRFs+Lexical + Knowledge | i2b2 + UF | NA | 0.9352 | 0.9163 | 0.9257 | 0.9681 | 0.9484 | 0.9582 |
Best F1 scores are highlighted in bold
Performances for each PHI category achieved by the customized LSTM-CRFs model using fine-tuning
| Entity Type | Performance on UF test set | |||||
|---|---|---|---|---|---|---|
| Strict | Relax | |||||
| Precision | Recall | F1 score | Precision | Recall | F1 score | |
| DATE | 0.9807 | 0.977 | 0.9789 | 0.985 | 0.9813 | 0.9831 |
| AGE | 0.9861 | 0.8659 | 0.9221 | 0.9861 | 0.8659 | 0.9221 |
| ID | 0.9173 | 0.8905 | 0.9037 | 0.9624 | 0.9343 | 0.9481 |
| NAME | 0.9029 | 0.8807 | 0.8917 | 0.9694 | 0.9455 | 0.9573 |
| PHONE | 0.9048 | 0.8085 | 0.8539 | 0.9762 | 0.8723 | 0.9213 |
| ZIP | 0.75 | 0.75 | 0.75 | 0.9 | 0.9 | 0.9 |
| INSTITUTE | 0.75 | 0.5042 | 0.603 | 0.9375 | 0.6303 | 0.7538 |
| CITY | 0.9048 | 0.4222 | 0.5758 | 1 | 0.4667 | 0.6364 |
| STREET | 0.55 | 0.5238 | 0.5366 | 0.85 | 0.8095 | 0.8293 |
| WEB | 0 | 0 | 0 | 0 | 0 | 0 |