| Literature DB >> 34350426 |
Alistair E W Johnson1, Lucas Bulgarelli1, Tom J Pollard1.
Abstract
The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. As a result, patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice. In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source, allowing for broad reuse.Entities:
Keywords: HIPAA; PHI; deidentification; electronic health records; named entity recognition; natural language processing; neural networks
Year: 2020 PMID: 34350426 PMCID: PMC8330601 DOI: 10.1145/3368555.3384455
Source DB: PubMed Journal: Proc ACM Conf Health Inference Learn (2020)
Figure 1:Architecture of the model with example text and predictions. Text is tokenized and fed into 12 identically constructed transformer blocks. Weights within the transformer blocks are initialized using various publicly available pretrained models. The final output of the transformer blocks is fed into a linear classification layer. Note the use of sub-word tokenization (represented by two hashes before the sub-word), and the class of the intermediate punctuation tokens.
Categories of PHI in each dataset and number of tokens in each class (n, %).
| Category | Type | Dernoncourt-Lee | i2b2 2006 | i2b2 2014 | PhysioNet |
|---|---|---|---|---|---|
| All | - | 60725 | 19498 | 28867 | 1779 |
| Age | Age | 126 (0.2) | 16 (0.1) | 1997 (6.9) | 4 (0.2) |
| Contact | - | - | 5 (0) | - | |
| Fax | - | - | 10 (0.0) | - | |
| Phone | 2500 (4.1) | 232 (1.2) | 524 (1.8) | 53 (3.0) | |
| URL | - | - | 2 (0.0) | - | |
| Date | Date | 36594 (60.3) | 7098 (36.4) | 12482 (43.2) | 482 (27.1) |
| Dateyear | - | - | 46 (2.6) | ||
| ID | BioID | - | - | 1 (0.0) | - |
| Device | - | - | 15 (0.1) | - | |
| Healthplan | - | - | 1 (0.0) | - | |
| ID | - | 4809 (24.7) | - | - | |
| IDNum | 1785 (2.9) | - | 456 (1.6) | - | |
| Medicalrecord | - | - | 1033 (3.6) | - | |
| Other | - | - | - | 3 (0.2) | |
| Location | City | - | - | 654 (2.3) | - |
| Country | 88 (0.1) | - | 183 (0.6) | - | |
| Hospital | 3457 (5.7) | 2400 (12.3) | 2312 (8.0) | - | |
| Location | - | 263 (1.3) | - | 367 (20.6) | |
| Location-other | 1494 (2.5) | - | 17 (0.1) | - | |
| Organization | - | - | 206 (0.7) | - | |
| State | 232 (0.4) | - | 504 (1.7) | - | |
| Street | 73 (0.1) | - | 352 (1.2) | - | |
| Zip | 118 (0.2) | - | 352 (1.2) | - | |
| Name | Doctor | 12883 (21.2) | 3751 (19.2) | 4797 (16.6) | - |
| Hcpname | - | - | - | 593 (33.3) | |
| Patient | 1375 (2.3) | 929 (4.8) | 2195 (7.6) | - | |
| Ptname | - | - | - | 54 (3.0) | |
| Ptnameinitial | - | - | - | 2 (0.1) | |
| Relativeproxyname | - | - | - | 175 (9.8) | |
| Username | - | - | 356 (1.2) | - | |
| Profession | Profession | - | - | 413 (1.4) | - |
The number of tokens is calculated by splitting annotated entities using whitespace characters.
In the PhysioNet corpus, ages under 89 years are not treated as PHI.
In the 2006 i2b2 corpus, year is not annotated as PHI.
Performance of models developed using the i2b2 2014 challenge training set and evaluated on the i2b2 2014 challenge test set. Models use all lower case text and an uncased vocabulary unless otherwise specified. Each token is treated as a distinct entity. Binary evaluation involves collapsing all labeled entities into a single “PHI” group.
| Multi-class | PHI vs. not PHI | |||||
|---|---|---|---|---|---|---|
| PPV | Se | F1 | PPV | Se | F1 | |
| 98.66 | 98.15 | 98.40 | 99.08 | 98.57 | 98.82 | |
| 98.56 | 97.77 | 98.16 | 99.00 | 98.20 | 98.60 | |
| 98.61 | 97.90 | 98.25 | 98.98 | 98.27 | 98.62 | |
| 98.36 | 97.38 | 97.87 | 98.90 | 97.91 | 98.40 | |
| 98.34 | 97.88 | 98.11 | 98.80 | 98.33 | 98.57 | |
| 98.25 | 98.06 | 98.15 | 98.66 | 98.47 | 98.57 | |
| 95.27 | 91.60 | 93.36 | 96.95 | 93.18 | 95.03 | |
| 98.16 | 98.32 | 98.23 | ||||
| Hartman et al. | 85.7 | 99.1 | 91.7 | - | - | - |
| Liu et al. | 97.94 | 96.04 | 96.98 | 99.30 | 97.28 | 98.28 |
The PHI vs. not PHI evaluation in Dernoncourt et al. used a subset of classes based upon HIPAA and is not directly comparable to other results.
Performance comparison of BERT against the model of Dernoncourt et al. for individual entities within the i2b2 2014 test corpus.
| Precision | Recall | F1 | ||
|---|---|---|---|---|
| Entity type | Model | |||
| AGE | BERT | 97.12 | 98.23 | 97.67 |
| Dernoncourt et al. | 98.97 | 97.60 | 98.28 | |
| CONTACT | BERT | 98.31 | 98.46 | 98.38 |
| Dernoncourt et al. | 98.80 | 98.33 | 98.57 | |
| DATE | BERT | 99.43 | 99.26 | 99.35 |
| Dernoncourt et al. | 99.06 | 99.52 | 99.29 | |
| ID | BERT | 96.73 | 97.66 | 97.20 |
| Dernoncourt et al. | 99.29 | 98.76 | 99.02 | |
| LOCATION | BERT | 97.14 | 94.12 | 95.60 |
| Dernoncourt et al. | 95.96 | 95.74 | 95.85 | |
| NAME | BERT | 99.12 | 98.29 | 98.70 |
| Dernoncourt et al. | 98.22 | 99.15 | 98.68 | |
| PROFESSION | BERT | 96.39 | 92.49 | 94.40 |
| Dernoncourt et al. | 87.99 | 79.71 | 83.64 |
Performance of models developed using the training dataset specified in the row, and evaluated on the test set for the corpus specified in the column. All models are trained using the same hyperparameters with the uncased base architecture.
| i2b2 2014 | i2b2 2006 | PhysioNet | Dernoncourt-Lee | |
|---|---|---|---|---|
| F1 | ||||
| i2b2 2014 | 98.62 | 81.62 | 87.95 | 88.32 |
| i2b2 2006 | 92.77 | 98.45 | 75.37 | 86.85 |
| PhysioNet | 83.84 | 52.28 | 95.61 | 78.54 |
| Dernoncourt-Lee | 84.02 | 63.13 | 90.27 | 97.42 |
| Se | ||||
| i2b2 2014 | 98.27 | 72.55 | 96.05 | 83.10 |
| i2b2 2006 | 92.11 | 97.71 | 77.19 | 80.85 |
| PhysioNet | 76.26 | 36.68 | 95.61 | 68.40 |
| Dernoncourt-Lee | 84.89 | 61.57 | 95.61 | 97.59 |
| PPV | ||||
| i2b2 2014 | 98.98 | 93.27 | 81.11 | 94.25 |
| i2b2 2006 | 93.45 | 99.20 | 73.64 | 93.81 |
| PhysioNet | 93.09 | 90.93 | 95.61 | 92.19 |
| Dernoncourt-Lee | 83.16 | 64.76 | 85.49 | 97.25 |
Rate of false negatives (FN) and false positives (FP) for models with a minimum desired sensitivity. Results are calculated on the i2b2 2014 test set (414,661 tokens) using the lowest threshold for model predictions which has at least the specified sensitivity.
| Required Sensitivity | PPV | F1 | FN/1000 | FP/1000 |
|---|---|---|---|---|
| 100 | 0 | 0 | 0 | 1000 |
| 99.7 | 49.86 | 66.47 | 0.14 | 47.18 |
| 99.0 | 96.82 | 97.90 | 0.47 | 1.53 |
| 98.27 | 98.92 | 98.60 | 0.81 | 0.51 |
Examples of ambiguous false negatives produced by the model. Top: missed location (nationality). Middle top: “ci” token labeled as PHI resulting in a false negative. Middle bottom: General location descriptor. Bottom: Conjunction considered as false negative.
| Token | 55 | y | / | o | columbian |
| Prediction | AGE | ||||
| Truth | AGE | LOCATION | |||
| Token | 2138 | ci | : | 100417 | |
| Prediction | DATE | ID | |||
| Truth | DATE | ID | ID | ID | |
| Token | goes | to | the | library | daily |
| Prediction | |||||
| Truth | LOCATION | ||||
| Token | in | electrical | and | avionics | mechanics |
| Prediction | PROFESSION | PROFESSION | PROFESSION | ||
| Truth | PROFESSION | PROFESSION | PROFESSION | PROFESSION |