| Literature DB >> 35172087 |
Seo Hyun Oh1, Min Kang1, Youngho Lee2.
Abstract
OBJECTIVE: De-identifying protected health information (PHI) in medical documents is important, and a prerequisite to deidentification is the identification of PHI entity names in clinical documents. This study aimed to compare the performance of three pre-training models that have recently attracted significant attention and to determine which model is more suitable for PHI recognition.Entities:
Keywords: Artificial Intelligence; Big Data; Data Anonymization; Deep Learning; Medical Informatics
Year: 2022 PMID: 35172087 PMCID: PMC8850174 DOI: 10.4258/hir.2022.28.1.16
Source DB: PubMed Journal: Healthc Inform Res ISSN: 2093-3681
Figure 1Pipeline showing the inputs and outputs of deep learning models: BERT (bidirectional encoder representations from transformers), RoBERTa (robustly optimized BERT pre-training approach), and XLNet (a model built based on Transformer-XL). IOB: inside-outside-beginning.
Figure 2Model structure of BERT (bidirectional encoder representations from transformers).
Figure 3Permutation model with autoregressive and autoencoding models.
Recall, precision, and F1-score of BERT, RoBERTa, and XLNet
| Recall | Precision | F1-score | |
|---|---|---|---|
| BERT | 0.85 | 0.86 | 0.85 |
| RoBERTa | 0.92 | 0.93 | 0.93 |
| XLNet | 0.95 | 0.96 | 0.96 |
BERT: bidirectional encoder representations from transformers, RoBERTa: robustly optimized BERT pre-training approach, XL-Net: a model built based on Transformer-XL.
Training time of BERT, RoBERTa, and XLNet
| Training time (s) | |
|---|---|
| BERT | 348 |
| RoBERTa | 685 |
| XLNet | 1,215 |
BERT: bidirectional encoder representations from transformers, RoBERTa: robustly optimized BERT pre-training approach, XL-Net: a model built based on Transformer-XL.
PHI entity performance evaluation of BERT, RoBERTa, and XLNet
| Recall | Precision | F1-score | Support | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| BERT | RoBERTa | XLNet | BERT | RoBERTa | XLNet | BERT | RoBERTa | XLNet | ||
| MEDICALRECORD | 0.97 | 0.98 | 0.99 | 0.98 | 0.95 | 0.98 | 0.98 | 0.96 | 0.98 | 1,849 |
| DATE | 0.98 | 0.99 | 0.99 | 0.98 | 0.99 | 0.99 | 0.98 | 0.99 | 0.99 | 13,251 |
| IDNUM | 0.93 | 0.91 | 0.93 | 0.88 | 0.75 | 0.88 | 0.90 | 0.82 | 0.90 | 642 |
| AGE | 0.94 | 0.97 | 0.97 | 0.85 | 0.95 | 0.98 | 0.89 | 0.96 | 0.98 | 628 |
| PHONE | 0.91 | 0.84 | 0.95 | 0.75 | 0.79 | 0.97 | 0.82 | 0.81 | 0.96 | 707 |
| ZIP | 0.88 | 0.95 | 0.99 | 0.75 | 0.96 | 0.97 | 0.81 | 0.95 | 0.98 | 377 |
| STATE | 0.68 | 0.92 | 0.94 | 0.92 | 0.58 | 0.88 | 0.78 | 0.71 | 0.91 | 248 |
| PATIENT | 0.49 | 0.90 | 0.97 | 0.53 | 0.88 | 0.96 | 0.51 | 0.89 | 0.97 | 2,135 |
| DOCTOR | 0.57 | 0.91 | 0.95 | 0.44 | 0.91 | 0.96 | 0.50 | 0.91 | 0.96 | 2,562 |
| HOSPITAL | 0.31 | 0.85 | 0.90 | 0.32 | 0.73 | 0.81 | 0.31 | 0.78 | 0.85 | 1,539 |
| CITY | 0.11 | 0.80 | 0.82 | 0.69 | 0.58 | 0.66 | 0.19 | 0.67 | 0.73 | 390 |
| STREET | 0.13 | 0.93 | 0.95 | 0.12 | 0.90 | 0.89 | 0.12 | 0.92 | 0.92 | 183 |
| COUNTRY | 0.00 | 0.12 | 0.61 | 0.00 | 0.92 | 0.81 | 0.00 | 0.21 | 0.73 | 122 |
| DEVICE | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 29 |
| LOCATION-OTHER | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 15 |
| ORGANIZATION | 0.00 | 0.07 | 0.45 | 0.00 | 0.40 | 0.71 | 0.00 | 0.12 | 0.55 | 92 |
| PROFESSION | 0.00 | 0.47 | 0.51 | 0.00 | 0.33 | 0.54 | 0.00 | 0.39 | 0.53 | 136 |
| USERNAME | 0.00 | 0.89 | 0.95 | 0.00 | 0.88 | 0.79 | 0.00 | 0.89 | 0.86 | 148 |
BERT: bidirectional encoder representations from transformers, RoBERTa: robustly optimized BERT pre-training approach, XLNet: a model built based on Transformer-XL.
Examples of comparisons of valid and prediction tags in BERT
| Instances | |||||||
|---|---|---|---|---|---|---|---|
| Sentences | pain | persisted | even | after | returning | from | blounstown |
| Valid tags | ‘O’ | ‘O’ | ‘O’ | ‘O’ | ‘O’ | ‘O’ | ‘B-CITY’ |
| Prediction tags | ‘O’ | ‘O’ | ‘O’ | ‘O’ | ‘O’ | ‘O’ | ‘B-HOSPITAL’ |
BERT: bidirectional encoder representations from transformers.