| Literature DB >> 36040774 |
Peng Wang1, Yong Li2, Liang Yang3, Simin Li3, Linfeng Li3, Zehan Zhao4, Shaopei Long2, Fei Wang5, Hongqian Wang5, Ying Li5, Chengliang Wang1.
Abstract
BACKGROUND: With the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may cause privacy issues. The task of deidentifying protected health information in electronic health records can be regarded as a named entity recognition problem. Existing rule-based, machine learning-based, or deep learning-based methods have been proposed to solve this problem. However, these methods still face the difficulties of insufficient Chinese electronic health record data and the complex features of the Chinese language.Entities:
Keywords: CRF; EHR; PHI; TinyBert; algorithm; data augmentation; de-identification; de-identify; development; health information; health record; machine learning; medical record; model; patient information; personal information; privacy; protected data; protected information
Year: 2022 PMID: 36040774 PMCID: PMC9472063 DOI: 10.2196/38154
Source DB: PubMed Journal: JMIR Med Inform
Figure 1The proposed model for deidentifying protected health information in Chinese electronic health records. BERT: bidirectional encoder representations from transformers; CRF: conditional random field; FFN: feed-forward network; MHA: multi-head attention; PER: personal name.
Figure 2The TinyBERT knowledge distillation process used in our model. FFN: feed-forward network. Attn: attention layer; L(emb): embedding loss; L(tr): transformer layer loss; L(pr): prediction loss; A: attention map values; Z: predicted logic vectors; S: student network; T: teacher network.
Figure 3An example of the data augmentation with a generation approach linearization operation in our data augmentation method. PER: personal name.
Statistical information for the raw data and hybrid augmented data for each type of entity.
| Entity types | Training set, n | Evaluation set (original), n | Test set (original), n | |||||
|
| Original | DAGAa | MRb | Total |
|
| ||
| PERc | 1448 | 4327 | 2892 | 8667 | 631 | 628 | ||
| LOCd | 302 | 1384 | 589 | 2275 | 102 | 105 | ||
| ORGe | 846 | 2188 | 1692 | 4726 | 275 | 303 | ||
| DATf | 3013 | 7412 | 6011 | 16,436 | 999 | 1034 | ||
| Total | 5609 | 15,311 | 11,184 | 32,104 | 2007 | 2070 | ||
aDAGA: data augmentation with a generation approach.
bMR: mention replacement.
cPER: personal name.
dLOC: location.
eORG: organization name.
fDAT: date.
Settings for each benchmark.
| Models | Settings | Parameters, n | Description |
| Gated recurrent units | 1 layer,a 512 dimsb | 2,190,000 | The parameters were randomly initialized. |
| BiLSTMc | 1 layer, 512 dims | 2,210,000 | The parameters were randomly initialized. |
| Base BERTd | 12 layers, 768 dims, 12 headse | 110,000,000 | The base BERT was pretrained on the English Wikipedia corpus. |
| Chinese-BERT-wwm | 12 layers, 768 dims, 12 heads | 110,000,000 | The base BERT was pretrained on the Chinese Wikipedia corpus with a whole word masking training strategy. |
| Chinese-BERT-wwm-ext | 12 layers, 768 dims, 12 heads | 110,000,000 | The base BERT was pretrained on the Chinese Wikipedia corpus, news, and question-answer pairs with a whole word masking training strategy. |
| Chinese-BERT-base | 12 layers, 768 dims, 12 heads | 147,000,000 | The base BERT was pretrained on the Chinese Wikipedia corpus with char, glyph, and pinyin embedding. |
| Chinese-BERT-large | 24 layers, 1024 dims, 12 heads | 374,000,000 | The base-BERT-large model with more layers and larger dims was pretrained on the Chinese Wikipedia corpus using char, glyph, and pinyin embedding. |
| PCL-MedBERT | 12 layers, 768 dims, 12 heads | 110,000,000 | A BERT model that was pretrained on the Chinese medicine corpus. |
| PCL-MedBERT-wwm | 12 layers, 768 dims, 12 heads | 110,000,000 | A BERT model that was pretrained on the Chinese medicine corpus with whole word masking training. |
| TinyBERT | 6 layers, 768 dims, 12 heads | 67,000,000 | A BERT distilled from the Chinese-BERT-wwm. |
aLayer: transformer blocks.
bDims: embedding dimensions.
cLSTM: long short-term memory.
dBERT: bidirectional encoder representations from transformers.
eHeads: attention heads.
Comparison of each benchmark model after fine-tuning on the raw data and the hybrid augmented data. Italics indicate the best performance.
| Models | Dataraw | DataDAGA+MRa | |||||
|
| P,b % | R,c % | F1,d % | P, % | R, % | F1, % | |
| Gated recurrent units | 94.92 | 93.04 | 93.97 | 95.9 | 95.02 | 95.46 | |
| BiLSTMe | 97.12 | 95.99 | 96.55 | 97.53 | 97.39 | 97.46 | |
| Base BERTf |
| 98.7 | 98.63 | 98.65 | 98.85 | 98.75 | |
| Chinese-BERT-wwm | 98.35 | 98.5 | 98.43 | 98.5 | 98.90 | 98.7 | |
| Chinese-BERT-wwm-ext | 98.4 | 98.5 | 98.45 | 98.65 | 98.90 | 98.78 | |
| Chinese-BERT-base | 82.92 | 85.36 | 84.12 | 96.86 | 97.05 | 96.96 | |
| Chinese-BERT-large | 95.42 | 95.7 | 95.56 | 97.27 | 96.57 | 96.92 | |
| PCL-MedBERT | 98.37 | 99.08 | 98.72 | 98.36 | 98.79 | 98.58 | |
| PCL-MedBERT-wwm | 98.42 |
|
| 98.46 | 98.89 | 98.67 | |
| Our model | 97.84 | 98.6 | 98.22 |
|
|
| |
aDAGA+MR: data augmentation with a generation approach and mention replacement.
bP: precision.
cR: recall.
dF1: F1 score.
eBiLSTM: bidirectional long short-term memory.
fBERT: bidirectional encoder representations from transformers.
Ablation studies of each model fine-tuned on different data sets. Italics indicate the best performance.
| Models | Dataraw | DataMRa | DataDAGAb | ||||||
|
| P,c % | R,d % | F1,e % | P, % | R, % | F1, % | P, % | R, % | F1, % |
| Gated recurrent units | 94.92 | 93.04 | 93.97 | 95.68 | 94.2 | 94.94 | 94.64 | 94.59 | 94.61 |
| BiLSTMf | 97.12 | 95.99 | 96.55 | 97.72 | 97.15 | 97.43 | 97.14 | 96.86 | 97 |
| Base BERTg |
| 98.7 | 98.63 | 98.25 | 98.6 | 98.43 | 98.6 | 98.5 | 98.55 |
| Chinese-BERT-wwm | 98.35 | 98.5 | 98.43 | 98.5 | 98.7 | 98.6 | 98.45 | 98.7 | 98.58 |
| Chinese-BERT-wwm-ext | 98.4 | 98.5 | 98.45 | 98.11 | 98.7 | 98.4 | 98.8 | 98.9 | 98.85 |
| Chinese-BERT-base | 82.92 | 85.36 | 84.12 | 88.37 | 88.88 | 88.63 | 94.42 | 95.7 | 95.06 |
| Chinese-BERT-large | 95.42 | 95.7 | 95.56 | 94.95 | 96.42 | 95.68 | 97.53 | 97.25 | 97.39 |
| PCL-MedBERT | 98.37 | 99.08 | 98.72 | 98.18 | 98.89 | 98.53 | 98.7 |
| 98.96 |
| PCL-MedBERT-wwm | 98.42 |
|
|
| 98.99 |
|
| 99.13 |
|
| Our model | 97.84 | 98.6 | 98.22 | 98.32 |
| 98.68 | 98.18 | 99.08 | 98.6 |
aMR: mention replacement.
bDAGA: data augmentation with a generation approach.
cP: precision.
dR: recall.
eF1: F1 score.
fBiLSTM: bidirectional long short-term memory.
gBERT: bidirectional encoder representations from transformers.
Performance comparison of our model on various entity types after fine-tuning our model with different data sets. Italics indicate the best performance.
| Methods | PERa | LOCb | ORGc | DATd | |||||||||||
|
| P,e % | R,f % | F1,g % | P, % | R, % | F1, % | P, % | R, % | F1, % | P, % | R, % | F1, % | |||
| Dataraw | 99.21 | 99.52 | 99.36 | 96.15 | 95.24 | 95.69 | 97.06 | 98.02 | 97.54 | 97.42 | 98.55 | 97.98 | |||
| DataDAGAh | 99.37 |
| 99.6 | 95.28 | 96.19 | 95.73 | 96.43 | 98.02 | 97.22 | 98.27 | 99.23 | 98.75 | |||
| DataMRi | 99.36 | 99.36 | 99.36 | 94.44 | 97.14 | 95.77 | 96.1 | 97.69 | 96.89 |
|
|
| |||
| DataDAGA+MR |
| 99.68 |
|
|
|
|
|
|
| 98.65 | 99.13 | 98.89 | |||
aPER: personal name.
bLOC: location.
cORG: organization name.
dDAT: date.
eP: precision.
fR: recall.
gF1: F1 score.
hDAGA: data augmentation with a generation approach.
iMR: mention replacement.
Symbols and meanings of additionally built training sets.
| Symbols | Meaning |
|
| Randomly selected sample comprising 10% of dataraw. |
|
| Randomly selected sample comprising 50% of dataraw. |
|
| Mixed data from |
|
| Mixed data from |
aDAGA: data augmentation with a generation approach.
bMR: mention replacement.
Results of TinyBERT after fine-tuning on different data volumes.
| Data Volume | P,a% | R,b% | F1,c% |
|
| 91.33 | 95.26 | 93.26 |
|
| 97.46 | 98.36 | 97.91 |
|
| 98.13 | 98.89 | 98.51 |
|
| 98.51 | 99.08 | 98.8 |
aP: precision.
bR: recall.
cF1: F1 score.
dDAGA: data augmentation with a generation approach.
eMR: mention replacement.
Efficiency comparison of the benchmark models.
| Models | CPUa time, seconds | Difference vs our model, % | GPUb time, seconds | Difference vs our model, % |
| Gated recurrent units | 100.76 | –36.31 | 56.45 | –9.52 |
| BiLSTMc | 98.61 | –37.68 | 54.94 | –11.94 |
| Base BERTd | 262.81 | 39.8 | 78.02 | 20.03 |
| Chinese-BERT-wwm | 259.96 | 39.16 | 78.07 | 20.08 |
| Chinese-BERT-wwm-ext | 263.23 | 39.89 | 77.64 | 19.64 |
| Chinese-BERT-base | 220.93 | 28.38 | 76.28 | 18.21 |
| Chinese-BERT-large | 698.99 | 77.36 | 117.05 | 46.7 |
| PCL-MedBERT | 261.53 | 39.5 | 76.44 | 18.38 |
| PCL-MedBERT-wwm | 260.38 | 39.23 | 78.02 | 20.03 |
| Our model | 158.22 | N/Ae | 62.39 | N/A |
aCPU: central processing unit.
bGPU: graphics processing unit.
cBiLSTM: bidirectional long short-term memory.
dBERT: bidirectional encoder representations from transformers.
eN/A: not applicable.
Figure 4Examples of the results of fine-tuning our model on the hybrid augmented data set. DAT: date; ORG; organization name; DAGA: data augmentation with a generation approach; MR: mention replacement.
Figure 5Training curves of our model on (A) the raw data set and (B) the hybrid augmented data set.