| Literature DB >> 32958039 |
Kohei Kajiyama1, Hiromasa Horiguchi2, Takashi Okumura3, Mizuki Morita4, Yoshinobu Kano5.
Abstract
BACKGROUND: Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset.Entities:
Keywords: De-identification; Electronic health records; Japanese language
Mesh:
Year: 2020 PMID: 32958039 PMCID: PMC7504663 DOI: 10.1186/s13326-020-00227-9
Source DB: PubMed Journal: J Biomed Semantics
Dataset characteristics
| Dataset name | MedNLP | Dummy-EHRs | Pathology Reports |
|---|---|---|---|
| # of documents | 50 reports | 32 pairs of records and summaries | 1000 reports |
| # of sentences | 2244 | 8183 | 3012 |
| # of tokens | 42,621 | 154,132 | 194,449 |
| # of all tags | 490 | 3017 | 295 |
| # of | 56 | 39 | 0 |
| # of | 75 | 170 | 31 |
| # of | 0 | 135 | 224 |
| # of | 4 | 16 | 0 |
| # of | 355 | 2657 | 40 |
| Example in original Japanese text | 工場に勤めている<a > 64歳</a > の < x > 男性</x > 。 | 施設入所中で寝たきりの<a > 86歳</a > <x > 女性</x > 。全介助 | <<院外標本 <h > 静大皮フ科クリニック</h > 、 < p > 桑田 智</p> |
| Example translated into English | A < a > 64-year-old</a > <x > man</x > works in a factory | An <a > 86-year-old</a > <x > woman</x > bedridden in a nursing home. Total assistance required | <<Ex-hospital sample < h > Shizudai Dermatology Clinic</h > , < p > Satoshi Kuwata</p> |
Overall results
| C3 | 89.59 | 91.67 | 90.62 | 99.58 | ||||||||
| B3 | 91.67 | 86.57 | 89.05 | 99.54 | ||||||||
| B1 | 90.05 | 87.96 | 88.99 | 99.49 | ||||||||
| B2 | 90.82 | 87.04 | 88.89 | 99.52 | ||||||||
| C1 | 92.42 | 84.72 | 88.41 | 99.49 | ||||||||
| A1 | 91.50 | 84.72 | 87.98 | 99.47 | ||||||||
| C2 | 91.50 | 84.72 | 87.98 | 99.46 | ||||||||
| A2 | 90.15 | 84.72 | 87.35 | 99.41 | ||||||||
| D1 | 86.10 | 74.54 | 79.90 | 99.36 | ||||||||
| G1 | 82.09 | 76.39 | 79.14 | 99.38 | ||||||||
| D3 | 85.87 | 73.15 | 79.00 | 99.35 | ||||||||
| D2 | 80.81 | 74.07 | 77.29 | 99.24 | ||||||||
| H2 | 76.17 | 75.46 | 75.81 | 99.28 | ||||||||
| H1 | 75.81 | 75.46 | 75.64 | 99.27 | ||||||||
| H3 | 74.88 | 74.54 | 74.71 | 99.26 | ||||||||
P, R and F were calculated at the phrase level: P, precision; R, recall; F, F1-measure; and A, accuracy. A was calculated in the word level (the agreement ratio of B-*, I-* and O).
The first column stands for participants’ team names, where the first letter stands for a team ID and the second numerical value stands for a submission run ID
Detailed results for each privacy type in MedNLP-1 (De-identification task)
| <a > age | <x > sex | <t > time | <h > hospital name | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| C3 | 90.32 | 87.5 | 88.89 | 100 | 100 | 100 | 87.16 | 91.49 | 89.27 | 97.30 | 94.74 | 96.00 |
| B3 | 90.00 | 84.38 | 87.10 | 100 | 50.00 | 66.67 | 91.30 | 89.36 | 90.32 | 97.06 | 86.84 | 91.67 |
| B1 | 93.33 | 87.5 | 90.32 | 100 | 100 | 100 | 90.65 | 89.36 | 90.00 | 89.47 | 89.47 | 89.47 |
| B2 | 90.00 | 84.38 | 87.10 | 100 | 100 | 100 | 91.24 | 88.65 | 89.93 | 91.89 | 89.47 | 90.67 |
| C1 | 96.67 | 90.62 | 93.55 | 100 | 50.00 | 66.67 | 91.18 | 87.94 | 89.53 | 93.55 | 76.32 | 84.06 |
| A1 | 92.86 | 81.25 | 86.67 | 100 | 50.00 | 66.67 | 91.04 | 86.52 | 88.73 | 91.89 | 89.47 | 90.67 |
| C2 | 96.67 | 90.62 | 93.55 | 100 | 50.00 | 66.67 | 89.13 | 87.23 | 88.17 | 96.77 | 78.95 | 86.96 |
| A2 | 92.86 | 81.25 | 86.67 | 100 | 50.00 | 66.67 | 89.05 | 86.52 | 87.77 | 91.89 | 89.47 | 90.67 |
| D1 | 92.31 | 75.00 | 82.76 | 100 | 50.00 | 66.67 | 82.84 | 78.72 | 80.73 | 96.15 | 65.79 | 78.12 |
| G1 | 80.65 | 78.12 | 79.37 | 100 | 50.00 | 66.67 | 84.56 | 81.56 | 83.03 | 72.73 | 63.16 | 67.61 |
| D3 | 88.89 | 75.00 | 81.36 | 100 | 50.00 | 66.67 | 83.08 | 76.60 | 79.70 | 96.15 | 65.79 | 78.12 |
| D2 | 92.31 | 75.00 | 82.76 | 100 | 50.00 | 66.67 | 75.86 | 78.01 | 76.92 | 96.15 | 65.79 | 78.12 |
| H2 | 83.87 | 81.25 | 82.54 | 100 | 100 | 100 | 73.79 | 75.89 | 74.83 | 77.78 | 73.68 | 75.68 |
| H1 | 80.65 | 78.12 | 79.37 | 100 | 100 | 100 | 75.86 | 78.01 | 76.92 | 70.27 | 68.42 | 69.33 |
| H3 | 83.87 | 81.25 | 82.54 | 100 | 100 | 100 | 73.79 | 75.89 | 74.83 | 70.27 | 68.42 | 69.33 |
P, R and F were calculated at the phrase level: P, precision; R, recall; F, F1-measure; and A, accuracy. A was calculated in the word level (the agreement ratio of B-*, I-* and O).
The first column stands for participants’ team names, where the first letter stands for a team ID and the second numerical value stands for a submission run ID
Rules used for our rule-based method, original Japanese with English translations
| Option 1 | main rule | Option 2 | |
|---|---|---|---|
翌 (next) | 一昨年 | two years ago | より (from) |
前 (before) | 昨年 | last year | まで (until) |
入院前 (before hospitalization) | 先月 | last month | 代 (‘s) |
入院後 (after hospitalization) | 先週 | last week | 前半 (early) |
来院から (after visit) | 昨日 | yesterday | 後半 (last) |
午前 (a.m.) | 今年 | this year | -- (from) |
午後 (p.m.) | 今月 | this month | -- (from) |
発症から (after onset) | 今週 | this week | 以上 (over) |
発症してから (after onset) | 今日 | today | 以下 (under) |
治療してから (after care) | 本日 | today | から (from) |
| 来年 | next year | 時 (when) | |
| 来月 | next month | 頃 (about) | |
| 来週 | next week | ごろ (about) | |
| 翌日 | tomorrow | ころ (about) | |
| 再来週 | the week after next | 上旬 (early) | |
| 明後日 | day after tomorrow | 中旬 (mid) | |
| 同年 | same year | 下旬 (late) | |
| 同月 | same month | 春 (spring) | |
| 同日 | same day | 夏 (summer) | |
| 翌年 | following year | 秋 (fall) | |
| 翌日 | the next day | 冬 (winter) | |
| 翌朝 | the next morning | 朝 (morning) | |
| 前日 | the previous day | 昼 (noon) | |
| 未明 | early morning | 夕 (evening) | |
| その後 | after that | 晩 (night) | |
| xx年 | xx (year) | 早朝 (early morning) | |
| xx月 | xx (month) | 明朝 (early morning) | |
| xx週間 | xx (week) | 以前 (before) | |
| xx日 | xx (day) | 以降 (after) | |
| xx時 | xx (o’clock) | 夕刻 (evening) | |
| xx分 | xx (minutes) | ほど (about) | |
Fig. 1Conceptual figure of our LSTM-based model, showing embedding and NER in separate figures. + means concatenation. The first figure shows the embedding part, where W is an xth input word, L is an ith letter of the word W, r denotes right to left (forward) LSTM, l denotes left to right (backward) LSTM, V is an intermediate node which corresponds to W. The second figure shows the NER part, where fl denotes forward LSTM, bl denotes backward LSTM, c denotes concatenated vector, finally a CRF layer is shown with an example predicted named entities in the BIO annotation style
LSTM parameter settings
| Word embedding size | 200 |
| Character embedding size | 100 |
| Hidden layer of character | 100 |
| Hidden layer of LSTM | 300 |
| Learning rate | 0.001 |
Evaluation results for each tag and in total, for different methods (rule, CRF, LSTM) and different evaluation datasets (MedNLP, dummy EHR, and pathology reports). M, d, and P respectively denote training data of MedNLP, dummy EHR, and Pathology reports; M + d denotes that training data consist of MedNLP+dummy EHR, all stands for all of these three datasets; other machine learning methods use the target evaluation dataset as its training data. In each cell, F1-score, precision, and recall are shown (in values multiplied by 100). The best scores for each tag type for each evaluation metric are presented in bold typeface. All evaluations were done by four-fold cross validations
| Evaluation Results on MedNLP dataset | |||||||||||||
| tag type | #of tags | scores | Rule | CRF | CRF | CRF | CRF | CRF | LSTM | LSTM | LSTM | LSTM | LSTM |
| total | 490 | F1 | 82.62 | 43.85 | 0.71 | 26.40 | 67.34 | 83.07 | 41.26 | 0.43 | 67.35 | 57.03 | |
| prec | 78.90 | 46.20 | 2.50 | 21.51 | 66.54 | 81.33 | 41.07 | 0.48 | 66.98 | 57.94 | |||
| recall | 79.95 | 42.33 | 0.41 | 59.76 | 68.38 | 86.12 | 41.57 | 0.38 | 68.17 | 56.34 | |||
| age | 56 | F1 | 71.12 | 30.00 | 0.00 | 32.55 | 53.04 | 95.83 | 71.11 | 0.00 | 84.72 | 87.50 | |
| prec | 78.24 | 37.50 | 0.00 | 26.93 | 56.85 | 95.83 | 71.11 | 0.00 | 84.72 | 87.50 | |||
| recall | 65.47 | 28.13 | 0.00 | 46.05 | 50.00 | 95.83 | 71.11 | 0.00 | 84.72 | 87.50 | |||
| hospital | 75 | F1 | 84.73 | 43.25 | 0.00 | 26.02 | 70.04 | 66.67 | 13.33 | 13.89 | 66.67 | 41.67 | |
| prec | 80.75 | 66.67 | 0.00 | 20.55 | 91.67 | 75.00 | 11.11 | 10.67 | 70.83 | 45.83 | |||
| recall | 81.71 | 27.50 | 0.00 | 53.06 | 60.42 | 62.50 | 16.67 | 20.00 | 63.89 | 38.89 | |||
| person | 0 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | |
| sex | 4 | F1 | 16.67 | 16.67 | 0.00 | 14.65 | 25.00 | 0.00 | 20.00 | 0.00 | 25.00 | 25.00 | |
| prec | 25.00 | 12.50 | 0.00 | 8.68 | 25.00 | 0.00 | 20.00 | 0.00 | 25.00 | 25.00 | |||
| recall | 12.50 | 25.00 | 0.00 | 50.00 | 25.00 | 0.00 | 20.00 | 0.00 | 25.00 | 25.00 | |||
| time | 355 | F1 | 50.00 | 16.67 | 47.43 | 0.98 | 14.65 | 70.57 | 67.22 | 42.98 | 89.78 | 82.67 | |
| prec | 50.00 | 25.00 | 45.16 | 2.50 | 8.68 | 65.46 | 66.26 | 39.46 | 88.68 | 81.53 | |||
| recall | 50.00 | 12.50 | 50.19 | 0.61 | 50.00 | 76.50 | 68.30 | 47.94 | 91.00 | 82.67 | |||
| Evaluation Results on Pathology Report dataset | |||||||||||||
| tag type | #of tags | scores | Rule | CRF | CRF | CRF | CRF | CRF | LSTM | LSTM | LSTM | LSTM | LSTM |
| all | 71 | F1 | 13.97 | 74.26 | 0.00 | 0.62 | 1.45 | 57.63 | 0.00 | 0.00 | 1.45 | 81.25 | |
| prec | 8.65 | 86.72 | 0.00 | 1.47 | 10.00 | 64.98 | 0.00 | 0.00 | 10.00 | 82.48 | |||
| recall | 43.33 | 65.16 | 0.00 | 0.39 | 0.78 | 54.06 | 78.84 | 0.00 | 0.00 | 0.78 | |||
| age | 0 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | |
| hospital | 31 | F1 | 31.19 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 25.00 | 0.00 | 13.33 | 0.00 | |
| prec | 26.47 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 25.00 | 0.00 | 13.33 | 0.00 | |||
| recall | 41.28 | 0.00 | 0.00 | 0.00 | 0.000 | 0.00 | 25.00 | 0.00 | 13.33 | 0.00 | |||
| person | 224 | F1 | 0.00 | 91.08 | 0.00 | 0.00 | 6.25 | 71.31 | 95.19 | 0.00 | 0.00 | 0.00 | |
| prec | 0.00 | 0.00 | 0.00 | 10.00 | 74.79 | 95.19 | 0.00 | 0.00 | 0.00 | ||||
| recall | 0.00 | 87.21 | 0.00 | 0.00 | 4.55 | 69.63 | 95.19 | 0.00 | 0.00 | 0.00 | |||
| sex | 0 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | |
| time | 40 | F1 | 9.25 | 10.57 | 0.00 | 2.00 | 0.00 | 18.82 | 3.81 | 0.00 | 6.25 | 19.44 | |
| prec | 5.25 | 16.67 | 0.00 | 1.79 | 0.00 | 20.83 | 6.67 | 0.00 | 10.00 | 19.44 | |||
| recall | 9.09 | 0.00 | 2.27 | 0.00 | 19.32 | 25.00 | 2.67 | 0.00 | 4.55 | 19.44 | |||
| Evaluation Results on Dummy EHR dataset | |||||||||||||
| tag type | #of tags | scores | Rule | CRF | CRF | CRF | CRF | CRF | LSTM | LSTM | LSTM | LSTM | LSTM |
| total | 3017 | F1 | 43.74 | 66.97 | 44.01 | 19.67 | 67.13 | 65.79 | 63.99 | 20.33 | 1.60 | 68.19 | |
| prec | 42.89 | 66.77 | 67.35 | 56.72 | 67.60 | 68.27 | 68.76 | 26.68 | 2.22 | 72.79 | |||
| recall | 44.75 | 33.28 | 12.34 | 66.69 | 63.63 | 60.20 | 17.03 | 1.25 | 67.24 | 60.04 | |||
| age | 39 | F1 | 48.46 | 29.35 | 0.00 | 38.87 | 33.82 | 50.00 | 22.38 | 0.00 | 50.00 | 41.67 | |
| prec | 51.97 | 28.85 | 0.00 | 41.56 | 35.72 | 50.00 | 19.05 | 0.00 | 50.00 | 45.83 | |||
| recall | 50.46 | 30.00 | 0.00 | 36.71 | 32.50 | 50.00 | 32.38 | 0.00 | 50.00 | 41.67 | |||
| hospital | 170 | F1 | 15.98 | 47.85 | 33.19 | 0.00 | 35.73 | 22.22 | 35.79 | 0.00 | 40.00 | 43.33 | |
| prec | 10.07 | 38.75 | 0.00 | 44.91 | 35.90 | 28.33 | 34.48 | 0.00 | 37.50 | 45.83 | |||
| recall | 39.06 | 43.73 | 29.42 | 0.00 | 53.60 | 37.81 | 29.17 | 37.33 | 0.00 | 41.67 | |||
| person | 135 | F1 | 0.00 | 26.96 | 0.00 | 0.00 | 28.36 | 15.48 | 0.00 | 0.00 | 45.83 | 37.50 | |
| prec | 0.00 | 26.79 | 0.00 | 0.00 | 29.91 | 19.64 | 0.00 | 0.00 | 45.83 | 37.50 | |||
| recall | 0.00 | 30.71 | 0.00 | 0.00 | 27.99 | 13.39 | 0.00 | 0.00 | 45.83 | 37.50 | |||
| sex | 16 | F1 | 35.92 | 29.17 | 0.00 | 90.08 | 33.93 | 0.00 | 40.00 | 0.00 | 50.00 | 50.00 | |
| prec | 44.27 | 50.00 | 0.00 | 95.83 | 50.00 | 0.00 | 40.00 | 0.00 | 50.00 | 50.00 | |||
| recall | 43.13 | 20.83 | 0.00 | 85.63 | 27.08 | 0.00 | 40.00 | 0.00 | 50.00 | 50.00 | |||
| time | 2657 | F1 | 49.48 | 71.28 | 42.14 | 21.20 | 70.60 | 68.33 | 83.93 | 51.97 | 48.89 | 85.70 | |
| prec | 51.81 | 71.44 | 64.94 | 59.35 | 71.24 | 70.94 | 84.82 | 52.59 | 48.89 | 86.51 | |||
| recall | 47.38 | 71.15 | 32.08 | 13.58 | 70.00 | 66.08 | 83.29 | 51.46 | 48.89 | 84.93 | |||