| Literature DB >> 32295174 |
Xianglong Chen1, Chunping Ouyang1, Yongbin Liu1, Yi Bu2.
Abstract
Electronic medical records are an integral part of medical texts. Entity recognition of electronic medical records has triggered many studies that propose many entity extraction methods. In this paper, an entity extraction model is proposed to extract entities from Chinese Electronic Medical Records (CEMR). In the input layer of the model, we use word embedding and dictionary features embedding as input vectors, where word embedding consists of a character representation and a word representation. Then, the input vectors are fed to the bidirectional long short-term memory to capture contextual features. Finally, a conditional random field is employed to capture dependencies between neighboring tags. We performed experiments on body classification task, and the F1 values reached 90.65%. We also performed experiments on anatomic region recognition task, and the F1 values reached 93.89%. On both tasks, our model had higher performance than state-of-the-art models, such as Bi-LSTM-CRF, Bi-LSTM-Attention, and Vote. Through experiments, our model has a good effect when dealing with small frequency entities and unknown entities; with a small training dataset, our method showed 2-4% improvement on F1 value compared to the basic Bi-LSTM-CRF models. Additionally, on anatomic region recognition task, besides using our proposed entity extraction model, 12 rules we designed and domain dictionary were adopted. Then, in this task, the weighted F1 value of the three specific entities extraction reached 84.36%.Entities:
Keywords: Bi-LSTM-CRF; domain dictionary; electronic medical records; entity recognition; rules
Year: 2020 PMID: 32295174 PMCID: PMC7215438 DOI: 10.3390/ijerph17082687
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1The general architecture of our proposed model. There are three parts in our model, namely vector representation, entity extraction model, and rules for specific task.
An example of the tag sequence.
| Word Sequence | ‘Right Side’ | ‘Internal Mammary’ | ‘Lymph Node’ | ‘Swollen’ | ‘Change’ | ‘Not’ | ‘Significant’ | ∘ | |
|---|---|---|---|---|---|---|---|---|---|
| Tag Sequence | B-b | I-b | E-b | O | O | O | O | O | O |
| Entity Type | Body | Body | Body | None | None | None | None | None | None |
Figure 2Word vector as combination of character embedding and word embedding. Each word trains a word vector. The word embedding training process is highlighted in color and the character embedding training process in white.
N-gram templates used to obtain text segments.
| N-Gram | Templates |
|---|---|
| 1-gram |
|
| 2-gram |
|
| 3-gram |
|
| 4-gram |
|
Figure 3The main architecture of our entity extraction model.
Description of the six main rules.
| Target | Rules |
|---|---|
| Sentence segmentation | Segment sentences by periods and semicolons; in special cases, the end of the sentence is unsigned, but the beginning of the sentence is numbered. |
| Candidate sentence for primary tumor site | Contains keywords: ‘cancer’, CA, MT. |
| Candidate sentence for lesion size | Contains keywords: (“cm” or “CM” or “MM”) and (‘density’ or ‘shadow’). |
| Candidate sentence for metastasis site | Contains keywords: ‘Metastasis’. If the sentence also contains the keyword of the original part of the tumor, the beginning of the candidate sentence is after the keyword. |
| Special case processing | Add lymph nodes after the part of entity (such as ‘Mediastinum’). |
| Primary site lesion size extraction | If the primary site of the tumor appears in the candidate sentence of the lesion size, the lesion size entity in the candidate sentence is extracted. |
Statistics of entity and sentence.
| Type | Number of Entities | Number of Sentences |
|---|---|---|
| CCKS-2017 body | 8310 | 6523 |
| CHIP-2018 body | 13,124 | 5117 |
| CHIP-2018 lesion size | 1669 |
Parameter setting of the proposed method.
| Parameters | Value |
|---|---|
| Word vector embedding size | 200 |
| Dictionary feature vector embedding size | 100 |
| Number of hidden neurons for each hidden layer | 300 |
| batch_size | 64 |
| tag_indices | 4 |
| Learning rate | 0.005 |
| Number of epochs | 10 |
| Dropout out | 0.5 |
| Optimizer | Adam optimizer |
Comparative results of four different feature combination models.
| Methods | CHIP-2018 Task 1 on Anatomic Site | CCKS-2017 Task 2 on Body Category | ||||
|---|---|---|---|---|---|---|
| P | R | F1 | P | R | F | |
| Char+Bi-LSTM | 93.26 | 89.63 | 91.41 | 89.13 | 87.01 | 88.05 |
| Word+Bi-LSTM | 91.76 | 86.92 | 89.27 | 85.47 | 83.68 | 84.57 |
| Char+word+Bi-LSTM |
| 90.22 | 91.92 | 89.52 | 87.83 | 88.67 |
| Char+word+dict+Bi-LSTM | 93.31 | 93.89 | 93.60 | 90.68 | 89.59 | 90.13 |
| Char+word+dict+Bi-LSTM(Highway+concat) | 93.58 |
|
|
|
|
|
In order to show the comparison in Table 6 more intuitively, the best experiment results are in bold.
Figure 4F1-score performance of different training epoch.
Comparative results between state-of-the-art models and our method.
| Dataset | 2017 CCKS Task 2 on Body Category | 2018 CHIP Task 1 on Anatomic Site | Overall | |
|---|---|---|---|---|
| Methods | ||||
| Rule-based | 82.32 | * | * | |
| CRF | 86.89 | 88.62 | 87.34 | |
| Bi-LSTM-CRF | 88.05 | 91.41 | 89.73 | |
| Bi-LSTM-CRF-N-F [ | 85.77 | * | * | |
| Vote [ | 87.42 | * | * | |
| Bi-LSTM-Attention [ | 89.21 | 92.37 | 91.46 | |
| Our method |
|
|
| |
In order to show the comparison in Table 7 more intuitively, the best experiment results are in bold. *: reserv in the experiment results of CHIP-2018 dataset.
F1 score of each method under different data types.
| Dataset | Unknown Entity Test Set | Low-Frequency Entity Test Set | High-Frequency Entity Test Set | |
|---|---|---|---|---|
| Methods | ||||
| Bi-LSTM-CRF | 39.65 | 56.83 | 91.13 | |
| Bi-LSTM-Attention [ | 46.32 | 64.71 | 94.52 | |
| Our method |
|
|
| |
In order to show the comparison in Table 8 more intuitively, the best experiment results are in bold.
F1-scores of first submitted methods on each category.
| Test Method | Primary Site | Lesion Size | Metastasis Site |
|---|---|---|---|
| Local test | 76.92 | 78.74 | 85.18 |
| Submitted | 65.49 | 62.84 | 61.82 |
Results of various methods on CHIP-2018 test set.
| Method | P | R | F1 |
|---|---|---|---|
| Frist submit | 62.73 | 61.90 | 62.31 |
| CWD-Bi-LSTM+rule | 67.84 | 67.32 | 67.57 |
| CWD-Bi-LSTM+rule +dictionary | 76.38 | 75.69 | 76.03 |
| CWD-Bi-LSTM+rule* +dictionary |
|
|
|
Model robustness verification.
| Noise Ratio | 0% | 1% | 2% | 3% | 4% | 5% | 6% | 7% | 8% | 9% | 10% |
|---|---|---|---|---|---|---|---|---|---|---|---|
| F1-score | 93.13 | 92.89 | 92.67 | 92.86 | 92.25 | 91.96 | 92.61 | 92.39 | 92.01 | 92.49 | 92.28 |
Comparison of results under different datasets.
| %Dataset | 5% | 10% | 15% | 20% | 25% | |
|---|---|---|---|---|---|---|
| Methods | ||||||
| Bi-LSTM + CRF | 60.68 | 70.28 | 78.52 | 83.26 | 85.48 | |
| Our method | 70.45 | 77.62 | 82.47 | 85.38 | 86.93 | |
Figure 5The effect of the number of rules on the results.