| Literature DB >> 30961622 |
Xiaoling Cai1, Shoubin Dong2, Jinlong Hu1.
Abstract
BACKGROUND: The Named Entity Recognition (NER) task as a key step in the extraction of health information, has encountered many challenges in Chinese Electronic Medical Records (EMRs). Firstly, the casual use of Chinese abbreviations and doctors' personal style may result in multiple expressions of the same entity, and we lack a common Chinese medical dictionary to perform accurate entity extraction. Secondly, the electronic medical record contains entities from a variety of categories of entities, and the length of those entities in different categories varies greatly, which increases the difficult in the extraction for the Chinese NER. Therefore, the entity boundary detection becomes the key to perform accurate entity extraction of Chinese EMRs, and we need to develop a model that supports multiple length entity recognition without relying on any medical dictionary.Entities:
Keywords: Chinese electronic medical records; Named entity recognition; Part of speech
Mesh:
Year: 2019 PMID: 30961622 PMCID: PMC6454585 DOI: 10.1186/s12911-019-0762-7
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Example for reduced POS tagging
Fig. 2The data preprocessing flowchart
Fig. 3The structure of SM-LSTM-CRF model
The number of entities and the average number of characters per entity
| Anatomical Part | Symptom Description | Independent Symptoms | Drug | Operation | |
|---|---|---|---|---|---|
| Train Set | 7838/2.5 | 2066/1.5 | 3055/2.5 | 1005/3.4 | 1116/7.9 |
| Test Set | 6339/2.7 | 918/1.6 | 1327/2.8 | 813/3.4 | 735/7.3 |
Hyper-parameters
| Parameter | Value |
|---|---|
| Character embedding size | 50 |
| POS embedding size | 50 |
| Initial learning rate | 0.001 |
| Batch size | 32 |
| Maximum training epochs | 20 |
| Size of LSTM hidden units | 200 |
| Optimizer | Adam |
Performance of Self-matching attention mechanism
| P | R | F | |
|---|---|---|---|
| LSTM-CRF | 0.6568 | 0.6904 | 0.6732 |
| SM-LSTM-CRF | 0.6858 | 0.6991 | 0.6991 |
Performance after adding POS information
| LSTM-CRF | SM-LSTM-CRF | |||
|---|---|---|---|---|
| Character | Character + POS | Character | Character + POS | |
| P | 0.6568 | 0.7606 | 0.6858 | 0.7819 |
| R | 0.6904 | 0.7487 | 0.6991 | 0.7713 |
| F | 0.6732 | 0.7546 | 0.6991 | 0.7765 |
Comparison of the different POS tagging methods
| LSTM-CRF | SM-LSTM-CRF | |||
|---|---|---|---|---|
| Initial POS tagging | Reduced POS tagging | Initial POS tagging | Reduced POS tagging | |
| P | 0.7606 | 0.7959 | 0.7819 | 0.8054 |
| R | 0.7487 | 0.7829 | 0.7713 | 0.7961 |
| F | 0.7546 | 0.7894 | 0.7765 | 0.8007 |
Performance comparison of different algorithms
| P | R | F1 | |
|---|---|---|---|
| N-gram-Based RNN-CRF [ | 0.5254 | 0.4056 | 0.4578 |
| Character-Based LSTM-CRF [ | 0.6568 | 0.6904 | 0.6732 |
| Attention-Based CNN-LSTM-CRF [ | 0.7224 | 0.7248 | 0.7236 |
| SM-LSTM-CRF | 0.8054 | 0.7961 | 0.8007 |
Fig. 4Performance comparison of different models on the five types of entities
The distribution of entities in the incorrect results
| Correct boundary, Wrong category | Wrong boundary, Correct category | Wrong boundary, Wrong category | |
|---|---|---|---|
| LSTM-CRF | 67 | 1863 | 987 |
| SM-LSTM-CRF | 60 | 1357 | 446 |
Fig. 5Performance comparison of different input methods on the five types of entities
The distribution of entities with various length in the correct results
| 1–5 | 6–10 | 10–15 | > 15 | |
|---|---|---|---|---|
| SM-LSTM-CRF(char) | 6886 | 667 | 52 | 4 |
| SM-LSTM-CRF(char + POS) | 7440 | 485 | 21 | 3 |
| SM-LSTM-CRF (char + Reduced POS tagging) | 7335 | 673 | 59 | 3 |
| Standard set | 9014 | 1002 | 104 | 12 |