| Literature DB >> 33299537 |
Lejun Gong1,2, Zhifei Zhang1, Shiqi Chen1.
Abstract
Background: Clinical named entity recognition is the basic task of mining electronic medical records text, which are with some challenges containing the language features of Chinese electronic medical records text with many compound entities, serious missing sentence components, and unclear entity boundary. Moreover, the corpus of Chinese electronic medical records is difficult to obtain. r> Methods: Aiming at these characteristics of Chinese electronic medical records, this study proposed a Chinese clinical entity recognition model based on deep learning pretraining. The model used word embedding from domain corpus and fine-tuning of entity recognition model pretrained by relevant corpus. Then BiLSTM and Transformer are, respectively, used as feature extractors to identify four types of clinical entities including diseases, symptoms, drugs, and operations from the text of Chinese electronic medical records. r> Results: 75.06% Macro-P, 76.40% Macro-R, and 75.72% Macro-F1 aiming at test dataset could be achieved. These experiments show that the Chinese clinical entity recognition model based on deep learning pretraining can effectively improve the recognition effect. Conclusions: These experiments show that the proposed Chinese clinical entity recognition model based on deep learning pretraining can effectively improve the recognition performance.Entities:
Year: 2020 PMID: 33299537 PMCID: PMC7707942 DOI: 10.1155/2020/8829219
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Figure 1Pipeline of deep learning pretraining.
Labeling rules.
| Entity types | Definition | Medical entities |
|---|---|---|
| Disease | The diagnosis made by doctors to patients or entities ending with “病” or “症” are collectively referred to as diseases. | 肺内隔离症 |
| Symptoms | Symptoms of discomfort, abnormalities, normal or abnormal examination results, or an unhealthy state of the patient, as well as the patient's self-reported history. | 声音嘶哑、无结核病史 |
| Drug | The specific drug name or class of drug given to the patient during treatment. | 地塞米松、抗生素 |
| Operation | This includes screening programs and treatments. A test item is given to a patient in order to discover, deny, confirm, and find out more about the disease. Treatment refers to the treatment procedures and interventions that are imposed on patients to solve the disease or relieve symptoms. | 拍胸片、抗感染、胸腔穿刺术 |
Distribution of entities among the training set and the test set.
| Data | Disease | Symptoms | Drug | Operation | The total number of entities |
|---|---|---|---|---|---|
| Training set | 701 | 2648 | 546 | 2138 | 6033 |
| Test set | 273 | 1043 | 208 | 918 | 2442 |
Figure 2LSTM cell.
Figure 3BiLSTM-CRF.
Figure 4Transformer.
data pairs. For any constituent element Q, the weight coefficient of each K corresponding to V can be obtained by calculating the similarity between the current element Q and other elements K, and then the weighted sum of V can be carried out to obtain the final attention value.
Comparison results of different dimensions.
| Different dimensions | Marco- | Marco- | Marco- |
|---|---|---|---|
| Random embedding | 69.52 | 69.70 | 69.38 |
| 50 embeddings | 53.42 | 54.31 | 53.74 |
|
|
|
|
|
| 300 embeddings | 55.36 | 61.03 | 57.88 |
Comparisons of different recognition models and different word embedding.
| Models | Dataset | Marco- | Marco- | Marco- |
|---|---|---|---|---|
| BiLSTM-CRF + embedding | Second | 68.37 | 70.84 | 69.58 |
|
|
|
|
|
|
| Transformer-CRF + embedding | Second | 52.70 | 69.50 | 59.90 |
| Transformer-CRF + EMR embedding | First | 52.70 | 72.10 | 60.70 |
Comparisons between pretraining and not pretraining.
| Models | Marco- | Marco- | Marco- |
|---|---|---|---|
| BioModel | 72.48 | 72.54 | 72.51 |
|
|
|
|
|
Performances of BioModel-fine.
| Types |
|
|
|
|---|---|---|---|
| Disease | 77.07 | 75.09 | 76.07 |
| Drug | 70.81 | 71.15 | 70.98 |
| Operation | 79.28 | 80.56 | 79.91 |
| Symptom | 71.74 | 74.12 | 72.91 |
|
|
|
|
|
Figure 5Two types of F1 comparison on entity.
Identifying five examples of long entities by BiModle-fine model.
| No | Identified medical entities |
|---|---|
| 1 | 双侧腋下扪及黄豆大小淋巴结 (symptom) |
| 2 | 右肺中叶大片密度增高阴影 (symptom) |
| 3 | 两肺纹理间可见边界不清的粟粒样微小淡结节影 (symptom) |
| 4 | 急性心肌梗塞 (disease) |
| 5 | 结核 PCR 扩增实验 (operation) |