| Literature DB >> 33299537 |
Lejun Gong1,2, Zhifei Zhang1, Shiqi Chen1.
Abstract
Background: Clinical named entity recognition is the basic task of mining electronic medical records text, which are with some challenges containing the language features of Chinese electronic medical records text with many compound entities, serious missing sentence components, and unclear entity boundary. Moreover, the corpus of Chinese electronic medical records is difficult to obtain.Entities:
Year: 2020 PMID: 33299537 PMCID: PMC7707942 DOI: 10.1155/2020/8829219
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Figure 1Pipeline of deep learning pretraining.
Labeling rules.
| Entity types | Definition | Medical entities |
|---|---|---|
| Disease | The diagnosis made by doctors to patients or entities ending with “病” or “症” are collectively referred to as diseases. | 肺内隔离症 |
| Symptoms | Symptoms of discomfort, abnormalities, normal or abnormal examination results, or an unhealthy state of the patient, as well as the patient's self-reported history. | 声音嘶哑、无结核病史 |
| Drug | The specific drug name or class of drug given to the patient during treatment. | 地塞米松、抗生素 |
| Operation | This includes screening programs and treatments. A test item is given to a patient in order to discover, deny, confirm, and find out more about the disease. Treatment refers to the treatment procedures and interventions that are imposed on patients to solve the disease or relieve symptoms. | 拍胸片、抗感染、胸腔穿刺术 |
Distribution of entities among the training set and the test set.
| Data | Disease | Symptoms | Drug | Operation | The total number of entities |
|---|---|---|---|---|---|
| Training set | 701 | 2648 | 546 | 2138 | 6033 |
| Test set | 273 | 1043 | 208 | 918 | 2442 |
Figure 2LSTM cell.
Figure 3BiLSTM-CRF.
Figure 4Transformer.
data pairs. For any constituent element Q, the weight coefficient of each K corresponding to V can be obtained by calculating the similarity between the current element Q and other elements K, and then the weighted sum of V can be carried out to obtain the final attention value.
Comparison results of different dimensions.
| Different dimensions | Marco- | Marco- | Marco- |
|---|---|---|---|
| Random embedding | 69.52 | 69.70 | 69.38 |
| 50 embeddings | 53.42 | 54.31 | 53.74 |
|
|
|
|
|
| 300 embeddings | 55.36 | 61.03 | 57.88 |
Comparisons of different recognition models and different word embedding.
| Models | Dataset | Marco- | Marco- | Marco- |
|---|---|---|---|---|
| BiLSTM-CRF + embedding | Second | 68.37 | 70.84 | 69.58 |
|
|
|
|
|
|
| Transformer-CRF + embedding | Second | 52.70 | 69.50 | 59.90 |
| Transformer-CRF + EMR embedding | First | 52.70 | 72.10 | 60.70 |
Comparisons between pretraining and not pretraining.
| Models | Marco- | Marco- | Marco- |
|---|---|---|---|
| BioModel | 72.48 | 72.54 | 72.51 |
|
|
|
|
|
Performances of BioModel-fine.
| Types |
|
|
|
|---|---|---|---|
| Disease | 77.07 | 75.09 | 76.07 |
| Drug | 70.81 | 71.15 | 70.98 |
| Operation | 79.28 | 80.56 | 79.91 |
| Symptom | 71.74 | 74.12 | 72.91 |
|
|
|
|
|
Figure 5Two types of F1 comparison on entity.
Identifying five examples of long entities by BiModle-fine model.
| No | Identified medical entities |
|---|---|
| 1 | 双侧腋下扪及黄豆大小淋巴结 (symptom) |
| 2 | 右肺中叶大片密度增高阴影 (symptom) |
| 3 | 两肺纹理间可见边界不清的粟粒样微小淡结节影 (symptom) |
| 4 | 急性心肌梗塞 (disease) |
| 5 | 结核 PCR 扩增实验 (operation) |