| Literature DB >> 31801540 |
Luqi Li1, Jie Zhao2, Li Hou3, Yunkai Zhai4, Jinming Shi2, Fangfang Cui2.
Abstract
BACKGROUND: Clinical named entity recognition (CNER) is important for medical information mining and establishment of high-quality knowledge map. Due to the different text features from natural language and a large number of professional and uncommon clinical terms in Chinese electronic medical records (EMRs), there are still many difficulties in clinical named entity recognition of Chinese EMRs. It is of great importance to eliminate semantic interference and improve the ability of autonomous learning of internal features of the model under the small training corpus.Entities:
Keywords: Attention mechanism; Chinese electronic medical records; Named entity recognition
Mesh:
Year: 2019 PMID: 31801540 PMCID: PMC6894110 DOI: 10.1186/s12911-019-0933-6
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Example of Chinese EMRs
| Chinese electronic medical record text | |
|---|---|
缘于入院前20余日于我院诊为乙状结肠癌, 在全麻下行乙状结肠癌根治术, 术中见:腹腔内无明显腹水, 腹腔、盆腔、大网膜无明显转移结节, 肝脏质地大小正常, 未触及肿物, 胆囊未触及结石。 More than 20 days before hospitalization, the patient was diagnosed with sigmoid colon cancer in our hospital and radical resection of sigmoid colon cancer was performed under general anesthesia. Intraoperative findings: No obvious ascites was found in abdominal cavity; no obvious metastatic nodules were found in abdominal cavity, pelvic cavity or omentum; the texture and size of liver were normal and no tumors were touched; no gallbladder stone was touched. |
Fig. 1The processing flow of our method
Details of custom dictionaries
| Source | Number | |
|---|---|---|
| Dic_anatomy | SNOMED CT: Body structure Sogou: Human anatomy | 6401 |
| Dic_drug | SNOMED CT: Pharmaceutical/biologic product Sogou: Generic list of commonly prescribed drugs | 35,151 |
| Dic_operation | SNOMED CT: Procedure Sogou: Surgical classification code (ICD-9-CM3) | 5840 |
Fig. 2The structure of BiLSTM-Att-CRF model
Distribution of entities in two datasets
| Dataset of CCKS 2018 | Dataset of CCKS 2017 | ||||
|---|---|---|---|---|---|
| Training set (600) | Test set (400) | Entity type | Training set (300) | Test set (100) | |
| Anatomical Part | 7838 (52%) | 6339 (63%) | Body Part | 10,719 (36%) | 3021 (32%) |
| Symptom Description | 2066 (14%) | 918 (9%) | Symptom | 7831 (26%) | 2311 (24%) |
| Independent Symptom | 3055 (20%) | 1327 (13%) | Diagnosis | 722 (2%) | 553 (6%) |
| Drug | 1005 (7%) | 813 (8%) | Test | 9546 (32%) | 3143 (33%) |
| Operation | 1116 (7%) | 735 (7%) | Treatment | 1048 (4%) | 465 (5%) |
Example of the manually annotated records
| Clinical text: 患者1个月前无明显诱因出现上腹部不适 … (The patient had no obvious inducement of upper abdominal uncomfortable one month ago …) | |||
|---|---|---|---|
| Entity | Start Position | End Position | Entity Type |
| 上腹部 (upper abdomen) | 13 | 15 | Anatomy |
| 不适 (uncomfortable) | 16 | 17 | Symptom |
Hyper-parameters of deep learning models
| Parameter | Value |
|---|---|
| Character embedding size | 200 |
| Additional features embedding size | 100 |
| Maximum training Epoch | 30 |
| Batch size | 32 |
| Time steps | 150 |
| Learning rate | 0.001 |
| Size of LSTM hidden units | 300 |
| Dropout rate | 0.2 |
| Optimizer | Adam |
Performance comparison of BiLSTM-Att-CRF model and basic models
| Dataset of CCKS 2018 | Dataset of CCKS 2017 | |||||
|---|---|---|---|---|---|---|
| Precision | Recall | F-score | Precision | Recall | F-score | |
| CRF | 85.63% | 79.58% | 82.49% | 87.32% | 83.06% | 85.14% |
| +POS | 85.94% | 81.04% | 83.42% | 88.97% | 85.12% | 87.01% |
| +Dic | 82.82% | 85.79% | 90.34% | 86.06% | 88.15% | |
| +POS + Dic | 88.72% | 83.57% | 86.04% | 87.11% | 89.14% | |
| BiLSTM-CRF | 85.20% | 83.09% | 84.13% | 90.09% | 89.24% | 89.66% |
| +POS | 85.23% | 82.73% | 83.96% | 89.36% | 89.74% | 89.55% |
| +Dic | 85.62% | 83.34% | 84.46% | 90.17% | 90.44% | 90.30% |
| +POS + Dic | 86.02% | 82.93% | 84.45% | 90.31% | 90.12% | 90.22% |
| BiLSTM-Att-CRF | 86.51% | 84.38% | 85.43% | 90.11% | 90.47% | 90.29% |
| +POS | 86.62% | 84.36% | 85.48% | 90.33% | 89.88% | 90.10% |
| +Dic | 86.97% | 84.79% | 85.87% | 89.87% | 90.31% | |
| +POS + Dic | 87.09% | 90.41% | 90.49% | |||
The bold values denote the highest values
Fig. 3Recognition performance of five entity types in CCKS 2018 dataset
Recall of Bilstm-Att-CRF model and other basic models
| CRF | BiLSTM-CRF | BiLSTM-Att-CRF | |
|---|---|---|---|
| Anatomical Part | 80.88% | 83.55% | |
| Symptom Description | 86.06% | 84.22% | |
| Independent Symptom | 83.33% | 88.78% | |
| Drug | 69.25% | 71.74% | |
| Operation | 64.90% | 80.05% |
The bold values denote the highest values
Example of recognition performance of Bilstm-Att-CRF model and other basic models
| CRF | BiLSTM-CRF | BiLSTM-Att-CRF | |
|---|---|---|---|
| 行乙状结肠癌根治术 (undergo radical resection of sigmoid colon cancer) | OP:乙状结肠癌根治术* (radical resection of sigmoid colon cancer) | OP:乙状结肠癌根治术* (radical resection of sigmoid colon cancer) | OP:乙状结肠癌根治术* (radical resection of sigmoid colon cancer) |
| 行局麻下区域淋巴结切除术 (undergo regional lymphadenectomy under local anesthesia) | OP:淋巴结切除术 (lymphadenectomy) | OP:局麻下区域淋巴结切除术 (regional lymphadenectomy under local anesthesia) | OP:区域淋巴结切除术* (regional lymphadenectomy) |
| 行胰腺肿瘤射频消融+无水酒精注射+胆总管十二指肠吻合+胃空肠吻合+胆囊切除术 (undergo radiofrequency ablation of pancreatic tumors+ethanol injection+ choledochoduodenostomy + gastrojejunostomy + cholecystectomy) | AP: 胰腺 (pancreas) Drug: 酒精 (ethanol) AP: 胆总管十二指肠 (common bile duct and duodenal) OP: 胆囊切除术* (cholecystectomy) | OP:胰腺肿瘤射频消融* (radiofrequency ablation of pancreatic tumors) AP:胃空肠 (gastrojejunal) OP:胆囊切除术* (cholecystectomy) | OP:胰腺肿瘤射频消融* (radiofrequency ablation of pancreatic tumors) OP:胆总管十二指肠吻合* (choledochoduodenostomy) OP:无水酒精注射* (ethanol injection) OP:胃空肠吻合* (gastrojejunostomy) OP:胆囊切除术* (cholecystectomy) |
| 术后行2周期紫衫+DDP化疗(2 cycles chemotherapy of taxol +DDP after the operation) | __ | OP: DDP化疗 (DDP chemotherapy) | Drug:紫衫* (taxol) Drug: DDP* |
The * denote the correct recognition. OP Operation, AP Anatomical Part
Fig. 4Recognition performance of different attention widths