| Literature DB >> 30559093 |
Yu Zhang1, Xuwen Wang1, Zhen Hou1, Jiao Li1.
Abstract
BACKGROUND: Electronic health records (EHRs) are important data resources for clinical studies and applications. Physicians or clinicians describe patients' disorders or treatment procedures in EHRs using free text (unstructured) clinical notes. The narrative information plays an important role in patient treatment and clinical research. However, it is challenging to make machines understand the clinical narratives.Entities:
Keywords: bidirectional LSTM-CRF; clinical named entity recognition; diagnosis; electronic health records; human body; machine learning; physical examination; syndrome; treatment
Year: 2018 PMID: 30559093 PMCID: PMC6315256 DOI: 10.2196/medinform.9965
Source DB: PubMed Journal: JMIR Med Inform
An example of the manually annotated golden standard.
| Entity | pos_ba | pos_eb | Entity type |
| 咳嗽 (cough) | 21 | 22 | Symptom |
| 发热 (fever) | 24 | 25 | Symptom |
| 查体 (examination) | 32 | 33 | Test |
| 咽部 (throat) | 35 | 36 | Body part |
| 充血 (congestion) | 38 | 39 | Symptom |
| 双扁桃体 (double tonsils) | 41 | 44 | Body part |
| 肿大 (swollen) | 46 | 47 | Symptom |
| 双肺 (lung) | 49 | 50 | Body part |
| 呼吸音 (breath sound) | 51 | 53 | Test |
| 胸片 (chest x-ray) | 67 | 68 | Test |
| 支气管肺炎(bronchopneumonia) | 74 | 78 | Diagnosis |
| 头孢哌酮 (cefoperazone) | 84 | 87 | Treatment |
| 炎琥宁 (andrographolide) | 89 | 91 | Treatment |
| 布地奈德 (budesonide) | 102 | 105 | Treatment |
| 沙丁胺醇 (salbutamol) | 107 | 110 | Treatment |
| 气道 (airway) | 113 | 114 | Body part |
apos_b: start position.
bpos_e: end position.
Distribution of entities among the training set and the test set.
| Dataset | Number of patients | Body part | Diagnosis | Symptom | Test | Treatment | Total |
| Training set | 300 | 10,719 | 722 | 7831 | 9546 | 1048 | 29,866 |
| Test set | 100 | 3021 | 553 | 2311 | 3143 | 465 | 9493 |
| All | 400 | 13,740 | 1275 | 10,142 | 12,689 | 1513 | 39,359 |
Figure 1A long short-term memory unit. it: input gate; ft: forget gate; ct: memory cell; ot: output gate; ht: output vector of the LSTM.
Figure 2Architecture of the bidirectional long short-term memory-conditional random fields. LSTM: long short-term memory; CRF: conditional random fields; B-dis: B-diagnosis; I-dis: I-diagnosis.
Overall performance of the bidirectional long short-term memory-conditional random fields model, conditional random fields–based models with different feature combinations, and the dictionary-based model.
| Model | Precision | Recall | F1 score |
| Dictionary-based model | 0.5215 | 0.6855 | 0.5924 |
| CRFa model+BOCb | 0.8792 | 0.8316 | 0.8547 |
| CRF model+BOC+POSc tags | 0.9065 | 0.8529 | 0.8789 |
| CRF model+BOC+POS tags+CTd | 0.9144 | 0.8658 | 0.8895 |
| CRF model+BOC+POS tags+CT+POCISe | 0.9203 | 0.8709 | 0.8949 |
| Bidirectional LSTM-CRFf model | 0.9112 | 0.8974 | 0.9043 |
aCRF: conditional random fields.
bBOC: bag-of-characters.
cPOS: part-of-speech.
dCT: character types.
ePOCIS: position of the character in the sentence.
fLSTM-CRF: long short-term memory-conditional random fields.
Detailed performance of the bidirectional long short-term memory-conditional random fields–based, conditional random fields–based, and dictionary-based clinical named entity recognition approaches.
| Entity type | Bidirectional LSTM-CRFa | CRFb_all_features | Dictionary-based approach | ||||||
| Precision | Recall | F1 score | Precision | Recall | F1 score | Precision | Recall | F1 score | |
| Body part | 0.8873 | 0.8444 | 0.8653 | 0.8909 | 0.8186 | 0.8532 | 0.6081 | 0.6452 | 0.6261 |
| Diagnosis | 0.8086 | 0.7486 | 0.7775 | 0.8148 | 0.6763 | 0.7391 | 0.3545 | 0.6058 | 0.4473 |
| Symptom | 0.9584 | 0.9675 | 0.9630 | 0.9715 | 0.9580 | 0.9647 | 0.7591 | 0.7594 | 0.7592 |
| Test | 0.9314 | 0.9510 | 0.9411 | 0.9459 | 0.9233 | 0.9345 | 0.7093 | 0.6949 | 0.7020 |
| Treatment | 0.7833 | 0.7075 | 0.7435 | 0.7581 | 0.6538 | 0.7021 | 0.2240 | 0.6108 | 0.3278 |
| Total | 0.9112 | 0.8974 | 0.9043 | 0.9203 | 0.8709 | 0.8949 | 0.5215 | 0.6855 | 0.5924 |
aLSTM-CRF: long short-term memory-conditional random fields.
bCRF: conditional random fields.
Figure 3Comparison of F1 scores between dictionary-based approach and machine learning–based approaches among 5 entity types; LSTM-CRF: long short-term memory-conditional random fields; CRF: conditional random fields.
Distribution of different types of errors in the results of the conditional random fields model based on all the 4 types of features (N=1386).
| GT-Pa (N=143) | P-GTb (N=604) | INTERSECTc (GT vs P; N=639) |
| 尿蛋白- (urinary protein-) | 2型糖尿病 (type 2 diabetes) | 右侧丘脑腔隙性脑梗死 versus 右侧丘脑 + 腔隙性脑梗死 (right thalamic lacunar infarction vs right thalamic+lacunar infarction) |
| 低血糖 (hypoglycemia) | 冠心病(coronary disease) | 胃肠 versus 急性胃肠炎 (stomach and intestine vs acute gastroenteritis) |
| 对称 (symmetry) | 胸 (chest) | 糖尿病肾病 versus 糖尿病 + 肾病 (diabetic nephropathy vs diabetes+nephropathy) |
| 瞳孔 (pupil) | 腔隙性脑梗 (lacunar clog) | 右下后 versus 右下后牙 (right lower back vs lower right posterior teeth) |
| 冠心病 (coronary disease) | 脂肪肝 (fatty liver) | 水肿 versus 脑水肿 (edema vs brain edema) |
| 肿, (swollen,) | 比重 (proportion) | 氨溴索注射液祛痰 versus 氨溴索注射液 (ambroxol injection to remove phlegm vs ambroxol injection) |
| 胃粘膜 (gastric mucosa) | 角膜 (cornea) | 皮肤、粘膜 versus 皮肤 + 黏膜 (skin、mucous membrane vs skin+mucous membrane) |
| 寒战 (chill) | 脑萎缩 (encephalatrophy) | 脂肪肝 versus 肝 (fatty liver vs liver) |
| 无力 (faintness) | 峰值 (peak value) | 尼群地平药物 versus 尼群地平 (nitrendipine drug vs nitrendipine) |
| 皮肤 (skin) | 活动障碍 (activity disorder) | 脑 versus 脑梗死 (brain vs cerebral infarction) |
aGT-P: Entities that were not identified by CRF.
bP-GT: Entities recognized by CRF but are not in the ground truth.
cINTERSECT: For each entity, there is an intersection between the ground truth and the entity predicted by CRF.