| Literature DB >> 35321705 |
An Fang1,2, Jiahui Hu2, Wanqing Zhao2, Ming Feng3, Ji Fu3, Shanshan Feng3, Pei Lou2, Huiling Ren2, Xianlai Chen4,5.
Abstract
OBJECTIVE: Pituitary adenomas are the most common type of pituitary disorders, which usually occur in young adults and often affect the patient's physical development, labor capacity and fertility. Clinical free texts noted in electronic medical records (EMRs) of pituitary adenomas patients contain abundant diagnosis and treatment information. However, this information has not been well utilized because of the challenge to extract information from unstructured clinical texts. This study aims to enable machines to intelligently process clinical information, and automatically extract clinical named entity for pituitary adenomas from Chinese EMRs.Entities:
Keywords: Chinese electronic medical records; Clinical information extraction; Clinical named entity recognition; Deep learning; Pituitary adenomas
Mesh:
Year: 2022 PMID: 35321705 PMCID: PMC8941801 DOI: 10.1186/s12911-022-01810-z
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1An example of the original EMR texts of pituitary adenomas
Examples of the clinical named entities extracted from EMR texts of pituitary adenomas
| Entities | Start positions | End positions | Entity types |
|---|---|---|---|
| 隐匿起病 (latent onset) | 7 | 11 | Disease course |
| 慢性病程 (chronic course of disease) | 12 | 16 | Disease course |
| 向心性肥胖 (centripetal obesity) | 30 | 35 | Symptom |
| 皮肤菲薄 (thin skin) | 36 | 40 | Symptom |
| 双下肢 (lower limbs) | 72 | 75 | Body region |
| 水肿 (edema) | 75 | 77 | Symptom |
| 糖尿病病史 (medical history of diabetes) | 83 | 88 | Disease |
| 高血压病史 (medical history of hypertension) | 91 | 96 | Disease |
| 肝硬化 (liver cirrhosis) | 99 | 102 | Disease |
| 双肾结石 (double kidney stones) | 103 | 107 | Disease |
| 左肾囊肿 (left renal cyst) | 108 | 112 | Disease |
| 开腹胆囊切除术 (open cholecystectomy) | 168 | 175 | Surgery |
| 皮肤发红 (skin redness) | 314 | 318 | Symptom |
| 锁骨上脂肪垫 (supraclavicular fat pad) | 350 | 356 | Symptom |
| 水牛背 (buffalo hump) | 360 | 363 | Symptom |
| 巩膜 (sclera) | 378 | 380 | Body region |
| 黄染 (yellow) | 382 | 384 | Symptom |
| 腹膨隆 (abdominal distension) | 409 | 412 | Symptom |
| 微腺瘤 (microadenoma) | 480 | 483 | Disease |
| 双侧肾上腺增粗 (bilateral adrenal thickening) | 491 | 498 | Disease |
Token distribution of the seven types of entities in three kinds of dataset
| Entity | Training set | Validating set | Testing set |
|---|---|---|---|
| Symptom | 10,880 | 3633 | 3655 |
| Body region | 981 | 451 | 507 |
| Disease | 3760 | 1260 | 1339 |
| Surgery | 616 | 215 | 165 |
| Medication | 742 | 205 | 197 |
| Family history | 137 | 46 | 61 |
| Disease course | 281 | 82 | 104 |
| All | 17,367 | 5892 | 6028 |
Token distribution of the seven types of entities in four kinds of EMR texts
| Entity | Current medical history | Past medical history | Case characteristics | Family history |
|---|---|---|---|---|
| Symptom | 14,006 | 180 | 3982 | 0 |
| Body region | 929 | 39 | 971 | 0 |
| Disease | 1609 | 3733 | 984 | 3 |
| Surgery | 123 | 605 | 268 | 0 |
| Medication | 673 | 217 | 154 | 0 |
| Family history | 1 | 2 | 45 | 196 |
| Disease course | 0 | 0 | 467 | 0 |
| All | 17,341 | 4876 | 6871 | 199 |
Fig. 2The processing flow of our methods
Fig. 3Entities distribution in the domain dictionary
Fig. 4The architecture of BiLSTM-CRF model
Fig. 5The architecture of BERT-BiLSTM-CRF model
Performance comparison of all models
| Model | Overall | |||||
|---|---|---|---|---|---|---|
| Strict | Relaxed | |||||
| P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
| Dictionary | 42.74 | 43.44 | 43.08 | 64.08 | 65.13 | 64.60 |
| CRF | 92.18 | 87.63 | 89.85 | 96.18 | 91.43 | 93.75 |
| + pos | 91.89 | 88.28 | 90.05 | 96.01 | 92.23 | 94.08 |
| + radical | 91.90 | 87.84 | 89.82 | 96.00 | 91.76 | 93.83 |
| + type | 92.31 | 87.82 | 90.01 | 96.52 | 91.82 | 94.11 |
| + index | 91.07 | 86.16 | 88.55 | 95.66 | 90.51 | 93.01 |
| + pos + radical | 91.90 | 87.95 | 89.88 | 96.13 | 92.01 | 94.03 |
| + pos + type | 92.32 | 88.60 | 90.42 | 96.39 | 92.50 | 94.40 |
| + pos + index | 91.08 | 87.43 | 89.22 | 95.62 | 91.79 | 93.66 |
| + radical + type | 91.98 | 87.99 | 89.94 | 96.22 | 92.04 | 94.09 |
| + radical + index | 91.12 | 87.18 | 89.10 | 95.59 | 91.45 | 93.47 |
| + type + index | 91.33 | 86.37 | 88.78 | 96.02 | 90.79 | 93.33 |
| + pos + radical + type | 92.43 | 88.49 | 90.42 | 96.51 | 92.40 | 94.41 |
| + pos + radical + index | 91.21 | 87.45 | 89.29 | 95.63 | 91.69 | 93.62 |
| + pos + type + index | 91.27 | 87.58 | 89.39 | 95.86 | 91.99 | 93.89 |
| + radical + type + index | 91.22 | 87.19 | 89.16 | 95.79 | 91.57 | 93.63 |
| + pos + radical + type + index | 91.46 | 87.75 | 89.57 | 95.86 | 91.97 | 93.88 |
| BiLSTM-CRF | 90.24 | 89.64 | 89.94 | 94.98 | 94.36 | 94.67 |
| BERT-BiLSTM-CRF | 92.00 | 90.54 | 96.34 | 94.81 | ||
Bold means best performance of all models
Fig. 6The strict F1 values of the models for seven medical entity types
Fig. 7The relaxed F1 values of the models for seven medical entity types
Fig. 8The overlap of tokens between CRF model and BERT-BiLSTM-CRF model