| Literature DB >> 26295801 |
Dong Xu1, Meizhuo Zhang2, Tianwan Zhao1, Chen Ge1, Weiguo Gao2, Jia Wei2, Kenny Q Zhu1.
Abstract
OBJECTIVE: This study aims to propose a data-driven framework that takes unstructured free text narratives in Chinese Electronic Medical Records (EMRs) as input and converts them into structured time-event-description triples, where the description is either an elaboration or an outcome of the medical event.Entities:
Mesh:
Year: 2015 PMID: 26295801 PMCID: PMC4546596 DOI: 10.1371/journal.pone.0136270
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1An example of Chinese EMR and structured records extracted from this EMR.
(a)a paragraph from an HPI in Chinese with inline literal translation in English. (b)time-event-description triples in Chinese with inline literal translation in English.
Fig 2An example of word segmentation and part-of-speech tagging in Chinese.
Fig 3Chinese EMR information extraction system workflow.
Fig 4Dynamic term assembly using body part prefixes.
Term recognition by the core lexica.
| Number of terms in GD-400 | Number of recognized terms | Number of correctly recognized terms | Recall (%) | Precision (%) | F1-score (%) | |
|---|---|---|---|---|---|---|
| Disease | 1179 | 1069 | 810 | 68.8 | 75.7 | 72.1 |
| Drug | 858 | 774 | 707 | 82.4 | 91.3 | 86.6 |
| Body Part | 752 | 1678 | 556 | 73.9 | 33.1 | 45.8 |
| Procedure | 117 | 100 | 60 | 51.3 | 60.0 | 55.3 |
| Symptom | 4970 | 4919 | 4395 | 88.4 | 89.3 | 88.9 |
| Clinical Test | 1143 | 1172 | 1042 | 88.9 | 91.2 | 90.0 |
| Total | 9019 | 9712 | 7570 | 83.9 | 77.9 | 80.8 |
Fig 5Number of new terms added with each iteration.
Number of new (a) drug and (b) disease terms added in each iteration with the fixed threshold that yields the best F1-score.
Term recognition after pattern iteration.
| Recall (%) | Precision (%) | F1-score (%) | |
|---|---|---|---|
| Disease | 79.4 | 76.2 |
|
| Drug | 87.5 | 91.9 |
|
| Body Part | 71.0 | 34.2 |
|
| Procedure | 51.3 | 60.0 | 55.3 |
| Symptom | 87.5 | 89.5 | 88.5 |
| Clinical Test | 88.9 | 91.2 | 90.0 |
| Total | 84.8 | 79.0 | 81.8 |
Term recognition after prefix enrichment.
| Recall (%) | Precision (%) | F1-score (%) | ||
|---|---|---|---|---|
| Disease | After body part prefix | 80.0 | 768 | 78.4 |
| After directional or extensional prefix | 85.6 | 822 |
| |
| Drug | After body part prefix | 87.5 | 91.9 | 89.8 |
| After directional or extensional prefix | 87.5 | 91.9 | 89.8 | |
| Body Part | After body part prefix | 64.7 | 47.3 | 54.1 |
| After directional or extensional prefix | 79.0 | 57.1 | 66.3 | |
| Procedure | After body part prefix | 53.8 | 63.6 | 58.3 |
| After directional or extensional prefix | 54.7 | 65.3 | 59.5 | |
| Symptom | After body part prefix | 90.3 | 92.4 | 91.4 |
| After directional or extensional prefix | 93.3 | 95.5 |
| |
| Clinical Test | After body part prefix | 96.0 | 93.6 | 94.8 |
| After directional or extensional prefix | 96.7 | 94.3 | 95.5 | |
| Total | After body part prefix | 86.8 | 85.1 | 85.9 |
| After directional or extensional prefix | 90.5 | 89.6 | 90.0 |
Comparison of two methods in term recognition: recall, precision and F1-score.
| Our method (enriched lexica) | Lei | Lei | |
|---|---|---|---|
| Disease | 83.9 (85.6/82.2) | 84.8 (80.6/89.6) | 80.2 (76.2/84.7) |
| Drug | 89.8 (87.5/91.9) | 88.2 (80.2/98.2) | 85.5 (77.6/95.1) |
| Body Part | 66.3 (79.0/57.1) | 68.4 (59.1/81.2) | 66.9 (57.8/79.4) |
| Procedure | 59.5 (54.7/65.3) | 60.0 (47.1/82.6) | 55.7 (43.8/76.8) |
| Symptom | 94.4 (93.3/95.5) | 94.0 (92.4/95.6) | 93.4 (91.9/95.1) |
| Clinical Test | 95.5 (96.7/94.3) | 94.7 (92.3/97.2) | 94.2 (91.8/96.8) |
|
| 89.6 (90.5/88.6) | 89.9 (86.1/94.2) | 88.6 (84.7/92.8) |
Values are F1-score (recall/precision) (%)
Detailed results of auxiliary description matching on GD-400.
| Method | Recall (%) | Precision (%) | F1-score (%) |
|---|---|---|---|
| Baseline | 71.2 | 81.6 | 76.1 |
|
| 89.2 | 85.6 |
|
| SVM without NGD feature | 86.8 | 80.9 | 83.8 |
| Patrick and Li | 89.5 | 78.8 | 83.8 |