| Literature DB >> 27595047 |
Sijia Liu1, Hongfang Liu2, Vipin Chaudhary1, Dingcheng Li2.
Abstract
It is widely acknowledged that natural language processing is indispensable to process electronic health records (EHRs). However, poor performance in relation detection tasks, such as coreference (linguistic expressions pertaining to the same entity/event) may affect the quality of EHR processing. Hence, there is a critical need to advance the research for relation detection from EHRs. Most of the clinical coreference resolution systems are based on either supervised machine learning or rule-based methods. The need for manually annotated corpus hampers the use of such system in large scale. In this paper, we present an infinite mixture model method using definite sampling to resolve coreferent relations among mentions in clinical notes. A similarity measure function is proposed to determine the coreferent relations. Our system achieved a 0.847 F-measure for i2b2 2011 coreference corpus. This promising results and the unsupervised nature make it possible to apply the system in big-data clinical setting.Entities:
Year: 2016 PMID: 27595047 PMCID: PMC5009297
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1.Pipeline of our proposed coreference resolution system
Generative Model of i2b2 Mention of finite mixture model using LDA
| For each document D: |
| Choose document-entity distribution parameter |
| For each mention m in position n: |
| Choose an entity drawn from document-entity distribution Multinomial( |
| Choose a headword from multinomial entity-head word distribution |
Figure 2.A sample Chinese Restaurant Process illustration for clinical note
Set of features used in similarity functions
| ID | Features | Type | Definition | Values |
| 1 | Token distance | Soft | The number of tokens between two mentions |
|
| 2 | CUI matching | Soft | if the two CUI concepts match | 0, 1 |
| 3 | CUI not existing | Soft | If either of the mentions does not have CUI extracted | 0, 1 |
| 4 | Head token matching | Soft | If the head tokens of the two mentions are the same | 0, 1 |
| 5 | Mention type matching | Hard | If the mention types match | 0, 1 |
| 6 | isPerson matching | Hard | If the mentions both refer to personal entities or both to nonpersonal entities | 0, 1 |
| 7 | Number matching | Hard | If the singular/plural forms match | 0, 1 |
| 8 | New entity indicator | Hard | If the later mention phrase contains a new entity indicator | 0, 1 |
Algorithm for definite sampling for non-pronoun mentions
| Initialize: Length of the document n; Number of entities K: K = 0; Entity Assignment |
| For each mention position i = 1,…, n: |
| For each antecedent position j= 1,…, i: |
| Calculate similarity between mention i and j: |
| End |
| Update ei by maximum likelihood estimation: |
| |
| End |
| Return |
Figure 3.cTAKES analysis engine pipeline for mention feature extraction
Statistics of training and testing dataset
| Number of documents | Number Of mentions | Number of chains | Number of chained mentions | Data source | |
| Training | 123 | 12338 | 1182 | 5428 | Pittsburg Progress |
| Testing | 493 | 66345 | 7050 | 32123 | Pittsburg Progress, Pittsburg Discharge, Beth Discharge, Partners Discharge |
Comparison of the F measure of different models in i2b2 dataset
| Methods | Test | Person | Problem | Treatment | Overall |
| Baseline | 0.166 | 0.593 | 0.249 | 0.306 | 0.519 |
| Exact string matching | 0.634 | 0.667 | 0.727 | 0.815 | 0.765 |
| Infinite Mixture Model |
|
|
|
|
|
Performance of infinite mixture model on i2b2 data
| Category | B3 | MUC | BLANC | CEAF | Average |
| Test | 0.946/0.971/0.958 | 0.444/0.242/0.313 | 0.599/0.684/0.629 | 0.939/0.916/0.927 | 0.711 |
| Person | 0.570/0.617/0.593 | 0.759/0.961/0.848 | 0.982/0.878/0.924 | 0.436/0.825/0.570 | 0.734 |
| Problem | 0.934/0.940/0.937 | 0.716/0.599/0.652 | 0.773/0.832/0.800 | 0.924/0.903/0.913 | 0.826 |
| Treatment | 0.936/0.960/0.948 | 0.745/0.596/0.663 | 0.770/0.837/0.800 | 0.910/0.877/0.893 | 0.826 |
| Overall | 0.886/0.930/0.907 | 0.741/0.806/0.772 | 0.965/0.875/0.915 | 0.843/0.879/0.861 | 0.847 |