Literature DB >> 27595047

An Infinite Mixture Model for Coreference Resolution in Clinical Notes.

Sijia Liu¹, Hongfang Liu², Vipin Chaudhary¹, Dingcheng Li².

Abstract

It is widely acknowledged that natural language processing is indispensable to process electronic health records (EHRs). However, poor performance in relation detection tasks, such as coreference (linguistic expressions pertaining to the same entity/event) may affect the quality of EHR processing. Hence, there is a critical need to advance the research for relation detection from EHRs. Most of the clinical coreference resolution systems are based on either supervised machine learning or rule-based methods. The need for manually annotated corpus hampers the use of such system in large scale. In this paper, we present an infinite mixture model method using definite sampling to resolve coreferent relations among mentions in clinical notes. A similarity measure function is proposed to determine the coreferent relations. Our system achieved a 0.847 F-measure for i2b2 2011 coreference corpus. This promising results and the unsupervised nature make it possible to apply the system in big-data clinical setting.

Entities: Chemical Disease Species

Year: 2016 PMID： 27595047 PMCID： PMC5009297

Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc

Introduction

The rapid increase in the volume of electronic health records (EHRs) has created a huge opportunity for clinical research and practice [1,2]. It is widely acknowledged that applying natural language processing (NLP) to process the unstructured clinical narratives complements structural EHR data [3]. However, the full utility of NLP has been hampered by the poor performance in relation detection tasks. For example, lacking coreference (linguistic expressions pertaining to the same entity/event) resolution may lead to misclassification in named entity recognition, patient risk prediction, cohort identifications, clinical decision support and other clinical applications. It is found that current state-of-the-art medical coreference resolution systems still run into trouble in cases requiring domain knowledge [4]. Hence, there is a critical need to advance the research for relation detection from EHRs. Coreference resolution is a challenging problem in the area of natural language processing (NLP). It tries to resolve the coreferent relations among different nouns, noun phrases and pronouns. Coreferent relations are common in natural language discourses in both EHRs and in general domains. For example, in the sentence of “He is minimally responsive today which appears to be near his baseline and does not have any specific complaints”, the pronoun “which” refers to the phrase “minimally responsive”. This is an example of how a pronoun is coreferent to its antecedent noun phrases. In another example, the phrase “his anticoagulation” might be mentioned as “anticoagulation therapy” later in a discourse, which is an example of how different noun phrases can be coreferent. The real world objects mentioned in a discourse by noun phrases or pronouns are usually called entities, and the phrases or pronouns appearing in discourses are called mentions. Mentions may consist of several words in free texts and each of such words is called a token. The coreference resolution has been studied for years in NLP. There are mainly three kinds of solutions: rule-based methods, supervised machine learning methods and unsupervised learning methods [3,5]. Rule-based, also known as heuristics-based methods tend to incorporate linguistic knowledge and lexical patterns to resolve coreferent relations. Supervised machine learning methods [6] use gold standard annotations as the input of machine learning models to train classifiers that can fit the input data [6,7]. Based on how the mentions are organized for the classifiers, the two most widely used models for this approach are mention-pair model [8 –10] and entity-mention model [11,12]. Unsupervised nonparametric Bayesian models attempt to make assumptions of the generative process in discourses and use Gibbs sampling [13,14] or Expectation-Maximization [15] to infer the model parameters. Unsupervised Bayesian methods [16 –18] do not require human efforts such as annotation and labeling in data preparation stages which is its advantage in trending big data applications. In clinical domain, a coreference dataset annotated from clinical notes was released with mention annotations and gold standard results by i2b2 challenge organizers in 2011 [4]. The best two systems in the challenge achieved more than 0.9 F measures and both of them used supervised machine learning methods [19,20]. Although there are some strong rules which can help us identify the coreferent relations [21], it is difficult to design a rule-based system for clinical applications due to the variety of discourse styles and the complexity of the hidden information which are not observable directly from their contexts [21 –23]. According to the competition results, supervised learning methods outperformed rule-based methods in i2b2 clinical corpus. However, currently there is no system utilizing unsupervised models in clinical domain. The complexity in their inference phases makes these models difficult to be deployed into implemented applications. In addition, these models have some drawbacks if applied in clinical discourses directly. In general NLP, the distribution of mention frequency is quite different from clinical discourses. There are much more singleton mentions in clinical notes and in most cases there will be a very long chain of patient mentions and other non-personal entities may appear not more than twice. Thus, if we look at the discourse as a generative process, the probability to a new entity in clinical notes is much higher. Moreover, the vocabulary size of concepts in clinical corpora is also much larger than common English articles. This makes the posterior distribution sparser and harder to infer the latent variables in these models. However, nonparametric Bayesian models are still promising for clinical applications if we add more domain-knowledge as constraints in the sampling process. Our study aims to propose an unsupervised model for clinical coreference resolution to achieve comparable performance to the state-of-art coreference resolution approaches. The training process of a coreference unsupervised Bayesian model involves maximizing the posterior probability among mentions in the given corpus. The challenge is how we can define a posterior probability so that if the posterior has been optimized, the coreference evaluation metrics are optimized as well. In this paper, we present an infinite mixture model to resolve the coreferent relations between annotated mentions. A similarity measurement between two mentions is proposed to evaluate the likelihood that two mentions are coreferent and then the similarity score is used to determine if there is an antecedent of the given mention. The paper is organized as follows: we first introduced our system pipeline. Then we proposed our infinite mixture model and evaluated the performance of our system on i2b2 dataset. Finally, the performance and error analysis along with some potential further improvements were discussed.

Methods

Our proposed coreference resolution system consists of several components and the pipeline is illustrated in Figure 1. The first component is the file preprocessor. It first reads both the free text of clinical records and corresponding annotated mentions, then extract lexical features, syntactic features and concept features of each mention. After we have the annotated mentions with features, we can then use definite sampling methods to infer entity indices of mentions. This can be achieved by Bayesian mixture models. After the entity indices are obtained, the mentions are chained according to entity indices for the evaluation.

Figure 1.

Pipeline of our proposed coreference resolution system

Finite Mixture Model using Latent Dirichlet Allocation

We first adopt a commonly used probabilistic topic modeling method, Latent Dirichlet Allocation (LDA) [15], as a finite mixture model coreference resolution method. The conventional LDA assumes that each document is generated from a multinomial distribution of topics and a multinomial distribution of words yields the whole corpus. The parameters of these multinomial distributions are drawn from Dirichlet distributions with arbitrary hyperparameters. To adopt LDA to a coreference resolution system, several modifications to the model should be made. Firstly, the generative process of document has to be redefined as Table 1. Note that in the coreference LDA model each topic will only contain one type of mentions, which is reasonable because mentions with different types or properties such as genders or numbers (singular or plural) should not be coreferent. To resolve the pronoun mentions, we simply match the pronoun with the most recent antecedent mention with a matched number (singular or plural) as the result. For example, the pronoun “those” will only match the most recent plural antecedent and all singular antecedents are skipped. In addition, because LDA takes single token as observed data, we only use the head word of each mention to train the model.

Table 1

Generative Model of i2b2 Mention of finite mixture model using LDA

For each document D:

Choose document-entity distribution parameter θ drawn from Dir(α);

For each mention m in position n:

Choose an entity drawn from document-entity distribution Multinomial(θ)

Choose a headword from multinomial entity-head word distribution

Generative Model of i2b2 Mention of finite mixture model using LDA We conducted an experiment on LDA using our optimized head word extraction method. Unsurprisingly, the finite mixture model cannot infer most of the coreference entities correctly. Our LDA coreference resolution system is implemented based on the Java version of JGibbLDA [24]. The overall unweighted average of coreference evaluation metrics has only a 0.475 F measure, even lower than the baseline with 0.519 which outputs no coreference relations. There are several reasons for the failure of LDA as a coreference resolution method. The most important one is that although LDA can be used as an unsupervised clustering method, the strategy deciding which mentions should corefer makes it impossible to directly be used in coreference resolution problem. In conventional LDA, the co-occurrence of words is used to calculate the topic-word probability and the probability is proportional to the number of appearance of the word. This is to say, the more times two words appear together in a document, the more likely they will be put into one topic. This will cause errors if we would like to use LDA as a coreference resolution method, because the coreference relation is different from co-occurrence and in most of the cases, even two mentions appear together many times, they may not be from the same entity. For example, in clinical notes, a patient may have “problem A” and “problem B”, but apparently, “problem A” and “problem B” are from different entities. Therefore, they should not be decided as coreferent, but the inference of LDA will do that. Another issue is that LDA is a finite mixture model and the number of components has to be arbitrarily set before model training. Besides, the number of components can significantly impact the clustering output. However, for the problem of coreference resolution, we do not know the number of entities until we get the result. As a result, the generative model of LDA has to be modified to solve coreference resolution problems.

Infinite Mixture Model using Mention Similarity

To solve the problems of LDA as a coreference resolution method, we adapted Hierarchical Dirichlet Process (HDP) [25] into our framework. Proposed as a topic modeling method as LDA, HDP can be regarded as an extension of LDA which replaces the finite topic multinomial distribution in LDA to an infinite mixture model: Dirichlet Process (DP) [26]. Therefore, the total number of topics in a corpus does not need to be assigned to an arbitrary value before training. Our model is inspired by Haghighi and Klein's work [18] and a definite sampling inference algorithm is proposed to improve the performance on clinical corpus. Given an assigned mention type from previous procedure, the entity will be generated from a modified Chinese Restaurant Process (CRP) from DP. A sample illustration of CRP in clinical note is shown in Figure 2. The first entity refers to the patient in the clinical note, and the second entity refers to the clinician who wrote the note. Each entity is regarded as a table in CRP, while each document is a restaurant which consists of an infinite number of tables, namely entities. Each mention is regarded as a customer who will find a table to sit in the restaurant. In conventional CRP, where the model will perform a “the riches get richer” manner because the probability of the customer assigned to a certain table is proportional to the number of existing customers on the table. In contrast, in our discourse generative model, the probabilities are calculated from the extracted feature vector of each mention and each entity candidates. This discourse generative mode is thus an infinite mixture model.

Figure 2.

A sample Chinese Restaurant Process illustration for clinical note

A sample Chinese Restaurant Process illustration for clinical note A similarity score f(m , m ) is calculated between pair-wise mentions. The weights of the features can be arbitrarily set according to the importance of each feature. Some of the mention properties have to be matched for any coreferent mentions, such as mention number and mention type. We call these properties hard constraints and others soft constraints. The combination of hard constraints is represented by a product of binary terms (either 0 or 1) in the similarity function. Through this, the similarity function can work in a rule-based manner for these strict rules. The similarity function of two mentions is defined as: where mi and mj refer to any two mentions. Terms f (m , m ) are hard constraints, all of which must be non-zeros to make the similarity score non-zero as well. Terms are the weighted soft constraints, which are the main measure of similarity and is the weight of k′-th feature. Respectively, k and k′ are the features index for hard constraints and soft constraints, while N and N′ are the number of hard and soft constraints. Several features are extracted for each mention before the calculation of pair-wise mention similarities. The features are: singular/plural of the head token; part-of-speech (POS) tag of the head token; token position of the first token of the mention; gender (male/female/neutral); if the mention is related to a person (true/false); if contains new entity indicators (“further”, “another”) and the Concept Unique Identifier (CUI) of the most specific concept within the mention. The number property of each mention is extracted based on the POS tag of the head token. From the result of Penn TreeBank POS Tagger [27], the plural nouns are assigned the same POS tag as singular nouns but ending with “S” mark. For example, for nominal nouns, the POS tag is NN and the corresponding plural noun POS tag is NNS. From this rule, we can decide whether the head token is singular or plural. We can decide if a mention is a personal mention from both POS tag and the annotated mention types. CUIs from UMLS Metathesaurus [28] are used to identify each non-person mentions. For most of the mention strings with more than one token, there will be several concepts found within the mentions. We only use the most specific concept, which is assumed to have the maximum number of tokens as the returned CUI query result. For example, for the mention “total abdominal hysterectomy”, the UMLS dictionary lookup algorithm in cTAKES will return three terms, which are “total abdominal hysterectomy” (C0404079), “abdominal hysterectomy” (C0404077) and “hysterectomy” (C0020699). In this example, “C0404079” is returned. The dependency relations within each mention are also considered when we choose the head word in each mention. In each sentence, the POS tagger and dependency parser are run to obtain the dependency relation within mentions in the sentence. The dependency parser we use is ClearTK Dependency Parser [29], which is implemented and included in cTAKES. For each word token, the dependency parser will return its parent. According to the parent information, we can construct a mention dependency tree, which is a subtree of the sentence dependency tree. Once a mention dependency tree is obtained, the root node of mention subtree is used as the head token of the mention. For example, in a mention “a CT scan on the chest”, from the dependency tree we will choose the token “scan” as the head token because it is the root. Similarly, in another example of “her CT scan”, the token “scan” is chosen as the head token as well. This is an improvement to the head word matching in [18] which simply used the last token of a mention as head token. These extracted features are then used to calculate the similarity, taking the values according to Table 2. Feature 1 to 4 are soft constrains and 5 to 8 are hard constrains. As a result, both value of N and N'- are 4. In the token distance feature, β is the coefficient of exponential function to control the decrease rate of the token distance function. T is the threshold of the longest distance of the considered mentions. In practice, the β is set as 0.5 and T is set to be one tenth of the total number of mentions in the document.

Table 2

Set of features used in similarity functions

ID	Features	Type	Definition	Values
1	Token distance	Soft	The number of tokens between two mentions	{exp(−β\|p1−p2\|),\|p1−p2\|≤T0,otherwise
2	CUI matching	Soft	if the two CUI concepts match	0, 1
3	CUI not existing	Soft	If either of the mentions does not have CUI extracted	0, 1
4	Head token matching	Soft	If the head tokens of the two mentions are the same	0, 1
5	Mention type matching	Hard	If the mention types match	0, 1
6	isPerson matching	Hard	If the mentions both refer to personal entities or both to nonpersonal entities	0, 1
7	Number matching	Hard	If the singular/plural forms match	0, 1
8	New entity indicator	Hard	If the later mention phrase contains a new entity indicator	0, 1

Set of features used in similarity functions

Inference by Definite Sampling

In this step, we first divide all mentions into three categories: personal mentions, pronoun mentions and nonpersonal mentions. The personal and pronoun mentions are simply the mentions with the annotated type of “person” and “pronoun”, respectively, in i2b2 dataset while non-personal mentions consist of types of test, problem and treatment. Typically, the personal entities in clinical notes are limited among the following categories: the patient, the author who is likely to be a clinician, other attending physicians and patient's family [21]. Personal pronouns like “I”, “she” and “him” are also annotated as person entities, thus we treat these pronouns as personal mentions. According to this observation, it is reasonable to apply a method exclusively for personal mentions. Accordingly, it is also safe to assume that most of the single third person pronouns, for example, “he”, “she”, “his” and “her”, refer to the patient [21], and we use this rule to directly assign the entity of patients to single third-person pronoun mentions in person category. In addition, descriptive mention phrases like “a 60 years old male” or “the female” are likely to refer the patient as well. So the mentions with token “patient”, “pt”, “male” or “female” are assigned to the patient entities as well. All other personal mentions are clustered using exact string matching. For non-personal mentions, instead of directly assigning an entity index to personal mentions, we try to find the best antecedent for a mention with sampling strategies. Rather than using Gibbs sampling method to decide which of the antecedent mentions the current mention should corefer, we employ a definite sampling method for our CRP, derived from a similarity function to estimate how likely two mentions belongs to the same entity. The return value of similarity function is called similarity score. After the similarity scores are calculated among all the antecedents, the definite sampling algorithm, which uses maximum likelihood estimation (MLE) to estimate the optimal entity, is shown in Table 3. Here we use the mention similarity defined before as the likelihood. The algorithm basically does pair-wise comparisons between two mentions and chooses the entity assignment maximizing the mention similarity (likelihood). Recall that we have hard constraints to make the matching of some of mention properties required. If any of these properties are mismatched, the similarity score will be zero. A positive threshold of similarity score will then be effective to filter out mentions with an all-zero similarity scores and our algorithm will assign a new entity index to it. In this manner new entities are generated. After we have iterated all the mentions in a document, the sampling step is done. As a result, for each mention we obtain an entity index. The generative process of our CRP is quite similar to the conventional infinite mixture model, except that the entity assignment probability is replaced from proportional to the element count of each cluster to proportional to mention similarity.

Table 3

Algorithm for definite sampling for non-pronoun mentions

Initialize: Length of the document n; Number of entities K: K = 0; Entity Assignment e→={e1,…en}

For each mention position i = 1,…, n:

For each antecedent position j= 1,…, i:

Calculate similarity between mention i and j: p(ei=ej|e→)∝f(mi,mj)

End

Update e_i by maximum likelihood estimation:

ei={argmaxejp(ei=ej|e→),maxejp(ei=ej|e→)> ThresK, K=K+1otherwise

End

Return e→

Algorithm for definite sampling for non-pronoun mentions The pronouns in i2b2 dataset in the pronoun type are restricted to non-personal pronouns. All personal pronouns like “him” or “I” are annotated as the person type. Therefore, we only need to take care of non-personal pronouns like “it”, “this”, “those” or “which”. The number of pronoun can also be detected if the pronoun is “these” or “those”. Then we simply use the most recent mention that matches the hard constraints of number. There are two common types of output format for the entity clusters widely used by the evaluation program of coreference resolution metrics: by chains and by pairs. The official evaluation script released with the i2b2 dataset requires a chain output format. Thus, the mentions with the same sampled entity index are output as a chain with its entity type without “pronoun”. The official documentation requires at each output chain has to include at least one mention which is not pronoun.

Experimental Setup and Results

Our proposed system is implemented using Apache cTAKES [30], which is a natural language processing system specifically designed for the extraction of information in clinical documents. cTAKES is an open source project implemented in Java based on Apache UIMA [31]. The aggregate analysis engine descriptor of UIMA makes it straightforward to design free text processing pipelines. Our analysis engine pipeline is illustrated in Figure 3. Both the markable concept files and raw text files are loaded by collection reader. In i2b2 dataset, each line contains only one sentence and thus the line number can be used as sentence index. Word tokens are separated by spaces. The lookup window annotator is used to improve the efficiency of the following UMLS dictionary lookup. It will annotate all the markable mentions in the clinical free text according to the concepts that are manually annotated by domain experts. As a result, the dictionary lookup function will only look up in the span of annotated mentions instead of each token in the clinical notes. The results of CUI from dictionary lookup are used as a feature while calculating mention similarities.

Figure 3.

cTAKES analysis engine pipeline for mention feature extraction

cTAKES analysis engine pipeline for mention feature extraction The clinical coreference resolution system is evaluated on i2b2 2011 Track 1C dataset [4]. In Track 1C, the mentions are already manually annotated with spans and types provided for coreference resolution. The terminology we used in this paper is more commonly used thus is slightly different from the definition in i2b2 2011 Challenge. The terms of “mention” and “entity” are called “markable” and “chain”, respectively. The Pittsburg Progress dataset is used as the training dataset. The system is developed on the training dataset and then tested on a larger testing dataset. All of our results presented are obtained in the complete i2b2 training set which contains Beth Discharge, Partners Discharge, Pittsburg Discharge and Pittsburg Progress. Some statistics of the training and testing dataset are shown in Table 4. Note that since our proposed system does not use any supervised techniques, the performances of different coreference evaluation metrics show only negligible differences in the training and testing dataset.

Table 4

Statistics of training and testing dataset

	Number of documents	Number Of mentions	Number of chains	Number of chained mentions	Data source
Training	123	12338	1182	5428	Pittsburg Progress
Testing	493	66345	7050	32123	Pittsburg Progress, Pittsburg Discharge, Beth Discharge, Partners Discharge

Statistics of training and testing dataset We performed several experiments to evaluate our proposed method. Firstly, we compared the F measures of the two baseline systems of different categories. The performance of the first baseline system in Table 5 is that we created a system that outputs no coreference chains. Namely, the result of the system is all the mentions are singletons and none of the mentions corefer to others. This system has an overall F measure of 0.519. The second baseline system is the exact string matching system. It uses the head token obtained by dependency parser and then only considering the mentions have the exact same head tokens as from the same entity. This system is implemented by increasing the weight of string matching term to a very large value in our proposed similarity function followed by changing the threshold of similarity measure to the value that only slightly smaller than that large weight. As a result, only the result of head token matching can impact the mention clustering. In practice, we set the mention clustering threshold to be 1.1, this optimizes our result in the way that either CUI matched concepts or string matched mentions can be clustered and the distance threshold is 0.1 for the token distance feature. It can be found that with this parameter set up our method's result outperforms the other systems.

Table 5

Comparison of the F measure of different models in i2b2 dataset

Methods	Test	Person	Problem	Treatment	Overall
Baseline	0.166	0.593	0.249	0.306	0.519
Exact string matching	0.634	0.667	0.727	0.815	0.765
Infinite Mixture Model	0.719	0.734	0.826	0.826	0.847

Comparison of the F measure of different models in i2b2 dataset Performance of infinite mixture model on i2b2 data The evaluation is obtained from the evaluation script provided by i2b2 2011 Challenge organizer. The system performance is measured by MUC [32], B-cubed [33], CEAF [34] and BLANC [35]. Each metrics has precision, recall and F-measure. The evaluation uses the unweighted average between MUC, B-cubed and CEAF as the final result. In the i2b2 dataset, the gold standard annotations contain the type of “pronoun”. As the result of coreference resolution, the “pronoun” type will finally merge into the other four entity types. Thus the evaluation metrics only contains four entity types without “pronoun”. Note that the overall precision/recall/F measure results are different from the average of all the categories in all the coreference measures. This is because in overall evaluation, the singleton and pronoun mentions are both considered. However, in categorized evaluations, there is no contribution of correctly clustered pronoun mentions. In our system, the overall F measure of 0.847 is higher than the F measure of the highest category, i.e. problem with 0.826. Besides, the evaluation system only provides the overall F measure and there is no results of the overall precision and recall.

Discussion

Our proposed infinite mixture model with mention clustering method can achieve good results in clinical domain while only uses a limited number of features. The definite sampling algorithm also outperforms the conventional inference method according to the posterior probability when infer the coreferent relations. Although the current feature set is relatively small, it has the potential to be extended into a rich-feature system by incorporating more features into the mention similarity function. By utilizing our proposed model, the scope of methods to solve the coreference resolution problem can be greatly expanded. For example, since we already have a quantified weighted feature vector to represent mentions distance, more clustering methods can be then introduced to solve the mention clustering problem. We can also sample new mentions from the multivariate distribution of these features then improve the performance of the nonparametric Bayesian models. In previous solutions, only a few hard constraints like type, number and gender are considered as features in the generative process. Compared with the performance by individual categories, we found that the test and person categories are the bottleneck in our system. For personal mentions, a rich-feature supervised machine learning classifier can achieve a 0.902 F measure in [20], which is significantly higher than the 0.734 F measure of our system. The classifier in [20] used several lexical features like text, tokens, character trigrams in patient mention identification. Although these features are not as interpretable as features which are used in Xu's [19] and our system, the classifier can still be able to correctly identify most of the patient mentions. This observation indicates the variety of latent patterns among patient mentions in clinical notes. There are also some mentions of patients' family members or other attending clinicians other that the authors of the notes. Our features for personal mentions may have difficulties when trying to solve these entities. The patient mentions can be easily annotated because there is only one patient in each clinical note, and the patient mentions are usually the majority in personal mentions. This makes the supervised learning in patient identification possible. Besides, the mentions which are represented by Protected Health Information (PHI) also make the problem more challenging. For example, the gender of the patient cannot be extracted directly from looking up the dictionary of first names. To improve our system's performance, pre-training a classifier for patients using supervised methods can be considered. Meanwhile, we can still retain other categories as unsupervised. By doing this, a boost on both person category and overall performance can be expected. For non-personal mentions, the complexity and variety of test mentions make the category hardest in the corpus The MUC score of test in our system is much lower than those of other categories. It seems that our model is too simplified for test mentions. For example: the “follow-up echo” can refer to “another echocardiogram”. In this case even if we can detect that “echo” is an abbreviation, it remains challenging to decide if it should refer to “echocardiogram”. Usually, the word “another” may indicate that mention has not been mentioned before however in this example it has. For these reasons, the first chain is hard to be linked. In another example, the mention “the patient’s abdominal exam” can refer to “the patient’s evaluation”. Since “exam” and “evaluation” have the same meaning so they are possible to be clustered. However, for any systems which do not use supervised learning, it is hard for the system to detect that these two words have the same meaning. Even for supervised learning methods, if the training set does not contain this relation, it will not be able to detect this relation either. To correct these errors, more features should be analyzed according to how the entities might be mentioned by the note authors. Except for specific improvements targeting the existing errors, more explorations can be made to further improve the proposed system. The hierarchy of UMLS concepts can also be used when calculating the likelihood. For example, if a parent-child relation is detected for two mentions, the mentions are still possible to be coreferent. Other medical related ontology like SNOMED can also be included as features. Besides, there can also be a spatial adjective detection method so that if the two mentions have the same kind of test in different part, these mentions may set to be not coreferent. The Gibbs sampling method can also be considered while updating the weights for feature vector so that the final weight will optimize the mixture similarity and then we may have a fully nonparametric Bayesian model.

Conclusion

We have described an infinite mixture model method for coreference resolution system in clinical domain. Unlike other solutions for the same task, we modeled the generative process of clinical discourse as Chinese Restaurant Process. The proposed infinite mixture model used definite sampling with maximum likelihood estimation of mention similarity to estimate the cluster of entities. The system can achieve an F measure of 0.847 on i2b2 2011 coreference dataset and the infinite mixture model is shown both effective and promising for further studies and applications.

Table 6

Performance of infinite mixture model on i2b2 data

Category	B3	MUC	BLANC	CEAF	Average
Test	0.946/0.971/0.958	0.444/0.242/0.313	0.599/0.684/0.629	0.939/0.916/0.927	0.711
Person	0.570/0.617/0.593	0.759/0.961/0.848	0.982/0.878/0.924	0.436/0.825/0.570	0.734
Problem	0.934/0.940/0.937	0.716/0.599/0.652	0.773/0.832/0.800	0.924/0.903/0.913	0.826
Treatment	0.936/0.960/0.948	0.745/0.596/0.663	0.770/0.837/0.800	0.910/0.877/0.893	0.826
Overall	0.886/0.930/0.907	0.741/0.806/0.772	0.965/0.875/0.915	0.843/0.879/0.861	0.847

10 in total

1. The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors: Olivier Bodenreider
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

Review 2. Evaluating the state of the art in coreference resolution for electronic medical records.

Authors: Ozlem Uzuner; Andreea Bodnari; Shuying Shen; Tyler Forbush; John Pestian; Brett R South
Journal: J Am Med Inform Assoc Date: 2012-02-24 Impact factor: 4.497

3. Lexical patterns, features and knowledge resources for coreference resolution in clinical notes.

Authors: Phil Gooch; Abdul Roudsari
Journal: J Biomed Inform Date: 2012-03-17 Impact factor: 6.317

4. Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules.

Authors: Siddhartha Reddy Jonnalagadda; Dingcheng Li; Sunghwan Sohn; Stephen Tze-Inn Wu; Kavishwar Wagholikar; Manabu Torii; Hongfang Liu
Journal: J Am Med Inform Assoc Date: 2012-06-16 Impact factor: 4.497

5. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications.

Authors: Guergana K Savova; James J Masanz; Philip V Ogren; Jiaping Zheng; Sunghwan Sohn; Karin C Kipper-Schuler; Christopher G Chute
Journal: J Am Med Inform Assoc Date: 2010 Sep-Oct Impact factor: 4.497

6. A supervised framework for resolving coreference in clinical records.

Authors: Bryan Rink; Kirk Roberts; Sanda M Harabagiu
Journal: J Am Med Inform Assoc Date: 2012-05-19 Impact factor: 4.497

7. A classification approach to coreference in discharge summaries: 2011 i2b2 challenge.

Authors: Yan Xu; Jiahua Liu; Jiajun Wu; Yue Wang; Zhuowen Tu; Jian-Tao Sun; Junichi Tsujii; Eric I-Chao Chang
Journal: J Am Med Inform Assoc Date: 2012-04-13 Impact factor: 4.497

Review 8. Coreference resolution: a review of general methodologies and applications in the clinical domain.

Authors: Jiaping Zheng; Wendy W Chapman; Rebecca S Crowley; Guergana K Savova
Journal: J Biomed Inform Date: 2011-08-12 Impact factor: 6.317

9. A rule based solution to co-reference resolution in clinical text.

Authors: Ping Chen; David Hinote; Guoqing Chen
Journal: J Am Med Inform Assoc Date: 2012-10-11 Impact factor: 4.497

10. Exploiting the potential of large databases of electronic health records for research using rapid search algorithms and an intuitive query interface.

Authors: A Rosemary Tate; Natalia Beloff; Balques Al-Radwan; Joss Wickson; Shivani Puri; Timothy Williams; Tjeerd Van Staa; Adrian Bleach
Journal: J Am Med Inform Assoc Date: 2013-11-22 Impact factor: 4.497

10 in total

3 in total

1. A Topic-modeling Based Framework for Drug-drug Interaction Classification from Biomedical Text.

Authors: Dingcheng Li; Sijia Liu; Majid Rastegar-Mojarad; Yanshan Wang; Vipin Chaudhary; Terry Therneau; Hongfang Liu
Journal: AMIA Annu Symp Proc Date: 2017-02-10

Review 2. Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing.

Authors: A Névéol; P Zweigenbaum
Journal: Yearb Med Inform Date: 2017-09-11

3. Correlating Lab Test Results in Clinical Notes with Structured Lab Data: A Case Study in HbA1c and Glucose.

Authors: Sijia Liu; Liwei Wang; Donna Ihrke; Vipin Chaudhary; Cui Tao; Chunhua Weng; Hongfang Liu
Journal: AMIA Jt Summits Transl Sci Proc Date: 2017-07-26

3 in total