Literature DB >> 23059732

A rule based solution to co-reference resolution in clinical text.

Abstract

OBJECTIVE: To build an effective co-reference resolution system tailored to the biomedical domain.
METHODS: Experimental materials used in this study were provided by the 2011 i2b2 Natural Language Processing Challenge. The 2011 i2b2 challenge involves co-reference resolution in medical documents. Concept mentions have been annotated in clinical texts, and the mentions that co-refer in each document are linked by co-reference chains. Normally, there are two ways of constructing a system to automatically discoverco-referent links. One is to manually build rules forco-reference resolution; the other is to use machine learning systems to learn automatically from training datasets and then perform the resolution task on testing datasets.
RESULTS: The existing co-reference resolution systems are able to find some of the co-referent links; our rule based system performs well, finding the majority of the co-referent links. Our system achieved 89.6% overall performance on multiple medical datasets.
CONCLUSIONS: Manually crafted rules based on observation of training data is a valid way to accomplish high performance in this co-reference resolution task for the critical biomedical domain.

Entities: Disease Gene Species

Keywords: Co-reference Resolution; Computational Linguistics; Natural Language Processing; Rule-based system

Mesh：

Year: 2012 PMID： 23059732 PMCID： PMC3756251 DOI： 10.1136/amiajnl-2011-000770

Source DB: PubMed Journal: J Am Med Inform Assoc ISSN： 1067-5027 Impact factor: 4.497

Background

Co-reference resolution is the process of linking together concepts that refer to the same entity. The ability to have computers automatically find this type of relation in text documents is of interest to people in the field of artificial intelligence because it can lead to having systems that can summarize texts and answer questions posed about information contained within those documents.1 2 Automatic summaries and question answering systems could be of great value to personnel in the healthcare industry as well. Because of these possibilities, a Natural Language Processing Challenge was hosted in 2011 by the i2b2 (Informatics for Integrating Biology and the Bedside) in order to advance co-reference resolution technology for the field of automatic biomedical document analysis and understanding. Annotated data was provided by four institutions: Partners HealthCare, Beth Israel Deaconess Medical Center, the University of Pittsburgh, and the Mayo Clinic. These data include the original texts for medical documents, a concept file for each document that describes concepts mentioned in the texts, and chain files that identify manually created co-reference chains in each of the texts as an example of how chains are to look after processing. The concept mentions to be linked are nouns or descriptive phrases in the medical texts that represent people, actions, objects, or ideas and have been given types accordingly. Two methods were adopted by the hosts of the challenge to annotate the datasets in the i2b2 shared task. The first method is the i2b2 style annotations that include five concept categories: people, problems, tests, treatments, and pronouns. The other method used is ODIE (Ontology Development and Information Extraction) style annotations that include eight categories: disease or syndrome, sign or symptom, procedure, people, other, none, laboratory or test result, and anatomical site. Each type of concept mention will only co-refer with a concept mention of the same type, with the exception of pronouns that can co-refer with any type of mention.3 This challenge has been divided in to three tracks; the University of Houston–Downtown (UHD) team is participating in two of these. The first track is to first find markables, and then find co-reference between them. The second track is to find co-reference relations between already marked concepts in the ODIE style of annotation. The third track is to find co-reference relations between already marked concepts in the i2b2 style of annotation.

Objective

The aim of this study was to build an effective rule-based co-reference resolution system and compare its performance with that of some publicly available co-reference systems. For the critical biomedical domain, time and effort spent in building carefully-crafted rules could be a well justified necessity to achieve the desired performance required in practice. To fully evaluate our approach, we conducted a comprehensive study that examined the performance of three publicly available general purpose co-reference resolution systems.

Materials and methods

The data used in this challenge came in two sets, training and test data. The training data differs from the test data by having gold standard co-reference chains included with it that could be used as a guide for constructing co-reference rules, either by hand, or by machine learning algorithms. The datasets are from the four institutions named above. Each of the institutions provided over 1000 documents1 in which concept mentions and co-reference were marked by hand. The test data consisted of 323 documents marked using the i2b2 annotation style, and 59 documents marked using the ODIE style annotations, all of which were taken from the pool of over 1000 documents provided by the four institutions. Our method is to test three well developed publicly available co-reference resolutions systems that are already built, as well as an algorithm constructed by our team, on the training data provided by i2b2. The best performing system was used to create output from the test data provided by i2b2 at the end of the challenge. Because the specifications for co-reference resolution in the i2b2 challenge were well defined and the type of data provided is specific,3 we adopted a rule-based approach for our system built in this study. For this challenge, the UHD only participated in the second and third tracks. That means our method was designed only to find co-reference chains in text documents when the gold standard concepts are already given as input. It is important to note that the publicly available systems are responsible for discovery of their own concept markables, whereas the UHD rule based system is given to them. In order to conduct a more equal evaluation and comparison of the four systems, the three publicly available systems would need to be capable of utilizing the gold standard data as input rather than having them rely on their own markables discovery. Attempts to do just that were made, however, since each of the systems has no facet for inputting such data, and the source code for each of the systems is not available in order to create a method to do so, it was impossible for the UHD team to have the three systems utilize the gold standard data as the rule based system does. When processing the datasets provided by i2b2, the gold standard concept files that came with the data were used to mark the concepts in the text documents. The system was developed by examining a sample of files, 15 per dataset that we felt were representative of the data as a whole, from the pool of training data, and constructing linking functions or rules, based on observation. The linking functions were checked across the unused training dataset to get an idea of rules that worked, and those that did not. The system consists of six components, and uses four data sources to aid in creating co-referent links. The general architecture for our system is depicted in figure 1.

Figure 1

Co-reference resolution system architecture.

Data input and access

The first two routines in the system are made to read in the text being examined and the concepts that are to be linked from the files provided by i2b2. The document handler breaks the text into tokens using white space boundaries, with each space character indicating the end of one word and the beginning of the next. The text is then stored in a two-dimensional array, where the first dimension is the line number, and the second dimension is the word number. A representation of this operation is depicted in figure 2.

Figure 2

Representation of document handler functionality. Access the article online to view this figure in colour.

Representation of document handler functionality. Access the article online to view this figure in colour. The document handler controls access to this matrix and gives the system a way to easily find the location of the concepts in the text, and a way to search the words surrounding the concepts for information about the concept. The concept handler reads in each concept and stores it in an array giving each concept a number based on its position in the array. Each element in the array holds the start line, start word, end line, end word, type, and the text within each concept. The concept handler gives easy access to the attributes of each concept. An example of concept storage can be found in figure 3.

Figure 3

Representation of concept handler functionality. Access the article online to view this figure in colour.

Main linker

The next routine in the algorithm is the main linker, which matches all the concepts that are not in the person category. Every concept that passes through this linker is compared to each of the other concepts of the same type in the document and links are recorded if they meet the programmed criteria. Decisions made by this linker are binary, meaning they either match or do not match. At this stage, every link that is detected is kept, which means a concept can have links to many concepts within the document, rather than at most two that is a characteristic of co-reference chains. The main linker uses string matching, the UMLS4 (Unified Medical Language System) database, and the WordNet5 database to determine if two concepts might have the same meaning. The main linker traverses the concept list and runs each one through its set of rules, and stores detected links in a list of pairs that is organized later on in the chain builder.

Non-personal pronoun match

The first step with each concept is to check if it is a pronoun type. If it is a pronoun type concept and the word is ‘which’ or ‘that’, it is linked to the concept that immediately precedes it if the two concepts have fewer than two words between the two concepts. There are other pronouns mentioned, but any rules written for them only resulted in performance loss when testing across unused training data was conducted; we were unable to build a reliable rule for any other pronoun. Example: ‘… deep wound culture showed MRSA which is sensitive to…’.

Be phrase match

The next step with each concept is to check the type of the concepts that immediately precede and follow the concept. If they are of the same type, the text in between the two concepts is examined and if it contains any words that indicate it is a ‘be phrase’, the two concepts are linked because they are probably saying ‘something is something’. Words and phrases that are commonly found in the ‘be phrases’ are stored in the rule database, and were added to the database manually by the UHD team based on observations of gold standard links. Example: ‘Resolution of organism is methicillin-resistant Staphylococcus…’.

Match by meaning

After the ‘be phrase’ match, the concepts are examined and linked by their meanings. First, the concepts are conditioned by filtering out what we refer to as ‘common words’. These common words include conjunctions (and, or, as, but, etc), adjectives (large, blue, painful, etc), and pronouns (he, she, it, etc). The conjunctions and pronouns that are filtered out are chosen to be eliminated from the concept if they appear in the common words table of the rule database. Each of the words that appear in the common words table was manually placed there by the UHD team. Adjectives are detected by searches in the WordNet database. After elimination of the common words, any non-letter characters, such as punctuation and hyphens, are removed. After this conditioning, the concepts are compared to every other concept of the same type on the document in three ways.

Head and synonym match

First, every leftover word in the concept is compared to every leftover word in each of the other concepts of the same type in the document by a word comparison method. This word comparison method will declare the words a match if the first 80% of the characters in the shorter word match the same number of characters in the longer word, or if they are found to be WordNet synonyms. If every word in one of the concepts is matched to a word in the other concept, a link between the two is recorded. Example: ‘abscess abscesses’.

UMLS match

The second comparison is through the UMLS database. Both concepts are searched for in the MRCONSO table of the UMLS database after the conditioning, and if they are found in the database and their UMLS concept numbers match, a link between the two is recorded. Example: ‘renal’ and ‘kidney’ both have C011773 for a concept number in the UMLS database

Acronym match

The third type of comparison is a check for acronyms. The first letters of each word in concepts that have two or more words are taken and are compared to whole words in other concepts, and if a whole word is found that matches either all the first letters, or some of them in order, a link is recorded. Example: ‘Methicillin-resistant Staphylococcus aureus MRSA’. After performing these steps, a phrase like ‘Recurrent soft tissue abscess in the gluteal region’ will link to ‘tissue abscesses’ because tissue is present in both mentions, abscess matches to abscesses by way of the head match, and since there are only two words in the second mention, all the other words in the first mention are ignored.

People linker

All concept types are processed though the same path in the algorithm except for the mentions of type ‘person’ or ‘people’. These mentions are processed by the people linker. As with the main linker, all decisions made by this linker are binary.

Identifying people mentions

When the people linker is called to examine a document, it runs through several subroutines to identify ‘person’ type mentions as being doctors or the subject of the document.

Medical personnel

The first step performs internet searches on each concept mention. The mention being processed is sent to a search engine, and the results are scanned for certain key words to indicate if the mention is referring to a doctor or medical personnel. Every mention that is found to be of medical personnel is stored in a list for later use. Example: when ‘optometrist’ is sent to the search engine, it returns many results like this: Doctors of Optometry and their Education | American Optometric… http://www.aoa.org/x5879.xml Doctors of Optometry and their Education. Doctors of optometry are the nation's largest eye care profession, serving patients in nearly 6500 communities across… These results are searched for keywords such as doctor, clinic, hospital, medical, etc. If two or more of those types of medical keywords are present, the mention is marked as being medical personnel.

The subject (patient)

The second step is to find a name in the document to represent the subject of the document. The function checks each concept and whether it meets the following criteria: It is not a pronoun. It is not found to be a doctor according to the previous check. It does not have the doctor salutation, Dr. It has no medical title at the end, MD. It does not contain common words stored in the rule database such as ‘patient’ or words that would indicate it is a family member, That concept is marked as the subject of the document. If no such concepts that fit that criteria are found, the first occurrence of a concept that says ‘patient’ or ‘pt’ is marked as the subject since the patient has been the subject of the document in every document observed by the UHD team. After finding an appropriate representation of the subject, every concept that has the words ‘patient’ or ‘pt’ in it and no words that refer to a family member is linked to the subject concept.

The subject's gender

The third step is to find the gender of the subject. This function simply counts the number of masculine and feminine pronouns in the document; the type that is more frequent is declared to be the gender of the subject.

Matching people mentions

After gathering information about the ‘person’ and ‘people’ type concept mentions, the algorithm moves on to actually create links between these mentions.

Introduction match

If two concepts are found to be no more than two words apart with one starting with a doctor salutation, or ending with a medical title, and the other was marked as referring to a doctor by the internet searches or by the database which stores words that identify mentions as medical personnel (eg, Attending), the two concepts are linked as this likely indicates an introduction of someone. Example: ‘Please follow-up with your Optometrist, Dr Smith 2019-01-16 at 8:30’.

Partial match

After linking the introductions, a matching function is run that works the same way as the head matching function in the main linker. Certain words are removed from concepts, such as salutations, pronouns, titles, and single letters, as well as punctuation; they are then compared to each other. If all of the words, up to 80% of the length of the word, in each concept appear in the other concept, a link between them is recorded. This match will link people's names together, including those that appear with an initial for the first name in one instance and the full name in another. Example: ‘You will see Edward L, Smith … on your visit to Dr Smith's clinic’.

Pronoun linking

The next step in the people linker is to match third person pronouns to the names that refer to them. This is done by searching the sentence that contains pronoun concepts.

Third person, no proper names in the sentence

If the sentence has only pronoun mentions in it, each of the pronouns in that sentence are linked to the subject concept if they are of the same gender as the subject. If it is not the same gender as the subject, the closest preceding concept that is not a pronoun is linked to it.

Third person, with proper names in the sentence

If there is one name in the sentence, and the name's position in the sentence is before the pronoun, then it is linked to that name. If there are multiple names in the sentence, any pronoun that is the gender of the subject is linked to the subject and the others are linked to the first name in the sentence that is found to be a doctor.

Other pronouns, including first and second person pronouns

After this, any person concepts that are first person pronouns are linked together, and any second person pronouns are linked to the subject. The last step is to link any pronoun type mention that is the word ‘this’ to the next person mention if it is within three words of it and the next mention is not a doctor; then, any pronoun type mention that is the word ‘who’ is linked to the previous person type mention that is not any type of pronoun.

Link filtering

After the semantic links are made in the main linker, they are passed over to filters to eliminate links that actually refer to two different entities based on clues found in the sentences surrounding the mentions in question. These clues include descriptive phrases such as dates, locations, or descriptive modifiers not included in the span of the mention. These clues are found by using regular expressions for dates and keywords stored in the rule database for locations and descriptive modifiers compared by string matching. These clues are only searched for if the word preceding or following each mention is one of the keywords stored in the rule database. Examples include in, on, are, is, etc. The filter portion of the algorithm also eliminates links using WordNet; any mention that is found to be an adjective with no noun included has any links to it removed.

Building the chains

Once the linkers and the filter have finished their jobs, the final output is created from the ‘web’ of links that has been made. The first concept with links is found and each link is traversed to the next concept, and each of those links is followed in a recursive fashion. A list of each concept visited is kept, and though concepts can be linked more than one time, they are added only once to the list. After every link has been examined in the ‘web’, the list of concepts is sorted according to each concept's position in the text. Concepts that appear in the beginning of the text are at the top of the list. Once a chain is constructed, it is written to an output file in the i2b2 format.

Results and discussion

There are a number of systems publicly available for co-reference resolution. For the purpose of comparison, we conducted experiments with three widely adopted systems—Beautiful Anaphora Resolution Toolkit (BART), the Stanford co-reference system, and LingPipe—and provide their performance based on i2b2 testing data. Each system was evaluated in two ways. The first method was to compare each link with the provided co-reference chain annotations, and count it as correct only if it matches exactly with the provided annotation. With this method, single unlinked concept mentions which are not co-referent to any other mentions, called ‘singletons’, are not considered, and links that fall in the same chain but skip an antecedent are considered incorrect. An example of this can be seen in figure 4.

Figure 4

Example of exact match scoring. Highlighted sections are concept mentions. Blue arrows are correct links; red is counted as incorrect even though it is a co-reference because it skips a mention. Access the article online to view this figure in colour. This scoring method is referred to as ‘exact match’ scoring and is a method we devised before receiving the i2b2 evaluation script. This method was used because the i2b2 evaluation script was not made available until late in the course of the challenge; after use it seemed to give a better representation of the performance of the systems as we could measure individual concept type performance. The second method of evaluation is with a script provided by i2b2 that conducts four types of examinations of the chain output for each system: B-Cubed,6 MUC,7 Blanc,8 and CEAF.9 Since many of the concept mentions, approximately 30–60%, depending on the dataset, are singletons, the i2b2 evaluation script will produce a much higher score than the exact match method because it considers singletons to be correct co-reference chains. Overall performance results using both methods are listed at the end of this section. The results are in the form of an f1 score, which is the harmonic mean of precision and recall.

Beautiful Anaphora Resolution Toolkit

BART was developed from a project done at the 2007 Johns Hopkins Summer workshop (http://www.bart-coref.org/).10 Once set up, text is sent to it through a web service, and output is returned in xml format. The output contains detected concept mentions and if they belong to a chain, the chain identifier is included in the xml tag of the concept mention. A translator was created to compare the BART output to the chain files included with the input texts. Only concept mentions detected by the BART system and listed by the i2b2 annotations were considered for testing; all other mentions and co-referent links were discarded. Individual concept type linking scores using the exact match scoring are listed in table 1. The i2b2 evaluation script results for each training dataset are shown in table 2.

Table 1

Exact match f1 scores for the four systems on individual concept mention types in the Beth Israel, Partners Healthcare, and Mayo Clinic across unused training data

System	Dataset	People	Problems	Test	Treatments	All others
UHD	Beth Israel	0.958	0.690	0.389	0.597	N/A
	Partners Healthcare	0.953	0.696	0.462	0.624	N/A
	Mayo Clinic	0.593	0.667	N/A	0.500	0.453
BART	Beth Israel	0.590	0.202	0.166	0.300	N/A
	Partners Healthcare	0.475	0.206	0.253	0.263	N/A
	Mayo Clinic	0.410	0.000	N/A	0.000	0.000
Stanford	Beth Israel	0.205	0.076	0.000	0.096	N/A
	Partners Healthcare	0.251	0.073	0.074	0.061	N/A
	Mayo Clinic	0.069	0.000	N/A	0.000	0.000
LingPipe	Beth Israel	0.243	0.015	0.029	0.092	N/A
	Partners Healthcare	0.139	0.067	0.088	0.066	N/A
	Mayo Clinic	0.071	0.000	N/A	0.000	0.000

BART, Beautiful Anaphora Resolution Toolkit; N/A, not applicable; UHD, University of Houston–Downtown.

Table 2

i2b2 evaluation script overall f1 score results of the unused training data for all four systems

System	Beth Israel	Partners Healthcare	Mayo Clinic
UHD	0.891	0.912	0.789
BART	0.775	0.712	0.436
Stanford	0.627	0.633	0.423
LingPipe	0.628	0.601	0.423

BART, Beautiful anaphora resolution toolkit; UHD, University of Houston–Downtown.

Exact match f1 scores for the four systems on individual concept mention types in the Beth Israel, Partners Healthcare, and Mayo Clinic across unused training data BART, Beautiful Anaphora Resolution Toolkit; N/A, not applicable; UHD, University of Houston–Downtown. i2b2 evaluation script overall f1 score results of the unused training data for all four systems BART, Beautiful anaphora resolution toolkit; UHD, University of Houston–Downtown.

Stanford co-reference system

The Stanford co-reference system is an ongoing project by the Stanford Natural Processing Language Group (http://nlp.stanford.edu/software/dcoref.shtml).11 It uses a ‘multi-pass sieve’ to perform co-reference resolution, which is a layered approach to detecting links between mentions. It starts with the strongest match first, then uses more and more relaxed criteria for matches as it runs down the layers of co-referring rules. Input was supplying the raw text in a string, and output from this system comes in the form of a map stored in an array. Each element of the array holds the location, in the form of line number and word number in the text, of a source mention, and a destination mention. A simple mapping function was constructed to convert the Stanford concept locations to i2b2 concept locations. Only concept mentions that were found by the Stanford system and listed by the i2b2 annotations were considered; all other mentions and co-referent links were discarded. Individual concept type linking scores using the exact match scoring are listed in table 1. The i2b2 evaluation script results for each training dataset are shown in table 2.

LingPipe

LingPipe is a suite of natural language processing tools provided by the Alias-i company as a commercial natural language processing product (http://alias-i.com/lingpipe). LingPipe performs co-reference resolution through a set of heuristic algorithms that link together mentions found by internal functions.12 Input for the system was through command line functions specifying the location of the input text documents, and output was a text document containing xml tags surrounding discovered concept mentions and a chain identifier if the mention was found to be co-referent. A translator similar to the one used to map the BART system output was constructed to make the data useable in this study. Individual concept type linking scores using the exact match scoring are listed in table 1. The i2b2 evaluation script results for each training dataset are shown in table 2.

Our system

The reasoning behind choosing a rule based approach for the UHD algorithm was strictly because of the specific nature of the challenge. Machine learning algorithms can perform the same, if not better, than rule based algorithms; however, they can take much more time to construct. Rule based algorithms rely on human knowledge for their performance rather than gathering their own information. For that reason they can be quicker to build, but are less adaptable to changes in the structure and types of data. Our system could conceivably be used for other types of English texts if given concept markables in the same style as the i2b2 data. It is not restricted to only the types given for the challenge and will attempt to process concepts of any type given to it. The algorithm only looks for matching concept types before testing co-reference between the concepts. The question as to whether this algorithm would work well with a type of document other than medical documents is as of now untested. This algorithm did do well as far as adapting to new data sources in the context of this competition. The University of Pittsburgh training data was released near the close of the challenge; the UHD algorithm had shown similar performance, about 1% higher f1 score, on that data than on the data that was being used to construct it. Individual concept type linking scores using the exact match scoring are listed in table 1. The i2b2 evaluation script results for each training dataset are shown in table 2.

Combining results

Once result data were collected, combinations of link results from the rule based system and the BART system were examined since the BART system showed the highest amount of correct link predictions. After combining the results from the two systems as a union of the sets, the statistics showed an increase of about 1% in recall but a decline of about 15% in precision, bringing the f1 score down overall. The combination of our system and BART was the only one attempted as it was felt that no better gain would be achieved from the other systems in union with ours since BART performed the best, and the time it takes to test the combinations would be better spent improving our own system. Since recall of the combination of the UHD and BART systems is only 1% higher, it can be said that the UHD system found nearly all of the correct co-referent links that the other systems found with a much higher precision.

Challenge participation

In order to participate in the challenge, each team participating was given test data that did not include the gold standard co-reference chains. After processing the data, each team submitted the data for evaluation by the hosts of the challenge. The system used for our submission to the challenge was the rule based system constructed by the UHD team since it showed the highest performance on the unused training data.

Challenge results

The system the UHD team constructed had an f1 score average of 0.895 on all of the datasets provided for the testing. This score was the only score provided by the hosts of the competition and represents the performance of the UHD system on all of the datasets during the competition evaluation. According to the hosts of the competition, our team ranked fourth in the challenge, with the top four performing systems being in a close tie for the first place. The highest performing system had an f1 score of 0.915.

Conclusion

Since the goal of the 2011 i2b2 Natural Language Processing task was to mark concept mentions as co-reference or not, the rule based system developed for this study was used to mark links in the test data released by the organization for the challenge. This decision was made based on the results from cross-checking the performance of each system on the training data provided. The results show the BART system performed the best out of the three publicly available co-reference systems tested in this study on this specific collection of data. The results also show that manually creating rules for co-reference based on observation of training data is a valid way to accomplish this co-reference task, particularly with the person type concepts in the i2b2 style annotations, and in this case performed well using the guidelines laid out by the hosts of the competition. The results listed in this paper show that the rule based system outperformed the three publicly available systems; this is due to the fact that the publicly available systems are general purpose systems designed to detect co-reference of people and named entities and the UHD rule based system was designed specifically for this challenge and these markables, and the publicly available systems must discover their own markables. The public systems should be given credit though for being able to detect co-referent links in this environment, and because they are responsible for discovering their own markables. It is not a stretch to imagine that these systems took a fair amount of time to develop, and can perform in many situations, whereas the UHD rule based system will operate in only the context of i2b2 or ODIE marked documents, which represent a variety of clinical reports from different institutions. Development costs can be higher on machine learning algorithms, like BART and the Stanford systems. However, in specific contexts such as this competition, a high amount of performance can be achieved with the lower cost rule based algorithms. The UHD rule based algorithm could be used, theoretically, in any domain as long concepts are annotated on one of the two styles used in this challenge.

4 in total