Literature DB >> 21347188

Using the weighted keyword model to improve information retrieval for answering biomedical questions.

Abstract

Physicians ask many complex questions during the patient encounter. Information retrieval systems that can provide immediate and relevant answers to these questions can be invaluable aids to the practice of evidence-based medicine. In this study, we first automatically identify topic keywords from ad hoc clinical questions with a Condition Random Field model that is trained over thousands of manually annotated clinical questions. We then report on a linear model that assigns query weights based on their automatically identified semantic roles: topic keywords, domain specific terms, and their synonyms. Our evaluation shows that this weighted keyword model improves information retrieval from the Text Retrieval Conference Genomics track data.

Entities: Chemical Disease Gene Species

Year: 2009 PMID： 21347188 PMCID： PMC3041568

Source DB: PubMed Journal: Summit Transl Bioinform ISSN： 2153-6430

Introduction

Clinicians and biomedical researchers often need to search a vast body of literature in order to make informed decisions [1,2]. Information retrieval and question answering systems (e.g., [3]) facilitate clinicians and biomedical researchers in accessing relevant information. Most existing information retrieval systems require users to enter query terms, which are then used to search for relevant documents. However, observational studies (e.g., [1,4-6]) have shown that clinicians typically have complex information needs and ask complex questions. Questions 1 and 2 are two examples from a collection of 4,653 questions posed by more than 100 primary care physicians [1,4-6] that is maintained and published by the National Library of Medicine (NLM)1. Question 1: “Thirty-eight-year-old woman with bloody diarrhea, worse over the past week. I treated her with Flagyl empirically. I saw her two days later and she was lots better. No more blood, no fever. Now her report comes back and the clostridium difficile is negative but she’s growing salmonella. Should I finish the Flagyl or discontinue it?” Question 2: “The maximum dose of estradiol valerate is 20 milligrams every 2 weeks. We use 25 milligrams every month which seems to control her hot flashes. But is that adequate for osteoporosis and cardiovascular disease prevention?” Similarly, biomedical scientists also pose complex questions that require complex answers [7,8]. Question 3 is an example of such that appeared in the TREC Genomics Track evaluation data. Question 3: “What effect does the insulin receptor gene have on tumorigenesis?” In this paper, we first report on applying natural language processing approaches to automatically extract topic keywords from complex biomedical questions. In the above three examples, the keywords are salmonella infections for question 1, estradiol valerate and osteoporosis and cardiovascular disease prevention for question 2, and insulin receptor gene and tumorigenesis for question 3. We then report on a weighted keyword model for query-term weight assignment. We have implemented this model into our clinical question answering system AskHERMES. Section 2, below, reviews the background of this research. Section 3 describes the model. The evaluation methods, results and discussion are in Sections 4, 5 and 6, respectively. Section 7 briefly describes the AskHERMES system in which the weighted keyword model has been implemented. Conclusions and future work are described in Section 8.

Background

Although the literature has reported different models for weighing query terms for question answering (see articles in the TREC evaluation) and it is common knowledge to assign weights based on the perceived importance of a query term, methods for identifying the importance of query terms are, to our knowledge, ad hoc: most models incorporate simple algorithms (e.g., ranking query terms based on the IDF value [9]). In contrast, we weigh query terms based on automatically identified keywords and domain-specific terminology. We then developed a linear model incorporating the identified keywords to improve information retrieval.

Model

The weighted keyword model begins by automatically identifying semantically rich topic keywords, as shown in questions 1–3. Query term weights are based on the identified keywords, and the UMLS concepts and their synonyms. In this section, we first briefly describe our approaches for automatic keyword identification and then describe our weighted keyword model.

Automatic Topic Keyword Identification

We developed a probabilistic model to automatically identify topic keywords from ad hoc clinical questions. Our model is trained and tested on the NLM’s 4,653 clinical questions, which have been annotated by physicians who assigned one to three keywords for each clinical question. Using the annotated questions, we trained a supervised machine-learning system that is based on conditional random fields. Our ten fold cross validation results showed that the system achieved 67.6% precision, 50.8% recall, and 58% F-score for automatic keyword identification. Details of the approaches are described in Yu and Cao (2009) [10].

The Weighted Keyword Model

To judge whether a query term is biomedical, domain-specific, we applied the tool MMTx, the implementation of the MetaMap [11], to map the question to concepts in the UMLS. The UMLS incorporates concept synonyms, which are used for query expansion. We used the methods described in Section 3.1 to identify the topic keywords. We group query terms into five categories: Original Word: non-stop single words embedded in the original question that are neither keywords nor mapped to the UMLS. UMLS Concept: a single word or multi-word term embedded in the original question that can be mapped to the UMLS. Keyword: A single word or multi-words term embedded in the original question that is identified as the topic keyword. Keyword Synonym: The synonymous terms of the keywords The UMLS Synonym: The synonymous terms of those that are not keywords. Each query term is assigned the baseline weight of the IDF value. We calculated the IDF values from more than 17 million citations in the MEDLINE collection. Our weighted keyword increases the baseline IDF value if the query term is identified as a keyword of the question. In addition, we experimented with increasing the weights of query terms based on which group they belong to. Our experiments with different weighting models concluded that most have similar impacts on information retrieval. One of the models is shown below: Original Words: the baseline IDF value UMLS Synonym: 2*IDF UMLS Concept: 3*IDF Keyword Synonym: 4*IDF Keywords: 5*IDF

Evaluation Methods

Currently, there is no evaluation data available for clinical information retrieval and question answering. The only available biomedical information retrieval evaluation data is the Genomics Track of the Text REtrieval Conference (TREC). TREC Genomics incorporates more than 160,000 full-text biomedical articles [7]. The 2006 and 2007 tasks focused on information retrieval for question answering [7,12]; a sample question from the tasks is “What is the role of IDE in Alzheimer’s disease?” We therefore evaluated the weighted keyword model using the TREC Genomics evaluation. Systems The purpose of this study is to compare different weighted keyword models for information retrieval. LUCENE is a high performance, full-featured text search engine [13] that has shown to be robust in biomedical texts [3]. We therefore implemented all our systems with LUCENE. The top 1,000 sentences of output from each system were used for evaluation. The following weighted keyword models were evaluated: Original Words: In this system, only the non-stop words embedded in the original question were used as query terms. There were no weighted keywords. Reweight: In this system, we increased the weight of keywords. Query Expansion: In this system, we expanded the queries with the UMLS synonyms. Query Expansion & Reweight: In this system, we included query terms from all five groups and weighed each group differently as described in Section 3.2. Data There were 28 and 36 questions posed in TREC Genomics 2006 and 2007, respectively. However, two questions were excluded by the TREC Genomics organizers [7,8]; 19 questions returned no result for related questions. The purpose of our study is to evaluate the effectiveness of the weighted keyword model for information retrieval. We used the remaining 43 questions for our evaluation. Evaluation Metrics We used the evaluation package published by the TREC Genomics Track (a Python script, available at http://ir.ohsu.edu/genomics/) to report the document-level retrieval performance. As stated in [8], the TREC Genomics judges returned a document as relevant if any text in that document was relevant to a question. A character-based mean average precision (MAP) measure is used by TREC Genomics to compare the accuracy of the extracted answers.

Evaluation Results

Table 1 shows the average MAPs of four systems for document retrieval for question answering using the TREC Genomics data. The baseline system is the original words which achieved a 0.042 MAP score. Query expansion improved the average MAP score by 28.6%. The reweight system improved the average MAP score by 9.5%. The absolute MAP improvements and their statistical significances are shown in Table 2. The improvement of reweight was statistically significant. Query Expansion and Expansion & Reweight both had larger standard deviations, which made the performance differences statistically non-significant. Figure 1 shows the MAP scores of a subset of TREC Genomics questions for the four systems. The MAP score differences by four systems; we only report in Figure 2 those systems with the MAP scores >0.03. As shown in Figure 2, the MAP scores ranged from close to zero to close to 0.7 in response to different questions. The variations in the MAP scores lead to the large value of standard deviation as shown in Table 1.

Table 1:

Average MAP scores (standard deviations in parentheses) of four systems for document retrieval for question answering using the TREC Genomics data.

Original Words	Query expansion	Reweight	Expansion & Reweight
.042 (.085)	.054 (.117)	.046 (.092)	.053 (.116)

Table 2:

Improvement in MAP scores of three systems (query expansion, reweight, and expansion & reweight) over the original words system.

	Query Expansion	Reweight	Expansion & Reweight
Average MAP (St. Dev)	.012 (.051)	.004 (.009)	.011 (.054)
p-value	.119	.005	.183

Figure 1:

The mean average precision (MAP) scores of 19 TREC Genomics questions for four systems. The original words system takes in all non-stop words of an ad hoc question as bag-of-word queries to return relevant documents. Reweight is built on top of the original words system; it increases the weights of terms that are identified as keywords of the question. Query expansion incorporates synonyms from the UMLS. Expansion & reweight assigns different weights to different groups of query terms as described in Section 3.2.

Figure 2:

AskHERMES system components

Discussion

Our work shows that, for most of the questions, a reweight system significantly outperforms a non-reweight system (p<0.005). We have tried different reweight combinations and found that in all cases, increasing the weights of keywords has significant improvements (data not shown). Our results clearly demonstrate the effectiveness of weighted keywords for improving information retrieval. We do not compare our absolute MAP scores with those who participated in the TREC Genomics competition, as the absolute MAP scores depend upon many other factors, including data preprocess and passage ranking. Our results show that although query expansion has improved the MAP scores for most of the questions, these improvements were not statistically significant. Our results are consistent with the reports in TREC Genomics. Query expansion was widely used in both the 2006 and 2007 TREC Genomics competitions [7,8]. Few teams have reported that query expansion statistically improves information retrieval. Teams report that the performance of query expansion varies for different topics (e.g., [14]). Reasons for this include failure in identifying synonyms [15], which depends upon the correct mapping to external knowledge resources. The variations in performance in query expansion explain our results, in which the improvement in weighted keywords diminished after query expansion. Our topic keyword model was trained over thousands of clinical questions, and it is interesting that the model can be used directly to capture the keywords in genomics questions and to improve the information retrieval in the genomic domain. The results demonstrate the generalizability of both our keyword identification model and the weighted keyword model. On the other hand, the question of whether the weighted keyword model can actually improve information retrieval and question answering in the clinical domain still needs to be tested.

Implementing the Weighted Keyword Model in the AskHERMES System

Our long-term goal is to develop an advanced medical question answering system to assist physicians in their clinical decision making. We have created such a prototype system called AskHERMES (Help physicians to Extract and aRticulate Multimedia information for answering clinical quEstionS), which can be accessed at http://www.askhermes.org. Figure 2 shows the AskHERMES system components. We have previously shown AskHERMES to outperform several other baseline information retrieval systems for answering definitional questions [3,16]. Currently, AskHERMES attempts to answer all types of clinical questions. In this study, we have integrated the weighted keyword model into the AskHERMES system, and our preliminary observation shows that the model slightly increases AskHERMES’ performance for question answering. Figure 3 shows the answers of two models (with and without weighted keywords) to a sample clinical question. A physician (Dr. Andrew Bennett) examined the outputs of both models. He concluded that none of the text outputs directly answered the questions, although the answers can be identified from the source articles. He also concluded that the weighted output is more on target than the unweighted one in both text outputs and source answers. The evaluation seems to support that the weighted model outperforms the unweighted one. On the other hand, a formal evaluation is required to draw any general conclusions.

Figure 3:

The outputs of two models, with and without weighted keywords in response to a sample clinical question. The keyword “head trauma” was automatically identified by AskHERMES. Each answer can be linked to its source page. “Human” indicates that the source page is a human study.

Conclusions and Future Work

Our contributions include a robust keyword identification system that is trained on thousands of ad hoc clinical questions and a linear model for incorporating the identified keywords as a way to improve information retrieval. Our evaluation results with the TREC Genomics data show an improvement in information retrieval with the weighted keyword model. We also demonstrate that the weighted keyword model can be easily integrated into a clinical question answering system.. The evaluation of the effectiveness of the weighted keyword model for improving clinical question answering remains as our future work. The key is to create evaluation data, which is an important but long-term challenging task. In addition, we hope to explore our weighted keyword models in open-domain information retrieval and question answering.

9 in total

1. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.

Authors: A R Aronson
Journal: Proc AMIA Symp Date: 2001

2. Answering physicians' clinical questions: obstacles and potential solutions.

Authors: John W Ely; Jerome A Osheroff; M Lee Chambliss; Mark H Ebell; Marcy E Rosenbaum
Journal: J Am Med Inform Assoc Date: 2004-11-23 Impact factor: 4.497

3. Accessing bioscience images from abstract sentences.

Authors: Hong Yu; Minsuk Lee
Journal: Bioinformatics Date: 2006-07-15 Impact factor: 6.937

4. A cognitive evaluation of four online search engines for answering definitional questions posed by physicians.

Authors: Hong Yu; David Kaufman
Journal: Pac Symp Biocomput Date: 2007

5. Development, implementation, and a cognitive evaluation of a definitional question answering system for physicians.

Authors: Hong Yu; Minsuk Lee; David Kaufman; John Ely; Jerome A Osheroff; George Hripcsak; James Cimino
Journal: J Biomed Inform Date: 2007-03-12 Impact factor: 6.317

6. Automatically extracting information needs from Ad Hoc clinical questions.

Authors: Hong Yu; Yong-Gang Cao
Journal: AMIA Annu Symp Proc Date: 2008-11-06

7. Lifelong self-directed learning using a computer database of clinical questions.

Authors: J W Ely; J A Osheroff; K J Ferguson; M L Chambliss; D C Vinson; J L Moore
Journal: J Fam Pract Date: 1997-11 Impact factor: 0.493

8. Analysis of questions asked by family doctors regarding patient care.

Authors: J W Ely; J A Osheroff; M H Ebell; G R Bergus; B T Levy; M L Chambliss; E R Evans
Journal: BMJ Date: 1999-08-07

9. An evaluation of information-seeking behaviors of general pediatricians.

Authors: Donna M D'Alessandro; Clarence D Kreiter; Michael W Peterson
Journal: Pediatrics Date: 2004-01 Impact factor: 7.124

9 in total

2 in total

1. Automatically extracting information needs from complex clinical questions.

Authors: Yong-gang Cao; James J Cimino; John Ely; Hong Yu
Journal: J Biomed Inform Date: 2010-07-27 Impact factor: 6.317

2. AskHERMES: An online question answering system for complex clinical questions.

Authors: YongGang Cao; Feifan Liu; Pippa Simpson; Lamont Antieau; Andrew Bennett; James J Cimino; John Ely; Hong Yu
Journal: J Biomed Inform Date: 2011-01-21 Impact factor: 6.317

2 in total