| Literature DB >> 31640539 |
Asma Ben Abacha1, Dina Demner-Fushman2.
Abstract
BACKGROUND: One of the challenges in large-scale information retrieval (IR) is developing fine-grained and domain-specific methods to answer natural language questions. Despite the availability of numerous sources and datasets for answer retrieval, Question Answering (QA) remains a challenging problem due to the difficulty of the question understanding and answer extraction tasks. One of the promising tracks investigated in QA is mapping new questions to formerly answered questions that are "similar".Entities:
Keywords: Consumer Health Questions; Deep Learning; Information Retrieval; Machine Learning; Medical Question-Answer Dataset; Question Answering; Recognizing Question Entailment
Mesh:
Year: 2019 PMID: 31640539 PMCID: PMC6805558 DOI: 10.1186/s12859-019-3119-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Neural Network Architecture
Description of training and test datasets
| Datasets | Type/Domain | # pairs | Positive Examples (Entailment/Similarity) |
|---|---|---|---|
| SNLI (2015) | Inference pairs of open-domain sentences. | 550,152 (train) | PS: A child in a light and dark green ensemble sits in a chair in front of a typewriter looking off-camera. HS: A child sitting in front of a desk. |
| MultiNLI (2017) | Inference pairs of open-domain sentences. | 392,702 (train) | PS: On the island of the Giudecca, you’ll find another of the great Palladio-designed churches (one of two in Venice), the Redentore. HS: There are two church in Venice that were designed by Palladio. |
| SemEval-cQA (2016) | Similar questions from the Qatar Living forum. | 3169 (train) | PQ: Books. Where can i donate books? HQ: english books. Where to buy english books? Is there a public library in doha? thanks |
| Clinical-QE (2016) | Entailment pairs of questions asked by doctors. | 8588 | PQ: Patient is reluctant to take medications so I have been treating with smaller doses than I would with some other patients. How do I control her hypertension and still get her cooperation? HQ: Patient reluctant to take medication. How to control hypertension and still get her cooperation? |
| Quora (2017) | Open-domain question similarity pairs. | 404,279 | PQ: I’ve been working out in the gym for the last three months but I’m not successful in gaining weight. Should I go for a mass gainer? Is it safe? HQ: I have been working out from few months but I am unable to gain mass/weight.Which mass gainer should I take? |
| New Test Data (CHQs) | Entailment pairs of consumer health questions. | 850 | PQ: IHSS heart condition and WPW heart condition. Is there any way you could send me information on both these heart conditions? My son has to get tested for them eventually and I would just like information to understand the conditions of both of them more. HQ: What is Wolff-Parkinson-White syndrome ? |
Accuracy (%) of RQE methods using the respective training and test sets of four datasets: SNLI, MultiNLI, Quora, and Clinical-QE
| Methods | Textual Datasets | Question Datasets | ||
|---|---|---|---|---|
| SNLI | MultiNLI | Quora | Clinical-QE | |
| Neural Network (NN) | 79.50 | 73.71 | 81.34 | 71.45 |
| NN + GloVe embeddings |
|
|
| 93.12 |
| Logistic Regression + Features | 75.91 | 67.88 | 67.79 |
|
The best score are in bold
Accuracy (%) of RQE methods when trained using the training sets of SNLI, MultiNLI, Quora and Clinical-QE, and tested on our test set of 850 consumer health questions
| Methods | Training Datasets | |||
|---|---|---|---|---|
| SNLI | MultiNLI | Quora | Clinical-QE | |
| Neural Network (NN) | 48.94 | 54.59 | 52.35 | 48.71 |
| NN + GloVe embeddings | 49.41 | 54.82 | 52.82 | 57.18 |
| Logistic Regression + Features | 67.05 | 64.94 | 52.11 |
|
The best score are in bold
Results (%) of the hybrid method (Logistic Regression + IR) on community QA datasets (SemEval-cQA-Test 2016 and SemEval-cQA-Test 2017)
| Systems | Test Sets | Acc | P | R | F1 | MAP | MRR |
|---|---|---|---|---|---|---|---|
| Hybrid Method (Logistic Regression + IR) | cQA-16-Test |
|
|
|
|
|
|
|
| cQA-16-Test | ||||||
|
| cQA-16-Test |
|
|
|
| ||
| Hybrid Method (Logistic Regression + IR) | cQA-17-Test |
|
|
|
|
|
|
|
| cQA-17-Test | ||||||
|
| cQA-17-Test |
|
|
|
|
The best score are in bold
Fig. 3Examples of QA pairs generated from an article about Acromegaly (A.D.A.M encyclopedia)
Fig. 2Examples of QA pairs generated from an article about Langerhans Cell Histiocytosis (NCI)
Fig. 4Overview of the RQE-based Question Answering System
Fig. 5Evaluation Interface: the reference answers used by LiveQA assessors were provided to help judge the retrieved answers
Inter-Annotator Agreement (IAA) over all ratings in the manual evaluation of the retrieved answers
| Assessors | IAA | Partial IAA | ||
|---|---|---|---|---|
| P (%) | F1 (%) | P (%) | F1 (%) | |
| A vs. B | 80.80 | 89.38 | 90.13 | 94.81 |
| A vs. C | 77.92 | 87.59 | 88.42 | 93.85 |
| Average | 79.36 | 88.48 | 89.27 | 94.33 |
Partial IAA over two ratings “Correct” and “Incorrect”
LiveQA Measures: Average Score (main score), Success@i+ and Precision@i+ on LiveQA’17 Test Data
| Measures | IR-based System | IR+RQE System | LiveQA’17 Best Results | LiveQA’17 Median Results |
|---|---|---|---|---|
| avgScore(0-3) | 0.711 |
| 0.637 | 0.431 |
| succ@2+ | 0.442 |
| 0.392 | 0.245 |
| succ@3+ | 0.192 | 0.25 |
| 0.142 |
| succ@4+ | 0.077 |
| 0.098 | 0.059 |
| prec@2+ | 0.46 |
| 0.404 | 0.331 |
| prec@3+ | 0.2 | 0.257 |
| 0.178 |
| prec@4+ | 0.08 |
| 0.101 | 0.077 |
Evaluation of the first retrieved answer for each question. N.B. Evaluating the RQE System alone is not relevant as explained previously (“RQE-based QA Approach” section). The best score are in bold
Common Measures: MAP and MRR on LiveQA’17 Test Questions
| Measures | IR-based System | IR+RQE System |
|---|---|---|
| Fully answered questions | 29% | 27% |
| Correctly answered questions | 51% | 54% |
| MAP@10 | 0.282 | 0.311 |
| MRR@10 | 0.281 | 0.333 |
Evaluation of top 10 answers for each question