| Literature DB >> 35713500 |
Wonjin Yoon1, Richard Jackson2, Aron Lagerberg3, Jaewoo Kang1,4.
Abstract
MOTIVATION: Current studies in extractive question answering (EQA) have modeled the single-span extraction setting, where a single answer span is a label to predict for a given question-passage pair. This setting is natural for general domain EQA as the majority of the questions in the general domain can be answered with a single span. Following general domain EQA models, current biomedical EQA (BioEQA) models utilize the single-span extraction setting with post-processing steps.Entities:
Year: 2022 PMID: 35713500 PMCID: PMC9344839 DOI: 10.1093/bioinformatics/btac397
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Examples of two types of BioEQA. Factoid questions require a phrase as an answer while list questions require multiple phrases as their answer. Answers are underlined in the corresponding passage and annotated using BIO tagging scheme
Fig. 2.Distribution of naturally posed questions in the general and biomedical domains
Fig. 3.Overview of the BioBERT model performing QA as sequence tagging. A question and passage from a sample forms the input token sequence after tokenization. The input sequence is fed into BioBERT to output the contextualized representations. The final layer of the model is a sequence tagging layer which predicts a tag/label for each token representation
Statistics of list-type questions in the original BioASQ datasets and two different version of pre-processed datasets
| Train | Test | ||||
|---|---|---|---|---|---|
| Dataset | Config. | Question | Sample | Question | Sample |
| BioASQ 7b | Original | 556 | 5324 | 88 | 393 |
| Single-span | 529 | 7722 | 88 | 393 | |
| Seq-Tag | 527 | 3610 | 88 | 393 | |
| BioASQ 8b | Original | 644 | 5717 | 75 | 383 |
| Single-span | 614 | 8416 | 75 | 383 | |
| Seq-Tag | 610 | 3914 | 75 | 383 | |
Note: The column Sample denotes the number of data points that are composed of a question and passage pair.
Performance comparison among the models on the BioASQ 7b and 8b list question datasets
| List-type question | BioASQ 7b | BioASQ 8b | |||||
|---|---|---|---|---|---|---|---|
| Language model | System | Prec. | Recall | F1 | Prec. | Recall | F1 |
| BioBERT ( |
| 0.5941 (0.0072) | 0.3869 (0.0069) | 0.4295 (0.0069) | 0.4476 (0.0186) | 0.3275 (0.0101) | 0.3382 (0.0115) |
|
| 0.5911 (0.0181) | 0.3966 (0.0074) | 0.4364 (0.0093) | 0.4581 (0.0071) | 0.3335 (0.0049) | 0.3428 (0.0054) | |
|
| 0.4247 (0.0112) | 0.5772 (0.0125) | 0.4498 (0.0116) | 0.3888 (0.0105) | 0.5936 (0.0126) | 0.4355 (0.0083) | |
| BlueBERT ( |
| 0.5408 (0.0107) | 0.3668 (0.0050) | 0.4031 (0.0065) | 0.4941 (0.0077) | 0.3535 (0.0053) | 0.3656 (0.0057) |
|
| 0.4048 (0.0064) | 0.6171 (0.0072) | 0.4538 (0.0047) | 0.3368 (0.0089) | 0.5698 (0.0066) | 0.3917 (0.0068) | |
| PubMedBERT ( |
| 0.5709 (0.0099) | 0.3964 (0.0070) | 0.4328 (0.0067) | 0.4754 (0.0115) | 0.3502 (0.0055) | 0.3622 (0.0070) |
|
| 0.4260 (0.0099) | 0.6276 (0.0102) | 0.4758 (0.0088) | 0.3775 (0.0122) | 0.5855 (0.0077) | 0.4254 (0.0085) | |
Note: Reported scores were micro-averaged across the 10 testing batches. Standard deviations are denoted in the parenthesis. F1 score is the official metric for the list questions of the BioASQ dataset. Note that we used full abstracts as input passages for all systems.
Performance of the sequence tagging approach with different sequence tagging layer on BioASQ 8b task
| BioASQ 8b List questions | |||
|---|---|---|---|
| Tagging Layer | Precision | Recall | F1 score |
| Linear | 0.3984 (0.0051) | 0.6016 (0.0037) | 0.4402 (0.0047) |
| BiLSTM | 0.4015 (0.0146) | 0.5787 (0.0231) | 0.4370 (0.0121) |
| BiLSTM-CRF | 0.3868 (0.0126) | 0.5925 (0.0072) | 0.4312 (0.0084) |
Note: Averages and standard deviations of five independent runs are reported in the table. Standard deviations are denoted in the parenthesis.
Fig. 4.Frequency polygon and histograms of the number of answers (predicted and golden). The predicted answer number distributions of the model predictions are marked as Yoon (2019b) and SeqTag (Ours). The answer number distributions of the gold standard (testing dataset) are marked as Answer. The size of a bin is 2.0 and the y-axis indicates the number of questions in the given bin. The number of predicted answers of the baseline model shows the highest population at the first bin [1, 3) whereas our model and the golden answers show the highest population at the second bin [3,5). Questions predicted as having 0 answers are excluded in this graph
The proportion of list-type questions with the number of answers in the question, out of list questions in testing datasets
| Dataset | Test question | Question with number | Proportion |
|---|---|---|---|
| BioASQ 7b | 88 | 23 | 26.1% |
| BioASQ 8b | 75 | 10 | 13.3% |
Performance of our model on multiple data
| BioASQ 8b test | ||||
|---|---|---|---|---|
| Training data | List-F1 | Factoid-MRR | Utility | |
| (1) | List 8b | 0.4310 (0.0056) | — | Focused on multi-answer questions (Requires metadata) |
| (2) | Factoid 8b | — | 0.3759 (0.0034) | Focused on single-answer questions (Requires metadata) |
| (3) | List 8b + Factoid 8b | 0.4148 (0.0081) | 0.3795 (0.0183) |
|
|
| –0.0162 | 0.0036 | ||
Note: Our sequence tagging approach enables to train a universal model that can predict questions without knowing metadata on the given question. Statistics on five individual runs are reported in the table.