| Literature DB >> 25925131 |
George Tsatsaronis1, Georgios Balikas2, Prodromos Malakasiotis3, Ioannis Partalas4, Matthias Zschunke5, Michael R Alvers6, Dirk Weissenborn7, Anastasia Krithara8, Sergios Petridis9, Dimitris Polychronopoulos10, Yannis Almirantis11, John Pavlopoulos12, Nicolas Baskiotis13, Patrick Gallinari14, Thierry Artiéres15, Axel-Cyrille Ngonga Ngomo16, Norman Heino17, Eric Gaussier18, Liliana Barrio-Alvers19, Michael Schroeder20, Ion Androutsopoulos21, Georgios Paliouras22.
Abstract
BACKGROUND: This article provides an overview of the first BIOASQ challenge, a competition on large-scale biomedical semantic indexing and question answering (QA), which took place between March and September 2013. BIOASQ assesses the ability of systems to semantically index very large numbers of biomedical scientific articles, and to return concise and user-understandable answers to given natural language questions by combining information from biomedical articles and ontologies.Entities:
Mesh:
Year: 2015 PMID: 25925131 PMCID: PMC4450488 DOI: 10.1186/s12859-015-0564-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of semantic indexing and question answering in the biomedical domain. The BIOASQ challenge focuses in pushing systems towards implementing pipelines that can realize the workflow shown in the figure. Starting with a variety of data sources (lower right corner of the figure), semantic indexing and integration brings the data into a form that can be used to respond effectively to domain specific questions. A semantic QA system associates ontology concepts with each question and uses the semantic index of the data to retrieve the relevant pieces of information. The retrieved information is then turned into a concise user-understandable form, which may be, for example, a ranked list of candidate answers (e.g., in factoid questions, like “What are the physiological manifestations of disorder Y?”) or a collection of text snippets, ideally forming a coherent summary (e.g., in “What is known about the metabolism of drug Z?”). The figure also illustrates how these steps are mapped to the BIOASQ challenge tasks. With blue, Task 1a is depicted, while red depicts Task 1b.
Basic statistics about the training data for Task1a
|
| 10,876,004 |
|
| 26,563 |
|
| 12.55 |
|
| 22 |
Number of articles for each test dataset in each batch
|
|
|
|
|
|---|---|---|---|
| 1 | 1,942 (1,553) | 4,869 (3,414) | 7,578 (2,616) |
| 2 | 830 (726) | 5,551 (3,802) | 10,139 (3,918) |
| 3 | 790 (761) | 7,144 (3,983) | 8,722 (2,969) |
| 4 | 2,233 (586) | 4,623 (2,360) | 1,976 (1,318) |
| 5 | 6,562 (5,165) | 8,233 (3,310) | 1,744 (1,209) |
| 6 | 4,414 (3,530) | 8,381 (3,156) | 1,357 (696) |
| Total | 16,763 (12,321) | 38,801 (20,025) | 31,570 (12,726) |
In parentheses the articles that have been annotated by the curators by the time of the Task 1a evaluation (September 2013).
Figure 2Interesting cases when evaluating hierarchical classifiers: (a) over-specialization, (b) under-specialization, (c) alternative problems, (d) pairing problem, (e) long distance problem. Nodes surrounded by circles are the true classes while the nodes surrounded by rectangles are the predicted classes. LCaF ia based on the notion of adding all ancestors of the predicted (rectangles) and true (circles) classes. However, adding all the ancestors has the undesirable effect of over-penalizing errors that happen to nodes with many ancestors. Thus, LCaF uses the notion of the Lowest Common Ancestor to limit the addition of ancestors.
Types of questions in Task 1b and respective examples along with the golden answers in each case
|
|
|
|
|
|
|---|---|---|---|---|
|
|
|
|
|
|
| Yes/No | Exact + Ideal | Is miR-21 related to carcinogenesis? | Yes | Yes. It has been demonstrated in several experimental studies that miR-21 has oncogenic potential, and is significantly disregulated in numerous types of cancer. Therefore, miR-21 is closely related to carcinogenesis. |
| Factoid | Exact + Ideal | Which is the most common disease attributed to malfunction or absence of primary cilia? | “autosomal recessive polycystic kidney disease” | When ciliary function is perturbed, photoreceptors may die, kidney tubules develop cysts, limb digits multiply and brains form improperly. Malformation of primary cilia in the collecting ducts of kidney tubules is accompanied by development of autosomal recessive polycystic kidney disease. |
| List | Exact + Ideal | Which human genes are more commonly related to craniosynostosis? | [“MSX2”, “RECQL4”, “SOX6”, “FGFR1”, “FGFR2”, “FGFR”] | The genes that are most commonly linked to craniosynostoses are the members of the Fibroblast Growth Factor Receptor family FGFR3 and to a lesser extent FGFR1 and FGFR2. Some variants of the disease have been associated with the triplication of the MSX2 gene and mutations in NELL-1. |
| Summary | Ideal | What is the mechanism of action of abiraterone? | - | Abiraterone acts by inhibiting cytochrome P450 170̆3b1-hydroxylase (CYP17A1), a critical step in androgen biosynthesis, thus leading to inhibition of androgen biosynthesis. |
Statistics of the training and test data for Task 1b
|
|
|
|
| |
|---|---|---|---|---|
|
| ||||
|
| 29 | 100 | 100 | 82 |
|
| 8 | 25 | 26 | 26 |
|
| 5 | 18 | 20 | 16 |
|
| 8 | 31 | 31 | 23 |
|
| 8 | 26 | 23 | 17 |
|
| 4.8 | 5.3 | 6.0 | 12.9 |
|
| 10.3 | 11.4 | 12.1 | 5.4 |
|
| 14.0 | 17.1 | 17.4 | 15.9 |
|
| 3.6 | 21.8 | 5.5 | 4.5 |
Figure 3An illustration for the article-offset pairs. An article-offset pair example. Article 1 has n characters and a golden snippet starting at offset 3 and ending at offset 10.
Correspondence of reference and submitted systems for Task1a
|
|
|
|---|---|
| [ | system1, system2, system3, system4, system5 |
| [ | cole_hce1, cole_hce2, utai_rebayct, utai_rebayct_2 |
| [ | mc1, mc2, mc3, mc4, mc5 |
| [ | Wishart-* |
| [ | RMAI, RMAIP, RMAIR, RMAIN, RMAIA |
| Baselines [ | MTIFL, MTI, bioasq_baseline |
Average ranks for each system across the batches of Task 1a for the measures MiF and LCaF
|
|
|
|
| |||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| MTIFL |
|
| 2.75 | 2.75 | 4.0 | 4.0 |
| system3 | 2.75 | 2.75 |
|
| 2.0 | 2.0 |
| system2 | - | - | 1.75 | 2.0 | 3.0 | 3.0 |
| system1 | - | - | - | - |
|
|
| MTI | - | - | - | - | 3.25 | 3.0 |
| RMAIP | 2.50 |
| 5.0 | 4.5 | 5.25 | 5.5 |
| RMAI | 3.25 | 3.0 | 5.0 | 4.5 | 8.5 | 7.25 |
| RMAIR | 6.25 | 6.0 | 4.5 | 3.25 | 6.25 | 6.25 |
| RMAIA | 5.75 | 5.5 | 4.0 | 5.25 | 7.25 | 5.75 |
| RMAIN | 4.50 | 3.25 | 6.0 | 5.0 | 6.5 | 6.25 |
| Wishart-S3-NP | 8.75 | 9.0 | 14.25 | 15.0 | - | - |
| Wishart-S1-KNN | 8.75 | 9.25 | 12.25 | 12.5 | - | - |
| Wishart-S5-Ensemble | 9.5 | 8.0 | 9.50 | 10.25 | - | - |
| mc4 | 14.75 | 14.25 | 21.0 | 21.0 | 21.5 | 21.25 |
| mc3 | 11.0 | 11.25 | 19.75 | 19.75 | 22.0 | 21.5 |
| mc5 | 11.25 | 10.0 | 15.0 | 14.75 | 17.0 | 17.0 |
| cole_hce2 | 9.25 | 9.5 | 11.25 | 9.25 | 12.75 | 12.0 |
| bioasq_baseline | 14.0 | 14.0 | 17.75 | 16.75 | 20.75 | |
| cole_hce1 | 13.5 | 13.5 | 14.75 | 14.0 | 16.0 | 14.75 |
| mc1 | 8.75 | 8.25 | 13.75 | 13.25 | 13.0 | 13.5 |
| mc2 | 11.25 | 11.5 | 17.75 | 18.25 | 14.25 | 15.75 |
| utai_rebayct | 15.5 | 16.0 | 16.75 | 17.5 | 19.25 | 21.5 |
| Wishart-S2-IR | 9.75 | 10.75 | 8.5 | 9.25 | - | - |
| Wishart-S5-Ngram | - | - | 10.5 | 9.75 | - | - |
| utai_rebayct_2 | - | - | - | - | 18.25 | 18.5 |
| TCAM-S1 | - | - | - | - | 11.25 | 12.25 |
| TCAM-S2 | - | - | - | - | 12.25 | 12.25 |
| TCAM-S3 | - | - | - | - | 12.5 | 12.5 |
| TCAM-S4 | - | - | - | - | 12.0 | 12.75 |
| TCAM-S5 | - | - | - | - | 12.75 | 12.0 |
| FU_System | - | - | - | - | 24.0 | 23.25 |
A hyphenation symbol (-) is used whenever the system participated in less than 4 times in the batch. The 4 best runs in each batch for each system were considered for its ranking.
Average ranks for each system for each batch of phase A of Task 1b
|
|
|
|
|
|---|---|---|---|
| Top 100 Baseline | 1.0 |
|
|
| Top 50 Baseline | 2.5 | 2.375 | 1.75 |
| MCTeamMM | 3.625 | 4.5 | 3.5 |
| MCTeamMM10 | 3.625 | 4.5 | 3.5 |
| Wishart-S1 | 4.25 | 3.875 | - |
| Wishart-S2 | - | 4.125 | - |
The MAP measure was used to rank the systems. A hyphen (symbol -) is used whenever the system did not participate in the corresponding batch.
Results for batch 1 for concepts in phase A of Task1b
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
|
|
| |||
| Top 100 Baseline | 0.080 | 0.858 | 0.123 | 0.472 | 0.275 |
| Top 50 Baseline | 0.121 | 0.759 | 0.172 | 0.458 | 0.203 |
| Wishart-S1 | 0.464 | 0.429 | 0.366 | 0.342 | 0.063 |
| MCTeamMM | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| MCTeamMM10 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
Average ranks for each system and each batch of phase B of Task 1b, for the “exact” answers
|
|
|
|
|
|---|---|---|---|
| Wishart-S1 |
|
| - |
| Wishart-S2 |
| - | - |
| Wishart-S3 |
| - | - |
| Baseline1 | 4.66 |
|
|
| Baseline2 | 4.33 | 4.0 | 2.66 |
| main system | 6.0 | 4.33 | 3.0 |
| system 2 | - | 5.33 | 3.33 |
| system 3 | - | 5.5 | 3.66 |
| system 4 | - | 5.5 | - |
The final rank is calculated across the individual ranks of the systems for the different types of questions. A dash symbol (-) is used whenever the system did not participate to the corresponding batch.
Average scores for each system and each batch of phase B of Task 1b for the “ideal” answers
|
|
|
|
|
|---|---|---|---|
| Wishart-S1 |
|
| - |
| Wishart-S2 |
| - | - |
| Wishart-S3 |
| - | - |
| Baseline1 | 2.86 | 3.02 |
|
| Baseline2 | 2.73 | 2.87 | 3.17 |
| main system | 3.35 | 3.39 | 3.13 |
| system 2 | - | 3.34 | 3.07 |
| system 3 | - | 3.34 | 2.98 |
| system 4 | - | 3.34 | - |
The final score is calculated as the average of the individual scores of the systems for the different evaluation criteria. A hyphenation symbol (-) is used whenever the system did not participate in the corresponding batch. The scores are given by experts who read and evaluated the “ideal” answers, and they range from 1 to 5, with 5 being the best score.
The “ideal” answers returned from the system Wishart-S1 along with the golden one
|
|
|
|---|---|
| Benzodiazepine (BZD) overdose (OD) continues to cause significant morbidity and mortality in the UK. Flumazenil is an effective antidote but there is a risk of seizures, particularly in those who have co-ingested tricyclic antidepressants. (PMID: 21785147) Flumazenil is a benzodiazepine antagonist. It is widely used as an antidote in comatose patients suspected of having ingested a benzodiazepine overdose. (PMID: 19500521) | Flumazenil should be used in all patients presenting with suspected benzodiazepine overdose. Flumazenil is a potent benzodiazepine receptor antagonist that competitively blocks the central effects of benzodiazepines and reverses behavioral, neurologic, and electrophysiologic effects of benzodiazepine overdose. Clinical efficacy and safety of flumazenil in treatment of benzodiazepine overdose has been confirmed in a number of rigorous clinical trials. In addition, flumazenil is also useful to reverse benzodiazepine induced sedation and to diagnose benzodiazepine overdose. |
Figure 4Screenshot of the annotation tool’s search and data selection screen with the section for document results expanded. The search interface accepts a number of keywords that are sent in parallel to each of the GOPUBMED services. Upon retrieval of the last response, results are combined and returned to the frontend. The client creates one request for each of the result domains (concepts, documents, statements). Whenever results are retrieved for a domain, the respective section of the GUI is updated immediately. Each search result displays the title of the result.
Figure 5Screenshot of the answer formulation and annotation with document snippets. The process of formulating the answer to the selected question and its annotation with document snippets by the domain expert is shown. The user can either dismiss items that were selected in the previous step, or add snippets (i.e., document fragments) as annotations to the answer.
Technologies used in Task 1a from the participating systems along with the feature representation of the documents
|
|
|
|
|
|---|---|---|---|
| [ | flat | SVMs, MetaLabeler [ | unigrams, bigrams |
| [ | hierarchical | SVMs, Bayes networks | unigrams, bigrams |
| [ | flat | MetaMap [ | unigrams |
| information retrieval, | |||
| search engines | |||
| [ | flat | k-NN, SVMs | unigrams, bigrams, |
| trigrams | |||
| [ | flat | k-NN, learning-to-rank | unigrams |
Unigrams, bigrams and trigrams refer to the word level.
Evaluation measures for Phase A of Task 1b
|
|
|
|
|---|---|---|
|
|
| |
| concepts | mean precision, recall, |
|
| articles | mean precision, recall, |
|
| snippets | mean precision, recall, |
|
| triples | mean precision, recall, |
|
Evaluation measures for the “exact” answers in Phase B of Task 1b
|
|
|
|
|---|---|---|
|
|
| |
| yes/no | yes or no |
|
| factoid | up to 5 entity names | strict and lenient accuracy, |
| list | a list of entity names |
|
Criteria for the manual evaluation of the “ideal” answers in Phase B of Task 1b
|
|
|
|
|---|---|---|
| information recall | All the necessary information is reported. | 1–5 |
| information precision | No irrelevant information is reported. | 1–5 |
| information repetition | The answer does not repeat the same information multiple times. | 1–5 |
| readability | The answer is easily readable and fluent. | 1–5 |
Evaluation measures for the “ideal” answers in Phase B of Task 1b
|
|
|
|
|---|---|---|
| any | paragraph-sized text |
|
The candidate resources that were examined for inclusion in the BIOASQ challenge by type
|
|
|
|---|---|
| Drugs |
|
| Targets |
|
| Diseases |
|
| General Purpose |
|
| Document Sources |
|
| Linked Data |
|
Highlighted are the final selected resources.