| Literature DB >> 26384372 |
Julien Gobeill1, Arnaud Gaudinat2, Emilie Pasche3, Dina Vishnyakova4, Pascale Gaudet5, Amos Bairoch5, Patrick Ruch6.
Abstract
Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/.Entities:
Mesh:
Year: 2015 PMID: 26384372 PMCID: PMC4572360 DOI: 10.1093/database/bav081
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Evolution of the number of documents dealing with ‘QA’ in MEDLINE, compared with ‘Big Data’.
Figure 2.Deep QA. In standard QA, answers are extracted from some retrieved documents. In Deep QA, curated data are exploited to build a supervised classification model, which is then used to generate answers.
Figure 3.Overall workflow of the EAGLi platform. The input is a question formulated in natural language, the output is a set of candidate answers extracted from a set of retrieved MEDLINE abstracts.
Performances for different combinations of Information Retrieval (IR) component/GO classifier for the micro-reading then the macro-reading tasks, in terms of Top Precision P0 and Recall at rank r
| Task | Benchmark | IR component | GO classifier | P0 | R at rank |
|---|---|---|---|---|---|
| Micro-reading | GOA benchmark | N/A | EAGL | 0.23 | 0.17 |
| GOCat | 0.48* (+109%) | 0.37* (+117%) | |||
| Macro-reading | CTD benchmark | PubMed | EAGL | 0.34 | 0.15 |
| GOCat | 0.69* (+103%) | 0.33* (+120%) | |||
| GoPubMed | 0.39 | 0.16 | |||
| Vectorial | EAGL | 0.33 | 0.14 | ||
| GOCat | 0.66* (+100%) | 0.33* (+135%) | |||
| UniProt benchmark | PubMed | EAGL | 0.33 | 0.45 | |
| GOCat | 0.58* (+76%) | 0.73* (+62%) | |||
| GoPubMed | 0.22 | 0.21 | |||
| Vectorial | EAGL | 0.34 | 0.49 | ||
| GOCat | 0.58* (+70%) | 0.75* (+53%) |
For Recall at rank r, according to the average number of expected good answers for each benchmark, r was 5 for the GOA and the UniProt benchmarks (respectively 2.8 and 1.3 expected good answers) and 100 for the CTD benchmark (30 expected good answers). For the GOCat classifier results, improvements of performances (+ x%) are given compared with the EAGL classifier. Statistically significant improvements (P < 0.05) are marked up in the table with an ‘*’.
Output of the QA engine with different classifiers used for answer extraction
| Answer extractor | # | Answers proposed by the QA engine | Correctness and GO level |
|---|---|---|---|
| GoPubMed | 1 | GO:0005694 chromosome | |
| 2 | GO:0005737 cytoplasm | X ( | |
| 3 | GO:0016020 membrane | ||
| 4 | GO:0005912 adherens junction | ||
| 5 | GO:0005886 plasma membrane | X ( | |
| EAGL | 1 | GO:0005912 adherens junction | |
| 2 | GO:0005915 zonula adherens | ||
| 3 | GO:0005923 tight junction | ||
| 4 | GO:0005886 plasma membrane | X ( | |
| 5 | GO:0005694 chromosome | ||
| GOCat | 1 | GO:0005634 nucleus | X ( |
| 2 | GO:0005737 cytoplasm | X ( | |
| 3 | GO:0005886 plasma membrane | X ( | |
| 4 | GO:0005911 cell–cell junction | ||
| 5 | GO: 0005913 cell–cell adherens junction |
The question submitted was ‘what cellular component is the location of ARVCF?’, with PubMed used as IR component. The table shows the top five most confident answers proposed by the QA engine, and if these GO terms are present in the ARVCF record in UniProtKB. The GO level is the maximum number of nodes in the GO graph between the correct term and the root. There were three associated GO terms in the gold file, all three were returned by the QA system with GOCat.
Output of the QA engine with different classifiers used for answer extraction
| Answer extractor | # | Answers proposed by the QA engine | Correctness and GO level |
|---|---|---|---|
| GoPubMed | 1 | GO:0005488 binding | X ( |
| 2 | GO:0004707 MAP kinase activity | ||
| 3 | GO:0004871 signal transducer activity | X ( | |
| 4 | GO:0003824 catalytic activity | X ( | |
| 5 | GO:0031993 light transducer activity | ||
| 6 | GO:0060089 molecular transducer activity | X ( | |
| 7 | GO:0047322 [hydroxymethylglutaryl-CoA reductase (NADPH)] kinase activity | ||
| 8 | GO:0050405 [acetyl-CoA carboxylase] kinase activity | ||
| 9 | GO:0033736 L-lysine 6-oxidase activity | ||
| 10 | GO:0005138 interleukin-6 receptor binding | ||
| EAGL | 1 | GO:0005128 erythropoietin receptor binding | |
| 2 | GO:0018822 nitrile hydratase activity | ||
| 3 | GO:0003824 catalytic activity | X ( | |
| 4 | GO:0004601 peroxidase activity | X ( | |
| 5 | GO:0004096 catalase activity | ||
| 6 | GO:0052716 hydroquinone:oxygen oxidoreductase activity | ||
| 7 | GO:0000257 nitrilase activity | ||
| 8 | GO:0033968 glutaryl-7-aminocephalosporanic-acid acylase activity | ||
| 9 | GO:0004806 triglyceride lipase activity | ||
| 10 | GO:0005344 oxygen transporter activity | ||
| GOCat | 1 | GO:0005515 protein binding | X ( |
| 2 | GO:0042803 protein homodimerization activity | X ( | |
| 3 | GO:0008270 zinc ion binding | ||
| 4 | GO:0000287 magnesium ion binding | ||
| 5 | GO:0003677 DNA binding | X ( | |
| 6 | GO:0003700 sequence-specific DNA binding transcription factor activity | X ( | |
| 7 | GO:0030170 pyridoxal phosphate binding | X ( | |
| 8 | GO:0008144 drug binding | X ( | |
| 9 | GO:0020037 heme binding | X ( | |
| 10 | GO:0004674 protein serine/threonine kinase activity | X ( |
The question submitted was ‘What molecular functions are affected by Nitriles?’, with PubMed used as IR component. The table shows the top 10 most confident answers proposed by the QA engine, and if these GO terms are present in the Nitriles record in the CTD database. The GO level is the maximum number of nodes in the GO graph between the correct term and the root. There were 182 possible GO terms for this question.