| Literature DB >> 29688379 |
Payam Karisani1, Zhaohui S Qin2, Eugene Agichtein1.
Abstract
The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddieEntities:
Mesh:
Year: 2018 PMID: 29688379 PMCID: PMC5887275 DOI: 10.1093/database/bax104
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.The system architecture and data flow diagram for biomedical dataset retrieval and bioCADDIE challenge.
Features used to describe query-dataset match for the Learning to Rank (LTR) machine learned ranking
| Group No | Feature name | Description |
|---|---|---|
| 1 | BM25 similarity score of the whole dataset | |
| 1 | BM25 similarity score of the TITLE | |
| 1 | BM25 similarity score of the TEXT | |
| 1 | BM25 similarity score of the METADATA | |
| 2 | ||
| 2 | ||
| 2 | ||
| 3 | ||
| 3 | ||
| 3 | ||
| 4 | ||
| 4 | ||
| 4 | ||
| 5 | ||
| 5 | ||
| 5 | ||
| 6 | Number of common word 2-grams in the query and TITLE | |
| 6 | Number of common word 2-grams in the query and TEXT | |
| 6 | Number of common word 2-grams in the query and METADATA | |
| 6 | Number of common word 2-grams in the query and the whole dataset | |
| 7 | Position of the first query term in the dataset TEXT field | |
| 8 | Ratio of the number of datasets belong to the dataset’s web domain to the whole datasets in the corpus |
Initial (first phase) retrieval model parameters, with the range of values and the empirically tuned best value for each parameter
| Parameter | Description | Range | Best value |
|---|---|---|---|
| Weight of TITLE in the retrieval | 0.1, 0.3, 0.5, 0.7 | 0.1 | |
| Weight of TEXT in the retrieval | 0.1, 0.3, 0.5, 0.7 | 0.3 | |
| Weight of METADATA in the retrieval | 0.1, 0.3, 0.5, 0.7 | 0.5 | |
| K1 parameter in BM25 | 0.6, 1, 1.4, 1.8 | 1.8 | |
| B parameter in BM25 | 0.3, 0.5, 0.7, 0.9 | 0.7 |
Query reformulation parameters, with the range of values and the empirically tuned best value for each parameter
| Parameter | Description | Range | Best value |
|---|---|---|---|
| Top datasets selected for WIG model, BRF and external expansion | 5, 10, 30 | 5 | |
| Number of terms added to the query by BRF | 5, 10, 30 | 5 | |
| Weight of the terms selected by BRF | 0.1, 0.3, 0.5 | 0.1 | |
| Number of terms added to the query using external resources | 5, 10, 30 | 10 | |
| Weight of the terms added using external resources | 0.1, 0.3, 0.5 | 0.5 |
Performance results for the steps described in the Methodology: retrieval system architecture and implementation section
| Model | NDCG | MAP | P@10 |
|---|---|---|---|
| 0.457 | 0.187 | 0.499 | |
| 0.446 | 0.180 | 0.463 | |
| 0.535 | 0.261 | 0.601 | |
| 0.534 | 0.259 | ||
| 0.547 | 0.590 | ||
| 0.586 |
The bold numbers indicate the highest achieved performance.
Figure 2.Retrieval performance of the query reformulation extensions.
Performance changes in using LTR
| Method | NDCG | MAP | P@10 |
|---|---|---|---|
| 0.457 | 0.187 | 0.499 | |
| 0.539 | 0.254 | 0.462 |
The bold numbers indicate the highest achieved performance.
Performance results for the steps described in the Methodology: retrieval system architecture and implementation section, with Leave-One-Out cross validation
| Model | NDCG | MAP | P@10 |
|---|---|---|---|
| 0.465 | 0.194 | 0.495 | |
| 0.450α | 0.185α | 0.481 | |
| 0.563α | 0.279α | ||
| 0.559α | 0.277α | 0.619α | |
| 0.619α | |||
| 0.561α | 0.283α | 0.624α |
The changes indicated by α, are statistically significant compared to BM25Opt using paired t-test (p < 0.05).
The bold numbers indicate the highest achieved performance.
Performance changes in using LTR, with Leave-One-Out cross validation
| Method | NDCG | MAP | P@10 |
|---|---|---|---|
| 0.457 | 0.187 | 0.499 | |
| 0.550 | 0.272 | 0.524 |
The bold numbers indicate the highest achieved performance.
Retrieval performance for retrained LTR using the extended training data
| Method | NDCG | MAP | P@10 |
|---|---|---|---|
| 0.457 | 0.187 | 0.499 | |
| IROpt | |||
| 0.539 | 0.254 | 0.462 | |
| 0.552 | 0.267 | 0.539 |
The bold numbers indicate the highest achieved performance.
Figure 3.Retrieval performance of the LTR framework when extended training data is used.
Feature group ablation in learning to rank model
| Rank | Category | NDCG after omission |
|---|---|---|
| 1 | (group 1) BM25 scores | 0.538 |
| 2 | (group 3) unigram IDF in the dataset fields | 0.544 |
| 3 | (group 5) unigram in the whole (concatenated) dataset fields | 0.548 |
| 4 | (group 7) DistanceFromStart | 0.550 |
| 5 | (group 2) unigram TF in the dataset fields | 0.550 |
| 6 | (group 8) DomainWeight | 0.553 |
| 7 | (group 6) shared bigrams | 0.557 |
| 8 | (group 4) unigram TF-IDF in the dataset fields | 0.558 |
Groups are mentioned in Table 1.
Performance improvements using external expansion resources
| External expansion resource | NDCG | MAP | P@10 |
|---|---|---|---|
| API | 0.534 | 0.259 | |
| Web search | 0.547 | 0.586 | |
| Web search + API | 0.590 |
Changes in retrieval performance before and after query modification for query numbers 1, 3, 10, and 15. Query terms enclosed in ‘[]’ are added using external resources, and query terms enclosed in ‘<>’are added by BRF. Query terms marked by ‘+’ are keywords with the highest weight in WIG model
| Query No | Original query terms and automatically expanded terms | NDCG before modification | NDCG after modification |
|---|---|---|---|
| 1 | Find protein sequencing data related to bacterial+ chemotaxis+ across all databases+ | 0.111 | 0.291 (+162%) |
| 3 | Search for all data types related to gene TP53INP1+ in relation to p53+ activation across all databases+ | 0.342 | 0.710 (+107%) |
| 10 | Search for data of all types related to energy metabolism+ in obese+ M. musculus+ | 0.373 | 0.436 (+16%) |
| 15 | Find data on the NF-kB+ signaling pathway in MG (Myasthenia+ gravis+) patients | 0.603 | 0.524 (-13%) |