| Literature DB >> 29220457 |
Mohamed Reda Bouadjenek1, Karin Verspoor1.
Abstract
In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one.Entities:
Mesh:
Year: 2017 PMID: 29220457 PMCID: PMC5737205 DOI: 10.1093/database/bax062
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Figure 1.Box-plots of the overlap similarity between the queries and different fields of their associated relevant datasets.
bioCADDIE dataset collection details
| Category | Repository | DocID | Title | Metadata | Total | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Description | Keywords | Organisms | PMID | Genes | Diseases | Treatment | ||||||
| Clinical trials | 1 | ClinicalTrials | x | x | x | x | — | — | — | x | x | 192 500 |
| 2 | CTN | x | x | x | x | x | — | — | — | — | 46 | |
| Gene expression | 3 | ArrayExpress | x | x | x | — | x | — | — | — | — | 60 881 |
| 4 | GEMMA | x | x | x | — | x | — | — | — | — | 2285 | |
| 5 | GEO | x | x | x | — | x | — | — | — | — | 105 033 | |
| 6 | Nursadatasets | x | x | x | x | x | — | — | — | — | 389 | |
| Imaging data | 7 | CVRG | x | x | x | — | — | — | — | — | — | 29 |
| 8 | NeuroMorpho | x | x | — | — | x | — | — | — | x | 34 082 | |
| 9 | CIA | x | x | — | — | x | — | — | x | — | 63 | |
| 10 | OpenFMRI | x | x | x | — | x | — | — | — | — | 36 | |
| Phenotype | 11 | MPD | x | x | x | — | x | — | — | — | — | 235 |
| 12 | PhenoDisco | x | x | x | — | — | — | — | x | — | 429 | |
| Physiological signals | 13 | PhysioBank | x | x | x | — | — | — | — | — | — | 70 |
| 14 | YPED | x | x | x | — | x | x | — | — | — | 21 | |
| Protein structure | 15 | PDB | x | x | x | x | x | x | x | — | — | 113 493 |
| Proteomic data | 16 | PeptideAtlas | x | x | x | — | x | x | — | — | x | 76 |
| 17 | Proteom Exchange | x | x | — | x | x | — | — | — | — | 1716 | |
| Unspecified | 18 | BioProject | x | x | x | x | x | — | — | — | — | 155 850 |
| 19 | Dataverse | x | x | x | — | — | — | — | — | — | 60 303 | |
| 20 | Dryad | x | x | x | x | — | — | — | — | — | 67 455 | |
| 794 992 | 759 131 | 531 449 | 474 206 | 113 590 | 113 493 | 192 992 | 226 658 | |||||
(x) means the information is provided, (—) means the information is not provided. These marks do not imply any ‘positive’ or ‘negative’ information except for the presence or the absence of the considered information in the metadata section.
Details of the queries
| Queries | Organisms | Genes | Diseases | Category | |
|---|---|---|---|---|---|
| Query 1: | Find protein sequencing data related to bacterial chemotaxis across all databases | — | — | — | + |
| Query 2: | Search for data of all types related to MIP-2 gene related to biliary atresia across all databases | — | + | + | — |
| Query 3: | Search for all data types related to gene TP53INP1 in relation to p53 activation across all databases | — | + | — | — |
| Query 4: | Find all data types related to inflammation during oxidative stress in human hepatic cells across all databases | + | — | + | — |
| Query 5: | Search for gene expression and genetic deletion data that mention CD69 in memory augmentation studies across all databases | — | + | — | + |
| Query 6: | Search for data of all types related to the LDLR gene related to cardiovascular disease across all databases | — | + | + | — |
| Query 7: | Search for gene expression datasets on photo transduction and regulation of calcium in blind | + | — | — | + |
| Query 8: | Search for proteomic data related to regulation of calcium in blind | + | — | — | + |
| Query 9: | Search for data of all types related to the ob gene in obese | + | + | — | — |
| Query 10: | Search for data of all types related to energy metabolism in obese | + | — | — | — |
| Query 11: | Search for all data for the HTT gene related to Huntingtoner disease across all databases | — | + | + | — |
| Query 12: | Search for data on neural brain tissue in transgenic mice related to Huntingtoner disease | + | — | + | — |
| Query 13: | Search for all data on the SNCA gene related to Parkinsonne disease across all databases | — | + | + | — |
| Query 14: | Search for data on nerve cells in the substantia nigra in mice across all databases | + | — | — | — |
| Query 15: | Find data on the NF-kB signaling pathway in | — | + | + | — |
(x) means the concept is present in the query, (—) means the concept is not present in the query. These marks do not imply any ‘positive’ or ‘negative’ information except for the presence or the absence of the considered concept in the query.
Figure 2.Architecture overview.
Sample of terms extracted from the qrels and added to the Query 1a
| Query 1: Find protein sequencing data related to bacterial chemotaxis across all databases | |||||
|---|---|---|---|---|---|
| Term added | P@100 | Recall | infAP | AP | infNDCG |
| 0.370 | |||||
| 0.410 | |||||
| 0.300 | |||||
| 0.300 | |||||
| 0.310 | |||||
| 0.220 | 0.039 | ||||
| Complex | 0.160 | 0.086 | 0.036 | 0.208 | |
| Protein | 0.320 | 0.206 | |||
| Xanthomonas | 0.200 | 0.019 | 0.191 | ||
| Maritima | 0.200 | 0.010 | 0.030 | 0.165 | |
| Domain | 0.081 | 0.04 | 0.115 | ||
| Thermotoga | 0.200 | 0.010 | 0.030 | 0.030 | |
aValues in bold are improvements over the baseline.
Figure 3.The utility of query expansion for the 15 queries.
Retrieval performance summary of the baselinea
| Metric | Queried fields | |
|---|---|---|
| Description | Title and description | |
| Baseline 1 | Baseline 2 | |
| 0.6067 | ||
| 0.2171 | ||
| 0.2088 | ||
| 0.3575 |
aValues in bold are improvements over the baseline.
Figure 4.The utility of the literature-based fields.
Figure 5.The utility of the Gene fields on the queries that mention genes.
Figure 6.The utility of the disease-based fields on the queries that mention diseases.
Figure 7.Analysis of the query qrels with respect to the repository categories for Query 1, Query 5, Query 7 and Query 8.
Figure 8.The utility of the category-based filter on the queries that mention specific type of biomedical data.
Multi-fields query retrieval performance when querying one field (on the diagonal) or two fields simultaneously (off the diagonal) on the set of 15 bioCADDIE queriesa
| Fields | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Title | Description | Keywords | Organisms | Art. Title | Art. Abstract | Genes | Diseases | Treatment | Metric | ||
| Fields | Title | 0.5667 | 0.5200 | 0.4400 | 0.3667 | 0.3933 | 0.5600 | 0.4000 | 0.2067 | ||
| 0.1509 | 0.0982 | 0.1283 | 0.1108 | 0.0864 | 0.1543 | 0.0995 | 0.0644 | ||||
| 0.1438 | 0.0999 | 0.1088 | 0.1136 | 0.1024 | 0.1461 | 0.0903 | 0.0624 | ||||
| 0.2615 | 0.3575 | 0.2361 | 0.2438 | 0.2346 | 0.1977 | 0.2609 | 0.1778 | 0.1510 | |||
| Description | – | 0.6067 | 0.5733 | 0.6333 | 0.6867 | 0.4933 | 0.6067 | 0.4867 | 0.4133 | ||
| – | 0.2171 | 0.1792 | 0.2362 | 0.2297 | 0.2034 | 0.2171 | 0.1676 | 0.1422 | |||
| – | 0.2088 | 0.1822 | 0.2276 | 0.2171 | 0.2074 | 0.2083 | 0.1688 | 0.1306 | |||
| – | 0.3241 | 0.3525 | 0.3607 | 0.3191 | 0.3604 | 0.3098 | 0.2702 | ||||
| Keywords | – | – | 0.2267 | 0.2667 | 0.3067 | 0.3533 | 0.3133 | 0.2600 | 0.1533 | ||
| – | – | 0.0322 | 0.0321 | 0.0314 | 0.0343 | 0.0344 | 0.0316 | 0.0168 | |||
| – | – | 0.0407 | 0.0391 | 0.0447 | 0.0469 | 0.0437 | 0.0397 | 0.0256 | |||
| – | – | 0.1462 | 0.1470 | 0.2207 | 0.1627 | 0.1404 | 0.1246 | 0.0996 | |||
| Organisms | – | – | – | 0.0067 | 0.1467 | 0.2000 | 0.1867 | 0.1267 | 0.0400 | ||
| – | – | – | 0.0001 | 0.0081 | 0.0123 | 0.0071 | 0.0145 | 0.0007 | |||
| – | – | – | 0.0001 | 0.0130 | 0.0211 | 0.0072 | 0.0136 | 0.0011 | |||
| – | – | – | 0.0076 | 0.1598 | 0.0923 | 0.0163 | 0.0439 | 0.0099 | |||
| Art. Title | – | – | – | – | 0.1733 | 0.2133 | 0.2400 | 0.2267 | 0.1267 | ||
| – | – | – | – | 0.0113 | 0.0164 | 0.0153 | 0.0258 | 0.0057 | |||
| – | – | – | – | 0.0230 | 0.0360 | 0.0237 | 0.0346 | 0.0092 | |||
| – | – | – | – | 0.1136 | 0.1188 | 0.1189 | 0.1125 | 0.0545 | |||
| Art. Abstract | – | – | – | – | – | 0.1667 | 0.2800 | 0.2133 | 0.2067 | ||
| – | – | – | – | – | 0.0149 | 0.0201 | 0.0263 | 0.0125 | |||
| – | – | – | – | – | 0.0266 | 0.0294 | 0.0362 | 0.0197 | |||
| – | – | – | – | – | 0.1035 | 0.1053 | 0.1033 | 0.0906 | |||
| Genes | – | – | – | – | – | – | 0.1933 | 0.1600 | 0.0467 | ||
| – | – | – | – | – | – | 0.0088 | 0.0175 | 0.0015 | |||
| – | – | – | – | – | – | 0.0091 | 0.0165 | 0.0025 | |||
| – | – | – | – | – | – | 0.0096 | 0.0459 | 0.0148 | |||
| Diseases | – | – | – | – | – | – | – | 0.1200 | 0.1000 | ||
| – | – | – | – | – | – | – | 0.0152 | 0.0117 | |||
| – | – | – | – | – | – | – | 0.0140 | 0.0119 | |||
| – | – | – | – | – | – | – | 0.0394 | 0.0436 | |||
| Treatment | – | – | – | – | – | – | – | – | 0.0333 | ||
| – | – | – | – | – | – | – | – | 0.0006 | |||
| – | – | – | – | – | – | – | – | 0.0012 | |||
| – | – | – | – | – | – | – | – | 0.0099 |
aValues in bold are improvements over the baseline. (–) is used to avoid duplicating the results as the table is symmetric.
Figure 9.Example of Query 2 transformed into the Lucene query syntax targeting multiple fields. Note that concept terms identified in the query are boosted with a factor of 2.
Figure 10.Performance of the multi-field query method with respect to the baselines.
Figure 11.Performance of the query expansion methods compared with the baselines.