Literature DB >> 33002137

A content-based dataset recommendation system for researchers-a case study on Gene Expression Omnibus (GEO) repository.

Braja Gopal Patra1, Kirk Roberts2, Hulin Wu3,2.   

Abstract

It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers' workload in identifying the right dataset and increase the reusability of biomedical datasets. Database URL: http://genestudy.org/recommends/#/.
© The Author(s) 2020. Published by Oxford University Press.

Entities:  

Year:  2020        PMID: 33002137      PMCID: PMC7659921          DOI: 10.1093/database/baaa064

Source DB:  PubMed          Journal:  Database (Oxford)        ISSN: 1758-0463            Impact factor:   3.451


  15 in total

1.  Omicseq: a web-based search engine for exploring omics datasets.

Authors:  Xiaobo Sun; William S Pittard; Tianlei Xu; Li Chen; Michael E Zwick; Xiaoqian Jiang; Fusheng Wang; Zhaohui S Qin
Journal:  Nucleic Acids Res       Date:  2017-07-03       Impact factor: 16.971

2.  Identifying data sharing in biomedical literature.

Authors:  Heather A Piwowar; Wendy W Chapman; Wendy Chapman
Journal:  AMIA Annu Symp Proc       Date:  2008-11-06

3.  A big data pipeline: Identifying dynamic gene regulatory networks from time-course Gene Expression Omnibus data with applications to influenza infection.

Authors:  Michelle Carey; Juan Camilo Ramírez; Shuang Wu; Hulin Wu
Journal:  Stat Methods Med Res       Date:  2018-07       Impact factor: 3.021

4.  The emergence and diffusion of DNA microarray technology.

Authors:  Tim Lenoir; Eric Giannella
Journal:  J Biomed Discov Collab       Date:  2006-08-22

5.  Science Concierge: A Fast Content-Based Recommendation System for Scientific Publications.

Authors:  Titipat Achakulvisut; Daniel E Acuna; Tulakan Ruangrong; Konrad Kording
Journal:  PLoS One       Date:  2016-07-06       Impact factor: 3.240

6.  A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge.

Authors:  Trevor Cohen; Kirk Roberts; Anupama E Gururaj; Xiaoling Chen; Saeid Pournejati; George Alter; William R Hersh; Dina Demner-Fushman; Lucila Ohno-Machado; Hua Xu
Journal:  Database (Oxford)       Date:  2017-01-01       Impact factor: 3.451

7.  Elsevier's approach to the bioCADDIE 2016 Dataset Retrieval Challenge.

Authors:  Antony Scerri; John Kuriakose; Amit Ajit Deshmane; Mark Stanger; Peter Cotroneo; Rebekah Moore; Raj Naik; Anita de Waard
Journal:  Database (Oxford)       Date:  2017-01-01       Impact factor: 3.451

8.  Query expansion using MeSH terms for dataset retrieval: OHSU at the bioCADDIE 2016 dataset retrieval challenge.

Authors:  Theodore B Wright; David Ball; William Hersh
Journal:  Database (Oxford)       Date:  2017-01-01       Impact factor: 3.451

9.  Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge.

Authors:  Wei Wei; Zhanglong Ji; Yupeng He; Kai Zhang; Yuanchi Ha; Qi Li; Lucila Ohno-Machado
Journal:  Database (Oxford)       Date:  2018-01-01       Impact factor: 3.451

10.  DataMed - an open source discovery index for finding biomedical datasets.

Authors:  Xiaoling Chen; Anupama E Gururaj; Burak Ozyurt; Ruiling Liu; Ergin Soysal; Trevor Cohen; Firat Tiryaki; Yueling Li; Nansu Zong; Min Jiang; Deevakar Rogith; Mandana Salimi; Hyeon-Eui Kim; Philippe Rocca-Serra; Alejandra Gonzalez-Beltran; Claudiu Farcas; Todd Johnson; Ron Margolis; George Alter; Susanna-Assunta Sansone; Ian M Fore; Lucila Ohno-Machado; Jeffrey S Grethe; Hua Xu
Journal:  J Am Med Inform Assoc       Date:  2018-03-01       Impact factor: 4.497

View more
  1 in total

1.  An informatics research platform to make public gene expression time-course datasets reusable for more scientific discoveries.

Authors:  Braja Gopal Patra; Babak Soltanalizadeh; Nan Deng; Leqing Wu; Vahed Maroufy; Canglin Wu; W Jim Zheng; Kirk Roberts; Hulin Wu; Ashraf Yaseen
Journal:  Database (Oxford)       Date:  2020-11-28       Impact factor: 4.462

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.