Literature DB >> 32151769

A content-based literature recommendation system for datasets to improve data reusability - A case study on Gene Expression Omnibus (GEO) datasets.

Braja Gopal Patra1, Vahed Maroufy1, Babak Soltanalizadeh1, Nan Deng1, W Jim Zheng2, Kirk Roberts3, Hulin Wu4.   

Abstract

OBJECTIVE: The centrality of data to biomedical research is difficult to understate, and the same is true for the importance of the biomedical literature in disseminating empirical findings to scientific questions made on such data. But the connections between the literature and related datasets are often weak, hampering the ability of scientists to easily move between existing datasets and existing findings to derive new scientific hypotheses. This work aims to recommend relevant literature articles for datasets with the ultimate goal of increasing the productivity of researchers. Our approach to literature recommendation for datasets is a part of the dataset reusability platform developed at the University Texas Health Science Center at Houston for datasets related to gene expression. This platform incorporates datasets from Gene Expression Omnibus (GEO). An average of 34 datasets were added to GEO daily in the last five years (i.e. 2014 to 2018), demonstrating the need for automatic methods to connect these datasets with relevant literature. The relevant literature for a given dataset may describe that dataset, provide a scientific finding based on that dataset, or even describe prior and related work to the dataset's topic that is of interest to users of the dataset.
MATERIALS AND METHODS: We adopt an information retrieval paradigm for literature recommendation. In our experiments, distributional semantic features are created from the title and abstract of MEDLINE articles. Then, related articles are identified for datasets in GEO. We evaluate multiple distributional methods such as TF-IDF, BM25, Latent Semantic Analysis, Latent Dirichlet Allocation, word2vec, and doc2vec. Top similar papers are recommended for each dataset using cosine similarity between the dataset's vector representation and every paper's vector representation. We also propose several novel re-ranking and normalization methods over embeddings to improve the recommendations.
RESULTS: The top-performing literature recommendation technique achieved a strict precision at 10 of 0.8333 and a partial precision at 10 of 0.9000 using BM25 based on a manual evaluation of 36 datasets. Evaluation on a larger, automatically-collected benchmark shows small but consistent gains by emphasizing the similarity of dataset and article titles.
CONCLUSION: This work is the first step toward developing a literature recommendation tool by recommending relevant literature for datasets. This will hopefully lead to better data reuse experience.
Copyright © 2020 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Cosine similarity; Gene Expression Omnibus (GEO); Literature recommendation; Re-ranking; Vector space model

Mesh:

Year:  2020        PMID: 32151769     DOI: 10.1016/j.jbi.2020.103399

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  7 in total

1.  A content-based dataset recommendation system for researchers-a case study on Gene Expression Omnibus (GEO) repository.

Authors:  Braja Gopal Patra; Kirk Roberts; Hulin Wu
Journal:  Database (Oxford)       Date:  2020-01-01       Impact factor: 3.451

2.  An informatics research platform to make public gene expression time-course datasets reusable for more scientific discoveries.

Authors:  Braja Gopal Patra; Babak Soltanalizadeh; Nan Deng; Leqing Wu; Vahed Maroufy; Canglin Wu; W Jim Zheng; Kirk Roberts; Hulin Wu; Ashraf Yaseen
Journal:  Database (Oxford)       Date:  2020-11-28       Impact factor: 4.462

3.  Personalized Online Learning Resource Recommendation Based on Artificial Intelligence and Educational Psychology.

Authors:  Xin Wei; Shiyun Sun; Dan Wu; Liang Zhou
Journal:  Front Psychol       Date:  2021-12-23

4.  Analysis and Validation of Hub Genes in Blood Monocytes of Postmenopausal Osteoporosis Patients.

Authors:  Yi-Xuan Deng; Wen-Ge He; Hai-Jun Cai; Jin-Hai Jiang; Yuan-Yuan Yang; Yan-Rong Dan; Hong-Hong Luo; Yu Du; Liang Chen; Bai-Cheng He
Journal:  Front Endocrinol (Lausanne)       Date:  2022-01-13       Impact factor: 5.555

5.  Identification and validation of ferroptosis key genes in bone mesenchymal stromal cells of primary osteoporosis based on bioinformatics analysis.

Authors:  Yu Xia; Haifeng Zhang; Heng Wang; Qiufei Wang; Pengfei Zhu; Ye Gu; Huilin Yang; Dechun Geng
Journal:  Front Endocrinol (Lausanne)       Date:  2022-08-25       Impact factor: 6.055

6.  Scientific paper recommendation systems: a literature review of recent publications.

Authors:  Christin Katharina Kreutz; Ralf Schenkel
Journal:  Int J Digit Libr       Date:  2022-10-05

7.  Identification of potential biomarkers and available drugs for oral squamous cell carcinoma.

Authors:  Zhijun Zhang; Fei Bi; Zhuang Zhang; Weidong Tian; Weihua Guo
Journal:  Transl Cancer Res       Date:  2021-01       Impact factor: 1.241

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.