Literature DB >> 25075115

Retro: concept-based clustering of biomedical topical sets.

Lana Yeganova1, Won Kim1, Sun Kim1, W John Wilbur1.   

Abstract

MOTIVATION: Clustering methods can be useful for automatically grouping documents into meaningful clusters, improving human comprehension of a document collection. Although there are clustering algorithms that can achieve the goal for relatively large document collections, they do not always work well for small and homogenous datasets.
METHODS: In this article, we present Retro-a novel clustering algorithm that extracts meaningful clusters along with concise and descriptive titles from small and homogenous document collections. Unlike common clustering approaches, our algorithm predicts cluster titles before clustering. It relies on the hypergeometric distribution model to discover key phrases, and generates candidate clusters by assigning documents to these phrases. Further, the statistical significance of candidate clusters is tested using supervised learning methods, and a multiple testing correction technique is used to control the overall quality of clustering.
RESULTS: We test our system on five disease datasets from OMIM(®) and evaluate the results based on MeSH(®) term assignments. We further compare our method with several baseline and state-of-the-art methods, including K-means, expectation maximization, latent Dirichlet allocation-based clustering, Lingo, OPTIMSRC and adapted GK-means. The experimental results on the 20-Newsgroup and ODP-239 collections demonstrate that our method is successful at extracting significant clusters and is superior to existing methods in terms of quality of clusters. Finally, we apply our system to a collection of 6248 topical sets from the HomoloGene(®) database, a resource in PubMed(®). Empirical evaluation confirms the method is useful for small homogenous datasets in producing meaningful clusters with descriptive titles.
AVAILABILITY AND IMPLEMENTATION: A web-based demonstration of the algorithm applied to a collection of sets from the HomoloGene database is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/CLUSTERING_HOMOLOGENE/index.html. CONTACT: lana.yeganova@nih.gov SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

Entities:  

Mesh:

Year:  2014        PMID: 25075115      PMCID: PMC4221121          DOI: 10.1093/bioinformatics/btu514

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  4 in total

1.  A thematic analysis of the AIDS literature.

Authors:  W John Wilbur
Journal:  Pac Symp Biocomput       Date:  2002

2.  Identifying well-formed biomedical phrases in MEDLINE® text.

Authors:  Won Kim; Lana Yeganova; Donald C Comeau; W John Wilbur
Journal:  J Biomed Inform       Date:  2012-06-08       Impact factor: 6.317

Review 3.  Survey of clustering algorithms.

Authors:  Rui Xu; Donald Wunsch
Journal:  IEEE Trans Neural Netw       Date:  2005-05

4.  Click-words: learning to predict document keywords from a user perspective.

Authors:  Rezarta Islamaj Doğan; Zhiyong Lu
Journal:  Bioinformatics       Date:  2010-09-01       Impact factor: 6.937

  4 in total
  4 in total

1.  How user intelligence is improving PubMed.

Authors:  Nicolas Fiorini; Robert Leaman; David J Lipman; Zhiyong Lu
Journal:  Nat Biotechnol       Date:  2018-10-01       Impact factor: 54.908

2.  Revealing topics and their evolution in biomedical literature using Bio-DTM: a case study of ginseng.

Authors:  Qian Chen; Ni Ai; Jie Liao; Xin Shao; Yufeng Liu; Xiaohui Fan
Journal:  Chin Med       Date:  2017-09-12       Impact factor: 5.455

3.  PubMed Phrases, an open set of coherent phrases for searching biomedical literature.

Authors:  Sun Kim; Lana Yeganova; Donald C Comeau; W John Wilbur; Zhiyong Lu
Journal:  Sci Data       Date:  2018-06-12       Impact factor: 6.444

4.  Discovering themes in biomedical literature using a projection-based algorithm.

Authors:  Lana Yeganova; Sun Kim; Grigory Balasanov; W John Wilbur
Journal:  BMC Bioinformatics       Date:  2018-07-16       Impact factor: 3.169

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.