Literature DB >> 30654030

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Neil R Smalheiser1, Aaron M Cohen2, Gary Bonifield3.   

Abstract

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.
Copyright © 2019 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Dimensional reduction; Implicit features; Natural language processing; Pvtopic; Semantic similarity; Text mining; Vector representation; Word2vec

Mesh:

Year:  2019        PMID: 30654030      PMCID: PMC6557457          DOI: 10.1016/j.jbi.2019.103096

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  19 in total

1.  A probabilistic similarity metric for Medline records: a model for author name disambiguation.

Authors:  Vetle I Torvik; Marc Weeber; Don R Swanson; Neil R Smalheiser
Journal:  AMIA Annu Symp Proc       Date:  2003

2.  Towards a framework for developing semantic relatedness reference standards.

Authors:  Serguei V S Pakhomov; Ted Pedersen; Bridget McInnes; Genevieve B Melton; Alexander Ruggieri; Christopher G Chute
Journal:  J Biomed Inform       Date:  2010-10-31       Impact factor: 6.317

3.  Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study.

Authors:  Serguei Pakhomov; Bridget McInnes; Terrence Adam; Ying Liu; Ted Pedersen; Genevieve B Melton
Journal:  AMIA Annu Symp Proc       Date:  2010-11-13

4.  Corpus domain effects on distributional semantic modeling of medical terms.

Authors:  Serguei V S Pakhomov; Greg Finley; Reed McEwan; Yan Wang; Genevieve B Melton
Journal:  Bioinformatics       Date:  2016-08-16       Impact factor: 6.937

5.  Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Authors:  Neil R Smalheiser; Aaron M Cohen; Gary Bonifield
Journal:  J Biomed Inform       Date:  2019-01-14       Impact factor: 6.317

6.  Representing Documents via Latent Keyphrase Inference.

Authors:  Jialu Liu; Xiang Ren; Jingbo Shang; Taylor Cassidy; Clare R Voss; Jiawei Han
Journal:  Proc Int World Wide Web Conf       Date:  2016-04

7.  Author Name Disambiguation in MEDLINE.

Authors:  Vetle I Torvik; Neil R Smalheiser
Journal:  ACM Trans Knowl Discov Data       Date:  2009-07-01       Impact factor: 2.713

8.  Three journal similarity metrics and their application to biomedical journals.

Authors:  Jennifer L D'Souza; Neil R Smalheiser
Journal:  PLoS One       Date:  2014-12-23       Impact factor: 3.240

9.  PubMed related articles: a probabilistic topic-based model for content similarity.

Authors:  Jimmy Lin; W John Wilbur
Journal:  BMC Bioinformatics       Date:  2007-10-30       Impact factor: 3.169

10.  Topic detection using paragraph vectors to support active learning in systematic reviews.

Authors:  Kazuma Hashimoto; Georgios Kontonatsios; Makoto Miwa; Sophia Ananiadou
Journal:  J Biomed Inform       Date:  2016-06-10       Impact factor: 6.317

View more
  4 in total

1.  Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database.

Authors:  Neil R Smalheiser; Aaron M Cohen
Journal:  Data Inf Manag       Date:  2018-05-22

2.  Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Authors:  Neil R Smalheiser; Aaron M Cohen; Gary Bonifield
Journal:  J Biomed Inform       Date:  2019-01-14       Impact factor: 6.317

3.  A web-based tool for automatically linking clinical trials to their publications.

Authors:  Neil R Smalheiser; Arthur W Holt
Journal:  J Am Med Inform Assoc       Date:  2022-04-13       Impact factor: 4.497

4.  Refining electronic medical records representation in manifold subspace.

Authors:  Bolin Wang; Yuanyuan Sun; Yonghe Chu; Di Zhao; Zhihao Yang; Jian Wang
Journal:  BMC Bioinformatics       Date:  2022-04-01       Impact factor: 3.169

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.