Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Literature DB >> 30654030

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Neil R Smalheiser¹, Aaron M Cohen², Gary Bonifield³.

Abstract

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

Entities: Chemical Disease Gene Species

Keywords: Dimensional reduction; Implicit features; Natural language processing; Pvtopic; Semantic similarity; Text mining; Vector representation; Word2vec

Mesh：

Year: 2019 PMID： 30654030 PMCID： PMC6557457 DOI： 10.1016/j.jbi.2019.103096

Source DB: PubMed Journal: J Biomed Inform ISSN： 1532-0464 Impact factor: 6.317

19 in total

1. A probabilistic similarity metric for Medline records: a model for author name disambiguation.

Authors: Vetle I Torvik; Marc Weeber; Don R Swanson; Neil R Smalheiser
Journal: AMIA Annu Symp Proc Date: 2003

2. Towards a framework for developing semantic relatedness reference standards.

Authors: Serguei V S Pakhomov; Ted Pedersen; Bridget McInnes; Genevieve B Melton; Alexander Ruggieri; Christopher G Chute
Journal: J Biomed Inform Date: 2010-10-31 Impact factor: 6.317

3. Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study.

Authors: Serguei Pakhomov; Bridget McInnes; Terrence Adam; Ying Liu; Ted Pedersen; Genevieve B Melton
Journal: AMIA Annu Symp Proc Date: 2010-11-13

4. Corpus domain effects on distributional semantic modeling of medical terms.

Authors: Serguei V S Pakhomov; Greg Finley; Reed McEwan; Yan Wang; Genevieve B Melton
Journal: Bioinformatics Date: 2016-08-16 Impact factor: 6.937

5. Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Authors: Neil R Smalheiser; Aaron M Cohen; Gary Bonifield
Journal: J Biomed Inform Date: 2019-01-14 Impact factor: 6.317

6. Representing Documents via Latent Keyphrase Inference.

Authors: Jialu Liu; Xiang Ren; Jingbo Shang; Taylor Cassidy; Clare R Voss; Jiawei Han
Journal: Proc Int World Wide Web Conf Date: 2016-04

7. Author Name Disambiguation in MEDLINE.

Authors: Vetle I Torvik; Neil R Smalheiser
Journal: ACM Trans Knowl Discov Data Date: 2009-07-01 Impact factor: 2.713

8. Three journal similarity metrics and their application to biomedical journals.

Authors: Jennifer L D'Souza; Neil R Smalheiser
Journal: PLoS One Date: 2014-12-23 Impact factor: 3.240

9. PubMed related articles: a probabilistic topic-based model for content similarity.

Authors: Jimmy Lin; W John Wilbur
Journal: BMC Bioinformatics Date: 2007-10-30 Impact factor: 3.169

10. Topic detection using paragraph vectors to support active learning in systematic reviews.

Authors: Kazuma Hashimoto; Georgios Kontonatsios; Makoto Miwa; Sophia Ananiadou
Journal: J Biomed Inform Date: 2016-06-10 Impact factor: 6.317

4 in total

1. Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database.

Authors: Neil R Smalheiser; Aaron M Cohen
Journal: Data Inf Manag Date: 2018-05-22

2. Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings.

Authors: Neil R Smalheiser; Aaron M Cohen; Gary Bonifield
Journal: J Biomed Inform Date: 2019-01-14 Impact factor: 6.317

3. A web-based tool for automatically linking clinical trials to their publications.

Authors: Neil R Smalheiser; Arthur W Holt
Journal: J Am Med Inform Assoc Date: 2022-04-13 Impact factor: 4.497

4. Refining electronic medical records representation in manifold subspace.

Authors: Bolin Wang; Yuanyuan Sun; Yonghe Chu; Di Zhao; Zhihao Yang; Jian Wang
Journal: BMC Bioinformatics Date: 2022-04-01 Impact factor: 3.169

4 in total