Literature DB >> 30217670

A comparison of word embeddings for the biomedical natural language processing.

Yanshan Wang1, Sijia Liu2, Naveed Afzal3, Majid Rastegar-Mojarad4, Liwei Wang5, Feichen Shen6, Paul Kingsbury7, Hongfang Liu8.   

Abstract

BACKGROUND: Word embeddings have been prevalently used in biomedical Natural Language Processing (NLP) applications due to the ability of the vector representations being able to capture useful semantic properties and linguistic relationships between words. Different textual resources (e.g., Wikipedia and biomedical literature corpus) have been utilized in biomedical NLP to train word embeddings and these word embeddings have been commonly leveraged as feature input to downstream machine learning models. However, there has been little work on evaluating the word embeddings trained from different textual resources.
METHODS: In this study, we empirically evaluated word embeddings trained from four different corpora, namely clinical notes, biomedical publications, Wikipedia, and news. For the former two resources, we trained word embeddings using unstructured electronic health record (EHR) data available at Mayo Clinic and articles (MedLit) from PubMed Central, respectively. For the latter two resources, we used publicly available pre-trained word embeddings, GloVe and Google News. The evaluation was done qualitatively and quantitatively. For the qualitative evaluation, we randomly selected medical terms from three categories (i.e., disorder, symptom, and drug), and manually inspected the five most similar words computed by embeddings for each term. We also analyzed the word embeddings through a 2-dimensional visualization plot of 377 medical terms. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. For the intrinsic evaluation, we evaluated the word embeddings' ability to capture medical semantics by measruing the semantic similarity between medical terms using four published datasets: Pedersen's dataset, Hliaoutakis's dataset, MayoSRS, and UMNSRS. For the extrinsic evaluation, we applied word embeddings to multiple downstream biomedical NLP applications, including clinical information extraction (IE), biomedical information retrieval (IR), and relation extraction (RE), with data from shared tasks.
RESULTS: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News. The intrinsic quantitative evaluation verifies that the semantic similarity captured by the word embeddings trained from EHR is closer to human experts' judgments on all four tested datasets. The extrinsic quantitative evaluation shows that the word embeddings trained on EHR achieved the best F1 score of 0.900 for the clinical IE task; no word embeddings improved the performance for the biomedical IR task; and the word embeddings trained on Google News had the best overall F1 score of 0.790 for the RE task.
CONCLUSION: Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained from EHR and MedLit can capture the semantics of medical terms better, and find semantically relevant medical terms closer to human experts' judgments than those trained from GloVe and Google News. Second, there does not exist a consistent global ranking of word embeddings for all downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained from the biomedical domain corpora do not necessarily have better performance than those trained from the general domain corpora for any downstream biomedical NLP task.
Copyright © 2018 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Information extraction; Information retrieval; Machine learning; Natural language processing; Word embeddings

Mesh:

Year:  2018        PMID: 30217670      PMCID: PMC6585427          DOI: 10.1016/j.jbi.2018.09.008

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  13 in total

1.  Measures of semantic similarity and relatedness in the biomedical domain.

Authors:  Ted Pedersen; Serguei V S Pakhomov; Siddharth Patwardhan; Christopher G Chute
Journal:  J Biomed Inform       Date:  2006-06-10       Impact factor: 6.317

2.  Towards a framework for developing semantic relatedness reference standards.

Authors:  Serguei V S Pakhomov; Ted Pedersen; Bridget McInnes; Genevieve B Melton; Alexander Ruggieri; Christopher G Chute
Journal:  J Biomed Inform       Date:  2010-10-31       Impact factor: 6.317

3.  Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study.

Authors:  Serguei Pakhomov; Bridget McInnes; Terrence Adam; Ying Liu; Ted Pedersen; Genevieve B Melton
Journal:  AMIA Annu Symp Proc       Date:  2010-11-13

4.  Corpus domain effects on distributional semantic modeling of medical terms.

Authors:  Serguei V S Pakhomov; Greg Finley; Reed McEwan; Yan Wang; Genevieve B Melton
Journal:  Bioinformatics       Date:  2016-08-16       Impact factor: 6.937

Review 5.  Clinical information extraction applications: A literature review.

Authors:  Yanshan Wang; Liwei Wang; Majid Rastegar-Mojarad; Sungrim Moon; Feichen Shen; Naveed Afzal; Sijia Liu; Yuqun Zeng; Saeed Mehrabi; Sunghwan Sohn; Hongfang Liu
Journal:  J Biomed Inform       Date:  2017-11-21       Impact factor: 6.317

6.  Evaluating word representation features in biomedical named entity recognition tasks.

Authors:  Buzhou Tang; Hongxin Cao; Xiaolong Wang; Qingcai Chen; Hua Xu
Journal:  Biomed Res Int       Date:  2014-03-06       Impact factor: 3.411

7.  Privacy-Preserving Predictive Modeling: Harmonization of Contextual Embeddings From Different Sources.

Authors:  Yingxiang Huang; Junghye Lee; Shuang Wang; Jimeng Sun; Hongfang Liu; Xiaoqian Jiang
Journal:  JMIR Med Inform       Date:  2018-05-16

8.  MIMIC-III, a freely accessible critical care database.

Authors:  Alistair E W Johnson; Tom J Pollard; Lu Shen; Li-Wei H Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G Mark
Journal:  Sci Data       Date:  2016-05-24       Impact factor: 6.444

9.  Knowledge Discovery from Biomedical Ontologies in Cross Domains.

Authors:  Feichen Shen; Yugyung Lee
Journal:  PLoS One       Date:  2016-08-22       Impact factor: 3.240

10.  Predicate Oriented Pattern Analysis for Biomedical Knowledge Discovery.

Authors:  Feichen Shen; Hongfang Liu; Sunghwan Sohn; David W Larson; Yugyung Lee
Journal:  Intell Inf Manag       Date:  2016-05
View more
  61 in total

1.  Machine learning mortality classification in clinical documentation with increased accuracy in visual-based analyses.

Authors:  Susan M Slattery; Daniel C Knight; Debra E Weese-Mayer; William A Grobman; Doug C Downey; Karna Murthy
Journal:  Acta Paediatr       Date:  2019-12-10       Impact factor: 2.299

2.  Enhancing clinical concept extraction with contextual embeddings.

Authors:  Yuqi Si; Jingqi Wang; Hua Xu; Kirk Roberts
Journal:  J Am Med Inform Assoc       Date:  2019-11-01       Impact factor: 4.497

3.  Adverse drug event and medication extraction in electronic health records via a cascading architecture with different sequence labeling models and word embeddings.

Authors:  Hong-Jie Dai; Chu-Hsien Su; Chi-Shin Wu
Journal:  J Am Med Inform Assoc       Date:  2020-01-01       Impact factor: 4.497

4.  A natural language processing framework to analyse the opinions on HPV vaccination reflected in twitter over 10 years (2008 - 2017).

Authors:  Xiao Luo; Gregory Zimet; Setu Shah
Journal:  Hum Vaccin Immunother       Date:  2019-07-16       Impact factor: 3.452

5.  Using convolutional neural networks to identify patient safety incident reports by type and severity.

Authors:  Ying Wang; Enrico Coiera; Farah Magrabi
Journal:  J Am Med Inform Assoc       Date:  2019-12-01       Impact factor: 4.497

6.  A clinical text classification paradigm using weak supervision and deep representation.

Authors:  Yanshan Wang; Sunghwan Sohn; Sijia Liu; Feichen Shen; Liwei Wang; Elizabeth J Atkinson; Shreyasee Amin; Hongfang Liu
Journal:  BMC Med Inform Decis Mak       Date:  2019-01-07       Impact factor: 2.796

7.  Identification of social determinants of health using multi-label classification of electronic health record clinical notes.

Authors:  Rachel Stemerman; Jaime Arguello; Jane Brice; Ashok Krishnamurthy; Mary Houston; Rebecca Kitzmiller
Journal:  JAMIA Open       Date:  2021-02-09

8.  Information Extraction for Populating Lung Cancer Clinical Research Data.

Authors:  Liwei Wang; Lei Luo; Yanshan Wang; Jason A Wampfler; Ping Yang; Hongfang Liu
Journal:  IEEE Int Conf Healthc Inform       Date:  2019-11-21

9.  Extracting chemical-protein relations using attention-based neural networks.

Authors:  Sijia Liu; Feichen Shen; Ravikumar Komandur Elayavilli; Yanshan Wang; Majid Rastegar-Mojarad; Vipin Chaudhary; Hongfang Liu
Journal:  Database (Oxford)       Date:  2018-01-01       Impact factor: 3.451

10.  RedMed: Extending drug lexicons for social media applications.

Authors:  Adam Lavertu; Russ B Altman
Journal:  J Biomed Inform       Date:  2019-10-15       Impact factor: 6.317

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.