Literature DB >> 33817030

Comparing general and specialized word embeddings for biomedical named entity recognition.

Rigo E Ramos-Vargas1, Israel Román-Godínez1, Sulema Torres-Ramos1.   

Abstract

Increased interest in the use of word embeddings, such as word representation, for biomedical named entity recognition (BioNER) has highlighted the need for evaluations that aid in selecting the best word embedding to be used. One common criterion for selecting a word embedding is the type of source from which it is generated; that is, general (e.g., Wikipedia, Common Crawl), or specific (e.g., biomedical literature). Using specific word embeddings for the BioNER task has been strongly recommended, considering that they have provided better coverage and semantic relationships among medical entities. To the best of our knowledge, most studies have focused on improving BioNER task performance by, on the one hand, combining several features extracted from the text (for instance, linguistic, morphological, character embedding, and word embedding itself) and, on the other, testing several state-of-the-art named entity recognition algorithms. The latter, however, do not pay great attention to the influence of the word embeddings, and do not facilitate observing their real impact on the BioNER task. For this reason, the present study evaluates three well-known NER algorithms (CRF, BiLSTM, BiLSTM-CRF) with respect to two corpora (DrugBank and MedLine) using two classic word embeddings, GloVe Common Crawl (of the general type) and Pyysalo PM + PMC (specific), as unique features. Furthermore, three contextualized word embeddings (ELMo, Pooled Flair, and Transformer) are compared in their general and specific versions. The aim is to determine whether general embeddings can perform better than specialized ones on the BioNER task. To this end, four experiments were designed. In the first, we set out to identify the combination of classic word embedding, NER algorithm, and corpus that results in the best performance. The second evaluated the effect of the size of the corpus on performance. The third assessed the semantic cohesiveness of the classic word embeddings and their correlation with respect to several gold standards; while the fourth evaluates the performance of general and specific contextualized word embeddings on the BioNER task. Results show that the classic general word embedding GloVe Common Crawl performed better in the DrugBank corpus, despite having less word coverage and a lower internal semantic relationship than the classic specific word embedding, Pyysalo PM + PMC; while in the contextualized word embeddings the best results are presented in the specific ones. We conclude, therefore, when using classic word embeddings as features on the BioNER task, the general ones could be considered a good option. On the other hand, when using contextualized word embeddings, the specific ones are the best option.
© 2021 Ramos-Vargas et al.

Entities:  

Keywords:  BiLSTM-CRF; BioNER; DrugBank; ELMo embeddings; Glove common crawl; MedLine; Pooled flair embeddings; Pyysalo PM + PMC; Transformer embeddings; Word embeddings

Year:  2021        PMID: 33817030      PMCID: PMC7959609          DOI: 10.7717/peerj-cs.384

Source DB:  PubMed          Journal:  PeerJ Comput Sci        ISSN: 2376-5992


  11 in total

1.  Measures of semantic similarity and relatedness in the biomedical domain.

Authors:  Ted Pedersen; Serguei V S Pakhomov; Siddharth Patwardhan; Christopher G Chute
Journal:  J Biomed Inform       Date:  2006-06-10       Impact factor: 6.317

2.  The DDI corpus: an annotated corpus with pharmacological substances and drug-drug interactions.

Authors:  María Herrero-Zazo; Isabel Segura-Bedmar; Paloma Martínez; Thierry Declerck
Journal:  J Biomed Inform       Date:  2013-07-29       Impact factor: 6.317

3.  A comparison of word embeddings for the biomedical natural language processing.

Authors:  Yanshan Wang; Sijia Liu; Naveed Afzal; Majid Rastegar-Mojarad; Liwei Wang; Feichen Shen; Paul Kingsbury; Hongfang Liu
Journal:  J Biomed Inform       Date:  2018-09-12       Impact factor: 6.317

4.  Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study.

Authors:  Serguei Pakhomov; Bridget McInnes; Terrence Adam; Ying Liu; Ted Pedersen; Genevieve B Melton
Journal:  AMIA Annu Symp Proc       Date:  2010-11-13

5.  Long short-term memory.

Authors:  S Hochreiter; J Schmidhuber
Journal:  Neural Comput       Date:  1997-11-15       Impact factor: 2.026

6.  Word embeddings and external resources for answer processing in biomedical factoid question answering.

Authors:  Dimitris Dimitriadis; Grigorios Tsoumakas
Journal:  J Biomed Inform       Date:  2019-02-10       Impact factor: 6.317

7.  Recurrent neural networks with specialized word embeddings for health-domain named-entity recognition.

Authors:  Iñigo Jauregi Unanue; Ehsan Zare Borzeshi; Massimo Piccardi
Journal:  J Biomed Inform       Date:  2017-11-13       Impact factor: 6.317

8.  Ontology-Based Healthcare Named Entity Recognition from Twitter Messages Using a Recurrent Neural Network Approach.

Authors:  Erdenebileg Batbaatar; Keun Ho Ryu
Journal:  Int J Environ Res Public Health       Date:  2019-09-27       Impact factor: 3.390

9.  Combining Contextualized Embeddings and Prior Knowledge for Clinical Named Entity Recognition: Evaluation Study.

Authors:  Min Jiang; Todd Sanger; Xiong Liu
Journal:  JMIR Med Inform       Date:  2019-11-13

10.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Authors:  Jinhyuk Lee; Wonjin Yoon; Sungdong Kim; Donghyeon Kim; Sunkyu Kim; Chan Ho So; Jaewoo Kang
Journal:  Bioinformatics       Date:  2020-02-15       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.