Literature DB >> 29236676

Supervised Learning and Knowledge-Based Approaches Applied to Biomedical Word Sense Disambiguation.

Abstract

Word sense disambiguation (WSD) is an important step in biomedical text mining, which is responsible for assigning an unequivocal concept to an ambiguous term, improving the accuracy of biomedical information extraction systems. In this work we followed supervised and knowledge-based disambiguation approaches, with the best results obtained by supervised means. In the supervised method we used bag-of-words as local features, and word embeddings as global features. In the knowledge-based method we combined word embeddings, concept textual definitions extracted from the UMLS database, and concept association values calculated from the MeSH co-occurrence counts from MEDLINE articles. Also, in the knowledge-based method, we tested different word embedding averaging functions to calculate the surrounding context vectors, with the goal to give more importance to closest words of the ambiguous term. The MSH WSD dataset, the most common dataset used for evaluating biomedical concept disambiguation, was used to evaluate our methods. We obtained a top accuracy of 95.6 % by supervised means, while the best knowledge-based accuracy was 87.4 %. Our results show that word embedding models improved the disambiguation accuracy, proving to be a powerful resource in the WSD task.

Entities: Chemical Disease Gene Species

Keywords: Biomedical text mining; information extraction; word embeddings

Mesh：

Year: 2017 PMID： 29236676 PMCID： PMC6042812 DOI： 10.1515/jib-2017-0051

Source DB: PubMed Journal: J Integr Bioinform ISSN： 1613-4516

Introduction

Nowadays new biomedical entities such as proteins, genes, mutations, and diseases are constantly being discovered, which leads to the growth of the biomedical lexicon. However, a new discovered biomedical entity is not always assigned to a new biomedical term, causing the biomedical term to have several possible senses. These ambiguous terms hinder the automatic extraction of biomedical information. Word sense disambiguation (WSD) is responsible for solving these ambiguities in textual documents, having the automatic capability to assign the correct meaning to ambiguous words given the surrounding textual context. WSD is a challenging artificial intelligence problem that has been studied for the last years [1]. Particularly in the biomedical field, there are much more ambiguous terms increasing the WSD difficulty [2], [3]. For assessing the biomedical WSD systems, some datasets containing biomedical ambiguous terms were proposed [4], [5], being the currently most used the MSH WSD dataset that was proposed by Jimeno-Yepes et al. [6]. This last dataset was created by automatic means using the unified medical language system (UMLS) Metathesaurus [7] and the medical subject headings (MeSH) [8] indexing of MEDLINE articles. In a previous work [9] we applied WSD supervised and knowledge-based methods in a subset of the MSH WSD dataset, achieving top accuracies of 94.7 and 85.1 % respectively. In this paper we extended our previous work by (1) testing more supervised classifiers, (2) combining bag-of-words as local features and word embeddings as global features in the supervised approach, (3) using different word embedding averaging functions, in the knowledge-based method, to calculate the surrounding context vectors, (4) extracting concept definitions for every ambiguous term in the MSH WSD making possible to apply our knowledge-based method to all the ambiguous terms in the MSH WSD dataset.

Related Works

Schuemie et al. [2] makes an overview of WSD in the biomedical domain until 2005. In [10] the authors showed that metadata information and well structured ontologies can play an important role to improve disambiguation. In 2011, Jimeno-Yepes et al. [6] proposed the MSH WSD dataset achieving a supervised accuracy of 93.9 %, and a knowledge-based accuracy of 83.8 %. The results of the following works are with respect of this same dataset. Some knowledge-based WSD approaches use semantic similarity measures from the UMLS achieving accuracies of 80.7 % [11] and 75.0 % [12]. Another knowledge-based method uses the UMLS semantic network achieving an accuracy of 60.3 % [13]. McInnes and Stevenson [3] explored supervised and knowledge-based WSD methods achieving a supervised state-of-the-art accuracy around 97.0 % and a knowledge-based accuracy around 78.0 %. Their supervised system relies in a vector space model, calculating the cosine between the vector representing the ambiguous term and each of the vectors representing the possible senses. Another knowledge-based method is presented in [14] achieving an accuracy of 89.1 %, where the authors developed a method to generate word-concept probabilities from a knowledge-base. The state-of-the-art accuracy by knowledge-based means is 92.2 %, and it was obtained using a method proposed by Sabbir et al. [15] that uses neural concept embeddings. Another supervised state-of-the-art accuracy of 96.0 % was achieved using a combination of unigrams and word embeddings with a SVM classifier by Jimeno-Yepes [16]. As far as we know, Iacobacci et al. [17] were the first to weight word embeddings considering the absolute word distance between a specific word and the ambiguous term in the problem of disambiguation. Word embeddings are a recent technique that maps words to high dimensional numeric vectors, being generated from unlabeled training data [18]. These models have shown to improve text mining tasks, such as named entity recognition [19] and word sense disambiguation [20], [21].

Implementation

In this work we applied supervised and knowledge-based methods to biomedical WSD. Bag-of-words features were used only in the supervised setting. Although, in both approaches, we used word embedding models generated by unlabeled MEDLINE abstracts. These word embeddings were used to calculate embedding vectors of the surrounding contexts of the ambiguous terms, which we will denominate as context vectors or context embeddings. For the knowledge-based approach we extracted concept unique identifier (CUI) textual definitions from the UMLS Metathesaurus, and calculate CUI embedding vectors, which we will denominate as concept vectors or concept embeddings. To find the most plausible meaning for a specific ambiguous term, our knowledge-based method calculates similarities between the context vector and the concept vectors weighted by CUI-CUI association values. Each step is explained more detailed below.

Dataset

We evaluated our proposed methods in the MSH WSD dataset [6], which is the most adopted for assessing biomedical WSD methods. This dataset is composed by a total of 203 ambiguous entities, of which 88 are regular terms, 106 are abbreviations, and 9 are a mix of both. Most of the ambiguous entities have only two possible senses, where a minor part of 14 terms have from three up to five senses. For each possible sense there is a maximum of 100 instances. Each instance is a MEDLINE abstract where the ambiguous term occurs. The dataset has a total of 37.090 distinct MEDLINE abstracts.

Word Embeddings

The word embedding models were generated using PubMed articles which are biomedical domain-specific. MEDLINE abstracts corresponding to the years 1900 to 2015 were used, which contained around 15 million documents involving a total of around 800 thousand unique words. Six word embedding models were trained using windows of 5, 20, and 50 words, and vector sizes of 100 and 300. For generating the word embedding vectors we used the continuous-bag-of-words model proposed by Mikolov et al. [18], implemented in the Gensim framework [22]. The word embedding models were used to calculate the context embedding vectors and the CUI embedding vectors, with the last ones only being used in the knowledge-based approach.

Context Embeddings

The context embeddings are vectors that represent the surrounding contexts of the ambiguous terms. Each surrounding context of an ambiguous term is composed by the words of the respective abstract (excluding the ambiguous term occurrences) in the MSH WSD dataset. All the context vectors were weighted using the inverse document frequency (IDF) scheme and normalized.

Supervised

In the supervised setting the term frequency (TF) component was added to the IDF weighting. However, since the cross-validation technique was used, these TF-IDF weights were fitted to a linear regression (from the labels of the current training fold), estimating new weights for each word. These final weights were the ones used to weight the word embeddings in the test fold.

Knowledge-Based

In the knowledge-based method we tested five different word embedding averaging functions: the TF-IDF weighting scheme, and four word distance decay functions using also the IDF scheme. The objective of using decay functions was to give greater importance to closest words of the ambiguous terms. The absolute word distance d between some specific word and the closest occurrence of an ambiguous term was defined as being the input of the decay function. Summarily the weighting schemes used were (IDF weighting included): Term frequency; No decay: f(d) = 1; Fractional decay: f(d) = 1/d; Exponential decay: f(d) = exp(−d); Logarithmic decay: f(d) = 1/ln(1 + d).

Supervised Learning Classification

We tested five machine learning classifiers from the scikit-learn framework [23]: decision tree (DT), k-nearest neighbor (k-NN, k = 5), logistic regression (LR), multi-layer perceptron (MLP), and support vector machine (SVM). To train the classifiers, bag-of-words features (unigrams, bigrams) and the context embeddings were used.

Knowledge-Based Method

Concept Embeddings

CUI textual definitions were extracted from UMLS knowledge sources to create the concept embedding vectors weighted by the TF-IDF scheme. All the concept vectors were normalized. https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html

CUI-CUI Association Values

We calculated CUI-CUI association values as normalized Pointwise Mutual Information (nPMI) from the MeSH co-occurrence counts in MEDLINE articles. The nPMI values are between 0 and 1, with 0 representing no association, and 1 a perfect association. So, a concept in relation to himself has a value of 1. Since there are many CUIs, and consequently much more CUI-CUI relations, we considered only the nPMI values greater or equal than 0.3. https://ii.nlm.nih.gov/MRCOC.shtml

Method

Our knowledge-based method came from the idea to compare the surrounding contexts of the ambiguous terms with the concept textual definitions, in order to find the most similar concept (meaning) given a specific context. With that in mind, we extended this baseline approach calculating a score for each possible CUI (meaning) of an ambiguous term as shown in equation (1). Accordingly to equation (1), CUI represents the target meaning, CUI represents any other related concept, is the context vector, and is the concept vector of the related concept CUI. Each context is compared to the concept textual definitions by their cosine similarities CS(, ), which are weighted by their nPMI(CUI, CUI) association values. The value N is the total number of relations considered, that is the number of non-zero nPMI values, and it is used to normalize the final score. For each possible CUI is calculated a score, and the one who get the highest score is considered the correct meaning.

Results

All the results, presented in Table 1–Table 8, were obtained dividing the dataset into five folds. The average accuracies and the respective standard deviations across the five folds are shown.

Table 1:

Supervised learning WSD accuracies (standard deviations) with bag-of-words as local features.

	U	B	U + B
DT	0.9067 (0.0030)	0.8335 (0.0045)	0.9019 (0.0018)
kNN	0.9324 (0.0017)	0.8850 (0.0043)	0.9354 (0.0019)
LR	0.9205 (0.0025)	0.8704 (0.0018)	0.9101 (0.0024)
MLP	0.9401 (0.0013)	0.9224 (0.0010)	0.9445 (0.0022)
SVM	0.9511 (0.0013)	0.9253 (0.0028)	0.9552 (0.0022)

Accuracies are the average across five folds. Five classifiers were tested. U, unigrams; B, bigrams; DT, decision tree; kNN, k-nearest neighbor (k = 5); LR, logistic regression; MLP, multi-layer perceptron; SVM, support vector machine. The top accuracy is shown in bold.

Table 8:

Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings.

	S100			S300
	W5	W20	W50	W5	W20	W50
CS	0.8318 (0.0020)	0.8407 (0.0025)	0.8468 (0.0034)	0.8355 (0.0018)	0.8477 (0.0014)	0.8501 (0.0026)
nPMI ≥ 0.8	0.8304 (0.0018)	0.8395 (0.0023)	0.8461 (0.0030)	0.8340 (0.0017)	0.8466 (0.0013)	0.8491 (0.0023)
nPMI ≥ 0.5	0.8155 (0.0029)	0.8290 (0.0041)	0.8343 (0.0042)	0.8190 (0.0030)	0.8323 (0.0042)	0.8352 (0.0040)
nPMI ≥ 0.3	0.8560 (0.0012)	0.8704 (0.0013)	0.8730 (0.0014)	0.8573 (0.0010)	0.8705 (0.0018)	0.8719 (0.0025)

IDF word embedding averaging with logarithmic decay, f(d) = 1/ln(1 + d), was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold.

Supervised learning WSD accuracies (standard deviations) with bag-of-words as local features. Accuracies are the average across five folds. Five classifiers were tested. U, unigrams; B, bigrams; DT, decision tree; kNN, k-nearest neighbor (k = 5); LR, logistic regression; MLP, multi-layer perceptron; SVM, support vector machine. The top accuracy is shown in bold. Supervised learning WSD accuracies (standard deviations) with word embeddings as global features. Accuracies are the average across five folds. Five classifiers were tested. S, Size; W, window; DT, decision tree; kNN, k-nearest neighbor (k = 5); LR, logistic regression; MLP, multi-layer perceptron; SVM, support vector machine. The top accuracy is shown in bold. Supervised learning WSD accuracies (standard deviations) with unigrams (bag-of-words) as local features and word embeddings as global features. Accuracies are the average across five folds. Five classifiers were tested. S, Size; W, window; DT, decision tree; kNN, k-nearest neighbor (k = 5); LR, logistic regression; MLP, multi-layer perceptron; SVM, support vector machine. The top accuracy is shown in bold. Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings. TF-IDF word embedding averaging was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold. Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings. IDF word embedding averaging with no decay, f(d) = 1, was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold. Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings. IDF word embedding averaging with fractional decay, f(d) = 1/d, was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold. Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings. IDF word embedding averaging with exponential decay, f(d) = exp(−d), was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold. Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings. IDF word embedding averaging with logarithmic decay, f(d) = 1/ln(1 + d), was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold.

Supervised

Supervised learning results were tested using five distinct classifiers as described in Section 3.4. Table Table 1 shows the results using only bag-of-words features (unigrams, bigrams) with a best accuracy of 95.5 % using a SVM classifier. Table 2 shows the results using only word embeddings with a best accuracy of 95.1 % using a MLP classifier. The combination of unigrams and word embeddings (3) improved the individual accuracies achieving a best accuracy of 95.6 % using a MLP classifier. One can see that the differences of using different word embedding models are not significant.

Table 2:

Supervised learning WSD accuracies (standard deviations) with word embeddings as global features.

	S100			S300
	W5	W20	W50	W5	W20	W50
DT	0.9219 (0.0017)	0.9185 (0.0030)	0.9194 (0.0033)	0.9186 (0.0013)	0.9186 (0.0025)	0.9166 (0.0017)
kNN	0.9452 (0.0024)	0.9452 (0.0024)	0.9447 (0.0017)	0.9449 (0.0019)	0.9444 (0.0023)	0.9441 (0.0025)
LR	0.9500 (0.0013)	0.9495 (0.0008)	0.9495 (0.0011)	0.9505 (0.0012)	0.9508 (0.0008)	0.9509 (0.0013)
MLP	0.9503 (0.0011)	0.9498 (0.0016)	0.9501 (0.0012)	0.9503 (0.0010)	0.9514 (0.0014)	0.9508 (0.0016)
SVM	0.9449 (0.0018)	0.9452 (0.0026)	0.9431 (0.0012)	0.9452 (0.0025)	0.9446 (0.0012)	0.9444 (0.0008)

Accuracies are the average across five folds. Five classifiers were tested. S, Size; W, window; DT, decision tree; kNN, k-nearest neighbor (k = 5); LR, logistic regression; MLP, multi-layer perceptron; SVM, support vector machine. The top accuracy is shown in bold.

Table 3:

Supervised learning WSD accuracies (standard deviations) with unigrams (bag-of-words) as local features and word embeddings as global features.

	S100			S300
	W5	W20	W50	W5	W20	W50
DT	0.9244 (0.0018)	0.9215 (0.0031)	0.9229 (0.0038)	0.9218 (0.0028)	0.9194 (0.0016)	0.9191 (0.0035)
kNN	0.9464 (0.0024)	0.9468 (0.0026)	0.9467 (0.0022)	0.9475 (0.0017)	0.9473 (0.0026)	0.9468 (0.0021)
LR	0.9515 (0.0013)	0.9514 (0.0010)	0.9515 (0.0008)	0.9519 (0.0012)	0.9524 (0.0011)	0.9520 (0.0011)
MLP	0.9557 (0.0010)	0.9556 (0.0006)	0.9555 (0.0003)	0.9544 (0.0011)	0.9550 (0.0009)	0.9545 (0.0011)
SVM	0.9490 (0.0008)	0.9486 (0.0011)	0.9481 (0.0015)	0.9499 (0.0013)	0.9496 (0.0009)	0.9482 (0.0016)

Knowledge-Based

Knowledge-based results were tested using five distinct word embedding averaging functions as described in Section 3.3.2 (Table 4–Table 8). Different thresholds (0.3, 0.5, 0.8, 1.0) for the nPMI values were imposed to filter out the most weighty relations. The threshold 1.0 is the particular case of the baseline scenario where only the cosine similarity between the context vector and the possible CUI (meaning) vector is computed. One can see that, in all the word embedding averaging functions, the threshold 0.3 produced the best accuracies proving that the addition of more related concepts leads to a better score refinement. The fractional decay averaging function obtained the highest results (Table 6), while the exponential decay averaging function obtained the lowest results (Table 7) even when compared to the baseline TF-IDF weighting (Table 4). Also, the word embedding models with higher windows achieved slightly higher accuracies. The top accuracy, 87.4 %, is presented in Table 6 and it was achieved using the fractional decay averaging function, the nPMI threshold set to 0.3, and the word embedding model with size of 100 and window of 50.

Table 4:

Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings.

	S100			S300
	W5	W20	W50	W5	W20	W50
CS	0.8144 (0.0012)	0.8254 (0.0010)	0.8321 (0.0026)	0.8181 (0.0010)	0.8319 (0.0015)	0.8337 (0.0039)
nPMI ≥ 0.8	0.8132 (0.0014)	0.8243 (0.0011)	0.8314 (0.0024)	0.8168 (0.0008)	0.8312 (0.0016)	0.8332 (0.0036)
nPMI ≥ 0.5	0.8005 (0.0041)	0.8152 (0.0038)	0.8197 (0.0038)	0.8030 (0.0037)	0.8174 (0.0031)	0.8209 (0.0034)
nPMI ≥ 0.3	0.8430 (0.0022)	0.8573 (0.0022)	0.8600 (0.0010)	0.8446 (0.0026)	0.8566 (0.0025)	0.8582 (0.0015)

TF-IDF word embedding averaging was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold.

Table 6:

Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings.

	S100			S300
	W5	W20	W50	W5	W20	W50
CS	0.8415 (0.0022)	0.8473 (0.0018)	0.8502 (0.0033)	0.8457 (0.0019)	0.8533 (0.0019)	0.8533 (0.0037)
nPMI ≥ 0.8	0.8395 (0.0022)	0.8459 (0.0024)	0.8493 (0.0032)	0.8438 (0.0019)	0.8515 (0.0023)	0.8518 (0.0039)
nPMI ≥ 0.5	0.8234 (0.0013)	0.8348 (0.0012)	0.8376 (0.0015)	0.8267 (0.0028)	0.8377 (0.0030)	0.8396 (0.0031)
nPMI ≥ 0.3	0.8617 (0.0017)	0.8720 (0.0016)	0.8744 (0.0021)	0.8622 (0.0020)	0.8730 (0.0025)	0.8736 (0.0021)

IDF word embedding averaging with fractional decay, f(d) = 1/d, was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold.

Table 7:

Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings.

	S100			S300
	W5	W20	W50	W5	W20	W50
CS	0.8259 (0.0013)	0.8270 (0.0031)	0.8278 (0.0036)	0.8278 (0.0018)	0.8302 (0.0024)	0.8276 (0.0022)
nPMI ≥ 0.8	0.8236 (0.0011)	0.8255 (0.0032)	0.8264 (0.0033)	0.8255 (0.0021)	0.8283 (0.0028)	0.8264 (0.0023)
nPMI ≥ 0.5	0.8057 (0.0022)	0.8137 (0.0007)	0.8150 (0.0027)	0.8092 (0.0035)	0.8162 (0.0017)	0.8168 (0.0021)
nPMI ≥ 0.3	0.8378 (0.0022)	0.8458 (0.0032)	0.8459 (0.0030)	0.8404 (0.0029)	0.8469 (0.0027)	0.8478 (0.0038)

IDF word embedding averaging with exponential decay, f(d) = exp(−d), was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold.

Discussion

In this paper we extended our previous work [9] by applying more settings to the supervised and knowledge-based approaches. Furthermore, we extracted textual definitions for every CUI included in the MSH WSD dataset, making it possible to apply our knowledge-based method to the entire dataset. As expected, the supervised classifiers obtained the highest results with a top accuracy around 95.6 %, while on the other hand our knowledge-based approach obtained a best accuracy around 87.4 %. Our supervised accuracy is very close to the state-of-the-art accuracy of 96.0 %, which was also obtained using a combination of unigrams and word embeddings with a SVM classifier [16]. Our knowledge-based method and results are comparable with other proposed knowledge-based approaches. In [6], Jimeno-Yepes et al. proposed the MSH WSD dataset, and tested four knowledge-based methods, where the automatic extracted corpus (AEC) method obtained the best accuracy around 84.5 %. McInnes and Pedersen [12] developed a knowledge-based method based on semantic similarity measures between UMLS concepts, and obtained an accuracy of 75 % in the same dataset. In [14], the authors used word-concept probabilities achieving a knowledge-based accuracy around 89 %. Our method is similar to the one proposed by Tulkens et al. [24], which also compared concept representations with the representations of the context of ambiguous terms, who obtained an accuracy of 84.0 % on the same dataset. As far as we know, the knowledge-based state-of-the-art accuracy in the MSH WSD dataset is 92.2 %, and it was obtained using a method proposed by Sabbir et al. [15] that uses neural word/concept embeddings. Our work showed that the word embeddings and their averaging function plays a key role in the WSD problem.

Table 5:

Knowledge-based WSD accuracies (standard deviations) using CUI association values (MeSH term co-occurrences), CUI definitions (UMLS), and word embeddings.

	S100			S300
	W5	W20	W50	W5	W20	W50
CS	0.8164 (0.0024)	0.8286 (0.0011)	0.8341 (0.0024)	0.8203 (0.0017)	0.8352 (0.0023)	0.8365 (0.0034)
nPMI ≥ 0.8	0.8154 (0.0024)	0.8277 (0.0008)	0.8334 (0.0020)	0.8193 (0.0012)	0.8343 (0.0020)	0.8357 (0.0028)
nPMI ≥ 0.5	0.8019 (0.0043)	0.8178 (0.0031)	0.8236 (0.0040)	0.8057 (0.0043)	0.8203 (0.0032)	0.8245 (0.0025)
nPMI ≥ 0.3	0.8458 (0.0023)	0.8600 (0.0018)	0.8635 (0.0007)	0.8471 (0.0019)	0.8598 (0.0014)	0.8611 (0.0010)

IDF word embedding averaging with no decay, f(d) = 1, was used to calculate the surrounding context vectors. Accuracies are the average across five folds. S, Size; W, window; CS, cosine similarity between term context vector and concept vector only; nPMI, normalized pointwise mutual information; nPMI ≥ thresh, cosine similarity plus related concepts with a nPMI value higher than the threshold. The top accuracy is shown in bold.

4 in total

1. deepBioWSD: effective deep neural word sense disambiguation of biomedical text data.

Authors: Ahmad Pesaranghader; Stan Matwin; Marina Sokolova; Ali Pesaranghader
Journal: J Am Med Inform Assoc Date: 2019-05-01 Impact factor: 4.497

2. Extraction of chemical-protein interactions from the literature using neural networks and narrow instance representation.

Authors: Rui Antunes; Sérgio Matos
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

3. An Efficient Parallelized Ontology Network-Based Semantic Similarity Measure for Big Biomedical Document Clustering.

Authors: Meijing Li; Tianjie Chen; Keun Ho Ryu; Cheng Hao Jin
Journal: Comput Math Methods Med Date: 2021-11-09 Impact factor: 2.238

4. Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets.

Authors: Shikhar Vashishth; Denis Newman-Griffis; Rishabh Joshi; Ritam Dutt; Carolyn P Rosé
Journal: J Biomed Inform Date: 2021-08-12 Impact factor: 6.317

4 in total