Literature DB >> 28881973

BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

Gizem Sogancioglu^1,2, Hakime Öztürk¹, Arzucan Özgür¹.

Abstract

MOTIVATION: The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text.
METHODS: We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods.
RESULTS: The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric.
AVAILABILITY AND IMPLEMENTATION: A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/ . CONTACT: gizemsogancioglu@gmail.com or arzucan.ozgur@boun.edu.tr.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28881973 PMCID： PMC5870675 DOI： 10.1093/bioinformatics/btx238

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Semantic text similarity estimation is a research problem that aims to calculate the similarities among texts based on their meanings and semantic content, rather than their shallow or syntactic representation. The measures on semantic text similarity have undertaken a crucial role in many natural language processing (NLP) applications such as machine translation (Finch ), automatic summarization (Wang ), and question answering (Jeon ). Several approaches for semantic sentence similarity computation have been proposed for generic English. These approaches are in general based on computing word-level similarities and combining these to obtain sentence-level similarity scores. Corpus-based measures such as Latent Semantic Indexing (LSA), knowledge-based measures that utilize general-domain ontologies including WordNet (Miller, 1995), and string-based measures such as edit distance have been effectively used for word-level similarity computation (Li ; Liu ; Mihalcea ). The SemEval Semantic Textual Similarity (STS) task series, which is being conducted annually since 2012 has also boosted research in this area (Agirre , 2013, 2014, 2016; Agirrea ). Manually annotated and test datasets provided by STS enabled the development and comparison of different approaches for semantic text similarity estimation. Supervised machine learning methods that integrate different features such as WordNet and corpus-based features, syntactic features, and features based on the distributed dense vector representation of words were shown to be effective for semantic text similarity computation (Han ; Šarić ; Sultan ). Publicly available tools such as ADW (Align, Disambiguate and Walk) (Pilehvar ; Pilehvar and Navigli, 2015) and SEMILAR (Semantic Similarity Toolkit) (Rus ) for generic domain sentence semantic similarity computation have also been developed. ADW is a knowledge-based system that uses the Topic-sensitive PageRank algorithm (Haveliwala, 2002) over a graph generated using WordNet to model the similarity between linguistic items of different granularity such as words, sentences, and documents (Pilehvar ; Pilehvar and Navigli, 2015). ADW was evaluated on SemEval 2012 data set and was shown to outperform the top three ranked systems (Pilehvar ). SEMILAR is a toolkit that implements several measures based on WordNet or LSA (Rus ). Different algorithms such as the optimal matching and the quadratic assignment problem algorithm are applied for assessing the similarity of sentence pairs by using the calculated word-level similarities (Rus ). The general domain state-of-the-art systems ADW and SEMILAR are considered as baseline models in our study. Assessing the similarity between two sentences is an important problem in the biomedical domain as well, due to the huge amount of information available in textual format, which renders effective retrieval, extraction and summarization of information vital. The excessive use of domain specific-language along with the rich variety of expressions and inadequate training corpora make measuring sentence similarity in the biomedical domain a difficult task. Therefore, semantic text similarity measures to be used in biomedical NLP studies call for domain-specific approaches including the use of biomedical domain-specific corpora or biomedical knowledge sources. As an example, consider the following two sentences taken from (Wang ) and (Fu ), respectively. The example sentences S1 and S2 are on the same topic and are similar to each other. The ‘receptor-interacting protein kinase 1’ in S1 is the same concept as ‘RIP1’ in S2; likewise ‘kinase 3’ and ‘RIP3’ refer to the same biomedical term. Domain-independent semantic text similarity measures developed for generic English can neither recognize these concepts nor give high weight to them while estimating the similarity between the sentences. S1: This form of necrosis, also termed necroptosis, requires the activity of receptor-interacting protein kinase 1 and its related kinase 3. S2: Moreover, other reports have also shown that necroptosis could be induced via modulating RIP1 and RIP3. These examples illustrate that new approaches that can handle both biomedical and domain independent words are needed for sentence similarity computation in the biomedical domain. Garla and Brandt (2012) compared knowledge-based (ontology-based) and distributional (corpus-based) similarity measures and observed that knowledge-based measures are more effective for semantic similarity computation in the biomedical domain. Most previous work on semantic similarity in the biomedical domain focused on computing ontology-based similarity between terms (Aouicha and Taieb, 2016; Harispe ; Mabotuwana ; Pedersen ; Pesquita ; SáNchez and Batet, 2011). Several studies showed that the use of biomedical ontologies to measure semantic similarity provided valuable information for a number of tasks performed in this domain such as similarity computation between gene products (Lord ), scoring protein–protein interactions (Jain and Bader, 2010) as well as disambiguation of biomedical terms (McInnes and Pedersen, 2013). To the best of our knowledge, there is neither a manually annotated benchmark data set, nor a comprehensive study on sentence-level semantic similarity computation in the biomedical domain. Although sentence-level semantic similarity computation has recently been used as a component in a text-mining system for evidence-based medicine (Hassanzadeh ) and for biomedical question answering (Papagiannopoulou ), these studies used general domain semantic similarity computation methods and did not perform any domain-specific adaptation. In this study, we show that general domain state-of-the-art sentence similarity computation systems fail to effectively model sentence similarity in the biomedical domain. We propose new approaches specifically adapted for the biomedical domain that can be categorized into four areas: string similarity measures, ontology based measures, a distributional vector model and a supervised method combining these different measures. Besides a general domain ontology, namely WordNet (Miller, 1995), we also exploit a biomedical ontology, UMLS (Unified Medical Language System) (Bodenreider, 2004). The distributional vector representations of sentences are learned using a large biomedical corpus of full text articles. In addition, we present a manually annotated benchmark data set for biomedical sentence similarity estimation, which can be used for training and evaluation in future studies in this area.

2 System and methods

2.1 BIOSSES dataset

Since there are no suitable datasets that comprise sentence pairs from the biomedical domain, we created a benchmark dataset for biomedical sentence similarity estimation. The dataset comprises 100 sentence pairs, in which each sentence was selected from the TAC (Text Analysis Conference) Biomedical Summarization Track Training Dataset containing articles from the biomedical domain. TAC dataset consists of 20 articles (reference articles) and citing articles that vary from 12 to 20 for each of the reference articles. We selected the BIOSSES sentence pairs from citing sentences, i.e. sentences that have a citation to a reference article, instead of choosing random sentence pairs, majority of which would be unrelated. Our motivation to use the TAC data set was that both semantically related and irrelevant sentence pairs occur in the annotation files. Some of the citing sentences cite the same reference articles because of similar reasons such as referring to a recent study on protein–protein interactions. Sentences citing the same reference article for a similar reason, in general have some degree of semantic similarity. On the other hand, there are also some citing sentences that cite reference article that are written about different topics or research fields (e.g. one refers to a study on microbiology, the other mentions research on embryology). Such citing sentences are expected to have lower or no semantic similarity. Therefore, it was possible to obtain sentence pairs with different similarity degrees by using this approach over the TAC dataset. The sentence pairs were evaluated by five different human experts that judged their similarity and gave scores ranging from 0 (no relation) to 4 (equivalent). The score range was described based on the guidelines of SemEval 2012 Task 6 on STS (Agirre ). Besides the annotation instructions, example sentences from the biomedical literature were provided to the annotators for each of the similarity degrees. These example sentence pairs that are scored between 0 and 4 are shown in Table 1.

Table 1.

Example annotations

Sentence 1	Sentence 2	Comment	Score
Here we show that both C/EBPα and NFI-A bind the region responsible for miR-223 upregulation upon RA treatment.	Isoleucine could not interact with ligand fragment 44, which contains amino group.	The two sentences are on different topics.	0
Membrane proteins are proteins that interact with biological membranes.	Previous studies have demonstrated that membrane proteins are implicated in many diseases because they are positioned at the apex of signaling pathways that regulate cellular processes.	The two sentences are not equivalent, but are on the same topic.	1
This article discusses the current data on using anti-HER2 therapies to treat CNS metastasis as well as the newer anti-HER2 agents.	Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis.	The two sentences are not equivalent, but share some details.	2
We were able to confirm that the cancer tissues had reduced expression of miR-126 and miR-424, and increased expression of miR-15b, miR-16, miR-146a, miR-155 and miR-223.	A recent study showed that the expression of miR-126 and miR-424 had reduced by the cancer tissues.	The two sentences are roughly equivalent, but some important information differs/missing.	3
Hydrolysis of β-lactam antibiotics by β-lactamases is the most common mechanism of resistance for this class of antibacterial agents in clinically important Gram-negative bacteria.	In Gram-negative organisms, the most common β-lactam resistance mechanism involves β-lactamase-mediated hydrolysis resulting in subsequent inactivation of the antibiotic.	The two sentences are completely or mostly equivalent, as they mean the same thing.	4

Example annotations Table 2 shows the Pearson correlation of the scores of each annotator with respect to the average scores of the remaining four annotators. It is observed that there is strong association among the scores of the annotators. The lowest correlations are 0.902, which can be considered as an upper bound for an algorithmic measure evaluated on this dataset.

Table 2.

Correlation scores among annotators

	Correlation r
Annotator A	0.952
Annotator B	0.958
Annotator C	0.917
Annotator D	0.902
Annotator E	0.941

Correlation scores among annotators The distribution of the scores by each of the annotators is illustrated in Figure 1. The distribution suggests that there are enough instances for each of the similarity degrees in our dataset.

Fig. 1

Distribution of the similarity scores in the dataset

Distribution of the similarity scores in the dataset The BIOSSES dataset of sentence pairs and the annotators’ scores are publicly available at http://tabilab.cmpe.boun.edu.tr/BIOSSES/DataSet.html.

2.2 String similarity measures

We evaluated the character- and term-based string similarity approaches briefly described in the following subsections using the annotated dataset. Simple pre-processing steps consisting of removal of the punctuation marks (Dot, Comma, Colon, Exclamation Mark, Semicolon, Slash Mark, Dash, Question Mark) and stop-words (http://www.ranks.nl/stopwords) were applied to the sentence pairs before applying the similarity algorithms. The implementations of the string similarity methods in the SimMetrics Library (https://github.com/Simmetrics/simmetrics) were used.

Qgram similarity

Qgram similarity (Ukkonen, 1992) is typically used in approximate string matching by ‘sliding’ a window of length q over the characters of a string to create ‘q’ length grams for matching. A match is then rated as the number of q-gram matches within the second string over the possible q-grams obtained from the first string.

Block distance

Block distance (Krause, 1987), also known as Manhattan Distance, computes the distance between two points by summing the differences of their corresponding components. The Equation for block distance between a point A = (A1,A2,…,A) and a point B = (B1,B2,…,B) in n-dimensional space is: In our case, A refers to the count of term i in sentence A and B refers to the count of term i in sentence B.

Jaccard similarity

Jaccard similarity (Jaccard, 1908) measures the similarity between two sets and is computed as the number of common terms over the number of unique terms in both sets (Equation 2). In our case, set A consists of the unique words in the first sentence and set B consists of the unique words of the second sentence.

Overlap coefficient

Overlap coefficient (Lawlor, 1980) is a similarity measure that differs from Jaccard similarity with being divided by the size of the smaller sized of the two sets (Equation 3).

Levenshtein distance

Levenshtein distance (Levenshtein, 1966) is a simple edit distance, which consists of the operations for transforming one of the given strings to the other, where an operation is defined as an insertion, deletion, substitution or copying of a character. The distance is defined as the minimum number of the required operations to change one string into another. The Levenshtein distance and block distance values are converted into similarity values by subtracting from 1.

2.3 Distributional vector model

Paragraph vector model

The word2vec model (Mikolov ), which constructs distributed representations of words, has been widely adopted to many recent NLP tasks including the biomedical domain (Aydin ; Chiu ; Moen and Ananiadou, 2013; Muneeb ). In this model, a large amount of unlabeled text data is used in training to represent words in a new low-dimensional space as real-valued vectors. The model’s ability of considering the word context allows us to easily relate word vectors in a semantic way (e.g. similar words have similar vectors). Word2vec is an unsupervised neural network based learning model based on two approaches, namely Skip-Gram and Continuous-Bag-of-Words (CBOWs). In the CBOW approach, the words are predicted based on their surrounding words ignoring the word order, whereas in Skip-Gram, a word is used to predict its surrounding words while considering how distant they are in the text. Paragraph vector is presented following the word2vec model as a way to describe sentences (Le and Mikolov, 2014). The paragraph vector method was utilized to capture semantic information from the texts. The difference of this model from the word2vec model is that the paragraphs are also mapped to distributed vector representations and used to predict the next word in the given context together with the distributed vector representations of the words in the paragraph. We trained a paragraph vector model by using a subset of the Open Access Subset of PubMed Central (http://www.ncbi.nlm.nih.gov/pmc/) dataset, which comprises ∼4G text data of ∼37K articles. The size of the output sentence vectors was set to 100 and the Skip-Gram approach was employed.

2.4 Ontology-based similarity

Ontologies are widely used for measuring semantic similarity between concepts/terms, since their representation links terms semantically. Due to the fact that a sentence consists of a set of words, we can utilize ontology-based word-level similarity measures to compute semantic similarity scores between sentences. To make our proposed algorithms clearer, we first briefly introduce the WordNet (Section 2.4.1) and the UMLS ontologies (Section 2.4.2), then describe the ontology-based word-level similarity algorithms (Section 2.4.3). Finally, we present our proposed approaches (Section 2.4.4), which exploit the word-level algorithms described in Section 2.4.3 to obtain sentence-level similarity scores.

WordNet

WordNet (Miller, 1995) is a large English lexical thesaurus that has been widely used for computing semantic similarity by using the measures described in Section 2.4.3. According to the structure of WordNet, each word consists of a form ‘f’ which is a string and a sense ‘s’ represented by a set of synonyms that have that meaning. Words in WordNet are categorized according to their syntactic categories such as verb, noun, adjective, and adverb. Since the same words can be interpreted as having different part-of-speech (POS) tags according to the contexts they occur in, this syntactic categorization allows to save the same word with each possible POS tags separately in a taxonomy. In addition, words and word senses are connected to each other with various types of relationships. The types of relationships most commonly used for measuring semantic similarity are listed below: Synonymy is the basic relation type in WordNet, since sets of synonyms (synsets) are used to represent word senses. Hyponymy and hypernymy represent the hierarchical relations between a word and its sub-name and super-name, respectively. Antonymy represents the relation between a name and its opposite-name.

2.4.2 UMLS

UMLS (Bodenreider, 2004) is a comprehensive thesaurus consisting of >1.7 million biomedical concepts. It comprises of the vocabulary sources on specialized topics such as MeSH consisting of medical subject headings, OMIM containing genetic knowledge bases, and SnomedCT which consists of the concepts belonging to clinical repositories. Since UMLS consists of various terminology sources, some concepts can overlap. In other words, the same concept can belong to different sources. To be able to use multiple sources as a single resource in the UMLS Metathesaurus, concept unique identifiers are assigned to the concepts.

Word-level similarity methods

The rich semantic information carried by ontologies enables the computation of semantic similarity scores among concepts. In this subsection, we briefly describe the ontology based path-based and information content (IC)-based similarity metrics that are employed in our proposed sentence-level similarity computation method. Path-based approaches utilize the structure of the taxonomy, whereas IC-based approaches use extra information that is learned from corpus statistics. The Path algorithm (Rada ) measures the semantic similarity of two concepts by calculating the shortest path between them in taxonomy. The intuition behind the algorithm is that the shorter the path between concepts in a hierarchy the more similar they are. In Equation 4, the len function computes the shortest path between concepts c1–c2, and depthmax refers to the maximum depth of the taxonomy. For example, given the sample taxonomy provided in Figure 2, the semantic distance between the terms ‘protein’ and ‘beta-lactams’ is computed as: The shortest path between c1 and c2 counts all nodes between them—including themselves. Since the maximum depth of the taxonomy is constant, this measure does not take into consideration the specificity of the concepts. According to the definition, len(c1,c2) is equal to 4 and depthmax is 5.

Fig. 2

Hierarchical relationships among a small subset of proteins and antibiotics

Hierarchical relationships among a small subset of proteins and antibiotics Similarly, the Leacock and Chodorow (LCH) measure (Leacock and Chodorow, 1998) takes the maximum depth of the taxonomy into account and the similarity is determined as: Unlike the Path and LCH measures, Wu and Palmer (WP) (Wu and Palmer, 1994) measure accounts for the specificity of the concepts, due to the concept depth feature. WP similarity between concepts c1 and c2 is measured as twice the depth of the lowest common subsumer of the given concepts over the sum of the depths of c1 and c2. The following example based on the sample taxonomy in Figure 2 illustrates the effect of concept depth using the WP and the Path metrics. Although the Path algorithm gives the same semantic similarity score for the two pairs, which have different specificity, WP estimates that cephem and ampicillin are more similar than antibiotic and enzyme. The result of the WP metric is reasonable for this example, since the path between deeper concepts causes less semantic distance. Both the concept depth feature and the frequency of the concept in a corpus give an idea about the specificity of the concept. With the motivation of these facts, IC is used for measuring the semantic similarity between concepts. IC of a concept is defined as the negative log likelihood of encountering concept c in a given corpus. The probability of encountering concept c is given as, In Equation (13), N denotes the total number of words in the corpus used, while freq(c) is the number of occurrences of concept c in the corpus. The Resnik (Resnik, 1995) similarity measure is determined as the IC of the lowest common subsumer of concepts c1 and c2. The Lin (Lin, 1998) similarity between concepts c1 and c2 is calculated as twice the IC of the lowest common subsumer of the concepts over the sum of ICs of c1 and c2. Jiang and Conrath (JCN) (Jiang and Conrath, 1997) measures the semantic similarity between concepts c1 and c2 as in Equation (16), which uses the ICs of the concepts and their lowest common subsumer.

Sentence-level ontology-based methods

In this section, we introduce our sentence-level ontology-based methods namely WordNet-based Similarity Method (WBSM), UMLS-based Similarity Method (UBSM) and combined ontology method (COM). The general design of these approaches is shown in Figure 3. There are two main tasks in the general flow; calculation of word-level similarities (Section 2.4.3), adapting word-level similarities to obtain sentence-level score (sentence-level similarity method). Although the proposed three methods use the same algorithms for these tasks, they differ from each other by using different ontologies for word-level similarity calculation.

Fig. 3

Sentence-level similarity module

Sentence-level similarity module Inspired by the study of Li , we developed a sentence-level similarity method, which is an algorithm to adapt word-level similarities to sentence-level. The algorithm is explained below using a walk-through example. A walk-through example S1: Necroptosis requires the activity of RIP1 and RIP3. S2: Necroptosis could be induced via modulating RIP1 and RIP3. Given two sentences S1 and S2, dictionary D is constructed, which consists of the union of the unique words from the two sentence. D for the example sentences S1 and S2 is: D: {Necroptosis, requires, the, activity, of, RIP1, and, RIP3, could, be, induced, via, modulating} D is used to build the semantic vectors D1 and D2 for S1 and S2, respectively, which have the same dimension as the dictionary. For instance, in order to build a semantic vector for S1, each word in the dictionary is compared with every word in S1 and the highest similarity score is assigned for the corresponding dimension index in the semantic vector. As shown in Figure 4, D is obtained by using all distinct words in S1 and S2. For determining the score of the 10th dimension of the semantic vector D1, the ontology-based word-level similarity scores between each word in S1 and the 10th dimension of D are computed. Since the highest score is 0.33 among all similarity scores, the score of the 10th index of D1 is set as 0.33. This process is repeated for the remaining indexes of the semantic vector D1. Then, the same algorithm is applied to create the semantic vector D2. Finally, the cosine similarity between D1 and D2 gives the semantic similarity score between the two sentences S1 and S2.

Fig. 4

Illustration of the proposed sentence-level ontology-based similarity algorithm which constructs semantic vectors of sentences

Illustration of the proposed sentence-level ontology-based similarity algorithm which constructs semantic vectors of sentences WBSM. WBSM takes two sentences to be compared as inputs and returns the semantic similarity score by exploiting WordNet. We used the WS4J library (https://github.com/Sciss/ws4j) for calculating the similarities between words by utilizing the WordNet ontology. The algorithms described in Section 2.4.3 were evaluated for WBSM. These measures were calculated using the Is-A relations in the WordNet ontology. Then, the sentence-level similarity method was used to combine word-level similarity scores to sentence-level. UBSM. Differently from WBSM, UBSM uses METAMAP (Aronson, 2001), which is a tool for extracting medical concepts from text rather than assuming each word as a concept. This approach is more reliable, since concepts can consist of more than one word. The METAMAP tool is run on both sentences S1 and S2 and a dictionary is constructed from the unique mapped concepts/phrases in the two sentences. Therefore, the word-level similarity method utilizing UMLS takes concepts mapped by METAMAP as inputs. The rest of the methodology for constructing the sentence-level vectors is the same as WBSM. Umls:Similarity (McInnes ) web interface was used to calculate the similarity of the concepts, which were mapped by METAMAP. The scope of Umls:Similarity is limited to the OMIM (Online Mendelian Inheritance in Man) and MeSH (Medical Subject Headings) ontologies, which are subsets of the UMLS ontology. Parent/Child (PAR/CHD) relationship was used as the relationship parameter in the UMLS:Similarity web interface. The algorithms described in Section 2.4.3 were evaluated for UBSM. COM. The major motivation behind the COM was to benefit from both biomedical domain and general domain ontologies, since sentences in biomedical text consist of both general terms and biomedical-specific terms. To utilize the knowledge from both UMLS and WordNet ontologies, we propose a new approach in this section. Our method performs combination of different approaches on sentence-level. As shown in Figure 5, the sentence-level COM takes the similarity scores of WBSM and UBSM for a sentence pair, then combines these scores by using Equation 17, where λ represents the weight parameter. When λ is set to 0.5, equal weight is given to the similarity scores obtained from the WordNet and UMLS ontologies. When λ is set to a value >0.5, higher weight is given to the similarity score obtained from WordNet, and when it is set to a value smaller than 0.5, higher weight is given to the similarity score obtained from UMLS.

Fig. 5

Sentence-level COM

Sentence-level COM If a word does not occur in either of the ontologies (UMLS and/or WordNet), the similarity score between the word and any other word with respect to the corresponding ontology is considered to be 0.

2.5 Supervised combination of similarity measures

We combined our unsupervised semantic similarity measures within a supervised method. We used the similarity scores computed by the unsupervised COM, Paragraph Vector and Qgram similarity as features in a supervised regression model. Linear Regression implemented in the Weka library (Hall ) was used as the supervised model. A linear regression model can be expressed as in Equation (18) (Alpaydin, 2014; Buckley and James, 1979; Raftery ), where y is the dependent variable, each x is an input variable, and k equals to the number of predictors (input variables). βs correspond to the parameters of the linear regression model, which are estimated from the training data. Therefore, in our supervised similarity model, the predicted sentence similarity score (y) is calculated through the similarity scores (x) that were obtained by the unsupervised methods. The supervised system exploiting the results of the unsupervised similarity computation methods is illustrated in Figure 6. The pre-processed sentences are given to each unsupervised system as inputs. Then, the output score of each system, which is the semantic similarity score for the given pair, is used as a feature in our supervised system.

Fig. 6

Supervised combination of similarity measures

3 Experimental results

The proposed sentence-level semantic similarity estimation algorithms are evaluated using the manually annotated dataset described in Section 2.1. For each sentence pair in the dataset, the mean of the scores assigned by the five human annotators was taken as the gold standard. The Pearson correlation (Pearson, 1895) between the gold standard scores and the scores estimated by the algorithms was used as the evaluation metric. The strength of correlation can be assessed by the general guideline proposed by Evans (1996) as follows: very strong: 0.80–1.00 strong: 0.60–0.79 moderate: 0.40–0.59 weak: 0.20–0.39 very weak: 0.00–0.19 Since there is no previous study on sentence semantic similarity computation developed specifically for the biomedical domain, we considered the domain-independent state-of-the-art approaches ADW (Pilehvar ; Pilehvar and Navigli, 2015) and SEMILAR (Rus ) introduced in Section 1 as our baseline models. According to the results shown in Table 3, both ADW and SEMILAR obtain moderate correlation based on Evans’ definition (Evans, 1996). The poor results of these generic-domain similarity estimation systems demonstrate the need for new approaches for this domain-specific research field.

Table 3

Experimental results of the presented approaches

Methods	Pearson correlation
Domain-independent systems
ADW	0.586
SEMILAR	0.419
String similarity measures
Qgram	0.754
Jaccard	0.710
Block	0.752
Levenshtein	0.592
Overlap coefficient	0.695
Word Embeddings based Similarity
Paragraph Vector	0.787
Ontology-based similarity
WBSM-Path	0.644
WBSM-Resnik	0.234
WBSM-Lin	0.495
WBSM-WP	0.354
WBSM-JCN	0.623
WBSM-LCH	0.287
UBSM-Path	0.651
UBSM-Resnik	0.473
UBSM-Lin	0.645
UBSM-WP	0.576
UBSM-JCN	0.624
UBSM-LCH	0.333
COM ([λ = 0.5])	0.710
Supervised semantic similarity system
Linear regression	0.836

Experimental results of the presented approaches We evaluated several string similarity measures on our dataset. We experimented with performing preprocessing as described in Section 2.2 and without performing preprocessing for all string-based methods as well as for the other evaluated methods. Pre-processing improved the performances of all methods. Therefore, in Table 3 we report the results when preprocessing was performed. Our experiments showed that the application of preprocessing methods contributed more to the performance of the string similarity measures compared with the other methods. The range of increase in Pearson correlation varies between 10 and 31% for the string similarity measures. This result is expected, as string-based approaches are highly sensitive to small changes, since they do not take into consideration the semantic information of text. Paragraph vector is an unsupervised approach, which we used with a large unlabeled corpus of biomedical text to learn semantic information. The strong correlation result obtained by the Paragraph Vector method shows that it is a promising method for representing sentences as vectors while capturing semantics. For both WBSM and UBSM, using the path algorithm as the word-level similarity approach yielded the best performance with Pearson correlation scores of 0.644 and 0.651, respectively. Therefore, for the combined ontology approach, we used the path algorithm both for computing the WordNet- and the UMLS-based scores. Then, the weighted sum of the similarity scores obtained from the WordNet- and UMLS-based methods was assigned as the final similarity score. The best combination was achieved when the weight parameter lambda was set to 0.5 (λ = 0.5) in Equation (17). The comparison between the COM and the methods that use a single ontology show that the efficient unification of the available biomedical information coming from a biomedical ontology with general domain information increased the overall performance. The results of the combined ontology approach justify our hypothesis, which was based on exploiting both general-domain and domain-specific ontologies for domain-specific text. The significant increase in the correlation performance of the combined model, compared with the individual correlation scores, indicate that the combination is useful. The evaluation of the supervised model was performed using stratified 10-fold cross-validation over all the sentence pairs, due to the small size of the dataset. The final result for the supervised semantic similarity system was obtained by averaging the individual correlation results of each fold. As the learning model, Linear Regression implemented in the Weka library (Hall ) was employed. The experimental results indicate that the supervised combination of the similarity scores computed by the different methods outperforms the individual performance of each unsupervised method. This shows that these unsupervised system scores complement each other. Although each unsupervised method obtained strong association with the gold standard, combination of these approaches by a supervised algorithm led to very strong correlation. The Supervised Semantic Similarity System exploiting the scores of the unsupervised systems as features produced the best correlation of 83.6% among the others.

4 Discussion

In this study, we presented and compared several approaches to measure semantic sentence similarity in the biomedical domain. We demonstrated the need for adapted or new approaches for domain-specific semantic sentence-level similarity, since our results showed that state-of-the-art domain-independent semantic similarity measures are inadequate when applied to biomedical text. Another important contribution of this research is that we provide a strong baseline as well as a hand-crafted benchmark dataset for further studies due to attempting the first methods in this unexplored research area of biomedical sentence-level semantic similarity computation. Thanks to the ontologies that enable the computation of semantic distances between concepts, ontology-based measures have been used in our semantic similarity computation study. Since the sentences in our dataset are selected from biomedical articles, we utilized WordNet as the general domain ontology and UMLS as the biomedical domain-specific ontology. The evaluations indicated that the COM, which utilizes both the WordNet and UMLS ontologies, accomplished better results on estimating the similarity among biomedical sentences compared with the methods where a single ontology was utilized. This outcome is reasonable, since sentences in the biomedical domain comprise both biomedical and general concepts. Thus, the knowledge extracted from both WordNet and UMLS complements each other and contributes to the overall performance of the system. Besides UMLS, there are various biomedical ontologies specialized on different subtopics in the biomedical domain such as the ChEBI ontology focusing on chemical entities (Degtyarenko ), the Interaction Network Ontology specializing in the domain of molecular interactions (Özgür ), and the Human Phenotype Ontology providing controlled vocabulary for phenotypic features related to human diseases (Köhler ). Integrating the semantic similarity scores computed by using different biomedical ontologies might contribute to the performance of the COM. As future work, we aim to make use of the knowledge obtained from different biomedical ontologies, in order to enhance our system to respond to a wider range of concepts and relationships. Our results revealed that the unsupervised Paragraph Vector approach based on a biomedical corpus to learn the distributional vector representations of sentences is a promising method for biomedical semantic similarity computation. Finally, we presented a supervised semantic similarity estimation system based on a linear regression model, which exploits high-level features. The high-level features consist of the similarity scores of the best performing unsupervised systems, namely Qgram, Paragraph Vector and the COM. Combining the unsupervised methods with the help of a supervised learning model increased the overall performance of the system. Experiments showed that using different approaches to estimate the similarity contributes to the overall performance of the system. The manually annotated dataset and the developed semantic similarity estimation systems are publicly available. We believe that our biomedical-domain specific semantic sentence-level similarity measures can be used in various applications of biomedical NLP such as automatic summarization, question answering, text categorization and text retrieval. The upper bound in this study can be considered as the performance of a typical human, which is 90.2% according to the correlations between the human annotators. Although our best performing system achieved high correlation with human annotations (83.6%), there is still room for improvement for biomedical domain-specific semantic sentence similarity estimation.

20 in total

1. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.

Authors: A R Aronson
Journal: Proc AMIA Symp Date: 2001

2. Computing semantic similarity between biomedical concepts using new information content approach.

Authors: Mohamed Ben Aouicha; Mohamed Ali Hadj Taieb
Journal: J Biomed Inform Date: 2015-12-17 Impact factor: 6.317

3. Measures of semantic similarity and relatedness in the biomedical domain.

Authors: Ted Pedersen; Serguei V S Pakhomov; Siddharth Patwardhan; Christopher G Chute
Journal: J Biomed Inform Date: 2006-06-10 Impact factor: 6.317

4. A framework for unifying ontology-based semantic similarity measures: a study in the biomedical domain.

Authors: Sébastien Harispe; David Sánchez; Sylvie Ranwez; Stefan Janaqi; Jacky Montmain
Journal: J Biomed Inform Date: 2013-11-21 Impact factor: 6.317

5. Semantic similarity estimation in the biomedical domain: an ontology-based information-theoretic perspective.

Authors: David Sánchez; Montserrat Batet
Journal: J Biomed Inform Date: 2011-04-02 Impact factor: 6.317

6. UMLS-Interface and UMLS-Similarity : open source software for measuring paths and semantic similarity.

Authors: Bridget T McInnes; Ted Pedersen; Serguei V S Pakhomov
Journal: AMIA Annu Symp Proc Date: 2009-11-14

7. Semantic similarity in the biomedical domain: an evaluation across knowledge sources.

Authors: Vijay N Garla; Cynthia Brandt
Journal: BMC Bioinformatics Date: 2012-10-10 Impact factor: 3.169

8. The Interaction Network Ontology-supported modeling and mining of complex interactions represented with multiple keywords in biomedical literature.

Authors: Arzucan Özgür; Junguk Hur; Yongqun He
Journal: BioData Min Date: 2016-12-19 Impact factor: 2.522

Review 9. The Human Phenotype Ontology in 2017.

Authors: Sebastian Köhler; Nicole A Vasilevsky; Mark Engelstad; Erin Foster; Julie McMurry; Ségolène Aymé; Gareth Baynam; Susan M Bello; Cornelius F Boerkoel; Kym M Boycott; Michael Brudno; Orion J Buske; Patrick F Chinnery; Valentina Cipriani; Laureen E Connell; Hugh J S Dawkins; Laura E DeMare; Andrew D Devereau; Bert B A de Vries; Helen V Firth; Kathleen Freson; Daniel Greene; Ada Hamosh; Ingo Helbig; Courtney Hum; Johanna A Jähn; Roger James; Roland Krause; Stanley J F Laulederkind; Hanns Lochmüller; Gholson J Lyon; Soichi Ogishima; Annie Olry; Willem H Ouwehand; Nikolas Pontikos; Ana Rath; Franz Schaefer; Richard H Scott; Michael Segal; Panagiotis I Sergouniotis; Richard Sever; Cynthia L Smith; Volker Straub; Rachel Thompson; Catherine Turner; Ernest Turro; Marijcke W M Veltman; Tom Vulliamy; Jing Yu; Julie von Ziegenweidt; Andreas Zankl; Stephan Züchner; Tomasz Zemojtel; Julius O B Jacobsen; Tudor Groza; Damian Smedley; Christopher J Mungall; Melissa Haendel; Peter N Robinson
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

Review 10. Semantic similarity in biomedical ontologies.

Authors: Catia Pesquita; Daniel Faria; André O Falcão; Phillip Lord; Francisco M Couto
Journal: PLoS Comput Biol Date: 2009-07-31 Impact factor: 4.475

13 in total

1. Identifying main finding sentences in clinical case reports.

Authors: Mengqi Luo; Aaron M Cohen; Sidharth Addepalli; Neil R Smalheiser
Journal: Database (Oxford) Date: 2020-01-01 Impact factor: 3.451

2. PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database.

Authors: Rezarta Islamaj; W John Wilbur; Natalie Xie; Noreen R Gonzales; Narmada Thanki; Roxanne Yamashita; Chanjuan Zheng; Aron Marchler-Bauer; Zhiyong Lu
Journal: Database (Oxford) Date: 2019-01-01 Impact factor: 3.451

3. Exploring automatic inconsistency detection for literature-based gene ontology annotation.

Authors: Jiyu Chen; Benjamin Goudey; Justin Zobel; Nicholas Geard; Karin Verspoor
Journal: Bioinformatics Date: 2022-06-24 Impact factor: 6.931

4. Measurement of Semantic Textual Similarity in Clinical Texts: Comparison of Transformer-Based Models.

Authors: Xi Yang; Xing He; Hansi Zhang; Yinghan Ma; Jiang Bian; Yonghui Wu
Journal: JMIR Med Inform Date: 2020-11-23

5. Protocol for a reproducible experimental survey on biomedical sentence similarity.

Authors: Alicia Lara-Clares; Juan J Lastra-Díaz; Ana Garcia-Serrano
Journal: PLoS One Date: 2021-03-24 Impact factor: 3.240

6. Unsupervised Event Graph Representation and Similarity Learning on Biomedical Literature.

Authors: Giacomo Frisoni; Gianluca Moro; Giulio Carlassare; Antonella Carbonaro
Journal: Sensors (Basel) Date: 2021-12-21 Impact factor: 3.576

7. HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey.

Authors: Juan J Lastra-Díaz; Alicia Lara-Clares; Ana Garcia-Serrano
Journal: BMC Bioinformatics Date: 2022-01-06 Impact factor: 3.169

8. Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.

Authors: Feihong Yang; Xuwen Wang; Hetong Ma; Jiao Li
Journal: BMC Med Inform Decis Mak Date: 2021-07-30 Impact factor: 2.796

9. Big Data Readiness in Radiation Oncology: An Efficient Approach for Relabeling Radiation Therapy Structures With Their TG-263 Standard Name in Real-World Data Sets.

Authors: Thilo Schuler; John Kipritidis; Thomas Eade; George Hruby; Andrew Kneebone; Mario Perez; Kylie Grimberg; Kylie Richardson; Sally Evill; Brooke Evans; Blanca Gallego
Journal: Adv Radiat Oncol Date: 2018-10-12

10. Clinical Context-Aware Biomedical Text Summarization Using Deep Neural Network: Model Development and Validation.

Authors: Muhammad Afzal; Fakhare Alam; Khalid Mahmood Malik; Ghaus M Malik
Journal: J Med Internet Res Date: 2020-10-23 Impact factor: 5.428