| Literature DB >> 30975071 |
Kathrin Blagec1, Hong Xu1, Asan Agibetov1, Matthias Samwald2.
Abstract
BACKGROUND: Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set.Entities:
Keywords: Natural language processing; Neural embedding models; Semantics
Mesh:
Year: 2019 PMID: 30975071 PMCID: PMC6460644 DOI: 10.1186/s12859-019-2789-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Highest correlation coefficients obtained with different methods
| Method |
|
|---|---|
|
| |
| Jaccard | 0.751 |
| Q-gram (q = 3) | 0.723 |
|
| |
| fastText (skip-gram, max pooling) | 0.766 |
| fastText (CBOW, max pooling) | 0.253 |
| Sent2vec | 0.798 |
| Skip-thoughts | 0.485 |
| Paragraph vector (PV-DM) | 0.819 |
| Paragraph vector (PV-DBOW) | 0.804 |
|
| |
| Jaccard, q-gram, Paragraph vector (PV-DBOW) and sent2vec | 0.846 |
|
| |
| Supervised linear regression (Combination of Jaccard, Q-gram, sent2vec, Paragraph vector DM, skip-thoughts, fastText) | 0.871 |
r Pearson correlation, CBOW Continuous Bag of Words, PV-DM Paragraph Vector Distributed Memory, PV-DBOW Paragraph Vector Distributed Bag of Words
Baseline values for our analysis, as reported by Sogancioglu et al. [9]
| Method |
|
|---|---|
| Jaccard | 0.710 |
| Q-gram | 0.754 |
| Paragraph Vector (PV-DBOW) | 0.787 |
| Supervised linear regression (Combined ontology method, Paragraph vector, Q-gram) | 0.836 |
PV-DBOW Paragraph Vector Distributed Bag of Words
Fig. 1Scatterplots showing the correlations between given similarities and scores assigned by human annotators
Average estimated cosine similarities for sentence pairs included in the negation and antonym subset and a reference set of highly similar sentences per model. Lower values indicate lower estimated semantic similarity; higher values indicate higher estimated semantic similarities
| Sent2vec | Skip-thoughts | PV-DM | PV-DBOW | fastText CBOW | fastText skip-gram | |
|---|---|---|---|---|---|---|
| Subset of highly similar sentences ( | 0.706 | 0.899 | 0.652 | 0.568 | 0.938 | 0.971 |
| Negation subset ( | 0.967 | 0.999 | 0.930 | 0.936 | 0.945 | 0.979 |
| Antonym subset ( | 0.983 | 0.999 | 0.968 | 0.960 | 0.976 | 0.989 |
PV-DM Paragraph Vector Distributed Memory, PV-DBOW Paragraph Vector Distributed Bag of Words, CBOW Continuous Bag of Words
Characteristics of the PMC Open Access dataset
| File size | 45 GB |
| Number of articles | > 1,700,000 |
| Total number of tokens | 8,126,457,106 |
| Number of unique words | 31,974,798 |
| Number of sentences | 277,809,416 |
| Average line length before post-processing (number of characters) | 162 |
| Longest line length before post-processing (number of characters) | 111,562 |
Fig. 2Pre- and post-processing steps for preparing the training corpus and steps for calculating the similarity metrics
Example sentences from the BIOSSES benchmark set
| Sentence 1 | Sentence 2 | Comment | Score |
|---|---|---|---|
| This article discusses the current data on using anti-HER2 therapies to treat CNS metastasis as well as the newer anti-HER2 agents | Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis. | The two sentences are not equivalent, but share some details | 2 |
| The up-regulation of miR-146a was also detected in cervical cancer tissues. | The expression of miR-146a has been found to be up-regulated in cervical cancer. | The two sentences are completely or mostly equivalent. | 4 |
Example sentences of contradiction via negation subset
| Sentence 1 (original sentence) | Sentence 2 (negated sentence) |
|---|---|
| Rip1 was reported to interact with rip3. | Rip1 was reported to not interact with rip3. |
| Moreover, other reports have also shown that necroptosis could be induced via modulating rip1 and rip3. | Moreover, other reports have also shown that necroptosis could not be induced via modulating rip1 or rip3. |
Example sentences of contradiction via antonyms subset
| Sentence 1 (original sentence) | Sentence 2 (negated sentence) |
|---|---|
| When expressed alone in primary cells however, oncogenic ras induces premature senescence, a putative tumour suppressor mechanism to protect from uncontrolled proliferation. | When expressed alone in primary cells however, oncogenic ras inhibits premature senescence, a putative tumour suppressor mechanism to protect from uncontrolled proliferation. |
| Two recent studies used rnai-mediated tet2 knock-down in vitro to suggest that tet2 depletion led to impaired hematopoietic differentiation and to preferential myeloid commitment. | Two recent studies used rnai-mediated tet2 knock-down in vitro to suggest that tet2 depletion led to enhanced hematopoietic differentiation and to preferential myeloid commitment. |