| Literature DB >> 28881973 |
Gizem Sogancioglu1,2, Hakime Öztürk1, Arzucan Özgür1.
Abstract
MOTIVATION: The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text.Entities:
Mesh:
Year: 2017 PMID: 28881973 PMCID: PMC5870675 DOI: 10.1093/bioinformatics/btx238
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Example annotations
| Sentence 1 | Sentence 2 | Comment | Score |
|---|---|---|---|
| Here we show that both C/EBP | Isoleucine could not interact with ligand fragment 44, which contains amino group. | The two sentences are on different topics. | 0 |
| Membrane proteins are proteins that interact with biological membranes. | Previous studies have demonstrated that membrane proteins are implicated in many diseases because they are positioned at the apex of signaling pathways that regulate cellular processes. | The two sentences are not equivalent, but are on the same topic. | 1 |
| This article discusses the current data on using anti-HER2 therapies to treat CNS metastasis as well as the newer anti-HER2 agents. | Breast cancers with HER2 amplification have a higher risk of CNS metastasis and poorer prognosis. | The two sentences are not equivalent, but share some details. | 2 |
| We were able to confirm that the cancer tissues had reduced expression of miR-126 and miR-424, and increased expression of miR-15b, miR-16, miR-146a, miR-155 and miR-223. | A recent study showed that the expression of miR-126 and miR-424 had reduced by the cancer tissues. | The two sentences are roughly equivalent, but some important information differs/missing. | 3 |
| Hydrolysis of | In Gram-negative organisms, the most common | The two sentences are completely or mostly equivalent, as they mean the same thing. | 4 |
Correlation scores among annotators
| Correlation | |
|---|---|
| Annotator A | 0.952 |
| Annotator B | 0.958 |
| Annotator C | 0.917 |
| Annotator D | 0.902 |
| Annotator E | 0.941 |
Fig. 1Distribution of the similarity scores in the dataset
Fig. 2Hierarchical relationships among a small subset of proteins and antibiotics
Fig. 3Sentence-level similarity module
Fig. 4Illustration of the proposed sentence-level ontology-based similarity algorithm which constructs semantic vectors of sentences
Fig. 5Sentence-level COM
Fig. 6Supervised combination of similarity measures
Experimental results of the presented approaches
| Methods | Pearson correlation |
|---|---|
| Domain-independent systems | |
| ADW | 0.586 |
| SEMILAR | 0.419 |
| String similarity measures | |
| Qgram | 0.754 |
| Jaccard | 0.710 |
| Block | 0.752 |
| Levenshtein | 0.592 |
| Overlap coefficient | 0.695 |
| Word Embeddings based Similarity | |
| Paragraph Vector | 0.787 |
| Ontology-based similarity | |
| WBSM-Path | 0.644 |
| WBSM-Resnik | 0.234 |
| WBSM-Lin | 0.495 |
| WBSM-WP | 0.354 |
| WBSM-JCN | 0.623 |
| WBSM-LCH | 0.287 |
| UBSM-Path | 0.651 |
| UBSM-Resnik | 0.473 |
| UBSM-Lin | 0.645 |
| UBSM-WP | 0.576 |
| UBSM-JCN | 0.624 |
| UBSM-LCH | 0.333 |
| COM ([ | 0.710 |
| Supervised semantic similarity system | |
| Linear regression | 0.836 |