Literature DB >> 33760855

Protocol for a reproducible experimental survey on biomedical sentence similarity.

Alicia Lara-Clares1, Juan J Lastra-Díaz1, Ana Garcia-Serrano1.   

Abstract

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.

Entities:  

Year:  2021        PMID: 33760855      PMCID: PMC7990182          DOI: 10.1371/journal.pone.0248663

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


  39 in total

1.  The Unified Medical Language System (UMLS): integrating biomedical terminology.

Authors:  Olivier Bodenreider
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

2.  SNOMED-CT: The advanced terminology and coding system for eHealth.

Authors:  Kevin Donnelly
Journal:  Stud Health Technol Inform       Date:  2006

3.  A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering.

Authors:  Mourad Sarrouti; Said Ouatik El Alaoui
Journal:  J Biomed Inform       Date:  2017-03-07       Impact factor: 6.317

4.  A supervised approach to quantifying sentence similarity: with application to evidence based medicine.

Authors:  Hamed Hassanzadeh; Tudor Groza; Anthony Nguyen; Jane Hunter
Journal:  PLoS One       Date:  2015-06-03       Impact factor: 3.240

5.  BELTracker: evidence sentence retrieval for BEL statements.

Authors:  Majid Rastegar-Mojarad; Ravikumar Komandur Elayavilli; Hongfang Liu
Journal:  Database (Oxford)       Date:  2016-05-12       Impact factor: 3.451

6.  Similarity corpus on microbial transcriptional regulation.

Authors:  Oscar Lithgow-Serrano; Socorro Gama-Castro; Cecilia Ishida-Gutiérrez; Citlalli Mejía-Almonte; Víctor H Tierrafría; Sara Martínez-Luna; Alberto Santos-Zavaleta; David Velázquez-Ramírez; Julio Collado-Vides
Journal:  J Biomed Semantics       Date:  2019-05-22

7.  Neural sentence embedding models for semantic similarity estimation in the biomedical domain.

Authors:  Kathrin Blagec; Hong Xu; Asan Agibetov; Matthias Samwald
Journal:  BMC Bioinformatics       Date:  2019-04-11       Impact factor: 3.169

8.  PPR-SSM: personalized PageRank and semantic similarity measures for entity linking.

Authors:  Andre Lamurias; Pedro Ruas; Francisco M Couto
Journal:  BMC Bioinformatics       Date:  2019-10-29       Impact factor: 3.169

9.  Predicting adverse drug reactions through interpretable deep learning framework.

Authors:  Sanjoy Dey; Heng Luo; Achille Fokoue; Jianying Hu; Ping Zhang
Journal:  BMC Bioinformatics       Date:  2018-12-28       Impact factor: 3.169

10.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Authors:  Jinhyuk Lee; Wonjin Yoon; Sungdong Kim; Donghyeon Kim; Sunkyu Kim; Chan Ho So; Jaewoo Kang
Journal:  Bioinformatics       Date:  2020-02-15       Impact factor: 6.937

View more
  1 in total

1.  HESML: a real-time semantic measures library for the biomedical domain with a reproducible survey.

Authors:  Juan J Lastra-Díaz; Alicia Lara-Clares; Ana Garcia-Serrano
Journal:  BMC Bioinformatics       Date:  2022-01-06       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.