Literature DB >> 31267135

PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database.

Rezarta Islamaj1, W John Wilbur1, Natalie Xie1, Noreen R Gonzales1, Narmada Thanki1, Roxanne Yamashita1, Chanjuan Zheng1, Aron Marchler-Bauer1, Zhiyong Lu1.   

Abstract

This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published peer-reviewed articles. To facilitate information access for database users, it is desirable to specifically identify the referenced articles that support the assertions of curator-composed sentences. Moreover, CDD curators desire an alert system that scans the newly published literature and proposes related articles of relevance to the existing CDD records. Our approach to address these needs is a text similarity method that automatically maps a curator-written statement to candidate sentences extracted from the list of referenced articles, as well as the articles in the PubMed Central database. To evaluate this proposal, we paired CDD description sentences with the top 10 matching sentences from the literature, which were given to curators for review. Through this exercise, we discovered that we were able to map the articles in the reference list to the CDD description statements with an accuracy of 77%. In the dataset that was reviewed by curators, we were able to successfully provide references for 86% of the curator statements. In addition, we suggested new articles for curator review, which were accepted by curators to be added into the reference list at an acceptance rate of 50%. Through this process, we developed a substantial corpus of similar sentences from biomedical articles on protein sequence, structure and function research, which constitute the CDD text similarity corpus. This corpus contains 5159 sentence pairs judged for their similarity on a scale from 1 (low) to 5 (high) doubly annotated by four CDD curators. Curator-assigned similarity scores have a Pearson correlation coefficient of 0.70 and an inter-annotator agreement of 85%. To date, this is the largest biomedical text similarity resource that has been manually judged, evaluated and made publicly available to the community to foster research and development of text similarity algorithms. Published by Oxford University Press 2019.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 31267135      PMCID: PMC6606757          DOI: 10.1093/database/baz064

Source DB:  PubMed          Journal:  Database (Oxford)        ISSN: 1758-0463            Impact factor:   3.451


  17 in total

1.  Extraction of gene-disease relations from Medline using domain dictionaries and machine learning.

Authors:  Hong-Woo Chun; Yoshimasa Tsuruoka; Jin-Dong Kim; Rie Shiba; Naoki Nagata; Teruyoshi Hishiki; Jun'ichi Tsujii
Journal:  Pac Symp Biocomput       Date:  2006

2.  The Ineffectiveness of Within - Document Term Frequency in Text Classification.

Authors:  W John Wilbur; Won Kim
Journal:  Inf Retr Boston       Date:  2009-10-01       Impact factor: 2.293

3.  Text mining for the biocuration workflow.

Authors:  Lynette Hirschman; Gully A P C Burns; Martin Krallinger; Cecilia Arighi; K Bretonnel Cohen; Alfonso Valencia; Cathy H Wu; Andrew Chatr-Aryamontri; Karen G Dowell; Eva Huala; Anália Lourenço; Robert Nash; Anne-Lise Veuthey; Thomas Wiegers; Andrew G Winter
Journal:  Database (Oxford)       Date:  2012-04-18       Impact factor: 3.451

4.  SMART: recent updates, new developments and status in 2015.

Authors:  Ivica Letunic; Tobias Doerks; Peer Bork
Journal:  Nucleic Acids Res       Date:  2014-10-09       Impact factor: 16.971

5.  CHEMDNER: The drugs and chemical names extraction challenge.

Authors:  Martin Krallinger; Florian Leitner; Obdulia Rabal; Miguel Vazquez; Julen Oyarzabal; Alfonso Valencia
Journal:  J Cheminform       Date:  2015-01-19       Impact factor: 5.514

6.  On expert curation and scalability: UniProtKB/Swiss-Prot as a case study.

Authors:  Sylvain Poux; Cecilia N Arighi; Michele Magrane; Alex Bateman; Chih-Hsuan Wei; Zhiyong Lu; Emmanuel Boutet; Hema Bye-A-Jee; Maria Livia Famiglietti; Bernd Roechert; The UniProt Consortium
Journal:  Bioinformatics       Date:  2017-11-01       Impact factor: 6.937

7.  TIGRFAMs and Genome Properties in 2013.

Authors:  Daniel H Haft; Jeremy D Selengut; Roland A Richter; Derek Harkins; Malay K Basu; Erin Beck
Journal:  Nucleic Acids Res       Date:  2012-11-28       Impact factor: 16.971

8.  Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge.

Authors:  Martin Krallinger; Alexander Morgan; Larry Smith; Florian Leitner; Lorraine Tanabe; John Wilbur; Lynette Hirschman; Alfonso Valencia
Journal:  Genome Biol       Date:  2008-09-01       Impact factor: 13.583

9.  The National Center for Biotechnology Information's Protein Clusters Database.

Authors:  William Klimke; Richa Agarwala; Azat Badretdin; Slava Chetvernin; Stacy Ciufo; Boris Fedorov; Boris Kiryutin; Kathleen O'Neill; Wolfgang Resch; Sergei Resenchuk; Susan Schafer; Igor Tolstoy; Tatiana Tatusova
Journal:  Nucleic Acids Res       Date:  2008-10-21       Impact factor: 16.971

10.  BIOSSES: a semantic sentence similarity estimation system for the biomedical domain.

Authors:  Gizem Sogancioglu; Hakime Öztürk; Arzucan Özgür
Journal:  Bioinformatics       Date:  2017-07-15       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.