Literature DB >> 14764613

A simple and practical dictionary-based approach for identification of proteins in Medline abstracts.

Sergei Egorov1, Anton Yuryev, Nikolai Daraselia.   

Abstract

OBJECTIVE: The aim of this study was to develop a practical and efficient protein identification system for biomedical corpora.
DESIGN: The developed system, called ProtScan, utilizes a carefully constructed dictionary of mammalian proteins in conjunction with a specialized tokenization algorithm to identify and tag protein name occurrences in biomedical texts and also takes advantage of Medline "Name-of-Substance" (NOS) annotation. The dictionaries for ProtScan were constructed in a semi-automatic way from various public-domain sequence databases followed by an intensive expert curation step. MEASUREMENTS: The recall and precision of the system have been determined using 1000 randomly selected and hand-tagged Medline abstracts.
RESULTS: The developed system is capable of identifying protein occurrences in Medline abstracts with a 98% precision and 88% recall. It was also found to be capable of processing approximately 300 abstracts per second. Without utilization of NOS annotation, precision and recall were found to be 98.5% and 84%, respectively.
CONCLUSION: The developed system appears to be well suited for protein-based Medline indexing and can help to improve biomedical information retrieval. Further approaches to ProtScan's recall improvement also are discussed.

Entities:  

Mesh:

Substances:

Year:  2004        PMID: 14764613      PMCID: PMC400515          DOI: 10.1197/jamia.M1453

Source DB:  PubMed          Journal:  J Am Med Inform Assoc        ISSN: 1067-5027            Impact factor:   4.497


  9 in total

1.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction.

Authors: 
Journal:  Genome Inform Ser Workshop Genome Inform       Date:  1998

2.  A biological named entity recognizer.

Authors:  Meenakshi Narayanaswamy; K E Ravikumar; K Vijay-Shanker
Journal:  Pac Symp Biocomput       Date:  2003

3.  A simple algorithm for identifying abbreviation definitions in biomedical text.

Authors:  Ariel S Schwartz; Marti A Hearst
Journal:  Pac Symp Biocomput       Date:  2003

4.  Creating an online dictionary of abbreviations from MEDLINE.

Authors:  Jeffrey T Chang; Hinrich Schütze; Russ B Altman
Journal:  J Am Med Inform Assoc       Date:  2002 Nov-Dec       Impact factor: 4.497

5.  Protein names and how to find them.

Authors:  Kristofer Franzén; Gunnar Eriksson; Fredrik Olsson; Lars Asker; Per Lidén; Joakim Cöster
Journal:  Int J Med Inform       Date:  2002-12-04       Impact factor: 4.046

6.  Tagging gene and protein names in biomedical text.

Authors:  Lorraine Tanabe; W John Wilbur
Journal:  Bioinformatics       Date:  2002-08       Impact factor: 6.937

7.  Using BLAST for identifying gene and protein names in journal articles.

Authors:  M Krauthammer; A Rzhetsky; P Morozov; C Friedman
Journal:  Gene       Date:  2000-12-23       Impact factor: 3.688

8.  A probabilistic model for identifying protein names and their name boundaries.

Authors:  Kazuhiro Seki; Javed Mostafa
Journal:  Proc IEEE Comput Soc Bioinform Conf       Date:  2003

9.  Toward information extraction: identifying protein names from biological papers.

Authors:  K Fukuda; A Tamura; T Tsunoda; T Takagi
Journal:  Pac Symp Biocomput       Date:  1998
  9 in total
  4 in total

1.  SemCat: semantically categorized entities for genomics.

Authors:  Lorraine Tanabe; Lynne H Thom; Wayne Matten; Donald C Comeau; W John Wilbur
Journal:  AMIA Annu Symp Proc       Date:  2006

2.  BioTagger-GM: a gene/protein name recognition system.

Authors:  Manabu Torii; Zhangzhi Hu; Cathy H Wu; Hongfang Liu
Journal:  J Am Med Inform Assoc       Date:  2008-12-11       Impact factor: 4.497

3.  A novel biological function for CD44 in axon growth of retinal ganglion cells identified by a bioinformatics approach.

Authors:  Albert Ries; Jeffrey L Goldberg; Barbara Grimpe
Journal:  J Neurochem       Date:  2007-08-30       Impact factor: 5.372

4.  Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks.

Authors:  Nikolai Daraselia; Anton Yuryev; Sergei Egorov; Ilya Mazo; Iaroslav Ispolatov
Journal:  BMC Bioinformatics       Date:  2007-07-10       Impact factor: 3.169

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.