Literature DB >> 12176836

Tagging gene and protein names in biomedical text.

Lorraine Tanabe1, W John Wilbur.   

Abstract

MOTIVATION: The MEDLINE database of biomedical abstracts contains scientific knowledge about thousands of interacting genes and proteins. Automated text processing can aid in the comprehension and synthesis of this valuable information. The fundamental task of identifying gene and protein names is a necessary first step towards making full use of the information encoded in biomedical text. This remains a challenging task due to the irregularities and ambiguities in gene and protein nomenclature. We propose to approach the detection of gene and protein names in scientific abstracts as part-of-speech tagging, the most basic form of linguistic corpus annotation.
RESULTS: We present a method for tagging gene and protein names in biomedical text using a combination of statistical and knowledge-based strategies. This method incorporates automatically generated rules from a transformation-based part-of-speech tagger, and manually generated rules from morphological clues, low frequency trigrams, indicator terms, suffixes and part-of-speech information. Results of an experiment on a test corpus of 56K MEDLINE documents demonstrate that our method to extract gene and protein names can be applied to large sets of MEDLINE abstracts, without the need for special conditions or human experts to predetermine relevant subsets. AVAILABILITY: The programs are available on request from the authors.

Entities:  

Mesh:

Substances:

Year:  2002        PMID: 12176836     DOI: 10.1093/bioinformatics/18.8.1124

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  73 in total

1.  A simple and practical dictionary-based approach for identification of proteins in Medline abstracts.

Authors:  Sergei Egorov; Anton Yuryev; Nikolai Daraselia
Journal:  J Am Med Inform Assoc       Date:  2004-02-05       Impact factor: 4.497

2.  Semantic relations asserting the etiology of genetic diseases.

Authors:  Thomas C Rindflesch; Bisharah Libbus; Dimitar Hristovski; Alan R Aronson; Halil Kilicoglu
Journal:  AMIA Annu Symp Proc       Date:  2003

3.  Identification of related gene/protein names based on an HMM of name variations.

Authors:  L Yeganova; L Smith; W J Wilbur
Journal:  Comput Biol Chem       Date:  2004-04       Impact factor: 2.877

4.  NLProt: extracting protein names and sequences from papers.

Authors:  Sven Mika; Burkhard Rost
Journal:  Nucleic Acids Res       Date:  2004-07-01       Impact factor: 16.971

5.  Toward an automatic method for extracting cancer- and other disease-related point mutations from the biomedical literature.

Authors:  Emily Doughty; Attila Kertesz-Farkas; Olivier Bodenreider; Gary Thompson; Asa Adadey; Thomas Peterson; Maricel G Kann
Journal:  Bioinformatics       Date:  2010-12-07       Impact factor: 6.937

6.  Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts.

Authors:  A M Cohen; W R Hersh; C Dubay; K Spackman
Journal:  BMC Bioinformatics       Date:  2005-04-22       Impact factor: 3.169

7.  Quantitative assessment of dictionary-based protein named entity tagging.

Authors:  Hongfang Liu; Zhang-Zhi Hu; Manabu Torii; Cathy Wu; Carol Friedman
Journal:  J Am Med Inform Assoc       Date:  2006-06-23       Impact factor: 4.497

8.  The value of parsing as feature generation for gene mention recognition.

Authors:  Larry H Smith; W John Wilbur
Journal:  J Biomed Inform       Date:  2009-04-02       Impact factor: 6.317

9.  Biological entity recognition with conditional random fields.

Authors:  Ying He; Mehmet Kayaalp
Journal:  AMIA Annu Symp Proc       Date:  2008-11-06

Review 10.  Recent progress in automatically extracting information from the pharmacogenomic literature.

Authors:  Yael Garten; Adrien Coulet; Russ B Altman
Journal:  Pharmacogenomics       Date:  2010-10       Impact factor: 2.533

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.