Literature DB >> 11163982

Using BLAST for identifying gene and protein names in journal articles.

M Krauthammer1, A Rzhetsky, P Morozov, C Friedman.   

Abstract

We describe a system which automatically identifies gene and protein names in journal articles, an important and non-trivial first step in knowledge extraction of protein and gene actions. Our system uses a database of gene and protein names and is based on BLAST [Altschul et al., Nucleic Acids Res. 25 (1997) 3389-3402], a popular tool for DNA and protein sequence comparison. We describe a method that consists of mapping sequences of text characters into sequences of nucleotides that can be processed by BLAST. We demonstrate that this approach is feasible: the system matches gene and protein names with a recall of 78.8% and a precision of 71.7%, which includes names that are not part of the system database. An analysis of the results suggests techniques that can be used to improve performance further.

Mesh:

Substances:

Year:  2000        PMID: 11163982     DOI: 10.1016/s0378-1119(00)00431-5

Source DB:  PubMed          Journal:  Gene        ISSN: 0378-1119            Impact factor:   3.688


  28 in total

1.  Linking biomedical language information and knowledge resources: GO and UMLS.

Authors:  I N Sarkar; M N Cantor; R Gelman; F Hartel; Y A Lussier
Journal:  Pac Symp Biocomput       Date:  2003

2.  A simple and practical dictionary-based approach for identification of proteins in Medline abstracts.

Authors:  Sergei Egorov; Anton Yuryev; Nikolai Daraselia
Journal:  J Am Med Inform Assoc       Date:  2004-02-05       Impact factor: 4.497

3.  Semantic relations asserting the etiology of genetic diseases.

Authors:  Thomas C Rindflesch; Bisharah Libbus; Dimitar Hristovski; Alan R Aronson; Halil Kilicoglu
Journal:  AMIA Annu Symp Proc       Date:  2003

4.  Identification of related gene/protein names based on an HMM of name variations.

Authors:  L Yeganova; L Smith; W J Wilbur
Journal:  Comput Biol Chem       Date:  2004-04       Impact factor: 2.877

5.  NLProt: extracting protein names and sequences from papers.

Authors:  Sven Mika; Burkhard Rost
Journal:  Nucleic Acids Res       Date:  2004-07-01       Impact factor: 16.971

6.  Text mining neuroscience journal articles to populate neuroscience databases.

Authors:  Chiquito J Crasto; Luis N Marenco; Michele Migliore; Buqing Mao; Prakash M Nadkarni; Perry Miller; Gordon M Shepherd
Journal:  Neuroinformatics       Date:  2003

7.  Using WordNet synonym substitution to enhance UMLS source integration.

Authors:  Kuo-Chuan Huang; James Geller; Michael Halper; Yehoshua Perl; Junchuan Xu
Journal:  Artif Intell Med       Date:  2008-12-30       Impact factor: 5.326

Review 8.  Recent progress in automatically extracting information from the pharmacogenomic literature.

Authors:  Yael Garten; Adrien Coulet; Russ B Altman
Journal:  Pharmacogenomics       Date:  2010-10       Impact factor: 2.533

9.  JUZBOX: a web server for extracting biomedical words from the protein sequence.

Authors:  Paul Bobby; Seetharaman Balaji; Variath Sathyanath; Santhosh J Eapen
Journal:  Bioinformation       Date:  2009-11-17

10.  Tunable machine vision-based strategy for automated annotation of chemical databases.

Authors:  Jungkap Park; Gus R Rosania; Kazuhiro Saitou
Journal:  J Chem Inf Model       Date:  2009-08       Impact factor: 4.956

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.