Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Generation of a large gene/protein lexicon by morphological pattern analysis.

Literature DB >> 15290756

Generation of a large gene/protein lexicon by morphological pattern analysis.

Abstract

The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that are characterized by common morphological features. Within each class inductive logic programming (ILP) is applied to learn the characteristics of those names that are gene/protein names. The criteria learned in this manner are then applied to our large set of names. We generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names. A simple false positive filter was applied to remove 8% of this set leaving 1,145,913 names. Examination of a random sample from this gene/protein name lexicon suggests it is composed of 82% (+/-3%) complete and accurate gene/protein names, 12% names related to genes/proteins (too generic, a valid name plus additional text, part of a valid name, etc.), and 6% names unrelated to genes/proteins. The lexicon is freely available at ftp.ncbi.nlm.nih.gov/pub/tanabe/Gene.Lexicon.

Mesh：

Substances：
Proteins

Year: 2004 PMID： 15290756 DOI： 10.1142/s0219720004000399

Source DB: PubMed Journal: J Bioinform Comput Biol ISSN： 0219-7200 Impact factor: 1.122

Keyword Cloud
Cited

7 in total

1. Identification of related gene/protein names based on an HMM of name variations.

Authors: L Yeganova; L Smith; W J Wilbur
Journal: Comput Biol Chem Date: 2004-04 Impact factor: 2.877

2. A flexible framework for deriving assertions from electronic medical records.

Authors: Kirk Roberts; Sanda M Harabagiu
Journal: J Am Med Inform Assoc Date: 2011-07-01 Impact factor: 4.497

3. CoPub Mapper: mining MEDLINE based on search term co-publication.

Authors: Blaise T F Alako; Antoine Veldhoven; Sjozef van Baal; Rob Jelier; Stefan Verhoeven; Ton Rullmann; Jan Polman; Guido Jenster
Journal: BMC Bioinformatics Date: 2005-03-11 Impact factor: 3.169

4. Collaborative development of the Arrowsmith two node search interface designed for laboratory investigators.

Authors: Neil R Smalheiser; Vetle I Torvik; Amanda Bischoff-Grethe; Lauren B Burhans; Michael Gabriel; Ramin Homayouni; Alireza Kashef; Maryann E Martone; Guy A Perkins; Diana L Price; Andrew C Talk; Ruth West
Journal: J Biomed Discov Collab Date: 2006-07-03

Generation of a large gene/protein lexicon by morphological pattern analysis.

1. Identification of related gene/protein names based on an HMM of name variations.

2. A flexible framework for deriving assertions from electronic medical records.

3. CoPub Mapper: mining MEDLINE based on search term co-publication.

4. Collaborative development of the Arrowsmith two node search interface designed for laboratory investigators.

5. Building a protein name dictionary from full text: a machine learning term extraction approach.

6. Incorporating rich background knowledge for gene named entity classification and recognition.

7. Processing biological literature with customizable Web services supporting interoperable formats.