Literature DB >> 15290756

Generation of a large gene/protein lexicon by morphological pattern analysis.

Lorraine Tanabe1, W John Wilbur.   

Abstract

The identification of gene/protein names in natural language text is an important problem in named entity recognition. In previous work we have processed MEDLINE documents to obtain a collection of over two million names of which we estimate that perhaps two thirds are valid gene/protein names. Our problem has been how to purify this set to obtain a high quality subset of gene/protein names. Here we describe an approach which is based on the generation of certain classes of names that are characterized by common morphological features. Within each class inductive logic programming (ILP) is applied to learn the characteristics of those names that are gene/protein names. The criteria learned in this manner are then applied to our large set of names. We generated 193 classes of names and ILP led to criteria defining a select subset of 1,240,462 names. A simple false positive filter was applied to remove 8% of this set leaving 1,145,913 names. Examination of a random sample from this gene/protein name lexicon suggests it is composed of 82% (+/-3%) complete and accurate gene/protein names, 12% names related to genes/proteins (too generic, a valid name plus additional text, part of a valid name, etc.), and 6% names unrelated to genes/proteins. The lexicon is freely available at ftp.ncbi.nlm.nih.gov/pub/tanabe/Gene.Lexicon.

Mesh:

Substances:

Year:  2004        PMID: 15290756     DOI: 10.1142/s0219720004000399

Source DB:  PubMed          Journal:  J Bioinform Comput Biol        ISSN: 0219-7200            Impact factor:   1.122


  7 in total

1.  Identification of related gene/protein names based on an HMM of name variations.

Authors:  L Yeganova; L Smith; W J Wilbur
Journal:  Comput Biol Chem       Date:  2004-04       Impact factor: 2.877

2.  A flexible framework for deriving assertions from electronic medical records.

Authors:  Kirk Roberts; Sanda M Harabagiu
Journal:  J Am Med Inform Assoc       Date:  2011-07-01       Impact factor: 4.497

3.  CoPub Mapper: mining MEDLINE based on search term co-publication.

Authors:  Blaise T F Alako; Antoine Veldhoven; Sjozef van Baal; Rob Jelier; Stefan Verhoeven; Ton Rullmann; Jan Polman; Guido Jenster
Journal:  BMC Bioinformatics       Date:  2005-03-11       Impact factor: 3.169

4.  Collaborative development of the Arrowsmith two node search interface designed for laboratory investigators.

Authors:  Neil R Smalheiser; Vetle I Torvik; Amanda Bischoff-Grethe; Lauren B Burhans; Michael Gabriel; Ramin Homayouni; Alireza Kashef; Maryann E Martone; Guy A Perkins; Diana L Price; Andrew C Talk; Ruth West
Journal:  J Biomed Discov Collab       Date:  2006-07-03

5.  Building a protein name dictionary from full text: a machine learning term extraction approach.

Authors:  Lei Shi; Fabien Campagne
Journal:  BMC Bioinformatics       Date:  2005-04-07       Impact factor: 3.169

6.  Incorporating rich background knowledge for gene named entity classification and recognition.

Authors:  Yanpeng Li; Hongfei Lin; Zhihao Yang
Journal:  BMC Bioinformatics       Date:  2009-07-17       Impact factor: 3.169

7.  Processing biological literature with customizable Web services supporting interoperable formats.

Authors:  Rafal Rak; Riza Theresa Batista-Navarro; Jacob Carter; Andrew Rowley; Sophia Ananiadou
Journal:  Database (Oxford)       Date:  2014-07-08       Impact factor: 3.451

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.