Literature DB >> 15542014

Gene name identification and normalization using a model organism database.

Alexander A Morgan1, Lynette Hirschman, Marc Colosimo, Alexander S Yeh, Jeff B Colombe.   

Abstract

Biology has now become an information science, and researchers are increasingly dependent on expert-curated biological databases to organize the findings from the published literature. We report here on a series of experiments related to the application of natural language processing to aid in the curation process for FlyBase. We focused on listing the normalized form of genes and gene products discussed in an article. We broke this into two steps: gene mention tagging in text, followed by normalization of gene names. For gene mention tagging, we adopted a statistical approach. To provide training data, we were able to reverse engineer the gene lists from the associated articles and abstracts, to generate text labeled (imperfectly) with gene mentions. We then evaluated the quality of the noisy training data (precision of 78%, recall 88%) and the quality of the HMM tagger output trained on this noisy data (precision 78%, recall 71%). In order to generate normalized gene lists, we explored two approaches. First, we explored simple pattern matching based on synonym lists to obtain a high recall/low precision system (recall 95%, precision 2%). Using a series of filters, we were able to improve precision to 50% with a recall of 72% (balanced F-measure of 0.59). Our second approach combined the HMM gene mention tagger with various filters to remove ambiguous mentions; this approach achieved an F-measure of 0.72 (precision 88%, recall 61%). These experiments indicate that the lexical resources provided by FlyBase are complete enough to achieve high recall on the gene list task, and that normalization requires accurate disambiguation; different strategies for tagging and normalization trade off recall for precision.

Mesh:

Year:  2004        PMID: 15542014     DOI: 10.1016/j.jbi.2004.08.010

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  21 in total

1.  Soft tagging of overlapping high confidence gene mention variants for cross-species full-text gene normalization.

Authors:  Cheng-Ju Kuo; Maurice H T Ling; Chun-Nan Hsu
Journal:  BMC Bioinformatics       Date:  2011-10-03       Impact factor: 3.169

2.  A fault model for ontology mapping, alignment, and linking systems.

Authors:  Helen L Johnson; K Bretonnel Cohen; Lawrence Hunter
Journal:  Pac Symp Biocomput       Date:  2007

3.  Rapidly retargetable approaches to de-identification in medical records.

Authors:  Ben Wellner; Matt Huyck; Scott Mardis; John Aberdeen; Alex Morgan; Leonid Peshkin; Alex Yeh; Janet Hitzeman; Lynette Hirschman
Journal:  J Am Med Inform Assoc       Date:  2007-06-28       Impact factor: 4.497

Review 4.  Bioinformatics and cancer research: building bridges for translational research.

Authors:  Gonzalo Gómez-López; Alfonso Valencia
Journal:  Clin Transl Oncol       Date:  2008-02       Impact factor: 3.405

5.  Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease.

Authors:  Marco Masseroli; Halil Kilicoglu; François-Michel Lang; Thomas C Rindflesch
Journal:  BMC Bioinformatics       Date:  2006-06-08       Impact factor: 3.169

6.  Overview of BioCreAtIvE task 1B: normalized gene lists.

Authors:  Lynette Hirschman; Marc Colosimo; Alexander Morgan; Alexander Yeh
Journal:  BMC Bioinformatics       Date:  2005-05-24       Impact factor: 3.169

7.  Gene and protein nomenclature in public databases.

Authors:  Katrin Fundel; Ralf Zimmer
Journal:  BMC Bioinformatics       Date:  2006-08-09       Impact factor: 3.169

8.  Automatically annotating documents with normalized gene lists.

Authors:  Jeremiah Crim; Ryan McDonald; Fernando Pereira
Journal:  BMC Bioinformatics       Date:  2005-05-24       Impact factor: 3.169

9.  Literature mining of protein-residue associations with graph rules learned through distant supervision.

Authors:  Ke Ravikumar; Haibin Liu; Judith D Cohn; Michael E Wall; Karin Verspoor
Journal:  J Biomed Semantics       Date:  2012-10-05

10.  Using nanoinformatics methods for automatically identifying relevant nanotoxicology entities from the literature.

Authors:  Miguel García-Remesal; Alejandro García-Ruiz; David Pérez-Rey; Diana de la Iglesia; Víctor Maojo
Journal:  Biomed Res Int       Date:  2012-12-27       Impact factor: 3.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.