Literature DB >> 15333458

Gene name ambiguity of eukaryotic nomenclatures.

Lifeng Chen1, Hongfang Liu, Carol Friedman.   

Abstract

MOTIVATION: With more and more scientific literature published online, the effective management and reuse of this knowledge has become problematic. Natural language processing (NLP) may be a potential solution by extracting, structuring and organizing biomedical information in online literature in a timely manner. One essential task is to recognize and identify genomic entities in text. 'Recognition' can be accomplished using pattern matching and machine learning. But for 'identification' these techniques are not adequate. In order to identify genomic entities, NLP needs a comprehensive resource that specifies and classifies genomic entities as they occur in text and that associates them with normalized terms and also unique identifiers so that the extracted entities are well defined. Online organism databases are an excellent resource to create such a lexical resource. However, gene name ambiguity is a serious problem because it affects the appropriate identification of gene entities. In this paper, we explore the extent of the problem and suggest ways to address it.
RESULTS: We obtained gene information from 21 organisms and quantified naming ambiguities within species, across species, with English words and with medical terms. When the case (of letters) was retained, official symbols displayed negligible intra-species ambiguity (0.02%) and modest ambiguities with general English words (0.57%) and medical terms (1.01%). In contrast, the across-species ambiguity was high (14.20%). The inclusion of gene synonyms increased intra-species ambiguity substantially and full names contributed greatly to gene-medical-term ambiguity. A comprehensive lexical resource that covers gene information for the 21 organisms was then created and used to identify gene names by using a straightforward string matching program to process 45,000 abstracts associated with the mouse model organism while ignoring case and gene names that were also English words. We found that 85.1% of correctly retrieved mouse genes were ambiguous with other gene names. When gene names that were also English words were included, 233% additional 'gene' instances were retrieved, most of which were false positives. We also found that authors prefer to use synonyms (74.7%) to official symbols (17.7%) or full names (7.6%) in their publications. CONTACT: lifeng.chen@dbmi.columbia.edu

Entities:  

Mesh:

Substances:

Year:  2004        PMID: 15333458     DOI: 10.1093/bioinformatics/bth496

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  44 in total

1.  Soft tagging of overlapping high confidence gene mention variants for cross-species full-text gene normalization.

Authors:  Cheng-Ju Kuo; Maurice H T Ling; Chun-Nan Hsu
Journal:  BMC Bioinformatics       Date:  2011-10-03       Impact factor: 3.169

2.  Resolving "orphaned" non-specific structures using machine learning and natural language processing methods.

Authors:  Dongfang Xu; Steven S Chong; Thomas Rodenhausen; Hong Cui
Journal:  Biodivers Data J       Date:  2018-08-10

3.  Quantitative assessment of dictionary-based protein named entity tagging.

Authors:  Hongfang Liu; Zhang-Zhi Hu; Manabu Torii; Cathy Wu; Carol Friedman
Journal:  J Am Med Inform Assoc       Date:  2006-06-23       Impact factor: 4.497

Review 4.  Bioinformatics and cancer research: building bridges for translational research.

Authors:  Gonzalo Gómez-López; Alfonso Valencia
Journal:  Clin Transl Oncol       Date:  2008-02       Impact factor: 3.405

5.  A fast document classification algorithm for gene symbol disambiguation in the BITOLA literature-based discovery support system.

Authors:  Andrej Kastrin; Dimitar Hristovski
Journal:  AMIA Annu Symp Proc       Date:  2008-11-06

6.  LINNAEUS: a species name identification system for biomedical literature.

Authors:  Martin Gerner; Goran Nenadic; Casey M Bergman
Journal:  BMC Bioinformatics       Date:  2010-02-11       Impact factor: 3.169

7.  Applying active learning to supervised word sense disambiguation in MEDLINE.

Authors:  Yukun Chen; Hongxin Cao; Qiaozhu Mei; Kai Zheng; Hua Xu
Journal:  J Am Med Inform Assoc       Date:  2013-01-30       Impact factor: 4.497

8.  Concept-based query expansion for retrieving gene related publications from MEDLINE.

Authors:  Sérgio Matos; Joel P Arrais; João Maia-Rodrigues; José Luis Oliveira
Journal:  BMC Bioinformatics       Date:  2010-04-28       Impact factor: 3.169

9.  Novel protein-protein interactions inferred from literature context.

Authors:  Herman H H B M van Haagen; Peter A C 't Hoen; Alessandro Botelho Bovo; Antoine de Morrée; Erik M van Mulligen; Christine Chichester; Jan A Kors; Johan T den Dunnen; Gert-Jan B van Ommen; Silvère M van der Maarel; Vinícius Medina Kern; Barend Mons; Martijn J Schuemie
Journal:  PLoS One       Date:  2009-11-18       Impact factor: 3.240

10.  Disambiguating the species of biomedical named entities using natural language parsers.

Authors:  Xinglong Wang; Jun'ichi Tsujii; Sophia Ananiadou
Journal:  Bioinformatics       Date:  2010-01-06       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.