Literature DB >> 21303863

GeneTUKit: a software for document-level gene normalization.

Minlie Huang¹, Jingchen Liu, Xiaoyan Zhu.

Abstract

MOTIVATION: Linking gene mentions in an article to entries of biological databases can facilitate indexing and querying biological literature greatly. Due to the high ambiguity of gene names, this task is particularly challenging. Manual annotation for this task is cost expensive, time consuming and labor intensive. Therefore, providing assistive tools to facilitate the task is of high value.
RESULTS: We developed GeneTUKit, a document-level gene normalization software for full-text articles. This software employs both local context surrounding gene mentions and global context from the whole full-text document. It can normalize genes of different species simultaneously. When participating in BioCreAtIvE III, the system obtained good results among 37 runs: the system was ranked first, fourth and seventh in terms of TAP-20, TAP-10 and TAP-5, respectively on the 507 full-text test articles.
AVAILABILITY AND IMPLEMENTATION: The software is available at http://www.qanswers.net/GeneTUKit/.

Entities: Disease Gene Species

Mesh：

Year: 2011 PMID： 21303863 PMCID： PMC3065680 DOI： 10.1093/bioinformatics/btr042

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Gene normalization is one of the most challenging tasks in bio-literature mining due to the high ambiguity of gene names as they may refer to orthologous or entirely different genes, may be named after phenotypes and other biomedical terms, or may resemble common names with non-gene entities (Hakenberg ). It is time consuming and labor intensive to annotate full-text articles manually. Therefore, a good assistive tool for this task may facilitate the process greatly. There has been a large body of work addressing the problem of gene mention normalization. ProMiner (Hanisch ), a strict dictionary-based approach, relies on the quality of its gene dictionaries heavily. Xu ) proposed a method using gene profiles generated from PubMed abstracts for gene disambiguation. GNAT (Hakenberg ) is a rule-based and machine learning (ML) based gene normalization system which used extensive background knowledge. Built from open-source libraries and publicly available resources, GENO (Wermter ) employed a carefully crafted suite of symbolic and statistical methods. Moara (Neves ) is a Java library for extracting and normalizing gene and protein mentions, and currently designed for four model organisms. Our software departs from previous systems in two aspects: first, it combines local and global contexts to normalize genes at document-level. The goal of this software is not to normalize every mention correctly, but to suggest a list of normalized genes given a target document, to assist human annotators. Most previous systems are normalizing genes at mention-level and only local context surrounding a mention (e.g. the sentence where the mention was recognized) were employed. However, due to the high ambiguity of gene names, it may be insufficient to use only local context: inter-sentential or document-level context can be helpful in this task. Second, the software is designed for simultaneously normalizing genes of many different species for full-text articles. It is not limited to any specific organism, but rather deals with all species present in a gene database (Entrez Gene in this article).

2 METHODS AND SYSTEM

The workflow of our software is shown in Figure 1. The software has four main modules. The first module is for gene mention recognition, the second one for gene ID candidate generation and the third one for gene ID disambiguation. In the fourth module, the software generates confidence scores for each gene ID, where the confidence score indicates the strength of the association between a gene ID and the document.

Fig. 1.

The workflow of GeneTUKit. Numbers in shaded boxes are gene IDs. The real-number values in the last box are confidence scores.

The workflow of GeneTUKit. Numbers in shaded boxes are gene IDs. The real-number values in the last box are confidence scores. We have used three methods for recognizing gene mentions in the first module. The first method is a conditional random field-based approach, which was trained on the training dataset of BioCreAtIvE II Gene Mention Recognition Task (Smith ). The second method is a dictionary-based recognition approach where the dictionary was compiled from Entrez Gene. The third method is ABNER (Settles, 2005), an open source named entity recognition system for biomedical literature. The input text is processed by these methods separately, and the resulting mentions are maintained if a mention is recognized by at least two methods. If two mentions are similar but have different boundaries, the overlapping part is taken as the final mention. The second module generates gene ID candidates for a recognized mention. In this module, an open-source indexing package, Lucene (http://lucene.apache.org/), was used to index all the genes in Entrez Gene. Each mention was then queried and top 50 gene IDs were returned as candidates. The text of mentions and Entrez Gene entries were, respectively, processed by the following rules sequentially: (i) removing special characters such as dashes and underscores; (ii) removing stop words; (iii) changing words such as ‘hBCL’ into ‘h BCL’; (iv) separating digits, Greek and Roman letters from alphabetic letters; and (v) converting the text to lowercase letters. The third module is for disambiguating gene IDs, which is accomplished by a ranking algorithm. The algorithm was trained on the 32 full-text articles provided by BioCreAtIvE III. Each article has a list of tuples (gene mention, gene id and species); however, the annotations did not give the positions where a gene mention was recognized. The training samples were generated as follows: for each gene ID candidate, if the ID appears in the manual annotation list, the candidate is taken as positive, otherwise negative. For each gene ID candidate and its corresponding mention, we extract features from local and global contexts. Some local context features are as follows: The document-level, global context features are listed partly as follows: In constructing these features, we used dictionary-based matching to recognize species as such a simple method can produce fairly good performance. For finding full/abbreviated name mappings, we adopted a method from (Schwartz and Hearst, 2003). Once features were obtained, we used a ranking algorithm ListNet (Cao ) to rank gene IDs for each mention and the top one ID was maintained for further processing. The ranking score of the gene ID given by the Lucene index. Whether the species of the ID is implied by the gene mention, such as hBCL. The edit distance between the mention and the official symbol of the ID. The minimal edit distance between the mention and all synonyms of the ID. Whether at least one word indicating gene functions of a gene ID appears in the sentences from which the mention was recognized. The words indicating gene functions are obtained from the corresponding gene symbols after removing common words (such as protein, gene etc.) and words containing capital letters or digits (e.g. VDR, p65). Whether the species of the gene ID appears in the document. Whether the species of the ID appears in the title. Whether the species of the ID is the nearest species in the same paragraph where the mention is recognized. If the mention has a full (or abbreviated) name through the document, compute the minimal edit distance between synonyms of the ID and the full (or abbreviated) name of the mention. The fourth module generates a confidence score for each predicted gene ID to measure the association of the given gene ID and the document using a support vector machine (SVM) classifier. The training examples were constructed similarly as in the third module. The features were constructed as follows: The best value of features used in the third module as each ID may correspond to many mentions. For the edit distance features, ‘best’ means ‘minimal’; for the ranking score feature, ‘best’ means ‘maximal’. The total number of gene mentions associated with the ID. The highest rank of the ID among all the mentions associating with the ID.

3 RESULTS

We evaluated the system on the BioCreAtIvE III GN corpus (Lu and Wilbur, 2010) in terms of Threshold Average Precision (TAP-k, k = 5, 10, 20, respectively) (Carroll ). For training, we used the 32 articles with gold-standard human annotation. For testing, the first dataset has 50 articles, each of which has gold-standard annotation, and the second one has 507 articles whose ground truth was inferred from 37 team submissions (referred as silver standard). The 507 articles also include the 50 articles from the first dataset. The results presented in Table 1 show the official evaluation results from BioCreAtIvE III. We have also tested the performance in terms of average precision. The manual error analysis has revealed that two major error types are (i) wrongly recognized gene mentions, and (ii) wrong species mapping. The Supplementary Material provide a more detailed analysis at http://www.qanswers.net/GeneTUKit/evaluation.html.

Table 1.

The evaluation results on the BioCreAtIvE III GN corpus

Measures	50 articles (gold standard)	507 articles (silver standard)
TAP-5	0.2973 (4/37)	0.4086 (7/37)
TAP-10	0.3125 (4/37)	0.4511 (4/37)
TAP-20	0.3248 (4/37)	0.4648 (1/37)
Average precision of TOP k recommendations
k = 5	0.4880	0.5764
k = 10	0.4340	0.4993
k = 20	0.3231	0.3984

The number in the bracket is the rank of our score among the 37 submissions.

The evaluation results on the BioCreAtIvE III GN corpus The number in the bracket is the rank of our score among the 37 submissions.

4 CONCLUSION

GeneTUKit is a software designed for document-level gene normalization, which employs features from the local context and the global context within the whole full-text article. It can normalize genes of many different species. Given a target article, the software outputs a list of normalized genes, and each predicted gene is associated with a confidence score. Funding: Natural Science Foundation of China (No. 60803075); Chinese 973 project (No. 2007CB311003). Conflicts of Interest: none declared.

9 in total

1. A simple algorithm for identifying abbreviation definitions in biomedical text.

Authors: Ariel S Schwartz; Marti A Hearst
Journal: Pac Symp Biocomput Date: 2003

2. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text.

Authors: Burr Settles
Journal: Bioinformatics Date: 2005-04-28 Impact factor: 6.937

3. Gene symbol disambiguation using knowledge-based profiles.

Authors: Hua Xu; Jung-Wei Fan; George Hripcsak; Eneida A Mendonça; Marianthi Markatou; Carol Friedman
Journal: Bioinformatics Date: 2007-02-21 Impact factor: 6.937

4. High-performance gene name normalization with GeNo.

Authors: Joachim Wermter; Katrin Tomanek; Udo Hahn
Journal: Bioinformatics Date: 2009-02-02 Impact factor: 6.937

5. Moara: a Java library for extracting and normalizing gene and protein mentions.

Authors: Mariana L Neves; José-María Carazo; Alberto Pascual-Montano
Journal: BMC Bioinformatics Date: 2010-03-26 Impact factor: 3.169

6. Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics.

Authors: Hyrum D Carroll; Maricel G Kann; Sergey L Sheetlin; John L Spouge
Journal: Bioinformatics Date: 2010-05-26 Impact factor: 6.937

7. Inter-species normalization of gene mentions with GNAT.

Authors: Jörg Hakenberg; Conrad Plake; Robert Leaman; Michael Schroeder; Graciela Gonzalez
Journal: Bioinformatics Date: 2008-08-15 Impact factor: 6.937

8. ProMiner: rule-based protein and gene entity recognition.

Authors: Daniel Hanisch; Katrin Fundel; Heinz-Theodor Mevissen; Ralf Zimmer; Juliane Fluck
Journal: BMC Bioinformatics Date: 2005-05-24 Impact factor: 3.169

9. Overview of BioCreative II gene mention recognition.

Authors: Larry Smith; Lorraine K Tanabe; Rie Johnson nee Ando; Cheng-Ju Kuo; I-Fang Chung; Chun-Nan Hsu; Yu-Shi Lin; Roman Klinger; Christoph M Friedrich; Kuzman Ganchev; Manabu Torii; Hongfang Liu; Barry Haddow; Craig A Struble; Richard J Povinelli; Andreas Vlachos; William A Baumgartner; Lawrence Hunter; Bob Carpenter; Richard Tzong-Han Tsai; Hong-Jie Dai; Feng Liu; Yifei Chen; Chengjie Sun; Sophia Katrenko; Pieter Adriaans; Christian Blaschke; Rafael Torres; Mariana Neves; Preslav Nakov; Anna Divoli; Manuel Maña-López; Jacinto Mata; W John Wilbur
Journal: Genome Biol Date: 2008-09-01 Impact factor: 13.583

9 in total

39 in total

1. A literature search tool for intelligent extraction of disease-associated genes.

Authors: Jae-Yoon Jung; Todd F DeLuca; Tristan H Nelson; Dennis P Wall
Journal: J Am Med Inform Assoc Date: 2013-09-02 Impact factor: 4.497

2. Hybrid Semantic Analysis for Mapping Adverse Drug Reaction Mentions in Tweets to Medical Terminology.

Authors: Ehsan Emadzadeh; Abeed Sarker; Azadeh Nikfarjam; Graciela Gonzalez
Journal: AMIA Annu Symp Proc Date: 2018-04-16

3. DES-Mutation: System for Exploring Links of Mutations and Diseases.

Authors: Vasiliki Kordopati; Adil Salhi; Rozaimi Razali; Aleksandar Radovanovic; Faroug Tifratene; Mahmut Uludag; Yu Li; Ameerah Bokhari; Ahdab AlSaieedi; Arwa Bin Raies; Christophe Van Neste; Magbubah Essack; Vladimir B Bajic
Journal: Sci Rep Date: 2018-09-06 Impact factor: 4.379

Review 4. Capturing the Patient's Perspective: a Review of Advances in Natural Language Processing of Health-Related Text.

Authors: G Gonzalez-Hernandez; A Sarker; K O'Connor; G Savova
Journal: Yearb Med Inform Date: 2017-09-11

5. Learning to rank figures within a biomedical article.

Authors: Feifan Liu; Hong Yu
Journal: PLoS One Date: 2014-03-13 Impact factor: 3.240

6. BioContext: an integrated text mining system for large-scale extraction and contextualization of biomolecular events.

Authors: Martin Gerner; Farzaneh Sarafraz; Casey M Bergman; Goran Nenadic
Journal: Bioinformatics Date: 2012-06-17 Impact factor: 6.937

7. Integrating various resources for gene name normalization.

Authors: Yuncui Hu; Yanpeng Li; Hongfei Lin; Zhihao Yang; Liangxi Cheng
Journal: PLoS One Date: 2012-09-12 Impact factor: 3.240

8. Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts.

Authors: Chih-Hsuan Wei; Bethany R Harris; Donghui Li; Tanya Z Berardini; Eva Huala; Hung-Yu Kao; Zhiyong Lu
Journal: Database (Oxford) Date: 2012-11-17 Impact factor: 3.451

9. PubTator: a web-based text mining tool for assisting biocuration.

Authors: Chih-Hsuan Wei; Hung-Yu Kao; Zhiyong Lu
Journal: Nucleic Acids Res Date: 2013-05-22 Impact factor: 16.971

10. Cataloging the biomedical world of pain through semi-automated curation of molecular interactions.

Authors: Daniel G Jamieson; Phoebe M Roberts; David L Robertson; Ben Sidders; Goran Nenadic
Journal: Database (Oxford) Date: 2013-05-23 Impact factor: 3.451