Literature DB >> 18841237

DAVID gene ID conversion tool.

Da Wei Huang, Brad T Sherman, Robert Stephens, Michael W Baseler, H Clifford Lane, Richard A Lempicki.

Abstract

UNLABELLED: Our current biological knowledge is spread over many independent bioinformatics databases where many different types of gene and protein identifiers are used. The heterogeneous and redundant nature of these identifiers limits data analysis across different bioinformatics resources. It is an even more serious bottleneck of data analysis for larger datasets, such as gene lists derived from microarray and proteomic experiments. The DAVID Gene ID Conversion Tool (DICT), a web-based application, is able to convert user's input gene or gene product identifiers from one type to another in a more comprehensive and high-throughput manner with a uniquely enhanced ID-ID mapping database. AVAILABILITY: http://david.abcc.ncifcrf.gov/conversion.jsp.

Entities: Disease

Keywords: database; gene; microarray; protein; proteome

Year: 2008 PMID： 18841237 PMCID： PMC2561161 DOI： 10.6026/97320630002428

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

Our current biological knowledge is spread over many independent bioinformatics databases, containing both novel and redundant data. Many different types of gene or gene product identifiers (IDs) are selectively used by these different databases and platforms. To leverage heterogeneous annotations across different bio-sources during data analysis, one of the immediate tasks is to convert users' IDs from one type to another as required by the individual source. For example, after an Affymetrix microarray experiment, one must typically translate Affymetrix IDs to gene names, GenBank accessions, RefSeq accessions, UniProt IDs, etc. for further analysis. While this is a time-consuming and tedious process, more importantly, an incomplete or inaccurate translation may easily result in the loss of key information during data analysis. NCBI’s Entrez Gene [1] is a popular bioinformatics source for the translation of gene IDs from one type to another. In addition, several ID translation tools also offer this service in a high-throughput fashion [2-6] (supplementary file 1), based either on Entrez Gene or on the UniProt/PIR mapping databases [7]. The research goal of the DAVID Gene ID Conversion Tool (DICT), one of the components in the DAVID Bioinformatics Resources [8,9], is to provide a more comprehensive means for batch translations among common gene/protein ID types. The important features and advances of the DICT are: 1) Enhanced translation capability over other similar tools. 2) Extensive ID type coverage, including more than 20 main and secondary ID types. 3) A batch mode interface in support of one-to-one, one-to-many and many-to-many ID relationships. 4) Hyperlinks to in-depth information about genes are provided for users to exam any potential translation errors. 5) A summary table of the overall translation which is generated for quality control purposes. 6) Capability to handle a mixture of ID types as well as a ‘not sure’ type. Approximately 130,000 ID conversion jobs have been conducted with the DICT since 2007 (based on survey on web log file). The usefulness of the tool motivates us to write this application note paper, which intends to introduce the availability, enhanced conversion capability and interface features of the tool to more researchers who have ID conversion needs. However, the technical details behind the features will not be discussed here, but can be found in our other related works for which references are provided in the appropriate sections.

Methodology

A unique backend database for ID-ID mapping information

A comprehensive backend ID-ID mapping database is the most important foundation for a better ID-ID translation. The unique advance of the DICT is that its backend ID mapping database, the DAVID Knowledgebase [10], does not simply adopt the popular NCBI Entrez Gene or UniProt ID mapping information as other similar tools do. The DAVID Knowledgebase was specially constructed by comprehensively re-agglomerating ID-ID relationships with a unique procedure, called the DAVID Gene Concept [10]. Such a procedure is able to maximally extend additional ID-ID links that were missed in the original systems (e.g. NCBI, PIR and UniProt systems). The newly identified ID-ID links, as well as the existing ID-ID mapping information from the original systems are stored in the tables of a relational database, where heavy table indexing and a specialized schema are used to enhance the performance of the database query.

Interactive web-based interface

The DICT is a web based application which does not require any configuration and installation in the client's computers. The output of the program can be described as two panels, i.e. left and right panels (Figure 1). The left panel provides the translation summary and options for ambiguous IDs. The right panel displays the final translation result. Various hyperlinks to in-depth gene information are provided for users to exam any potential errors or alternative translation choices. The results can be either copied over to other tools, such as a spreadsheet, or downloaded as a tab-delimited text file. An additional button is provided for users to directly import the translation result into DAVID for further analysis with other DAVID analytic functions [9].

Figure 1

Layouts of the submission page (a) and result page (b) of the DICT.

The results appear in few seconds and are presented in a HTML page. The user can also save directly the results in text format.

DICT Features

The improved ID-ID conversion capability and the extensive ID coverage

The DICT covers dozens of commonly used types of gene and protein identifiers (Table 1 under supplementary material). Importantly, all types of IDs are fully cross convertible to each other by DICT. In addition, the DICT introduces a special type, i.e. ‘not sure’. The ‘not sure’ type is provided as an aid to users that may not be sure about the type of identifiers that their list contains or that contain a mixture of many types. In both cases, the tool will systematically search all possible identifier types and suggest appropriate choices to the user. Most importantly, with the uniquely constructed ID-ID mapping information in the DAVID Knowledgebase, the cross-reference capability of ID-to-ID is largely improved [10]. Accordingly, the conversion quality and success rate of the tool is enhanced as compared to other similar tools. In the supplementary file 2 and 3, the translation results from nine ID translation tools (e.g. ONTO-Translate, MatchMiner, AliasServer, IDConverter, etc.), based on the same set of example IDs, were compared side-by-side. For the particular examples, the DICT is able to handle various combinations of translation tasks in a more comprehensive way than other similar tools.

The high-throughput capability and entire database download

The DICT is able to efficiently convert up to three thousand gene IDs at-a-time, which is sufficient for the need of typical high-throughput data analysis. Moreover, if users want to convert IDs for genome-wide genes, such as all Affy IDs to RefSeq, the entire DAVID Knowledgebase is available for download [10].

10 in total

1. DAVID: Database for Annotation, Visualization, and Integrated Discovery.

Authors: Glynn Dennis; Brad T Sherman; Douglas A Hosack; Jun Yang; Wei Gao; H Clifford Lane; Richard A Lempicki
Journal: Genome Biol Date: 2003-04-03 Impact factor: 13.583

2. UniProt: the Universal Protein knowledgebase.

Authors: Rolf Apweiler; Amos Bairoch; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data.

Authors: Maximilian Diehn; Gavin Sherlock; Gail Binkley; Heng Jin; John C Matese; Tina Hernandez-Boussard; Christian A Rees; J Michael Cherry; David Botstein; Patrick O Brown; Ash A Alizadeh
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. AliasServer: a web server to handle multiple aliases used to refer to proteins.

Authors: Florian Iragne; Aurélien Barré; Nicolas Goffard; Antoine De Daruvar
Journal: Bioinformatics Date: 2004-04-01 Impact factor: 6.937

5. Babel's tower revisited: a universal resource for cross-referencing across annotation databases.

Authors: Sorin Drăghici; Sivakumar Sellamuthu; Purvesh Khatri
Journal: Bioinformatics Date: 2006-10-26 Impact factor: 6.937

6. IDconverter and IDClight: conversion and annotation of gene and protein IDs.

Authors: Andreu Alibés; Patricio Yankilevich; Andrés Cañada; Ramón Díaz-Uriarte
Journal: BMC Bioinformatics Date: 2007-01-10 Impact factor: 3.169

7. MatchMiner: a tool for batch navigation among gene and gene product identifiers.

Authors: Kimberly J Bussey; David Kane; Margot Sunshine; Sudar Narasimhan; Satoshi Nishizuka; William C Reinhold; Barry Zeeberg; Weinstein Ajay; John N Weinstein
Journal: Genome Biol Date: 2003-03-25 Impact factor: 13.583

8. Entrez Gene: gene-centered information at NCBI.

Authors: Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists.

Authors: Da Wei Huang; Brad T Sherman; Qina Tan; Joseph Kir; David Liu; David Bryant; Yongjian Guo; Robert Stephens; Michael W Baseler; H Clifford Lane; Richard A Lempicki
Journal: Nucleic Acids Res Date: 2007-06-18 Impact factor: 16.971

10. DAVID Knowledgebase: a gene-centered database integrating heterogeneous gene annotation resources to facilitate high-throughput gene functional analysis.

Authors: Brad T Sherman; Da Wei Huang; Qina Tan; Yongjian Guo; Stephan Bour; David Liu; Robert Stephens; Michael W Baseler; H Clifford Lane; Richard A Lempicki
Journal: BMC Bioinformatics Date: 2007-11-02 Impact factor: 3.169

10 in total

82 in total

1. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Authors: Da Wei Huang; Brad T Sherman; Richard A Lempicki
Journal: Nat Protoc Date: 2009 Impact factor: 13.491

2. Prediction of multiple infections after severe burn trauma: a prospective cohort study.

Authors: Shuangchun Yan; Amy Tsurumi; Yok-Ai Que; Colleen M Ryan; Arunava Bandyopadhaya; Alexander A Morgan; Patrick J Flaherty; Ronald G Tompkins; Laurence G Rahme
Journal: Ann Surg Date: 2015-04 Impact factor: 12.969

3. A comprehensive protein-centric ID mapping service for molecular data integration.

Authors: Hongzhan Huang; Peter B McGarvey; Baris E Suzek; Raja Mazumder; Jian Zhang; Yongxing Chen; Cathy H Wu
Journal: Bioinformatics Date: 2011-04-15 Impact factor: 6.937

4. Omics-based molecular target and biomarker identification.

Authors: Zhang-Zhi Hu; Hongzhan Huang; Cathy H Wu; Mira Jung; Anatoly Dritschilo; Anna T Riegel; Anton Wellstein
Journal: Methods Mol Biol Date: 2011

5. Conserved abundance and topological features in chromatin-remodeling protein interaction networks.

Authors: Mihaela E Sardiu; Joshua M Gilmore; Brad D Groppe; Damir Herman; Sreenivasa R Ramisetty; Yong Cai; Jingji Jin; Ronald C Conaway; Joan W Conaway; Laurence Florens; Michael P Washburn
Journal: EMBO Rep Date: 2014-11-26 Impact factor: 8.807

6. Single nucleotide polymorphisms in microRNA binding sites of oncogenes: implications in cancer and pharmacogenomics.

Authors: Mayakannan Manikandan; Arasambattu Kannan Munirajan
Journal: OMICS Date: 2013-11-28

7. Protein mimetic amyloid inhibitor potently abrogates cancer-associated mutant p53 aggregation and restores tumor suppressor function.

Authors: L Palanikumar; Laura Karpauskaite; Mohamed Al-Sayegh; Ibrahim Chehade; Maheen Alam; Sarah Hassan; Debabrata Maity; Liaqat Ali; Mona Kalmouni; Yamanappa Hunashal; Jemil Ahmed; Tatiana Houhou; Shake Karapetyan; Zackary Falls; Ram Samudrala; Renu Pasricha; Gennaro Esposito; Ahmed J Afzal; Andrew D Hamilton; Sunil Kumar; Mazin Magzoub
Journal: Nat Commun Date: 2021-06-25 Impact factor: 14.919