Literature DB >> 22155608

Overview of biological database mapping services for interoperation between different 'omics' datasets.

Shweta S Chavan¹, John D Shaughnessy, Ricky D Edmondson.

Abstract

Many primary biological databases are dedicated to providing annotation for a specific type of biological molecule such as a clone, transcript, gene or protein, but often with limited cross-references. Therefore, enhanced mapping is required between these databases to facilitate the correlation of independent experimental datasets. For example, molecular biology experiments conducted on samples (DNA, mRNA or protein) often yield more than one type of 'omics' dataset as an object for analysis (eg a sample can have a genomics as well as proteomics expression dataset available for analysis). Thus, in order to map the two datasets, the identifier type from one dataset is required to be linked to another dataset, so preventing loss of critical information in downstream analysis. This identifier mapping can be performed using identifier converter software relevant to the query and target identifier databases. This review presents the publicly available web-based biological database identifier converters, with comparison of their usage, input and output formats, and the types of available query and target database identifier types.

Entities: Disease Gene Species

Mesh：

Year: 2011 PMID： 22155608 PMCID： PMC3525252 DOI： 10.1186/1479-7364-5-6-703

Source DB: PubMed Journal: Hum Genomics ISSN： 1473-9542 Impact factor: 4.639

Introduction

Many primary biological databases are dedicated to providing annotation for a specific type of biological molecule such as a clone, transcript, gene or protein (eg the National Center for Biotechnology Information [NCBI] Entrez Gene[1] database provides annotation for genes, whereas UniprotKB[2,3] provides this for proteins). Other types of secondary databases provide relevant information about the attributes of these molecules, such as pathway, function(s) or structural information (eg the Kyoto Encyclopedia of Genes and Genomes [KEGG][4] and Protein Data Bank [PDB] [5]. Often, these databases provide limited cross-references for interoperation between databases. Thus, enhanced mapping between these databases is required to facilitate the correlation of independent experimental datasets, which can be provided by Identifier (Id) mapping services. Id mapping services are tools to connect one type of database Id to the corresponding Id in another database. This mapping includes three types of relationships: one-to-one, one-to-many and many-to-many. One-to-many and many-to-many relationships are required to account for biological processes such as alternative splicing, resulting in one gene giving rise to multiple transcripts, the presence of several isoforms of a single protein and other similar processes occurring in a cell. Also, gene expression datas such as microarray datas are known to have multiple probes targeting a single transcript and vice versa (eg Affymetrix[6] probes, which can be described by many-to-many relationships). Thus, mapping Ids of multiple databases to one another facilitates the correlation of different types of 'omics' datasets which, in turn, might provide meaningful insights into the biological processes occurring in a cell. Several Id mapping services are publicly available (Table 1). The seven Id converters that are discussed in detail in this review were selected to represent the majority of Id converters--as well as major biological databases--used in high-throughput genomics and proteomics datasets. They have many common features, such as: (i) supporting one-to-many and many-to-many relationships; (ii) providing mapping counts, and the details of the database Ids used as a 'bridge' to link to target database Ids; (iii) providing a web-based graphical user interface (GUI) which allows submission of a single Id or a batch search for multiple Ids; and (iv) the output database Ids are hyperlinked to their original database for reference.

Table 1

Link to web interface of various Id converters

Mapping services	Link
Gene/Clone ID converter	http://idconverter.bioinfo.cnio.es/
ID mapping by UniProt	http://www.uniprot.org/?tab=mapping
MatchMiner	http://discover.nci.nih.gov/matchminer/MatchMinerLookup.jsp
DAVID gene ID conversion tool	http://david.abcc.ncifcrf.gov/conversion.jsp
g:Convert	http://biit.cs.ut.ee/gprofiler/gconvert.cgi
CRONOS	http://mips.helmholtz-muenchen.de/genre/proj/cronos/index.html
bioDBnet:db2db	http://biodbnet.abcc.ncifcrf.gov/db/db2db.php

Link to web interface of various Id converters These Id converters also differ in a number of ways, such as in the availability of (i) type of query input; (ii) target output databases; (iii) species; (iv) data sources--and therefore coverage of the genome or proteome of a particular species; (v) ease of use; (vi) database update frequency; (vii) possible conversion types (eg protein-to-gene, gene-to-transcripts etc.); (viii) speed of conversion; (ix) a detailed help section or tutorial describing the intended use of the application; and (x) an algorithm for mapping database Ids. In general, these services establish mapping links using existing cross-references or by using sequence alignment information to determine the match. Fewer Id converters have their own published algorithm for establishing the mapping, in addition to using the existing cross-references. Many of the Id converters provide a web-based intuitive user interface, generally having three components: input types (ie query databases), output types (ie target databases) and the species under consideration. One such web application is Clone/Gene ID Converter [7]. It provides the option for several query and target databases for human, mouse and rat species. The output can be customised by selecting from a number of output databases which are divided into several logical levels, such as gene, gene clone, protein and functional annotation. Further, detailed references are provided for the resultant output Ids by hyperlinking them to their original data sources. The output can also be obtained in a spreadsheet or text format. A detailed list of input/output databases, availability and other pertinent features can be found in Table 2. A useful piece of information provided on the interface itself includes the specific version that was used as the data source for individual databases. This is of importance, considering the frequent updates of sequence databases and the increasing novel findings about the biological molecules in the respective research areas.

Table 2

Comparison of various Id converters

Features of mapping services	Gene/Clone ID converter	ID mapping by UniProt	MatchMiner	DAVID gene ID conversion tool	g:Convert	CRONOS	bioDBnet:db2db
Interface	Web-based GUI form	Web-based GUI form	Web-based GUI form, command line	Web-based GUI form	Web-based GUI form	Web-based GUI form	Web-based GUI form
Output format	Html, text, spreadsheet	Html, text	Html, text, spreadsheet	Html, text, spreadsheet	Html, text, spreadsheet, minimal (no header)	Html, email for batch mode	Text, spreadsheet
Organisms	Human, mouse, rat	Human, mouse, rat and many other species	Human, mouse	Human, about another 90,000 species	Human, mouse, rat and 31 other Ensembl-supported genome species	Human, mouse, rat, cow, dog, and fruit fly	A specified list could not be found
Input/output clone or transcript	Clone Ids, Affymetrix Ids, GenBank Accession (Additional output: EMBL)*	GenBank, EMBL, DDBJ	Affymetrix Ids, GenBank Accession, EST, IMAGE Clone Id, FISH-mapped BAC Clone Id	Affymetrix Id, Agilent Id, Illumina Id, GenBank Accession, Gene symbol, GenPept Accession, NCBI GI, RefSeq RNA/Genomic accession	Affymetrix, Agilent, CCDS Ids, Ensembl transcript, Illumina, RefSeq DNA/Genomic	Ensembl/FlyBase Transcript ID, EMBL, Affymetrix, Agilent, CCDS	Affymetrix, Agilent, GenBank, RefSeq Genomic, RefSeq Nucleotide
Input/output gene	HUGO gene names, Entrez gene Ids, Ensembl gene Ids, UniGene cluster Ids, RefSeq RNAs (Additional output: CCDS)^a	Entrez Gene, HGNC, Ensembl, UniGene, TIGR (JCVI)	Gene Symbol HUGO/Alias, Name, UniGene Cluster Id, Entrez Gene Id, RefSeq RNA	Entrez gene Id, Ensembl gene/transcript Id, RefSeq mRNA accession, UniGene Id	Ensembl Gene, Entrez Gene, RefSeq mrna, UniGene	Gene Name, Ensembl/FlyBase Gene ID, GI, GeneID, HGNC, RefSeq mRNA	Entrez Gene ID, Ensembl Gene ID, UniGene
Input/output protein	RefSeq peptides, SwissProt names (Additional output: IPI, PDB)*	UniProtKB, RefSeq, GenPept, IPI, PDB	RefSeq protein	PIR accession, PIR Id, PIR NREF Id, RefSeq Protein accession, Uniprot Id/accession, UniRef Id	Ensembl Protein, IPI, PDB, RefSeq Protein	Protein Name, UniProt, Ensembl/FlyBase Protein ID, IPI, PIR	UniProt Accession, Ensembl Protein ID, GenPept, RefSeq Protein, UniProt
Input/output other information	(Additional output: PubMed, GO, KEGG, Reactome, Chromosomal locations from Ensembl, UCSC Genome Browser, OMIM)^a	SGD, GeneRif, NCBI Taxon, and others	Cytogenetic location: UCSC (Additional output: PubMed, GO, KEGG, Reactome, Chromosomal locations from Ensembl, UCSC Genome Browser, OMIM)*	"Not sure" type also accepted, and many other secondary database identifiers also supported	UCSC, PubMed, GO and many other secondary databases	dbSNP, UniSTS, MGI, orfnames, MIM, MORBID, CDD	GO, InterPro, Biocarta, KEGG, dbSNP, H-Invitational (H-Inv), HomoloGene, MGC, MIM, UniSTS, Taxon, and other secondary databases

aAll input Id types are potential output Id types as well (eg): as in Id mapping by UniProt. In some cases, however, there are additional output Id types available to choose from which are not available as input Id using that particular converter. Such output Id types are mentioned in parentheses and are indicated as 'additional output' (eg) as in Clone/Gene Id converter, GUI graphical user interface.

Comparison of various Id converters aAll input Id types are potential output Id types as well (eg): as in Id mapping by UniProt. In some cases, however, there are additional output Id types available to choose from which are not available as input Id using that particular converter. Such output Id types are mentioned in parentheses and are indicated as 'additional output' (eg) as in Clone/Gene Id converter, GUI graphical user interface. Another similar type of Id converter is ID mapping hosted by UniProt [3]. It supports almost all organisms, with monthly updates, and maps approximately 90 database sources, including primary sequence databases and secondary functional/structural annotation databases. Thus, the input and output database option is divided into Uniprot, other sequence database, three-dimensional structure, protein-protein interaction, protein family/group, two-dimensional gel, genome annotation, organism specific gene database, phylogenomics, enzyme and pathway and gene expression, and other database types which are listed in Table 2. The output is provided in the form of a tab-delimited table indicating the Ids in the query that could be mapped to those in the target databases, along with a list of unique target database Ids and a list of those Ids that could not be mapped. Id mapping by UniProt also provides an application programming interface (API) for programmatic access, as well as file transfer protocol (FTP) downloads if the user wishes to have a local Id mapping service for large datasets (> 100,000 Ids). MatchMiner[8] is tool that provides a clean interface with an interesting BatchMerge option, along with Interactive Lookup and Batch Lookup. Interactive Lookup and Batch Lookup can be used for generic Id conversion for single and multiple query input, respectively. The Batch Merge option is intended to be used to merge an input of two different query database Id lists into a single list, by determining which of the Ids from the two lists refers to the same gene or biological entity. In the Id conversion html output, hyperlinks are provided from each output Id to the original database for some (eg Entrez Gene, UniGene[9]), but not all, databases (eg Affymetrix). Also, MatchMiner follows a hierarchy of source reliability while searching for an Id and specifies the source database in the output; for example, if the input Id is a GenBank[10] Accession, then the algorithm first searches for the Id from University of California Santa Cruz (UCSC)[11] -known genes. If this is not found, only then it does search through UniGene and then UCSC expressed sequence tags (ESTs). Details of hierarchy of source reliability for all source databases can be found in the original article on MatchMiner [8]. Another unique feature of MatchMiner is that it provides a command line interface option for querying, which can be useful in cases where MatchMiner is to be integrated as a part of a pipeline or as a filtering step in a workflow. This feature requires certain system requirements; details can be found at http://discover.nci.nih.gov/matchminer/command.jsp. Thus, MatchMiner provides certain unique features that can be useful for specialised Id conversion needs. Most of the Id converters use the available data sources to create mapping; however, the DAVID gene ID conversion tool[12-14] uses its own knowledge base, which is based on the DAVID gene concept, [13] in addition to the primary Entrez- and Uniprot-based mapping. The data source used by this tool includes 20 main gene/protein Id types, in addition to other secondary Id types. It also has the capability to handle a mixture of Id types in case of 'unsure' input Id type. The output yields summary statistics for conversion, including Id count, presence in DAVID database and conversion status as successful or otherwise, with possible choices for ambiguities, such as when the input Id may not belong to the database as specified by the user but may exist as an Id in another database. The DAVID Knowledgebase is available for download as well. bioDBnet[15] provides a converter 'db2db' which has wide coverage of databases, including 153 database Id types including genes, proteins, pathways and other biological concepts as their data source. It also provides other menu options for Id conversion, such as 'dbFind', when the input Id type is unknown, 'dbWalk', where the user can direct the type of conversion and the intermediate databases to 'walk' through, 'dbReport', which provides an all-inclusive search, one Id to all other available Ids/annotations available. Thus, bioDBnet provides flexible interface options and, importantly, is updated weekly. Likewise, g:Convert, which is a part of g:Profiler, [16] provides mappings which are mainly based on the Ensembl database, [17] created through a three-level index of gene, transcript and protein Ensembl Ids. By contrast, the cross-reference navigation server (CRONOS)[18] provides mappings which are based on primary resources such as Uniprot, RefSeq[19] and Ensembl. These mappings are validated by eliminating ambiguous gene names, which provide an all-inclusive search of one Id to all other available Ids/annotations available.

Conclusion

This review is by no means comprehensive, but is intended to be representative of the currently available Id converters. Thus, there are several other Id converters that are part of other integrative analysis systems which are not reviewed here but might be of interest to researchers--such as Babelomics, [20] BioMart, [21] ID Converter System, [22] BridgeDB etc [23]. Many of the users provide their feedback after using these tools at internet forums (eg http://biostar.stackexchange.com/questions/22/gene-id-conversion-tool). Comparisons are made using a test set of Ids to test the performance of different Id converters (eg http://www.scribd.com/doc/18966500/Id-Converters-Test) that might aid in the selection of an appropriate Id converter. Such comparative analysis is not presented in this review, as the intended use of each of the Id converters is different and each has its own unique features which may not be measured by direct comparison. It is, however, recommended that one should base the choice of an Id converter application on the researcher's conversion needs; for example, the availability of the required input and output Id type, acceptable mapping algorithm and database update frequency, which are described in this review and summarised in Table 2, as well as other factors that might be of interest for the biological experiment being conducted.

22 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules.

Authors: J L Sussman; D Lin; J Jiang; N O Manning; J Prilusky; O Ritter; E E Abola
Journal: Acta Crystallogr D Biol Crystallogr Date: 1998-11-01

3. GenBank.

Authors: Dennis A Benson; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Barbara A Rapp; David L Wheeler
Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971

4. The human genome browser at UCSC.

Authors: W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal: Genome Res Date: 2002-06 Impact factor: 9.043

5. UniProt: the Universal Protein knowledgebase.

Authors: Rolf Apweiler; Amos Bairoch; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

6. NetAffx: Affymetrix probesets and annotations.

Authors: Guoying Liu; Ann E Loraine; Ron Shigeta; Melissa Cline; Jill Cheng; Venu Valmeekam; Shaw Sun; David Kulp; Michael A Siani-Rose
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

7. MatchMiner: a tool for batch navigation among gene and gene product identifiers.

Authors: Kimberly J Bussey; David Kane; Margot Sunshine; Sudar Narasimhan; Satoshi Nishizuka; William C Reinhold; Barry Zeeberg; Weinstein Ajay; John N Weinstein
Journal: Genome Biol Date: 2003-03-25 Impact factor: 13.583

8. Entrez Gene: gene-centered information at NCBI.

Authors: Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. The Universal Protein Resource (UniProt).

Authors: Amos Bairoch; Rolf Apweiler; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

4 in total

Review 1. Connectivity and complex systems: learning from a multi-disciplinary perspective.

Authors: Laura Turnbull; Marc-Thorsten Hütt; Andreas A Ioannides; Stuart Kininmonth; Ronald Poeppl; Klement Tockner; Louise J Bracken; Saskia Keesstra; Lichan Liu; Rens Masselink; Anthony J Parsons
Journal: Appl Netw Sci Date: 2018-06-18

Review 2. A systems approach to infectious disease.

Authors: Manon Eckhardt; Judd F Hultquist; Robyn M Kaake; Ruth Hüttenhain; Nevan J Krogan
Journal: Nat Rev Genet Date: 2020-02-14 Impact factor: 53.242

3. Interdisciplinary approach towards a systems medicine toolbox using the example of inflammatory diseases.

Authors: Christian R Bauer; Carolin Knecht; Christoph Fretter; Benjamin Baum; Sandra Jendrossek; Malte Rühlemann; Femke-Anouska Heinsen; Nadine Umbach; Bodo Grimbacher; Andre Franke; Wolfgang Lieb; Michael Krawczak; Marc-Thorsten Hütt; Ulrich Sax
Journal: Brief Bioinform Date: 2017-05-01 Impact factor: 11.622

4. Identification of hub genes and potential molecular mechanisms in gastric cancer by integrated bioinformatics analysis.

Authors: Ling Cao; Yan Chen; Miao Zhang; De-Quan Xu; Yan Liu; Tonglin Liu; Shi-Xin Liu; Ping Wang
Journal: PeerJ Date: 2018-07-02 Impact factor: 2.984

4 in total