Literature DB >> 21544171

The Booly aliasing resource: a database of grouped biological identifiers.

Abstract

UNLABELLED: Redundancy among sequence identifiers is a recurring problem in bioinformatics. Here, we present a rapid and efficient method of fingerprinting identifiers to ascertain whether two or more aliases are identical. A number of tools and approaches have been developed to resolve differing names for the same genes and proteins, however, these methods each have their own limitations associated with their various goals. We have taken a different approach to the aliasing problem by simplifying the way aliases are stored and curated with the objective of simultaneously achieving speed and flexibility. Our approach (Booly-hashing) is to link identifiers with their corresponding hash keys derived from unique fingerprints such as gene or protein sequences. This tool has proven invaluable for designing a new data integration platform known as Booly, and has wide applicability to situations in which a dedicated efficient aliasing system is required. Compared with other aliasing techniques, Booly-hashing methodology provides 1) reduced run time complexity, 2) increased flexibility (aliasing of other data types, e.g. pharmaceutical drugs), 3) no required assumptions regarding gene clusters or hierarchies, and 4) simplicity in data addition, updating, and maintenance. The new Booly-hashing aliasing model has been incorporated as a central component of the Booly data integration platform we have recently developed and shoud be broadly applicable to other situations in which an efficient streamlined aliasing systems is required. This aliasing tool and database, which allows users to quickly group the same genes and proteins together can be accessed at: http://booly.ucsd.edu/alias. AVAILABILITY: The database is available for free at http://booly.ucsd.edu/alias.

Entities: Disease Species

Year: 2011 PMID： 21544171 PMCID： PMC3082858 DOI： 10.6026/97320630006083

Source DB: PubMed Journal: Bioinformation ISSN： 0973-2063

Background

A common problem confronted by bioinformaticians is the need to resolve whether two or more identifiers are identical, i.e., are aliases of each other. A number of aliasing services have attempted to resolve the differing naming conventions created by both computational and manual labelling methods (AliasServer, DAVID, HGNC, SEGUID, MagicMatch, NCBI, ENSEMBL) [1-7]. These services differ by their technology and solutions with the general strategy of 1) using either in-house generated unique identifiers (NCBI, DAVID, ENSEMBL), or 2) the generation of unique fingerprints (AliasServer, MagicMatch, SEGUID) by way of cryptographic hashing algorithms which digest large arbitrary blocks of data (e.g., sequence) and returns a fixed-size bit string [8]. As each of these systems is designed with a specific goal in mind, none of them are optimized for specifically answering the single root question: are two identifiers the same? (Figure 1a)

Figure 1

Booly aliasing resource. (a) Difference between other aliasing approaches and the Booly-hashing method. The single question we wish to answer efficiently is, whether two identifiers (e.g., FBgn0000055 and ADH) are

In the course of designing a comprehensive data warehousing and comparison application called Booly [9], we recognized a need for a dedicated aliasing tool designed to efficiently and flexibly resolve alias identities. One of the main tasks of Booly is to mix and match datasets together using combinations of the Boolean operations. A common usage of such a tool is data aggregation between multiple sources (e.g. the aggregation of Gene Ontology data to that of a home brew spreadsheet table for annotation). When identifiers from both datasets are in the same format (e.g., gene symbol), the process of integrating the data can be performed trivially. However, the process of integrating the data becomes more challenging when converting formats is needed, thus becoming an unwieldy aliasing problem. This aliasing problem is compounded when comparing multiple datasets with differing identifier formats. Furthermore, Booly was created to compare content that extends beyond sequence data (e.g., databases of pharmaceutical drugs, human diseases, or other web-based content). With these requirements in mind, we designed an aliasing system (Booly-hashing) that can quickly resolve heterogeneous identifiers from multiple sources while maintaining flexibility to handle aliases from multiple entities. Booly-hashing is an aliasing database resource that utilizes a 160-bit Secure Hash Algorithm (SHA) hash key to generate unique fingerprints of sequences and their identifiers represented as a 40 character hexadecimal number (Figure 1a) [10]. Our streamlined approach requires the storage of only the hash key and its associated identifier. Current aliasing methods utilizing the hashing technology require the source of the identifiers to be known (AliasServer, SEGUID) [1, 5]. This limits the ability to find aliases of identifiers from heterogeneous sources. Our simplified technique is more broadly applicable as it allows for conversion to known hash keys for any identifier regardless of originating source. Another aliasing approach is to convert aliases into known, reference identifiers (e.g. RefSeq, Genebank Gene ID) such that one can then easily ascertain whether two identifiers are the same (DAVID) [3]. However, this approach is insufficient as some reference databases are incomplete and lack the overlap required to be inclusive of all known sequences and their identifiers (see Table 1). In contrast, our aliasing approach utilizes the sequence hash key as a singular point of conversion. Finally, unlike other sequence-related aliasing technologies, we have developed our Booly-hashing infrastructure to accommodate aliases from other sources such as pharmaceutical drugs or keyword aliases. As an example, in the case of pharmaceutical drugs, the unique fingerprint is the chemical formula that remains intact despite multiple branding names. A comparison table of the differences in features among our approach and other aliasing tools can be found in Figure 1b. In aggregate, our aliasing method allows one to efficiently and accurately ascertain whether two or more identifiers are aliases of each other. Furthermore, our streamlined approach is flexible and easy to modify and update. We have incorporated this aliasing model as part of a core component in Booly, our data integration platform designed to aid researchers in making new connections leading to novel discoveries in the laboratory. This generalized aliasing system should be of similar utility for development of other comparative tools that also have the simple requirement of rapidly deciding whether two identifiers are the same. Additionally, we have created an online tool that simply takes as input a list of identifiers and groups them accordingly into similar gene or protein sequence clusters. one and the same? Booly-hashing utilizes a 160-bit SHA-1 hash key to generate unique fingerprints of sequences and their identifiers represented as a 40 character hexadecimal number. Identifiers with the same hash-keys are considered as aliases of each other. Other approaches require knowledge of the source of the original identifier or knowledge of a conversion format requiring additional steps that increase complexity and programming (b) Comparison of two commonly used aliasing tools in bioinformatics (AliasServer and DAVID Gene Conversion Tool) against the Booly-hashing resource.

Summary

The process of determining whether two or more identifiers are aliases of each other is a common recurring problem in bioinformatics. To this end, we have created a streamlined aliasing method that utilizes a fingerprint such as a sequence or chemical formula for the purpose of creating unique hash-key identifiers. Our approach affords us a number of advantages over existing aliasing solutions, including a reduction in run time complexity, increased flexibility, flexible alias clusters, and simplicity in addition of new data, updating, and maintenance. In addition to performing well for Booly, these advantages should allow better integration of data containing heterogeneous identifiers leading to new connections and novel discoveries within many fields of science.

Author Contributions

EB advised on the study and helped write the manuscript. LHD conceived of the study, was responsible for its design and coordination, implemented the Booly aliasing resource, and wrote the manuscript. All authors read and approved the final manuscript.

8 in total

1. AliasServer: a web server to handle multiple aliases used to refer to proteins.

Authors: Florian Iragne; Aurélien Barré; Nicolas Goffard; Antoine De Daruvar
Journal: Bioinformatics Date: 2004-04-01 Impact factor: 6.937

2. MagicMatch--cross-referencing sequence identifiers across databases.

Authors: Mike Smith; Victor Kunin; Leon Goldovsky; Anton J Enright; Christos A Ouzounis
Journal: Bioinformatics Date: 2005-06-16 Impact factor: 6.937

3. A database of unique protein sequence identifiers for proteome studies.

Authors: György Babnigg; Carol S Giometti
Journal: Proteomics Date: 2006-08 Impact factor: 3.984

4. Booly: a new data integration platform.

Authors: Long H Do; Francisco F Esteves; Harvey J Karten; Ethan Bier
Journal: BMC Bioinformatics Date: 2010-10-13 Impact factor: 3.169

5. The HUGO Gene Nomenclature Database, 2006 updates.

Authors: Tina A Eyre; Fabrice Ducluzeau; Tam P Sneddon; Sue Povey; Elspeth A Bruford; Michael J Lush
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

6. DAVID gene ID conversion tool.

Authors: Da Wei Huang; Brad T Sherman; Robert Stephens; Michael W Baseler; H Clifford Lane; Richard A Lempicki
Journal: Bioinformation Date: 2008-07-30

7. Ensembl 2009.

Authors: T J P Hubbard; B L Aken; S Ayling; B Ballester; K Beal; E Bragin; S Brent; Y Chen; P Clapham; L Clarke; G Coates; S Fairley; S Fitzgerald; J Fernandez-Banet; L Gordon; S Graf; S Haider; M Hammond; R Holland; K Howe; A Jenkinson; N Johnson; A Kahari; D Keefe; S Keenan; R Kinsella; F Kokocinski; E Kulesha; D Lawson; I Longden; K Megy; P Meidl; B Overduin; A Parker; B Pritchard; D Rios; M Schuster; G Slater; D Smedley; W Spooner; G Spudich; S Trevanion; A Vilella; J Vogel; S White; S Wilder; A Zadissa; E Birney; F Cunningham; V Curwen; R Durbin; X M Fernandez-Suarez; J Herrero; A Kasprzyk; G Proctor; J Smith; S Searle; P Flicek
Journal: Nucleic Acids Res Date: 2008-11-25 Impact factor: 16.971

8. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Vyacheslav Chetvernin; Deanna M Church; Michael Dicuccio; Ron Edgar; Scott Federhen; Michael Feolo; Lewis Y Geer; Wolfgang Helmberg; Yuri Kapustin; Oleg Khovayko; David Landsman; David J Lipman; Thomas L Madden; Donna R Maglott; Vadim Miller; James Ostell; Kim D Pruitt; Gregory D Schuler; Martin Shumway; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Alexandre Souvorov; Grigory Starchenko; Roman L Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2007-11-27 Impact factor: 16.971

8 in total