Literature DB >> 16381876

The HUGO Gene Nomenclature Database, 2006 updates.

Tina A Eyre¹, Fabrice Ducluzeau, Tam P Sneddon, Sue Povey, Elspeth A Bruford, Michael J Lush.

Abstract

The HUGO Gene Nomenclature Committee (HGNC) aims to give every human gene a unique and ideally meaningful name and symbol. The HGNC database, previously known as Genew, contains over 22,000 public records with approved human gene nomenclature and associated information. The database has undergone major improvements throughout the last year, is publicly available for online searching at http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/searchgenes.pl and has a new custom downloads interface at http://www.gene.ucl.ac.uk/cgi-bin/nomenclature/gdlw.pl.

Entities: Chemical

Mesh：

Year: 2006 PMID： 16381876 PMCID： PMC1347509 DOI： 10.1093/nar/gkj147

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

OVERVIEW

The HUGO Gene Nomenclature Committee (HGNC) maintains a database of unique and approved human gene names and symbols (1). Current estimates predict the total number of protein coding human genes as 20 000–25 000 (2,3), and over 18 000 of these now have been assigned HGNC approved nomenclature. We also assign nomenclature to other specific features such as fragile sites and disease loci inferred by linkage. This nomenclature is hand-curated and represents the gold standard, to be used in all publications and databases where a specific gene is discussed or referenced. HGNC data can be accessed in two main ways. First, for specific online searches the HGNC database search engine, Searchgenes, is available at with both simple and advanced search options. Second, custom downloads are available, allowing the user to download large volumes of data in their own preferred format using our custom download script (). The HGNC database migrated from Microsoft Access to PostgreSQL () at the end of March 2005. This change has meant not only easier curation for the database editors and greatly improved quality control checking, but also increased search speed and flexibility for both editors and users. In addition, custom downloads are now available to the public, allowing retrieval of precise sets of genes and data about those genes.

IMPROVEMENTS SINCE 2004

Renaming the database

Previously the HGNC database was referred to as Genew (1); however, following the change from Microsoft Access to PostgreSQL in March 2005 it was decided to change this to the easily recognized name of the ‘HGNC Database’. The term Genew was little known and this move seemed more in line with our policy for assigning unique and meaningful nomenclature. HGNC identification numbers, the unique identifiers associated with each gene record in the HGNC database, should now be referred to using the HGNC: prefix. This syntax has been adopted by all the major genome databases that display HGNC data, including Entrez Gene (4), Ensembl (5) and GeneCards (6).

Database editing

The HGNC database is implemented in PostgreSQL version 8.03. It consists of 28 tables containing in total over 500 000 records. The database now integrates public and confidential data, submitted to the HGNC by independent researchers and from more large-scale projects, such as the Human Genome Sequencing Consortium. This includes the results of our custom BLAST server, making 200 000 sequences searchable and inter-linked with HGNC gene records. Quality control checking is used to enforce formats on the data entered and to check its integrity, and can now be performed on various levels. First, the database checks for invalid formats or missing required data when an editor attempts to save a modified record. Second, scripts are used to error check records containing newly approved nomenclature prior to release. If an error is found, that record is held back from release into the public domain and the editor responsible is automatically notified. Third, all data are regularly monitored and any inconsistencies are listed on a quality control web page. The HGNC editors are now able to curate the database remotely, using a web-based editing tool on a secure server using SSL encryption. All transactions are logged providing an audit trail and SQL triggers are now used to automatically add certain details to the gene records, such as logging the name of the editor and the date on which modifications were made.

Online improvements

The HGNC database front-end and editor are web-based and written in Perl. The HTML::Template perl module is used to allow rapid generation of complex data editing and viewing forms containing multiple gene records from simple repeating units. In addition, special purpose forms can be rapidly generated to support new projects or new applications of HGNC data. Both Searchgenes and the Symbol Report Form results format have been given a new look using new website templates developed in Macromedia Dreamweaver MX2004. It is now very easy to link to a particular Symbol Report Form via either the HGNC ID or the approved symbol, using URLs such as or . Linking by HGNC ID is preferred and is more reliable in the long term, since HGNC IDs are constant for any given gene whereas approved symbols may change. When one entry has been merged into another entry, the merged entry remains in the database with ‘Symbol Withdrawn’ status, the text ∼withdrawn is added to the symbol and the gene name is replaced with text indicating the entry it has been merged into. On rare occasions when an entry is split, the original HGNC ID remains associated with the most appropriate entry.

Custom data downloads—basic use

Predefined downloads of HGNC data are now available from our custom downloads page () in both plain text and HTML formats. The previously available static file downloads have been phased out, and the new system has been shown to be more convenient and flexible, and includes improved documentation. A variety of data are available, including approved gene symbol and name, literature and database aliases, chromosomal location, sequence accession numbers and a gene family name (where applicable). Links to relevant entries in other databases, such as Ensembl (5), GENATLAS (7), GeneCards (6), GeneClinics/GeneTests (8), IMGT (9), Entrez Gene (4), MGD (10), PubMed (11), OMIM (11), RefSeq (11), Swiss-Prot (12), UCSC (13) and Vega (14) are also provided. A particularly important functionality of the custom downloads pages is that the results are generated dynamically so that they are up-to-date whenever the user returns to the saved URL. However, the URL also encodes the format of the data, so that this will be preserved as the database develops and new fields are added.

Custom data downloads—advanced use

More advanced users may use the script directly () to select custom views of HGNC data using simple SQL ‘WHERE’ clauses. This enables data for a particular group of genes to be displayed. The data returned may also be limited by chromosome. Documentation for this feature is available at . Users may specify the output format of their searches. The ‘HTML’ option will give a simple HTML table of results with hyperlinks to the HGNC gene symbol reports, as well as to a limited set of relevant entries in external databases. The ‘Gene Report Table’ format produces a series of tables, each containing data for a single gene with more links. The ‘Text’ output format is particularly useful for downloading data into a tab-delimited file that may be processed further, injected into other databases or viewed in spreadsheet programs. A valuable debugging option when using the WHERE field is the ‘Show SQL’ output option which displays the SQL query without executing it. Users can directly include a particular table of data within their own web pages by using use the ‘PHP Code’ output option to generate code to be embedded in a PHP document (). This technique is used to generate dynamically updated Gene Family Report pages (e.g. ). Finally, the ‘Perl Code’ format generates a snippet of code that uses the LWP::Simple module to download the data specified in that search. This option facilitates automatic downloads of HGNC data. Again, the format of the results is specified by the code and will be maintained even when modifications to the database structure are made.

USAGE OF THE HGNC DATABASE

The HGNC custom downloads script received 506 000 hits between January 1 and June 30, 2005, an average of 2800 per day (excluding queries made by HGNC staff and major web crawlers). Searchgenes was queried 290 000 times in this same period. Nearly all (99%) of our custom downloads users make use of the WHERE clause functionality, rather than downloading the entire data set. Of them 41% selected a plain-text output and 59% requested the Gene Report output, suggesting that the download script is frequently being used as an application program interface (API) to serve specific subsets of HGNC data to external applications. Consistent with this, the most popular searches were for single records specified by HGNC ID (78%) or approved symbol (18%). Multiple gene records can be returned using inexact query terms with the keywords ‘LIKE’ or ‘ILIKE’ or with the ‘IN’ keyword to identify records matching a list of queries. Less than 1% of searches used these inexact terms, again suggesting the use of the download script as an API. It seems useful to point out that these inexact queries are valuable for concurrently downloading, viewing or linking to a set of records of interest, such as those belonging to a particular group of genes.

FUTURE DIRECTIONS

In the near future the HGNC website will provide an online form for direct submission of sequences to the database to streamline the flow of data. In addition, Searchgenes will be superseded with an improved search facility, new fields, such as Name Aliases, and further fields, such as locus type, which are currently only available in the downloadable dataset.

CONCLUSIONS

The developments described here have provided much needed automation and opened the way for continued improvements in database flexibility and agility. As a result, the HGNC database is now far more able to respond to the needs of both its editors and the community.

CITATION

Authors are requested to cite this article and the database in the following format: ‘The HGNC Database, HUGO Gene Nomenclature Committee (HGNC), Department of Biology, University College London, Wolfson House, 4 Stephenson Way, London NW1 2HE, UK (URL: )’. [Include month and year in which you retrieved the data cited.]

14 in total

1. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.

Authors: Brigitte Boeckmann; Amos Bairoch; Rolf Apweiler; Marie-Claude Blatter; Anne Estreicher; Elisabeth Gasteiger; Maria J Martin; Karine Michoud; Claire O'Donovan; Isabelle Phan; Sandrine Pilbout; Michel Schneider
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

2. IMGT, the international ImMunoGeneTics database.

Authors: Marie-Paule Lefranc
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

3. The UCSC Genome Browser Database.

Authors: D Karolchik; R Baertsch; M Diekhans; T S Furey; A Hinrichs; Y T Lu; K M Roskin; M Schwartz; C W Sugnet; D J Thomas; R J Weber; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. Comparison of the current RefSeq, Ensembl and EST databases for counting genes and gene discovery.

Authors: Thomas P Larsson; Christian G Murray; Tobias Hill; Robert Fredriksson; Helgi B Schiöth
Journal: FEBS Lett Date: 2005-01-31 Impact factor: 4.124

Review 5. Genatlas database, genes and development defects.

Authors: J Frézal
Journal: C R Acad Sci III Date: 1998-10

6. GeneTests-GeneClinics: genetic testing information for a growing audience.

Authors: Roberta A Pagon; Peter Tarczy-Hornoch; Patricia K Baskin; Joseph E Edwards; Maxine L Covington; Miriam Espeseth; Christine Beahler; Thomas D Bird; Bradley Popovich; Charli Nesbitt; Cynthia Dolan; Kathi Marymee; Nancy B Hanson; Whitney Neufeld-Kaiser; Gina McCullough Grohs; Tracy Kicklighter; Cynthia Abair; Audin Malmin; Matthew Barclay; Rajasri Dharani Palepu
Journal: Hum Mutat Date: 2002-05 Impact factor: 4.878

7. Genew: the Human Gene Nomenclature Database, 2004 updates.

Authors: Hester M Wain; Michael J Lush; Fabrice Ducluzeau; Varsha K Khodiyar; Sue Povey
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

8. Human Gene-Centric Databases at the Weizmann Institute of Science: GeneCards, UDB, CroW 21 and HORDE.

Authors: Marilyn Safran; Vered Chalifa-Caspi; Orit Shmueli; Tsviya Olender; Michal Lapidot; Naomi Rosen; Michael Shmoish; Yakov Peter; Gustavo Glusman; Ester Feldmesser; Avital Adato; Inga Peter; Miriam Khen; Tal Atarot; Yoram Groner; Doron Lancet
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

9. Ensembl 2005.

Authors: T Hubbard; D Andrews; M Caccamo; G Cameron; Y Chen; M Clamp; L Clarke; G Coates; T Cox; F Cunningham; V Curwen; T Cutts; T Down; R Durbin; X M Fernandez-Suarez; J Gilbert; M Hammond; J Herrero; H Hotz; K Howe; V Iyer; K Jekosch; A Kahari; A Kasprzyk; D Keefe; S Keenan; F Kokocinsci; D London; I Longden; G McVicker; C Melsopp; P Meidl; S Potter; G Proctor; M Rae; D Rios; M Schuster; S Searle; J Severin; G Slater; D Smedley; J Smith; W Spooner; A Stabenau; J Stalker; R Storey; S Trevanion; A Ureta-Vidal; J Vogel; S White; C Woodwark; E Birney
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Wolfgang Helmberg; David L Kenton; Oleg Khovayko; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Joan U Pontius; Kim D Pruitt; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Grigory Starchenko; Tugba O Suzek; Roman Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

90 in total

1. High-resolution DNA analysis of human embryonic stem cell lines reveals culture-induced copy number changes and loss of heterozygosity.

Authors: Elisa Närvä; Reija Autio; Nelly Rahkonen; Lingjia Kong; Neil Harrison; Danny Kitsberg; Lodovica Borghese; Joseph Itskovitz-Eldor; Omid Rasool; Petr Dvorak; Outi Hovatta; Timo Otonkoski; Timo Tuuri; Wei Cui; Oliver Brüstle; Duncan Baker; Edna Maltby; Harry D Moore; Nissim Benvenisty; Peter W Andrews; Olli Yli-Harja; Riitta Lahesmaa
Journal: Nat Biotechnol Date: 2010-03-28 Impact factor: 54.908

2. Genome-wide analysis reveals regulatory role of G4 DNA in gene transcription.

Authors: Zhuo Du; Yiqiang Zhao; Ning Li
Journal: Genome Res Date: 2007-12-20 Impact factor: 9.043

3. Integration of biological networks and gene expression data using Cytoscape.

Authors: Melissa S Cline; Michael Smoot; Ethan Cerami; Allan Kuchinsky; Nerius Landys; Chris Workman; Rowan Christmas; Iliana Avila-Campilo; Michael Creech; Benjamin Gross; Kristina Hanspers; Ruth Isserlin; Ryan Kelley; Sarah Killcoyne; Samad Lotia; Steven Maere; John Morris; Keiichiro Ono; Vuk Pavlovic; Alexander R Pico; Aditya Vailaya; Peng-Liang Wang; Annette Adler; Bruce R Conklin; Leroy Hood; Martin Kuiper; Chris Sander; Ilya Schmulevich; Benno Schwikowski; Guy J Warner; Trey Ideker; Gary D Bader
Journal: Nat Protoc Date: 2007 Impact factor: 13.491

4. Semantic classification of biomedical concepts using distributional similarity.

Authors: Jung-Wei Fan; Carol Friedman
Journal: J Am Med Inform Assoc Date: 2007-04-25 Impact factor: 4.497

5. The SWAN biomedical discourse ontology.

Authors: Paolo Ciccarese; Elizabeth Wu; Gwen Wong; Marco Ocana; June Kinoshita; Alan Ruttenberg; Tim Clark
Journal: J Biomed Inform Date: 2008-05-04 Impact factor: 6.317

6. Biological Interpretation of Complex Genomic Data.

Authors: Kathleen M Fisch
Journal: Methods Mol Biol Date: 2019

7. Graphle: Interactive exploration of large, dense graphs.

Authors: Curtis Huttenhower; Sajid O Mehmood; Olga G Troyanskaya
Journal: BMC Bioinformatics Date: 2009-12-14 Impact factor: 3.169

8. PharmGKB: an integrated resource of pharmacogenomic data and knowledge.

Authors: Li Gong; Ryan P Owen; Winston Gor; Russ B Altman; Teri E Klein
Journal: Curr Protoc Bioinformatics Date: 2008-09

Review 9. Genes and social behavior.

Authors: Gene E Robinson; Russell D Fernald; David F Clayton
Journal: Science Date: 2008-11-07 Impact factor: 47.728

10. Proteomics analysis of A33 immunoaffinity-purified exosomes released from the human colon tumor cell line LIM1215 reveals a tissue-specific protein signature.

Authors: Suresh Mathivanan; Justin W E Lim; Bow J Tauro; Hong Ji; Robert L Moritz; Richard J Simpson
Journal: Mol Cell Proteomics Date: 2009-10-16 Impact factor: 5.911