Literature DB >> 15980578

GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases.

M A van Driel1, K Cuelenaere, P P C W Kemmeren, J A M Leunissen, H G Brunner, Gert Vriend.   

Abstract

The identification of genes underlying human genetic disorders requires the combination of data related to cytogenetic localization, phenotypes and expression patterns, to generate a list of candidate genes. In the field of human genetics, it is normal to perform this combination analysis by hand. We report on GeneSeeker (http://www.cmbi.ru.nl/GeneSeeker/), a web server that gathers and combines data from a series of databases. All database searches are performed via the web interfaces provided with the original databases, guaranteeing that the most recent data are queried, and obviating data warehousing. GeneSeeker makes the same selection of candidate genes as the human geneticists would have performed, and thus reducing the time-consuming process to a few minutes. GeneSeeker is particularly well suited for syndromes in which the disease gene displays altered expression patterns in the affected tissue(s).

Entities:  

Mesh:

Year:  2005        PMID: 15980578      PMCID: PMC1160196          DOI: 10.1093/nar/gki435

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The identification of causative genes in human genetic disorders will be accelerated by the wealth of ‘omics’ information being generated. Geneticists consult a number databases to search for these genes. Each database concentrates on a different (molecular) aspect. In addition, databases have their own user interface, different formats to present the data and sometimes even their own ontologies. Data, such as gene localization and expression patterns, may be distributed over multiple databases. Geneticists normally collect phenotypic and/or expression data and the genes in the chromosomal region(s) of interest, and combine these to get a list of candidate genes. The rationale for this is that the gene that causes a disease is most probably expressed in the tissues affected by that disease (1–3). Using model organisms, such as the mouse, it is often possible to obtain information on genes, proteins, protein interactions and other functional attributes that can be transferred to Homo sapiens by means of synteny and protein homology relationships. The use of data from other species (such as mouse) often proves helpful in identifying the location or function of the equivalent human gene (4). GeneSeeker mimics this multi-species identification strategy (5).

MATERIALS AND METHODS

Databases used

Table 1 lists the databases that GeneSeeker queries. These are divided over database groups (DB-groups). All databases are accessed through their standard WWW interfaces except MIMMAP and OXFORD. MIMMAP is a reformatted version of the OMIM (6) gene mapping information. OXFORD is used to translate human to mouse chromosomal locations, and is described in more detail in the pre-processing section. We use SRS (Sequence Retrieval System, Lion Biosciences, Cambridge, UK) to access these two databases (7). The SRS parser was modified to allow searches for chromosomal ranges.
Table 1

Databases accessed by the GeneSeeker

DatabaseURL
DB-group 1: localization databases (human)
    OXFORD (15)
    MIMMAP (6)
    GDB (16)
DB-group 2: localization databases (mouse)
    MGD (15)
Datasets used in the interface
    GXD thesaurusVan Steensel et al. (10)
    Zuerich datasetBrewer et al. (11,12)
DB-group 3: expression/phenotype databases
    PubMed (Nature Library of Medicine, Bethesda, MD)
    OMIM (6)
    UniProt (9) (Swiss-Prot, TrEMBL, etc.)
    GXD (17)
    MLC (15)
    TBASE (18) (was tbase, merged January 2005)
‘Link out’ database
    GeneCards (14)

Data processing

The layout of the GeneSeeker web server is shown in Figure 1. The user query consists of a chromosomal band range using standard nomenclature (e.g. 7p15–p21). This cytogenetic localization is passed through DB-group 1. Syntenic regions in the mouse are sought in DB-group 2 using an Oxford-grid. Tissues of interest or phenotypic features of a syndrome can be specified by the user as a Boolean expression that is split up and processed by DB-group 3. This modular set-up makes it easy to add extra DB-groups in the future. For every database, a plug-in was designed to perform all tasks from user-query pre-processing to query-result post-processing. These plug-ins deal with a series of technical topics, such as query reformatting, generating the correct URL, filling in the form on that database's web interface, requesting all hits rather than in chunks, parsing the database HTML output and so on.
Figure 1

Overview of GeneSeeker. The query, which consists of a cytogenetic localization, a phenotypic description and expression data, is divided over the three DB-groups that use the database-specific plug-ins to deal with all topics ranging from user-query pre-processing to post-processing of the query output. Results from each DB-group are merged with a Boolean OR. The results of the three DB-groups are combined as specified in the user query.

The name of a gene can vary from database to database. The gene for the multi-drug resistance-associated protein 1, for example, is stored as ABCC1, MRP or MRP1, depending on the database used. These gene nomenclature problems have to be solved because GeneSeeker depends on the gene names in the combination steps. For each DB-group the results are integrated with a Boolean OR. The resulting gene lists of the three DB-groups are combined according to the Boolean logic specified in the user query.

Implementation issues

Parallelization

The database plug-ins run in parallel to minimize the waiting time. A queuing system prevents excessive loads on remote servers. The plug-ins return the results of the queries to GeneSeeker as a list containing the gene names and corresponding database hyperlinks.

Mouse–human synteny

An Oxford grid (8) is used to find the homologous genes and gene regions in the mouse genome for all human chromosome locations entered by the user. A human chromosomal band range is translated into the corresponding mouse chromosome locations. Two mouse locations are combined if the genetic distance is shorter than a user-specified value (defaults to 10 cM). We regenerate this Oxford grid weekly to ensure that the latest synteny information is used in each query.

Gene nomenclature

Inconsistent gene nomenclature is resolved using gene synonym information from UniProt database (9). We use the MGD human homologues information to interconvert mouse and human gene names. We maintain local copies of these conversion tables because nearly all queries require that gene nomenclature problems be solved.

User interface

The GeneSeeker interface consists of the query form shown in Figure 2 and an options form that usually requires no user input. A genetic localization and the phenotypic/expression terms should be entered for a meaningful search. Databases that generate more noise than signal can be removed from the query. The user can also suppress the display of housekeeping genes or a specified list of genes. The options form contains a thesaurus (10) that can help the user to select the correct expression terms: for example, when the user is interested in a genetic trait that results in abnormalities in the brain, selection of the ‘brain’ category returns the hints ‘brain or hindbrain or forebrain…’. Hints for the genetic localization data can be found in a table containing frequently aberrant chromosomal bands in specific disorders taken from literature (11,12). The user can be notified on request about the completion of GeneSeeker searches by email. All parameters are linked to help screens. The results are presented in four tables (Figure 3).
Figure 2

An example of a GeneSeeker query. Analyses of Trismus-Pseudocamptodactyly syndrome (TPC; MIM 158 300) has been linked to 17p12–p13.1 (13). TPC is characterized by defects in muscle tissue mainly in limb and/or mouth. The options form is data not shown.

Figure 3

The output of GeneSeeker for the Trismus-Pseudocamptodactyly syndrome query (see Figure 2). It has been shown that mutations in the MYH8 gene can cause TPC (13). Top left table: genes that agree perfectly with the user query. Top right table: genes found in mouse syntenic regions that cannot be mapped automatically on the human genome, but match the expression pattern. Bottom left table: genes found in mouse syntenic regions that match the expression pattern, but map on the human genome outside the candidate cytogenetic region. Bottom right table: human genes in the candidate cytogenetic region that do not match the phenotype/expression pattern. All genes are hyperlinked to the underlying database, and, when possible, to GeneCards (14).

RESULTS AND DISCUSSION

The GeneSeeker offers a user-friendly quick scan of several databases that are commonly used by geneticists to identify candidate genes for specific Mendelian diseases. As such, GeneSeeker uses those databases that are most appropriate for the questions asked. Several aspects are likely to change in the near future as genomics and genetics develop. For example, our usage of an Oxford grid can be improved or replaced as soon as consensus is reached about the localization of genes on the mouse and human genomes among the various databases. Expression pattern information (e.g. microarray data) is growing rapidly, and is expected to become useful for GeneSeeker in the near future. At the moment, publicly available expression information is still sparse, scattered and not yet standardized. In its present form, GeneSeeker is best suited for syndromes in which one can assume aberrant or absent gene expression in the affected tissues. GeneSeeker allows the user to query heterogeneous databases and obtain good candidate genes for the disease of interest based on positional, expression and model data (5). With the present hardware set-up GeneSeeker can perform ∼1000 searches per day.
  18 in total

1.  UniProt: the Universal Protein knowledgebase.

Authors:  Rolf Apweiler; Amos Bairoch; Cathy H Wu; Winona C Barker; Brigitte Boeckmann; Serenella Ferro; Elisabeth Gasteiger; Hongzhan Huang; Rodrigo Lopez; Michele Magrane; Maria J Martin; Darren A Natale; Claire O'Donovan; Nicole Redaschi; Lai-Su L Yeh
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

2.  SRS: information retrieval system for molecular biology data banks.

Authors:  T Etzold; A Ulyanov; P Argos
Journal:  Methods Enzymol       Date:  1996       Impact factor: 1.600

Review 3.  Gene-based approach to human gene-phenotype correlations.

Authors:  T P Dryja
Journal:  Proc Natl Acad Sci U S A       Date:  1997-10-28       Impact factor: 11.205

4.  TBASE: a computerized database for transgenic animals and targeted mutations.

Authors:  R P Woychik; J S Wassom; D Kingsbury; D A Jacobson
Journal:  Nature       Date:  1993-05-27       Impact factor: 49.962

5.  The Oxford Grid.

Authors:  J H Edwards
Journal:  Ann Hum Genet       Date:  1991-01       Impact factor: 1.670

6.  The Mouse Gene Expression Database (GXD).

Authors:  M Ringwald; J T Eppig; D A Begley; J P Corradi; I J McCright; T F Hayamizu; D P Hill; J A Kadin; J E Richardson
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

7.  GDB: the Human Genome Database.

Authors:  S I Letovsky; R W Cottingham; C J Porter; P W Li
Journal:  Nucleic Acids Res       Date:  1998-01-01       Impact factor: 16.971

8.  Probing the gene expression database for candidate genes.

Authors:  M A van Steensel; J Celli; J H van Bokhoven; H G Brunner
Journal:  Eur J Hum Genet       Date:  1999-12       Impact factor: 4.246

9.  Mutation of perinatal myosin heavy chain associated with a Carney complex variant.

Authors:  Mark Veugelers; Michael Bressan; Deborah A McDermott; Stanislawa Weremowicz; Cynthia C Morton; C Charlton Mabry; Jean-François Lefaivre; Alan Zunamon; Anne Destree; Jean-Marie Chaudron; Craig T Basson
Journal:  N Engl J Med       Date:  2004-07-29       Impact factor: 91.245

10.  A chromosomal deletion map of human malformations.

Authors:  C Brewer; S Holloway; P Zawalnyski; A Schinzel; D FitzPatrick
Journal:  Am J Hum Genet       Date:  1998-10       Impact factor: 11.025

View more
  32 in total

Review 1.  Role of in silico tools in gene discovery.

Authors:  Bing Yu
Journal:  Mol Biotechnol       Date:  2008-12-20       Impact factor: 2.695

2.  Improved detection of disease-associated variation by sex-specific characterization and prediction of genes required for fertility.

Authors:  N R Y Ho; N Huang; D F Conrad
Journal:  Andrology       Date:  2015-10-16       Impact factor: 3.842

3.  Universal concept signature analysis: genome-wide quantification of new biological and pathological functions of genes and pathways.

Authors:  Xu Chi; Maureen A Sartor; Sanghoon Lee; Meenakshi Anurag; Snehal Patil; Pelle Hall; Matthew Wexler; Xiao-Song Wang
Journal:  Brief Bioinform       Date:  2020-09-25       Impact factor: 11.622

Review 4.  Bioinformatic tools for identifying disease gene and SNP candidates.

Authors:  Sean D Mooney; Vidhya G Krishnan; Uday S Evani
Journal:  Methods Mol Biol       Date:  2010

5.  Advances in translational bioinformatics: computational approaches for the hunting of disease genes.

Authors:  Maricel G Kann
Journal:  Brief Bioinform       Date:  2009-12-10       Impact factor: 11.622

6.  A multi-dimensional evidence-based candidate gene prioritization approach for complex diseases-schizophrenia as a case.

Authors:  Jingchun Sun; Peilin Jia; Ayman H Fanous; Bradley T Webb; Edwin J C G van den Oord; Xiangning Chen; Jozsef Bukszar; Kenneth S Kendler; Zhongming Zhao
Journal:  Bioinformatics       Date:  2009-07-14       Impact factor: 6.937

7.  A bivariate whole genome linkage study identified genomic regions influencing both BMD and bone structure.

Authors:  Xiao-Gang Liu; Yong-Jun Liu; Jianfeng Liu; Yufang Pei; Dong-Hai Xiong; Hui Shen; Hong-Yi Deng; Christopher J Papasian; Betty M Drees; James J Hamilton; Robert R Recker; Hong-Wen Deng
Journal:  J Bone Miner Res       Date:  2008-11       Impact factor: 6.741

8.  Gene-disease relationship discovery based on model-driven data integration and database view definition.

Authors:  S Yilmaz; P Jonveaux; C Bicep; L Pierron; M Smaïl-Tabbone; M D Devignes
Journal:  Bioinformatics       Date:  2008-11-27       Impact factor: 6.937

9.  PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning.

Authors:  Yuko Yoshida; Yuko Makita; Naohiko Heida; Satomi Asano; Akihiro Matsushima; Manabu Ishii; Yoshiki Mochizuki; Hiroshi Masuya; Shigeharu Wakana; Norio Kobayashi; Tetsuro Toyoda
Journal:  Nucleic Acids Res       Date:  2009-05-25       Impact factor: 16.971

10.  FitSNPs: highly differentially expressed genes are more likely to have variants associated with disease.

Authors:  Rong Chen; Alex A Morgan; Joel Dudley; Tarangini Deshpande; Li Li; Keiichi Kodama; Annie P Chiang; Atul J Butte
Journal:  Genome Biol       Date:  2008-12-05       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.