Literature DB >> 15980560

GBA server: EST-based digital gene expression profiling.

Xin Wu¹, Michael G Walker, Jingchu Luo, Liping Wei.

Abstract

Expressed Sequence Tag-based gene expression profiling can be used to discover functionally associated genes on a large scale. Currently available web servers and tools focus on finding differentially expressed genes in different samples or tissues rather than finding co-expressed genes. To fill this gap, we have developed a web server that implements the GBA (Guilt-by-Association) co-expression algorithm, which has been successfully used in finding disease-related genes. We have also annotated UniGene clusters with links to several important databases such as GO, KEGG, OMIM, Gene, IPI and HomoloGene. The GBA server can be accessed and downloaded at http://gba.cbi.pku.edu.cn.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2005 PMID： 15980560 PMCID： PMC1160240 DOI： 10.1093/nar/gki480

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The sequencing and analysis of Expressed Sequence Tags (ESTs) is one of the three most important techniques used to study gene expression, the other two being DNA microarray and SAGE. Vast amounts of EST data are now available, and the volume is growing rapidly. Currently there are over 25 million EST sequences in NCBI's dbEST database (), among which six million were added in 2004 alone. Because the same gene may be represented by many different EST sequences, the UniGene database () (1) was developed to partition nucleotide sequences into a non-redundant set of gene-oriented clusters. Although it can be difficult to use data from more than one microarray experiment because the platforms are often different, it is straightforward to use EST data across different libraries, tissues and disease stages. Analysis of these EST data, often with the help of the UniGene database, holds tremendous value for the study of gene expression. For instance, Baranova et al. (2) developed software named HsAnalyst that identified novel tumor markers and potential targets for anti-tumor therapy using data in dbEST and UniGene. Stanton et al. (3) built differential gene expression profiles from dbEST and UniGene to identify tissue-enriched genes in mouse pancreas, mammary gland and heart. Ewing et al. (4) built digital gene expression profiles of the rice genome using dbEST and analyzed these profiles using Pearson's correlation coefficient; they found two co-expressed clusters of contigs encoding proteins with seed-related functions. Walker (5) developed the Guilt-by-Association (GBA) algorithm, which uses the Fisher exact test to find genes that have similar expression patterns to that of a query gene across EST libraries; using known disease-related genes as a query (bait), they successfully identified new genes that are associated with schizophrenia, Parkinson's disease and prostate cancer (5,6). Thompson et al. (7) applied the GBA algorithm to identify groups of genes involved in common cellular processes, which they named ‘functional modules’, in pregnancy, breast cancer and ovarian cancer, and validated the results using real-time PCR. Despite the proven importance of EST-based expression profiling, there is a shortage of web servers that conduct such analysis. The deficiency is especially clear when contrasted with the large number of available web servers for microarray analysis. Furthermore, the EST servers that do exist, including Digital Differential Display (DDD, ), cDNA Digital Gene Expression Displayer (DGED, ), xProfiler () and GEPIS (8), all aim to find differentially expressed genes in different pools of tissues or samples. For instance Scheurle et al. (9) used DDD to find up-regulated and down-regulated genes in solid tumors. Currently there is no web server that can find co-expressed genes based on ESTs in cDNA libraries and UniGene. In particular, the GBA algorithm is not publicly available as a web server or standalone software. Given the value of finding co-expressed genes, we have developed a new web server, named GBA, that has two types of functions. First, given a gene sequence, it uses the GBA algorithm to find other genes (clustered by UniGene) that have statistically similar expression pattern across all EST libraries. Thus a user can input a novel sequence and find the most closely co-expressed genes that can offer valuable information on the input's function. Or the user can input a gene known to be involved in a disease and find new genes co-expressed with it that may also be involved in the same disease. Second, we have linked UniGene clusters to a variety of other databases, including Gene (), IPI () (10), HomoloGene (), OMIM () (11), GOA () (12) and KEGG () (13). The server allows a user to input a UniGene ID and retrieve extensive information about the gene such as functional categories and known pathways. Alternatively, it also allows a user to input an ID from GO, KEGG, etc. and retrieve all UniGene clusters involved in the particular GO function or KEGG pathway.

DESCRIPTION OF FUNCTIONS

The GBA server supports four functions: GBA Engine and Gene Matcher, which implement the GBA algorithm, and UniGene Annotation and Annotation Linker, which link UniGene clusters to several important molecular databases.

GBA Engine

We have downloaded and parsed all cDNA libraries from EMBL () (14) and CGAP () (15) and stored the data in a relational database. Because libraries with too few entries are under-sequenced and do not adequately reflect the true expression levels of genes, GBA Engine allows the user to specify the minimum number of cDNA sequences in a library in order for it to be included in the analysis. Table 1 shows how using a different minimum number of cDNA sequences affects the number of libraries included in the analysis. Based on previous experience, we recommend using a cutoff of 500 or 1000.

Table 1

Number of human and mouse libraries in the CGAP database using different minimum numbers of cDNA sequences

No. of cDNAs in library	No. of human libraries	No. of mouse libraries
>5000	276	253
>2000	488	401
>1000	713	516
>500	1117	584
>100	3774	651
Total	8570	965

A gene is considered present (expressed) in a library if at least one cDNA sequence corresponding to the gene is found in the library. The presence/absence of a gene in all libraries forms a vector that represents its expression profile. For a pair of genes, A and B, GBA Engine converts their expression profiles into a 2 × 2 contingency table, showing the numbers of libraries where both A and B are present, where A is present but B is absent, where B is present but A is absent, and where both A and B are absent. GBA Engine then applies the Fisher exact test to test the null hypothesis that there is no association between A and B and determines a P-value for statistical significance. Because multiple statistical tests are performed, GBA Engine can optionally apply a Bonferroni correction to the P-value to reduce the false positive rate. Given a gene (represented by its UniGene ID), GBA Engine returns a list of other UniGene clusters that have similar expression pattern across cDNA libraries, ranked by their P-value.

Gene Matcher

If a user has an anonymous gene sequence, the first requirement is to find its corresponding UniGene cluster before running GBA Engine or UniGene Annotation. Gene Matcher provides an interface that runs BLAST (16) to query the gene sequence against all UniGene sequences to identify the right UniGene cluster.

UniGene Annotation and Annotation Linker

To help users better understand the functions of UniGene clusters, we have integrated UniGene clusters with several molecular databases including Gene, IPI, HomoloGene, OMIM, GO and KEGG, which we downloaded, parsed and stored in a relational database. Given a UniGene ID, UniGene Annotation returns detailed information on the gene locus, orthologs, disease association, GO categories and pathways. Alternatively, given an ID from the other databases, Annotation Linker returns all UniGene clusters that are associated with the locus, ortholog, GO category and pathway, respectively.

IMPLEMENTATION

The GBA server is a three-tier application developed in Java. In the Database Tier, we developed the GBA tool Java package to parse the molecular databases integrated in the GBA server and used the POSTGRES database system to store the data. In the Logic Tier, we developed Java programs for data processing and statistical tests. In the Web Tier, we used Apache Tomcat () for the web server, Struts () as the MVC (Model–View–Control) framework and Java Mail () to handle emailing of the results. We update our database whenever the NCBI UniGene database is updated. We track the external databases such as GO and OMIM regularly and update the links in our database when these databases are updated. Complete updates of our database are scheduled monthly. In addition, users can download and install the GBA server locally and update it according to their own needs. (See GBA Administrator Guide on the website for help.)

EXAMPLE APPLICATION

To demonstrate the use of the GBA server, we used it to identify genes that may be associated with Parkinson's disease, one of the most common neurodegenerative disorder diseases, given genes already known to be associated with Parkinson's disease, such as PTEN induced putative kinase 1 (PINK1). It has been suggested that a mutant form of PINK1 damages neurons by stress-induced apoptosis and mitochondrial dysfunction (17). Using PINK1 (UniGene ID Hs.389171) as bait, we searched the cDNA libraries with >500 sequences for the top 50 genes that have similar expression patterns, using GBA Engine with a Bonferroni correction (Figure 1a). Part of the result is shown in Figure 1b. In particular, we found that hit #3 (UniGene ID Hs.370408), in turn, has a large number of the same co-expressed genes (measured by GBA) as several genes known to be associated with Parkinson's disease, such as alpha-synuclein (SNCA) (data not shown) (18). We looked for more information on this gene using UniGene Annotation (Figure 2). The linked entry in the Gene (LocusLink) database shows that the gene is Catechol-O-methyltransferase (COMT), which encodes a protein that ‘catalyzes the transfer of a methyl group from S-adenosylmethionine to catecholamines, including the neurotransmitters dopamine, epinephrine, and norepinephrine’ (19). The linked entry in the OMIM database shows that ‘COMT is important in the metabolism of catechol drugs used in the treatment of hypertension, asthma, and Parkinson disease.’ Figure 2 shows that COMT is involved in KEGG pathway hsa00350 (Tyrosine Metabolism). Figure 3 shows the input and output using the KEGG Gateway of Annotation Linker to find all UniGene clusters involved in the pathway. This example demonstrates the power of using GBA Engine to find additional genes that are involved in a disease process and can serve as potential drug targets. It also demonstrates the value of using UniGene Annotation and Annotation Linker for further in-depth analysis.

Figure 1

Input and output of GBA Engine using PINK1 as an example. (a) Example input of GBA Engine. (b) Example output of GBA Engine showing genes having similar expression patterns to PINK1.

Figure 2

Input and output of UniGene Annotation using Hs.370408 as an example. (a) Example input of UniGene Annotation. (b) Example output of UniGene Annotation showing detailed information on the gene.

Figure 3

Input and output of Annotation Linker using pathway hsa00350 as an example. (a) Example input of the KEGG Gateway of Annotation Linker. (b) Example output of the KEGG Gateway of Annotation Linker showing all UniGene clusters involved in the pathway.

DISCUSSION

The GBA server is the first to make the GBA algorithm publicly available. It also integrates UniGene clusters with a variety of molecular databases for the first time and provides a public, web-based user interface. With the rapid accumulation of available EST data, and because data in new cDNA libraries from new experiments can be easily used together with existing ones, the effectiveness of the GBA server will continue to increase. Currently the GBA server supports three species that have the most available EST data: human, mouse, and rat. We will add more species in the future that have sufficient EST data available.

18 in total

1. Pharmaceutical target discovery using Guilt-by-Association: schizophrenia and Parkinson's disease genes.

Authors: M G Walker; W Volkmuth; T M Klingler
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1999

2. Identification and confirmation of a module of coexpressed genes.

Authors: H Garrett R Thompson; Joseph W Harris; Barbara J Wold; Stephen R Quake; James P Brody
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

3. Identifying tissue-enriched gene expression in mouse tissues using the NIH UniGene database.

Authors: Jo-Ann L Stanton; Andrew B Macgregor; David P L Green
Journal: Appl Bioinformatics Date: 2003

4. A public database for gene expression in human cancers.

Authors: A Lal; A E Lash; S F Altschul; V Velculescu; L Zhang; R E McLendon; M A Marra; C Prange; P J Morin; K Polyak; N Papadopoulos; B Vogelstein; K W Kinzler; R L Strausberg; G J Riggins
Journal: Cancer Res Date: 1999-11-01 Impact factor: 12.701

5. Cancer gene discovery using digital differential display.

Authors: D Scheurle; M P DeYoung; D M Binninger; H Page; M Jahanzeb; R Narayanan
Journal: Cancer Res Date: 2000-08-01 Impact factor: 12.701

Review 6. Catechol-O-methyltransferase and Parkinson's disease.

Authors: Chun-Hwi Tai; Ruey-Meei Wu
Journal: Acta Med Okayama Date: 2002-02 Impact factor: 0.892

7. In silico screening for tumour-specific expressed sequences in human genome.

Authors: A V Baranova; A V Lobashev; D V Ivanov; L L Krukovskaya; N K Yankovsky; A P Kozlov
Journal: FEBS Lett Date: 2001-11-09 Impact factor: 4.124

8. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.

Authors: Evelyn Camon; Michele Magrane; Daniel Barrell; Vivian Lee; Emily Dimmer; John Maslen; David Binns; Nicola Harte; Rodrigo Lopez; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

9. Database resources of the National Center for Biotechnology.

Authors: David L Wheeler; Deanna M Church; Scott Federhen; Alex E Lash; Thomas L Madden; Joan U Pontius; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Tatiana A Tatusova; Lukas Wagner
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

10. The EMBL Nucleotide Sequence Database.

Authors: Carola Kanz; Philippe Aldebert; Nicola Althorpe; Wendy Baker; Alastair Baldwin; Kirsty Bates; Paul Browne; Alexandra van den Broek; Matias Castro; Guy Cochrane; Karyn Duggan; Ruth Eberhardt; Nadeem Faruque; John Gamble; Federico Garcia Diez; Nicola Harte; Tamara Kulikova; Quan Lin; Vincent Lombard; Rodrigo Lopez; Renato Mancuso; Michelle McHale; Francesco Nardone; Ville Silventoinen; Siamak Sobhany; Peter Stoehr; Mary Ann Tuli; Katerina Tzouvara; Robert Vaughan; Dan Wu; Weimin Zhu; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

11 in total

1. ORTom: a multi-species approach based on conserved co-expression to identify putative functional relationships among genes in tomato.

Authors: Laura Miozzi; Paolo Provero; Gian Paolo Accotto
Journal: Plant Mol Biol Date: 2010-04-22 Impact factor: 4.076

2. Prevalence of alternative splicing choices in Arabidopsis thaliana.

Authors: Adam C English; Ketan S Patel; Ann E Loraine
Journal: BMC Plant Biol Date: 2010-06-04 Impact factor: 4.215

3. Quantitative gene expression profiles in real time from expressed sequence tag databases.

Authors: Vincent A Funari; Konstantin Voevodski; Dimitry Leyfer; Laura Yerkes; Donald Cramer; Dean R Tolan
Journal: Gene Expr Date: 2010

4. Identification of genes associated with nitrogen-use efficiency by genome-wide transcriptional analysis of two soybean genotypes.

Authors: Qing N Hao; Xin A Zhou; Ai H Sha; Cheng Wang; Rong Zhou; Shui L Chen
Journal: BMC Genomics Date: 2011-10-26 Impact factor: 3.969