Literature DB >> 15980543

T-profiler: scoring the activity of predefined groups of genes using gene expression data.

André Boorsma¹, Barrett C Foat, Daniel Vis, Frans Klis, Harmen J Bussemaker.

Abstract

One of the key challenges in the analysis of gene expression data is how to relate the expression level of individual genes to the underlying transcriptional programs and cellular state. Here we describe T-profiler, a tool that uses the t-test to score changes in the average activity of predefined groups of genes. The gene groups are defined based on Gene Ontology categorization, ChIP-chip experiments, upstream matches to a consensus transcription factor binding motif or location on the same chromosome. If desired, an iterative procedure can be used to select a single, optimal representative from sets of overlapping gene groups. T-profiler makes it possible to interpret microarray data in a way that is both intuitive and statistically rigorous, without the need to combine experiments or choose parameters. Currently, gene expression data from Saccharomyces cerevisiae and Candida albicans are supported. Users can upload their microarray data for analysis on the web at http://www.t-profiler.org.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2005 PMID： 15980543 PMCID： PMC1160244 DOI： 10.1093/nar/gki484

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

An important technique in the post-genomic era is the simultaneous measurement of the transcript levels of all genes from a genome by microarray experiments (1,2). In recent years, the amount of data from such experiments has rapidly increased (3,4). Furthermore, the combination of chromatin-immunoprecipitation and microarray technology (‘ChIP-chip’) has made it possible to globally measure the binding of transcription factors to gene promoters (5,6). There has also been an explosion in the number of computational methods for analyzing microarray data. Among the most popular are algorithms such as hierarchical clustering (7), K-means clustering (8) and self-organizing maps (9). A limitation of these clustering methods is the need to have gene expression profiles across multiple hybridizations. Alternative methods have been developed that can take a single genome-wide expression pattern as input, such as motif-based correlation or regression (10–12). To obtain easily interpretable information on changes in the cellular state in terms of functional annotation, methods such as Funspec (13), GO term finder (14), GOAL (15) and GeneXpress () score the significance of overlap between predefined gene groups [from Gene Ontology (GO) (16) or the MIPS database (17)] and the subset of induced or repressed genes. These methods are based on the cumulative hypergeometric distribution (also referred to as Fisher's exact test). A disadvantage of these methods is that they require individual genes to be significantly up- or down-regulated in order to contribute to the score. We previously developed a method that can score GO categories without the need to apply cut-offs to the expression level of individual genes (18). This algorithm, now named T-profiler, uses the t-test to score the difference between the mean expression level of predefined groups of genes and that of all other genes on the microarray (see Methods). A similar approach was independently pioneered by Pavlidis et al. (19). T-profiler is currently suitable for the analysis of Saccharomyces cerevisiae and Candida albicans gene expression profiles, and in the near future will be extended to other organisms.

METHODS

For a given gene group G, the t-value is given by the following formula: where Here μG is the mean expression log-ratio of the NG genes in gene group G, μG′ is the mean expression log-ratio of the remaining NG′ genes and s is the pooled standard deviation, as obtained from the estimated variances for groups G and G′. The associated two-tailed P-value can be calculated from t using the t-distribution with NG − 2 degrees of freedom and is corrected for multiple testing by multiplying it by the number of gene groups that are being tested in parallel (Bonferroni correction). All groups with a corrected E-value of ≤0.05 are considered to be significantly regulated. To reduce the influence of outliers, which may result in false positives or false negatives, we discard the highest and lowest expression value in each gene group. This method is similar to the jack-knife procedure (20).

Gene groups sharing a common motif in their upstream region

Motif groups are defined as genes with a match to a particular consensus motif within 600 base pairs upstream of the open reading frame (ORF) (21), allowing no overlap with neighboring ORFs. The consensus motifs used in T-profiler are derived from three different sources. First, motifs were extracted from the SCPD database (). Next, motifs were found by comparing the genome sequences of highly related yeast species (22,23). Finally, motifs discovered from various microarray experiments using the REDUCE algorithm (11,24) were added. Most of these motifs are similar or identical to motifs described in the literature. In total, 153 motif groups are included in the T-profiler calculation. Far less information is available about regulatory sequences of C.albicans. It was recently reported that about one-third of S.cerevisiae regulatory elements are conserved in C.albicans (25). T-profiler therefore uses the list of S.cerevisiae motifs, supplemented with newly discovered C.albicans regulatory motifs, to score C.albicans expression data.

Gene groups bound by a common transcription factor based on ChIP-chip data

The binding of transcription factors to their global DNA targets can be measured by ChIP-chip experiments. In S.cerevisiae this technique has been explored on a large scale by Lee et al. (5) and Harbison et al. (6). We used the transcription factor binding (TFB) data for 203 transcription factors from Harbison et al. (6) as input into T-profiler; the binding of 84 of these regulators was measured under various environmental conditions. A gene was considered to be part of a TFB group if the P-value reported by the authors was <0.001. In addition, TFB groups were required to have at least seven gene members. This resulted in 252 TFB groups that were used for T-profiler analysis.

GO categories

The third type of gene group is based on membership of a specific GO category (16). In GO, each gene is classified according to biological process, molecular function and cellular component. The GO gene group contains the genes associated with a specific GO category as well as all of its child categories. Only GO groups with more than six members were used for calculation. This resulted in 1389 GO-derived gene groups that were used for T-profiler analysis. Significant scores of GO groups give direct information about which functions or cellular processes are expected to have changed as a result of the altered gene expression. It should be kept in mind, however, that, unlike in the case of motif and ChIP-chip based gene groups, the t-values for GO categories are not directly related to a molecular mechanism.

Iterative removal of redundant gene groups

Several of the predefined gene groups scored by T-profiler show strong mutual overlap: the GO categories used by T-profiler are hierarchically organized; consensus motifs can match similar sequences; and ChIP-chip experiments can reveal that similar sets of genes are bound by different transcription factors and/or under different conditions. The t-values for overlapping gene groups are strongly correlated and therefore mutually redundant. Following the idea of forward selection of non-redundant motifs in REDUCE (11), we implemented an iterative procedure to select a non-redundant set of gene groups among those that have t-values significantly different from zero. At each step, we subtract the mean expression level of the genes in the gene group with the highest absolute t-value from all genes in that gene group. The t-values are then recalculated for all other gene groups, and the procedure is repeated until even the most significantly regulated gene group has a P-value > 0.05. In the case of nested GO categories at different levels in the hierarchy, this procedure will naturally select the most appropriate level for a given branch of annotation.

Aneuploidy test

Hughes et al. (26) described the discovery of chromosomal aberrations in yeast deletion mutants based on gene expression profiles. These are often duplications or deletions of an entire chromosome. By applying T-profiler at the level of whole chromosomes, where gene groups are defined as the set of all genes on a specific chromosome, it is possible to detect such aneuploidy. A statistically significant chromosomal t-value does not necessarily point, however, to aneuploidy, as it may also be caused by normal differential regulation by a transcription factor whose targets are preferentially located on the same chromosome. In the aneuploid dataset from Hughes et al. (26) we observed an absolute t-value > 10 for almost all deleted or duplicated chromosomes; such extreme t-values are therefore a good indicator of aneuploidy.

AN EXAMPLE

Gene expression datasets can be uploaded as a tab-delimited text file with the systematic ORF name in the first column and the log-transformed expression data in the second column. The upload of an expression profile comparing cells 80 min after a heat shift from 30 to 37°C from the Environmental Stress Response data set of Gasch et al. (23) will serve as an example. After uploading, the user is presented with some basic information about the dataset, including the number of genes, the average and the standard deviation (Figure 1A). Importantly, no cut-offs are applied; all values are used for calculation.

Figure 1

Screenshots of the various T-profiler analysis results. (A) Statistics of the uploaded gene expression dataset for cells assayed 80 min after the temperature shift from 30 to 37°C (23). The type of analysis can be selected from the panels to the right. (B) Scoring consensus motifs. Only significantly scoring motifs are shown (E-value < 0.05). By selecting the motifs in the left column, information about the genes containing this motif and their expression levels can be obtained. (C) Scoring GO categories. Only a subset of the 50 significantly changed categories is shown. (D) Scoring ChIP-chip based gene groups. (E) Graph showing the t-value for each chromosome, obtained from the gene expression profile of the mutant pfd2Δ, in which chromosome 14 is duplicated. (F) The same result as in (C), but now with redundant gene groups removed by our iterative procedure.

Next, the user can follow links to results for four different types of predefined gene groups: genes whose promoter region matches a specific consensus motif (Figure 1B), genes that belong to a specific GO category (Figure 1C), genes whose promoter is significantly bound by a specific transcription factor according to a ChIP-chip experiment (Figure 1D) and genes that reside on a specific chromosome (Figure 1E). The statistical parameters that are output by T-profiler for any group scored are (i) a t-value measuring the up-regulation (t > 0) or down-regulation (t < 0) in units of the standard error of the difference and (ii) an E-value that is Bonferroni corrected for the parallel testing of the large number of categories, which represents the number of groups with the same t-value or higher that would be observed by chance. Typically, only a small subset of the gene groups considered will score as differentially expressed (Figure 1). Figure 1B shows consensus motifs associated with differential regulation. The heat shock response motif (HSF1) and the general stress response motif (MSN2/4) score positively, whereas the PAC and rRPE motifs, both over-represented in genes involved in rRNA biosynthesis (27), score negatively. The up-regulation of genes under the control of the HSF1 motif is specific to heat-shocked cells, whereas the down-regulation of genes involved in rRNA biosynthesis and genes containing MSN2/4 motifs is typical of the environmental stress response (23). Figure 1D shows which transcription factors and corresponding ChIP-chip conditions are associated with differential regulation. The fact that genes bound by the transcription factor Hsf1p score positively whereas the genes bound by the ribosome-regulating transcription factors Rap1p, Sfp1p and Fhl1p score negatively is consistent with the motif-based results. Figure 1C shows the results of T-profiler analysis based on GO; in total, 50 categories have a significant t-value. Most of the positively scoring categories are involved in heat shock and stress response, whereas most of the negatively scoring categories are comprised mainly of ribosomal genes. Again, the results compare well with the results obtained by T-profiler using motif and ChIP-chip based gene groups. However, the large number of similar GO categories reported makes it harder to interpret the results. Figure 1F shows how this problem is resolved by the iterative removal of redundant categories. Finally, in Figure 1E, the high t-value of chromosome 14 points to a duplication of chromosome 14 in the deletion mutant pfd2Δ.

CONCLUSION

T-profiler analyzes genome-wide expression patterns one experiment at a time, without the need to tune any parameters. Our use of the t-test to score gene groups eliminates the need to impose a threshold on the expression level of individual genes. A group can be scored as significantly induced or repressed even if the expression of none of its individual member genes changes significantly. This feature greatly increases the sensitivity to small-amplitude coordinate changes in the expression of groups of genes. Representing a transcriptome by a relatively small set of statistically robust and easily interpretable t-values allows for seamless comparison between hybridizations, even across different platforms and laboratories. We plan to extend the functionality of T-profiler to multiple experiments in the near future.

27 in total

Review 1. Exploring expression data: identification and analysis of coexpressed genes.

Authors: L J Heyer; S Kruglyak; S Yooseph
Journal: Genome Res Date: 1999-11 Impact factor: 9.043

2. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae.

Authors: J D Hughes; P W Estep; S Tavazoie; G M Church
Journal: J Mol Biol Date: 2000-03-10 Impact factor: 5.469

3. Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation.

Authors: L J Jensen; S Knudsen
Journal: Bioinformatics Date: 2000-04 Impact factor: 6.937

4. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

5. Regulatory element detection using correlation with expression.

Authors: H J Bussemaker; H Li; E D Siggia
Journal: Nat Genet Date: 2001-02 Impact factor: 38.330

6. Widespread aneuploidy revealed by DNA microarray expression profiling.

Authors: T R Hughes; C J Roberts; H Dai; A R Jones; M R Meyer; D Slade; J Burchard; S Dow; T R Ward; M J Kidd; S H Friend; M J Marton
Journal: Nat Genet Date: 2000-07 Impact factor: 38.330

7. Exploring gene expression data with class scores.

Authors: Paul Pavlidis; Darrin P Lewis; William Stafford Noble
Journal: Pac Symp Biocomput Date: 2002

8. CLICK: a clustering algorithm with applications to gene expression analysis.

Authors: R Sharan; R Shamir
Journal: Proc Int Conf Intell Syst Mol Biol Date: 2000

9. Genomic expression programs in the response of yeast cells to environmental changes.

Authors: A P Gasch; P T Spellman; C M Kao; O Carmel-Harel; M B Eisen; G Storz; D Botstein; P O Brown
Journal: Mol Biol Cell Date: 2000-12 Impact factor: 4.138

10. NCBI GEO: mining millions of expression profiles--database and tools.

Authors: Tanya Barrett; Tugba O Suzek; Dennis B Troup; Stephen E Wilhite; Wing-Chi Ngau; Pierre Ledoux; Dmitry Rudnev; Alex E Lash; Wataru Fujibuchi; Ron Edgar
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

86 in total

1. Integrating genetic and gene expression evidence into genome-wide association analysis of gene sets.

Authors: Qing Xiong; Nicola Ancona; Elizabeth R Hauser; Sayan Mukherjee; Terrence S Furey
Journal: Genome Res Date: 2011-09-22 Impact factor: 9.043

2. Integrated transcriptomic and proteomic analysis of the physiological response of Escherichia coli O157:H7 Sakai to steady-state conditions of cold and water activity stress.

Authors: Chawalit Kocharunchitt; Thea King; Kari Gobius; John P Bowman; Tom Ross
Journal: Mol Cell Proteomics Date: 2011-10-18 Impact factor: 5.911

3. Distinct role of Mediator tail module in regulation of SAGA-dependent, TATA-containing genes in yeast.

Authors: Suraiya A Ansari; Mythily Ganapathi; Joris J Benschop; Frank C P Holstege; Joseph T Wade; Randall H Morse
Journal: EMBO J Date: 2011-10-04 Impact factor: 11.598

4. Accelerated evolutionary rate of housekeeping genes in tunicates.

Authors: Georgia Tsagkogeorga; Xavier Turon; Nicolas Galtier; Emmanuel J P Douzery; Frédéric Delsuc
Journal: J Mol Evol Date: 2010-08-10 Impact factor: 2.395

5. Fenofibrate increases very low density lipoprotein triglyceride production despite reducing plasma triglyceride levels in APOE*3-Leiden.CETP mice.

Authors: Silvia Bijland; Elsbet J Pieterman; Annemarie C E Maas; José W A van der Hoorn; Marjan J van Erk; Jan B van Klinken; Louis M Havekes; Ko Willems van Dijk; Hans M G Princen; Patrick C N Rensen
Journal: J Biol Chem Date: 2010-05-25 Impact factor: 5.157

6. Comparison of toxicogenomic responses to phthalate ester exposure in an organotypic testis co-culture model and responses observed in vivo.

Authors: Sean Harris; Sanne A B Hermsen; Xiaozhong Yu; Sung Woo Hong; Elaine M Faustman
Journal: Reprod Toxicol Date: 2015-10-22 Impact factor: 3.143

7. Cellular processes and pathways that protect Saccharomyces cerevisiae cells against the plasma membrane-perturbing compound chitosan.

Authors: Anna Zakrzewska; Andre Boorsma; Daniela Delneri; Stanley Brul; Stephen G Oliver; Frans M Klis
Journal: Eukaryot Cell Date: 2007-01-26

8. Responses of pathogenic and nonpathogenic yeast species to steroids reveal the functioning and evolution of multidrug resistance transcriptional networks.

Authors: Dibyendu Banerjee; Gaelle Lelandais; Sudhanshu Shukla; Gauranga Mukhopadhyay; Claude Jacq; Frederic Devaux; Rajendra Prasad
Journal: Eukaryot Cell Date: 2007-11-09

9. Arsenic toxicity to Saccharomyces cerevisiae is a consequence of inhibition of the TORC1 kinase combined with a chronic stress response.

Authors: Dagmar Hosiner; Harri Lempiäinen; Wolfgang Reiter; Joerg Urban; Robbie Loewith; Gustav Ammerer; Rudolf Schweyen; David Shore; Christoph Schüller
Journal: Mol Biol Cell Date: 2008-12-10 Impact factor: 4.138

10. Revealing global regulatory perturbations across human cancers.

Authors: Hani Goodarzi; Olivier Elemento; Saeed Tavazoie
Journal: Mol Cell Date: 2009-12-11 Impact factor: 17.970