Literature DB >> 16845056

Next station in microarray data analysis: GEPAS.

David Montaner¹, Joaquín Tárraga, Jaime Huerta-Cepas, Jordi Burguet, Juan M Vaquerizas, Lucía Conde, Pablo Minguez, Javier Vera, Sach Mukherjee, Joan Valls, Miguel A G Pujana, Eva Alloza, Javier Herrero, Fátima Al-Shahrour, Joaquín Dopazo.

Abstract

The Gene Expression Profile Analysis Suite (GEPAS) has been running for more than four years. During this time it has evolved to keep pace with the new interests and trends in the still changing world of microarray data analysis. GEPAS has been designed to provide an intuitive although powerful web-based interface that offers diverse analysis options from the early step of preprocessing (normalization of Affymetrix and two-colour microarray experiments and other preprocessing options), to the final step of the functional annotation of the experiment (using Gene Ontology, pathways, PubMed abstracts etc.), and include different possibilities for clustering, gene selection, class prediction and array-comparative genomic hybridization management. GEPAS is extensively used by researchers of many countries and its records indicate an average usage rate of 400 experiments per day. The web-based pipeline for microarray gene expression data, GEPAS, is available at http://www.gepas.org.

Entities: Chemical Disease Species

Mesh：

Year: 2006 PMID： 16845056 PMCID： PMC1538867 DOI： 10.1093/nar/gkl197

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

It is quite common that the introduction of a new technology is accompanied by claims and promises which on many occasions cannot be fulfilled. This hype is then followed by a wave of disappointment against the technology. Fortunately, as it is reaching a certain degree of maturity, DNA microarray technologies do not seem to have followed this fate. During an initial period, DNA microarray publications were dealing with issues such as reproducibility and sensitivity. Many classical microarray papers dating from the late nineties were mere proof-of-principle experiments (1,2), in which only cluster analysis was applied. Later, sensitivity became a main concern as a natural reaction against quite liberal interpretations of microarray experiments made by some researchers, such as the fold criteria to select differentially expressed genes. It was soon obvious that genome-scale experiments should be carefully analysed because many apparent associations happened merely by chance (3). In this context, different methods for the adjustment of P-values, which are considered standard today, started to be extensively used (4,5). More recently the use of microarrays as predictors of clinical outcomes (6), despite not being free of criticisms (7), fuelled the use of the methodology because of its practical implications. There are still some concerns with the cross-platform coherence of results but it seems clear that intra-platform reproducibility is high (8) and, despite the fact that gene-by-gene results are not always the same, the biological themes emerging from the different platforms are increasingly consistent (9). That points to the importance of the interpretation of experiments in terms of their biological implications instead of a mere comparison of lists of genes (10,11). Keeping a pace with the trends mentioned above, Gene Expression Profile Analysis Suite (GEPAS) has been growing during the last 4 years. In the first release it was more oriented towards clustering and data preprocessing (12). Successive releases showed a package more oriented towards gene selection, class prediction and the functional annotation of experiments (13,14). The version presented here include several new modules, some of which are new while other ones constitute already available tools completely rewritten including new functionalities. GEPAS is not a simple web server, but it constitutes one of the largest resources for integrated microarray data available over the web. It has been working for more than four years having by the end of year 2005 an average of 400 experiments analysed per day summing up over all of their modules. GEPAS is used by researches worldwide as can be seen in the usage map, where all the sessions are mapped to its geographic location (). It also offers on-line tutorials that can be used in courses. In the new version (3.0) we present new modules for the normalization of Affymetrix experiments, for differential gene expression, for the evaluation of cluster quality and another module for array-comparative genomic hybridization (Array-CGH) data management. Also, another conceptual novelty is the connection of GEPAS to the PupaSuite tools (15–17), which offers the possibility of analysing polymorphisms at the light of the results of the gene expression analysis.

GENERAL OVERVIEW

GEPAS aims to tackle the most common problems in microarray data analysis in a simple but rigorous way. Thus, after an essential step of normalization, there are different ‘workflows’, or sequences of steps, that can be followed, depending on the aim of the experiment: class discovery, differential gene expression, class prediction or genomic copy number estimation, just to cite the most common objectives of microarray experiments. Class discovery, either in genes or in experiments, is achieved by using clustering methods. GEPAS includes some commonly used clustering methods such as hierarchical clustering (18), SOTA (19,20), SOM (21), K-means (22) and SOM-Tree (23). The evaluation of cluster quality, a scarcely addressed issue, has been implemented here in the Cluster Accuracy Analysis Tool (CAAT) module (see below). Differential gene expression implies finding genes with significant differences in expression between two or more classes, related to a continuous experimental factor (e.g. the concentration of a metabolite) or to survival data. A new, more complete module for differential gene expression is presented in this new version of GEPAS (see below). The module Tnasas for class prediction implements different classifiers, such as diagonal linear discriminant analysis (DLDA) (24), nearest neighbour (NN) (25), support vector machines (SVM) (26), random forest (27) and shrunken centroids (PAM) (28) of known efficiency as class predictors using microarray data (24). Cross-validation error is calculated in a way to avoid the well-known selection bias problem (29,30). See Tnasas help () for a more detailed description of the methods and error estimation strategy. Array-CGH (31) can be analysed through the module ISACGH that allows predicting copy number, relating these values to gene expression and performing functional annotation through the babelomics (11) suite. Finally, functional annotation is carried out with the babelomics suite which can be used either as an independent suite or as an integrated part of the GEPAS. Figure 1 illustrates, following the metaphor of a subway line, the interconnections of the different tools in the GEPAS environment.

Figure 1

Map of GEPAS functionalities as a subway line. Data (Affimetrix, two-colour or raw) are introduced from the left side and pass through the preprocessor. Then different types of analyses can be performed: gene selection (T-rex) in different situations (two or more classes, correlation or survival; see text for details) or class discovery (Tnasas) are two types of supervised analyses. Array-CGH data can be analysed through the red line ISACGH. Unsupervised analysis can also be performed using different methods. CAAT allows to map co-expressed genes on their chromosomal coordinates allowing the study of RIDGES (54). All the tools end up in Babelomics (11), that allows for two different types of analysis: comparison of two sets of genes of analysis or blocks of functionally related genes.

NORMALIZATION AND PREPROCESSING

GEPAS now implements normalization facilities for both two-colours and Affymetrix arrays. DNMAD (32) module performs normalization in two-colour arrays using print-tip loess (33) with a number of different options. DNMAD can input Genepix (Axon instruments) GPR files. The module expresso normalizes Affymetrix CEL files using standard Bioconductor (34) tools; in particular the package affy (35). Besides its friendly web interface we provide the user with the speed and above all the physical memory available in our server. More information can be found in the corresponding tutorial web pages (). In addition, the preprocessor (36) module performs some preprocessing of the data (log-transformations, standardizations, imputation of missing values and so on).

CLUSTERING AND CLUSTER QUALITY ESTIMATION

Despite the fact that clustering is one of the most popular—albeit often improperly used (30)—methodologies in the analysis of microarray data there are very few alternatives for the estimation of the quality of the results found. We have included a module, CAAT, which provides many options for the visualization and intuitive manipulation of hierarchical and non-hierarchical clustering results. Many visualization modes, browsing options and cluster extraction possibilities are currently available. Moreover, CAAT provides some descriptive measures about each partition (average profiles, standard deviation profiles, inter and intra-cluster distances) as well as a global estimation of cluster quality by the silhouette method (37), which performs well, in noisy situations, such as microarray analysis (38). CAAT submits data to other tools such as the Babelomics (11) functional annotation suite or to ISACGH (Figure 1). There is more detailed information in the CAAT documentation ().

DIFFERENTIAL GENE EXPRESSION

This version of GEPAS includes new methods for differential gene expression analysis under different conditions. The old module pomelo has been replaced by the new module T-rex (Tools for RElevant gene seleXion) which is much faster and offers new tests for different situations. T-rex distinguishes among four conceptually different testing cases: When appropriate, P-values adjusted for multiple testing are provided. Three methodologies are implemented. One of them controls the FWER (family-wise error rate) (45) while the others control the FDR (false discovery rate) (46). Our implementations make use of the p.adjust function in the stats R package and the qvalues package (47) from Bioconductor. Finding genes differentially expressed between two discrete classes (e.g. case/control and so on). A number of authors (39,40) have found that the classical t-statistic, which was widely used in early work on the analysis of differential expression, can be highly unreliable for microarray data. Problems arise mainly as a consequence of statistical issues relating to the SD term in the denominator of the t-statistic. For example, many non-differentially expressed genes may by chance have small observed SDs, which may cause these genes to be erroneously selected. GEPAS now also implements different new tests: The t-test, which is still available. An empirical Bayes methodology that allows fitting hierarchical mixture models to identify differentially expressed genes (41). One of the advantages of this methodology is that it fits a global model taking into account all genes in the dataset. A novel test for the analysis of microarray data by combining inference for differential expression and variability (CLEAR-test) (J. Valls, M. Grau, X. Sole, P. Hernandez, D. Montaner, J. Dopazo, M. A. Peinado, G. Capella, M. A. G. Pujana and V. Moreno, manuscript submitted). Most tests evaluate differential expression by using estimated variability, but no inference is made in terms of the variability itself. CLEAR-test evaluates both whether genes show large fold changes and whether their variability is high. A data-adaptive approach to the analysis of differential expression, in which an effective test statistic is learned directly from microarray data. This approach has been shown to ameliorate many of the problems associated with both the t-statistic and simple moderated statistics like SAM (42), and to produce good results under a range of conditions (43). Finding genes differentially expressed between more than two classes (e.g. different types of cancers and so on) Together with the classical ANOVA methodology we make available the same CLEAR test mentioned above (41). While the mathematical treatment of this kind of data is similar to that of two classes, in our tools, we separate the case when more than two classes are available because of its different conceptual implications. Finding genes whose expression is correlated to a continuous variable (e.g. the level of a metabolite). Regression analysis of gene expression on any numerical independent variable has been implemented. C routines have been compiled for the particular architecture of our computers in order to achieve the maximal speed. Estimates of Pearson's and Spearman's correlation coefficients as well as P-values for testing the null hypothesis of no correlation can also be obtained with T-rex. Finding genes whose expression is related to survival times. GEPAS uses C routines to estimate a Cox proportional hazards regression model (44). Right censored data are allowed as well as replicates in the survival times. Censoring variables should be provided by the researcher together with survival times that may be replicated.

FUNCTIONAL ANNOTATION

Functional annotation of the experiments gives clues to the researcher for the interpretation of the experiment. There are a number of tools that make use of gene functional annotations to try to understand the global changes in gene expression in microarray experiments (48), but probably one of the most complete packages in this respect is the Babelomics suite (11,49). This suite of programs for functional annotation of genome-scale experiments has undergone a deep modification described in detail elsewhere (49). In brief, Babelomics can now compare two groups of genes and test simultaneously for the significant over-abundance of diverse biological themes such as GO terms, KEGG pathways, Interpro motifs, Swissprot keywords, Transfac® motifs, CisRed motifs, relative abundance in tissues and bioentities extracted from PubMed, with the proper multiple testing adjustment. This is carried out by the FatiGO+ module, the evolution of the FatiGO program (50). Additionally there are two modules designed to search for functionally related blocks of genes that are co-ordinately over- or under-expressed using both the FatiScan (51) or the GSEA (52) algorithms. Despite its general scope (Babelomics is not restricted to microarrays but applicable to any type of large-scale experiment), and the possibility of being used alone as an independent resource, the Babelomics suite has been fully integrated into GEPAS. Modules of gene selection (T-rex) or class prediction (tnasas) can submit the genes selected as relevant to the FatiGO+ module for testing against the rest of genes. Likewhise, the modules for clustering (hierarchical, k-means, SOM, SOTA) through their cluster' viewers or through CAAT, can submit the genes within the selected cluster to be tested against the rest of genes. Similar operation can be performed from within ISACGH, with the genes contained in the selected chromosomal region. Moreover, arrangements of genes can be sent from T-rex to the FatiScan to test blocks of functionally related genes tha are co-ordinately over- or under-expressed. Sets of arrays can also be submitted to GSEA with the same purpose.

ARRAY-CGH

Genetic aberrations, which are the molecular basis of many diseases, have classically been studied through CGH. The introduction of microarray-based CGH methods (array-CGH) has revolutionized this methodology in terms of resolution and throughput (31,53) but, at the same time, has generated a need for new algorithms and software for dealing with this type of data. We have included in GEPAS a new module, ISACGH, which completely replaces the old viewer InSilicoCGH (14). ISACGH includes two new and efficient methods for accurate estimation of genomic copy number from array-CGH hybridization data, integrated into a web-based system that allows, for the first time, the combined study of gene expression and genomic copy number. Several visualization options offer a convenient representation of the results. Moreover, the link to the Babelomics (11,49) tools allows, for the first time in a tool of this type, the production of functional annotations (using different relevant biological information such as gene ontology, pathways, etc.) for the detected chromosomal regions of interest (amplified or deleted). We use the DAS technology (Distributed Annotation System; see ), that allows a remote mapping of information (our predictions) from a server (our server) to a client (Ensembl), to represent the ISACGH predictions and data onto the Ensembl chromosomal coordinates. ISACGH generically maps data onto their chromosomal coordinates. So, beyond to map genomic hybridisations any other data can be mapped. Thus CAAT can send to ISACGH groups of co-expressing genes, which might be useful for defining regions of increased gene expression, also known as RIDGES (54).

Polymorphisms affecting gene expression

Although the study of regulatory polymorphisms is not new, there has been a recent revival of interest in them mainly because of the availability of high-throughput data and methodologies that allows their characterisation (55). The corresponding GEPAS modules (CAAT, tnasas and T-rex) have a unique feature in this regard: the possibility of connecting the genes found to be regulated in a microarray experiment to possible regulatory SNPs in such genes. In particular, clustering and gene selection methods can be connected to the PupaSuite (15–17).

DISCUSSION

GEPAS is a long-term project that aims to provide the scientific community with an advanced set of tools for microarray data analysis, without renouncing to an easy and intuitive use. It has been running uninterruptedly for more than four years and has grown to include more tools as new algorithms were introduced in the microarray data analysis arena (12–14). The GEPAS team has intended to deliver a coherent set of state-of-the-art and widely established algorithms, running away from building a simple collection of as-much-as-possible tools. Actually, any new tool included is the response to a new or emerging requirement requested by our users. As the Functional Genomics node of the Spanish Institute of Bioinformatics (INB; ) and being part of the Spanish Network of Cancer Centers (RTICCC; ) we have a direct contact with researchers from which we get much of the feedback necessary to build up a useful tool. GEPAS, integrated with the Babelomics suite (11,49), provides the tools for performing the most common analyses of microarray data. Moreover, it has been conceived as a workflow that helps the user to carry out a series of consecutive steps of analysis with simple mouse clicks. GEPAS has been designed to take full advantage of the properties of the web: connectivity, cross-platform functionality and remote usage. Its modular architecture allows easy implementation of new tools and facilitates the connectivity of GEPAS from and to other web-based tools. The user of GEPAS ranges from the experimentalist with not much experience in bioinformatics and no deep statistical skills, interested only in data analysis, to the bioinformatician that invokes some of the tools remotely for different purposes. GEPAS is running in a high-end cluster (with 20 dedicated AMD Opteron CPUs at 2.4 GHz) with a large amount of RAM (6 GB). This allows to use tools (e.g. normalization tools are highly RAM-consuming) that usually are beyond the capabilities of the hardware available to many end users. In addition, there is a teaching programme related to GEPAS (see ) with on-line tutorials that can be freely used (). Although other alternatives are available for microarray data analysis, there is no other similar resource over the web with the number of possibilities offered by GEPAS.

40 in total

1. Gene expression data preprocessing.

Authors: J Herrero; R Díaz-Uriarte; J Dopazo
Journal: Bioinformatics Date: 2003-03-22 Impact factor: 6.937

2. On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles.

Authors: C M Kendziorski; M A Newton; H Lan; M N Gould
Journal: Stat Med Date: 2003-12-30 Impact factor: 2.373

3. GEPAS: A web-based resource for microarray gene expression data analysis.

Authors: Javier Herrero; Fátima Al-Shahrour; Ramón Díaz-Uriarte; Alvaro Mateos; Juan M Vaquerizas; Javier Santoyo; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

4. Statistical significance for genomewide studies.

Authors: John D Storey; Robert Tibshirani
Journal: Proc Natl Acad Sci U S A Date: 2003-07-25 Impact factor: 11.205

Review 5. Integrating 'omic' information: a bridge between genomics and systems biology.

Authors: Hui Ge; Albertha J M Walhout; Marc Vidal
Journal: Trends Genet Date: 2003-10 Impact factor: 11.639

Review 6. Comparison and meta-analysis of microarray data: from the bench to the computer desk.

Authors: Yves Moreau; Stein Aerts; Bart De Moor; Bart De Strooper; Michal Dabrowski
Journal: Trends Genet Date: 2003-10 Impact factor: 11.639

7. Data-adaptive test statistics for microarray data.

Authors: Sach Mukherjee; Stephen J Roberts; Mark J van der Laan
Journal: Bioinformatics Date: 2005-09-01 Impact factor: 6.937

8. Ontological analysis of gene expression data: current tools, limitations, and open problems.

Authors: Purvesh Khatri; Sorin Drăghici
Journal: Bioinformatics Date: 2005-06-30 Impact factor: 6.937

Review 9. Genomic microarrays in human genetic disease and cancer.

Authors: Donna G Albertson; Daniel Pinkel
Journal: Hum Mol Genet Date: 2003-08-05 Impact factor: 6.150

10. GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data.

Authors: Juan M Vaquerizas; Lucía Conde; Patricio Yankilevich; Amaya Cabezón; Pablo Minguez; Ramón Díaz-Uriarte; Fátima Al-Shahrour; Javier Herrero; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

43 in total

1. High-resolution genome-wide analysis of chromosomal alterations in elastofibroma.

Authors: Juan Luis García Hernández; Javier Ortiz Rodríguez-Parets; José María Valero; María Asunción Gomez Muñoz; M Rocío Benito; Jesus M Hernandez; Agustín Bullón
Journal: Virchows Arch Date: 2010-04-27 Impact factor: 4.064

Review 2. Genome and proteome annotation: organization, interpretation and integration.

Authors: Gabrielle A Reeves; David Talavera; Janet M Thornton
Journal: J R Soc Interface Date: 2009-02-06 Impact factor: 4.118

3. A Method for the Annotation of Functional Similarities of Coding DNA Sequences: the Case of a Populated Cluster of Transmembrane Proteins.

Authors: Miguel Angel Fuertes; José Ramón Rodrigo; Carlos Alonso
Journal: J Mol Evol Date: 2016-11-03 Impact factor: 2.395

4. Whi3, a developmental regulator of budding yeast, binds a large set of mRNAs functionally related to the endoplasmic reticulum.

Authors: Neus Colomina; Francisco Ferrezuelo; Hongyin Wang; Martí Aldea; Eloi Garí
Journal: J Biol Chem Date: 2008-07-29 Impact factor: 5.157

5. Integrative pathway-centric modeling of ventricular dysfunction after myocardial infarction.

Authors: Francisco Azuaje; Yvan Devaux; Daniel R Wagner
Journal: PLoS One Date: 2010-03-11 Impact factor: 3.240

6. ETS transcription factors control transcription of EZH2 and epigenetic silencing of the tumor suppressor gene Nkx3.1 in prostate cancer.

Authors: Paolo Kunderfranco; Maurizia Mello-Grand; Romina Cangemi; Stefania Pellini; Afua Mensah; Veronica Albertini; Anastasia Malek; Giovanna Chiorino; Carlo V Catapano; Giuseppina M Carbone
Journal: PLoS One Date: 2010-05-10 Impact factor: 3.240

7. Coordinated modular functionality and prognostic potential of a heart failure biomarker-driven interaction network.

Authors: Francisco Azuaje; Yvan Devaux; Daniel R Wagner
Journal: BMC Syst Biol Date: 2010-05-12

8. Babelomics: an integrative platform for the analysis of transcriptomics, proteomics and genomic data with advanced functional profiling.

Authors: Ignacio Medina; José Carbonell; Luis Pulido; Sara C Madeira; Stefan Goetz; Ana Conesa; Joaquín Tárraga; Alberto Pascual-Montano; Ruben Nogales-Cadenas; Javier Santoyo; Francisco García; Martina Marbà; David Montaner; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2010-05-16 Impact factor: 16.971

9. Pomelo II: finding differentially expressed genes.

Authors: Edward R Morrissey; Ramón Diaz-Uriarte
Journal: Nucleic Acids Res Date: 2009-05-12 Impact factor: 16.971

10. Gene expression profiling integrated into network modelling reveals heterogeneity in the mechanisms of BRCA1 tumorigenesis.

Authors: R Fernández-Ramires; X Solé; L De Cecco; G Llort; A Cazorla; N Bonifaci; M J Garcia; T Caldés; I Blanco; M Gariboldi; M A Pierotti; M A Pujana; J Benítez; A Osorio
Journal: Br J Cancer Date: 2009-10-20 Impact factor: 7.640