Literature DB >> 18508806

GEPAS, a web-based tool for microarray data analysis and interpretation.

Joaquín Tárraga¹, Ignacio Medina, José Carbonell, Jaime Huerta-Cepas, Pablo Minguez, Eva Alloza, Fátima Al-Shahrour, Susana Vegas-Azcárate, Stefan Goetz, Pablo Escobar, Francisco Garcia-Garcia, Ana Conesa, David Montaner, Joaquín Dopazo.

Abstract

Gene Expression Profile Analysis Suite (GEPAS) is one of the most complete and extensively used web-based packages for microarray data analysis. During its more than 5 years of activity it has continuously been updated to keep pace with the state-of-the-art in the changing microarray data analysis arena. GEPAS offers diverse analysis options that include well established as well as novel algorithms for normalization, gene selection, class prediction, clustering and functional profiling of the experiment. New options for time-course (or dose-response) experiments, microarray-based class prediction, new clustering methods and new tests for differential expression have been included. The new pipeliner module allows automating the execution of sequential analysis steps by means of a simple but powerful graphic interface. An extensive re-engineering of GEPAS has been carried out which includes the use of web services and Web 2.0 technology features, a new user interface with persistent sessions and a new extended database of gene identifiers. GEPAS is nowadays the most quoted web tool in its field and it is extensively used by researchers of many countries and its records indicate an average usage rate of 500 experiments per day. GEPAS, is available at http://www.gepas.org.

Entities: Disease Gene Species

Mesh：

Year: 2008 PMID： 18508806 PMCID： PMC2447723 DOI： 10.1093/nar/gkn303

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Since its introduction in the mid 1990s (1), microarrays have revolutionized the way in which the research community addresses biological problems. Its success relays on its application to classify types of tumours (2), predicting disease outcome (3) or even the response to treatments (4). These practical applications of microarrays, despite them not being free of criticisms (5), have definitively fuelled the use of the methodology. In this scenario, the real bottleneck in the use of microarray technologies comes from the data analysis step (6). The web-based package Gene Expression Profile Analysis Suite (GEPAS) has been growing during the last 5 years (7–10) trying to keep pace with the state-of-the-art in algorithms for high-throughput gene expression data analysis as well as responding to the demands of the microarray community. Although originally designed to analyse microarray data, the most important modules of GEPAS are not tied to the technology or to the microarray platforms used to extract the data on gene expression. GEPAS is rather oriented to analyse high-throughput gene expression data and to test different types of genome-scale hypotheses. GEPAS is not a web server of a simple tool, but it constitutes one of the largest resources for integrated microarray data analysis available over the web. GEPAS is used by researchers worldwide as can be seen in the usage map, where all the sessions are mapped to its geographic location (http://bioinfo.cipf.es/access_map/map.html). By the end of year 2007, an average of 500 experiments per day were being analysed in GEPAS. The recent release 4.0 presented here includes new modules, new tests in already existent modules, technical improvements (GEPAS is now based on web services technology and includes Web 2.0 features) and a more powerful and intuitive interface which includes graphical tools to define workflows and persistent private sessions.

GENERAL OVERVIEW

GEPAS has been designated for the analysis of high-throughput gene expression data. Obviously, today this means microarray data analysis, but this situation might change in the future and the data could come from different platforms or technologies. Although some of their modules are platform dependent, the core of GEPAS aims to analyse and test hypothesis using gene expression data in a simple but rigorous way. Many different biological questions can be addressed through gene-expression experiments, nevertheless, there are usually three types of objectives in this context: ‘class comparison’, ‘class prediction’ and ‘class discovery’ (6). The first two objectives fall into the category of supervised methods and usually involve the application of tests to define differentially expressed genes, or the use of different procedures to predict class membership on the basis of the values observed for a number of ‘key’ genes. Clustering methods belong to the last category, also known as unsupervised analysis, because no previous information about the class structure of the data set is used in the study. Thus, GEPAS is composed by the following modules:

Normalization and pre-processing

GEPAS implements normalization facilities for both two-colour and Affymetrix arrays. Normalization in two-colour arrays is performed using print-tip loess (11) with a number of different options. Affymetrix CEL files using standard bioconductor (12) tools, in particular the package affy (13). Besides its friendly web interface we provide the user with the speed and above all, the physical memory available in our server. In addition, the pre-processor (14) module performs some pre-processing of the data (log-transformations, standardizations, imputation of missing values, etc.).

Class discovery

Clustering techniques are used for class discovery either in genes or in experiments. GEPAS includes the best performing clustering methods according to different independent benchmarkings (15,16). There are obviously more methods but among the most extensively used for gene expression data clustering we can highlight: hierarchical clustering (17), SOM (18), SOTA (19) and K-means (20). It is worth mentioning that the version of SOM implemented here can automatically find the optimal number of clusters (21).The evaluation of cluster quality, a barely addressed issue, has been implemented here using the silhouette method (22), which presents an optimal performance in noisy situations, such as microarray data (23), along with some descriptive measures for each cluster partition (average profiles, standard deviation profiles, inter- and intra-cluster distances).

Differential gene expression

GEPAS implements tests for finding genes with significant differences in expression between two or more classes, related to a continuous experimental factor (e.g. the concentration of a metabolite) or to survival data. For two-class comparisons, GEPAS implements the popular t-test, the empirical Bayes test (24), the CLEAR-test that combines differential expression and variability (25), the data-adaptive test (26) and the SAM test (27). For comparisons involving more than two classes GEPAS uses the classical ANOVA. In order to find genes whose expression is significantly correlated to a continuous variable (e.g. the level of a metabolite), regression analysis and estimates of Pearson's and Spearman's correlation co-efficients can be obtained. Finally, for finding genes whose expression is related to survival times GEPAS estimates a Cox proportional hazards regression model (28). Right censored data is allowed as well as replicates in the survival times. Censoring variables should be provided by the researcher together with survival times that may be replicated. When appropriate, P values adjusted for multiple testing are provided. Three methodologies are implemented. One of them controls the FWER (family-wise error rate) (29) while the others control the FDR (false discovery rate) (30).

Predictors

A new module for class prediction (31) has been implemented. The module includes different classifiers, such as diagonal linear discriminant analysis (DLDA) (32), k-nearest neighbour (KNN) (33), support vector machines (SVM) (34), SOM (18) and shrunken centroids (PAM) (35) of well-known efficiency as class predictors using microarray data (32). Cross-validation error is calculated in such a way as to avoid the well-known selection bias problem (36). See ref. (31) for details. Once the model has been trained it can be used for further prediction of new samples. This implementation is unique among similar programmes.

Time-course and dose–response gene-expression experiments

A new module for the analysis of multi-series time-course and dose–response microarray experiments has been added. In this type of experiments, the researcher aims to study gene expression changes across time or across dosages and to evaluate trend differences between the various experimental groups (37). This module implements and extends the maSigPro statistical approach for the study of gene expression changes along time and the specific trend differences between various experimental groups (38). The method is a two-regression step approach where individual series are identified by dummy variables. The procedure first adjusts a global regression model which considers all experiment series and a maximum complexity in the time/dosage-dependent response. This first step indentifies differentially expressed genes at a given false positive control rate. In the second step, a variable selection method is applied to find the best model for each gene and to analyse particular significant profile differences between series. Finally, significant genes are clustered and displayed showing these trend differences.

Functional profiling

There are many available tools that make use of gene functional annotations to provide an interpretation for the observed global changes in gene expression in microarray experiments (39). Probably, one of the most complete packages for functional profiling analysis is the Babelomics suite (40,41). This suite of programs for functional annotation of genome-scale experiments has undergone a deep modification described in detail elsewhere (Al-Shahrour, submitted to this issue). Babelomics performs functional enrichment analysis, that is, comparing two lists of genes and testing simultaneously in order to find significant over-abundance of diverse biologically relevant terms that would define functional modules such as GO, KEGG pathways, Interpro motifs or regulatory modules such as Transfac® motifs, CisRed motifs, miRNA binding motifs or other types of modules such as the ones defined by relative abundance in tissues and bioentities extracted from PubMed. All the tests are further adjusted for multiple testing effects (42,43). Additionally, gene set enrichment analysis can be performed using different algorithms (44,45) using several sources of information (46). The Babelomics suite is fully integrated into GEPAS. Gene expression analyses resulting in lists of genes to be compared (different clusters, genes differentially expressed, etc.) can be submitted to Babelomics for functional enrichment analysis. Moreover, arrangements of genes according to, for example, differential expression or other criteria can be sent to Babelomics to be studied by gene set enrichment analysis. This allows discovering pathways or functional modules of genes that are coordinately activated or deactivated in the experiment studied.

Entry points and data formats

There are two entry points to GEPAS: platform dependent and platform independent. GEPAS accepts and normalizes different types of microarray data which include Affymetrix CEL files and 13 different two-channel arrays including Agilent, Genepix and other. Once the files are normalized any type of analysis can be applied. On the other hand, there is another simple format by means of which data from other platforms, other technologies (e.g. SAGE) and even other nature (e.g. proteomics, Chip-on-chip data) can be input in any of the GEPAS modules. A very simple text file with the numeric gene expression values are in the format of a tabulator-delimited matrix, in which rows make reference to gene identifiers and columns to experiments, can be used for this purpose. Information on the experiments can be stored in the first rows starting by a # symbol. The first column contains the gene identifiers.

WHAT IS NEW IN VERSION 4.0?

The novelties added to this version have been described in more detail above, in the general overview of the programme. Summarizing, we have implemented a number of new tests, inexistent in previous versions, apart from new whole modules. Thus, much more options for normalization have been added (support for 12 more formats). New tests for differential expression such as an improved version of clear (25) test or the popular SAM test (27) were implemented. The module for cluster visualization has also been extensively improved. Much work has been invested if implementing an improved tool for protein and gene ID conversion which includes a large number of species and databases. Now, the converter tool supports more than 10 species and more than 40 gene ID references for human [including single nucleotide polymorphism (SNP) and orthologous information]. In general, almost all the modules of GEPAS have undergone improvements to some extent. We have included a new complete module that allows the analysis of multi-series time-course and dose–response microarray experiments. The module is an implementation of the maSigPro statistical approach for the study of gene expression changes along time and the specific trend differences between various experimental groups (38). Another new module is the clustering by a version of SOM (21) that automatically finds the number of clusters. Obviously, the Babelomics has its own catalogue of novelties that are described in an accompanying paper. In addition, there are technical novelties such as the re-engineering to web services, the inclusion of Web 2.0 technology features, the new interface of sessions and the pipeliner, which are described below. All the novelties included in GEPAS are, in terms of resources invested, far beyond the work demanded by a conventional web server that offers a unique facility.

The pipeliner: a graphic module for easy implementation of workflows

Microarray data analysis consists of a series of steps that can be carried out by sequentially running different GEPAS modules (e.g. normalization + pre-processing + gene selection + functional profiling of significant genes). If some of these steps have to be repeated systematically many times (which would happen, for example in a microarray core facility) it is easier to have the possibility of saving the sequence of operations as a workflow and using it in future analysis. The possibility of saving and storing operations is also useful when a researcher uses a non-default set of parameters in the tools. The advanced ‘pipeliner’ module allows users to define workflows, for repetitive tasks, in a completely visual manner by choosing, dragging and dropping icons representing the different modules in the package (without the need of any scripting skills). Figure 1 shows the graphic interface that allows defining sequences of operations as well as setting the parameter used in these. The workflows so defined by this Java applet can be stored in the sessions and can be further loaded from them.

Figure 1.

The pipeliner interface with the available modules on the left and the customization options window below. Modules can be dragged and dropped on the screen and the sequence of execution is defined by linking them. Clicking on a module brings about the corresponding parameters’ window below. Workflows defined in this straightforward manner can be stored in the session manager and used in future sessions.

Internal re-engineering, technological improvements and the session interface

GEPAS has been completely re-engineered and now it is based on SOAP web services and on new Web 2.0 technology features such as AJAX. This has facilitated the design of a new interface that allows asynchronous use, as well as projects, jobs and user management. Thus, the users can choose between the traditional anonymous sessions without loging in (as in previous versions) or to log into the new environment with username and password. This new environment offers persistent sessions in which data is kept stored as well as different facilities for tracking of the operations performed. Both options are free. GEPAS is now running in a high-end cluster with 10 dedicated Intel XEON Quad-Core CPUs at 2.0 GHz (summing up a total of 40 cores) with a large amount of RAM (total 60 GB). In this way we can offer a high computer power to end users. An improved module for protein and gene ID conversion including a large number of species and databases is used behind the scene. This module allows importing any microarray file regardless of the IDs used in the platform. More species and gene references have been added and now the converter module supports more than 10 species and more than 40 ID references for human (including SNP and orthologous information). This module has been implemented in Java to speed up the performance. Besides the web interface a public web service Application Programming Interface is provided, allowing anyone to access the data from their code.

Related training activities

In addition, there is a teaching programme related to GEPAS (http://bioinfo.cipf.es/docus/courses/courses.html) with on-line tutorials that can be freely used (http://bioinfo.cipf.es/docus/courses/on-line.html).

GEPAS usage

The impact over the user's community has been estimated by the corresponding number of Scholar Google citations. According to the number of citations, GEPAS is by far the most popular web resource in its category with 196 citations [252 if the citations of the SOTA (19) are included]. The updated citations for the web-tools with a significant presence in the scientific community can be found at: http://bioinfo.cipf.es/docus/tools-citations/microarrays. GEPAS is used by a broad research community of many countries and its records indicate an average usage rate of around 500 users per day. The geographical distribution of users can be monitored in real time at: http://bioinfo.cipf.es/access_map/map.html. The web-based pipeline for microarray gene expression data, GEPAS, is available at http://www.gepas.org.

Future plans

We are working on several improvements that will be released in an upcoming version. These include normalization for one channel Agilent arrays, for exon arrays (both Agilent and Affymetrix), for tiling arrays and for Illumina arrays. New tests for differential expression will be included. A new version of the predictor with more predictor tools and new cross-validation methods will also be implemented. The ISACHG (47) for array-CGH analysis will be fully integrated in GEPAS and interfaces to databases such as ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) or Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) will be provided.

DISCUSSION

GEPAS is a long-term, ongoing ambitious project that aims to provide the scientific community with an advanced set of tools for high-throughput gene expression data analysis, without renouncing to an easy and intuitive use. Since its official release in 2003 (7), GEPAS has been running uninterruptedly and has grown-up to include more tools to keep pace with the novelties in the microarray data analysis arena (7–9). GEPAS has the vocation of being a consistent set of both state-of-the-art and widely established algorithms, instead of a simple collection of as-much-as-possible tools. In fact, any new tool which has been included in the package has been the response to a new or emerging requirement requested by our users. As the Functional Genomics node of the Spanish Institute of Bioinformatics (INB; http://www.inab.org) and being part of the Spanish Network of Cancer (RTICC; http://www.rticcc.org) and the Network of Centres for Research in Rare Diseases (CIBERER, http://www.ciberer.es), we have a direct contact with researchers from which we get much of the feedback necessary to build up a useful tool. We are also integrated in the EMERALD project (http://www.microarray-quality.org/), where we will provide input in the data mining methodologies such as clustering, gene selection or predictors, to assess the implications of QA/QC. GEPAS, integrated with the Babelomics suite (40,41), offers all the necessary methods in order to perform the most common analysis of microarray data. GEPAS has been designed to take full advantage of the properties of the web: connectivity, cross-platform functionality and remote usage. Its modular architecture based on web services allows easy implementation of new tools and facilitates the connectivity of GEPAS from and to other web-based tools. It cannot be discarded that the technologies and the platforms will change in the future. Such foreseeable changes can only affect the entry point and the technology-related part of GEPAS (that is, the normalization). The important contribution of GEPAS is its potential for analyzing high-throughput gene expression data and for testing different types of hypotheses in this context, regardless the technology that has produced such results. The step of functional interpretation is typically made by studying the enrichment in pre-defined modules of genes related among them by any interesting biological property (common function, regulation, chromosomal location, etc.) as a function of some parameter derived from the experiment. Thus, functional enrichment methods (39) are used to find gene modules significantly over-represented among the relevant genes selected in the experiment. Over-representation of a given gene module means that genes with a particular property have been activated or deactivated in the experiment. Recently, gene set enrichment methods are superseding conventional functional enrichment methods for the functional interpretation of high-throughput gene-expression data, given their higher sensitivity (39,48,49). Both families of methods along with several definitions of modules (functional, transcriptional, text-mining based and phenotypical and tissues based) are implemented in the Babelomics module, fully integrated in GEPAS. GEPAS is now running in a high-end cluster that offers high computer power. This allows using tools (for example normalization tools are highly RAM-consuming) that are usually beyond the capabilities of the hardware available to many end users. Although there are many alternatives for microarray data analysis, there is no other similar resource over the web with the number of possibilities offered by GEPAS.

38 in total

1. Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information.

Authors: Fátima Al-Shahrour; Ramón Díaz-Uriarte; Joaquín Dopazo
Journal: Bioinformatics Date: 2005-04-19 Impact factor: 6.937

2. Data-adaptive test statistics for microarray data.

Authors: Sach Mukherjee; Stephen J Roberts; Mark J van der Laan
Journal: Bioinformatics Date: 2005-09-01 Impact factor: 6.937

3. Combined static and dynamic analysis for determining the quality of time-series expression profiles.

Authors: Itamar Simon; Zahava Siegfried; Jason Ernst; Ziv Bar-Joseph
Journal: Nat Biotechnol Date: 2005-12 Impact factor: 54.908

Review 4. Microarray data analysis: from disarray to consolidation and consensus.

Authors: David B Allison; Xiangqin Cui; Grier P Page; Mahyar Sabripour
Journal: Nat Rev Genet Date: 2006-01 Impact factor: 53.242

5. maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments.

Authors: Ana Conesa; María José Nueda; Alberto Ferrer; Manuel Talón
Journal: Bioinformatics Date: 2006-02-15 Impact factor: 6.937

Review 6. Roadmap for developing and validating therapeutically relevant genomic classifiers.

Authors: Richard Simon
Journal: J Clin Oncol Date: 2005-09-06 Impact factor: 44.544

7. Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

Authors: M Schena; D Shalon; R W Davis; P O Brown
Journal: Science Date: 1995-10-20 Impact factor: 47.728

Review 8. Computational cluster validation in post-genomic data analysis.

Authors: Julia Handl; Joshua Knowles; Douglas B Kell
Journal: Bioinformatics Date: 2005-05-24 Impact factor: 6.937

9. GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data.

Authors: Juan M Vaquerizas; Lucía Conde; Patricio Yankilevich; Amaya Cabezón; Pablo Minguez; Ramón Díaz-Uriarte; Fátima Al-Shahrour; Javier Herrero; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

10. BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments.

Authors: Fátima Al-Shahrour; Pablo Minguez; Joaquín Tárraga; David Montaner; Eva Alloza; Juan M Vaquerizas; Lucía Conde; Christian Blaschke; Javier Vera; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

35 in total

1. Integrative genomic and proteomic analyses identify targets for Lkb1-deficient metastatic lung tumors.

Authors: Julian Carretero; Takeshi Shimamura; Klarisa Rikova; Autumn L Jackson; Matthew D Wilkerson; Christa L Borgman; Matthew S Buttarazzi; Benjamin A Sanofsky; Kate L McNamara; Kathleyn A Brandstetter; Zandra E Walton; Ting-Lei Gu; Jeffrey C Silva; Katherine Crosby; Geoffrey I Shapiro; Sauveur-Michel Maira; Hongbin Ji; Diego H Castrillon; Carla F Kim; Carlos García-Echeverría; Nabeel Bardeesy; Norman E Sharpless; Neil D Hayes; William Y Kim; Jeffrey A Engelman; Kwok-Kin Wong
Journal: Cancer Cell Date: 2010-06-15 Impact factor: 31.743

2. NetworkAnalyst for statistical, visual and network-based meta-analysis of gene expression data.

Authors: Jianguo Xia; Erin E Gill; Robert E W Hancock
Journal: Nat Protoc Date: 2015-05-07 Impact factor: 13.491

3. WebArrayDB: cross-platform microarray data analysis and public data repository.

Authors: Xiao-Qin Xia; Michael McClelland; Steffen Porwollik; Wenzhi Song; Xianling Cong; Yipeng Wang
Journal: Bioinformatics Date: 2009-07-14 Impact factor: 6.937

4. Expander: from expression microarrays to networks and functions.

Authors: Igor Ulitsky; Adi Maron-Katz; Seagull Shavit; Dorit Sagir; Chaim Linhart; Ran Elkon; Amos Tanay; Roded Sharan; Yosef Shiloh; Ron Shamir
Journal: Nat Protoc Date: 2010-01-28 Impact factor: 13.491

Review 5. Bioinformatic approaches to augment study of epithelial-to-mesenchymal transition in lung cancer.

Authors: Tim N Beck; Adaeze J Chikwem; Nehal R Solanki; Erica A Golemis
Journal: Physiol Genomics Date: 2014-08-05 Impact factor: 3.107

6. Serial Expression Analysis: a web tool for the analysis of serial gene expression data.

Authors: Maria José Nueda; José Carbonell; Ignacio Medina; Joaquín Dopazo; Ana Conesa
Journal: Nucleic Acids Res Date: 2010-06-04 Impact factor: 16.971

7. GEOGLE: context mining tool for the correlation between gene expression and the phenotypic distinction.

Authors: Yao Yu; Kang Tu; Siyuan Zheng; Yun Li; Guohui Ding; Jie Ping; Pei Hao; Yixue Li
Journal: BMC Bioinformatics Date: 2009-08-25 Impact factor: 3.169

8. Survival Online: a web-based service for the analysis of correlations between gene expression and clinical and follow-up data.

Authors: Luca Corradi; Valentina Mirisola; Ivan Porro; Livia Torterolo; Marco Fato; Paolo Romano; Ulrich Pfeffer
Journal: BMC Bioinformatics Date: 2009-10-15 Impact factor: 3.169

9. Pomelo II: finding differentially expressed genes.

Authors: Edward R Morrissey; Ramón Diaz-Uriarte
Journal: Nucleic Acids Res Date: 2009-05-12 Impact factor: 16.971

10. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization.

Authors: Enrico Glaab; Jonathan M Garibaldi; Natalio Krasnogor
Journal: BMC Bioinformatics Date: 2009-10-28 Impact factor: 3.169