Literature DB >> 15980548

GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data.

Juan M Vaquerizas¹, Lucía Conde, Patricio Yankilevich, Amaya Cabezón, Pablo Minguez, Ramón Díaz-Uriarte, Fátima Al-Shahrour, Javier Herrero, Joaquín Dopazo.

Abstract

The Gene Expression Profile Analysis Suite, GEPAS, has been running for more than three years. With >76,000 experiments analysed during the last year and a daily average of almost 300 analyses, GEPAS can be considered a well-established and widely used platform for gene expression microarray data analysis. GEPAS is oriented to the analysis of whole series of experiments. Its design and development have been driven by the demands of the biomedical community, probably the most active collective in the field of microarray users. Although clustering methods have obviously been implemented in GEPAS, our interest has focused more on methods for finding genes differentially expressed among distinct classes of experiments or correlated to diverse clinical outcomes, as well as on building predictors. There is also a great interest in CGH-arrays which fostered the development of the corresponding tool in GEPAS: InSilicoCGH. Much effort has been invested in GEPAS for developing and implementing efficient methods for functional annotation of experiments in the proper statistical framework. Thus, the popular FatiGO has expanded to a suite of programs for functional annotation of experiments, including information on transcription factor binding sites, chromosomal location and tissues. The web-based pipeline for microarray gene expression data, GEPAS, is available at http://www.gepas.org.

Entities: Chemical Disease Species

Mesh：

Year: 2005 PMID： 15980548 PMCID： PMC1160260 DOI： 10.1093/nar/gki500

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

GEPAS, which stands for Gene Expression Profile Analysis Suite, is a web tool designed and oriented to the analysis of DNA microarray gene expression experiments. The emphasis in the development of new tools for GEPAS has been driven by the requirements of data analysis in the most active fields using microarray technologies, which are, without doubt, biomedical applications [e.g. (1–4)]. As a consequence, much stress has been put on the implementation of proper methods for gene selection, predictors, CGH-arrays and functional annotation of experiments. More classical data analysis approaches, such as clustering, have also been incorporated into GEPAS, as well as different options for data preprocessing. GEPAS has been conceived as an integrated web-based pipeline for the analysis of gene expression patterns where different methods can be used within an integrated interface that provides a user-friendly environment to end users. The way in which the methods are connected has been designed to guide the user by suggesting all the available possibilities to continue with the analysis and to prevent possible inappropriate uses of the tools. GEPAS, which was originally the backbone of the pipeline of microarray data analysis of the CNIO, was made public three years ago and first published in 2003 (5,6). In the years since, GEPAS has become a de facto standard for many researchers and its use has undergone a spectacular growth. In terms of the scope of analysis, GEPAS is the most complete web-based resource that can be found nowadays. Our aim is to keep GEPAS ‘living’ by the continuous addition of new algorithms. Here we report the new modules, some trends observed in its use and some novelties.

SCOPE OF GEPAS

As previously mentioned, GEPAS is experiment oriented. This means that facilities for data manipulation such as rows and columns management are deliberately absent in its design. With the exception of the module DNMAD, which can take as input Genepix (Axon instruments) GPR files from a scanner (see below), GEPAS accepts as input data already preselected (usually coming from a database) in a very simple format: a tab-delimited text file containing genes in rows and experiments in columns (except the first column, which contains the identifiers for the genes). Several preprocessing facilities are provided. These are normalization along with different kinds of data transformation such as missing value imputation, filtering of ‘flat’ patterns and extraction of genes based on functional properties. GEPAS permits two main types of experimental designs: those oriented towards class discovery, for which different clustering methods are available, and those related to supervised questions, which include mainly gene selection and building predictors. GEPAS includes two tools for dealing with both of these problems. In addition, there is a great interest now in tools that allow CGH-arrays to be handled. GEPAS includes a module for mapping either genomic or mRNA hybridizations over the corresponding chromosomal locations, with different facilities for data visualization. Finally, GEPAS provides a module for functional annotation of experiments that includes the popular FatiGO (8), as well as a variety of new tools.

GEPAS AT A GLANCE

GEPAS includes a number of interconnected tools implemented as individual modules that can be used either independently or within the pipeline (Figure 1). Since the previous version (6), GEPAS has undergone a number of technical improvements which have not had much impact on its external aspect but have notably changed its performance. Internal links among modules have been improved and redesigned in order to avoid wrong pathways in the pipeline. Some CPU-intensive modules have been moved to dedicated computers (in particular DNMAD, Pomelo and Tnasas). The structure of GEPAS is as follows.

Figure 1

The GEPAS pipeline. The figure summarizes the most important features of the GEPAS pipeline. Black arrows show the flow of information from the raw data to the three main types of analysis: CGH-array, unsupervised clustering and supervised analysis (gene selection or predictors). Functional annotation is possible from the latter two options. Grey arrows represent the possibility to re-analyse parts of the experiments.

Preprocessing

DNMAD (9) is for normalization using print-tip loess (10,11) (), with different possibilities. Some additional options have been included in this new version: the possibility of using a spot's flags, optional use of background subtraction and the possibility of using global loess (instead of print-tip). We have also included a better management of flagged dots, new diagnostic plots (the density plots for either raw or background-corrected red and green channels) and automatic dye-swap. DNMAD can take as input Genepix (Axon instruments) GPR files. Preprocessor (12) performs some preprocessing of the data (log-transformations, standardizations, imputation of missing values, etc.). Data can also be filtered on the basis of their functional labels [GO terms (13)] using the Knowledge Filtering module (6). IDconverter, a new module, maps lists of accession numbers and identifiers among different clone, gene or protein standards. IDconverter includes distinct levels of information such as gene level (gene HUGO name, Ensembl gene, Unigene cluster, LocusLink, RefSeq, gene location, gene description), clone level (Affymetrix, GenBank accession number, IMAGE Clone ID) and protein level (SwissProt, TrEMBL, now UNIPROT). Chromosomal locations are obtained from Ensembl.

Analysis

Unsupervised clustering includes different methods such as aggregative clustering (14), SOTA (15,16), SOM (17), K-means (18) (which is a new addition in this version of GEPAS) and SOM-Tree (19). Supervised analysis includes Gene selection. Analysis of genes differentially expressed between two or more classes, related to a continuous experimental factor (e.g. the concentration of a metabolite) or to survival is performed by the module Pomelo (6). Different methods for multiple testing adjustment are included (20–22). Predictors. The module Tnasas (for ‘This is not a substitute for a statistician’) implements a simple, although effective, way of building class predictors from microarray data. The error rate is computed taking into account the effect of gene selection and is not biased downwards by the ‘selection bias’ problem so common in many microarray studies [e.g. (23,24)]. For the analysis of CGH-arrays, given the growing interest in microarray-based CGH (array CGH) (25), we have expanded the capabilities of the InSilicoCGH tool, which allows the mapping of the results of microarray hybridizations onto chromosome coordinates. The InSilicoCGH module has been designed for the simultaneous analysis of genomic and mRNA hybridizations on the same expression array. It can also deal with BAC-arrays. We have added a new option: the zoom. This magnifies the view of the desired chromosomal location in order to facilitate detection of the precise position of chromosomal gains and losses; in general, it allows hybridization values at gene level to be viewed in more detail. Figure 2 is a screenshot of the zoom tool.

Figure 2

The zoom tool of InSilicoCGH in action. Clicking on the desired chromosomal region produces a pop-up window with a zoom facility. The user can freely move around the point chosen and can easily visualize in detail the hybridization values. Borders of deleted or amplified regions can be precisely defined in this way.

Functional analysis of experiments

Functional annotation of microarray experiments is an important aspect of analysis that very few packages incorporate. Several modules for functional annotation of microarray experiments are available. These programs, along with similar ones, are discussed in an accompanying paper by our group (7) All these tools, in addition of being connected to GEPAS (because of its obvious usefulness for the analysis of microarray data), are grouped as an independent resource called Babelomics (7). Babelomics has, at its general purpose, the facilitation of functional annotation in any type of high-throughput experiments (proteomics, interactomics, massive sequencing, etc.). FatiGO (8) allows significant asymmetrical distributions of GO terms between groups of genes to be found. FatiWise (6) does the same with InterPro motifs (26), KEGG (27) pathways and SwissProt keywords, when available. TransFAT performs the same operations for putative transcription factor binding sites in the promoter regions of genes as predicted by the program Match (28), from the Transfac® database (29). TMT, the Tissues Mining Tool, is a web application to extract significant information related to the differential expression of two sets of genes in tissues. FatiScan allows the detection of modest but coordinate changes in gene expression values by applying the FatiGO algorithm to lists of genes ordered according to their differences in expression. In terms of its internal architecture, GEPAS is a collection of programs mainly written in C++, although some were written in other programming languages such as R [DNMAD (9)] or PERL [Preprocessor (12)]. These modules are interconnected by PERL wrappers.

A PIPELINE OF MICROARRAY DATA ANALYSIS TOOLS

The efficiency of a modular package such as GEPAS lies largely in its degree of integration of the different data analysis tools. Users can move through a complete pipeline of data analysis in a transparent way, without needing to perform any reformatting operation. In addition, a properly designed workflow can help to prevent possible wrong operations in microarray data analysis owing to misconceptions. Figure 1 illustrates the structure of the GEPAS pipeline. Raw data can be loaded and normalized. Several data transformation options are available through the Preprocessor tool. Depending on the particular problem addressed, data can be directed to any of the three main types of analysis: CGH-array, unsupervised clustering and supervised analysis (gene selection or predictors). A functional annotation is possible from the last two options. GEPAS has been designed in a way that prevents possible misuses of the methods implemented in the package.

TRAINING PROGRAMME AND GEPAS

In addition to the tools, a collection of on-line tutorials that can be used to learn the use of the tools or as a part of a course is available on the GEPAS web page. The structure of the tutorials includes some theory, a guided example and several examples based on publicly available datasets. There are tutorials for (i) normalization using DNMAD, (ii) data preprocessing using the Preprocessor tool, (iii) data clustering using the different algorithms available (UPGMA, SOM, SOTA), (iv) selection of differentially expressed genes using the Pomelo tool and (v) functional annotation using FatiGO. The tutorials are currently used on different courses, such as a masters in bioinformatics (Spain) and the international FCUL-IGC Post-Graduate Programme in Bioinformatics ().

CONSOLIDATION OF GEPAS AS A WIDELY USED PACKAGE

Our records indicate that, since March 2004, GEPAS has been used to analyse >76 000 experiments, with a daily average of almost 300 uses (statistics can be checked at on the different pages for GEPAS, and the particular pages for Pomelo, Tnasas, DNMAD and FatiGO, which are independently monitored). Compared with last year's records (35 000 experiments per year with a daily average of 130) (6), there has been a clear increase in the use of the tool. The distribution of users has also changed. Whereas one year ago it was used more by Spanish researchers (25%), followed by US (.edu and .net domains) (15%), French (10%), UK (5%) and other users (Japanese, German, Dutch, etc.) (6) the profile of users during this last year has changed to 23% US (.edu and .net), 9% French, 6% Spanish, 5% UK and others. These figures suggest that GEPAS seems to be becoming more popular among US-based researchers. Obviously the usage in all countries has increased, since the remainder of the percentages appear to maintain the same level while the absolute number of uses has increased 2-fold.

CONCLUSIONS

Despite the availability of many programs and packages for microarray data analysis, there are still many aspects of the analysis with poor or incomplete coverage. There are a number of options for analysing DNA microarray data (see e.g. ). Most of the software available for microarray data analysis focuses on unsupervised cluster methods, which, in many cases, are used for inadequate purposes (23). There are also different initiatives such as BASE (30), Bioconductor (31) and BRB tools (), but these are in some cases dependent on a particular computer operating system and usually require from the user previous training in statistics. GEPAS can be considered the most complete web-based resource that can be found nowadays. Since the first release (5,6), GEPAS has avoided the temptation to become a list of as many methods as possible and evolved really to cope with new challenges that have emerged in the field of microarray data analysis. Much work has been invested in the implementation of a useful workflow. GEPAS provides the user with an integrated environment in which modules can be found for different types of analysis that respond to real analysis demands. Modules are connected in such a way as to avoid improper use of the tools. From a technical point of view, GEPAS has been designed with the intention of taking full advantage of the properties of the web: connectivity, cross-platform compatibility and remote usage. The modular architecture allows the addition of new tools and facilitates the connectivity of GEPAS from and to other web-based tools. With >76 000 experiments analysed during the last year and a daily average of almost 300 uses, GEPAS can be considered a consolidated tool in the field of microarray data analysis.

24 in total

1. TRANSFAC: an integrated system for gene expression regulation.

Authors: E Wingender; X Chen; R Hehl; H Karas; I Liebich; V Matys; T Meinhardt; M Prüss; I Reuter; F Schacherer
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

3. A hierarchical unsupervised growing neural network for clustering gene expression patterns.

Authors: J Herrero; A Valencia; J Dopazo
Journal: Bioinformatics Date: 2001-02 Impact factor: 6.937

4. Assembly of microarrays for genome-wide measurement of DNA copy number.

Authors: A M Snijders; N Nowak; R Segraves; S Blackwood; N Brown; J Conroy; G Hamilton; A K Hindle; B Huey; K Kimura; S Law; K Myambo; J Palmer; B Ylstra; J P Yue; J W Gray; A N Jain; D Pinkel; D G Albertson
Journal: Nat Genet Date: 2001-11 Impact factor: 38.330

5. The KEGG resource for deciphering the genome.

Authors: Minoru Kanehisa; Susumu Goto; Shuichi Kawashima; Yasushi Okuno; Masahiro Hattori
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

6. MATCH: A tool for searching transcription factor binding sites in DNA sequences.

Authors: A E Kel; E Gössling; I Reuter; E Cheremushkin; O V Kel-Margoulis; E Wingender
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

7. GEPAS: A web-based resource for microarray gene expression data analysis.

Authors: Javier Herrero; Fátima Al-Shahrour; Ramón Díaz-Uriarte; Alvaro Mateos; Juan M Vaquerizas; Javier Santoyo; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

8. FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes.

Authors: Fátima Al-Shahrour; Ramón Díaz-Uriarte; Joaquín Dopazo
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

Review 9. Expression profiling--best practices for data generation and interpretation in clinical trials.

Authors:
Journal: Nat Rev Genet Date: 2004-03 Impact factor: 53.242

10. Gene expression profiling predicts clinical outcome of breast cancer.

Authors: Laura J van 't Veer; Hongyue Dai; Marc J van de Vijver; Yudong D He; Augustinus A M Hart; Mao Mao; Hans L Peterse; Karin van der Kooy; Matthew J Marton; Anke T Witteveen; George J Schreiber; Ron M Kerkhoven; Chris Roberts; Peter S Linsley; René Bernards; Stephen H Friend
Journal: Nature Date: 2002-01-31 Impact factor: 49.962

40 in total

Review 1. Bioinformatics and cancer: an essential alliance.

Authors: Joaquín Dopazo
Journal: Clin Transl Oncol Date: 2006-06 Impact factor: 3.405

Review 2. DNA microarrays: a powerful genomic tool for biomedical and clinical research.

Authors: Victor Trevino; Francesco Falciani; Hugo A Barrera-Saldaña
Journal: Mol Med Date: 2007 Sep-Oct Impact factor: 6.354

3. Detection of deregulated pathways to lymphatic metastasis in oral squamous cell carcinoma.

Authors: Eryang Zhao; Jiankai Xu; Xiaodong Yin; Yu Sun; Jinna Shi; Xia Li
Journal: Pathol Oncol Res Date: 2008-09-18 Impact factor: 3.201

4. Transgenic mice expressing constitutively active Akt in oral epithelium validate KLFA as a potential biomarker of head and neck squamous cell carcinoma.

Authors: Marta Moral; Carmen Segrelles; Ana Belén Martínez-Cruz; Corina Lorz; Mirentxu Santos; Ramón García-Escudero; Jerry Lu; Agueda Buitrago; Clotilde Costa; Cristina Saiz; José M Ariza; Marta Dueñas; Jose L Rodriguez-Peralto; Francisco J Martinez-Tello; Maria Rodriguez-Pinilla; Montserrat Sanchez-Cespedes; John Digiovanni; Jesús M Paramio
Journal: In Vivo Date: 2009 Sep-Oct Impact factor: 2.155

Review 5. Bioinformatic approaches to augment study of epithelial-to-mesenchymal transition in lung cancer.

Authors: Tim N Beck; Adaeze J Chikwem; Nehal R Solanki; Erica A Golemis
Journal: Physiol Genomics Date: 2014-08-05 Impact factor: 3.107

6. Constitutively active Akt induces ectodermal defects and impaired bone morphogenetic protein signaling.

Authors: Carmen Segrelles; Marta Moral; Corina Lorz; Mirentxu Santos; Jerry Lu; José Luis Cascallana; M Fernanda Lara; Steve Carbajal; Ana Belén Martínez-Cruz; Ramón García-Escudero; Linda Beltran; José C Segovia; Ana Bravo; John DiGiovanni; Jesús M Paramio
Journal: Mol Biol Cell Date: 2007-10-24 Impact factor: 4.138

7. EDGE(3): a web-based solution for management and analysis of Agilent two color microarray experiments.

Authors: Aaron L Vollrath; Adam A Smith; Mark Craven; Christopher A Bradfield
Journal: BMC Bioinformatics Date: 2009-09-04 Impact factor: 3.169

8. TiSGeD: a database for tissue-specific genes.

Authors: Sheng-Jian Xiao; Chi Zhang; Quan Zou; Zhi-Liang Ji
Journal: Bioinformatics Date: 2010-03-11 Impact factor: 6.937

9. Comprehensive characterization of the DNA amplification at 13q34 in human breast cancer reveals TFDP1 and CUL4A as likely candidate target genes.

Authors: Lorenzo Melchor; Laura Paula Saucedo-Cuevas; Iván Muñoz-Repeto; Socorro María Rodríguez-Pinilla; Emiliano Honrado; Alfredo Campoverde; Jose Palacios; Katherine L Nathanson; María José García; Javier Benítez
Journal: Breast Cancer Res Date: 2009-12-08 Impact factor: 6.466

10. CGI: Java software for mapping and visualizing data from array-based comparative genomic hybridization and expression profiling.

Authors: Joyce Xiuweu-Xu Gu; Michael Yang Wei; Pulivarthi H Rao; Ching C Lau; Sanjiv Behl; Tsz-Kwong Man
Journal: Gene Regul Syst Bio Date: 2007-10-06