Literature DB >> 19435879

Pomelo II: finding differentially expressed genes.

Edward R Morrissey¹, Ramón Diaz-Uriarte.

Abstract

Pomelo II (http://pomelo2.bioinfo.cnio.es) is an open-source, web-based, freely available tool for the analysis of gene (and protein) expression and tissue array data. Pomelo II implements: permutation-based tests for class comparisons (t-test, ANOVA) and regression; survival analysis using Cox model; contingency table analysis with Fisher's exact test; linear models (of which t-test and ANOVA are especial cases) that allow additional covariates for complex experimental designs and use empirical Bayes moderated statistics. Permutation-based and Cox model analysis use parallel computing, which permits taking advantage of multicore CPUs and computing clusters. Access to, and further analysis of, additional biological information and annotations (PubMed references, Gene Ontology terms, KEGG and Reactome pathways) are available either for individual genes (from clickable links in tables and figures) or sets of genes. The source code is available, allowing for extending and reusing the software. A comprehensive test suite is also available, and covers both the user interface and the numerical results. The possibility of including additional covariates, parallelization of computation, open-source availability of the code and comprehensive testing suite make Pomelo II a unique tool.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins

Year: 2009 PMID： 19435879 PMCID： PMC2703955 DOI： 10.1093/nar/gkp366

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

There is a continuous demand for web-based applications for the analysis of genomic and proteomic data. For end-users, a key feature of web-based applications is that they make few demands on users' software and hardware, since only a web browser is needed (1). Moreover, and of particular relevance, when dealing with large datasets, computational capabilities are not limited by the user's hardware (only by the servers'). In this context, web-based applications allow developers to take advantage of the increased availability of multicore processors and clusters built with off-the-shelf components. These are probably the major opportunities for significant performance gains in the near future (2–5). When deployed in a computing cluster, parallelization [such as provided by MPI (6)], harvests computational resources that are rarely available to individual researchers and can deliver significant decreases in waiting time, while being completely transparent to the end-user. Moreover, web-based applications can offer a user interface and experience very similar to that of desktop applications [e.g. by usage of Javascript (7)]. Finally, web-based tools offer the opportunity to quickly bring new methodological developments to many potential users. Interpretation of results (8,9) can also be easily provided by web-based tools, by linking to additional sources of information [e.g. PubMed references, Gene Ontology terms, etc.), which also permits further analysis with this additional information, such as identifying features [e.g. pathways, GO terms, etc.] which might be characteristic of the set of differentially expressed genes. In addition to the above general features of ‘omics’ web-based applications, when searching for differentially expressed genes (and similarly for protein and tissue array analysis) it is of course imperative to incorporate the best statistical practices in the field. Depending on the type of response data, different tests should be applied. The most common type of data (gene expression data for different types of patients) are often analyzed to search for differentially expressed genes using ANOVA, t-tests and related approaches that compare two or more classes. Tissue array data, however, are of a categorical or presence/absence nature, and require contingency table methods. Survival data, in contrast, require methods (such as Cox model) that can explicitly deal with right-censored observations. Thus, a tool for the search of differentially expressed genes (proteins) should incorporate the above methods to cover some of the more common needs of wet-lab researchers. In all these cases, and regardless of the type of test, it is by now been well recognized (10,11) that multiple testing problems should be taken into account. In addition, and since many microarray studies are really observational studies with human patients, it is often necessary to include additional clinical covariates to minimize confounding problems (12,13). Finally, in some cases the statistical methodology exists that will allow us to borrow information from all genes in the array when carrying out the test for each gene, using moderated statistics and Empirical Bayes approaches [e.g. (14)]. Finally, availability of source code, under an open-source license, is well recognized as an important feature of bioinformatics applications (13,15): it allows for fast methodological development based upon previous work by permiting other researchers to extend the methods and provide improvements and bug fixes, it makes it possible to verify claims made by method developers, and ensures that the international research community remains the owner of the tools it needs to carry out its work. The impact of code availability is further enhanced when standard best practices in software development [see review and references in (16)] and the usual open-source development mode (17), are followed. Of particular importance, especially with applications that perform complex analysis, is to provide testing suites that allow to verify the results of the analysis performed by the web-based application.

POMELO II: UNIQUE FEATURES

There are several other web-based applications that can be used to identify differentially expressed genes (18–31). However, all of these fail one or more of the requirements mentioned in the Introduction section. Many of them incorporate some of the same procedures as Pomelo II, but few offer as comprehensive a set of analysis as Pomelo does. Most tools are limited to two-class comparisons. Multi-class comparisons are only available from EMAAS (18), EzArray (24), GenePublisher (26), WebArray (27) and GEPAS (32). Survival analysis and regression are only available in GEPAS (32), but GEPAS is not open source. Contingency tables, however, are not available in other tools except Pomelo II. Several other tools make it explicit that they run in clusters (20,33), which allows for load balancing and swapping jobs to idle nodes. However, with the exception of EMAAS (18), parallel computing seems not to be used by any other tools. More importantly in microarray data analysis, a unique characteristic of Pomelo II is that it allows to incorporate additional covariates, such as age or sex, a much needed feature in many microarray studies with human subjects, where these variables can have an effect in gene expression (12,13). As part of our emphasis on this feature, when using additional covariates, the user is alerted to possible aliasing and confounding and to the available degrees of freedom available (Input and output section).

FUNCTIONALITY, INPUT, OUTPUT

Available statistical methods

Pomelo II incorporates a range of validated, well-know, statistical methods for identifying differentially expressed genes (or proteins). Fisher's exact test is available for contingency tables (this test is useful specially with tissue array data). Linear regression, with a P-value obtained by a permutation test, is of interest when we try to model the values of an interval-scaled variable using gene expression data. Cox model is a widely used method for censored data, such as when we want to find the relationship between patient survival and gene expression. Two-class comparisons are available as a permutation-based t-test, as a parametric t-test using moderated statistics with an empirical Bayes approach (14) as implemented in Limma (34), and as a paired t-test (also using Limma). Class comparisons for two or more classes are available as ANOVA, using permutation for significance, or as linear models, using Limma. If using linear models, we can adjust for the possible effects of ‘additional covariates’ (e.g. sex, age, etc.). For all the tests implemented, we return unadjusted P-values as well as FDR-adjusted P-values, using the approach of Benjamini and Hochberg [see details and discussion in (10,11)].

Input and output

Input are plain text files. For all analysis (except survival analysis), two files are needed: the expression data, and the class labels data. In addition, for linear models, and if additional covariates are used, a file with the additional covariables will be required. For survival analysis, three files are needed: expression data, survival times and censored indicators. A screenshot of the main input screens showing the methods available is shown in Figure 1a.

Figure 1.

Three input screens from Pomelo II. (a) Initial input, showing available statistical methods. (b) Additional covariates check page, with figures showing distributions at different levels of the class variable. (c) Additional covariates check page, showing degrees of freedom available and help. When using linear models, the user can use additional covariables. These are other subject attributes (e.g. subject age, gender, weight, etc.), often readily available from the clinical history. This information can allow Pomelo II to check if gene expression differences or similarities may be due to these factors instead of due to belonging to a certain class. When entering additional covariates for the linear model, the user can choose which of the covariates to use. In addition, we show plots of each of the covariates at the different levels of the class variable (Figure 1b). This allows the user to check that the program has correctly interpreted the variables showing which are numerical and which are categorical. In addition, it alerts the user of the possible existence of confounding and aliasing (35). Suppose that in a study comparing expression profiles of breast cancer patients with non-breast cancer control subjects, most breast cancer patients were females and the non-breast cancer subjects were males. This situation would be readily detectable with the plots provided by Pomelo II. In addition, some studies have small numbers of subjects but try to correct for too many covariates; when entering additional covariates, the user is informed about the available degrees of freedom, as well as the degrees of freedom used by any of the covariates included; help is available immediately, explaining the meaning of the table (Figure 1c). The main output from the program is a table with the results of the analysis and a heatmap. The results table contains a header indicating the test you have used, number of permutations and which covariables were used (if any); see Figure 2a and b, for two examples, corresponding to a permutation t-test and a Cox model. The table shows an index corresponding to the original ordering in the data file, gene names, P-values (unadjusted), FDR-adjusted P-values and statistics (and the absolute value of the statistic); in the case of Cox models, an additional column, ‘Warnings’, might show warnings from the fit (e.g. lack of convergence). At the bottom of the output, there is a figure with a heatmap (Figure 2c) where you can filter how and which genes to plot, and allows you to choose the color scale. Both tables and heatmap are clickable and will take you to a page with additional information [our IDConverter Light (36)] and will allow you to send selected genes (based on user-specified selection criteria) to PaLS (37) to examine PubMed references, Gene Ontology terms, KEGG pathways or Reactome pathways that are common to that set of genes.

Figure 2.

Output. Output table from a permutation t-test (a) and a Cox model (b), and a heatmap with dendrogram, showing available options for heatmap redrawing (c).

Output. Output table from a permutation t-test (a) and a Cox model (b), and a heatmap with dendrogram, showing available options for heatmap redrawing (c). If you have run an ‘Anova, linear models (limma)’ test, the output will also contain a Class compare section containing a button. By clicking on the button, we will be taken to a class compare page (Figure 3a), where we will be able to compare specific pairs of classes. For each comparison, a table will appear (e.g. Figure 3b), showing a table with (moderated) t-statistics (and associated P-values and FDR-corrected P-values), similar to the one in Figure 2. The Class compare page is provided because in linear models (ANOVAs) with three or more classes, we might be interested in comparing particular pairs of classes in addition to the overall F-test (if our linear model had only two classes originally, this option is not really necessary, since of course the overall F-test is equivalent to the t-test for the two-class comparison). Note that a particular two-class comparison in a, say, three-class analysis is not necessarily identical to conducting just a two-class comparison with a t-test: in linear models, we use all available data to estimate the error term and, moreover, the empirical Bayes method implemented in Limma (14) borrows information from gene expression data across all classes. Thus, in experiments that comprise more than two classes, it is always preferable to carry out specific contrasts after a full, global model, is fitted to all the data, rather than conducting many two-group analyses that discard information from the other groups.

Figure 3.

Output from linear model. (a) Class comparison page. (b) Output table from one of the two-class comparisons. (c) Details of Class comparison, showing Venn diagram and table of up- and down regulated genes for each two-class comparison. From the Class compare page, we can also obtain differential expression tables which, again, are particularly useful with more than two classes. They are also useful with two classes since the F-statistic, which is always of positive sign, gives no indication of whether the mean of the first group is larger or smaller than the mean of the second group. As shown in Figure 3c, for the user-selected group comparisons we obtain Venn diagrams that provide a quick visual information about the number of up- and down regulated genes in each two-class comparison and their intersection (e.g. the number of genes that are up regulated in both the contrast between classes 0 and 1 and the contrast between classes 0 and 2 are 656 in the figure). We also obtain a table showing which genes are differentially expressed in each two-class comparison; we use color codes (green and red) and the ‘<’ and ‘>’ signs to allow for fast differentiation between up- and down regulated genes. The FDR threshold below which genes are considered differentially expressed can be changed by the user, and the Venn diagram and table will be regenerated automatically.

Documentation, help, tutorials

Online help, including full documentation, pre-run examples, sample files and loading of sample data sets is available from the main page of Pomelo II. We also provide video tutorials (see http://pomelo2.bioinfo.cnio.es/help/flash_tutorials/video_tutorials.html) of some of the most common or most involved analysis. In most screens, there is help available to options specific of that step, accessible by clicking on the ‘?’ symbol (e.g. see Figure 1c). The help files are licensed under a Creative Commons license (http://www.creativecommons.org), allowing for redistribution and classroom use.

IMPLEMENTATION, AVAILABILITY, MATURITY AND TESTING

Most of the statistical functionality is written in R (38) or in C/C++, with extensive use of parallelization using MPI (6) and R interfaces to MPI [via the R-packages Rmpi (39), by H. Yu, and papply (40) by D. Currie]. Parallelization is used in all permutation-based tests and the Cox model computations. Cox model fitting uses the survival package, by T. Therneau (41). For linear models, we use Limma (34), by G. K. Smyth and collaborators. The web interface is written in Python and Javascript. Control of the application, fault-tolerance and booting and halting the LAM/MPI universes is accomplished by a combination of Python and shell scripts. We create a new LAM/MPI universe for each run of each application, and the actual nodes/CPUs that are used in a LAM/MPI universe are determined at run time (thus excluding nodes that are down). Our publicly accessible installation, available at http://pomelo2.bioinfo.cnio.es, runs on a cluster with 31 two dual-core AMD Opteron 2.2 GHz CPUs and six GB RAM, running Debian GNU/Linux. Shared storage space uses RAID 50, which provides protection against hard disk failure, as well as access to results and data from nodes different from the one where computations started. Redundancy and load-balancing of the web service is achieved with Linux Virtual Server with heartbeat and mon, which ensures balancing of the master nodes for MPI and of the non-parallelized executions. All of the code (including repository history) is available under open source licenses (GNU GPL and Affero GPL) from the Launchpad at http://launchpad.net/pomelo2.

Testing, maturity and number of accesses

Pomelo II includes a comprehensive test suite that uses FunkLoad (http://funkload.nuxeo.org). These tests cover the user interface, handling of error conditions and incorrectly formated files and the numerical output, and can be run on demand, and wherever new changes are introduced in the software, thus ensuring appropriate quality control and regression testing. The complete code is also available, under the GNU GPL and Affero GPL licenses, from http://launchpad.net/functional-testing (go to the Pomelo II directory in the source code). An additional test using Selenium (http://www.openqa.org/selenium/) is available (http://pomelo2.bioinfo.cnio.es/tests.html); these tests verify that the AJAX component of the application runs correctly under different operating systems and browsers. Pomelo II is a mature application. The server has been running for more than four years. In the last 2 years, over 6000 experimental datasets have been analyzed. Usage and testing includes four groups at the developers institution (CNIO), and users world wide.

FUNDING

Fundacion de Investigacion Medica Mutua Madrileña and Project [TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science (MEC); Red Tematica de Investigacion Cooperativa COMBIOMED. Funding for open access charge: Red Tematica de Investigacion Cooperativa COMBIOMED. Conflict of interest statement. None declared.

26 in total

Review 1. Open source software for the analysis of microarray data.

Authors: Sandrine Dudoit; Robert C Gentleman; John Quackenbush
Journal: Biotechniques Date: 2003-03 Impact factor: 1.993

Review 2. Epidemiology, cancer genetics and microarrays: making correct inferences, using appropriate designs.

Authors: John D Potter
Journal: Trends Genet Date: 2003-12 Impact factor: 11.639

3. GenePublisher: Automated analysis of DNA microarray data.

Authors: Steen Knudsen; Christopher Workman; Thomas Sicheritz-Ponten; Carsten Friis
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

4. caGEDA: a web application for the integrated analysis of global gene expression patterns in cancer.

Authors: Satish Patel; James Lyons-Weiler
Journal: Appl Bioinformatics Date: 2004

5. ArrayQuest: a web resource for the analysis of DNA microarray data.

Authors: Gary L Argraves; Saurin Jani; Jeremy L Barth; W Scott Argraves
Journal: BMC Bioinformatics Date: 2005-12-01 Impact factor: 3.169

6. Next station in microarray data analysis: GEPAS.

Authors: David Montaner; Joaquín Tárraga; Jaime Huerta-Cepas; Jordi Burguet; Juan M Vaquerizas; Lucía Conde; Pablo Minguez; Javier Vera; Sach Mukherjee; Joan Valls; Miguel A G Pujana; Eva Alloza; Javier Herrero; Fátima Al-Shahrour; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

7. IDconverter and IDClight: conversion and annotation of gene and protein IDs.

Authors: Andreu Alibés; Patricio Yankilevich; Andrés Cañada; Ramón Díaz-Uriarte
Journal: BMC Bioinformatics Date: 2007-01-10 Impact factor: 3.169

8. MAGMA: analysis of two-channel microarrays made easy.

Authors: Hubert Rehrauer; Stefan Zoller; Ralph Schlapbach
Journal: Nucleic Acids Res Date: 2007-05-21 Impact factor: 16.971

9. EMAAS: an extensible grid-based rich internet application for microarray data analysis and management.

Authors: G Barton; J Abbott; N Chiba; D W Huang; Y Huang; M Krznaric; J Mack-Smith; A Saleem; B T Sherman; B Tiwari; C Tomlinson; T Aitman; J Darlington; L Game; M J E Sternberg; S A Butcher
Journal: BMC Bioinformatics Date: 2008-11-25 Impact factor: 3.169

10. MIDAW: a web tool for statistical analysis of microarray data.

Authors: Chiara Romualdi; Nicola Vitulo; Micky Del Favero; Gerolamo Lanfranchi
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

22 in total

1. Recurrent Germline DLST Mutations in Individuals with Multiple Pheochromocytomas and Paragangliomas.

Authors: Laura Remacha; David Pirman; Christopher E Mahoney; Javier Coloma; Bruna Calsina; Maria Currás-Freixes; Rocío Letón; Rafael Torres-Pérez; Susan Richter; Guillermo Pita; Belén Herráez; Giovanni Cianchetta; Emiliano Honrado; Lorena Maestre; Miguel Urioste; Javier Aller; Óscar García-Uriarte; María Ángeles Gálvez; Raúl M Luque; Marcos Lahera; Cristina Moreno-Rengel; Graeme Eisenhofer; Cristina Montero-Conde; Cristina Rodríguez-Antona; Óscar Llorca; Gromoslaw A Smolen; Mercedes Robledo; Alberto Cascón
Journal: Am J Hum Genet Date: 2019-03-28 Impact factor: 11.025

2. Analysis of Protein-Protein Interactions by Protein Microarrays.

Authors: Ana Montero-Calle; Rodrigo Barderas
Journal: Methods Mol Biol Date: 2021

3. Phage Microarrays for Screening of Humoral Immune Responses.

Authors: Ana Montero-Calle; Pablo San Segundo-Acosta; María Garranzo-Asensio; Guillermo Solís-Fernández; Maricruz Sanchez-Martinez; Rodrigo Barderas
Journal: Methods Mol Biol Date: 2021

4. AAPL: Assessing Association between P-value Lists.

Authors: Tianwei Yu; Yize Zhao; Shihao Shen
Journal: Stat Anal Data Min Date: 2013-04-01 Impact factor: 1.051

5. Research resource: Transcriptional profiling reveals different pseudohypoxic signatures in SDHB and VHL-related pheochromocytomas.

Authors: Elena López-Jiménez; Gonzalo Gómez-López; L Javier Leandro-García; Iván Muñoz; Francesca Schiavi; Cristina Montero-Conde; Aguirre A de Cubas; Ricardo Ramires; Iñigo Landa; Susanna Leskelä; Agnieszka Maliszewska; Lucía Inglada-Pérez; Leticia de la Vega; Cristina Rodríguez-Antona; Rocío Letón; Carmen Bernal; José M de Campos; Cristina Diez-Tascón; Mario F Fraga; Cesar Boullosa; David G Pisano; Giuseppe Opocher; Mercedes Robledo; Alberto Cascón
Journal: Mol Endocrinol Date: 2010-10-27

6. ABCG2 is a potential marker of tumor-initiating cells in breast cancer.

Authors: Renata Danielle Sicchieri; Willian Abraham da Silveira; Larissa Raquel Mouro Mandarano; Tatiane Mendes Gonçalves de Oliveira; Hélio Humberto Angotti Carrara; Valdair Francisco Muglia; Jurandyr Moreira de Andrade; Daniel Guimarães Tiezzi
Journal: Tumour Biol Date: 2015-06-20

7. Prognostic biomarkers for HNSCC using quantitative real-time PCR and microarray analysis: β-tubulin isotypes and the p53 interactome.

Authors: Sharon Lobert; Mary E Graichen; Robert D Hamilton; Karen T Pitman; Michael R Garrett; Chindo Hicks; Tejaswi Koganti
Journal: Cytoskeleton (Hoboken) Date: 2014-11-22

8. An Integrative Genomics Approach for Associating GWAS Information with Triple-Negative Breast Cancer.

Authors: Chindo Hicks; Ranjit Kumar; Antonio Pannuti; Kandis Backus; Alexandra Brown; Jesus Monico; Lucio Miele
Journal: Cancer Inform Date: 2013-01-29

9. Integrative proteomics and tissue microarray profiling indicate the association between overexpressed serum proteins and non-small cell lung cancer.

Authors: Yansheng Liu; Xiaoyang Luo; Haichuan Hu; Rui Wang; Yihua Sun; Rong Zeng; Haiquan Chen
Journal: PLoS One Date: 2012-12-19 Impact factor: 3.240

10. Novel Integrative Genomics Approach for Associating GWAS Information with Intrinsic Subtypes of Breast Cancer.

Authors: Chindo Hicks; Tejaswi Koganti; Alexandra S Brown; Jesus Monico; Kandis Backus; Lucio Miele
Journal: Cancer Inform Date: 2013-05-15