Literature DB >> 17488846

Asterias: integrated analysis of expression and aCGH data using an open-source, web-based, parallelized software suite.

Ramón Díaz-Uriarte¹, Andreu Alibés, Edward R Morrissey, Andrés Cañada, Oscar M Rueda, Mariana L Neves.

Abstract

Asterias (http://www.asterias.info) is an open-source, web-based, suite for the analysis of gene expression and aCGH data. Asterias implements validated statistical methods, and most of the applications use parallel computing, which permits taking advantage of multicore CPUs and computing clusters. Access to, and further analysis of, additional biological information and annotations (PubMed references, Gene Ontology terms, KEGG and Reactome pathways) are available either for individual genes (from clickable links in tables and figures) or sets of genes. These applications cover from array normalization to imputation and preprocessing, differential gene expression analysis, class and survival prediction and aCGH analysis. The source code is available, allowing for extention and reuse of the software. The links and analysis of additional functional information, parallelization of computation and open-source availability of the code make Asterias a unique suite that can exploit features specific to web-based environments.

Entities: Disease Species

Mesh：

Year: 2007 PMID： 17488846 PMCID： PMC1933128 DOI： 10.1093/nar/gkm229

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Web-based applications are well suited for the analysis of microarray and genomic data. They do not require the user to install or upgrade any software, the computational capabilities (a concern with the large data sets common in genomic studies) are not limited by the user's hardware (only by the server) and, with the recent advances in web technologies, can offer a user interface and experience very similar to that of desktop applications. Integrated suites that carry out a complete set of analyses of several different types of data can be very appealing for many users, as the applications within the suite present a similar interface, have homogeneous input requirements and allow the analysis of various types of data that many wet-lab researchers deal with routinely (e.g., from microarray data normalization to aCGH). In addition, web-based tools offer the opportunity to quickly bring new methodological developments to many potential users. Therefore, there is room for additional work in integrated web-based suites to incorporate key statistical and methodological advances.

Web-based tools: requirements and desirable features

Web-based tools do not need to compromise on statistical rigor and can use validated and state-of-the-art methods. When trying to discover differentially expressed genes, multiple testing problems should be taken into account (1,2) and, since many microarray studies are really observational studies with human patients, it is often necessary to include additional clinical covariates to minimize confounding problems (3,4). In addition, we can also borrow information from all genes in the array when carrying out the test for each gene, using moderated statistics and Empirical Bayes approaches (5). When dealing with classification and prediction, it is crucial to avoid biases that lead to overoptimistic estimates of the error rates. These biases include ‘selection bias’ (6,7) and bias caused by selecting and reporting the error rate of the classifier (among a set of classifiers) with the smallest cross-validated error rate (8,9). Additionally, gene selection in the context of classification often yields many solutions with similar prediction errors, but which share few common genes (10–12); being unaware of the possible instability of our results can lead to a false sense of certainty that the given set is special and distinct. In addition to statistical rigor, a modern tool should incorporate the increasing availability of multicore processors and clusters built with off-the-shelf components, which are probably the major opportunities for significant performance gains in the near future (13,14). MPI (15) is one approach to parallelize computations over several CPUs and/or processor cores, thus decreasing execution time. Interestingly, web-based applications are well suited for this task; if deployed in a computing cluster, the parallelization, while transparent for the user, permits harvesting computational resources that are rarely available to individual researchers. To help in the interpretation of results (16,17), web-based tools are ideally suited to link to additional sources of information, such as PubMed references, gene ontology (GO) terms, and the UCSC and Ensembl databases and KEGG and Reactome pathways. Moreover, it is possible to carry out further analysis with this additional information, such as highlighting features (e.g. pathways, GO terms, etc.) that might be characteristic of a set of selected genes that are, say, very common among the genes that tend to be repeatedly selected as relevant for a classification problem. This usage of additional information can help us understand whether there are biological commonalities behind the possible multiple solutions (see above). Finally, the availability of source code, under an open-source license, allows other researchers to further improve the method and provide bug fixes, use the code for instruction and teaching, permits to verify claims by method developers, encourages reproducible research, and ensures that the international research community remains the owner of the tools it needs to carry out its work (18). These features facilitate fast methodological development based on previous work, and expedite the transfer of results to applied research. The value of the source code is further enhanced if best practices (19) as well as common open-source practices (including public code repositories and open bug tracking) are followed, ultimately allowing the building of a community of contributors (20).

ASTERIAS: UNIQUE FEATURES

Some of the currently available web-based suites include RACE (26), MIDAW (27), Gepas (28) and CARMAweb (29). All of these, however, fail one or more of the above requirements. We have thus developed Asterias to fulfill those requirements. First, Asterias is the only web-based application which we know of that is designed, from the beginning, to make extensive use of parallelization in its computations. The speed up can be dramatic when run in a computing cluster (in our own installation of 30 dual-processor servers, some applications speed up by factors of 30 × to 50 ×). Second, Asterias, as with some other suites, includes tools that cover the complete range of needs of many researchers (from normalization to aCGH analysis, including imputation, differential expression and class prediction), but Asterias is the only suite that includes tools for searching for large sets of predictive genes (GeneSrF), and gene selection, molecular signatures and prediction with survival data (SignS). Third, we provide statistically rigorous and state-of-the-art methods, from the well-known BioConductor limma package (5), in the study of differential expression, to the best available methods for aCGH analysis, as reported in recent reviews (30,31). Moreover, we facilitate the analysis of multiple solutions in class prediction and gene selection tools (e.g. frequency of genes in bootstrap and cross-validation runs and similarity of solutions with regards to biological role via an analysis of additional information—see below). Fourth, the development of Asterias includes functional and regression testing of our applications, using publicly available and open-source tests; this is also a unique feature of Asterias. In addition, the newest release of Asterias includes two important additions. We make (virtually) all of our code available under open-source licenses (GNU GPL and Affero GPL) and have an open-source development mode, including open bug tracking and full repository history available. Finally, an important novelty with respect to our latest release, the user can analyze the results (e.g. the genes that have been selected as good prognosis classifiers) and examine PubMed references, GO terms, KEGG pathways or Reactome pathways for those genes using the new PaLS web server. PaLS, coupled with the examination of multiple solutions, can ease the biological interpretation of the results, specially in studies of gene selection and classification. Asterias shares some common history with the GEPAS suite (28), and one of the authors of Asterias (RD-U) was heavily involved in the development of GEPAS (32–34) and related tools (35–37). Nowadays, Asterias and GEPAS only share the tool DNMAD—although the R code in Asterias' DNMAD has changed to adapt it to the latest BioConductor releases— and a similar approach to web server load-balancing, via Pound or LVS, with everything else being different. A brief history of the split can be found at http://asterias.bioinfo.cnio.es/Asterias.Gepas.html. The main differences between Asterias and GEPAS are our strong commitment to parallel computing, differences in the type of applications being developed (e.g. SignS, ADaCGH, GeneSrF, PomeloII) and software development mode (all of our code is available under open-source licenses, including complete repositories and functional tests).

FUNCTIONALITY, INPUT, OUTPUT

Figures 1 and 2 show the main functionality provided by each of the Asterias applications, the relationships between the tools and the main input and output of each application. All the analysis tools are accessible from preP, but can also be accessed directly, and preP can be accessed either directly or from DNMAD.

Figure 1.

Figure 2.

Asterias: input/output and data and information flow between applications. Black and blue arrows involve files, green arrows URLs. Olive boxes denote graphical output.

Asterias: functionality and data and information flow between sets of applications (see details in Figure 2). References for ADaCGH methods are: circular binary segmentation (21), wavelet-based smoothing (22), SW-ARRAY (23) and ACE (24). The method implemented in SignS is from (25). Asterias: input/output and data and information flow between applications. Black and blue arrows involve files, green arrows URLs. Olive boxes denote graphical output. Input to all applications are plain text files, with tab-separated columns. Further details are provided in the online help of each application. Output of most applications includes both text-like output, with clickable links to IDClight (38) and PaLS and graphical output. Some applications (e.g. IDconverter) can also provide tabular output in other formats (e.g. Microsoft Excel). Screenshots of output are provided in the Supplementary Data.

IMPLEMENTATION

Most of the statistical functionality is written in R (39), with some code in C/C++ (Pomelo II and several dynamically loadable code in R packages), and extensive use of parallelization using MPI and R interfaces to MPI. The R code uses standard R or BioConductor packages (some of them modified to allow parallel computation) and our own packages (e.g. varSelRF, ADaCGH). Full details on the R and BioConductor packages used are provided in the help pages of each application. The web interfaces and input data validation are written in Python (with some legacy Perl and PHP in DNMAD and IDconverter). Clickable figures and tables are usually generated using R, with additional post-processing using Python. The database server for IDconverter, IDClight and PaLS is MySQL. Scripts for database management and generation are also written in Python. JavaScript is used in several applications, most notably in Pomelo II (AJAX), but also on clickable figures and collapsible trees. Booting and halting the LAM/MPI universes is accomplished by a combination of Python and shell scripts. We create a new LAM/MPI universe for each run of each application, and the actual nodes/CPUs that are used in a LAM/MPI universe are determined at run-time (thus excluding nodes that are down).

Documentation, help, bug tracking

Online help, including tutorials, examples and sample files, is available for all applications. Pomelo II includes additional tutorials as flash movies. The online tutorials and examples are licensed under a Creative Commons license (http://www.creativecommons.org), allowing for redistribution and classroom use. The R packages have, additionally, help available in the standard R format. Bug tracking is available from the Bioinformatics.org project page http://bioinformatics.org/bugs/?group_id=630.

Availability

Our publicly accessible installation runs on a cluster with 30 dual-CPU nodes with Debian GNU/Linux. The web service is load-balanced (we are currently using Linux Virtual Server, but have used Pound in the past), which ensures balancing of the master nodes for MPI and of the non-parallelized applications (e.g. preP). All of the code (except, temporarily, for PaLS) is available under open-source licenses (either GNU GPL v.2 or Affero Public License). The complete repositories can be downloaded from Bioinformatics.org (http://bioinformatics.org/asterias) or Launchpad (https://launchpad.net/asterias). The R package varSelRF is also available from the R repositories.

Testing, maturity and number of accesses

Asterias includes a test suite that uses FunkLoad (http://funkload.nuxeo.org). The test suite tests the user interface, handling of error conditions and incorrectly formated files and the numerical output, and can be run on demand, and wherever new changes are introduced in the software, thus ensuring appropriate quality control and regression testing. The complete code is also available (see ‘Functional testing’ in the repositories). For Pomelo II (which makes extensive use of AJAX), additional tests using Selenium (http://www.openqa.org/selenium/) are available (http://pomelo2.bioinfo.cnio.es/tests.html); these tests verify that the application runs correctly under different operating systems and browsers. Asterias is a mature suite. Its oldest application, DNMAD (40), has been running since October 2003, and the newest one, PaLS, has been running since October 2006. The rest of the applications have been running for at least a year, often considerably longer. The number of data sets analyzed (note that these are counts of actual numbers of successfully uploaded files, not just hits) in the 10-month period February 1, 2006 and November 30, 2006, range from 3700 and 2900 for preP and Pomelo II, respectively, between to 530 and 340 for SignS and GeneSrF, except for IDconverter and IDClight, which have over 70 daily uses.

Future work

Our main development effort is focused on making Asterias easy to install and deploy, from laptops to clusters of workstations. We are currently re-implementing all of Asterias using Pylons (http://pylonshq.com), a Python web framework, together with installation scripts that ease the configuration, management and monitoring of the computing nodes and parallel computing layers. We are also exploring other languages and paradigms, such as QHTML (41), built on top of Mozart/Oz, to solve the problem that ‘Building web-based applications requires the mastering of a number of languages/technologies (e.g. HTML, CSS, CGI, ASP, PHP, XML, etc.). Such languages and technologies were created to address different aspects on a by-need, evolutionary manner. The result is a plethora of tools that are fitted together in an ad hoc fashion’. (41). In both cases, our ultimate objective is developing a general framework (or at least a large enough set of case examples) that will make it much simpler for any bioinformatician/biostatistician to take new ideas and developments from the primary methodological research and make them quickly available as web-based applications. These web-based applications should be capable of using advances in computing and hardware (multicore CPUs, computing clusters built with off-the-shelf components, parallel computing and concurrency) and web technologies (e.g., AJAX).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

31 in total

Review 1. Open source software for the analysis of microarray data.

Authors: Sandrine Dudoit; Robert C Gentleman; John Quackenbush
Journal: Biotechniques Date: 2003-03 Impact factor: 1.993

2. GEPAS: A web-based resource for microarray gene expression data analysis.

Authors: Javier Herrero; Fátima Al-Shahrour; Ramón Díaz-Uriarte; Alvaro Mateos; Juan M Vaquerizas; Javier Santoyo; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

3. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions.

Authors: R L Somorjai; B Dolenko; R Baumgartner
Journal: Bioinformatics Date: 2003-08-12 Impact factor: 6.937

4. New challenges in gene expression data analysis and the extended GEPAS.

Authors: Javier Herrero; Juan M Vaquerizas; Fátima Al-Shahrour; Lucía Conde; Alvaro Mateos; Javier Santoyo Ramón Díaz-Uriarte; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

5. FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes.

Authors: Fátima Al-Shahrour; Ramón Díaz-Uriarte; Joaquín Dopazo
Journal: Bioinformatics Date: 2004-01-22 Impact factor: 6.937

6. Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information.

Authors: Fátima Al-Shahrour; Ramón Díaz-Uriarte; Joaquín Dopazo
Journal: Bioinformatics Date: 2005-04-19 Impact factor: 6.937

7. Effects of threshold choice on biological conclusions reached during analysis of gene expression by DNA microarrays.

Authors: Kuang-Hung Pan; Chih-Jian Lih; Stanley N Cohen
Journal: Proc Natl Acad Sci U S A Date: 2005-06-10 Impact factor: 11.205

8. GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data.

Authors: Juan M Vaquerizas; Lucía Conde; Patricio Yankilevich; Amaya Cabezón; Pablo Minguez; Ramón Díaz-Uriarte; Fátima Al-Shahrour; Javier Herrero; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

9. Bias in error estimation when using cross-validation for model selection.

Authors: Sudhir Varma; Richard Simon
Journal: BMC Bioinformatics Date: 2006-02-23 Impact factor: 3.169

10. IDconverter and IDClight: conversion and annotation of gene and protein IDs.

Authors: Andreu Alibés; Patricio Yankilevich; Andrés Cañada; Ramón Díaz-Uriarte
Journal: BMC Bioinformatics Date: 2007-01-10 Impact factor: 3.169

11 in total

1. High-resolution genome-wide analysis of chromosomal alterations in elastofibroma.

Authors: Juan Luis García Hernández; Javier Ortiz Rodríguez-Parets; José María Valero; María Asunción Gomez Muñoz; M Rocío Benito; Jesus M Hernandez; Agustín Bullón
Journal: Virchows Arch Date: 2010-04-27 Impact factor: 4.064

2. CGHweb: a tool for comparing DNA copy number segmentations from multiple algorithms.

Authors: Weil Lai; Vidhu Choudhary; Peter J Park
Journal: Bioinformatics Date: 2008-02-22 Impact factor: 6.937

3. A Systems Biology Interpretation of Array Comparative Genomic Hybridization (aCGH) Data through Phylogenetics.

Authors: Ayman N Abunimer; Jose Salazar; David P Noursi; Mones S Abu-Asab
Journal: OMICS Date: 2016-03

4. limma powers differential expression analyses for RNA-sequencing and microarray studies.

Authors: Matthew E Ritchie; Belinda Phipson; Di Wu; Yifang Hu; Charity W Law; Wei Shi; Gordon K Smyth
Journal: Nucleic Acids Res Date: 2015-01-20 Impact factor: 16.971

5. EDGE(3): a web-based solution for management and analysis of Agilent two color microarray experiments.

Authors: Aaron L Vollrath; Adam A Smith; Mark Craven; Christopher A Bradfield
Journal: BMC Bioinformatics Date: 2009-09-04 Impact factor: 3.169

6. Survival Online: a web-based service for the analysis of correlations between gene expression and clinical and follow-up data.

Authors: Luca Corradi; Valentina Mirisola; Ivan Porro; Livia Torterolo; Marco Fato; Paolo Romano; Ulrich Pfeffer
Journal: BMC Bioinformatics Date: 2009-10-15 Impact factor: 3.169

7. Parallelization and optimization of genetic analyses in isolation by distance web service.

Authors: Julia L Turner; Scott T Kelley; James S Otto; Faramarz Valafar; Andrew J Bohonak
Journal: BMC Genet Date: 2009-06-19 Impact factor: 2.797

8. ArrayMining: a modular web-application for microarray analysis combining ensemble and consensus methods with cross-study normalization.

Authors: Enrico Glaab; Jonathan M Garibaldi; Natalio Krasnogor
Journal: BMC Bioinformatics Date: 2009-10-28 Impact factor: 3.169

9. EzArray: a web-based highly automated Affymetrix expression array data management and analysis system.

Authors: Yuerong Zhu; Yuelin Zhu; Wei Xu
Journal: BMC Bioinformatics Date: 2008-01-24 Impact factor: 3.169

10. GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest.

Authors: Ramón Diaz-Uriarte
Journal: BMC Bioinformatics Date: 2007-09-03 Impact factor: 3.169