Literature DB >> 20478823

Babelomics: an integrative platform for the analysis of transcriptomics, proteomics and genomic data with advanced functional profiling.

Ignacio Medina¹, José Carbonell, Luis Pulido, Sara C Madeira, Stefan Goetz, Ana Conesa, Joaquín Tárraga, Alberto Pascual-Montano, Ruben Nogales-Cadenas, Javier Santoyo, Francisco García, Martina Marbà, David Montaner, Joaquín Dopazo.

Abstract

Babelomics is a response to the growing necessity of integrating and analyzing different types of genomic data in an environment that allows an easy functional interpretation of the results. Babelomics includes a complete suite of methods for the analysis of gene expression data that include normalization (covering most commercial platforms), pre-processing, differential gene expression (case-controls, multiclass, survival or continuous values), predictors, clustering; large-scale genotyping assays (case controls and TDTs, and allows population stratification analysis and correction). All these genomic data analysis facilities are integrated and connected to multiple options for the functional interpretation of the experiments. Different methods of functional enrichment or gene set enrichment can be used to understand the functional basis of the experiment analyzed. Many sources of biological information, which include functional (GO, KEGG, Biocarta, Reactome, etc.), regulatory (Transfac, Jaspar, ORegAnno, miRNAs, etc.), text-mining or protein-protein interaction modules can be used for this purpose. Finally a tool for the de novo functional annotation of sequences has been included in the system. This provides support for the functional analysis of non-model species. Mirrors of Babelomics or command line execution of their individual components are now possible. Babelomics is available at http://www.babelomics.org.

Entities: Disease Gene

Mesh：

Year: 2010 PMID： 20478823 PMCID： PMC2896184 DOI： 10.1093/nar/gkq388

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

High-throughput technologies such as transcriptomics (microarrays) proteomics, large-scale genotyping [genome wide association studies (GWAS)], next generation sequencing, etc., produce huge amounts of data of unfeasible interpretation without the application of automatic procedures for functional profiling (1). The idea behind this new version of Babelomics is to integrate primary (normalization, calls, etc.) and secondary [signatures, predictors, associations, Transmission/disequilibrium tests (TDTs), clustering, etc.] analysis tools within an multiple-purpose platform that allows relating some of these genomic data and/or interpreting them by means of different functional enrichment or gene set methods. Such interpretation is made not only using functional definitions [GO (2), KEGG (3), Biocarta, Interpro (4), reactome (5)] but also regulatory information [Transfac (6), Jaspar (7), ORegAnno (8)] protein–protein interactions (9), text-mining module definitions (10) and the possibility of producing de novo annotations through the Blast2GO system (11). Babelomics (12,13) as well as Gene Expression Pattern Analysis Suite (GEPAS) (14,15) have been uninterruptedly running for more than 8 years. Currently, Babelomics have an average of more than 200 experiments analyzed per day, respectively, (http://bioinfo.cipf.es/webstats/babelomics/awstats.babelomics.bioinfo.cipf.es.html), distributed among many different countries (http://bioinfo.cipf.es/toolsusage). Since their last release, a total of 65 280 anonymous users and 4521 registered users have used these tools. In terms of technology, Babelomics has been reengineered, speeded up and transformed to web services. Babelomics has a new interface that allows the definition of persistent sessions and asynchronous use (a program can be left running and come back later to see the results) through a queue system. Moreover, the complete program can be installed locally and their modules can now be independently invoked as command line programs and can be integrated into analysis pipelines.

STRUCTURE OF THE PROGRAM

The program is organized into different sections that are described below. Two of them are related to data input and data management. The rest of them are different analysis options for data analysis. Novelties with respect to previous versions are indicated at the corresponding sections.

Upload data

This section is new. Upload process has been separated from the tools. Different parsers check the integrity and format of the data. The data accepted are: expression microarrays (one-channel Affymetrix and Agilent, and two-channel Agilent, Genepic and generic), Array-CGH (Agilent), generic data matrices of expression, ArrayCGH and SNPs, lists of identifiers (gene, transcript, protein, SNP, functional terms, ranks), lists of annotations (gene annotation, extended annotations) and some other data such as dendrogram descriptions (in Newick format; (http://evolution.genetics.washington.edu/phylip/newicktree.html), Blast results, protein–protein interaction data and genotype data (in standard PED and MAP formats; http://pngu.mgh.harvard.edu/∼purcell/plink/data.shtml). The program documentation contains details on the formats used. To ease the process, the data format can be uploaded according to the format, as described above, or according the tool to be used.

Preprocessing

This section allows preprocessing the data loaded in the previous section. Almost all the options (except normalization of Affymetrix and two-color Agilent and genepix arrays) are new. All the array types loaded in the previous section can be normalized by different methods. Limma (16) and affy packages from Bioconductor (17) are implemented. This section also includes an editor which allows easy addition and/or modification of labels and tags that will be further used as class or category labels for gene selection, prediction, etc. Other preprocessing facilities for normalized data are available, such as log-transformations, replicate merging, missing value imputation, etc. A converter of identifiers is available with more than 80 cross-referenced identifiers for genes, proteins, transcripts, microarray probesets, pathways, functional annotations and regulatory regions. Finally, a facility for obtaining the existing annotations for lists of gene identifiers is available. All the possible annotations used in Babelomics (functional, regulatory, etc.) can be used for this purpose.

Expression

This section corresponds to the GEPAS (14,15) functionality. It includes tests for differential expression [two- or multi-class comparison, survival analysis, correlation to continuous parameters or time/dosage series analysis (18)], methods for class prediction (19) such as SVM (support vector machines), KNN (k-nearest neighbors), Random Forest and Naive Bayes, with different feature selection methods [differential expression, genetic algorithms, principal component analysis (PCA)] and clustering methods for both samples and genes implementing the algorithms UPGMA (unweighted pair group method with arithmetic mean), SOTA (self organizing tree algorithm) (20), K-means and SOM (self organizing maps). We have also included biclustering methods (21). The users of GEPAS will appreciate the novelties here, which are ‘limma’ methods for one-, two- and multi-class comparison, new multiple testing correction methods (Benjamini–Hochberg (22), Benjamini–Yekutieli (23), Bonferroni, Holm (24) and Hochberg (25)) for differential gene expression. The predictor module has novelties such as the addition of random forest (26) and Naive Bayes methods, the use of different algorithms for feature selection (see above), new parameter tuning options and new representations of the results with receiver operating characteristic curves with new metrics [area under the curve, root mean squared error, Matthews correlation coefficient and accuracy]. We have also added biclustering (21) as a new clustering method.

Genomics

This section is completely new. It implements SNP-based genotyping (27). This module can deal with GWAS case–control studies and carries out chi-square, Fisher and linear or logistic model tests. For trios, the program can carry out TDT. Functionality is taken from the PLINK program (28). Again, the results of the test can be analyzed by the functional analysis module (see below) and the novel pathway-based analysis (PBA) strategies can be applied (29). Population stratification can be analyzed by identity-by-state (IBS) (28) and PLINK documentation (http://pngu.mgh.harvard.edu/∼purcell/plink/strat.shtml). This is a simple but potentially powerful approach to population stratification, which can use the whole genome SNP data.

Functional analysis

This module inherits the functionality of the previous version of Babelomics (12,13) although many novelties have been included. Apart from the functional enrichment methods such as the popular FatiGO (30), and gene set analysis methods such as the segmentation test (31) or the logistic regression model (32) (a new addition), other testing strategies have been added. Thus, the Genecodis method (33) that finds concurrent annotations in ranked lists of genes is one of the new methods included. It is also possible to carry our PBA in GWAS experiments by means of the novel module Gesbap (29). Gene modules defined using text-mining derived functional annotations related to medical terms and chemical compounds can also be used for gene-set analysis (GSA) (10). Also another module uses gene expression data already available in databases to define tissue-specific or phenotype-specific gene expression profiles. These can be used to check the similarity of a particular experiment to the standard profiles of healthy or diseased tissues (34). Finally, the possibility of finding significant subnetworks of protein–protein interactions associated to the genomic experiment analyzed is included in an additional module (9) as another novelty. Regarding the gene modules that can be used in order to produce a functional interpretation of the results, many possibilities are available, including functional definitions [GO (2), GOSlim, KEGG (3), Biocarta (http://www.biocarta.com/), Interpro (4), reactome (5)], regulatory information [Transfac (6), Jaspar (7), ORegAnno (8), miRNA target genes from miRBase (35)], protein–protein interactions (9) and text-mining module definitions (10). Additionally, the user can define their own gene modules by uploading them (see upload section) or by using the annotation tools available in this version of Babelomics (see Blast2GO below). Jaspar, ORegAnno, reactome and protein–protein interactions are new modules in this release. Different filters can be used to test sub-selection of gene modules, excluding in this way superfluous tests that only will result in a reduction of the statistical power of the method used. Finally, the popular Blast2GO (11) module can be used to produce annotations of genes of non-model organisms. Such annotations can be stored and further used to analyze genomic experiments.

Utilities

Several utilities are available related to facilitate the annotation of the genes or to produce different visualizations of the data. Thus, the new modules for annotation and identification conversion already described in the preprocessing section can also be found here. Several new utilities to produce graphical representation of the results (histograms and boxplots, cluster representations, PCA viewers and the GO hierarchy viewer) are also available. Supplementary Figure 1 shows some of such graphics.

Technical details

Babelomics is designed as a web application so it can work on any operating system: Windows, GNU/Linux and MacOS. Babelomics has been tested in many browsers including: Firefox 3.x, Safari 4.x, Chrome 2.x, Opera 9.x, IE 7.x. and IE 8.x. The code has almost entirely rewritten in Java (except some routines for normalization of microarrays that are in R), which constituted a speed up of almost 30x with respect to previous versions. All the modules are web services and can be used in command line. A convenient queue system has been implemented. Babelomics is running in a high-end cluster with 10 dedicated Intel XEON Quad-Core CPUs at 2.0 GHz (summing up a total of 40 cores) with a large amount of RAM (total 60 GB).

CONCLUSIONS

Today’s Babelomics is a long-term project that started in 2001 with the publication of the clustering method SOTA (20) for microarray data analysis, followed by the popular FatiGO (30) for functional enrichment analysis, which are now a constituent part of Babelomics. Later, different methods (both functional enrichment and gene set enrichment) along with a number of gene module definitions were assembled as the prototype of the previous Babelomics (12,13) while different methods for gene expression data analysis (differential expression, predictors, clustering) give rise in parallel to the GEPAS (14,15) project. Both packages have become popular in their respective areas, being Babelomics the third most cited web-based tool for functional analysis (Supplementary Table 1) and GEPAS the most cited web tool for microarray data analysis (Supplementary Table 2). This new version of Babelomics embeds the gene expression data analysis functionality of GEPAS within the functional profiling framework of the previous Babelomics and includes new modules for the analysis of genomic data (genotyping and genomic copy number alterations). Many new methods and new module definitions of different nature (functional, regulatory, phenotypic, etc.) have been included in this version. The Babelomics project aims to provide the scientific community with an advanced set of methods for the integrated analysis of genomic data within the context of functional profiling analysis without renouncing to a user-friendly and intuitive use. As the Functional Genomics node of the Spanish Institute of Bioinformatics (INB; http://www.inab.org) and being part of the Spanish Network of Cancer (RTICC; http://www.rticcc.org) and the Network of Centres for Research in Rare Diseases (CIBERER, http://www.ciberer.es), we have a direct contact with researchers which provided us much of the feedback necessary to make of Babelomics a useful tool. Although there are many tools for the functional profiling of high-throughput experiments (Supplementary Tables 1 and 2), Babelomics is a widely used tool which offers a combination of features and a degree of integration that makes it unique among other resources available.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Funding for open access charge: Spanish Ministry of Science and Innovation (MICINN) (projects BIO BIO2008-04212 and CEN-2008-1002). Red Temática de Investigación Cooperativa en Cancer (RTICC, partial) (RD06/0020/1019); Instituto de Salud Carlos III (MICINN). Conflict of interest statement. None declared.

29 in total

1. Biclustering algorithms for biological data analysis: a survey.

Authors: Sara C Madeira; Arlindo L Oliveira
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2004 Jan-Mar Impact factor: 3.710

2. maSigPro: a method to identify significantly differential expression profiles in time-course microarray experiments.

Authors: Ana Conesa; María José Nueda; Alberto Ferrer; Manuel Talón
Journal: Bioinformatics Date: 2006-02-15 Impact factor: 6.937

3. Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research.

Authors: Ana Conesa; Stefan Götz; Juan Miguel García-Gómez; Javier Terol; Manuel Talón; Montserrat Robles
Journal: Bioinformatics Date: 2005-08-04 Impact factor: 6.937

4. miRBase: microRNA sequences, targets and gene nomenclature.

Authors: Sam Griffiths-Jones; Russell J Grocock; Stijn van Dongen; Alex Bateman; Anton J Enright
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

5. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.

Authors: V Matys; O V Kel-Margoulis; E Fricke; I Liebich; S Land; A Barre-Dirrie; I Reuter; D Chekmenev; M Krull; K Hornischer; N Voss; P Stegmaier; B Lewicki-Potapov; H Saxel; A E Kel; E Wingender
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

6. New developments in the InterPro database.

Authors: Nicola J Mulder; Rolf Apweiler; Teresa K Attwood; Amos Bairoch; Alex Bateman; David Binns; Peer Bork; Virginie Buillard; Lorenzo Cerutti; Richard Copley; Emmanuel Courcelle; Ujjwal Das; Louise Daugherty; Mark Dibley; Robert Finn; Wolfgang Fleischmann; Julian Gough; Daniel Haft; Nicolas Hulo; Sarah Hunter; Daniel Kahn; Alexander Kanapin; Anish Kejariwal; Alberto Labarga; Petra S Langendijk-Genevaux; David Lonsdale; Rodrigo Lopez; Ivica Letunic; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Jaina Mistry; Alex Mitchell; Anastasia N Nikolskaya; Sandra Orchard; Christine Orengo; Robert Petryszak; Jeremy D Selengut; Christian J A Sigrist; Paul D Thomas; Franck Valentin; Derek Wilson; Cathy H Wu; Corin Yeats
Journal: Nucleic Acids Res Date: 2007-01 Impact factor: 16.971

7. Next station in microarray data analysis: GEPAS.

Authors: David Montaner; Joaquín Tárraga; Jaime Huerta-Cepas; Jordi Burguet; Juan M Vaquerizas; Lucía Conde; Pablo Minguez; Javier Vera; Sach Mukherjee; Joan Valls; Miguel A G Pujana; Eva Alloza; Javier Herrero; Fátima Al-Shahrour; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2006-07-01 Impact factor: 16.971

8. BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments.

Authors: Fátima Al-Shahrour; Pablo Minguez; Juan M Vaquerizas; Lucía Conde; Joaquín Dopazo
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

9. Gene set-based analysis of polymorphisms: finding pathways or biological processes associated to traits in genome-wide association studies.

Authors: Ignacio Medina; David Montaner; Nuria Bonifaci; Miguel Angel Pujana; José Carbonell; Joaquin Tarraga; Fatima Al-Shahrour; Joaquin Dopazo
Journal: Nucleic Acids Res Date: 2009-06-05 Impact factor: 16.971

10. SNOW, a web-based tool for the statistical analysis of protein-protein interaction networks.

Authors: Pablo Minguez; Stefan Götz; David Montaner; Fatima Al-Shahrour; Joaquin Dopazo
Journal: Nucleic Acids Res Date: 2009-05-19 Impact factor: 16.971

174 in total

1. Evening expression of arabidopsis GIGANTEA is controlled by combinatorial interactions among evolutionarily conserved regulatory motifs.

Authors: Markus C Berns; Karl Nordström; Frédéric Cremer; Réka Tóth; Martin Hartke; Samson Simon; Jonas R Klasen; Ingmar Bürstel; George Coupland
Journal: Plant Cell Date: 2014-10-31 Impact factor: 11.277

2. Integration of MicroRNA databases to study MicroRNAs associated with multiple sclerosis.

Authors: Charlotte Angerstein; Michael Hecker; Brigitte Katrin Paap; Dirk Koczan; Madhan Thamilarasan; Hans-Jürgen Thiesen; Uwe Klaus Zettl
Journal: Mol Neurobiol Date: 2012-05-02 Impact factor: 5.590

3. Exome sequencing in multiplex autism families suggests a major role for heterozygous truncating mutations.

Authors: C Toma; B Torrico; A Hervás; R Valdés-Mas; A Tristán-Noguero; V Padillo; M Maristany; M Salgado; C Arenas; X S Puente; M Bayés; B Cormand
Journal: Mol Psychiatry Date: 2013-09-03 Impact factor: 15.992

Review 4. Integrative systems biology: an attempt to describe a simple weed.

Authors: Louisa M Liberman; Rosangela Sozzani; Philip N Benfey
Journal: Curr Opin Plant Biol Date: 2012-01-23 Impact factor: 7.834

Review 5. Bioinformatics for spermatogenesis: annotation of male reproduction based on proteomics.

Authors: Tao Zhou; Zuo-Min Zhou; Xue-Jiang Guo
Journal: Asian J Androl Date: 2013-07-15 Impact factor: 3.285

6. WNT signaling suppression in the senescent human thymus.

Authors: Sara Ferrando-Martínez; Ezequiel Ruiz-Mateos; Jarrod A Dudakov; Enrico Velardi; Johannes Grillari; David P Kreil; M Ángeles Muñoz-Fernandez; Marcel R M van den Brink; Manuel Leal
Journal: J Gerontol A Biol Sci Med Sci Date: 2014-03-22 Impact factor: 6.053

7. Assessing differential expression measurements by highly parallel pyrosequencing and DNA microarrays: a comparative study.

Authors: Joaquín Ariño; Antonio Casamayor; Julián Perez Pérez; Laia Pedrola; Miguel Álvarez-Tejado; Martina Marbà; Javier Santoyo; Joaquín Dopazo
Journal: OMICS Date: 2011-09-15

8. A network-based gene-weighting approach for pathway analysis.

Authors: Zhaoyuan Fang; Weidong Tian; Hongbin Ji
Journal: Cell Res Date: 2011-09-06 Impact factor: 25.617

9. VGX-1027 modulates genes involved in lipopolysaccharide-induced Toll-like receptor 4 activation and in a murine model of systemic lupus erythematosus.

Authors: Paolo Fagone; Karuppiah Muthumani; Katia Mangano; Gaetano Magro; Pier Luigi Meroni; Joseph J Kim; Niranjan Y Sardesai; David B Weiner; Ferdinando Nicoletti
Journal: Immunology Date: 2014-08 Impact factor: 7.397

Review 10. Bioinformatic approaches to augment study of epithelial-to-mesenchymal transition in lung cancer.

Authors: Tim N Beck; Adaeze J Chikwem; Nehal R Solanki; Erica A Golemis
Journal: Physiol Genomics Date: 2014-08-05 Impact factor: 3.107