Literature DB >> 21561920

Model-based gene set analysis for Bioconductor.

Sebastian Bauer¹, Peter N Robinson, Julien Gagneur.

Abstract

UNLABELLED: Gene Ontology and other forms of gene-category analysis play a major role in the evaluation of high-throughput experiments in molecular biology. Single-category enrichment analysis procedures such as Fisher's exact test tend to flag large numbers of redundant categories as significant, which can complicate interpretation. We have recently developed an approach called model-based gene set analysis (MGSA), that substantially reduces the number of redundant categories returned by the gene-category analysis. In this work, we present the Bioconductor package mgsa, which makes the MGSA algorithm available to users of the R language. Our package provides a simple and flexible application programming interface for applying the approach. AVAILABILITY: The mgsa package has been made available as part of Bioconductor 2.8. It is released under the conditions of the Artistic license 2.0. CONTACT: peter.robinson@charite.de; julien.gagneur@embl.de.

Entities: Chemical Gene Species

Mesh：

Year: 2011 PMID： 21561920 PMCID： PMC3117381 DOI： 10.1093/bioinformatics/btr296

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Gene Ontology (GO) analysis and other forms of gene-set enrichment analysis have become a standard exploratory tool for understanding the results of large-scale genomics experiments and for generating new hypotheses (Robinson and Bauer, 2011). Most early approaches investigated each GO term one at a time, for example testing for significant enrichment of responder genes in each term using Fisher's exact test. In contrast, two recent methods, GenGO and model-based gene set analysis (MGSA) have been developed as global approaches, in which one aims to find the best combination of GO terms to explain the observed biological response (Bauer ; Lu ). Such global, or ‘model-based’ approaches avoid problems connected with the statistical dependencies inherent in large ontologies such as the GO, in which gene annotations are propagated to ancestor terms, or in any collection of gene sets in which the categories share many annotated genes. MGSA analyzes all GO categories at once by modeling gene response as a function of the combination of active GO terms. It employs probabilistic inference via a Metropolis-Hasting algorithm to estimate the probability of categories to be active. The MGSA approach naturally takes category overlap into account and avoids the need for multiple testing corrections met in single-category enrichment analysis. More details of the procedure can be found in the original publication, where we also demonstrated that MGSA substantially improves upon single-category statistical enrichment analysis methods and GenGO. Real-life applications have shown the utility of the method in identifying concise yet informative list of categories (Bauer ; Ott ). In our original work, we integrated a first implementation of MGSA into the Ontologizer application, which is a tool for GO analysis and allows user to inspect the results in an interactive environment (Bauer ). Here, we present an implementation of MGSA for users of Bioconductor (Gentleman ). The mgsa package wraps a fast C-based implementation of the MGSA algorithm into a flexible application programming interface (API) and utilizes OpenMP to take advantage of the multi-core processing units that modern computer hardware offers (Dagum and Menon, 1998).

2 AVAILABILITY AND USAGE

The mgsa package is part of Bioconductor 2.8, therefore it can be installed directly within the R environment together with all its dependencies. Refer to the Bioconductor Web page at http://www.bioconductor.org/ for installation procedures. Once the package is installed and loaded, the method can be readily accessed using the function mgsa. To invoke the function, one needs to specify the observations, a vector of gene identifiers corresponding to the study set (e.g. the set of differentially expressed genes), and the gene sets, a list of vectors of gene identifiers for each of the GO terms (or other gene sets or categories) to be analyzed. To simplify the usage of GO, the readGAF function takes a GAF (Gene Annotation Format) file as input, in which gene annotations are stored, and computes the gene sets of all GO categories including direct and indirect annotations. GAF files are available from the GO homepage and updated regularly. The function takes advantage of the GO.db package to load the structure of the GO, so no external file is needed for the ontology itself. If goa.filename contains the location of a GAF file, observations is a vector of character strings describing the genes of the study set, then an MGSA analysis is as simple as entering the following R code. A detailed tutorial is provided in the package vignette that can be invoked with:

3 APPLICATION

The MGSA package is not restricted to the GO but allows analysis with arbitrary gene sets. This flexibility is illustrated on a dataset in which gene expression for two yeast strains that differ by a single allele (PHO84; Gagneur ) is compared. We ask which transcription factor(s) could together best explain the set of 84 transcripts that show differential expression. We stored these as vector of gene identifiers, observations: MacIsaac ) have compiled a regulatory network for yeast by integrating data of in vivo transcription factor binding from ChIP/chip together with transcription factor motif analysis and sequence conservation. We defined as gene sets the sets of targets of each transcription factor of the network with intermediate cutoffs for binding intensities and conservation (MacIsaac ). This network contains a total of 2514 targets across 116 transcription factors. We simply stored it as a named list of vectors of gene identifiers, sets: For instance, the first item of the list contains a vector of genes that are targets of the transcription factor ABF1 as predicted by (MacIsaac ). We can now call the mgsa method and plot results: The plot displays the marginal probabilities of the 10 most likely sets (Fig. 1). MGSA infers changes in activity for the PHO4 transcription factor (posterior=0.9995±2×10−4). Allele variation in the transporter PHO84 affects cellular phosphate levels and regulation of the whole PHO pathway (Gagneur ). These transcriptional changes are known to be mediated by the transcription factor PHO4, which MGSA precisely identified.

Fig. 1.

Transcription factor target set enrichment. The posterior probability is shown for the 10 transcription factors with highest marginal probabilities. Categories whose posterior is above 0.5 are interpreted to be ‘active’ according to the MGSA model (Bauer ).

4 CONCLUSION

The mgsa package gives users of Bioconductor programmatic access to MGSA. Thus, it can be incorporated into scripts and pipelines written in R and be combined with many other packages of the bioinformatics community. The package comes with a simple but flexible API, which allows researchers not only to use GO as source of gene sets, but also other categorization schemes like the KEGG pathways or the Broad institute gene sets that are easily available through other Bioconductor packages, for instance via GSEABase (Morgan ). Funding: Deutsche Forschungsgemeinschaft (DFG RO 2005/4-1). We thank the lab of Lars Steinmetz for financial support. Conflict of Interest: none declared.

7 in total

1. Ontologizer 2.0--a multifunctional tool for GO term enrichment analysis and data exploration.

Authors: Sebastian Bauer; Steffen Grossmann; Martin Vingron; Peter N Robinson
Journal: Bioinformatics Date: 2008-05-29 Impact factor: 6.937

2. GOing Bayesian: model-based gene set analysis of genome-scale data.

Authors: Sebastian Bauer; Julien Gagneur; Peter N Robinson
Journal: Nucleic Acids Res Date: 2010-02-19 Impact factor: 16.971

3. Bioconductor: open software development for computational biology and bioinformatics.

Authors: Robert C Gentleman; Vincent J Carey; Douglas M Bates; Ben Bolstad; Marcel Dettling; Sandrine Dudoit; Byron Ellis; Laurent Gautier; Yongchao Ge; Jeff Gentry; Kurt Hornik; Torsten Hothorn; Wolfgang Huber; Stefano Iacus; Rafael Irizarry; Friedrich Leisch; Cheng Li; Martin Maechler; Anthony J Rossini; Gunther Sawitzki; Colin Smith; Gordon Smyth; Luke Tierney; Jean Y H Yang; Jianhua Zhang
Journal: Genome Biol Date: 2004-09-15 Impact factor: 13.583

4. MicroRNAs differentially expressed in postnatal aortic development downregulate elastin via 3' UTR and coding-sequence binding sites.

Authors: Claus Eric Ott; Johannes Grünhagen; Marten Jäger; Daniel Horbelt; Simon Schwill; Klaus Kallenbach; Gao Guo; Thomas Manke; Petra Knaus; Stefan Mundlos; Peter N Robinson
Journal: PLoS One Date: 2011-01-31 Impact factor: 3.240

5. A probabilistic generative model for GO enrichment analysis.

Authors: Yong Lu; Roni Rosenfeld; Itamar Simon; Gerard J Nau; Ziv Bar-Joseph
Journal: Nucleic Acids Res Date: 2008-08-01 Impact factor: 16.971

6. An improved map of conserved regulatory sites for Saccharomyces cerevisiae.

Authors: Kenzie D MacIsaac; Ting Wang; D Benjamin Gordon; David K Gifford; Gary D Stormo; Ernest Fraenkel
Journal: BMC Bioinformatics Date: 2006-03-07 Impact factor: 3.169

7. Genome-wide allele- and strand-specific expression profiling.

Authors: Julien Gagneur; Himanshu Sinha; Fabiana Perocchi; Richard Bourgon; Wolfgang Huber; Lars M Steinmetz
Journal: Mol Syst Biol Date: 2009-06-16 Impact factor: 11.429

7 in total

23 in total

1. Statistical Contributions to Bioinformatics: Design, Modeling, Structure Learning, and Integration.

Authors: Jeffrey S Morris; Veerabhadran Baladandayuthapani
Journal: Stat Modelling Date: 2017-06-15 Impact factor: 2.039

2. Gene set selection via LASSO penalized regression (SLPR).

Authors: H Robert Frost; Christopher I Amos
Journal: Nucleic Acids Res Date: 2017-07-07 Impact factor: 16.971

3. Multiset Statistics for Gene Set Analysis.

Authors: Michael A Newton; Zhishi Wang
Journal: Annu Rev Stat Appl Date: 2015-04 Impact factor: 5.810

4. The genomic and transcriptomic landscape of a HeLa cell line.

Authors: Jonathan J M Landry; Paul Theodor Pyl; Tobias Rausch; Thomas Zichner; Manu M Tekkedil; Adrian M Stütz; Anna Jauch; Raeka S Aiyar; Gregoire Pau; Nicolas Delhomme; Julien Gagneur; Jan O Korbel; Wolfgang Huber; Lars M Steinmetz
Journal: G3 (Bethesda) Date: 2013-08-07 Impact factor: 3.154

5. Mediator phosphorylation prevents stress response transcription during non-stress conditions.

Authors: Christian Miller; Ivan Matic; Kerstin C Maier; Björn Schwalb; Susanne Roether; Katja Strässer; Achim Tresch; Matthias Mann; Patrick Cramer
Journal: J Biol Chem Date: 2012-11-07 Impact factor: 5.157

6. Treatment with MOG-DNA vaccines induces CD4+CD25+FoxP3+ regulatory T cells and up-regulates genes with neuroprotective functions in experimental autoimmune encephalomyelitis.

Authors: Nicolás Fissolo; Carme Costa; Ramil N Nurtdinov; Marta F Bustamante; Victor Llombart; María J Mansilla; Carmen Espejo; Xavier Montalban; Manuel Comabella
Journal: J Neuroinflammation Date: 2012-06-22 Impact factor: 8.322

7. A new analysis approach of epidermal growth factor receptor pathway activation patterns provides insights into cetuximab resistance mechanisms in head and neck cancer.

Authors: Silvia von der Heyde; Tim Beissbarth
Journal: BMC Med Date: 2012-05-01 Impact factor: 8.775

8. Genome-wide matching of genes to cellular roles using guilt-by-association models derived from single sample analysis.

Authors: Jeff A Klomp; Kyle A Furge
Journal: BMC Res Notes Date: 2012-07-23

9. GeneMANIA prediction server 2013 update.

Authors: Khalid Zuberi; Max Franz; Harold Rodriguez; Jason Montojo; Christian Tannus Lopes; Gary D Bader; Quaid Morris
Journal: Nucleic Acids Res Date: 2013-07 Impact factor: 16.971

10. DNA methylation contributes to natural human variation.

Authors: Holger Heyn; Sebastian Moran; Irene Hernando-Herraez; Sergi Sayols; Antonio Gomez; Juan Sandoval; Dave Monk; Kenichiro Hata; Tomas Marques-Bonet; Liewei Wang; Manel Esteller
Journal: Genome Res Date: 2013-08-01 Impact factor: 9.043