Literature DB >> 21233165

ESS++: a C++ objected-oriented algorithm for Bayesian stochastic search model exploration.

Leonardo Bottolo1, Marc Chadeau-Hyam, David I Hastie, Sarah R Langley, Enrico Petretto, Laurence Tiret, David Tregouet, Sylvia Richardson.   

Abstract

SUMMARY: ESS++ is a C++ implementation of a fully Bayesian variable selection approach for single and multiple response linear regression. ESS++ works well both when the number of observations is larger than the number of predictors and in the 'large p, small n' case. In the current version, ESS++ can handle several hundred observations, thousands of predictors and a few responses simultaneously. The core engine of ESS++ for the selection of relevant predictors is based on Evolutionary Monte Carlo. Our implementation is open source, allowing community-based alterations and improvements. AVAILABILITY: C++ source code and documentation including compilation instructions are available under GNU licence at http://bgx.org.uk/software/ESS.html.

Entities:  

Mesh:

Year:  2011        PMID: 21233165      PMCID: PMC3035799          DOI: 10.1093/bioinformatics/btq684

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

In recent years, biological sciences have taken full advantage of rather inexpensive high-throughput technologies. New experiments at a systemic level have been conceived to dissect the role of genetic and environmental factors in the development of common diseases or the identification of risk factors for complex phenotypes (Heinig ). The dimensions and diversity of available genetic, genomics and other 'omics data sets pose new theoretical and computational problems requiring multi-level data integration and efficient statistical analysis tools. When the aim is to predict the variation of pathophysiological or complex phenotypes, regression models are widely used. In this set up, Bayesian variable selection (BVS) allows the construction of parsimonious regression models for high-dimensional datasets, adopting prior specifications that translate expected sparsity of the underlying biology and facilitate the interpretation of the results. Moreover, in problems where no single model stands out, model uncertainty is taken into account, reporting competing models with their posterior evidence. ESS++ is a C++ implementation of a fully BVS approach for linear regression that can analyse single and multiple responses in an integrated way (Bottolo and Richardson, 2010; Petretto ). Whereas other approaches (Servin and Stephens, 2007) consider one predictor at the time, ESS++ performs an efficient search for combinations of covariates that predict the variation of single and multiple responses. Like Shotgun Stochastic Search (Hans ), ESS++ is also designed to work under the “large p, small n” paradigm i.e. when the number of predictors p is large with respect to the number of observations n, thus making fully Bayesian analysis feasible in genetics/genomics experiments. When the number of predictors is large, the multimodality of the model space is a known issue in variable selection. ESS++ explores the 2-dimensional model space using an extension of parallel tempering called Evolutionary Monte Carlo that combines Markov chain Monte Carlo (MCMC) and genetic algorithms. Specifically, ESS++ relies on running multiple tempered chains in parallel which exchange information about set of covariates that are selected in the regression models. Since chains with higher temperatures flatten the posterior density, global moves (between chains) allow the algorithm to jump from one local mode to another. Local moves (within-chains) permit the fine exploration of alternative models, resulting in a combined algorithm that ensures that the chains mix efficiently and do not become trapped in local modes.

2 EXAMPLES OF ESS++ APPLICATION

In this section, we present the results of the application of ESS++ to investigate genetic regulation. To discover the genetic causes of variation in the expression (i.e. transcription) of genes, gene expression data are treated as a quantitative phenotype while genotype data (SNPs) are used as predictors, a type of analysis known as expression Quantitative Trait Loci (eQTL). In this context, it is important to distinguish cis-eQTLs, where the genetic control points (SNPs) are located close to the location of the transcribed gene, from trans-acting eQTLs, which lie on a different chromosome. Here, we use a larger dataset (Heinig ) to reanalyse three genes (Cd36, Ascl3 and Hopx) that were presented in Petretto ): in particular, for each gene we investigate the ability of ESS++ to find a parsimonious set of predictors (polygenic regulation) that explain the joint variability of gene expression in seven tissues (adrenal gland, aorta, fat, heart, kidney, liver, skeletal muscle) using 1304 SNPs and 29 observations, taken from the rat inbred lines that were studied. We run ESS++ for 2.2M sweeps with 200K as burn-in using four chains. The prior requires two main user-defined parameters the a priori expected model size and SD of the model size. We set these to E(pγ)=2 and sd(pγ)=2, respectively, meaning the prior model size is likely to range from 0 to 8. For each gene, Figure 1 shows the marginal posterior probability of inclusion (MPPI), a measure of the marginal contribution of each predictor. For the first gene Cd36, Figure 1a, ESS++ confirms shared genetic effects due to a single cis-eQTL (SNP J664145) and in silico prediction of its systemic effect in all tissues (Aitman ). For the second gene Ascl3, Figure 1b, ESS++ also finds a single cis-acting genetic control point (SNP J697407) for the variation of the gene expression in all seven tissues, highlighting the fact that the second trans-acting locus found in Petretto ) was specific for the four tissues considered (adrenal gland, fat, heart, kidney). The landscape for the MPPI is much more complicated for the last gene Hopx, Figure 1c, although the locus with highest MPPI (SNP WKY-G-j-20h03) is the one identified in Petretto ).
Fig. 1.

Marginal posterior probability of inclusion (MPPI) obtained running ESS++ on a multiple tissues mapping experiment for three different genes: for each gene, the set of SNPs associated with high MPPI (>0.50) are highlighted, showing monogenic control for (a) Cd36 gene (SNP J664145) and (b) Ascl3 gene (SNP J697407), with evidence for polygenic control for (c) Hopx gene (SNP WKY-G-j-20h03 and SNP J590621).

Marginal posterior probability of inclusion (MPPI) obtained running ESS++ on a multiple tissues mapping experiment for three different genes: for each gene, the set of SNPs associated with high MPPI (>0.50) are highlighted, showing monogenic control for (a) Cd36 gene (SNP J664145) and (b) Ascl3 gene (SNP J697407), with evidence for polygenic control for (c) Hopx gene (SNP WKY-G-j-20h03 and SNP J590621). One of the distinctive features of ESS++ is also the possibility to look at the best models visited during the MCMC run. For instance, in the Hopx gene the 10 best visited models are all polygenic, SNP WKY-G-j-20h03 is included in all 10 best visited models and altogether they account for about 15% of the posterior mass. Finally, when compared with the computational time of the Matlab implementation of Petretto ), ESS++ runs on a 3 GHz desktop computer, with the same MCMC specifications roughly 15 times faster (in 36, 74, and roughly 400 minutes for the examples above).

3 DOCUMENTATION AND IMPLEMENTATION

ESS++ is written in C++. Documentation of the algorithm (provided with the code and in the Supplementary Material) details not only the installation on different platforms and the contents of the package, but also how to run ESS++. The command line of ESS++ is extremely simple and it requires few specifications from the user: the response and predictor matrices (-Y file_name, -X file_name); the number of sweeps and the burn-in period (-nsweep int, -burn_in int); if an hyperprior on the regression coefficient is required (-g_set); if the user prefers a standard/detailed output for the summary statistics (-out file_name, -out_full file_name); and if additional output files (MCMC move histories) are required (-history). The set-up of ESS++ is highly customizable by the user through the modification of the -par file. Among several other settings it is possible to define: the a priori expected value and the SD of the number of predictors (E_P_GAM, SD_P_GAM); the number of chains and their initial distance (NB_CHAINS, B_T); the parameters for the evolutionary part of the algorithm such as the proportion of local and global moves (P_MUTATION); and the weighting of different types of global moves (P_DR). We refer the reader to Table 1 of the accompanying documentation for full details on all the parameters that can be entered in ESS++. The C++ implementation of ESS++ is open source. Its natural object-oriented structure favours community-based alterations and improvements. ESS++ is memory efficient and can be run, even for very large datasets, on a desktop computer. However, when thousands of observations are collected, the calculation of the (marginal) likelihood, which relies on costly linear algebra operations (QR decomposition, matrix multiplication), becomes rate limiting. A future development for ESS++ will be the translation of some of these linear algebra operations into Compute Unified Device Architecture. Funding: Medical Research Council Clinical Sciences Centre (L.B.); Nouvelle Société Française d'Athérosclérose and by EU Community's Seventh Framework Programme under grant agreement no. 226756 (to M.C.-H.); PHC ALLIANCE 2009 (19419PH) grant (to D.T. and E.P.), the Wellcome Trust (S.R.L.); EU Community's Seventh Framework Programme under grant agreement no. HEALTH-F4-2010-241504; (EURATRANS to E.P.); MRC grant (G0600609 to S.R.). Conflict of Interest: none declared.
  4 in total

1.  New insights into the genetic control of gene expression using a Bayesian multi-tissue approach.

Authors:  Enrico Petretto; Leonardo Bottolo; Sarah R Langley; Matthias Heinig; Chris McDermott-Roe; Rizwan Sarwar; Michal Pravenec; Norbert Hübner; Timothy J Aitman; Stuart A Cook; Sylvia Richardson
Journal:  PLoS Comput Biol       Date:  2010-04-08       Impact factor: 4.475

2.  Identification of Cd36 (Fat) as an insulin-resistance gene causing defective fatty acid and glucose metabolism in hypertensive rats.

Authors:  T J Aitman; A M Glazier; C A Wallace; L D Cooper; P J Norsworthy; F N Wahid; K M Al-Majali; P M Trembling; C J Mann; C C Shoulders; D Graf; E St Lezin; T W Kurtz; V Kren; M Pravenec; A Ibrahimi; N A Abumrad; L W Stanton; J Scott
Journal:  Nat Genet       Date:  1999-01       Impact factor: 38.330

3.  A trans-acting locus regulates an anti-viral expression network and type 1 diabetes risk.

Authors:  Matthias Heinig; Enrico Petretto; Chris Wallace; Leonardo Bottolo; Maxime Rotival; Han Lu; Yoyo Li; Rizwan Sarwar; Sarah R Langley; Anja Bauerfeind; Oliver Hummel; Young-Ae Lee; Svetlana Paskas; Carola Rintisch; Kathrin Saar; Jason Cooper; Rachel Buchan; Elizabeth E Gray; Jason G Cyster; Jeanette Erdmann; Christian Hengstenberg; Seraya Maouche; Willem H Ouwehand; Catherine M Rice; Nilesh J Samani; Heribert Schunkert; Alison H Goodall; Herbert Schulz; Helge G Roider; Martin Vingron; Stefan Blankenberg; Thomas Münzel; Tanja Zeller; Silke Szymczak; Andreas Ziegler; Laurence Tiret; Deborah J Smyth; Michal Pravenec; Timothy J Aitman; Francois Cambien; David Clayton; John A Todd; Norbert Hubner; Stuart A Cook
Journal:  Nature       Date:  2010-09-08       Impact factor: 49.962

4.  Imputation-based analysis of association studies: candidate regions and quantitative traits.

Authors:  Bertrand Servin; Matthew Stephens
Journal:  PLoS Genet       Date:  2007-05-30       Impact factor: 5.917

  4 in total
  15 in total

1.  Bayesian detection of expression quantitative trait loci hot spots.

Authors:  Leonardo Bottolo; Enrico Petretto; Stefan Blankenberg; François Cambien; Stuart A Cook; Laurence Tiret; Sylvia Richardson
Journal:  Genetics       Date:  2011-09-16       Impact factor: 4.562

2.  Statistical Methods in Integrative Genomics.

Authors:  Sylvia Richardson; George C Tseng; Wei Sun
Journal:  Annu Rev Stat Appl       Date:  2016-04-18       Impact factor: 5.810

3.  R2GUESS: A Graphics Processing Unit-Based R Package for Bayesian Variable Selection Regression of Multivariate Responses.

Authors:  Benoît Liquet; Leonardo Bottolo; Gianluca Campanella; Sylvia Richardson; Marc Chadeau-Hyam
Journal:  J Stat Softw       Date:  2016-01-29       Impact factor: 6.440

4.  Systems genetics identifies Sestrin 3 as a regulator of a proconvulsant gene network in human epileptic hippocampus.

Authors:  Jacques Behmoaras; Leonardo Bottolo; Michelle L Krishnan; Katharina Pernhorst; Paola L Meza Santoscoy; Michael R Johnson; Tiziana Rossetti; Doug Speed; Prashant K Srivastava; Marc Chadeau-Hyam; Nabil Hajji; Aleksandra Dabrowska; Maxime Rotival; Banafsheh Razzaghi; Stjepana Kovac; Klaus Wanisch; Federico W Grillo; Anna Slaviero; Sarah R Langley; Kirill Shkura; Paolo Roncon; Tisham De; Manuel Mattheisen; Pitt Niehusmann; Terence J O'Brien; Slave Petrovski; Marec von Lehe; Per Hoffmann; Johan Eriksson; Alison J Coffey; Sven Cichon; Matthew Walker; Michele Simonato; Bénédicte Danis; Manuela Mazzuferi; Patrik Foerch; Susanne Schoch; Vincenzo De Paola; Rafal M Kaminski; Vincent T Cunliffe; Albert J Becker; Enrico Petretto
Journal:  Nat Commun       Date:  2015-01-23       Impact factor: 14.919

5.  Kcnn4 is a regulator of macrophage multinucleation in bone homeostasis and inflammatory disease.

Authors:  Heeseog Kang; Audrey Kerloc'h; Maxime Rotival; Xiaoqing Xu; Qing Zhang; Zelpha D'Souza; Michael Kim; Jodi Carlson Scholz; Jeong-Hun Ko; Prashant K Srivastava; Jonathan R Genzen; Weiguo Cui; Timothy J Aitman; Laurence Game; James E Melvin; Adedayo Hanidu; Janice Dimock; Jie Zheng; Donald Souza; Aruna K Behera; Gerald Nabozny; H Terence Cook; J H Duncan Bassett; Graham R Williams; Jun Li; Agnès Vignery; Enrico Petretto; Jacques Behmoaras
Journal:  Cell Rep       Date:  2014-08-14       Impact factor: 9.423

6.  GUESS-ing polygenic associations with multiple phenotypes using a GPU-based evolutionary stochastic search algorithm.

Authors:  Leonardo Bottolo; Marc Chadeau-Hyam; David I Hastie; Tanja Zeller; Benoit Liquet; Paul Newcombe; Loic Yengo; Philipp S Wild; Arne Schillert; Andreas Ziegler; Sune F Nielsen; Adam S Butterworth; Weang Kee Ho; Raphaële Castagné; Thomas Munzel; David Tregouet; Mario Falchi; François Cambien; Børge G Nordestgaard; Fredéric Fumeron; Anne Tybjærg-Hansen; Philippe Froguel; John Danesh; Enrico Petretto; Stefan Blankenberg; Laurence Tiret; Sylvia Richardson
Journal:  PLoS Genet       Date:  2013-08-08       Impact factor: 5.917

7.  A trans locus causes a ribosomopathy in hypertrophic hearts that affects mRNA translation in a protein length-dependent fashion.

Authors:  Franziska Witte; Jorge Ruiz-Orera; Camilla Ciolli Mattioli; Susanne Blachut; Eleonora Adami; Jana Felicitas Schulz; Valentin Schneider-Lunitz; Oliver Hummel; Giannino Patone; Michael Benedikt Mücke; Jan Šilhavý; Matthias Heinig; Leonardo Bottolo; Daniel Sanchis; Martin Vingron; Marina Chekulaeva; Michal Pravenec; Norbert Hubner; Sebastiaan van Heesch
Journal:  Genome Biol       Date:  2021-06-28       Impact factor: 13.583

8.  Population genomics of cardiometabolic traits: design of the University College London-London School of Hygiene and Tropical Medicine-Edinburgh-Bristol (UCLEB) Consortium.

Authors:  Tina Shah; Jorgen Engmann; Caroline Dale; Sonia Shah; Jon White; Claudia Giambartolomei; Stela McLachlan; Delilah Zabaneh; Alana Cavadino; Chris Finan; Andrew Wong; Antoinette Amuzu; Ken Ong; Tom Gaunt; Michael V Holmes; Helen Warren; Daniel I Swerdlow; Teri-Louise Davies; Fotios Drenos; Jackie Cooper; Reecha Sofat; Mark Caulfield; Shah Ebrahim; Debbie A Lawlor; Philippa J Talmud; Steve E Humphries; Christine Power; Elina Hypponen; Marcus Richards; Rebecca Hardy; Diana Kuh; Nicholas Wareham; Claudia Langenberg; Yoav Ben-Shlomo; Ian N Day; Peter Whincup; Richard Morris; Mark W J Strachan; Jacqueline Price; Meena Kumari; Mika Kivimaki; Vincent Plagnol; Frank Dudbridge; John C Whittaker; Juan P Casas; Aroon D Hingorani
Journal:  PLoS One       Date:  2013-08-20       Impact factor: 3.240

9.  JAM: A Scalable Bayesian Framework for Joint Analysis of Marginal SNP Effects.

Authors:  Paul J Newcombe; David V Conti; Sylvia Richardson
Journal:  Genet Epidemiol       Date:  2016-04       Impact factor: 2.135

10.  Weibull regression with Bayesian variable selection to identify prognostic tumour markers of breast cancer survival.

Authors:  P J Newcombe; H Raza Ali; F M Blows; E Provenzano; P D Pharoah; C Caldas; S Richardson
Journal:  Stat Methods Med Res       Date:  2016-09-30       Impact factor: 3.021

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.