Stuart Aitken1, Alastair M Kilpatrick2, Ozgur E Akman1. 1. MRC Human Genetics Unit, IGMM, University of Edinburgh, Edinburgh EH4 2XU, UK, School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK, Department of Pediatrics, University of California San Diego, La Jolla, CA 92093, USA, Centre for Systems, Dynamics and Control, College of Engineering, Mathematics & Physical Sciences, University of Exeter, Exeter EX4 4QF, UK. 2. MRC Human Genetics Unit, IGMM, University of Edinburgh, Edinburgh EH4 2XU, UK, School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK, Department of Pediatrics, University of California San Diego, La Jolla, CA 92093, USA, Centre for Systems, Dynamics and Control, College of Engineering, Mathematics & Physical Sciences, University of Exeter, Exeter EX4 4QF, UK MRC Human Genetics Unit, IGMM, University of Edinburgh, Edinburgh EH4 2XU, UK, School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK, Department of Pediatrics, University of California San Diego, La Jolla, CA 92093, USA, Centre for Systems, Dynamics and Control, College of Engineering, Mathematics & Physical Sciences, University of Exeter, Exeter EX4 4QF, UK.
Abstract
MOTIVATION: Model selection and parameter inference are complex problems of long-standing interest in systems biology. Selecting between competing models arises commonly as underlying biochemical mechanisms are often not fully known, hence alternative models must be considered. Parameter inference yields important information on the extent to which the data and the model constrain parameter values. RESULTS: We report Dizzy-Beats, a graphical Java Bayesian evidence analysis tool implementing nested sampling - an algorithm yielding an estimate of the log of the Bayesian evidence Z and the moments of model parameters, thus addressing two outstanding challenges in systems modelling. A likelihood function based on the L1-norm is adopted as it is generically applicable to replicated time series data. AVAILABILITY AND IMPLEMENTATION: http://sourceforge.net/p/bayesevidence/home/Home/.
MOTIVATION: Model selection and parameter inference are complex problems of long-standing interest in systems biology. Selecting between competing models arises commonly as underlying biochemical mechanisms are often not fully known, hence alternative models must be considered. Parameter inference yields important information on the extent to which the data and the model constrain parameter values. RESULTS: We report Dizzy-Beats, a graphical Java Bayesian evidence analysis tool implementing nested sampling - an algorithm yielding an estimate of the log of the Bayesian evidence Z and the moments of model parameters, thus addressing two outstanding challenges in systems modelling. A likelihood function based on the L1-norm is adopted as it is generically applicable to replicated time series data. AVAILABILITY AND IMPLEMENTATION: http://sourceforge.net/p/bayesevidence/home/Home/.
Bayesian methods provide a sound basis for ranking alternative systems biology models and for characterizing the extent to which parameters are constrained by models and data (Kirk ). Markov Chain Monte Carlo methods have been applied to model selection (Schmidl ) and to parameter inference in systems biology (Hug ; Kanodia ), but often require considerable algorithmic and conceptual development. Nested sampling promises to ease these complex computational tasks: Recent biological applications include (Aitken and Akman, 2013; Kirk ; Pullen and Morris, 2014).General purpose code for nested sampling is available in R (Skilling, 2006; Aitken and Akman, 2013), and biological applications of the MultiNest tool (Feroz ) have been reported (Kirk ; Pullen and Morris, 2014). A C-based command-line application implementing nested sampling and providing a Systems Biology Markup Language (SBML) interface has recently been released (Johnson ), but no graphical tool is currently available. Thus we sought to add nested sampling to the widely used Dizzy chemical kinetics simulation tool (Ramsey ) (over 200 citations as of November 2014). In addition, we have added an optimization function and SBML 3.1 compatibility. However, as Dizzy’s command language has operators that cannot be captured in SBML 3.1, and SBML 3.1 has features not supported by Dizzy, this feature is restricted to the intersection of the modelling languages.
2 Methods
Nested sampling calculates two of the central results of Bayesian inference: the posterior distribution of the parameters θ, and the evidence , that is, the support for the data D under hypothesis H (Skilling, 2006), through a sampling strategy. A selection between two alternative models H0 and H1 can be made by calculating the ratio of their posterior probabilities (1), a calculation that can be decomposed into the Bayesian evidence (Z0 and Z1) and the prior probability of the respective hypotheses.
The evidence (2) is a scalar quantity that can be viewed as an integral of the likelihood (L) over the elements of mass () associated with the prior density . The prior mass can be accumulated from its elements () in any order. The enclosed prior of likelihood can be defined (3), and this allows the evidence to be written as a one-dimensional integral of the (inverse) likelihood L(X) over the unit range (taking the enclosed prior mass X to be the primary variable) (4) (Skilling, 2006).
Given a sequence of decreasing values where the likelihood can be evaluated, the evidence can be approximated numerically as a weighted sum. Inferences about the posterior can be obtained from the sequence of m discarded points generated by sampling, P. Each point is assigned the weight , from which the first and second moments of each parameter in θ can be estimated—for more details see Skilling (2006) and Aitken and Akman (2013). The size of the population of active points (points θ within the evolving constraint ) used to sample the parameter space is the only parameter of the algorithm that the user must specify. For complex likelihood functions with multiple modes, this number may be as high as 10 000, and for a single mode as low as 200.Dizzy-Beats updates the user interface of the original Dizzy program (Ramsey ) retaining the text of the model in the left editing panel (see Fig. 1) and placing the original simulator choices in the simulation tab on the right. A histogram plot for visualizing the results of stochastic simulations is added to the simulation viewing formats. Two new tabs add optimization and inference capabilities, and both require a data file to be specified (a simple comma separated format is used, with column headers matching the names of species in the model). Using the simulation tab, the user can select parameters, modify their values and see the simulation results plotted over the data. This allows a manual tuning and exploration of the model’s fit to the data. Computational optimization using simulated annealing can be run to explore a larger parameter space. Similarly, the inference tab requires users to select parameters to be included in inference by nested sampling and to input their prior range. A uniform prior is assumed as is typical in nested sampling. A graph of log likelihood or of the samples of the selected parameters can be viewed as nested sampling progresses to monitor progress. The stopping heuristics of Aitken and Akman (2013) are implemented but the user can in addition specify the maximum number of iterations, and must specify the number of active points. The outputs are a file summarizing the results, and a second listing the posterior samples for further analysis.
Fig. 1.
Dizzy-Beats: an application for parameter inference and model selection
Dizzy-Beats: an application for parameter inference and model selectionA likelihood function based on the L1-norm is used for optimization and inference—this is defined by Equations (5) and (6) (Sivia and Skilling, 2006).Equation (5) defines the normalizing constant as the expected value of the moduli of the differences between the replicate observations at time t and the values predicted by the kinetic model (μ). The product of the probabilities of the median observation at time t () defines the likelihood function for a time series x of m samples [Equation (6)]. Maximization of this likelihood minimizes the sum of the moduli of the residuals (rather than their squares) on the basis that the testable information is restricted to the expected value of the modulus of the difference between theory and experiment. Should we know both the mean and variance, maximum entropy considerations would lead instead to the Gaussian distribution (Sivia and Skilling, 2006). Time points where the replicates are most dissimilar contribute least to the likelihood as is larger—as is desirable.
3 Discussion
Dizzy-Beats is a graphical application for simulating and optimizing systems models based on an established simulator (Ramsey ) and its simple textual model syntax, to which we have added SBML 3.1 import/export functionality. Uniquely, Dizzy-Beats provides model comparison and parameter inference functions through the nested sampling algorithm in a graphical application. Comparable functions are implemented in BioBayes (Vyshemirsky and Girolami, 2008); however, users must edit the XML representation of the model should they wish to make modifications. SYSBIONS (Johnson ) implements nested sampling but all interaction is via the command-line. The use of a likelihood based on the L1-norm derived from biological replicate data makes fewer assumptions than a Gaussian error model (Johnson ; Vyshemirsky and Girolami, 2008), and is less computationally complex than a transitional likelihood function derived from reaction propensities (Aitken and Akman, 2013; Heron ).
Funding
This work was funded by BBSRC grant [BB/I023461/1] (Bayesian evidence analysis tools for systems biology; S.A. and O.E.A.).Conflict of Interest: none declared.