Literature DB >> 32269786

A comparison of approximate versus exact techniques for Bayesian parameter inference in nonlinear ordinary differential equation models.

Amani A Alahmadi^1,2, Jennifer A Flegg³, Davis G Cochrane¹, Christopher C Drovandi⁴, Jonathan M Keith¹.

Abstract

The behaviour of many processes in science and engineering can be accurately described by dynamical system models consisting of a set of ordinary differential equations (ODEs). Often these models have several unknown parameters that are difficult to estimate from experimental data, in which case Bayesian inference can be a useful tool. In principle, exact Bayesian inference using Markov chain Monte Carlo (MCMC) techniques is possible; however, in practice, such methods may suffer from slow convergence and poor mixing. To address this problem, several approaches based on approximate Bayesian computation (ABC) have been introduced, including Markov chain Monte Carlo ABC (MCMC ABC) and sequential Monte Carlo ABC (SMC ABC). While the system of ODEs describes the underlying process that generates the data, the observed measurements invariably include errors. In this paper, we argue that several popular ABC approaches fail to adequately model these errors because the acceptance probability depends on the choice of the discrepancy function and the tolerance without any consideration of the error term. We observe that the so-called posterior distributions derived from such methods do not accurately reflect the epistemic uncertainties in parameter values. Moreover, we demonstrate that these methods provide minimal computational advantages over exact Bayesian methods when applied to two ODE epidemiological models with simulated data and one with real data concerning malaria transmission in Afghanistan.

Entities: Chemical Disease Gene Mutation Species

Keywords: Bayesian inference; Markov chain Monte Carlo; approximate Bayesian computation; compartmental models; epidemiology models; ordinary differential equations

Year: 2020 PMID： 32269786 PMCID： PMC7137938 DOI： 10.1098/rsos.191315

Source DB: PubMed Journal: R Soc Open Sci ISSN： 2054-5703 Impact factor: 2.963

Introduction

Models of dynamical systems consisting of sets of ordinary differential equations (ODEs) are an essential tool to describe many processes in science and engineering. ODE models contain parameters such as kinetic rates and initial concentrations. However, these parameters often cannot be measured directly by experiments, or there is inherent uncertainty in the parameter values. As such, these parameter values need to be estimated using statistical techniques such as maximum likelihood estimation or Bayesian inference. In the last decade much research has focused on estimating the unknown parameters of ODE systems under a Bayesian framework. One reason is that the Bayesian approach provides appropriate quantification of the uncertainty of parameters (and hence model predictions) through the posterior distribution. Markov chain Monte Carlo (MCMC) techniques were first developed by Metropolis et al. [1]. This class of technique creates a Markov chain which has the posterior distribution as its limiting distribution. The state of the chain after a number of steps is used as a sample from the posterior distribution and the quality of this sample improves as the number of steps gets larger. The original algorithm proposed by Metropolis was generalized by Hastings [2] to give the Metropolis–Hastings algorithm. Exact Bayesian inference techniques have grown steadily more sophisticated over time, increasing the efficiency and complexity of sampling schemes. The modern Bayesian toolbox now includes schemes such as sequential Monte Carlo (SMC) [3], the Metropolis adjusted Langevin algorithm (MALA) [4] and hybrid (Hamiltonian) Monte-Carlo (HMC) [5,6]. These schemes improve the Metropolis–Hastings algorithm, enabling efficient sampling from high dimensional, strongly correlated posterior distributions. However, there are many models that possess a computationally intractable likelihood function, ruling out exact Bayesian methods. This has led to the development of approximate Bayesian computation (ABC). The ABC methodology first appeared as the ABC rejection algorithm [7] which avoids calculation of the likelihood function. The theory was generalized and substantiated by Beaumont et al. [8]. To obtain samples more efficiently, a MCMC approach to ABC was formulated by Marjoram et al. [9]. In the context of dynamical systems, both approximate and exact Bayesian techniques involve numerical solution of the set of ODEs for each proposed set of parameters in order to evaluate how well the numerical solution matches the observed data. A desire to avoid the computational costs associated with numerical solution of the ODEs has led to the development of Gaussian Process (GP) models [10-12] for ODE parameter inference. Dass et al. [13] proposed a two-step method to approximate the posterior distribution of unknown parameters in an ODE model. In the first step, data are generated from the ODE using a numerical method and then the second step uses the Laplace approximation to marginalize the posterior for each parameter. This method gives a fast approach compared to a full Bayesian computational scheme. ABC methods based on SMC have been proposed [14] and many authors have developed approaches to improve the performance of the SMC ABC algorithm (see for example Beaumont et al. [15]). An SMC ABC approach was developed by Toni et al. [16], with application to dynamical systems. Their algorithm is theoretically sound, but we question the validity of the Bayesian posteriors they produce when they apply ABC to several examples involving ODE models. The authors apply ABC where they take the observed data as synthetically generated, where the ODE model is solved at an assumed true parameter value and measurement error added. However, when ‘simulating’ data in their ABC procedure, the ODE model is solved only, without generating measurement error. In this paper, we show that such an approach generates parameter distributions that are sensitive to the ABC tolerance, and will eventually converge onto a point mass if the tolerance is continually reduced. Thus this approach fails to correctly characterize the uncertainty as a Bayesian approach would aim to do. In order to ‘correctly’ apply ABC to ODE models, one must simulate from the assumed measurement error model after solving the ODE. However, we also show in this paper that an exact Bayesian approach is more computationally efficient than this ‘correct’ ABC implementation, questioning the need for considering ABC in the first place when attempting to estimate the posterior distribution for ODE models. Given that the Toni et al. [16] paper is highly cited, we are concerned that other researchers might follow their ABC approach for calibrating ODE models. For example, Gupta et al. [17] compared the performance of MCMC, parallel tempering (PT) and SMC ABC (using the ABC Sys-Bio package) in estimating parameters in ODE models. The authors analysed simulated data with measurement error added, taken from the ABC-SysBio package. Then when applying MCMC and PT to infer model parameters, they assumed ‘a likelihood function with 1% Gaussian error’. However, when using SMC ABC, no noise was added to the simulated data. Consequently, not only does this make the comparison invalid, but also the resulting approximate posterior distribution produced by SMC ABC does not represent the uncertainty around the parameter values. In another example, Silk et al. [18] present applications to molecular dynamical systems in which they ‘have focused on the sequential ABC algorithm proposed by Toni et al. [16]’. Silk et al. [18] mention that they simulate the model ‘subject to some small added zero-mean Gaussian noise with covariance 0.01I’ so they have clearly used ‘noise added in the simulation step, σ is considered known’ (option 2 from Toni’s thesis [19, p. 154]). However, for many real problems, this is not practical and we fear that ABC users might revert to the option of simulating without noise. Two other examples of assuming known noise are in da Costa et al. [20] and Costa et al. [21]. The authors assumed the uncertainties to be ‘additive, uncorrelated, Gaussian, with zero mean and’ a known standard deviation, as they stated on pages 2801 and 1295, respectively. Moreover, Barnes et al. [22] presented an implementation of ABC SMC for ODEs (section 4.1 of their paper) and used the SysBio package. We show in our paper that using this package with ODE models can give an incorrect approximation to the posterior when not considering estimation of the noise. There is no explanation in Barnes et al. [22] regarding the authors’ assumption about the noise. The same issue appeared in Toni & Stumpf [23] and Sun et al. [24]: they applied SMC ABC for an ODE model, but there are no details regarding the authors’ assumptions about the noise. Understanding the overall noise (uncertainty) associated with the unknown parameter values when conducting parameter estimation using ODE models is important, especially when we aim to use these ODE models to inform real-world applications. The remainder of this paper is organized as follows. In §2, we introduce a simple method of exact Bayesian inference and two methods of approximate Bayesian computation (MCMC ABC and SMC ABC), complemented with a discussion on the approximation to a point mass that results from SMC ABC and MCMC ABC. Application of MCMC, MCMC ABC and SMC ABC to two ODE epidemiological models with simulated data and one with real epidemiological data are presented in §3. Section 4 presents further discussion, comparison of the presented methods and our conclusions.

Bayesian techniques for ODE parameter inference

Bayesian techniques such as Markov chain Monte Carlo (MCMC) methodologies are sampling-based methods that involves sampling the posterior densityor an approximation to equation (2.1) in the case of approximate Bayesian computation (ABC) approaches, to calculate the desired density, where y = (y1, …, y) is the observed data, are the unknown parameters, p(|y) is the posterior distribution, p(y|) is the likelihood and p() is the prior. In this section we discuss application of these Bayesian frameworks in the context of inferring parameters for ODE models. MCMC techniques as developed by Metropolis et al. [1] and Hastings [2] can be used to sample from the posterior distribution in equation (2.1). The Metropolis–Hastings algorithm constructs a Markov chain for which the stationary and limiting distribution is the posterior distribution. After running the chain for a sufficient amount of time,[1] samples from the chain can be considered draws from the posterior distribution. An implementation of the Metropolis–Hastings algorithm is given in appendix A. However, MCMC methods require the computation of the likelihood function, p(y|), in equation (2.1). As a result, ABC methods were developed to sample from an approximation to the posterior in cases for which the likelihood is intractable or too computationally costly to compute. Instead of calculating the likelihood as before, a distance between the observed data, y, and simulated data, z, is calculated and for sufficiently small distance the parameter proposals are accepted. For more explanation see appendix A. ABC targets an approximate posterior [26]:where is an indicator function that takes the value one if its logical argument is true and zero otherwise and f(z|) is the model that generates simulations z giving . The accuracy of ABC approaches depends on choosing a suitable discrepancy function ρ(z, y) and an appropriate tolerance ε [9]. In practice, the discrepancy function typically compares sets of summary statistics s( · ) for the observed and simulated datasets. ABC rejection sampling is very simple to implement, though it can suffer from extremely low acceptance rates when the prior distribution is dissimilar to the posterior distribution [9]. To counteract this deficiency, a more efficient ABC technique based on MCMC was developed [9]. For more details see appendix A. Furthermore, in order to improve the low acceptance rate in the basic ABC algorithm, an SMC ABC algorithm was proposed in Sisson et al. [14], based on the SMC sampler methodology developed by Del Moral et al. [27]. The SMC ABC algorithm converges to the approximate posterior distribution through a number of intermediate distributions with a distance threshold that is sequentially decreased, see appendix A. The efficiency of the SMC ABC algorithm depends not only on the model complexity and the amount of data available, but also on the choice of the decreasing sequence of ε (the tolerances), and the choice of perturbation kernel K, according to Filippi et al. [28]. There are various ways to construct the decreasing sequence of ε, either manually or adaptively as proposed in Drovandi & Pettitt [29] and Del Moral et al. [30]. In the adaptive method, the value of ε is chosen to be the αth quantile of the discrepancies between the observed data and the simulated data that was generated in the (t − 1)th population (see appendix A), where 0 ≤ α ≤ 1. In this paper, we used the latter method of selecting the sequence of tolerance thresholds and we stopped the algorithm when we reached a final ε that setting the desired final agreement between simulated and real data Liepe et al. [31]. The choice of perturbation kernel affects the acceptance rate in SMC ABC and the time consumed by the algorithm as explained in Filippi et al. [28]. Perturbation kernels can be divided into two classes: component-wise perturbation kernels and multivariate perturbation kernels. For component-wise perturbation kernels, one can use a uniform distribution or a univariate Gaussian distribution to perturb the particle * sampled from the previous population . The standard deviation of the kernel can be fixed in advance for each population, but more recently practitioners are adaptively choosing the width of the kernel (Beaumont et al. [15], Didelot et al. [32], Filippi et al. [28]). If the model parameters are correlated, a component-wise perturbation kernel can fail to capture the structure of the true posterior, leading to a low acceptance rate. To overcome this problem, a multivariate normal distribution with a covariance matrix ( that depends on the covariance of the previous populations can be used to perturb the particles [28]:where are the particles from the previous populations for which the corresponding simulated data ( satisfy ρ((, ) < ε (remembering ε < ε) and are the associated weights. To further improve the performance of SMC ABC, we adopted a method proposed in Prangle [33] to adaptively update the discrepancy function, ρ(z, y). We used a weighted Euclidean distance function:where y is the jth observation, z is the jth simulated observation in the simulated data = (z1, …, z) and ζ is a tunable scaling factor that allows the contribution to the discrepancy function of the jth coordinate to be normalized. The reason for normalizing the coordinates is to prevent any of them dominating the acceptance decision in the algorithm. In non-adaptive methods, the values of ζ are determined in advance and fixed. Fixing ζ from the first iteration in SMC ABC will not guarantee that the jth coordinate will be normalized in later iterations because in SMC ABC after the first round we are not sampling from the prior so the scale to normalize needs to be adapted. To adapt the values of ζ in each iteration, Prangle [33] proposed calculating the median absolute deviation (MAD) of the jth coordinate of the simulated data vectors from the previous iteration (including those rejected). The value of the next ε is also determined using these distances; for more details see algorithm 4 in [33]. Note that Prangle [33] defined the discrepancy function in terms of summary statistics for and , as is usual in ABC. Here we have used the coordinates of and directly, following the approach of Toni et al. [16] for inference of ODE model parameters. Vaart et al. [34] proposed an ABC method called error-calibrated ABC that implements a general methodology introduced by Wilkinson [35]. In their method, they incorporated the estimation of the noise into the ABC technique by identifying an ABC acceptance probability in which the noise is assumed to be normally distributed and independent. This results in improved estimates of the parameter values and their uncertainty. An implementation of the error-calibrated ABC algorithm is given in algorithm 1.

Algorithm 1.

Error-calibrated ABC algorithm, Vaart et al. [34].

1: Repeat n times:.

(a) Draw θ∗∼p(θ).

(b) Simulate z∗ from model given θ∗.

2: Find z^ , the simulated value that minimizes ρ(z, y).

3: For each data type k, calculate λ^k, the standard deviation of all corresponding z^n−y.

4: Accept (θ∗, z∗) with probability pχl2(s)s1−12/c, where s=∑ j=1l(z j∗−yjλ^k)2 and c is equal to the maximum acceptance probability across all runs.

Error-calibrated ABC algorithm, Vaart et al. [34]. 1: Repeat n times:. (a) Draw . (b) Simulate from model given . 2: Find , the simulated value that minimizes . 3: For each data type , calculate , the standard deviation of all corresponding . 4: Accept with probability , where and is equal to the maximum acceptance probability across all runs.

Bayesian inference for parameters in ODE models

Consider a Q-dimensional dynamical system for the state variable vector, x(t), described by the system of ODEs:where x is a Q × 1 vector of the dependent variables, f is a Q × 1 vector-valued Lipschitz continuous function with respect to x, is an M × 1 vector of model parameters, t is the independent variable (often time) and represents the derivative of x with respect to the independent variable. Given the dynamical system in equation (2.5), along with values for the parameter vector, , and the initial condition, x0, the solution to the system can be approximated numerically. We denote an experimental observation at time t by the Q × 1 vector y. Experimental observations are taken at K time points; the times are stored in a K × 1 vector t = (t1, t2, …, t) and the observations are stored in the Q × K matrix y = (y1, y2, …, y). These observations are usually associated with some unknown noise process, characterized by one or more variance parameters, say 2. The (approximate) solution for the dependent variables at time t, given and x0, is denoted by the Q × 1 vector . The solution for the dependent variables at times t is stored in the Q × K matrix . In a Bayesian setting, the posterior distribution for and 2 given y is:where p(y|, 2) is the likelihood, p() and p(2) are independent priors for and 2 respectively.

Observation model

In this paper, we assume that each observation, y for k = 1, …, K, has an associated additive noise process, , such thatwhere is a Q × 1 vector and is the solution for the dependent variables at time t, given and x0. Under a Gaussian error model (we assumed Gaussian model for simplicity and illustration purposes but we can assume any kind of error model), and assuming the are independent of each other, y follows a multivariate normal distribution:where (2) is a diagonal matrix with diagonal elements associated with the Q dependent variables. Hence, the likelihood function is given byand the posterior density is The Bayesian techniques discussed in appendix A can be used to sample from the posterior distribution of {, x0, 2} in the case of MCMC and an approximation to the posterior in the case of ABC methods.

Model misspecification in ABC methods for ODE models

Exact Bayesian methods, such as MCMC, generate samples directly from p(, 2|y) (at least in the limiting sense), rather than from the approximate posterior p(, |) shown in equation (2.2). This applies in general, but in particular for the case of parameter inference in ODE models, as long as an observational model has been defined, leading to a posterior distribution such as in equation (2.10). Evaluating this expression requires solving a system of ODEs to obtain for a specific collection of parameters. It is then straightforward, at least for this simple noise model, to evaluate the likelihood and the posterior density up to a normalizing constant. One of the major motivations for using likelihood-free methods, such as ABC, is that they are applicable even when evaluating the likelihood is difficult or impossible. That motivation is not present here, since solving the system of ODEs for each proposed parameter vector is the main computational burden involved in evaluating equation (2.10), and this is still necessary for ABC, at least using the method proposed by Toni et al. [16]. Their method does still have a computational advantage, in that it avoids evaluating the density of the noise model, which may be prohibitive for certain models. However, where the simple independent Gaussian noise model of equation (2.2) is appropriate, or some other simple noise model applies, the contribution of these density evaluations to the overall computational burden will be negligible. The method of Toni et al. [16] actually avoids even simulating draws from the noise model. This is made clear in the following text which appears in the thesis by Toni [19, p. 154], which is the basis of the work in Toni et al. [16]: ‘We explore the differences between three different inference approaches: Since option 1 (the approach taken throughout the thesis and associated papers) avoids adding noise in the simulation step, the method is applicable regardless of the noise model. However, this generality comes at a cost and, as we explain in this section, results in an approximate distribution of the form equation (2.2) that contains no information about parameter uncertainty. No noise added in the simulation step, θ is the unknown parameter. This framework has been introduced in Chapter 3 and used throughout this thesis. Noise added in the simulation step, σ is considered known and θ unknown. Noise added in the simulation step, both σ and θ are unknown.’ Toni et al. [16] adapted the Sisson et al. [14] SMC ABC algorithm and used it to infer parameters in ODE models. However, the method they devised differs in two crucial aspects from standard practice in implementing ABC. The first is that they do not simulate data vectors z* from the same model they assume for the data, which is of the form shown in equation (2.9). Instead, they generate z* by merely solving the underlying system of ODEs for each proposed value of the parameter vector **. The simulated data z* is thus a deterministic function of **, without any added noise, and in effect the underlying likelihood distribution model used in the resulting ABC algorithm is a point mass concentrated at the solution of the system of ODEs. In this sense, the data generation model used by Toni et al. [16] is misspecified. A second departure from standard ABC practice is that the discrepancy function used by Toni et al. [16] directly computes a distance between the simulated and observed data, originally using Euclidean distancebut they also experimented with alternative metrics. In this paper, we experiment with the more general adaptively weighted distance function in equation (2.4). All of these discrepancy functions have in common that they can only take a zero value when the simulated data z* exactly corresponds to the observed data . In contrast, practical ABC methods more commonly use a discrepancy function based on the distance between vectors of summary statistics s(*) and s(), which have much lower dimension than the simulated and observed data vectors * and . In that case, the discrepancy function is zero whenever s(*) = s(), which can occur even if * ≠ . A practical reason for basing the discrepancy on a vector of summary statistics is that this places weaker constraints on the acceptability of proposed pairs (**, *). If the summary statistics are sufficient for , nothing is lost by using s() instead of , but more usually the summary statistics capture much, but not all, of the information can reveal about . As a result of these two departures from standard ABC practice, it will not in general be possible for the discrepancy ρ(z, y) in equation (2.11) to be arbitrarily small. The problem is that since the generative data model does not include a noise term, there may be no parameter vector for which the solution to the system of ODEs exactly corresponds to the observations , and hence there is some minimum allowable discrepancy ε0 > 0. Consequently, ρ(z, y) will always be greater than 0, which is considered a misspecification in ABC estimation, according to Frazier et al. [36]. Under ideal conditions, for the function there is a unique 0 such that ρ((0), ) = ε0 > 0, where () is the unique solution to the system of ODEs with parameter vector . Therefore, as ε → ε0 from above, the approximate posterior p(, |) approaches a Dirac delta function at the point (0, (0)). It follows that the approximate ‘posterior’ p(, |) targeted by the method of Toni et al. [16] contains no information about the posterior variance of parameters. A practical demonstration of this is provided in the results below, in which small to moderate changes in the noise model used to simulate the observations resulted in no change in the posterior variance estimated by ABC methods. On the other hand, changing the noise model used to simulate observations did affect the location of the posterior and the final ε that guaranteed a good acceptance rate. However, our results presented below demonstrate that the shapes of the contours of distributions of the form of equation (2.2) for ε > ε0 may resemble those of the true posterior, and we propose that it may be possible to find some ε > ε0 for which p approximates the true posterior. Finding a good way to do this is left for future work.

Test problems

The ABC and MCMC techniques described in appendix A were compared against each other when conducting parameter inference for one epidemiological compartmental model. The Bayesian parameter inference software developed in this paper was validated using the method of posterior quantiles [37] on a computationally inexpensive model described in §3.1, before being implemented on a more demanding nonlinear system of ODEs describing malaria transmission in §3.2.

Test problem 1—susceptible–infected–recovered model

Susceptible–infected–recovered (SIR) models categorize hosts into one of three different compartments at time t. Individuals are considered susceptible (S), if they are able to be infected by the pathogen, infected (I) if currently infected with the pathogen or recovered (R) if they have successfully cleared the pathogen. The flow of individuals between compartments in the SIR model is visualized in figure 1.

Figure 1.

The transition of individuals between susceptible, infected and recovered states in an epidemiological compartmental model.

The transition of individuals between susceptible, infected and recovered states in an epidemiological compartmental model. SIR models and their variants, in both deterministic and stochastic forms, are among the most fundamental epidemiological models and have found use describing diseases as diverse as influenza, herpes and malaria [38]. In this test problem we use the SIR model to represent the fraction of the total population (P) in each category as follows:where S(t), I(t) and R(t) are the numbers of susceptible, infected and recovered individuals in the population at time t (weeks). The deterministic, constant population, SIR model without demographics can be described mathematically as:where β is the infection rate and γ is the recovery rate. When conducting parameter inference for this system, the observable data is y = (y1, …, y), where y = i(t) is the proportion of the population infected at time t, for k = 1, …, n. The parameters of interest are = {β, γ}.

Simulation results

A test dataset was generated by solving the system of equations (3.1) in the interval [0, 50] using a fourth order Runge–Kutta method and storing the solution at weekly intervals (figure 2), using true model parameters . To generate observations y, normal noise was added to the solution. For ABC approaches, a discrepancy function ρ(z, y) was used to compare infected proportions in the dataset y with a solution to the equations z = (z1, …, z) for proposed parameters as follows:where n is the number of observed data points. The priors for β, γ and σ2 were taken to be vague:where is the uniform distribution and is the inverse-gamma distribution. For MCMC approach, normal proposal distributions were used with adaptive approach tuning parameters in the algorithm to maintain an acceptance ratio between 0.3 and 0.5 [25].

Figure 2.

Test data showing the proportion of population infected over time, obtained from equation (3.1) with β = 0.9, γ = 0.333, S(0) = 1 − 1.27 × 10−6, I(0) = 1.27 × 10−6 and R(0) = 0. Red line shows the continuous infection curve while the blue points are the observations to be used to infer the model parameters. Given the observations y, the parameter vector = {β, γ} was estimated using MCMC and SMC ABC and the results from these methods were compared. The noise σ2 was additionally estimated when using MCMC. As discussed in §2.2, the distributions derived using the ABC approaches are not an approximation to the true posterior of the ODE model parameters since the noise is not estimated; we therefore cannot use the standard deviation of the distributions from ABC approaches as a measure of performance. Instead, we compared the CPU times, the number of iterations and the mean absolute errors (MAE), although, MAE may favour over-concentrated posterior approximations. The formula used for MAE is , and the θ are posterior samples for θ, for each method. We applied the MCMC approach using appendix A, algorithm 2. Table 1 shows that MCMC chain converged after 12 401 steps to reach convergence and this took approximately 6.58 min.

Algorithm 2.

The Metropolis–Hastings algorithm [1,2].

1: Initialize θ0.

2: for t=1 to T do

3: Propose θ∗ from a proposal distribution q(θ∗|θt−1).

4: Calculate α(θt−1→θ∗)=min(1, p (y|θ∗)p(θ∗)q(θt−1|θ∗) p (y|θt−1)p(θt−1)q(θ∗|θt−1)).

5: Set θt=θ∗ with probability α, else set θt=θt−1.

6: end for

Table 1.

The number of iterations, computational time (min) and mean absolute error for parameter inference in the SIR model.

	iterations	CPU time	MAE (β)	MAE (γ)
MCMC	12 401	6.58 min	0.0038	0.0034
SMC ABC	141 408	11.29 min	0.0035	0.0029

The number of iterations, computational time (min) and mean absolute error for parameter inference in the SIR model. We next applied SMC ABC as outlined in appendix A, algorithm 6 using our own implementation in R. In SMC ABC code we used T = 11 populations, each with 1000 particles, used component-wise uniform kernels that adapted their width from the previous particle distributions [28] and used uniform priors 2 units wide and centred at zero for both parameters. The tolerance sequence was selected adaptively such that in population i the new threshold ε was the 25th percentile of the distances in the previous iteration, t − 1 (as explained in §2). The algorithms terminated when we reached a challenge tolerance of ε = 0.067056 that had been chosen by finding the distance between the true ODE solution and the generated observations y.

Algorithm 6.

SMC ABC algorithm [14,16,27].

1: Initialize ϵt≥0 for t=1,....,T where ϵt>ϵt+1>0.

2: for t=0 to T do

3: for i=1 to N do

4: if t=0 then

5: Sample θ∗∗ from p(θ).

6: else

7: Sample θ∗ from the previous population θt−1(i) with normalized weights

wt−1(i) and use a perturbation kernel Kt to sample θ∗∗∼Kt(⋅|θ∗).

8: end if

9: if p(θ∗∗)=0 then

10: Go to line 4.

11: else

12: Simulate z∗ from model given θ∗∗.

13: end if

14: if ρ(z∗, y)≥ϵt then

15: Go to line 4.

16: else

17: Set θt(i)=θ∗∗ and calculate the weight for the particle θt(i):

wt(i)={1 ift=0 p(θt(i))∑ j=1Nwt−1( j)Kt(θt(i)|θt−1( j))ift>0.

18: end if

19: end for

20: set ϵt+1 to be α-quantile of saved distances vector

21: ϵt=ϵt+1

22: Normalize the weights.

23: end for

24: Return particles θT(i).

Comparing SMC ABC with MCMC, we found that SMC ABC consumed run times longer than MCMC with 11.26 min. In addition, It can be seen in figure 3b that the estimated joint posterior resulting from the 11th population of the SMC ABC method has the smaller variance compared with MCMC method. Table 2 shows that all the methods have achieved good point estimation for both parameters β and γ.

Figure 3.

Table 2.

True values of the parameters β and γ and their estimated values (each estimate is the median of the sampled values) using MCMC and SMC ABC for SIR model example.

parameter	true value	MCMC	SMC ABC
β	0.9	0.8968	0.8964
γ	0.3333	0.3308	0.3304

Scatter plot of sample draws for γ and β using MCMC (a) and SMC ABC (b). The contour lines contain the stated proportions of sample draws from the joint posterior and are produced using the R function ‘HPDregionplot’. True values of the parameters β and γ and their estimated values (each estimate is the median of the sampled values) using MCMC and SMC ABC for SIR model example. We also compared credible intervals for the solution to the system of ODEs, using MCMC and ABC. To do this, we followed the procedure in Gelman & Rubin [25] to simulate the posterior predictive distribution (PPD) for a future observation, y. This produces a PPD that has variance that depends on the posterior uncertainty (and hence the observational noise). This was not a problem within MCMC because we take samples from the posterior and solve the model and then simulate the noise, which has been estimated within MCMC. However, with SMC ABC method, we take the samples from the posterior and solve the model without simulating the noise. So, as the tolerance gets smaller and smaller the predictive intervals will also get smaller, and do not have the correct coverage of the observed data. Figure 4 shows that the credible intervals obtained using SMC ABC are much narrower than those obtained using MCMC, highlighting that the variation within the sample does not contain useful information about the inferred uncertainty of the estimates. Note that this problem is not reduced by the tunable elements of the algorithm; for example, we tried several different perturbation kernels proposed by Filippi et al. [28], such as component-wise perturbation kernels that adaptively choose width based on the previous population and multivariate normal perturbation kernels that are sometimes useful when parameters are highly correlated. The resulting credible intervals were not affected significantly by the choice of perturbation kernel (comparison results not shown).

Figure 4.

Plot of the posterior predictive credible intervals estimated using MCMC and SMC ABC fitted with the weekly infection cases. The blue dots represent the noisy data. The shaded areas are created using the posterior predictive samples. It can be seen that the result derived from MCMC covered most of the data points while the ABC derived result produces unrealistically narrow credible intervals. In addition, we used the proposed method in Prangle [33] within our R code, involving an adaptive distance function to improve the performance of the SMC ABC method. In this algorithm, the scale parameters ζ in the distance function (equation (2.4)) are updated in each iteration (calculated using MAD) and are used to choose the value of the next ε. In principle, it might be possible to use a variant of this technique to choose ε so that the resulting sample reflects the shape and spread of the true posterior distribution. However, it is not clear what number of rounds of adaptation would produce such an ε. In other words, it is not clear how and when to terminate the SMC ABC algorithm. In the literature, there are several methods one can use to terminate, such as when the algorithm reaches a certain value of ε or a target acceptance rate, or one can use a specified number of a total simulations as a tuning parameter and the algorithm terminates when a further simulation is required [33]. One option that is available when using an adaptive kernel width is to stop the algorithm when the width of the kernel becomes negligible or when 1 − (ε/ε) falls below some threshold. However, we found that none of these methods terminate the algorithm in such a way as to produce the correct shape and spread of the posterior distribution. Figure 5 illustrates that if we run the algorithm for 30 rounds the estimated posterior will shrink towards a point estimate of the parameters. If we run the algorithm for six rounds, the resulting estimate is not a good representation of the true posterior: it is too wide and hence misleading. An estimated posterior distribution similar to what we obtain using MCMC can be somewhat artificially generated if we run the SMC ABC algorithm for 16 rounds, but at present the algorithm lacks an independent way of identifying this. As far as we are aware, there are as yet no clear guidelines to follow to determine how many rounds of SMC ABC are needed when dealing with an ODE model to guarantee that a good approximation to the true posterior of the parameters has been achieved.

Figure 5.

An illustration of SMC ABC for the SIR model using algorithm 4 in Prangle [33]. The red curve represents the posterior estimate resulting from the sixth round and is clearly overdispersed (by comparison to the sample obtained using MCMC—purple scatter points). The black curve, obtained after 16 rounds of SMC ABC, is the closest approximation to the posterior obtained using SMC ABC. The blue curve is the posterior estimate obtained after 30 rounds of SMC ABC, and has clearly shrunk too much around the true parameter values (dashed light blue lines).

The impact of the noise on the inference

To demonstrate the impact of the noise on the parameter inference we applied MCMC and SMC ABC to observations y generated using different values of σ2, specifically σ2 ∈ {0.0001, 0.0005, 0.001}. We plot the resulting posterior marginals for both parameters in figure 6. The variance of the estimated posterior derived from MCMC increases as the value of σ2 increases, indicating that the variance of the posterior is affected by the amount of noise, as expected. On the other hand, although the noise has been increased, the variances of the estimated posteriors derived from SMC ABC are almost the same for small to moderate amounts of noise. However, when the noise parameter is increased further, the location of the estimated posterior is changed. This illustrates what has been discussed in §2.2 that these posteriors do not provide valid information about the uncertainty in the parameter estimates. As a result, conducting parameter estimation for ODE models using this ABC framework is not recommended.

Figure 6.

Plots of the estimated posterior marginal densities for parameter β and γ obtained using MCMC (red) and SMC ABC (blue) with different amounts of noise. The black solid line represents the true value of the parameters. It is clear that the variance of the posteriors derived from MCMC is affected by increasing the noise, but this is not the case for posteriors obtained using SMC ABC.

Including the error term in the ABC algorithm

Including the error term in the ABC algorithm may overcome this limitation and work inspired by Wilkinson [35] in Vaart et al. [34], as has been explained in §2, argued that the acceptance of the proposed parameters should be with respect to the error term rather than with respect to some tolerance level. In their method they assumed that the error term follows a normal distribution. This method is promising and can capture similar posterior shapes compared to the one derived from MCMC, as figure 7 shows. A significant drawback that we found for this approach is that the acceptance rate is very low and a large number of simulations are needed. This leads to a longer computational time, which is prohibitive for the case of ODE model parameter inference. The total CPU time when we applied the Vaart et al. [34] algorithm to the first test problem to have 1000 samples is 29 h which was derived from 2 × 106 simulations and this time is much larger than when we used the MCMC method for this example (5.25 min).

Figure 7.

Scatter plot of posterior distribution sample draws for β and γ from MCMC (black) and error-calibrated ABC (red) for the SIR model in SIR model example.

Example 2—nonlinear ODE model of malaria transmission

Work by White et al. [39], acknowledging the lack of reliable data in some countries where malaria control or elimination is particularly desirable, showed the utility of a compartmental mathematical model in predicting effects of various elimination strategies compared to the more complex models of Gu et al. [40] and Maire et al. [41]. The model describes population dynamics using four population compartments in the transmission of malaria: The model comprises four ODEs that govern the temporal evolution of the population compartments. The model is illustrated in figure 8 and can be described mathematically by the following equations:

Figure 8.

The flow of individuals between susceptible, infected and recovered states in the model of White et al. [39].

The flow of individuals between susceptible, infected and recovered states in the model of White et al. [39]. Here λ is the force of infection and is given bywhere R0, the average number of secondary infections arising from a single infected individual in a susceptible population, is expressed as a function of time to incorporate the seasonal forcing associated with malaria transmission and is of the form, R0(t) = Acos2π(t − ϕ) + r0. The model is parametrized in terms of a number of constants as described in table 3.

Table 3.

The parameter values used in simulation of the White et al. [39] model.

parameter		value	source
P	(people)	29 203 486	Worldometers [42]
L	(years)	66.67	Maude et al. [43]
d_imm	(years)	0.93	Aguas et al. [44]
d_in	(years)	0.11	assumed
d_treat0	(weeks)	2	Maude et al. [43]
p₁		0.87	Aguas et al. [44]
p₂		0.08	Aguas et al. [44]
A		0.67	assumed
r₀		1.23	assumed
ϕ		3/12	assumed
η₀		0.11	assumed

The parameter values used in simulation of the White et al. [39] model. The observed data, y(t), is taken to be the number of observable clinical infections, C(t) as follows:and For our purposes here, the parameter vector of interest is = (η0, din) and in the case of using MCMC, the parameter vector of interest is = (η0, din, σ2), where η0 is the percentage of individuals with clinical infection that receive treatment, din is the average duration of an untreated sensitive infection and σ2 is the noise associated with the data which we assumed to be normally distributed.

Application: malaria in Afghanistan

Afghanistan is a landlocked country located between South Asia and Central Asia. Despite the fact that most of the country is desert, there is significant rainfall and snowfall [45], which provides a fertile environment for mosquito-borne diseases such as malaria. We use monthly data from cases registered nationwide across all regions of Afghanistan in the period from January 2005 to September 2015 from Anwar et al. [46] as shown in figure 9.

Figure 9.

In the ODE system in equation (3.4), I1 and I2 represent the number of infected individuals with no prior immunity and prior immunity, respectively. However, in the case of the data from Afghanistan, each data point represents the total number of malaria cases that arrived at hospitals in the month. In order to calculate the cumulative number of cases over time we added an extra ODE which takes the form:where W(t) is the cumulative number of observed (that is, treated) cases. To compute the number of new cases in each month we subtract the cumulative cases from consecutive months. The values of the model parameters used are shown in table 3 and the initial conditions are given by the equilibrium solution of the system in equation (3.4) with the addition of W(t = 0) = 0 for the new ODE. As with the first test problem, for the ABC approaches we used the discrepancy function in equation (3.2) to compare the clinical infections given in the dataset y with a simulated solution x. The priors for η0, din and σ2 were taken as follows: A logistic transformation was used to transform η0 ∈ [0, 1] while a log transform was applied to din ∈ [0, ∞) and σ2 ∈ [0, ∞) so that each transformed parameter had support over the real line. This step was used to improve the acceptance rate of the proposals. In MCMC method, zero mean normal proposal distributions were used with standard deviations equal to (0.007, 0.07, 0.1) for the parameters (η, din, σ2) respectively. The same proposal distributions were used with MCMC ABC, but with standard deviations equal to (0.1, 0.1) for the parameters (η, din). All of the methods have achieved convergence to similar values for both parameters under investigation (η0, din), as shown in table 4. SMC ABC consumed significantly longer CPU time compared with the other methods as shown in table 5. Also, SMC ABC needed 143 031 model simulations to get 500 accepted values, while MCMC and MCMC ABC needed just 5838 and 4050 iterations, respectively to converge (tables 4 and 5).

Table 4.

The estimated values (the median of the posterior) of the parameters η0 and din from MCMC, MCMC ABC and SMC ABC for the Afghanistan data.

parameter	MCMC	MCMC ABC	SMC ABC
η₀	0.0525	0.04685	0.0459
d_in	0.23035	0.2483	0.2453

Table 5.

The number of iterations and computational time (min) for parameter inference in the malaria model, applied to the Afghanistan data.

	iterations	CPU time
MCMC	5838	45.13 min
MCMC ABC	4050	39.99 min
SMC ABC	143 031	522.34 min

The estimated values (the median of the posterior) of the parameters η0 and din from MCMC, MCMC ABC and SMC ABC for the Afghanistan data. The number of iterations and computational time (min) for parameter inference in the malaria model, applied to the Afghanistan data. Since a real dataset has been used here, the true parameter values are unknown. As a consequence, applying MCMC ABC was difficult because this lack of information makes the choice of an appropriate ε problematic. In this paper, in order to select an appropriate tolerance level we adopted the method of Vaart et al. [34], which is to solve the ODE model with different proposals of the parameters from the priors, find all the distances using 3.2 between these solutions and the true data, and then choose the one that minimizes this distance. We then used the best fitting solution to estimate the value of the MCMC ABC tolerance which was ε = 116230.8. Then, we applied SMC ABC for six populations with an adaptively chosen sequence of tolerance ε = (244616.4, 244616.4, 244616.4, 176677.1, 116966.8, 100042.7). The estimated joint posteriors of η0 and din can be seen in figure 10. All have the same shape and similar position, but the variances are very different. Figure 9 shows that the posterior predictive distribution from MCMC covers most of the data points; however, the predictive intervals from ABC methods (here, showing only the SMC ABC result from the last population) are very tight and poorly cover the data points. The reason for this is, as the SMC ABC algorithm converges towards the point estimate, without any consideration to the noise in the data, the tolerance gets smaller and then most of the accepted parameter values are from a tight region around a point estimate (figure 10). Thus, the predictive intervals from ABC methods do not enable appropriate coverage of the data, consistent with the discussion in §2.2.

Figure 10.

Scatter plot of posterior distribution sample draws for η0 and din obtained using MCMC (a), ABC MCMC (b) and SMC ABC (c) for model applied to Afghanistan data. The contour lines contain the stated proportion of sample draws and they were produced using the R function ‘HPDregionplot’.

Plot of the posterior predictive credible intervals from MCMC and SMC ABC fitted with the monthly Afghanistan data. The blue dots represent the data. The shaded areas are created by the posterior predictive samples. The result derived from MCMC covered most of the data points while the ABC methods were unable to cover most of the data. Scatter plot of posterior distribution sample draws for η0 and din obtained using MCMC (a), ABC MCMC (b) and SMC ABC (c) for model applied to Afghanistan data. The contour lines contain the stated proportion of sample draws and they were produced using the R function ‘HPDregionplot’.

Discussion

Our investigation into exact and approximate methods for inferring parameters in ODE dynamical systems under a Bayesian framework highlights some limitations of current methods. The main problem we identified is that the observations of the system are often noisy, so when we infer the parameters for such a system it is inappropriate to not simulate the noise process. Estimation of the noise parameter is standard using exact Bayesian inference (MCMC), but not with the current practice with ABC-based approaches when applying to a system of ODEs that we investigated here. The general idea when applying the ABC-based methods considered here to a dynamical system is to compare the noisy observations to solutions generated using the ODEs (which is a deterministic model that does not take into consideration the noise in the observations) for each set of parameters proposed. The parameters are accepted based on some tolerance ε that also does not depend on the noise term. To illustrate this limitation, we compared the popular methods MCMC, MCMC ABC and SMC ABC for estimating model parameters in ODEs. We can see in the second example presented in this paper that the computational time consumed by MCMC ABC is shorter compared to the other methods (MCMC and SMC ABC). However, when dealing with a deterministic model, the estimated posteriors derived from current ABC methods do not provide useful estimates of the true posteriors. In particular, they do not contain appropriate information about the uncertainty of the parameter values. Being able to naturally quantify uncertainty in posterior distributions is one of the main advantages of Bayesian statistical inference over other approaches, given that the output of Bayesian inference is a probability distribution rather than a point estimate. Here, we have shown that the ABC methods are not able to capture the distribution, but instead converge to point estimates of the best parameter values. To demonstrate the effect of not including the noise term in the estimation we compared the results in the first example for different noise values. It is clear that the distribution estimated using the ABC methods is not affected by the noise values (for low to moderate amounts of noise), which illustrates that the posterior variances estimated using ABC do not depend on the noise in the data. As a result, the posteriors do not represent the whole probability distribution of the parameters under investigation. In order to improve SMC ABC we tried an adaptive SMC ABC on the first test problem by using several different perturbation kernels that are proposed in Filippi et al. [28]; component-wise perturbation kernels and multivariate normal perturbation kernels that adapt their width from the previous population in the algorithm steps. Adapting the width of the kernels may affect the accuracy of the estimations, but still does not capture the true posterior shape when comparing with the posterior obtained using an MCMC method. We also tried another adaptive SMC ABC method in this paper; an adaptive distance function as in §2, proposed in Prangle [33]. This method did make it possible to obtain a distribution that more closely approximates the posterior using SMC ABC, but the problem with this algorithm is that there is no existing criteria to identify an appropriate iteration at which to terminate. Including the error term in the ABC algorithm may improve the ABC posterior as we have seen when we applied Vaart et al. [34] method in §2; however, the long computational times required by this approach are considered as a remarkable drawback in the case of ODE model parameter inference. The comparison conducted in this paper demonstrates that using exact Bayesian inference (MCMC) for ODE parameter estimation is a practical alternative (Gelman et al. [47]), despite the difficulty involved in calculation of the likelihood. We found that MCMC gave accurate estimation of the parameter values and the resulting posterior gave appropriate information about the uncertainty of the parameters. Furthermore, the variance of the MCMC posterior distribution changed as the noise in the data changed, as one would expect. The same was not true for the ABC methods considered. The time consumed by the MCMC algorithms was slightly larger than MCMC ABC; however, since the resulting posteriors were more appropriate, the extra effort to calculate the likelihood is deemed worthwhile. In addition, choosing an appropriate ε when applying MCMC ABC is difficult, especially when working with real data. With simulated data, it is possible to find an appropriate ε from the distance between the true solution of the ODE model and the noisy data, but this is not possible in a real application where the true solution is unknown. In this case, in order to determine an acceptable tolerance level, we adapt the work of Vaart et al. [34] to find the best fit solution and then find an appropriate ε. We found that among all the methods, applying SMC ABC is the easiest to implement, but consumes the most computational time. Moreover, as we have observed, this method produces inappropriately shaped posterior distributions. The first example presented in this paper involves a likelihood function that is easy to compute, so using a likelihood-based approach such as MCMC or an importance sampling method like SMC is certainly to be preferred over likelihood-free methods (such as ABC). Most of the computational cost of MCMC and SMC ABC method is consumed in solving the ODE models several times to compute the likelihood for MCMC or to do the simulations for SMC ABC. However, in the second example we found that more effort is needed to construct the likelihood functions when applying MCMC. In addition, when we chose an uninformative prior for the parameters, the SMC ABC algorithm located the appropriate region of the parameters space easily, while it was more difficult to choose appropriate initial parameters to achieve rapid convergence with MCMC. We would currently recommend users of ABC methods be careful when using it with ODEs, unless a sensible choice of error model and summary statistics can be made. Deciding what are sensible choices for the ABC algorithm is still difficult and an important topic of current and future work.

Algorithm 3.

ABC rejection algorithm, pre-specification of

1: while number of accepted θ∗<N do

2: Draw θ∗∼p(θ).

3: Simulate z∗ from model given θ∗.

4: if ρ(z∗, y)≤ϵ then

5: Accept θ∗.

6: end if

7: end while

Algorithm 4.

ABC rejection algorithm, post-determination of

1: for i=1 to T do

2: Draw θt∼p(θ) and simulate zt from model given θt.

3: Compute discrepancy function ρt=ρ(zt, y).

4: end for

5: Sort {θt, ρt}t=1T into ascending order, based on ρ.

6: Keep N=αT of θt with the lowest discrepancy, hence defining ϵ.

Algorithm 5.

MCMC ABC with early rejection [49].

1: Obtain θ0 and z0 using ABC rejection sampling.

2: for t=1 to T do

3: Draw θ∗∼q(θ∗|θt−1).

4: Compute r= p(θ∗)q(θt−1|θ∗) p(θt−1)q(θ∗|θt−1).

5: if U(0, 1)<r then

6: Simulate z∗ from model given θ∗.

7: if 𝟙0{ρ(z∗, y)≤ϵ} then

8: Set θt=θ∗ else, set θt=θt−1.

9: end if

10: else

11: θt=θt−1.

12: end if

13: end for

20 in total

1. Markov chain Monte Carlo without likelihoods.

Authors: Paul Marjoram; John Molitor; Vincent Plagnol; Simon Tavare
Journal: Proc Natl Acad Sci U S A Date: 2003-12-08 Impact factor: 11.205

2. Approximate Bayesian computation in population genetics.

Authors: Mark A Beaumont; Wenyang Zhang; David J Balding
Journal: Genetics Date: 2002-12 Impact factor: 4.562

3. Estimation of parameters for macroparasite population evolution using approximate bayesian computation.

Authors: C C Drovandi; A N Pettitt
Journal: Biometrics Date: 2011-03 Impact factor: 2.571

4. Approximate Bayesian computation (ABC) gives exact results under the assumption of model error.

Authors: Richard David Wilkinson
Journal: Stat Appl Genet Mol Biol Date: 2013-05-06

5. Taking error into account when fitting models using Approximate Bayesian Computation.

Authors: Elske van der Vaart; Dennis Prangle; Richard M Sibly
Journal: Ecol Appl Date: 2018-01-23 Impact factor: 4.657

6. Evaluation of Parallel Tempering to Accelerate Bayesian Parameter Estimation in Systems Biology.

Authors: Sanjana Gupta; Liam Hainsworth; Justin S Hogg; Robin E C Lee; James R Faeder
Journal: Proc Euromicro Int Conf Parallel Distrib Netw Based Process Date: 2018-06-07

7. Simulation-based model selection for dynamical systems in systems and population biology.

Authors: Tina Toni; Michael P H Stumpf
Journal: Bioinformatics Date: 2009-10-29 Impact factor: 6.937

8. Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems.

Authors: Tina Toni; David Welch; Natalja Strelkowa; Andreas Ipsen; Michael P H Stumpf
Journal: J R Soc Interface Date: 2009-02-06 Impact factor: 4.118

9. On optimality of kernels for approximate Bayesian computation using sequential Monte Carlo.

Authors: Sarah Filippi; Chris P Barnes; Julien Cornebise; Michael P H Stumpf
Journal: Stat Appl Genet Mol Biol Date: 2013-03-26

10. The last man standing is the most resistant: eliminating artemisinin-resistant malaria in Cambodia.

Authors: Richard J Maude; Wirichada Pontavornpinyo; Sompob Saralamba; Ricardo Aguas; Shunmay Yeung; Arjen M Dondorp; Nicholas P J Day; Nicholas J White; Lisa J White
Journal: Malar J Date: 2009-02-20 Impact factor: 2.979

2 in total

1. Calibrating models of cancer invasion: parameter estimation using approximate Bayesian computation and gradient matching.

Authors: Yunchen Xiao; Len Thomas; Mark A J Chaplain
Journal: R Soc Open Sci Date: 2021-06-16 Impact factor: 2.963

2. Numerical method for parameter inference of systems of nonlinear ordinary differential equations with partial observations.

Authors: Yu Chen; Jin Cheng; Arvind Gupta; Huaxiong Huang; Shixin Xu
Journal: R Soc Open Sci Date: 2021-07-28 Impact factor: 2.963

2 in total