Literature DB >> 25152812

Approximate Bayesian inference for complex ecosystems.

Abstract

Mathematical models have been central to ecology for nearly a century. Simple models of population dynamics have allowed us to understand fundamental aspects underlying the dynamics and stability of ecological systems. What has remained a challenge, however, is to meaningfully interpret experimental or observational data in light of mathematical models. Here, we review recent developments, notably in the growing field of approximate Bayesian computation (ABC), that allow us to calibrate mathematical models against available data. Estimating the population demographic parameters from data remains a formidable statistical challenge. Here, we attempt to give a flavor and overview of ABC and its applications in population biology and ecology and eschew a detailed technical discussion in favor of a general discussion of the advantages and potential pitfalls this framework offers to population biologists.

Entities: Chemical

Year: 2014 PMID： 25152812 PMCID： PMC4136695 DOI： 10.12703/P6-60

Source DB: PubMed Journal: F1000Prime Rep ISSN： 2051-7599

Introduction

Theoretical population biology has been crucial for our understanding of ecosystems [1]. Mathematical models can explain elegantly what might appear as bewilderingly complex variations in species abundances. Seminal work starting in the early 20th century [2-4] has, in fact, become so familiar to population biologists and beyond that today we are hardly surprised to see complex oscillatory patterns or complex dependencies of population dynamics on a myriad of environmental and demographic factors [5]. Many of these phenomena can straightforwardly be explained in terms of relatively simple population dynamics models. The success of these models has also meant that ecological ideas are coming to pervade the analysis of other interacting systems, including cancer [6], stem cells [7,8], and even the banking system [9,10], all of which are characterized by the interactions between different entities that affect the overall dynamics of the system and its stability. Simple models are beguiling and shape our intuition and allow us to explain trends in data. In many important scenarios, however, different factors come together with sometimes complex patterns resulting from their interplay. Thus, understanding realistic systems—subject to a multitude of internal and external factors—is hard [11,12]. This is further complicated in situations where models are used to make predictions or assess different types of interventions in silico prior to their implementation in, for example, conservation biology. These challenges are not unique to theoretical ecology, of course, and recent years have seen concerted efforts to tackle the so-called inverse problem: estimating parameters of a model from data [13]; choosing from among a set of plausible candidate models the model that is best able to explain the data [14]; or inferring mechanistic or statistical dependencies between the different state variables making up a system—in an ecological case, this would, for example, be the species considered in the model. Below, we are considering population dynamical models where a vector, x, containing N species, describes the abundances of species in the ecosystem. These are assumed to change as a result of interactions among the species (and potentially external factors) according to some rate laws, where, with slight abuse of notations, we will also implicitly allow for stochastic dynamics. The community matrix of the ecological system (1) is, of course, given by which captures the ecological relationships among the species. Finally, the (vector-valued) parameter Θ denotes the typically unknown demographic and system parameters (for example, birth, death, and migration rates) as well as parameters characterizing the interactions between and within species. Below, we will discuss methods that allow us to infer the parameters, Θ, and choose between different potential models (for example, ). The statistical toolset that we will discuss, centered primarily around ABC [15,16], complements traditional mathematical approaches that have been used in theoretical population biology to great effect since the 1950s. But the aim here is—rather than to focus on general mathematical laws governing the behavior and fate of natural populations—to make models as specific to a given problem, to identify the key factors driving an ecosystem's dynamics, or to make predictions about the future of an ecosystem. There are well-defined statistical frameworks to deal with parameter inference. Model selection—the process of comparing the ability of different models to explain some data—is continuing to attract the attention of statisticians and domain experts in different scientific disciplines [14]. But for many challenging real-world problems, conventional statistical approaches become computationally too cumbersome very quickly. This class of problems includes many stochastic processes, highly structured populations, and those where different types of data need to be considered. Often, it is still straightforward to establish simulation models—in general, real-world problems tend to defy purely analytical approaches—but conventional statistical approaches become computationally too expensive. Arguably, many of the most contentious problems in population biology (or science in general) fall into this category of problems. A model abstracts from reality what are known or believed to be the essential features of a real natural (or technological or social) system. This fact alone has in the past added to some controversies: as “all models are wrong” [17], it is necessary to identify the best model that captures and allows us to quantitatively and qualitatively understand the dynamics of the real system. Thus, we need statistical tools that allow us to deal with complex systems, many of which are expected to stretch conventional statistics. Here, we develop a viable alternative that maintains most if not all of the advantages of the Bayesian inferential apparatus but can be extended to problems defying conventional statistics.

Model calibration and parameter estimation

Given a model, f(x;Θ), and some data, , we need to infer the parameters, Θ, from the data. The likelihood [18] is defined as the probability of obtaining some data D given a parameter value Θ', This is the central quantity in likelihood inference; crucially, the likelihood contains all the information about the parameter that can be extracted from the data D. In Bayesian inference [19], it, together with the prior distribution of Θ, Pr(Θ), strikes a balance between what is or can be known about the parameter prior to having seen the data, and the information contained in the data, to give rise to the posterior distribution, Here, Pr(D) denotes the evidence. It is often thought of as a normalization constant but does in fact contain information about the ability of a model to describe the data. Obtaining the posterior distribution, or a sample from it, is computationally demanding. In general, computing the evidence Pr(D), which is typically a high-dimensional integral, is complicated. Sometimes, the focus, therefore, may shift from consideration of the whole (posterior) distribution to the maximum (mode) of the posterior distribution; this maximum a posteriori estimate is the Bayesian equivalent to the maximum likelihood estimate. So that the additional information contained in the distribution can be obtained, a wealth of computational statistical approaches have been developed. Markov chain Monte Carlo (MCMC) methods have become the main workhorses of computational Bayesian statistics and have allowed us to generate samples from the posterior distribution. Recent years have witnessed increased interest in these and related methods—such as population and sequential Monte Carlo techniques—but even the most sophisticated approaches reach their limits when the number of parameters or the complexity of the model increases. The first problem, the so-called curse of dimensionality, is shared by all statistical inference procedures. The second problem is more interesting. For example, we may ask ourselves whether there are simpler versions, , of the model that we are considering, , that would, despite the simplification or coarse-graining (by simplification we typically mean that the dimension of the parameter vector is smaller in the simplified model, that is, ), allow us to draw meaningful, verifiable (or falsifiable) mechanistic insights from the available data. In principle, this might appear to exacerbate the statistical problem, for we would have to find computationally affordable and sufficiently discriminatory ways of deciding if and when a simpler model is a good approximation to the original model, . We will return to this point again below. First, however, we discuss ABC methods, which form an alternative approach to tackling statistically challenging problems in a Bayesian framework and which have become a popular alternative to conventional (or exact) Bayesian inference in many applications, especially in evolutionary, population, and systems biology.

Approximate Bayesian computation

In ABC, we stay as close as possible [16] to the model of interest but instead forgo evaluation of the likelihood in favor of a comparison between simulated and real data [15,20,21]. For many systems, the likelihood becomes computationally intractable, either because of the model complexity or the detailed nature of the data. Nevertheless, the underlying model can still be simulated. The principal underlying insight of ABC is that we can consider where is data obtained by simulating from our model with parameter Θ, Δ[x,x'] is a distance function that can be chosen flexibly to suit the problem at hand, and ε is a tolerance threshold that reflects the desired accuracy of our inference. The essential problem is that for any complicated problem, it is impossible to obtain the precise dataset, D, by simulating from the model , even if we know the true parameter (we ignore the artificial problem of deterministic dynamics with no observational noise). By increasing the threshold ε, our inference becomes more approximate, but the chance of obtaining a simulated dataset for which increases. The comparison of real and simulated data is particularly straightforward for ecological time-series data, for example [22]. Here, D might take the form of vectors of population abundances, , for n species collected at t=1,…,m time-points. In this case, the Euclidean (or any other vector) norm provides a suitable distance. The analysis of dynamical systems is thus relatively straightforward in an ABC framework. Generalization to compartmental or spatio-temporal models (or both) is straightforward [23]: if we can simulate data efficiently, we can appeal to the Bayesian inference formalism via ABC (keeping in mind the nature of the approximation and the tolerance threshold ε). Instead of comparing the data, we can compare aspects of the data, such as summary statistics. This has been one of the main advantages as well as sources of contention for ABC inference. We call a statistic, s of the data, sufficient if and only if In this case, we can replace the data by the sufficient statistic without any loss of information about the parameter, Θ. The attraction of using sufficient summary statistics lies in the fact that their dimension, d, is typically much smaller than the dimension of the data itself,d (in the above example of n species sampled at m time points, d); that is, d. Especially in population genetics, which has inspired the rise of ABC methods since the late 1990s, the use of summary statistics has been popular (see, for example, [24-28]). With the use of summary statistics, the likelihood can be written as with potentially an appropriate change in the distance function and ε. Although Equation 5 works very well for parameter inference if s is sufficient, it is important to note that sufficient statistics are few and far between for any real-world problem. Unfortunately, ABC requires appropriate sufficient statistics (or comparisons of the data directly, as in the case of time-series problems). There have been attempts to generate collections of statistics that together fulfil sufficiency properties [28-33], but these are computationally expensive in their own right. So far, we have implicitly considered ABC in a simple rejection framework: (a) we sample a parameter from a suitable prior, (b) we simulate the model for the parameter, and then (c) we compare the simulated and the real data (or their respective summary) statistics and accept a parameter as a draw from the ABC posterior if the distance is below some threshold. Steps (a) to (c) are repeated until a sufficiently large number of parameter values have been accepted. The posterior in this case is represented as a sum over indicator functions, , where either or , depending on whether the data or sufficient summaries are used in the inference process. This framework is as simple as it is impractical: like all rejection samplers, it is limited to small problems involving less than a handful of parameters. It has been possible to construct ABC-MCMC samplers [21], but the real workhorses of most ABC approaches to real-world problems are based on sequential Monte Carlo (SMC) approaches [22,34]; ABC-SMC has become a very popular field of research (arguably inspiring more detailed analysis also in exact SMC samplers), and recent developments are allowing us to tackle larger and more complicated systems [35]. The most widely used flavor of ABC-SMC proceeds by constructing a set of intermediate distributions that start from the prior and increasingly resemble the posterior. To do so, a sequence of decreasing thresholds, (with ), is defined and the sequence of distributions is constructed by sampling parameter vectors from the previous distribution (or the prior in the first step), perturbing them by using some perturbation kernel function, and accepting those parameter vectors for which the distance between real and simulated data falls below the threshold . Choice of the thresholds and the nature of the perturbation kernels determine the computational efficiency and runtime of the inference, but both can be tuned to speed up the process and tackle larger problems [36,37].

Model selection and checking

So far, we have assumed that we have a single model that describes our system of interest. The Bayesian framework readily provides us with credible intervals for parameters, but it is also possible to assign probabilities for different models to be the correct model, conditional on the available data and the set of competing models, . In the likelihood framework, the comparison of general (that is, non-nested) models is made possible only through the use of information criteria; in the Bayesian framework, the posterior probability of a model is given analogously to Equation 3 by [14] which is also known as the marginal likelihood of model . In principle, Bayesian model selection allows us to compare any number of arbitrary models. An additional advantage is that the selection via the marginal likelihood, Equation 6, automatically strikes a balance between the ability of models to reproduce or explain the observed data, the complexity of the model, and the robustness of the inference. Equation 6 can be interpreted in the ABC framework [22,38,39], and ABC model selection has been an area of great interest and activity [40-45]. Although model selection is indeed straightforward if experimental and simulated data are compared directly, it has been shown that model selection becomes unreliable when summary statistics instead of the data are compared [46,47]: summary statistics are sufficient for model selection for only a very restricted set of problems. Constructing sets of statistics that are sufficient for model selection (they must be sufficient for every model considered and across the models; this is an area of active research [48,49]), while possible in principle, is computationally enormously demanding. In many ecological problems, however, we deal with spatio-temporal time-series data, for which model selection is possible. Our aim in such cases is typically to identify the most promising mechanistic descriptions of a complex system. If no single model emerges from such a comparison, then we need to investigate those models that have comparably high marginal likelihoods. Simulations from the respective model posteriors can then be used, for example, to develop more discriminatory experimental designs that allow us to further distinguish among these models [50]. This, too, is an area of continuing importance for ABC.

Applicability of approximate Bayesian computation: an outlook

ABC methods were borne out of a need to tackle problems that defy conventional statistical methodologies. It has become clear, however, that whenever suitable Bayesian alternatives that do deal with the proper likelihood are available, ABC becomes computationally too expensive. The reason for this is primarily the fact that the representation of the posterior (as a weighted sum over Dirac δ functions) is not very efficient. So when alternatives are available, they ought to be used. In parallel to their role in computationally demanding applications, ABC techniques have, more recently, also attracted attention as an inferential framework in their own right [16,51]. From this, interesting new approaches to deal with real-world problems may well emerge [52]. In conclusion, ABC-based methods are best suited to those problems for which other likelihood-based (or exact) Bayesian inference procedures do not yet exist. This appears to still include a host of challenging and interesting problems. Many stochastic and highly structured spatio-temporal problems in ecology, epidemiology, and evolutionary genetics clearly fall into this category. The recent developments discussed above mean that ABC has become a viable new way of tackling computationally demanding parameter inference problems. Given a model—as long as we can simulate it—ABC gives us a handle to evaluate approximate posterior distributions, which then can be further evaluated. Sensitivity and robustness analyses, but also predictions of future behavior or the likely effects of any interventions or perturbations, can be analyzed by simulating the model with parameters sampled from the posterior. There is enormous scope for basing the exploration of, for example, policy or conservation measures on the available data in this way. ABC has, for example, been used in experimental design [50,53] and in synthetic biology [14,54] to generate designs of molecular pathways that exhibit certain types of behavior. In such cases, we replace the observed data, D, by a representation of the desired behavior (such as the desired abundance of a species). Then the inference procedure is used to identify the scenario for which we are most likely to observe this outcome. Such predictions then reflect the best available evidence in light of the data and the model. As an aside, it is worth keeping in mind that the technical challenges of statistical inference and modeling can often be minor compared with the difficulties in communicating the results to policy makers or the general public. Many of the most pressing problems in ecology have become highly emotive topics as they nearly always involve a conflict between parties that have very different priorities (see, for example, [45,55]). In many complicated situations, the nuance and cautiousness that accompany how we present such analyses could be taken for wavering or lack of reliability. Here, however, ABC, with its explicit focus on simulation, may even have an advantage, as the underlying rationale is so straightforwardly explained and easy to understand.

44 in total

1. Approximate Bayesian computation in population genetics.

Authors: Mark A Beaumont; Wenyang Zhang; David J Balding
Journal: Genetics Date: 2002-12 Impact factor: 4.562

2. Parameter estimation in biochemical pathways: a comparison of global optimization methods.

Authors: Carmen G Moles; Pedro Mendes; Julio R Banga
Journal: Genome Res Date: 2003-10-14 Impact factor: 9.043

3. Stem-cell niches: it's the ecology, stupid!

Authors: Kendall Powell
Journal: Nature Date: 2005-05-19 Impact factor: 49.962

4. Systemic risk: the dynamics of model banking systems.

Authors: Robert M May; Nimalan Arinaminpathy
Journal: J R Soc Interface Date: 2009-10-28 Impact factor: 4.118

5. Bayesian design of synthetic biological systems.

Authors: Chris P Barnes; Daniel Silk; Xia Sheng; Michael P H Stumpf
Journal: Proc Natl Acad Sci U S A Date: 2011-08-29 Impact factor: 11.205

6. A framework for parameter estimation and model selection from experimental data in systems biology using approximate Bayesian computation.

Authors: Juliane Liepe; Paul Kirk; Sarah Filippi; Tina Toni; Chris P Barnes; Michael P H Stumpf
Journal: Nat Protoc Date: 2014-01-23 Impact factor: 13.491

Review 7. Model selection in systems and synthetic biology.

Authors: Paul Kirk; Thomas Thorne; Michael P H Stumpf
Journal: Curr Opin Biotechnol Date: 2013-04-08 Impact factor: 9.740

8. Maximizing the information content of experiments in systems biology.

Authors: Juliane Liepe; Sarah Filippi; Michał Komorowski; Michael P H Stumpf
Journal: PLoS Comput Biol Date: 2013-01-31 Impact factor: 4.475

9. Model criticism based on likelihood-free inference, with an application to protein network evolution.

Authors: Oliver Ratmann; Christophe Andrieu; Carsten Wiuf; Sylvia Richardson
Journal: Proc Natl Acad Sci U S A Date: 2009-06-12 Impact factor: 11.205

10. Phylodynamic inference and model assessment with approximate bayesian computation: influenza as a case study.

Authors: Oliver Ratmann; Gé Donker; Adam Meijer; Christophe Fraser; Katia Koelle
Journal: PLoS Comput Biol Date: 2012-12-27 Impact factor: 4.475

3 in total

1. Simulation and inference algorithms for stochastic biochemical reaction networks: from basic concepts to state-of-the-art.

Authors: David J Warne; Ruth E Baker; Matthew J Simpson
Journal: J R Soc Interface Date: 2019-02-28 Impact factor: 4.118

2. A large-scale stochastic spatiotemporal model for Aedes albopictus-borne chikungunya epidemiology.

Authors: Kamil Erguler; Nastassya L Chandra; Yiannis Proestos; Jos Lelieveld; George K Christophides; Paul E Parham
Journal: PLoS One Date: 2017-03-31 Impact factor: 3.240

3. Modeling the architecture of the regulatory system controlling methylenomycin production in Streptomyces coelicolor.

Authors: Jack E Bowyer; Emmanuel Lc de Los Santos; Kathryn M Styles; Alex Fullwood; Christophe Corre; Declan G Bates
Journal: J Biol Eng Date: 2017-10-03 Impact factor: 4.355

3 in total