Literature DB >> 30225025

The impact of experimental design choices on parameter inference for models of growing cell colonies.

Andrew Parker¹, Matthew J Simpson², Ruth E Baker¹.

Abstract

To better understand development, repair and disease progression, it is useful to quantify the behaviour of proliferative and motile cell populations as they grow and expand to fill their local environment. Inferring parameters associated with mechanistic models of cell colony growth using quantitative data collected from carefully designed experiments provides a natural means to elucidate the relative contributions of various processes to the growth of the colony. In this work, we explore how experimental design impacts our ability to infer parameters for simple models of the growth of proliferative and motile cell populations. We adopt a Bayesian approach, which allows us to characterize the uncertainty associated with estimates of the model parameters. Our results suggest that experimental designs that incorporate initial spatial heterogeneities in cell positions facilitate parameter inference without the requirement of cell tracking, while designs that involve uniform initial placement of cells require cell tracking for accurate parameter inference. As cell tracking is an experimental bottleneck in many studies of this type, our recommendations for experimental design provide for significant potential time and cost savings in the analysis of cell colony growth.

Entities: CellLine Chemical Disease Mutation

Keywords: approximate Bayesian computation; cell migration; cell proliferation; cell spreading; experimental design

Year: 2018 PMID： 30225025 PMCID： PMC6124093 DOI： 10.1098/rsos.180384

Source DB: PubMed Journal: R Soc Open Sci ISSN： 2054-5703 Impact factor: 2.963

Introduction

The study of how cell populations grow and spread is integral to understanding and predicting the invasion of cancer, the speed of wound repair and the robustness of embryonic development [1-3]. However, the extent to which cell populations grow and spread is governed by multiple processes, including motility, proliferation, adhesion and cell death, making it difficult to elucidate the relative contributions of these processes to the growth and invasion of a cell colony [4]. As such, in vitro cell biology assays are routinely used to probe the mechanisms by which cells interact, and the key processes involved in the growth and expansion of cell colonies. These in vitro assays generally involve seeding a population of cells on a two-dimensional substrate, and observing the population as the individual cells move and proliferate and the density of the monolayer increases towards confluence. A useful approach to interpret the results of these assays involves using a mathematical model that incorporates mechanistic descriptions of processes such as cell motility and proliferation. By parametrizing and validating the models using quantitative data from in vitro assays, it is possible to provide quantitative insights into the mechanisms driving the growth and spreading of a cell population, and make experimentally testable predictions. However, it is not always clear how best to choose the experimental design, nor which summary statistics of the data to collect, in order to accurately and efficiently parametrize and validate models. In this work, we use a two-dimensional lattice-based exclusion process model that incorporates both motility and proliferation mechanisms. Our goal is to assess how our ability to accurately infer model parameters is affected by changes in the experimental design. Parameter inference is performed in a Bayesian framework using approximate Bayesian computation (ABC), allowing us to quantify the uncertainty of our parameter estimates and bypass the need to compute a likelihood function for the mechanistic model. By quantifying the information gain using the different experimental protocols, we are able to provide guidelines for experimental design in terms of the selection of experimental geometry and the collection of relevant quantitative summary statistics from imaging data.

Experimental design

Typically, there are two main types of two-dimensional in vitro experiments that are considered at the level of the population. The first experiment, shown in figure 1a, is often referred to as a growth-to-confluence assay. Here, we observe a population of cells seeded, initially at low density, as the cells move and proliferate and the population increases in number to eventually occupy the whole domain under observation [7,8]. The second experiment we consider is shown in figure 1b and is often referred to as a scratch assay. It involves perturbing a population of cells at, or near, confluence by scraping away a region of the population and observing the resulting spread of the population into this, now empty, region [9,10]. In this work, we will explore the extent to which models of these two types of assays can be parametrized using various summary statistics of these experimental data.

Figure 1.

Two experimental designs considered in this work. (a) Images from a growth-to-confluence assay using MDA MB 231 breast cancer cells. See [5] for more information. (b) Images from a scratch assay with PC3 prostate cancer cells. See [6] for more information. (c) Schematics of the model. We use an on-lattice model, as described in the text, and vary the initial condition to replicate the experiments in (a,b): we seed cells uniformly at random over 24 rows (left), 12 rows (middle) or six rows (right). The images in (a) are reproduced from Simpson et al. [5] with kind permission, whereas the images in (b) are reproduced from Johnston et al. [6] with kind permission.

Approximate Bayesian computation and summary statistics

Parameter inference is approached generally in one of two ways, through either a frequentist approach or a Bayesian approach [11,12]. In frequentist inference, one generally seeks a point estimate of a parameter through maximum-likelihood estimation, and captures uncertainty in the estimate through the generation of confidence intervals. A Bayesian approach instead derives a predictive posterior distribution for the model parameters given observed data [13]. The posterior, , satisfies , where is the likelihood of obtaining the data from the given model and parameters, and the prior, π(), captures any previous knowledge of the parameters. For the mechanistic model we use in this work, the likelihood is intractable and so we use ABC to generate an approximate posterior distribution. ABC has been used previously in the biological sciences, particularly to target complex problems in systems biology [14,15], population genetics [16] and ecology [17]. There is also growing popularity in the use of ABC specifically in the investigation of cell biology processes [18-20]. ABC relies on repeated simulation of the model using parameters from the prior distribution, and the acceptance of these parameter sets whenever the model output is sufficiently close to the experimental data. These accepted parameter sets are then used to estimate the posterior distribution. In using ABC in practice, there are several important user choices, most importantly the means of comparison between simulated and experimental data, typically through summary statistics and a suitable distance metric, and the threshold used for accepting or rejecting parameter sets [21,22]. In particular, the summary statistics represent lower dimensional descriptions of the data and their choice is vitally important as they impact the amount of information gained about model parameters through the use of ABC [23]. The major new insights provided in this work are a quantitative understanding of how both the choices of experimental geometry and summary statistics impact the quality of the posteriors generated using ABC. The use of ABC rejection allows us to compare the quality of the posteriors resulting from a wide range of possible summary statistics in a computationally efficient way, and we quantify the information gain using the Kullback–Leibler divergence [24]. We also demonstrate how data-cloning ABC (ABC-DC) [25] can potentially be used to obtain maximum-likelihood estimates (MLEs) of model parameters more efficiently than ABC rejection. Instead of targeting the likelihood of the observed data, the data cloning approach targets the likelihood corresponding to a large number of copies (also known as clones) of the data, where each data clone is assumed independent of the others. Data cloning results in a posterior distribution that has the MLE as its mean, and the variance can be related to the asymptotic variance of the MLE [26]. ABC-DC uses ABC Markov chain Monte Carlo with data cloning to facilitate convergence towards a MLE [25].

Aims and outline

The aim of this work is to demonstrate the use of Bayesian inference methods to characterize the motile and proliferative behaviour of individual cells within growing cell colonies. In particular, we aim to estimate parameters of a lattice-based volume exclusion model for cell colony growth in a variety of experimental geometries. In addition, we demonstrate the use of ABC-DC for more efficient maximum-likelihood estimation. In §2, we introduce our model, and the various ABC algorithms and summary statistics we employ in this work. In §3, we assess how experimental design choices impact the quality of estimated posterior distributions, using both ABC rejection and ABC-DC, and we conclude with a discussion of our results in §4. To confirm the accuracy of our inference process, and the relevance of the mechanistic model to the experimental data, we conclude by sampling from the posterior distributions and exploring whether key summary statistics predicted by the parametrized model, such as the size of the population and the displacement of individual cells, reflect our experimental observations within some reasonable confidence interval.

Methods

Mechanistic model

We employ a simple two-dimensional lattice-based exclusion process model akin to that of Simpson et al. [27], whereby N(t) cells occupy a square lattice with R rows and C columns at time t. During each time step of duration τ, we choose N(t) cells at random with replacement to attempt a movement and proliferation event into orthogonally adjacent lattice sites with probabilities Pm and Pp, respectively. For each cell, we draw a uniform random number, . If r ≤ Pm, the cell attempts a movement event into one of the four orthogonally adjacent lattice sites with equal probability, and if Pm < r ≤ Pm + Pp the cell attempts a proliferation event, whereby a daughter cell is placed into one of these lattice sites, each with equal probability. If Pm + Pp < r ≤ 1, no movement or proliferation event is attempted. If a cell attempts to move or to place a daughter cell into an occupied lattice site, or outside of the domain, the attempted movement or proliferation event is aborted. These parameters in the discrete model are related to the classical diffusion coefficient, D, and cell proliferation rate, λ, by D = limΔ → 0, PmΔ2/(4τ) and λ = limΔ → 0, Pp/τ [28]. To replicate experimental images, we take R = 24, C = 32, where lattice sites have length Δ = 18.75 μm (corresponding to the approximate cell diameter of the cells considered in typical experiments). Simulations are initialized with cell positions randomly distributed in the first rows of the domain, where is chosen to mimic potential experimental conditions. To interpolate between the growth-to-confluence and scratch assay designs, we choose three initial conditions (figure 1c) initializing cells uniformly at random across either 24, 12 or six rows of the domain. We note that, in reality, the size of the experimental domain is much larger than that captured in the experimental images [6]. To confirm that our results are insensitive to this choice of domain size, we repeated our investigations with simulations carried out on a larger domains and data collected on a subset of this domain that corresponds to the experimental image size. Results are shown in electronic supplementary material, figure S6.

In silico data

As our aim in this work is to better understand how experimental design impacts our ability to infer model parameters, we use our mechanistic model to generate in silico (observed) data that closely replicate that available from experiments (figure 1a,b) and attempt to infer the parameters used to generate these in silico data. We use a time step of τ = 1/24 h, and model parameters Pm = 0.25 and Pp = 0.0025 (motivated by estimates for the diffusivity and proliferation rates from similar experiments [6]). We record the final positions of cells in our in silico experiments after T = 12 h, equating to 288 time steps. We also record trajectory data for five randomly chosen cells by recording their positions every eight time steps (corresponding to every 20 min [5]). As experiments are typically repeated several times to ensure reproducibility of results [5], we repeat simulations M = 10 times and average the resulting statistics. To confirm the consistency of our results for larger values of the final simulation time, T, we repeat our methodology for longer periods of time, either T = 24 h or T = 36 h. Results are shown in the electronic supplementary material, figure S5.

Inference

We use ABC to estimate posterior distributions for the model parameters, = (Pm, Pp). ABC rejection is performed by repeatedly sampling from the prior distribution, π(), simulating the model, and accepting parameters that result in simulation output sufficiently close to the observed data. The accepted parameters are used to compute the approximate posterior distribution, . In order to assess how close the simulated and observed data are, we consider a range of summary statistics (detailed in §2.3.2) and an appropriate distance function (described in §2.3.3). In order to quantify the performance of different summary statistics in inferring model parameters, we choose a uniform (uninformative) prior, to ensure Pm + Pp ≤ 1.

Approximate Bayesian computation rejection

Algorithm 1 describes the ABC rejection algorithm we use in this work. In algorithm 1, we avoid directly specifying an acceptance threshold, ε, by accepting the 1st percentile of the samples (ranked in terms of the distance between the simulated and observed data) and taking the number of samples, K, sufficiently large to obtain an accurate approximation to the true posterior. The accepted samples are used to estimate a posterior distribution using bivariate kernel density estimation with a Gaussian kernel [29].

Summary statistics

In order to compare observed and simulated data, we reduce the dimension of the data using summary statistics. How best to choose summary statistics for parameter inference is an ongoing research question, and some automated procedures have been developed for this choice, typically through either minimizing a loss of information function [30], or maximizing the gain of information [23] through a measure such as the Kullback–Leibler divergence [24]. Kulback–Leibler divergence. The Kullback–Leibler divergence is a measure of the relative difference in two continuous distributions, F and G, and defined as In this work, we generate bivariate posteriors and priors which are each functions of = (Pm, Pp). We discretize these distributions onto a fine mesh, 512 × 512 in size, for plotting. We compute the Kullback–Leibler divergence, IKL, by numerically integrating over both dimensions. The observed data consist of the positions of all cells at the terminal time of the assay together with the tracks of five randomly chosen cells, i = 1, … , 5. We let N(t) be the number of cells at time t = τ × n, i.e. at the nth iteration of a simulation, so that n = 288 corresponds to the final simulation time, T = 12 h. We also let (t) = (X(t), Y(t)) be the (x, y) lattice coordinates of cell i at time t, so that ∥δX(t)∥ = ∥X(t) − X(t)∥ is the horizontal distance between cell i and cell j at time t, and similarly for δY. We summarize these data using statistics motivated by random walk models [5,20,23], considering thirteen statistics in total, labelled as in table 1.

Table 1.

A list of the summary statistics and the corresponding notation adopted. Note, apart from cell trajectory statistics, all summary statistics are evaluated at the final time, T.

summary statistic	notation
number of cells	N
size of largest cluster, k-connected	κ_k, k = 4, 8
binning variance, bin size = k	Q_k, k = 2, 4, 8
Manhattan displacement of cell i	∥x∥
Manhattan tortuosity of cell i	Γ
smallest gyration tensor eigenvalue	λ
correlation function	C_XY (l)
normalized two-dimensional correlation function	C^XY(l)
one-dimensional correlation function	C_Y (l)
normalized one-dimensional correlation function	C^Y(l)

A list of the summary statistics and the corresponding notation adopted. Note, apart from cell trajectory statistics, all summary statistics are evaluated at the final time, T. — The first summary statistic we consider is the final cell number, N(T). — In order to assess whether cells are forming clusters, the second and third summary statistics we consider relate to the size of the largest cluster of cells, computed using the Matlab function bwconncomp.[1] We consider either 4-connected clusters (cells orthogonally adjacent are part of a single cluster), κ4(T), or 8-connected clusters (cells diagonally adjacent are also part of a single cluster), κ8(T). — Summary statistics four to six correspond to the binning variance [5], which quantifies the deviation from the average number of cells expected in quadrats of the domain, and is computed as where n(b, T) is the number of cells in bin b at time T, and B the number of bins when the bin width is k. — As summary statistics seven and eight, we consider trajectory statistics from the five tracked cells (labelled i = 1, … , 5): either the total Manhattan displacement (the sum of horizontal and vertical distances moved) of the cells, ∥x∥, where or the tortuosity of the trajectory, Γ, where — Summary statistic nine is the smallest eigenvalue, λ, of the gyration tensor [31], where and the smallest eigenvalue quantifies the spread of cells along the lesser of the two principal axes. — Summary statistics ten and eleven consider the distribution of pairs of cells, using the pairwise correlation functions, where the argument l indicates the number of pairs of cells separated by distance l. We consider pairwise correlations that measure only the vertical separation between cells, C (due to the heterogeneity of initial condition in the y direction), or the total separation of cells, C, where and and 1{ is the indicator function which, in this case, counts pairs of cells separated by lattice distance l. — Finally, we consider the correlation functions normalized by the expected number of pairs at each distance l = 1, … , L, which accounts for the density of cells in the domain, where and and C and R are the numbers of columns and rows in the lattice, respectively.

Computing distances

Experiments are typically performed multiple times in similar conditions, so we simulate multiple datasets, and call each set a replicate. The observed data, , consist of M replicates, . The simulated data, , consist of M replicates and K samples, . From the simulated data, we compute each summary statistic, s(l), where j = 1, … , 13 denotes the 13 different summary statistics, and l = 1, … , L is the lth element of statistic (L = 24 for the pairwise correlation function statistics because there are 24 rows in the lattice, and L = 1 otherwise). First we average over the replicates, then we compute the median absolute deviation (MAD) for each statistic, (l). This is defined for a univariate dataset, X = {X}, as MAD = median(|X − median(X)|) . We then compute the distances for each statistic, j = 1, … , 13, as Note that this choice of distance function weights vectors of summary statistics of different lengths equally. This is of particular importance when performing parameter inference using combinations of summary statistics, as in this work. We consider combinations of A = 1, 2 or 3 summary statistics from the 13 statistics listed in table 1 to give where the subscript a indexes the specific summary statistics used.

Regression adjustment

After performing ABC rejection (and during ABC-DC), we perform regression adjustment to derive a more accurate estimate of the posterior distribution [16,25,32]. We assume that satisfies the following regression model: where I = {k : d < δ}, B = |I|, s denotes the vector of summary statistics considered in the distance function (2.15) and the ξ are uncorrelated Gaussian random variables with zero mean and common variance 2. The regression model is solved to find the least-squares estimates, , and the accepted parameter sets are adjusted according to

Data-cloning approximate Bayesian computation

Data-cloning ABC involves considering a dataset, , containing K clones of the experimental data, , that is, , where each clone is assumed independent of the others. The likelihood of the cloned data is then [25,26] Hence, the posterior distribution resulting from cloned data satisfies and the MLE of is equivalent to the mean of as K → ∞ [26]. Intuitively, we can see this by reasoning that, if all of the independent model-generated datasets are close to the experimental data, we are K times as likely to have selected a sensible candidate parameter. The ABC-DC algorithm was proposed by Picchini et al. [25], and it uses ABC Markov chain Monte Carlo [33]. The algorithm works in two stages first the acceptance threshold, ε, is decreased according to a tolerance scheme {ε1, ε2, … , ε}, and then the number of clones, K, is increased according to a clone scheme {K, K, … , K}, where subscripts index the population number. The MLE is approximated by averaging the final parameter population, and the 90% credible interval can be used for comparison to ABC rejection. For each sample, a proposal # is accepted or rejected according to the acceptance probability [25] which approximates a Gaussian centred on the experimental observation with variance ε, weighted according to the MAD, (l), of the summary statistic j under consideration, where the MAD is estimated using ABC rejection. The ABC-DC method is described in full in algorithm 2. The number of samples, r, for the decreasing tolerance stage is denoted by {r1, … , r}, and {r, … , r} denotes the number of samples for the increasing clones stage. The scheme we use has P = 5 and Q = 4, and other variables are as follows:

Results

Our aim in this work is to provide insights into how both the choices of experimental geometry and summary statistics impact the quality of the posteriors generated using ABC for experiments that are typically used to characterize growing and spreading cell populations. We generate in silico (observed) data using the model outlined in §§2.1 and 2.2 for the range of experimental designs illustrated in figure 1c, and quantify the quality of the posteriors resulting from the use of ABC rejection with various summary statistics using the Kullback–Leibler divergence between the prior and posterior distributions (details of the methods used can be found in §2.3). In §§3.1 and 3.2, we ask whether it is possible to quantify, using only a single summary statistic of the data, the relative contributions of proliferation and motility in a number of cell spreading experiments. In §§3.3, we explore the extent to which parameter estimates can be improved using multiple summary statistics. Finally, in §3.4 we demonstrate the use of ABC-DC to efficiently find MLEs, and compare these estimates to those generated using ABC rejection.

Can we infer both model parameters using a single summary statistic?

We first ask whether we can jointly infer both model parameters, Pm and Pp, using a single summary statistic in the distance function (2.15). To do this, we perform ABC rejection on in silico (observed) data gathered from simulations with Pm = 0.25 and Pp = 0.0025, where we replicate different experimental designs by varying the model initial conditions: we initialize the model with 24 cells in all 24 rows, or in 12 rows, or in six rows, of the domain, as shown in figure 1c. Note that this means we change the average density within the region into which cells are initialized, but that the overall cell number, and hence the average total density, is constant. Figure 2a–f shows the posteriors generated for selected summary statistics of the data. The average information gain, and its deviation, for each of the three experimental designs and all summary statistics, is summarized in figure 2g.

Figure 2.

Posteriors generated using a single summary statistic as the experimental design is varied. The parameters used to generate the in silico (observed) data are indicated using dashed white lines. (a,b) The posterior distributions corresponding to the most informative summary statistic for inferring Pp ((a), N) or Pm ((b), ∥x∥) for the growth-to-confluence assay (cells initialized uniformly at random over 24 rows). (c,d) The posterior distributions generated using the summary statistic C for the growth-to-confluence assay (c) and the scratch assay (d) (cells initialized uniformly at random over six rows). (e,f) The posterior distributions for summary statistic Q8 for the growth-to-confluence assay (e) and the scratch assay (f). (g) The information gain in moving from the prior to the posterior for each summary statistic and for each experimental design, where the error bars denote the standard deviation. Figure 2a,b shows that for a growth-to-confluence assay design (where cells are initialized across all 24 rows of the domain), an accurate estimate of Pp can be obtained using the number of cells, N, as a summary statistic, and an accurate estimate of Pm can be obtained using the Manhattan displacement, ∥x∥, as a summary statistic. However, no single summary statistic can be used to infer both Pp and Pm with any degree of confidence. Figure 2c,d, however, demonstrates that, in contrast to a growth-to-confluence assay design, using a scratch assay design (where cells are initialized only in six rows of the domain) enables both Pm and Pp to be accurately estimated using a single summary statistic, for example the one-dimensional correlation function summary statistic, C. Finally, figure 2e,f demonstrates how the predicted posterior can vary for a single summary statistic as the experimental design is changed. We use as an example the binning variance statistic, Q8, which is the deviation in cell numbers from the mean when the domain is divided into 8 × 8 bins. Increasing the rate of movement, Pm, means clusters are broken up more rapidly, while increasing the rate of proliferation, Pp, makes clusters larger. As a result, in silico (observed) data generated using small Pm and Pp result in the same binning variance as data generated with large Pm and Pp. The effect is more pronounced for a scratch assay design due to the initial confinement of cells. Figure 2g summarizes the quality of the predicted posteriors resulting from the use of each of the summary statistics listed in table 1 for each experimental design considered in figure 1c. Our results demonstrate that statistics based on cell numbers provide the same or less information as cells are increasingly confined initially, whereas the opposite is true for the summary statistics that relate to correlations in cell positions. Both model parameters, Pm and Pp, can be inferred using a single summary statistic, without the need for cell trajectory information, but only when cells are initially confined to a scratch-assay-like geometry. For growth-to-confluence assays, cell trajectory data are necessary to accurately infer both model parameters.

What is the impact of the initial number of cells on the posteriors?

Next, we consider how changing the initial number of cells in the assay (24, 48 or 72 cells) affects the quality of the predicted posterior distribution. We consider either a growth-to-confluence assay (cells initialized uniformly at random across the domain, figure 3a–c) or a scratch assay design (cells initialized uniformly into six rows of the domain, figure 3d–f). Figure 3a–c demonstrates that, for the growth-to-confluence assay, increasing the initial cell number does not facilitate accurate inference of both model parameters using a single summary statistic. In line with the results presented in figure 2, figure 3d–f indicates that both model parameters can be accurately estimated using a single summary statistic when the scratch assay is used; however, increasing the initial cell number does not significantly increase the quality of the posterior. We quantify the quality of the estimated posterior for each summary statistic and each experimental design in the bar plots in figure 3c,f. For almost every summary statistic, the quality of the posterior increases as the initial cell number increases, likely due to a reduction in the variance of each summary statistic.

Figure 3.

Posteriors generated using a single summary statistic as the initial cell number is varied. The parameters used to generate the in silico (observed) data are indicated using dashed white lines. (a–c) Results for a growth-to-confluence assay generated using 24 cells (a) and 72 cells (b). (d–f) Results for a scratch assay generated using 24 cells (d) and 72 cells (e). The posterior distributions in (a,b) and (d,e) are generated using the pairwise correlations summary statistic C. (c,f) The information gained in moving from the prior to the posterior for each summary statistic for each experimental design, where the error bars denote the standard deviation.

Are trajectory statistics necessary for maximal information gain?

It is natural to ask whether multiple summary statistics can be combined to increase the quality of the estimated posteriors and, further, whether combining different summary statistics can avoid the need to collect cell trajectory data to accurately infer parameters of a growth-to-confluence assay. To answer this question, we considered all combinations of two or three summary statistics from the 13 under consideration, computing the resulting posteriors and information gain for each combination. Figure 4 demonstrates that, for a growth-to-confluence assay, cell trajectory data are necessary to generate a posterior that allows for accurate estimation of both parameters Pm and Pp (figure 4a,b). In particular, trajectory information is necessary to estimate the motility parameter, Pm. It is interesting to note that, in both cases (with and without trajectory information), the number of cells summary statistic, N, is included in the best-performing two-summary-statistic combinations, likely because it facilitates accurate estimation of the proliferation parameter, Pp. For a scratch assay, cell trajectory information is not necessary to accurately infer both model parameters; however, as expected, including cell trajectory data enables more accurate inference of the cell motility parameter, Pm. Figure 4e quantities the quality of the estimated posteriors for a range of experimental designs, initial cell densities and for combinations of one, two or three summary statistics. Results demonstrate that little extra information is gained by considering more than two summary statistics and that, as expected, accurate parameter inference for growth-to-confluence assays requires cell trajectory data.

Figure 4.

Posterior distributions and information gain when considering combinations of summary statistics (SS). (a–d) Posteriors resulting from the most informative two-summary-statistic combination when considering either a growth-to-confluence assay (a,b), or a scratch assay (c,d), excluding trajectory information (a,c), or including trajectory information (b,d). (e) The information gain in moving from the prior to the posterior for each of the best-performing summary statistic combinations, varying the experimental design, and excluding trajectory information (dark shading) or including trajectory information (light shading), where the error bars denote the standard deviation.

Can data-cloning approximate Bayesian computation provide an efficient means to estimate model parameters?

Finally, we investigate whether the ABC-DC algorithm proposed by Picchini et al. [25] can provide an efficient means to estimate model parameters. ABC-DC essentially involves using a number of copies of the observed data (called ‘clones’) that are assumed to be independently generated datasets. ABC-DC uses ABC Markov chain Monte Carlo [33] over two stages: first the acceptance threshold, ε, is decreased; and then the number of clones, K, is increased. This has the impact of concentrating the posterior around the MLE, with the posterior mean an approximation to the MLE, and the 90% credible interval available for quantification of the uncertainty in parameter predictions. Figure 5 shows the results of using ABC-DC with the two-dimensional pairwise correlation function, C, to estimate posterior distributions for both Pm and Pp. Figure 5a,c shows how the posterior changes as the number of clones is increased, relative to ABC rejection, while figure 5b,d shows the Markov chains of both Pm and Pp. Our results indicate that, as expected, the ABC-DC algorithm results in a tighter posterior focused approximately about the MLEs found using ABC-DC with a single instance of the data, K = 1. To understand how the MLEs generated using ABC-DC differ from those generated using ABC rejection, we look at the posteriors generated using each algorithm together with three summary statistics: the number of cells, N; the two-dimensional pairwise correlation statistic, C; and the Manhattan displacement, ∥x∥. We calculate MLEs and the 5% and 95% empirical percentiles of the accepted parameters. Results in table 2 indicate that the MLEs are similar for both ABC rejection and ABC-DC, though the credible intervals for ABC-DC are much smaller, and not a good representation of the true uncertainty in each parameter. However, ABC-DC can provide significant computational savings in estimation of the MLE, with maximal efficiency displayed with summary statistics that are relatively computationally demanding to compute, such as C.

Figure 5.

Table 2.

ABC-DC for a scratch assay experiment. The MLEs and the 5%–95% empirical percentiles are shown for parameters accepted using ABC rejection and ABC-DC, for different summary statistics. The speed up of the ABC-DC method over ABC rejection is also given.

statistic: parameter	ABC rejection MLE (90% CI)	ABC-DC MLE (90% CI)	speed up
∥x∥: P_m	0.250 (0.231, 0.300)	0.247 (0.239, 0.255)	1.65
C_XY: P_m	0.239 (0.174, 0.310)	0.236 (0.186, 0.280)	4.50
C_XY: P_p(×10⁻²)	0.241 (0.215, 0.270)	0.228 (0.219, 0.236)	4.50
N: P_p(×10⁻²)	0.243 (0.220, 0.282)	0.246 (0.235, 0.257)	1.72

Example results from the ABC-DC algorithm for a scratch assay using the two-dimensional pairwise correlations statistic, C. (a,c) Estimates of the posteriors for Pm and Pp, respectively, generated using kernel density estimation. We compare the posteriors for Pm and Pp generated using ABC-DC (yellow), ABC-DC (before the clone numbers are increased) (red) and ABC rejection (blue). (b,d) The trajectories of parameters accepted during the Markov chain Monte Carlo step of the ABC-DC algorithm. ABC-DC for a scratch assay experiment. The MLEs and the 5%–95% empirical percentiles are shown for parameters accepted using ABC rejection and ABC-DC, for different summary statistics. The speed up of the ABC-DC method over ABC rejection is also given.

Predictions for experimental data

We can use our model to infer parameter distributions for different cell types in a growth-to-confluence assay, and subsequently make predictions about long-time behaviours, such as the time taken to reach confluence. In figure 6, we use data from the growth-to-confluence experiments detailed in [5] to infer motility and proliferation rates for two different cell types: 3T3 fibroblast cells and MDA MB 231 breast cancer cells. In particular, for each cell type, we use 10 experimental replicates to estimate the bivariate posterior distribution for Pm and Pp, with data collected 12 h into the experiment. Figure 6a,b shows the posteriors generated using ABC rejection with the cell number, N, and Manhattan displacement, ∥x∥, as summary statistics, and the priors For the 3T3 fibroblast cell line, we obtain Pm = 0.0465 ± 0.00341 and Pp = 0.00163 ± 0.000403, while for the MDA MB 231 breast cancer cells we obtain Pm = 0.0227 ± 0.00254 and Pp = 0.000931 ± 0.000427. In each case, we give the posterior mean and the standard error. In real terms, for the 3T3 cells this gives a diffusivity of D = 97.986 ± 7.196 μm2 h−1 and a proliferation rate of λ = 0.0392 ± 0.00968 h−1, while for the 231 cells we have D = 47.850 ± 5.347 μm2 h−1 and λ = 0.0223 ± 0.0102 h−1.

Figure 6.

Posterior distributions for the results of posterior predictive checks. (a,b) Posterior distributions for 231 breast cancer and 3T3 fibroblast cell lines, respectively. (c,d) Posterior predictive check using data collected 24 h into the experiment, for the 231 breast cancer and 3T3 fibroblast cell lines, respectively, and the cell number, N, as a summary statistic. The ten experimental replicates are ordered according to increasing cell number at 24 h. For each replicate, we plot the experimental data (blue cross), the mean cell number (red line) and the 95% confidence interval (shaded region). (e,f) Posterior predictive check using data collected 24 h into the experiment, for the 231 breast cancer and 3T3 fibroblast cell lines, respectively, and the Manhattan displacement, ∥x∥, as a summary statistic. Details are as for (c,d). To check the validity of our model and inferred posterior distributions, we collected data from the same experimental replicates at 24 h and used these to perform a posterior predictive check. In figure 6c–f, we present the results of performing the posterior predictive check using both the cell number, N, and the Manhattan displacement, ∥x∥, as summary statistics. We produce results using, for each dataset, 1000 samples from the estimated posteriors, and visually compare the statistics for each experimental replicate with the average of the predicted statistics and their 95% confidence intervals. These plots show strong support for the model for each cell type. We see that for the number of cells statistic, all of the experimental statistics fall within the 95% confidence intervals. For the Manhattan displacement statistic, two of the 10 replicates for the 231 cells have an experimental statistic that falls outside the 95% confidence interval of the simulated statistic. In this case, we suspect translation of the positions of slow-moving cells (as in the case of the 231 cells) onto a lattice has introduced artefacts, causing discrepancy between simulation displacement and experimental displacement.

Discussion

Mechanistic models together with experimental data from cell biology assays provide an excellent opportunity to characterize the extent to which cell proliferation and cell motility can drive the growth and expansion of a cell colony. To be able to draw quantitative distinctions between the relative contributions of cell proliferation and motility for different cell types and biological scenarios requires the accurate inference of model parameters using quantitative summaries of experimental data. In this work, we have applied ABC to explore the extent to which different experimental designs and quantitative summary statistics impact our ability to accurately infer parameters of a simple mechanistic model of cell growth and spreading. The model we present is the simplest possible individual-based model of a two-dimensional cell biology experiment in which we can simulate cell motility and cell proliferation with a simple mechanism for contact inhibition. Our rationale for working with this fundamental model is that it is crucial to explore whether it is possible to parametrize this simple model using routinely collected experimental data, before attempting inference for a more biologically realistic model with a greater number of unknown parameters. Our results suggest that, for growth-to-confluence experiments, cell trajectory information, in addition to summary statistics such as cell numbers and spatial correlations, are required for accurate parameter inference. On the other hand, for a scratch assay geometry, it is possible to accurately infer both motility and proliferation parameters without recourse to tracking individual cells over multiple frames of a microscopy video. As cell tracking provides the bottleneck in the analysis of many experimental studies, this renders experimental geometry an important aspect to consider in the design of experiments. In addition, our results demonstrate that increases in the initial cell number in cell biology assays can increase the quality of estimated posterior distributions by decreasing noise in the summary statistics, a result also found for a similar model by Ross and co-workers [18]. We also demonstrated that posteriors of significantly increased quality can be generated using a combination of two summary statistics, but that further increases in the number of summary statistics provided little increase in the quality of the posteriors. In electronic supplementary material, figures S1–S4, we demonstrate that our results are consistent across a range of parameter values and a range of experimental durations. However, it should be noted that, because we collected data at only the final time point of the experiment, some summary statistics lose efficacy if experiments are run over a sufficiently long time period that the population grows to confluence. For example, in this case, the number of cells, N, at the final time point provides little information on the model parameters. Finally, we have shown that the ABC-DC approach provides an efficient means to approximate the MLEs of model parameters, with significant improvements in computational efficiency provided where the summary statistics are time-consuming to calculate.

Conclusion

In summary, our results suggest that experimental design choices that incorporate initial spatial heterogeneities in cell positions facilitate parameter inference without the requirement of cell tracking, while those that seed cells uniformly initially require cell tracking for accurate parameter inference. As cell tracking is often a major technical limitation of many experimental studies of this type, our recommendations for experimental design choice could lead to significant potential time and cost savings in the analysis of these kinds of commonly used experiments.

23 in total

1. Markov chain Monte Carlo without likelihoods.

Authors: Paul Marjoram; John Molitor; Vincent Plagnol; Simon Tavare
Journal: Proc Natl Acad Sci U S A Date: 2003-12-08 Impact factor: 11.205

2. Approximate Bayesian computation in population genetics.

Authors: Mark A Beaumont; Wenyang Zhang; David J Balding
Journal: Genetics Date: 2002-12 Impact factor: 4.562

3. Multi-scale modeling of a wound-healing cell migration assay.

Authors: Anna Q Cai; Kerry A Landman; Barry D Hughes
Journal: J Theor Biol Date: 2006-10-28 Impact factor: 2.691

4. Data cloning: easy maximum likelihood estimation for complex ecological models using Bayesian Markov chain Monte Carlo methods.

Authors: Subhash R Lele; Brian Dennis; Frithjof Lutscher
Journal: Ecol Lett Date: 2007-07 Impact factor: 9.492