Francesco Di Lauro1, Wasiur R KhudaBukhsh2, István Z Kiss3, Eben Kenah4, Max Jensen3, Grzegorz A Rempała4. 1. Big Data Institute, University of Oxford, Oxford, OX3 7LF, UK. 2. Department of Mathematics, University of Nottingham, Nottingham, NG7 2RD, UK. 3. Department of Mathematics, University of Sussex, Brighton, BN1 9RH, UK. 4. Department of Biostatistics, The Ohio State University, Columbus, OH 43210, USA.
Abstract
We present a new method for analysing stochastic epidemic models under minimal assumptions. The method, dubbed dynamic survival analysis (DSA), is based on a simple yet powerful observation, namely that population-level mean-field trajectories described by a system of partial differential equations may also approximate individual-level times of infection and recovery. This idea gives rise to a certain non-Markovian agent-based model and provides an agent-level likelihood function for a random sample of infection and/or recovery times. Extensive numerical analyses on both synthetic and real epidemic data from foot-and-mouth disease in the UK (2001) and COVID-19 in India (2020) show good accuracy and confirm the method's versatility in likelihood-based parameter estimation. The accompanying software package gives prospective users a practical tool for modelling, analysing and interpreting epidemic data with the help of the DSA approach.
We present a new method for analysing stochastic epidemic models under minimal assumptions. The method, dubbed dynamic survival analysis (DSA), is based on a simple yet powerful observation, namely that population-level mean-field trajectories described by a system of partial differential equations may also approximate individual-level times of infection and recovery. This idea gives rise to a certain non-Markovian agent-based model and provides an agent-level likelihood function for a random sample of infection and/or recovery times. Extensive numerical analyses on both synthetic and real epidemic data from foot-and-mouth disease in the UK (2001) and COVID-19 in India (2020) show good accuracy and confirm the method's versatility in likelihood-based parameter estimation. The accompanying software package gives prospective users a practical tool for modelling, analysing and interpreting epidemic data with the help of the DSA approach.
The standard approach to building a stochastic compartmental epidemic model is to make use of a continuous-time Markov chain (CTMC) to keep track of the sizes of the compartments over time (e.g. number of individuals with different immunological statuses) using counting processes [1]. Following the random time change representation of Poisson processes [2,3], the trajectory equations for those counting processes are written in terms of independent, unit rate Poisson processes. When the size of the population under consideration is large, those counting processes, appropriately scaled, converge to deterministic, continuous real-valued functions satisfying certain ordinary differential equations (ODEs) by virtue of the functional law of large numbers (FLLN) for Poisson processes [4,5]. This convergence provides a link between the stochastic and the deterministic world and the limiting ODEs are often referred to as the mean-field equations in the literature. Famous examples include the classical Kermack–McKendrick equations for the susceptible–infected–recovered (SIR) epidemic model [6].However, this astounding popularity of the standard Markov models or the corresponding mean-field ODE models seems to belie their apparent lack of faithfulness to the underlying biology of the disease. To quote van Kampen [7],‘Non-Markov is the rule, Markov is the exception’.Indeed, the population count-based Markov models assume exponentially distributed inter-event times. As a consequence, the instantaneous rates of infection and recovery are assumed constant regardless of key epidemiologically relevant covariates, such as the age of infection (see §2), time since vaccination, etc. As shown in [8] (in particular, see table 1 and fig. 1), the estimates obtained by assuming a Markovian model when the underlying model is non-Markovian could be significantly biased. While there are more advanced stochastic models that do incorporate those covariates (as we will also do in this paper), those models are often fit to data in an ad hoc fashion; or are too computationally expensive to be useful for practical purposes. Our aim in this work is to build a principled and rigorous statistical approach to fitting those more advanced stochastic models to data without compromising on simplicity.In this paper, we present a survival analytic approach, dubbed dynamic survival analysis (DSA), that constructs probability distributions of individual times of infection and recovery from population-level (mean-field) trajectory equations. In [9], a subset of the authors first employed this idea in the context of the classical Kermack–McKendrick Markovian (SIR) epidemics described by their mean-field ODEs. Here, we extend the idea to the vastly more realistic class of non-Markovian models that allow non-exponential contact interval [8] and infectious periods. The theoretical underpinning is laid down by an extension of the so-called Sellke construction [1,10], which we describe in detail in §2.2.There are several advantages of DSA. First, DSA does not require knowledge of the size of the susceptible population, which is almost always unknown in real epidemics and often assumed to be the population of the entire city, state, or even a country. In fact, DSA not only avoids this ad hoc adjustment, but also provides a ready estimate of the effective population size, tracking of which could provide further insights into an ongoing epidemic. Second, DSA does not require the whole epidemic trajectory and works with only a random sample of infection and, if available, recovery times. Third, on the strength of its survival analytic foundation, DSA is able to handle censoring, truncation and aggregation of data (over time and population) in a straightforward manner. We illustrate some of these features of the DSA method below.The rest of the paper is structured as follows. §2 describes the stochastic model in terms of measure-valued processes and the so-called Sellke construction, along with their large population mean-field approximations. In §3, we describe the DSA modelling approach in detail before conducting extensive numerical analysis in §4. We apply the DSA method to the epidemics of foot-and-mouth disease (FMD) in the UK and COVID-19 in India. In §4, we also provide synthetic data analysis so that DSA could be compared against the true data-generating model. Finally, we conclude with a short discussion in §5. For the sake of completeness, additional mathematical derivations and numerical figures are provided in appendices, where we also compare the performances of Markovian and non-Markovian DSA on the FMD dataset.
Stochastic model
Because we want to keep track of important epidemiological covariates along with counts of individuals in different compartments, our primary tool will be measure-valued processes, which are naturally capable of carrying more information than raw population counts. The measure-valued representation will also allow us to turn an inherently non-Markovian model into a Markov model, albeit on a more abstract state space. While the age of infection (§2) is the most natural choice for ‘age’, one may also use the notion of age to account for other important covariates that describe time since some specific event. For instance, the biological age, time since vaccination are important for certain infectious diseases. Therefore, we use the term ‘age’ in a broad sense and keep track of the ages of individuals with different immunological statuses (susceptible, infected, recovered/removed).Suppose we have n susceptible and m infected individuals initially. We assume that m depends on n in the sense that m/n → ρ as n → ∞ for some ρ ∈ (0, 1). Let us now define the following stochastic processes:where N(t), N(t) and N(t) are, respectively, the total numbers of susceptible, infected and recovered individuals in the population at time t. The quantities s(t), i(t) and r(t) are the ages of the kth susceptible, infected and recovered individuals (following some specific ordering convention). The set-function δ is the Dirac measure, i.e. for a set A, the function δ(A) takes value 1 if x ∈ A and 0 otherwise. The stochastic processes , and keep track of the age distribution of the population of individuals. For instance, taking the ‘age’ for the infected individuals to represent the age of infection, gives us the number of infected individuals whose ages of infection lie in the set [3.5, 7]. Now, define the stochastic processThe process X is a Markov process. Although we do not explicitly show the dependence of the stochastic process X on the initial size of the susceptible population n, it is worth keeping in mind.
Contact intervals and infectious periods
We adopt the pairwise model of [8] to describe the dynamics of the epidemic process under the stochastic mass-action set-up. There are two types of events: infection and natural recovery. In order to describe the intensities (of the Markov process X) corresponding to these two types of events, let us introduce two functions: and . The function β(u, v) describes the instantaneous intensity of an infectious contact between a susceptible individual of age u and an infectious individual of age v. That is, the probability that a susceptible individual of age u will be infected by an infectious individual of age v in the next δt time unit is n−1β(u, v)δt under the stochastic law of mass-action, where δt is assumed infinitesimally small. In the language of the pairwise model [8] of infectious diseases, the function β characterizes the probability law of the contact intervals. The function γ is the hazard function that characterizes the probability law of the infectious period. Note that neither of these two probability laws needs to be exponential, even though X itself is a Markov process (see [11] for a similar example in the context of a stochastic chemical reaction network (CRN)). The infection and natural recovery processes are assumed independent. We also assume that recovered individuals can no longer infect others or be infected.The stochastic process X can be simulated by extending the standard Doob–Gillespie’s stochastic simulation algorithm (SSA) in a straightforward manner. An alternative approach to simulating individual trajectories is the Sellke construction, which also provides the theoretical underpinning to the DSA approach. For the sake of simplicity, we will assume in the following that the function β(u, v) depends only on the age v of the infected individual and not on the age u of the susceptible individual, i.e. β(u, v) = β(v). This will allow for a simpler and a more intuitive description of the Sellke construction. The general case of β(u, v) is considered in appendix A.
Sellke construction
The classical Sellke construction [1] provides an alternative individual-based description of the standard stochastic mass-action (SIR) epidemic model. It can be shown that the resultant epidemic process is equivalent to the original population-level stochastic model in the sense that the counts of individuals with different immunological statuses have the same probability law under both constructions. However, the crux of the Sellke construction is that it describes the epidemic process in terms of individual survival probabilities (i.e. for an initially susceptible individual, the probability of remaining susceptible until time t). This is useful for parameter inference. The classical Sellke construction can be adapted to the age-structured epidemic model of ours in a straightforward fashion.To each of the initial n susceptible individuals, we assign a threshold, an exponentially distributed random variable with mean one. Let U denote the threshold corresponding to the ith susceptible individual. The random variables U1, U2, …, U are independent. Let U(1), U(2), …, U( be the corresponding order statistics, i.e. U(1) ≤ U(2) ≤ … ≤ U(. Let us now define the cumulative infection pressurewhere, for a point measure and a measurable function f, the notation 〈, f〉 denotes the integration of the function f with respect to the measure ν, i.e.The epidemic process proceeds as follows: The first infection occurs when the cumulative infection pressure exceeds the smallest individual threshold, i.e. when for the first time; the second infection occurs when , and so on. Note that infected individuals recover following an infectious period that has a probability law characterized by the hazard function γ. Therefore, it is possible that the cumulative infection pressure becomes constant when the last infected individual recovers and there are no more infected individuals. Susceptible individuals whose thresholds are never exceeded by the cumulative infection pressure escape infection and never leave the susceptible compartment. Figure 1 provides a pictorial description of the Sellke construction.
Figure 1
Sellke construction. Here, we begin with a single infected individual. The arrows point to the times of infection. The orange horizontal lines indicate the infectious period of each infected individual. The probability density function (PDF) of the infectious periods is shown in the inset (Weibull with shape c = 1.9 and scale 1).
Sellke construction. Here, we begin with a single infected individual. The arrows point to the times of infection. The orange horizontal lines indicate the infectious period of each infected individual. The probability density function (PDF) of the infectious periods is shown in the inset (Weibull with shape c = 1.9 and scale 1).Let us denote the time of infection of an initially susceptible individual by T. In essence, the Sellke construction specifies an individual-level survival function: The probability that an initially susceptible individual i remains susceptible until time t, conditional on the history (filtration) of the epidemic process, is given bywhere is the threshold of the individual i. This survival probability will play a crucial role in devising the DSA-likelihood function. It is worth pointing out that the random variable T is improper because some individuals may escape infection with positive probability.From the classical theory of stochastic epidemiology, we know that appropriately scaled population counts in CTMC-based epidemic models converge to solutions to ODEs in the large population (mean-field) limit [1]. They are a consequence of the FLLN-type approximation theorems for Markov processes [4,5]. The intuition is that the stochastic fluctuation, which is typically described in terms of a zero-mean martingale after a Doob–Meyer decomposition of the counting processes around the mean, vanishes in the limit. A similar intuition holds true for measure-valued Markov processes. Indeed, the scaled process n−1X converges to a vector of deterministic measure-valued functions in the limit of n → ∞. Furthermore, when the limiting measure-valued functions admit densities, it is possible to describe them using partial differential equation (PDE) (e.g. see [12,13] and appendices).
Mean-field limit
We are interested in the limit of the epidemic process as n → ∞ with m/n → ρ, for some ρ ∈ (0, 1). Therefore, in the limit, the total scaled population size is (1 + ρ). We scale the system this way because we wish to interpret the susceptible curve as a survival function, which takes the value of one at zero. We shall elaborate further on this point in §3 on DSA.Under some technical assumptions on the intensities and the initial population size (more precise statement in appendix A), the scaled stochastic process n−1X converges to a deterministic continuous function as n → ∞, where the components and are measure-valued functions. The main technical tools required to establish the convergence are borrowed from existing probability theory literature. In particular, similar techniques and derivations can be found in [12-16]. A brief, intuitive sketch of the proof of convergence of the scaled process n−1X to the deterministic function x is provided in appendix A for the sake of completeness.The densities and of and satisfy the system of PDE given in (A 5) in appendix A. Because of our simplifying assumption β(u, v) = β(v), it makes sense to integrate out the age component for the susceptible and the recovered individuals. Therefore, by definingwe can write the limiting system as follows:with initial conditions z(0) = 1, z(0) = 0 and such thatand boundary conditionUsing the method of characteristics on (2.4), we getwhere is the survival function of the probability distribution characterized by the hazard function γ. That is, . Unfortunately, y does not admit an explicit solution. However, efficient numerical methods exist. We describe the solution scheme that we adopted in appendix B. The limiting proportion of recovered individuals z is also fully described by the limiting density y of infected individualsFor different choices of the functions β and γ depending on the particular infectious disease in question, one can solve (2.4) numerically and fit to data. Typically, one would assume a parametric representation of the functions β and γ, and then attempt to infer those parameters based on data. However, a common problem in epidemiological literature is that the choice of the likelihood function is often ad hoc and strictly speaking, unjustifiable. To this end, the DSA method [9,17-20] provides, in a principled way, a likelihood function based on a random sample of transfer times.[1] In the next section, we describe the DSA method in greater detail.
Dynamic survival analysis and parameter inference
The DSA method combines dynamical systems theory and survival analysis. For a given dynamical system, typically described by ODEs or PDEs for population counts/proportions, the DSA method provides an alternative interpretation that characterizes probability laws of transfer times [9,17,19,21]. The mathematical underpinning is provided by a novel application of the Sellke construction.Rewriting (2.4) and with the initial condition z(0) = 1, we immediately seewhich is precisely the limit of the survival function according to the Sellke construction in (2.3) as n → ∞. That is,Note that the random variable T in the Sellke construction depends on n even though we do not show it explicitly to keep the notations simple. Therefore, the function z, the limiting proportion of susceptible individuals, can be interpreted as a survival function. However, the survival function z is improper because z(∞) > 0. The quantity z(∞) is precisely the limiting proportion of susceptible individuals that forever escape the infection. The survival function z can be made proper by conditioning on individuals who get infected [9]. Another important observation is that the ‘time to infection’ random variables associated with the initially susceptible individuals become independent in the limit of n → ∞. This phenomenon is sometimes referred to as mean-field independence [21,22].
Likelihood contribution of infection times
Let us denote by θ the set of parameters required to describe the contact interval distribution in terms of β and the infectious period in terms of γ. On account of the Sellke construction, we can treat the function z as an improper survival function for the (improper) random variable T, the time to infection for an initially susceptible individual. Therefore, we can define the conditional probability density function (PDF)for the infection times, where τ : = 1 − z(T). Also, set τ : = τ∞. The PDF f is proper by virtue of the conditioning.Most epidemic and pandemic trajectories are only partially observed. A crucial advantage of the DSA approach is that it does not require the whole trajectory. Suppose we have a random sample of infection times t1, t2, …, t from an epidemic trajectory observed partially until time T, for some finite, positive number T. Then, following the mean-field independence, the contribution of the infection times to the DSA likelihood function is given byThe contribution ℓ can be modified in a straightforward fashion if the infection times are censored and/or truncated [19].
Likelihood contribution of recovery times
Now, let us describe the contribution of the recovery times to the DSA likelihood. While the recovery times are often not observed, or only partially observed (with further possibility of censoring or truncation), when available they can be incorporated into the DSA likelihood function, rendering it more informative. There are two possible scenarios. Let us consider the simpler case first: we have a random sample s1, s2, …, s of infectious periods. Then, denoting the PDF of the probability law characterized by the hazard function γ by r, the contribution of the random sample of infectious periods to the DSA likelihood function is given byNow, let us consider the second case: we do not directly observe individual infectious periods, but only observe recovery times. Suppose u1, u2, …, u is a random sample of recovery times of M individuals whose infection times are unknown. They are precisely a random sample of the sum of two independent random variables: time to infection and infectious period. Therefore, we can define the convolution-form PDFconditional on the partially observed epidemic trajectory until time T, whereNow, with the conditional PDF of the recovery times given in (3.4), we can write down the contribution of the random sample u1, u2, …, u of recovery times as follows:The conditional PDF g, in general, does not admit a closed-form expression. However, it can be computed numerically.
The DSA likelihood
Suppose we have a random sample t1, t2, …, t of infection times, a random sample s1, s2, …, s of infectious periods and a random sample u1, u2, …, u of recovery times. Then, the DSA likelihood function is given byNote that it is not necessary to have data on recovery times. The likelihood contribution ℓ(θ) is adequate for parameter inference. See [17], where parameter inference was done for the COVID-19 pandemic in the state of OH, USA, based only on infection times. When information on recovery times is unavailable, we simply set and by adopting the convention .Often it is easier to work with the log-likelihood function. Therefore, for the purpose of parameter inference, we also define the DSA log-likelihood functionThe maximum-likelihood estimate (MLE) of the parameter θ is then numerically obtained by maximizing the log-likelihood function . That is,We present numerical results in §4. For Bayesian methods, we need to introduce a prior for the parameter θ and then implement a Markov Chain Monte Carlo (MCMC) algorithm to approximate the posterior distribution of the parameter θ. However, we do not pursue the Bayesian path in this paper.
Mean-field limits as Chapman–Kolmogorov equations
An alternative way to view DSA is to interpret the limiting trajectory equations as satisfying Chapman–Kolmogorov equations (written in the differential form) for certain probability distributions. Let us pick a random individual embedded in an infinitely large population (mean-field) and follow in time. Let denote a time-inhomogeneous CTMC that keeps track of whether an individual is in the susceptible compartment () or not (). We specify the time-dependent instantaneous transition rates of W(t) as follows:where x is the mean-field FLLN limit of the stochastic process n−1X. Write for , the marginal distribution of the Markov process W. We of course have . Then, following the previous discussion, DSA, in essence, is tantamount to writingIt is straightforward to verify that p satisfieswhich is the time-inhomogeneous Chapman–Kolmogorov equation (in the differential form) for the marginal distribution. It is in this viewpoint that we say the limiting mean-field equations given in equation (A 5) satisfy the Chapman–Kolmogorov equations for the probability distribution p. It is worth mentioning that the time derivative (d/d)p gives us what is popularly known as the chemical master equation (CME) in the physical sciences literature. Note that our Chapman–Kolmogorov viewpoint is somewhat different from the notion of a generalized master equation (GME) [23,24] in that we are not attempting to describe the original stochastic system in §2 with equation (3.11), but rather constructing a Markov chain whose Chapman–Kolmogorov equations are given by the mean-field limit of the original stochastic process.Viewing the limiting trajectory equations as satisfying Chapman–Kolmogorov equations also reveals that, if we have data only on individual infection times, the likelihood function ℓ in equation (3.2) is essentially a Markov likelihood function. Here, we described the Chapman–Kolmogorov viewpoint on the simplistic state space . In general, we could construct a Markov process that keeps track of the immunological status along with the age of the individual. Accordingly, DSA can be shown to be tantamount to describing the transition kernel for W(t) in terms of the mean-field trajectory equations x and their densities y, y, y. Since this viewpoint is only a side note and not the main aim of the paper, we leave the discussion for a future work. We do, however, refer interested readers to [25], where the authors use a stationary GME with memory terms and show that the effect of molecular memory is equivalent to the introduction of a feedback in the context of intracellular reaction processes. We also remark that in certain examples, e.g. [26], the non-Markovian formulation can be shown to be equivalent to a Markovian formulation in that the steady state of the non-Markovian process can be reduced to that of an equivalent Markov process.
Estimate of effective population size
In addition to giving a simple product-form likelihood function for θ, DSA also gives a ready estimate of the effective population size. Given k, the number of cases observed by time T, the effective population size can be estimated by the discount estimatorIn similar vein, we can also estimate the final size of the epidemic as follows:Refer to [9,17] for further discussions on this.
Numerical results
In this section, we demonstrate how the DSA method can be used for inference of model parameters from infectious disease outbreak data using the likelihood functions described in §3. Typical outbreak data consist of population-level aggregated counts (such as the daily number of newly positive cases). Hence, we use this scenario as a benchmark for numerical validation. At the beginning, we will analyse synthetic data and make several simplifying assumptions, which we will gradually remove in favour of more realistic models when considering datasets from real epidemic outbreaks, such as the FMD epidemic in the UK and the COVID-19 pandemic in India.
Synthetic data
We begin by carrying out DSA analysis on synthetic data. We begin by keeping the premise deliberately simple: we assume that the family of the infectious period is known in that the functional form of the hazard function γ (or the PDF characterized by γ) is known, but the parameters are to be inferred along with the initial condition of the PDE (A 5) and a constant infection rate, β. To this end, we begin by assuming the infectious period is a Gamma random variable. The rationale behind this choice is the flexibility of the Gamma distribution and its historical importance in the infectious disease epidemiology, as a natural generalization of the Exponential distribution [27-31]. The proposed inference scheme, of course, works for any other distribution, such as the log-logistic or Weibull (see §4.2). All the code to reproduce the results in this section is available online,[2] and a brief description of the numerical scheme used to solve the PDE can be found in appendix B.
Description of data
The Sellke construction is an excellent means to generate exact simulations of an epidemic. We simulate an outbreak on a population of N = 10 000 individuals. Epidemics are run until no infected individuals are present in the population. Datasets consist of the series of infection and recovery times taken from the simulation, without noise nor delays.We consider three different scenarios, characterized by different availability of data: we either work with only recovery times, with only infection times, or with both. We generate 1000 datasets from the same initial conditions, to characterize the distribution of the estimates. Estimates are found by means of a mix of global and local optimization routines.The objective is to infer the initial proportion of infected individuals ρ = 50/9950, the per-contact infection rate β = 0.25, and the parameters of the distribution of infectious period, which is a Gamma distribution with mean μ = 9 and variance σ2 = 6. Results are shown in figures 2 and 3.
Figure 2
Inferred parameters (a) ρ and (b) β. Each figure shows histograms for different scenarios of data availability, as denoted in the legend. In each figure part, the true parameter is represented by the downward triangle; the square is the average value inferred when considering only infectious times, the diamond when considering only recovery times and the upward triangle when considering both.
Figure 3
Inferred infectious period distribution mean and standard deviation. Black dots represent the true values.
Inferred parameters (a) ρ and (b) β. Each figure shows histograms for different scenarios of data availability, as denoted in the legend. In each figure part, the true parameter is represented by the downward triangle; the square is the average value inferred when considering only infectious times, the diamond when considering only recovery times and the upward triangle when considering both.Inferred infectious period distribution mean and standard deviation. Black dots represent the true values.We find that inference based on only infection times using the likelihood function ℓ(θ) in (3.2) results in wider distributions for all inferred parameters, suggesting greater uncertainty, than inference based on both. This is expected because the likelihood function ℓ(θ) in (3.7) is more informative than the likelihood function ℓ(θ) in (3.2). In general, the true parameters are always near the mode of the distributions of the inferred parameters. It is worth noting that when the infection rate β is overestimated, the initial proportion of infected individuals ρ is underestimated, and vice versa. This suggests a potential statistical unidentifiability of the parameters. Outbreaks starting with a higher number of infected individuals but smaller transmission rate may be hard to distinguish from those that start with a smaller number of infected individuals but with higher transmission rate.The mean and the standard deviation of the distribution of the infectious period are reported in figure 3. We observe that inference based only on infection times, in general, accurately captures the mean of the distribution of the infectious period but tends to overestimate the variance. The overall quality of inference improves significantly when recovery times are also available.
Foot-and-mouth disease
Let us now turn to real datasets. We consider the 2001 FMD outbreak in the UK. The outbreak began at the end of February 2001 and ended in September 2001, affecting more than 2000 farms. Policy makers' efforts to control the epidemic resulted in the culling of millions of herds and flocks [32]. Because of the specific interventions taken to control the outbreak, we interpret the infectious period in the DSA model as the time from when the disease hit a farm to the elimination of infected herds, i.e. the time to removal. Since this quantity is unlikely to be exponentially distributed, we fit a more flexible Gamma distribution to it. For the contact interval distribution characterized by the hazard function β, we assume a Weibull distribution, which is in line with other methods present in the literature [33]. We note that both these choices may be viewed as generalizing the usual Markovian model based on two Exponential distributions.The dataset[3] consists of daily incidence of infected premises by time of report, {t, I}, with no information on removal times (figure 4). For each day t, we distribute the number of new cases I uniformly in the interval (t, t). Furthermore, we consider only the first 80 days of data, to exclude the noisy tail and potentially confounding effects of strict measures. These simplifying assumptions allow us to maximize the likelihood ℓ(θ) in (3.2). Since the original data points are too noisy, we consider the 7-day moving average of the counts, starting from day 6. This results in a smoother dataset that is less noisy, although a bit delayed with respect to the true one.
Figure 4
Visualization of the FMD outbreak. New daily cases since the first day of data (February 2001), to last day where a new case was confirmed (September 2001). The data points in black are excluded from the analysis.
Visualization of the FMD outbreak. New daily cases since the first day of data (February 2001), to last day where a new case was confirmed (September 2001). The data points in black are excluded from the analysis.Maximum-likelihood estimates are obtained by means of a mix of global and local optimization routines. The distributions of inferred contact interval and infectious period are shown in figure 5. The shapes of the inferred distributions are in line with findings from other studies of same outbreak [34]. Our model with Weibull contact interval distribution and Gamma infectious period does not consider the incubation period explicitly. Once both infectious period and contact interval distributions are known, we can find R0 using the formula [35], where S, we recall, is the survival function of the infectious period distribution. This gives a point-estimate of R0 = 2.55. Finally, the effective population size inferred was .
Figure 5
The best-fitting PDF of the contact interval and the infectious period inferred from the FMD data.
The best-fitting PDF of the contact interval and the infectious period inferred from the FMD data.We compute confidence intervals using a bootstrap method, which we describe now. We first solve the limiting PDE (A 5) with the MLE estimates. From the solution, we compute the distribution of infection times PDF (3.1). This distribution is used to generate n = 500 synthetic datasets with as many datapoints as the original one, consisting of simulated dates of infections, on which we repeat the inference. Each new set of inferred parameters is then used to produce both the estimate for R0 (shown in figure 11) and the (t, I(t)) incidence curve that we can compare against the true data.
Figure 11
FMD R0 estimates from the bootstrap method.
Finally, when computing confidence intervals, we compensate for other sources of noise that cannot be explicitly accounted for in our the model but are present in real-world data, such as testing limits, day-of-the-week effects and various sources of delays. This variance-adjustment is done by inflating the confidence intervals by a factor determined by taking the square root of the variance between the data points and the 7-day moving average. Results are shown in figure 6. As can be verified, the trajectories do capture the epidemic trend quite well in that all the data points lie within the variance-adjusted 95% confidence interval.
Figure 6
Variance-adjusted confidence intervals for the FMD dataset.
Variance-adjusted confidence intervals for the FMD dataset.
Comparison with Markovian compartmental models
We show the difference between a flexible non-Markovian model and the standard Markovian compartmental SIR model (with Exponential contact intervals and Exponential infectious periods) by comparing their performance for the FMD epidemic. Both models tend to perform better when the data on the full course of the epidemic are considered (not shown). Here, we present an analysis based on the more realistic situation where only early data are available. We are interested both in inference of the infectious period and the contact interval distributions, and in forward predictions. For the purpose of prediction, we include in our model only observations from the first 20 days, corresponding to roughly 300 cases, with a peak daily case count of 22 infected premises. The first three days are excluded from the inference because there were no cases reported on day 2 and day 3. It is worth remembering that the animal culling policy was introduced on 15 March 2001, which is after the last data point considered, although a national movement ban was already in place for the whole period. In both scenarios, we consider the curves obtained from the MLE estimates projected to day 70 and compared with real case count and cumulative infections. Results are shown in figures 7, and figure 13 in appendix C.
Figure 7
Comparison between (a) Weibull–Gamma model and the (b) Exponential–Exponential model. Only the first 20 days of reported cases are considered for the fitting, and the solutions (grey curves) are compared with the observed incidence (black dots) in the following 50 days of data. In the insets, the distributions are inferred from the MLE estimates. These distributions are Weibull(shape = 3.163, scale = 4.153), Gamma(shape = 4.675, scale = 0.8). For the standard exponential models, we have an average contact interval period of 6.7 days and average time to removal of 10.2 days.
Figure 13
Comparison of cumulative incidence between (a) Weibull–Gamma model and standard stochastic (b) Exponential–Exponential model.
Comparison between (a) Weibull–Gamma model and the (b) Exponential–Exponential model. Only the first 20 days of reported cases are considered for the fitting, and the solutions (grey curves) are compared with the observed incidence (black dots) in the following 50 days of data. In the insets, the distributions are inferred from the MLE estimates. These distributions are Weibull(shape = 3.163, scale = 4.153), Gamma(shape = 4.675, scale = 0.8). For the standard exponential models, we have an average contact interval period of 6.7 days and average time to removal of 10.2 days.The DSA model is better able to capture the dynamics of the real epidemic, and gives substantially more reliable predictions, even when only a few observations are taken into account. Another important aspect is the interpretability of the results. The DSA model allows us to get a realistic idea of how both the contact interval and the infectious period distributions appear, whereas standard Exponential–Exponential models do not provide much insight, beyond looking at the rate of the underlying Exponential distributions. Another important result that the DSA model is able to achieve is the estimation of the effective population size. This can be interpreted as the number of farms that were potentially involved in the dynamics, and it is therefore an upper limit on the total number of farms that might become infected. For this model, the median estimated population size was (90% confidence interval [1740 − 32 224]; see also figure 14), in line with results from the literature [34].
Figure 14
Frequency histogram of effective population size estimated from the first 20 days of data for the FMD.
Third wave of COVID-19 in India
The analysis of FMD outbreak data makes use of only infection times. As the synthetic data analysis suggests that inference based only on infection times tends to be poorer compared to when both infection times as well as recovery times are available, we now analyse an epidemic where both times are available.In a global effort to document and control the ongoing COVID-19 pandemic, many governments provided freely available population-level datasets that we can use as case studies for inference when both infection and recovery times are known. Various countries adopted strong non-pharmaceutical measures that drastically changed the local dynamics of the epidemic, resulting in several distinct epidemic waves. At the same time, new SARS-CoV-2 variants emerged with markedly different epidemiological characteristics. To curtail the impact of such exogenous factors, we consider only the third wave in India.[4] Data consist of daily incidence and prevalence of cases, recoveries and deaths, meaning that we have data to inform both likelihoods in (3.2) and (3.6). The observed period spans from 15 February 2021 to 31 June 2021 inclusive (figure 8). For this dataset, we assume both the contact interval and the infectious period to be Gamma distributed.
Figure 8
Indian wave of COVID-19 cases. The Delta-wave to which we fit the model is highlighted in purple, and spans from 15 February 2021 to 31 June 2021.
Indian wave of COVID-19 cases. The Delta-wave to which we fit the model is highlighted in purple, and spans from 15 February 2021 to 31 June 2021.Similar to our approach on the FMD data, daily cases are distributed uniformly across the day. Because the DSA method requires only a random sample of infection and recovery times, we work with a dataset generated by taking a random sample (without replacement) of size 3000. We do not consider exogenous factors such as under-reporting of cases as they are beyond the scope of this paper. It is worth noting, however, that these exogenous factors surely have an impact on the results and can be accounted for by a more refined model.The best-fitting inferred contact interval and infectious period distributions are shown in figure 9. There are roughly in line with estimates of viral load and recovery distributions, respectively, from the literature [36]. The point estimate for the reproduction rate is R0 = 1.69. Although R0 of the SARS-CoV-2 Delta variant is estimated to be in the range 3–8 [37], it is more realistic to compare our estimate with R calculated from observed cases in that period, as our model uses only that source of information. The recovery distribution has a mean of 5.6 days and a variance of 26 days, so it is rather wide and right-skewed. The contact interval distribution is more peaked, with a slightly lower mean (around 4.5 days) and a variance of roughly 10. It is important to notice that infection times represent the collection of specimens from infected individuals, and recovery times follow country-specific healthcare system protocols, so they do not necessarily coincide with the true infectious distributions. Furthermore, the infectious period starts immediately after the incubation time has passed, while time to recovery is usually calculated from the onset of symptoms. Finally, the estimated effective population size is 31 million people. This is likely an underestimate because of the underreporting of cases in India during the Delta wave.
Figure 9
The best-fitting contact interval and infectious period distributions inferred from Indian Delta wave data. The distributions are, respectively, Gamma(4.5, 10), and Gamma(5.5, 20).
The best-fitting contact interval and infectious period distributions inferred from Indian Delta wave data. The distributions are, respectively, Gamma(4.5, 10), and Gamma(5.5, 20).Confidence intervals are computed in a similar way to the FMD analysis, with two major differences: the 7-day moving averages result in a curve that is too delayed with respect to the actual one because of exponential growth/decline. Although this effect may be accounted for by considering exponential moving averages, we preferred not to modify the data that way. For a similar reason, computing the variance-adjusted confidence intervals that take into account all the noise that cannot be explained by the model is not possible. Therefore, the confidence intervals, displayed in figure 10, underestimate the true variability of the underlying process, but seem to be generally in good agreement with the data. Interestingly, repeating the inference on different subsets of the original dataset does not produce significantly different estimates for the two distributions of interest. This suggests that the method is robust, not only because we have many data points to inform the likelihood, but also because we consider both the infection times and the recovery/death times. The distribution of the estimates of the reproduction number is shown in appendix C (figure 12).
Figure 10
Confidence intervals for Indian wave. (a) Daily number of new cases, (b) daily number of recoveries or deaths (referred to as removals).
Figure 12
COVID-19 R0 estimates from the bootstrap method.
Confidence intervals for Indian wave. (a) Daily number of new cases, (b) daily number of recoveries or deaths (referred to as removals).
Discussion
In this paper, we presented a method called DSA to both model and infer parameters of non-Markovian epidemic models. A crucial advantage of DSA is that it makes the entire toolkit of survival analysis available for making inference on dynamical systems. Therefore, DSA handles censored, truncated data in a straightforward and principled way. For instance, see [20] for an application of the DSA method adapted to a simple Markovian susceptible–exposed–infected–recovered (SEIR) model, where a snapshot of COVID-19 positivity data gathered through mass testing is used to analyse transmission in an Ohio prison. The analysis helped uncover the grave COVID-19 situation in correctional facilities in Ohio. Also, see [18] where we used the DSA approach coupled with approximate Bayesian computation (ABC) method to quantify the population-level effect of the mass vaccination campaign against COVID-19 in Israel. The analysis further helped quantify the indirect effect of vaccination on the unvaccinated young population in Israel. In [19], the DSA method was used to analyse the individual-level epidemic data from the Ebola pandemic in the Democratic Republic of Congo, suggesting success of the ring vaccination and contact tracing efforts evident from much lower estimates of the effective population size than previous analyses.In this paper, we adopted the law of mass-action to model the interactions among the individuals for the sake of simplicity. Under the law of mass-action, an infected individual can potentially infect any susceptible individual in the population. This is in contrast with network-based models, where infected individuals can only infect their neighbours (connections defined by the graph adjacency matrix) [17,37-39]. However, inferring the underlying network structure is a non-trivial task and often infeasible. This is particularly true when the underlying network exhibits complex substructures [41]. Therefore, the mass-action models are still routinely used despite being unrealistic in many epidemics. Nevertheless, an immediate future direction for us would be to develop the DSA methodology for a non-Markovian network model.The crux of the DSA methodology lies in the change in perspective about dynamical systems—one that views them as describing probability distributions of times of infection and recovery, as opposed to describing (scaled) counts. As such, the method is completely general and could be quickly adapted to the particular setting of any infectious disease. We hope the software package [42] will help translate the DSA methodology into a useful practical tool in modern infectious disease epidemiology.