Literature DB >> 32956345

Fast estimation of time-varying infectious disease transmission rates.

Mikael Jagan1,2, Michelle S deJonge1, Olga Krylova1, David J D Earn1,2.   

Abstract

Compartmental epidemic models have been used extensively to study the historical spread of infectious diseases and to inform strategies for future control. A critical parameter of any such model is the transmission rate. Temporal variation in the transmission rate has a profound influence on disease spread. For this reason, estimation of time-varying transmission rates is an important step in identifying mechanisms that underlie patterns in observed disease incidence and mortality. Here, we present and test fast methods for reconstructing transmission rates from time series of reported incidence. Using simulated data, we quantify the sensitivity of these methods to parameters of the data-generating process and to mis-specification of input parameters by the user. We show that sensitivity to the user's estimate of the initial number of susceptible individuals-considered to be a major limitation of similar methods-can be eliminated by an efficient, "peak-to-peak" iterative technique, which we propose. The method of transmission rate estimation that we advocate is extremely fast, for even the longest infectious disease time series that exist. It can be used independently or as a fast way to obtain better starting conditions for computationally expensive methods, such as iterated filtering and generalized profiling.

Entities:  

Mesh:

Year:  2020        PMID: 32956345      PMCID: PMC7549798          DOI: 10.1371/journal.pcbi.1008124

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


1 Introduction

The transmission rate of an infectious disease is a salient quantity in the study of epidemics. Changes in the transmission rate over time greatly influence the spread of infection [1, 2]. Quantifying how it changes over time can elucidate factors governing disease spread (e.g., weather [3], contact patterns [4]), inform epidemic forecasts, and suggest strategies for epidemic control [5]. In practice, we do not observe transmission directly. Instead, we observe the number of cases of infection (disease incidence) or number of deaths from infection (disease mortality) reported over time, and must reconstruct time-varying transmission rates from these data [6-13]. Utilizing historical mortality records, it is possible to identify patterns in transmission dating far back in time. Most notably, the London Bills of Mortality and the Registrar General’s Weekly Returns enable investigation of transmission patterns continuously from the mid-17th century to the present, for a number of infectious diseases including cholera [14] and smallpox [15]. A mechanistic understanding of long infectious disease time series—three centuries of weekly data in the case of smallpox [15]—requires methods of transmission rate estimation that are both accurate and fast, and therefore tractable for long time scales. Simulation-based methods of transmission rate estimation from reported incidence or mortality have been developed using the susceptible-infected-removed (SIR) model for infectious disease dynamics [16]. Markov chain Monte Carlo (MCMC [17, 18]) and sequential Monte Carlo (as in iterated filtering [8, 19, 20]) methods are statistically rigorous, but not tractable for long time scales owing to high computational cost. Generalized profiling [21, 22], which combines trajectory and gradient matching, is faster, but still too slow for convenient exploration of time series spanning hundreds of years. (Several CPU hours were required to apply generalized profiling to just 26 years of weekly data [22].) In comparison, Finkenstädt and Grenfell’s popular “time series SIR” (tSIR) method [7, 23] is extremely fast, using a simple discretization of a continuous-time SIR model to reduce transmission rate estimation to a local regression problem. However, the tSIR method assumes that the duration of infection is equal to the time step, that natural death of susceptible individuals can be ignored, and that cumulative incidence approximates cumulative births. The latter two assumptions are reasonable for pre-vaccination measles, when most susceptible individuals were eventually infected [6]. However, in many contexts (e.g., with pathogens less transmissible than measles), susceptible mortality over time scales of interest and the difference between incidence and births are non-negligible. In their unpublished PhD and MSc theses, Krylova (Ch. 4 in [24]) and deJonge [25] separately modified a fast discretization method originally proposed by Fine and Clarkson [6]. Krylova’s approach has been employed to estimate the amplitude of seasonal variation in measles transmission in 20th century New York City [9]. Unlike the tSIR method and unlike Fine and Clarkson, Krylova’s and deJonge’s methods do not place constraints on the infectious period or ignore susceptible mortality. Here, we present a new algorithm based on deJonge’s method and compare its performance to the methods of Fine and Clarkson and Krylova. Our main investigative approach is to apply each method to simulated reported incidence data with known underlying transmission rate, so that error in transmission rate estimates can be quantified exactly. Our analysis of the methods reveals a shared sensitivity to process and observation error. We mitigate this issue by introducing a smoothing step. The methods are additionally sensitive to error in the user’s estimate of the initial number of susceptible individuals, which is rarely known with any precision. We propose a fast, iterative technique for estimating this parameter from time series of incidence, births, and natural mortality, eliminating a long-standing barrier to the use of fast methods of transmission rate reconstruction.

2 Methods

In §§2.1 and 2.2 below, we present three fast methods for estimating time-varying transmission rates, based on a mechanistic model of disease spread. In §§2.3–2.7, we outline our systematic analysis of the sensitivity of the methods to parameters of the data-generating process and to error in the user-specified values of input parameters. Finally, in §2.8, we introduce peak-to-peak iteration (PTPI), a technique for estimating the initial number of susceptible individuals. Essential notation is summarized in Table 1.
Table 1

Notation.

Unless otherwise stated, simulations of reported incidence time series use the reference values listed here. If a symbol is to be interpreted differently in relation to disease incidence and disease mortality data, then the correct definition is indicated by (I) and (M), respectively.

SymbolNameDefinitionRef. val.UnitNotes
tkkth observation timeTime of the kth observation in time series data, for k = 0, …, n.t0 + kΔtyears
t0Transient periodDuration of the transient in system (1) that is ignored in simulations of reported incidence, before observations are recorded.2000yearsSystem (1) is numerically integrated between t = 0 years and t = t0, and observed starting at t = t0. This is done so that simulations reflect dynamics near the attractor of system (1).
ΔtObservation intervalTime between successive observations in time series data.1weeksDisease mortality is reported weekly in the London Bills of Mortality.
nTime series lengthTime between the initial and final observations in time series data, in units Δt, given by (tnt0)/Δt.1042If Δt = 1 week, then 1042Δt = ⌊20 × 365/7⌋Δt ≃ 20 years.
[·]ΔtNearest kΔt roundingFor time lengths t, [t]Δt=[tΔt]Δt, where [⋅] denotes nearest integer rounding.
〈⋅〉Long-term averagingFor functions x(t), x=limt1t0tx(s)ds. For sequences xk, x=limn1nk=0nxk.
(S(t), I(t), R(t))StateNumber of (susceptible, infected, removed) individuals in the population at time t.“Removed” individuals have either recovered from the disease and gained permanent immunity or died from the disease.
N(t)Population sizeS(t) + I(t) + R(t).
B(t)BirthsNumber of births that occur during the time interval [t − Δt, t).
Q(t)Cumulative incidenceNumber of susceptibles who become infected during the time interval [t0, t).
Z(t)IncidenceNumber of susceptibles who become infected during the time interval [t − Δt, t).Q(t) − Q(t − Δt)
C(t)Reported disease (I) incidence or (M) mortalityNumber of (I) infections or (M) disease-induced deaths reported during the time interval [t − Δt, t). Eq (31) C is an abbreviation of “cases”, which are reported as infections or as deaths.
(S0, I0, R0)Initial state (t = t0)(S(t0), I(t0), R(t0)).(S*, I*, R*)(S*, I*, R*) denotes the state of system (1) after numerical integration between t = 0 years and t = t0 with seasonally forced transmission rate β(t) (Eq (27)), constant vital rates νc and μc, and initial state (t = 0 years) (S^,I^,R^) (Eq (32); see below).
N0Initial population size (t = t0)N(t0).S* + I* + R*
(S^,I^,R^) Endemic equilibriumEndemic equilibrium of system (1) with constant transmission rate (β ≡ 〈β〉) and constant vital rates (νμμc). Eq (32)
N^0 Initial population size (t = 0 years)N(0).106
xkEstimation input/outputWithin an estimation algorithm (Boxes 13), the supplied or estimated value of x(tk) (x = C, B, μ, Z, S, I, β).varies
ν(t)Birth rateNumber of births per unit time relative to N^0, at time t.νcyear−1In simulations of reported incidence, ν(t) is modeled as a constant νc. In general, estimation of β(t) from data does not require the underlying ν(t) to be constant.
νcBirth rate (constant)See ν(t) above.0.04year−1
μ(t)Natural mortality rateNumber of natural deaths per unit time per capita, at time t.μcyear−1In simulations of reported incidence, μ(t) is modeled as a constant μc. In general, estimation of β(t) from data does not require the underlying μ(t) to be constant.
μcNatural mortality rate (constant)See μ(t) above.0.04year−1
tgenMean generation intervalMean time between onset of infection (in infector) and subsequent transmission of infection (by infector) [26, 27].13daysThe reference value is the sum of the observed mean latent and infectious periods, which for measles are 8 days and 5 days, respectively [16].
γRemoval rateNumber of removals (recoveries or deaths from disease) per unit time per infected.1/tgenday−1
prepCase reporting probability(I) Probability that an infection is reported, or (M) the case fatality ratio times the probability that a death from disease is reported.0.25 or 1If we simulate data with under-reporting, then we use prep = 0.25 as a reference value. Otherwise, we set prep = 1.
trepMean case reporting delayMean time between infection and reporting of (I) infection or (M) disease-induced death.2 or 0weeksIf we simulate data with reporting delays, then we use trep = 2 weeks as a reference value. Otherwise, we set trep = 0 weeks.
β(t)Transmission rateNumber of infections per unit time per susceptible per infected, at time t. Eq (27) year−1In simulations of reported incidence, β(t) is modeled by the seasonal forcing function defined in Eq (27). In general, estimation of β(t) from data does not require it vary seasonally or even periodically.
βMean transmission rateContinuous-time average of the seasonally forced β(t), equal to limt1t0tβ(s)ds.β*year−1β* ≈ 5.6 × 10−4year−1 is the value of 〈β〉 that satisfies Eq (2) with R0=20, νc = μc = 0.04 year−1, tgen = γ−1 = 13 days, and N^0=106.
R0 Basic reproduction numberNumber of individuals that a typical infected person is expected to infect in an otherwise completely susceptible population. Eq (2) For measles in the 20th century, R020 [16].
αSeasonal amplitudeAmplitude of the seasonally forced β(t) relative to 〈β〉.0.08For measles, α ≈ 0.08 [2]. We require α ∈ [0, 1] to ensure that the seasonal forcing function defined in Eq (27) is non-negative.
ϕ(t; ϵ)Environmental noise (realized)Phase shift in the seasonally forced β(t), at time t.Normal(0, ϵ2)ϕ is a realization of a continuous-time stochastic process defined by a set {Φ(t; ϵ)} of independent and Normal(0, ϵ2)-distributed random variables.
ϵStandard deviation of environmental noiseSee ϕ(t; ϵ) above.0.5
qLoess smoothing parameterRough number of nearest neighbours weighted in local regression (i.e., when fitting loess curves to time series), determining the degree of smoothing.See §2.2.6 for an exact definition.

Notation.

Unless otherwise stated, simulations of reported incidence time series use the reference values listed here. If a symbol is to be interpreted differently in relation to disease incidence and disease mortality data, then the correct definition is indicated by (I) and (M), respectively.

2.1 Model of disease transmission

We assume that the principal mechanisms of disease spread in the focal population are captured by the SIR model [16], formulated with time-varying rates of birth, death, and transmission. Expressing the model as a system of ordinary differential equations, we write where S, I, and R are the numbers of individuals who are susceptible, infected, and removed, respectively; N = S + I + R is the population size; and is the population size at an initial time, defined to be 0 years for simplicity. (We reserve the notation N0 for N(t0), where t0 > 0 years is the length of a transient; see Table 1.) The time-varying parameters are ν(t) birth rate, the number of births per unit time relative to ; μ(t) natural mortality rate, the number of natural deaths per unit time per capita (i.e., relative to N); and β(t) transmission rate, the number of infections per unit time per susceptible per infected. The constant parameter γ is the rate of removal from the infected compartment (due to recovery or death from disease) per infected individual. In Eq (1a) and Eq (1b), we use mass action incidence β(t)SI rather than standard incidence β(t)SI/N. Mass action incidence is essential for reproducing transitions in epidemic patterns resulting from changes in the birth rate [2, 28]. In Eq (1a), we write the net birth rate as rather than ν(t). This formulation is for convenience: the factor of does not affect dynamics, but ensures that ν(t) and μ(t) have the same scale. The SIR model (1) assumes that the focal population is homogeneously mixed and subject to the mass action principle, which states that incidence is proportional to the product of the densities of susceptibles and infecteds [16]. The model further assumes that the latent period (time from infection to onset of infectiousness) can be ignored and that the infectious period (time from onset of infectiousness to recovery or death from disease) is exponentially distributed [29]. The distributions of the latent and infectious periods affect disease dynamics [28, 30, 31], but Krylova and Earn [28] showed that the effect on long-term dynamical structure is typically small when the mean generation interval is fixed (see Fig 11 in [28]). For this reason, we assign the mean generation interval implied by the SIR model (1) (tgen = γ−1) the value of the sum of the observed mean latent and infectious periods. This sum is the true mean generation interval if the latent and infectious periods are both exponentially distributed, and is a good estimate of the true mean generation interval more generally [28]. Transmissibility of infection is typically measured by the basic reproduction number , defined as the number of individuals that a typical infected person is expected to infect in an otherwise completely susceptible population [16]. If the birth and death rates are constant (ν ≡ ν and μ ≡ μc), and if the transmission rate has a well-defined average 〈β〉 [32], then the basic reproduction number for the SIR model (1) can be written as [28]

2.2 Estimating β(t) from time series data

Here, we examine three fast methods for estimating time-varying transmission rates β(t). The methods take as input (i) time series of reported disease incidence or disease mortality, (ii) time series of births and natural mortality, and (iii) values for input parameters, such as the mean generation interval tgen. By assumption, the time series data are available at discrete, equally spaced time points where Δt is the observation interval. The methods return as output a time series estimate of β(t), denoted by or simply β, which can be averaged (§2.2.5) or smoothed (§2.2.6) to distill temporal patterns of interest. Missing data must be imputed: the three methods are recursive, so they break down as soon as they encounter a missing value. Imputation can be accomplished most simply via linear interpolation between available data. More sophisticated techniques accounting for uncertainty in missing values are described in [33].

2.2.1 FC method

We review the method first described by Fine and Clarkson [6], referred to here as the “FC method”. Let S(t) and I(t) be the number of susceptibles and infecteds in the population at time t. S decreases when susceptibles become infected or die and increases when susceptibles are born. Let Z(t) and B(t) be the number of infections and births, respectively, that occur during the time interval [t − Δt, t). Assuming that natural mortality was negligible, Fine and Clarkson reconstructed S from Z and B with the recursion Fine and Clarkson further assumed that the observation interval Δt was equal to the mean generation interval tgen, so that prevalence could be approximated by incidence. That is, for all t. They derived an expression for Z(t + Δt) via the mass action principle Rearranging Eq (6), they obtained an estimate of β(t), given by Fine and Clarkson applied Eqs (4), (5), and (7) to estimate S(t), I(t), and β(t) (for k = 0, …, n), after specifying (i) the initial number of susceptibles S0 = S(t0), and (ii) values of Z(t) and B(t) from incidence and birth data, respectively. A limitation of the FC method is the constraint requiring Δt = tgen. For some diseases, this is a minor issue, because incidence and birth data can be aggregated so that the time between successive aggregates is approximately equal to tgen. For example, the mean generation interval of measles is approximately two weeks, so Fine and Clarkson [6] aggregated pairs of weekly observations. A second, more serious limitation is the assumption, implicit in Eqs (4) and (5), that natural mortality is negligible. We discuss this issue in §3.1.

2.2.2 S method

Krylova (Ch. 4 in [24]) generalized the FC method in order to eliminate the constraint requiring Δt = tgen and account for natural mortality. Her approach is based on the SEIR model, which distinguishes “exposed” individuals in the latent stage of infection from infectious individuals. Here, we adapt Krylova’s approach to the SIR model (1) and refer to our approach as the “S method.” We define S, I, Z, and B as in the FC method. Let μ(t) be the per capita natural mortality rate at time t, and let Q(t) be the total number of infections occurring between the initial observation time t0 and current time t (i.e., cumulative incidence). The observation interval Δt is no longer constrained to be equal to the mean generation interval tgen. We reconstruct S recursively by discretizing Eq (1a): Eq (8) is the result of applying the forward Euler method for numerical integration to Eq (1a), and replacing the incidence and birth terms with Z(t + Δt) and B(t + Δt), respectively. Eq (8) is identical to Eq (4) of the FC method, except that it includes a natural mortality term. In order to estimate β(t), we note that, by definition, dQ/dt is the rate at which individuals enter the infected compartment. From Eq (1b), this is If the mean generation interval tgen is short enough that I and μ are roughly constant between times t and t + tgen, then dI/dt ≈ 0 in that interval, and using Eq (1b) we can write In this case, dQ/dt is also (approximately) the rate at which individuals leave the infected compartment, tgen time after infection: Note that Eq (11) is also valid if the generation interval is narrowly distributed around its mean tgen (even if tgen is long). Discretizing Eqs (9) and (11) using forward Euler, we obtain two approximations of Z(t + Δt): Rearranging Eq (12) yields an estimate of β(t), given by and an estimate of I(t), given by Since data are available only at the observation times t (Eq (3)), the value of Z(t + Δt − tgen) in Eq (14) will be observed only if tgen is an integer multiple of Δt. In general, tgen is not divisible by Δt. Therefore, in practice, we replace tgen in Z(t + Δt − tgen) with the nearest integer multiple of Δt, denoted here by [tgen]Δ: Thus, the S method is defined by Eq (13), coupled with Eqs (8) and (15) for the reconstruction of S and I. The S method requires users to specify (i) input parameters S0 = S(t0) and tgen = γ−1, and (ii) values of Z(t), B(t), and μ(t) from incidence, birth, and natural mortality data, respectively. The FC method is a special case of the S method, obtained by setting Δt = tgen and μ(t) ≡ 0.

2.2.3 SI method

DeJonge [25] improved Krylova’s method (Ch. 4 in [24]) by reconstructing I directly from Eq (1b) instead of relying on the approximation in Eq (11). Here, we improve deJonge’s discretization, which employs the forward Euler method, by instead combining forward and backward Euler. One way to do this is to use the trapezoidal method: whereas forward and backward Euler take f ′(t)Δt and f ′(t + Δt)Δt, respectively, to approximate integrals , the trapezoidal method takes the average , which is less prone to error. Our discretization, which we call the “SI method”, is consistently more accurate than deJonge’s and others (see §S9 of S1 Text for a comparison of nine possible algorithms). Numerically integrating Eq (1a) and Eq (1b) using the trapezoidal method, and replacing the incidence and birth terms with Z(t + Δt) and B(t + Δt), respectively, we obtain and Eq (17) eliminates an important problem with Eq (15) of the S method, which estimates I(t) ≈ 0 if Z(t + Δt − [tgen]Δ) = 0, leading to division by zero in Eq (13). Discretizing Eq (9) using forward and backward Euler, we obtain two approximations of Z(t + Δt): Rearranging Eq (18) yields two estimates of β(t), whose average supplies a more accurate estimate (see §S9 of S1 Text), given by Thus, the SI method is defined by Eq (20), coupled with Eqs (16) and (17) for the reconstruction of S and I. Compared to the S method, the SI method, in principle, requires one additional input parameter, namely the initial number of infecteds I0 = I(t0). In §3.6, we show that, in practice, this additional information is not necessary.

2.2.4 Estimating true incidence from reported incidence

Let C(t) be the number of infections reported during the time interval [t − Δt, t). We estimate true incidence Z from reported incidence C via where prep is the probability that an infection is reported and [trep]Δ is the mean time between infection and reporting, rounded to the nearest integer multiple of the observation interval Δt. Eq (21) has the limitation that multiplying by does not correct for under-reporting if, by chance, C(t + [trep]Δ) = 0. In this situation, not only is the result Z(t) ≈ 0 incorrect, but we divide by zero in the FC and S methods when we substitute Eqs (5) and (15) in Eqs (7) and (13), respectively. If C(t + [trep]Δ) = C(t + [trep]Δ + Δt) = 0, then the SI method also suffers: Eq (20) gives β(t) ≈ 0. To circumvent these issues, we replace zeros in reported incidence time series by linearly interpolating between nonzero values prior to estimating true incidence using Eq (21). We do not replace leading and trailing zeros. If what we observe is deaths from disease, rather than infections, then we have the complication that only a fraction of infections end in death. In this situation, we can still use Eq (21) to estimate Z, provided we interpret (i) C as reported disease mortality, (ii) prep as the case fatality ratio times the probability that a death from disease is reported, and (iii) trep as the mean time between infection and reporting of disease-induced death. A more sophisticated method of inferring true incidence from reported data is described in [34].

2.2.5 Averaging raw estimates of β(t)

Given fixed time series data and input parameters, the FC, S, and SI methods return estimates of β(t) that are entirely determined (not random). In the absence of additional data observed from the same population, it is difficult to assign confidence to the output. However, if an estimate is approximately periodic (with apparent period T) and contains m complete cycles, and if we assume β(t) is truly periodic, then we can view as containing a sample of m estimates of the true cycle, with some variance, and use its mean as an estimator instead of any one of the m cycles. For such an estimate defined on the interval [t0, t0 + mT), the mean and variance are given by In §3.3, we apply the S and SI methods to simulated data to estimate an underlying, seasonally forced β(t) (Eq (27)), which has a period of 1 year. We linearly interpolate the raw time series estimate β and compute the average 1-year cycle in the interpolant βint(t) using Eq (22a). Comparing this average to the true 1-year cycle, we are able to assess bias in the two methods. Note that and s2(t) can be used to obtain a formal, likelihood-based measure of confidence in estimates (see §2.3.4 in [35]).

2.2.6 Smoothing raw estimates of β(t)

Process and observation error introduce random fluctuations in reported incidence on top of longer-term (e.g., seasonal) variation. In §3.2, we show that noise in reported incidence is spuriously amplified in β, the raw time series estimate of β(t). To distill temporal patterns of interest from the noise, we fit a smooth loess (short for local regression; see Ch. 8.1 in [36]) curve βloess(t; q) to the points and use βloess(t; q) as our final estimate of β(t). Here, q ∈ {5, …, n + 1} is an integer-valued parameter controlling the degree of smoothing. At times t ∈ [t0, t], the fitted value βloess(t; q) is obtained as follows: Order the distances d = |t − t| of the time points t (Eq (3)) from t, letting denote the ith smallest distance (for i = 1, …, n + 1). Fit a quadratic polynomial p2(t) to the points . This is done by weighted least squares using tricube weights Hence only time points t nearer to t than the qth nearest time point are weighted in the fit. Define βloess(t; q) = p2(t). Typically, smoother fits are obtained with greater q [36, 37]. The optimal value of q for a given time series β, denoted by qopt, is that which minimizes error in βloess(t; q) relative to β(t). In §3.4, we estimate β(t) from simulated data, smooth β using each value of q on a grid, and use our knowledge of β(t) to determine qopt. We show that it is possible for smoothing to eliminate much of the error in β attributable to process and observation error. Thus, in §2.2.7, we explicitly define the FC, S, and SI methods with loess smoothing as a final step. In practice, β(t) is not known, so we cannot determine qopt. In this case, qopt can be estimated using statistical methods, such as time series cross-validation [38]. However, reasonable results can be obtained much more quickly by inspecting βloess(t; q) directly and increasing q from 4 until a desirable degree of smoothing is achieved (e.g., until noise on the time scale of weeks is visibly reduced, and patterns on the time scale of months are easier to discern).

2.2.7 Summary

In Boxes 1–3 below, we summarize the three methods derived in §§2.2.1–2.2.6 for estimating time-varying transmission rates β(t) from time series data with observation times t (Eq (3)). We use the notation x to refer to the value supplied or computed for x(t) within the estimation algorithms (x = C, B, μ, Z, S, I, β).

Box 1. FC method (Fine & Clarkson 1982 [6])

where Δt is assumed to be roughly equal to tgen, and natural mortality is assumed to be negligible. Users must specify: a time series of reported incidence or reported disease mortality, with zeros replaced via linear interpolation between nonzero values; a time series of births; input parameters S0, tgen, prep, and trep.

Box 2. S method (adapted from Krylova 2011 [24])

Users must specify: a time series of reported incidence or reported disease mortality, with zeros replaced via linear interpolation between nonzero values; a time series of births; a time series of the per capita natural mortality rate; input parameters S0, tgen = γ−1, prep, and trep; loess smoothing parameter q.

Box 3. SI method (adapted from deJonge 2014 [25])

Users must specify: a time series of reported incidence or reported disease mortality, with zeros replaced via linear interpolation between nonzero values; a time series of births; a time series of the per capita natural mortality rate; input parameters S0, I0, tgen = γ−1, prep, and trep; loess smoothing parameter q. In Box 4, we provide instructions for input specification based on our analysis of the methods. β is sensitive to mis-specification of S0, but not I0 (cf. §3.6.1). If the user’s estimate of S0 is uncertain, and if the incidence time series Z is roughly periodic, then a more accurate estimate of S0 may be obtained via peak-to-peak iteration (PTPI; cf. §3.7). If S is negative for any k, then it is likely that the case reporting probability prep was underestimated or that births were systematically under-reported by B. This can be resolved by correcting the estimate of prep or correcting B, then restarting the algorithm. Users should apply close to the minimal correction necessary to prevent negative S. q must be tuned to the β time series. An estimate of qopt can be obtained using statistical methods, such as time series cross-validation [38]. However, q can be tuned quickly through visual inspection of βloess(t; q): one can increase q from 5 until a desirable degree of smoothing is achieved (e.g., until noise on the time scale of weeks is visibly reduced, and patterns on the time scale of months are easier to discern).

2.3 Simulating reported incidence data

In order to compare the performance of the FC, S, and SI methods in estimating β(t), we apply the methods to simulated reported incidence data with known underlying β(t). Here, we outline our methods for simulating these data using the SIR model (1).

2.3.1 Seasonal forcing of β(t) with environmental stochasticity

We reproduce seasonal fluctuation in the transmission rate by modeling β(t) in Eq (1) as a sinusoidal forcing function with period equal to one year: Here, α ∈ [0, 1] is the amplitude of seasonal forcing relative to the mean 〈β〉. We introduce stochastic fluctuation by adding a randomly generated phase shift: ϕ is a realization of a continuous-time stochastic process consisting of independent, Normal(0, ϵ2)-distributed random variables. It models leading to random noise in the transmission rate. Modeling environmental stochasticity with a random phase shift rather than additive noise conveniently avoids negative β(t): β(t) oscillates between 〈β〉(1 − α) and 〈β〉(1 + α) regardless of the distribution of the noise. In practice, we take the values of ϕ at times t (Eq (3)) and linearly interpolate to obtain values in between. This helps to make simulations of Eqs (1) and (9) with adaptive time steps (cf. §2.3.2) reproducible.

2.3.2 Generating incidence time series with demographic stochasticity

We supplement Eq (1) with Eq (9), so that trajectories of the resulting system record changes in cumulative incidence Q. In this system, we employ the noisy transmission rate β(t) (Eq (28)) and constant vital rates νc and μc. We then either (i) numerically integrate the differential equations to approximate their solution, or (ii) treat the system more realistically as an event-driven, continuous-time Markov process (with event rates specified by terms in the differential equations) and use the adaptive tau-leaping algorithm for stochastic simulation [39, 40]. The latter approach accounts for in disease dynamics. We prevent disease fadeout in simulations with demographic stochasticity by setting the rates of infected recovery and death to zero whenever I = 1. In both methods of simulation, we record the state of the system at times t (Eq (3)), choosing initial state where S0 + I0 + R0 = N0 = N(t0). Finally, we derive incidence Z from Q via first differences:

2.3.3 Introducing observation error

Observation error due to under-reporting (prep < 1) and reporting delays (trep > 0 weeks) creates discrepancies between true incidence Z and reported incidence C. We introduce random observation error to simulated incidence time series with delayed binomial sampling: For simulations without observation error, we set prep < 1 and trep > 0 weeks.

2.3.4 Parametrization

The simulation method outlined in §§2.3.1–2.3.3 is parametrized by For most simulations, we assign parameters the reference values listed in Table 1. We consider different values when we investigate the sensitivity of β(t) estimates to data-generating parameters (cf. §2.6.1). We bypass transient dynamics by choosing t0 = 2000 years and numerically integrating system (1) between 0 years and t0 in order to obtain a point (S*, I*, R*) very near the attractor. For this step, we exclude environmental noise, defining β(t) as in Eq (27), and take the initial state to be the endemic equilibrium of the unforced system (system (1) with β ≡ 〈β〉 and ν ≡ μ ≡ μc):

2.4 Creating mock birth and natural mortality time series

In addition to reported incidence data, the FC, S, and SI methods require time series of births and the per capita natural mortality rate. For simplicity, we create mock time series by (i) choosing constant vital rates and , then (ii) setting and for all k. Note that is the result of integrating the net birth rate in the SIR model (1), given by , between successive observation times using . We specify and , where νc and μc are the data-generating vital rates (cf. §2.3.4), except when we investigate the sensitivity of β(t) estimates to incorrect vital data (cf. §2.6.2). For example, to model under-reporting of births, we simply set .

2.5 Measuring β(t) estimation error

When we simulate reported incidence data, the underlying transmission rate β(t) is defined beforehand via Eq (27) and known for all t. We use this knowledge to quantify the error in estimates of β(t) obtained from the data. Specifically, given an estimate defined at time points t (Eq (3)), we compute the relative root mean square error (RRMSE), defined as where Note that by “underlying transmission rate” we mean the transmission rate excluding environmental noise. Although we simulate data using the noisy β(t), defined in Eq (28), our aim is to reconstruct the noiseless β(t), defined in Eq (27).

2.6 Sensitivity analysis

Error in β(t) estimation from reported incidence data depends on how the data were generated. The number of cases reported over time is influenced by features of the disease (e.g., the natural history of infection), population (e.g., contact patterns), and case reporting (e.g., the frequency and accuracy of reports). In our simulations of reported incidence, there are 14 (cf. §2.3.4), whose values are summarized in the vector Estimation error also depends on how accurately certain data-generating parameters are specified by users of the FC, S, and SI methods. The initial observation time t0, observation interval Δt, and time series length n are always known exactly. Other parameters (〈β〉, α, ϵ, , νc, and μc) influence our simulations of reported incidence, but in practice are not parameters of the FC, S, and SI methods. In practice, users are required to specify only S0, tgen, prep, trep, and (with the SI method) I0. However, when we test the methods here, we do specify vital rates νc and μc in order to create mock (constant) birth and natural mortality time series (cf. §2.4). The specified values of these 7 are summarized in the vector First, we investigate the sensitivity of the methods to the data-generating parameter values . Then, we examine their sensitivity to error in the user’s specification ′ of the input parameters. Here, we describe our analysis using the notation to refer to transmission rate estimates constructed with user input ′, from data generated by parameter values .

2.6.1 Sensitivity to data-generating parameters

In §3.5, we consider the ideal situation in which the input ′ corresponds exactly to the data-generating . In this case, how sensitive is error in to ? For example, is β(t) estimated more accurately for diseases with longer mean generation interval tgen, etc.? To answer these questions, we perform the following steps on a grid of data-generating parameter values : Simulate 1000 reported incidence time series using . Create corresponding mock (constant) birth and natural mortality time series (cf. §2.4), specifying and in the input ′. Estimate β(t) from the simulated data, specifying , , , , and in the input ′. Compute the median RRMSE in the estimates (1000 estimates corresponding to 1000 simulations). We repeat this analysis 6 times, corresponding to 2 methods of β(t) estimation (S or SI) and 3 methods of data simulation: without demographic stochasticity and without observation error (fixing prep = 1, trep = 0 weeks), with demographic stochasticity but without observation error (fixing prep = 1, trep = 0 weeks), or with demographic stochasticity and with observation error (fixing prep = 0.25 unless sensitivity to prep is considered, trep = 2 weeks). Environmental stochasticity (ϵ = 0.5) is included in all simulations.

2.6.2 Sensitivity to mis-specification of input parameters

In §3.6, we fix the data-generating and consider the more realistic situation in which components of the input ′ differ from the corresponding components of by a potentially large factor. In this case, how sensitive is error in to error in ′? For example, how important is having an accurate estimate of tgen, etc.? To answer these questions, we perform the following steps: Simulate 1000 reported incidence time series using fixed data-generating parameter values . (We assign the reference values listed in Table 1.) For each point on a grid of input parameter values ′: Create mock (constant) birth and natural mortality time series, taking and from the input ′. Estimate β(t) from the simulated data, taking , , , , and from the input ′. Compute the median RRMSE in the estimates (1000 estimates corresponding to 1000 simulations). We repeat this analysis 6 times, as outlined at the end of §2.6.1.

2.7 Asymptotic analysis

Here, we examine analytically the propagation of input error to the output of the SI method. (Similar expressions for propagated errors are obtained by analyzing the S method.) Our analysis here supports numerical results presented in §3.6 concerning the sensitivity of β(t) estimation error to mis-specification of input parameters.

2.7.1 Explicit solutions of the (S, I) difference equations

The SI method uses Eq (26a) to Eq (26c) to recursively reconstruct S(t) and I(t) from time series of reported incidence, births, and natural mortality. After substitution of Eq (26a), Eq (26b) and Eq (26c) can be written as where r = [trep]Δ/Δt is the mean case reporting delay in units of the observation interval, rounded to the nearest integer. Eq (37) are linear, first order difference equations, whose explicit solutions are obtained using standard algebraic techniques (see Eq 1.2.4 in [41]) and given by with the conventions and if a < b. As we show in §2.7.2, explicit solutions of Eq (37) facilitate asymptotic analysis.

2.7.2 Propagation of input error to (S, I)

We consider the special case in which the vital rates are constant and set and μ = μc for all k (cf. §2.4). Then Eq (38) simplify to where we have made explicit the dependence of S and I on input parameters S0, I0, νc, μc, tgen = γ−1, and prep. Using Eq (39), we can derive exact expressions for the error propagated to S and I in the SI method as a result of assigning an incorrect value to an input parameter. If the initial number of susceptibles is truly S0, but we specify , where ω > 0, then the error propagated to S is where is the life expectancy in the population. Similarly, specifying for I0 yields an error in I, where tinf = (γ + μc)−1 is the mean time between infection and removal from the infected compartment, accounting for the possibility of natural death during infection. Eqs (40) and (41) show that the errors propagated to S and I vanish as k → ∞; we exploit this fact to improve susceptible reconstruction (cf. §2.8). Mis-specifying νc by assigning a value creates an error in S that increases in magnitude over time and converges to a limit: Unlike Eq (42), the exact expression for Err(S, μc ← ωμc) is not readily simplified and is difficult to interpret: However, if C has a well-defined long-term average 〈C〉 (this will be true if, for instance, C is periodic), then Err(S, μc ← ωμc) has a well-defined long-term average 〈Err(S, μc ← ωμc)〉 with a simple form. Replacing C in Eq (43) with 〈C〉, simplifying the resulting expression, then taking the limit as k → ∞, we obtain We can similarly show the following, still assuming that 〈C〉 is well-defined: Here, is the (incorrect) mean time spent infected that results when ωμc is incorrectly specified for μc (Eq (45)) or ωtgen is incorrectly specified for tgen (Eq (46)).

2.7.3 Propagation of error in (S, I) to β

Let β(Z, Z, S, I) be the raw SI method estimate of β(t), given by the right hand side of Eq (26d). Suppose that, due to propagated error (cf. §2.7.2), the arguments are incorrect by a factor, so that where ω, ω, ω > 0. Then the computed β will have relative error Hence severe underestimation of S or I (ω ≪ 1 or ω ≪ 1) causes the relative error in β to blow up.

2.8 Estimating S0 via peak-to-peak iteration

Reconstruction of susceptibles S(t) is a necessary step in the reconstruction of β(t) using the FC, S, and SI methods. In §3.6, we show that susceptible reconstruction requires accurate specification of the initial number of susceptibles S0 = S(t0). However, reliable estimates of S0 have, to this point, been difficult to obtain in practice. We propose a technique for iteratively improving estimates of S0, requiring only incidence, birth, and natural mortality data at times t (Eq (3)). Crucially, our technique, which we call “peak-to-peak iteration” (PTPI), enables accurate susceptible reconstruction without direct observation of the susceptible population size at the initial time. Our approach is motivated by application of the SI method to simulated data. When we incorrectly guessed the value of S0 and attempted to reconstruct S(t) via Eq (26b), the absolute error in the reconstruction decreased monotonically over time (k). (Eq (40) shows that the error propagated from S0 to S vanishes as k → ∞.) Consequently, if the underlying dynamics are at least approximately periodic, and if t0 and t occur at the same phase of the cycle, then S is actually a better estimate of S0 than our initial guess. In this situation, instead of reconstructing β(t) directly, we can use S as an updated estimate of S0, and reconstruct S(t) more accurately. This procedure can be repeated any number of times, and, with simulated data, we observe rapid convergence to an accurate estimate of S0 (cf. §3.7). The key point is that the reconstructed final state can be used as an improved estimate of the initial state only if the initial and final states occur at the same phase of the cycle. This will not be true unless the observation period (the time between the first and last observations in time series data) is an integer multiple of the period of the underlying dynamics. We can ensure this by choosing appropriate times at which to start and stop S(t) reconstruction. In noisy periodic data, the points in a cycle that are easiest to identify robustly are the peaks. Consequently, we ignore observations (i) prior to the time t of the first peak in the incidence time series and (ii) after the time t of the last peak that occurs near an integer multiple of the apparent period after the first peak. For the truncated time series, the iterations converge to an accurate estimate of S(t) starting from an initial guess, and we recover the corresponding accurate estimate of S0 by solving Eq (26b) backwards in time, from t to t0: The complete PTPI algorithm, which consists of finding t and t (truncation step) and estimating S0 (iteration step), is outlined in Boxes 5 and 6 below. In §3.7, we assess the performance of PTPI by applying the technique to simulated data with known underlying S0, starting from an incorrect initial estimate of S0.

Box 5. Peak-to-peak iteration: Truncation step

Goal: Given a roughly periodic time series of incidence, we want to find the time t of the first peak and the time t of the last peak occurring at the same phase of the cycle. These times are necessary for the iteration step (Box 6). Algorithm: Smooth the raw incidence time series Z by applying a (2ℓ1 + 1)-point central moving average, computed via Choose minimal ℓ1 large enough to remove spurious peaks in Z caused by noise, while retaining true peaks. Identify the period T of the smoothed incidence time series from its power spectrum, and calculate the number of embedded cycles, given by . Construct the set indexing peaks in : Choose minimal ℓ2 large enough to ensure that indexes true peaks in , but not spurious peaks caused by noise (any that remain after smoothing). Define , the set of times of peaks in , and record the time of the first peak, given by . For i = 0, …, m, define τ = t+ iT and find the element of nearest τ, namely . The resulting subset should contain successive time points that are roughly one period apart, i.e., the corresponding peaks in should occur at the same phase of the cycle. Record the time of the last such peak, given by .

Box 6. Peak-to-peak iteration: Iteration step

Goal: We want to produce an accurate estimate of the initial number of susceptibles S0 = S(t0), given a roughly periodic time series of incidence, a time series of births, a time series of the per capita natural mortality rate, times t and t as defined in the truncation step (Box 5), and an initial estimate of S0. Algorithm: Define an initial estimate of S(t). (We use the initial estimate of S0.) Reconstruct S(t) between times t and t using Eq (26b), starting with the current estimate of S(t). Update the estimate of S(t) with the estimate of S(t) obtained in (ii). Repeat (ii) and (iii) until the sequence of estimates of S(t) converges (to within a desirable tolerance). Reconstruct S(t) between times t0 and t using Eq (51), starting with the final estimate of S(t) obtained in (iv). The reconstruction is performed backwards in time, from t to t0. Record the estimate of S0 = S(t0) computed in (v). This value can be passed back to Eq (26b), allowing for reconstruction of S(t) between times t0 and t, as usual.

3 Results

In §3.1, we compare the performance of the FC, S, and SI methods in estimating β(t) from an idealized reported incidence time series. In §3.2, we show how process and observation error create spurious noise in estimates of β(t). In §§3.3 and 3.4, we examine averaging and smoothing as ways to distill temporal patterns of interest from noisy estimates of β(t). In §§3.5 and 3.6, we summarize our systematic analysis of the sensitivity of β(t) estimation error to data-generating parameters and to mis-specification of input parameters by the user. In §3.7, addressing apparent sensitivity to mis-specification of the initial number of susceptibles S0, we assess the performance of PTPI as a method of estimating S0. Finally, in §3.8, we report the run times of the S and SI methods and PTPI. The results reported here are entirely reproducible using the annotated R code available in S1 File.

3.1 Example of β(t) estimation using the FC, S, and SI methods

We applied the FC, S, and SI methods without input error to estimate S(t) and β(t) from an idealized reported incidence time series, simulated without process or observation error. The time series estimates S and β are shown in Fig 1. The S and SI methods estimated S(t) and β(t) accurately at every time point, whereas the FC method captured seasonality but failed otherwise. In the FC method, S neglects natural mortality (Eq (24b)), so it increases without bound while β decays to zero due to division by S (Eq (24d)).
Fig 1

Example of S(t) and β(t) estimation using the FC, S, and SI methods.

Plotted are the susceptible population size S(t) and seasonally forced transmission rate β(t) (Eq (27)) underlying 20 years of weekly reported incidence, together with time series estimates S and β obtained from the data by the FC [blue], S [green], and SI [red] methods. The reported incidence time series (Δt = 1 week, n = ⌊20 × 365/7⌋ = 1042) was simulated without process or observation error (ϵ = 0, prep = 1), using reference values (Table 1) for all other data-generating parameters. The three estimation methods were applied without input error, i.e., all input parameters were assigned their true (data-generating) values. [Panel A] S(t) scaled by 1/N0, describing the number of susceptibles as a proportion of the initial population size. Grey lines show that the absolute error in the FC method estimate of S(t) increases linearly as μc 〈S〉t, where μc is the constant per capita natural mortality rate and 〈S〉 is the continuous-time average of S(t). [Panel B] β(t) scaled by 1/〈β〉, describing the transmission rate relative to its mean. RRMSE (Eq (33)) in the β time series generated by the (FC, S, SI) method is roughly (0.3355, 0.0240, 0.0021).

Example of S(t) and β(t) estimation using the FC, S, and SI methods.

Plotted are the susceptible population size S(t) and seasonally forced transmission rate β(t) (Eq (27)) underlying 20 years of weekly reported incidence, together with time series estimates S and β obtained from the data by the FC [blue], S [green], and SI [red] methods. The reported incidence time series (Δt = 1 week, n = ⌊20 × 365/7⌋ = 1042) was simulated without process or observation error (ϵ = 0, prep = 1), using reference values (Table 1) for all other data-generating parameters. The three estimation methods were applied without input error, i.e., all input parameters were assigned their true (data-generating) values. [Panel A] S(t) scaled by 1/N0, describing the number of susceptibles as a proportion of the initial population size. Grey lines show that the absolute error in the FC method estimate of S(t) increases linearly as μc 〈S〉t, where μc is the constant per capita natural mortality rate and 〈S〉 is the continuous-time average of S(t). [Panel B] β(t) scaled by 1/〈β〉, describing the transmission rate relative to its mean. RRMSE (Eq (33)) in the β time series generated by the (FC, S, SI) method is roughly (0.3355, 0.0240, 0.0021). Fig 1A confirms that the absolute error in the FC method estimate of S(t) increases linearly as μc 〈S〉t, where μc is the constant per capita natural mortality rate and 〈S〉 is the continuous-time average of S(t). In practice, the FC method fails whenever natural mortality in the underlying population is non-negligible. Since the S and SI methods address this limitation at effectively no computational cost, we do not present further analysis of the FC method. In Fig 1B, the SI method estimate of β(t) was very accurate (RRMSE ≈ 0.2%), whereas the S method estimate peaked too early and too high (RRMSE ≈ 2.4%).

3.2 Effects of process and observation error

We applied the S and SI methods without input error to four reported incidence time series C, simulated using the same parameter values but with different levels of process and observation noise. The first simulation was purely deterministic, while the remaining three included (i) environmental stochasticity [ES], (ii) ES and demographic stochasticity [ES+DS], or (iii) ES, DS, and observation error [ES+DS+OE]. Fig 2 shows the resulting estimates Z, I, and β of true incidence Z(t), prevalence I(t), and the seasonally forced transmission rate β(t).
Fig 2

Effects of process and observation error on the S and SI methods.

Plotted are the estimates [Row A] Z, [Row B] I, and [Row C] β of true incidence Z(t), prevalence I(t), and the seasonally forced transmission rate β(t) (Eq (27)) obtained by applying the [Left] S and [Right] SI methods without input error to each of four simulated reported incidence time series (indicated by the legend; Δt = 1 week, n = ⌊3 × 365/7⌋ = 156). The first simulation was purely deterministic [dark grey] (ϵ = 0, prep = 1), while the remaining three accounted for (i) environmental stochasticity [ES, light grey] (ϵ = 0.5, prep = 1), (ii) ES and demographic stochasticity [ES+DS, blue] (ϵ = 0.5, prep = 1), or (iii) ES, DS, and observation error [ES+DS+OE, red] (ϵ = 0.5, prep = 0.25). Reference values (Table 1) were assigned to all other data-generating parameters, in all four simulations. The left and right panels in Row A are identical, because the S and SI methods compute Z identically (compare Eqs (25a) and (26a)). RRMSE in the β time series is (0.0239, 0.0375, 0.1126, 0.1432) with the S method and (0.0021, 0.0153, 0.0494, 0.0591) with the SI method (order follows the legend). Note that the underlying β(t) was the same in all simulations; it is not plotted in Row C, but is close to perfectly represented by the dark grey curve in the right panel (RRMSE ≈ 0.2%). Due to process error, the underlying Z(t) and I(t) (also not shown) varied between the deterministic, ES, and ES+DS simulations.

Effects of process and observation error on the S and SI methods.

Plotted are the estimates [Row A] Z, [Row B] I, and [Row C] β of true incidence Z(t), prevalence I(t), and the seasonally forced transmission rate β(t) (Eq (27)) obtained by applying the [Left] S and [Right] SI methods without input error to each of four simulated reported incidence time series (indicated by the legend; Δt = 1 week, n = ⌊3 × 365/7⌋ = 156). The first simulation was purely deterministic [dark grey] (ϵ = 0, prep = 1), while the remaining three accounted for (i) environmental stochasticity [ES, light grey] (ϵ = 0.5, prep = 1), (ii) ES and demographic stochasticity [ES+DS, blue] (ϵ = 0.5, prep = 1), or (iii) ES, DS, and observation error [ES+DS+OE, red] (ϵ = 0.5, prep = 0.25). Reference values (Table 1) were assigned to all other data-generating parameters, in all four simulations. The left and right panels in Row A are identical, because the S and SI methods compute Z identically (compare Eqs (25a) and (26a)). RRMSE in the β time series is (0.0239, 0.0375, 0.1126, 0.1432) with the S method and (0.0021, 0.0153, 0.0494, 0.0591) with the SI method (order follows the legend). Note that the underlying β(t) was the same in all simulations; it is not plotted in Row C, but is close to perfectly represented by the dark grey curve in the right panel (RRMSE ≈ 0.2%). Due to process error, the underlying Z(t) and I(t) (also not shown) varied between the deterministic, ES, and ES+DS simulations. Noise of any type introduces random fluctuations in C on top of longer-term (e.g., seasonal) variation. Noise in C is propagated to Z (Fig 2A) and I (Fig 2B), because (i) in both the S and SI methods, we scale C by a constant factor of to compute Z (Eqs (25a) and (26a)); (ii) in the S method, we scale C by a constant factor of [prep(γ + μk)Δt]−1 to compute I (Eq (25c) after substitution of Eq (25a)); and (iii) in the SI method, I contains a weighted sum of C terms (Eq (38b)). Noise in Z and I is amplified in β (Fig 2C), distorting the correct temporal pattern, for the following reason. When Z and I are close to zero, small absolute changes in either yield large relative changes in the ratio Z/I and in turn β, which contains a factor of Z/I in the S method (Eq (25d)) and (Z + Z)/(2I) in the SI method (Eq (26d)). Hence low amplitude noise in Z and I appears as spurious, higher amplitude noise in β. This is an important issue in practice, because the incidence of endemic diseases is typically very small relative to the population size, and periodic fluctuations bringing incidence even closer to zero are common for many diseases [4, 14, 42]. Fig 2 shows that the SI method is much better than the S method at resisting noise propagation. One reason is the effective smoothing of incidence in the SI method, which replaces Z with (Z + Z)/2 in the computation of β (compare Eqs (25d) and (26d)). We expose a second reason in §3.2.1 below by comparing the variance in I induced by observation error, between the two methods. (We expect similar results for process error.)

3.2.1 Propagation of noise from C to I

Consider the S and SI method estimates of prevalence I(t), Here, g = [tgen]Δ/Δt and r = [trep]Δ/Δt are the mean generation interval and case reporting delay in units of the observation interval, rounded to the nearest integer. These estimates are obtained from Eq (25c) (after substitution of Eqs (25a)) and (38b) when we assume a constant natural mortality rate μc. Following §2.3.3, suppose reported incidence is generated from true incidence Z(t) via . Then the variance of C is It follows from Eqs (54) and (55) and the identity Var(aX) = a2 Var(X) that If Z(t) has a well-defined average 〈Z〉, then replacing instances of Z in Eq (56) with 〈Z〉 and taking the limit as k → ∞, we obtain the average variances Comparing these with 〈Var(C)〉 = 〈Z〉prep(1 − prep) using reference parameter values tgen = γ−1 = 13 days, μc = 0.04year−1, and Δt = 1 week, we obtain where tinf = (γ + μc)−1 is the mean time spent infected. Hence, while both the S and SI methods suffer from propagation of noise from reported incidence C to estimated prevalence I, particularly for prep ≪ 1, the S method tends to be much worse (by a factor of 3.44/0.93 ≈ 3.7 in this example). Comparative resistance to noise propagation is a distinct advantage of the SI method over the S method.

3.3 Averaging the raw estimate of β(t)

Fig 3A displays two raw estimates β (S and SI methods, applied without input error) of a seasonally forced β(t), each spanning 1000 years (only the first 10 years are shown). The estimates embed 1000 1-year cycles, which are displayed in Fig 3B and 3C together with their 1-year average (cf. §2.2.5).
Fig 3

Bias and variance in 1-year cycles embedded in three estimates of a seasonally forced β(t).

[Panel A] In black, the seasonally forced β(t) (Eq (27)) underlying 1000 years of simulated reported incidence data. In (transparent) colour, raw estimates β obtained from the data by the S [green] and SI [red] methods, both applied without input error. Only the first 10 of 1000 years are shown. [Panels B and C] In black, the true 1-year cycle in the seasonally forced β(t). In light (transparent) colour, the 1000 1-year cycles embedded in the linear interpolant βint(t) of β. In dark colour, the average 1-year cycle (Eq (22a)) in βint(t). Results are shown for both the S [Panel B, green] and SI [Panel C, red] methods. [Panel D] Like Panel C, except for a smooth loess curve βloess(t; q) (q = 53) fit to β, instead of the interpolant βint(t). [Details] A reported incidence time series with 1000 years of weekly observations (Δt = 1 week, n = 52153) was simulated with environmental noise in transmission (ϵ = 0.5), demographic stochasticity, and random under-reporting of cases (prep = 0.25), using reference values (Table 1) for the remaining parameters.

Bias and variance in 1-year cycles embedded in three estimates of a seasonally forced β(t).

[Panel A] In black, the seasonally forced β(t) (Eq (27)) underlying 1000 years of simulated reported incidence data. In (transparent) colour, raw estimates β obtained from the data by the S [green] and SI [red] methods, both applied without input error. Only the first 10 of 1000 years are shown. [Panels B and C] In black, the true 1-year cycle in the seasonally forced β(t). In light (transparent) colour, the 1000 1-year cycles embedded in the linear interpolant βint(t) of β. In dark colour, the average 1-year cycle (Eq (22a)) in βint(t). Results are shown for both the S [Panel B, green] and SI [Panel C, red] methods. [Panel D] Like Panel C, except for a smooth loess curve βloess(t; q) (q = 53) fit to β, instead of the interpolant βint(t). [Details] A reported incidence time series with 1000 years of weekly observations (Δt = 1 week, n = 52153) was simulated with environmental noise in transmission (ϵ = 0.5), demographic stochasticity, and random under-reporting of cases (prep = 0.25), using reference values (Table 1) for the remaining parameters. Both estimates suffered from spurious noise distorting the correct seasonal pattern, caused by process and observation error in the data-generating process (cf. §3.2). As in Fig 2C, the variance was markedly smaller with the SI method. Averaging the embedded 1-year cycles recovered the true 1-year cycle from the noise. In the absence of input error, the S method appears to carry a slight bias (peaking early and too high, as in Fig 1), whereas the SI method is nearly unbiased. While some existing infectious disease time series span several centuries [15], in practice, averaging as in Fig 3B and 3C is sensible only over time intervals during which the underlying seasonal pattern in transmission is roughly stationary.

3.4 Smoothing the raw estimate of β(t)

Regardless of whether averaging is employed, comparison of Fig 3C and 3D shows that it is helpful to smooth the β time series by fitting a loess curve βloess(t; q) (cf. §2.2.6). An appropriate degree of smoothing (i.e., choice of loess smoothing parameter q) eliminated spurious noise without significantly increasing bias. Fig 4A quantifies the effect of smoothing β using the optimal value qopt for parameter q (cf. §2.2.6). It plots RRMSE before and after smoothing as a function of the amount of noise in the simulated reported incidence data, which was modulated by varying the case reporting probability prep between 0.01 and 1 (more noise for smaller prep; see Eq (31)).
Fig 4

Reduction in β(t) estimation error with optimal loess smoothing.

The horizontal axis measures the case reporting probability prep, for which 41 values equally spaced on a logarithmic scale between 0.01 and 1 were considered. Using each value of prep and reference values (Table 1) for all other parameters, 100 reported incidence time series (Δt = 1 week, n = 1042) were simulated accounting for environmental noise in transmission (ϵ = 0.5), demographic stochasticity, and random under-reporting of cases (measured by prep). The underlying seasonally forced β(t) (Eq (27)) was estimated from reported incidence using the S and SI methods, both applied without input error, yielding two raw estimates β per simulation. Smooth loess curves βloess(t; q) (q = 10, …, 110; cf. §2.2.6) were fit to each β time series. The optimal q for a given time series, denoted by qopt, was defined as the value that minimized RRMSE (Eq (33)) in βloess(t; q). Overall, for each value of prep and each β(t) estimation method (S and SI), 100 values of qopt were obtained corresponding to 100 β time series. Plotted on the vertical axis as functions of prep are the median and 5th and 95th percentiles of [Panel A] RRMSE in the raw estimates β [dashed lines] and optimal loess estimates βloess(t; qopt) [solid lines] and [Panel B] qopt. Lines and bands indicate the median and 5th–95th percentile range, respectively. Results for the S and SI methods are shown in green and red, respectively.

Reduction in β(t) estimation error with optimal loess smoothing.

The horizontal axis measures the case reporting probability prep, for which 41 values equally spaced on a logarithmic scale between 0.01 and 1 were considered. Using each value of prep and reference values (Table 1) for all other parameters, 100 reported incidence time series (Δt = 1 week, n = 1042) were simulated accounting for environmental noise in transmission (ϵ = 0.5), demographic stochasticity, and random under-reporting of cases (measured by prep). The underlying seasonally forced β(t) (Eq (27)) was estimated from reported incidence using the S and SI methods, both applied without input error, yielding two raw estimates β per simulation. Smooth loess curves βloess(t; q) (q = 10, …, 110; cf. §2.2.6) were fit to each β time series. The optimal q for a given time series, denoted by qopt, was defined as the value that minimized RRMSE (Eq (33)) in βloess(t; q). Overall, for each value of prep and each β(t) estimation method (S and SI), 100 values of qopt were obtained corresponding to 100 β time series. Plotted on the vertical axis as functions of prep are the median and 5th and 95th percentiles of [Panel A] RRMSE in the raw estimates β [dashed lines] and optimal loess estimates βloess(t; qopt) [solid lines] and [Panel B] qopt. Lines and bands indicate the median and 5th–95th percentile range, respectively. Results for the S and SI methods are shown in green and red, respectively. Using the optimal loess estimate βloess(t; qopt) instead of the raw estimate β significantly reduced RRMSE—by at least 46% for the S method and 17% for the SI method across all simulations. Although raw estimates generated by the SI method were consistently more accurate (expected in light of Fig 3B and 3C), optimal loess estimates were comparable between the S and SI methods for prep > 0.2 (RRMSE ≈ 3%). For prep < 0.2 (severe under-reporting of cases), optimal smoothing failed to an increasing extent to recover the underlying β(t) from noise in β. In this setting, the S method was greatly outperformed by the SI method, which is more resilient to noise in reported incidence (cf. §3.2). Fig 4B shows that median qopt was roughly constant for prep > 0.1, with More smoothing (greater q) was required to minimize RRMSE for prep > 0.1. More generally, Fig 4 indicates that the S and SI methods should always include a smoothing step. Hence, in the remaining analysis, we always smooth β.

3.5 Sensitivity to data-generating parameters

Here, we characterize the sensitivity of β(t) estimation error to parameters of the data-generating process. As in §§3.1–3.4, we consider the ideal case in which the user-specified values of all input parameters are equal to the true (data-generating) values. The details of our analysis are outlined in §2.6.1. Fig 5 plots the median RRMSE in estimates of a seasonally forced β(t) (Eq (27)) from 1000 realizations of a reported incidence time series, as a bivariate function of the mean 〈β〉 and amplitude α of seasonal forcing. To aid interpretation, the 〈β〉 axis was scaled to measure the basic reproduction number (Eq (2)).
Fig 5

Sensitivity of β(t) estimation error to the mean 〈β〉 and amplitude α of seasonal forcing.

Contained in each panel are heatmaps of median RRMSE (Eq (33)) in estimates of a seasonally forced β(t) (Eq (27)) from simulated reported incidence time series, as a bivariate function of the mean 〈β〉 and amplitude α of seasonal forcing. The 〈β〉 axis has been scaled to measure the basic reproduction number (Eq (2)). When simulating reported incidence, reference values (Table 1) were assigned to all data-generating parameters except 〈β〉 and α. A grid of pairs with levels and α = 0, 0.01, …, 0.2 was considered, with 〈β〉 defined for each value of via Eq (2). For each parametrization, 1000 simulations were performed with environmental stochasticity [ES] (ϵ = 0.5) and with or without demographic stochasticity [DS] and observation error [OE], as indicated by row: [Row A] without DS or OE (prep = 1, trep = 0 weeks), [Row B] with DS but without OE (prep = 1, trep = 0 weeks), [Row C] with DS and OE (prep = 0.25, trep = 2 weeks). Corresponding mock birth and natural mortality time series were created, then β(t) was estimated from the data using [Left] the S method and [Right] the SI method, all without input error. For each set of estimates of β(t) (1000 estimates per parametrization, per simulation method, per estimation method), the median RRMSE was calculated (after smoothing with fixed q; see Eq (59)) and displayed as one point in the appropriate heatmap, coloured according to the logarithmic scale on the right. The darkest blue indicates median RRMSE less than 0.01.

Sensitivity of β(t) estimation error to the mean 〈β〉 and amplitude α of seasonal forcing.

Contained in each panel are heatmaps of median RRMSE (Eq (33)) in estimates of a seasonally forced β(t) (Eq (27)) from simulated reported incidence time series, as a bivariate function of the mean 〈β〉 and amplitude α of seasonal forcing. The 〈β〉 axis has been scaled to measure the basic reproduction number (Eq (2)). When simulating reported incidence, reference values (Table 1) were assigned to all data-generating parameters except 〈β〉 and α. A grid of pairs with levels and α = 0, 0.01, …, 0.2 was considered, with 〈β〉 defined for each value of via Eq (2). For each parametrization, 1000 simulations were performed with environmental stochasticity [ES] (ϵ = 0.5) and with or without demographic stochasticity [DS] and observation error [OE], as indicated by row: [Row A] without DS or OE (prep = 1, trep = 0 weeks), [Row B] with DS but without OE (prep = 1, trep = 0 weeks), [Row C] with DS and OE (prep = 0.25, trep = 2 weeks). Corresponding mock birth and natural mortality time series were created, then β(t) was estimated from the data using [Left] the S method and [Right] the SI method, all without input error. For each set of estimates of β(t) (1000 estimates per parametrization, per simulation method, per estimation method), the median RRMSE was calculated (after smoothing with fixed q; see Eq (59)) and displayed as one point in the appropriate heatmap, coloured according to the logarithmic scale on the right. The darkest blue indicates median RRMSE less than 0.01. Fig 6 plots median RRMSE as a univariate function of each of 6 additional parameters—the initial states S0 and I0, vital rates νc and μc, mean generation interval tgen, and case reporting probability prep—with the focal parameter assigned values between and 4 times its reference value (Table 1). The horizontal axis measures the ratio of the focal parameter’s data-generating value to its reference value, so that commensurate deviations from the reference case can be compared across the 6 parameters.
Fig 6

Sensitivity of β(t) estimation error to data-generating parameters other than 〈β〉 and α.

Plotted in each panel is the median RRMSE (Eq (33)) in estimates of a seasonally forced β(t) (Eq (27)) from simulated reported incidence time series (Δt = 1 week, n = 1042), as a univariate function of each of 5 or 6 data-generating parameters (indicated by the legend). When simulating reported incidence, reference values (Table 1) were assigned to all but the focal parameter, which was assigned 41 values logarithmically spaced between and 4 times its reference value. The horizontal axis (logarithmic scale) measures the ratio of the focal parameter’s true value to its reference value, so that commensurate deviations from the reference case can be compared across parameters. For each parametrization, 1000 simulations were performed with environmental stochasticity [ES] (ϵ = 0.5) and with or without demographic stochasticity [DS] and observation error [OE], as indicated by row: [Row A] without DS or OE (prep = 1, trep = 0 weeks), [Row B] with DS but without OE (prep = 1, trep = 0 weeks), or [Row C] with DS and OE (prep = 0.25 except when prep is the focal parameter, trep = 2 weeks). Corresponding mock birth and natural mortality time series were created, then β(t) was estimated from the data using [Left] the S method and [Right] the SI method, all without input error. For each set of estimates of β(t) (1000 estimates per parametrization, per simulation method, per estimation method), the median RRMSE was calculated (after smoothing with fixed q; see Eq (59)) and displayed as one point in the appropriate panel and graph.

Sensitivity of β(t) estimation error to data-generating parameters other than 〈β〉 and α.

Plotted in each panel is the median RRMSE (Eq (33)) in estimates of a seasonally forced β(t) (Eq (27)) from simulated reported incidence time series (Δt = 1 week, n = 1042), as a univariate function of each of 5 or 6 data-generating parameters (indicated by the legend). When simulating reported incidence, reference values (Table 1) were assigned to all but the focal parameter, which was assigned 41 values logarithmically spaced between and 4 times its reference value. The horizontal axis (logarithmic scale) measures the ratio of the focal parameter’s true value to its reference value, so that commensurate deviations from the reference case can be compared across parameters. For each parametrization, 1000 simulations were performed with environmental stochasticity [ES] (ϵ = 0.5) and with or without demographic stochasticity [DS] and observation error [OE], as indicated by row: [Row A] without DS or OE (prep = 1, trep = 0 weeks), [Row B] with DS but without OE (prep = 1, trep = 0 weeks), or [Row C] with DS and OE (prep = 0.25 except when prep is the focal parameter, trep = 2 weeks). Corresponding mock birth and natural mortality time series were created, then β(t) was estimated from the data using [Left] the S method and [Right] the SI method, all without input error. For each set of estimates of β(t) (1000 estimates per parametrization, per simulation method, per estimation method), the median RRMSE was calculated (after smoothing with fixed q; see Eq (59)) and displayed as one point in the appropriate panel and graph. In order to produce Figs 5 and 6, we assigned reference values (Table 1) to all but the focal data-generating parameter(s) (e.g., all except 〈β〉 and α in Fig 5). We fit loess curves βloess(t; q) to all raw estimates β of β(t), and recorded the RRMSE in βloess(t; q). Motivated by Fig 4B and Eq (59), we fixed q = q*, taking q* = 65 with the S method and q* = 53 with the SI method. A pattern in our interpretation of Figs 5 and 6 below is that error in β(t) estimation is sensitive to a parameter if changes in that parameter (i) cause incidence Z(t) or prevalence I(t) to approach zero more frequently or more closely, or (ii) increase noise in estimated incidence Z or estimated prevalence I. Both outcomes incorrectly increase noise in β (cf. §3.2). When the noise in β is extreme, setting q = q* can undersmooth the time series (q* < qopt). In this case, smaller RRMSE is attainable by determining qopt and setting q = qopt. Nevertheless, we did not find qopt for each of the 5 × 106 time series considered by Figs 5 and 6, which would have increased the total computation time by a factor of 100. Consequently, Figs 5 and 6 may overestimate the sensitivity of β(t) estimation error to data-generating parameters. (In §S5.3 of S1 Text, we show that the quantitative effect of choosing q* over qopt is likely to be small.)

3.5.1 Sensitivity to the basic reproduction number and seasonal amplitude α (Fig 5)

For fixed α, median RRMSE was a non-monotonic function of . The reason is that changes in (effective) are responsible for dynamical transitions that alter the structure of solutions of the SIR model (1) [28, 42, 43]. Specifically, as is increased from 2 to 32, minimum incidence Zmin and minimum prevalence Imin on the attractor varies non-monotonically (see Fig 2 in [28]). Smaller Zmin and Imin yield more noise in β, and correspondingly greater RRMSE. For fixed , Imin decreases monotonically as α is increased from 0 to 1 (see Fig 11 in [43]), so we expect median RRMSE to increase monotonically with α, as observed in Fig 5.

3.5.2 Sensitivity to the initial state (S0, I0) (Fig 6)

RRMSE is sensitive to the data-generating S0, but not I0. The reference values of S0 and I0 are taken from a point (S*, I*, R*) on the attractor of the SIR model (1) with seasonally forced β(t) and constant vital rates νc and μc (cf. §2.3.4). When S0 is far from S*, the solution of system (1) undergoes extreme fluctuation before relaxing to the attractor, and both Z and I approach zero during the transient, generating spurious noise at the start of the β time series. Note that I0 differing from I* has a much smaller effect on dynamics than S0 differing from S* by the same factor. Since I* ≪ S*, the perturbation of (S0, I0, R0) from the attractor is much smaller.

3.5.3 Sensitivity to vital rates νc and μc (Fig 6)

Median RRMSE was a non-monotonic function of the data-generating birth rate νc. This behaviour arises because scaling νc is dynamically equivalent to scaling by the same factor [2, 28], and median RRMSE is a non-monotonic function of (cf. §3.5.1 above). Changing the data-generating natural mortality rate μc had a negligible effect on RRMSE. This is unsurprising, because natural death is dominated by recovery and disease-induced death in governing the rate of infected decrease. That is, γ ≫ μ(t) in Eq (1b), so changes in μc by up to a factor of 4 have little effect on dynamics.

3.5.4 Sensitivity to the mean generation interval tgen (Fig 6)

Median RRMSE increased rapidly as the data-generating tgen was made smaller than 2−4/5 (roughly 0.57) times its reference value of 13 days. A period-doubling bifurcation occurs near this value of tgen, and the attractor of the SIR model (1) acquires a 2-year cycle with much smaller Zmin and Imin (see §S5.3.1 of S1 Text). Propagation of noise to β intensifies, resulting in greater RRMSE. The performance of the S method fluctuates more as a function of tgen than that of the SI method. This occurs because the S method rounds tgen in the numerator of Eq (25c) to the nearest integer multiple of Δt, and the rounding error oscillates as a function of tgen. The SI method does not require rounding, so these fluctuations are not observed.

3.5.5 Sensitivity to the case reporting probability prep (Fig 6)

When the reported incidence data contain observation error (Fig 6C), RRMSE is additionally sensitive to the case reporting probability prep. Decreasing prep increases noise in reported incidence C (Eq (31)), which is propagated to estimated incidence Z, estimated prevalence I, and in turn β (cf. §3.2). Fig 6 suggests weak sensitivity to prep. However, noise in Z and I is amplified in β to the extent that Z and I are close to zero (cf. §3.2). Hence, for example, if the data-generating tgen were assigned a value smaller than half its reference value of 13 days, then we would have observed more acute sensitivity to prep as a result of closer approaches to zero by Z and I (cf. §3.5.4 above).

3.5.6 S method versus SI method (Figs 5 and 6)

Both the S and SI methods performed well, estimating β(t) with median RRMSE less than 10% across most parametrizations. However, by resisting noise propagation (cf. §3.2), the SI method was significantly less sensitive to the data-generating parameters and to the addition of demographic stochasticity and observation error.

3.6 Sensitivity to mis-specification of input parameters

In §3.5, we considered the ideal situation in which the user knows the true (data-generating) values of the input parameters. Here, we examine the more realistic situation in which the user specifies input parameters with some error. The effect of mis-specification is particularly important for parameters that are difficult to estimate accurately, such as the case reporting probability prep. The details of our analysis are outlined in §2.6.2. We restrict our attention to application of the SI method to reported incidence data simulated with process and observation error. Differences in RRMSE between methods of data simulation and β(t) estimation are dominated (by an order of magnitude) by the increase in RRMSE resulting from mis-specified input parameters. Fig 7A plots the median RRMSE in estimates of β(t) from 1000 realizations of a reported incidence time series, as a univariate function of the factor by which an input parameter—one of the initial states S0 and I0, mean generation interval tgen, vital rates νc and μc, and case reporting parameters prep and trep—was mis-specified. The specified value of the focal parameter was varied between and 4 times its true (data-generating) value, and the remaining parameters were specified without error.
Fig 7

Sensitivity of β(t) estimation error to the user-specified values of input parameters.

[Panel A] Median RRMSE (Eq (33)) in estimates of β(t) from simulated reported incidence time series (Δt = 1 week, n = 1042), as a univariate function of the factor by which an input parameter was mis-specified. One thousand simulations were performed using fixed values (Table 1) for all data-generating parameters. The simulations accounted for environmental stochasticity [ES] (ϵ = 0.5), demographic stochasticity [DS], and observation error [OE] (prep = 0.25, trep = 2 weeks). For each simulation, corresponding mock birth and natural mortality time series were created, and β(t) was estimated from the data using the SI method. True (data-generating) values were specified for all input parameters except the focal parameter (indicated by the legend), for which 41 values logarithmically spaced between and 4 times the true value were specified in turn. Each input parametrization yielded 1000 estimates of β(t), whose median RRMSE was calculated (after smoothing with fixed q; see Eq (59)) and displayed as one point in the appropriate graph. [Panel B] Result of repeating the analysis from Panel A in which S0 was specified with varying amounts of error, but with the initially erroneous value of S0 updated using the method of peak-to-peak iteration (PTPI; 25 iterations) prior to β(t) estimation. The original result, obtained without PTPI, is presented for comparison.

Sensitivity of β(t) estimation error to the user-specified values of input parameters.

[Panel A] Median RRMSE (Eq (33)) in estimates of β(t) from simulated reported incidence time series (Δt = 1 week, n = 1042), as a univariate function of the factor by which an input parameter was mis-specified. One thousand simulations were performed using fixed values (Table 1) for all data-generating parameters. The simulations accounted for environmental stochasticity [ES] (ϵ = 0.5), demographic stochasticity [DS], and observation error [OE] (prep = 0.25, trep = 2 weeks). For each simulation, corresponding mock birth and natural mortality time series were created, and β(t) was estimated from the data using the SI method. True (data-generating) values were specified for all input parameters except the focal parameter (indicated by the legend), for which 41 values logarithmically spaced between and 4 times the true value were specified in turn. Each input parametrization yielded 1000 estimates of β(t), whose median RRMSE was calculated (after smoothing with fixed q; see Eq (59)) and displayed as one point in the appropriate graph. [Panel B] Result of repeating the analysis from Panel A in which S0 was specified with varying amounts of error, but with the initially erroneous value of S0 updated using the method of peak-to-peak iteration (PTPI; 25 iterations) prior to β(t) estimation. The original result, obtained without PTPI, is presented for comparison.

3.6.1 Sensitivity to error in the specified initial state

Fig 7 shows that error in the specified value of S0 is propagated non-negligibly to estimates of β(t), while mis-specification of I0 has practically no effect on β(t) estimation error. Eqs (40) and (41) show that specifying incorrect values and for S0 and I0 creates errors in S and I that vanish geometrically as k → ∞. However, since tlife ≫ tinf, the decay is significantly slower in S. Indeed, with reference values μc = 0.04 year−1, tgen = γ−1 = 13 days, and Δt = 1 week, we find that a factor of 10 reduction in error between times t and t requires just i = 5 in the infected time series, compared to i = 3002 in the susceptible time series (roughly 58 years with Δt = 1 week). Hence, in practice, accurate reconstruction of S(t), I(t), and in turn β(t) relies on accurate specification of S0, but not I0. We address sensitivity to mis-specification of S0 in §3.7 below.

3.6.2 Sensitivity to error in the specified birth rate and case reporting probability

Mis-specifying νc or prep by a factor of 21/10 (7.2%) yielded median RRMSE greater than 30%. Mis-specifying by a factor of 2−1/10 (−6.7%) led to even worse estimates of β(t), with median RRMSE exceeding 100% (not visible in Fig 7A). Eqs (42) and (47) show that specifying incorrect values and for νc and prep generates absolute errors in S that tend to increase over time (k) to a limit. In practice, systematic underestimation of births by the B time series (modeled here by ) and overestimation of incidence by the Z time series () can cause S to eventually take negative values. Once this happens, attempts by the S and SI methods to reconstruct β(t) fail completely. While this failure may seem concerning, it should be viewed as a tool for diagnosing incorrect birth and case reporting rates: if the S or SI method yields negative S for any k, then one should speculate that births were underestimated or that incidence was overestimated, and retry the algorithm with a scaled up B time series and/or with greater prep (as Z is computed by scaling reported incidence by a factor of ; see Eqs (25a) and (26a)). Of course, overcorrection is also undesirable (cf. right half of Fig 7A). In our work, we have found that a brief exploration of possible adjustments—factors by which to increase B and/or prep—suffices to identify ones that prevent both negative S and pronounced transient dynamics at the start of the susceptible time series (indicating under- or overcorrection).

3.7 Solution of the S estimation problem using PTPI

In §3.6, we showed that the performance of the S and SI methods is highly sensitive to mis-specification of the initial number of susceptibles S0. Here, we assess PTPI as a way to iteratively improve initially poor estimates of S0 prior to reconstruction of S(t) and β(t). Fig 8 demonstrates PTPI for an example in which S0 was overestimated by a factor of 4 by a user of the SI method. PTPI yielded increasingly accurate estimates of S0 and correspondingly more accurate reconstructions of S(t) (Fig 8B) and β(t) (Fig 8C). Fig 7B repeats our analysis from §3.6, except using PTPI (25 iterations) to update the incorrect estimate of S0 prior to reconstructing β(t). We see that application of PTPI in conjunction with the SI method enables accurate β(t) reconstruction independently of errors in the initial estimate of S0. This result is unsurprising in light of Fig 9, which shows that PTPI converges rapidly (in fewer than 10 iterations) to an accurate estimate of S0 independently of the initial guess. Due to process error in the underlying dynamics, the relative error in the limiting estimate of S0 varied between the 1000 realizations of reported incidence considered (5th–95th percentile range [−11.9, 12.5]%, median 0.9%). Process error creates variance in the time between peaks in incidence (see Fig 8A), violating the periodicity assumption of PTPI (the theoretical basis of the technique; cf. §2.8). Nevertheless, Figs 7–9 demonstrate that PTPI can significantly improve S(t) and β(t) reconstruction from roughly periodic incidence data.
Fig 8

Example of S(t) and β(t) reconstruction with an overestimate of S0 corrected by peak-to-peak iteration (PTPI).

[Panel A] Truncation step of PTPI (Box 5). Plotted is a reconstruction of true incidence Z(t) from a simulated reported incidence time series, before [Z, black] and after [, yellow] smoothing with a 13-point central moving average. Vertical lines indicate peaks in . The times of the first peak in and the last peak occurring at the same phase of the cycle (in this case, the last peak) are denoted by t and t. [Panel B] Iteration step of PTPI (Box 6), where the initial estimates of both S0 = S(0) and S(t) were taken to be 4 times the true (data-generating) value of S0. Plotted in grey are successive reconstructions of S(t) between times t and t, generated by updating the estimate of S(t) with the estimate of S(t) obtained in the previous iteration. Dashed continuations to the left of t display estimation of S0 backwards in time from estimates of S(t). Plotted in black is the result of reconstructing S(t) starting from the final estimate of S0, which was obtained after 25 iterations and had a relative error of roughly 1.4% (compared to 300% in the initial estimate). [Panel C] The sequence of reconstructions of β(t) corresponding to the estimates of S0 shown in Panel B. [Details] Twenty years of weekly reported incidence (Δt = 1 week, n = 1042) were simulated with environmental noise in transmission (ϵ = 0.5), demographic stochasticity, and random under-reporting of cases (prep = 0.25), using reference values (Table 1) for the remaining parameters. Z(t), S(t) and β(t) were reconstructed from reported incidence using the SI method without input error (apart from mis-specification of S0).

Fig 9

Convergence of estimates of S0 obtained using peak-to-peak iteration (PTPI).

S0 was estimated by applying PTPI (25 iterations) to 1000 incidence time series (i.e., 1000 realizations of a reported incidence time series, scaled by ). An initial guess for S0 was taken to be or 4 times the true (data-generating) value. For each initial guess, this process generated 1000 sequences of 26 estimates of S0. Plotted are the median [black lines] and 5th–95th percentile range [grey bands] of the estimate of S0 at each iteration, for the first 10 iterations. The vertical axis measures (on a logarithmic scale) the ratio of the estimated and true values of S0, hence convergence close to 1 [dashed green line] represents convergence of the estimates close to the true value. [Details] One thousand reported incidence time series (Δt = 1 week, n = 1042) were simulated with environmental noise in transmission (ϵ = 0.5), demographic stochasticity, and random under-reporting of cases (prep = 0.25), using reference values (Table 1) for the remaining parameters, including S0 (hence S0 was the same in all simulations). True incidence was estimated from reported incidence via Eq (26a) (with reporting parameters prep and trep correctly specified), yielding 1000 time series of estimated incidence. Corresponding mock (constant) birth and natural mortality time series were created (with vital rates νc and μc correctly specified), and these data (estimated incidence, births, natural mortality) were passed to the PTPI algorithm, allowing for iterative re-estimation of S0.

Example of S(t) and β(t) reconstruction with an overestimate of S0 corrected by peak-to-peak iteration (PTPI).

[Panel A] Truncation step of PTPI (Box 5). Plotted is a reconstruction of true incidence Z(t) from a simulated reported incidence time series, before [Z, black] and after [, yellow] smoothing with a 13-point central moving average. Vertical lines indicate peaks in . The times of the first peak in and the last peak occurring at the same phase of the cycle (in this case, the last peak) are denoted by t and t. [Panel B] Iteration step of PTPI (Box 6), where the initial estimates of both S0 = S(0) and S(t) were taken to be 4 times the true (data-generating) value of S0. Plotted in grey are successive reconstructions of S(t) between times t and t, generated by updating the estimate of S(t) with the estimate of S(t) obtained in the previous iteration. Dashed continuations to the left of t display estimation of S0 backwards in time from estimates of S(t). Plotted in black is the result of reconstructing S(t) starting from the final estimate of S0, which was obtained after 25 iterations and had a relative error of roughly 1.4% (compared to 300% in the initial estimate). [Panel C] The sequence of reconstructions of β(t) corresponding to the estimates of S0 shown in Panel B. [Details] Twenty years of weekly reported incidence (Δt = 1 week, n = 1042) were simulated with environmental noise in transmission (ϵ = 0.5), demographic stochasticity, and random under-reporting of cases (prep = 0.25), using reference values (Table 1) for the remaining parameters. Z(t), S(t) and β(t) were reconstructed from reported incidence using the SI method without input error (apart from mis-specification of S0).

Convergence of estimates of S0 obtained using peak-to-peak iteration (PTPI).

S0 was estimated by applying PTPI (25 iterations) to 1000 incidence time series (i.e., 1000 realizations of a reported incidence time series, scaled by ). An initial guess for S0 was taken to be or 4 times the true (data-generating) value. For each initial guess, this process generated 1000 sequences of 26 estimates of S0. Plotted are the median [black lines] and 5th–95th percentile range [grey bands] of the estimate of S0 at each iteration, for the first 10 iterations. The vertical axis measures (on a logarithmic scale) the ratio of the estimated and true values of S0, hence convergence close to 1 [dashed green line] represents convergence of the estimates close to the true value. [Details] One thousand reported incidence time series (Δt = 1 week, n = 1042) were simulated with environmental noise in transmission (ϵ = 0.5), demographic stochasticity, and random under-reporting of cases (prep = 0.25), using reference values (Table 1) for the remaining parameters, including S0 (hence S0 was the same in all simulations). True incidence was estimated from reported incidence via Eq (26a) (with reporting parameters prep and trep correctly specified), yielding 1000 time series of estimated incidence. Corresponding mock (constant) birth and natural mortality time series were created (with vital rates νc and μc correctly specified), and these data (estimated incidence, births, natural mortality) were passed to the PTPI algorithm, allowing for iterative re-estimation of S0.

3.8 Run time

We implemented the S and SI methods and PTPI in R and ran them on a MacBook Pro with a 2.4 GHz Quad-Core Intel Core i5 chip. The S and SI methods are both extremely fast, requiring a total of 0.124 and 0.376 seconds, respectively, to generate a reconstruction of β(t) from 1000 years of weekly reported incidence (Δt = 1 week, n = 52142). Application of PTPI in conjunction with either method increases the run time with each iteration, but the total run time remains inconsequential due to the rate of convergence of the iterations to a limiting estimate of S0. For example, when we applied PTPI to the same simulated data, the truncation step (Box 5) added 0.094 seconds to the total run time, while the iteration step (Box 5) added 1.01 seconds per iteration on average.

4 Discussion

We have compared three fast methods of estimating the time-varying transmission rate β(t) from reported incidence time series, all based on discretizations of the SIR model (1). Fine and Clarkson’s method [6], referred to here as the FC method, fails rapidly in practice, because it treats natural mortality in the susceptible population as negligible. Although Krylova’s method [24], adapted here as the S method, corrects this limitation of the FC method and is accurate for certain simulated data, her method suffers from extreme sensitivity to process and observation error. Specifically, noise in reported incidence is spuriously propagated to its estimates of β(t). Our algorithm for transmission rate estimation, referred to here as the SI method and based on deJonge’s method [25], is much more resilient to noise in reported incidence and therefore superior to the S method. Like its predecessors, the SI method is sensitive to (i) certain input parameters: the initial number of susceptible individuals S0, the case reporting probability prep, and the mean generation interval tgen; as well as (ii) vital data: times series of births and natural mortality without substantial systematic errors. The requirement of a good estimate of S0 has been a major barrier to use of existing fast methods of β(t) estimation (including those presented in [6, 24, 25]). We have proposed and demonstrated PTPI as a valid and fast technique for obtaining accurate estimates of S0 from poor initial guesses, conditional on periodic dynamics (epidemic recurrence with a fixed period). Use of the SI method in conjunction with PTPI represents a major advance over the existing fast methods. Estimation of the case reporting probability prep is possible using maximum likelihood approaches, including trajectory matching. However, a fast way to obtain a crude estimate of prep is to divide cumulative reported incidence over the time interval [t0, t], by the cumulative incidence that is expected from the unforced SIR model (system (1) with β ≡ 〈β〉, ν ≡ νc, and μ ≡ μc) at equilibrium: This approximation can be made in temporal subintervals to obtain a time-varying reporting rate, which would replace the constant prep in Eq (26a). Sensitivity of the SI method to mis-specification of the mean generation interval (tgen) may be of greater concern, though if the distribution of the incubation period (time from infection to onset of symptoms) is narrow, then tgen will be well approximated by the (observable) mean serial interval [44]. Overall, the SI method, in conjunction with PTPI, represents a highly tractable approach to reconstructing susceptibles and β(t) from infectious disease time series that span decades or centuries. It makes fewer assumptions about the disease and population of interest than the regression-based tSIR method [7, 23] (i.e., it does not require an infectious period equal to the observation interval, ignore susceptible mortality, or assume that cumulative incidence approximates cumulative births). Moreover, it is significantly less complex and much less computationally demanding than simulation-based methods of inference, such as iterated filtering [8, 19, 20] and generalized profiling [21, 22]. Even when the observed infectious disease time series is short enough that simulation-based methods are tractable, the approach to transmission rate reconstruction that we promote here can be usefully employed to provide better starting conditions at negligible computational cost.

Text supplement.

A .pdf document containing annotated R code, making the results reported here completely reproducible by the reader. (PDF) Click here for additional data file.

Source files.

A .zip archive containing all of the source files needed to compile S1 Text. (ZIP) Click here for additional data file.
  26 in total

1.  Destabilization of epidemic models with the inclusion of realistic distributions of infectious periods.

Authors:  A L Lloyd
Journal:  Proc Biol Sci       Date:  2001-05-07       Impact factor: 5.349

2.  Transients and attractors in epidemics.

Authors:  Chris T Bauch; David J D Earn
Journal:  Proc Biol Sci       Date:  2003-08-07       Impact factor: 5.349

3.  Parameterizing state-space models for infectious disease dynamics by generalized profiling: measles in Ontario.

Authors:  Giles Hooker; Stephen P Ellner; Laura De Vargas Roditi; David J D Earn
Journal:  J R Soc Interface       Date:  2010-11-17       Impact factor: 4.118

4.  Inference for nonlinear dynamical systems.

Authors:  E L Ionides; C Bretó; A A King
Journal:  Proc Natl Acad Sci U S A       Date:  2006-11-22       Impact factor: 11.205

5.  How generation intervals shape the relationship between growth rates and reproductive numbers.

Authors:  J Wallinga; M Lipsitch
Journal:  Proc Biol Sci       Date:  2007-02-22       Impact factor: 5.349

6.  Forecasting Epidemics Through Nonparametric Estimation of Time-Dependent Transmission Rates Using the SEIR Model.

Authors:  Alexandra Smirnova; Linda deCamp; Gerardo Chowell
Journal:  Bull Math Biol       Date:  2017-05-02       Impact factor: 1.758

7.  The cohort effect in childhood disease dynamics.

Authors:  Daihai He; David J D Earn
Journal:  J R Soc Interface       Date:  2016-07       Impact factor: 4.118

8.  Measles in England and Wales--I: An analysis of factors underlying seasonal patterns.

Authors:  P E Fine; J A Clarkson
Journal:  Int J Epidemiol       Date:  1982-03       Impact factor: 7.196

9.  The interval between successive cases of an infectious disease.

Authors:  Paul E M Fine
Journal:  Am J Epidemiol       Date:  2003-12-01       Impact factor: 4.897

10.  Likelihood-based estimation of continuous-time epidemic models from time-series data: application to measles transmission in London.

Authors:  Simon Cauchemez; Neil M Ferguson
Journal:  J R Soc Interface       Date:  2008-08-06       Impact factor: 4.118

View more
  2 in total

1.  Patterns of smallpox mortality in London, England, over three centuries.

Authors:  Olga Krylova; David J D Earn
Journal:  PLoS Biol       Date:  2020-12-21       Impact factor: 8.029

2.  Response: Commentary: Statistical Modeling for the Prediction of Infectious Disease Dissemination With Special Reference to COVID-19 Spread.

Authors:  Subhash Kumar Yadav; Yusuf Akhter
Journal:  Front Public Health       Date:  2022-01-31
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.