Literature DB >> 35714080

Estimating the basic reproduction number at the beginning of an outbreak.

Sawitree Boonpatcharanon¹, Jane M Heffernan^2,3, Hanna Jankowski^2,3.

Abstract

We compare several popular methods of estimating the basic reproduction number, R0, focusing on the early stages of an epidemic, and assuming weekly reports of new infecteds. We study the situation when data is generated by one of three standard epidemiological compartmental models: SIR, SEIR, and SEAIR; and examine the sensitivity of the estimators to the model structure. As some methods are developed assuming specific epidemiological models, our work adds a study of their performance in both a well-specified (data generating model and method model are the same) and miss-specified (data generating model and method model differ) settings. We also study R0 estimation using Canadian COVID-19 case report data. In this study we focus on examples of influenza and COVID-19, though the general approach is easily extendable to other scenarios. Our simulation study reveals that some estimation methods tend to work better than others, however, no singular best method was clearly detected. In the discussion, we provide recommendations for practitioners based on our results.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35714080 PMCID： PMC9205483 DOI： 10.1371/journal.pone.0269306

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

The basic reproduction number, R0, (also called the basic reproductive ratio) is defined as the expected number of new infections produced by a single (typical/average) infectious individual, when introduced into a totally susceptible population. R0 is used in epidemiological studies of infectious diseases to gauge how contagious/transmissible an infectious disease is: if R0 < 1, the disease will die out, and if R0 > 1 infection can increase in the population. It is also used to determine how effective vaccination or other disease mitigation strategies need to be in order to protect populations from infection. At the outset of an infectious disease outbreak, an immediate goal is to determine R0, so that public health and healthcare decision makers can be informed. For example, at the debut of the COVID-19 pandemic, reports of R0 estimates were plentiful (see e.g. [1-6]). In the recent MERS-COV, 2009 H1N1, and 2003 SARS epidemics, there were also numerous studies of R0 globally (see [7-19] for a small snapshot). There are many statistical and mathematical methods that can be used to estimate R0 [20-29]. A main difficulty in R0 estimation is that the methods often depend on data that is not available, or the methods suffer from collection and/or reporting, or other, bias. Different estimators utilize different approaches to deal with these difficulties. Broadly speaking, estimators can be classified as real-time (requiring little computation time) and non-real-time (requiring more extensive computation). Real-time estimators typically rely on simple epidemic models and/or simplifications of models in an attempt to remove dependence on unobservables (such as the Susceptible-Infectious-Recovered, a.k.a. SIR, compartmental modelling framework). Non-real time methods generally handle unobservables via Bayesian or Monte Carlo approaches, at the cost of computing time. Often, real-time methods also assume some prior knowledge of other parameters, such as the serial interval (SI). It is therefore important to study the effects of misspecification of the either the modelling framework or input parameters on these estimators. For example, suppose an R0 estimator has been constructed to work within a SIR disease modelling framework. Infectious diseases, however, can include periods of infection that are not infectious. The infectious period can also be split into various stages of asymptomatic and symptomatic infection, which ultimately affect the case reporting rate to public health. Therefore, methods that are based on the SIR modelling framework can project erroneous estimates of R0, and differences in R0 estimates may simply reflect poor estimator structure or application to data that has been misspecified. A recent study by [27] has discussed several nuances of different estimator methods that can affect R0 estimates. The effect of misspecification is only touched on briefly. In this work, we compare six different estimators of R0: four real-time estimators and two estimators which require longer computation times. The four real-time estimators are based on an SIR or similar framework, while the two other estimators can be tuned to extensions of the SIR model. We then simulate data generated from one of three compartmental epidemiological models, the SIR, SEIR, and SEAIR models that track susceptible (S), exposed (E), asymptomatically infectious (A), symptomatically infectious (I), and recovered (R) individuals in their modelling frameworks. We note that three of the real-time estimators assume that the serial interval is known, and therefore we also consider the situation when this serial interval is guessed incorrectly in these estimators. Our work thus studies the effect of compartmental model and/or serial interval misspecification on the real-time estimators. Moreover, non-real-time methods require specification of the epidemiological model by the investigator, and our work studies the effect of compartmental model misspecification on these. The report of our findings is organized as follows. We first provide an introduction to three compartmental infectious disease models that we use to generate case data. Six R0 estimators are then introduced, including a discussion of their underlying compartmental model structure assumptions. We then apply each estimator to data generated from the three compartmental models, and Canadian COVID-19 data for the provinces of British Columbia, Ontario, Quebec, and also for the country as a whole. Early epidemic dynamics are discussed using the inflection point (or turning point) in the epidemic growth curve, the point at which the curvature in the epidemic growth curve changes—early timepoints exist before this point. We employ parameter values representative of respiratory virus epidemics, and in particular, influenza and COVID-19 [30-35]. We note that while daily data may be sometimes available during an infectious disease outbreak, it may not be complete and can include a reporting delay. We thus have chosen to use weekly case reports. Weekly case report data is also typical to outbreaks of influenza, a respiratory virus, and a chosen pathogen of study.

Methods

Epidemiological models

We focus on three compartmental epidemiological models that form the basis of all infectious diseases models [36-38], the SIR: Susceptible–Infectious–Recovered SEIR: Susceptible–Exposed–Infectious–Recovered SEAIR: Susceptible–Exposed–(Asymptomatic Infectious)–(Symptomatic Infectious)–Recovered models. The models are each composed of three to five compartments (with labels matching the model name). Individuals transition from one compartment to the next based on pre-specified random dynamics. Here, we assume that these distributions are exponential, and thus assume systems of ordinary differential equations (ODEs). We use the notation θ = (β, σ, ρ, γ) to denote the vector of parameters for the models, see Table 1 for details. The ODE systems for all three are provided in the S1 Appendix, as well as their corresponding flow diagrams. All models are considered without inclusion of demography, i.e. birth and death. The total population is fixed throughout the simulation and denoted by N with initial values of S0 and I0 for S and I populations, respectively, and all others zero. Therefore, for all three models N is equal to S0 + I0, and this is approximately equal to S0 since S0 > >I0. For the SIR model, for all t ≥ 0 it also holds that S(t) + I(t) + R(t) = N. Similarly, S(t) + E(t) + I(t) + R(t) = N for the SEIR model and for the SEAIR model, S(t) + E(t) + A(t) + I(t) + R(t) = N.

Table 1

SIR, SEIR, SEAIR model parameters and values, R0, serial interval.

(a) Model contact rate notation
model	parameter
model	β	σ	ρ	γ
SIR	S → I: βI(t)/N			I → R: γ
SEIR	S → E: βI(t)/N	E → I: σ		I → R: γ
SEAIR	S → E: βI(t)/N	E → A: σ	A → I: ρ	I → R: γ
(b) Model parameters, R₀, and serial interval
model	θ	R₀ = R₀(θ)		serial interval
SIR	(β, γ)	β/γ		1/γ
SEIR	(β, γ, σ)	β/γ		1/γ + 1/σ
SEAIR	(β, γ, σ, ρ)	β/γ + β/ρ		1/γ + 1/σ
(c) Parameter values for simulations
model	influenza 1	influenza 2		COVID-19
SIR	(1/3, 1/5)	(1/3, 1/5)		(1/2, 5/26)
SEIR	(1/3, 1/3, 1/5)	(5/9, 1/2, 1/3)		(13/11, 1/3, 5/11)
SEAIR	(1/3, 1/3, 1/2, 1/5)	(5/12, 1/2, 1, 1/3)		(26/57, 1/3, 2/7, 5/11)

Data is generated using the SIR, SEIR, and SEAIR compartmental model structures using a stochastic agent-based modelling framework implemented in C++. The simulations progress at the level of individual hosts in the applicable model disease status compartments. The simulation moves forward using “event times” that are assigned to each infected individual in the population and are determined by the compartment characteristics of which an individual is currently a member. Such event times correspond to infection events, when an infected individual transmits the infection to a susceptible, and times at which infected individuals progress to the next stage of infection or recover. The C++ model is based on previous work [39, 40]. Again, we note that all event times are assumed to be exponentially distributed with mean 1/ξ where ξ refers to the model parameter associated with the same transition in the system of ordinary differential equations. See Table 1. 1000 agent-based model simulations are conducted for each of the SIR, SEIR, and SEAIR frameworks with parameters as given in Table 1. Model parameters were taken from the literature, and are representative of pandemic influenza (R0 ∈ [1.2, 7], serial interval ∈[1.5, 9.5]) and COVID-19 (R0 ∈ [1.6, 3.4], serial interval ∈[4.2, 7.5]) [3, 30–32]. The first influenza (influenza 1 in Table 1) example parameters are such that R0 = 5/3 for SIR and SEIR and R0 = 7/3 for SEAIR. For this example, the serial interval is 5 days for the SIR model and 8 days for the SEIR and SEAIR models. The second influenza (influenza 2) example parameters are such that R0 = 5/3 and the serial interval is 5 days for each of the SIR, SEIR, and SEAIR models. The COVID-19 parameters are such that R0 = 2.6 and the serial interval is 5.2 days, again, for all models. The incubation period in the SEAIR COVID-19 model has a mean of 6.5 days [32]. For each epidemic, the population size N is set to 10, 001 where S(0) = 10, 000 and I(0) = 1.

R0 and the serial distribution

The serial distribution is the distribution from the time that an infected individual (the infector) becomes symptomatic, to the time when a person infected by the infector, the infectee, becomes symptomatic. For the SIR model, this is the same as the time spent in the I compartment, and in particular, the serial distribution is exponential with mean 1/γ when exponential distributions are assumed throughout the model [41]. We summarize the serial intervals for our models in Table 1 [41]. In the literature, the serial distribution may also be referred to as the serial interval, although this most often refers to the mean of the serial distribution, or alternatively, a range indicating highly likely values from the serial distribution. Here, we will use the convention that the serial interval refers to the mean of the serial distribution. For diseases such as influenza, it may be reasonable to assume that the serial distribution is known apriori. For other situations, such as new emerging diseases, such assumptions are less valid.

Methods for estimating R0

Many methods exist to estimate R0. We refer to [29] for a recent review. If the transition rates in the compartmental models are known, then R0 can be easily calculated using the formulas listed in Table 1. However, full transition rates are generally not known in practice, and hence statistical estimation methods are required. The main difficulty in estimation is that complete data is unavailable for the full epidemiological model. Here, we consider six different methods of estimating R0. For simplicity, we name the methods WP, seqB, ID, IDEA, plug-n-play, and fullBayes in this work. A summary of the methods and their key properties is given in Table 2 for reference.

Table 2

Summary of estimation methods for R0.

method	summary
WP	White & Pagano Method, due to [42]. Serial distribution can be assumed known or can be estimated using MLE; method developed under branching process model; simple method which yields real-time estimates (when serial interval is unknown the method takes longer to compute).
seqB	Sequential Bayes Method, due to [43]. Serial distribution assumed known (only the mean is used); method developed assuming SIR model and uses sequential Bayes methods; simple method which yields real-time estimates.
ID	Incidence Decay Method (see [44]). Serial distribution assumed known (only the mean is used); method developed assuming an SIR model structure and uses least squares estimation. It is a simple method which yields real-time estimates.
IDEA	The Incidence Decay and Exponential Adjustment Method is presented in [44]. Serial distribution assumed known (only the mean is used); method developed assuming SIR model and uses least squares estimation; simple method which yields real-time estimates. IDEA uses a slightly more complex model for fitting than ID.
plug-and-play	Plug-and-Play Method. See [45]. Serial distribution assumed unknown; method selects one of SIR/SEIR/SEAIR model; implementations available though not real-time (depending on input selection). Generally, this approach fits the complete model using maximum likelihood and relying on Monte Carlo to fill in missing observations. The R-package, called POMP, is quite technical and can be difficult to implement [45].
fullBayes	Full Bayes Method. See [46]. Serial distribution assumed unknown; method selects one of SIR/SEIR/SEAIR model; not real-time. this approach fits the complete model using maximum likelihood and relying on Monte Carlo to fill in missing observations. Can be quite technical in implementation.

The first four (WP, seqB, ID, and IDEA) are real-time methods based on simplifications of the full ODE epidemiological models. This simplification is necessitated by the fact that the full data is unobservable. In these methods, estimation of R0 is coupled with either estimation or prior knowledge of the serial distribution. The two latter methods (plug-n-play and fullBayes) do not simplify the full epidemic models, but handle the issue of unobservable data by Monte Carlo simulation (plug-n-play method) or Bayesian priors with MCMC used to handle estimation due to model complexity (fullBayes method). As such, these methods are more computationally intensive. These two methods estimate the unknown transition rate parameter vector θ in the epidemic model. They do not require any prior knowledge, including prior knowledge of the serial distribution. Indeed, since the methods result in estimates of θ, these can then in turn be used to derive an estimate of the serial distribution. Furthermore, the methods assume prior knowledge of the epidemic model, in the sense that the user can decide whether the SIR, SEIR, or the SEAIR model is more appropriate for the particular disease. In contrast, the WP, seqB, ID, and IDEA methods all rely on simplifications, and are not able to allow for such tailoring. Although the plug-n-play and fullBayes methods are more computationally intensive and not considered “real-time”, we note that modern day access to computational power is blurring this line of distinction. Our implementations of fullBayes and plug-n-play were done on a non-specialized desktop computer and without special consideration to computing time in the implementations. The time required to obtain the estimates was less than two minutes in both cases, and we do not consider this to be prohibitive. Furthermore, more careful programming could yield even faster estimates. A more detailed discussion is available in Sectio. Computational Time.

WP: Maximum likelihood estimation of a branching model

[42] developed a straightforward estimation method whereby either the serial distribution is known, or the serial distribution is estimated along with R0. The method assumes that only the number of infectious individuals at discrete time points (e.g. daily or weekly) is observable and both approaches (serial known and unknown) use maximum likelihood. Recall that I(t) denotes the number of infecteds (i.e. the individuals in compartment I) at time t. Using our notation, and assuming that the times t0 = 0, t1, t2, …, t are integers which count, for example, the number of days or weeks since the beginning of the pandemic (time zero), [42] obtain the log-likelihood where and p is a vector denoting the (discrete and finite) serial distribution on t1, …, t. That is, if Y is the random variable representing the serial distribution then p(t) = P(t ≤ Y < t)/P(Y ≤ t). If p is known (notably, this includes knowing the value of t which describes the support of p) then the maximum likelihood estimate of R0 is straightforward to compute. In the SIR model with exponential transitions, p(t) is a truncated geometric distribution. If p is unknown, then [42] recommend discretizing a gamma distribution to simplify estimation. Other models (SEIR and SEAIR) do not have simple closed form expressions for p(t) (see [41]). We found that for coarse data (e.g. weekly) the discretization and mean dominates the values of p more so than the actual distribution chosen. The WP method assumes an underlying branching process, which is neither of the SIR/SEIR/SEAIR models from which our data sets are generated. This model assumes, in particular, that throughout, the population size “available” to be infected remains constant, which does not hold for our simulated ODE models. As such, estimates should only really be considered early on in the epidemic. In our simulations presented below, we highlight the inflection point of each epidemic, and the WP method should only really be considered valid before this time. The method has been implemented in [47], see also [48] for details on the R package called R0. In our simulations, we found this implementation to have some numerical instability issues, which is most likely caused by the particular parameters of our simulated data sets. This instability was particularly profound when p was assumed unknown, and most often the algorithm would not yield a solution. For this reason, we programmed our own implementation, for which we used a simple grid search. The built-in alternative optimization function in R uses the bisection method, and was very sensitive to the starting value (a small change in the starting value could change the R0 estimate by orders of a thousand). In comparison, the grid search approach performed better, although it was still not ideal. The likelihood surface is very flat, which resulted in a non-unique MLE (we report only a default value). This property of the likelihood surface is most likely what also causes the issues we observed for our data in the implementation of the R0 R package [48]. Furthermore, note that the log-likelihood assumes that the serial distribution is discrete, and that this discretization matches the observed data. That is, if data is observed weekly, the serial distribution is only known on a weekly timescale. This discretization can affect the serial distribution considerably, particularly if the timescale is quite coarse.

seqB: Equential Bayes estimation using an SIR approximation

[43] developed a Bayesian approach used to estimate R0. As above, it is assumed that infectious counts are observed at periodic times such as days or weeks. The basic idea is to start with a mildly informative prior on R0 and then update sequentially. The approach is based on the SIR model, and assumes that the mean of the serial distribution is known (under the SIR model, this is equivalent to knowing the parameter γ which is the inverse of the mean of the serial distribution). [43] note that under the SIR model, and considering time interval t − t where R = R0 S(t)/N ≈ R0 at the beginning of an infection. Using this result, seqB assumes that the conditional distribution of I(t), conditional on I(t), R0, is Poisson with mean λ = I(t) exp{(t − t)γ(R0−1)}. In the approach, γ is known, and a prior is placed on R0. With N0 also assumed known, posterior estimates are found using a hierarchical or sequential Bayes approach. Note that the method cannot handle data sets where there are no new infections observed in some time interval t − t (as this results in a Poisson mean of zero). Therefore, the times at which infectious counts are observed must be sufficiently coarse so that all counts are non-zero (e.g. weeks instead of days). The method would also be inappropriate for situations where long intervals between cases are observed in the initial stages of the epidemic. This was observed, for example, in Canada for the first cases of COVID-19. Although the above development is based on the SIR model, the resulting approximation behaves similarly to a branching process, much like the WP method. We therefore again consider this estimator valid only in the early stages, which for our simulations translates to times prior to the inflection points of the epidemic. The posterior distribution of R0 will have the same support as the prior, and placing a discretized prior on R0 makes computations relatively straightforward, since the normalizing constant of the posterior is easy to implement. In the R implementation in [48], called R0, the initial prior on R0 is assumed to be uninformative. Their package focuses on the posterior mode, and much like their implementation of the WP method, uses a discretized version of the serial distribution (which could affect the input value of γ). We again chose to use our own implementation, and report the posterior mean which minimizes the Bayes’ risk.

ID and IDEA: Least square estimation using incidence decay approximations

[44] introduced two simplified models describing the relationship between R0 and other epidemic parameters in the SIR model. The first of these is the incidence decay (ID) model where In the model, time s is measured in units re-scaled based on the serial distribution. Recall that under the SIR model the serial distribution is exponential with mean 1/γ. We then have the relationship in (1) that . As (1) is only valid for a short (and unknown) period of time, [44] proposed a second alternative formulation, where a decay factor d was introduced in order to reflect the often observed outbreak decline. In the incidence decay and exponential adjustment (IDEA) model, the relationship becomes instead Under the ID model, we can solve (1) to obtain Of course, this relationship is not valid for real data across all values of s as is stochastic. To obtain an estimate of R0 least squares is a natural option, and hence the ID estimator is the minimizer of which yields As noted above, the number of infectious people increases rapidly at the beginning of an outbreak, so a method based on (1) is expected to underestimate R0. The IDEA model was introduced to overcome this issue. As in the ID model, we solve (2) and use least squares estimation to obtain its estimate. The IDEA estimator is defined then as the minimizer of Unlike in the ID model, we also need to obtain a minimizer of d to solve the optimization problem, and hence we require k ≥ 2. Minimizing, we obtain Details of these calculations are given in the S1 Appendix. Note that the formula is not valid for k = 1. Both the ID and IDEA methods are straightforward and estimate R0 directly, as long as the mean of the serial distribution is known. The model was built under the SIR assumption. In our simulations we examine the effect of misspecification of the underlying epidemic model.

plug-n-play: Maximum likelihood using sequential Monte Carlo for partially observed epidemics

Maximum likelihood is one of the more popular approaches used to estimate unknown parameters in a statistical model. The general idea is to find the parameter set θ which maximizes the likelihood (probability model) evaluated at the observed data. The difficulty for our setting is that our compartmental models (see the discussion of the epidemiological models) rely on data which is unobservable. In particular, the models require that the exact times of infections are known while we observe only daily or weekly counts of infectious individuals. The WP method [42], which also uses maximum likelihood, gets around this issue by creating a simplified model with a likelihood which relies only on observable data. Another alternative, discussed in [49], is to maximize the full likelihood and fill in the unobservables using many Monte Carlo simulations in a way which matches the fixed observable data points. Such an approach is often referred to as “plug-n-play”. The plug-n-play inferential method of [49] is based on likelihood inference using sequential Monte Carlo of partially observed Markov processes (POMP), also known as hidden Markov models or state-space models. The plug-and-play terminology comes from the fact that inference is based on Monte Carlo simulations from the model and does not require explicit expressions of the transition probabilities, which can be quite complicated. The algorithm for this method has been implemented in the R package POMP [45]. This software package can be accessed from the comprehensive R archive network (CRAN), see also [50]. As mentioned previously, the basic idea is to generate complete epidemic data in a way which matches the observed weekly infectious observations. To simplify the implementation, complete continuous-time data is not generated but rather an approximation is generated with observations of all components at a discretized time-scale Δt (single value selected by the user). These discretized epidemics are generated using sequential Monte Carlo methods. An estimate of θ is then obtained via maximum likelihood using iterated filtering. The implementation in [50] allows for the selection of the model SIR, SEIR, or SEAIR. We refer to [49, 50] for additional details. The algorithm returns estimates of θ, as well as an estimate of R0 derived via the formula regardless of the epidemiological model. We refer to the estimate thus obtained as the plug-n-play estimator. R code detailing our simulations and choices of input values is provided as S1 File.

fullBayes: Bayesian inference for partially observed epidemics

Similar to the plug-n-play approach of the previous section, this is a simulation approach in which the incomplete observed data is replaced with complete data via simulations. The main difference is that the complete data is generated by placing a prior on its distribution in a Bayesian inferential approach. Some examples of epidemiological inference under the Bayesian paradigm are described in [46]. In order to describe the method we need first to introduce some additional notation. We do this for the SEAIR model, as all other models are simplifications of this case. Recall that we have observed infection counts I(t1), …, I(t) at times t1, …, t. Let m denote the vector with jth element given by the cumulative sums . As such, m describes the entirety of the observed data. For a time interval [0, T] the complete epidemic includes much more information. Let denote the individual times of exposure. Similarly, denote the individual times of transitions into the asymptomatic, infectious, and recovered states, respectively. We assume that m0 = 1. We also assume that all people who are infected in week j will recover in week j + 1. Furthermore, we assume that the number of exposed and asymptomatic people in week j is also equal to m − m. We let denote the epidemic path which contains all of this information. As in [46], the first infection is treated separately as a parameter of the model. Hence a prior is placed on this variable. Recall that θ denotes the vector of compartmental model parameters; see Table 1, (b) An independent prior is also placed on θ, π(θ), and samples from the posterior distribution are obtained. The marginal distribution of is π(θ|m), which is the posterior distribution of θ given the observable data, and the distribution we are interested in. We now calculate the likelihood for the SEAIR model. The joint prior distribution of the unknown rate parameters θ is made up of independent gamma distributions given by Γ(α, k) with mean k/α. We assume that α is the same for the parameters β, σ, ρ, γ, while k varies and if appropriate will be denoted by k, k, k, k. In the simulations we take α = 1 and k = k = 3, k = 2, k = 5. The prior distribution on is exponential with rate one, and this is independent from the θ vector. Calculations given in the S1 Appendix give the posterior marginal distributions for and all of which have gamma distribution with closed form expressions for the parameters. Some sensitivity analysis to the prior distributions was conducted (see S1 Appendix), and changing the prior did not visibly affect the results. The general approach we take is now described using the following steps. Use Markov chain Monte Carlo (MCMC) to simulate from From Step 1, we obtain a sequence of samples for l = 1, …, b + B from the posterior distribution . Here, b denotes the burn-in period for the MCMC results, and B denotes the number of MCMC samples collected. To obtain an estimate of θ, from the samples l = b + 1, …, b + B, one option is to simply average the values θ. Instead, we treat each a sample from the full posterior model, and calculate the posterior mean of using the formulas given in the S1 Appendix. Average the posterior means to obtain an estimate of θ. The final reported estimate is obtained from the estimate of θ in Step 3 using the appropriate formula in Table 1. In our simulations, we take b = 100 and B = 1000, and refer to the estimator as fullBayes. The MCMC algorithm we use is the Metropolis-within-Gibbs. Namely, there are three main components to the posterior distribution θ, τ, and . In the S1 Appendix, the posterior distributions for and are obtained in closed form. Given one observation of the algorithm generates the next observation as follows. Sample from the posterior Sample θ from the posterior Sample τ using a Metropolis step: Propose a new τ: For each i = 1, …, k is IID uniformly distributed on [t, t] for j = m, …, m is IID uniformly distributed on [t, t] for j = m, …, m is IID uniformly distributed on [t, t] for j = m, …, m is IID uniformly distributed on [t, t] for j = m, …, m Accept the proposal with probability min{1, α} where noting that with the proposal distribution in (a), we have that g(τ|τ)/g(τ|τ) = 1. Details are provided in the S1 Appendix The chain is initialized by sampling θ from its prior distribution.

Real world COVID-19 data

We consider an example for the COVID-19 pandemic in Canada. The first case of COVID-19 was recorded on January 25th, 2020 in Toronto, Ontario [51]. For the first few weeks, isolated cases arrived, however strict contact tracing kept the pandemic from beginning. We therefore do not consider the first four weeks of the pandemic timeline (there were very few cases, and most weeks had zero cases at this stage). In late February, the pandemic took hold and cases began to grow exponentially with community transmission [51]. Approximately one month from this, non-pharmaceutical measures were imposed and most provinces went into some form of lockdown. We therefore do not consider data much longer after lockdown initiation as these measures would decrease the transmission rate. We estimate R0 for all of Canada, and for the three most populous provinces, British Columbia (BC), Ontario, and Quebec. In Ontario, strict restrictions were imposed following March break (a one week school break during the winter) which fell around March 20th, 2022. In Quebec, lockdown was imposed around March 24th, and strict public measures were implemented around March 17th in BC. Epidemic data is provided from [52]. Public health mitigation data and dates are provided by [51].

Workflow

The goal of our study is to quantify R0 estimation in well-specified and misspecified settings, including misspecification of the model and serial distribution. For all models we therefore consider data coming from SIR, SEIR, and SEAIR epidemiological models, and the realworld COVID-19 pandemic in Canada. We study the R0 estimation methods as follows: Using synthetic data provided by the SIR, SEIR, and SEAIR models, we apply the following methods for well-specified and misspecified settings WP method assuming serial distribution (SD) is known and set to exponential with correct mean (5 days for influenza 1 and 2 and 5.2 days for COVID-19) SD is known and set to exponential with incorrect mean (3 days for influenza 1, 2 and 7 days for influenza 2, and 4.2 and 7.5 days for COVID-19) SD is unknown and estimated from a gamma distribution with unknown mean and variance (using a grid search algorithm) seqB method assuming SD has the correct mean (5 days for influenza 1 and 2 and 5.2 days for COVID-19) SD has an incorrect mean (3 days for influenza 1, 2 and 7 days for influenza 2, and 4.2 and 7.5 days for COVID-19) ID and IDEA methods assuming SD has the correct mean (5 days for influenza 1 and 2 and 5.2 days for COVID-19) SD has an incorrect mean (3 days for influenza 1, 2 and 7 days for influenza 2, and 4.2 and 7.5 days for COVID-19) plug-n-play and fullBayes methods developed assuming SIR SEIR (SEIR and SEAIR data only) SEAIR (SEAIR data only) In these examples, the outbreaks are followed for 15 weeks, and this is the timeline given in our results. This timeline is presented only as a comparison to what is happening at the earliest stages. It also, however, improves the comparison between methods. Our comments below focus only on the time period before the inflection point (denoted as a vertical blue line for all methods). Using real world data, we apply the WP, seqB, ID, and IDEA methods with known SI, using incorrect and true values for COVID-19. We then apply WP, fullBayes and plug-n-play. Estimates are generated using weeks 5 to 10 for Canada, BC, Ontario, and Quebec. The date that lockdown was implemented is indicated by a vertical line for all three provinces. No such line is given for all of Canada, as the measures were handled provincially and not nationally. When considering the results, recall that seqB and IDEA methods require at least two weeks of observations.

Results

Epidemic simulations

Fig 1 plots the number of individuals in compartment I for each model structure, and each parameter set. The grey lines plot the simulation outcomes while the black lines plot the mean of the simulation data. Although the complete epidemic path is simulated, we assume that only the weekly number of infectious people is actually available. The epidemics are followed for 15 weeks, which covers the first 100 days of an outbreak. Simulation data is recorded at every event time. Weekly data is extracted from each simulation and saved in a data file for use for all of the R0 estimators employed here. The blue vertical line indicates the point of inflection, where the concavity/curvature of the black line changes. The inflection points are 7, 12, and 9 for influenza 1 parameter values, 6, 7, and 7 weeks for influenza 2, and 3, 5, and 6 weeks for COVID-19, for the SIR, SEIR, and SEAIR models, respectively. These points are used to determine appropriate time intervals for R0 estimation for each model since R0 estimates are associated with early exponential growth and can be affected by decreases in the growth rate as the epidemic continues towards and past the point of inflection. Thus, “early in the epidemic” is the same as prior to the point of inflection. In real data, this time point would be unknown. Code and files containing all results have been provided in the S1 File.

Fig 1

The number of infectious individuals (y-axis) at time t in weeks (x-axis); from left to right: SIR, SEIR, and SEAIR; from top to bottom the examples are influenza 1, influenza 2, then covid19.

Individual simulated outbreaks from 1000 simulations are shown as grey lines, and their average is denoted as a black line. The blue vertical dashed lines show the inflection points for each model.

The number of infectious individuals (y-axis) at time t in weeks (x-axis); from left to right: SIR, SEIR, and SEAIR; from top to bottom the examples are influenza 1, influenza 2, then covid19.

Individual simulated outbreaks from 1000 simulations are shown as grey lines, and their average is denoted as a black line. The blue vertical dashed lines show the inflection points for each model.

R0 estimates

Using synthetic data from the SIR, SEIR and SEAIR epidemiological models

We summarize our numerical results in plots comparing the average mean squared error (MSE), side-by-side boxplots, as well as tables reporting the median R0 estimates and its standard deviation. Again, these are all provided in a separate file as S1 File. In the main manuscript, we show only plots comparing the MSE of the various methods for the SIR data for the influenza 1 and 2 examples (Figs 2, 3 and 5), and SEAIR for the COVID-19 example (Figs 4 and 5). The MSE plots do not include the WP method where the serial distribution is estimated, as here the MSE was much too large to report. This can be ascertained from the Tables and the side-by-side boxplots provided in the Supplementary Material (in particular, see Tables 7, 12 and 17 in S1 File).

Fig 2

Influenza example 1 estimated MSE of R0 estimators assuming known serial interval (SI) with SIR data (week on x-axis).

The inflection point indicated by the blue dashed vertical line.

Fig 3

Influenza example 2 estimated MSE of R0 estimators assuming known serial interval (SI) with SIR data (week on x-axis).

The inflection point indicated by the blue dashed vertical line.

Fig 5

Estimated MSE of R0 estimators assuming unknown serial interval (SI) (week on x-axis).

For both influenza examples the data is SIR while for the COVID-19 example the data is SEAIR. The inflection point indicated by the blue dashed vertical line.

Fig 4

COVID-19 estimated MSE of R0 estimators assuming known serial interval (SI) with SEAIR data (week on x-axis).

The inflection point indicated by the blue dashed vertical line.

Influenza example 1 estimated MSE of R0 estimators assuming known serial interval (SI) with SIR data (week on x-axis).

The inflection point indicated by the blue dashed vertical line.

Influenza example 2 estimated MSE of R0 estimators assuming known serial interval (SI) with SIR data (week on x-axis).

The inflection point indicated by the blue dashed vertical line.

COVID-19 estimated MSE of R0 estimators assuming known serial interval (SI) with SEAIR data (week on x-axis).

The inflection point indicated by the blue dashed vertical line.

Estimated MSE of R0 estimators assuming unknown serial interval (SI) (week on x-axis).

For both influenza examples the data is SIR while for the COVID-19 example the data is SEAIR. The inflection point indicated by the blue dashed vertical line. Figs 2 and 3 plot the MSE of the estimated R0 values and the true R0 value, for the WP, seqB, ID, and IDEA methods for the influenza 1 and 2 examples, using SIR data, and assuming a known serial interval. These plots provide examples of the well-specified and misspecified cases, using the true and misspecifed values of the known serial interval. Of the methods presented in these plots, seqB performs best, followed by ID. When SEIR and SEAIR data are considered, all estimators have larger MSE. However, our conclusion does not change (se. Sections 1–3 of the additional file included as S1 File) and seqB and ID still perform best. Finally, considering both bias and variance, as shown in the totality of boxplots and tables in the S1 File, our conclusion remains the same. Fig 4 plots the MSE of the estimated R0 values and the true R0 value for the WP, seqB, ID and IDEA methods for the COVID-19 example, using SEAIR data. These plots provide examples of misspecification given incorrect serial interval (serial intervals of 4.2 and 7.5 days are incorrect, and 5.2 days is the true value), and given misspecified data where SEAIR data is used for these methods that relate best to the SIR model framework. Here, again, seqB performs best, followed by ID. This is also true when SIR and SEIR data are considered, and considering bias and variance as presented in the totality of boxplots and tables in the S1 File. We plot the MSE of R0 estimates calculated using the fullBayes and plug-n-play methods in Fig 5 for influenza 1 and 2 examples using SIR data and SIR model structure, and for the COVID-19 example using SEAIR data, but with SIR, SEIR and SEAIR model structures. In all cases presented in this figure, we find that plug-n-play outperforms fullBayes. fullBayes performs well in the longterm, but this is not our goal—R0 estimates are needed early on in the epidemic. A review of all of the cases presented in the S1 File confirm our conclusion. Computational time is a crucial factor as real-time estimates are desirable. Table 3 shows computational time for the SEIR model for a single data set and using a 1.60GHz/8GB RAM 64-bit operating system, x64-based processor. The results in this work are based on fullBayes with 1000 iterations and plug-n-play with 1000 particles and 10 IF iterations, where IF stands for the iterated filtering algorithm. The fullBayes method was implemented in R, and it is possible that faster implementations can be achieved using a different programming language. In comparison, the real-time methods (WP, seqB, ID, and IDEA) take less than one second each to compute.

Table 3

Computational time for the SEIR model for one data set (IF: Iterated filtering algorithm).

method	iterations	time
fullBayes	1000 iterations	8	minutes
fullBayes	3000 iterations	19.76	minutes
plug-n-play (1000 particles)	5 IF iterations	3.10	minutes
	10 IF iterations	5.82	minutes
	100 IF iterations	58.44	minutes
	1000 IF iterations	9.77	hours

Based on the estimator outcomes, our recommendations are as follows. When the serial interval is known, we recommend seqB and ID. We also recommend plug-n-play when the serial interval is known. When the serial interval is unknown, plug-n-play performs the best. Overall, we recommend that a suite of these estimators be used—employ plug-n-play, seqB, and ID. When the serial interval is unknown, a range of serial intervals can be provided to the seqB and ID methods to compare to the plug-n-play results. Practitioners, however, should consider their own preferences as to bias and variability of the estimators. We note here that as this study is focused on data observed weekly, our results may not be applicable to data observed, for example, daily, as the effect of the serial distribution on the results may be different. We also assumed that our data did not suffer from collection bias, under-reporting, and reporting delay. These issues are important, but beyond the scope of this work. However, it is our belief that weekly data, as considered here, is less sensitive to some of these issues than more fine-grained data.

Using real world COVID-19 data

Fig 6 shows plots of estimates of R0 for all six estimators as applied to real world COVID-19 epidemic data from Canada. The provinces of BC (second column), Ontario (third column), and Quebec (last column) are studied, as well as the entire nation (first column). The WP, seqB, ID and IDEA methods are applied using assumed known serial intervals of 2, 5, and 8 days. We compare our estimates to previously found R0 estimates (black horizontal lines) of the Canadian pandemic in reference [3], to the Greater Toronto Area (which represents approximately 1/6 of the Canadian population). In summary, seqB, ID and plug-n-play estimates perform best. seqB produces estimates within the range denoted by the black horizontal lines for all serial interval values considered. The same is true for early estimates for plug-n-play. The ID method achieves the lower estimate for all geographic jurisdictions. It is sensitive to the choice of serial interval value, however, and higher serial interval values may drive the estimation to lie above the upper bound. See, for example, the subplots for Canada and Ontario. Given the findings here, we again recommend a combination of seqB, ID, and plug-n-play methods for estimation of R0.

Fig 6

R0 estimators (y-axis) for COVID-19 data in Canada.

Data from [52]. The x-axis shows time in weeks where t = 0 denotes January 25, 2020—the date of the first known case in Canada [51]. The vertical gray line shows the date of lockdown for each of the provinces (there was no national lockdown date) [51]; while the horizontal lines denote estimates of R0 from reference [3]. The provinces of BC (second column), Ontario (third column), and Quebec (last column) are studied, as well as the entire nation (first column). The WP, seqB, ID and IDEA methods are applied using assumed known serial intervals of 2, 5, and 8 days.

R0 estimators (y-axis) for COVID-19 data in Canada.

Conclusion

The basic reproduction number, R0, is an important parameter for estimation early in an epidemic so that public health interventions can be informed. As many estimators exist, and the assumptions of the estimators as well as their dependency on particular biological estimates (i.e., the serial interval), vary between methods, it is expected that R0 estimates will differ. It is thus important to understand what estimators provide better outcomes under both true and misspecified conditions. Since respiratory viruses (especially influenza, and coronaviruses i.e., COVID-19 of late) affect the global population every year, we have chosen to study the estimators of R0 for these types of infections, which are typically modelled using SIR, SEIR and SEAIR compartmental models. We have also chosen to consider weekly case data, as this is characteristic of pandemic influenza and other pandemic respiratory infection outbreak reported data, globally (with the exception of COVID-19, which was reported almost daily in most regions until early 2022). We have considered six estimators that are commonly used when determining R0 for any infectious disease outbreak. We discussed the advantages and disadvantages of each method, including dependencies on proper estimates of the serial distribution, and the computational resources needed to run each estimator. Our simulations consider a variety of well- and missspecified settings. Briefly, we find that the WP method can provide close estimates to the true R0 value if the SD is known, but when the SD is unknown, the method suffers greatly (see Tables 7, 12 and 17 in S1 File). The seqB method performs well given SIR data but underperforms if there is any misspecification; the ID and IDEA methods, are useful due to their simplicity. ID outperforms the IDEA model, but ID estimates of slightly higher MSE copared to seqB. fullBayes estimates can have large variabilities, and are sensitive to the underlying model structure, but the plug-n-play method provides consistent estimates even with only one week of data. Considering both bias and variability, as well as misspecification, we find that the performance of the seqB, ID, and plug-n-play estimators is best, providing estimates of R0 that are closest to the true value under both correctly specified and misspecified cases. Notably, plug-n-play does not require prior knowledge of the serial distributions. However, if the serial interval is known, seqB and ID outperform plug-n-play. Furthermore, seqB and ID require less computational time, and are easier to implement. The choice of R0 estimator is ultimately up to the practitioner. In our analysis we have shown that some R0 estimators can be greatly affected by even a small level of misspecification. Given that biological certainty may be lacking at the beginning of an infectious disease outbreak, the number of disease stages needed in a model and a proper distribution of the serial interval may not be known. This means that a range of R0 results will ensue, and the accuracy of the estimates will be unclear. We therefore recommend that a suite of estimators be used when estimating R0. Given the current study results, we recommend that seqB, ID, and plug-n-play methods be included in any suite. plug-n-play does not require knowledge of the serial distribution and provides close to true estimates under different model structures quickly. seqB and ID should be implemented using a range of known serial intervals, to provide sensitivity analysis and confidence in R0 estimation. We do however note that plug-n-play may be difficult to implement for some, since the R package is quite technical [45]. Daily case reporting data has been available for the most recent COVID-19 pandemic. Daily data was not provided during the 2009 H1N1 pandemic, however. Furthermore, there may be issues with daily reporting (such as periodicity, reporting delay) whereby public health may choose to use weekly reporting data over daily data as the weekly data would be more reliable. We have thus only considered weekly case reporting data in this study as it is expected that weekly case reporting data can be expected in many future epidemics and pandemics. It is important to note that First Few Hundred (FF100) studies, whereby the first few hundred cases of a new virus are followed in detail at the beginning of an infectious disease outbreak, have been implemented during the 2009 H1N1 and COVID-19 pandemics [53-60]. In these cases the serial distribution, and the need to consider exposed and/or asymptomatic periods of infection can be quickly determined, enabling realization of earlier and more certain estimates of R0 early on. Given that First Few Hundred protocols are not implemented in much of the globe, weekly case report data however may still be considered the norm for future pandemics. In our current study we have assumed perfect data with no unobserved infections, no reporting delay, and no data collection bias. These issues are intuitively expected to affect R0 estimates. We venture to continue our study of R0 estimation considering these aspects in our epidemiological data sets. In summary, our work has various strengths, and some limitations. A unique strength of our work is the study of model misspecification. We are unaware of previous work in this direction. We did not consider all possible estimators of R0, but focused on those most commonly used in the field of Infectious Disease Modelling. We selected a variety of influenza and COVID-19 scenarios for our simulations, which provide considerable information on the behaviour of these estimators. We did not investigate other infectious diseases, such as Ebola, which could potentially have quite different parameters. Our overall recommendations are however, general, and are therefore widely applicable. Lastly, we considered only the scenario of perfect data. Alternative settings are beyond the scope of this work, however, this, along with other infectious diseases and potentially more estimators will be considered in future.

A supplementary file contains additional simulations results (both tables and boxplots) as well as some further technical details.

(PDF) Click here for additional data file. (PDF) Click here for additional data file. 8 Oct 2021

PONE-D-21-22343

Estimating the basic reproduction number at the beginning of an outbreak under incomplete data

PLOS ONE Dear Dr. Heffernan, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Nov 22 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Inés P. Mariño, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: I Don't Know Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: In “Estimating the basic reproduction number at the beginning of an outbreak under incomplete data” by Boonpatcharanon and colleagues, different methods to estimate R0, the basic reproduction number, are compared considering the first 100 days of an epidemic. The authors apply frequentist and Bayesian approaches to estimate R0 under three different infection models: SIR, SEIR and SEAIR. These models differ in the allowed transitions between states of individuals (susceptible, exposed, (asymptomatic/symptomatic) infected, recovered). The authors conclude with a recommendation but also highlight that it always depend on the data which approach to choose; they recommend sensitivity analyses. The aim of the study is described and motivated. The manuscript is well written. However, there are some issues the authors should consider to facilitate the readability of the manuscript. Major issues: 1) “Incomplete data” sounds like missing data related to counts of, e.g., infected individuals. Should “incomplete” also comprise incomplete knowledge/information on transmission and course of infection? Please clarify (in the manuscript and probably in the title). Additionally, please add information on the required (observed) data underlying the R0 estimation/calculation. 2) Please provide real data applications to support the assumptions in the simulation study and to illustrate the investigated methods on real data. For influenza, weekly case reports are published for several seasons, for example by the ECDC (European Centre for Disease Prevention and Control), the Government of Canada or the CDC (Centers for Disease Control and Prevention). 3) Please provide the (documented) source code for the investigation to redo the analysis (including data simulation and figure/table preparation). 4) Introduction: a. Could the authors elaborate more on data misspecification? Maybe through an own paragraph including examples of misspecifications and their possible influence on the R0 estimates? This issue is related to the reliability of an R0 estimation in the epidemic situation itself. The benefit/value of the R0 estimation depends heavily on the population under investigation, i.e. whether this population is a random sample of the total population or a for the total population not representative subpopulation (i.e. comprising, e.g., more or fewer infected individuals or different transmission probabilities than in the total population). Issues to be considered are for example the test strategy (which individuals are tested or must provide a test result; related to the number of unreported cases) and the test quality (reliable test results). b. As the study is about the early stage of an epidemic (first 15 weeks), could the authors additionally include this time frame into the considerations about misspecification? In case of a “new” disease, the knowledge on which, e.g., R0 estimation is based is limited in the early days. Could the authors please highlight the important issues unique to the beginning of an epidemic/pandemic – compared to the subsequent time? Besides in the beginning of an epidemic, is it also possible to consider a time point within an epidemic with a very low number of infected individuals, e.g. between two waves or two seasons (in case of seasonality as for influenza)? Please clarify “early stage”. Please add a motivation for considering only the first 15 weeks. c. Please add a motivation for the decision to consider SIR, SEIR and SEAIR only. 5) Materials and Methods: a. Please provide the underlying assumptions related to the data for the investigation (i.e. no unobserved infections, no reporting delay, …). b. Please include a section about the simulation study. The approach description should not be part of the result section and the parameter choice should not be part of the method description. Please aggregate. c. In some parts, methods are provided in the results section and vice versa. Please check and separate. d. Lines 64-77: Please provide a supporting figure for illustration, if possible. Furthermore, please consider the inclusion of Table 1 in this figure and, if possible, remove Table 1. e. Line 97: Please introduce the methods briefly (including the reference to the respective subsection) and provide the abbreviations used throughout the manuscript. Then, refer to Table 3. Otherwise, the subsequent sections cannot be followed easily. f. Line 103: Please consider to describe “serial distribution” earlier in the manuscript because it was already used earlier. Suggestion: Provide a section with definitions needed for the models (SIR, SEIR, SEAIR). Furthermore, please consider a summarisation of all parameters that are set to some selected values in the investigation. A table (or subheadings after re-ordering) might help. g. Part 0.2.1 i. Please check notations and definitions. For example: 1. Line 133: “or” instead of “, or,“. 2. Please unify kappa and k. 3. Line 135: “both” does not fit to “the method”, which is one method. Please check. 4. Lines 137/138: Please add the origin for “number of days or weeks”. 5. Line 139: Please clarify min(kappa, t). What is t? 6. Line 139: Please clarify the relation between I(t – t_j) and I(t), if there is one, otherwise please define I(time difference / interval). 7. Line 159: Please clarify “built-in alternative optimisation”. Where is it “built-in”? ii. Please provide p(t_j) for all models. iii. Lines 149/150: Please explain the limitation. iv. Lines 150: Please provide the section reference for the simulations. v. Instability issues (lines 156-165): 1. Might the instability be an indicator for non-adequateness of the applied method? 2. Please consider to include the observed instability issues in the result section to clearly separate methods and results (introduction of new subsection headings might help). Is it possible to quantify these issues? 3. Was the implementation of the grid search approach in comparison to the original implementation validated? If so, how? h. Part 0.2.2: i. As long intervals without new infections are problematic for this approach, this approach might be better suited for situations after the start of a new “wave” with rapidly increasing numbers of newly detected infections. Did the authors investigated scenarios, in which the numbers only increased slowly, or were the scenarios adapted to this method? In the latter case, a comparison in a non-adequate scenario would be of interest to guide future method applications. Especially in the beginning of a pandemic, such situations might occur. ii. Lines 214/215: Please state the adaptations in more detail. Was the implementation in comparison to the original implementation validated? If so, how? i. Part 0.2.3: i. Lines 233/234: Please clarify “beginning of an outbreak”. The authors state that the number of infectious individuals rapidly decreases in the beginning, but in the beginning of a new disease few individuals are infected/infectious and the number of infected/infectious people increase. Otherwise, I would expect that R0 is overestimated as the estimate does not decrease fast enough. Please clarify. ii. Lines 244/245: Please provide a reference to the specifications of the misspecification. j. Part 0.2.4: i. Please provide (throughout the manuscript) names of R packages besides the reference. ii. Line 275: Please explain “particle”. iii. Equation after line 279: R0 is probably not a single value as delta_t is probably a sequence. Please check and adapt, if necessary. iv. Line 280: Please clarify where “regardless of the epidemiological model” relates to (and what is model-dependent). v. Line 282: Please check the reference to the appendix. Appendix 1.3 is “Least square estimation for the IDEA method”. Please provide more comments in the source code (Appendix 1.4) and please check line breaks to facilitate reading. k. Part 0.2.5: i. Line 292: Please provide the respective simplifications in the subsequent derivations. ii. Lines 294/295: Please describe m more clearly. Please explain additionally (besides the equation) m_j in words. Definition of m0 should be provided with the definition of m_j. iii. Line 295: Please clarify “epidemic” and “much more information”. iv. Lines 296/297: Please check the conditions for i. v. Lines 299/300: What is the impact, if an individual needs more than one week to recover? What is the motivation for one week? Please add. vi. Line 334: “obtained” instead of “obtain” 6) Results: a. Lines 351-353: Please additionally consider the case that the population studied is not a random sample of the target population. Alternatively, please clearly state (when defining the study design) the assumption that the populations studied is a random sample and discuss this assumption as limitation. b. Lines 377 to 380: Does the results change if the other methods are also only applied to the subset of samples? Please comment. c. Lines 380/381: Please define bias and variability. Did the authors also consider a joint measure of bias and variability? d. Line 382: A figure cannot study. Please rephrase throughout the manuscript. e. Please consider to add further subsections to provide more guidance to the reader. f. Line 405: Computation time is provided but the related section follows later-on. Please reorder. g. Part 1.1: Could the authors please provide computational aspects for all models? 7) Discussion: a. Please provide a paragraph about strength and limitations. b. Please compare the results (at least in parts) with other studies. 8) Abbreviations, parameter, model names, methods names and other short forms: a. Please introduce all in the main part of the manuscript. E.g. ODE, MCMC, IID, S0, I0, S, I, S(t), SD, … are missing. b. Please check the usage for consistency, e.g. S versus S(t). c. Please state which parameter are 0 at t=0. 9) Figures: a. Please provide axis titles at the respective axis and not in the description. b. In case the legend only comprises one symbol/colour differing between figure panels, please consider providing this information as panel title above the respective plot panel. This also introduces shorter description. c. Please introduce all abbreviations, parameter and model names in the figure description. d. In case of boxplots, please provide complete boxplots. In case of a needed zoomed-in boxplot, the complete one should be provided in the supplement. e. Please provide information in the description of the boxplots so that the reader is able to identify scenarios with misspecifications. 10) Tables: a. Please introduce all abbreviations, parameter, model names and method names in the table description. b. Please provide a description that allows to understand the table without the part in the main manuscript where the table is cited for the first time. Minor issues: 1) Section numbering in the main part: Please remove the leading “0.”. Please check the complete numbering and doubling of section headings, e.g. “Results” and “1. Results” and supporting information starts with 1.2. 2) Please consider to avoid “flu” and to use “influenza” throughout the manuscript. 3) Materials and methods: a. Line 58: Please clarify “approximately”. b. Line 77: It should probably be I(0) = 1 (first round bracket is misplaced). c. Line 85: Please provide information on the meaning of “inflection” in lay terms (i.e. related to the course of infection/pandemic). d. Line 126: Please provide some additional information on the computer. e. Part 0.2.2: i. Equation after line 192: To stick to the notation throughout the manuscript, please consider replacing s by t, i.e. S(t) and dt. ii. Line 194: Please consider to replace | by “given”, i.e. “conditional distribution of I(t_j+1) given I(t_j) and R_0”. This would facilitate reading. iii. Line 196: Please introduce N0. f. Part 0.2.3: i. Please introduce s and d. ii. Line 230: Please delete “obvious”. iii. Equation (4): Please consider to use additional brackets so that it is clear to which the sum sign belongs. iv. Lines 243/244: “However, …” instead of “…, however.”. 4) Figure 1: a. Lines 84/85: Please consider to remove parts of figure descriptions from the main text that should be part of the description accompanying the respective figure itself, i.e. below the figure panel(s). b. Please introduce the meaning of “inflection”. 5) Table 2: Please clarify the meaning of Y_i (exponentially distributed with a mean of 1). Later-on, it is a mean of 1/gamma (provided as an example). Or other natural numbers. Please consider a consistent notation. 6) Supporting information: a. Part 1.2: i. Please provide references for the models and their chosen parametrisation. ii. Please introduce all parameter in more detail, even if they are introduced in the main text. Providing all definitions facilitates reading. The authors could consider to introduce a separate section within 1.2 for definitions. An alternative might be to provide the definitions in the main text, e.g. in a table. b. Part 1.3: i. Please provide the partial derivatives and few more steps of the solving process. Reviewer #2: This manuscript describes an interesting simulation study comparing 6 different methods of estimating the R0 coefficient (WP, secB, ID, IDEA, plug-n-play and fullBayes). The data are simulated via three different compartmental models, SIR, SEIR and SEAIR. Methods are intended to be tested both under the well-specified model and parameters and under the miss-specified ones. The quality of this work is the large range of methods tested, from the more classical and simplified models to the fully Bayesian ones. However, while the idea of comparing the performance of the methods is good and promising and the spectrum of methods compared is broad, the study and manuscript suffer from several weaknesses. The biggest problem is a misunderstanding of two random duration variables involved in the epidemiological analysis of a pandemic: the infectious period and the serial interval. The first is the random length of time a subject remains infectious, the second is the random time between when the infector develops symptoms and when the infected develops symptoms in a chain of transmission (see for example: Zhou X-H, You C, et al, 2020, the Lancet). These two intervals are in general quite different in mean; for instance for COVID-19 infection the mean infectious period is around 8-10 days (He X, Lau EHY, et al 2020, Nature; Zhou X-H, You C, et al, 2020, the Lancet) while the mean serial interval is around 4-5 days (Nishiura et al 2020, IJID; Du et al, 2020, CDC; Zhou X-H, You C, et al, 2020, the Lancet). The mix-up between these two intervals (and distributions) is evident on page 4 when it says: “The serial distribution is the distribution of the random amount of time that an individual is infected..”. This inaccuracy has consequences for the simulation study. In fact, data generated according SIR model of parameters beta and gamma have by construction mean infectious period of 1/gamma (fixed at 5 days for simulations). The problem arises when methods adopted for R0 estimation depend on the serial interval distribution, instead of the infectious period distribution, which is the case of the WP (White and Pagano 2007), ID and IDEA (Fisman 2013). In these cases models will not be well specified even when authors present them as being so. This can explain why in Fig 5, for example, WP, ID and IDEA methods (lines 1,3 and 4) seem to perform better when the gamma parameter is incorrect (right panel) than when it is correct (left panel). And comparing Fig 5 and 6 for the same methods, performance is improved when the model is miss-specified (R0 estimated assuming SIR with SEIR data). The authors need to address this point first. A second point is inherent in the design of the simulation and the presentation of the results. The data are indeed simulated under a single choice of parameters, which may not be sufficient to draw general conclusions. Here, the parameters are chosen with respect to a given infection (influenza). It seems to me that adding other parameter choices would add value to the study. In addition, attention should again be paid to the fact that the gamma parameter do refer to the distribution of the infection period and not to the distribution of the serial interval. The results are presented by boxplots, which is a good idea. However, on the one hand, some graphs are repeated several times (e.g. the WP case (SD = exp mean 5/7) with the SIR data is repeated 3 times in fig 2, 3 and 5), and I believe that a way could be found to avoid this. On the other hand, the results should also be presented numerically in tables, with for each setting the specification of the bias and variability of the simulated results at the inflection point, or with a summary of both (mean square error). An application to real data would also be interesting, in order to see how different R0 estimations the considered methods can produce on observed incidence data. I would personally be interested in seeing these results for COVID-19 outbreak. Finally, a thorough review of the English language is necessary. Specific points: Page 2, line 26. ….”serial interval, infectious period… “. Please define all quantities when they are introduced Page 3, line 74. Here gamma is set to 1/3, while in the Result section it is set to 1/5 (or 7/5 with weekly data). Page 3, line 100. “ODE epidemiological model “. Please define Page 12, line 391: “Note that here the mean of the serial distribution was incorrect by only two days….”. Here authors don’t comment the fact that performance is better with the wrong serial distribution (see my comment above). In addition the amount of miss-specification (2 days) is chosen by the authors and they can modify it if it seems not enough to show some effect. I recommend testing a range of parameter choices. Page 15, line 490-91. “Asymptomatic infected (infected, no symptoms, not infection)”. Replace with : (infected, no symptoms, infection) ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Miriam Kesselmeier Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 25 Jan 2022 We thank the reviewers for their comments. We have added new examples to our study. We have also revised the manuscript for enhanced clarity and understanding. We have provided a detailed response to reviewers as an attachment. Submitted filename: ProjectInc_revisions_replytoreviewers.pdf Click here for additional data file. 4 Mar 2022

PONE-D-21-22343R1

Estimating the basic reproduction number at the beginning of an outbreak

PLOS ONE Dear Dr. Heffernan, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we consider that the manuscript is much improved but still does not fully meet PLOS ONE’s publication criteria. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised by one of the reviewers. Please submit your revised manuscript by Apr 18 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

PONE-D-21-22343R2

Estimating the basic reproduction number at the beginning of an outbreak

PLOS ONE Dear Dr. Heffernan, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jun 18 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Inés P. Mariño, Ph.D. Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The authors have addressed almost all of my comments. An issue that was not addressed, yet, can be found in lines 410-425. There are still "influenza one" and "influenza two", although the authors stated that they have changed all to "influenza 1" and "influenza 2", respectively. Please adapt. Coming back to strength and limitations in the discussion, my wording was not clear. I am sorry for the inconvenience. My suggestion was to additionally provide a paragraph on strengths and limitations (and approaches to mitigate them) of the conducted study with respect to, e.g., selected scenarios and the selection of the investigated estimators. Additionally to the strengths and limitations of the investigated estimators. Maybe a simple reordering might be a solution for highlighting. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Miriam Kesselmeier [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

12 May 2022 Dear Editor and Reviewers, We thank the Editor and both reviewers for their consideration and careful reading of our manuscript. Reviewer #2 is now completed satisfi\fed with the manuscript and is not requesting a single change. Reviewer #1 has requested two minor changes and we have implemented both. Comment 1: The authors have addressed almost all of my comments. (a) An issue that was not addressed, yet, can be found in lines 410-425. There are still "influenza one" and "influenza two", although the authors stated that they have changed all to "influenza 1" and "infuenza 2", respectively. Please adapt. We missed these entries in the previous revision. This has been done. (b) Coming back to strength and limitations in the discussion, my wording was not clear. I am sorry for the inconvenience. My suggestion was to additionally provide a paragraph on strengths and limitations (and approaches to mitigate them) of the conducted study with respect to, e.g., selected scenarios and the selection of the investigated estimators. Additionally to the strengths and limitations of the investigated estimators. Maybe a simple reordering might be a solution for highlighting. We have summarized some strengths and limitations in a final paragraph to the Conclusion. Submitted filename: reply_to_reviewer.pdf Click here for additional data file. 19 May 2022 Estimating the basic reproduction number at the beginning of an outbreak PONE-D-21-22343R3 Dear Dr. Heffernan, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Inés P. Mariño, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 7 Jun 2022 PONE-D-21-22343R3 Estimating the basic reproduction number at the beginning of an outbreak Dear Dr. Heffernan: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Inés P. Mariño Academic Editor PLOS ONE

44 in total

1. Transmissibility of 2009 pandemic influenza A(H1N1) in New Zealand: effective reproduction number and influence of age, ethnicity and importations.

Authors: S Paine; G N Mercer; P M Kelly; D Bandaranayake; M G Baker; Q S Huang; G Mackereth; A Bissielo; K Glass; V Hope
Journal: Euro Surveill Date: 2010-06-17

Estimating the basic reproduction number at the beginning of an outbreak.

Introduction

Methods

Epidemiological models

R0 and the serial distribution

Methods for estimating R0

WP: Maximum likelihood estimation of a branching model

seqB: Equential Bayes estimation using an SIR approximation

ID and IDEA: Least square estimation using incidence decay approximations

plug-n-play: Maximum likelihood using sequential Monte Carlo for partially observed epidemics

fullBayes: Bayesian inference for partially observed epidemics

Real world COVID-19 data

Workflow

Results

Epidemic simulations

The number of infectious individuals (y-axis) at time t in weeks (x-axis); from left to right: SIR, SEIR, and SEAIR; from top to bottom the examples are influenza 1, influenza 2, then covid19.

R0 estimates

Using synthetic data from the SIR, SEIR and SEAIR epidemiological models

Influenza example 1 estimated MSE of R0 estimators assuming known serial interval (SI) with SIR data (week on x-axis).

Influenza example 2 estimated MSE of R0 estimators assuming known serial interval (SI) with SIR data (week on x-axis).

COVID-19 estimated MSE of R0 estimators assuming known serial interval (SI) with SEAIR data (week on x-axis).

Estimated MSE of R0 estimators assuming unknown serial interval (SI) (week on x-axis).

Using real world COVID-19 data

R0 estimators (y-axis) for COVID-19 data in Canada.

Conclusion

A supplementary file contains additional simulations results (both tables and boxplots) as well as some further technical details.

1. Transmissibility of 2009 pandemic influenza A(H1N1) in New Zealand: effective reproduction number and influence of age, ethnicity and importations.

2. Comparing methods for estimating R0 from the size distribution of subcritical transmission chains.

3. Comparative estimation of the reproduction number for pandemic influenza from daily case notification data.

4. Natural variation in HIV infection: Monte Carlo estimates that include CD8 effector cells.

5. Estimation of the serial interval of influenza.

6. Serial interval and incubation period of COVID-19: a systematic review and meta-analysis.

7. The R0 package: a toolbox to estimate reproduction numbers for epidemic outbreaks.

8. Real time bayesian estimation of the epidemic potential of emerging infectious diseases.

9. Simulating the SARS outbreak in Beijing with limited data.

Review 10. Estimating epidemic exponential growth rate and basic reproduction number.

1. Comparative Dynamics of Delta and Omicron SARS-CoV-2 Variants across and between California and Mexico.