Literature DB >> 35196320

Bayesian data assimilation for estimating instantaneous reproduction numbers during epidemics: Applications to COVID-19.

Xian Yang^1,2, Shuo Wang^2,3,4, Yuting Xing², Ling Li⁵, Richard Yi Da Xu⁶, Karl J Friston⁷, Yike Guo^1,2.

Abstract

Estimating the changes of epidemiological parameters, such as instantaneous reproduction number, Rt, is important for understanding the transmission dynamics of infectious diseases. Current estimates of time-varying epidemiological parameters often face problems such as lagging observations, averaging inference, and improper quantification of uncertainties. To address these problems, we propose a Bayesian data assimilation framework for time-varying parameter estimation. Specifically, this framework is applied to estimate the instantaneous reproduction number Rt during emerging epidemics, resulting in the state-of-the-art 'DARt' system. With DARt, time misalignment caused by lagging observations is tackled by incorporating observation delays into the joint inference of infections and Rt; the drawback of averaging is overcome by instantaneously updating upon new observations and developing a model selection mechanism that captures abrupt changes; the uncertainty is quantified and reduced by employing Bayesian smoothing. We validate the performance of DARt and demonstrate its power in describing the transmission dynamics of COVID-19. The proposed approach provides a promising solution for making accurate and timely estimation for transmission dynamics based on reported data.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35196320 PMCID： PMC8923496 DOI： 10.1371/journal.pcbi.1009807

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Epidemic modelling is important for understanding the transmission dynamics and responding to the emerging COVID-19 pandemic [1-8]. Since the pilot work by Kermack and McKendrick [9], various epidemic models with different governing equations have been developed to describe the transmission dynamics of infectious diseases [10]. For common infection diseases such as influenza, the epidemiological parameters are related to the nature of the virus and treated as constants during the epidemic outbreak. These models are not applicable to the emerging COVID-19 pandemic where extensive government control measures have been implemented and continuously revised. Due to the impacts of control measures, the epidemiological parameters (e.g., infection rates) linked to human behaviours could change substantially. In particular, the instantaneous reproduction number R, defined as the expected number of secondary cases occurring at time t, divided by the number of infected cases scaled by their relative infectiousness, has drawn extensive attention [11]. Estimating such time-varying parameters from epidemiological observations (e.g., daily report of confirmed cases) is useful for nowcasting transmission [12], for retrospectively assessing intervention impacts and for developing vaccine strategies [6,13,14]. All the applications depend on a reliable system for estimating the time-varying parameters with accuracy and timeliness. Imprecise estimation or inappropriate interpretation could feed misinformation. Several systems [10,15-18] have been proposed to estimate the time-varying epidemiological parameters in practice; however, this remains a challenging task due to the following issues [19]: Lagging observations. Given a mathematical model of transmission dynamics, to infer the time-varying parameters, the number of infections should be the ideal observable data. However, the actual infection number is often unknown and can only be inferred from other epidemiological observations (e.g., the daily confirmed cases). Such observations are lagging behind the infection events due to inevitable time delays between the infections of individual patients and the detection of the cases (e.g., days for symptom onset [20]). Direct parameter estimation based on lagging observations without adjusting for the time delay results in the temporal inaccuracy of estimates [19,21]. To address this problem, a two-step strategy, first estimating infections from epidemiological observations with a temporal transformation followed by parameter estimation, has been commonly used in practice [21]. The simple temporal shift of observations by the mean observation delay turns out insufficient for the relatively long observation delay or the rapidly changing transmission dynamics, which are seen in the COVID-19 pandemic [21]. Backward convolution method (i.e., subtracting time delay, with a given distribution, from each observation time) leads to an over-smooth reconstruction of the infection number and bias for parameter estimation [10]. Deconvolution methods [22] through inversing the observation process are mathematically more accurate but sensitive to the optimisation procedures (e.g., stopping criterion) of the ill-posed inverse problem. In addition, the estimated result of infection number is often calculated as a point estimate, thereby overlooking the uncertainty from the observation process is neglected [23]. Taking an approach alternative to the two-step strategy, we are investigating a new Bayesian method that could jointly estimate both infection number and epidemiological parameters with uncertainty by explicitly parameterising the observation delay. Averaging inference. There are two general paradigms to deal with the challenge of estimating time-varying parameters: 1) reformulating the problem into an inference of static or quasi-static parameters, so that various methods for static parameter estimation can be used; 2) developing inference methods for explicit time-varying parameter estimation. For the first approach, the time-varying parameter is usually parameterised with several static parameters (e.g., the initial value and the exponential decay rate [17]). When adopting the quasi-static method, it is assumed that such a slow evolution of the parameter that could be treated as static within a short period. For example, Cori et al. [16] proposed a sliding-window method ‘EpiEstim’ using a segment of observations for the averaging inference of R, assuming R remains the same within the sliding window. But this assumption does not apply to the rapidly changing transmission dynamics, in which the window size affects the accuracy. Best practices of selecting the sliding window are still under investigation [21]. Instead of adopting a local sliding window, Flaxman et al. [13] defined several periods according to the dates of intervention measures, assuming a constant R within each period. This approach requires additional information about the intervention timeline, which could be inaccurate, and does not capture the abrupt change of R. In contrast to these window-based methods, data assimilation [24] is a window-free alternative approach that has been less explored for parameter estimation in computational epidemiology. Applying sequential Bayesian inference [25,26], data assimilation supports instantaneous updating of model states upon the availability of new observation data. The Bayesian model selection mechanism [27] can also be used for modelling the switching transmission dynamics under interventions, thereby avoiding the drawback of averaging inference. Different from the common compartment models [28-30] used in concurrent data assimilation studies of COVID-19 modelling, we use the renewal process taking into account for the changing infectiousness of the virus during the infection period. Moreover, we propose the Bayesian smoothing scheme that allows the correction of historical estimates based on subsequent observations. Quantification of uncertainty. The credibility of parameter estimation is as important as the estimate itself, especially for policymaking. The uncertainty comes from multiple sources, including the intrinsic uncertainties of epidemic modelling, data observation and inference processes. Firstly, the uncertainty of epidemiological models affects the final estimates. For example, R estimation is found to be sensitive to the assumed distribution of generation time intervals [21]. Secondly, the uncertainty, resulting from systematic errors (e.g., weekend misreporting) and random errors (e.g., spike noise) in the observation processes should be properly quantified. During the COVID-19 pandemic, for example, we have seen different reporting standards and time delay across countries and regions, with different levels of uncertainty. Thirdly, the uncertainty could be enlarged or smoothed in the inference processes. For example, the use of a sliding window could smooth the parameter estimation but may simultaneously miscalculate the uncertainty, due to the overfitting within the sliding-window. To provide reliable credibility intervals (CrI) of parameter estimates, the three aforementioned types of uncertainty should all be considered and reported as part of the final estimates. A state-of-the-art package for R estimation, EpiEstim (Version 2) [31], allows users to account for the uncertainty from epidemiological parameters by resampling over a range of plausible values. However, the uncertainty from imperfect observations and the side effects associated with the sliding window cannot be processed by this tool. Recently, ‘EpiNow’ [18] was proposed to integrate the uncertainty of observation process, but the inference is still based on the sliding window. In this work, we deal with model and data uncertainty in the data assimilation framework [24] with a Bayesian smoothing mechanism to enable both the latest and historical observations to continuously integrate into inference flow, thereby alleviating spurious variability of estimations. In order to tackle these practical issues, we propose a comprehensive Bayesian data assimilation system, for estimating time-varying epidemiological parameters together with their uncertainty. In particular, we focus on the joint estimation of infection numbers and R as a real-world application. Compared to the Bayesian approach for estimating the basic reproduction number R0 at the beginning stage of an epidemic break [32], the sequential updating scheme is developed in our system. The evolution of the transmission dynamics is described by a hierarchical transition process, which is informed by newly data formulated with explicit observation delay. A model selection mechanism is built in the transition process to detect abrupt changes under interventions.

Results

1. Bayesian data assimilation for epidemiological parameter estimation

We propose a Bayesian data assimilation approach, as illustrated in Fig 1, to estimate the time-varying parameters based on epidemiological observations. This framework is applicable to various epidemic models when the governing equations and observation functions are available. Given an epidemic model (e.g., renewal process), we can construct a latent state at time t which consists of the time-varying variables and parameters of the governing equations. The epidemiological observations C1: up to the latest observation time T are made during the observation process of the latent state . The problem of estimating the time-varying parameters can be formulated as a Bayesian inference problem of p(|C1:) for each time step t. In contrast to inferring the ‘pseudo’ dynamics (i.e., reformulating into a static/quasi-static problem), our method directly estimates the ‘real’ dynamics by assimilating information from the observations for the epidemic model forecast.

Fig 1

Illustration of the inference of Bayesian data assimilation system for time-varying parameter estimation. The latent state includes the variables and parameters of an epidemic model to be estimated. The epidemiological observation is denoted as C, and is linked to the latent state via the observation function. For each time step, the estimation of the latent state p(|C1:) is constantly updated according to ongoing reported observations using sequential Bayesian updating with forward filtering and backward smoothing. (A) Forward filtering at each time step. The posterior state estimation p(|C1:) estimated from previous step t−1 is transformed as the prior p(|C1:) for the current step t, calculated from the state transition model as detailed in the Method section. Together with the likelihood p(C|) obtained from epidemiological observation at the current step, the posterior of the current step p(|C1:) is estimated. At the same time, as shown in (B), backward smoothing is used to compute , taking account of all the observations C1: up to the time T by applying a Bayesian smoothing method (see the Methods section for more informaiton). As illustrated in Fig 1, the Bayesian data assimilation has two phases: forward filtering and backward smoothing. The forward filtering uses the up-to-date prior from the state transition model and the likelihood determined by the latest observation to update the current latent state, by computing its posterior distribution following Bayes’ rule. For the implementation of this Bayesian updating process, we adopt a particle filter method [26] to efficiently approximate the posterior distribution. The backward smoothing works by looking back to refine the previous state estimation when more observations are accumulated to reduce the uncertainty of parameter estimation. That is, the estimation of latent state at a time t is smoothed retrospectively, given all observations available till time T (T>t). Please refer to the Method section for more detailed explanations.

2. DARt: A data assimilation system for R estimation

To apply the proposed Bayesian data assimilation approach in a real-world problem, we developed the ‘DARt’ (Data Assimilation for R estimation) system for the R estimation. The transmission dynamics is described by the governing equations of the renewal process, where R is the key epidemiological parameter driving the number of incident infections j. We construct the latent state including the variable j, the time-varying parameter R and the auxiliary variable M for switching dynamics. Notably, M is to indicate the switching dynamics of the epidemiological parameter: M = 0 indicates smooth changes, while M = 1 indicates an abrupt change. As detailed in the Methods section, the dynamics of the latent state can be described using a hierarchical transition model, where R, j and M can be estimated by DARt. Under the modelling of convolutional observation process, we test the capacity of DARt with different observation inputs and kernels. The performance of the DARt system is validated and compared to that of the state-of-the-art EpiEstim and EpiNow2 systems through simulations and real-world applications. The results confirm its power of estimation and adequacy for practical use. We have made the system available online for broad use in R estimation for both research and policy assessment.

Validation through simulation

Due to the lack of ground-truth R in real-world epidemics, we conduct a set of simulation experiments by using synthetic data for validation. Fig 2 illustrates the design of simulation experiments where a synthetic R, is adopted as the ground truth to validate its estimated . We also estimated R using the state-of-the-art R estimation package EpiEstim [31] and EpiNow2 [18] to compare the effectiveness in overcoming the three aforementioned issues (i.e., lagging, averaging and uncertainty).

Fig 2

Validation experiment of the DARt system on simulated data.

First, the ground-truth R sequence is synthetic using piecewise Gaussian random walk split by several abrupt change points. The sequence of incident infection j is simulated based on a renewal process parameterised by the synthetic R. The observation process includes applying a convolution kernel that represents the probabilistic observation delay to obtain the expected observation and adding Gaussian noise that represents the reporting error to obtain the noisy ‘real’ observation C. The inputs (in grey) to the DARt system are the distributions of generation time, observation kernel and simulated noisy observation C. The system outputs are the estimated , estimated and change indicator . These outputs are compared with the synthetic R, j and the time of abrupt changes. Also, the observation function is applied to the estimated to compute the estimated observation with uncertainty, which is compared to the ‘real’ observation.

Validation experiment of the DARt system on simulated data.

Experimental settings

In the simulation experiments, we compare the performance of DARt with that of two comparative methods: EpiEstim and EpiNow2. These two models are applied under their default settings with a 7-day smoothing window. When applying EpiEstim, we adopt the two-step strategy that shifts C backwards in time by the median observation delay (5 days in the simulation). As the current implementation of EpiNow2 (https://github.com/epiforecasts/EpiNow2) only supports Gamma and Lognormal distribution as time delays, we set the generation time distribution to be a Gamma distribution with shape and scale equal to 4.44 and 1.89,respectively (obtained by fitting the Weibull distribution reported by Ferretti et al. [3] using the Gamma distribution). With the simulated R curve and the generation time distribution, we follow the renewal process to simulate the infected curve j (initialized to be 1). Then, the observation curve of onset cases is generated using the incubation time distribution [3] (i.e., the lognormal distribution with log mean and standard deviation of 1.644 and 0.363 days respectively) as the observation time delay. Similar to the experiments in other related work [12,13], all comparative models start estimation when the daily observation exceeds a threshold number, which is set to be 10 in our experiments. Fig 3A shows the synthetic R curve following a piece-wise Gaussian random walk that mimics the scenario of two successive interventions and one resurgence. To approximate the early stage of exponential growth, the simulation starts with R0 = 3.2 (i.e. the basic reproduction number) and follows a Gaussian random walk R~Gaussian(R, (0.05)2). At t = 23, we set R23 = 1.6 indicating the mitigation outcome of soft interventions. After soft interventions, the epidemic is still being uncontrolled with the evolution of R resuming to the Gaussian random walk as above. At t = 33, R decreased abruptly to a value under 1, where we set R33 = 0.5 to indicate the suppression effects of intensive interventions (e.g., lockdown). After the epidemic is controlled for a while, one outbreak happens at t = 83 with R83 = 3. The evolution of R after this resurgence follows the random walk for a few days.

Fig 3

Simulation results.

Simulation results.

(A) Synthetic R, simulated j and C curves. (B) shows the comparison of the synthetic R (in red) with estimated R curves from DARt, EpiEstim and EpiNow2. (C) shows the estimated M from DARt to indicate sharp changes of R. (D) shows the simulated j, j from DART, and j from EpiNow2. (E) compares the distributions of estimated C from DARt and EpiNow2 with the simulated C curve with 95% CrI. (F) compares the DARt estimated R results with and without smoothing. To simulate the real-world noisy observations, we added Gaussian noise with zero mean and standard deviation equal to N times of . The results presented in the next section are obtained with N = 1. To further investigate the performance of all comparable models under different levels of noise, we also show the results when N is chosen from {0,1,2,3} in Fig F in S1 Text. Notably, in the rest of this main manuscript both the generation and incubation time distributions are truncated and normalised, i.e. values smaller than 0.1 are discarded. Sensitivity analysis has been done and reported in Fig G in S1 Text showing impacts of different choice of threshold. Fig H in S1 Text represents the uncertainties resulted from different settings of time distributions.

Simulation results

All simulation results are presented in Fig 3 and discussed as follows. Correctness of estimation: Fig 3B compares the synthetic R with the estimated R from DARt, EpiEstim and Epinow2. We can see that R from DARt matches the synthetic R better than that from the other comparative methods with relatively less degree of fluctuations and faster response to abrupt changes. The results demonstrate that the proposed model can mitigate the influence of noisy observations and overcome the weakness of averaging. The probabilities of having abrupt changes are captured by M as shown in Fig 3C. Even with observation noise, DARt can still detect abrupt changes. Correctness of estimation: Fig 3D shows the simulated j, DARt estimated j, and EpiNow2 estimated j. We can find that the DARt estimated j with 95% CrI match well the simulated j. In contrast, the estimated j curve from Epinow2 is over-smoothed such that sharp changes in the simulated j cannot be captured by Epinow2. In particular, the peak value of j from EpiNow2 deviates greatly from the simulated value. Accuracy in recovering observations : Fig 3E compares the distributions of reconstructed C from DARt and that of EpiNow2. We can find that compared with C from EpiNow2, C from DARt with 95% CrI can generally match well with the simulated C. Effectiveness of DARt smoothing: Fig 3F illustrates the effectiveness of backward smoothing by comparing the DARt estimated R results with and without smoothing, showing the expected smoothing effect of estimated R with reduced CrI. It is clear that the results from DARt without smoothing are affected by local fluctuations, which are probably due to observation noises. With the introduction of smoothing, both the uncertainties and local fluctuations are reduced.

Applicability to real-world data

We applied DARt to estimate R in four different regions during the emerging pandemic. Each region represents a distinct epidemic dynamic, allowing us to test the effectiveness and robustness of DARt in different scenarios. 1) Wuhan: When the outbreak of COVID-19 happened in Wuhan, the government responded with very stringent interventions such as a total lockdown. By studying its R evolution, we can check the capability of DARt in detecting the abrupt changes of R. 2) Hong Kong: The daily increase of reported cases in Hong Kong has been remained at a low level for most of the time with the maximum daily cases under 200. As no stringent interventions have been introduced in such a city with high-density population, Hong Kong offers an ideal scenario for studying the change of R under mild interventions. 3) United Kingdom: The daily infection number in the UK changed significantly this year, varying from 2,000 to 6,000. UK is one of the first countries initiating mass immunisation campaign; therefore, its instantaneous R is a useful metric for checking the efficacy of vaccination in real world on the way towards ‘herd immunity’. 4) Sweden: Sweden is a representative of countries that have less stringent intervention policies; it has a clear miss-reporting pattern repeated periodically. This makes Sweden an ideal case to examine the robustness of DARt with considerable observation noises.

Epidemic dynamics in these four regions

The inference results for R and reconstructed observations for these four regions are shown in Fig 4. For Wuhan, the observation data are the number of onset cases compiled retrospectively from epidemic surveys, while for Hong Kong, UK and Sweden, the observation data are the number of reported confirmed cases. Notably, we use the onset-to-confirmed delay distribution from [33] together with the distribution of incubation time proposed in [3] to approximate the observation delay. As the ground-truth R is not available, we validate the results by checking whether the estimated distributions of observation match well with the observation curve of C. As shown in the top panel of each subplot in Fig 4, the CrIs of estimated C distributions (in blue) match most parts of the original observations (in yellow), confirming the reliability of our R estimation.

Fig 4

Epidemic dynamics in Wuhan (A), Hong Kong (B), United Kingdom (C) and Sweden (D). The top row of each subplot shows the number of daily observations (in yellow), the estimated daily observations (in blue) and the estimated daily infections (in green). The middle row compares the DARt R estimation (in black) with the EpiEstim results (in blue) and EpiNow2 (in yellow). The distributions of all estimated R are with 95% CrI. The bottom row shows the probabilities of having abrupted changes (M = 1) (in green bars). Fig 4A shows the results of testing using Wuhan’s onset data [1]. We observe that there was a sharp decrease in Wuhan’s R after January 21, 2020, which is also illustrated by M as the probability of abrupt changes peaked at this time (in green bars). A strict lockdown intervention has been enforced in Wuhan since January 23, 2020. This sharp decrease in Wuhan’s R is likely to be the result of this intervention. The small offset between the exact lockdown date and the time of sharp decrease might be due to noisy onset observations and approximated incubation time distribution. After the lockdown, R decreased smoothly, indicating that people’s increasing awareness of the disease and the precaution measures taken had made an impact. Since the beginning of February 2020, the value of R remained below 1 for most of the time with the enforcement of quarantine policy and increases in hospital beds to accept all diagnosed patients. It is noted that the onset curve has a peak on February 1, 2020, due to a major correction in the reporting standard. Neither R nor j curve from our model were severely affected by this fluctuation, highlighting the robustness of our model thanks to the smoothing mechanism. The results from Wuhan suggest that our switching mechanism can address the issue of averaging and automatically detect sharp changes in epidemiological dynamics. The results from DARt are also compared with results from EpiEstim and EpiNow2. EpiEstim generates results with significant local variations and delays, while EpiNow2 can derive a smooth R curve with no obvious delays. However, the immediate impact of lockdown cannot be well detected by EpiNow2. Fig 4B shows the inferred results from Hong Kong that reported confirmed cases [34] during the most recent outbreak from November 2020 to March 2021. In Hong Kong, the number of infections remains low for most of the time and the government has continuously imposed soft interventions. In the middle of November 2020, a newly imported case has triggered a new outbreak, resulting in a large R value. However, the R level returned to be around 1 very soon as the government has further tightened social distancing measures at that time. From late January 2021, Hong Kong started to implement mandatory lockdown in the restricted areas. Since then, the number of daily cases remains at low level. Compared with the results from EpiEstim and EpiNow2, the results from DARt have similar trend with the others. The delays in the results of EpiEstim are still significant. We also investigate the performance of DARt using different types of observations and present the results of R estimation using onsets and confirmed cases in S1 Text. Fig 4C shows the inference results from the United Kingdom’s reported confirmed cases [35]. It is noted that the United Kingdom was one of the first countries in the world to authorise the emergency use of COVID-19 vaccines. Since early December 2020, the United Kingdom rolled out its COVID-19 mass vaccination programme. By mid-February 2021, the United Kingdom had successfully hit its target of 15 million first-dose COVID-19 vaccinations, encompassing the top four priority groups for vaccination. As of April 22, 2021, the UK had reached its target of 33 million (63% adults) first-dose COVID-19 vaccinations and 11 million (21% adults) second-dose. After 3 months since the mass vaccination programme started, the number of infection cases continuously followed a downward trend. However, since further easing of COVID-19 restrictions in mid-May 2021, R gradually increased. During Euro 2020 football match (from June 11 to July 11), the R value remained above 1. An immediate decrease in R occurred when Euro 2020 was finished and since then R remained around 1. The results from DARt, EpiEstim and EpiNow2 are generally consistent. However, DARt has accurately detected and responded to the impact of the completion of Euro 2020 in mid-July. In addition to studying the whole country, we applied DARt to typical cities in England to investigate the local epidemic dynamics as well (please refer to section 4 of S1 Text). Fig 4D shows the inferred epidemic dynamics in Sweden from the daily reported data [36]. We find that the daily reported cases in Sweden had shown dramatic local fluctuations that were likely to be caused by misreporting. The reported cases dropped to 0 on Saturdays, Sundays and Mondays. This kind of fluctuations in observations could induce unnecessary fluctuations to R curves. Therefore, we used Sweden’s data to further test the robustness of our scheme in the presence of undesirable local fluctuations in observations. The results suggested that the influence of such periodic fluctuations has been smoothed by DARt and EpiNow2 to yield a consistent R curve, where results from EpiEstim have shown significant local fluctuations. To summarise, DARt has been applied to four different regions for investigating the transmission dynamics of COVID-19 to demonstrate its real-world applicability and effectiveness. Consistent with the findings in the simulation study, DARt has shown its advantages in the following aspects: 1) Instantaneity—DARt adopts a window-free sequential Bayesian inference approach to detect and indicate abrupt epidemic changes; 2) Robustness—with Bayesian smoothing, the R curve from DARt is stable at the presence of observation noise; 3) Temporal accuracy—DARt performs a joint estimation of R and j by explicitly encoding the lag into observation kernels.

Discussion and conclusion

In this paper, we have proposed a Bayesian data assimilation scheme for estimating the time-varying epidemiological parameters based on observations. To study a real-world application scenario, we focus on estimating R and provide a state-of-the-art R estimation tool, DARt, supporting the study of a wide range of observations. In the DARt system, epidemic states can therefore be updated using newly observed data, following a data assimilation process in the framework of sequential Bayesian updating. For the model inference, a particle filtering/smoothing method is used to approximate the R distribution in both forward and backward directions of time, ensuring the R at each time step assimilates information from all time points. By taking the Bayesian approach, we have processed the uncertainty in R estimation by accommodating observation uncertainty in likelihood mapping and introduced Bayesian smoothing to incorporate sufficient information from observations. Our method provides a smooth R curve together with its posterior distribution. We have demonstrated that inferred R curves can describe different observations accurately. Our work is not only important in revealing the epidemical dynamics but also useful in assessing the impact of interventions. The sequential inference mechanism of R estimation takes into account the accuracy of time alignment and provides an abrupt change indicator. Different from approaches of directly incorporating interventions as co-factors into epidemic model [13,37], our method offers a promising method for intervention assessment. We have made some approximations to facilitate the implementation. First, the observation time and generation time distributions are truncated into fixed and identical lengths. Theoretically, these two distributions can be of any length, while most values are quite small in practice. In our state transition model, one variable of the latent state is a vectorised form of infection numbers over a period. The purpose of vectorisation is to facilitate implementation by making the transition process Markovian. The length of this vector variable is determined by the length of effective observation time and the generation time distributions. Truncating these two distributions to a limited length, by discarding small values, would facilitate vectorisation. Apart from truncation, we have assumed that these two distributions do not change during the prevalence of disease. However, as we have discussed previously [23], introducing interventions, such as an increased testing capacity, would affect the observation time. The distribution of generation time would also change as the virus is evolving. It is possible to extend our model by adding a time-varying observation function. For example, the testing capability and time-varying mortality rate could also be considered in the observation process. Second, we approximate the variance of observation error empirically. Given that variance of observations is unknown and could change over time across different regions, the standard deviation of the Gaussian likelihood function is not set to a fixed value in our scheme. Instead, we estimate the region-specific time-varying observation variance from the observational data. Although the empirical estimation yielded reasonable results for the four regions and cities in the UK (see S1 Text), it may generate some implausible results in some scenarios, for example, when the epidemic is growing or resurging explosively, leading to an overestimation of observation variance. An adaptive error variance inference should be made to tackle this issue. The third approximation is implicit in the use of a particle filter to approximate the posterior distributions over model state variables–including R–with a limited number of samples (i.e., particles). Particle filtering makes no assumption about the form of posterior distributions. On the contrary, the variational equivalent of the particle filter, namely variational filtering [38] provides an analytical approximation to the posterior probability and can be regarded as limiting solutions to an idealised particle filter, with an infinite number of particles [39]. Considering the importance of both the mean value of R and its estimation uncertainty for advising governments on policymaking, an analytical approximation is desirable to help properly quantify uncertainty. Finally, change detection is approximated by the change indicator M, which is included as part of the latent state and inferred during particle filtering. This work opens an avenue to explore variational Bayesian inference for switching state models [40]. Crucially, variational procedures enable us to assess model evidence (a.k.a. marginal likelihood) and hence allow automatic model selection. Examples of Variational Bayes and model comparison to optimise the parameters and structure of epidemic models can be found in previous studies [41]. These variational procedures can be effectively applied to change detection. In conclusion, our work provides a practical scheme for accurate and robust estimation of time-varying epidemiological parameters. It opens a new avenue to study epidemic dynamics within the Bayesian data assimilation framework. We provide an open-source R estimation package as well as an associated Web service that may facilitate other people’s research in computational epidemiology and the practical use for policy development and impact assessment.

Methods

The proposed Bayesian data assimilation framework for estimating epidemiological parameters include three main components: 1) a state transition model—describing the evolution of the latent state; 2) an observation function–defining an observation process and describing the relationship between the latent state and observations; 3) a sequential Bayesian engine: estimating statistical reason time-varying model parameters with uncertainty by assimilating prior state information provided by the transition model and the newly available observation. In this section, we introduce a real-world application of the proposed data assimilation framework to estimate one of the key epidemiological parameters, R. The modelling epidemic dynamics is characterised by the renewal process, which is the foundation of our state transition model. We then describe the observation function, linking a sequence of infection numbers with the observation data. Next, we present a detailed state transition model and propose the sequential Bayesian update module.

1. Renewal process for modelling epidemic dynamics

Common R estimation methods include compartment model-based methods (e.g., SIR and SEIR [42]) and time-since-infection models based on renewal process [31]. Their relationships are discussed in Section 1.1 of S1 Text. Comparative studies have been conducted in [21] to show that EpiEstim, one of the renewal process-based methods, outperforms other methods in terms of accuracy and timeliness. Given the renewal process, the key transition equation derived from the process is: where j is the number of incident infection cases on day t, T is the time span of the set {w}, and individual w is the probability that the secondary infection case occurs k days after the primary infection, describing the distribution of generation time [10]. The profile of w is related to the biological characteristics of the virus and is generally assumed to be time-independent during the epidemic. Considering the simplicity and superior performance of applying the renewal process to model epidemic dynamics, our work adopts Eq (1) as the basic transition function for joint estimation of R and j.

2. Observation process

In epidemiology, the daily infection number j cannot be measured directly but is reflected in observations such as the case reports of onset, confirmed infections and deaths. There is an inevitable time delay between the real date of infection and the date reporting, due to the incubation time, report delay, etc. Taking account of this time delay, we model the observation process as a convolution function between kernel φ, and the infection number in T most recent days. where C is the observation data, and φ is the probability that an individual infected is detected on day k. T is the maximum dependency window. It is assumed that the past daily infections before this window do not affect the current observation C. Since there is a delay between observation and infection, we suppose the most recent infection that can be observed by C is at the time t−d, where d is a constant determined by the distribution of observation delay. To accommodate various observation types (e.g., the number of daily reported cases, onsets, deaths and infected cases), DARt will choose the appropriate time delay distributions accordingly. For example, for the input of onsets, the infected-to-onsets time distribution is chosen to be the kernel in the observation function. For the input of daily reported cases, the infected-to-onset and the onset-to-report delays are used together as the kernel in the observation function. These delay distributions can be either directly obtained from literature or inferred from case reports that contain individual observation delays [3]. Detailed descriptions of the observation functions for different epidemic curves can be found in S1 Text (Section 1.2).

3. Sequential Bayesian Inference

In Fig 5, we illustrate the Bayesian inference scheme of DARt with the following details.

Fig 5

Three components of DARt inference model: state transition model, observation function and sequential Bayesian update module with two phases (forward filtering and backward smoothing).

The latent state that can be observed in C are defined as = where R is the instantaneous reproduction number, M is a binary state variable indicating different evolution patterns of R, is a vectorised form of infection numbers j, t* indicates the most recent infection that can be detected at time t is from the time t* due to observation delay, and T is the length of the vector such that C is only relevant to and j only depends on via the renewal process.

Three components of DARt inference model: state transition model, observation function and sequential Bayesian update module with two phases (forward filtering and backward smoothing).

State transition model

In our model, indirectly observable variables j and R are included in the latent state. The state transition function for R is commonly assumed to follow a Gaussian random walk [18] or constant within a sliding window as implemented in EpiEstim. Such a simplification cannot capture an abrupt change in R under stringent intervention measures. To capture such abrupt changes, we introduce an auxiliary binary latent variable M to indicate the switching dynamics of the epidemiological parameters under interventions without assuming a pre-defined evolution pattern (e.g., constant or exponential decay). M = 0 indicates a smooth evolution corresponding to minimal or consistent interventions; M = 1 indicates an abrupt change of corresponding to new interventions or outbreak. The smooth evolution is modelled as a Gaussian random walk while the abrupt change is captured through resetting the parameter memory by assuming a uniform probability distribution for the next time step of estimation. Doing so provides an automatic way of framing a new epidemic period that was manually done in [13]. The transition of M is modelled as a discrete Markovian process with fixed transition probabilities controlling the sensitivity of change detection: where is a Gaussian distribution with the mean value of R and variance of σ2, describing the random walk with the randomness controlled by σ. U[0, R+Λ] is a uniform distribution between 0 and R+Λ allowing sharp decrease while limiting the amount of increase. This is because we assume that R can have a significant decrease when intervention is introduced but it is unlikely to increase dramatically as the characteristics of the disease would not change instantly. The transition of the change indicator M, is modelled as a discrete Markovian process with fixed transition probabilities: where α is a value close to and lower than 1. The above function means that the value of M is independent of M, while the probability of Mode II (i.e., M = 1) is quite small. This is because it is unlikely to have frequent abrupt changes in R. For the incident infection j, the state transition can be modelled based on Eq (1) as . To make the transition process Markovian, we vectorise the infection numbers as follows. Suppose the infection numbers that can be observed in C are all included in , where t* = t−d, and the length of this vector T is larger than or equal to T−d+1. We also require T to be not smaller than T. Therefore, all the historical information needed to infer j is available from , i.e., only depends on (i.e., being Markovian). The state transition process and observation process are illustrated in Fig 6.

Fig 6

Illustration of the hierarchical transition process and observation process.

Illustration of the hierarchical transition process and observation process.

The most recent infection that can be observed by C is at the time t* = t−d where d is a constant determined by the distribution of observation delay. Suppose T is the length of the vector such that C is only relevant to and j only depends on via the renewal process. Therefore, T≥max(T, T−d+1). The case that T = T−d+1 is depicted in this figure. The latent state in our model is then defined as = , which contribute to C at time t. The state transition function of is therefore Markovian: where is the m-th component of the latent variable and δ(x, y) is the Kronecker delta function (please refer to S1 Text for more details). With Eqs (3)–(5), the latent state transition function p(|) can be obtained as a Markov process:

Forward filtering

We formulate the inference of the latent state = with the observations C as within a data assimilation framework. A sequential Bayesian filtering approach is adopted to infer the time-varying latent state, which updates the posterior estimation using the latest observations following Bayes’ rule. This approach differs from the fixed prior in the Bayesian inference of static parameters. This filtering mechanism computes the posterior distribution of the latent state by assimilating the forecast from the forward transition model with the information from the new epidemiological observations. For the implementation of this Bayesian updating process, we adopt a particle filter method [26] to efficiently approximate the posterior distribution through Sequential Monto Carlo (SMC) sampling. This eschews any fixed-form assumptions for the posterior–of the sort used in variational filtering and dynamic causal modelling [38]. Let us denote the observation history between time 1 and t as C1: = [C1, C2,…,C]. Given previous estimation p(|C1:) and new observation C, we would like to update the estimation of , i.e., p(|C1:) following Bayes’ rule with the assumption that C1: is conditionally independent of C1: given : where p(|C1:) is the prior and p(C|) is the likelihood. The prior can be written in the marginalised format: where is assumed to be conditionally independent of C1: given , and the transition p(|) is defined in Eq (6) based on the underlying renewal process. The likelihood p(C|) can be calculated assuming the observation uncertainty follows a Gaussian distribution: where H is the observation function with a kernel chosen according to the types of observations and σ2 is the variance of observation error estimated empirically. To show the benefits of using this Gaussian likelihood function, we show the simulation results of using Poisson likelihood without considering the observation noise. Results can be found in Fig B in S1 Text, where the estimations fluctuate dramatically under noisy observation. By substituting Eq (8) into Eq (7), we obtain the iterative update of p(|C1:) given the transition p(|) and likelihood p(C|):

Backward smoothing

The estimated result p(|C1:) from aforementioned forward filtering only includes the past and present information flows, corresponding to the prior p(|C1:) and likelihood p(C|), respectively. The filtering estimates would be accurate if all related infections are fully observed in C1:. However, this is certainly not the case due to observation delay. In order to reduce the uncertainty from forward filtering, we adopt the Bayesian backward smoothing technique, estimating the latent state at a time t retrospectively, given all observations available till time T (T>t). Compared with other parameter estimation methods [13], Bayesian data assimilation takes advantage of additional information to smooth inference results with reduced uncertainty caused by incomplete observations. More specifically, the smoothing mechanism can be described as: given a sequence of observations C1: up to time T and filtering results p(|C1:), for all time t

Data

We obtained daily onset or confirmed cases of four different regions (Wuhan, Hong Kong, Sweden, UK) from publicly available sources [1,34-36]. For Wuhan, we adopted the daily number of onset patients from the retrospective study [1] (from the end of December 2020 to early March 2020). For UK data, we downloaded the daily report cases (cases by date reported) from the official UK Government website for data and insights on Coronavirus (COVID-19)[35] (from the start of January 2021 to the end of August 2021) accessed on 30th of August 2021. Data for UK Cities were also downloaded from the same source [35] (from the start of January 2021 to the start of September 2021) accessed on 2nd of September 2021. For Sweden data, we downloaded the daily number of confirmed cases from the Our World in Data COVID-19 dataset [36] (from the middle of January 2021 to the start of September 2021) accessed on 2nd of September 2021. For Hong Kong, we downloaded the case reports from government website [34] (from the end of November 2020 to the end of March 2021), including descriptive details of individual confirmed case of COVID-19 infection in Hong Kong. For those asymptomatic patients whose onset date are unknown, we set their onset date as their reported date, and for those whose onset date is unclear, we simply removed and neglected these records. Only local cases and their related cases are considered, while imported cases and their related cases are excluded. We release DARt as open-source software for epidemic research and intervention policy design and monitoring. The source codes of our method and our web service are publicly available online (https://github.com/Kerr93/DARt).

Supplementary document containing some supporting information.

Fig A. Illustrations of three types of observations and corresponding distributions of delay from the real infection date and observation. Fig B. Comparison between the simulation results using Poisson likelihood and Gaussian likelihood in DARt (both with 95% CrI). Fig C. Comparison between the estimated daily infection. The estimated infection by DARt is drawn in black with 95% CrI. The ground-truth simulated infection is in red and the back calculated infection is in yellow. Fig D. Comparison of estimated R curves of Hong Kong using different observations. Subplot A) shows R estimations (in black) from confirmed cases (in yellow). Subplot B) shows R estimations (in black) from daily onset (in yellow). Fig E. Epidemic dynamics in London, Leicester, Birmingham, Liverpool, Manchester, Sheffield, and Leeds. The top row of each subplot shows the number of daily observations (in yellow), the estimated daily observations (in blue) and the estimated daily infections (in green). The middle row shows the DARt results of R curve with 95% CrI (in black), while the probability of having abrupt changes is shown in the bottom row (i.e., M = 1) (in green). Fig F. The R estimation results under different levels of observation noise: A) N = 0, B) N = 1, C) N = 2 and D) N = 3, where the added Gaussian noise has the standard deviation equal to N times of the unperturbed observation. Fig G. The R estimation results of DARt with different truncation threshold: A) 0.01, B) 0.05 and C) 0.1. Fig H. A) The R estimation results of DARt obtained from the generation time and observation delay distributions with uncertainties. B) The R estimation results of DARt obtained from the generation time distribution following a Lognormal distribution. Table A. Transition probabilities of M. Table B. Simulation results using synthetic data in the main manuscript. ΔR-mean/ΔJ-mean and ΔR-sd/ΔJ-sd are the mean and standard deviation of the differences between synthetic R/J and estimated R/J. Since EpiEstim does not estimate , we leave the corresponding values as NA. (PDF) Click here for additional data file. 6 Aug 2021 Dear Dr Yang, Thank you very much for submitting your manuscript "Bayesian data assimilation for estimating transmission dynamics in computational epidemiology" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Alison L. Hill Associate Editor PLOS Computational Biology Rob De Boer Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The review is uploaded as an attachment Reviewer #2: A summary of my comments (detailed below): The method described in this paper is interesting and promising, given its success at handling the issues it sets out to address (lags, uncertainty and averaging), and the demonstrated comparisons to results output from EpiEstim It is necessary to add a thorough comparison to EpiNow The deconvolution discussion and comparisons should be extended, and the accuracy of Supplementary Figure 3 should be carefully checked (and commented on, depending on the results of the check) This paper needs to consider uncertainty from potential misspecification of the generation interval distribution and the observation delay distribution The analysis of the sensitivity of the method to truncation needs to be separate from the issue of adding variance to the estimates of the parameters for the generation interval distribution and observation delay distribution Some issues with writing, organization, and figures should be fixed The instantaneous reproductive number Rt has become a ubiquitous tool for assessing the state of the COVID-19 pandemic in different regions across the globe. It has been used to inform policy decisions such as whether to impose or lift restrictions like lockdowns and mask mandates, and to determine the effect of such interventions. Yet computing Rt from the data that is typically available to researchers during the course of the pandemic is nontrivial because of the forward-looking nature of the interpretation of Rt as the average number of secondary infections a person who is infectious at time t will produce in the next timestep, a measure which approximates the number of secondary infections a person infected at time t will produce over the course of their infection using only data accessible up to time t. Moreover, Rt computations depend on epidemiological parameters that are often unknown or even unknowable in real analyses. Although best practices for computing Rt have been proposed since the beginning of the pandemic (Gostic et all 2020, e.g.), several problems have remained as important challenges to computing Rt accurately. These include the lag between time of infection and measured case numbers from testing, hospitalization, or deaths; the difficulty in properly accounting for multiple sources of uncertainty; and the ambiguity in selecting a window size for averaging in computing Rt using what are currently state of the art methods. This paper advances the current state of the field by proposing a novel Bayesian method, DARt, for addressing these three problems using the technique of data assimilation. It uses simulated data to demonstrate advantages of this method compared to methods for computing Rt using the package EpiEstim, which is currently a widely used, recommended method for computing Rt in real applications. One particularly interesting feature of the method is how it handles abrupt changes in Rt through the incorporation of a binary latent variable that infers the probability of large changes in Rt in short periods of time. This paper is not the first, however, to implement a Bayesian method with the goal of addressing lagged data, uncertainty and averaging in computations of Rt. In particular, the package EpiNow (and its more recent iteration EpiNow2), developed by Abbott et al 2020, implements another Bayesian method meant to to solve these issues within the context of best practices for computing Rt. While the authors of this paper briefly mention EpiNow2, they do not compare their methods for inferring the curve of incident infections from the curve of observations and Rt to the methods used by EpiNow2 in detail. Given the importance of EpiNow2 as a highly developed tool for solving problems this paper is addressing as well, the authors’ choice to compare only to EpiEstim and not to EpiNow2 is a major oversight that prevents the researchers from putting their work in context with currently available tools. The paper also deemphasizes the importance of the Goldstein et al deconvolution method for taking into account lags in Rt calculations, comparing DARt to deconvolution only in section 2.3 of the supplementary information. Deconvolution has the serious disadvantages of not allowing for computation of uncertainty in the inference of the incident infection curve and requiring a window size choice in its calculation of Rt, but it is typically very successful at producing a point estimate of the infection curve from the curve of observations when the distributions of generation interval and delays to observation are assumed to be known and with an appropriate choice of stopping criterion for the Richardson-Lucy algorithm. The lack of agreement between the simulated infection curve and the deconvolved infection curve in Supplementary Figure 3 is surprising and may indicate an issue with the implementation of the deconvolution code. Such an issue can occur if the indexing of time in the delay distribution is off compared to the indexing of time in the vectorized curve of observations (one must be careful to keep track of whether the delay distribution before truncation mathematically begins with the probability that the delay equals 0 days vs. the delay equaling 1 day; such an off-by-one error can cause disagreements like the one shown here - in particular, the disagreement between the location of the peak of the deconvolved curve and the true infection curve - although it is not the only possible issue). This disagreement should be carefully checked; if the deconvolution is truly not reproducing the peak correctly, this requires at least a comment on the intuition for why this is happening. This paper also neglects a source of uncertainty that is very important to real-life calculations of Rt. Namely, the observation delay distribution is very rarely known exactly (or even very accurately at all); even the generation interval distribution can be unknown, for example in the case of new COVID-19 variants. The authors provide a sensitivity analysis in supplementary figure 5 that is meant to address this issue, but the analysis is insufficient for a few reasons. First, it seems to test the effects of different thresholds for the truncation and samples from generation time distribution and observation delay distribution with variance in the parameters in a single plot. Second, there is no demonstrated analysis of the sensitivity of the method to cases where the shape of the assumed generation interval and/or observation delay are misspecified (with the wrong shape/scale/logmean/variance compared to the generation interval and observation delay used to generate the observation curve). Such misspecification can, at least for some methods, severely impact the accuracy of the inferred infection and Rt curves. Ideally, a method for inferring the infection curve and Rt with uncertainty would take into account uncertainty derived from the likely misspecification of generation interval and observation delay distribution in real applications; barring this, which is difficult to model, the sensitivity analysis should at least be modified to: a) test the effects of delay distribution minimum probability threshold and uncertainty in the delay distributions separately, and b) test the sensitivity of the method to cases where the delay distributions are misspecified, not just uncertain. There are a few more minor issues that need to be addressed. In no particular order: There are some scattered language and grammatical errors which need to be edited. The arbitrary choice of Delta in the prior for Mt is not ideal, especially since the decision to allow Rt to increase discontinuously for the Hong Kong analysis (and the constraint that Rt not increase discontinuously in other cases) is inserted “by hand” rather than “discovered” by means of the inference The extended description of the algorithm in the “Results” section seems potentially misplaced; I would suggest considering making this discussion much more concise and putting more of these details in the Methods section In figure 3, c) and d) should have the same scale on the y axis In the Sweden analysis of Figure 6, in my opinion it is somewhat disingenuous to directly input the raw data into EpiEstim with small window sizes, as in reality, a reasonable researcher would always try to smooth out the zeros and obvious reporting errors in this dataset before entering into EpiEstim (of course, the difference is that this smoothing would be by hand and thus require more of an arbitrary choice, whereas the described method does this without arbitrary choices) ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols Submitted filename: Yang_et_al_review.pdf Click here for additional data file. 27 Sep 2021 Submitted filename: response_to_reviewers.docx Click here for additional data file. 18 Nov 2021 Dear Dr Yang, Thank you very much for submitting your manuscript "Bayesian data assimilation for estimating epidemic evolution: a COVID-19 study" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations. Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Alison L. Hill Associate Editor PLOS Computational Biology Rob De Boer Deputy Editor PLOS Computational Biology *********************** A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately: [LINK] Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The review is uploaded as an attachment Reviewer #2: The main contributions and importance of the paper have not changed since the original submission; the comments below are specific to this submission. First, I think the sensitivity analyses in section 6 of the Supplement are a nice addition in response to previous comments from the reviewers. Major comment: 1. The comparison between DARt, EpiNow, and EpiEstim is no longer "apples to apples," as both DARt and EpiNow are implemented taking into account an observation delay, but EpiEstim is now implemented as "plug and play" with no consideration of observation delay. In the first submission, EpiEstim was used in conjunction with Richardson-Lucy deconvolution to take into account observation delays in computing the incident infections before computing Rt. If the authors want to separate the issue of inferring incident infections using deconvolution from the accuracy of the computation of Rt itself in their application of EpiEstim, they could still take into account observation delays to align time values properly using a more simplistic method such as subtracting the mean or median of the observation delay distribution from the reported times. Some sort of correction for the observation delay still needs to be made for the EpiEstim Rt calculations as otherwise issues with inferring the shape of Rt are intermixed with issues with inferring the timing in Rt in EpiEstim but not in the other two comparisons, making conclusions about the comparison with EpiEstim misleading. Minor comments: 2. The main text's writing needs to be significantly edited. Although the authors definitely made an effort to address writing issues since the first submission, this submission still has many grammatical and language issues throughout. The text of the Supplement is written much more clearly. 3. The word "easement" needs to be replaced in the main text and supplement (say, with "easing" of restrictions) because its meaning is currently misused. 4. Overall, moving the details of the algorithm from the "Results" to the "Methods" as suggested in the first round of reviews has improved the flow and readability; however, now some aspects of the "Discussion" don't make sense, as they aren't explained in the text until after the Discussion section. Specifically: - particle filtering needs to be mentioned in the algorithm summary in "Results" in order for it to be mentioned in the "Discussion and Conclusion" section - same for the definition of M_t 5. Replace "the Bayes rule" with "Bayes' rule" throughout 6. I still feel that the discussion of reporting issues with the Sweden data has issues (lines 364-371). The periodicity and predictability of the under-reporting and subsequent correction are not the same as random noise, so I don't think it is appropriate to say that the Sweden comparison shows robustness to "observation noise". 7. The formatting of citations in the text is awkward. For example, when multiple citations are listed[A], [B-E], the comma outside the bracket is unclear. One way of fixing this would be to instead write[A, B-E], all in brackets. Moreover, although in the text usually there is no space between the text and the citation[example], sometimes there is a space [example 2]. Please check this for consistency throughout. 8. Throughout the text and supplement the word "synthesised" is used where the word "synthetic" would be more appropriate/conventional. 9. It's interesting that the EpiNow2 results are consistently over-smoothed. Is this because the analysis uses default parameters or is it impossible to improve this even by choosing parameters more deliberately? Please at least comment on this in a sentence when discussing EpiNow results (preferably, include analysis in a supplemental figure). 10. Dates need to be consistently formatted - for example in line 317, "21st of Jan. 2020" is different from later date formatting. Saying "Jan. 21, 2020" or "1/21/2020" would be more in line with common usage. 11. Consider providing the Supplement as a .pdf: Since the Supplement is provided in a .docx instead of .pdf as it stands, much of the formatting is lost when I open it on my machine. 12. In Supplementary Table 2, Rt-mean, Rt-sd, Jt-mean and Jt-sd are not the best notation because these represent *differences* between the synthetic and estimated values. I would suggest using notation that indicates this, such as $\\Delta$Rt-mean, $\\Delta$Rt-sd, etc. 13. In Supplementary Table 2, I expect that the EpiEstim row will change if an effort is made to take into account observation delays, as suggested in the major comment. Additional Editorial comments: ********** We suggest the authors develop a title that better reflects the focus of the study. Otherwise, interested readers may not be able to find this paper. We advise against using the word "evolution" in the title, since in biology that generally means the Darwinian process of mutation and selection, which is not modeled here. Since you have specifically focused on a method for estimating Rt during emerging epidemics, we would recommend replacing this word with "growth" or "real-time reproduction number" or something similar. For example, "Bayesian data assimilation for estimating effective reproduction numbers during epidemics: applications to COVID-19" Similar comments apply to the wording used in the abstract. In the abstract, symbols should not be used before they are defined. Please use words instead of symbol for "Rt" when it's first introduced Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols References: Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Submitted filename: Yang_et_al_review_R1.pdf Click here for additional data file. 15 Dec 2021 Submitted filename: response_to_reviewers.docx Click here for additional data file. 5 Jan 2022 Dear Dr Yang, We are pleased to inform you that your manuscript 'Bayesian data assimilation for estimating instantaneous reproduction numbers during epidemics: applications to COVID-19' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Alison L. Hill Associate Editor PLOS Computational Biology Rob De Boer Deputy Editor PLOS Computational Biology *********************************************************** 16 Feb 2022 PCOMPBIOL-D-21-00782R2 Bayesian data assimilation for estimating instantaneous reproduction numbers during epidemics: applications to COVID-19 Dear Dr Guo, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Olena Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

27 in total

1. A guide to R - the pandemic's misunderstood metric.

Authors: David Adam
Journal: Nature Date: 2020-07 Impact factor: 49.962

2. Temporal dynamics in viral shedding and transmissibility of COVID-19.

Authors: Xi He; Eric H Y Lau; Peng Wu; Xilong Deng; Jian Wang; Xinxin Hao; Yiu Chung Lau; Jessica Y Wong; Yujuan Guan; Xinghua Tan; Xiaoneng Mo; Yanqing Chen; Baolin Liao; Weilie Chen; Fengyu Hu; Qing Zhang; Mingqiu Zhong; Yanrong Wu; Lingzhai Zhao; Fuchun Zhang; Benjamin J Cowling; Fang Li; Gabriel M Leung
Journal: Nat Med Date: 2020-04-15 Impact factor: 53.440

3. Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe.

Authors: Seth Flaxman; Swapnil Mishra; Axel Gandy; H Juliette T Unwin; Thomas A Mellan; Helen Coupland; Charles Whittaker; Harrison Zhu; Tresnia Berah; Jeffrey W Eaton; Mélodie Monod; Azra C Ghani; Christl A Donnelly; Steven Riley; Michaela A C Vollmer; Neil M Ferguson; Lucy C Okell; Samir Bhatt
Journal: Nature Date: 2020-06-08 Impact factor: 49.962

4. Sequential Data Assimilation of the Stochastic SEIR Epidemic Model for Regional COVID-19 Dynamics.

Authors: Ralf Engbert; Maximilian M Rabe; Reinhold Kliegl; Sebastian Reich
Journal: Bull Math Biol Date: 2020-12-08 Impact factor: 1.758

5. Model-informed COVID-19 vaccine prioritization strategies by age and serostatus.

Authors: Kate M Bubar; Kyle Reinholt; Stephen M Kissler; Marc Lipsitch; Sarah Cobey; Yonatan H Grad; Daniel B Larremore
Journal: Science Date: 2021-01-21 Impact factor: 47.728

6. Real time bayesian estimation of the epidemic potential of emerging infectious diseases.

Authors: Luís M A Bettencourt; Ruy M Ribeiro
Journal: PLoS One Date: 2008-05-14 Impact factor: 3.240

7. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV-2).

Authors: Ruiyun Li; Sen Pei; Bin Chen; Yimeng Song; Tao Zhang; Wan Yang; Jeffrey Shaman
Journal: Science Date: 2020-03-16 Impact factor: 47.728

8. Modelling transmission and control of the COVID-19 pandemic in Australia.

Authors: Sheryl L Chang; Nathan Harding; Cameron Zachreson; Oliver M Cliff; Mikhail Prokopenko
Journal: Nat Commun Date: 2020-11-11 Impact factor: 14.919

9. Effective immunity and second waves: a dynamic causal modelling study.

Authors: Karl J Friston; Thomas Parr; Peter Zeidman; Adeel Razi; Guillaume Flandin; Jean Daunizeau; Oliver J Hulme; Alexander J Billig; Vladimir Litvak; Cathy J Price; Rosalyn J Moran; Anthony Costello; Deenan Pillay; Christian Lambert
Journal: Wellcome Open Res Date: 2020-09-30