Grace Y Yi1,2, Pingbo Hu1, Wenqing He1. 1. Department of Statistical and Actuarial Sciences University of Western Ontario, London Ontario Canada N6A 5B7. 2. Department of Computer Science University of Western Ontario, London Ontario Canada N6A 5B7.
Abstract
The coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has spread stealthily and presented a tremendous threat to the public. It is important to investigate the transmission dynamics of COVID-19 to help understand the impact of the disease on public health and the economy. In this article, we develop a new epidemic model that utilizes a set of ordinary differential equations with unknown parameters to delineate the transmission process of COVID-19. The model accounts for asymptomatic infections as well as the lag between symptom onset and the confirmation date of infection. To reflect the transmission potential of an infected case, we derive the basic reproduction number from the proposed model. Using the daily reported number of confirmed cases, we describe an estimation procedure for the model parameters, which involves adapting the iterated filter-ensemble adjustment Kalman filter (IF-EAKF) algorithm. To illustrate the use of the proposed model, we examine the COVID-19 data from Quebec for the period from 2 April 2020 to 10 May 2020 and carry out sensitivity studies under a variety of assumptions. Simulation studies are used to evaluate the performance of the proposed model under a variety of settings.
The coronavirus disease 2019 (COVID-19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has spread stealthily and presented a tremendous threat to the public. It is important to investigate the transmission dynamics of COVID-19 to help understand the impact of the disease on public health and the economy. In this article, we develop a new epidemic model that utilizes a set of ordinary differential equations with unknown parameters to delineate the transmission process of COVID-19. The model accounts for asymptomatic infections as well as the lag between symptom onset and the confirmation date of infection. To reflect the transmission potential of an infected case, we derive the basic reproduction number from the proposed model. Using the daily reported number of confirmed cases, we describe an estimation procedure for the model parameters, which involves adapting the iterated filter-ensemble adjustment Kalman filter (IF-EAKF) algorithm. To illustrate the use of the proposed model, we examine the COVID-19 data from Quebec for the period from 2 April 2020 to 10 May 2020 and carry out sensitivity studies under a variety of assumptions. Simulation studies are used to evaluate the performance of the proposed model under a variety of settings.
The coronavirus disease 2019 (COVID‐19), caused by the severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2), has spread stealthily and represented a tremendous threat to worldwide public health. To help understand the virus transmissions, it is imperative to investigate the transmission dynamics quantitatively. In the literature, a variety of epidemic models have been developed to study various types of infectious diseases, including the susceptible‐infectious‐recovered (SIR) model, the susceptible‐infectious‐susceptible (SIS) model, the susceptible‐exposed‐infectious‐recovered (SEIR) model, the Reed–Frost model, and their variants. A review of epidemic models can be found in Duan et al. (2015).Applications of those epidemic models have been extensive. To name a few, Osthus et al. (2017) used the SIR model to forecast seasonal influenza. Shah & Gupta (2013) applied the SEIR model to examine the transmission processes of vector‐borne diseases. Ng & Orav (1990) proposed a generalized Reed–Frost model to predict human immunodeficiency virus (HIV) incidence in San Francisco's homosexual population, and Ng, Turinici & Danchin (2003) developed a modified SEIR model, called the susceptible‐exposed‐infectious‐recovered‐protection (SEIRP) model, to study the outbreak of the severe acute respiratory syndrome (SARS) in China.Recently, a number of new models have been explored to study the dynamics of COVID‐19. For example, Tang et al. (2020) proposed a generalized SEIR model to incorporate presymptomatic COVID‐19 cases and study the implications of the intervention measures such as contact tracing, quarantine, and isolation in China. Tuite, Fisman & Greer (2020) generalized the SEIR model by incorporating the information on interventions and severities of clinical symptoms to examine the potential impact of case‐based and noncase‐based nonpharmaceutical interventions for the population of Ontario, Canada. Mandal et al. (2021) proposed a deterministic model by stratifying the population into three age groups and incorporating asymptomatic cases to explore various strategies for lifting lockdowns in India. Also, the IHME COVID‐19 Forecasting Team (2021) used the SEIR model to characterize the trajectories of COVID‐19 infections and the effects of nonpharmaceutical interventions in the United States.In this article, we propose a new epidemic model, called the susceptible‐exposed‐asymptomatic‐symptomatic‐active‐removed (SEASAR) model, to delineate the COVID‐19 transmission dynamics. We describe the target population by stratifying it into six subpopulations, consisting of individuals who are, respectively, susceptible, exposed, asymptomatic, symptomatic, active, and removed. This model generalizes the SIR and SEIR models by accounting for asymptomatic infections and the lag between symptom onset and the diagnosis time. Consistent with the SIR and SEIR models, we make two routine assumptions: (1) the population is homogeneous and (2) there are no inbound and outbound travels. Similar to the SIR and SEIR models, the SEASAR model is a deterministic model that utilizes ordinary differential equations to describe the transmission dynamics of COVID‐19. We derive the basic reproduction number from the proposed model to provide a scalar measure of the pandemic.To implement the proposed model, we develop an estimation procedure for the model parameters by adapting the iterated filter‐ensemble adjustment Kalman filter (IF‐EAKF) algorithm (e.g., Ionides, Bretó & King, 2006), where sampling from Bayesian posterior distributions is employed. We illustrate the use of the proposed model by analyzing the COVID‐19 data from Quebec for the period from 2 April 2020 to 10 May 2020. We conduct sensitivity analyses to assess how the estimation of the model parameters and the predicted results may change as the model assumptions are altered. We compare the analysis results by applying the proposed method in contrast to the SIR and SEIR models as well as a neural network model. Simulation studies provide a basis for assessing the model performance under different settings.The remainder of this article is organized as follows. We introduce the deterministic SEASAR model and elaborate on its rationale in Section 2. In Section 3, we present the stochastic model for the observed data and establish its connection with the SEASAR model, together with the initialization setup. Section 4 describes the estimation procedure by adapting the IF‐EAKF algorithm. In Section 5, we utilize the proposed SEASAR model to analyze the Quebec COVID‐19 data for the period from 2 April 2020 to 10 May 2020, and Section 6
reports our simulation studies. The article concludes with a discussion of some outstanding issues.
MODEL FRAMEWORK
Via a meta‐analysis, He, Yi & Zhu (2020) estimated that about of individuals with COVID‐19 are asymptomatic. Because of the incubation period, there is a time lag between symptom onset and being confirmed as an infected individual. To incorporate these features of COVID‐19, we develop the SEASAR model to be described as follows.
Illustration of the Proposed Model
To illustrate the ideas, we first consider a static framework by focusing on a given time point. We divide the target population into six subpopulations with specific features, denoted by , , , , , and , respectively. Specifically, represents the subpopulation of susceptible cases (i.e., those at risk of becoming infected with the novel coronavirus), is the subpopulation of exposed cases (i.e., those who are infected but do not have the infectious ability yet and are still in the latent period) (e.g., Bai et al., 2020), stands for the subpopulation of asymptomatic infections (i.e., those cases who are infectious but exhibit no symptoms), represents the subpopulation of symptomatic infections (i.e., those cases who exhibit symptoms and are infectious, but who are not yet confirmed), is the subpopulation of active cases (i.e., those confirmed cases who have either not recovered or died), and denotes the subpopulation of removed cases (i.e., those confirmed cases who have recovered or died from COVID‐19).Next, we introduce parameters to facilitate dynamic changes among the subpopulations. Let denote the average latent period, defined as the average time (in days, say) from being infected to having the infectious ability. Various studies have been conducted to estimate the value of (e.g., Bai et al., 2020; Guan et al., 2020), so here we take as being available. Let denote the average symptomatic transmission rate, defined as the average number of individuals infected by a symptomatic case per unit time. Let the average asymptomatic transmission rate be denoted by , defined as the average number of individuals infected by an asymptomatic case per unit time. As asymptomatic infections are regarded as less infectious than symptomatic cases (e.g., Li et al., 2020), is a constant between 0 and 1. Let denote the average fraction of symptomatic infections relative to all infections, let denote the average rate for asymptomatic infections to develop symptoms per unit time, and let denote the average recovery rate of asymptomatic infections per unit time. Let stand for the average time (in days, say) from symptom onset to the time of being a confirmed case, let denote the average time (in days, say) from being a confirmed case to having recovered, and let represent the average time (in days, say) from being a confirmed case to death due to COVID‐19.Figure 1 is a flowchart showing the relationship among the six subpopulations. A black solid arrow between two subpopulations indicates that the members of one subpopulation can transition into the other subpopulation; a red dashed arrow between two subpopulations indicates that members in one subpopulation can be infected by members in the other subpopulation. The parameters on black solid lines determine the number of people who move from one subpopulation to another subpopulation per unit time, and the corresponding parameters on red dashed lines determine the number of people infected by asymptomatic or symptomatic cases per unit time. We assume that we have a homogeneous population and that any confirmed COVID‐19 case must be quarantined immediately and cannot infect other cases thereafter. Thus, there is no transition from to other compartments except . Further, we assume that asymptomatic individuals are not tested for COVID‐19, whereas all symptomatic cases are assumed to be confirmed at a certain time. The former assumption is fairly reasonable, especially in the early stage of the pandemic when test kits are scarce. However, the latter assumption is less realistic, because in reality, some symptomatic cases may never be confirmed because of false negative results or not being tested. These two assumptions basically consider the setting where an initially asymptomatic individual can move into compartment only if this person shows symptoms before recovery or death, and those asymptomatic cases who never show symptoms cannot directly enter compartment .
Figure 1
Illustration of the SEASAR model. The population is divided into six compartments: (susceptible), (exposed), (asymptomatic), (symptomatic), (active), and (removed).
Illustration of the SEASAR model. The population is divided into six compartments: (susceptible), (exposed), (asymptomatic), (symptomatic), (active), and (removed).
Deterministic Dynamic Model
Figure 1 shows a static chart for the transitions among the six subpopulations at a given time point. However, for any time period, the transitions are not static but dynamic. To characterize this dynamic feature, we modify the six subpopulations discussed in Section 2.1 by showing their dependence on time , and let , , , , , and denote the respective sizes of the corresponding six subpopulations (or state compartments) at time . We assume that the size of each state compartment at follows a certain distribution with the mean, denoted by the same symbol with the asterisk removed (e.g., for ). While those sizes are treated as random, here we focus on delineating the dynamic changes in their means to reflect the underlying links.To be specific, let denote the vector of the six average subpopulation sizes at time . We represent the dynamic changes in via the ordinary differential equations (ODEs):
where, under the assumption of no immigration or emigration of individuals, represents the time‐invariant total size at any time point . Any of the Equations (1)–(6) is determined by the other five equations because of the total size constraint. Such a constraint, however, is not applied to the sum of , , , , , and .Our reasoning for Equations (1)–(6) may be found in Appendix B of the accompanying Supplementary Material. For ease in referring to Equations (1)–(6), let denote the vector of parameters of primary interest; then
where represents the vector function determined by the right‐hand side of Equations (1)–(6), and we call these six equations the SEASAR model. Figure S.1 in the Supplementary Material displays a flowchart of the transmission dynamics for the six subpopulations together with the associated values.
The Basic Reproduction Number
Knowing the value of allows us to describe the six subpopulation sizes using Equations (1)–(6). Further, it enables us to describe the severity of the pandemic using a simple measure, the basic reproduction number, denoted , which is defined as the expected number of cases infected by one case in a population consisting of individuals susceptible to infection.A large value of indicates a severe pandemic. Usually, comparing to 1 describes the spread of the disease. “” suggests that the infection is spreading in the population, and “” indicates a dying‐down situation. In Appendix A of the Supplementary Material, we show that the value of derived from the SEASAR model equals
DATA AND THE STOCHASTIC MODEL
The Observed Data and the Stochastic Model
For , let denote the number of confirmed cases to be reported on day , which we regard as a random variable. We assume that follows a normal distribution:
with mean and variance . As in Section 2.1, we assume that only symptomatic individuals may be confirmed as cases; thus, by Figure 1, a case that is being confirmed at time corresponds to a transition from state to state at . By the definition of , can be regarded as the average proportion of symptomatic cases that are confirmed on any given day, and, hence, represents the mean number of confirmed cases on day . Consequently, in (7) is given by , which links with the SEASAR model.Let denote the initial time point from which we start examining the data. Since the number of confirmed COVID‐19 cases is reported on a daily basis, we take the fundamental unit of time to be a day, and let denote the examination days with being a specified positive integer.In the study period , let represent the observed values for the process . Being called the observational error variance (OEV) (e.g., Li et al., 2020), in (7) is often characterized heuristically based on previously observed data via an assumed function form. For example, for , may be fixed as
bearing in mind that other specifications to characterize the value of are also possible.By time , let , , and denote the reported cumulative number of recoveries, of confirmed cases, and of deaths from COVID‐19, respectively. For , let denote the number of individuals who report that their COVID‐19 symptoms appeared on day but are not yet confirmed to have COVID‐19 on day ; is related to in the SEASAR model via for . Let record the available data when the study begins and write .In contrast, the unobserved variables , , , , , and , , , and represent the observed data that are used to describe an estimation procedure for the model parameters in Equations (1)–(6), where and are used to initialize the mean sizes of the six subpopulations, together with the assumptions as outlined in Section 3.2; and is used to estimate the model parameters , as described in Section 4.
Initial Mean Sizes of the Subpopulations
At the beginning point of the study, the initial mean sizes of the six subpopulations, , are given. Table S.1 in the Supplementary Material summarizes the relationship of the initial mean sizes of the six subpopulations to be used in Section 4.2, with , , and being defined as in the following.In contrast to , which is defined in Section 3.1, let denote the cumulative number of recovered asymptomatic cases by time ; however, is unavailable. To facilitate the relationship between the observed and unobserved values, let denote the ratio of the unobserved value to the observed value of , and let represent the ratio of the unobserved size to the observed size . Motivated by Hao et al. (2020), let denote the ratio of the unobserved to the observed data in over the time window , where the function represents the largest integer that is less than or equal to . Notationally, one may write , though it is understood that and the data in do not have an intrinsic connection.The introduction of the ratios , , and does not enable us to determine the values of unobservable , , and , but these ratios do offer us an informative way to describe the pandemic situation in relative scales of the observed values by time . For instance, at the early stage of the pandemic, testing kits are limited, so the number of recoveries from confirmed cases is likely to be much smaller than the corresponding number of recoveries by asymptomatic individuals, and such a scenario can be informatively described by a large value of . If is greater than 1, then there are more asymptomatic infections than symptomatic infections. Despite the lack of information about , and , one may conduct sensitivity analyses by changing their values to describe different scenarios with different degrees of severity of the pandemic, as we report in Section 5.
ESTIMATION PROCEDURE
Here we adapt the IF‐EAKF algorithm in combination with the fourth‐order Runge–Kutta (RK4) method (e.g., Süli & Mayers, 2003, p. 328) to estimate the SEASAR model parameter . The basic idea is to embed the original SEASAR model with its time‐invariant parameter into an expanded yet “artificial” model with a time‐varying parameter, say , so that the problem of estimating the original parameter is converted to estimating sequentially, where the enhanced model assumes the same form as we assumed in Equations (1)–(6) with the parameter replaced by . The introduction of the “artificial” time‐varying parameter allows us to implement the IF‐EAKF algorithm for the estimation of by sequentially examining the observed data over for each .Considering this enlarged model offers us the flexibility of using the observed data
sequentially to update estimates of the model parameters using the Bayesian principle. For , the time‐dependent parameter in the enlarged model is treated as random, and its prior and posterior distributions are updated for the sequential time points in , as described in Section 4.2. To implement the IF‐EAKF algorithm, we specify the prior information concerning the parameters at the study entry point in Section 4.1 and then present the implementation procedure in Section 4.2.
Initialization of the Model Parameters
Here we assume that the prior information concerning the parameters at the initial time point is not informative except for constraining the parameters to certain ranges to reflect our prior knowledge about them.To be specific, the transmission rate of symptomatic infections is such that to cover the reported values in the literature, including Hao et al. (2020) and Li et al. (2020). The multiplicative factor is assumed to satisfy , the fraction of symptomatic infections relative to all infections is restricted so that , the average rate of asymptomatic infections who develop symptoms per unit time is constrained to be , and the average recovery rate of asymptomatic infections per unit time is such that . The average time from symptom onset to being confirmed is considered to be between 1 and 10 days (Kramer, 2020), the average time from being confirmed to recovery is between 7 and 42 days (WHO, 2020), and the average time from being confirmed as a symptomatic case until death occurs ranges from 14 to 56 days (WHO, 2020).
An Estimation Algorithm
In this subsection, we describe the detail of adapting the IF‐EAKF algorithm to iteratively update the estimates of the parameters of the SEASAR model. Let be a prespecified integer, which is reasonably large, and let denote the iteration number. The estimation procedure consists of iterations for as follows:First, we describe how to obtain an initial estimate of for at iteration , elaborated in the following four parts.Part 1:
,
:
Part 2:
,
:
Part 3:
,
:
Part 4:
:Stage 1: Determine prior values for and .Step 1: Let denote a prior distribution for parameter , which we assume is the uniform distribution over the ranges specified in Section 4.1, with the assumption that the parameter components in are independent of each other.Step 2: Generate values from , denoted , where for each , .Step 3: Using the initial size of symptomatic infections, we generate prior values for , denoted , by setting Then calculate the sample mean and variance:
together with the pairwise sample covariances:
where is a symbol for , , , , , , , and .Stage 2: Generate posterior values for and .We employ the following steps to generate posterior values for and from their posterior distribution. The derivations may be found in Appendix E of the Supplementary Material.Step 1: Generate posterior values for , denoted . For each , is determined by
where is given by Equation (8) with and taken as , the number of confirmed cases reported on day .Step 2: Generate posterior values for , denoted , with , where each component of equals
with representing a symbol for , , , , , , , and .Stage 1: Generate prior values for , , and .Step 1: Generate prior values for , denoted , by setting , where we denote
for each .Step 2: Using the RK4 method, generate prior values for , denoted , where for each ,
. Specifically, let , , , and . Then we setStep 3: Generate prior values for , denoted , by setting . Then calculate
together with the pairwise sample covariance
where represents each of the symbols , , , , , or .Stage 2: Generate posterior values for , , and . This stage is similar to Stage 2 in Part 1.Step 1: Generate posterior values for and , respectively, denoted and .Step 2: Generate posterior values for , denoted , where for each ,
, with each component of given by
Here represents each of the symbols , , , , , or .Stage 1: Generate prior values for , , and .Step 1: Generate prior values for , denoted , by settingStep 2: Generate prior values for , denoted , using the RK4 method where, like the result specified in Equation (11),Step 3: Similar to Step 3 of Stage 1 in Part 2, generate prior values for .Stage 2: Similar to Stage 2 in Part 2, generate posterior values of , , and .We repeat Part 3 for and obtain a sequence of posterior values for , denoted . Then, we calculate
which is then adopted as the initial value of at iteration .Next, we describe the iterative procedures to update estimates of for . The iterations are similar to those sketched in the preceding Parts 1–4 except that the prior distribution of in Step 1 of Stage 1 of Part 1 is taken to be , where is the output of iteration , is a discount factor representing a value in , and is a user‐specified positive‐definite matrix with representing the dimension of parameter vector (i.e., ).Let denote the estimate of the model parameter at the th iteration. To reduce the Monte Carlo error induced during the simulation procedure, we run the preceding algorithm repeatedly, say times, and let denote the resulting estimates of . Let be the final estimate of , where is often set to be a large number, for example, , as in our numerical studies.The implementation steps are summarized in Algorithm 1. We now comment on the specification of , , , and . While the choice of may, in principle, be driven by the consideration of “the larger, the better”, our numerical experience suggests that a value in the range 100–500 usually works well in combination with suitable values of and . The iteration number and the discount factor are often specified in an ad hoc way by inspecting the evolution of the posterior distributions over iterations. The inclusion of the discount factor ensures that in Algorithm 1 will converge within a reasonable number of iterations. Li et al. (2020, Supplementary Material, p. 8) commented that a small value of makes the algorithm “quench” too fast and miss the maximum likelihood estimate, while a value of close to 1 seems to delay the prompt convergence of the algorithm. In applications, Li et al. (2020) suggested setting to be a value in . Here, we specify as a diagonal matrix with the th diagonal element set as the variance of the uniform distribution over the range of the th parameter in , which we specified in Section 4.1.
ANALYSIS OF THE QUEBEC COVID‐19 DATA
As an illustration, we used our proposed model to analyze the observed data concerning the daily reported number of COVID‐19 cases in the Canadian province of Quebec for the period between 2 April 2020 and 10 May 2020. To implement the procedure described in Section 4.2, we developed the R source code, which is available from the second author upon request. Using the notation in Section 3.1, the examination days are listed in with and being 2 April 2020. By time , the reported cumulative numbers of recoveries, of confirmed cases, and of deaths from COVID‐19 are available at the website https://www.canada.ca/en/public‐health/services/diseases/2019‐novel‐coronavirus‐infection.html, giving with , , and . In addition, the observed values for the process are available at https://www.canada.ca/en/public‐health/services/diseases/2019‐novel‐coronavirus‐infection.html.The data for Quebec are not available. However, using other reported data for either Canada or Quebec, we may roughly approximate for and . Let denote the number of COVID‐19 patients with symptom onset on day in Canada, which is available in the section labelled “Epidemic curve” of the website https://health‐infobase.canada.ca/covid‐19/epidemiological‐summary‐covid‐19‐cases.html?stat=num&measure=total_last7&map=pt∖%23a4. Let and denote the numbers of confirmed COVID‐19 cases on day in Canada and Quebec, respectively, which can be found at the website https://www.canada.ca/en/public‐health/services/diseases/2019‐novel‐coronavirus‐infection.html. Assuming that the number of confirmed cases in Quebec and in Canada in a day appears in nearly the same ratio as the number of individuals with symptom onset in Quebec to that in Canada, is approximated by for and . As is only used to help describe the initial value , we hope this treatment of offers us a reasonable approximation, and we stress that this approximation yields a less accurate estimate of than that obtained from the setting where would have been available.To assess the performance of the SEASAR model in terms of parameter estimation as well as prediction, we split the study period into two parts: the period from 2 April to 30 April 2020 (denoted with ) and the period from 1 May to 10 May 2020 (denoted with ). The data for the first period, called the training data, are used to estimate the model parameters by applying Algorithm 1. The data in the second period, called the test data, are used to assess the prediction performance of our proposed model.
Estimation and Prediction Results
We ran Algorithm 1 with (7) and (8) on the training data, where we set , , , and fixed the average latent period at 5.2 days, an estimate reported by Bai et al. (2020) and Hao et al. (2020). For the initial sizes of the six subpopulations discussed in Section 3.2, we took , , and .In Table 1, we reported the estimates (EST) of the model parameters in Equations (1)–(6) as well as the estimate of the basic reproduction number
. The estimate of suggests that the pandemic situation in Quebec for the period from 2 April 2020 to 30 April 2020 was not under control and the virus was spreading in the province.
Table 1
Analysis of the Quebec data: The estimates of the model parameters.
Parameter
θ
μ
α
β
γ
F
B
J
|R0
EST
0.88
0.10
0.48
0.01
0.97
2.36
39.58
37.56
|1.06
Analysis of the Quebec data: The estimates of the model parameters.Next, we evaluated the prediction performance using the test data. With replaced by its estimate, we used the RK4 method to derive the estimates of recursively from the SEASAR model for by using
a form similar to that specified in Equation (11), where is defined as indicated in Equation (11) with replaced by , and . Then we took the ratio of the resulting estimates of and as a predicted (or fitted) value to for .To evaluate the prediction performance, for , we report the absolute prediction error (APE), defined , and the relative absolute prediction error (Rel.APE) in percent, defined as , where two settings are considered with representing the daily net number of confirmed cases (i.e., ) or the daily cumulative number of confirmed cases (i.e., ), and we let denote the corresponding predicted value of . The results are reported in Table 2.
Table 2
Analysis of the Quebec data using the proposed SEASAR model: Prediction performance for the daily net and daily cumulative numbers in 10 days for the period 1 May 2020 to 10 May 2020.
Net number
Cumulative number
APE
Rel.APE
APE
Rel.APE
Day 1
241.07
21.72
36.96
0.13
Day 2
133.30
13.22
170.26
0.57
Day 3
11.52
1.29
181.78
0.60
Day 4
128.27
16.92
53.50
0.17
Day 5
98.08
12.35
44.57
0.14
Day 6
12.11
1.33
32.46
0.10
Day 7
7.29
0.80
25.17
0.07
Day 8
2.46
0.27
22.71
0.07
Day 9
79.38
0.95
102.08
0.29
Day 10
186.22
25.34
288.31
0.79
Note: APE and Rel.APE (in percent) represent the absolute prediction error and the relative absolute prediction error, respectively.
Analysis of the Quebec data using the proposed SEASAR model: Prediction performance for the daily net and daily cumulative numbers in 10 days for the period 1 May 2020 to 10 May 2020.Note: APE and Rel.APE (in percent) represent the absolute prediction error and the relative absolute prediction error, respectively.To visualize the results, we displayed in Figure 2 the mean estimates of the daily net number and of the daily cumulative number of cases for the period from 2 April 2020 to 30 April 2020 (in green), as well as the mean predicted daily net numbers and the predicted daily cumulative numbers of cases for the period from 1 May 2020 to 10 May 2020 (in blue), in comparison to the actual observed daily net numbers and the daily cumulative numbers of cases from 2 April 2020 to 10 May 2020 (in red). While the APE values vary greatly from day to day and it is difficult to quantify an acceptable range for the prediction error, examining the Rel.APE values gives us some insight into the prediction performance. The Rel.APE values for predicting a daily net number of confirmed cases seem to be acceptable and they are very small for predicting a daily cumulative number of confirmed cases. This suggests that prediction of a daily cumulative number of confirmed cases seems fairly acceptable.
Figure 2
The model fitting to the number (in green) in the period 2 April 2020 to 30 April 2020, and the model prediction to the number (in blue) in the period 1 May 2020 to 10 May 2020, as opposed to the reported number (in red) in the period 2 April 2020 to 10 May 2020, where the number represents the daily net number of confirmed cases (left) and the daily cumulative number of confirmed cases (right).
The model fitting to the number (in green) in the period 2 April 2020 to 30 April 2020, and the model prediction to the number (in blue) in the period 1 May 2020 to 10 May 2020, as opposed to the reported number (in red) in the period 2 April 2020 to 10 May 2020, where the number represents the daily net number of confirmed cases (left) and the daily cumulative number of confirmed cases (right).Finally, to mitigate the error due to data aggregation that arises on weekends, we calculated APE and Rel.APE for the weekly net numbers of confirmed cases, which are 178.938 and 0.028, respectively. We also displayed these results in Figure S.3 of the Supplementary Material in a manner similar to that used in Figure 2.
A Comparison with the Neural Network, SIR, and SEIR Models
To further assess the performance of the SEASAR model, we compared its prediction performance to that of the neural network (NN), SIR, and SEIR models by examining prediction of the daily net, daily cumulative, and weekly net numbers of confirmed cases.The NN method is commonly used by the machine learning community. Basically, it contains three elements: the input layer, the hidden layer(s) with a number of nodes, and the output layer. Here we take the input data as the observed time series with and being 2 April 2020. We use the NN model with one hidden layer having three nodes, with the same implementation details as specified in Chen et al. (2021). The R function nnetar was used to fit the training data, and the R function forecast was used for prediction.Next we considered the SIR model, which divides the target population into three subpopulations, respectively denoted , , and , of susceptible cases, of infectious cases (i.e., those who are infected and are themselves infectious), and of removed cases (i.e., those cases who recover or die from COVID‐19). Let , , and denote the mean size of the corresponding subpopulation at time , which are, by definition, related to the subpopulations classified by the SEASAR model via , , and . Let denote the average transmission rate, defined as the average number of individuals infected by an infectious case per unit time, and let denote the average removal rate, defined as the average rate of death or recovery. Under the same assumptions for the SEASAR model, the SIR model is characterized by the following ordinary differential equations:Now we turn to the SEIR model that stratifies the population into four subpopulations. In addition to the three subpopulations considered in the SIR model, now denoted , , and , the SEIR model further considers the subpopulation of exposed cases (i.e., those who are infected but not yet infectious, and are still in the latent period), denoted . Let , , , and denote the mean size of the corresponding subpopulation at time , which, by definition, are related to the quantities in the SEASAR model by the corresponding , , , and . Under the same assumptions that apply to the SEASAR model, the SEIR model is characterized by the ordinary differential equations
where and have meanings similar to those of and in the SIR model, and is the latent period defined in Section 2.1 and is fixed at the value as in Section 5.1.Modifying the estimation procedure that we outlined in Section 4.2, we obtained estimates of the parameters for the SIR and SEIR models, where the implementation details may be found in Appendices I and J in the Supplementary Material; we set , and , as in Section 5.1. As we previously reported in Section 5.1, in Table 3 we summarized the APE and Rel.APE values for prediction of the daily net number and the daily cumulative number obtained from the fitted NN, SIR, and SEIR models for the same time period 1 May 2020 to 10 May 2020. For the prediction of the weekly net number, we estimated that the average APE values for the NN, SIR, and SEIR were 678.07, 1354.636, and 1863.149, respectively, and that the average Rel.APE values for the NN, SIR, and SEIR were 0.106, 0.212, and 0.292, respectively. In contrast, the proposed SEASAR model yielded a much smaller average APE and average Rel.APE, which were 178.938 and 0.028, respectively.
Table 3
Analysis of the Quebec data using the NN, SIR, and SEIR models: Prediction performance for the daily net and daily cumulative numbers in 10 days for the period 1 May 2020 to 10 May 2020.
NN model
SIR model
SEIR model
Net number
Cumulative number
Net number
Cumulative number
Net number
Cumulative number
Day
APE
Rel.APE
APE
Rel.APE
APE
Rel.APE
APE
Rel.APE
APE
Rel.APE
APE
Rel.APE
1
293.09
26.40
292.76
1.02
383.85
34.58
1939.29
6.77
31.91
2.87
6195.44
21.63
2
192.90
19.14
485.65
1.64
284.45
28.22
2223.75
7.50
145.76
14.46
6341.19
21.38
3
77.31
8.67
562.96
1.84
171.05
19.18
2394.80
7.84
273.72
30.69
6614.91
21.65
4
56.58
7.46
506.37
1.62
39.65
5.23
2434.46
7.78
419.80
55.38
7034.71
22.47
5
20.55
2.59
485.82
1.51
78.26
9.86
2512.72
7.83
395.98
49.87
7430.69
23.15
6
95.46
10.49
581.28
1.76
196.87
21.63
2709.59
8.21
292.29
32.12
7722.98
23.40
7
96.46
10.59
677.74
2.00
200.48
22.01
2910.08
8.58
303.70
33.33
8026.68
23.66
8
97.46
10.69
775.21
2.23
204.10
22.38
3114.17
8.94
315.23
34.56
8341.91
23.94
9
21.46
2.57
796.67
2.23
130.71
15.64
3244.88
9.10
403.87
48.31
8745.78
24.52
10
79.54
10.82
717.13
1.97
32.33
4.40
3277.21
9.00
517.63
70.42
9263.40
25.45
Note: APE and Rel.APE (in percent) represent the absolute prediction error and the relative absolute prediction error, respectively.
Analysis of the Quebec data using the NN, SIR, and SEIR models: Prediction performance for the daily net and daily cumulative numbers in 10 days for the period 1 May 2020 to 10 May 2020.Note: APE and Rel.APE (in percent) represent the absolute prediction error and the relative absolute prediction error, respectively.To visualize the results obtained from the SEASAR model (in verde olive), the NN model (in purple), the SIR model (in green), and the SEIR model (in blue), we displayed in Figure 3 the mean predicted daily net numbers, the mean predicted daily cumulative numbers, and the mean predicted weekly net numbers of confirmed cases for the period from 1 May 2020 to 10 May 2020, in comparison to the corresponding reported numbers for the same period. Our proposed SEASAR model appeared to outperform the SIR, SEIR, and NN models with respect to prediction during the period represented by the test data. The SEIR and SIR models performed quite differently; the SEIR model resulted in over‐estimated values, whereas the SIR model produced under‐estimated results, though the degree of bias varied considerably. It would be interesting to investigate what causes these systematic biases of the SEIR and SIR models, though such an investigation is beyond the scope of this article.
Figure 3
The average prediction to the number in the period 1 May 2020 to 10 May 2020 using the SEASAR model (in verde olive), SIR model (in green), SEIR model (in blue), and NN model (in purple), as opposed to the real number (in red) in the period 1 May 2020 to 10 May 2020, where the number represents the daily net number (left), daily cumulative number (middle), and weekly net number (right) of confirmed cases.
The average prediction to the number in the period 1 May 2020 to 10 May 2020 using the SEASAR model (in verde olive), SIR model (in green), SEIR model (in blue), and NN model (in purple), as opposed to the real number (in red) in the period 1 May 2020 to 10 May 2020, where the number represents the daily net number (left), daily cumulative number (middle), and weekly net number (right) of confirmed cases.Finally, to see how various assumptions might affect the results, we conducted sensitivity analyses for 12 settings. For details, see Appendix H in the Supplementary Material.
SIMULATION STUDY
To assess the performance of our proposed method, we conducted simulation studies using the estimation procedure described in Section 4.2. In this section, we report the performance of the SEASAR model by assuming that the two required conditions are met: (1) the population is homogeneous and (2) the population is closed, i.e., there is no immigration or emigration. In Appendix K of the Supplementary Material, we assess the model performance when those conditions are violated.
Data Generation
First, we generated data from the SEASAR model specified in Equations (1)–(6), together with (7) and (8), for a period of study days. We fixed , , and . Then we used Equation (12) to generate data with the initial value of equal to . To see what the trajectories of look like, we plotted in Figure S.6 in the Supplementary Material with set to the value 200.With the availability of , we calculated for , and then used (7) and (8) to iteratively generate realizations of for , where we set , the number of confirmed cases in Quebec on 1 April 2020. Thus, we obtained the data . We repeated this data‐generation process times, and let record each copy of for , where we used , and represents a realized value of in the th generated data sample.
Assessment of the Estimation Performance
Here we focused on evaluating estimation of the model parameter as well as prediction of the daily numbers of confirmed cases. To this end, for , we divided into two subsets, i.e., and , with and , where . We considered different values of to evaluate the model performance. Here represents the training data used to estimate the SEASAR model parameters, and denotes the test data used to evaluate the prediction performance.When assessing the estimate of , we considered three cases for the training data with , 34, or 39, respectively, called “Case 1”, “Case 2”, and “Case 3”. In each instance, we separately ran Algorithm 1 on each set of training data for to estimate the model parameter , where we considered different specifications of , , and . Let denote the resulting estimates for for each configuration. We reported the average bias of the estimates (BIAS), which we calculated using ; the average relative estimate bias (RBIAS), which we calculated using ; and the sample standard deviation (SSD), which we calculated using , where and denote the th element of and , respectively, and .Table 4 summarizes the results for the settings with and , where is taken as 100, 300, or 500; additional results for and 1000 may be found in Table S.6 in the Supplementary Material. While different choices of lead to different degrees of estimation bias, as expected, the incurred bias or relative bias for those values of fall in acceptable ranges in general. Interestingly, a larger value of does not necessarily yield estimates with a smaller bias. On the contrary, increasing the value of does help reduce the sample standard deviation. Overall, it seems that with suitable values of and , setting to be a value between 100 and 500 may be plausible.
Table 4
Simulation study in Section 6 with and : Estimation results for the SEASAR model parameter; the entries with are the original values times .
n=100
Case 1
Case 2
Case 3
Parameter
BIAS
RBIAS
SSD
BIAS
RBIAS
SSD
BIAS
RBIAS
SSD
θ
−0.01
−0.05
0.02
−0.01
−0.06
0.02
−0.01
−0.07
0.02
μ
0.15
0.34
0.11
0.19
0.41
0.11
0.20
0.45
0.11
α
−0.01
−0.08
0.02
−0.02
−0.09
0.02
−0.02
−0.10
0.02
β
1.59∗
0.02
0.02
1.60∗
0.02
0.02
1.49∗
0.02
0.02
γ
−0.06
−0.08
0.10
−0.07
−0.10
0.11
−0.08
−0.11
0.11
F
−8.47∗
−0.94∗
0.16
−0.01
−1.33∗
0.16
−0.02
−1.68∗
0.16
B
0.33
0.01
3.66
0.28
0.01
3.77
0.48
0.02
3.67
J
1.00
0.03
4.21
0.89
0.03
4.45
1.05
0.03
4.36
n=300
Case 1
Case 2
Case 3
Parameter
BIAS
RBIAS
SSD
BIAS
RBIAS
SSD
BIAS
RBIAS
SSD
θ
−0.01
−0.06
0.02
−0.01
−0.07
0.02
−0.01
−0.07
0.02
μ
0.17
0.38
0.09
0.20
0.44
0.09
0.22
0.49
0.09
α
−0.02
−0.08
0.02
−0.02
−0.10
0.02
−0.02
−0.11
0.02
β
1.92∗
0.02
0.02
1.18∗
0.01
0.02
1.34∗
0.02
0.02
γ
−0.06
−0.09
0.10
−0.08
−0.11
0.10
−0.09
−0.12
0.11
F
−8.06∗
−0.90∗
0.16
−0.01
−1.45∗
0.16
−0.02
−1.80∗
0.16
B
0.46
0.02
2.13
0.56
0.02
2.19
0.40
0.02
2.25
J
1.06
0.03
2.52
0.87
0.03
2.59
0.96
0.03
2.49
n=500
Case 1
Case 2
Case 3
Parameter
BIAS
RBIAS
SSD
BIAS
RBIAS
SSD
BIAS
RBIAS
SSD
θ
−0.01
−0.06
0.02
−0.01
−0.07
0.02
−0.02
−0.08
0.02
μ
0.18
0.39
0.08
0.21
0.46
0.09
0.23
0.50
0.09
α
−0.02
−0.09
0.02
−0.02
−0.10
0.02
−0.02
−0.11
0.02
β
1.90∗
0.02
0.02
1.35∗
0.02
0.02
1.25∗
0.02
0.02
γ
−0.07
−0.09
0.09
−0.08
−0.11
0.10
−0.09
−0.13
0.10
F
−8.83∗
−0.98∗
0.16
−0.01
−1.48∗
0.16
−0.02
−1.86∗
0.16
B
0.44
0.02
1.68
0.48
0.02
1.68
0.47
0.02
1.73
J
0.94
0.03
2.04
1.00
0.03
1.99
1.00
0.03
2.03
Note: BIAS and RBIAS represent the average bias of the estimates and the average relative bias of the estimates, respectively; and SSD stands for the sample standard deviation of the estimates.
Simulation study in Section 6 with and : Estimation results for the SEASAR model parameter; the entries with are the original values times .Note: BIAS and RBIAS represent the average bias of the estimates and the average relative bias of the estimates, respectively; and SSD stands for the sample standard deviation of the estimates.To evaluate the prediction performance of the SEASAR model, we used different training data to build a prediction model and then compared the performance over different prediction windows. We first fixed or 39 for the training data to build a prediction model in simulation . Then for each scenario, we compared the prediction performance over a short and a relatively long time period by using the test data with and 10, respectively. For convenience, we used the labels “Scenario‐1‐short”, “Scenario‐1‐long”, “Scenario‐2‐short”, and “Scenario‐2‐long”, respectively, to indicate those settings with , and .To assess the discrepancies between the predicted values and the corresponding values generated from the model for the test data, for , we calculated the total absolute prediction error (TAPE): , and the total relative absolute prediction error (TRAPE): , where we considered two settings for . In Setting 1, we let represent the daily net number of confirmed cases , and let denote its predicted value. In Setting 2, we let represent the daily cumulative number of confirmed cases with , and let denote the corresponding predicted value.Figure S.7 of the Supplementary Material shows the boxplots of the TAPEs and TRAPEs for the four scenarios under the two settings. To gain an overall sense of the prediction error, in Table 5 we reported the average TAPE (ATAPE) and the average of TRAPE (ATRAPE) over the simulations, together with the associated sample standard deviations (SSD). As expected, with a given training dataset to build a SEASAR model for prediction, the ATAPE and ATRAPE for a shorter prediction window are smaller than the corresponding values for a longer prediction window. Furthermore, the former case incurs less variation than the latter one.
Table 5
Simulation study in Section 6: Prediction performance of the proposed SEASAR model for Scenario‐1‐short (), Scenario‐1‐long (), Scenario‐2‐short (), and Scenario‐2‐long ().
TAPE
TRAPE
ATAPE
SSD
ATRAPE
SSD
Setting 1
Scenario‐1‐short
24.77
8.71
0.31
0.11
Scenario‐1‐long
49.07
15.03
0.67
0.21
Scenario‐2‐short
22.92
7.51
0.41
0.14
Scenario‐2‐long
45.57
10.72
0.90
0.22
Setting 2
Scenario‐1‐short
105.32
77.35
0.02
0.02
Scenario‐1‐long
262.25
196.44
0.05
0.04
Scenario‐2‐short
103.72
80.48
0.02
0.01
Scenario‐2‐long
245.09
182.93
0.04
0.03
Note: and represent the size of the training and test data, respectively. ATAPE and ATRAPE represent the averages of total absolute prediction error and the total relative absolute prediction error, respectively.
Simulation study in Section 6: Prediction performance of the proposed SEASAR model for Scenario‐1‐short (), Scenario‐1‐long (), Scenario‐2‐short (), and Scenario‐2‐long ().Note: and represent the size of the training and test data, respectively. ATAPE and ATRAPE represent the averages of total absolute prediction error and the total relative absolute prediction error, respectively.
DISCUSSION
In this article, we introduced a new epidemic model, called the SEASAR model, to describe the transmission process for COVID‐19, where the population is divided into six subpopulations, called susceptible, exposed, asymptomatic, symptomatic, active, and removed. While the proposed SEASAR model extends the SIR and SEIR models to accommodate the manifestations of COVID‐19 related to asymptomatic infections and varying lag times between symptom onset and diagnosis, it has certain limitations as we outline below.The model basically delineates settings that are reasonably characterized by time‐invariant parameters in . It does not, however, facilitate the investigation of other features of the data such as weekly cycles related to varying testing and reporting rates among weekdays and weekends. To gain further insight into transmission, one may prefer to use the SEASAR model to predict the number of cumulative cases rather than daily cases. Further, prediction over a short‐term window tends to be more reliable than that over a longer period. In contrast, dividing the study period into five intervals to allow for interval‐dependent model parameters, Hao et al. (2020) proposed a generalized SEIR model to describe the COVID‐19 dynamic progression in Wuhan during the period from 1 January 2020 to 8 March 2020. With 10 unknown parameters, they focused on the estimation of only two parameters and replaced the other parameters with the estimates reported in the literature. To better describe the dynamic changes of the population, it is useful to develop a more flexible model with time‐varying model parameters.Consistent with the SIR and SEIR models, our proposed model requires two standard conditions: (1) the population is homogeneous and (2) the population size remains invariant over time. In applications, it is difficult to satisfy these conditions. By dropping the assumption of no immigration or emigration, Li et al. (2020) modified the SEIR model to delineate the COVID‐19 transmission in 375 cities of China. Their method requires the availability of inter‐city mobility data. It would be interesting to extend our proposed model with these two assumptions relaxed. For example, one could perhaps add immigration and emigration to the SEASAR model to reflect the population dynamics if the observed data include such information for estimation of the associated parameters. One may further stratify the six subpopulations by pandemic‐related factors, such as age and medical conditions, to achieve more homogeneous subpopulations. Such a development typically requires rich information at the individual level; merely having data at the population level, such as , , and in the estimation framework we have considered here, is not sufficient for building a more refined model than the SEASAR model.As noted by a referee, the model parameter cannot be well estimated based on the observed data , , and , even if it is combined with an additional distributional assumption such as the one identified in (7) and the availability assumption for and with . The use of Bayesian priors, as we have suggested following (S21) in Appendix D of Section S2 in the Supplementary Material, comes into play to help resolve the issues of nonidentifiability or nonestimability of the model parameters. While imposing prior information enables us to estimate using the posterior distribution, this does not mean that the nonidentifiability issue is eliminated. We should note that estimation results based on the posterior distribution in such a circumstance may be greatly affected by the choice of priors.The implementation of the IF‐EAKF algorithm hinges on the availability of the variance of , as shown, for instance, in the expression indicated in Equation (9). Here is assumed to be determined by the previously observed data as identified in Equation (8). While this scheme has been used in many studies of infectious diseases such as the study of influenza by Pei et al. (2018), the study of the West Nile virus by DeFelice et al. (2017), and the study of the respiratory syncytial virus by Reis & Shaman (2016), the reasonableness of the resulting analysis relies on how is specified. It would be interesting to develop an estimation procedure that can also accommodate estimation of the unknown parameter . This research warrants in‐depth studies, which extend beyond the scope of this article.Another important aspect concerns the quality of the observed data, an issue that should not be overlooked (Yi, 2017). Under‐reporting or over‐reporting confirmed cases can occur on a daily basis due to reasons related to varying incubation times, insufficient test capacity, test errors, delay in data aggregation, and so on. Available COVID‐19 data often involve measurement error (such as recall bias when reporting exposure to a COVID‐19 infected case) and missing observations. It would be interesting to refine the proposed model to address those issues, and such research warrants a careful investigation.Appendix S1 Supplementary MaterialClick here for additional data file.