Literature DB >> 34320273

Discovering dynamic models of COVID-19 transmission.

Jinwen Liang¹, Xueliang Zhang², Kai Wang², Manlai Tang³, Maozai Tian^1,2.

Abstract

Existing models about the dynamics of COVID-19 transmission often assume the mechanism of virus transmission and the form of the differential equations. These assumptions are hard to verify. Due to the biases of country-level data, it is inaccurate to construct the global dynamic of COVID-19. This research aims to provide a robust data-driven global model of the transmission dynamics. We apply sparse identification of nonlinear dynamics (SINDy) to model the dynamics of COVID-19 global transmission. One advantage is that we can discover the nonlinear dynamics from data without assumptions in the form of the governing equations. To overcome the problem of biased country-level data on the number of reported cases, we propose a robust global model of the dynamics by using maximin aggregation. Real data analysis shows the efficiency of our model.

Entities: Chemical

Keywords: COVID-19; global transmission; maximin aggregation; sparse identification of nonlinear dynamics (SINDy)

Mesh：

Year: 2021 PMID： 34320273 PMCID： PMC8447335 DOI： 10.1111/tbed.14263

Source DB: PubMed Journal: Transbound Emerg Dis ISSN： 1865-1674 Impact factor: 4.521

INTRODUCTION

The outbreak of a new virus named SARS‐CoV‐2 was initially identified in mid‐December 2019 in Wuhan, Hubei Province, China. COVID‐19, the disease caused by this coronavirus, was characterized as a pandemic by WHO on 11 March 2020. As of 5 April 2021, the number of confirmed cases rose to 131,419,173 worldwide, and the number of deaths reached 2,854,842. Research on the epidemic has sprung up in the past year. Prediction of the infectious cases, modelling the dynamics including differential equations and identifying relationships between the infectious cases and other factors, such as environmental factors, are the main concerns of statisticians. Studying the patterns of transmission can simulate the development of the current epidemic and predict the trend of the epidemic in the future, further, help governments make a decision. We focus on mechanistic equations for disease dynamics from case notification data for COVID‐19 in this paper. It is a huge challenge to discover the governing equations from real data. The data on major communicable diseases are more open, and the dynamic model of its transmission is relatively traditional and fixed. Classical mathematical models describing the spread of infectious diseases include the susceptible, infective, and recovered (SIR) model (Beretta & Takeuchi, 1995), susceptible, exposed, infective, and recovered (SEIR) model (Ma et al., 2004), susceptible, exposed, infective, diagnosed, and recovered model (C. Liu et al., 2004) and susceptible, infective, recovered and dead model (Rui & Tian, 2021), etc. The idea is to divide the population into susceptible (S), exposed (E), infected (I), confirmed (C) and recovered (R) populations, then reveal the law of epidemic transmission through the infection mechanism that how individuals move between the compartments. And these models are used to study the spread of infectious diseases such as measles, smallpox, rabies, Ebola viruses, etc. Due to the characteristic that COVID‐19 has a relatively long incubation period and quarantine measures are implemented, it is hard to describe with the existing epidemic models. Researchers have been expanded these models in many ways to account for observed epidemic patterns (Chowdhury et al., 2020; Frenkel & Schwartz, 2021; Mandal et al., 2020; Paiva et al., 2020; Rajendran & Jayagopal, 2021; Zhao & Chen, 2020). The above models all prespecify assumptions of the mechanism of dynamic transmission of COVID‐19. These assumptions are hard to verify, which reduces the possibility to receive accurate results. So we hope the dynamic transmission mechanism of COVID‐19 is not assumed in advance, and the governing equations are determined by the statistical method according to recorded time series data. Horrocks and Bauch (2020) showed that sparse identification of nonlinear dynamics (SINDy) can be applied to epidemiological data to yield models that describe observed epidemic patterns. The SINDy framework was proposed by Brunton et al. (2016) to extract governing equations from data. SINDy can be used for the discovery of new models (see Kaiser et al., 2017; Mangan et al., 2016, 2017; Markus et al., 2018; Rudy et al., 2016; Tran & Ward, 2017). So far, SINDy has not yet been tested in model discovery from empirical data on COVID19 dynamics. For one specific country or region, we can use SINDy as in Horrocks and Bauch (2020). But one difficulty we face is that country‐level or regional‐level data are heterogeneous, if we fit a regression model using global data with a mixture of all country or regional level data, there would be a significant error. Estimating regressions over the pool of all available data are likely to estimate effects that might be strong for one country but very weak for another country, resulting in too many effects that are not replicable. For example, L. Liu et al. (2021) used a dynamic panel data model to generate density forecasts for daily active COVID‐19 infections for a panel of countries/regions. Siwiak et al. (2020) provided a high‐resolution global model of the pandemic but prespecified assumptions of the mechanism. For inhomogeneous data, Bühlmann and Meinshausen (2015) showed a different type of aggregation can still lead to consistent estimation of the effects which are common in all heterogeneous data, the so‐called maximin effects (Meinshausen & Bühlmann, 2015). The maximin aggregation, also called magging, is simple and general. Applying maximin effects to the SINDy framework, we can provide a robust global transmission model. In this article, we apply SINDy to time series data from COVID‐19 to discover dynamic models that govern its epidemic patterns on each grouped data, then use maximin aggregation to get a global dynamic model. One advantage of this research is we create a global model of the early stages of the pandemic that would overcome the problem of the heterogeneous data on the number of notification cases; the other advantage is we determine governing equations according to recorded time series data. The remainder of this paper is structured as follows. In Section 2, we introduce SINDy and magging estimating procedures. In Section 3, we do some descriptive analysis of the data of the Johns Hopkins University dataset. In Section 4, a systematical analysis based on SINdy and magging is carried out. Finally, in Section 5, we give concluding remarks and future research proposals.

MATERIALS AND METHODS

For given time t, and N represent the cumulative number of the total population, the number of the susceptible individuals, the number of the infected cases, the cumulative number of the recovery cases and the cumulative number of the death cases at time t, respectively. The SINDy algorithm (Brunton et al., 2016) performs the intractable brute force search for all possible model functional terms. The historical data , which are sampled at several times , are arranged into two large matrices: where is the derivative of (t). Then, we construct an augmented library consisting of candidate nonlinear functions of the columns of . For example, may consist of constant, polynomial and trigonometric terms Here, higher polynomials are denoted as , , etc. For example, denotes the quadratic nonlinearities in the state variable , given by A sparse regressionproblem is set up with the sparse vectors of coefficients , where is modelled as a matrix of independent identically distributed Gaussian entries with zero mean, and noise magnitude . The matrix has dimensions , where is the number of candidate nonlinear functions. And has dimensions . Sparse regression techniques, such as LASSO (Tibshirani, 1996), adaptive LASSO ((Hui, 2006)), smoothly clipped absolute deviation (Fan & Li, 2001), elastic‐net (Hui & Hastie, 2005), etc have been researched in a large number of studies. In this article, we use sequential least‐squares proposed in Brunton et al. (2016); e algorithm is given below.

ALGORITHM 1 SINDy algorithm

Input: , , λ is a sparsification knob. Initial guess: . Find small coefficients and threshold and regress dynamics onto remaining terms to find sparse . Repeat step 2 until convergence. Output: . The kth row of the dynamical system is reconstructed as follows: Applying SINDy to discover a continuous‐time model involves determining the derivative vector . Using the discrete system, the response vector is . Take the SIR model for an example, in its most elementary version it can be written in discrete time as follows: where N is the (fixed) size of the population, is the average number of contacts per person per time and is the rate of recovery or mortality. Now we add country or regional level g, . The kth row of the dynamical system is as follows: As estimates for the , we use the sequential threshold least‐squares estimator where is a tuning parameter, the calculating process is illustrated in Algorithm 1, and Akaike's information criterion (AIC; Bozdogan, 1987) is used to select . Otherwise, the sparse regression technique as we mentioned above is also a good choice. Here, let the empirical explained variance in group g is So the estimator for according to Bühlmann and Meinshausen (2015) is is approximately a combination of the weighted estimators of each group, and the weights are computed by quadratic programming. The optimization and computation can be implemented in a very efficient way; refer to Bühlmann and Meinshausen (2015) for more details. The steps in the Matlab environment are as follows: Step 1. Calculate the empirical covariance matrix Step 2. Step 3. Use ‘quadprog’ to calculate the weight vector x. [x,fval,exitflag,output,lambda] = quadprog(H,d,[],[],Aeq,beq,lb), d = zeros(G,1); Aeq = ones(1,G);beq = 1; lb = zeros(G,1). Step 4. The combination of the weighted estimators is .

DATA DESCRIPTION

The data we use in this paper were collected by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. Original global data are updated once a day. The reported time series started on 22 January 2020. The infected population is calculated as , where represents the cumulative confirmed patients at time t, R is the cumulative recovered patients at time t and D denotes the cumulative deaths cases at time t. Adding up regional data from the same country except for ‘Diamond Princess’, there are 191 countries in the global daily reports. There is a comparison between the numbers of cumulative confirmed cases on 10 March 2020, and 10 March 2021, as illustrated in Figure 1. The global epidemic is getting worse and worse.

FIGURE 1

Bubble plot of numbers of cumulative confirmed cases on 10 March 2020, and 10 March 2021. The larger the bubble, the severer the situation

Bubble plot of numbers of cumulative confirmed cases on 10 March 2020, and 10 March 2021. The larger the bubble, the severer the situation By 10 March 2021, the top 10 countries with the largest number of confirmed cases are shown in Figure 2.

FIGURE 2

Number of cumulative confirmed cases for selected top ten countries on 10 March 2021

Number of cumulative confirmed cases for selected top ten countries on 10 March 2021 The susceptible population means more easily infected group and is an indispensable part of compartment models. But original dataset does not include susceptible cases. Here we add an assumption that the total population is fixed with constraint , which results in . So we can replace with in constructing differential equations. Accordingly, holds in our real data analysis. SINDy was then applied to the CSSE dataset using a function library consisting of all polynomials involving , and up to second order. That is which is 15 dimensional. And the discrete system is as follows: : Polynomial terms often appear in traditional compartmental epidemic models, so we choose polynomials as library functions. When a third‐order library is used instead, the results will be different due to the number of increasing parameters. Furthermore, there are severe noises and errors in the CSSE dataset, the final estimation is terrible. We only show our baseline analysis using a second‐order polynomial library. The AIC criterion is used to choose the optimal tuning parameter , which balances sparsity and overfitting. Models yielding the lowest AIC score across the grid will finally be selected. The evaluation of the global transmission of COVID‐19 is based on the top 10 countries with the largest number of confirmed cases. We choose dates whose deaths number are greater than 20 as the start of the event time and dates as the end in our analysis. Another way to group the global population is according to belonging to continents. As we know, the world is divided into seven continents, namely Asia, Europe, Africa, North America South America, Oceania, and Antarctica. No Antarctic data are reported in the table. Summing up data of the same continent, we get six summary data of continents, respectively. Different from country‐level data, the start of the event time is when deaths number exceeds 350, and the end is . We also analyze global summary data adding up all country and regional data as a comparison. The period is from 18 February 2020 to 13 April 2020, a 56‐day interval.

RESULTS

The calculating coefficients are illustrated in Table 1. Three models all remain one order terms and second‐order terms make no difference.

TABLE 1

Coefficients comparison between three different models using a function library of polynomials up to second order

Methods	Terms	1	C	I	R	D
Country based	C Eq.	2722	−1.06	1.14	1.06	0.383
	I Eq.	2399	0.0018	−0.0059	−0.0195	−0.0923
	R Eq.	178.0	0.241	−0.233	−0.254	−0.217
	D Eq.	0	0.0565	0.0565	0.0565	0.0565
Continent based	C Eq.	2700	0.0069	−0.0147	−0.0218	0.0633
	I Eq.	−861.0	0.802	−0.81	−0.784	−0.32
	R Eq.	1611.0	0.0041	−0.0036	−0.0121	0.0458
	D Eq.	24.2	−0.0073	0.0082	0.0074	0.0186
Global based	C Eq.	0	−0.397	0.488	0.39	0
	I Eq.	0	−0.415	0.477	0.411	0
	R Eq.	0	0.0267	0	−0.0286	0
	D Eq.	0	−0.0127	0.0161	0.012	0

C, confirmed cases; CD, product of confirmed cases and death cases, C eq., equation of confirmed cases; CI, product of confirmed cases and infected cases; CR, product of confirmed cases and recovered cases; D, death cases, D eq., equation of death cases; I, infected cases; I eq., equation of infected cases; ID, product of infected cases and death cases; IR, product of infected cases and recovered cases; R, recovered cases; RD, product of recovered cases and death cases; R eq., equation of recovered cases.

Coefficients comparison between three different models using a function library of polynomials up to second order C, confirmed cases; CD, product of confirmed cases and death cases, C eq., equation of confirmed cases; CI, product of confirmed cases and infected cases; CR, product of confirmed cases and recovered cases; D, death cases, D eq., equation of death cases; I, infected cases; I eq., equation of infected cases; ID, product of infected cases and death cases; IR, product of infected cases and recovered cases; R, recovered cases; RD, product of recovered cases and death cases; R eq., equation of recovered cases. To see the results more clearly, we predict the confirmed cases, infected cases, recovered cases, and deaths cases in 30 days of the early period. The confirmed cases, infected cases, recovered cases, and death case on 18 January 2020, was 557, 510, 30, and 17, respectively, which is the start point of the differential equations. Figure 3 displays the complete results. Global‐based predictive cases are much less than country‐based and continent‐based predictive cases. To better illustrate the results, we use biaxial coordinates. Though the red line and the green line are closer, global‐based predictions deviate most from true data because there is a mixture of all country‐level biases. Continent‐based models perform better in earlier confirmed and infected cases prediction, while country‐based models perform better in latter prediction. For both recovered and death cases prediction, it is clear that country‐based models are more accurate than continent‐based models.

FIGURE 3

Comparison of the predictive number of confirmed cases, infected cases, recovery cases and death cases, respectively, in the first 30 days. Country based results (blue star line), continent based results (black rhombus), global‐based results (red ring line) and true data (green dot line)

DISCUSSION

In this paper, we demonstrate that SINDy can discover dynamical models from COVID‐19 data. The data have some limitations. Case notifications are typically under‐reported and biased even when they are available. Infectious cases are measured with error because individuals who are asymptomatic are not captured in the official statistics. It is hard to decide whether death cases are dying with or dying of COVID‐19. Also, counts of the number of recovered individuals are often inaccurate. Due to state policies towards the disease are different, country‐level data are heterogeneous. Fitting model curves on global data bear a significant error, as these data are a mixture of all country‐level biases. Estimating regressions over the pool of all available data is likely to estimate effects that might be strong for one country but very weak for another country, resulting in too many effects that are not replicable. To overcome the problem of biased and heterogeneous country‐level data on the number of cases, we provide a robust global model of the dynamics by using the maximin aggregation technique. This model is simple and conservative, can reduce the complexity of data sources and extract a structure with the common contribution to all country data. Regression algorithms and tuning parameter selection too influence the final results. Our approach replaces the susceptible time series with the confirmed time series based on the assumption. According to the research from Eastin and Eastin (2020), almost everyone is susceptible to COVID‐19, which increases the difficulty for counts of accurate susceptible populations. One problem is how to construct a library of functions, we can incorporate domain knowledge and other related methods, for example, compartment models. We select polynomials functions as a basis due to forms of compartment models. Deep methods (Long et al., 2018, 2019) are another technique that can provide more flexible representations to discover differential equations, but lack interpretations. Regression coefficients in our model are fixed, they may be varying due to seasonal factors, etc. In our future research, we will consider regression coefficients are functions too. In conclusion, we have shown that SINDy and maximin aggregation can be applied to COVID‐19 data to yield models that describe global epidemic patterns.

CONFLICTS OF INTEREST

The authors declare no conflict of interest with respect to the research, authorship and/or publication of this article.

AUTHOR CONTRIBUTIONS

All authors conceived the study, carried out the analysis, discussed the results, drafted the manuscript, critically read the manuscript. Specifically, Dr. Jinwen Liang and Professor Xueliang Zhang contributed to all of the following: (1) conception and design of the work, acquisition of data; and (2) drafting the article or revising it critically for important intellectual content; Professors Kai Wang, Manlai Tang and Tian Maozai contributed to all of the following: data analysis and interpretation of data; final approval of the version to be published and agreement to be accountable for all aspects of the work.

ETHICAL STATEMENT

The data of daily reports were collected from CSSE at Johns Hopkins University and thus neither ethical approval nor individual consent was not applicable.

16 in total

Discovering dynamic models of COVID-19 transmission.

INTRODUCTION

MATERIALS AND METHODS

ALGORITHM 1 SINDy algorithm

DATA DESCRIPTION

RESULTS

DISCUSSION

CONFLICTS OF INTEREST

AUTHOR CONTRIBUTIONS

ETHICAL STATEMENT

1. Sparse identification of nonlinear dynamics for rapid model recovery.

2. Discovering governing equations from data by sparse identification of nonlinear dynamical systems.

3. Joint estimation of case fatality rate of COVID-19 and power of quarantine strategy performed in Wuhan, China.

4. Data-driven discovery of partial differential equations.

5. A model based study on the dynamics of COVID-19: Prediction and control.

6. Algorithmic discovery of dynamic models from infectious disease data.

7. A data-driven model to describe and forecast the dynamics of COVID-19 transmission.

8. Sparse identification of nonlinear dynamics for model predictive control in the low-data limit.

9. Discovering dynamic models of COVID-19 transmission.

10. Modeling the epidemic dynamics and control of COVID-19 outbreak in China.

1. SINDy-SA framework: enhancing nonlinear system identification with sensitivity analysis.

2. Discovering dynamic models of COVID-19 transmission.