Literature DB >> 33521404

Forecasting the number of confirmed new cases of COVID-19 in Italy for the period from 19 May to 2 June 2020.

Abstract

In this paper we forecast the spread of the coronavirus disease 2019 outbreak in Italy in the time window from May 19 to June 2, 2020. In particular, we consider the forecast of the number of new daily confirmed cases. A forecast procedure combining a log-polynomial model together with a first-order integer-valued autoregressive model is proposed. An out-of-sample comparison with forecasts from an autoregressive integrated moving average (ARIMA) model is considered. This comparison indicates that our procedure outperforms the ARIMA model. The Root Mean Square Error (RMSE) of the ARIMA is always greater than that of the our procedure and generally more than twice as high as the our procedure RMSE. We have also conducted Diebold and Mariano (1995) tests of equal mean square error (MSE). The tests results confirm that forecasts from our procedure are significantly more accurate at all horizons. We think that the advantage of our approach comes from the fact that it explicitly takes into account the number of swabs. .

Entities: Disease

Keywords: COVID-19; Real-time forecasts; Time series

Year: 2021 PMID： 33521404 PMCID： PMC7835080 DOI： 10.1016/j.idm.2021.01.003

Source DB: PubMed Journal: Infect Dis Model ISSN： 2468-0427

Introduction

Coronavirus disease 2019 (COVID-19) is defined as illness caused by a novel coronavirus now called “severe acute respiratory syndrome coronavirus 2” (SARSCoV-2; formerly called 2019-nCoV), which was first identified amid an outbreak of respiratory illness cases in Wuhan City, Hubei Province, China. The World Health Organization declared COVID-19 a pandemic on March 11, 2020. Public health officials are increasingly using mathematical, statistical and computational tools to guide intervention plans against infectious disease outbreaks. In particular, when an outbreak occurs, statistical models can be useful to forecast the trajectory of the outbreak. In this paper, we propose a simple econometric procedure to predict the spread of COVID-19. The forecasts are obtained in the time window from May 19 to June 2, 2020, in Italy. A number of authors has tried to forecast the evolution of the COVID-19 outbreak by using time series methods; these include classical ARIMA models, (Alzahrani et al. (2020), Benvenuto et al. (2020), Chintalapudi et al. (2020), Perone (2020); Yang et al. (2020)), a hybrid ARIMA-Wavelet-based model (Chakraborty and Ghosh (2020)), a log-polynomial model (Peracchi (2020)), machine learning models (Dal Molin Ribeiro et al. (2020)) and artificial neural network models (Wieczorek et al. (2020)). In order to forecast the number of daily new diagnosed cases, we use a log-polynomial model together with a first-order INteger-valued AutoRegressive (INAR(1)) model. The log-polynomial model is used to forecast the ratio number of daily new diagnosed cases/number of swabs and the INAR(1)) model is used to forecast the number of swabs. The forecast of the number of daily new diagnosed cases is then obtained by multiplying these two forecasts. An out-of-sample comparison with forecasts from an ARIMA model is also considered. Our procedure appears to be providing a very good forecast performance. Probably, the advantage of our approach comes from the fact that it explicitly takes into account the number of swabs. In fact, we think that it is necessary to consider the number of swabs to capture the fluctuations of the number of daily new confirmed cases around the trend. To the best of our knowledge, no model with this characteristic is present in literature. The structure of the paper is as follows. Section 2 describes the used data set. Section 3 illustrates the used statistical models. Section 4 presents real-time forecasts of confirmed cases. Section 5 concludes the paper.

Characteristics of the data

In this paper we use daily time-series data from the Italian Civil Protection Department (Dipartimento della Protezione Civile or DPC). Data are available at: https://github.com/pcm-dpc/COVID-19/tree/master/dati-andamentonazionale. The DPC website contains different daily time-series at various levels of aggregation, starting February 24, 2020. For a complete description of this data set, see Peracchi (2020). Here, we consider two time series, at national level, from March 18 to June 2, 2020: the number of new daily COVID-19 confirmed cases (“Nuovi positivi”); the number of daily nasal swabs (“Tamponi”). Fig. 1 illustrates the behavior of the new daily confirmed cases. This series seems to fluctuate around a linear time trend.

Fig. 1

Number of new daily COVID-19 confirmed cases from March 18 to June 2, 2020

Number of new daily COVID-19 confirmed cases from March 18 to June 2, 2020 The number of daily swab tests in Fig. 2 shows significant weekly fluctuations too. It is interesting to note that there is a strong positive linear relationship between “Nuovi positivi” and “Tamponi” (see the scatter plot in Fig. 3). The correlation coefficient between these variables (0.88) is significantly different from zero.

Fig. 2

Number of Tamponi from March 18 to June 2, 2020

Fig. 3

Scatter plot Nuovi positivi versus Tamponi, February 24 to Decenber 25, 2020

Number of Tamponi from March 18 to June 2, 2020 Scatter plot Nuovi positivi versus Tamponi, February 24 to Decenber 25, 2020 This correlation suggests that a model that does not take into account the number of performed swabs might not provide a good predictive performance. In order to address this problem, we adopt the following procedure. First, we use a log-polynomial model to forecast the ratio Nuovi Positivi/Tamponi, then we use an INAR(1) model to forecast the number of Tamponi, and finally we combine the two forecasts.

The statistical models

In this section we introduce, shortly, the log-polynomial and INAR(1) models and we present our forecast procedure.

The log-polynomial model

Let Y be a non-negative integer-valued random variable. We say that Y follows a log-polynomial model of order n ifwhere β0, β1,...,β are n+1 unknown parameters and u is a random term, with mean zero and variance σ2. To fix ideas, we consider the case n = 2. If we let the historical data be denoted by Y1,...,Y, then we can write the forecasts of Y for h = 1,2, … as It is important to note that the predictor will systematically underestimate the conditional mean function In fact, the Jensen inequality, together with the condition [u] = 0, implies that Duan (1983) suggested that a consistent estimator of [exp(u)] is given bywhere is a least squares residual in the original log form regression (1). Thus an appropriate feasible predictor for Y would be the so-called Duan’s smearing estimatorwhere are the Ordinary Least Square Estimates of the parameters β0, β1, β2, respectively. In the sequel we will use this predictor.

The INAR(1) model

The INAR(1) process was introduced by Mckenzie, 1988, Al-Osh and Alzaid (1987) and Alzaid and Al-Osh (1988) in order to modelling non-negative integervalued time series. Before introducing the INAR(1) model, it is necessary to give the definition of the thinning operator, as reported by Steutel and van Ham, 1979. Let X be a non-negative integer-valued random variable and α ∈ [0,1], the binomial thinning operator, denoted by is defined aswhere is a sequence of independent and identically distributed (i.i.d.) random variables that follow a Bernoulli distribution with parameter α, and that is independent of X. A discrete-time non-negative integer-valued stochastic process is said to be an INAR(1) process if it satisfies the equationwhere is an innovation sequence of i.i.d non-negative integervalued random variables not depending on past values of . Following Bourguignon et al. (2016) we also introduce the definition of firstorder seasonal INAR process. A discrete-time non-negative integer-valued stochastic process is said be a first-order seasonal INAR process with seasonal period s (INAR(1)) if it satisfies the following equation:where is an innovation sequence of i.i.d non-negative integer-valued random variables not depending on past values of and denotes the seasonal period. If is a sequence of independent and identically distributed Poisson random variables with mean λ, then the process is called a Poisson INAR(1) process. In the sequel we will use a Poisson INAR(1) process. The unknown parameters of this model are α and λ. We can estimate α and λ using the conditional least squares (CLS) method. Let be a sample of a Poisson INAR(1) process. For all , F denotes the σ-algebra generated by the random variables and denotes the expectation with respect to the true parameter . The CLS estimator of is given bywhere . It is possible to show that the CLS estimates of α and λ are given, respectively, as Now, we consider the problem of forecasting a future value X, , based on the series up to time . Following Bourguignon et al. (2016) we will use the predictorwhere q = [h/s] and r = qs − h, with [x] denoting the upper integer part of , that is, .

A procedure to forecast the confirmed cases of COVID19

In this subsection we illustrate our procedure to forecast the number of daily new infections. We denote with Z the variable new infections and with X the number of swabs. Then, we consider the ratio Y = Z/X. Our predictor (denoted by ) is obtained in three steps: Step 1. To estimate a log-polynomial model of order 1 for the variable Y and to use the predictor (2) to forecast Y, h = 1,2,.... Step 2. To estimate an INAR7(1) for the variable X and to use the predictor (3) to forecast X, h = 1,2,.... Step 3. To obtain by To the best of our knowledge, this is the first paper presenting such a combination of two different approaches (log-polynomial and INAR(1) models) to forecast the confirmed cases of COVID-19.

Real-time forecasting performance of the proposed procedure

In this section we evaluate the real-time forecasting performance of our procedure and compare it with the forecasts from an ARIMA model for the number of confirmed new cases. We denote the forecasts from our procedure with P1 and the ARIMA forecasts with P2. We remember that a variable y follows an ARIMA(p, d, q) if(1 − where L is the lag operator, d is the degree of first differencing involved, and are the autoregressive and moving average polynomials of orders p and q respectively, u is a normally distributed white noise process with mean 0 and variance . ARIMA models often outperform more sophisticated structural models in terms of short-run forecasting ability. Therefore, the ARIMA forecasting technique provides a useful benchmark by which other forecasting techniques may be appraised. In particular, we have identified (using the BIC criterion) and estimated an ARIMA(2,1,2) for the number of confirmed new cases. 1 In our real-time forecast evaluation exercise we use a recursive forecasting scheme, expanding the model estimation sample before each new forecast. In the first run the models are estimated on the period March 18 - May 18, 2020 (training set) and the evaluation period is May 19 - June 2, 2020 (test set), in the second run the models are estimated on the period March 18 - May 19, 2020 and the test set is May 20 - June 2, 2020, and so forth. In this way we obtain, for each procedure P1 and P2, 15 one-day-ahead predictions, 14 two-day-ahead predictions, 13 three-day-ahead predictions, and so on.

Motivation of the time span March 18 - June 2 2020

It is important to underline that our procedure, like many other time series methods, assumes that the underlying data generating process of the time series is constant. For actual time series data this assumption is often invalid as shifting environmental conditions may cause the underlying data generating process to change. In the period March 18 - June 2, 2020 (within which both the training sets and test sets are included) the assumption of constancy of the data generating process can be considered a plausible assumption, because Italy went into lockdown from March 9 to June 3, when the Italian Government granted free movement to all citizens across Regions.

Performance measures

We follow the usual practice in the literature, and evaluate the point forecasts from the different models using both the square root of the mean squared forecast errors (RMSE) and the mean of the absolute forecast errors (MAE). In algebraic terms they can be defined as, respectively,andwhere is the predicted value of W based on the estimates of the model parameters from all the data available up to the current date T − h + i, N is a positive integer, and 0 < P < N. These measures capture the average deviation of the forecast from the realized value over the training set. Table 1 shows the RMSE and the MAE for the two forecasting procedures (P1 and P2). The procedure with the smallest forecasting error for each horizon is displayed in bold. We note that the proposed procedure P1 always outperforms the alternative ARIMA-based procedure P2. It is evident that the RMSE of the procedure P2 is always greater than that of the Procedure P1 and generally more than twice as high as the P1 RMSE.

Table 1

RMSE and MAE of the forecasts.

RMSE
	h = 1	h = 2	h = 3	h = 4	h = 5	h = 6	h = 7
P1	58.66	60.15	60.48	55.54	58.89	62.13	64.64
P2	121.95	127.54	115.79	103.60	130.95	179.50	197.33
MAE
	h = 1	h = 2	h = 3	h = 4	h = 5	h = 6	h = 7
P1	47.56	49.01	49.47	44.38	47.71	50.54	53.05
P2	92.27	95.82	95.57	85.34	114.56	141.18	166.52

Note. The table shows the RMSE and MAE of the forecasts obtained from P1 and P2 procedures. The values in bold represent the best model according to each measure of error and for each forecast horizon.

RMSE and MAE of the forecasts. Note. The table shows the RMSE and MAE of the forecasts obtained from P1 and P2 procedures. The values in bold represent the best model according to each measure of error and for each forecast horizon. We have also conducted Diebold and Mariano (1995) tests of equal mean square error (MSE). The results of the tests (available upon request) confirm that forecasts from procedure P1 are significantly more accurate at all horizons. In particular, the good performance of the procedure P1, when we consider the forecast one step ahead, is shown in Fig. 4.

Fig. 4

Forecasts of confirmed new cases 1 step ahead from P1 procedure, from May 19 to June 2, 2020.

Forecasts of confirmed new cases 1 step ahead from P1 procedure, from May 19 to June 2, 2020. We close this section considering Fig. 5. It presents a comparison between the actual number of confirmed new cases and the predicted values obtained from P1 procedure and P2 procedure.

Fig. 5

Forecasts of confirmed new cases from P1 and P2 (ARIMA) procedures, from May 19 to June 2, 2020. All the forecasts are performed at time May 18.

Forecasts of confirmed new cases from P1 and P2 (ARIMA) procedures, from May 19 to June 2, 2020. All the forecasts are performed at time May 18. Here, all the forecasts are performed at time May 18 for the period from May 19 to June 2, 2020. It is evident that the forecasts from procedure P1 match very well the actual number of confirmed new cases, while the forecasts from P2 procedure are characterized by a larger deviation.

Conclusions and future work

This paper has aimed to provide a forecast of the COVID-19 diffusion in Italy by using a data driven approach. In particular, we have proposed a new procedure based on the combined use of a log-polynomial model and an INAR(1) model. We have used this procedure to forecast the number of new daily confirmed cases in the time window from May 19 to June 2, 2020. An out-of-sample comparison with forecasts from an autoregressive integrated moving average (ARIMA) model has been considered. This comparison indicates that our procedure outperforms the ARIMA model. The accuracy of the short-term forecasts from our procedure is very high on the considered period. We think that the advantage of our approach comes from the fact that it explicitly takes into account the number of swabs. The accuracy of the predictions could be augmented including the reproduction number values (R) in our procedure. Unfortunately, we think that this is not possible as our procedure combines two models that do not allow the introduction of other variables different from the numbers of swabs (X) and the ratio new daily confirmed cases/numbers of swabs (Y). This happens because X is modeled by an autoregression and Y is considered a function of the time only. Maybe a possibility could be to substitute the log-polynomial model for Y with a bivariate Vector AutoRegressive (VAR) model modeling Y and R. We will consider this approach in future work. Another point that we intend to investigate in future research concerns the comparison of the predictive performance of our procedure with a forecasting model that includes the reproduction number values (examples of such models can be found in Chintalapudi et al. (2020b) and Zhao et al. (2020). We hope that the results of our analysis contribute to the elucidation of critical aspects of this outbreak providing a useful perspective in Italy and internationally on how the pandemic spreads. In particular, forecasts can be helpful to anticipate resources such as protective equipment, medical ventilators, and hospital beds. Finally, it is important to note that a limitation of our analysis is that it has been conducted only using the reported number of confirmed new cases that have been officially notified. The current numbers of confirmed cases are known to be vastly underestimated due to the limited testing available. There are almost certainly many more cases of COVID-19 that have not been diagnosed than those that have. Nevertheless, while these data are biased, they are the only source of information available that can guide our efforts to understand the outbreak dynamics.

Declaration of competing interest

The authors declare that they have no conict of interest.

8 in total

1. Research on COVID-19 based on ARIMA model^Δ-Taking Hubei, China as an example to see the epidemic in Italy.

Authors: Qiuying Yang; Jie Wang; Hongli Ma; Xihao Wang
Journal: J Infect Public Health Date: 2020-06-20 Impact factor: 3.718

2. COVID-19 outbreak reproduction number estimations and forecasting in Marche, Italy.

Authors: Nalini Chintalapudi; Gopi Battineni; Getu Gamo Sagaro; Francesco Amenta
Journal: Int J Infect Dis Date: 2020-05-11 Impact factor: 3.623

3. Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil.

Authors: Matheus Henrique Dal Molin Ribeiro; Ramon Gomes da Silva; Viviana Cocco Mariani; Leandro Dos Santos Coelho
Journal: Chaos Solitons Fractals Date: 2020-05-01 Impact factor: 5.944

4. Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis.

Authors: Tanujit Chakraborty; Indrajit Ghosh
Journal: Chaos Solitons Fractals Date: 2020-04-30 Impact factor: 5.944

5. COVID-19 virus outbreak forecasting of registered and recovered cases after sixty day lockdown in Italy: A data driven model approach.

Authors: Nalini Chintalapudi; Gopi Battineni; Francesco Amenta
Journal: J Microbiol Immunol Infect Date: 2020-04-13 Impact factor: 4.399

6. Application of the ARIMA model on the COVID-2019 epidemic dataset.

Authors: Domenico Benvenuto; Marta Giovanetti; Lazzaro Vassallo; Silvia Angeletti; Massimo Ciccozzi
Journal: Data Brief Date: 2020-02-26

7. Forecasting the spread of the COVID-19 pandemic in Saudi Arabia using ARIMA prediction model under current public health interventions.

Authors: Saleh I Alzahrani; Ibrahim A Aljamaan; Ebrahim A Al-Fakih
Journal: J Infect Public Health Date: 2020-06-08 Impact factor: 3.718

8 in total

2 in total

1. Panel Associations Between Newly Dead, Healed, Recovered, and Confirmed Cases During COVID-19 Pandemic.

Authors: Ming Guan
Journal: J Epidemiol Glob Health Date: 2021-12-11

2. Multiscaled causality of infections on viral testing volumes: The case of COVID-19 in Tunisia.

Authors: Foued Saâdaoui; Hana Rabbouch; Hayet Saadaoui; Frédéric Dutheil
Journal: Int J Health Plann Manage Date: 2022-02-12

2 in total