Eunju Hwang1. 1. Department of Applied Statistics, Gachon University, South Korea.
Abstract
This paper is devoted to modeling and predicting COVID-19 confirmed cases through a multiple linear regression. Especially, prediction intervals of the COVID-19 cases are extensively studied. Due to long-memory feature of the COVID-19 data, a heterogeneous autoregression (HAR) is adopted with Growth rates and Vaccination rates; it is called HAR-G-V model. Top eight affected countries are taken with their daily confirmed cases and vaccination rates. Model criteria results such as root mean square error (RMSE), mean absolute error (MAE), R 2 , AIC and BIC are reported in the HAR models with/without the two rates. The HAR-G-V model performs better than other HAR models. Out-of-sample forecasting by the HAR-G-V model is conducted. Forecast accuracy measures such as RMSE, MAE, mean absolute percentage error and root relative square error are computed. Furthermore, three types of prediction intervals are constructed by approximating residuals to normal and Laplace distributions, as well as by employing bootstrap procedure. Empirical coverage probability, average length and mean interval score are evaluated for the three prediction intervals. This work contributes three folds: a novel trial to combine both growth rates and vaccination rates in modeling COVID-19; construction and comparison of three types of prediction intervals; and an attempt to improve coverage probability and mean interval score of prediction intervals via bootstrap technique.
This paper is devoted to modeling and predicting COVID-19 confirmed cases through a multiple linear regression. Especially, prediction intervals of the COVID-19 cases are extensively studied. Due to long-memory feature of the COVID-19 data, a heterogeneous autoregression (HAR) is adopted with Growth rates and Vaccination rates; it is called HAR-G-V model. Top eight affected countries are taken with their daily confirmed cases and vaccination rates. Model criteria results such as root mean square error (RMSE), mean absolute error (MAE), R 2 , AIC and BIC are reported in the HAR models with/without the two rates. The HAR-G-V model performs better than other HAR models. Out-of-sample forecasting by the HAR-G-V model is conducted. Forecast accuracy measures such as RMSE, MAE, mean absolute percentage error and root relative square error are computed. Furthermore, three types of prediction intervals are constructed by approximating residuals to normal and Laplace distributions, as well as by employing bootstrap procedure. Empirical coverage probability, average length and mean interval score are evaluated for the three prediction intervals. This work contributes three folds: a novel trial to combine both growth rates and vaccination rates in modeling COVID-19; construction and comparison of three types of prediction intervals; and an attempt to improve coverage probability and mean interval score of prediction intervals via bootstrap technique.
The COVID-19 pandemic has lasted for more than a year although vaccinations against the COVID-19 has been recently distributed around the world. The severe virus is still threatening human health, social activity and global economy, despite struggles and efforts of experts to overcome the crisis. Many researchers have attempted to help with health systems of COVID-19. Statisticians and econometricians contribute to preventing the spread of COVID-19 through a variety of time series models, by modeling and forecasting the COVID-19 confirmed and death cases. Notable time series studies have emerged to model the COVID-19 data. From the beginning of Benvenuto et al. [1], who used an ARIMA (autoregressive integrated moving average) model, there have been so many ARIMA time series analyses for the COVID-19 datasets [2], [3], [4], [5], [6], [7], [8], [9], [10]. Adding sophisticated theories of probability distributions and differential equations, mathematical models have been developed by [12], [13], [14], [15], [16], [17], [18], [19], [11] for efficient predictions of the COVID-19 pandemic. In the area of artificial intelligence and machine learning, stochastic modelings for the COVID-19 epidemic disease have been conducted to support the forecast analyses [20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31].Among recent remarkable works of the COVID-19 modeling, we refer to Chowell and Luo [32], Bracher et al. [33], Ekinci [34], Hwang and Yu [35]: Chowell and Luo [32] introduced ensemble methodology for forecasting growth process of epidemic outbreaks using differential equations. They achieved the improved coverage rate and mean interval score of prediction intervals. Bracher et al. [33] combined multiple models to improve reliability of forecast outputs. Ekinci [34] adopted an ARMA-GARCH (generalized autoregressive conditional heteroscedasticity) process for modeling the growth rate of daily COVID-19 confirmed cases, considering conditional variance and asymmetric effect, while [35] employed a HAR (heterogeneous AR) model to capture the long memory embedded in the COVID-19 data. The GARCH and HAR models in Ekinci [34], Hwang and Yu [35], which are popular and robust time series models for financial volatility, still show good predictive performance on the COVID-19 time series data.Vaccinations against the COVID-19 have been widely distributed in the world. According to Tregoning et al. [36] it is reported that the development and deployment of vaccines for the COVID-19 are a remarkable science success story: within 16 months of the first vaccine trials, 2.8 billion vaccine doses have been administered. Thanks to Mathieu et al. [37] a global database of COVID-19 vaccinations are available. Shim [38] evaluated the impact of COVID-19 vaccination program through a mathematical model based on age-structured classification, while [39] studied forecasting of the number of fully vaccinated people and examined future vaccination rate by the ARIMA model. Tregoning et al. [36] and Higdon [40] conducted systematic reviews of COVID-19 vaccines efficacy and effectiveness against the SARS-CoV-2 infection and disease. Vaccines and booster shots affect the new COVID-19 spread in the local region. Vaccination rates are one of the primary components that influence strongly the COVID-19 confirmed cases.This work concerns itself with a time series analysis for the COVID-19 in the world. Vaccination rates are considered as a regressor variable in a multiple time series regression for the purpose of the COVID-19 forecast analysis. In addition, the growth rates of daily COVID-19 confirmed cases are adopted as another regression factor because the growth rates directly determine new numbers in the near future. The COVID-19 confirmed cases have long-memory property with algebraically decaying autocorrelation coefficients. As in Hwang and Yu [35], we use the HAR model that is an additive cascade regression model with a main feature of the long memory. Overall, to forecast the COVID-19 confirmed cases, this paper suggests a HAR model with Growth rates and Vaccination rates, called the HAR-G-V model, which is a form of multiple linear regression. The multiple linear regression and its statistical theories are well-known (for example, see Montgomery and Peck [41]). Their coefficient parameter estimation and related analysis are frequently used, as a standard-level study, to achieve statistically significant results.The HAR model is a powerful time series model especially for realized volatility of financial asset, (see Corsi [42], Hong et al. [43], Hwang and Hong [44], Wilms et al. [45]). Its background is deferred to a recent work [35] of author, which is also a HAR model-application to the COVID-19 data. The HAR model has moving-averages as regressor variables: for example, in a HAR(3) model with lags (1,7,14), the previous value, 7-day moving average and 14-day moving average are used as explanatory variables. The conventional HAR model is a kind of a linear AR model: for example, the HAR(3) model with lags (1,7,14) is theoretically a linear AR(14) model. However, adding the growth rates to the HAR model results in a nonlinear AR model. Not only the averages but also the growth rates are important factors that determine the COVID-19 spread. In the notable work [34], the growth rate itself of the daily COVID-19 confirmed cases is used for time series modeling. In contrast, the current work uses both averages and growth rates, in a form of nonlinear AR model, and at the same time, in a multiple linear regression model. The modified (nonlinear) HAR() model is employed with one-day, one-week, two-week moving growth rates as well as one-day, one-week, two-week moving averages, with lags (1,7,14), if . In addition, two vaccination rates are imposed in our time series model: 1st and 2nd vaccination rates, which are different from country to country. The following three main issues: (i) reflecting long memory of COVID-19 data, (ii) adopting a nonlinear autoregressive effect with both moving averages and moving rates, and (iii) taking vaccination rates that play a decisive role in the COVID-19 situation, motivate us to choose the multiple regression of the HAR-G-V model in this study.For the empirical analysis of the COVID-19 time series data in the world, top eight affected countries are taken: Brazil, France, Germany, India, Italy, Russia, UK and USA. Their daily COVID-19 confirmed cases and vaccination rates are analyzed through HAR-G-V models of order 2, 3 with lags (1,7), (1,7,14), respectively. In the HAR models with/without growth rates and vaccination rates, model criteria measures such as root mean square error (RMSE), mean absolute error (MAE), , AIC and BIC are evaluated for the comparison. In addition, these are compared to ARIMA models, which are popular in time series literature including COVID-19. The proposed HAR-G-V model outperforms other HAR models and ARIMA models with respect to RMSE and MAE. It is because the both rates have strong influences on the spread of COVID-19. The fitted HAR-G-V models are well-matched to real daily COVID-19 confirmed cases.To investigate forecast performance of the HAR-G-V model, out-of-sample forecasting is conducted to evaluate RMSE, MAE, mean absolute percentage error (MAPE) and root relative square error (RRSE) of forecasts. As a main goal of this work, prediction intervals are established. Their empirical coverage probability, average length and mean interval score are reported. The mean interval score was assessed in Chowell and Luo [32], to account the forecasting performances across a diversity of epidemic datasets. The interval score was suggested by Gneiting and Raftery [46]. It is a scoring rule for prediction interval, which rewards for a narrow prediction interval and incurs a penalty if the observation misses the interval. See [46] for its details, who dealt with the interval score and other scoring rules of prediction intervals.Based on the residuals of the HAR-G-V model, prediction intervals, (), are constructed by approximating the residual distribution. The following three approaches are attempted: (i) Approximation by normal distribution; it does not provide the best performance because residuals do not follow the normal distribution. (ii) Approximation by Laplace distribution; it seems to fit well the residual distribution and to improve the normal approximation by enhancing coverage probability in some countries. However, the Laplace distribution makes longer prediction interval and thus it has the worst mean interval score in the comparison of 95% prediction intervals. (iii) In order to improve the previous two approximations, the bootstrap technique is applied to the residuals to construct the bootstrap prediction interval.The validity of the bootstrap technique in a regression model was initially established by Freedman [47]. Also, the bootstrap procedure was used by [48] to estimate the mean squared error of forecast for the AR() model. As a useful application of the bootstrap, Thombs and Schucany [49] dealt with the bootstrap prediction intervals for the AR models. For weak dependence structure in a network vector autoregression, Hong and Hwang [50] implemented the bootstrap prediction intervals using the stationary bootstrap and the classical residual bootstrap. In the current work, in order to construct a better prediction interval, the classical residual-bootstrap method is used. This residual resampling-technique not only improves empirical coverage but also shortens average length of prediction interval, and thus the proposed bootstrap prediction interval yields the best mean interval score. So as to find how much the bootstrap method can improve the normal and Laplace approaches, bootstrap improvement ratio (BIR) of the mean interval score is suggested. In cases of India, Italy and Russia, which have so long prediction intervals by the normal and Laplace approximations, their BIRs are evaluated as more than 0.7, which means that mean interval scores of the bootstrap prediction intervals can be enhanced by more than 70% from those of other two prediction intervals.This research has contributions in three aspects: (i) This present work combines both growth rates and vaccination rates, for modeling and forecasting the COVID-19 pandemic, through a nonlinear type of HAR models. It provides results of well-fitting to actual daily COVID-19 confirmed cases and good criteria measures such as small values of RMSE and MAE. (ii) Forecasting accuracy performances of RMSE, MAE, MAPE and RRSE are evaluated with small values and, moreover, in behalf of more informative interval estimation, three types of prediction intervals are constructed and compared. (iii) Bootstrap method can improve prediction intervals not only by yielding the empirical coverage, close to the nominal coverage, but also by shortening average length and thus resulting in better mean interval score. Consequently, the combination of the HAR-G-V model and the bootstrap technique achieves an accurate prediction interval with the best mean interval score. An outline of this study is drawn in the flow chart of Fig. 1
.
Fig. 1
Flow chart for this research of HAR-G-V model.
Flow chart for this research of HAR-G-V model.The rest of the paper is organized as follows: In Section 2, model and data are presented. In Section 3 empirical results of estimation and prediction are explored. Conclusion and discussion are stated in Section 4.
Model and data
HAR-G-V model for a triangular array
We first introduce the HAR-G-V model, which is used for a triangular array of observations. Suppose that a sample of size is observed, and the dataset is a triangular array with mean zero and finite variance, but it is not necessarily stationary. In the expression , we omit subscript and use in the model just for notational simplicity.The HAR-G-V model of order , , is defined as follows:where the HAR components are given bywith positive integers , and the growth rate components are given by log difference of shifted values:with some constant , called a shift threshold, such that for all . The last two regressors and are used as the 1st and 2nd vaccination rates at time , respectively. Coefficients
are parameters to be estimated for model-fitting and forecasting, and is an i.i.d. sequence of random variables with mean zero and finite variance.The HAR-G-V model of order in (1) can be regarded as a nonlinear AR model of order with additional regression variables, which make the model a multiple regression form. As mentioned above, as explanatory variables, the model has moving averages and moving rates, both of which are functions of the observations . In other words, we may writefor some real-valued function defined on -dimensional real space , i.e., is a composite of algebraic additions, divisions and logarithmic function, along with parameters
and . The shift threshold is estimated as any positive number only with condition for all in the logarithm function. Instead of a general function , we consider a specific semiparametric AR model with the form of (1).In the HAR-G-V model (1), if , then model (1) is denoted by HAR-V model, with only vaccination rates, and if , then it is HAR-G model, with only growth rates. If all , then model (1) is the HAR() model, which is used in the analysis of [35].In the next section of an empirical study, the four models; HAR, HAR-V, HAR-G and HAR-G-V models of order , are compared together, by applying eight datasets of COVID-19 cases. Order is chosen as , and lags ( are (1,7) for and (1,7,14) for . According to Linton et al. [51] the incubation period of the COVID-19 is 14 to 21 days and so the recent work [35] of author dealt with (1,7,14) for and (1,7,14,21) for . However, since this current work handles additionally growth rates and vaccination rates, (if we use , then there are 10 coefficients in the multiple regression), and also since the results of order are expected to be similar to those of order , we omit an empirical study for .
COVID-19 data
This work considers the new daily COVID-19 confirmed cases in top eight affected countries with their growth rates and vaccination rates. The daily confirmed cases (7-day smoothed) and vaccination rates are taken from Our World in Data in https://ourworldindata.org/covid-vaccinations, (see Mathieu et al. [37]). The confirmed cases are applied by 7-day moving-averaging as in Ekinci [34] so as to smoothen the daily data. Eight countries: Brazil, France, Germany, India, Italy, Russia, UK and USA, are chosen to apply the HAR models and to construct the prediction intervals as a main goal. Statistics summaries of daily confirmed cases (7-day smoothed) are displayed in Table 1
along with those of their one-day, one-week and two-week growth rates.
Table 1
Statistics of daily confirmed cases (7-day smoothed) and their growth rates; size is the period from the starting date in 2021 to 08/24/2021; SD = standard deviation.
Brazil
France
Germany
India
Italy
Russia
UK
USA
Starting
02/09
01/26
01/16
02/15
01/18
03/02
01/10
01/19
Size N
190
204
214
184
212
169
220
211
Daily confirmed cases
Mean
56917.9
18765.2
8429.5
116687.6
9691.6
14028.4
14912.9
60591.1
SD
13699.6
10707.5
6303.9
117743.3
6890.7
6435.6
13467.5
40198.2
Min
28337.6
1840.6
538.0
12751.6
639.3
8016.4
1517.7
11299.3
Median
60438.8
20287.4
7864.8
50359.8
9430.5
9488.6
8662.2
57301.1
Max
77327.6
45641.6
21595.7
391232.0
23211.9
24490.6
48675.4
171103.9
Skewness
−0.5867
0.0849
0.4389
1.2000
0.3819
0.5397
0.8334
0.8423
Kurtosis
−0.7192
−0.6760
−1.0002
−0.0274
−1.0363
−1.5331
−0.5727
0.0264
One-day growth rate
Mean
−0.0027
0.0003
−0.0033
0.0050
−0.0031
0.0038
−0.0018
−0.0006
SD
0.0386
0.0815
0.0683
0.0428
0.0451
0.0163
0.0737
0.0460
Min
−0.2419
−0.3246
−0.1898
−0.1225
−0.1636
−0.0295
−0.5146
−0.1538
Median
−0.0002
0.0017
0.0032
−0.0013
−0.0047
−0.0010
−0.0030
−0.0086
Max
0.2165
0.2546
0.2958
0.1337
0.1284
0.0706
0.5197
0.1960
Skewness
−0.5975
−0.3307
0.02783
0.2650
0.4591
1.5086
0.1375
1.0606
Kurtosis
13.4556
2.4453
1.6424
−0.4308
1.1372
2.6062
−0.5964
2.9285
One-week growth rate
Mean
−0.0173
0.0057
−0.0265
0.0353
−0.0232
0.0309
−0.0109
−0.0035
SD
0.1298
0.3277
0.3176
0.2838
0.2847
0.1006
0.3320
0.2420
Min
−0.3572
−0.7531
−0.7623
−0.4066
−0.5087
−0.1171
−0.9072
−0.4321
Median
0.0070
−0.0234
0.0027
−0.0531
−0.0147
−0.0072
−0.0481
−0.0379
Max
0.2275
1.1725
0.5079
0.5356
0.7688
0.3528
0.6126
0.6937
Skewness
−0.6097
0.8542
−0.2229
0.2691
0.6975
1.5848
0.3378
0.7216
Kurtosis
0.2087
1.2571
−0.7937
−1.0847
0.3750
1.6817
−0.9006
−0.2771
Two-week growth rate
Mean
−0.0391
0.0141
−0.0652
0.0722
−0.0471
0.0693
−0.0154
−0.0005
SD
0.1889
0.6056
0.5886
0.5626
0.5492
0.1876
0.5574
0.4625
Min
−0.4629
−1.1577
−1.2154
−0.7770
−0.9424
−0.1296
−1.2253
−0.6608
Median
−0.0260
0.0170
0.0161
−0.0896
−0.0907
−0.0020
−0.1433
−0.1470
Max
0.3476
1.6405
0.8989
1.0268
1.3463
0.5938
1.0605
1.0772
Skewness
−0.2535
0.7179
−0.2469
0.2241
0.6688
1.4191
0.4948
0.7940
Kurtosis
−0.4413
0.4999
−1.0432
−1.1951
0.0393
0.7863
−1.0508
−0.4554
Statistics of daily confirmed cases (7-day smoothed) and their growth rates; size is the period from the starting date in 2021 to 08/24/2021; SD = standard deviation.Because the vaccine supply differs from country to country, the sample size is different for each country. More specifically, the start date in the sample period is set as the first day of the 2nd vaccination, while the end date is all 08.24.2021. The size in each country, shown in Table 1, is the length between the start date and 08.24.2021.As introduced above, model (1) is assumed to have mean zero, and so we adopt as standardized values of COVID-19 confirmed cases for each country. In other words, , for , where is daily COVID-19 confirmed case at time , and are its sample mean and sample standard deviation, respectively. Note that generates a triangular array with mean zero and variance one, (but not stationary).With lags (1,7,14) for , the growth rates in (2) are computed as one-day, one-week and two-week growth rates of daily numbers shifted by . The shift threshold may be chosen as some positive number so that for all . Either , for any , or are possible. Here we select threshold because it implies thatwhich is log difference of original COVID-19 confirmed cases . These growth rates are depicted in Fig. 2
, where we see one-day, one-week, two-week growth rates of the eight countries. Positive values during a time period imply that the numbers of COVID-19 confirmed cases increase during the period; for example, in India, during the period before 60 on time axis, growth rates are positive, which means new daily COVID-19 confirmed numbers increase, while during the period between 75 and 140, growth rates are negative, which means new daily COVID-19 numbers decrease. These facts are also shown on the plot of India in Fig. 5. In Russia, between 80 and 110 on time axis, growth rates are positive, which indicates increasing COVID-19 during the period, as seen on the plot of Russia in Fig. 5. In Fig. 3
, the 1st and 2nd vaccination rates of the eight countries are shown, until date 08.24.2021. India and Russia have low vaccination rates, whereas Italy, UK and USA have high rates. Data in Figs. 2 and 3 are used as regressor variables in the HAR-G-V model of a triangular array.
Fig. 2
Growth rates of daily COVID-19 confirmed cases (7-day smoothed).
Fig. 5
Daily confirmed cases (7-day smoothed) and HAR-G-V model fitting with its residuals.
Fig. 3
Vaccination rates.
Growth rates of daily COVID-19 confirmed cases (7-day smoothed).Vaccination rates.A main characteristic of the HAR model is the long memory, which means that historical observation has a persistent impact on future value. The long memory can be identified by autocorrelation coefficient functions (ACFs) with algebraically decay rates. In order to see whether the COVID-19 data are suitable for the HAR model, their sample ACFs are illustrated in Fig. 4
. The COVID-19 confirmed cases have the long-memory effects as seen in Fig. 4, where the sample ACFs of the confirmed cases are depicted for the eight countries. The sample ACFs are decreasing very slowly. This feature leads us to adopt the HAR models for modeling the COVID-19 pandemic time series. In the next section, we discuss empirical results of the COVID-19 confirmed cases with their estimation and forecasting through the HAR models with/without the growth rates and vaccination rates. Because the proposed models are a type of multiple linear regression model, which is popular and standard in statistical analysis, and also because earlier works [43], [44] dealt with a simulation study in the HAR model, in this work, the simulation is omitted and real data analysis only is discussed.
Fig. 4
Sample autocorrelation coefficient function (ACF) of daily COVID-19 confirmed cases (7-day smoothed).
Sample autocorrelation coefficient function (ACF) of daily COVID-19 confirmed cases (7-day smoothed).Daily confirmed cases (7-day smoothed) and HAR-G-V model fitting with its residuals.
Empirical study
Parameter estimation
This section presents the empirical results of the proposed models with daily confirmed cases and vaccination rates. Four models; HAR, HAR-V, HAR-G and HAR-G-V models, are compared by reporting some criteria measures of the parameter estimation. Additionally, these results are compared with those by popular ARIMA models.Suppose the observations of size are relabelled by , where is chosen so that , with , 14, for , 3, respectively. We use the least squares method that minimizes the sum of squared errors (SSE):The regression model in (1) is written as where and . The ordinary least squared (OLS) estimator is obtained by minimizing the SSE and given asTo see the performance of the OLS regression, we compute criteria measures such as root mean square error (RMSE), mean absolute error (MAE), , AIC and BIC.Let . Using the residuals , RMSE and MAE are given byAlso, , AIC and BIC are given bywhere is the log-likelihood function and is the number of parameters.For the HAR(-G-V) models with order , these performance measures in the eight countries are reported in Tables 2
and 3
. In order to compare the results with existing model, Table 4
gives the performances by ARIMA models. In the columns of RMSE and MAE of Table 2, Table 3, Table 4, the values in the parentheses are the errors multiplied by the standard deviations of the daily confirmed cases in each country, where the standard deviations are given in Table 1. Because HAR and HAR-V models have fewer parameters, AIC and BIC values in these models are somewhat smaller than those in the HAR-G and HAR-G-V models.
Table 2
Criteria of HAR() models with/without growth rates and vaccination rates; Brazil, France, Germany and India.
RMSE
MAE
R2
AIC
BIC
Brazil
p=2
HAR
0.14876(2038.06)
0.09671(1324.91)
0.979
−167.2
−160.9
HAR-V
0.14634(2004.85)
0.09508(1302.64)
0.980
−169.0
−156.3
HAR-G
0.14782(2025.06)
0.09731(1333.16)
0.979
−165.5
−152.8
HAR-G-V
0.14493(1985.55)
0.09363(1282.75)*
0.980
−168.4
−149.4
p=3
HAR
0.14864(2036.32)
0.09726(1332.36)
0.979
−165.5
−156.0
HAR-V
0.14582(1997.74)
0.09575(1311.75)
0.980
−168.3
−152.4
HAR-G
0.14698(2013.58)
0.09684(1326.60)
0.979
−163.5
−144.5
HAR-G-V
0.14473(1982.75)*
0.09437(1292.96)
0.980
−164.9
−139.5
France
p=2
HAR
0.18192(1948.01)
0.09762(1045.33)
0.969
−104.4
−97.88
HAR-V
0.17893(1915.97)
0.09195(984.65)
0.970
−106.7
−93.69
HAR-G
0.17065(1827.26)
0.09503(1017.59)
0.973
−124.7
−111.7
HAR-G-V
0.17042(1824.78)
0.09471(1014.13)
0.973
−121.2
−101.7
p=3
HAR
0.17907(1917.40)
0.09266(992.16)
0.970
−108.4
−98.65
HAR-V
0.17779(1903.79)
0.08936(956.84)*
0.971
−107.1
−90.87
HAR-G
0.17061(1826.83)
0.09438(1010.62)
0.973
−120.8
−101.3
HAR-G-V
0.17040(1824.58)*
0.09428(1009.51)
0.973
−117.3
−91.27
Germany
p=2
HAR
0.10697(674.34)
0.05891(371.62)
0.989
−322.5
−315.9
HAR-V
0.10508(662.44)
0.05684(358.35)
0.989
−325.6
−312.4
HAR-G
0.10205(643.37)
0.05991(377.65)
0.990
−337.3
−324.1
HAR-G-V
0.10146(639.62)
0.05940(374.49)
0.990
−335.6
−315.9
p=3
HAR
0.10563(665.93)
0.05684(358.29)
0.989
−325.5
−315.6
HAR-V
0.10475(660.37)
0.05563(350.72)*
0.989
−324.9
−308.4
HAR-G
0.10194(642.59)
0.05952(375.18)
0.990
−333.8
−314.0
HAR-G-V
0.10124(638.23)*
0.05919(373.15)
0.990
−332.5
−306.1
India
p=2
HAR
0.01323(1558.30)
0.00919(1082.32)
1.000
−984.0
−977.8
HAR-V
0.01303(1533.87)
0.00896(1055.56)
1.000
−985.4
−972.9
HAR-G
0.01318(1552.68)
0.00922(1085.87)
1.000
−981.3
−968.7
HAR-G-V
0.01299(1530.08)
0.00903(1062.89)
1.000
−982.2
−963.4
p=3
HAR
0.01186(1396.15)
0.00783(921.56)
1.000
−1019
−1010
HAR-V
0.01181(1390.10)
0.00776(913.84)
1.000
−1017
−1001
HAR-G
0.01139(1340.87)
0.00744(875.97)
1.000
−1027
−1008
HAR-G-V
0.01125(1324.41)*
0.00731(860.22)*
1.000
−1027
−1002
Table 3
Criteria of HAR() models with/without growth rates and vaccination rates; Italy, Russia, UK and USA.
RMSE
MAE
R2
AIC
BIC
Italy
p=2
HAR
0.03619(249.42)
0.01897(130.77)
0.999
−748.3
−741.8
HAR-V
0.03608(248.61)
0.01882(129.71)
0.999
−745.6
−732.5
HAR-G
0.03565(245.71)
0.01891(130.32)
0.999
−750.3
−737.1
HAR-G-V
0.03553(244.85)
0.01881(129.64)
0.999
−747.7
−727.9
p=3
HAR
0.03613(248.97)
0.01888(130.15)
0.999
−747.0
−737.2
HAR-V
0.03603(248.27)
0.01866(128.57)*
0.999
−744.2
−727.7
HAR-G
0.03541(244.01)
0.01887(130.06)
0.999
−749.0
−729.3
HAR-G-V
0.03530(243.28)*
0.01871(128.96)
0.999
−746.2
−719.9
Russia
p=2
HAR
0.01644(105.83)
0.01193(76.79)
1.000
−829.5
−823.4
HAR-V
0.01628(104.77)
0.01203(77.43)
1.000
−828.6
−816.5
HAR-G
0.01585(101.98)
0.01218(78.38)
1.000
−827.0
−824.8
HAR-G-V
0.01548(101.35)
0.01224(78.79)
1.000
−834.9
−816.7
p=3
HAR
0.01629(104.84)
0.01198(77.07)
1.000
−803.5
−821.3
HAR-V
0.01615(103.98)
0.01207(77.68)
1.000
−829.0
−813.8
HAR-G
0.01534(98.74)
0.01184(76.21)*
1.000
−843.0
−824.7
HAR-G-V
0.01530(98.50)*
0.01188(76.49)
1.000
−839.8
−815.4
UK
p=2
HAR
0.03639(490.06)
0.02126(286.43)
0.998
−776.6
−769.9
HAR-V
0.03526(474.89)
0.02144(288.81)
0.999
−785.5
−772.2
HAR-G
0.03612(486.43)
0.02111(284.39)
0.998
−775.6
−762.3
HAR-G-V
0.03435(462.64)
0.02109(284.11)
0.999
−792.3
−772.3
p=3
HAR
0.03288(442.83)
0.01987(267.61)
0.999
−816.3
−806.3
HAR-V
0.03144(423.47)
0.01976(266.07)
0.999
−830.7
−814.1
HAR-G
0.03165(426.26)
0.01932(260.16)
0.999
−826.0
−806.1
HAR-G-V
0.03081(414.95)*
0.01927(259.54)*
0.999
−833.1
−806.5
USA
p=2
HAR
0.03952(1588.63)
0.02708(1088.51)
0.998
−709.9
−703.4
HAR-V
0.03914(1573.53)
0.02707(1088.02)
0.998
−709.7
−696.6
HAR-G
0.03897(1566.44)
0.02712(1090.36)
0.998
−711.5
−698.3
HAR-G-V
0.03862(1552.69)
0.02700(1085.47)
0.998
−710.9
−691.2
p=3
HAR
0.03912(1572.87)
0.02690(1081.42)
0.998
−711.9
−702.0
HAR-V
0.03886(1562.19)
0.02692(1082.19)
0.998
−710.5
−694.1
HAR-G
0.03843(1544.88)
0.02624(1054.95)*
0.998
−712.9
−693.2
HAR-G-V
0.03821(1536.29)*
0.02633(1058.66)
0.998
−711.1
−684.9
Table 4
Comparison with ARIMA models.
RMSE
MAE
R2
AIC
BIC
Brazil
ARIMA(1,1,1)
0.14527(1990.21)
0.09655(1322.72)
0.979
−186.9
−177.23
France
ARIMA(1,1,1)
0.17485(1872.25)
0.10083(1079.63)
0.969
−125.8
−115.9
Germany
ARIMA(1,2,1)
0.10476(660.43)
0.05947(374.92)
0.989
−347.0
−336.9
India
ARIMA(1,1,1)
0.011297(1330.12)
0.00671(789.64)*
1.000
−1112.7
−1103.1
Italy
ARIMA(1,2,1)
0.03474(239.36)*
0.01829(126.06)*
0.999
−808.6
−798.6
Russia
ARIMA(1,2,1)
0.01615(103.95)
0.01252(80.57)
1.000
−897.9
−888.6
UK
ARIMA(1,2,0)
0.02763(370.39)*
0.01674(224.51)*
0.999
−942.2
−935.4
USA
ARIMA(1,2,1)
0.03854(1549.26)
0.02661(1069.80)
0.998
−760.9
750.9
Criteria of HAR() models with/without growth rates and vaccination rates; Brazil, France, Germany and India.Criteria of HAR() models with/without growth rates and vaccination rates; Italy, Russia, UK and USA.Comparison with ARIMA models.As for RMSE in Tables 2 and 3, HAR-V models have smaller errors than HAR models, and also HAR-G-V models have smaller errors than HAR-G models. It implies that vaccination rates affect significantly the model identification. In Tables 2 and 3, bold numbers are the smallest errors in the given order in each country, whereas mark indicates the best one among the two smallest errors in both orders in each country. All eight countries have the smallest values of RMSE in HAR-G-V model of order 3. Consequently, HAR-G-V model of order is the most suitable model in the sense of RMSE. As for MAE in Tables 2 and 3, Brazil, India and UK have the smallest errors in HAR-G-V model of order 3. In other countries, we see that the models with vaccination rates, HAR-V or HAR-G-V, are better than the models without them, HAR or HAR-G, respectively, with respect to MAE. Table 4 presents empirical results of ARIMA() models for the COVID-19 data. RMSE, MAE, , AIC and BIC by our proposed HAR(-G-V) models are compared with those in Table 4. Orders of ARIMA() models are selected through a unit-root test, i.e., ADF test, as well as ACF and PACF plots, and through consideration of smallest RMSE and MAE. Bold numbers in Table 4 of ARIMA models indicate smaller errors than HAR(-G-V) model errors in each country. Most cases demonstrate that HAR(-G-V) models fit with the COVID-19 data better than the ARIMA models. To see the model-fitting of HAR-G-V models, Fig. 5 depicts the actual daily COVID-19 confirmed cases and HAR-G-V models along with their residuals. For all the eight countries, HAR-G-V models are well-matched with the actual data. It is concluded that HAR-G-V model of order 3 is the best one with reduced errors, compared to other models. Therefore, in the next subsection of prediction, we adopt the HAR-G-V model of order 3 and discuss prediction intervals by approximating the residual distribution.
Prediction analysis
Now we investigate the forecasting performance by the HAR-G-V model of order 3, applied to the daily COVID-19 confirmed cases. For the first prediction result, one-, two-, three-step ahead forecasts are computed and their forecasting performance accuracy is reported. In order to provide the forecasting analysis, rolling windows are used. Out of the given observations of size , we take two subsamples; in-sample and out-of-sample. To assess the -step ahead forecast, (), let , and let with relabelled observations, (). is the in-sample used to estimate parameters. The -step ahead predicted value is computed from , (), and accuracy measures of RMSE, MAE, MAPE and RRSE are evaluated along with observation in the out-of-sample as follows:Root Mean Square Error (RMSE):Mean Absolute Error (MAE):Mean Absolute Percentage Error (MAPE):Root Relative Square Error (RRSE):where .In Table 5
, RMSE, MAE, MAPE and RRSE for the -step ahead forecasts, (), are evaluated with size . The values in the parentheses on the columns of RMSE and MAE indicate the errors multiplied by the standard deviations of in-sample observations. As expected, it is shown in Table 5 that all the errors are reasonably small, and the farther step ahead, the larger errors.
Table 5
Forecasting performance of one-, two-, three-step ahead ( forecasts in HAR-G-V model with size 30.
k
RMSE
MAE
MAPE
RRSE
Brazil
1
0.20620(2824.94)
0.11714(1604.88)
4.35666
0.45181
2
0.31809(4357.75)
0.22947(3143.68)
8.57290
0.68883
3
0.37590(5149.75)
0.29034(3977.58)
11.25078
0.85273
France
1
0.05194(556.22)
0.04302(460.71)
2.19886
0.20366
2
0.07343(786.35)
0.06043(647.13)
2.99611
0.33470
3
0.09759(1045.04)
0.07635(817.59)
3.71851
0.52219
Germany
1
0.02556(161.13)
0.02103(132.58)
5.51255
0.12122
2
0.02581(162.73)
0.02124(133.93)
4.82855
0.11271
3
0.05738(361.72)
0.04367(275.31)
11.58729
0.23259
India
1
0.01886(2220.95)
0.01117(1315.23)
3.22558
0.71244
2
0.03623(4265.88)
0.02432(2864.33)
7.00961
1.28780
3
0.05698(6709.39)
0.03789(4462.14)
10.89532
1.88765
Italy
1
0.01446(99.64)
0.01134(78.14)
1.5562
0.12081
2
0.03627(249.95)
0.02876(198.19)
3.92301
0.33632
3
0.06402(441.16)
0.05005(344.91)
6.74034
0.65573
Russia
1
0.01622(104.42)
0.01343(86.45)
0.39060
0.11079
2
0.02954(190.16)
0.02418(155.64)
0.71247
0.20343
3
0.04626(297.73)
0.03795(244.27)
1.13091
0.31719
UK
1
0.06500(871.42)
0.04350(583.18)
1.78073
0.16298
2
0.15677(2101.66)
0.10299(1380.77)
4.32968
0.47361
3
0.27381(3670.67)
0.18412(2468.32)
8.04029
1.02424
USA
1
0.08251(3316.91)
0.06854(2755.29)
3.23027
0.10518
2
0.11920(4791.86)
0.09111(3662.80)
4.45301
0.15395
3
0.14519(5836.68)
0.11731(4715.98)
5.42386
0.19048
Forecasting performance of one-, two-, three-step ahead ( forecasts in HAR-G-V model with size 30.For the second prediction result, we construct % prediction intervals using the residual distributions. In the HAR-G-V model with estimated parameter vector , the residuals are approximated by three approaches. The distributions of the residuals are plotted with histograms in Fig. 6
, where the normal and Laplace distributions by estimated mean and standard deviation of the residuals are shown as well. The % Box–Jenkins prediction interval by the normal approximation is constructed as , with for , respectively, where and are the estimated mean and standard deviation of the residuals. However, as shown in Fig. 6, the normal density is not well fitted with distribution of the residuals. Hence we adopt the Laplace distribution for better fitting of the density. The Laplace distribution, which is well-known as two-sided exponential distribution, has the following probability density function with location parameter and scale parameter : . Its mean and variance are and , respectively. In Fig. 6, red curves indicate Laplace density functions with estimated parameters from the residuals. Using the Laplace distribution, % prediction intervals are constructed as with the estimated mean and scale parameter, and . By rescaling these intervals, the one-step prediction intervals of the daily COVID-19 confirmed cases are constructed.
Fig. 6
Distribution of residuals by HAR-G-V models.
Distribution of residuals by HAR-G-V models.In Table 6
, for the last days, , empirical coverage probability, average length and mean interval score of 80% prediction intervals are displayed. Moreover, those of 95% prediction intervals are given in Table 7
. The empirical coverage probability is the ratio of observations that lie in the prediction interval. The interval score is a scoring rule for prediction interval, which gives a penalty if observation misses the interval and rewards for a narrow interval, (see Chowell and Luo [32], Gneiting and Raftery [46]). The mean interval score (MIS) is defined bywhere and are lower and upper bounds of the % prediction interval for level , respectively, is the indicator function and is observation. If coverage probability is one, then the mean interval score is the same as the average length. Lower coverage probability yields more penalty as much as the last two terms in (3). For the case of Brazil in Tables 6 and 7, the normal and Laplace approximations address small coverage probabilities, not close to, but lower than the nominal coverage, and thus they make much larger (or worse) mean interval scores with high penalty, compared to the average lengths.
Table 6
Coverage probability (CP), average length (AL) and mean interval score (MIS) of 80% prediction intervals by normal, laplace distributions and Bootstrap procedure in HAR-G-V residuals for the last days.
Normal
Laplace
Bootstrap
ℓ
CP
AL
MIS
CP
AL
MIS
CP
AL
MIS
Brazil
20
0.7
2447
3790
0.65
2176
3924
0.85
3526
3749
50
0.6
2176
8631
0.54
1935
8887
0.76
3509
7780
100
0.53
2096
9145
0.46
1863
9475
0.72
3422
7926
France
20
1.0
2089
2089
1.0
1858
1858
1.0
2636
2636
50
0.9
2250
2588
0.88
2000
2472
0.9
2806
2985
100
0.66
2231
4910
0.62
1984
5105
0.81
3429
4244
Germany
20
1.0
1192
1192
1.0
1060
1060
1.0
1218
1218
50
1.0
1234
1234
0.94
1097
1126
1.0
1251
1251
100
0.77
1203
2022
0.66
1070
2073
0.84
1413
1861
India
20
1.0
24234
24234
1.0
21547
21547
0.95
2703
2706
50
1.0
26141
26141
1.0
23242
23242
0.78
2792
6798
100
1.0
31015
31015
1.0
27575
27575
0.76
3063
6000
Italy
20
1.0
1313
1313
1.0
1167
1167
1.0
359
359
50
1.0
1389
1389
1.0
1235
1235
0.82
368
486
100
1.0
1361
1361
1.0
1210
1210
0.82
390
599
Russia
20
1.0
1357
1357
1.0
1206
1206
0.7
253
335
50
1.0
1272
1272
1.0
1131
1131
0.74
251
366
100
0.78
787
1006
0.75
699
948
0.6
213
573
UK
20
1.0
2434
2434
1.0
2164
2164
0.8
688
979
50
0.86
2373
3235
0.82
2110
3190
0.56
569
4030
100
0.93
2413
2844
0.91
2146
2686
0.62
441
2495
USA
20
0.75
6952
9895
0.65
6181
10339
0.52
3245
14417
50
0.82
7186
9892
0.78
6389
9894
0.56
2871
13218
100
0.91
7597
8950
0.89
6755
8507
0.7
2663
8044
Table 7
Coverage probability (CP), average length (AL) and mean interval score (MIS) of 95% prediction intervals by normal, laplace distributions and Bootstrap procedure in HAR-G-V residuals for the last days.
Normal
Laplace
Bootstrap
ℓ
CP
AL
MIS
CP
AL
MIS
CP
AL
MIS
Brazil
20
0.9
3747
4584
0.95
4050
4465
1.0
8011
8011
50
0.74
3332
21773
0.78
3602
20856
0.96
8095
17968
100
0.69
3209
208
0.74
3469
22065
0.95
8137
16182
France
20
1.0
3200
3200
1.0
3458
3458
1.0
9449
9449
50
1.0
3445
3445
1.0
3723
3723
1.0
034
10034
100
0.82
3416
7901
0.87
3692
7342
1.0
10361
10361
Germany
20
1.0
1825
1825
1.0
1973
1973
1.0
3227
3227
50
1.0
1890
1890
1.0
2042
2042
1.0
3439
3439
100
0.89
1842
3190
0.9
1991
3054
0.99
3563
3602
India
20
1.0
37109
37109
1.0
40107
40107
1.0
6340
6340
50
1.0
40028
40028
1.0
43261
43261
0.94
5867
13861
100
1.0
47492
47492
1.0
51328
51328
0.94
5490
11140
Italy
20
1.0
2011
2011
1.0
2173
2173
1.0
869
869
50
1.0
2127
2127
1.0
2299
2299
1.0
981
981
100
1.0
2084
2084
1.0
2253
2253
0.98
1093
1156
Russia
20
1.0
2078
2078
1.0
2254
2254
0.95
437
478
50
1.0
1948
1948
1.0
2105
2105
0.92
433
525
100
0.88
1205
1716
0.91
1302
1771
0.80
355
1154
UK
20
1.0
3745
3745
1.0
4047
4047
0.95
1996
2007
50
0.94
3653
4344
0.96
3948
4393
0.78
1703
9220
100
0.97
3716
4061
0.98
4016
4239
0.83
1366
5315
USA
20
0.9
10645
11671
0.95
11505
11596
0.7
6284
25698
50
0.94
11004
14384
0.96
11892
14556
0.72
5264
24559
100
0.97
11634
13324
0.98
12573
13905
0.85
4835
14012
Coverage probability (CP), average length (AL) and mean interval score (MIS) of 80% prediction intervals by normal, laplace distributions and Bootstrap procedure in HAR-G-V residuals for the last days.Coverage probability (CP), average length (AL) and mean interval score (MIS) of 95% prediction intervals by normal, laplace distributions and Bootstrap procedure in HAR-G-V residuals for the last days.The Laplace distribution seems to fit the residual distributions better and to improve coverage probabilities of the 95% prediction intervals in Brazil, France, Germany and Russia (Table 7). Also, the Laplace distribution makes the mean interval scores better in most countries for the 80% prediction intervals (Table 6), whereas this is not true for the 95% prediction intervals (Table 7). However, average lengths of prediction intervals by both normal and Laplace distributions are large in some countries and so coverage probability is always equal to one. Too long intervals make always coverage one. Thus we need to improve prediction intervals with proper coverage probability close to the nominal coverage, as well as with shorter length. For this purpose, we consider the bootstrap procedure in the next subsection.
Bootstrap prediction interval
In order to assess the bootstrap forecast, we first generate a bootstrap sample of residuals. Using the residuals , we define the empirical distribution byWe draw an i.i.d. sequence from the empirical distribution , which is a bootstrap sample to be used in constructing a bootstrap prediction interval. One-step ahead bootstrap forecast is given byAn % bootstrap prediction interval is constructed by the - and -bootstrap quantiles, and , respectively, of the bootstrap sample , where is the number of bootstrap replications. The % bootstrap prediction interval is given byIn Figs. 7
and 8
, the right columns depict the 95% bootstrap prediction intervals for the COVID-19 confirmed cases, where is used to get the bootstrap quantiles, while the left columns illustrate the Box–Jenkins prediction intervals by the normal approximation. Because all the lengths by the Laplace approximation in Table 7 are larger than those by the normal approximation, plots of Laplace prediction intervals are omitted. The interval plots in Figs. 7 and 8 describe the numerical results in Table 7 with 95% one-step prediction intervals for the last 15 days. The left and right columns use the same scales on the vertical axis in each country so that the Box–Jenkins and bootstrap prediction intervals are compared with each other at sight. In most countries the prediction intervals include the actual values of the COVID-19 confirmed cases. On the left columns of Figs. 7 and 8, the prediction intervals by the Box–Jenkins are so large with coverage probability one in some countries. The lengths of the bootstrap prediction intervals on the right columns are not too large or too small because bootstrap distribution mimics the distribution of the residuals.
Fig. 7
One-day ahead predicted values and 95% prediction intervals by HAR-G-V models: Box–Jenkins prediction intervals (left column) and Bootstrap prediction intervals (right column); Brazil, France, Germany and India.
Fig. 8
One-day ahead predicted values and 95% prediction intervals by HAR-G-V models: Box–Jenkins prediction intervals (left column) and Bootstrap prediction intervals (right column); Italy, Russia, UK and USA.
One-day ahead predicted values and 95% prediction intervals by HAR-G-V models: Box–Jenkins prediction intervals (left column) and Bootstrap prediction intervals (right column); Brazil, France, Germany and India.One-day ahead predicted values and 95% prediction intervals by HAR-G-V models: Box–Jenkins prediction intervals (left column) and Bootstrap prediction intervals (right column); Italy, Russia, UK and USA.To add some discussion of the bootstrap prediction intervals, we go back to Tables 6 and 7, whose last columns report results of the 80%, 95% bootstrap prediction intervals. In Tables 6 and 7 we see that the bootstrap procedure not only improves coverage probability but also makes better mean interval score of prediction intervals. In Brazil, France and Germany, the coverage probabilities are improved to the values close to the norminal coverage by enlarging the average length of the intervals. In India, Italy and Russia, the bootstrap prediction intervals are shortened with the improved coverage probability as well, which are in cases of so large average length by the normal and Laplace distributions. Consequently, the bootstrap technique produces the best mean interval scores: bold numbers in columns of MIS in Tables 6 and 7 indicate the best one among three scores of the normal, Laplace and bootstrap approaches.Finally, in order to show how much the bootstrap method improves the normal and Laplace approximations, in the countries with the best mean interval scores in all three cases of Table 7, i.e., in Brazil, India, Italy and Russia, the coverage probability and mean interval score of 80%, 95% prediction intervals are depicted in Fig. 9, Fig. 10, Fig. 11, Fig. 12
. Instead of in Tables 6 and 7, are set on the horizontal axis in Fig. 9, Fig. 10, Fig. 11, Fig. 12. Using the last days, the empirical coverage probability and the mean interval score are computed and then the two measures are illustrated on the -axis of the horizon. Red graphs in the figures indicate the results by the bootstrap while green and blue ones are by the normal and Laplace, respectively. Red graphs in the coverage probability (left columns of Fig. 9, Fig. 10, Fig. 11, Fig. 12) are closer to the nominal coverage in both 80% and 95% intervals. In the plots of coverage probability, the grey horizontal lines in the middle lie on the nominal coverage. Coverage probabilities of the 80% bootstrap prediction intervals average to 0.8470, 0.8708, 0.9191 and 0.7717 in Brazil, India, Italy and Russia, respectively, whereas those of the normal and Laplace approximations are all one except for Brazil, which has the averages 0.7073 and 0.6564. Also, the averages of coverage probabilities of the 95% bootstrap prediction intervals are 0.9761, 0.9624, 1.0 and 0.9691 in the four countries, respectively, whereas those of others are all one, but for Brazil with 0.8543 and 8894.
Fig. 9
Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Brazil.
Fig. 10
Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for India.
Fig. 11
Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Italy.
Fig. 12
Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Russia.
Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Brazil.Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for India.Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Italy.Coverage probability (left column) and Mean interval score (right column) of 80%, 95% prediction interval (PI) for Russia.As for the mean interval score (right columns of Fig. 9, Fig. 10, Fig. 11, Fig. 12), red graphs of the bootstrap are much smaller values than others except for a few values of Brazil. As shown in the figures, the Laplace approximation does not significantly improve the normal one. However, the bootstrap method considerably lowers and enhances the mean interval scores from both normal and Laplace approximations. To understand the bootstrap improvement with an evaluated measure, we suggest a bootstrap improvement ratio (BIR) of the mean interval score. The BIR is defined aswhere is average of the mean interval scores by the underlying approach for :with in (3), and stands for the bootstrap. The BIR values are displayed in Table 8
, whose results are computed from the mean interval scores (right columns) of Fig. 9, Fig. 10, Fig. 11, Fig. 12. In the case of Brazil, it is reported that the BIR values of the 95% prediction intervals are not as good as those in others because the mean interval scores of the bootstrap are higher (or worse) for , as seen in Fig. 9, than other two approaches. In India and Russia, BIR values are between 0.7 and 0.8, which means that even up to 70–80% of the scores are improved either from the normal to the bootstrap or from the Laplace to the bootstrap. In conclusion, we claim that the bootstrap technique can improve prediction intervals with having more proper empirical coverages (closer to nominal one) as well as shorter average lengths and thus better mean interval scores.
Table 8
Bootstrap improvement ratio (BIR) of prediction interval (PI) on the mean interval scores in Fig. 9, Fig. 10, Fig. 11, Fig. 12.
Normal to Bootstrap
Laplace to Bootstrap
80% PI
95% PI
80% PI
95% PI
Brazil
0.0246
0.0101
0.0434
−0.0172
India
0.8031
0.7132
0.7786
0.7346
Italy
0.7103
0.5754
0.6741
0.6071
Russia
0.7683
0.7781
0.7394
0.7947
Bootstrap improvement ratio (BIR) of prediction interval (PI) on the mean interval scores in Fig. 9, Fig. 10, Fig. 11, Fig. 12.
Conclusion and discussion
The COVID-19 pandemic might continue over the next few years in spite of the vaccination supply all around the world. Health scientists and researchers will make more efforts to reverse the virus-contaminated Earth to the uncontaminated one again. For this purpose, more efficient models as well as more accurate predictions are needed to be reflected in social policies such as lockdown and social distancing. In the statistical aspect, rather than point-estimation of predicted values, interval-estimation such as prediction intervals can give more information with high probability. To this end, this study suggests a multiple regression with a form of nonlinear HAR along with growth rates and vaccination rates as its additional regressors, and aims to provide more accurate prediction intervals for the daily COVID-19 confirmed cases.Model selections of the HAR models with/without the growth rates and vaccination rates are given by the criteria measures such as RMSE, MAE, R, AIC and BIC. Our proposed HAR-G-V model addresses the smallest values of RMSE and MAE in top eight affected countries. We conclude that the HAR-G-V model is a suitable and useful tool for obtaining accurate prediction. For the forecasting, point-estimation of predicted values as well as interval-estimation through three kinds of prediction intervals are extensively provided. RMSE, MAE, MAPE and RRSE are evaluated for the forecast performance accuracy. In order to assess accurate prediction intervals, three approaches are carried out with the residuals of the HAR-G-V model. Normal and Laplace approximations, and bootstrap method are tried to construct prediction intervals. The last one improves prediction intervals with maximum 70–80% of mean interval score, because bootstrap procedure mimics the residual distribution.This work has the following contributions: (i) A time series model collaborated with both growth rates and vaccination rates is proposed in order to forecast the daily COVID-19 confirmed cases. As a result, the proposed model yields smaller errors in RMSE and MAE. (ii) This work mainly focuses on the prediction intervals to be more informative, by establishing the 80%, 95% prediction intervals through the normal, Laplace and bootstrap approaches. All three prediction intervals produce good empirical coverage probabilities. (iii) The bootstrap technique can improve the mean interval score of prediction intervals. Thus we conclude that the HAR-G-V model combined with the bootstrap method provides accurate prediction intervals of the COVID-19 pandemic.As a next step of the forecast analysis of the COVID-19, author plans to propose and study a HAR(-G-V) model with partially periodicity as mentioned as in the last section of Hwang and Yu [35]. This current work has handled the data of 7-day smoothed COVID-19 cases as in Ekinci [34]. Empirical results in Ekinci [34] and in this work can provide useful predictive information in deciding the COVID-19 social policies such as social distancing or pandemic relief money. It is because such a social policy is executed for a rather long period, specifically, much longer than 7 days. Due to irregular PCR tests from day to day, for example, rare on Sunday or frequent on Wednesday, the 7-day smoothed data analysis is reasonable and sufficient to understand the overall COVID-19 situation. However, as an interesting and challenging topic, author is interested in modeling the partially periodic COVID-19 cases. As seen in the patterns of COVID-19 data (unsmoothed) of USA, Germany, Brazil, India etc., there is a phenomenon such as strong oscillation with 7-day periodicity, which makes 7-day periodic ACFs. The oscillation seems to appear as the magnitude of the COVID-19 data is larger. In other words, if the magnitude becomes larger, then the oscillation becomes stronger; together with no oscillation on small magnitude. In order to reflect such a feature, partial oscillations not only having some truncation but also proportional to the magnitude might be more desirable than full oscillations. Such a time series modeling for the partially periodic COVID-19 cases will be a promising and challenging study. It will need a mathematical refinement or probability distributional complement (like [18]) as well as an experimental design along with the basis of the statistical inference.
CRediT authorship contribution statement
Eunju Hwang: Conceptualization, Methodology, Software, Data curation, Writing – original draft, Visualization, Investigation, Supervision, Validation, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.