Literature DB >> 33584158

Forecasting the dynamics of cumulative COVID-19 cases (confirmed, recovered and deaths) for top-16 countries using statistical machine learning models: Auto-Regressive Integrated Moving Average (ARIMA) and Seasonal Auto-Regressive Integrated Moving Average (SARIMA).

K E ArunKumar1, Dinesh V Kalaga2, Ch Mohan Sai Kumar3, Govinda Chilkoor4, Masahiro Kawaji2, Timothy M Brenza1,5.   

Abstract

Most countries are reopening or considering lifting the stringent prevention policies such as lockdowns, consequently, daily coronavirus disease (COVID-19) cases (confirmed, recovered and deaths) are increasing significantly. As of July 25th, there are 16.5 million global cumulative confirmed cases, 9.4 million cumulative recovered cases and 0.65 million deaths. There is a tremendous necessity of supervising and estimating future COVID-19 cases to control the spread and help countries prepare their healthcare systems. In this study, time-series models - Auto-Regressive Integrated Moving Average (ARIMA) and Seasonal Auto-Regressive Integrated Moving Average (SARIMA) are used to forecast the epidemiological trends of the COVID-19 pandemic for top-16 countries where 70%-80% of global cumulative cases are located. Initial combinations of the model parameters were selected using the auto-ARIMA model followed by finding the optimized model parameters based on the best fit between the predictions and test data. Analytical tools Auto-Correlation function (ACF), Partial Auto-Correlation Function (PACF), Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were used to assess the reliability of the models. Evaluation metrics Mean Absolute Error (MAE), Mean Square Error (MSE), Root Mean Square Error (RMSE) and Mean Absolute Percent Error (MAPE) were used as criteria for selecting the best model. A case study was presented where the statistical methodology was discussed in detail for model selection and the procedure for forecasting the COVID-19 cases of the USA. Best model parameters of ARIMA and SARIMA for each country are selected manually and the optimized parameters are then used to forecast the COVID-19 cases. Forecasted trends for confirmed and recovered cases showed an exponential rise for countries such as the United States, Brazil, South Africa, Colombia, Bangladesh, India, Mexico and Pakistan. Similarly, trends for cumulative deaths showed an exponential rise for countries Brazil, South Africa, Chile, Colombia, Bangladesh, India, Mexico, Iran, Peru, and Russia. SARIMA model predictions are more realistic than that of the ARIMA model predictions confirming the existence of seasonality in COVID-19 data. The results of this study not only shed light on the future trends of the COVID-19 outbreak in top-16 countries but also guide these countries to prepare their health care policies for the ongoing pandemic. The data used in this work is obtained from publicly available John Hopkins University's COVID-19 database.
© 2021 Elsevier B.V. All rights reserved.

Entities:  

Keywords:  ARIMA; Akaike Information Criteria (AIC); Auto-Correlation Function (ACF); Bayesian Information Criterion (BIC); COVID-19; Pandemic; SARIMA; Statistical modeling; Time-series forecast

Year:  2021        PMID: 33584158      PMCID: PMC7869631          DOI: 10.1016/j.asoc.2021.107161

Source DB:  PubMed          Journal:  Appl Soft Comput        ISSN: 1568-4946            Impact factor:   6.725


Non-stationary time-series. Time-series after first order differencing. Time-series after second order differencing. Observation at time in the past, one-time step away from current time stamp . Seasonal time-series data. Observation at in the past, m time steps away from the current time-stamp . Number of time steps for a single seasonal period. Time-series data at time . Intercept or constant. Auto-regressive parameter at th or th time-stamp. Auto-regressive parameter at time-stamp. Auto-regressive parameter at time-stamp. Moving average parameter at time-stamp. Moving average parameter at time-stamp. Random error or residual term for the th day. Auto-regressive term of ARIMA model. Ordinary differencing term of ARIMA model. Moving average term of ARIMA model. Auto-regressive term of SARIMA model. Seasonal differencing term of SARIMA model. Moving average term of SARIMA model. Auto-regressive parameter of SARIMA model at th time-stamp. Moving average parameter of SARIMA model at th time-stamp. Backshift operator. Non-seasonal auto-regressive polynomial. Non-seasonal moving average polynomial. Seasonal auto-regressive polynomial. Seasonal differencing. Seasonal moving average polynomial. Likelihood of the candidate model given the data evaluated at . The set of model parameters. The number of estimated parameters in the candidate model. Sample size or number of observations.

Introduction

In the last week of December 2019, a group of patients at local hospitals in Wuhan, China demonstrated novel a form of viral pneumonia [1]. All the patients shared a common history of visiting a wet market in Wuhan, China. The patients were not responding to medicine and the agent causing the disease was identified as Severe Acute Respiratory Syndrome Corona Virus-2 (SARS-CoV-2) which is a strain of the coronaviruses family [2]. Sooner the outbreak was declared as a pandemic, the disease is named as COVID-19, by World Health Organization (WHO) on the 11th of March 2020 [3]. Ever since the outbreak was declared as pandemic, many countries over the world were affected severely by novel coronavirus disease (COVID-19) and took various measures to control the spread. For example, countries such as the USA, Australia, India took various preventive measures such as using facemasks, implementing stay-at-home order, social-distancing, and lockdowns [4], [5], [6], [7]. As a result of the control measures the daily confirmed cases decreased drastically. For example, due to the implementation of stay-at-home orders in the USA, the daily confirmed cases on June 7h, 2020 was 19,370 which was decreased from 32,074 on April 9th, 2020 [8]. Currently, most of the countries are at the breaking point in terms of health services, following the stay-at-home, mandatory face masks, and social-distancing orders. This can be evident from the surge in the reported 63000 daily confirmed new cases (5-day average) on 15th July 2020, which was 3.2 folds of the cases reported on 7th June 2020 [8]. Similarly, Italy’s health care system has been pushed beyond the limits. The exponential rise in confirmed cases required exponential rise in health care supplies and the deployment of healthcare personal [9]. Currently, there are 16 countries where 80% of the global COVID-19 confirmed cases are concentrated. As there is no specific treatment for the COVID-19 illness, the preparation of the health care system and prevention is of utmost urgency [10]. The healthcare system can be prepared to control the outbreak, by accurately predicting the forecast of the COVID-19 dynamics using statistical modeling tools. These models can be used for making short-term and long-term forecast of the disease spread thereby providing an idea on the amount of additional healthcare resources will be needed. Various statistical models are used to predict the upcoming number of cases and forecast the spread of infectious disease in the near future [11]. Zhang et al. [12] have used the SARIMA model to forecast Typhoid fever. In another study Chen et al. [13] have forecasted the influenza incidence in urban and rural areas of Shenyang, China using SARIMA model. Similarly, ARIMA models were used to forecast infectious diseases such as tuberculosis [14], Dengue fever [15] and Brucellosis [16]. Recently, ARIMA models were used to predict the prevalence, growth rate, the life cycle of COVID-19 pandemic. Ceylan. Z [17] has used ARIMA models to predict the epidemiological trend in Italy, Spain, and France. Leila et al. [18] have used the ARIMA model to predict and forecast the number of COVID-19 patients for the next 30 days in Iran. They reported the number of daily cases would be 3,574 by April 20. Marbaniang S. P. [19] has reported the use of ARIMA models and predicted and forecasted the total confirmed cases for the next 20 days from May 18th, 2020. He reported that the cases in India will increase to 2,45,000 in the first week of June 2020. Perone [20] has used ARIMA models to forecast the cumulative cases in Italy for more than 40 days. Their results showed that the number of COVID-19 cases in Tuscany (Italy) will reach plateau on 55th day of the forecast. Further, several researchers have reported the short-term forecast of COVID-19 pandemic using the machine learning models other than ARIMA and SARIMA Ghosal et al. [21] have used the linear and multiple linear regression techniques to forecast the number of fatalities in India for a short period for six weeks. Authors have reported that the fatalities in India will be doubled if the COVID-19 preventive measures are unchanged or not implemented. Parbat and Chakraborty [22] have employed the Support Vector Regression (SVR) for predicting the COVID-19 cases in India for 60 days based on the time-series data reported for the period of 1st March 2020 to 30th April 2020. Their results indicate that the SVR model has an accuracy of 97% in predicting the cumulative fatalities cases, cumulative recovered cases, cumulative confirmed cases. Their model also able to predict the daily new COVID cases with an accuracy of 87%. Maleki et al. [23] have used Auto-Regressive (AR) models based on two-piece scale mixture normal distributions to forecast the confirmed and recovered COVID-19 cases. Their model performed well in forecasting confirmed and recovered global COVID-19 cases. Ribeiro et al. [4], [24] have used Cubist Regression, Random Forest, Ridge Regression, SVR, and ARIMA models for short-term forecasting of COVID-19 confirmed cases in Brazil. Their findings reveal that the best performing models are SVR, ARIMA. Salgotra et al. [25] have used models based on genetic programming for predicting the cumulative confirmed cases and cumulative fatalities in India. Authors have found that their model is less sensitive to the variables and highly reliable in predicting the cumulative confirmed cases and cumulative deaths. Chimmula and Zhang [26] have employed a deep learning Long Short-Term Memory (LSTM) network to predict the COVID-19 trends in Canada. It is reported that the pandemic in Canada will be ending in about three months. Mehdi et al. [27] have employed LSTM network, SARIMA and Holt winter’s exponential smoothing and moving average methods to forecast COVID-19 cases in Iran. Their comparative study reported that the LSTM model outperformed other models. Ardabili et al. [28] have implemented a multi-layer perceptron model and adaptive network-based fuzzy interface system for predicting the COVID-19 outbreak. Their research work has recommended developing individual machine learning models for each country due to the existence of fundamental differences among different countries. In this study, we made an attempt to forecast the cumulative COVID-19 confirmed cases, recovered cases, and confirmed deaths for the top-16 countries, where 70%–80% of global COVID-19 cases concentrated. The top-16 countries were chosen based on the total accumulative confirmed cases. The pie chart for the percentage distribution of COVID-19 cases per each country is depicted in Fig. 1. The present study uses the COVID-19 cases are reported for the period of Jan 22, 2020 to July 24th, 2020 and the data was obtained from Johns Hopkins coronavirus resource center [29]. The rest of the paper is organized as follows: Section 2 describes the statistical models, their underlying mathematics along with the analytical tools, evaluation metrics. The computational framework of the model parameter selection procedure is discussed in Section 3. In Section 4, the model parameter selection and parameter optimization procedure are discussed in great detail by taking the time-series analysis of cumulative confirmed cases of the USA as a case study. Further, forecasted trends of the cumulative confirmed cases, recovered and deaths, based on ARIMA and SARIMA models, are given in the results and discussion section (Section 5). Finally, Section 6 provides the conclusions drawn from the present work.
Fig. 1

Pie-chart showing the percentage distribution of global COVID-19 data (A) confirmed cases (B) recovered cases (C) deaths.

Pie-chart showing the percentage distribution of global COVID-19 data (A) confirmed cases (B) recovered cases (C) deaths.

Statistical models and description

We have used ARIMA and SARIMA statistical models to generate a 60-day forecast of cumulative COVID-19 cases for top-16 countries, the proposed models are country-specific and were optimized by selecting the best model parameters. For each country, we have considered the date on which the first case was reported as the starting day of the time-series, hence, the date of the first case reported varies from country to country. To have a statistically meaningful forecast of time-series data, the minimum sample size of 30 observations is required [30]. The number of observations (i.e. sample size) used in the present work is much greater than the minimum size required to carry out the meaningful time-series forecasting, as the data collected for the duration of seven months (22nd January 2020 to 3rd August 2020). Time-series data is a sequence of numerical values that has a time-stamp associated with each value [31]. Time-series data can be classified into two categories namely stationary data and non-stationary data. A stationary time-series data has no patterns with respect to the time whereas a non-stationary time-series data has patterns, also known as seasonality. Therefore, the mean and variance of the non-stationary data are not constant over time. The non-stationary time-series data can be converted into stationary by calculating the difference between two successive observations. This technique is called differencing, it removes the changes in the level of the time-series thereby eliminating the trends and seasonality. There are two widely used differencing techniques, known as ordinary differencing and seasonal differencing. The ordinary first-order differencing, second-order differencing are mathematically represented as Eqs. (1), (2), respectively. Where is non-stationary time-series data, is the time-series after first-order differencing, is the time-series after second-order differencing, is the observation at time-stamp , is the observation at time-stamp . Second order differencing is needed when the data is not stationary after first order differencing. In seasonal differencing, the difference is equal to the difference between an observation and the previous observation from the same season. The first order of seasonal differencing can be written as follows. where is the seasonal time-series after first order differencing, is the observation at time-stamp , m is the number of time step corresponding to a single seasonal period. The time-series data was first subjected to differencing for removing the seasonality and then the resulted data frame is used for forecasting. For developing the statistical models based on the time-series data the following assumptions were made: Time-series data does not contain anomalies/outliers. Data is univariate meaning the time-series data is comprised of only one variable, as both the ARIMA and SARIMA model regresses a variable with its past values. The model assumes that the data is stationary requiring the mean and variance are constant over time. Model parameters and error terms are assumed to be constant with respect to time.

Auto-Regressive Integrated Moving Average (ARIMA(p,d,q))

ARIMA(p,d,q) model was first introduced by Box and Jenkin in 1976 [32], it can be used for forecasting the non-seasonal stationary time-series data. An ARIMA model is characterized by 3 terms: p, d, q​ where p is the order of the Auto-Regression (AR) term, q is the order of the Moving Average (MA) term, d is the order of differencing required to make the time-series stationary. Auto-Regression is nothing but the regression of the variable against itself to forecast the variable of interest. It correlates the pattern of the one-time period to its previous time periods. MA is a regression-like model that uses the errors associated with the forecast at a previous time-step to forecast a variable at a later time-step. The following are the generalized equations of pth order AR model (Eq. (4)) and qth order MA model (Eq. (5)). ARIMA models are built upon incorporating the AR model (Eq. (4)), integration (I) and the MA model (Eq. (5)). The integration (I) is the reverse process of differencing to generate the forecast. The generalized ARIMA model is mathematically represented as in Eq. (6). Where is intercept, is auto-regressive model parameters, is moving average model parameters, is current time-series value, , ... is past values and is random error or residual term for the tth day and it is given by the following equation:

Seasonal Auto-Regressive Integrated Moving Average (SARIMA(p, d, q)(P, D, Q))

Seasonal-ARIMA (SARIMA) model includes non-seasonal ARIMA(p, d, q) and additional seasonal terms (P, D, Q) to account for the seasonality of the time-series data for m number of time steps corresponding to a single seasonal period. The terms P, Q and D are the order of seasonal AR term, seasonal moving average term, seasonal differencing term, respectively. The general SARIMA model is mathematically represented as follows: Where is the non-stationary time-series, is the Gaussian white noise process, (B) is non-seasonal auto-regressive polynomial and (B) is non-seasonal moving average polynomial, D is seasonal differencing term is equal to 1 or 2 etc. However, the value of D  1 is sufficient to enforce stationarity into the data, is seasonal auto-regressive polynomial, and is seasonal moving average polynomial. Where, B is defined as the backshift operator which is expressed as follows: The expressions for the non-seasonal auto-regressive model (Eq. (10)), moving-average (Eq. (11)) model, seasonal terms for seasonal AR model (Eq. (12)) and seasonal MA (Eq. (13)) model are given below.

Analytical tools and model evaluation

The following analytical tools are used for assessing the reliability of time-series analysis: Auto-Correlation Function (ACF), Partial Auto-Correlation Function (PACF), Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). These measures indicate the relation between the observations within the time-series. ACF gives the correlation of time-series data with its previous time-series data, whereas PACF correlates the time-series with its own lagged values separated by certain time units. AIC and BIC are both penalized-likelihood criteria, the lower the AIC and BIC values mean that the model is more likely to be considered as a true model. The evaluation metrics used in this study are Mean Absolute Error (MAE), Mean Square Error (MSE), and Root Mean Square Error (RMSE).

Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF)

The correlation between current observation with the observations from previous time-steps (lags) in a time-series data is called auto-correlation. The plot of auto-correlation vs lags in the time-series is called an auto-correlation plot, and the ACF describes the linear relationship between observation at time t and observation at a previous time (t-k). To illustrate, the ACF for time-series is given by: where is lag, and it is defined as the difference between and . Lag auto-correlation means the correlation between the observations that are k time periods apart. On the other hand, in partial auto-correlation, the intermediate observations are considered while calculating the correlation between two observations at different times. For instance, consider that a time-series . The PACF between two observations and (assuming ) can be written as shown in the equation.

Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)

The generated models need to be tested for the goodness of the model performance in terms of explaining the relationships between the variables. We have used the information criteria to determine how well a model explains the relationships. Two popular criteria are AIC and BIC, these information criteria access the quality of the models by giving credit to models which has less error while applying penalty for models with too many parameters. AIC is mathematically represented as follows. represents the likelihood function and K is the total number of model parameters. Similarly, BIC is another model selection criterion. BIC imposes a lesser penalty on the number of parameters when compared to AIC. In both AIC and BIC settings the lower value represents the best model which has a higher likelihood value. Thus, assisting time-series analysts in choosing the best model amongst the finite number of potential models generated. BIC is mathematically represented as follows. Where is the number of observations.

Evaluation metrics

MAE, MSE, RMSE and MAPE are used often to evaluate the accuracy of the proposed model, which are given by the following equations: Where, is model predicted value, is actual value. Algorithm showing the methodology for developing ARIMA and SARIMA models.

Computational framework for model development

In the first step, each time-series was checked for the presence of non-stationarity using ACF and PACF plots. If the auto-correlation reduces very marginally as the number of lagsincrease, it indicates that the time-series is non-stationary. Such time-series with evidence of non-stationary was differenced before performing the ARIMA or SARIMA modeling. The raw time-series was used for modeling without any differencing if ACF and PACF plots indicate the presence of stationarity. The non-stationary time-series was subjected to first-order differencing to stabilize the mean of the time-series before performing the forecast. However, in some cases the second-order differencing was performed if the first order differenced time-series has a trend or seasonality. The algorithm showing the stepwise procedure for developing the ARIMA and SARIMA models were given in Fig. 2. The scripts were written in Python (Ver. 3.7.) programming language installed in the Anaconda environment. The numerical simulations were performed in the cloud computing platform (Google COLAB) and on a local computer (OS: Windows 10, Processor: Intel-I7). The snapshots of the Python script depicting the important steps of the data analysis can be found in Appendix A. The average computational time taken for each simulation on the local computer is about 3 secs for the ARIMA model and 6 secs for the SARIMA models.
Fig. 2

Algorithm showing the methodology for developing ARIMA and SARIMA models.

Most of the countries’ cumulative cases required second-order differencing except for confirmed cases of South Africa and Spain (Table 1), for recovered cases of UK (Table 3), for recovered cases Spain and UK (Table 4). ACF plot of stationary time-series was used to get a basic idea on whether AR terms or MA terms will fit to the data to deliver a superior model. If the ACF plot has negative auto-correlation at the first lag, it suggests using MA terms. If the PACF plot of the differenced time-series showed a sharp cutoff which is positive, we consider adding AR terms to the model. Selecting the best parameter (p, d, q) manually using ACF and PACF plots for ARIMA can be time-consuming as the number of models to assess is a permutation of the number of model order parameters, and it can be even more expensive for selecting the parameters of SARIMA(p, d, q)(P, D, Q). To select the proper combination of the model parameter values we performed a grid search using pmdarima (Pyramid ARIMA) library available in statsmodels (a python module). The pmdarima uses AIC as an evaluation metric to choose the best model from various ARIMA and SARIMA models. The seasonality of the data was checked using the seasonal_decompose function that is available in statsmodels. Then the stepwise parameter selection was performed to identify the best combination by setting the seasonality to “True” during the grid search. Since the cumulative COVID-19 cases are of only few months, the parameter that represents seasonality (m) was assigned to 3, 7, 12. Our data analysis showed that seasonality terms varied from country to country. The model with the best seasonal term was identified using information criteria (AIC and BIC). Zohair Malki et al. [33] have used a similar approach for identifying the seasonality term (m) for COVID -19 data by assigning it 3,7 and 12.
Table 1

Selected ARIMA models for forecasting cumulative confirmed cases.

CountryARIMA (p,d,q)AICBICMSEMAERMSEMAPE
South Africa7,2,11.93E+031.96E+031.33E+051.91E+023.64E+025.42E−02
Bangladesh6,2,21.72E+031.75E+0338,97,4581.35E+031.97E+036.00E−01
Brazil6,2,12.63E+032.65E+031.29E+092.99E+043.59E+041.39E+00
Chile7,2.42.20E+032.24E+033,62,4914.87E+026.02E+021.50E−01
Columbia2,2,21.77E+031.77E+031.11E+073.30E+033.33E+031.82E+00
India5,2,22.48E+032.50E+032.99E+073.52E+035.47E+033.28E−01
Iran0,2,02.10E+032.11E+037.43E+042.22E+022.73E+028.06E−02
Italy7,2,42.28E+032.32E+031.97E+041.23E+021.40E+024.50E−01
Mexico6,2,21.96E+031.98E+038.08E+062.65E+032.84E+037.50E−01
Pakistan4,2,12.26E+032.28E+034.25E+054.25E+026.52E+021.63E−01
Peru0,2,12.14E+032.14E+039.02E+062.45E+033.00E+036.77E−01
Russia3,2,32.36E+032.38E+038.66E+061.88E+032.94E+032.37E−01
Saudi Arabia3,2,11.81E+031.83E+038.10E+056.42E+029.00E+022.57E−01
Spain3,2,42.75E+032.77E+033.06E+083.97E+035.53E+031.55E+00
UK7,2,12.26E+032.29E+038.71E+062.33E+032.95E+039.00E−02
USA7,2,13.03E+033.06E+032.04E+081.17E+041.43E+049.91E−02
Table 3

Selected ARIMA models for forecasting cumulative recovered cases.

CountryARIMA (p,d,q)AICBICMSEMAERMSEMAPE
South Africa(3,2,1)1.77E+031.78E+032.15E+081.10E+041.47E+045.03E+00
Bangladesh(6,2,1)2.08E+032.11E+032.33E+074.00E+034.83E+033.47E+00
Brazil(0,2,1)2.58E+032.58E+039.00E+082.55E+043.00E+041.64E+00
Chile(0,2,1)2.25E+032.16E+031.90E+081.18E+041.38E+043.81E+00
Columbia(6,2,1)1.76E+031.79E+032.20E+081.16E+041.48E+041.14E+01
India(10,2,2)2.49E+032.53E+033.25E+081.01E+041.80E+041.21E+00
Iran(1,2,6)2.08E+032.10E+032.88E+074.48E+035.37E+031.84E+00
Italy(5,2,5)2.22E+032.26E+032.27E+061.22E+031.51E+036.00E−01
Mexico(1,2,1)2.06E+032.07E+031.66E+089.99E+031.29E+043.65E+00
Pakistan(6,2,2)1.25E+032.19E+032.95E+073.97E+035.43E+031.92E+00
Peru(3,2,3)1.98E+032.00E+039.42E+062.50E+033.07E+039.98E−01
Russia(6,2,2)2.51E+032.54E+031.95E+081.23E+041.40E+042.19E+00
Saudi Arabia(5,2,2)1.96E+031.98E+031.25E+081.05E+041.12E+045.18E+00
Spain(2,2,2)2.35E+032.37E+033.00E−041.50E−021.70E−028.62E+00
UK(2,1,2)1.49E+031.51E+033.85E+035.50E+016.20E+014.06E+00
USA(2,2,1)3.18E+033.20E+032.54E+081.16E+041.59E+049.74E−01
Table 4

Selected SARIMA models for forecasting cumulative recovered cases.

CountrySARIMA (p,d,q)(P,D,Q,m)AICBICMSEMAERMSEMAPE
South Africa(4,2,2)(3,2,2,7)1.49E+031.52E+032.20E+081.17E+041.48E+045.46E+00
Bangladesh(0,2,1)(1,0,0,7)1.96E+031.96E+031.15E+072.92E+033.40E+032.57E+00
Brazil(2,2,1)(1,1,1,12)2.26E+032.28E+035.46E+072.13E+032.34E+033.12E+00
Chile(1,2,1)(1,0,1,7)2.28E+032.28E+038.23E+077.80E+039.07E+032.53E+00
Colombia(5,2,2)(4,2,2,7)1.12E+031.15E+035.71E+041.16E+022.38E+021.30E−01
India(7,2,6)(3,2,6,3)1.98E+032.05E+031.46E+087.15E+031.21E+048.70E−01
Iran(4,2,4)(3,1,2,3)1.86E+031.87E+032.16E+073.80E+034.65E+039.50E−02
Italy(4,2,2)(1,1,1,7)1.96E+031.98E+033.78E+061.79E+031.94E+039.23E−01
Mexico(1,2,1)(1,0,1,7)1.93E+031.93E+031.19E+088.82E+031.09E+043.24E+00
Pakistan(4,2,2)(2,1,2,7)1.97E+031.98E+039.44E+041.58E+023.07E+027.80E−02
Peru(2,2,2)(2,2,1,7)1.53E+031.55E+035.82E+061.99E+032.41E+037.70E−01
Russia(5,2,0)(1,0,1,7)2.51E+032.54E+031.95E+081.23E+041.40E+049.70E+01
Saudi Arabia(5,2,2)(4,0,2,7)1.47E+031.51E+031.32E+091.08E+041.15E+048.00E−02
Spain(1,1,3)(2,0,1,7)2.35E+032.37E+035.00E−042.00E−022.00E−039.88E−01
UK(4,1,2)(2,0,1,3)1.41E+031.44E+032.10E+021.25E+011.45E+018.00E−01
USA(2,2,2)(1,0,1,7)2.99E+033.01E+036.03E+081.72E+042.46E+041.41E+00
Selected ARIMA models for forecasting cumulative confirmed cases. Selected SARIMA models for forecasting cumulative confirmed cases. The time-series data of all the selected top-16 countries were split into 80% training and 20% testing/validating datasets. The model development and parameter selection were done using the training dataset and the performance of the developed model was tested with the validation dataset. The ACF and PACF plots of the residuals were used to further determine the model’s goodness of fit. If the ACF and PACF plots of the residuals displayed correlation coefficients that are significantly different from zero at higher lags, then we developed higher-order ARIMA or SARIMA models, otherwise, the simple models suggested by auto-ARIMA were used. The evaluation of the model was done using the evaluation metrics: MAE, MSE, RMSE and MAPE (Fig. 2). The actual vs predicted values were plotted to visually understand the error. Once the finest model was identified by training on the training dataset, the model was used to predict values of the test data followed by forecasting for the next 60 days of cumulative COVID-19 cases for top-16 countries. In the first step, we checked for stationarity of the raw data followed by differenced data of all the countries using ACF and PACF plots as mentioned in the case study. As mentioned before, both seasonal and non-seasonal ARIMA models for all the top-16 countries’ COVID-19 cases. The SARIMA models capture both trend and seasonality using non-seasonal differencing (d) and seasonal differencing (D) respectively. For COVID-19 cases, we have considered seasonality in this time-series which is in between 3 to 12 days pattern. There are various factors that control and contribute to the seasonal pattern of the pandemics such as influenza and COVID-19. Some of those factors include social distancing on weekdays vs weekends [34], climatic conditions [35]. For example, the seasonality of the confirmed cases of the USA has an oscillating pattern on every 7 days as discussed in Section 4. For instance, in the case of the confirmed cases of Peru a simple ARIMA model (0,2,1) was selected as the best model with the lowest AIC and BIC values of 2,139.9 and 2,143.3 as shown in Table 1. The selected ARIMA(0,2,1) model was used to forecast the cumulative confirmed cases in Peru because the ACF and PACF plot of the residuals did not show any correlation coefficients that are significantly different from zero at least until 10 lags as shown in Figure S1B (supplementary document). For any model such as ARIMA(0,2,1), with second-order differencing (I or d  2), implies that the forecast and the trend of the time-series was adapted over time, hence the trend is equal to the exponentially smoothed values of the previous slopes (change in the process). Similarly, ARIMA(0,2,0) was the best model to forecast cumulative confirmed cases of Iran, in the forecast process of ARIMA(0,2,0), new observation is predicted based on the most recent value and the trend is the most recent change in the process. The predicted observation and trend determine the value of the next period in the forecast [36]. Moreover, higher order models such as ARIMA(6,2,2) are developed to fit the data of countries such as Italy as shown in Table 1. ARIMA(6,2,2) for Italy means that the response variable (y) is a combination of 6th (p) order auto-regression model, 2nd (q) order moving average model and the d value of 2 represents the integrative part of the model. Similarly, SARIMA models were developed based on ACF plots of differenced data as described in the case study (Fig. 3). For example, SARIMA(2,2,2)(2,2,1,7) for Peru presented in Table 4, has both second-order seasonal (D) and second-order ordinary differences (d) as indicated by the 2’s in the second place of each part of the model. It also has a 2nd order auto-regressive model and 2nd order moving average model along with 2nd order seasonal Auto-Regressive model and 1 seasonally lagged error. In this study, we have reported the final optimized ARIMA and SARIMA models used for forecasting COVID-19 cases of the top-16 countries. The complete details of these models are presented in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6 which has the information criteria (AIC, BIC) values, and evaluation metrics (MSE, MAE, RMSE and MAPE) values. Table 1, Table 2 have the details of ARIMA and SARIMA models used to forecast cumulative confirmed cases of top-16 countries. The details of models used to forecast cumulative recovered cases are described in Table 3 (ARIMA models) and Table 4 (SARIMA models). Similarly, Table 5, Table 6 provide details of the selected models for forecasting cumulative death cases for 60 days using ARIMA and SARIMA models, respectively. Such selected ARIMA and SARIMA models were used to forecast the next 60 days from the recent reported date of the COVID-19 cases. The following section presents a case study based on the USA data, and the 60-day forecast of COVID-19 cases for top-16 countries.
Fig. 3

The Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF) plots of the cumulative confirmed cases of the USA as function of the Lag. The shaded region represents the 95% confidence interval (CI). (A) ACF plot of actual data, (B) PACF plot of the actual data, (C) ACF plot of actual data after first differencing, (D) PACF of data after first differencing (E) ACF of data after second differencing, (F) PACF of the data after second differencing.

Table 2

Selected SARIMA models for forecasting cumulative confirmed cases.

CountrySARIMA (p,d,q)(P,D,Q,m)AICBICMSEMAERMSEMAPE
South Africa(2,1,2)(1,1,1,7)1.87E+031.89E+032.12E+053.29E+024.60E+029.00E−02
Bangladesh(5,2,2)(2,1,2,7)1.39E+031.43E+036.63E+042.13E+022.57E+021.84E−01
Brazil(3,2,2)(2,0,1,7)2.30E+032.32E+032.69E+094.45E+045.18E+042.04E+00
Chile(0,2,1)(1,0,1,7)2.04E+032.05E+033.59E+054.97E+026.00E+021.58E−01
Colombia(4,2,1)(2,1,1,7)1.40E+031.43E+031.42E+089.71E+031.19E+044.83E+00
India(2,2,1)(1,1,1,7)2.21E+032.23E+032.97E+093.87E+045.45E+043.10E+00
Iran(3,2,0)(2,0,1,7)1.69E+031.71E+031.33E+073.06E+033.64E+031.09E+00
Italy(1,2,2)(2,0,1,7)1.99E+032.02E+031.76E+061.13E+031.33E+034.55E−01
Mexico(1,2,1)(1,0,2,7)1.73E+031.75E+034.35E+075.78E+036.59E+031.61E+00
Pakistan(2,2,2)(0,0,1,3)2.50E+032.52E+031.35E+089.53E+031.16E+043.56E+00
Peru(0,2,1)(1,0,0,12)1.95E+031.95E+032.73E+074.37E+035.22E+031.21E+00
Russia(4,2,4)(4,1,4,3)2.09E+032.14E+035.73E+061.54E+032.39E+032.13E−01
Saudi Arabia(0,2,0)(1,0,0,3)1.77E+031.77E+031.38E+073.01E+033.72E+031.17E+00
Spain(3,1,1)(2,1,1,3)2.55E+032.57E+036.59E+061.80E+032.57E+036.71E−01
UK(1,2,1)(1,0,2,7)2.32E+032.33E+039.05E+058.11E+029.51E+022.73E−01
USA(3,2,4)(2,0,4,7)2.46E+032.50E+033.62E+081.60E+041.90E+044.00E+00
Table 5

Selected ARIMA models for forecasting cumulative death cases.

CountryARIMA (p,d,q)AICBICMSEMAERMSEMAPE
South Africa(0,2,1)9.27E+029.36E+021.30E+032.65E+013.60E+014.62E−01
Bangladesh(2,2,2)7.18E+027.30E+021.65E+033.70E+014.00E+011.37E+00
Brazil(6,2,3)1.45E+031.48E+032.04E+053.55E+024.52E+024.26E−01
Chile(0,2,1)1.27E+031.28E+032.04E+033.60E+014.52E+014.20E−01
Columbia(1,2,1)1.22E+031.23E+034.01E+021.40E+012.90E+012.10E−01
India(0,2,1)1.75E+031.75E+031.47E+048.40E+011.21E+022.95E−01
Iran(2,2,1)1.16E+031.17E+034.64E+041.80E+022.16E+021.22E+00
Italy(3,1,6)1.55E+031.55E+033.61E+034.91E+016.01E+011.40E−01
Mexico(3,2,2)1.44E+031.46E+032.43E+041.23E+021.56E+023.03E−01
Pakistan(3,2,3)1.08E+031.10E+032.84E+034.27E+015.33E+017.77E−01
Peru(0,2,1)1.81E+031.82E+035.95E+031.08E+031.87E+036.16E+00
Russia(5,2,1)1.07E+031.09E+035.29E+035.27E+017.27E+014.68E−01
Saudi Arabia(1,2,0)6.89E+026.97E+022.94E+021.38E+011.72E+015.46E−01
Spain(0,2,1)1.80E+031.81E+031.27E+013.18E+003.56E+001.00E−02
UK(6,2,2)1.57E+031.60E+032.39E+054.14E+024.88E+026.17E−01
USA(4,2,4)1.88E+031.91E+031.33E+053.16E+023.65E+022.80E−01
Table 6

Selected SARIMA models for forecasting cumulative death cases.

CountrySARIMA (p,d,q)(P,D,Q,m)AICBICMSEMAERMSEMAPE
South Africa(2,2,2)(1,0,1,7)8.56E+028.74E+021.94E+053.14E+024.41E+025.10E+00
Bangladesh(0,2,1)(0,0,1,7)6.69E+026.77E+021.28E+029.20E+001.13E+013.50E−01
Brazil(5,2,2)(3,2,3,7)9.96E+021.03E+032.33E+041.36E+021.53E+022.54E−01
Chile(1,2,1)(2,0,1,7)1.35E+031.37E+036.25E+056.99E+027.90E+028.05E+00
Columbia(0,2,1)(2,0,2,3)1.12E+031.13E+037.83E+042.19E+022.80E+022.90E+00
India(1,2,1)(1,0,1,12)1.43E+031.44E+032.24E+061.20E+031.50E+034.02E+00
Iran(1,2,2)(1,0,1,12)1.17E+031.18E+031.04E+052.60E+023.21E+021.74E+00
Italy(6,2,2)(1,0,1,7)1.41E+031.44E+033.12E+041.47E+021.77E+024.20E−01
Mexico(0,2,1)(1,0,1,7)1.34E+031.35E+031.36E+053.06E+023.68E+027.50E−01
Pakistan(2,2,3)(2,0,2,7)1.03E+031.05E+033.65E+041.58E+031.91E+023.10E+00
Peru(1,2,1)(2,0,2,12)9.01E+029.18E+023.71E+061.09E+031.93E+036.10E+00
Russia(0,2,2)(1,0,2,7)9.09E+029.24E+022.81E+041.50E+021.67E+021.34E+00
Saudi Arabia(1,2,0)(1,0,0,3)6.67E+026.75E+026.47E+037.20E+018.00E+012.80E+00
Spain(1,2,2)(0,0,1,3)1.90E+031.92E+033.10E+001.48E+001.77E+005.00E−03
UK(6,2,4)(2,0,2,3)1.42E+031.47E+031.63E+053.61E+024.03E+028.00E−01
USA(5,2,1)(1,0,1,7)1.89E+031.92E+031.96E+054.07E+024.42E+022.00E−01
The Auto-Correlation Function (ACF) and Partial Auto-Correlation Function (PACF) plots of the cumulative confirmed cases of the USA as function of the Lag. The shaded region represents the 95% confidence interval (CI). (A) ACF plot of actual data, (B) PACF plot of the actual data, (C) ACF plot of actual data after first differencing, (D) PACF of data after first differencing (E) ACF of data after second differencing, (F) PACF of the data after second differencing. Selected ARIMA models for forecasting cumulative recovered cases. Selected SARIMA models for forecasting cumulative recovered cases.

Forecasting cumulative confirmed cases of USA: A case study

This section describes the detailed forecasting procedure for the USA’s confirmed cases. Fig. 3 displays the ACF and PACF plots of the actual time-series data, first order and second order differenced cumulative confirmed cases of the USA. Fig. 3A and 3B are ACF and PACF plots of the actual data, respectively, the auto-correlation coefficients gradually decrease as the number of lags increase (Fig. 3A). This suggests that the data is non-stationary, hence, there is a need to apply the differencing technique to convert the data to stationary. The ACF plot (Fig. 3C) of the time-series after first order differencing, shows the correlation coefficients decreased gradually representing the existence of non-stationarity in the time-series. So, the time-series was differenced for the second time to introduce stationarity. Fig. 3D is the PACF plot of the time-series data after first differencing, it displays a sharp cutoff after lag 1. Moreover, on inspecting Fig. 3E, the ACF plot of the second time differenced time-series shows an oscillation indicating a seasonal series, the sharp significant peak (greater correlation) occurs at lags of 7 days because the data at 22nd January correlates with 29th January and so on. This pattern strongly supports the existence of seasonality in the time-series. This could be because of a greater number of social distancing violations on weekends than on the weekdays. The Fig. 3F is a PACF plot of second-order differenced data displayed a sharp cutoff after lag 0. The second-order differencing indicated an integrated order (I) of 2 must be used in developing the model because taking the second-order differencing made the USA data stationary. Similarly, we did second-order differencing for recovered and death cases to stabilize the datasets whenever required. The ARIMA(0, 2, 0) was the best ARIMA model with the lowest AIC and BIC values 3,113.8 and 3,120.1, respectively. However, while determining the goodness of the fit, the auto-correlation plots of residuals displayed coefficients at higher lags that are significantly different from zero. So, we developed a higher-order ARIMA model (7,2,1) with AIC and BIC values of 3,025 and 3,056 respectively (Table 1) for prediction and forecasting the cumulative confirmed cases in the USA. An ARIMA(7,2,1), the auto-correlation plots of the residuals did not display lags that are significantly different from zero as shown in Figure S1B, S2B. To further investigate the ARIMA(7,2,1) model, we have used the Quantile–Quantile (Q–Q) plot and the probability density Q–Qplot was constructed using the residuals (Fig. 4). The residual errors have a normal distribution as shown in Figs. 4B and 4D the linear plot of residuals with respect to quantiles follow a linear relationship except few blue dots at the ends but all other dots lie close to the straight line. This bell-shaped distribution of residuals suggests that the data came from a normal distribution. Higher-order model ARIMA(7,2,1) was selected and used to predict the cumulative cases and forecast to the near future. However, when we considered seasonality in the model i.e. SARIMA(3,2,4)(2,1,4,7), the Q–Q plot displayed lesser outliers at the tails when compared to ARIMA(7,2,1). The Kernel Density Estimate Plot (KDE) of the residuals of ARIMA(7,2,1) and SARIMA(3,2,4)(2,1,4,7) has a gaussian-like distribution but it is sharper suggesting an asymmetric exponential distribution as shown in Fig. 4(B & D). Moreover, the KDE plot ofSARIMA(3,2,4)(2,1,4,7) (Fig. 4D) shows that the distribution of residuals is more normal/gaussian than that of ARIMA(7,2,1) (Fig. 4B). The results of diagnostic plots (Q–Q plot and KDE plot) of the residuals are in strong support of choosing SARIMA(3,2,4)(2,1,4,7) as the better model to fit with zero-auto correlated errors as shown in ACF and PACF plots of the residuals in Figures S3B & S4B (supporting document).
Fig. 4

Diagnostics for the models used in the case study — prediction and forecast of cumulative confirmed cases of the USA. (A) Normal Q–Q plot of residuals of ARIMA(7,2,1), (B) KDE of residuals of ARIMA(7,2,1), (C) Normal Q–Q plot of residuals of SARIMA(3,2,4)(2,0,1,7), (D) KDE of residuals of SARIMA(3,2,4)(2,0,1,7).

The Fig. 5 displays the comparison between the test data (20% of the actual data) and the predictions of the test data obtained by ARIMA, SARIMA models, along with the LSTM (Long–Short​ Term Memory) and GRU (Gated Recurrent Unit) models, developed in our recent work [37]. The model evaluation was carried out by calculating the MAE, MSE, RMSE and MAPE using the Eqs. (18), (19), (20) and (21), respectively. The calculated errors are reported in Table 1, Table 2.
Fig. 5

Comparison of ARIMA and SARIMA models’ predictions with test data of cumulative confirmed cases in the USA.

Diagnostics for the models used in the case study — prediction and forecast of cumulative confirmed cases of the USA. (A) Normal Q–Q plot of residuals of ARIMA(7,2,1), (B) KDE of residuals of ARIMA(7,2,1), (C) Normal Q–Q plot of residuals of SARIMA(3,2,4)(2,0,1,7), (D) KDE of residuals of SARIMA(3,2,4)(2,0,1,7). From Fig. 5, it is evident that both the ARIMA and SARIMA models predicted the test data reasonably well. Further, SARIMA model outperformed the complex deep learning models such as LSTM and GRU models confirming that the simple machine learning models are sufficient to accurately predict the test data. The predictions of SARIMA(3,2,4)(2,1,4,7) matched the test data better than the ARIMA(7,2,1) predictions. Therefore, the prediction for test data and forecast for the next 60 days of cumulative confirmed cases of the USA was done using the ARIMA(7,2,1) and SARIMA(3,2,4)(2,1,4,7) models. Fig. 6(B) and Fig. 7(B) shows the 60-day forecast with 95% (CI) using ARIMA(7,2,1) and SARIMA(3,2,4)(2,1,4,7), respectively. Both models’ forecast suggests that the USA’s actual cumulative confirmed cases might continue to increase exponentially in 60 days. Our best forecast ARIMA and SARIMA models for the USA projects the number of cumulative confirmed cases might reach 7.5 million by the end of September. According to ARIMA(7,2,1) the cumulative confirmed cases will increase to 6,478,221 on September 1st, 2020. Whereas SARIMA(3,2,4)(2,1,4,7) indicates that the cases will be 2,677 lesser than what ARIMA(7,2,1) predicted on 1st September. Before forecasting 60 days into the future, a similar robust analysis was done for all three (confirmed, recovered, deaths) cumulative COVID-19 cases for proposing an optimized model for each country in top-16 countries.
Fig. 6(B)

60-day ahead forecast of the cumulative confirmed cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best ARIMA model selected for each country.

Fig. 7(B)

60-day ahead forecast of the cumulative confirmed cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best SARIMA models selected for each country.

Comparison of ARIMA and SARIMA models’ predictions with test data of cumulative confirmed cases in the USA. 60-day ahead forecast of the cumulative confirmed cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best ARIMA models selected for each country. 60-day ahead forecast of the cumulative confirmed cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best ARIMA model selected for each country. 60-day ahead forecast of the cumulative confirmed cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best SARIMA models selected for each country. 60-day ahead forecast of the cumulative confirmed cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best SARIMA models selected for each country. 60-day ahead forecast of the cumulative recovered cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best ARIMA models selected for each country. 60-day ahead forecast of the cumulative recovered cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best ARIMA models selected for each country. 60-day ahead forecast of the cumulative recovered cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best SARIMA models selected for each country. 60-day ahead forecast of the cumulative recovered cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best SARIMA models selected for each country. 60-day ahead forecast of the cumulative death cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best ARIMA models selected for each country. 60-day ahead forecast of the cumulative death cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best ARIMA models selected for each country. 60-day ahead forecast of the cumulative death cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best SARIMA models selected for each country. 60-day ahead forecast of the cumulative death cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best SARIMA models selected for each country. Selected ARIMA models for forecasting cumulative death cases. Selected SARIMA models for forecasting cumulative death cases.

Results and discussion

The percentage distribution of cumulative COVID-19 cases (confirmed cases, recovered cases and deaths) of the top-16 countries are present as shown in Fig. 1. We selected top-16 countries based on the number of cumulative confirmed cases, the top-16 countries include the USA, Brazil, India, Russia, Peru, Chile, Mexico, the UK, South Africa, Iran, Spain, Pakistan, Italy, Saudi Arabia, and Turkey as of 25th July. From Fig. 1A, it is evident that out of 16 countries the USA had 26.4% of global cumulative confirmed cases followed by Brazil (15.1%), India (9.0%), Russia (5.1%) and South Africa (2.8%). This work is accounted for 78.4% of the total confirmed cases (16.5 M) which are reported in top-16 countries but not accounted for those reported in the rest of the world. The country-based percentage distribution of the cumulative recovered cases is given in Fig. 1B. The order of countries with high to low recovered cases is as follows: USA (18.7%), Brazil (13.4%), India (9.5%), Russia (6.2%), South Africa (3.3%), Mexico (3.1%), Peru (2.8%), Chile (2.7%), UK (2.6%), Bangladesh (2.5%), Iran (2.3%), Pakistan (2.2%), Spain (2.1%), Saudi Arabia (2.0%), Colombia (1.6%) and Italy (1.3%). The total percentage of recovered cases recorded by these countries is 76% of the global cases (9.4 M). Similarly, total deaths reported by the top-16 countries are account for 78.7% of the global deaths (650,000). Countries have such as the USA, Brazil, India, Russia, South Africa have reported a high number of deaths. The present work has accounted for approximately 80% of the global confirmed cases, recovered cases and deaths for developing a reliable statistical model. Hence, the results from these models can be used for predicting the COVID-19 trends in other countries, which are not considered in this work, as well as for forecasting the global COVID-19 cases.

Cumulative confirmed cases

The 60-day forecast of confirmed cases for top-16 countries are shown in Figs. 6 & 7. It is important to mention that the yellow line represents the reported data, the blue line represents the forecasted data and the shaded region is the 95% Confidence Interval (CI) of the forecasted data. Fig. 6 displays the 60-day forecast of the top-16 countries based on ARIMA modes whereas Fig. 7 shows the SARIMA based models’ forecasts of cumulative confirmed cases of top-16 countries. From Fig. 6(A), it is evident that South Africa will have a cumulative confirmed case of 1,100,000 by September 22nd of 2020. The forecast for Brazil has an exponential trend with a narrow 95% CI, the ARIMA model predicted that the cumulative confirmed cases will be 5,900,000 by the end of the 2nd week of September. Similarly, a 60-day forecast of Colombia reveals that the number of cumulative confirmed cases will reach 799,000 by the 3rd week of September, upper and lower limits of the 95% CI of the forecast strongly follows the exponential growth of confirmed cases as seen in Fig. 6(A). A similar exponential trend of the number of confirmed cases is observed in the case of the USA, South Africa, Colombia, Brazil, India, Mexico, and Bangladesh (Fig. 6(A), Fig. 6(B)). However, the forecast of Saudi Arabia, Pakistan, Chile, Russia, Peru, Iran shows a steep linear increment in the number of cumulative confirmed cases at a steady pace. Italy, the UK and Spain’s forecast showed very steady linear increment in the number of cumulative confirmed cases. The selected ARIMA models projected the number of cases in South Africa, Brazil, Colombia, Chile, India, Iran, Italy, Mexico, Pakistan, Peru, Russia, Saudi Arabia, Spain, UK, USA, and Bangladesh will be 1,750,000, 5,800,000, 799,000, 401,000, 6,900,000, 425,000, 290,000, 820,000, 325,000, 505,000, 1200000, 401,000, 301,000, 330,000, 7,850,000 and 410,000 respectively according to Fig. 6(A), Fig. 6(B).
Fig. 6(A)

60-day ahead forecast of the cumulative confirmed cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best ARIMA models selected for each country.

Fig. 7(A)

60-day ahead forecast of the cumulative confirmed cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best SARIMA models selected for each country.

When seasonality is considered for SARIMA models, the seasonal forecast of the confirmed cases has captured the variance and seasonality in the time-series and projected well into the forecast. This is more evident from the forecast of Brazil as shown in Fig. 7(A). The SARIMA(3,2,2)(2,1,17) of Brazil has better captured the seasonality when compared to non-seasonal forecast of Brazil as shown in Fig. 6(A). The forecasted data is capable of recognizing the continuous seasonal patterns of the reported data. The number of cumulative confirmed cases predicted by SARIMA(3,2,2)(2,1,1,7) of Brazil is 2,000,000 greater than the ARIMA predicted cumulative confirmed cases by the end of 2nd week of September. The SARIMA models’ predicted number of cumulative confirmed cases of most of the countries are lesser than that of the ARIMA models’ predictions. Such countries include South Africa with 109,000, Colombia with 169,000, India with 200,000, Iran with 30,000, Mexico with 45,000, Russia with 100,000, Saudi Arabia with 20,000, UK with 35,000, USA with 350,000 and Bangladesh with 30,000 lesser cumulative confirmed cases when compared to cumulative confirmed cases predicted by their respective ARIMA models. Further, the countries including USA, Peru, Pakistan, Iran, Italy and Chile has broad 95% CI even after 3 weeks of forecast (Fig. 7(B)). The lower limit of forecast’s 95% CI of these countries indicates the decline in the number of confirmed cases, whereas the upper limit indicates the rapid exponential raise in the number of confirmed cases. Certainly, most of the countries are reopening and loosening the COVID-19 restrictions, we can see the rapid rise in the confirmed cases in the next 60 days as suggested by the upper limit and not the significant declining trend predicted by the models. Due to the relaxation of preventive measures such as lockdowns, social distancing and reopening of restaurants, and other local businesses, the fast-rising infection rates may lead to an exponential growth of COVID-19 victims in these countries, the effect of the reopening of the economy is clearly visible in various countries. For example, the cumulative cases in India were less at the beginning of the pandemic, which was due to the implementation of the lockdown (April–May 2020). However, the cases rise as soon as the lockdown was removed (June to August 2020). Whereas in the USA, the reaction was relatively slow toward COVID-19, leading to a continuous raise in COVID-19 cases. Similarly, Iran has noticed a significant drop in new cases after implementing stringent lockdown policies, Iran reopened in April 2020, due to which the number of COVID-19 cases in Iran skyrocketed again in May 2020 [18]. The forecast for cumulative cases in Iran suggests that the cases might reach 450,000 by the second week of September. Similarly, the forecasts of confirmed cases in India suggests that the cumulative confirmed cases will reach 65,00,000 according to ARIMA(5,2,2) but when we considered the seasonality, SARIMA(2,2,1)(1,1,1,7), the projected cumulative confirmed cases will be 70,00,000 which is 500,000 greater than the ARIMA(5,2,2) model’s prediction by 3 week of September 2020 (Figs. 6(A) & 7(A)). Similar to ARIMA models, three different trends in the forecasted profiles such as exponential rise (USA, South Africa, Colombia, Brazil, India, Mexico, and Bangladesh), steep linear increment (Saudi Arabia, Pakistan, Chile, Russia, Peru, Iran) and gradual linear increment (Italy, UK and Spain) are observed. As shown in Fig. 7(A), Fig. 7(B), the selected SARIMA models projected number of cases of South Africa, Brazil, Colombia, Chile, India, Iran, Italy, Mexico, Pakistan, Peru, Russia, Saudi Arabia, Spain, UK, USA, and Bangladesh will be 950,000, 8,500,000, 625,000, 405,000, 6,250,000, 410,000, 220,000, 800,000, 356,000, 440,000,1,500,000, 395,000, 300,000, 375,000, 7,600,000,400,000 respectively. To avoid the surge in the number of new confirmed cases, local businesses, schools etc. should follow the guidelines for organizing events and gatherings as published by the Center for Disease Control and Prevention (CDC) [38].

Cumulative recovered cases

The ARIMA based forecasted trends of cumulative recovered cases for all the top-16 countries are given in Fig. 8(A), Fig. 8(B). It is clear from Fig. 8, that the recovery rate has shown three different trends such as exponential rise, steep linear increase and gradual linear increase were observed. After reviewing Fig. 8, the selected ARIMA models projected that the number of recovered cases in South Africa, Brazil, Colombia, Chile, India, Iran, Italy, Mexico, Pakistan, Peru, Russia, Saudi Arabia, Spain, UK, USA, and Bangladesh will reach 1,600,000, 4,500,000, 425,000, 475,000, 4,000,000, 410,000, 210,000, 650,000,610,000, 585,000, 1,150,000, 410,000, 150,000, 1,850, 2,510,000, 325,000 by the end of the September 2 week, respectively. Fig. 9(A), Fig. 9(B) reports the forecasted recovered cases of 16 countries based on the SARIMA model, it is evident that the recovered cases in South Africa, Brazil, Colombia, India, Mexico, Pakistan, the USA, and Bangladesh are increasing exponentially. Recovered cases in countries like Peru, Russia, Iran, and Saudi Arabia are increasing at higher linear rate as compared to that in the Chile, Italy, Spain, and UK. This observation is very similar to the observation made by ARIMA (Fig. 8). The predicted number of cumulative recovered cases according to SARIMA models (Table 4) is lesser than the prediction of ARIMA models of the respective countries. The number of recovered cases in countries such as South Africa, Brazil, Colombia, Chile, India, Iran, Italy, Mexico, Pakistan, Peru, Russia, Saudi Arabia, Spain, UK, USA, and Bangladesh will reach 2,560,000, 4,100,000, 790,000, 490,000, 6,100,000, 400,000, 210,000, 585,000,605,000, 475,000, 1,000,000, 401,000, 150,000, 1,500, 2,500,000, 250,000 respectively (Fig. 9).
Fig. 8(A)

60-day ahead forecast of the cumulative recovered cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best ARIMA models selected for each country.

Fig. 8(B)

60-day ahead forecast of the cumulative recovered cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best ARIMA models selected for each country.

Fig. 9(A)

60-day ahead forecast of the cumulative recovered cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best SARIMA models selected for each country.

Fig. 9(B)

60-day ahead forecast of the cumulative recovered cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best SARIMA models selected for each country.

Interestingly in Spain, the number of cumulative recovered cases remained constant. It can be explained due to the fact that the constant number of the recovered cases in the forecast was influenced by the constant recovered data reported by Spain, over a recent couple of months. If we observe the cumulative recovered cases between, May and July the number is constant. The forecast of Spain remained same with ARIMA(2,2,2) and SARIMA(1,1,3)(2,0,1,7) but the AIC and BIC values were decreased when we used the SARIMA model as shown in Table 3, Table 4. From Fig. 8(A), it is evident that the number of recovered COVID-19 patients is 4,000,000 in India. If we compare the number of recovered cases of India the with number of cumulative confirmed cases in India (Fig. 6(A)), the percentage of recovered COVID-19 patients will be more than 65% by the end of September. Whereas in the USA, the percentage of recovered COVID-19 patients will be 35% by the end of September. In the ARIMA(1,2,6) forecast of Iran (Fig. 8(A)), there will be 410,000 COVID-19 patients recovered by the end of the second week of September. Whereas the SARIMA(4,2,4)(3,1,2,3) of Iran (Fig. 9(A)), predicted that the number of COVID-19 patients recovered will be equal to 400,000. When SARIMA models were used the number of predicted cumulative recovered cases was less than that of the ARIMA models for countries — Brazil, Iran, Italy, Mexico, Pakistan, Peru, Russia Saudi Arabia, USA, and Bangladesh (Figs. 8 and 9). However, the cumulative recovered cases of countries — South Africa, Colombia, Chile, India increased after using SARIMA models for forecasting the 60 days. The SARIMA(1,2,1)(1,0,1,7) model of Chile predicted that a total of 490,000 will be recovered from the COVID-19 disease which is 10,000 greater than the ARIMA(0,2,1) (Table 3) predictions. In the case of USA, the cumulative recovered cases were 900 less when SARIMA(2,2,2)(1,0,1,7) was used.

Cumulative death cases

As of today (09/08/2020), we are 33 weeks into the COVID-19 pandemic. According to the CDC’s weekly summary released on 14th august the current percentage of deaths attributed to COVID-19 is 8.1% which is higher than the epidemic threshold. The percentage of deaths are expected to increase in the coming weeks as more the death certificates are being handled [39] which is also being supported by our results. The forecasted trends of the confirmed cumulative deaths were very similar to the trends observed for the confirmed cases and recovered cases. Though the USA is leading the world with a high number of confirmed deaths, it is found that the USA has a steep increase in the number of deaths along with Pakistan and Saudi Arabia. Countries such as Spain, UK, Italy have shown a gradual linear increase in the number of deaths whereas, South Africa, Brazil, Colombia, Chile, India, Iran, Mexico, Peru, Russia, and Bangladesh have shown an exponential increase in the number of deaths (Figs. 10 and 11). The selected ARIMA models projected the number of deaths for South Africa, Brazil, Colombia, Chile, India, Iran, Italy, Mexico, Pakistan, Peru, Russia, Saudi Arabia, Spain, UK, USA, and Bangladesh will be 23,000, 152,000, 23,000, 24,000, 80,000, 30,000, 38,000, 95,000, 7,900, 41,000, 24,500, ,4,900, 29,000, 55,000, 210,000, 6,500 respectively (Fig. 10(A), Fig. 10(B)). Similarly, selected SARIMA models projected the number of deaths for South Africa, Brazil, Colombia, Chile, India, Iran, Italy, Mexico, Pakistan, Peru, Russia, Saudi Arabia, Spain, UK, USA, and Bangladesh will be 58,000, 150,000, 20,000, 16,000, 76,000, 28,000, 37,000, 85,000,7,600, 52,000, 21,000, 5,600, 30,000, 50,000, 225,000, 5,650 respectively (Fig. 11(A), Fig. 11(B)).
Fig. 10(A)

60-day ahead forecast of the cumulative death cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best ARIMA models selected for each country.

Fig. 11(A)

60-day ahead forecast of the cumulative death cases in the top-16 countries (1–8) generated on July 26th of 2020 based on the best SARIMA models selected for each country.

Fig. 10(B)

60-day ahead forecast of the cumulative death cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best ARIMA models selected for each country.

Fig. 11(B)

60-day ahead forecast of the cumulative death cases in the top-16 countries (9–16) generated on July 26th of 2020 based on the best SARIMA models selected for each country.

For example, the cumulative death cases in South Africa might increase exponentially to 40,000 (Fig. 10(B)) in the second week of September according to the SARIMA(2,2,2)(1,0,1,7) model listed in Table 6. The SARIMA(5,2,2)(2,1,2,7) for Bangladesh, has lower AIC and BIC values of 699 and 677 as shown in Table 6, when compared to ARIMA(2,2,2) model of Bangladesh which has AIC and BIC values are 718 and 730 as shown in Table 5. Similarly, other countries’ models are described in Table 5, Table 6. The seasonality has a key role in determining the number of cumulative death cases, which is evident from Figs. 11(A) & 11(B). When the SARIMA models were used, the forecasts of the countries showed the number of cumulative death cases was less than that of the ARIMA models’ forecasts. For instance, in case of Brazil, Chile, India, Iran, Mexico, Pakistan, Russia, Saudi Arabia, Spain, and Bangladesh the forecasted cases on 3 week of September are 10,000, 8000, 6000, 1000, 8,000 200, 3,900 4000 and 350 lesser than their respective ARIMA models’ predictions. However, in the case of the USA, South Africa, Colombia, Italy, Peru, the SARIMA models predicted the cumulative death cases were 20,000 33,100, 6000, 10,000, 10,000 greater than the predictions of their respective ARIMA models. The 95% CI of the forecast of UK and Spain remain broad for both ARIMA and SARIMA based forecasts as shown in Figs. 10(B) & 11(B). Moreover, the lower limit of 95% CI of the UK’s forecast as shown in Fig. 11(B) declined to near zero deaths. This scenario is a deviation from the current dynamics of the COVID-19 pandemic. However, the upper limit of 95% CI of the UK is a more realistic projection as shown in Fig. 10(B). The deaths might increase to 99,000 by the end of September. A similar trend was observed with Spain (Fig. 11(B)) with 45,000 deaths by the end of September. In the case of the USA, ARIMA(4,2,4) model suggests the number of deaths will increase to 200,000 in the next few weeks, the upper limit of 95% CI implies that the number of cumulative death cases might even cross 250,000. Whereas the lower limit indicates the number of cumulative deaths might remain at 150,000. But by further inspection of SARIMA(5,2,1)(1,1,1,7) forecast of the USA (Fig. 11(B)), we can see the upper limit of 95% CI of the forecast reveals that the cumulative death cases might increase to 310,000. On the contrary, the lower limit of 95% CI of forecast displays a sharp decline in the number of deaths to less than 100,000. This decline in the trend can be achieved by enforcing strict social and physical distancing measures and implementing lockdowns at the federal level. By implementing lockdown at the country level, India was able to control the pandemic for a while [4], [40].

Conclusions

In this study, we have forecasted COVID-19 cases (confirmed, recovered and deaths) for 60 days, until 21st September 2020, using ARIMA and SARIMA statistical models. Our forecast indicates that the COVID-19 trends in top-16 countries can be classified into three classes as exponential, steep linear increase, gradual linear increase. The reasons for this observation can be the population density, infection rate, lifestyle etc. The exponential rise of the COVID-19 forecast has a very narrow width of the shared region of the 95% CI, whereas the width of the shared region increases for both linear increment cases. Countries such as the United States, Brazil, South Africa, Colombia, Bangladesh, India, Mexico and Pakistan have shown exponential growth in confirmed cases and recovered cases for the upcoming 60 days. In the case of deaths, countries such as Brazil, South Africa, Chile, Colombia, Bangladesh, India, Mexico, Iran, Peru, and Russia have shown an exponential increase in trends. Spain, UK, Italy the projections are stable with not much increase in COVID-19 cases. It is found that the COVID-19 forecasted value of the 60th day from the ARIMA and SARIMA models are more or less the same but to capture the seasonality or trends of the data SARIMA models outperform the ARIMA models. For most of the countries including the USA and India have a 7-day seasonal pattern, as selecting 7 in the SARIMA model generated the lowest AIC and BIC values. When we considered seasonality the SARIMA models predicted a number of COVID-19 cases was less than that of the ARIMA models’ predictions. The SARIMA forecasts are more realistic numbers because they considered the variations that occurred in the past few weeks (June–July 2020) of the COVID-19 time-series and projected into the future. Based on our predictions and forecasts, health care strategy administrators should take proper decision on the right time in supplying equipment to hospitals and other healthcare aids to the public. To keep the COVID-19 pandemic under control all countries must be prepared with their health care workers and hospital facilities. These results shed light on the approaching surge in cases thereby emphasizing the importance of social distancing and implementation of preventive measures of COVID-19.

CRediT authorship contribution statement

K.E. ArunKumar: Conceptualization, Methodology, Software, Validation, Formal analysis, Writing - original draft, Visualization, Data curation, Resources. Dinesh V. Kalaga: Writing - review & editing, Supervision, Data curation, Project administration. Ch. Mohan Sai Kumar: Data curation and preprocessing. Govinda Chilkoor: Data curation and preprocessing. Masahiro Kawaji: Data curation, Project administration. Timothy M. Brenza: Data curation, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  22 in total

1.  From Containment to Mitigation of COVID-19 in the US.

Authors:  Stephen M Parodi; Vincent X Liu
Journal:  JAMA       Date:  2020-04-21       Impact factor: 56.272

2.  The Exponentially Increasing Rate of Patients Infected with COVID-19 in Iran.

Authors:  Leila Moftakhar; Mozhgan Seif
Journal:  Arch Iran Med       Date:  2020-04-01       Impact factor: 1.354

3.  Reducing demand uncertainty in the platelet supply chain through artificial neural networks and ARIMA models.

Authors:  Bahareh Fanoodi; Behnam Malmir; Farzad Firouzi Jahantigh
Journal:  Comput Biol Med       Date:  2019-08-30       Impact factor: 4.589

4.  Comparative study of four time series methods in forecasting typhoid fever incidence in China.

Authors:  Xingyu Zhang; Yuanyuan Liu; Min Yang; Tao Zhang; Alistair A Young; Xiaosong Li
Journal:  PLoS One       Date:  2013-05-01       Impact factor: 3.240

5.  Time series analysis of human brucellosis in mainland China by using Elman and Jordan recurrent neural networks.

Authors:  Wei Wu; Shu-Yi An; Peng Guan; De-Sheng Huang; Bao-Sen Zhou
Journal:  BMC Infect Dis       Date:  2019-05-14       Impact factor: 3.090

6.  Universal weekly testing as the UK COVID-19 lockdown exit strategy.

Authors:  Julian Peto; Nisreen A Alwan; Keith M Godfrey; Rochelle A Burgess; David J Hunter; Elio Riboli; Paul Romer
Journal:  Lancet       Date:  2020-04-21       Impact factor: 79.321

7.  Time Series Analysis and Forecast of the COVID-19 Pandemic in India using Genetic Programming.

Authors:  Rohit Salgotra; Mostafa Gandomi; Amir H Gandomi
Journal:  Chaos Solitons Fractals       Date:  2020-05-30       Impact factor: 5.944

8.  Estimation of COVID-19 prevalence in Italy, Spain, and France.

Authors:  Zeynep Ceylan
Journal:  Sci Total Environ       Date:  2020-04-22       Impact factor: 7.963

9.  When a system breaks: queueing theory model of intensive care bed needs during the COVID-19 pandemic.

Authors:  Hamish Dd Meares; Michael P Jones
Journal:  Med J Aust       Date:  2020-05-07       Impact factor: 7.738

10.  A Novel Coronavirus from Patients with Pneumonia in China, 2019.

Authors:  Na Zhu; Dingyu Zhang; Wenling Wang; Xingwang Li; Bo Yang; Jingdong Song; Xiang Zhao; Baoying Huang; Weifeng Shi; Roujian Lu; Peihua Niu; Faxian Zhan; Xuejun Ma; Dayan Wang; Wenbo Xu; Guizhen Wu; George F Gao; Wenjie Tan
Journal:  N Engl J Med       Date:  2020-01-24       Impact factor: 91.245

View more
  23 in total

1.  Gecko: A time-series model for COVID-19 hospital admission forecasting.

Authors:  Mark J Panaggio; Kaitlin Rainwater-Lovett; Paul J Nicholas; Mike Fang; Hyunseung Bang; Jeffrey Freeman; Elisha Peterson; Samuel Imbriale
Journal:  Epidemics       Date:  2022-05-23       Impact factor: 5.324

2.  From SIR to SEAIRD: A novel data-driven modeling approach based on the Grey-box System Theory to predict the dynamics of COVID-19.

Authors:  K Midzodzi Pekpe; Djamel Zitouni; Gilles Gasso; Wajdi Dhifli; Benjamin C Guinhouya
Journal:  Appl Intell (Dordr)       Date:  2021-04-23       Impact factor: 5.086

3.  Rise and Decay of the COVID-19 Epidemics in the USA and the State of New York in the First Half of 2020: A Nonlinear Physics Perspective Yielding Novel Insights.

Authors:  Till D Frank
Journal:  Biomed Res Int       Date:  2021-05-18       Impact factor: 3.411

4.  Forecasting COVID-19 pandemic in Alberta, Canada using modified ARIMA models.

Authors:  Jian Sun
Journal:  Comput Methods Programs Biomed Update       Date:  2021-09-26

5.  A predictive model for daily cumulative COVID-19 cases in Ghana.

Authors:  Abdul-Karim Iddrisu; Emmanuel A Amikiya; Dominic Otoo
Journal:  F1000Res       Date:  2021-05-05

6.  Panel Associations Between Newly Dead, Healed, Recovered, and Confirmed Cases During COVID-19 Pandemic.

Authors:  Ming Guan
Journal:  J Epidemiol Glob Health       Date:  2021-12-11

7.  Slope Micrometeorological Analysis and Prediction Based on an ARIMA Model and Data-Fitting System.

Authors:  Dunwen Liu; Haofei Chen; Yu Tang; Chao Liu; Min Cao; Chun Gong; Shulin Jiang
Journal:  Sensors (Basel)       Date:  2022-02-05       Impact factor: 3.576

8.  Machine learning-based forecasting of firemen ambulances' turnaround time in hospitals, considering the COVID-19 impact.

Authors:  Selene Cerna; Héber H Arcolezi; Christophe Guyeux; Guillaume Royer-Fey; Céline Chevallier
Journal:  Appl Soft Comput       Date:  2021-06-04       Impact factor: 6.725

9.  Comparative study of COVID-19 situation between lower-middle-income countries in the eastern Mediterranean region.

Authors:  Sokaina El Khamlichi; Amal Maurady; Abdelfettah Sedqui
Journal:  J Oral Biol Craniofac Res       Date:  2021-10-09

10.  Multi-Regional Modeling of Cumulative COVID-19 Cases Integrated with Environmental Forest Knowledge Estimation: A Deep Learning Ensemble Approach.

Authors:  Abdelgader Alamrouni; Fidan Aslanova; Sagiru Mati; Hamza Sabo Maccido; Afaf A Jibril; A G Usman; S I Abba
Journal:  Int J Environ Res Public Health       Date:  2022-01-10       Impact factor: 3.390

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.