Literature DB >> 36217358

Machine learning and automatic ARIMA/Prophet models-based forecasting of COVID-19: methodology, evaluation, and case study in SAARC countries.

Iqra Sardar¹, Muhammad Azeem Akbar², Víctor Leiva³, Ahmed Alsanad⁴, Pradeep Mishra⁵.

Abstract

Machine learning (ML) has proved to be a prominent study field while solving complex real-world problems. The whole globe has suffered and continues suffering from Coronavirus disease 2019 (COVID-19), and its projections need to be forecasted. In this article, we propose and derive an autoregressive modeling framework based on ML and statistical methods to predict confirmed cases of COVID-19 in the South Asian Association for Regional Cooperation (SAARC) countries. Automatic forecasting models based on autoregressive integrated moving average (ARIMA) and Prophet time series structures, as well as extreme gradient boosting, generalized linear model elastic net (GLMNet), and random forest ML techniques, are introduced and applied to COVID-19 data from the SAARC countries. Different forecasting models are compared by means of selection criteria. By using evaluation metrics, the best and suitable models are selected. Results prove that the ARIMA model is found to be suitable and ideal for forecasting confirmed infected cases of COVID-19 in these countries. For the confirmed cases in Afghanistan, Bangladesh, India, Maldives, and Sri Lanka, the ARIMA model is superior to the other models. In Bhutan, the Prophet time series model is appropriate for predicting such cases. The GLMNet model is more accurate than other time-series models for Nepal and Pakistan. The random forest model is excluded from forecasting because of its poor fit.

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2022, Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Entities: Chemical

Keywords: Artificial intelligence; Facebook Prophet algorithm; GLM; R software; SARS-CoV-2; South Asian Association for Regional Cooperation countries; Time-series models

Year: 2022 PMID： 36217358 PMCID： PMC9533996 DOI： 10.1007/s00477-022-02307-x

Source DB: PubMed Journal: Stoch Environ Res Risk Assess ISSN： 1436-3240 Impact factor: 3.821

Introduction

Machine learning (ML) has solved many complex real-world problems over the last decade in diverse fields as business, climatology, healthcare, and robotics (Kelleher et al. 2020; Akbar et al. 2022). ML algorithms are trial-and-error-based methods rather than conventional algorithms that follow programming commands based on decision statements (Makridakis et al. 2018; Sardar et al. 2021). Forecasting/prediction is the most significant part of ML (Bontempi et al. 2012), and multiple ML algorithms have been used to forecast diseases, stock markets, and weather (Mahdi et al. 2021; Bustos et al. 2022; Chaouch et al. 2022; Ma et al. 2022). Regression and neural network models have been employed to predict a patient’s future condition for a specific disease (Harrell et al. 1985). Different studies have been performed for diseases’ prediction, such as breast cancer (Asri et al. 2016), cardiovascular (Anderson et al. 1991), and coronary artery (Lapuerta et al. 1995) conditions, via ML techniques. The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), also known as COVID-19, was discovered in December 2019, in Wuhan, China, and declared by The World Health Organization (WHO) as a pandemic in March 2020 (WHO 2020a, b). Different types of SARS-CoV-2 have been identified (Alkadya et al. 2022). COVID-19 has revolutionized the world’s population, as well as the economy (Chahuan-Jimenez et al. 2021) and finance (Liu et al. 2021), forcing us to a new manner of life that will keep its mark on society forever (Mahdi et al. 2021). The spread of COVID-19 was so rapid that almost all countries imposed either partial or complete lockdown in affected areas to curb its spread (Ospina et al. 2022). Precautionary measures imposed by different governments are directing their masses to follow the standard operating procedure and control the spread of SARS-CoV-2 (Jerez-Lillo et al. 2022). Several studies based on ML tools have tried to predict COVID-19 confirmed cases (Petropoulos and Makridakis 2020) and COVID-19 outbreaks (Grasselli et al. 2020), as well as improving its clinical diagnostic (Bustos et al. 2022), which help in decision-making. The South Asian Association for Regional Cooperation (SAARC) was established in 1985, and it is composed by Afghanistan, Bangladesh, Bhutan, India, the Maldives, Nepal, Pakistan, and Sri Lanka. COVID-19 has disturbed the economy of SAARC countries as well (De la Fuente-Mella et al. 2021). The confirmed cases in overall SAARC countries are 16,870,105, with 39,053 as deaths till 19 July 2020 (WHO 2020a, b; Zhao et al. 2022). India is the largest country of the SAARC affected from COVID-19, with cases rising to 1.1 million, which is second only to the US and with a recovery rate of 63.75%. In Latin America, the deaths have been high as well (Martin-Barreiro et al. 2021), particularly in Brazil (Ospina et al. 2022) and Chile (Jerez-Lillo et al. 2022). The COVID-19 infection rate in SAARC countries is approximately 11.78% of the world population. India has developed medical facilities to fight against it. However, despite lockdowns, three million spiked active cases were reported in India. Comparatively, Bangladesh and Pakistan have crossed two lacs of cases of COVID-19. Active cases in Pakistan are gradually decreasing, and now it is 53,652 cases with less testing facilities. The death rate in Bangladesh is 6.61%. No casualty was found to date due to COVID-19 in Afghanistan and Bhutan. The death rate is below 1% in the Maldives and Sri Lanka. Time-series models are widely studied and applied (Brockwell and Davis 1991; Leiva et al. 2021), especially in epidemic disease prediction (Ospina et al. 2022; Jerez-Lillo et al. 2022). Several researchers are studying different dimensions of the epidemic and trying to find the results to help humanity. In Fanelli and Piazza (2020), a temporal dynamic was considered for predicting COVID-19 in China, France, and Italy. In Chimmula and Zhang (2020), the COVID-19 outbreak in Canada was forecasted by employing a long short-term memory network. In Malavika et al. (2021), COVID-19 was predicted in India based on a SIR model (Fierro et al. 2015; Rangasamy et al. 2022). In Petropoulos and Makridakis (2020) and Alzahrani et al. (2020), exponential smoothing and autoregressive integrated moving average (ARIMA) models were used for forecasting COVID-19. Currently, ML is a helpful technique for forecasting and researchers are employing it in different fields of science. In Rustam et al. (2020), ML models were compared with varying forecasting models based on COVID-19 data, finding that the ML model performance was better than other models. For example, in Sujath et al. (2020), ML techniques were utilized for predicting COVID-19 data in India. The employment of ML tools for forecasting COVID-19 have been very helpful in controlling this epidemic. To the best of our knowledge, no studies based on ML techniques and automatic ARIMA/Prophet models for forecasting COVID-19 cases on SAARC countries have been conducted. The specific contributions that this investigation aims to make are to: Propose a methodology for ML-based forecasting of COVID-19 data. The proposed methodology is represented in Fig. 1 to state the problem setting.

Fig. 1

Proposed methodology

Derive an autoregressive modeling framework for forecasting COVID-19 data in a region composed of countries with different realities. Compare different forecasting models by means of selection criteria. Choose the best forecasting model for SAARC countries based on COVID-19 data by using evaluation metrics. Forecast the COVID-19 confirmed cases in SAARC countries. Analyze the impact of forecasted COVID-19 confirmed cases on these countries. Proposed methodology Therefore, the objectives of this study are related to each of the contributions mentioned in (1)–(6). All the calculations and graphical plots were performed with the R software (R Development Core Team 2020). The plots were produced with ggplot of R by using a tool of interactive forecast visualization. This is a wrapper for plot_time_series() that generates an interactive (plotly) or static (ggplot2) plot with the forecasted data. The plots of time series were constructed with interactive plotting for one or more time series. This is a workhorse time-series plotting function that generates interactive plotly plots, consolidates 20+ lines of ggplot2 code, and scales well to many time series. The remainder of this article is organized as follows. Section 2 discusses the methodology, including data source, automatic and ML models, as well as the evaluation metrics used in the present investigation. In Sect. 3, we report the results of our study and provide a discussion of these results. In Sect. 4, conclusions and future work suggestions are stated.

Methodology

In this section, the data source, methods, and evaluation metrics are provided.

Data and their sources

The COVID-19 data set employed in the present investigation was sourced from “Our World in Data” (Ritchie et al. 2020) using the website: which was accessed on 14 May 2022. In this study, we have considered only the variable “number of confirmed COVID-19 cases” for SAARC countries and its corresponding time-series data with its metadata related to date and location. The data were collected from 31 January 2020 to 19 July 2020 from the repository:where other COVID-19 variables can be obtained as well. Specifically, the data set used in our analysis was secured from:which was accessed on 14 May 2022. The original data source (raw data) on confirmed COVID-19 cases and deaths for all countries correspond to the COVID-19 data repository of the “Center for Systems Science and Engineering (CSSE)” at Johns Hopkins University (www.jhu.edu) with link given by: https://ourworldindata.org/coronavirus; github.com/owid/covid-19-data/tree/master/public/data, github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.csv github.com/CSSEGISandData/COVID-19. The full COVID-19 data set is a collection of the COVID-19 data maintained by Our World in Data (ourworldindata.org), which is updated daily and includes data on confirmed cases, deaths, hospitalizations, and testing.

Time-series models

In this study, we use automatic forecasting models based on: (i) ARIMA and Prophet time-series frameworks; as well as (ii) extreme gradient boosting (XGBoost), generalized linear model elastic net (GLMNet), and random forest ML methods. We employ these models to analyze the COVID-19 data from the SAARC countries mentioned in Sect. 2.1. Next, we describe each of these methods. where and are polynomials of orders p and q, respectively, assuming that and have no roots for to ensure causality and invertibility, whereas is a constant. In the expression defined in (1), B is a backshift operator; Y(t) is the response variable; and is a white noise both at time t, with , that is, the white noise follows a normal (Gaussian) distribution with zero mean and variance . Note that there is an indirect polynomial of order d in the forecasting if (Brockwell and Davis 1991). An approach for automatic ARIMA modeling is stated in Algorithm 1.where Y(t) stated in (2) is the response variable to be predicted at time t; g(t) is a trend function employed to analyze time-series non-periodic changes; s(t) is a periodic or seasonality term reflecting the change of a week or a year; h(t) is the influence of occasional days or holidays; ; and is an error term which is assumed to be normally distributed such as in (1). Here, we consider the non-periodic changes of time series. As mentioned, the Prophet model handles the outliers of time series as well as the missing values. This model can automatically forecast the future trend, with this trend g(t) stated in (2) being described by saturating growth and a piecewise linear models. The growth model is established by a logistic regression formulated as where “e” is the exponential or Euler function. In the model presented in (3), the elements K, c, and u, corresponding to the curve maximum value or carrying capacity (K), logistic growth rate or steepness of the curve (c), and sigmoid point or offset parameter (u), are not constant, because they depend on time t, so that the formulation stated in (3) becomes as When one finds the change point of the time series, the trend changes at this point, and then the model expressed in (4) is now given by where , , and , with “” denoting the transpose of a matrix. For a piecewise linear function, the model defined in (5) is established as where c is the growth rate, u denotes the offset parameter, γ has the rate adjustments, and φ is set to make the function continuous; for more details, see Taylor and Letham (2018). Automatic ARIMA models: A statistical analysis model known as ARIMA employs time-series data to help researchers better understand a data collection or to forecast future trends. If a statistical model forecasts future values using data from the past, it is said to be autoregressive. These models (Hyndman and Khandakar 2008) combine unit root tests and minimize the Akaike information criterion, AIC in short (Ventura et al. 2019). Note that the maximum likelihood method can be used to estimate the ARIMA parameters. Indeed, the R software estimates the ARIMA parameters with this method so that we use it. When estimating the ARIMA parameters, the maximum likelihood method is like the least squares method. Note that the maximum likelihood estimators are more efficient than other estimators. The ARIMA acronym is formed by the terms AR, I, and MA, where the notation ARIMA(p, d, q) is often utilized. In this notation, the acronym AR(p) is related to the autoregressive part of the model, and the parameter p specifies how many lagged series should be used to forecast future periods (Hyndman et al. 2006). The acronym I(d) is associated with the integrated part of the model, and the parameter d indicates how many differences are needed to make the series stationary, that is, I(d) is the differencing component or number of differences. Then, the acronym MA(q) is the moving average part of the model, where q is the number of lag elements in the prediction equation for the forecast error. An ARIMA(p, d, q) model is stated as Prophet model: This is an open-source framework of Facebook to forecast time-series data, based on an additive formulation opened to the public in 2017 (Taylor and Letham 2018; Prophet 2020). Prophet nonlinear trends are fitted with daily, weekly, and yearly seasonality, including the effects of holidays. The Prophet model not only forecasts the future, but also fills missing values and detects irregularities. Although the trend forecast appears reasonable, the uncertainty intervals appear to be too large. The Prophet model can deal with historical outliers, but only by matching them to trend changes. Then, the model predicts similar magnitude trend changes in the future. Outliers are best dealt with by removing them, and the Prophet model has no issues with missing data. If we leave the dates in the future, but we set their values to N/A in the history, the Prophet model makes a forecast for their values. In the case of the Prophet model, we also used the maximum likelihood method as well other methods to estimate its parameters. In this investigation, we have not considered the non-periodic changes of the time series. The Prophet model can handle the time series and automatically bifurcate the future trend. A Prophet model is formulated as

Machine learning models

ML algorithms are a branch of computer science that are trained from past data. The algorithm selects a suitable model to be calibrated according to data characteristics and forecasts future values. In this ML framework, the algorithm takes a data set with input instances and a regressor to train the model. The trained model generates a prediction for an unpredicted test data set (Talabis et al. 2015). The ML methods use regression and classification algorithms. Several ML models, as logistic regression, neural networks, or support vector machines, have been utilized to analyze and forecast COVID-19 (Bustos et al. 2022). We have found that some ML models predict better when compared to others of this type. We apply ML models to our COVID-19 data set for predicting the effects of SARS-CoV-2 in SAARC countries and to global level. Three ML methods are employed in the study of COVID-19 prediction. These methods correspond to GLMNet, random forest, and XGBoost. where now is a parameter that controls the shrinkage type affecting the estimation process; is a penalty parameter that controls the shrinkage amount; is the response variable; are the values of the covariate vector ; is the regression intercept; is the vector of regression coefficients associated with the values of the covariate vector, with being the -th element of the vector ; p is the number of covariates; and n is the size of the sample data set.where is the loss function that controls the predictive power and is the regularization part of controlling ease and over-fitting. GLMNet: The elastic net is a particular case of the shrinkage method, which holds both ridge and least absolute shrinkage and selection operator (LASSO) regressions. The GLMNet property has the ability to handle problems (Zou and Hastie 2005). The elastic net penalty term is given by , in the objective function (Friedman et al. 2005) stated as Note that we are assuming a linear regression structure given by , where “E” denotes the expected value of Y conditional to . The glmnet package of the R software (R Development Core Team 2020) offers various kinds of regression methods for both variable selection and prediction techniques when , depending on the data and problem under consideration. If , the LASSO defaulting option of the glmnet package is obtained with a penalty parameter and carries out both parameter shrinkage and variable selection. If , it gives a ridge regression with a penalty parameter . We choose to optimize the elastic net. Efficiently, this shrinks certain coefficients and makes some equal to zero for sparse selection. The model structure needs variable selection to form a subset of predictors. The elastic net uses the problem approach, which implies that the number of parameters is much greater than the size of the sample utilized in the modeling. The elastic net is suitable when the variables form groups that contain highly correlated covariates. The variable selection is combined with the model structure to help in increasing the accuracy. When a group of covariates is highly correlated, the model automatically selects those with greater predictive power. Random forest: To build a set of decision trees, two methods are combined: bootstrap aggregating and random subspace, with the decision tree set being categorized. Random forest comprises many decision tree classifiers, whose classification of the decision tree obtains the output groups. In the single decision tree, two random selection processes are used in the algorithm. The first one is the random selection of training samples, and the second one is a random selection of sample features. Then, the decision trees are built. Classification results are finalized by the weight voting method. Random forest is a joint classifier that builds independent and non-matching decision trees depending on randomization. It can be defined as , for , where now is the mutual independent random vector parameter, and are data input values (Provost et al. 2016). Every decision tree uses a random vector of parameters, features of samples selected randomly, and a subset of sample data chosen randomly as a training set. The structure of the random forest is stated in Algorithm 2 (Bradter et al. 2013). XGBoost: The extreme gradient boosting model (Tianqi C, Guestrin 2016) is an ensemble ML system for tree boosting. The XGBoost model is used for problems related to regression trees and classification based on the boosting technique. At each step, the XGBoost model produces a weak learner and cumulates it into the model. The XGBoost model is based on the gradient direction of a loss function called gradient boosting machine (Jerome and Friedman 2002). The difference between random forest and gradient boosting machine is that decision trees are built independently in a random forest, while the another method adds a new tree to counterpart previously built ones (Luckner et al. 2017). The XGBoost is a method where new models are built to predict prior models’ residuals and further make predictions. The XGBoost model is established as

Model evaluation metrics

Evaluation metrics are used to measure the quality of statistical models. Evaluation metrics tell us that our model is operating correctly. There are numerous kinds of evaluation metrics to test a model. The metrics employed in this investigation are defined next. The mean absolute error is established aswhich indicates the difference between the observed () and forecasted () values for case i. The MAE is obtained from the data set by averaging the absolute difference. The mean absolute percentage error is stated bywhich measures the size of error in percentage form. The mean absolute scaled error (Hyndman and Khandakar 2008) is formulated aswhich is used as a measure of the accuracy of forecasts, with the denominator being the mean absolute error of the one-step “naive forecast method” on the training set (here defined as ), that utilizes the true value from the prior period as the forecast: . The symmetric mean absolute percentage error is expressed bywhich is based on the percentage error. The root of the mean square error is given bywhich corresponds to the square root of the mean square error. A well-known indicator of adequacy for statistical models is the determination coefficient that is defined in our case aswhich tells us how well the fit values are compared to the true values, where is the mean of true values. After the training model is obtained, we check the model goodness of fit. The values between zero and one are interpreted as percentages. Note that 0% shows that the response variable is not explained around its mean based on the covariates, whereas 100% indicates that the model describes the response variability around its mean perfectly. As the value of R-square of the trained model increases, the model better fits the data set.

Results and discussion

In this section, we use the automatic and ML models to forecast the COVID-19 confirmed cases in SAARC countries. The forecasting is done by employing the best models that are suitable for this framework. Firstly, the data set is preprocessed. Then, we split the data into a set to train the models and another set (10 days) to test the models. ML models as GLMNet, XGBoost, and automatic models, as ARIMA and Prophet, are used. These models are trained for confirmed cases and evaluated on the mentioned metrics. The flowchart of the proposed methodology is shown in Fig. 2.

Fig. 2

Flowchart the proposed methodology to forecast COVID-19 cases in SAARC countries

Flowchart the proposed methodology to forecast COVID-19 cases in SAARC countries To compare the models, we evaluate their performance based on the metrics mentioned (MAE, MAPE, MASE, RMSE, SMAPE, and R-square). We consider the confirmed COVID-19 cases of SAARC countries. Up to 19 July 2020, confirmed cases in the world were 14,443,127, and in SAARC countries 1,605,501; of which 1,033,502 are recovered cases; 36,270 are deaths; and 535,728 are active cases. The status of COVID-19 in SAARC countries is presented in Table 1, which includes the information of confirmed cases, deaths, recovered cases, and active cases till 19 July 2020. India has the highest number of COVID-19 active cases with 26,348 deaths. In Sri Lanka, the Maldives, and Nepal, the death shares below 50, whereas 204,276 patients were recovered from COVID-19 in Pakistan. The number of actives cases is 90,265 in Bangladesh. The time-series plot of global and SAARC countries is shown in Fig. 3, whereas Fig. 4 displays this plots for each country.

Table 1

COVID-19 confirmed, recovered, active cases and deaths in SAARC countries

SAARC countries	Confirmed cases	Recovered cases	Deaths	Active cases
India	107,878	677,856	26,838	374,088
Pakistan	263,496	204,276	5568	53,652
Bangladesh	204,525	11,164	2618	90,265
Afghanistan	35,475	23,634	118	10,660
Nepal	1750	11,637	40	5825
Maldives	2930	2354	15	561
Sri Lanka	2704	2023	1	670
Bhutan	87	80	0	7

Source: Worldometer: www.worldometers.info/coronavirus/#countries; accessed on 10 July 2020

Fig. 3

Time-series plot of confirmed COVID-19 cases of (a) global; and (b) SAARC countries

Fig. 4

Time-series plot of COVID-19 confirmed cases per month of (a) Afghanistan; (b) Bangladesh; (c) Bhutan; (d) India; (e) Maldives; (f) Nepal; (g) Pakistan; and (h) Sri Lanka

Time-series plot of confirmed COVID-19 cases of (a) global; and (b) SAARC countries Time-series plot of COVID-19 confirmed cases per month of (a) Afghanistan; (b) Bangladesh; (c) Bhutan; (d) India; (e) Maldives; (f) Nepal; (g) Pakistan; and (h) Sri Lanka COVID-19 confirmed, recovered, active cases and deaths in SAARC countries Source: Worldometer: www.worldometers.info/coronavirus/#countries; accessed on 10 July 2020 According to the objective of our investigation, we compare forecasting models based on selection criteria. Best models are selected based on MAE, MAPE, MASE, SMAPE, RMSE, and R-square. The performance of the ARIMA model is best in most countries; see Table 2. Figures 5, 6, 7, 8, 9, 10, 11, 12 and 13 show the forecasted plot for all models used in this study. Description of best models and forecast values are provided in Table 2. For SAARC countries, the ARIMA(1, 2, 0)(1, 0, 2) model with drift is suitable for forecasting. Based on this, cases reach 1,315,265 on July 2020 (Table 3). The residual values of the best prediction model were the smallest ones, and it shows that the model is the best fit for the data; see Table 4. Note that the random forest model is excluded from forecasting during the analysis because of its poor fit. Model accuracy values are provided for an individual member of SAARC countries shown in Tables 4, 5, 6, 7, 8, 9, 10 and 11.

Table 2

Calibration of confirmed COVID-19 cases in SAARC countries

Country	Model	Date	Value		Residual
Country	Model	Date	True	predicted	Residual
SAARC	ARIMA	2020-07-01	1,301,878	1,284,324	17,553,579
	(1,2,0)(1,0,2)	2020-07-01	133,598	1,315,265	20,715,579
India	ARIMA	2020-07-01	820,916	789,137.4	31,778,590
	(0,2,0)(2,0,0)	2020-07-01	849,553	811,178.8	38,374,206
Pakistan	GLMNet	2020-07-01	24,635	248,331.	− 1,980,171
		2020-07-01	248,87	250,528.9	− 1,656,942
Bangladesh	ARIMA	2020-07-01	178,443	178,030.	41,285,772
	(0,2,1)(2,0,0)	2020-07-01	181,129	181,051.6	7,739,907
Afghanistan	ARIMA(0,2,3)	2020-07-01	34,366	34,464.	− 9,823,814
		2020-07-01	34,45	34,759	− 3,079,588
Nepal	GLMNet	2020-07-01	16,649	17,998.5	− 1,349,482
		2020-07-01	16,719	18,304.6	− 1,585,618
Maldives	ARIMA(0,2,1)	2020-07-01	2617	2,532,938	84,062,256
		2020-07-01	2664	2,550,500	113,499,722
Sri Lanka	ARIMA(0,2,2)	2020-07-01	2454	2,127,718	326,281,545
		2020-07-01	251	2,135,839	375,160,992
Bhutan	Prophet	2020-07-01	8	8,179,123	0.2
		2020-07-01	8	8,249,074	− 0.5

Fig. 5

Plot of forecasting for confirmed COVID-19 cases in SAARC countries

Fig. 6

Plot of forecasting for confirmed COVID-19 cases in Afghanistan

Fig. 7

Plot of forecasting for confirmed COVID-19 cases in Bangladesh

Fig. 8

Plot of forecasting for confirmed COVID-19 cases in Bhutan

Fig. 9

Plot of forecasting for confirmed COVID-19 cases in India

Fig. 10

Plot of forecasting for confirmed COVID-19 cases in the Maldives

Fig. 11

Plot of forecasting for confirmed COVID-19 cases in Nepal

Fig. 12

Plot of forecasting for confirmed COVID-19 cases in Pakistan

Fig. 13

Plot of forecasting for confirmed COVID-19 cases in Sri Lanka

Table 3

Accuracy results of confirmed COVID-19 cases in SAARC countries

Model	MAE	MAPE	MASE	SMAPE	RMSE	R-square
ARIMA(1,2,0)(1,0,2)	9935.48	0.80	0.3	0.81	11,541.89	1.00
Prophet	13,140.0	1.05	0.4	1.05	16,953.09	1.00
GLMNet	91,372.40	7.44	2.89	7.78	100,978.10	0.99
Random forest	589,028.17	48.94	18.65	65.35	601,966.30	0.86
XGBoost	13,136.24	1.05	0.4	1.05	16,879.37	1.00

Table 4

Accuracy results of confirmed COVID-19 cases in Afghanistan

Model	MAE	MAPE	MASE	SMAPE	RMSE	R-square
ARIMA(0,2,3)	123.36	0.37	0.46	0.37	157.59	0.98
Prophet	547.86	1.6	2.03	1.57	862.79	0.94
GLMNet	1261.75	3.76	4.68	3.84	1323.77	0.97
Random forest	10,011.0	29.9	37.09	35.46	10,239.54	0.55
XGBoost	620.17	1.8	2.30	1.79	918.94	0.93

Table 5

Accuracy results of COVID-19 data in Bangladesh

Model	MAE	MAPE	MASE	SMAPE	RMSE	R-square
ARIMA(0,2,1)(2,0,0)	358.69	0.20	0.31	0.20	397.33	1.00
Prophet	3384.84	1.90	0.42	1.92	3772.35	0.99
GLMNet	6507.96	3.67	2.89	3.74	6622.81	1.00
Random forest	67,951.44	38.4	18.65	47.56	68,036.40	0.65
XGBoost	3445.9	1.93	0.42	1.96	3825.89	0.98

Table 6

Accuracy results of confirmed COVID-19 cases in Bhutan

Model	MAE	MAPE	MASE	SMAPE	RMSE	R-square
ARIMA(0,2,1)(1,0,0)	1.18	1.44	0.56	1.45	1.23	0.78
Prophet	0.35	0.43	0.23	0.43	0.38	0.94
GLMNet	0.70	0.85	1.89	0.86	0.74	0.81
Random forest	27.44	33.47	17.7	40.20	27.45	0.53
XGBoost	0.54	0.65	1.23	0.66	0.57	0.69

Table 7

Accuracy results of confirmed COVID-19 cases in India

Model	MAE	MAPE	MASE	SMAPE	RMSE	R-square
ARIMA(0,2,0)(2,0,0)	17421.50	2.25	0.70	2.28	20,776.11	1.00
Prophet	47,200.57	6.00	1.90	6.30	61,119.96	0.94
GLMNet	78,360.15	10.2	3.15	10.90	89,010.92	0.99
Random forest	396,671.74	53.35	15.94	73.46	406,965.77	0.84
XGBoost	47,179.87	6.00	1.90	6.30	61,076.94	0.94

Table 8

Accuracy results of confirmed COVID-19 cases in the Maldives

Model	MAE	MAPE	MASE	SMAPE	RMSE	R-square
ARIMA(0,2,1)	41.7	1.6	1.28	1.64	54.25	0.93
Prophet	198.48	7.70	6.07	8.16	241.38	0.97
GLMNet	101.43	3.95	3.10	4.05	113.98	0.90
Random forest	596.44	23.47	18.23	26.68	605.13	0.86
XGBoost	204.26	7.93	6.24	8.40	243.36	0.97

Table 9

Accuracy results of confirmed COVID-19 cases in Nepal

Model	MAE	MAPE	MASE	SMAPE	RMSE	R-square
ARIMA(1,2,2)	1327.80	8.1	5.43	7.64	1658.72	0.93
Prophet	940.6	5.76	3.85	5.52	1151.74	0.95
GLMNet	641.05	3.9	2.62	3.79	834.10	0.94
Random forest	7357.50	46.08	30.10	59.95	7390.10	0.79
XGBoost	848.74	5.18	3.47	4.99	1062.04	0.95

Table 10

Accuracy results of confirmed COVID-19 cases in Pakistan

Model	MAE	MAPE	MASE	SMAPE	RMSE	R-square
ARIMA(3,2,2)	4361.95	1.80	1.46	1.78	5375.71	1.00
Prophet	21,688.37	8.93	7.24	8.3	28,595.20	0.96
GLMNet	1354.18	0.56	0.45	0.56	1676.14	1.00
Random forest	104,982.54	44.29	35.03	57.28	106,490.52	0.74
XGBoost	21,830.37	8.99	7.28	8.36	28,686.44	0.96

Table 11

Accuracy results of confirmed COVID-19 cases in Sri Lanka

Model	MAE	MAPE	MASE	SMAPE	RMSE	R-square
ARIMA(0,2,2)	101.0	4.18	2.04	4.44	173.65	0.70
Prophet	711.7	30.88	14.39	41.55	924.08	0.84
GLMNet	196.30	8.37	3.97	9.1	280.20	0.69
Random forest	426.34	19.03	8.6	21.30	462.16	0.74
XGBoost	699.0	30.30	14.14	40.66	912.89	0.85

According to the ARIMA model, in Afghanistan, confirmed cases are greater than in other models. Table 4 reports that the values of model selection are better in the ARIMA(0, 2, 3) model. Performance of time-series models of Bangladesh COVID-19 confirmed cases is provided in Table 5, where the ARIMA(0, 2, 1)(2, 0, 0) model with drift is detected as the best one. The predicted values of cases in Bangladesh are 181,051, which are very near to the true value of 181,129 on 12 July 2020 (Zhao et al. 2022). Plot of forecasting for confirmed COVID-19 cases in SAARC countries Plot of forecasting for confirmed COVID-19 cases in Afghanistan Plot of forecasting for confirmed COVID-19 cases in Bangladesh Plot of forecasting for confirmed COVID-19 cases in Bhutan Plot of forecasting for confirmed COVID-19 cases in India Plot of forecasting for confirmed COVID-19 cases in the Maldives Plot of forecasting for confirmed COVID-19 cases in Nepal Plot of forecasting for confirmed COVID-19 cases in Pakistan Plot of forecasting for confirmed COVID-19 cases in Sri Lanka Calibration of confirmed COVID-19 cases in SAARC countries Accuracy results of confirmed COVID-19 cases in SAARC countries Accuracy results of confirmed COVID-19 cases in Afghanistan Accuracy results of COVID-19 data in Bangladesh The Prophet time series model is suitable for prediction in Bhutan; see Table 6. The model predicted value is almost the same as the true value of Bhutan. The values of RMSE, MAE, MAPE were the smallest ones in Bhutan. Still, in this country, the number of confirmed cases is 87 with no deaths due to COVID-19, which shows less spread. Accuracy results of confirmed COVID-19 cases in Bhutan For India, the forecasting in terms of model accuracy indicates that the ARIMA model is better compared to Prophet, XGBoost, and GLMNet methods, which is reported in Table 7. The prediction values for the confirmed cases are 811,178.8, which is very close to the true values. Medical facilities were improved in the last four months, and the recovery rate is greater than 60%. Accuracy results of confirmed COVID-19 cases in India Table 8 reports that the COVID-19 cases in the Maldives require stationary differencing. The prediction value of 12 July 2020 is like the true value. After Bhutan, the Maldives is having less COVID-19 confirmed cases among other SAARC countries. For the Bhutan model, residual values were less than 1%. In the Nepal data set, the GLMNet ML model is more accurate as compared to time series models: see Table 9. The values of RMSE (834.10) and MAE (641) are minimum in all compared models. Accuracy results of confirmed COVID-19 cases in the Maldives Accuracy results of confirmed COVID-19 cases in Nepal Table 10 provides the model accuracy of Pakistan about the confirmed COVID-19 cases. Like Nepal, in Pakistan also, the GLMNet model performance is better as compared to time-series models. The values of the random forest method are very different from other models. Accuracy results of confirmed COVID-19 cases in Pakistan The forecasted values of the GLMNet method in Pakistan show the COVID-19 confirmed cases are 250,528 on 12 July 2020. Still, according to its spread, Pakistan needs to increase the testing facilities, as the recovery rate is 97.35%. Table 11 reports that the ARIMA model accuracy is much better in Sri Lanka. The residual values are small and show that the ARIMA model is the best fitted. Accuracy results of confirmed COVID-19 cases in Sri Lanka

Conclusion

In this article, we have proposed and derived an autoregressive modeling framework based on machine learning and statistical methods to predict the number of COVID-19 cases in the South Asian Association for Regional Cooperation countries. We have considered automatic forecasting models based on autoregressive integrated moving average and Prophet time series structures, as statistical methods. In the case of machine learning, we have employed extreme gradient boosting, generalized linear model elastic net, and random forest techniques. The dataset was based on time series data from 31 January 2020 to 19 July 2020, and was used for upcoming forecasting. Different forecasting models were compared utilizing several evaluation metrics, based on MAE, MAPE, MASE, SMAPE, RMSE, and R-square values. Results have proven that the ARIMA model is found to be suitable and ideal for forecasting COVID-19 cases in the mentioned countries. The random forest model did not fit to our dataset well so that it was excluded from the forecasting purpose. Model accuracy values were studied for individual countries. In Afghanistan, Bangladesh, India, the Maldives, and Sri Lanka, the ARIMA model was superior to other models. In Bhutan, the Prophet time series model was suitable for projection purposes. Still, in Bhutan, the total number of confirmed cases was 87, with no deaths. In Nepal and Pakistan, the generalized linear model elastic net model was more accurate than other models. This forecast study can help authorities to take timely actions and to control COVID-19 spread. We are planning as a future research will explore the projection methodology and update the dataset using more accurate and suitable machine learning methods as well as statistical models for prediction.

18 in total

1. Use of neural networks in predicting the risk of coronary artery disease.

Authors: P Lapuerta; S P Azen; L LaBree
Journal: Comput Biomed Res Date: 1995-02

2. Regularization Paths for Generalized Linear Models via Coordinate Descent.

Authors: Jerome Friedman; Trevor Hastie; Rob Tibshirani
Journal: J Stat Softw Date: 2010 Impact factor: 6.440

3. Asymmetric autoregressive models: statistical aspects and a financial application under COVID-19 pandemic.

Authors: Yonghui Liu; Chaoxuan Mao; Víctor Leiva; Shuangzhe Liu; Waldemiro A Silva Neto
Journal: J Appl Stat Date: 2021-04-24 Impact factor: 1.404

4. Autoregressive count data modeling on mobility patterns to predict cases of COVID-19 infection.

Authors: Jing Zhao; Mengjie Han; Zhenwu Wang; Benting Wan
Journal: Stoch Environ Res Risk Assess Date: 2022-06-23 Impact factor: 3.821

5. A machine learning forecasting model for COVID-19 pandemic in India.

Authors: R Sujath; Jyotir Moy Chatterjee; Aboul Ella Hassanien
Journal: Stoch Environ Res Risk Assess Date: 2020-05-30 Impact factor: 3.379

6. Regression models for prognostic prediction: advantages, problems, and suggested solutions.

Authors: F E Harrell; K L Lee; D B Matchar; T A Reichert
Journal: Cancer Treat Rep Date: 1985-10

7. Cardiovascular disease risk profiles.

Authors: K M Anderson; P M Odell; P W Wilson; W B Kannel
Journal: Am Heart J Date: 1991-01 Impact factor: 4.749

8. Disjoint and Functional Principal Component Analysis for Infected Cases and Deaths Due to COVID-19 in South American Countries with Sensor-Related Data.

Authors: Carlos Martin-Barreiro; John A Ramirez-Figueroa; Xavier Cabezas; Víctor Leiva; M Purificación Galindo-Villardón
Journal: Sensors (Basel) Date: 2021-06-14 Impact factor: 3.576

9. Forecasting COVID-19 epidemic in India and high incidence states using SIR and logistic growth models.

Authors: B Malavika; S Marimuthu; Melvin Joy; Ambily Nadaraj; Edwin Sam Asirvatham; L Jeyaseelan
Journal: Clin Epidemiol Glob Health Date: 2020-06-27

10. Forecasting the spread of the COVID-19 pandemic in Saudi Arabia using ARIMA prediction model under current public health interventions.

Authors: Saleh I Alzahrani; Ibrahim A Aljamaan; Ebrahim A Al-Fakih
Journal: J Infect Public Health Date: 2020-06-08 Impact factor: 3.718