Iqra Sardar1, Muhammad Azeem Akbar2, Víctor Leiva3, Ahmed Alsanad4, Pradeep Mishra5. 1. Department of Mathematics and Statistics, International Islamic University Islamabad, Islamabad, Pakistan. 2. Department of Software Engineering, LUT University, Lappeenranta, Finland. 3. School of Industrial Engineering, Pontificia Universidad Católica de Valparaíso, Valparaíso, Chile. 4. STC's Artificial Intelligence Chair, Department of Information Systems, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia. 5. Department of Statistics, College of Agriculture, Powarkheda, India.
Abstract
Machine learning (ML) has proved to be a prominent study field while solving complex real-world problems. The whole globe has suffered and continues suffering from Coronavirus disease 2019 (COVID-19), and its projections need to be forecasted. In this article, we propose and derive an autoregressive modeling framework based on ML and statistical methods to predict confirmed cases of COVID-19 in the South Asian Association for Regional Cooperation (SAARC) countries. Automatic forecasting models based on autoregressive integrated moving average (ARIMA) and Prophet time series structures, as well as extreme gradient boosting, generalized linear model elastic net (GLMNet), and random forest ML techniques, are introduced and applied to COVID-19 data from the SAARC countries. Different forecasting models are compared by means of selection criteria. By using evaluation metrics, the best and suitable models are selected. Results prove that the ARIMA model is found to be suitable and ideal for forecasting confirmed infected cases of COVID-19 in these countries. For the confirmed cases in Afghanistan, Bangladesh, India, Maldives, and Sri Lanka, the ARIMA model is superior to the other models. In Bhutan, the Prophet time series model is appropriate for predicting such cases. The GLMNet model is more accurate than other time-series models for Nepal and Pakistan. The random forest model is excluded from forecasting because of its poor fit.
Machine learning (ML) has proved to be a prominent study field while solving complex real-world problems. The whole globe has suffered and continues suffering from Coronavirus disease 2019 (COVID-19), and its projections need to be forecasted. In this article, we propose and derive an autoregressive modeling framework based on ML and statistical methods to predict confirmed cases of COVID-19 in the South Asian Association for Regional Cooperation (SAARC) countries. Automatic forecasting models based on autoregressive integrated moving average (ARIMA) and Prophet time series structures, as well as extreme gradient boosting, generalized linear model elastic net (GLMNet), and random forest ML techniques, are introduced and applied to COVID-19 data from the SAARC countries. Different forecasting models are compared by means of selection criteria. By using evaluation metrics, the best and suitable models are selected. Results prove that the ARIMA model is found to be suitable and ideal for forecasting confirmed infected cases of COVID-19 in these countries. For the confirmed cases in Afghanistan, Bangladesh, India, Maldives, and Sri Lanka, the ARIMA model is superior to the other models. In Bhutan, the Prophet time series model is appropriate for predicting such cases. The GLMNet model is more accurate than other time-series models for Nepal and Pakistan. The random forest model is excluded from forecasting because of its poor fit.
Keywords:
Artificial intelligence; Facebook Prophet algorithm; GLM; R software; SARS-CoV-2; South Asian Association for Regional Cooperation countries; Time-series models
Machine learning (ML) has solved many complex real-world problems over the last decade in diverse fields as business, climatology, healthcare, and robotics (Kelleher et al. 2020; Akbar et al. 2022). ML algorithms are trial-and-error-based methods rather than conventional algorithms that follow programming commands based on decision statements (Makridakis et al. 2018; Sardar et al. 2021). Forecasting/prediction is the most significant part of ML (Bontempi et al. 2012), and multiple ML algorithms have been used to forecast diseases, stock markets, and weather (Mahdi et al. 2021; Bustos et al. 2022; Chaouch et al. 2022; Ma et al. 2022). Regression and neural network models have been employed to predict a patient’s future condition for a specific disease (Harrell et al. 1985). Different studies have been performed for diseases’ prediction, such as breast cancer (Asri et al. 2016), cardiovascular (Anderson et al. 1991), and coronary artery (Lapuerta et al. 1995) conditions, via ML techniques.The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), also known as COVID-19, was discovered in December 2019, in Wuhan, China, and declared by The World Health Organization (WHO) as a pandemic in March 2020 (WHO 2020a, b). Different types of SARS-CoV-2 have been identified (Alkadya et al. 2022). COVID-19 has revolutionized the world’s population, as well as the economy (Chahuan-Jimenez et al. 2021) and finance (Liu et al. 2021), forcing us to a new manner of life that will keep its mark on society forever (Mahdi et al. 2021).The spread of COVID-19 was so rapid that almost all countries imposed either partial or complete lockdown in affected areas to curb its spread (Ospina et al. 2022). Precautionary measures imposed by different governments are directing their masses to follow the standard operating procedure and control the spread of SARS-CoV-2 (Jerez-Lillo et al. 2022). Several studies based on ML tools have tried to predict COVID-19 confirmed cases (Petropoulos and Makridakis 2020) and COVID-19 outbreaks (Grasselli et al. 2020), as well as improving its clinical diagnostic (Bustos et al. 2022), which help in decision-making.The South Asian Association for Regional Cooperation (SAARC) was established in 1985, and it is composed by Afghanistan, Bangladesh, Bhutan, India, the Maldives, Nepal, Pakistan, and Sri Lanka. COVID-19 has disturbed the economy of SAARC countries as well (De la Fuente-Mella et al. 2021). The confirmed cases in overall SAARC countries are 16,870,105, with 39,053 as deaths till 19 July 2020 (WHO 2020a, b; Zhao et al. 2022). India is the largest country of the SAARC affected from COVID-19, with cases rising to 1.1 million, which is second only to the US and with a recovery rate of 63.75%. In Latin America, the deaths have been high as well (Martin-Barreiro et al. 2021), particularly in Brazil (Ospina et al. 2022) and Chile (Jerez-Lillo et al. 2022).The COVID-19 infection rate in SAARC countries is approximately 11.78% of the world population. India has developed medical facilities to fight against it. However, despite lockdowns, three million spiked active cases were reported in India. Comparatively, Bangladesh and Pakistan have crossed two lacs of cases of COVID-19. Active cases in Pakistan are gradually decreasing, and now it is 53,652 cases with less testing facilities. The death rate in Bangladesh is 6.61%. No casualty was found to date due to COVID-19 in Afghanistan and Bhutan. The death rate is below 1% in the Maldives and Sri Lanka.Time-series models are widely studied and applied (Brockwell and Davis 1991; Leiva et al. 2021), especially in epidemic disease prediction (Ospina et al. 2022; Jerez-Lillo et al. 2022). Several researchers are studying different dimensions of the epidemic and trying to find the results to help humanity. In Fanelli and Piazza (2020), a temporal dynamic was considered for predicting COVID-19 in China, France, and Italy. In Chimmula and Zhang (2020), the COVID-19 outbreak in Canada was forecasted by employing a long short-term memory network. In Malavika et al. (2021), COVID-19 was predicted in India based on a SIR model (Fierro et al. 2015; Rangasamy et al. 2022). In Petropoulos and Makridakis (2020) and Alzahrani et al. (2020), exponential smoothing and autoregressive integrated moving average (ARIMA) models were used for forecasting COVID-19.Currently, ML is a helpful technique for forecasting and researchers are employing it in different fields of science. In Rustam et al. (2020), ML models were compared with varying forecasting models based on COVID-19 data, finding that the ML model performance was better than other models. For example, in Sujath et al. (2020), ML techniques were utilized for predicting COVID-19 data in India. The employment of ML tools for forecasting COVID-19 have been very helpful in controlling this epidemic. To the best of our knowledge, no studies based on ML techniques and automatic ARIMA/Prophet models for forecasting COVID-19 cases on SAARC countries have been conducted.The specific contributions that this investigation aims to make are to:Propose a methodology for ML-based forecasting of COVID-19 data. The proposed methodology is represented in Fig. 1 to state the problem setting.
Fig. 1
Proposed methodology
Derive an autoregressive modeling framework for forecasting COVID-19 data in a region composed of countries with different realities.Compare different forecasting models by means of selection criteria.Choose the best forecasting model for SAARC countries based on COVID-19 data by using evaluation metrics.Forecast the COVID-19 confirmed cases in SAARC countries.Analyze the impact of forecasted COVID-19 confirmed cases on these countries.Proposed methodologyTherefore, the objectives of this study are related to each of the contributions mentioned in (1)–(6). All the calculations and graphical plots were performed with the R software (R Development Core Team 2020). The plots were produced with ggplot of R by using a tool of interactive forecast visualization. This is a wrapper for plot_time_series() that generates an interactive (plotly) or static (ggplot2) plot with the forecasted data. The plots of time series were constructed with interactive plotting for one or more time series. This is a workhorse time-series plotting function that generates interactive plotly plots, consolidates 20+ lines of ggplot2 code, and scales well to many time series.The remainder of this article is organized as follows. Section 2 discusses the methodology, including data source, automatic and ML models, as well as the evaluation metrics used in the present investigation. In Sect. 3, we report the results of our study and provide a discussion of these results. In Sect. 4, conclusions and future work suggestions are stated.
Methodology
In this section, the data source, methods, and evaluation metrics are provided.
Data and their sources
The COVID-19 data set employed in the present investigation was sourced from “Our World in Data” (Ritchie et al. 2020) using the website: which was accessed on 14 May 2022. In this study, we have considered only the variable “number of confirmed COVID-19 cases” for SAARC countries and its corresponding time-series data with its metadata related to date and location. The data were collected from 31 January 2020 to 19 July 2020 from the repository:where other COVID-19 variables can be obtained as well. Specifically, the data set used in our analysis was secured from:which was accessed on 14 May 2022. The original data source (raw data) on confirmed COVID-19 cases and deaths for all countries correspond to the COVID-19 data repository of the “Center for Systems Science and Engineering (CSSE)” at Johns Hopkins University (www.jhu.edu) with link given by:https://ourworldindata.org/coronavirus;github.com/owid/covid-19-data/tree/master/public/data,github.com/owid/covid-19-data/blob/master/public/data/owid-covid-data.csvgithub.com/CSSEGISandData/COVID-19.The full COVID-19 data set is a collection of the COVID-19 data maintained by Our World in Data (ourworldindata.org), which is updated daily and includes data on confirmed cases, deaths, hospitalizations, and testing.
Time-series models
In this study, we use automatic forecasting models based on: (i) ARIMA and Prophet time-series frameworks; as well as (ii) extreme gradient boosting (XGBoost), generalized linear model elastic net (GLMNet), and random forest ML methods. We employ these models to analyze the COVID-19 data from the SAARC countries mentioned in Sect. 2.1. Next, we describe each of these methods.
where and are polynomials of orders p and q, respectively, assuming that and have no roots for to ensure causality and invertibility, whereas is a constant. In the expression defined in (1), B is a backshift operator; Y(t) is the response variable; and is a white noise both at time t, with , that is, the white noise follows a normal (Gaussian) distribution with zero mean and variance . Note that there is an indirect polynomial of order d in the forecasting if (Brockwell and Davis 1991). An approach for automatic ARIMA modeling is stated in Algorithm 1.where Y(t) stated in (2) is the response variable to be predicted at time t; g(t) is a trend function employed to analyze time-series non-periodic changes; s(t) is a periodic or seasonality term reflecting the change of a week or a year; h(t) is the influence of occasional days or holidays; ; and is an error term which is assumed to be normally distributed such as in (1). Here, we consider the non-periodic changes of time series. As mentioned, the Prophet model handles the outliers of time series as well as the missing values. This model can automatically forecast the future trend, with this trend g(t) stated in (2) being described by saturating growth and a piecewise linear models. The growth model is established by a logistic regression formulated as where “e” is the exponential or Euler function. In the model presented in (3), the elements K, c, and u, corresponding to the curve maximum value or carrying capacity (K), logistic growth rate or steepness of the curve (c), and sigmoid point or offset parameter (u), are not constant, because they depend on time t, so that the formulation stated in (3) becomes as When one finds the change point of the time series, the trend changes at this point, and then the model expressed in (4) is now given by where , , and , with “” denoting the transpose of a matrix. For a piecewise linear function, the model defined in (5) is established as where c is the growth rate, u denotes the offset parameter, γ has the rate adjustments, and φ is set to make the function continuous; for more details, see Taylor and Letham (2018).Automatic ARIMA models: A statistical analysis model known as ARIMA employs time-series data to help researchers better understand a data collection or to forecast future trends. If a statistical model forecasts future values using data from the past, it is said to be autoregressive. These models (Hyndman and Khandakar 2008) combine unit root tests and minimize the Akaike information criterion, AIC in short (Ventura et al. 2019). Note that the maximum likelihood method can be used to estimate the ARIMA parameters. Indeed, the R software estimates the ARIMA parameters with this method so that we use it. When estimating the ARIMA parameters, the maximum likelihood method is like the least squares method. Note that the maximum likelihood estimators are more efficient than other estimators.The ARIMA acronym is formed by the terms AR, I, and MA, where the notation ARIMA(p, d, q) is often utilized. In this notation, the acronym AR(p) is related to the autoregressive part of the model, and the parameter p specifies how many lagged series should be used to forecast future periods (Hyndman et al. 2006). The acronym I(d) is associated with the integrated part of the model, and the parameter d indicates how many differences are needed to make the series stationary, that is, I(d) is the differencing component or number of differences. Then, the acronym MA(q) is the moving average part of the model, where q is the number of lag elements in the prediction equation for the forecast error. An ARIMA(p, d, q) model is stated asProphet model: This is an open-source framework of Facebook to forecast time-series data, based on an additive formulation opened to the public in 2017 (Taylor and Letham 2018; Prophet 2020). Prophet nonlinear trends are fitted with daily, weekly, and yearly seasonality, including the effects of holidays. The Prophet model not only forecasts the future, but also fills missing values and detects irregularities. Although the trend forecast appears reasonable, the uncertainty intervals appear to be too large.The Prophet model can deal with historical outliers, but only by matching them to trend changes. Then, the model predicts similar magnitude trend changes in the future. Outliers are best dealt with by removing them, and the Prophet model has no issues with missing data. If we leave the dates in the future, but we set their values to N/A in the history, the Prophet model makes a forecast for their values. In the case of the Prophet model, we also used the maximum likelihood method as well other methods to estimate its parameters. In this investigation, we have not considered the non-periodic changes of the time series. The Prophet model can handle the time series and automatically bifurcate the future trend.A Prophet model is formulated as
Machine learning models
ML algorithms are a branch of computer science that are trained from past data. The algorithm selects a suitable model to be calibrated according to data characteristics and forecasts future values. In this ML framework, the algorithm takes a data set with input instances and a regressor to train the model. The trained model generates a prediction for an unpredicted test data set (Talabis et al. 2015). The ML methods use regression and classification algorithms.Several ML models, as logistic regression, neural networks, or support vector machines, have been utilized to analyze and forecast COVID-19 (Bustos et al. 2022). We have found that some ML models predict better when compared to others of this type. We apply ML models to our COVID-19 data set for predicting the effects of SARS-CoV-2 in SAARC countries and to global level. Three ML methods are employed in the study of COVID-19 prediction. These methods correspond to GLMNet, random forest, and XGBoost.
where now is a parameter that controls the shrinkage type affecting the estimation process; is a penalty parameter that controls the shrinkage amount; is the response variable; are the values of the covariate vector ; is the regression intercept; is the vector of regression coefficients associated with the values of the covariate vector, with being the -th element of the vector ; p is the number of covariates; and n is the size of the sample data set.where is the loss function that controls the predictive power and is the regularization part of controlling ease and over-fitting.GLMNet: The elastic net is a particular case of the shrinkage method, which holds both ridge and least absolute shrinkage and selection operator (LASSO) regressions. The GLMNet property has the ability to handle problems (Zou and Hastie 2005). The elastic net penalty term is given by , in the objective function (Friedman et al. 2005) stated asNote that we are assuming a linear regression structure given by , where “E” denotes the expected value of Y conditional to . The glmnet package of the R software (R Development Core Team 2020) offers various kinds of regression methods for both variable selection and prediction techniques when , depending on the data and problem under consideration. If , the LASSO defaulting option of the glmnet package is obtained with a penalty parameter and carries out both parameter shrinkage and variable selection. If , it gives a ridge regression with a penalty parameter . We choose to optimize the elastic net. Efficiently, this shrinks certain coefficients and makes some equal to zero for sparse selection. The model structure needs variable selection to form a subset of predictors. The elastic net uses the problem approach, which implies that the number of parameters is much greater than the size of the sample utilized in the modeling. The elastic net is suitable when the variables form groups that contain highly correlated covariates. The variable selection is combined with the model structure to help in increasing the accuracy. When a group of covariates is highly correlated, the model automatically selects those with greater predictive power.Random forest: To build a set of decision trees, two methods are combined: bootstrap aggregating and random subspace, with the decision tree set being categorized. Random forest comprises many decision tree classifiers, whose classification of the decision tree obtains the output groups. In the single decision tree, two random selection processes are used in the algorithm. The first one is the random selection of training samples, and the second one is a random selection of sample features. Then, the decision trees are built. Classification results are finalized by the weight voting method. Random forest is a joint classifier that builds independent and non-matching decision trees depending on randomization. It can be defined as , for , where now is the mutual independent random vector parameter, and are data input values (Provost et al. 2016). Every decision tree uses a random vector of parameters, features of samples selected randomly, and a subset of sample data chosen randomly as a training set. The structure of the random forest is stated in Algorithm 2 (Bradter et al. 2013).XGBoost: The extreme gradient boosting model (Tianqi C, Guestrin 2016) is an ensemble ML system for tree boosting. The XGBoost model is used for problems related to regression trees and classification based on the boosting technique. At each step, the XGBoost model produces a weak learner and cumulates it into the model. The XGBoost model is based on the gradient direction of a loss function called gradient boosting machine (Jerome and Friedman 2002). The difference between random forest and gradient boosting machine is that decision trees are built independently in a random forest, while the another method adds a new tree to counterpart previously built ones (Luckner et al. 2017). The XGBoost is a method where new models are built to predict prior models’ residuals and further make predictions. The XGBoost model is established as
Model evaluation metrics
Evaluation metrics are used to measure the quality of statistical models. Evaluation metrics tell us that our model is operating correctly. There are numerous kinds of evaluation metrics to test a model. The metrics employed in this investigation are defined next. The mean absolute error is established aswhich indicates the difference between the observed () and forecasted () values for case i. The MAE is obtained from the data set by averaging the absolute difference.The mean absolute percentage error is stated bywhich measures the size of error in percentage form. The mean absolute scaled error (Hyndman and Khandakar 2008) is formulated aswhich is used as a measure of the accuracy of forecasts, with the denominator being the mean absolute error of the one-step “naive forecast method” on the training set (here defined as ), that utilizes the true value from the prior period as the forecast: . The symmetric mean absolute percentage error is expressed bywhich is based on the percentage error. The root of the mean square error is given bywhich corresponds to the square root of the mean square error. A well-known indicator of adequacy for statistical models is the determination coefficient that is defined in our case aswhich tells us how well the fit values are compared to the true values, where is the mean of true values. After the training model is obtained, we check the model goodness of fit. The values between zero and one are interpreted as percentages. Note that 0% shows that the response variable is not explained around its mean based on the covariates, whereas 100% indicates that the model describes the response variability around its mean perfectly. As the value of R-square of the trained model increases, the model better fits the data set.
Results and discussion
In this section, we use the automatic and ML models to forecast the COVID-19 confirmed cases in SAARC countries. The forecasting is done by employing the best models that are suitable for this framework. Firstly, the data set is preprocessed. Then, we split the data into a set to train the models and another set (10 days) to test the models. ML models as GLMNet, XGBoost, and automatic models, as ARIMA and Prophet, are used. These models are trained for confirmed cases and evaluated on the mentioned metrics. The flowchart of the proposed methodology is shown in Fig. 2.
Fig. 2
Flowchart the proposed methodology to forecast COVID-19 cases in SAARC countries
Flowchart the proposed methodology to forecast COVID-19 cases in SAARC countriesTo compare the models, we evaluate their performance based on the metrics mentioned (MAE, MAPE, MASE, RMSE, SMAPE, and R-square). We consider the confirmed COVID-19 cases of SAARC countries. Up to 19 July 2020, confirmed cases in the world were 14,443,127, and in SAARC countries 1,605,501; of which 1,033,502 are recovered cases; 36,270 are deaths; and 535,728 are active cases. The status of COVID-19 in SAARC countries is presented in Table 1, which includes the information of confirmed cases, deaths, recovered cases, and active cases till 19 July 2020. India has the highest number of COVID-19 active cases with 26,348 deaths. In Sri Lanka, the Maldives, and Nepal, the death shares below 50, whereas 204,276 patients were recovered from COVID-19 in Pakistan. The number of actives cases is 90,265 in Bangladesh. The time-series plot of global and SAARC countries is shown in Fig. 3, whereas Fig. 4 displays this plots for each country.
Table 1
COVID-19 confirmed, recovered, active cases and deaths in SAARC countries
SAARC countries
Confirmed cases
Recovered cases
Deaths
Active cases
India
107,878
677,856
26,838
374,088
Pakistan
263,496
204,276
5568
53,652
Bangladesh
204,525
11,164
2618
90,265
Afghanistan
35,475
23,634
118
10,660
Nepal
1750
11,637
40
5825
Maldives
2930
2354
15
561
Sri Lanka
2704
2023
1
670
Bhutan
87
80
0
7
Source: Worldometer: www.worldometers.info/coronavirus/#countries; accessed on 10 July 2020
Fig. 3
Time-series plot of confirmed COVID-19 cases of (a) global; and (b) SAARC countries
Fig. 4
Time-series plot of COVID-19 confirmed cases per month of (a) Afghanistan; (b) Bangladesh; (c) Bhutan; (d) India; (e) Maldives; (f) Nepal; (g) Pakistan; and (h) Sri Lanka
Time-series plot of confirmed COVID-19 cases of (a) global; and (b) SAARC countriesTime-series plot of COVID-19 confirmed cases per month of (a) Afghanistan; (b) Bangladesh; (c) Bhutan; (d) India; (e) Maldives; (f) Nepal; (g) Pakistan; and (h) Sri LankaCOVID-19 confirmed, recovered, active cases and deaths in SAARC countriesSource: Worldometer: www.worldometers.info/coronavirus/#countries; accessed on 10 July 2020According to the objective of our investigation, we compare forecasting models based on selection criteria. Best models are selected based on MAE, MAPE, MASE, SMAPE, RMSE, and R-square. The performance of the ARIMA model is best in most countries; see Table 2. Figures 5, 6, 7, 8, 9, 10, 11, 12 and 13 show the forecasted plot for all models used in this study. Description of best models and forecast values are provided in Table 2. For SAARC countries, the ARIMA(1, 2, 0)(1, 0, 2) model with drift is suitable for forecasting. Based on this, cases reach 1,315,265 on July 2020 (Table 3). The residual values of the best prediction model were the smallest ones, and it shows that the model is the best fit for the data; see Table 4. Note that the random forest model is excluded from forecasting during the analysis because of its poor fit. Model accuracy values are provided for an individual member of SAARC countries shown in Tables 4, 5, 6, 7, 8, 9, 10 and 11.
Table 2
Calibration of confirmed COVID-19 cases in SAARC countries
Country
Model
Date
Value
Residual
True
predicted
SAARC
ARIMA
2020-07-01
1,301,878
1,284,324
17,553,579
(1,2,0)(1,0,2)
2020-07-01
133,598
1,315,265
20,715,579
India
ARIMA
2020-07-01
820,916
789,137.4
31,778,590
(0,2,0)(2,0,0)
2020-07-01
849,553
811,178.8
38,374,206
Pakistan
GLMNet
2020-07-01
24,635
248,331.
− 1,980,171
2020-07-01
248,87
250,528.9
− 1,656,942
Bangladesh
ARIMA
2020-07-01
178,443
178,030.
41,285,772
(0,2,1)(2,0,0)
2020-07-01
181,129
181,051.6
7,739,907
Afghanistan
ARIMA(0,2,3)
2020-07-01
34,366
34,464.
− 9,823,814
2020-07-01
34,45
34,759
− 3,079,588
Nepal
GLMNet
2020-07-01
16,649
17,998.5
−
1,349,482
2020-07-01
16,719
18,304.6
− 1,585,618
Maldives
ARIMA(0,2,1)
2020-07-01
2617
2,532,938
84,062,256
2020-07-01
2664
2,550,500
113,499,722
Sri Lanka
ARIMA(0,2,2)
2020-07-01
2454
2,127,718
326,281,545
2020-07-01
251
2,135,839
375,160,992
Bhutan
Prophet
2020-07-01
8
8,179,123
0.2
2020-07-01
8
8,249,074
− 0.5
Fig. 5
Plot of forecasting for confirmed COVID-19 cases in SAARC countries
Fig. 6
Plot of forecasting for confirmed COVID-19 cases in Afghanistan
Fig. 7
Plot of forecasting for confirmed COVID-19 cases in Bangladesh
Fig. 8
Plot of forecasting for confirmed COVID-19 cases in Bhutan
Fig. 9
Plot of forecasting for confirmed COVID-19 cases in India
Fig. 10
Plot of forecasting for confirmed COVID-19 cases in the Maldives
Fig. 11
Plot of forecasting for confirmed COVID-19 cases in Nepal
Fig. 12
Plot of forecasting for confirmed COVID-19 cases in Pakistan
Fig. 13
Plot of forecasting for confirmed COVID-19 cases in Sri Lanka
Table 3
Accuracy results of confirmed COVID-19 cases in SAARC countries
Model
MAE
MAPE
MASE
SMAPE
RMSE
R-square
ARIMA(1,2,0)(1,0,2)
9935.48
0.80
0.3
0.81
11,541.89
1.00
Prophet
13,140.0
1.05
0.4
1.05
16,953.09
1.00
GLMNet
91,372.40
7.44
2.89
7.78
100,978.10
0.99
Random forest
589,028.17
48.94
18.65
65.35
601,966.30
0.86
XGBoost
13,136.24
1.05
0.4
1.05
16,879.37
1.00
Table 4
Accuracy results of confirmed COVID-19 cases in Afghanistan
Model
MAE
MAPE
MASE
SMAPE
RMSE
R-square
ARIMA(0,2,3)
123.36
0.37
0.46
0.37
157.59
0.98
Prophet
547.86
1.6
2.03
1.57
862.79
0.94
GLMNet
1261.75
3.76
4.68
3.84
1323.77
0.97
Random forest
10,011.0
29.9
37.09
35.46
10,239.54
0.55
XGBoost
620.17
1.8
2.30
1.79
918.94
0.93
Table 5
Accuracy results of COVID-19 data in Bangladesh
Model
MAE
MAPE
MASE
SMAPE
RMSE
R-square
ARIMA(0,2,1)(2,0,0)
358.69
0.20
0.31
0.20
397.33
1.00
Prophet
3384.84
1.90
0.42
1.92
3772.35
0.99
GLMNet
6507.96
3.67
2.89
3.74
6622.81
1.00
Random forest
67,951.44
38.4
18.65
47.56
68,036.40
0.65
XGBoost
3445.9
1.93
0.42
1.96
3825.89
0.98
Table 6
Accuracy results of confirmed COVID-19 cases in Bhutan
Model
MAE
MAPE
MASE
SMAPE
RMSE
R-square
ARIMA(0,2,1)(1,0,0)
1.18
1.44
0.56
1.45
1.23
0.78
Prophet
0.35
0.43
0.23
0.43
0.38
0.94
GLMNet
0.70
0.85
1.89
0.86
0.74
0.81
Random forest
27.44
33.47
17.7
40.20
27.45
0.53
XGBoost
0.54
0.65
1.23
0.66
0.57
0.69
Table 7
Accuracy results of confirmed COVID-19 cases in India
Model
MAE
MAPE
MASE
SMAPE
RMSE
R-square
ARIMA(0,2,0)(2,0,0)
17421.50
2.25
0.70
2.28
20,776.11
1.00
Prophet
47,200.57
6.00
1.90
6.30
61,119.96
0.94
GLMNet
78,360.15
10.2
3.15
10.90
89,010.92
0.99
Random forest
396,671.74
53.35
15.94
73.46
406,965.77
0.84
XGBoost
47,179.87
6.00
1.90
6.30
61,076.94
0.94
Table 8
Accuracy results of confirmed COVID-19 cases in the Maldives
Model
MAE
MAPE
MASE
SMAPE
RMSE
R-square
ARIMA(0,2,1)
41.7
1.6
1.28
1.64
54.25
0.93
Prophet
198.48
7.70
6.07
8.16
241.38
0.97
GLMNet
101.43
3.95
3.10
4.05
113.98
0.90
Random forest
596.44
23.47
18.23
26.68
605.13
0.86
XGBoost
204.26
7.93
6.24
8.40
243.36
0.97
Table 9
Accuracy results of confirmed COVID-19 cases in Nepal
Model
MAE
MAPE
MASE
SMAPE
RMSE
R-square
ARIMA(1,2,2)
1327.80
8.1
5.43
7.64
1658.72
0.93
Prophet
940.6
5.76
3.85
5.52
1151.74
0.95
GLMNet
641.05
3.9
2.62
3.79
834.10
0.94
Random forest
7357.50
46.08
30.10
59.95
7390.10
0.79
XGBoost
848.74
5.18
3.47
4.99
1062.04
0.95
Table 10
Accuracy results of confirmed COVID-19 cases in Pakistan
Model
MAE
MAPE
MASE
SMAPE
RMSE
R-square
ARIMA(3,2,2)
4361.95
1.80
1.46
1.78
5375.71
1.00
Prophet
21,688.37
8.93
7.24
8.3
28,595.20
0.96
GLMNet
1354.18
0.56
0.45
0.56
1676.14
1.00
Random forest
104,982.54
44.29
35.03
57.28
106,490.52
0.74
XGBoost
21,830.37
8.99
7.28
8.36
28,686.44
0.96
Table 11
Accuracy results of confirmed COVID-19 cases in Sri Lanka
Model
MAE
MAPE
MASE
SMAPE
RMSE
R-square
ARIMA(0,2,2)
101.0
4.18
2.04
4.44
173.65
0.70
Prophet
711.7
30.88
14.39
41.55
924.08
0.84
GLMNet
196.30
8.37
3.97
9.1
280.20
0.69
Random forest
426.34
19.03
8.6
21.30
462.16
0.74
XGBoost
699.0
30.30
14.14
40.66
912.89
0.85
According to the ARIMA model, in Afghanistan, confirmed cases are greater than in other models. Table 4 reports that the values of model selection are better in the ARIMA(0, 2, 3) model. Performance of time-series models of Bangladesh COVID-19 confirmed cases is provided in Table 5, where the ARIMA(0, 2, 1)(2, 0, 0) model with drift is detected as the best one. The predicted values of cases in Bangladesh are 181,051, which are very near to the true value of 181,129 on 12 July 2020 (Zhao et al. 2022).Plot of forecasting for confirmed COVID-19 cases in SAARC countriesPlot of forecasting for confirmed COVID-19 cases in AfghanistanPlot of forecasting for confirmed COVID-19 cases in BangladeshPlot of forecasting for confirmed COVID-19 cases in BhutanPlot of forecasting for confirmed COVID-19 cases in IndiaPlot of forecasting for confirmed COVID-19 cases in the MaldivesPlot of forecasting for confirmed COVID-19 cases in NepalPlot of forecasting for confirmed COVID-19 cases in PakistanPlot of forecasting for confirmed COVID-19 cases in Sri LankaCalibration of confirmed COVID-19 cases in SAARC countriesAccuracy results of confirmed COVID-19 cases in SAARC countriesAccuracy results of confirmed COVID-19 cases in AfghanistanAccuracy results of COVID-19 data in BangladeshThe Prophet time series model is suitable for prediction in Bhutan; see Table 6. The model predicted value is almost the same as the true value of Bhutan. The values of RMSE, MAE, MAPE were the smallest ones in Bhutan. Still, in this country, the number of confirmed cases is 87 with no deaths due to COVID-19, which shows less spread.Accuracy results of confirmed COVID-19 cases in BhutanFor India, the forecasting in terms of model accuracy indicates that the ARIMA model is better compared to Prophet, XGBoost, and GLMNet methods, which is reported in Table 7. The prediction values for the confirmed cases are 811,178.8, which is very close to the true values. Medical facilities were improved in the last four months, and the recovery rate is greater than 60%.Accuracy results of confirmed COVID-19 cases in IndiaTable 8 reports that the COVID-19 cases in the Maldives require stationary differencing. The prediction value of 12 July 2020 is like the true value. After Bhutan, the Maldives is having less COVID-19 confirmed cases among other SAARC countries. For the Bhutan model, residual values were less than 1%. In the Nepal data set, the GLMNet ML model is more accurate as compared to time series models: see Table 9. The values of RMSE (834.10) and MAE (641) are minimum in all compared models.Accuracy results of confirmed COVID-19 cases in the MaldivesAccuracy results of confirmed COVID-19 cases in NepalTable 10 provides the model accuracy of Pakistan about the confirmed COVID-19 cases. Like Nepal, in Pakistan also, the GLMNet model performance is better as compared to time-series models. The values of the random forest method are very different from other models.Accuracy results of confirmed COVID-19 cases in PakistanThe forecasted values of the GLMNet method in Pakistan show the COVID-19 confirmed cases are 250,528 on 12 July 2020. Still, according to its spread, Pakistan needs to increase the testing facilities, as the recovery rate is 97.35%. Table 11 reports that the ARIMA model accuracy is much better in Sri Lanka. The residual values are small and show that the ARIMA model is the best fitted.Accuracy results of confirmed COVID-19 cases in Sri Lanka
Conclusion
In this article, we have proposed and derived an autoregressive modeling framework based on machine learning and statistical methods to predict the number of COVID-19 cases in the South Asian Association for Regional Cooperation countries. We have considered automatic forecasting models based on autoregressive integrated moving average and Prophet time series structures, as statistical methods. In the case of machine learning, we have employed extreme gradient boosting, generalized linear model elastic net, and random forest techniques.The dataset was based on time series data from 31 January 2020 to 19 July 2020, and was used for upcoming forecasting. Different forecasting models were compared utilizing several evaluation metrics, based on MAE, MAPE, MASE, SMAPE, RMSE, and R-square values.Results have proven that the ARIMA model is found to be suitable and ideal for forecasting COVID-19 cases in the mentioned countries. The random forest model did not fit to our dataset well so that it was excluded from the forecasting purpose. Model accuracy values were studied for individual countries. In Afghanistan, Bangladesh, India, the Maldives, and Sri Lanka, the ARIMA model was superior to other models. In Bhutan, the Prophet time series model was suitable for projection purposes. Still, in Bhutan, the total number of confirmed cases was 87, with no deaths. In Nepal and Pakistan, the generalized linear model elastic net model was more accurate than other models. This forecast study can help authorities to take timely actions and to control COVID-19 spread.We are planning as a future research will explore the projection methodology and update the dataset using more accurate and suitable machine learning methods as well as statistical models for prediction.