Literature DB >> 35777884

Application of a data-driven XGBoost model for the prediction of COVID-19 in the USA: a time-series study.

Zheng-Gang Fang¹, Shu-Qin Yang¹, Cai-Xia Lv¹, Shu-Yi An², Wei Wu³.

Abstract

OBJECTIVE: The COVID-19 outbreak was first reported in Wuhan, China, and has been acknowledged as a pandemic due to its rapid spread worldwide. Predicting the trend of COVID-19 is of great significance for its prevention. A comparison between the autoregressive integrated moving average (ARIMA) model and the eXtreme Gradient Boosting (XGBoost) model was conducted to determine which was more accurate for anticipating the occurrence of COVID-19 in the USA.
DESIGN: Time-series study.
SETTING: The USA was the setting for this study. MAIN OUTCOME MEASURES: Three accuracy metrics, mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE), were applied to evaluate the performance of the two models.
RESULTS: In our study, for the training set and the validation set, the MAE, RMSE and MAPE of the XGBoost model were less than those of the ARIMA model.
CONCLUSIONS: The XGBoost model can help improve prediction of COVID-19 cases in the USA over the ARIMA model. © Author(s) (or their employer(s)) 2022. Re-use permitted under CC BY-NC. No commercial re-use. See rights and permissions. Published by BMJ.

Entities: Chemical

Keywords: COVID-19; epidemiology

Mesh：

Year: 2022 PMID： 35777884 PMCID： PMC9251895 DOI： 10.1136/bmjopen-2021-056685

Source DB: PubMed Journal: BMJ Open ISSN： 2044-6055 Impact factor: 3.006

This study used the autoregressive integrated moving average and eXtreme Gradient Boosting (XGBoost) models to predict cases of COVID-19 in the USA. Data on vaccination in the USA were introduced into the XGBoost model. The seasonality of data was considered in both models. The study period was relatively small and should be expanded to better reflect the future development of COVID-19 in the USA. The XGBoost model was built based on prevaccination-induced herd immunity. Therefore, as the cases of more transmissible variants increase, the accuracy of prediction may decline.

Introduction

First detected in Wuhan, China, and subsequently spread to all over the world, COVID-19 (http://COVID-19.who.int/) promises to be a defining global health event of the 21 century and has posed a severe and growing threat to public health.1 2 Immediately after the first case in the USA was identified on 20 January 2020, COVID-19 cases increased exponentially until 11 July 2021, on that date were 33 595 701 cases and 598 442 deaths.3 The majority of cases experience mild-to-moderate respiratory illness, but even death has resulted.4 The common symptoms resulting from COVID-19 infection appear to be wide, encompassing fever, cough, fatigue and sore throat.5 6 The clinical features of most patients are fever, and some have dyspnoea and extensive pneumonia infiltrates on CT scan of the chest.7 8 Given the uncertainty around decisions on the accurate time of the emergence and disappearance of the disease, it has been an increasingly important area of study in short-term forecasting to create better plans and more appropriate responses. Time-series analysis is beneficial for understanding the association of variables by using different models and obtaining more accurate predictions. The autoregressive integrated moving average (ARIMA) model by Box and Jenkins is the most common analytical method in data science. It is used for processing not only stationary but also non-stationary time series and is even applicable to seasonal time series.9 However, infectious diseases are affected by many factors, and their time series usually do not conform to a linear function. Therefore, the Box-Jenkins based ARIMA model is insufficient to handle non-linear situations well. In contrast, the eXtreme Gradient Boosting (XGBoost) model is a flexible machine learning method capable of dealing with the non-linearity of time series through its strong self-learning ability. The incidence of COVID-19 has varied greatly among countries,10 and it has been noted that vaccination may play a key role in the containment of the COVID-19 pandemic.11 12 Vaccines against COVID-19 now used in the USA have demonstrated high effectiveness.13 Therefore, effective vaccines against COVID-19 will be essential to lowering morbidity and mortality. Nevertheless, to date, no researchers have included vaccinated individuals in the XGBoost model to forecast the incidence of COVID-19. In this study, ARIMA and XGBoost models were developed to fit and forecast COVID-19 in the USA. In addition, we determined which of those models is a better predictor of COVID-19 in the USA by comparing the fit and forecast accuracies of the two models.

Methods

Data sources

Data on COVID-19 cases3 and vaccination13 in the USA were collected from the website of the Centers for Disease Control and Prevention of the USA (https://COVID-19.cdc.gov). The daily data on COVID-19 in the USA from 13 December 2020 to 30 June 2021 were split into training (13 December 2020 to 16 June 2021) and validation sets (17 June 2021 to 30 June 2021). The models were established on training data and tested on the validation set.

Seasonal ARIMA model

ARIMA models have often been used for the prediction of infectious diseases, such as dengue,14 Hemorrhagic fever with renal syndrome (HFRS)15 and malaria.16 Considering time trends, periodic changes and random fluctuations, it has become a common model in data science. ARIMA is optimal for data containing trend, cyclicity and seasonality.17 In our study, an ARIMA (p, d, q) (P, D, Q) [S] model was built, in which p represents the autoregression (AR) order, d the difference order and q the moving average (MA) order. S denotes the period of the seasonal trend and P, D and Q are the seasonal terms for the seasonal ARIMA. Parameters (P, D, Q) and (p, d, q) are determined according to the partial autocorrelation function (PACF) and autocorrelation function (ACF). Parameter S is chosen by the periodic length of seasonality. The seasonal model can be presented as follows: where , and denote the tendency, seasonal effect and random effects, respectively. By differencing, we stabilised the time series. An augmented Dickey-Fuller (ADF) test is used to confirm this stabilisation. The corrected Akaike’s information criterion (AICc) informs us of the goodness of fit of the ARIMA model. The model with the minimum value will be regarded as optimal. Finally, the Ljung-Box test was used to examine whether the residual sequences were white noise.

XGBoost model

The XGBoost model is a decision tree-based machine learning algorithm that is widely used in data science. By using an internal algorithm that combines the results from multiple individual trees, we can yield accurate predictions.18 Simultaneously, the model shows the ranking of input features. Moreover, XGBoost can help us obtain a stronger classifier from other classifiers and has other benefits, such as avoiding overfitting, effectively dealing with missing values and reducing running time by parallel and distributed calculation.19 The objective function of the XGBoost model is as follow: where denotes the number of training data, and are the feature vector and its label at the instance, represents the prediction of the instance at the iteration, is a loss function that calculates the difference between the label and the final forecast plus the new tree output, denotes a new tree that classifies the instance with , and denotes the regularisation term that penalises the complexity of the new tree.20 In the process of building the XGBoost model, the lag terms in the data are the input items, which are used for the prediction of data. Given the existence of a seasonal trend, we built seven lag terms (1-day to 7-day lag) as input items. To transform week variables to a common format, a one-hot encoding technique was used, which can convert categorical variables into numerical values in machine learning preprocessing. The week variable is used as a one-hot representation encoded into a matrix, whose columns correspond to the presence of Monday, Wednesday, Thursday, Friday, Saturday and Sunday. The matrix of the week variable is represented as follows: We built a numerical variable from 1 to the number of observations to analyse the effect of the time trend. The hyperparameters, including SubsampRate, ColsampRate, Depth, MinChild and eta, should be adjusted to optimise the XGBoost model.

Model selection

In our study, three accuracy metrics were applied to evaluate the performance of the models: mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE), as follows: In these equations, , and are the number of observations, the forecasted value, and the actual value, respectively. MAE is the mean of the absolute prediction error, which represents the MAEs between the actual and the prediction. RMSE is the square root of the average squared error, which is frequently used to evaluate the difference between the prediction and the actual value. MAPE represents the mean error between the actual and the prediction in percentage form, which computes the average absolute percent difference between the actual and the prediction. As MAPE, RMSE and MAE approach zero, the prediction results are considered more accurate.

Data analysis

In our study, all data were processed in R V.4.1.0 software. We used the xts, TSstudio and tseries packages to analyse of data and the ggplots2 and dygraphs packages to draw diagrams. The proposed models were established via forecast and xgboost packages (see R codes in online supplemental material 1).

Patient and public involvement

No patients were involved.

Results

Characteristics of COVID-19 cases

As of 11 July 2021, the total number of COVID-19 cases had reached 33 595 701 in the USA. According to the plot of daily cases, a study period was chosen from 13 December 2020 to 30 June 2021. First, it was certain to make the series become stationary. The time-series graph, given in figure 1A, shows that the data have a downward trend and fluctuate greatly, and the ADF test also confirms its non-stationarity. By Box-Cox transformation, the original data became more stationary with less fluctuation (figure 1B),21 and we then decomposed it. The Box-Cox transformation data, seasonal trend, time trend and remainder are shown in figure 1C. The diagrams show that there is a seasonal pattern and a trend. Moreover, we drew the relationship between the transformed and lag series (figure 2). To stabilise the time series, seasonal and regular differencing were applied. We conducted first-order and seven-order differencing (seasonal differencing) to address the instability caused by time trends and seasonal factors.

Figure 1

(A) Daily cases of COVID-19 in the USA from 13 December 2020 to 30 June 2021. (B) Contrast of the primary and transformed series of COVID-19. (C) Decomposition of the transformed series of COVID-19.

Figure 2

Difference correlations in the first seven lags.

(A) Daily cases of COVID-19 in the USA from 13 December 2020 to 30 June 2021. (B) Contrast of the primary and transformed series of COVID-19. (C) Decomposition of the transformed series of COVID-19. Difference correlations in the first seven lags.

Forecasting the cases of COVID-19 by the seasonal ARIMA model

After first-order and seasonal differencing, the COVID-19 data transformed by Box-Cox transformation became stationary (figure 3), and the ADF test also supported stationarity (t=−5.6143, p<0.01). This result showed us that the parameters d and D are 1 and 1 in the seasonal ARIMA model.

Figure 3

Cases of COVID-19 in the USA after transformation and differences.

Cases of COVID-19 in the USA after transformation and differences. The plots of ACF (figure 4A) and PACF (figure 4B) showed the temporal dependence of COVID-19 cases, and thus, we tried to build a seasonal ARIMA model with nonseasonal (p, d, q) and seasonal (P, D, Q) parameters. After differencing, the peak values (lag 1, 4, 7 and 14) in figure 4A indicated that the maximum q and Q values should be set to 4 and 2, respectively. At the same time, significant peak values at lags 1, 2 and 4, and 7, 14, 21 and 28 are observed in figure 4B, and thus, the maximum p and P values should be 4 and 4, respectively. Then, we found the model with the lowest AICc value via the auto.arima function. Finally, the optimal model was ARIMA (table 1), and the Ljung-Box test indicated that the residual series was white noise (p=0.6325). The time plot of the residuals, the corresponding ACF and the histogram also checked that residuals from the model were white noise. (figure 5). The ARIMA model performed well in the fit and forecasting of COVID-19 cases. The details are given in figure 6A.

Figure 4

(A) Autocorrelation function (ACF) and (B) partial autocorrelation function (PACF) diagrams for cases of COVID-19 in the USA after transformation and differences.

Table 1

Parameters of the ARIMA (0,1,1) (0,1,1)7 model

Series: trainARIMA (0,1,1) (0,1,1)₇
Series: trainARIMA (0,1,1) (0,1,1)₇				Coefficients	ma1	sma1
		−0.391	−0.917
	SE	−0.070	0.067
CIs of coefficients		2.5%	97.5%
	ma1	−0.528	−0.253
	sma1	−1.048	−0.785
AICc		128.920

AICc, Akaike’s information criterion; ARIMA, autoregressive integrated moving average.

Figure 5

The combination of residuals, the corresponding autocorrelation function (ACF) diagram, and the histogram for the autoregressive integrated moving average (ARIMA) (0,1,1) (0,1,1)7 model.

Figure 6

Fit and forecast results of (A) autoregressive integrated moving average (ARIMA) (0,1,1)(0,1,1)7 and (B) eXtreme Gradient Boosting (XGBoost) models.

(A) Autocorrelation function (ACF) and (B) partial autocorrelation function (PACF) diagrams for cases of COVID-19 in the USA after transformation and differences. Parameters of the ARIMA (0,1,1) (0,1,1)7 model AICc, Akaike’s information criterion; ARIMA, autoregressive integrated moving average. The combination of residuals, the corresponding autocorrelation function (ACF) diagram, and the histogram for the autoregressive integrated moving average (ARIMA) (0,1,1) (0,1,1)7 model. Fit and forecast results of (A) autoregressive integrated moving average (ARIMA) (0,1,1)(0,1,1)7 and (B) eXtreme Gradient Boosting (XGBoost) models.

Forecasting the cases of COVID-19 by the XGBoost model

In the application of the XGBoost model, the value of hyperparameters is essentially important. We consistently built models via preset bounds for hyperparameters, and then we obtained the best one in the final training with 168 rounds. The hyperparameters of the optimal model were: SubSampRate=0.5, ColSampRate=0.2, Depth=4, MinChild=2 and eta=0.07. The fit and forecast results of the optimal model are shown in figure 6B.

Models comparison

For the ARIMA model, we lost 8 observations in the training set after differencing, and only 162 observations were used for analysis. For the XGBoost model, we built seven lag terms (1-day to 7-day lag) as input terms because of the existence of seasonal trends. Accordingly, only 163 observations remained for analysis. The fit and forecast information of the two models are illustrated in table 2. In the training set and the validation set, compared with the seasonal ARIMA model, the XGBoost model had smaller values of MAE, RMSE and MAPE. It should be noted that the performance of the test set in the XGBoost model outweighed that of the validation set in the seasonal ARIMA model. For the XGBoost model, the MAPE values of the training and validation sets (4.046% and 7.892%) were excellent.

Table 2

Performance of the ARIMA (0,1,1) (0,1,1)7 and XGBoost model

Model	Training set			Test set
Model	MAE	RMSE	MAPE (%)	MAE	RMSE	MAPE (%)
ARIMA (0,1,1) (0,1,1)₇	7061.536	13 517.664	7.996	2083.571	2633.424	15.884
XGBoost	2331.134	3500.331	4.046	962.357	1209.984	7.892

ARIMA, autoregressive integrated moving average; MAE, mean absolute error; MAPE, mean absolute percentage error; RMSE, root mean square error; XGBoost, eXtreme Gradient Boosting.

Performance of the ARIMA (0,1,1) (0,1,1)7 and XGBoost model ARIMA, autoregressive integrated moving average; MAE, mean absolute error; MAPE, mean absolute percentage error; RMSE, root mean square error; XGBoost, eXtreme Gradient Boosting.

Discussion

In this paper, we developed two models (seasonal ARIMA and XGBoost) and used past data on daily cases of COVID-19 to predict 14 days ahead in the USA. The fit and prediction accuracies of the proposed models were assessed by three criteria. The model results show that the XGBoost model has better fit and better forecast COVID-19 cases in the USA. The prediction of cases of COVID-19 can help the government and the public take precautionary measures to control the further spread of COVID-19. The ARIMA model is commonly used for the prediction of time-series data, and it can show autocorrelations in data. The XGBoost model is a decision tree-based machine learning model, by which we can uncover the non-linearity in the time series of COVID-19 cases. Accordingly, our models not only retain the irregular trend of the COVID-19 data but also capture the incidental fluctuation. The ARIMA model combines AR with the MA, which is beneficial for capturing the characteristics of data in nature and making a more exact forecast. The seasonal ARIMA model has been among the most significant predictors for seasonal forecasts of time series.15 22 23 Normally, the loss of data happens more often with more differences. In our study, we only used the data on the daily number of COVID-19 cases to build the ARIMA model. We first conducted a first-order difference while we found that the data did not become stationary. We conducted a seasonal difference in the next step, and the result was good. Finally, the ARIMA model was selected as the optimal model with the minimum AICc. From the results of the ARIMA model, we can conclude that the model precisely reflects the seasonality in the data on COVID-19 cases. Nevertheless, owing to the non-linearity of the data, the MAE, RMSE and MAPE in the validation set were not good. Starting the experimental evaluation with the seasonal ARIMA, we then applied the XGBoost model to further analyse the time series in the USA. In current COVID-19 research, the effectiveness of vaccines against COVID-19 has been confirmed. Once vaccines have been approved for use in individuals, sufficient and effective vaccines will help build herd immunity among people.24–26 From the variable importance graph (figure 7) for the XGBoost model, we also see that the significance scores of vaccine variables (fully vaccinated and at least one dose vaccinated) rank in the second and fifth positions. As a result, vaccines have played an important role in the spread of COVID-19 in the USA. Vaccinations have been administered in countries on different dates. As of 11 July 2021, more than 158 million people were fully vaccinated and 183 million had at least one dose against COVID-19 in the USA. Based on the afore-mentioned evidence, in addition to the data on the daily number of COVID-19 cases, we also collected the vaccination data to build the XGBoost model. The vaccination data included the daily cumulative number of fully vaccinated and those with at least one dose. The XGBoost model has already been carried out in studies to predict the trend in COVID-19.18 19 27–34 Luo et al 19 used the long short-term memory and XGBoost models in the prediction of COVID-19 in the USA and assessed the ranking of features via the XGBoost model. Khan et al 31 aimed to predict the mortality rate in confirmed COVID-19 patients from 146 countries employing the XGBoost model. Ahamad et al 34 developed several machine learning algorithms and discovered that the XGBoost model could precisely predict COVID-19 trends and simultaneously select features associated with them for all ages. In this paper, the XGBoost model is better than the seasonal ARIMA model based on the fit and forecast results, which is probably because vaccine variables were considered. The forecasting results showed that the MAEs of the seasonal ARIMA and XGBoost models were 2083.571 and 962.357, respectively. The RMSE values were 2633.424 and 1209.984, respectively. The MAPE (%) values were 15.884 and 7.892, respectively. Additionally, the accuracy metric values for the training data (2331.134, 3500.331, 4.016) and the validation data (962.357, 1209.984, 7.892) are quite small. As shown in table 2. This finding also suggests the high accuracy of the XGBoost model in the fit and forecast of COVID-19. However, new variants ravaging the USA are raising worries about the effectiveness of currently administered vaccines.35 36 The XGBoost model is built based on prevaccination-induced herd immunity in the USA. Therefore, as the cases of more transmissible variants increase, the accuracy of prediction may decrease.

Figure 7

Feature importance for COVID-19 cases in the USA.

Feature importance for COVID-19 cases in the USA. The time series of epidemics are always characterised by instability and volatility. Therefore, differencing and transformation are required to render them stationary. The ARIMA model is inapplicable to processing data that cannot be converted into stationary data, whereas the XGBoost model can dismiss it. Hence, compared with the traditional ARIMA model, the XGBoost model will achieve a broader application in practice. However, we first developed a seasonal ARIMA. According to the principle of this model, we used the past data on daily cases of COVID-19 to predict 14 days ahead by using the forecast function in the forecast package. The one-step ahead prediction method was performed in the XGBoost model. One-step ahead prediction uses actual past data to obtain a 1-day prediction. For example, actual data before and at time t as the model inputs to forecast the daily cases at time t+1, and actual data before and at time t+1 are used as the model inputs to forecast the daily cases at time t+2. According to the one-step prediction, we obtain the 14-day forecasting values. To a certain extent, the ARIMA model is more useful in real-world applications because it can forecast over a longer period. The XGBoost model can only use one-step ahead prediction, especially when impact factors are used as inputs of the model. New data are needed to rebuild the model to better reflect the future development of COVID-19 in the USA. This prediction of cases of COVID-19 by the models can help the government make effective measures and policies to deal with COVID-19.

Conclusions

Based on data from COVID-19 cases in the USA, we developed the XGBoost and seasonal ARIMA models, by which we conducted a 14-day, out-of -sample prediction. We obtained the fit and forecast results and compared the performance of the two models with the MAE, RMSE and MAPE values. We concluded that the XGBoost model leads to a notable improvement in the fit and prediction accuracy.

33 in total

1. Time series analysis of human brucellosis in mainland China by using Elman and Jordan recurrent neural networks.

Authors: Wei Wu; Shu-Yi An; Peng Guan; De-Sheng Huang; Bao-Sen Zhou
Journal: BMC Infect Dis Date: 2019-05-14 Impact factor: 3.090

2. Estimation of COVID-19 prevalence in Italy, Spain, and France.

Authors: Zeynep Ceylan
Journal: Sci Total Environ Date: 2020-04-22 Impact factor: 7.963

3. Clinical and Laboratory Predictors of In-hospital Mortality in Patients With Coronavirus Disease-2019: A Cohort Study in Wuhan, China.

Authors: Kun Wang; Peiyuan Zuo; Yuwei Liu; Meng Zhang; Xiaofang Zhao; Songpu Xie; Hao Zhang; Xinglin Chen; Chengyun Liu
Journal: Clin Infect Dis Date: 2020-11-19 Impact factor: 9.079

4. A novel coronavirus outbreak of global health concern.

Authors: Chen Wang; Peter W Horby; Frederick G Hayden; George F Gao
Journal: Lancet Date: 2020-01-24 Impact factor: 79.321

5. Clinical Characteristics of Coronavirus Disease 2019 in China.

Authors: Wei-Jie Guan; Zheng-Yi Ni; Yu Hu; Wen-Hua Liang; Chun-Quan Ou; Jian-Xing He; Lei Liu; Hong Shan; Chun-Liang Lei; David S C Hui; Bin Du; Lan-Juan Li; Guang Zeng; Kwok-Yung Yuen; Ru-Chong Chen; Chun-Li Tang; Tao Wang; Ping-Yan Chen; Jie Xiang; Shi-Yue Li; Jin-Lin Wang; Zi-Jing Liang; Yi-Xiang Peng; Li Wei; Yong Liu; Ya-Hua Hu; Peng Peng; Jian-Ming Wang; Ji-Yang Liu; Zhong Chen; Gang Li; Zhi-Jian Zheng; Shao-Qin Qiu; Jie Luo; Chang-Jiang Ye; Shao-Yong Zhu; Nan-Shan Zhong
Journal: N Engl J Med Date: 2020-02-28 Impact factor: 91.245

6. COVID-19 mortality risk assessment: An international multi-center study.

Authors: Dimitris Bertsimas; Galit Lukin; Luca Mingardi; Omid Nohadani; Agni Orfanoudaki; Bartolomeo Stellato; Holly Wiberg; Sara Gonzalez-Garcia; Carlos Luis Parra-Calderón; Kenneth Robinson; Michelle Schneider; Barry Stein; Alberto Estirado; Lia A Beccara; Rosario Canino; Martina Dal Bello; Federica Pezzetti; Angelo Pan
Journal: PLoS One Date: 2020-12-09 Impact factor: 3.240

7. Computational Intelligence-Based Model for Mortality Rate Prediction in COVID-19 Patients.

Authors: Irfan Ullah Khan; Nida Aslam; Malak Aljabri; Sumayh S Aljameel; Mariam Moataz Aly Kamaleldin; Fatima M Alshamrani; Sara Mhd Bachar Chrouf
Journal: Int J Environ Res Public Health Date: 2021-06-14 Impact factor: 3.390

8. Development of new hybrid model of discrete wavelet decomposition and autoregressive integrated moving average (ARIMA) models in application to one month forecast the casualties cases of COVID-19.

Authors: Sarbjit Singh; Kulwinder Singh Parmar; Jatinder Kumar; Sidhu Jitendra Singh Makkhan
Journal: Chaos Solitons Fractals Date: 2020-05-11 Impact factor: 5.944

9. Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study.

Authors: Fei Zhou; Ting Yu; Ronghui Du; Guohui Fan; Ying Liu; Zhibo Liu; Jie Xiang; Yeming Wang; Bin Song; Xiaoying Gu; Lulu Guan; Yuan Wei; Hui Li; Xudong Wu; Jiuyang Xu; Shengjin Tu; Yi Zhang; Hua Chen; Bin Cao
Journal: Lancet Date: 2020-03-11 Impact factor: 79.321

10. Geographic Differences in COVID-19 Cases, Deaths, and Incidence - United States, February 12-April 7, 2020.

Authors:
Journal: MMWR Morb Mortal Wkly Rep Date: 2020-04-17 Impact factor: 17.586