Literature DB >> 33281306

Data Analysis of Covid-19 Pandemic and Short-Term Cumulative Case Forecasting Using Machine Learning Time Series Models.

Abstract

The Covid-19 pandemic is the most important health disaster that has surrounded the world for the past eight months. There is no clear date yet on when it will end. By now, more than 31 million people have been infected worldwide. Predicting the Covid-19 trend has become a challenging issue. In this study, data of COVID-19 between 20/01/2020 and 18/09/2020 for USA, Germany and Global was obtained from World Health Organization. Dataset consist of weekly confirmed cases and weekly cumulative confirmed cases for 35 weeks. Then the distribution of the data was examined using the most up-to-date Covid-19 weekly case data and its parameters were obtained according to the statistical distributions. Furthermore, time series prediction model using machine learning was proposed to obtain the curve of disease and forecast the epidemic tendency. Linear regression, multi-layer perceptron, random forest and support vector machines (SVM) machine learning methods were used. The performances of the methods were compared according to the RMSE, APE, MAPE metrics and it was seen that SVM achieved the best trend. According to estimates, the global pandemic will peak at the end of January 2021 and estimated approximately 80 million people will be cumulatively infected.

Entities: Chemical Disease Gene Species

Keywords: Covid-19; Machine Learning; Multi-layer perceptron; Statistical distribution; Support vector machine

Year: 2020 PMID： 33281306 PMCID： PMC7698672 DOI： 10.1016/j.chaos.2020.110512

Source DB: PubMed Journal: Chaos Solitons Fractals ISSN： 0960-0779 Impact factor: 5.944

Introduction

The COVID-19 disease, which occurred after December 2019, spread all over the world after February 2020. The virus has passed from animals to humans and is transmitted from person to person via airborne droplets [8]. In a short time, it became the biggest epidemic the world has seen in the last century. With respect to data from World Health Organization (WHO), the number of cases seen worldwide is increasing rapidly. Despite the measures taken, the virus has not yet been stopped because of its high infectious power. Since COVID-19 first emerged, various trend analysis studies have been conducted. Fanelli and Piazza [6] analyzed the temporal dynamics of the coronavirus disease 2019 outbreak in China, Italy and France. Yesilkanat [20] estimated the near future case numbers for 190 countries in the world using random forest algorithm. Sahin and Sahin [12] estimated the cumulative cases of COVID-19 using fractional nonlinear grey Bernoulli model. Yadav et al. [18] analyzed COVID-19 spread using machine learning methods. Kaxiras et al. [9] used Susceptible-Infectious-Removed (SIR) populations model for describing COVID-19 pandemic. Wang et al. [15] studied on prediction of Covid-19 with logistic model and machine learning technics. Wieczorek et al. [17] presented a neural network powered COVID-19 spread forecasting model. Das [4] estimated incidences of COVID-19 using Box-Jenkins method for the period July 12-September 11, 2020. Shastri et al. [13] performed a time series forecasting of Covid-19 using deep learning models for India and USA. Feroze [7] forecasted the patterns of COVID-19 using bayesian structural time series models. In this study, unlike previous works, the distribution of Covid-19 weekly case increase was examined and the largest extreme value distribution was found for global and Germany, and smallest extreme value distribution was found for USA. Afterwards, the disease curves were found and predictions were made for weekly cumulative cases for global, Germany and USA with linear regression, multi-layer perceptron, random forest and support vector machines (SVM) machine learning time series methods. The performances of the methods were compared according to the RMSE, APE, MAPE metrics and SVM was found best fitted method to forecast Covid-19 data. Then short-term cumulative case forecasting was applied using all methods for global, Germany and USA. The paper is organized as follows. In the section two, machine learning time series methods will be introduced. Detailed analysis of the dataset will be explained in the third section. In the fourth section, evaluation metrics, results and discussion will be given. Finally, the conclusion of the study will be summarized in the section five.

Machine learning for time series forecasting

There are numerous approaches in the literature which are used to model time series such as Auto-Regressive Integrated Moving Average (ARIMA) and Fourier Transforms. These are univariate due to the nature of the time series’s data. Using a single variable can be ineffective in understanding the time series. Therefore, it may be necessary to convert the data to multivariate. Machine learning can be used for this purpose [1]. Machine learning time series takes into account the time parameter and evaluates other inputs based on time. Time feature is divided into sub-components such as daily, weekly, monthly, quarterly, days of the week, weekend, weekdays, N-period lagged date, minimum, maximum, average, powers of time, products of time and lagged variables. Hidden patterns in time series can be captured with these components. As in general machine learning methods, the data for the time series is divided into two groups as training and test data. The data behavior is learned by training data and a general model is created. Then, this model is tested using the test data. Machine learning time series with nonlinear data can yield successful results. The machine learning methods used in this study are discussed in the following subsections.

Random forest

Random Forest (RF) is a popular unsupervised learning technique and employed for regression and classification [3]. It is an ensemble learning method. The classifier represents a decision tree [19]. N outputs by N decision trees are obtained using this method. All outcomes are estimated by voting. RF is both a simple and easy method for using parallel [2].

Linear regression

Linear regression is the most basic and simple approach used to find the relationship between variables consisting of numerical data. In this method, the trend of the data is found and estimation is performed accordingly. However, all independent variables must be defined [5].

Multilayer perceptron

Artificial neural networks (ANN) work by imitating the learning feature of the human brain. It gives better results for longer term predictions than statistical methods. It can also model nonlinear data. However, it is unknown how it does modeling the data because of its black-box feature [5]. A multilayer perceptron (MLP) is a feed forward ANN model. A MLP consists of three layers: output, hidden and input. MLP uses a back propagation (supervised) learning technique for training. MLP can discern data that can not be linearly separated [11]. Mathematical calculation of MLP is stated as follows: where y is the output, X is the vector of input, hij is the weight matrix, bj is the bias vector and fH is the hidden layer’s activation function, wj, bo and fo are the vector of weight, the bias scalar and the output layer’s activation function [10].

Support vector machines

The support vector machine (SVM) is a machine learning technique employed for classification and regression. Instead of using a nonlinear function for regression, it tests to predict the regression employing a linear function in a large space [14]. In SVM prediction is calculated by following formula: where x is vector of the input, b is the bias and w is the vector of weight [1].

Covid-19 dataset and distribution analysis

In this study, data of COVID-19 between 20/01/2020 and 18/09/2020 for USA, Germany and global was was obtained from World Health Organization website [16]. Dataset consist of weekly confirmed cases and weekly cumulative confirmed cases for 35 weeks. Descriptive statistics of weekly confirmed cases is given in Table 1 . Considering the 8-months period, a global average of 881.504 new cases are seen weekly. The standard deviation for global is also almost close to the mean. This closeness is similar for the USA and Germany. The positive skewness that Germany has means that there is a longer tail on the right.

Table 1

Descriptive statistics of weekly confirmed cases

	Mean	Std.Dev.	Minimum	Maximum	Skewness	Kurtosis
Global	881.504	714.238	1.928	2.177.544	0,30	-1,35
USA	29.274	21.516	0	66.963	-0,01	-1,05
Germany	988	1.172	0	4.615	2,03	3,98

Descriptive statistics of weekly confirmed cases Germany has positive kurtosis, global and USA have negative kurtosis. As shown in Fig. 1 , the data distribution for Germany with large kurtosis displays tail data that exceeds the tails of the normal distribution.

Fig. 1

Histogram of weekly cases for global, Germany and USA

Histogram of weekly cases for global, Germany and USA In addition, distribution analysis was made for weekly case data. The results of goodness of fit test for weekly global cases are given in Table 2 . Fitting of the data to Lognormal, Normal, Exponential, 2-Parameter Exponential, Weibull, 3-Parameter Weibull, Largest Extreme Value, Smallest Extreme Value, Logistic and Gamma distributions was investigated.

Table 2

Goodness of fit test for weekly global cases

Distribution	AD	P
Normal	1,259	<0,005
Lognormal	2,860	<0,005
Exponential	2,594	<0,003
2-Parameter Exponential	1,665	0,012
Weibull	2,039	<0,010
3-Parameter Weibull	1,667	<0,005
Smallest Extreme Value	1,564	<0,010
Largest Extreme Value	1,153	<0,010
Gamma	1,798	<0,005
Logistic	1,232	<0,005

Goodness of fit test for weekly global cases In Table 2, AD value represents Anderson-Darling test value. It is a measure of the deviations between the fitted line of the distribution and data points. The p-value is the probability showing that the data follow the distribution. In order to choose the best distribution, it is expected that the AD value is low and the p-value is high. The probability plot of the first four distributions with the lowest AD value is given in Fig. 2 . As seen in Fig. 2, Largest Extreme Value is the distribution that fits best for global weekly data.

Fig. 2

Probability plot for weekly global cases

Probability plot for weekly global cases Estimated parameters of distributions for global weekly data are given in Table 3 . Using these parameters, proper similar data can be derived for distributions or used for estimation.

Table 3

Estimates of distribution parameters for weekly global cases

Distribution	Location	Shape	Scale	Threshold
Normal	881.504	-	714.238	-
Lognormal	12,80563	-	1,93553	-
Exponential	-	-	881.504	-
2-Parameter Exponential	-	-	905.445	-23.941,9
Weibull	-	0,83125	820.779	-
3-Parameter Weibull	-	1,01121	910.043	-24.993
Smallest Extreme Value	1.241.170	-	668.737	-
Largest Extreme Value	542.319	-	580.244	-
Gamma	-	0,68580	1.285.360	-
Logistic	844.683	-	430.713	-

Estimates of distribution parameters for weekly global cases Similarly, the goodness of fit test was performed for the weekly case data of Germany and USA, and probability plots are given in Figs. 3 and 4 . Accordingly, the best fit distribution for Germany was found as largest extreme value and it was found as smallest extreme value for USA. The smallest extreme value distribution is skewed to the left and the largest extreme value distribution is skewed to the right. This skewness can also be seen in Fig. 1 for Germany, USA and global data.

Fig. 3

Probability plot for weekly cases in Germany

Fig. 4

Probability plot for weekly cases in USA

Probability plot for weekly cases in Germany Probability plot for weekly cases in USA

Short-term cumulative case forecasting

In this study, data of COVID-19 between 20/01/2020 and 18/09/2020 for USA, Germany and the global was obtained from World Health Organization website [16]. Furthermore, time series prediction model using machine learning methods is proposed to obtain the disease curve and forecast the epidemic trend. Linear regression, multi-layer perceptron, random forest and support vector machines methods were used for forecasting. The evaluation metrics described in the subsection below are used to compare these methods.

Evaluation metrics

In order to compare the estimation methods used in this study, root mean square error (RMSE), mean absolute percentage error (MAPE) and absolute percentage error (APE) metrics were used. By measuring APE, the consistency between the original value and the predicted value is calculated. These values are expected to be low when comparing. The following equations will express the APE, MAPE, and RMSE calculations: where n shows observation number, yi is the i-th observed value and ŷi is the i-th estimated value.

Results and discussion

Machine learning time series takes into account the time parameter and evaluates other inputs based on time. In this study, time feature is divided into sub-components as time index, weekly cases, 17 lagged variables of weekly cases, square of the time index, cube of time index and products of 17 lagged variables with time index. Thus, 38 different variables were extracted. Dataset consist of weekly cumulative confirmed cases for 35 weeks. In machine learning methods, the data for the time series is divided into two groups as training and test data. In this study, 18 weeks were used for training and 17 weeks as test data. After training and testing, APE, MAPE, RMSE values were found and are given in Table 4 for linear regression, multi-layer perceptron, random forest and SVM machine learning methods. In Table 4, it is seen that SVM method provides the best performance for the global, Germany and USA data. It is the method with the lowest value for RMSE, MAPE, and APE values. It is obviously seen that SVM achieved the best trend for all data.

Table 4

Comparison of the methods.

Methods	Metric	Global	Germany	USA
Random Forest	MAE	269.274,0518	955,1736	42.496,719
	MAPE	2,0726	0,4477	1,5608
	RMSE	340.926,4251	1.387,0147	53.864,6539
Linear Regression	MAE	17.081,1363	224,0337	11.745,2812
	MAPE	0,1853	0,1125	0,3331
	RMSE	21.816,6988	324,0253	16.508,4263
MLP	MAE	139.330,4846	752,3291	19.819,1675
	MAPE	0,8179	0,381	0,6497
	RMSE	223.638,9972	832.688,26	26.713,8094
SVM	MAE	19.771,7317	191,0731	5.852,0147
	MAPE	0,1247	0,0918	0,1406
	RMSE	25.825,8366	329,196	9.531,6776

Comparison of the methods. When Table 4 is examined in detail, SVM and linear regression methods have very close MAPE and RMSE values. Next comes the MLP method. The method with the worst performance is Random Forest. The Random Forest method has also failed in predicting the future. Accordingly, estimations were made with best three method for the global, Germany and USA for 17 weeks after 18/09/2020. These estimates are shown in Fig. 5, Fig. 6, Fig. 7 .

Fig. 5

Prediction of weekly cumulative global cases

Fig. 6

Prediction of weekly cumulative cases for Germany

Fig. 7

Prediction of weekly cumulative cases for USA

Prediction of weekly cumulative global cases Prediction of weekly cumulative cases for Germany Prediction of weekly cumulative cases for USA Fig. 5 shows the future trend for global cumulative data. According to forecasts in Fig. 5, the global pandemic will peak at the end of January 2021 and an estimated approximately 80 million people will be cumulatively infected by using SVM method. Approximately 98 million people will be infected according to the linear regression method. For the MLP method, approximately 39 million people will be infected. The prediction of SVM, which is the best method according to performance metrics, seems more robust and realistic. Fig. 6 shows the future trend for cumulative case data for Germany. According to forecasts in Fig. 6, Germany will peak at the end of January 2021 and an estimated approximately 580.000 people will be cumulatively infected by using SVM method. Approximately 1 million people will be infected according to the linear regression method. For the MLP method, 330.000 people will be infected. Performance metrics show that the estimation of SVM is more accurate. According to forecasts in Fig. 7, USA will peak at the end of January 2021 and an estimated approximately 11 million people will be cumulatively infected by using SVM method. According to linear regression method, it enters a downward trend and approaches zero. This is not a realistic estimation. According to the MLP method, 6 million people will be infected. Once again, the prediction of SVM seems more realistic.

Conclusion

In this study, data of COVID-19 between 20/01/2020 and 18/09/2020 for USA, Germany and the global was analyzed. The distribution of the data is found as largest extreme value for global and Germany and smallest extreme value for USA. Then time series prediction model is proposed to obtain the disease curve and forecast the epidemic trend using machine learning methods. Linear regression, multi-layer perceptron, random forest and SVM machine learning methods were used for this purpose. The performances of the methods were compared according to the RMSE, APE, MAPE criteria. The results showed that the SVM method outperformed linear regression, multi-layer perceptron, random forest methods in modeling the Covid-19 data, and could be successfully used to diagnose the behavior of cumulative Covid-19 data over time. With the practical application of such machine learning time series methods, further research is expected to provide the most appropriate method for healthcare professionals to control and prevent future epidemics.

CRediT authorship contribution statement

Serkan Ballı: Investigation, Conceptualization, Methodology, Formal analysis, Validation, Writing - review & editing.

Declaration of Competing Interest

The author declares that he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

14 in total

1. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches.

Authors: Cole Brokamp; Roman Jandarov; M B Rao; Grace LeMasters; Patrick Ryan
Journal: Atmos Environ (1994) Date: 2016-12-01 Impact factor: 4.798

2. A comparison of three data mining time series models in prediction of monthly brucellosis surveillance data.

Authors: Nasrin Shirmohammadi-Khorram; Leili Tapak; Omid Hamidi; Zohreh Maryanaji
Journal: Zoonoses Public Health Date: 2019-07-15 Impact factor: 2.702

3. Forecasting the patterns of COVID-19 and causal impacts of lockdown in top five affected countries using Bayesian Structural Time Series Models.

Authors: Navid Feroze
Journal: Chaos Solitons Fractals Date: 2020-08-12 Impact factor: 5.944

4. Analysis on novel coronavirus (COVID-19) using machine learning methods.

Authors: Milind Yadav; Murukessan Perumal; M Srinivas
Journal: Chaos Solitons Fractals Date: 2020-06-30 Impact factor: 9.922

5. Clinical Characteristics of Coronavirus Disease 2019 in China.

Authors: Wei-Jie Guan; Zheng-Yi Ni; Yu Hu; Wen-Hua Liang; Chun-Quan Ou; Jian-Xing He; Lei Liu; Hong Shan; Chun-Liang Lei; David S C Hui; Bin Du; Lan-Juan Li; Guang Zeng; Kwok-Yung Yuen; Ru-Chong Chen; Chun-Li Tang; Tao Wang; Ping-Yan Chen; Jie Xiang; Shi-Yue Li; Jin-Lin Wang; Zi-Jing Liang; Yi-Xiang Peng; Li Wei; Yong Liu; Ya-Hua Hu; Peng Peng; Jian-Ming Wang; Ji-Yang Liu; Zhong Chen; Gang Li; Zhi-Jian Zheng; Shao-Qin Qiu; Jie Luo; Chang-Jiang Ye; Shao-Yong Zhu; Nan-Shan Zhong
Journal: N Engl J Med Date: 2020-02-28 Impact factor: 91.245

6. Analysis and forecast of COVID-19 spreading in China, Italy and France.

Authors: Duccio Fanelli; Francesco Piazza
Journal: Chaos Solitons Fractals Date: 2020-03-21 Impact factor: 5.944

8 in total

1. A Novel Approach on Deep Learning-Based Decision Support System Applying Multiple Output LSTM-Autoencoder: Focusing on Identifying Variations by PHSMs' Effect over COVID-19 Pandemic.

Authors: Yong-Ju Jang; Min-Seung Kim; Chan-Ho Lee; Ji-Hye Choi; Jeong-Hee Lee; Sun-Hong Lee; Tae-Eung Sung
Journal: Int J Environ Res Public Health Date: 2022-06-01 Impact factor: 4.614

2. Prediction intervals of the COVID-19 cases by HAR models with growth rates and vaccination rates in top eight affected countries: Bootstrap improvement.

Authors: Eunju Hwang
Journal: Chaos Solitons Fractals Date: 2022-01-03 Impact factor: 5.944