Literature DB >> 32771920

Prediction of new active cases of coronavirus disease (COVID-19) pandemic using multiple linear regression model.

Smita Rath¹, Alakananda Tripathy², Alok Ranjan Tripathy³.

Abstract

INTRODUCTION AND AIMS: The COVID-19 pandemic originated from the city of Wuhan of China has highly affected the health, socio-economic and financial matters of the different countries of the world. India is one of the countries which is affected by the disease and thousands of people on daily basis are getting infected. In this paper, an analysis of daily statistics of people affected by the disease are taken into account to predict the next days trend in the active cases in Odisha as well as India.
MATERIAL AND METHODS: A valid global data set is collected from the WHO daily statistics and correlation among the total confirmed, active, deceased, positive cases are stated in this paper. Regression model such as Linear and Multiple Linear Regression techniques are applied to the data set to visualize the trend of the affected cases.
RESULTS: Here a comparison of Linear Regression and Multiple Linear Regression model is performed where the score of the model R2tends to be 0.99 and 1.0 which indicates a strong prediction model to forecast the next coming days active cases. Using the Multiple Linear Regression model as on July month, the forecast value of 52,290 active cases are predicted towards the next month of 15th August in India and 9,358 active cases in Odisha if situation continues like this way.
CONCLUSION: These models acquired remarkable accuracy in COVID-19 recognition. A strong correlation factor determines the relationship among the dependent (active) with the independent variables (positive, deceased, recovered).

Entities: Chemical Disease Gene Species

Keywords: Coronavirus; Correlation coefficient; India; Linear regression; Multiple linear regression; Odisha

Mesh：

Year: 2020 PMID： 32771920 PMCID： PMC7395225 DOI： 10.1016/j.dsx.2020.07.045

Source DB: PubMed Journal: Diabetes Metab Syndr ISSN： 1871-4021

Introduction

In the beginning of 2020, the first case of COVID-19 pandemic in India was reported on January 30, 2020. COVID-19 is corona virus disease caused by SARS-CoV 2 (severe acute respiratory syndrome coronavirus 2). The novel corona virus was first originated from the wet market of Wuhan a city in China [1]. These plays a havoc on human by claiming 523,011 lives worldwide according to the world health organization [2]. The virus spread among the people more often through small droplets released by coughing, sneezing and talking in the close contact [3]. Instead of moving long distance through air the droplet falls onto the ground or surface. The basic symptoms of the COVID-19 are fever, cough, shortness of breath, loss of sense and fatigue [4]. Other symptoms include breathing difficulty and chest pain. To prevent the spreading of virus number of measures are being carried out like personal hygiene, washing hand frequently with soap and water, using face mask and making social distancing. In order to prevent the transmission of virus many countries impose shutdown and lockdown. The first case of COVID-19 is detected in January 30, 2020. In India, the COVID-19 has huge impact. According to the world health organization report [2] the number of cases in India is 793, 802 confirmed cases of the virus as on 11 July the total number of samples tested so far is more than one crore. The number of fatalities due to COVID-19 pandemic is 21,604 till date. As per the government of India report [5] the worst effected states and union territory are Maharashtra with 2,30,599 cases and the number of deaths is 9667, Tamil Nadu with 1,26,581 cases followed by Delhi with 1,07, 051. India declares nationwide lockdown to stop the exponentially growth of infection that affected in other countries like Italy [6]. The nationwide lock down is made in order to flatten the infection curve in India. The focus of the paper lies in finding out each daily active cases or new confirmed COVID-19 cases using a regression model that will be helpful in forecasting the next day’s scenario of the country. The objective was to identify the relation among the data collected from each day and thus could make a significant contribution to a reliable data, and project a forecasting model for India as well as Odisha. We felt this was of utmost importance, because it would help to clarify the potential plan of action as well as prepare it.

Materials and methods used

The whole data set was collected from WHO site https://covid19india.org and https://covidindia.org/odisha/for the daily new positive cases, active cases, deceased and recovered cases in a csv file from March 22, 2020 to July 4, 2020. The daily data of COVID- 19 cases of Odisha and India are in a form of continuous set where the active cases are dependent on the other variables as confirmed from the correlation values in Table 1 and Table 2 . Correlation research aims at calculating and understanding the impact of a linear or nonlinear relationship between two continuous variables. Coefficients of association assume values ranging from negative correlations (−1) to uncorrelated (0) to positive correlations (+1). The sign of the coefficient of correlation (i.e., positive or negative) determines the direction of relation. The absolute value shows the strength of the linear relationship (Table 1, Table 2) which is very close to +1.

Table 1

Correlation table of Odisha daily Covid-19 cases.

Table 2

Correlation table of India daily Covid-19 cases.

Correlation table of Odisha daily Covid-19 cases. Correlation table of India daily Covid-19 cases. Initially the data cleaning process is performed on the two data set to remove any missing values. Then a correlation analysis is performed on the data sets using Python programming through Spyder of Anaconda Navigator App. Then Linear regression model is used to evaluate the relative impact of active cases due to daily positive cases in Odisha as well as in case of India data. The key goal of linear regression is to fit a straight line with the data forecasts Y for X where Y is the total number of daily active cases and X is the total number of positive cases. The least squares method is commonly used to estimate the intercept and slope regression parameters which define the line. The below Fig. 1, Fig. 2 shows the average peak values of active cases in part of Odisha as well as India.

Fig. 1

Daily Active Cases of India showing the average values in curves.

Fig. 2

Daily Active Cases of Odisha showing the peak with average value.

Daily Active Cases of India showing the average values in curves. Daily Active Cases of Odisha showing the peak with average value. The model can be expressed as in Eq. (1) where are dependent and independent variable, is the intercept and is the regression parameter as slope and is the random error respectively. The limitations of Linear regression are that it often explores a relation between the mean of the input variables and output variables. Just as the mean is not a full description of a single variable, linear regression is just not a clear understanding of variable relationships. Therefore, an analysis of the various factors is done using Multiple Linear Regression (MLR) model. The dependent variable (target variable) is dependent on many independent variables, in this case. You can describe a regression equation involving multiple variables as:where is the predictor or target variable and are the independent variables. β is the y-intercept and β_o,β_1,β_2 and ε are the coefficients and error term respectively.

Results

The two data sets are first spilt into eighty percent training set and twenty percent testing set and then Linear Regression is performed to train the first 80% set. Here the number of daily active cases are predicted based on daily positive cases. Here the model generates the coefficients to find the next active cases number on the test set as shown in Table 3 . As we explained, the regression line effectively selects the right value for the intercept and slope resulting in a line that fits the criteria best.

Table 3

Values obtained after training with linear regression prediction model.

Data set	Intercept	Coefficient	Score (R2)	Mean Absolute Error (MAE)×10−6	Mean Squared Error (MSE)×10−6	Root Mean Squared Error (RMSE) ×10−6
Odisha	−31.97450729	0.6876260	0.995588	73.8606	11320.1564	106.3962238
India	202497.14638	0.5714703	0.974855	245838.76	74851765386.71	273590.5067

Bold indicates R-squared or Coefficient of Multiple Determination.

Values obtained after training with linear regression prediction model. Bold indicates R-squared or Coefficient of Multiple Determination. So, it can be said that for every one-unit of positive cases increase there is an increase of 68% in case of active cases in Odisha and for every one-unit increase in positive cases of India shows an increase of 57% in active cases. The score represents the value of as 0.995588 and 0.974855 value which indicate it as a strong predictor model. A comparison of actual and predicted values in both data sets can be visualized through bar graphs by taking 25 records from the data in the below Fig. 3, Fig. 4 .

Fig. 3

Visualization of Actual vs predicted values using Linear Regression Model in Odisha COVID-19 Cases.

Fig. 4

Visualization of Actual vs predicted values using Linear Regression Model in India COVID-19 Cases.

Linear regression comprising various variables is named linear multiple regression. The steps for multiple linear regression are nearly similar to those for simple linear regression. The distinction lies with estimation. You can use this to find out how factor does have the maximum impact on the output forecasted and how independent factors are interrelated. Here again the whole data is spilt into training and test data set to perform multiple linear regression. The inputs are daily positive cases, recovered and deceased cases to predict the daily number of active cases. We can derive a relation between the above variables using the correlation factor from the above Table 1, Table 2 Table 4 represents the MAE, MSE, RMSE, intercept, Score and Coefficient for the predictor model.

Table 4

Values obtained after training with multiple linear regression prediction model.

Data set	Intercept	Coefficient (β_o, β_1, β_2)	Score (R2)	Mean Absolute Error (MAE)	Mean Squared Error (MSE)	Root Mean Squared Error (RMSE)
Odisha	−21.972463	[0.78035157, −0.45427124, 13.57489689]	0.9985	45.5734759e-06	3826.5455326712213e-06	61.85907801342679e-06
India	−3.259629011154175e-09	[-1,1,1]	1.0	2.6540334374658414e-09	7.735857208018085e-18	2.7813409010795647e-09

Bold indicates R-squared or Coefficient of Multiple Determination.

Values obtained after training with multiple linear regression prediction model. Bold indicates R-squared or Coefficient of Multiple Determination. Visualization of Actual vs predicted values using Linear Regression Model in Odisha COVID-19 Cases. Visualization of Actual vs predicted values using Linear Regression Model in India COVID-19 Cases. The value shows the predictor Multiple Linear Regression model to be more accurate as compared to the results obtained using Linear Regression model. From the above Table 4 , an equation is established as follows for predicting the next day active cases as and A visualization of 25 records in terms of actual vs predicted values are shown in the below bar graph in Fig. 5, Fig. 6 which shows the closeness between them.

Fig. 5

Visualization of Actual vs predicted values using Multiple Linear Regression Model in Odisha COVID-19 Cases.

Fig. 6

Visualization of Actual vs predicted values using Multiple Linear Regression Model in India COVID-19 Cases.

Visualization of Actual vs predicted values using Multiple Linear Regression Model in Odisha COVID-19 Cases. Visualization of Actual vs predicted values using Multiple Linear Regression Model in India COVID-19 Cases. The forecast values for the next few days are shown in Fig. 7, Fig. 8 using the above prediction model.

Fig. 7

Forecast of next days of Odisha COVID-19 cases.

Fig. 8

Forecast of next days of India COVID-19 cases.

Forecast of next days of Odisha COVID-19 cases. Forecast of next days of India COVID-19 cases.

Discussion

India as well as Odisha as one of its state is now at a critical reaction point as shown in the above Fig. 7, Fig. 8. As in the case of polio, surveillance plays a key role in the battle against COVID-19 too. Accordingly, at the government’s decision, has raised support for enhancing effectiveness of the control and response at the federal, district and block levels; cluster confinement operations; reinforcing data gathering actions in real time. India has already taken effective steps like full initiatives. Statistical models are important techniques for evaluating infectious disease data analyses in real time. We used the Linear and Multiple Linear regression model in this paper to evaluate the epidemic data of the region of India and India as a country through a detail investigation into the different applications of the models in history. Syazali et al. [7] examined the influence of volume, quality of goods and the brand name on buying value from consumers using Multiple Linear Regression analysis, Multiple Linear regression (MLR) is discussed by Salleh et al. [8] to infer GRN from data relating to gene expression and prevent inferring indirect interaction as a direct interaction and shows the effectiveness of MLR in dealing cascade error. Uyanık and Güler [9] discusses on Multiple Linear Regression and examined the values using the assumptions like normality, linearity, no extreme values. Three data models like Artificial Neural network, adaptive neuro fuzzy inference system and multiple linear regression were used by Khademi et al. [10] to predict the overall strength of recycled aggregator concrete. Hosseinzadeh et al. [11] uses Artificial Neural Network and Multiple Linear Regression to forecast the recovery of nutrients from solid waste under different treatments with vermicompost. Multiple linear regression -TOPSIS is studied by Luu, von Meding, and Mojtahedi [12] for predicting disaster from data. Similarly, the relationship between the mechanical properties of the tea stem and their impact factor has been studied by Du, Hu, and Buttar [13] to improve the picking efficiency of the tea plucking machine using MLR technique. Kadam et al. [14] uses Artificial Neural Network and Multiple Linear Regression to predicting ground water quality fitness for drinking from Shivganga river basin located in the eastern slope of the western Ghats, India. Quasi-Monte Carlo combined with multiple linear regression (QMC-MLR) is suggested by Xu and Yan [15] to solve the calculation of probabilistic load flow (PLF). To reduce the number of accidents Jomnonkwao, Uttra, and Ratanavaraha [16] in their paper provide plan which required future forecast data using regression models. Yuchi et al. [17] uses Multiple Linear Regression and random forest model to predict the particulate matter increasing the death and diseases. This prediction model will speculate the advance situation that is coming in days and effective measures are to be more enhanced to flatten the curve. The forecast value of 9358 active cases with a lower bound of 8582 cases and upper bound of 10,134 in Odisha and 52,290 active cases with a lower bound of 48,711 and upper bound value of 55,868 in India as shown in Figs. 7 and 8 shows the growth in upward direction in COVID-19 cases. The similarity of using the model lies in the behaviour of future forecast using MLR technique and how it influences the different factors of the daily analysis of positive, recovered and deceased cases. Strength of the Model: The strength of the model is its value came to be 1.0 which shows a strong predictor model taking into consideration of all the factors as shown below in Table 5 as the Statistical ANOVA measure. Variance Analysis (ANOVA) comprises of simulations which provide knowledge on levels of variation within a regression model and form the basis for meaningful tests. The significance F value is 0.00000015 which derives the P value to check the null hypothesis that all-group data are derived from groups with the same means. P value is greater than 0.05 is a chance that the null hypothesis is true (see Table 6).

Table 5

Statistical ANOVA measure of Multiple Linear Regression model.

	Df	SS	MS	F	Significance F
Regression	3	5.19E+14	1.73E+14	18.64554641	0.00000015
Residual	156	1.45E-15	9.31E-18
Total	159	5.19E+14

Table 6

Summary Output: ANOVA showing the Significance of p-value to validate the model for prediction of daily Active cases.

	Coefficients	Standard Error	T Stat	P-value	Lower 95%	Upper 95%	Lower 95.0%	Upper 95.0%
Intercept	2E-09	4.12E-10	4.164241	0.6734	9.02616E-10	2.53E-09	9.02616E-10	2.53164E-09
Positive	1	2.52E-15	3.97E+14	0.6633	1	1	1	1
Recovered	−1	2.56E-15	−3.9E+14	0.6533	−1	−1	−1	−1
Deceased	1	2.41E-14	4.16E+13	0.6753	1	1	1	1

Statistical ANOVA measure of Multiple Linear Regression model. Summary Output: ANOVA showing the Significance of p-value to validate the model for prediction of daily Active cases. Limitations of the Model: Limitations of the model can be thought of in terms of gathering more independent variables or information, ways to find the number of contact tracing cases. If the number of contact tracing cases are been reduce, it will indirectly reduce the number of daily active cases.

Conclusion

From the above training and testing of the prediction models, it was found to be an effective way to forecast the next number daily active cases during second week of August as we can see the forecast figure shows the active number of cases will tend to be around upper confidence value to be 10,134 cases and lower confidence value of 8582 cases in Odisha and similarly the upper confidence value of forecast is around 48,711 and lower confidence value as 55,868 in case of India. These models acquired remarkable accuracy in COVID-19 recognition. Bearing in mind these projected active results, the current estimated for COVID-19 containment needs to be reinforced or updated. Our framework could assist and protect healthcare professionals, government officials in making plans appropriate to cope with the influx of future COVID-19 patients.

Funding

No Funding

Declaration of competing interest

None to declare.

5 in total

1. Evaluation of random forest regression and multiple linear regression for predicting indoor fine particulate matter concentrations in a highly polluted city.

Authors: Weiran Yuchi; Enkhjargal Gombojav; Buyantushig Boldbaatar; Jargalsaikhan Galsuren; Sarangerel Enkhmaa; Bolor Beejin; Gerel Naidan; Chimedsuren Ochir; Bayarkhuu Legtseg; Tsogtbaatar Byambaa; Prabjit Barn; Sarah B Henderson; Craig R Janes; Bruce P Lanphear; Lawrence C McCandless; Tim K Takaro; Scott A Venners; Glenys M Webster; Ryan W Allen
Journal: Environ Pollut Date: 2018-11-16 Impact factor: 8.071

2. Application of artificial neural network and multiple linear regression in modeling nutrient recovery in vermicompost under different conditions.

Authors: Ahmad Hosseinzadeh; Mansour Baziar; Hossein Alidadi; John L Zhou; Ali Altaee; Ali Asghar Najafpoor; Salman Jafarpour
Journal: Bioresour Technol Date: 2020-01-29 Impact factor: 9.642

3. Multiple Linear Regression for Reconstruction of Gene Regulatory Networks in Solving Cascade Error Problems.

Authors: Faridah Hani Mohamed Salleh; Suhaila Zainudin; Shereena M Arif
Journal: Adv Bioinformatics Date: 2017-01-29

4. Sentiment analysis of nationwide lockdown due to COVID 19 outbreak: Evidence from India.

Authors: Gopalkrishna Barkur; Giridhar B Kamath
Journal: Asian J Psychiatr Date: 2020-04-12

5. Updated understanding of the outbreak of 2019 novel coronavirus (2019-nCoV) in Wuhan, China.

Authors: Weier Wang; Jianming Tang; Fangqiang Wei
Journal: J Med Virol Date: 2020-02-12 Impact factor: 2.327

5 in total

18 in total

1. Discovering Correlations between the COVID-19 Epidemic Spread and Climate.

Authors: Shaofu Lin; Yu Fu; Xiaofeng Jia; Shimin Ding; Yongxing Wu; Zhou Huang
Journal: Int J Environ Res Public Health Date: 2020-10-29 Impact factor: 3.390

2. A computational tool for trend analysis and forecast of the COVID-19 pandemic.

Authors: Henrique Mohallem Paiva; Rubens Junqueira Magalhães Afonso; Fabiana Mara Scarpelli de Lima Alvarenga Caldeira; Ester de Andrade Velasquez
Journal: Appl Soft Comput Date: 2021-03-10 Impact factor: 6.725

3. COVID-19 modelling in the Caribbean: Spatial and statistical assessments.

Authors: Stephan Moonsammy; Temitope D Timothy Oyedotun; Donna-Marie Renn-Moonsammy; Temitayo Deborah Oyedotun
Journal: Spat Spatiotemporal Epidemiol Date: 2021-03-06

4. A fuzzy graph approach analysis for COVID-19 outbreak.

Authors: Nurfarhana Hassan; Tahir Ahmad; Azmirul Ashaari; Siti Rahmah Awang; Siti Salwana Mamat; Wan Munirah Wan Mohamad; Amirul Aizad Ahmad Fuad
Journal: Results Phys Date: 2021-05-04 Impact factor: 4.476

5. The Changes in Climate Change Concern, Responsibility Assumption and Impact on Climate-friendly Behaviour in EU from the Paris Agreement Until 2019.

Authors: Miglė Jakučionytė-Skodienė; Genovaitė Liobikienė
Journal: Environ Manage Date: 2022-01-07 Impact factor: 3.644

6. An extended robust mathematical model to project the course of COVID-19 epidemic in Iran.

Authors: Reza Lotfi; Kiana Kheiri; Ali Sadeghi; Erfan Babaee Tirkolaee
Journal: Ann Oper Res Date: 2022-01-06 Impact factor: 4.854

7. Simulation of the COVID-19 patient flow and investigation of the future patient arrival using a time-series prediction model: a real-case study.

Authors: Mahdieh Tavakoli; Reza Tavakkoli-Moghaddam; Reza Mesbahi; Mohssen Ghanavati-Nejad; Amirreza Tajally
Journal: Med Biol Eng Comput Date: 2022-02-12 Impact factor: 3.079

8. A hierarchical study for urban statistical indicators on the prevalence of COVID-19 in Chinese city clusters based on multiple linear regression (MLR) and polynomial best subset regression (PBSR) analysis.

Authors: Ali Cheshmehzangi; Yujian Li; Haoran Li; Shuyue Zhang; Xiangliang Huang; Xu Chen; Zhaohui Su; Maycon Sedrez; Ayotunde Dawodu
Journal: Sci Rep Date: 2022-02-04 Impact factor: 4.379

9. Automatic Decision-Making Style Recognition Method Using Kinect Technology.

Authors: Yu Guo; Xiaoqian Liu; Xiaoyang Wang; Tingshao Zhu; Wei Zhan
Journal: Front Psychol Date: 2022-03-04

10. Analytical study on changes in domestic hot water use caused by COVID-19 pandemic.

Authors: Dongwoo Kim; Taesu Yim; Jae Yong Lee
Journal: Energy (Oxf) Date: 2021-05-13 Impact factor: 7.147