Literature DB >> 32298982

Linear Regression Analysis to predict the number of deaths in India due to SARS-CoV-2 at 6 weeks from day 0 (100 cases - March 14th 2020).

Samit Ghosal1, Sumit Sengupta2, Milan Majumder3, Binayak Sinha4.   

Abstract

INTRODUCTION: and Aims: No valid treatment or preventative strategy has evolved till date to counter the SARS CoV 2 (Novel Coronavirus) epidemic that originated in China in late 2019 and have since wrought havoc on millions across the world with illness, socioeconomic recession and death. This analysis was aimed at tracing a trend related to death counts expected at the 5th and 6th week of the COVID-19 in India.
MATERIAL AND METHODS: Validated database was used to procure global and Indian data related to coronavirus and related outcomes. Multiple regression and linear regression analyses were used interchangeably. Since the week 6 death count data was not correlated significantly with any of the chosen inputs, an auto-regression technique was employed to improve the predictive ability of the regression model.
RESULTS: A linear regression analysis predicted average week 5 death count to be 211 with a 95% CI: 1.31-2.60). Similarly, week 6 death count, in spite of a strong correlation with input variables, did not pass the test of statistical significance. Using auto-regression technique and using week 5 death count as input the linear regression model predicted week 6 death count in India to be 467, while keeping at the back of our mind the risk of over-estimation by most of the risk-based models.
CONCLUSION: According to our analysis, if situation continue in present state; projected death rate (n) is 211 and467 at the end of the 5th and 6th week from now, respectively.
Copyright © 2020 Diabetes India. Published by Elsevier Ltd. All rights reserved.

Entities:  

Keywords:  Coronavirus; Correlation; Death rates; India; Regression

Mesh:

Year:  2020        PMID: 32298982      PMCID: PMC7128942          DOI: 10.1016/j.dsx.2020.03.017

Source DB:  PubMed          Journal:  Diabetes Metab Syndr        ISSN: 1871-4021


Introduction

The pandemic of COVID-19 (Coronavirus disease 2019) caused by SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) has created a havoc on the human civilization. Since, its appearance in the city of Wuhan (Hebei district) in China, it has been a relentless march of new cases and deaths [1]. What makes it more scary is the novel strain of the virus and the unknowns associated with it [2]. The present strategy has been to prevent its spread by social isolation and a scientific overdrive to manufacture newer rapid diagnostic kits as well as medications [[3], [4], [5]]. Coronavirus belongs to a family of RNA viruses within the virus family Coronaviridae, order Nidovirales [6]. Coronaviruses are divided into three groups depending on the antigenic spikes produced by different protein structures of the virus (spike, membrane & nucleocapsid) [7]. The SARS coronavirus falls under group 2. The ability of this family of viruses to readily undergo genetic recombination not only within same group, but also between group, makes them readily susceptible to natural selection and changing its nature of virulence [8]. The most striking feature however, is its ability to freely cross from one species to another. HCoV 229 E belongs to the group 1 of the coronaviruses family thought to be responsible for the epidemic of common cold [7]. Transmission from bats to humans is thought to be the initial transmission process for HCoV 229 E, which had happened within the last two centuries. However, the two dramatic events- SARS-CoV (originated from bats and got transmitted from civet cats) & MERS-CoV (originated from bats and got transmitted from camels) in 2003 and 2012 respectively brough our focus back on the coronavirus family [8,9]. The present coronavirus pandemic is the result of changes in the receptor binding domain of the spike protein component via natural selection (?in human host/?in animal vector) resulting in its increased affinity for the ACE2 receptor site [10]. ]At present we have more than 450000 individuals affected with Cov-2 resulting in more than 12000 deaths worldwide [11]. In India, as the present day statistics holds, we have around 718 confirmed cases with 13 deaths [12]. Several countries including India have gone into a state of lockdown in order to prevent spread of this deadly virus. With new rapid-diagnostic kits coming in and trials with potentially helpful drugs underway, we need a better understanding of the disease process and what it holds for the near future. With all the CoV-2 related data available through reliable sources, we choose to assimilate the available data on total infection rates, total deaths, case fatality rates (CFR), recovery numbers from across the globe and create a predictive analysis on what we can expect in India in the coming weeks. The aim was to identify the top 15 countries i.e. those most heavily affected and hence could contribute to a substantial quantity of robust data, and compute a predictive model for India. We thought this was of paramount importance, since it would help understanding as well as planning for the future course of action. India has entered week 4. This analysis was aimed at tracing a trend related to death counts expected at the 5th and 6th week of the COVID-19 in India.

Materials and methods

Global data was collected from the WHO COVID-19 situation report and the Indian data was updated from the website covid19india.org. Data was collected in a CSV file and uploaded in Jupyter notebook and analysed with the Python 3.8.2 software. As a re-validation process and for simplicity of understanding the data was also analysed using excel with XL-STAT statistical software. Inputs: Total number of infected cases, active cases, recovery numbers,. Outputs: Total deaths and case fatality rates (CFR) In order to get a good predictive value data was analysed for the top 15 infected countries with India the 16th country.

Pre-analysis phase

There was one missing data (NA) in the dataset, which was the recovery numbers from the US. In view of the heterogeneity of data and significant outliers data imputation with mean was ruled out. As a recovery strategy a correlation analysis was conducted (leaving out the US data) using python and a strong r = 0.99 (P < 0.001) was found between total number of infected cases and recovery. Utilising this robust association and the formula generated from linear regression (Y [Recovery cases | USA] = b0 + b1∗ [Total cases | USA], with b0 = -781.05 and b1 = 0.869), the missing value (1117) was derived. The analysis was conducted thereafter (Table 1 ).
Table 1

Raw data including all coronavirus-related variables for week 1 and the total death outputs for week 5 through 9, including the imputed value.

CountriesTotal casesActive casesRecovery casesWeek 4 deathsCFRWeek 5 deaths
China74185578056511220042.7012715
Italy21157177501220714416.8114825
Spain5232490630971332.5421093
Iran11364732199195144.5231433
France36613570482792.158450
UK798769495111.378177
Netherlands804792134101.244106
Germany36753621313080.21868
Belgium55955513930.53737
Switzerland11391124303110.96656
South Korea797971987294.42670.84094
Austria50449743110.1986
Brazil15115015100.00011
Indonesia69603845.79732
USA218321261117482.199255
India60655442101.650
Raw data including all coronavirus-related variables for week 1 and the total death outputs for week 5 through 9, including the imputed value. Correlation analysis determining the relationship between week 5 deaths and all the input variables. Results from the multiple regression analysis conducted with 5th week death count as output and all the 4th week parameters as input. ∗ Goodness of fit (Adjusted R Square) shows the high predictive power of the model in this multivariate linear regression. However, most of predictors fail to show their significance of contribution in model except Week 4 death. The maximum, minimum and average predicted death counts for week 6 based on the equation of the linear regression model. Multiple regression analysis with week 6 death counts as input and all the 4th week variables as input. ∗ Goodness of fit (Adjusted R Square) shows the high predictive power of the model in this multivariate linear regression. However, all the predictors fail to show their significance of contribution in model. Prediction for 6th week death count in India based on the auto-regression analysis technique.

Results

Analysis for week 5 death number prediction: Step1: A correlation analysis was performed to ascertain the presence of and thereafter the strength of association between the output (week 5 death count) and the inputs from week 4. There was a strong correlation between week 5 deaths and all the input variables (Table 2).
Table 2

Correlation analysis determining the relationship between week 5 deaths and all the input variables.

Total casesActive casesRecovery casesWeek 4 deathsCFRWeek 5 deaths
Total cases1
Active cases0.999048611
Recovery cases0.9947539540.9914715321
Week 4 deaths0.9226234230.9245235580.8839969091
CFR0.2682086250.2660504240.2093696450.5115016681
Week 5 deaths0.6356360810.6445365970.5610976330.8762112230.6964023151
Step 2: A multivariate regression analysis ascertained the most important input parameters which would be used to build the model for the 5th week death prediction in India. The model came out to have a very strong predictive capacity (r = 0.99, R2 = 0.98, adjusted R2 = 0.97). However, the P-value was significant only for the 4th week death input parameter (Table 3).
Table 3

Results from the multiple regression analysis conducted with 5th week death count as output and all the 4th week parameters as input. ∗ Goodness of fit (Adjusted R Square) shows the high predictive power of the model in this multivariate linear regression. However, most of predictors fail to show their significance of contribution in model except Week 4 death.

SUMMARY OUTPUT
Regression Statistics
Multiple R0.990327848
R Square0.980749246
Adjusted R Square0.970054383
Standard Error234.1358914
Observations15



ANOVA

df
SS
MS
F
Significance F



Regression525135569.865027113.97291.702831431.92537E-07
Residual9493376.540754819.61564
Total1425628946.4




Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Lower 95.0%
Upper 95.0%
Intercept84.42512812115.00746950.7340838680.481583332−175.7398428344.590099−175.7398428344.590099
Total cases−0.0699944220.218157313−0.3208438040.755653571−0.563500550.423511705−0.563500550.423511705
Active cases0.1215577760.1553845170.7823030130.454125958−0.2299464230.473061974−0.2299464230.473061974
Recovery cases−0.0957150870.109664289−0.8728008680.405455473−0.3437929450.152362771−0.3437929450.152362771
Week 4 deaths3.497486060.703917984.9685988450.0007713821.9051129615.089859161.9051129615.08985916
CFR33.5134407946.337709950.7232433550.487899902−71.30974168138.3366233−71.30974168138.3366233
Step 3: A simple regression analysis was subsequently done to predict the death counts from the strongest input variable (Table 4). The model was robust with r = 0.87, R2 = 0.77 & adjusted R2 = 0.75, P < 0.001, 95% CI: 1.31–2.60. Based on the upper limit(maximum) & the lower limit (minimum) of the confidence intervals, the minimum, maximum and average death counts for week 5 was computed-211 (Table 4). Hence the week 5 death counts for India was predicted based on the available data from the top 15 infected countries.
Table 4

The maximum, minimum and average predicted death counts for week 6 based on the equation of the linear regression model.

In 95% Confidence IntervalIntercept and Co-efficient5th Week predicted death
Mean point of estimationb0191.644211
b11.957
Lower point of estimationb0−229.314−216
b11.312
Upper point of estimationb0612.602639
b12.602

Death number prediction for week 6

Step 1: Correlation study was conducted to ascertain the relationship between the output (week 6 death counts) and the input variables from week 4. A good correlation was observed with all the input variables. Step 2: Multiple regression analysis was done to ascertain the strength of association between the input variables and the output, including ruling out issues related to multi-collinearity. Once again the model created was very robust with r = 0.95, R2 = 0.91 & adjusted R2 = 0.86 (Table 5). However, the P-value for significance was not evident for any of the input variables.
Table 5

Multiple regression analysis with week 6 death counts as input and all the 4th week variables as input. ∗ Goodness of fit (Adjusted R Square) shows the high predictive power of the model in this multivariate linear regression. However, all the predictors fail to show their significance of contribution in model.

SUMMARY OUTPUT
Regression Statistics
Multiple R0.955366444
R Square0.912725042
Adjusted R Square0.864238954
Standard Error687.4807679
Observations15



ANOVA

Df
SS
MS
F
Significance F

Regression544485034.688897006.93618.824472810.000158714
Residual94253668.256472629.8062
Total1448738702.93




Coefficients
Standard Error
t Stat
P-value
Lower 95%
Upper 95%
Intercept115.6866622337.69031730.3425821120.739777934−648.2219079879.5952322
Total cases0.1742358940.6405637170.2720040020.791755917−1.2748199061.623291695
Active cases0.1594108830.4562472950.3493957880.734828115−0.8726922041.19151397
Recovery cases−0.3923751370.322001422−1.2185509470.253991509−1.120792960.336042685
Week 4 deaths3.0613821822.0668769331.4811632630.172701393−1.6142182767.736982641
CFR101.8408025136.05895380.7485049660.4732618−205.9459343409.6275393
Step 3: Auto-Regression technique- The 5th week death count data was incorporated as the input variable, in view of the fact that this end-point was significantly associated with the week 4 death count. A separate correlation analysis was performed between week 5 and week 6 death counts and a very robust association (r = 0.97) was found justifying its inclusion as the input variable. Step 4: Using week 5 death count as input a simple linear regression was performed to create a model predicting the week 6 outcomes. The model was robust (r = 0.96, R2 = 0.94, adjusted R2 = 0.94) and statistically significant (P-value <0.001, 95% CI: 1.13–1.54). Step 5: From the regression model formula the minimum, maximum, and average death count was estimated (Table 6). The average predicted death count for India was estimated to be 467.
Table 6

Prediction for 6th week death count in India based on the auto-regression analysis technique.

In 95% Confidence IntervalIntercept and Co-efficient6th Week predicted death
Mean point of estimationb0184.33467
b11.34
Lower point of estimationb0−119.77120
b11.14
Upper point of estimationb0488.44813
b11.54

Discussion

India is in the 4th week of the coronavirus pandemic. What lies ahead for India is the crucial stage, week 5–6 where effective preventive measures can prevent a potential catastrophe, which countries like China, Italy and the United States of America are experiencing, with an exponential growth of both infection as well as deaths. Exponential progression in the number of infected cases have occurred from the 4th week onwards, in the above mentioned countries [11]. At the point of going to the press, there are approximately 4,89,853 confirmed cases of coronavirus infection worldwide including 22,152 deaths [11]. Luckily, at 4th week, India has escaped the brunt of the disease with figures hovering around 693 confirmed infections and 13 deaths. It is the next couple of weeks which holds the key to the direction the virus takes or doesn’t take if we take adequate preventive steps. India has already taken strong measures including complete lockdown of both its internal and external borders as well as social isolation.

How does this analysis help?

Assessing the trends of the top 15 most infected countries a predictive model was created for India assuming that the same trend would follow. In other words can we justify the drastic measures being taken? What can we expect, if we allow the present trend to continue and mimic the exponential growth experienced by China and our western counterparts? Our analysis predicts a jump from approximately 10 deaths at week 4–211 at week 5 and then 467 by week 6. We speculate the need for urgent interventions (which are being taken as of now), to prevent this drastic and sharp rise in death rates which indirectly also indicates an increase in infection rate.

Limitations of this analysis

The main limitation of this analysis was that it takes most input data into consideration without taking into account the logistic actions being taken or not taken during the process. However, the end of weeks results are highly indicative of both the virus-related natural trajectory as well as the local government’s reactions. Secondly, limiting our analysis to the top 15 most infected countries could lead to an over-estimation of the outcomes. However, faced with a catastrophe of such magnitude, it is worth over-estimating rather than under-estimating.

Strength of the study

In spite of all the limitations the biggest strength of this study was very high adjusted R2 found in all the predictive models. In addition there was cross-validation with two different software practically ruling out any error creeping in from one mode of analysis.

Conclusion

According to our analysis, if situation continue in present state; projected death rate (n) is 211 and 467 at the end of the 5th and 6th week from now, respectively. Keeping these projected mortality data in mind, current measured for containment of COVID-19 must be strengthened or supplemented.

Funding

None.

Declaration of competing interest

None to declare.
  4 in total

Review 1.  Coronavirus genome structure and replication.

Authors:  D A Brian; R S Baric
Journal:  Curr Top Microbiol Immunol       Date:  2005       Impact factor: 4.291

2.  The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2.

Authors: 
Journal:  Nat Microbiol       Date:  2020-03-02       Impact factor: 17.745

3.  COVID-19 and the consequences of isolating the elderly.

Authors:  Richard Armitage; Laura B Nellums
Journal:  Lancet Public Health       Date:  2020-03-20

Review 4.  COVID-19 and Italy: what next?

Authors:  Andrea Remuzzi; Giuseppe Remuzzi
Journal:  Lancet       Date:  2020-03-13       Impact factor: 79.321

  4 in total
  19 in total

1.  A systematic review on AI/ML approaches against COVID-19 outbreak.

Authors:  Onur Dogan; Sanju Tiwari; M A Jabbar; Shankru Guggari
Journal:  Complex Intell Systems       Date:  2021-07-05

2.  Predictive models of COVID-19 in India: A rapid review.

Authors:  Atul Kotwal; Arun Kumar Yadav; Jyoti Yadav; Jyoti Kotwal; Sudhir Khune
Journal:  Med J Armed Forces India       Date:  2020-06-17

3.  Data science in unveiling COVID-19 pathogenesis and diagnosis: evolutionary origin to drug repurposing.

Authors:  Jayanta Kumar Das; Giuseppe Tradigo; Pierangelo Veltri; Pietro H Guzzi; Swarup Roy
Journal:  Brief Bioinform       Date:  2021-03-22       Impact factor: 11.622

4.  Overview of current state of research on the application of artificial intelligence techniques for COVID-19.

Authors:  Vijay Kumar; Dilbag Singh; Manjit Kaur; Robertas Damaševičius
Journal:  PeerJ Comput Sci       Date:  2021-05-26

5.  Treatment of Coronavirus Disease 2019 (COVID-19) Patients with Convalescent Plasma.

Authors:  Eric Salazar; Katherine K Perez; Madiha Ashraf; Jian Chen; Brian Castillo; Paul A Christensen; Taryn Eubank; David W Bernard; Todd N Eagar; S Wesley Long; Sishir Subedi; Randall J Olsen; Christopher Leveque; Mary R Schwartz; Monisha Dey; Cheryl Chavez-East; John Rogers; Ahmed Shehabeldin; David Joseph; Guy Williams; Karen Thomas; Faisal Masud; Christina Talley; Katharine G Dlouhy; Bevin V Lopez; Curt Hampton; Jason Lavinder; Jimmy D Gollihar; Andre C Maranhao; Gregory C Ippolito; Matthew O Saavedra; Concepcion C Cantu; Prasanti Yerramilli; Layne Pruitt; James M Musser
Journal:  Am J Pathol       Date:  2020-05-27       Impact factor: 4.307

6.  Partial derivative Nonlinear Global Pandemic Machine Learning prediction of COVID 19.

Authors:  Durga Prasad Kavadi; Rizwan Patan; Manikandan Ramachandran; Amir H Gandomi
Journal:  Chaos Solitons Fractals       Date:  2020-06-25       Impact factor: 9.922

7.  Nonlinear models: a case of the COVID-19 confirmed rates in top 8 worst affected countries.

Authors:  Serdar Neslihanoglu
Journal:  Nonlinear Dyn       Date:  2021-06-07       Impact factor: 5.022

8.  Role of Threat and Coping Appraisal in Protection Motivation for Adoption of Preventive Behavior During COVID-19 Pandemic.

Authors:  Arista Lahiri; Sweety Suman Jha; Arup Chakraborty; Madhumita Dobe; Abhijit Dey
Journal:  Front Public Health       Date:  2021-07-05

9.  Grappling with COVID-19 by imposing and lifting non-pharmaceutical interventions in Sri Lanka: A modeling perspective.

Authors:  Mahesh Jayaweera; Chamath Dannangoda; Dilum Dilshan; Janith Dissanayake; Hasini Perera; Jagath Manatunge; Buddhika Gunawardana
Journal:  Infect Dis Model       Date:  2021-07-07

10.  A novel framework for COVID-19 case prediction through piecewise regression in India.

Authors:  Apurbalal Senapati; Amitava Nag; Arunendu Mondal; Soumen Maji
Journal:  Int J Inf Technol       Date:  2020-11-10
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.