| Literature DB >> 32486055 |
Ayan Chatterjee1, Martin W Gerdes1, Santiago G Martinez2.
Abstract
"Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2)", the novel coronavirus, is responsible for the ongoing worldwide pandemic. "World Health Organization (WHO)" assigned an "International Classification of Diseases (ICD)" code-"COVID-19"-as the name of the new disease. Coronaviruses are generally transferred by people and many diverse species of animals, including birds and mammals such as cattle, camels, cats, and bats. Infrequently, the coronavirus can be transferred from animals to humans, and then propagate among people, such as with "Middle East Respiratory Syndrome (MERS-CoV)", "Severe Acute Respiratory Syndrome (SARS-CoV)", and now with this new virus, namely "SARS-CoV-2", or human coronavirus. Its rapid spreading has sent billions of people into lockdown as health services struggle to cope up. The COVID-19 outbreak comes along with an exponential growth of new infections, as well as a growing death count. A major goal to limit the further exponential spreading is to slow down the transmission rate, which is denoted by a "spread factor (f)", and we proposed an algorithm in this study for analyzing the same. This paper addresses the potential of data science to assess the risk factors correlated with COVID-19, after analyzing existing datasets available in "ourworldindata.org (Oxford University database)", and newly simulated datasets, following the analysis of different univariate "Long Short Term Memory (LSTM)" models for forecasting new cases and resulting deaths. The result shows that vanilla, stacked, and bidirectional LSTM models outperformed multilayer LSTM models. Besides, we discuss the findings related to the statistical analysis on simulated datasets. For correlation analysis, we included features, such as external temperature, rainfall, sunshine, population, infected cases, death, country, population, area, and population density of the past three months - January, February, and March in 2020. For univariate timeseries forecasting using LSTM, we used datasets from 1 January 2020, to 22 April 2020.Entities:
Keywords: COVID-19; ICD; LSTM; RNN; algorithm; artificial intelligence; community disease; correlation; deep learning; hypothesis test; keras; machine learning; measurable sensor data; population; public health; python; regression; spread factor; statistics; transmission rate
Mesh:
Year: 2020 PMID: 32486055 PMCID: PMC7308840 DOI: 10.3390/s20113089
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1(a) Daily corona total new cases; (b) and daily corona total death.
Propagation of human coronavirus through air [19,47,48,49,50,51,52].
| No | Size | Transmission Distance |
|---|---|---|
| 1 | Larger respiratory droplets (>5–10 μm diameter) | Travel only short distances, generally < 1 m, but in extraordinary cases up to 4 m |
| 2 | Virus-laden small (<5 μm diameter) aerosolized droplets (droplet nuclei) | Travel long distances, >1 m |
| 3 | Combinations of an individual patient’s physiology and environmental conditions, such as humidity and temperature, the gas cloud, and its payload of pathogen-bearing droplets of all sizes | Travel 7–8 m |
| 4 | Strong airflow from the air conditioner | Distance above 1 m |
Figure 2Consolidated case(s) growth of the “world”.
Figure 3Top 17 countries according to the total cases reported till 22 April 2020.
Figure 4The death ratio of top 17 countries according to the total cases reported till 22 April 2020.
Feature description of “simulated_data_1”.
| No. | Features | Description |
|---|---|---|
| 1 | Temp-Jan | Average temperature of the country in January 2020 [ |
| 2 | Temp-Feb | Average temperature of the country in February 2020 [ |
| 3 | Temp-Mar | Average temperature of the country in March 2020 [ |
| 4 | Rainfall-Jan | Average rainfall of the country in January 2020 [ |
| 5 | Rainfall-Feb | Average rainfall of the country in February 2020 [ |
| 6 | Rainfall-Mar | Average rainfall of the country in March 2020 [ |
| 7 | Sunshine-Jan | Average sunshine of the country in January 2020 [ |
| 8 | Sunshine-Feb | Average sunshine of the country in February 2020 [ |
| 9 | Sunshine-Mar | Average sunshine of the country in March 2020 [ |
| 10 | Population | Total population of the country [ |
| 11 | Area | Total area of the country [ |
| 12 | Population Density | Population density of the country [ |
| 13 | Case-Jan | Total infected cases of the country in January 2020 [ |
| 14 | Case-Feb | Total infected cases of the country in February 2020 [ |
| 15 | Case-Mar | Total infected cases of the country in March 2020 [ |
| 16 | Death-Jan | Total deceased of the country in January 2020 [ |
| 17 | Death-Feb | Total deceased of the country in February 2020 [ |
| 18 | Death-Mar | Total deceased of the country in March 2020 [ |
| 19 | Country | Name of the country selected for analysis |
Description of selected datasets.
| No | Name | External Source | Purpose | Description |
|---|---|---|---|---|
| 1 | COVID-19 datasets | Univariate LSTM forecasting | It is containing world-wide and country specific data, such as total cases, death, recoveries. | |
| 2 | Simulated_data_1 | For correlation analysis | It is containing features, such as external temperature, rainfall, sunshine, population, infected cases, death, country, population, area, and population density of the past three months-January, February, and March | |
| 3 | Simulated_data_2 | Not available | For analyzing our proposed algorithm | Key variables used in the algorithm are as follows: |
Python libraries for data processing [61].
| No. | Libraries | Purpose |
|---|---|---|
| 1 | Pandas | Data importing, structuring and analysis |
| 2 | NumPy | Computing with multidimensional array object |
| 3 | Matplotlib | Python 2-D plotting |
| 4 | SciPy | Statistical analysis |
| 5 | Seaborn, plotly | Plotting of high-level statistical graphs |
| 7 | Keras with TensorFlow | LSTM model development, training, and testing |
Hypothesis testing method [62].
| Method | Description |
|---|---|
| Augmented Dickey-Fuller test | To test if a timeseries is stationary or non-stationary |
Significance of regression coefficient (r).
| |r| Value | Meaning |
|---|---|
| 0.00–0.2 | Very weak |
| 0.2–0.4 | Weak to moderate |
| 0.4–0.6 | Medium to substantial |
| 0.6–0.8 | Very strong |
| 0.8–1 | Extremely strong |
Statistical analysis methods on the selected datasets.
| No. | Methods | Purpose |
|---|---|---|
| 1 | Mean, standard deviation | Distribution test |
| 2 | Covariance, correlation | Association test |
| 3 | Histogram, line, bar, Scatter | Distribution plot |
| 4 | Quantile analysis | Outlier detection |
Figure 5(a) A vanilla LSTM cell; (b) Equations of a vanilla LSTM cell.
LSTM model store [61,64].
| Method | Implementation |
|---|---|
| Pickle string | Import pickle library |
| Pickled model | Import joblib from sklearn.externals library |
Figure 6Correlation heatmap of simulated data (“simulated_data_1”) to check feature correlation.
Figure 7Exponential regression plot to show death increases with number of cases.
Figure 8Flattening the distribution graphs of active cases over days by reducing human coronavirus spreading with different “f” values, such as (a) f = 0.25; (b) f = 0.50; (c) f = 0.75; (d) f = 1.00; (e) f = 2.00; (f) f = 3.00; (g) f = 4.00; and (h) f = 5.00.
Effect of spreading factor (“f”) to flatten the curve of active cases.
|
| Peak Active Cases | Span of Active Cases (Days) | Treatment Duration (Days) | Maximum Load (Week) | Avg Load (Patient/day) |
|---|---|---|---|---|---|
| 0.25 | 70,000–80,000 | 1–100 | 100 | 7–10 | Moderate |
| 0.50 | 140,000–160,000 | 1–50 | 50 | 4–5 | Medium |
| 0.75 | 175,000–190,000 | 1–40 | 40 | 3–4 | High |
| 1.00 | 175,000–200,000 | 1–36 | 36 | 2–4 | High |
| 2.00 | 200,000 | 1–23 | 23 | 2–3 | Very High |
| 3.00 | 200,000 | 1–19 | 19 | 2–3 | Very High |
| 4.00 | 200,000 | 1–17 | 17 | 1–2 | Very High |
| 5.00 | 200,000 | 1–18 | 18 | 1–2 | Very High |
Figure 9Trend analysis of total reported cases in four Asian countries.
Result of hypothesis testing of timeseries data.
| Timeseries Data | Test Result | Nature of Data |
|---|---|---|
| Total_deaths | ADF Statistic: −4.763,824 | Rejecting null hypothesis; no unit root and timeseries is stationary |
| New_deaths | ADF Statistic: −2.814,703 | Fail to reject null hypothesis; the data has a unit root and data is non-stationary |
| Total_cases | ADF Statistic: 5.989,246 | Fail to reject null hypothesis; the data has a unit root and data is non-stationary |
| New_cases | ADF Statistic: 2.771,519 | Fail to reject null hypothesis; the data has a unit root and data is non-stationary |
Average performance analysis of LSTM models to forecast total cases of the “World”.
| LSTM | MAE | MSE | RMSE | Forecast | |R2| | Compilation Time (ms) |
|---|---|---|---|---|---|---|
| Vanilla | 8,968.244 | 98,168,777.193 | 9,908.016 | 121.883 | 1.0 | 110.0 |
| Stacked | 6,597.784 | 82,779,520.484 | 9,098.325 | 1,120.341 | 1.0 | 192.0 |
| Bidirectional | 7,130.149 | 74,807,857.322 | 8,649.154 | 1,454.284 | 1.0 | 194.0 |
| Multi-Layer 1 | 37,438.048 | 2,338,577,178.93 | 48,358.838 | −37,075.648 | 0.995 | 520.0 |
| Multi-Layer 2 | 45,038.733 | 4,110,861,091.40 | 64,115.997 | 15,340.520 | 0.992 | 762.0 |
| Multi-Layer 3 | 51,890.187 | 10,545,625,824.0 | 102,691.898 | −45,213.395 | 0.970 | 680.0 |
Average performance analysis of LSTM models to forecast total death of the “World”.
| LSTM | MAE | MSE | RMSE | Forecast | |R2| | Compilation Time (ms) |
|---|---|---|---|---|---|---|
| Vanilla | 735.039 | 2,300,815.114 | 1,516.844 | −120.177 | 0.99 | 104.0 |
| Stacked | 738.703 | 4,637,553.996 | 2,153.498 | 341.605 | 0.98 | 190.0 |
| Bidirectional | 660.818 | 1,114,423.658 | 1,055.663 | 394.884 | 0.99 | 191.0 |
| Multi-Layer 1 | 3,573.872 | 30,177,345.174 | 5,493.391 | −3,573.872 | 0.983 | 400.0 |
| Multi-Layer 2 | 1,290.960 | 4,069,047.834 | 2,017.188 | −708.400 | 0.998 | 407.0 |
| Multi-Layer 3 | 3,108.016 | 52,959,914.784 | 7,277.356 | −3,033.915 | 0.966 | 400.0 |
Figure 10Comparing the calibration of the LSTM models to forecast total cases of the “World”.
Figure 11Comparing the calibration of the LSTM models to forecast total deaths of the “World”.