| Literature DB >> 34764608 |
Nan Jing1, Zijing Shi1, Yi Hu1, Ji Yuan2,3.
Abstract
The coronavirus disease 2019 (COVID-19) is rapidly becoming one of the leading causes for mortality worldwide. Various models have been built in previous works to study the spread characteristics and trends of the COVID-19 pandemic. Nevertheless, due to the limited information and data source, the understanding of the spread and impact of the COVID-19 pandemic is still restricted. Therefore, within this paper not only daily historical time-series data of COVID-19 have been taken into account during the modeling, but also regional attributes, e.g., geographic and local factors, which may have played an important role on the confirmed COVID-19 cases in certain regions. In this regard, this study then conducts a comprehensive cross-sectional analysis and data-driven forecasting on this pandemic. The critical features, which has the significant influence on the infection rate of COVID-19, is determined by employing XGB (eXtreme Gradient Boosting) algorithm and SHAP (SHapley Additive exPlanation) and the comparison is carried out by utilizing the RF (Random Forest) and LGB (Light Gradient Boosting) models. To forecast the number of confirmed COVID-19 cases more accurately, a Dual-Stage Attention-Based Recurrent Neural Network (DA-RNN) is applied in this paper. This model has better performance than SVR (Support Vector Regression) and the encoder-decoder network on the experimental dataset. And the model performance is evaluated in the light of three statistic metrics, i.e. MAE, RMSE and R 2. Furthermore, this study is expected to serve as meaningful references for the control and prevention of the COVID-19 pandemic.Entities:
Keywords: Coronavirus disease 2019 (COVID-19); Dual-stage attention-based recurrent neural network (DA-RNN); SHapley additive exPlanation (SHAP); eXtreme gradient boosting (XGB)
Year: 2021 PMID: 34764608 PMCID: PMC8256957 DOI: 10.1007/s10489-021-02616-8
Source DB: PubMed Journal: Appl Intell (Dordr) ISSN: 0924-669X Impact factor: 5.019
Fig. 1The proposed framework
Static features used in this study with their definitions
| Theme | Feature Name | Description |
|---|---|---|
| Socioeconomic | Income | Income per capita ($) |
| GDP | GDP per capita | |
| Unemployment | Unemployment as a percentage of the state labor force | |
| Health spending | Spending for all health services ($) | |
| Behavioral | Smoking rate | Percentage of smokers |
| Environmental | Temperature | Average temperature in 2019 |
| Pollution | Measurement of the public’s exposure to particulate matter | |
| Demographic | Urban | Percentage of the population living in an urban environment |
| Pop density | Density of people per meter squared | |
| Sex ratio | Males / Females | |
| Flu deaths | Influenza and Pneumonia death rate per 100,000 people | |
| Respiratory deaths | Chronic lower respiratory disease rate per 100,000 people | |
| Physicians | Number of physicians and surgeons per 1000 people | |
| Hospital beds | Number of hospital beds per 1000 people | |
| Age 65+ | Percent of 65 years and over | |
| Major airports | Number of medium and large airports | |
| Public Transportation | The proportion of people who use public transportation when they go to work | |
| Target variable | Infected rate | Total number of confirmed COVID-19 cases per 1000 people as of September 30, 2020 |
Fig. 2The heatmap of covariance matrix of static features
The hyperparameter values of XGB, RF, and LGB utilized in this study
| Parameters of XGB | Values | Parameters of RF | Values | Parameters of LGB | Values |
|---|---|---|---|---|---|
| number of iterations | 150 | number of iterations | 150 | number of iterations | 150 |
| max depth | 3 | max features | default | max depth | 3 |
| subsample | 0.7 | max depth | 3 | subsample | 0.7 |
| colsample bytree | 0.7 | min_samples_split | 2 | colsample bytree | 0.7 |
| lambda | 2 | min_samples_leaf | 1 | min_child_weight | 1 |
| alpha | 1 | min_weight_fraction_leaf | 0 | num_leaves | 40 |
| learning rate | 0.05 | max_leaf_nodes | None | learning_rate | 0.05 |
The performance results of XGB, RF and LGB
| RMSE | MAE | MSE | |
|---|---|---|---|
| XGB | 5.0172 | 6.3516 | 25.1722 |
| RF | 6.3216 | 7.5264 | 39.9626 |
| LGB | 6.8234 | 6.6538 | 77.8523 |
The feature importance calculated by using data from March 1 to June 30
| No. | Features | Gain of XGB | Mean magnitude of |
|---|---|---|---|
| 1 | Pop Density | 110.4355 | 1.4059 |
| 2 | Public Transportation | 39.9490 | 0.9363 |
| 3 | Pollution | 20.2100 | 0.6166 |
| 4 | Sex Ratio | 3.8168 | 0.5133 |
| 5 | Urban | 4.4805 | 0.4145 |
| 6 | Income | 26.9567 | 0.3677 |
| 7 | Hospital Beds | 2.4753 | 0.3319 |
| 8 | Respiratory Deaths | 2.3760 | 0.3230 |
| 9 | Physicians | 3.1678 | 0.3216 |
| 10 | Flu Deaths | 1.8414 | 0.2705 |
| 11 | Age 65+ | 13.0877 | 0.2506 |
| 12 | Major Airports | 11.6312 | 0.1672 |
| 13 | Unemployment | 5.6695 | 0.1666 |
| 14 | GDP | 6.8395 | 0.1338 |
| 15 | Health Spending | 2.6873 | 0.1099 |
| 16 | Temperature | 0.6306 | 0.0667 |
| 17 | Smoking Rate | 0.4426 | 0.0527 |
The feature importance calculated by using data from March 1 to September 30
| No. | Features | Gain of XGB | Mean magnitude of |
|---|---|---|---|
| 1 | Temperature | 146.9716 | 1.3928 |
| 2 | Age 65+ | 82.3948 | 1.0826 |
| 3 | Pollution | 117.6348 | 0.7934 |
| 4 | Sex Ratio | 71.3326 | 0.7878 |
| 5 | Hospital Beds | 86.3513 | 0.5321 |
| 6 | Physicians | 59.3828 | 0.3353 |
| 7 | Flu Deaths | 70.7815 | 0.3350 |
| 8 | Pop Density | 24.6166 | 0.3278 |
| 9 | Smoking Rate | 38.8826 | 0.2811 |
| 10 | Income | 54.6326 | 0.2498 |
| 11 | Urban | 57.2029 | 0.1965 |
| 12 | Health Spending | 30.7574 | 0.1560 |
| 13 | Major Airports | 21.3473 | 0.1254 |
| 14 | Unemployment | 34.6735 | 0.1231 |
| 15 | GDP | 42.9217 | 0.0892 |
| 16 | Respiratory Deaths | 51.7551 | 0.0883 |
| 17 | Public Transportation | 76.4803 | 0.0460 |
Fig. 3A summary plot of the impact of each static feature on the model output
Fig. 4Three examples illustrating the relative contributions of static features to the predicted confirmed cases of COVID-19 per 1000 people. (a) New Jersey (b) New York (c) District of Columbia
Fig. 5State samples sorted by the explanation similarity
Temporal variables utilized in this study with their definitions
| Variable Name | Type | Description |
|---|---|---|
| Total confirmed cases | Integer | Cumulative sum of cases confirmed after positive test to date |
| New confirmed cases | Integer | Daily cases confirmed after positive test |
| mobility_retail_and_recreation | Double | Percentage change in visits to retail and recreation locations compared to baseline |
| mobility_grocery_and_pharmacy | Double | Percentage change in visits to grocery and pharmacy locations compared to baseline |
| mobility_parks | Double | Percentage change in visits to park locations compared to baseline |
| mobility_transit_stations | Double | Percentage change in visits to transit station locations compared to baseline |
| mobility_workplaces | Double | Percentage change in visits to workplace locations compared to baseline |
| mobility_residential | Double | Percentage change in visits to residential locations compared to baseline |
| Average mobility | Double | The average value of mobility_retail_and_recreation, mobility_grocery_and_pharmacy, mobility_parks, mobility_transit_stations, mobility_workplaces and mobility_residential |
| Average temperature | Double | Recorded hourly average temperature |
Fig. 6Graphical illustration of the dual-stage attention-based recurrent neural network
The prediction results using the data in Washington
| Model | Training | Testing | ||||
|---|---|---|---|---|---|---|
| MAE | RMSE | MAE | RMSE | |||
LSTM-based DA-RNN | 305.2435 | 397.7334 | 0.9824 | 496.2958 | 669.4930 | 0.9342 |
GRU-based DA-RNN | 253.4356 | 372.8175 | 0.9997 | 463.6574 | 654.5355 | 0.9474 |
| Encoder-Decoder | 595.3059 | 723.4670 | 0.8428 | 635.4309 | 853.9572 | 0.8134 |
| SVR | 1352.5391 | 1503.4896 | 0.6309 | 1662.2470 | 1751.1622 | 0.5932 |
The prediction results using the data in Ohio
| Model | Training | Testing | ||||
|---|---|---|---|---|---|---|
| MAE | RMSE | MAE | RMSE | |||
LSTM-based DA-RNN | 779.2359 | 1015.8341 | 0.9993 | 954.3849 | 1342.0370 | 0.9156 |
GRU-based DA-RNN | 699.2269 | 932.7992 | 0.9994 | 909.3970 | 1134.3039 | 0.9130 |
| Encoder-Decoder | 939.3892 | 1237.9832 | 0.8128 | 1049.2895 | 1689.3498 | 0.7496 |
| SVR | 1437.604 | 2038.3557 | 0.6781 | 1704.4902 | 2368.9732 | 0.5810 |
The prediction results using the data in Los Angeles
| Model | Training | Testing | ||||
|---|---|---|---|---|---|---|
| MAE | RMSE | MAE | RMSE | |||
LSTM-based DA-RNN | 935.5827 | 1148.7093 | 0.9994 | 1148.3849 | 1347.3209 | 0.9837 |
GRU-based DA-RNN | 857.4780 | 1058.3201 | 0.9994 | 1049.1987 | 1230.2390 | 0.9810 |
| Encoder-Decoder | 1829.1779 | 2270.9733 | 0.7176 | 2029.1921 | 2346.2892 | 0.6744 |
| SVR | 2159.8801 | 2454.7972 | 0.6767 | 2126.2587 | 2587.6133 | 0.6089 |
Fig. 7The iterative process of loss function during the training process of GRU-based DA-RNN on three different states. (a) Washington (b) Ohio (c) Los Angeles
Fig. 8The comparison of the predicted and true values employing GRU-based DA-RNN on three different states. (a) Washington (b) Ohio (c) Los Angeles
Sources of data used in this study
| Data | Source |
|---|---|
| Pop Density | https://worldpopulationreview.com/states/ |
| Sex Ratio | https://www.kff.org/other/state-indicator/distribution-by-gender/ |
| Smoking Rate | https://worldpopulationreview.com/states/smoking-rates-by-state/ |
| Age 65+ | |
| Income | |
| Urban | |
| GDP | |
| Unemployment | |
| Physicians | |
| Hospital Beds | |
| Flu Deaths | |
| Respiratory Deaths | |
| Health Spending | |
| Major Airports | |
| Public Transportation | |
| Temperature | |
| Pollution | |
| Confirmed cases | |
| Google Mobility |