| Literature DB >> 32839733 |
Abstract
In light of the COVID-19 pandemic that has struck the world since the end of 2019, many endeavors have been carried out to overcome this crisis. Taking into consideration the uncertainty as a feature of forecasting, this data article introduces long-term time-series predictions for the virus's daily infections in Brazil by training forecasting models on limited raw data (30 time-steps and 40 time-steps alternatives). The primary reuse potential of this forecasting data is to enable decision-makers to develop action plans against the pandemic, and to help researchers working in infection prevention and control to: (1) explore limited data usage in predicting infections. (2) develop a reinforcement learning model on top of this data-lake, which can perform an online game between the trained models to generate a new capable model for predicting future true data. The prediction data was generated by training 4200 recurrent neural networks (54 to 84 days validation periods) on raw data from Johns Hopkins University's online repository, to pave the way for generating reliable extended long-term predictions.Entities:
Keywords: Brazil; COVID-19; Deep learning; Forecasting; Infectious disease; Prediction; Recurrent neural network; Time-series
Year: 2020 PMID: 32839733 PMCID: PMC7437445 DOI: 10.1016/j.dib.2020.106175
Source DB: PubMed Journal: Data Brief ISSN: 2352-3409
Metadata, DM1: Deterministic mode 1, DM2: Deterministic mode 2, NDM: Non-deterministic mode. Source: Author.
| ID | Items | DM1 | DM2 | NDM | Technical validation | Technical validation (control group) |
|---|---|---|---|---|---|---|
| Values | Values | Values | Values | |||
| Country | Brazil | Brazil | Brazil | Brazil | India | |
| 0 | Start date for training data | 07–04–2020 | 07–04–2020 | 07–04–2020 | 08–03–2020 | 07–04–2020 |
| 1 | End date for training data | 06–05–2020 | 06–05–2020 | 06–05–2020 | 06–04–2020 | 06–05–2020 |
| 2 | Start date for evaluation data | 07–05–2020 | 07–05–2020 | 07–05–2020 | 07–04–2020 | 07–05–2020 |
| 3 | End date for evaluation data | 29–06–2020 | 29–06–2020 | 29–06–2020 | 13–06–2020 | 11–07–2020 |
| 4 | Duration of evaluation data | 54 days | 54 days | 54 days | 68 days | 66 days |
| 5 | Start date for training process | 28–06–2020 | 28–06–2020 | 03–07–2020 | 13–06–2020 | 12–07–2020 |
| 6 | End date for training process | 30–06–2020 | 01–07–2020 | 03–07–2020 | 13–06–2020 | 12–07–2020 |
| 7 | Number of models | 1197 | 1976 | 20 | 1001 | 1 |
| 8 | Number of predictions | 2835 | 7301 | 53 | 3619 | 1 |
| 9 | Number of graphs | 2835 | 7301 | 53 | 3619 | 1 |
| 10 | Number time-steps | 30 | 40 | 30 | 30 | 30 |
| 11 | Processor | CPU | CPU | GPU | GPU | CPU |
| 12 | Crop-point of input data since 22–01–2020: removing data before: | day: 110 | day: 110 | day: 110 | day: 80 | days: (110, 115, 125, 135) |
Best 10 performing models in the deterministic mode 1. Source: Author.
| ID | Model | r2_all_duration |
|---|---|---|
| 140 | model_trained_d46818b3-b5e8–4aed-b379-d9bb56e5d6a4.h5 | 0.665547468 |
| 2052 | model_trained_3664a884–649f-4c54–9b9f-a7450346ba1d.h5 | 0.589318674 |
| 2739 | model_trained_0a220f4e-0b3a-49a5-afe6–349f3e35c3c2.h5 | 0.587505583 |
| 590 | model_trained_de93114c-7638–4df8-b35d-6b6928edea9f.h5 | 0.586324501 |
| 2827 | model_trained_2ddfbf07–8602–4357–9b5f-d41d366dfb61.h5 | 0.585366432 |
| 872 | model_trained_45ff443b-8380–46c5-b824-d470d5bf5935.h5 | 0.583010877 |
| 1636 | model_trained_bd30a0c6-a93e-441d-b5db-a2a77a18ba90.h5 | 0.57803346 |
| 1374 | model_trained_ba60d0a0–64b2–45c8-bf1d-b158686deef1.h5 | 0.576606486 |
| 2166 | model_trained_a556979b-a212–49bb-b2be-d24f8ca8c48a.h5 | 0.57569909 |
| 1909 | model_trained_cf40d4ca-6cd7–44dc-a1e9-bcb190a5d466.h5 | 0.574641935 |
Fig. 1Graph of third-best model in the deterministic mode 1. Source: Author.
Settings of third-best model in the deterministic mode 1. Source: Author.
| Settings/ID=2739 | Value |
|---|---|
| model | model_trained_0a220f4e-0b3a-49a5-afe6–349f3e35c3c2.h5 |
| r2_all_duration | 0.587505583 |
| time-steps | 15 |
| epochs | 1000 |
| batch-size | 1024 |
| validation-split | 0.3 |
| rnn | recurrent_v2.GRU |
| layers | [64, 32, 64] |
| dropout | 0 |
| conv-rnn | TRUE |
| seed-python | 47 |
| seed-tf | 8 |
| functional-api | TRUE |
| t1 | 6 |
| r2_time_steps | 0.460565672 |
Fig. 2Graphs of other models with different trends in the deterministic mode 1. Source: Author.
Best 10 performing models in the deterministic mode 2. Source: Author.
| ID | Model | r2_all_duration |
|---|---|---|
| 7221 | model_trained_3822084a-e528–4571–9796–43acbd33c9c3.h5 | 0.602340286 |
| 3379 | model_trained_8ea50721–243a-4efe-a7a9-cbe0b408cdea.h5 | 0.602340286 |
| 4726 | model_trained_a1c16bba-9203–4a61–89a1-baed9958b947.h5 | 0.60177885 |
| 5491 | model_trained_b5aa1f4e-281d-4413-a9ed-3842957d936c.h5 | 0.597394405 |
| 4624 | model_trained_1671766c-327d-4e18-b4e2-a2619b36bab0.h5 | 0.59109153 |
| 3304 | model_trained_c370645c-37d8–429f-b5bb-de572a6297fb.h5 | 0.590435975 |
| 3663 | model_trained_019628ca-e614–460b-8445-aefaa667f04f.h5 | 0.589312596 |
| 7095 | model_trained_8b5b8f8f-e463–456c-a73e-9b8fc4248cd4.h5 | 0.58849072 |
| 2976 | model_trained_d678e1c1–1452–4d5f-9c72–95c97e93ac57.h5 | 0.587584874 |
| 2796 | model_trained_5cb9ae23–2f1a-4cbf-9bd1-e429a2250f5e.h5 | 0.587506132 |
Fig. 3Graph of second-best model in the deterministic mode 2. Source: Author.
Settings of second-best model in the deterministic mode 2. Source: Author.
| Settings/ID=4726 | Value |
|---|---|
| model | model_trained_a1c16bba-9203–4a61–89a1-baed9958b947.h5 |
| r2_all_duration | 0.60177885 |
| time-steps | 20 |
| epochs | 1200 |
| batch-size | 1024 |
| validation-split | 0.3 |
| rnn | recurrent_v2.GRU |
| layers | [16, 32, 16] |
| dropout | 0 |
| conv-rnn | TRUE |
| seed-python | 46 |
| seed-tf | 43 |
| functional-api | TRUE |
| t1 | 10 |
| r2_time_steps | 0.368462128 |
| r2_sum | 0.970240978 |
Fig. 4Graphs of other models with different trends in the deterministic mode 2. Source: Author.
Best 10 performing models in the non-deterministic mode. Source: Author.
| ID | Model | r2_all_duration |
|---|---|---|
| 42 | model_trained_fe30df8d-1726–4692-a025–3276b035d22d.h5 | 0.558035054 |
| 26 | model_trained_f7762637–038a-4dec-b1ae-fbc05556a5e4.h5 | 0.544462339 |
| 11 | model_trained_29d990e5-ff1f-485b-9a71–10728826e4ab.h5 | 0.496746963 |
| 12 | model_trained_f05990fe-93ab-4634-bb21-eb633bbe8dce.h5 | 0.493894826 |
| 43 | model_trained_fe30df8d-1726–4692-a025–3276b035d22d.h5 | 0.482351598 |
| 9 | model_trained_c6ef4b9c-b473–493d-a388–2e2223183d19.h5 | 0.467796218 |
| 27 | model_trained_f7762637–038a-4dec-b1ae-fbc05556a5e4.h5 | 0.465112991 |
| 5 | model_trained_f7e3ec50–0f24–4ab6–9286-fa4e0c3de866.h5 | 0.458621676 |
| 7 | model_trained_c6ef4b9c-b473–493d-a388–2e2223183d19.h5 | 0.437672728 |
| 30 | model_trained_f3ac1c52–96b3–4e1d-ae21–16b54a0e69e8.h5 | 0.423220917 |
Fig. 5Graph of second-best model in the non-deterministic mode. Source: Author.
Settings of second-best model in the non-deterministic mode. Source: Author.
| Settings/ID=26 | Value |
|---|---|
| model | model_trained_f7762637–038a-4dec-b1ae-fbc05556a5e4.h5 |
| r2_all_duration | 0.544462338 |
| time-steps | 15 |
| epochs | 600 |
| batch-size | 1024 |
| validation-split | 0.3 |
| rnn | recurrent_v2.GRU |
| layers | [128, 256, 128] |
| dropout | 0 |
| conv-rnn | TRUE |
| seed-python | 13 |
| seed-tf | 4 |
| functional-api | TRUE |
| t1 | 7 |
| r2_time_steps | 0.129786299 |
| r2_sum | 0.674248637 |
Fig. 6Graphs of other models with different trends in the non-deterministic mode. Source: Author.
Fig. 7Graph for a validation model. Source: Author.
Analysis for the validation dataset against other datasets, DM1: Deterministic mode 1, DM2: Deterministic mode 2, NDM: Non-deterministic mode. Source: Author.
| Item | Value | % of models/ sample size | DM1 | DM2 | NDM |
|---|---|---|---|---|---|
| Confidence level | 95% | – | – | – | – |
| Margin of error | 4.15% | – | – | – | – |
| Number of models | 1001 | – | 1197 | 1976 | 20 |
| Sample size | 358 | – | – | – | – |
| Number of predictions | 5208 | – | 2835 | 7301 | 53 |
| Mean accuracy | 48.25% | – | 26.51% | 30.60% | 24.49% |
| Standard deviation | 0.2070 | – | 0.1506 | 0.15621 | 0.1539 |
| Models with accuracy < 0.4 | 329 | 92% | 91.73% | 91.55% | 100.00% |
| 0.5 > Models with accuracy > 0.4 | 260 | 73% | 30.83% | 51.32% | 35.00% |
| 0.6 > Models with accuracy > 0.5 | 275 | 77% | 12.61% | 26.87% | 10.00% |
| 0.7 > Models with accuracy > 0.6 | 289 | 81% | 0.08% | 0.15% | 0% |
| 0.8 > Models with accuracy > 0.7 | 229 | 64% | 0% | 0% | 0% |
| Models with accuracy > 0.8 | 3 | 1% | 0% | 0% | 0% |
| Crop-point | 80 | 110 | |||
Evaluating performance till 2020–07–11. Source: Author.
| Mode | Accuracy till 2020–06–29 | Accuracy till 2020–07–11 |
|---|---|---|
| Deterministic (30 time-steps) | 0.5875 | 0.6374 |
| Deterministic (40 time-steps) | 0.6018 | 0.5898 |
| Non-deterministic (30 time-steps) | 0.5444 | 0.5401 |
Evaluating performance of India as a control group till 2020–07–11. Source: Author.
| Crop-point | Accuracy till 2020–07–11 | Duration of evaluation | Trendline |
|---|---|---|---|
| 115 | 0.9557 | 61 | Exponential |
| 110 | 0.9404 | 66 | Exponential |
| 115 | 0.9286 | 61 | Exponential |
| 110 | 0.9107 | 66 | Exponential |
| 125 | 0.906 | 51 | Exponential |
| 115 | 0.8914 | 61 | Exponential |
| 135 | 0.7992 | 41 | Polynomial |
| 115 | 0.7251 | 61 | Polynomial |
Comparing performance of different time-steps – till 2020–06–29. Source: Author.
| Time-steps ( | R2 accuracy |
|---|---|
| 50 | −2.770762177 |
| 40 | 0.531547951 |
| 30 | 0.599121699 |
| 20 | −2.157868562 |
| 10 | −2.616230335 |
Fig. 8The recursive predict function. Source: Author.
Number of training sessions. Source: Author.
| Mode | Population | Confidence level | Margin of error | Required sample size per population | Number of training sessions | s1 | s2 | s3 | s4 | s5 | s6 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| DM1 | 1197 | 95% | 4% | 400 | 2 | 404 | 793 | ||||
| DM2 | 1976 | 95% | 4% | 461 | 6 | 4 | 355 | 153 | 334 | 600 | 530 |
| NDM | 20 | 95% | 4% | 20 | 1 | 20 | |||||
| TVM | 1001 | 95% | 4% | 535 | 1 | 1001 |
For Non-deterministic modes (NDM and TVM) these values are informative only, as the training sessions are non-deterministic by nature.
The 20 models in the Non-deterministic mode were all trained in 1 session. The Google Drive File Stream service that syncs files automatically from the training session on Google Colab to Google Drive, has created the designated folder on 3 July and indicated correct created date of 9 files out of 20 as: 3 July and incorrect modified date as: 25 June for these 9 files that even shows 24 June when downloaded. This error can be noticed in the compressed zip file. However, the code already included a working second layer of protection against these errors, as there is an internal collective settings dictionary file that states exactly all settings used at the beginning of the training process for each model before generating individual settings files, which is created on 3 July. The dictionary clarifies that the actual creation of settings that were used to initialize each training session of the 9 models, occurred on 3 July, as this dictionary creation code is responsible for generating universally unique identifiers as models’ naming convention.
| Infectious Diseases | |
| Forecasting COVID-19′s daily infections in Brazil by using deep recurrent neural network models with limited pandemic data. | |
| - Tables (.csv) | |
| - The data was generated by training 4200 Deep recurrent neural networks RNNs on limited raw data (30 time-steps and 40 time-steps) from COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University at: | |
| - Raw: The generated prediction data | |
| - The metadata and training - inference settings for the deterministic setup are located in the settings folder in the data repository. | |
| The csv file for daily COVID-19 infection numbers from January 22, 2020, to several dates – indicated in the metadata - was downloaded from COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. | |
| - City: Cairo | |
| Repository name: Mendeley |
Fast growth pattern in the validation model. Source: Author.
| Values | Date | ||||
|---|---|---|---|---|---|
| 5/16/2020 | 5/31/2020 | 6/14/2020 | 6/29/2020 | 10/1/2020 | |
| Predicted value | 7987 | 16,651 | 27,513 | 44,050 | 1,565,159 |
| True Value | 13,220 | 16,409 | 17,110 | 24,052 | NA |
| Interval - days | −15 | 0 | 15 | 30 | 110 |
| Increase - predictions | 140.86% | 108.48% | 65.23% | 60.11% | 3453.14% |
| Increase - true values | 169.91% | 24.12% | 4.27% | 40.57% | NA |