Literature DB >> 35005154

Application of probabilistic models for extreme values to the COVID-2019 epidemic daily dataset.

Daniel Canton Enriquez1, Jose A Niembro-Ceceña1, Martin Muñoz Mandujano1, Daniel Alarcon1, Jorge Arcadia Guerrero2, Ivan Gonzalez Garcia1, Agueda Areli Montes Gutierrez2, Alfonso Gutierrez-Lopez3.   

Abstract

Worldwide, COVID-19 coronavirus disease is spreading rapidly in a second and third wave of infections. In this context of increasing infections, it is critical to know the probability of a specific number of cases being reported. We collated data on new daily confirmed cases of COVID-19 breakouts in: Argentina, Brazil, China, Colombia, France, Germany, India, Indonesia, Iran, Italy, Mexico, Poland, Russia, Spain, U.K., and the United States, from the 20th of January, 2020 to 28th of August 2021. A selected sample of almost ten thousand data is used to validate the proposed models. Generalized Extreme-Value Distribution Type 1-Gumbel and Exponential (1, 2 parameters) models were introduced to analyze the probability of new daily confirmed cases. The data presented in this document for each country provide the daily probability of rate incidence. In addition, the frequencies of historical events expressed as a return period in days of the complete data set is provided.
© 2022 The Authors. Published by Elsevier Inc.

Entities:  

Keywords:  Coronavirus; Daily new cases statistical analysis; Exponential distribution; Gumbel distribution; Probabilistic analysis

Year:  2022        PMID: 35005154      PMCID: PMC8719919          DOI: 10.1016/j.dib.2021.107783

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table Value of the Data Data on daily Covid cases are now easy to obtain. Authorities there are beginning to compile, cross-check and release these data to examine and analysis it. Thus, they are widely available in most countries. However, it is not easy to associate a probability of event occurrence to each daily case report data. These data can be updated through official reports and specialized websites. The database presented here is easy to update during the progress of the epidemic (including the third wave in some countries). In data-set of new daily cases are associated with their probability of frequency. They can be wielded to determine the probability of recent infections at specific sites. The likelihood of a new outbreak of Covid in any of the countries above can be estimated employing the extreme values probability distribution with the best fit. This dataset also supports expanding understanding of the differences in geographic scale in forecasting COVID-19 case counts [2]. Show that statistically significant differences exist based on percentage error metrics when using the same forecasting method at different levels of geographic resolution. The probability distributions presented are a complement to a forecasting model. This dataset provides daily probability of rate incidence that could be explored alongside forecasting data to gain further insight into the validity of different forecasts at varied geographic scales as a result of population size differences across countries. In order to provide health institutions, research centers and authorities with probabilistic tools to respond to changes in the epidemic. The Matlab code for the systematic of the frequency calculations is included.

Data Description

Worldwide, COVID-19 coronavirus disease is spreading rapidly in a second and third wave of infections. In this context of increasing infections, it is critical to know the probability of a specific number of cases being reported [3]. Daily data on new confirmed cases of COVID-19 outbreaks in 16 most affected countries: Argentina, Brazil, China, Colombia, Italy, Spain, France [4], Germany, India [5], Indonesia, Iran, Mexico, Poland, Russia, U.K., and the United States from the 20th of January, 2020 to 28th of August 2021 were collected from COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) (https://coronavirus.jhu.edu/). A sample of more than ten thousand daily data is utilized to validate the proposed models. Figs. 1 to 4 shows an example of fit frequency analysis. Comparison between fit proposed models Exp-1P, Exp-2P and Gumbel with daily data on confirmed cases of COVID-19 showed in Figs. 5 to 7.
Fig. 1

Comparison between fit proposed models Exp-1P, Exp-2P and Gumbel with daily data on confirmed cases of COVID-19 in Germany.

Fig. 4

Comparison between fit proposed models Exp-1P, Exp-2P and Gumbel with daily data on confirmed cases of COVID-19 in Mexico.

Fig. 5

Daily new confirmed cases in Italy, probabilistic characterization with Exp-2P.

Fig. 7

Daily new confirmed cases in Mexico, probabilistic characterization with Gumbel.

Comparison between fit proposed models Exp-1P, Exp-2P and Gumbel with daily data on confirmed cases of COVID-19 in Germany. Comparison between fit proposed models Exp-1P, Exp-2P and Gumbel with daily data on confirmed cases of COVID-19 in Iran. Comparison between fit proposed models Exp-1P, Exp-2P and Gumbel with daily data on confirmed cases of COVID-19 in Italy. Comparison between fit proposed models Exp-1P, Exp-2P and Gumbel with daily data on confirmed cases of COVID-19 in Mexico. Daily new confirmed cases in Italy, probabilistic characterization with Exp-2P. Daily new confirmed cases in Argentina, probabilistic characterization with Exp-1P. Daily new confirmed cases in Mexico, probabilistic characterization with Gumbel. Very specific studies on COVID-19 forecasting are currently available. It is common to use autoregressive models of the type ARMA(p,q). For example [6], utilize Hidden Markov Chain Models of Moroccan data y [7] using Recurrent Neural Networks; these studies are “forecasting" models. However, there are few studies on the probability of a specific number of infections happening in a day. This is one of the highlights of this dataset. It is proposed to use a frequency analysis to assign a probability of occurrence (infection) of a very particular day in a specific country. A theoretical frequency analysis means to fit a series of data to a probability distribution function , which represents the probability of occurrence of a random variable. This procedure must be applied when it is desirable to know an event associated with a return period greater than the maximum length of data record; this is why it is called theoretical because it is not possible to estimate the event using an empirical frequency table. There are several probability distribution functions. Those most successfully used are: normal, log-normal, exponential, gamma, Pearson type III (or three-parameter gamma), log-Pearson type III and those of extreme values types I, II and III; or Gumbel, Frechet and Weibull, respectively. Mixed probability functions are also used, i.e. they can take into account two or three data sets. For daily covid data we propose to use the extreme distributions shown below.

Gumbel distribution

WhereWhere is the standard deviation and is the mean. is the scale parameter. is the shape parameter. Then to equal the probability function of the return period with the distribution function is.And solving x

Exponential distribution

Where is the mean. is the location parameter. According to the return period is:And solving x

Exponential II distribution

WhereWhere is the standard deviation and is the mean. is the scale parameter. is the shape parameter. According to the return period is: And solving x

Experimental design, materials and methods

Generalized Extreme-Value Distribution Type-1 (Gumbel) [8] and Exponential models were introduced to analyze the probability of new daily confirmed cases. The data presented in this document for each country provide the daily probability of rate incidence [9]. In addition, the frequencies of historical events expressed as a return period in days of the complete data set is provided. Table 1 shows the estimation of the parameters of the distributions used. This probabilistic analysis comes from the frequency analysis in each of the countries. Only some countries are shown here as examples. The total of the probabilistic analysis can be obtained from the database of this paper. If a series of extreme values is used, the maximum data recorded in each day must be used. This series is used when the design must be based on the most adverse conditions. The empirical return period of this data series is obtained with the following expression proposed by Hosking et al. [10].Where
Table 1

Parameters for fit proposed models Exponential (1, 2 parameters) and Gumbel.

Gumbel
Exponential 2p
Exponential 1p
CountryTotal analyzed dataScale parameterShape parameterScale parameterShape parameterLocation parameter
Argentina5407448.04556.58284.51267.90.0001046
Brazil546300,052.013,869.025,216.413,326.70.0000259
China581126.8460.6837.5-674.90.0061500
Colombia5377106.94143.59114.91581.30.0001090
France57911,938.35691.215,305.6-2732.50.0000790
Germany5766768.82873.08771.6-1593.20.0001440
India57344,240.545,262.682,295.6-25,555.00.0000176
Indonesia5417890.72856.510,116.2-2707.40.0001350
Iran5536942.94595.58901.2-300.10.0001160
Italy5737049.83789.49041.71183.60.0001273
Mexico5442498.04529.45174.9796.31867.00
Poland5396089.81841.67810.42454.20.0001867
Russia5726366.88022.48165.73531.20.0000855
Spain5719843.12891.512,619.4-4049.18570.26
Turkey53210,352.34252.013,277.3-3050.50.0000978
U. Kingdom57210,913.05271.313,991.1-2423.80.0000860
United States58050,807.436,334.465,137.7508.765,646.35
T is the empirical return period, in days n is the total number of data in each country m is the order-number in a list from high to low value Parameters for fit proposed models Exponential (1, 2 parameters) and Gumbel. When historical records of a phenomenon are used, defined as daily data, they should be assigned a return period according to their observed cumulative frequencies (frequencies table). To calculate it, it is assumed that the frequency or recurrence interval of each observed event, allows assigning a return period to each data. This is known as the observed (empirical) return period. Since the return period has a completely probabilistic definition, in its mathematical form T of a daily event x, it should be defined as the inverse of the probability P(x) of that event x to occur. This means that the probability of being equalized or exceeded by another event x must be expressed as:

Ethics Statements

The authors paid attention to the ethical rules in the study. There is no violation of ethics. The authors declare that this work does not involve the use of human subjects or experimentation with animals.

CRediT Author Statement

Daniel Canton Enriquez and Alfonso Gutierrez-Lopez: Designed the model and the computational framework. All Authors analyzed the data, carried out the implementation and performed the calculations; Alfonso Gutierrez-Lopez and Martin Muñoz Mandujano: Wrote the manuscript with input from all authors; Ivan Gonzalez Garcia, Jose A. Niembro-Ceceña1 and Jorge Arcadia Guerrero: Were in charge of overall direction and planning.

Funding

This work was financially supported by Consejo Nacional de Ciencia y Tecnología, CONACYT, Mexico.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
SubjectData Mining and Statistical Analysis. Infectious Diseases
Specific subject areaGeneralized Extreme-Value Distribution Type 1-Gumbel and Exponential (1, 2 parameters models applied to characterize probabilistically COVID-19 daily cases
Type of dataTableGraphFigure
How the data were acquiredThe data on daily recent confirmed cases of COVID-19 were carefully collected from Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) Database. The data were built as a time-series database by Excel and probabilistic models for extreme values were satisfactorily established for analysis using Matlab.
Data formatAnalyzed
Parameters for data collectionUnder the framework of frequency analysis and the Moments estimation parameter method, a probabilistic fitting was carried out to the daily new confirmed Covid cases. Raw data from Argentina, Brazil, China, Colombia, France, Germany, India, Indonesia, Iran, Italy, Mexico, Poland, Russia, Spain, U.K., and the United States, were used.
Description of data collectionDaily data on new confirmed cases of COVID-19 outbreaks in Argentina, Brazil, China, Colombia, France, Germany, India, Indonesia, Iran, Italy, Mexico, Poland, Russia, Spain, U.K., and the United States from the 20th of January, 2020 to 28th of August 2021 are available in the Database. COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU) (https://coronavirus.jhu.edu/). In addition, there are no missing values and the Excel file of the daily data is presented in Supplementary Data.This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL) [1].https://coronavirus.jhu.edu/map.htmlhttps://github.com/CSSEGISandData/COVID-19/blob/master/README.md
Data source locationArgentina, Brazil, China, Colombia, France, Germany, India, Indonesia, Iran, Italy, Mexico, Poland, Russia, Spain, U.K., and the United States.
Data accessibilityThe analyzed data is publicly hosted in the mendeley repositories with the following data:Repository name: Frequency analysis of new Covid-19 infectionsMatlab code: https://github.com/dCantonE/FrequencyAnalysisSupplementary material associated with this article: https://data.mendeley.com/datasets/kvnsn8nyhg/3
  6 in total

1.  A hierarchical spatio-temporal model to analyze relative risk variations of COVID-19: a focus on Spain, Italy and Germany.

Authors:  Abdollah Jalilian; Jorge Mateu
Journal:  Stoch Environ Res Risk Assess       Date:  2021-03-23       Impact factor: 3.379

2.  Estimation of COVID-19 prevalence in Italy, Spain, and France.

Authors:  Zeynep Ceylan
Journal:  Sci Total Environ       Date:  2020-04-22       Impact factor: 7.963

3.  The hidden Markov chain modelling of the COVID-19 spreading using Moroccan dataset.

Authors:  Abdelghafour Marfak; Doha Achak; Asmaa Azizi; Chakib Nejjari; Khalid Aboudi; Elmadani Saad; Abderraouf Hilali; Ibtissam Youlyouz-Marfak
Journal:  Data Brief       Date:  2020-07-24

4.  An interactive web-based dashboard to track COVID-19 in real time.

Authors:  Ensheng Dong; Hongru Du; Lauren Gardner
Journal:  Lancet Infect Dis       Date:  2020-02-19       Impact factor: 25.071

5.  Generated time-series prediction data of COVID-19's daily infections in Brazil by using recurrent neural networks.

Authors:  Mohamed Hawas
Journal:  Data Brief       Date:  2020-08-19

6.  Short-Range Forecasting of COVID-19 During Early Onset at County, Health District, and State Geographic Levels Using Seven Methods: Comparative Forecasting Study.

Authors:  Christopher J Lynch; Ross Gore
Journal:  J Med Internet Res       Date:  2021-03-23       Impact factor: 5.428

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.