Literature DB >> 35784094

COVID-19 contagion forecasting framework based on curve decomposition and evolutionary artificial neural networks: A case study in Andalusia, Spain.

Miguel Díaz-Lozano1, David Guijo-Rubio2, Pedro Antonio Gutiérrez2, Antonio Manuel Gómez-Orellana2, Isaac Túñez3, Luis Ortigosa-Moreno1, Armando Romanos-Rodríguez1, Javier Padillo-Ruiz4, César Hervás-Martínez2.   

Abstract

Many types of research have been carried out with the aim of combating the COVID-19 pandemic since the first outbreak was detected in Wuhan, China. Anticipating the evolution of an outbreak helps to devise suitable economic, social and health care strategies to mitigate the effects of the virus. For this reason, predicting the SARS-CoV-2 transmission rate has become one of the most important and challenging problems of the past months. In this paper, we apply a two-stage mid and long-term forecasting framework to the epidemic situation in eight districts of Andalusia, Spain. First, an analytical procedure is performed iteratively to fit polynomial curves to the cumulative curve of contagions. Then, the extracted information is used for estimating the parameters and structure of an evolutionary artificial neural network with hybrid architectures (i.e., with different basis functions for the hidden nodes) while considering single and simultaneous time horizon estimations. The results obtained demonstrate that including polynomial information extracted during the training stage significantly improves the mid- and long-term estimations in seven of the eight considered districts. The increase in average accuracy (for the joint mid- and long-term horizon forecasts) is 37.61% and 35.53% when considering the single and simultaneous forecast approaches, respectively.
© 2022 Elsevier Ltd. All rights reserved.

Entities:  

Keywords:  COVID-19 contagion forecasting; Curve decomposition; Evolutionary artificial neural networks; Time series

Year:  2022        PMID: 35784094      PMCID: PMC9235375          DOI: 10.1016/j.eswa.2022.117977

Source DB:  PubMed          Journal:  Expert Syst Appl        ISSN: 0957-4174            Impact factor:   8.665


Introduction

Since the first outbreak detected in Wuhan, China, the new coronavirus has widely and rapidly spread around the world due to its powerful human-to-human transmission capacity (Sanche et al., 2020), leading to an exponential growth in the number of infected people in all countries. This new coronavirus, which produces the disease known as coronavirus disease 2019 (COVID-19), has put human health at risk by provoking fever, cough, and myalgia as common symptoms, and potentially leading to complications, such as acute respiratory distress syndrome in a significant percentage (Chen et al., 2020, Huang et al., 2020), among others. On the 11th of March 2020, when many countries were in an emergency situation, the World Health Organization (WHO) declared this virus a global pandemic thereby, forcing the countries to adopt prevention measures such as nationwide lockdowns, mandatory use of facial masks and human mobility controls to suppress virus transmission. Many studies have shown that reducing social interaction by applying control measures helps to mitigate the spread of contagions (Kharroubi and Saleh, 2020, Kraemer et al., 2020). In each country, the number of outbreaks, their intensity and duration depend on a wide variety of parameters, such as the size of population, the control measures applied and even specific climatological features (Livadiotis, 2020, Malki et al., 2020). Nevertheless, the evolution of the outbreaks are similar regardless of the location where they take place. The dynamics of the pandemic are as follows: at the beginning, a small number of people became infected. Then, due to the high contagion capacity, if social interaction is not limited, the number of infections increases linearly in a short period of time. After that, when a sufficient number of people contract the virus, infections begin to rise exponentially, and the outbreak quickly spirals out of control. At this time, control measures are usually applied to bring the outbreak under control. Hence, it is of huge importance to predict the contagion rate to avoid the inflexion point when the spread velocity starts to increase exponentially. Forecasting will allow the anticipation of uncontrolled situations and thus can help in adopting health system preparedness measures to avoid hospital overcrowding and in devising social restrictions to minimize the number of infections and the economic impact (Maital & Barzani, 2020). To devise suitable economic, social and health care strategies to mitigate the pandemic effects, anticipating the evolution of the outbreaks has thus become a crucial task. Multiple forecasting techniques have been proposed since data about COVID-19 impact (regarding infected, deceased and recovered people) started to be available. In this connection, epidemiological models (Kermack & McKendrick, 1927) are the most popular approach to estimating the evolution of infectious diseases. These models use mutually exclusive compartments or states and assign them to a population of individuals to describe the dynamics of the population. Individuals flow through the compartments according to the parameters of the model. Compartmental models have been mainly applied at the beginning of the pandemic period in different locations. For example, in Anastassopoulou, Russo, Tsakris, and Siettos (2020), a four-compartmental SIDR (Susceptible–Infectious–Recovered–Dead) model was used to estimate the most important epidemiological parameters, such as the basic reproduction number, , and the rates of infection, mortality and recovery, using the COVID-19 incidence data of Hubei (China) from January 11 to February 10, 2020. Moreover, in Hauser et al. (2020), also using data from Hubei, an age-stratified SEIR (Susceptible–Exposed– Infected–Removed) model was fitted to estimate the symptomatic case-fatality ratio (sCFR) and infection-fatality ratio (IFR) over six regions of Europe. An extension of this model, including predictions of the number of cases using different time series forecasting techniques, was presented in Katris (2021). Alternatively, machine learning (ML) techniques have also been applied to model some threatening aspects of the COVID-19 pandemic. ML approaches have a long history in solving real-world problems in different fields, including health care, economy and natural language processing. Concerning the health care area, and particularly the new SARS-CoV-2 virus, several ML and artificial intelligence expert systems have been proposed for different purposes. On the one hand, with the aim of identifying screening and management of SARS-CoV-2 positive diagnoses, methods have been developed to augment traditional identification tools, using radio imaging technology to detect abnormalities associated to COVID-19 infections (Huang et al., 2020, Ng et al., 2020) as an alternative to conventional tests, or even as standalone methods when viral testing is not an option. In Song et al. (2021), a convolutional neural network (CNN) was applied to distinguish between COVID-19 infected patients, pneumonia infected patients and healthy patients using computed tomography (CT) images. A deep neural network called COVID-Net is presented in Wang, Lin and Wong (2020) for identifying positive SARS-CoV-2 diagnoses using chest X ray (CXR) images. In Abbas, Abdelsamea, and Gaber (2021) a class decomposition (Abbas, Abdelsamea, & Gaber, 2020) technique is applied in combination with a pretrained CNN, improving the performance of the classifiers when the class decomposition layer is included as a preprocessing step. A combination of deep learning and classical ML algorithms is proposed in Sethy and Behera (2020), where deep features of CXR images are extracted from the fully connected layer of a CNN and used to feed a support vector machine for distinguishing between healthy, COVID positive and pneumonia images. In Tamal et al. (2021), a set of radiomics features from CXR images were selected and used to train three classical ML classifiers, providing an accurate, fast and automatic method that can be integrated with standard X-ray reporting systems. Most recently, in Garg, Salehi, Rocca, Garner, and Duncan (2022), the performances of 20 different CNNs trained for classifying patients into three and two classes using chest CT images achieved an accurate and very efficient classification model. Apart from the new imaging diagnostic assistance mechanisms, multiple algorithms and statistical techniques have been applied to obtain an accurate prognosis of the pandemic rates, the peak of outbreaks in different countries or even estimations of specific pandemic wave scopes and to create analytical models that could act as decision support systems. In this sense, in Benvenuto, Giovanetti, Vassallo, Angeletti, and Ciccozzi (2020), a classical autoregressive integrated moving average (ARIMA) is applied to analyze the trend of COVID-19 prevalence and incidence and to perform short-term prediction about the spread of the virus. In Wang, Zheng, Li and Zhu (2020), three critical points of infections and recovered cases are estimated in different countries by considering a hybrid forecast model based on a logistic curve fit and Prophet (Taylor & Letham, 2018) application. A comparison of six time series forecasting techniques applied to active COVID-19 cases is performed in Papastefanopoulos, Linardatos, and Kotsiantis (2020). Following a similar methodology, in Ribeiro, da Silva, Mariani, and dos Santos Coelho (2020), the predictive capacity of different machine learning regression models is measured using data from 10 Brazilian states. More complex models are used in Verma, Mandal, and Gupta (2022) for forecasting purposes, where long short-term memory (LSTM) recurrent neural networks are designed for predicting the contagion rate in 4 Indian states. Multiple regressors were applied in An et al. (2020) to predict the risk of mortality using different characteristics of an infected person, concluding that variables such as advanced age or taking metformin are important predictors that influence the output probability. An assisted fuzzy case-based reasoning (FCBR) algorithm for determining patient attention priorities based on eight factors is presented in Geetha, Narayanamoorthy, Manirathinam, and Kang (2021) with the aim of improving medical assistance and reducing the mortality of COVID-19 patients. In Fidan and Yuksel (2022), an unsupervised study is carried out to analyze the importance of city-related parameters, such as population and environmental variables, in addition to the number of cases to establish restrictions to contain the spread of the virus. In Desai (2021), the effects of news sentiments are included in the forecast task, concluding that negative news sentiments could help to reduce the contagion rate. Finally, in Khan et al. (2021), a very comprehensive review of research regarding the use of diverse ML algorithms to combat the COVID-19 pandemic is presented, where applications such as diagnosis, detection or forecasting are carried out using different types of data in several countries. Among all the state-of-the-art techniques in ML, there is one option that stands out: feed-forward neural networks (Bishop et al., 1995). This is a highly adaptative technique that can be used to model most nonlinear problems, mainly due to its universal approximation capability (Hornik, Stinchcombe, & White, 1989). These models are layer-divided structures, where each layer is composed of computational units that are connected to the nodes of the next layer by weighted connections. Artificial neural networks (ANNs) have been trained to solve multiple classification and regression problems (Abiodun et al., 2018). In this sense, backpropagation is the most popular algorithm for adjusting the network parameters in the training stage. This method computes the gradient of the loss function with respect to the weights of the network connections. However, even though this approach can efficiently adjust the weight magnitudes, a global optimal solution is not guaranteed. The reason behind this is the large extent of the multidimensional error surface and the presence of local optima where the backpropagation can get stuck. To perform a more exhaustive search over this surface, evolutionary algorithms have been presented as an alternative to discover and explore different regions (trying to escape from local optima) where more accurate solutions could be located. For forecasting purposes, evolutionary artificial neural networks (EANNs) have been applied in different fields such as renewable energy (Gómez-Orellana et al., 2022, Mason et al., 2018), economy (Au et al., 2008, Chiroma et al., 2015, Yi-Hui, 2007) or environment (Lu, Fan, & Lo, 2003). When modeling a future value as a target variable, it is important to consider the existing trade-off between accuracy and anticipation. Short-term events are usually easier to model but do not offer enough anticipation time, while long-term horizon forecasts are very useful for anticipating important events, but their accuracy decreases when the forecasting horizon increases. It is worth mentioning that the concepts of short and long horizons are relative and depend on the field of study where the prediction is carried out. Thus, in Gómez-Orellana et al. (2022), a 48-h forecast is considered a long-term horizon, while in Au et al. (2008) they refer to weekly events as short-term events. In the field of epidemiology, since the parameters associated with an outbreak evolve rapidly and the information is reported daily, 1, 3, and 5 days can be considered to be short-, mid- and long-term horizons, respectively. In this study, we make the following contributions. First, we propose a novel methodology for the analysis of the contagion curve by extracting evolution characteristics. The extraction procedure is performed through an iterative process of polynomial model fitting. The coefficients of the polynomial model fitting the cumulative contagion curve describe the shape and evolution of the curve, where points represent days since the beginning of each wave. This procedure allows the collection of inherent features of a process that can be used as related information, which is especially useful when process exogenous related information is scarce. Second, we apply this methodology to real data involving the number of infected people in different locations of Andalusia (Spain) to build a transfer function per zone that can predict the evolution of the contagion rate on the different stages of the pandemic. By means of this methodology, we build a multivariate time series dataset per sanitary district by considering the real cumulative contagion curve data as the dependent series and the polynomial coefficient series as independent correlated series. Third, we estimate the parameters of these transfer functions using EANNs, with data from different periods of the pandemic waves that took place in Andalusia. Given that every district presents different contagion rates in the considered outbreaks, different architectures of EANNs are considered using distinct basis functions in the nodes belonging to the hidden layer with the goal of finding the best network scheme for each district. By using distinct periods for the estimation of the parameters, we build a forecast model per sanitary district using a different order of lags that serves to obtain an accurate prediction about the contagion rate in different stages of an outbreak. This results in models that can identify the beginning and ending of new outbreaks and predict the evolution of the contagion rate during the most critical phase. The parameter estimation is carried out by considering mid- and long-term time forecast horizons ( and days, respectively). The reason behind the choice of these time horizons is that they offer an acceptable trade-off between anticipation and accuracy. To serve as decision support for adopting prevention measures, 3- and 5-day predictions of the number of contagions are enough to devise suitable containment schemes. Larger horizons lead to less accurate results. It is not appropriate to base sensitive decisions such as the application of restriction measures on such results. Additionally, we consider a combined estimation where both time forecast horizons are involved in a single model using a multitask evolutionary artificial neural networks (MuEANNs) approach. The resulting models can act as a decision support system in new emergency situations caused by new viruses or strains with similar contagion rate behavior. The results obtained with the models that are trained with the extracted polynomial information significantly outperform those obtained with models trained with pure autoregressive data in almost all districts. Specially, the accuracy is increased by up to 73.55% in the case of the Almería district. The remainder of this paper is organized as follows: in Section 2, we describe the data source and perform an analysis of the data used in this study. The proposed methodology is explained in Section 3 where the polynomial curve fit, the dataset building process, the autoregressive forecast models and the EANNs are presented. The experimental design and results obtained for each architecture, dataset and forecasting approach are presented in Section 4. Finally, we conclude the paper in Section 5.

Data description

The data used in this study and some considerations are specified in this section. This article focuses on eight areas of Andalusia, the country’s second largest and most populated autonomous region, located in southern Spain. The data were obtained from the official website of the Andalusia government,1 where the number of COVID-19 diagnosed, cured and deceased people were reported daily and segregated into sanitary districts. Sanitary districts are administrative divisions that have local health management competencies in specific zones of Andalusia. These divisions are represented in Fig. 1, where the eight districts that include the provincial capitals are highlighted in green and their populations are specified. For this study, the daily reported information in these eight sanitary districts from July , , to August , , is used, resulting in a total of observations, each representing one day of an outbreak. During this period of time, four different waves took place in Andalusia. Although the data reported by the Andalusia government about positive diagnosed contagions are available since March , , the data until July , correspond to the first wave of the pandemic. During this wave, the country wide lockdown resulted in a low intensity and short time wave. Hence we decided to exclude this period from the analysis.
Fig. 1

Geography of the sanitary districts of Andalusia. The districts including provincial capitals are highlighted in green.

The reported information presents a weekly pattern of fewer diagnosed contagions during weekends, making the positive diagnoses time series highly noisy. This effect is common to most countries (Ricon-Becker, Tarrasch, Blinder, & Ben-Eliyahu, 2020) and is due to the lower number of patients tested on Saturdays and Sundays, differences in testing timings and reporting delays. Missing data on weekends are usually included in the first days of the weekly reports thereby preserving the real number of positive diagnoses. Fig. 2, Fig. 2 show, respectively, the daily and cumulative positive COVID-19 diagnosis time series belonging to the district of Córdoba, one of the districts considered in this study, from July , to August , .
Fig. 2

Daily (a) and cumulative (b) reported positive COVID-19 diagnoses in the district of Córdoba from July , , to August , .

Geography of the sanitary districts of Andalusia. The districts including provincial capitals are highlighted in green. In Fig. 2(a), the noise produced by the weekly cycles is observable, resulting in a visible sawtooth effect in the contagion time series. The representation of the cumulative case time series is shown in Fig. 2(b) where the noise is considerably mitigated, leading to a soft and monotonically nondecreasing time series. The temporal trend of this district is repeated for the other districts, while varying in magnitude according to specific district parameters, such as the geographic location or the total population. The temporal positive diagnoses returned four different waves, each one defined by an increment and followed by a drop in the contagion rate. Daily (a) and cumulative (b) reported positive COVID-19 diagnoses in the district of Córdoba from July , , to August , . As mentioned above, the first wave has not been considered due to its low intensity and short wave time. Therefore, for this analysis, the following waves have been used: the second wave spans from July , , to December , , and its long-term evolution is characterized by a high peak of positive contagions and a gradual increment and drop in the contagion rate. During this wave, the average age of the people needing hospital care decreased compared to the first wave and the mean hospitalization time was also reduced (Iftimie et al., 2021). The third wave spans from December , , to March , . The beginning of the contagion rate of this wave is highly accelerated because of the removal of control measures due to the Christmas holidays. During this period, mobility and social interaction increased, resulting in a quick increment in the number of infected people and the maximum number of COVID-19 cases was reached. At the end of the holiday period, when control measures were reset, the contagion rate slowed down. The fourth wave, lasting from March , , to June , , is defined by a soft initial diagnoses increment owing to the beginning of the vaccination campaign. This wave presents two peaks provoked by the different changes in the containment measures policies. Finally, the fifth wave took place in summer, from June , , to August , , a period where control policies were loose. However, the high vaccination rate produced a wave with weak magnitude and short time period. Considering these partitions, the second, third, fourth and fifth waves consist of , , and observations, respectively. These dates have been chosen based on expert knowledge and the behavior observed in the daily contagion curves of the considered districts. The nature of these four waves is very different in terms of evolution because of the distinct social policies, control measures and vaccination campaigns that took place along their respective developments, resulting in a heterogeneous set of rates for each wave. For this reason, with the main objective of modeling the complete period of the pandemic, including the most diverse information possible about all stages is crucial. In this study, different periods for each wave are considered: beginning, growth and stabilization. Including information about these three stages in the parameter optimization phase results in models that can detect imminent increases in contagion rates that can potentially lead to local outbreaks, accurately predict the evolution of an initiated outbreak and recognize in advance the end of virus transmission.

Proposed methodology

In this section, the methodology proposed in this paper and the process to generate the datasets used in the experimentation are detailed. To provide an overview and facilitate understanding, Fig. 3 presents a flowchart containing all the steps detailed in this section.
Fig. 3

Overview flowchart describing the applied methodology.

Overview flowchart describing the applied methodology.

Curve fitting

Many empirical data obtained from natural processes describe patterns that may be defined using algebraic expressions. Curve fitting techniques help to understand the inherent behavior of a process and may be useful in description, clustering or forecasting tasks (Motulsky & Ransnas, 1987). If the nature of the underlying process is known, an accurate parametric function working as an approximator results in a soft curve whose parameters constitute a characterization of the empirical data evolution. The information extracted from the fitted curve can be used in ML tasks as important features, both in supervised (Hamidi, Ghassemian, & Imani, 2018) and unsupervised (Abraham et al., 2003, Martin et al., 1998) contexts. In this paper, we consider the approximation by a polynomial function of the cumulative curve of contagions of the districts involved in this study by the least squares method. The shape of the curve described in each wave is defined by a soft initial growth. Then, when a sufficiently high number of infected people is reached, an exponential increment occurs due to the high contagion rate. The waves finish with a linear evolution when the outbreak is under control, but residual positive cases are still present. According to this shape, a third-order polynomial is the most interpretable and lowest-variance model that can accurately fit the cumulative curve of each wave individually. The polynomial regression is a type of fit where the dependent variable linearly depends on the powers of a single independent variable, in this case, the number of days from the beginning of the wave. A -degree polynomial model, which is composed of parameters, is defined as: where is the cumulative number of contagions after days from the beginning of an outbreak, , are the parameters to be estimated and is the error term. For this study, the chosen -degree polynomial model is composed of parameters to be estimated during fit and has a single inflexion point along with its domain, where the function changes its concavity. Following an iterative procedure, for each time period between the start of the outbreak (Day ) and a day , a -degree polynomial model is fitted. Once the polynomial model has been fitted, the estimated parameters are used to describe a point of the curve, which is represented as: , where includes the four parameters of the third order polynomial. Note that each point of the curve represents a single day belonging to the th outbreak, where . Moreover, it is worth mentioning that the sanitary districts are considered independently. A more detailed description of the variables representing each point is given in Table 1.
Table 1

Descriptors of a point of the cumulative curve of contagions.

VariableDescription
α1,d,iEstimated α1 after d days of the ith wave
α2,d,iEstimated α2 after d days of the ith wave
α3,d,iEstimated α3 after d days of the ith wave
α4,d,iEstimated α4 after d days of the ith wave
yd,iCum. contagions after d days of the ith wave
Given that for building accurate models a minimum number of days is needed, can only take values in the range . In this way, is the minimum number of observations needed to build accurate models, and is the total duration, in days, of the th outbreak. Therefore, the number of polynomial models to be fitted for outbreak is . For the estimation of , the evolution of the quality of the fitted polynomial models has been measured for the sanitary districts under study, using the coefficient of determination as an evaluation metric, defined in Eq. (2) for the th wave. statistic represents the proportion of the variance of the dependent variable explained by the independent variables: where , a real number, is the cumulative contagions after days, is the number of cumulative contagions predicted by the model, and is the mean of observed cumulative contagions. This metric ranges from to , reaching its maximum value when the model is able to entirely explain the variance present in the real data, e.g., fitting perfectly to the empirical nature of the observed contagion rate. Descriptors of a point of the cumulative curve of contagions. Analyzing the evolution of the mean magnitude of the fitted polynomial models of the eight districts, the minimum feasible choice for each outbreak is based on the instant from which it begins to stabilize. At the beginning of the waves, the cumulative number of contagions is reset to , considering the waves individually, e.g., ignoring the cumulative number of infected people of past waves. Fig. 4 shows the average, maximum and minimum evolution of the coefficient of determination for the eight sanitary districts, and the chosen is indicated with a vertical line. The minimum number of days where the determination coefficient variance begins to stabilize corresponds to , , and for the second, third, fourth and fifth waves, respectively. Considering these initial thresholds for fitting polynomial curves, from the observations of each district, a total of days are used as the minimum number of observations to build accurate regressions, resulting in a total of polynomial fits on the entire pandemic period for each district. In Table 2, the , and standard deviation, , of all approximated models for each district are presented. As shown in Table 2, the selected polynomial models fit analytically well for all districts and waves.
Fig. 4

Mean, maximum and minimum evolution of the eight sanitary districts for each wave.

Table 2

, and standard deviation, (), of the statistics for the fitted polynomial models.

DistrictR2
MeanSD
Almería0.99540.0043
Cádiz Bay0.99610.0023
Córdoba0.99440.0075
Granada0.99470.0047
Huelva Coast0.99320.0044
Jaén0.99290.0166
Málaga0.99590.0029
Sevilla0.99670.0028
Mean, maximum and minimum evolution of the eight sanitary districts for each wave. , and standard deviation, (), of the statistics for the fitted polynomial models.

Autoregressive models

Formally, a time series of length is an ordered sequence of values . The timestamps conform to a sequence of positive and ascending discrete values, which, in most cases, are equally spaced. Time series are produced by any kind of sequenced phenomenon, whether natural, such as the climate in a particular location, or artificial, such as stock prices of a given enterprise over time. Due to its intrinsic temporal nature, time series have been the object of investigation in multiple fields of data mining, such as forecasting, classification, clustering, and outlier analysis (Fu, 2011). A time series may be composed of more than one time-dependent variable measured at the same time instant. In this case, time series are called multivariate, and its analysis involves modeling the inherent relationship between the dimensions that compose it. Mathematically, a multivariate time series of length , is composed of dimensions; thus, the value is a vector of length . In this sense, the point descriptors, obtained from the polynomial fit iterative process described in Section 3.1 and detailed in Table 1, are used to build a multivariate time series dataset per district. Within the scope of time series forecasting, autoregressive (AR) models are a useful technique for modeling future values by using the information of the past instants. A -order AR model, denoted as , can be applied to the target variable as follows: where is the target value obtained from past lagged values ( for ), is a constant that serves as the intercept of the model, is the th coefficient of the model, and is the error term. AR models represent any value of the series as a combination of its past values. Specifically, in this study, AR models will be used for the prediction of the number of cumulative contagions of the th outbreak of ; hence, (note that is the number of days since the beginning of the th outbreak). Furthermore, when the forecasting of the number of cumulative contagions is carried out by using multivariate time series, a vector autoregressive model (VAR) (Zivot & Wang, 2003) is used. These VAR models generalize the AR model by including information of the different dimensions of the time series (independent terms) to model the target time series (dependent term). models include lags of all the dimensions of a given time series, in our case, lags of the point descriptors presented in Table 1. Consequently, the model composed of dimensions is defined as follows: where is the vector of model coefficients and is the vector containing the th lagged representation of the curve point belonging to the th outbreak, i.e, the th lagged values of the variables described in Table 1. Note that the vector has length . In the same way that happens with AR models, VAR models will be used for the prediction of the number of cumulative contagions of the th outbreak of ; hence, .

Artificial neural networks

Artificial neural network (ANN) models (Bishop et al., 1995) have been applied to a large number of regression and classification tasks, demonstrating great performance due to their capability of being universal approximators (Hornik et al., 1989). In this sense, multiple ANN models have been proposed in recent years: one of the first approaches was the multilayer perceptron (MLP) (Bishop et al., 1995) with sigmoid units (SUs) as basis functions for the nodes in the hidden layers. Other alternatives include the radial basis function (RBF) (Broomhead and Lowe, 1988, Poggio and Girosi, 1989) neural networks, which usually make use of Gaussian transfer functions in the nodes of the hidden layers, and multiplicative networks, which use product units (PUs) (Durbin & Rumelhart, 1989), computing a weighted product instead of a weighted sum for a node input. In this paper, given that mid- and long-term forecasts are considered, the use of single and simultaneous forecasts is proposed. In this sense, two different models of ANNs are developed. On the one hand, ANNs aim to perform each forecast task separately (monotask), and on the other hand, ANNs achieve both forecast tasks simultaneously (multitask). Accordingly, monotask models only have one output, whereas multitask models have two outputs. Considering that the aim of this work is to forecast the cumulative number of contagions, two different horizons have been considered: and . These two forecast horizons are focused on predicting the cumulative number of contagions at mid- and long-term, i.e.,  and days ahead, respectively. Therefore, the ANNs models applied to perform the forecast tasks can be defined as follows: where is the model output for the forecast task considering the output , is a vector including the basis function parameters corresponding to the synaptic weights: including the weights from the hidden layer to the th output node, and , containing the weights of the connections from the input nodes to the th hidden node, being the number of neurons in the hidden layer. represents the basis function of the th hidden neuron, being the set of inputs of the ANN model (Table 3 shows a complete description of the different variables used).
Table 3

Inputs () included in the EANNs for day belonging to the th outbreak. is the total number of inputs for the generated datasets. Note that the different datasets are named according to the AR models used (either AR or VAR).

LagsDataset namexd,iI
p=1AR(1)y{yd1,i,di}2
VAR(1)y,α4{yd1,i,α4,d1,i,di}3
VAR(1)y,α3,α4{yd1,i,α3,d1,i,α4,d1,i,di}4

p=2AR(2)y{yd1,i,yd2,i,di}3
VAR(2)y,α4{yd1,i,α4,d1,i,yd2,i,α4,d2,i,di}5
VAR(2)y,α3,α4{yd1,i,α4,d1,i,α3,d1,i,yd2,i,α4,d2,i,α3,d2,i,di}7
With respect to the type of neurons of the hidden layer, in this study, two different basis functions (BFs) are considered: Radial Basis Function (RBF), as a kernel BF, with a Gaussian Transfer Function (GTF), defined as follows: where and being the centroid and radius of the GTF, respectively. Product Unit (PU), as a projection BF, defined as follows: where are the weights from the input layer to the hidden layer. Furthermore, combining BFs in the hidden layer has some advantages, such as providing flexibility to the decision rules. Any continuous function can be decomposed into two different types of functions: one belonging to the projection group and the other belonging to the kernel group (Donoho & Johnstone, 1989). In this sense, hybrid ANNs (both monotask and multitask) have also been considered. Such hybrid models combine PU (projection) and RBF (kernel) BFs in the hidden layer, i.e., they perform a linear combination of both types of BFs. These hybrid ANN models are defined as follows: where and are the numbers of hidden neurons of the first and second types, respectively. contains the coefficients of the ANN model, being the coefficients between the hidden layer and the th output node, and and being the weights connecting the input layer to the th hidden neuron of the first and the second type, respectively. and represent the basis functions of each type, defined in Eqs. (6), (7). The general forecast task proposed in this study considers the real number of cumulative contagions as the continuous variable to be forecasted. Consequently, for the evaluation of the ANNs, the mean square error (MSE) of the network output layer with respect to the real values is used as the loss function, defined as follows for the monotask models: where and are the real and forecasted cumulative numbers of contagions, respectively. Note that specifies the general output of the monotask model () forecasting the cumulative number of contagions for the day of the th outbreak. Moreover, as mentioned above, in this study, two forecast horizons are considered: or . Accordingly, the loss function to evaluate the multitask models is defined as: where, in this case, and .

Evolutionary artificial neural networks

Even though ANNs are considered to be universal approximators, for the different BFs, the required training time and final complexities of the networks may vary significantly. The speed of convergence may be a problem for those cases involving a sufficiently high number of patterns in the training set or when the complexity of the network increases, becoming the main challenge to be addressed (Livni, Shalev-Shwartz, & Shamir, 2014). In most studies, gradient-based methods have been applied to optimize ANN parameters. However, due to the nonlinear nature of ANNs, the optimization process may converge to one of the many local optima existing on the error surface of these models (Sutton, 1986). To perform a more exhaustive exploration process over the error surface, Yao (1993) presented a thorough review of evolutionary algorithms (EAs) applied to the optimization of ANN parameters. EAs use a set of different candidate solutions (population) to: (1) simulate the evolutionary process and natural selection and (2) provide the algorithm with a global search capability with the objective of discovering regions on the error surface where good performance solutions are located. Therefore, EAs represent an efficient method to optimize both the structure and weight connections of ANNs. In this work, the EA proposed in Martínez-Estudillo, Martínez-Estudillo, Hervás-Martínez, and García-Pedrajas (2006) is applied to optimize the ANN models described in Section 3.3. In Algorithm 1 the pseudocode of the EA is shown. As the goal of the EA is to minimize the MSE of the ANNs, the fitness function guiding the evolutionary process is defined as a strictly decreasing transformation of the MSE: Consequently, the fitness for the monotask or multitask ANNs is calculated using the MSE defined in Eq. (9) or Eq. (10), respectively. The EA begins creating an initial population of ANNs. Each ANN (individual) of the population is randomly generated, i.e., the number of neurons in the hidden layer, the number of connections of each hidden neuron (to link both the input layer to the hidden layer and the hidden layer to the output layer), and the weights of those connections are randomly generated considering the parameters described in Table 4 of Section 4.2. Next, the ANNs are evaluated and ranked according to their fitness.
Table 4

Parameter values that have been used in the EA for all models (PU, RBF and RBFPU for both MoEANNs and MuEANNs).

ParameterValue
Independent runs40
Stopping criteria:
 (1) maximum number of generations1500
 (2) consecutive generations without improving individuals10
Population size1000
Number of hidden layers of each individual1
Minimum number of hidden neurons (initialisation)2
Maximum number of hidden neurons (initialisation)3
Maximum number of hidden neurons (whole process)4
Range of hidden neurons to be added or deleted[1,3]
Range of links to be added or deleted[1,5]
Range for weights between input and hidden layer[0.1,0.9]
Range for weights between hidden and output layer[10,10]
After that, ANN optimization (evolutionary process) is performed as long as the stopping criteria are not met. Specifically, two stopping criteria are considered, as shown in Table 4 of the following Section 4.2. Hence, if either of them is reached, then the EA stops. In each generation of the evolutionary process, the worst 10% of ANNs are replaced with a copy of the best 10% of ANNs, which represents an elitist pressure since the best 10% of individuals (after being cloned, they constitute the 20% of the population) are optimized in a different way, as described below. Next, all individuals are evolved by simultaneously applying two types of mutations: parametric and structural. Parametric mutation is applied to the best 10% of ANNs and updates their weights by adding Gaussian noise, whose variance decreases throughout the evolutionary process, i.e., the strength of the changes decreases as the individuals become better. In that way, this adaptive variance (particularly to each individual) dynamically modifies the intensity of the exploration of the error surface, favoring exploitation as the performance of the individual increases. On the other hand, structural mutation (applied to the remaining 90% of ANNs) alters the structure of individuals. More specifically, this type of mutation modifies the number of hidden neurons and their number of connections between input and output layers, i.e., it explores a wider area of the search space while trying to maintain the diversity of individuals. In particular, the EA applies five types of structural mutations: Add Neuron, Delete Neuron, Add Link, Delete Link and Neuron Fusion (Gutiérrez, Hervás, Carbonero, & Fernández, 2009), which are applied sequentially to each individual. Thus, the performance of individuals is improved throughout the evolutionary process, generation after generation, maximizing their fitness (and hence minimizing their MSE). Finally, the ANN with the best fitness is selected as the final solution when the EA stops. Since the population is randomly initialized, the EA has a stochastic component. Therefore, different runs will lead to distinct populations and, consequently, to different final solutions. In this way, and regardless of the BFs used in the hidden layer, the developed evolutionary artificial neural networks (EANNs) will be denominated as monotask EANNs (MoEANNs) and multitask EANNs (MuEANNs) depending on whether they perform the forecast tasks separately or simultaneously, respectively.

Experiment settings and results

The settings used in the experimentation and the results obtained using EANN models are presented in this section. In addition, a statistical analysis is performed to evaluate significant differences between the models and datasets used, with respect to the distinct sanitary districts considered.

Datasets

To analyze the effect of the polynomial information in the forecasting task, for each district, different multivariate time series datasets are constructed by combining different sets of input variables and autoregressive orders. Experimental results have empirically demonstrated that using the four polynomial coefficients simultaneously generates overfitted models, performing poorly on the generalization set. For this reason, only the coefficients defining the critical point of the -degree polynomial model are considered in the dataset building process. The critical point is produced when the function changes its concavity which is related to the instant where the outbreak gets out of control, and, therefore, the number of positive cases begins to rise exponentially. This moment is reached when the second derivative of the polynomial is equal to zero, i.e., . Consequently, only these two coefficients will be considered as polynomial information in the datasets because they characterize the inflexion moment with having more importance. Table 3 shows the inputs of the EANN models. Regarding the coefficients extracted from the polynomial approximation, except for the coefficient , which is always included in the AR/VAR models, the other two coefficients, and , can be included or not. Moreover, apart from these coefficients, the number of days from the beginning of the th wave is always included as an input variable with the aim of temporally locating the fitted evolution. Notably, the and models are applied with . Therefore, the simplest dataset is built with only two input variables ( lagged time and ), being identified as . On the other hand, the most complex dataset is built with seven input variables (being , and lagged two times and ), which is identified as . Note that the total number of input variables is specified in column of Table 3. Regarding the target or output variables, as mentioned in the previous Section 3.3, the cumulative number of contagions at two different time horizons is employed, and . For the MoEANN approaches, two different datasets are built as , depending on the forecasting horizon, or . On the other hand, for the simultaneous forecasting with and carried out by the MuEANN approach, the dataset is built as . Note that is one of the six combinations presented in Table 3. Inputs () included in the EANNs for day belonging to the th outbreak. is the total number of inputs for the generated datasets. Note that the different datasets are named according to the AR models used (either AR or VAR). The train and test partitions have been generated while taking into account the three stages previously mentioned in Section 2. From each wave, a specific period has been included in the generalization test, aiming to include as many different cases as possible to build robust models. The rest of the stages are included in the training set. Specifically, the testing partition is composed of the beginning stage belonging to the second wave, the growth stage of the fourth wave and the stabilization stage from the third wave and fifth wave. This splitting methodology is shown in Fig. 5, where periods for training and testing sets are represented over the cumulative contagion curve and the evolution of polynomial coefficients ( and ) for the district of Sevilla. This hold-out splitting methodology responds to the need to involve as much information as possible about the different stages of all the waves in the training set. With these nonrandom partitions, we aim to create a heterogeneous and representative set of observations for training and testing sets, given that the application of cross-validation would have increased unaffordably the needed runs of the algorithm. Therefore, from the patterns that were involved in the datasets for each district, the final datasets used in the experimentation contain and observations for the two considered autoregressive orders and , respectively. Note that for , the first observation from each of the waves is needed to build the final datasets, whereas in the case of , the initial observations needed are . From these final datasets, and for , (and and for ) patterns are used for building the training and testing sets, respectively (representing 80% and 20% of the total data).
Fig. 5

Train and test partitions of the district of Sevilla.

Train and test partitions of the district of Sevilla.

Experimental settings

The contagion curve is specific for every district, varying in magnitude and transmission rate according to geographical and population parameters. To find the best model and dataset for each specific district, the experimentation carried out in this paper makes use of all generated options detailed in Table 3 for all districts. In this way, the considered BFs for the ANNs detailed in Section 3.3 have been applied to estimate the weights from the input layer to the hidden layer of the models, resulting in different combinations per district: ANNs (using each one a different BF combination: PU, RBF or the hybrid one RBFPU) applied to the different datasets generated from the AR/VAR models. In addition, depending on whether the two forecast tasks are approached simultaneously or not, the two different approaches proposed in this work are compared: (1) MoEANN, which considers the two tasks separately ( or days ahead as different tasks), and (2) MuEANN, which groups both tasks carrying out them simultaneously. Consequently, different models ( MoEANNs and MuEANN) were considered for the combinations of each sanitary district, resulting in a total of experiments being executed for each district. Regardless of the district, dataset and computing error perspective, the evolutionary algorithm used to optimize the EANN models is applied with the same configuration. More specifically, in Table 4, the parameters of the EA are shown, together with their values. The weight values of the connections from the hidden layer to the output layer are randomly generated regardless of the type of BF of the model. However, the weight values of the connections from the input layer to the hidden layer are randomly generated for PU units, whereas for RBF, units the weight values are initialized using K-means (Lloyd, 1982) as a clustering method to determine the Gaussian function centroids. For hybrid EANN models combining RBFPU units, both types of BFs have the same probability of being selected in the initialization process of the individuals, and the final number of both types of units will depend on the evolutionary process. Parameter values that have been used in the EA for all models (PU, RBF and RBFPU for both MoEANNs and MuEANNs). Input features for all datasets are scaled in the interval . The output layer is composed of and linear units for the two MoEANN and the MuEANN approaches, respectively, and their output values are scaled in the interval . The described configuration for performing the optimization of the models follows the guidelines published in Gutiérrez et al. (2009) and Martínez-Estudillo et al. (2006). The EANN models containing SU basis functions in the hidden layer, either pure or combined with PU or RBF, were affected by a high overfitting and, consequently, performed poorly in the test set. Thus, this basis function has been excluded from this study.

Results

The results presented in this section aim to compare the model performance differences produced when polynomial coefficients are included as training features and the extent to which multitask models are able to take advantage of using the same structure and parameters for tackling several tasks simultaneously. The results are expressed in terms of root mean square error (RMSE, applying the square root to Eqs. (9), (10)). As explained in Section 3.4 the EANN models are stochastic. Thus, runs are performed and the results are expressed as the mean and standard deviation of the RMSE obtained for the generalization sets, i.e., patterns unseen during the training stage. Given the interest in knowing which BF is better and the best value for in the AR/VAR models, the results are segregated considering the different options, as well as for the individual districts. In addition, the performances of the models for the two forecasting horizons are also compared. Note that for single horizon forecasts, only the corresponding output neuron error of the MuEANN model is considered. Regarding individual forecasting horizons, Table 5, Table 6 show the results obtained for and , respectively, in the eight districts considered. On the other hand, worth mentioning that when simultaneous time horizon forecasting tasks are approached, the results achieved by the MoEANN approaches (one tackling and the other tackling ) are the average of the errors of the two single MoEANN models. Table 7 shows the results obtained for all zones and datasets considering these two horizons simultaneously. In Table 5, Table 6, Table 7, best results per forecasting approach and district are highlighted in bold, whereas the second best result is highlighted in italics.
Table 5

Performances of the MoEANN and MuEANN models trained with the different combinations of input features and autoregressive orders, , evaluating the errors of the forecast horizon. The results are expressed as RMSE of the generalization set, and SD stands out for Standard Deviation.

DistrictBFpMoEANN
MuEANN
AR(p)yVAR(p)y,α4VAR(p)y,α4,α3AR(p)yVAR(p)y,α4VAR(p)y,α4,α3
AlmeríaPU1312.6113.7295.6213.2485.429.52275.9718.92119.6723.75108.7816.77

2

300.2412.58
103.3915.85
90.7212.90

275.8122.80
136.7230.20
126.7526.41
RBF1312.1113.72176.5229.38181.8749.91332.0918.92257.0467.98304.0180.76

2

293.5312.58
184.2246.05
203.3461.35

353.0122.80
255.3984.67
301.0772.12
RBF+PU1131.9034.75107.1226.0598.6928.41147.5324.31142.8234.69146.5142.91
2136.1454.1997.7125.1899.6146.13152.7524.91164.1245.26171.7845.95

Cádiz BayPU1358.684.41223.447.76222.2010.90397.6520.54237.8432.91247.0724.69

2

345.9112.52
234.5715.14
235.5322.00

354.0033.69
228.0626.10
229.8341.18
RBF1292.564.41168.0735.62173.0333.94295.5720.54268.9791.29281.1568.42

2

299.9712.52
153.8937.15
195.4563.40

303.9633.69
333.86382.05
271.4678.10
RBF+PU1306.9742.29201.6527.82194.8539.42226.1738.12185.5157.84174.0259.10
2295.8258.44210.1438.15191.6840.64217.7549.77184.2467.41175.1265.75

CórdobaPU1195.734.5082.585.8776.0010.42194.048.9095.409.5793.708.34

2

182.7113.97
80.437.57
74.4112.31

186.749.00
89.409.87
86.6811.13
RBF1189.784.50155.6732.39166.9033.23226.678.90196.2033.28209.9140.78

2

186.4113.97
161.0329.05
176.9430.49

235.629.00
197.9736.07
220.0630.83
RBF+PU1116.6520.2782.2828.2171.9616.16120.3616.0874.5810.9677.1516.61
2117.1515.0077.1413.3970.5213.79120.7515.6178.1012.9878.6015.63

GranadaPU1198.095.62112.259.39118.4913.14213.2014.01121.3911.73127.4512.34

2

157.5740.79
114.7019.86
113.1412.95

210.2617.62
113.1512.98
128.2617.07
RBF1203.495.62182.2245.85169.5042.52318.2214.01227.1243.82229.4246.94

2

186.0140.79
168.0131.91
181.6241.68

316.9817.62
246.9646.30
270.1351.13
RBF+PU1148.5645.33130.61104.41108.3969.97173.5036.95105.6921.87109.0326.16
2152.2859.65112.0852.14115.38103.97165.8535.86104.7122.91123.6464.24

Huelva CostPU1147.3618.1461.3010.9468.9210.70139.299.3962.534.0567.978.01

2

147.9318.95
69.729.51
76.8514.70

133.157.50
60.944.69
70.0011.91
RBF1135.6518.1487.2718.56109.9646.24158.099.39114.1438.05141.2541.63

2

133.6218.95
83.4519.20
121.6545.67

154.997.50
108.6635.66
142.1257.78
RBF+PU189.7733.7063.5417.0171.2329.7777.0415.5569.5520.8879.0720.38
275.0023.1467.6521.7774.1423.8575.4617.8568.1419.1769.2416.99

JaénPU1100.2712.5478.896.5481.396.44117.334.5998.8012.75105.998.80

2

86.5716.14
78.306.98
77.308.38

113.9712.73
96.9710.78
103.6611.09
RBF1109.0612.5496.6112.4988.3813.97112.814.5999.4413.78112.9432.96

2

108.0816.14
106.5712.88
96.7919.70

114.1912.73
115.9425.38
119.2829.18
RBF+PU174.0335.9470.6461.7294.1649.7770.8318.4684.6943.5078.4121.37
280.6647.8674.2134.1674.1120.9870.6423.4285.7154.2477.5319.70

MálagaPU1428.7914.34312.3920.39305.8023.28391.1028.60296.1527.29293.2533.05

2

423.8843.40
300.5629.19
293.0639.57

370.8741.41
281.4540.26
276.0943.11
RBF1365.1314.34467.0785.02499.1184.02451.4728.60559.18150.48559.78105.88

2

395.3643.40
499.4168.21
528.7268.64

448.6041.41
564.19107.76
531.75107.13
RBF+PU1358.9533.49316.2538.65332.67148.19288.3160.17255.6856.73259.1346.60
2352.6326.08312.1048.37298.1572.09274.4565.26243.8065.70267.21127.55

SevillaPU1268.4618.32155.9319.72156.7827.77292.4214.57147.5919.40162.6016.60

2

226.2722.92
174.8223.72
187.6037.87

270.8616.85
141.3222.04
158.9819.26
RBF1388.2818.32320.3886.43348.93123.80473.5214.57446.58102.65459.1296.61

2

365.8922.92
331.6191.40
371.6182.34

483.7516.85
442.58112.32
518.3271.35
RBF+PU1263.9847.13171.7339.08146.1134.11273.3858.92150.5323.62152.4421.21
2256.6835.30163.8134.53150.7120.47274.9452.97159.3830.53156.3831.06
Table 6

Performances of MoEANN and MuEANN models trained with the different combinations of input features and autoregressive orders, , evaluating the errors of the forecast horizon. The results are expressed as RMSE of the generalization set, SD stands out for Standard Deviation.

DistrictBFpMoEANN
MuEANN
AR(p)yVAR(p)y,α4VAR(p)y,α4,α3AR(p)yVAR(p)y,α4VAR(p)y,α4,α3
AlmeríaPU1426.6020.90137.2415.40124.9412.66322.3915.91140.4022.01131.4715.30

2

415.6827.30
147.0721.20
138.4518.57

326.9333.21
169.1435.54
146.1625.96
RBF1421.4620.90254.3857.33261.4376.61463.6315.91272.9860.76310.5484.80

2

426.2527.30
248.0870.96
247.5371.40

466.0133.21
288.3575.45
307.7077.09
RBF+PU1182.9759.55147.0131.56152.2943.97215.8037.53156.1232.20154.4127.30
2180.7055.17148.5434.73137.5242.93228.2336.45174.9942.88176.2339.83

Cádiz BayPU1501.2325.16325.0311.60322.4224.88409.4018.17276.8629.40269.1022.70

2

498.9519.77
312.9213.03
312.0726.27

429.5434.05
278.1825.54
257.2343.86
RBF1460.8925.16219.5240.75242.4067.64396.4618.17324.4293.69353.2086.84

2

454.4519.77
217.7648.41
258.1157.20

410.0034.05
381.32401.25
326.9877.12
RBF+PU1502.89147.86285.9947.07275.1650.38293.5437.96230.0355.89216.9153.80
2442.87107.47286.2952.97267.1067.05311.51112.80220.2458.14212.6860.66

CórdobaPU1264.526.69132.215.69110.7318.68252.264.47135.5611.81124.9515.13

2

251.6211.44
128.858.86
115.5118.41

255.806.53
133.1410.32
128.6112.48
RBF1263.826.69207.7437.88216.6346.81328.344.47249.1138.39260.3443.14

2

258.8711.44
223.0026.99
228.6528.77

338.046.53
258.7841.17
265.9330.29
RBF+PU1174.5527.13111.6317.62111.0618.58184.4715.81125.0313.78128.5120.28
2172.9130.25121.8819.65110.9516.27190.2118.62135.8120.29131.5820.06

GranadaPU1279.0121.82152.279.72164.8218.47301.1412.14139.007.85158.6612.52

2

192.3845.54
152.2414.59
154.6723.55

311.7821.74
143.1712.28
165.9115.85
RBF1294.8221.82220.1437.12212.9452.10412.3212.14284.4241.23300.0864.77

2

281.1945.54
242.3338.38
237.8046.39

407.6021.74
314.2634.37
324.7754.32
RBF+PU1216.0844.87176.55118.66126.3631.26240.3461.51168.6224.40170.2736.28
2207.3344.49196.73123.74158.5675.74241.6753.77169.8427.97196.5669.05

Huelva CostPU1202.6019.0286.376.7893.8910.47198.0910.0990.979.8592.139.49

2

203.5715.97
79.067.22
89.888.48

196.189.77
89.039.64
96.4110.44
RBF1204.0919.02114.3523.67135.8640.22226.7810.09140.1933.66164.7536.37

2

189.2815.97
105.7522.14
137.2639.83

227.029.77
134.1930.46
169.9867.43
RBF+PU1127.0566.9884.6025.2995.6639.06116.4422.2697.0321.84101.2421.02
2108.3439.0479.9220.6996.1732.21114.9425.3196.3822.8394.8728.81

JaénPU1155.568.00117.3610.04122.967.68149.905.03129.3414.86117.0310.98

2

128.2322.05
111.728.35
115.0812.26

148.6115.77
127.9013.89
123.0615.93
RBF1140.768.00121.1413.64121.5214.38181.865.03160.2720.03166.2327.05

2

138.0222.05
128.7116.94
124.1724.20

188.3915.77
169.5224.90
173.1933.89
RBF+PU1106.7754.71102.8334.25109.7636.07108.3941.12124.1791.61113.3636.17
298.8231.06115.55103.98116.2933.37102.8230.77115.7545.87108.9324.21

MálagaPU1511.4733.48410.5722.16395.9734.33308.5026.07326.3627.25325.7030.92

2

528.8676.06
385.9536.31
365.2346.20

351.9396.41
306.1737.17
308.0144.44
RBF1542.0333.48608.6067.13639.0070.48528.3426.07572.4969.65581.8587.50

2

570.6476.06
628.9946.10
687.1261.91

552.1596.41
594.2680.60
586.2192.53
RBF+PU1492.8355.81401.0337.86407.2157.16342.6182.43288.6358.38289.4545.87
2472.3841.33419.9373.59385.0262.96352.9778.50268.5362.05303.64111.44

SevillaPU1378.6422.45229.2625.04209.9838.14374.8911.51206.5223.67218.5418.82

2

318.6627.24
260.8834.39
233.2528.07

385.2728.82
216.9826.45
226.4825.73
RBF1553.1922.45436.69104.15427.7491.88654.7111.51556.51109.41563.96121.52

2

365.8927.24
470.9489.03
494.1773.37

665.6228.82
550.15126.87
613.4078.47
RBF+PU1394.1461.60227.1966.71196.4456.24449.7992.50252.3841.93251.0130.25
2394.3879.74251.6074.07226.5144.38435.6288.04271.0239.57258.8742.70
Table 7

Performances of MoEANN and MuEANN models trained with the different combinations of input features and autoregressive orders, , simultaneously evaluating the errors of the and forecast horizons. The results are expressed as RMSE of the generalization set, SD stands out for Standard Deviation.

DistrictBFpMoEANN
MuEANN
AR(p)yVAR(p)y,α4VAR(p)y,α4,α3AR(p)yVAR(p)y,α4VAR(p)y,α4,α3
AlmeríaPU1369.6157.15116.7121.15104.9211.46299.1829.05130.0425.13120.1311.46

2

357.9657.89
125.2322.26
114.5824.19

301.3738.27
152.9336.74
136.4527.93
RBF1366.0957.15214.9439.49221.1540.57397.8629.05265.0164.96307.2882.87

2

359.8957.89
216.1532.84
225.4423.55

409.5138.27
271.8781.87
304.3874.72
RBF+PU1157.4326.44127.6520.66125.8927.47181.6746.53149.4734.12150.4636.18
2158.4223.48123.1325.99118.5620.09190.4948.97169.5544.42174.0143.06

Cádiz BayPU1441.3272.28274.2450.89271.6450.29403.5220.26257.3536.81258.0926.15

2

422.4376.62
273.7539.36
273.8038.58

391.7750.73
253.1235.98
243.5344.69
RBF1375.6072.28193.7926.46207.7135.41346.0120.26296.7096.56317.1786.08

2

378.1876.62
185.4211.46
226.3832.28

356.9850.73
357.59392.49
299.2282.43
RBF+PU1412.7098.75245.5142.65237.1240.77259.8650.81207.7761.08195.4760.45
2369.3574.09248.2238.67229.8738.42264.6398.98202.2465.47193.9011.46

CórdobaPU1230.1334.48108.0524.9493.5817.78223.1529.95115.4822.78109.3219.84

2

217.1634.64
104.6424.38
95.2220.92

221.2735.41
111.2724.09
107.6524.07
RBF1226.8034.48182.3926.71192.4025.66277.5029.95222.6644.62235.1348.97

2

222.6434.64
192.4131.43
202.7926.42

286.8335.41
228.3749.22
243.0038.21
RBF+PU1145.6029.3696.9615.4491.5119.99152.4235.8099.8011.46102.8331.67
2145.0328.2999.7922.7490.7411.46155.4838.75106.9633.51105.0932.01

GranadaPU1238.5540.63132.5220.25141.6623.50257.1745.88130.1913.31143.0619.95

2

174.9718.60
133.4719.22
133.9121.20

261.0254.48
128.1611.46
147.0925.01
RBF1249.7340.63201.1820.02191.2222.79365.2745.88255.7751.29264.7566.69

2

233.6018.60
205.1737.63
209.7128.87

362.2954.48
280.6152.86
297.4559.41
RBF+PU1182.7534.42153.5825.28117.4911.46206.9260.76137.1639.07139.6544.02
2179.8128.46155.4943.36136.9723.58203.7659.38137.2741.40160.1076.00

Huelva CostPU1174.9827.9673.8312.8881.4012.90168.6930.9776.7516.0980.0514.94

2

175.7528.13
74.395.49
83.367.35

164.6732.70
74.9911.46
83.2117.32
RBF1169.8727.96100.8114.29122.9114.53192.4430.97127.1738.21153.0040.81

2

161.4528.13
94.7412.04
129.4610.18

191.0032.70
121.4335.53
156.0564.32
RBF+PU1108.4119.9474.0711.4983.4413.5596.7427.5183.2925.4090.1523.49
291.8817.5873.7911.4685.1512.2295.2029.4882.2625.3782.0526.90

JaénPU1127.9227.8398.1319.45102.4420.96133.6216.98114.0720.61111.5111.38

2

107.4021.28
95.0116.94
96.4319.16

131.2922.48
112.4319.84
113.3616.81
RBF1124.9127.83109.0412.78104.9516.99147.3416.98129.8634.94139.5840.24

2

123.0521.28
117.3611.73
110.4814.47

151.2922.48
142.7336.74
146.2441.56
RBF+PU190.6117.7086.7311.46101.9610.1889.6136.99104.4374.3795.8934.46
289.7411.0494.8822.2895.2021.7286.7311.46100.7352.4393.2327.09

MálagaPU1472.8941.72361.4849.30351.4645.40349.8049.54311.2531.17309.4835.88

2

476.3753.05
343.8043.08
329.1511.46

361.4074.79
293.8140.67
292.0546.60
RBF1453.5841.72537.8371.30569.0570.50489.9049.54565.83117.44570.8197.75

2

483.0053.05
564.2065.23
607.9279.61

500.3774.79
579.2296.33
558.98103.73
RBF+PU1431.6367.52359.1842.84370.4138.63315.4677.10272.1659.87274.2948.66
2412.5160.16366.0254.48342.1444.21313.7182.17256.1711.46285.42121.15

SevillaPU1323.5555.28193.5336.98183.0427.22333.6543.27177.0511.46190.5733.13

2

273.0546.47
217.8543.37
210.4323.54

328.0661.88
179.1544.99
192.7340.69
RBF1470.7355.28378.5458.97387.8340.75564.1243.27501.54119.48511.54121.64

2

365.8946.47
401.2770.31
432.8961.91

574.6861.88
496.37131.34
565.8688.79
RBF+PU1332.4865.59199.8128.67171.5911.46361.59117.44201.4561.25201.7255.78
2326.4069.27208.2644.51188.6138.33355.28108.32215.2066.06207.6263.40
Apart from the model performances, the complexities of the two MoEANN and the MuEANN models are an important aspect to be compared. Since almost all experiments resulted in the maximum limited number of nodes in the hidden layer ( neurons), complexities are measured in terms of the total number of connections. Table 8 shows the average number of connections for each model, forecasting approach and dataset. For the sake of a fair comparison, model complexities are measured from a simultaneous forecast horizon perspective. For MoEANN models, the average of connections is computed for each input feature combination and autoregressive order. Then, the resulting averages for the two horizons are summed. Bold and italics represent the least complex and the second least complex architectures by forecasting approach, basis function and dataset, respectively. As expected, models become more complex when the number of autoregressive input features increases. Considering the average number of connections of the models, it can be said that the use of MuEANN models is justified. Even though the results obtained with this approach are slightly worse, the models are much simpler in all cases while their performances remain competitive.
Table 8

Comparison of MoEANN and MuEANN model complexities considering double and simultaneous forecasting, . The results are expressed in terms of the average number of links involved in the EANNs of the eight considered districts.

BFpInput dataset
AR(p)yVAR(p)y,α4VAR(p)y,α4,α3
MoEANNPU118.7023.2624.80
220.7626.7429.42
RBF119.4620.4521.84
221.1425.5027.74
RBF+PU120.8124.4926.80
222.1329.0233.09

MuEANNPU113.0416.2017.03
213.7318.5919.53
RBF114.3314.9115.48
215.2616.9218.16
RBF+PU115.4116.8517.67
215.8219.1720.59
Performances of the MoEANN and MuEANN models trained with the different combinations of input features and autoregressive orders, , evaluating the errors of the forecast horizon. The results are expressed as RMSE of the generalization set, and SD stands out for Standard Deviation. Performances of MoEANN and MuEANN models trained with the different combinations of input features and autoregressive orders, , evaluating the errors of the forecast horizon. The results are expressed as RMSE of the generalization set, SD stands out for Standard Deviation. Performances of MoEANN and MuEANN models trained with the different combinations of input features and autoregressive orders, , simultaneously evaluating the errors of the and forecast horizons. The results are expressed as RMSE of the generalization set, SD stands out for Standard Deviation. In most cases, the results produced by the models involving one single output neuron (MoEANN) are slightly better than those obtained using the MuEANN approach, as seen in the results shown in Table 5, Table 6. This is to be expected since MoEANN model training is guided by considering only the error made for a single output variable, focusing the model training on a single objective. Nevertheless, it is important to point out the special case of the district of Málaga, where the results of the single and simultaneous forecast improve if the models are trained considering the errors of the two horizons in the training set. This phenomenon is due to the capability of the MuEANN model to infer relationships between the related tasks used to optimize itself. Comparison of MoEANN and MuEANN model complexities considering double and simultaneous forecasting, . The results are expressed in terms of the average number of links involved in the EANNs of the eight considered districts. Considering the results achieved, it can be concluded that there is no single combination of BF and dataset that performs accurately for all districts. However, some highlights can be extracted from the experimental results. Pure RBF models always produce poor performances when modeling from the MuEANN perspective, and except for Cádiz Bay, they also underperform using the MoEANN approach. In the particular case of simultaneous forecasting (see MuEANN columns of Table 7), PU architecture is the best choice when modeling the cumulative curve of contagion of the districts of Almería, Granada, Huelva and Sevilla, while an RBFPU architecture is needed for the rest of the locations. In this respect, a different number of input variable lags are needed to achieve the best results depending on the district to be modeled. Overall, considering Table 5, Table 6, Table 7, it can be seen that hybridization outperforms pure model architectures in the districts of Málaga, Jaén and Cádiz when modeling from a MuEANN approach. On the other hand, MoEANN approaches achieve better results in Sevilla, Jaén and Granada when modeling the considered horizons, either individually or jointly. Moreover, concerning the complexities presented in Table 8, the average number of links for the eight considered districts is higher for the hybrid architectures, regardless of the autoregressive order and the number of model outputs. Pure PU models trained with are less complex than RBF models trained with the same dataset. Nevertheless, it is noteworthy that the opposite effect occurs when pure models are trained including polynomial information, i.e, when they are trained with the and datasets. The results show that the performances obtained for both approaches (the MoEANN and MuEANN models) improve when some sort of autoregressive polynomial information derived from curve fitting is included in the model training set, except for the district of Jaén. This fact demonstrates that adding lags of polynomial coefficients has a positive effect on the forecasting task, not only in terms of accuracy but also reducing, in some cases, the variance of the results. The results obtained in Jaén may be because it is the least populated district, and the fitness of the polynomial models along its outbreaks has the higher standard deviation of the eight districts, as shown in Table 2. Fig. 6 shows a boxplot diagram where the RMSE achieved with the MuEANN model and the best dataset is compared with the results of the analogous model trained with no polynomial data considering the dual and simultaneous horizon forecast. The improvement in the results when autoregressive polynomial coefficients are included in the training data is graphically noticeable in the districts except Jaén, especially in Córdoba, Huelva Coast, Almería, Sevilla and Granada, where the median obtained with the best combination is distinctly lower than its analogous model. There is no distinguished difference in the dispersion of the results obtained for the Cádiz and Málaga districts. However, in these districts, the worst result obtained with the model is always better than the worst result obtained with the model. Additionally, to quantitatively summarize the aforementioned improvement, Table 9 presents the percentage accuracy gain of the double forecast results obtained with models trained with best with respect to training with best datasets. In terms of percentage, the results for all the districts except Jaén are notably improved when modeled with polynomial information, with Almería being the district that benefits the most from this additional information, increasing the accuracy by 73.55% and 59.46% for the MoEANN and MuEANN approaches, respectively.
Fig. 6

Comparison boxplots of simultaneous forecast results using MuEANN models including () and not including () polynomial information in the training set. The results are expressed as the RMSE obtained for the test set of the executions for each district.

Table 9

Percentage accuracy gain obtained in simultaneous double-horizon forecasting, and , training the models with the best datasets with respect to the best in all districts.

Districtyd+3,d+5
MoEANNMuEANN
Córdoba37.43%34.52%
Huelva Cost19.68%54.46%
Almería73.55%59.84%
Cádiz Bay50.97%26.72%
Sevilla48.39%46.93%
Granada35.71%50.90%
Jaén4.28%7.49%
Málaga30.90%18.34%

Mean37.61%35.53%
To graphically illustrate the improvement produced by the inclusion of the polynomial curve point descriptors in the training phase of the EANNs, in Fig. 7, Fig. 8 the best performances obtained for the district of Sevilla with MuEANN models trained with and datasets are compared. More specifically, in Fig. 7, Fig. 8, the model test predictions are represented over time with the real curve of cumulative contagions of the four waves. For both forecast horizons, the model trained with the best dataset fits better than the one trained with the best . This improvement is especially noteworthy in the second and fourth waves for both horizons and in the third wave for the mid-term forecast. In Fig. 7, Fig. 8, where the real test values for mid- and long-term horizons are scattered with the values predicted by both models, the improvements are visibly easier to appreciate. Points located in the gray equality line represent a perfect prediction. Predictions resulting from models trained with models of the two considered forecast horizons are closer to this line, representing a more accurate prognosis of the cumulative number of contagions. Note that in Fig. 7, Fig. 8 the plotted data correspond, from left to right, to the second, fourth, fifth and third waves.
Fig. 7

test predictions of the cumulative number of Sevilla over time (a) and scattered with real values (b) using the best and datasets.

Fig. 8

test predictions of the cumulative number of Sevilla over time (a) and scattered with real values (b) using the best MuEANN model trained with and datasets.

Comparison boxplots of simultaneous forecast results using MuEANN models including () and not including () polynomial information in the training set. The results are expressed as the RMSE obtained for the test set of the executions for each district. Percentage accuracy gain obtained in simultaneous double-horizon forecasting, and , training the models with the best datasets with respect to the best in all districts. test predictions of the cumulative number of Sevilla over time (a) and scattered with real values (b) using the best and datasets. test predictions of the cumulative number of Sevilla over time (a) and scattered with real values (b) using the best MuEANN model trained with and datasets.

Statistical analysis

Once all the results for the models and datasets combinations for each district have been descriptively analyzed, a statistical analysis has been carried out to obtain robust conclusions from several aspects of the experimentation. For simplicity purposes, this analysis is mainly focused on the results obtained in simultaneous horizon forecasting with the MoEANN and MuEANN approaches (see Table 7). As mentioned in Section 4.3, these models are less complex than using both MoEANNs (with one forecasting horizon each), while performances remain competitive. These tests assume normality in the data being compared. In this respect, a previous Kolmogorov–Smirnov test (Massey, 1951) was applied using the results from the different seeds before performing the comparison tests. In all cases, the -values obtained from the Kolmogorov–Smirnov test show that the null hypothesis of normality is accepted for the different sets of results used. For reasons of clarity, several comparisons of mean tests have been performed to conclude significant differences within several aspects of the results obtained. First, the results of the best MuEANN model and training data combination for all districts are contrasted with the best results obtained from the MoEANN model, which performs better in most cases. Table 10 shows the -values obtained in a paired t-test for the mean results of the best models in both approaches. Considering a level of significance of , significant differences can only be found in the results of the districts of Córdoba, Almería and Málaga (i.e.,  out of sanitary districts). In this latter case, the MuEANN method is more accurate than the monotask method. For this specific sanitary district, in addition to being less complex in terms of total links, models that simultaneously predict mid- and long-term time forecasting horizons perform significantly better. Notwithstanding the additional difficulty of modeling two different output time horizons, this approach takes advantage of the inherent relationship between the two tasks addressed achieving similar results to the models with a single time horizon target. For the rest of the districts, the mean results obtained in both approaches are not significantly different, although the MuEANN approach is simpler in terms of the number of connections, and, thus, easier to implement.
Table 10

Statistical differences between the average mean RMSE test results of the MoEANN and MuEANN best models.

Districttp-value
Córdoba−3.262.31E−03a
Huelva Cost−0.426.78E−01
Almería−6.212.63E−07a
Cádiz Bay−0.853.98E−01
Sevilla−0.923.61E−01
Granada−1.699.82E−02
Jaén−1.182.46E−01
Málaga6.754.69E−08b

Statistically significant differences favoring MoEANN method.

Statistically significant differences favoring MuEANN method.

To analyze the effect of the polynomial information on the performance of the models, a paired t-test was applied to the results obtained with the best and datasets for every horizon and model approach. The results, expressed as the -value obtained in the test, are presented in Table 11 for all districts, forecast horizon approaches and methodologies (MoEANN or MuEANN). With the exception of Jaén, where no significant differences can be found between the best results obtained using the or datasets, the rest of the models benefit significantly from the additional data derived from the polynomial curve fitting.
Table 11

Statistical differences between the best results obtained with the different models trained with the and datasets in the eight districts. The results are expressed as the resulting -value of a paired t-test.

Districtyd+3,iyd+5,iyd+3,d+5,i
MoEANNMuEANNMoEANNMuEANNMoEANNMuEANN
Córdoba1.63E−15a2.11E−18a2.27E−12a2.58E−21a4.17E−15a3.63E−21a
Huelva Cost1.79E−03a3.84E−06a5.95E−05a5.86E−07a1.45E−05a4.36E−07a
Almería5.84E−10a1.52E−09a2.57E−07a4.75E−16a1.64E−09a5.89E−14a
Cádiz Bay1.91E−21a6.66E−04a3.59E−15a2.54E−08a1.56E−14a3.67E−07a
Sevilla6.57E−15a3.64E−28a5.26E−16a2.01E−33a6.29E−24a1.39E−32a
Granada4.90E−03a8.82E−12a4.35E−08a7.73E−15a1.84E−07a1.26E−13a
Jaén7.51E−011.56E−015.89E−013.22E−019.45E−012.23E−01
Málaga2.00E−09a3.93E−02a8.02E−12a3.06E−04a3.76E−14a1.98E−04a

Statistically significant differences favoring models trained with datasets.

Statistical differences between the average mean RMSE test results of the MoEANN and MuEANN best models. Statistically significant differences favoring MoEANN method. Statistically significant differences favoring MuEANN method. Statistical differences between the best results obtained with the different models trained with the and datasets in the eight districts. The results are expressed as the resulting -value of a paired t-test. Statistically significant differences favoring models trained with datasets.

Conclusions

In this paper, a novel approach for extracting inherent information of the cumulative curve of contagions produced by the COVID-19 transmission rate has been proposed. By iteratively fitting a -degree polynomial, a coefficient representation of the curve is generated for every day of a specific outbreak. The experimentation carried out, using the information of eight specific locations in Andalusia, Spain, demonstrates that using the coefficients that define the polynomial curve inflexion instant improves model performances when forecasting the mid- and long-term values of the original curve. In the experimentation carried out in this paper, evolutionary artificial neural networks (EANNs) are used to model the cumulative contagion curve of different districts using two different basis functions and considering the hybridization of both. For each location, different datasets are used to estimate the transfer function parameters using combinations of autoregressive information of the original curve and extracted polynomial features. Due to the unique and specific characteristics of the sanitary districts considered, there is no single combination of network architecture and data that perform well for all cases. However, on the whole, the methodology applied achieves excellent results. However, from the results obtained, it can be concluded that including lags of the polynomial features to the forecaster significantly improves the results, except for the district of Jaén. In most cases, the best performances obtained in Jaén are achieved using only autoregressive data of the number of cumulative contagion time series. In this district, the methods that slightly benefit from the additional information of the polynomial coefficients (MoEANNs approaches modeling and ) return results that are not significantly different from the same approaches using only autoregressive information, as seen in the -values presented in Table 11. Moreover, two different EANN approaches have been applied. On the one hand, monotask EANNs (MoEANN models) have been considered for modeling one target value at a time, i.e., two different models are needed to forecast on two different forecasting horizons. On the other hand, a multitask EANN (MuEANN model) has been used for forecasting both horizons simultaneously, resulting in a much simpler model. It is expected that models performing only one task are more accurate, as their optimization procedure is less complex. Nevertheless, there are no significant differences between the performances achieved by the MoEANN approaches and the MuEANN one for the sanitary districts of Huelva, Cádiz, Sevilla, Granada and Jaén, although the complexity of MuEANN models is lower in terms of average number links. Furthermore, in the district of Málaga, the MuEANN models perform significantly better than the average results of the two best MoEANN models. In this case, the MuEANN approach benefits from the intrinsic relationship between the two output targets. Finally, it has been demonstrated that considering the features extracted from the inherent behavior of a process in the modeling stage can significantly improve the performance of the tasks carried out. Applying a similar methodology for any kind of problem, if the intrinsic problem nature is known, adapting a low-variance model as a description step may provide high quality descriptors that are able to support several supervised and unsupervised tasks.

CRediT authorship contribution statement

Miguel Díaz-Lozano: Formal analysis, Visualization, Methodology, Writing – original draft. David Guijo-Rubio: Validation, Writing – original draft, Writing – review & editing. Pedro Antonio Gutiérrez: Software, Methodology, Writing – review & editing. Antonio Manuel Gómez-Orellana: Software. Isaac Túñez: Project administration, Resources. Luis Ortigosa-Moreno: Project administration, Resources. Armando Romanos-Rodríguez: Project administration, Resources. Javier Padillo-Ruiz: Project administration, Resources. César Hervás-Martínez: Funding acquisition, Conceptualization, Methodology.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
  30 in total

Review 1.  Fitting curves to data using nonlinear regression: a practical and nonmathematical review.

Authors:  H J Motulsky; L A Ransnas
Journal:  FASEB J       Date:  1987-11       Impact factor: 5.191

2.  Deep Learning Enables Accurate Diagnosis of Novel Coronavirus (COVID-19) With CT Images.

Authors:  Ying Song; Shuangjia Zheng; Liang Li; Xiang Zhang; Xiaodong Zhang; Ziwang Huang; Jianwen Chen; Ruixuan Wang; Huiying Zhao; Yutian Chong; Jun Shen; Yunfei Zha; Yuedong Yang
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2021-12-08       Impact factor: 3.710

3.  Estimation of SARS-CoV-2 mortality during the early stages of an epidemic: A modeling study in Hubei, China, and six regions in Europe.

Authors:  Anthony Hauser; Michel J Counotte; Charles C Margossian; Garyfallos Konstantinoudis; Nicola Low; Christian L Althaus; Julien Riou
Journal:  PLoS Med       Date:  2020-07-28       Impact factor: 11.069

4.  Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives for Brazil.

Authors:  Matheus Henrique Dal Molin Ribeiro; Ramon Gomes da Silva; Viviana Cocco Mariani; Leandro Dos Santos Coelho
Journal:  Chaos Solitons Fractals       Date:  2020-05-01       Impact factor: 5.944

5.  An integrated framework with machine learning and radiomics for accurate and rapid early diagnosis of COVID-19 from Chest X-ray.

Authors:  Mahbubunnabi Tamal; Maha Alshammari; Meernah Alabdullah; Rana Hourani; Hossain Abu Alola; Tarek M Hegazi
Journal:  Expert Syst Appl       Date:  2021-05-04       Impact factor: 6.954

6.  News Sentiment Informed Time-series Analyzing AI (SITALA) to curb the spread of COVID-19 in Houston.

Authors:  Prathamesh S Desai
Journal:  Expert Syst Appl       Date:  2021-04-29       Impact factor: 6.954

7.  Are Lockdown Measures Effective Against COVID-19?

Authors:  Samer Kharroubi; Fatima Saleh
Journal:  Front Public Health       Date:  2020-10-22

8.  Machine learning prediction for mortality of patients diagnosed with COVID-19: a nationwide Korean cohort study.

Authors:  Chansik An; Hyunsun Lim; Dong-Wook Kim; Jung Hyun Chang; Yoon Jung Choi; Seong Woo Kim
Journal:  Sci Rep       Date:  2020-10-30       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.