Literature DB >> 34025047

Meteorological and human mobility data on predicting COVID-19 cases by a novel hybrid decomposition method with anomaly detection analysis: a case study in the capitals of Brazil.

Tiago Tiburcio da Silva¹, Rodrigo Francisquini¹, Maric C V Nascimento¹.

Abstract

In 2020, Brazil was the leading country in COVID-19 cases in Latin America, and capital cities were the most severely affected by the outbreak. Climates vary in Brazil due to the territorial extension of the country, its relief, geography, and other factors. Since the most common COVID-19 symptoms are related to the respiratory system, many researchers have studied the correlation between the number of COVID-19 cases with meteorological variables like temperature, humidity, rainfall, etc. Also, due to its high transmission rate, some researchers have analyzed the impact of human mobility on the dynamics of COVID-19 transmission. There is a dearth of literature that considers these two variables when predicting the spread of COVID-19 cases. In this paper, we analyzed the correlation between the number of COVID-19 cases and human mobility, and meteorological data in Brazilian capitals. We found that the correlation between such variables depends on the regions where the cities are located. We employed the variables with a significant correlation with COVID-19 cases to predict the number of COVID-19 infections in all Brazilian capitals and proposed a prediction method combining the Ensemble Empirical Mode Decomposition (EEMD) method with the Autoregressive Integrated Moving Average Exogenous inputs (ARIMAX) method, which we called EEMD-ARIMAX. After analyzing the results poor predictions were further investigated using a signal processing-based anomaly detection method. Computational tests showed that EEMD-ARIMAX achieved a forecast 26.73% better than ARIMAX. Moreover, an improvement of 30.69% in the average root mean squared error (RMSE) was noticed when applying the EEMD-ARIMAX method to the data normalized after the anomaly detection.

Entities: Chemical Disease Gene Mutation Species

Keywords: ARIMAX; COVID-19; EEMD; anomaly; human mobility data; meteorological data

Year: 2021 PMID： 34025047 PMCID： PMC8130621 DOI： 10.1016/j.eswa.2021.115190

Source DB: PubMed Journal: Expert Syst Appl ISSN： 0957-4174 Impact factor: 6.954

Introduction

According to the Centers for Disease Control and Prevention, a pandemic “refers to an increase, often sudden, in the number of cases of a disease above what is normally expected” “over several countries or continents, usually affecting a large number of people” (Dicker, Coronado, Koo, & Parrish, 2006). Several pandemic outbreaks have befallen humanity over the centuries. One of the first recorded pandemics occurred between 165 A.D. and 180 A.D. in the reign of Marcus Aurelius, when Antonine Plague wiped out a third of the population in some areas of the Roman empire and decimated the Roman army (Ligon, 2006). Almost 500 years later, during the mid-sixth century, the Justinian plague hit the Byzantine empire. During this epidemic, 40% of Constantinople’s population was wiped out. One of the greatest pandemics in human history, the Black Death, or Bubonic plague, occurred between 1347 and 1352, and killed between 75 and 200 million people. Several other pandemics have occurred, such as New World Smallpox (1520-unknown), The Third plague (1855), and The 1918 Flu (1918–1920). In 2002, Severe Acute Respiratory Syndrome (SARS), caused by SARS Coronavirus (SARS-CoV), emerged in the province of Guangdong, southern China, infecting thousands of people and causing the death of approximately one thousand humans (Zhong et al., 2003). Cheng, Lau, Woo, and Yuen (2007) stated that “the presence of a large reservoir of SARS-CoV viruses in horseshoe bats, together with the culture of eating exotic mammals in southern China” was “a time bomb”. The authors warned about the possibility of a resurgence of SARS-Cov and other new viruses in animals or laboratories and that everyone should be prepared for a new pandemic. Eight years later, a new coronavirus variant was discovered in the Middle East, the Middle East Respiratory Syndrome Coronavirus (MERS-CoV), which is still a reality. Four years after MERS-CoV, another coronavirus emerged in Wuhan, China (December 2019). Because of its similarity to SARS-CoV, this new coronavirus was called SARS-CoV-2, and the disease, COVID-19 (Coronavirus Disease 2019) (Huang et al., 2020). COVID-19 is a highly contagious virus that spread rapidly around the world, causing worldwide travel restrictions as well as mandatory lockdown in many cities. On April 29, 2021, the World Health Organization (WHO) reported that the virus was in 223 Countries, with 148,999,876 confirmed cases, and 3,140,115 deaths (WHO, 2021). In Brazil, until April 29, 2021, there had been confirmed 14,441,563 COVID-19 cases and 395,022 deaths caused by the virus (WHO, 2021). For this reason, scientists around the world from the most diverse areas have focused their studies on understanding COVID-19 transmission dynamics (Fang, Nie, & Penny, 2020), prevention (Ali and Ghonimy, 2020, Li et al., 2020, Voysey et al., 2021), detection (Ismael and Sengür, 2021, Vidal et al., 2021), control measures (Meo et al., 2020), and prediction analysis (Hernandez-Matamoros et al., 2020, Katris, 2021, Petropoulos et al., 2020). Historically, viral respiratory tract infections, such as the ones caused by the coronaviruses from past epidemics, H1N1 influenza and syncytial virus, were related to meteorological factors which possibly influenced the transmission and stability of the virus (Baker et al., 2019, Barreca and Shimshack, 2012, Chan et al., 2011, Lowen and Steel, 2014, Paynter, 2015). Several authors studied the correlation between climatic variables and the number of COVID-19 cases in the world: absolute humidity and temperature in the USA (Gupta, Raghuwanshi, & Chanda, 2020); UV index, wind speed, absolute humidity, among others, in 206 countries/regions (Islam et al., 2020); average air humidity and temperature in Brazil (Neto & Melo, 2020); temperature, absolute humidity, dew point, among others, in Singapore (Pani, Lin, & RavindraBabu, 2020); and wind speed and temperature in Turkey (Sahin, 2020), for example. However, only a few authors addressed the prediction of COVID-19 cases using models that consider climatic variables, as can be observed in Silva et al., 2020, Makade et al., 2020, Mousavi et al., 2020. Some studies also investigate the impact of human mobility on COVID-19 transmission. In these cases, mobility can be measured by passenger traffic in airports (Oztig & Askin, 2020), for example, or by changes in commuting patterns (Badr et al., 2020, Shao et al., 2021, Wang et al., 2020, Zhu et al., 2020). All these studies show that there is a strong correlation between human mobility and the number of people infected by COVID-19. To our knowledge, no study has yet investigated the impact of both human mobility and meteorological variables on COVID-19 transmission rates. Both these factors should both be considered in such studies as there is clear evidence that climate affects human mobility (Brum-Bastos, Long, & Demšar, 2018). This statement likely holds since, on warmer days, people lean toward performing outdoor activities and attending open-air events; and on colder days, the opposite holds is true. The daily number of COVID-19 cases can be modeled and studied through the time series theory. In a general way, a time series can be thought of as a combination of other time series, each explaining the original data at different frequencies (Büyüksahin et al., 2018). In this way, the frequency range of each subdivision is formed and creates more linear structures within them, making the prediction of this original time series more accurate. Several techniques can be used to obtain the decomposition, such as Principal Component Analysis (PCA) (Jolliffe, 2002), Variational Mode Decomposition (VMD) (Dragomiretskiy & Zosso, 2014), Fourier Transform (FT) (Graps, 1995), Empirical Mode Decomposition (EMD) (Huang et al., 1998), Ensemble Empirical Mode Decomposition (EEMD) (Huang & Wu, 2008), and Singular Spectrum Analysis (SSA) (Golyandina, Nekrutkin, & Zhigljavsky, 2001). Since PCA is limited to linear time series; FT is limited to linear, periodic, or stationary time series (Huang et al., 1998); SSA is an application of PCA in the time domain (Hsieh & Aiming, 2002); the VMD application has to solve a variational optimization problem which requires predetermining an appropriate number of variational modes; and since EMD presents a mode-mixing problem; EEMD has been considered one of the most useful tools to decompose time series, either because of its simplicity or because it is not limited to linear nor stationary time series. According to Dong, Dai, Tang, and Yu (2019), EMD-based methods, like EEMD, substantially enhance prediction accuracy and have been successfully used in several types of datasets, such as IoT systems (Yu, Ding, Guo, & Wang, 2019), bitcoin (Khaldi, El Afia, Chiheb, & Faizi, 2018), geology (Liu, Zhan, Yang, & Wang, 2019), economy (Wu, Wu, & Zhu, 2019), finance (Lin, Lin, & Cao, 2021), medicine (Liu et al., 2019, Zha et al., 2018), machine fault diagnosis (Amirat, Benbouzid, Wang, Bacha, & Feld, 2018), and water resource management (Niu et al., 2019). It has also been applied for meteorological data, such as temperature (Liu et al., 2019), precipitation (Alizadeh, Roushangar, & Adamowski, 2019), and wind speed (Santhosh, Venkaiah, & Kumar, 2019). In this paper, we propose an adaptation of the EEMD method to decompose several time series, and to use these new decomposed time series in the forecast of another time series also decomposed by the same adaptation. The time series that will be predicted corresponds to the number of daily cases of COVID-19, and the other series, used as independent variables, correspond to the meteorological and human mobility time series. Only the time series that showed a reasonable correlation with the daily cases of COVID-19, in each city, are considered in the prediction. The prediction method employed in this paper is the Autoregressive Integrated Moving Average Exogenous inputs (ARIMAX) method (Box & Jenkins, 1990), a well-established method previously mentioned in the literature. In this paper, we also aim at understanding how, together, meteorological conditions and human mobility affect the transmission of COVID-19. For such, we analyze both meteorological variables (rainfall, maximum temperature, minimum temperature, and humidity) and human mobility variables (movement trends over time by geography, across different categories of places: retail and recreation areas, grocery stores and pharmacies, parks, transit stations, workplaces, and residential areas). The main contributions of this paper can be summarized as follows: It provides a thorough analysis on the correlation of meteorological and human mobility variables in Brazilian capitals; It uses meteorological variables and human mobility in the prediction of daily cases of COVID-19 in Brazilian capitals; It adapts EEMD to decompose time series with independent variables; It proposes a novel method that combines the introduced EEMD-based method with ARIMAX to predict time series with independent variables, called EEMD-ARIMAX; It develops an oriented-case anomaly detection algorithm to better investigate the significant errors in prediction and thus adjust the prediction; It improves the ARIMAX forecast by 26.73% using the new EEMD-ARIMAX method; It refines the method by using the introduced anomaly detection strategy, thus improving the prediction by 30.69%. The rest of the paper is organized as follows. Section 2 presents a general literature review on the prediction of COVID-19 cases using human mobility and meteorological data. Moreover, it also shows a brief discussion on prediction methods, giving special attention to decomposition-based methods introduced to predict COVID-19 cases. Section 3 shows the main features of the data used in the case study and introduces the proposed EEMD-ARIMAX method. Section 4 presents the results obtained by the proposed strategy EEMD-ARIMAX after a thorough correlation data analysis is carried out. Section 5 shows the performed data anomaly detection and the results of the EEMD-ARIMAX and ARIMAX in normalized data. Section 6 wraps up the paper drawing some conclusions and giving directions for future works. A list of symbols referring to all the notations used throughout the paper is presented in A.

Related work

This section presents a brief literature review on predicting COVID-19 cases considering either meteorological or mobility variables. It also presents a short overview of methods for COVID-19 prediction, in particular, methods more closely related to the performed study.

Human mobility in the prediction of COVID-19 cases

According to Nayak et al. (2021), one of the primary impacts on predicting the COVID-19 cases consists of the variations in engagement, i.e. how committed people are to taking measures to reduce the number of COVID-19 cases. These measures include washing hands, wearing face masks, and maintaining social distancing. Concerning social distancing, one way to estimate the level of commitment is by analyzing the rates of human mobility. In line with this, Oztig and Askin (2020) considered the flow of people at airports as a human mobility measure, and observed that the greater the number of airports in a country, the more likely it is for the country to have a higher number of COVID-19 cases. This conclusion was drawn by the use of negative binomial regression analysis. Badr et al. (2020) studied the correlation between social distancing and COVID-19 cases, where social distancing was quantified by mobility patterns. To model the mobility data, the authors considered changes in commuting patterns between and within counties in the USA. The data to model the mobility patterns were obtained by Teralytics (Zürich, Switzerland). Wang et al. (2020) coupled the data of confirmed COVID-19 cases with the Google mobility data in Australia. The authors concluded that the social restriction policies imposed in the country at the emergence of the first COVID-19 case were effective in curbing the spread of the virus. Moreover, they observed that the correlation between human mobility and the spread of COVID-19 varies according to the type of mobility. Shao et al. (2021) used human mobility data from 47 countries in 6 continents collected from Mobility Trends Reports (from Apple Inc.), and showed that human mobility is strongly related to the COVID-19 transmission rate. Zhu et al. (2020) demonstrated a positive link between human mobility and the number of people infected by COVID-19, considering data from 120 cities in China. These studies show a clear influence of human mobility on the spread of COVID-19. However, cities present different human mobility patterns depending on factors such as how technological the cities are, the conditions of public and private transportation systems, among others. Therefore, the relationship/correlation between human mobility and dissemination of COVID-19 must be evaluated considering the cities’ particularities. Bearing this in mind, in this study we focus on the relationship between human mobility and the spread of COVID-19 cases in each of the 27 Brazilian capitals.

Meteorological variables in the prediction of COVID-19 cases

Sahin, 2020, Sharma and Gupta, 2021 state that meteorological features should be used to improve the accuracy of COVID-19 predictions. Such variables are crucial factors affecting infectious diseases, whether in terms of changes in the transmission dynamics, regarding host susceptibility, or the survival of the virus in the environment (McClymont & Hu, 2021). In line with this, Gupta et al. (2020) studied the relationship among new COVID-19 cases, absolute humidity, and temperatures in the USA. The authors observed that the spread of COVID-19 was majorly influenced by the absolute humidity in a narrow range of 4 to 6 g/m3. Islam et al. (2020) investigated the link between some environmental factors and COVID-19 cases in 206 countries/regions (until April 20, 2020). The relationship between the spread of COVID-19 and humidity, and UV index were inconclusive. Their investigation suggested a negative relationship between wind speed and COVID-19 cases. Moreover, a higher rate of COVID-19 cases was observed in environments with an absolute humidity between 5 and 10 g/m3. In Singapore, Pani et al. (2020) revealed that temperature, absolute humidity, and dew point have a positive correlation with the number of daily COVID-19 cases. Wind speed, atmospheric boundary layer height, and ventilation coefficient, on the other hand, showed a negative correlation with the number of COVID-19 cases. In Turkey, Sahin (2020) showed that wind speed has a positive correlation with COVID-19 cases, and that temperature and COVID-19 cases are negatively correlated. In Brazil, Neto and Melo (2020) concluded that only the average air humidity was significantly correlated with the number of COVID-19 cases (considering data from Brazilian capitals, and data available from April 2020 to May 2020). The study revealed a positive correlation, in contrast with the results obtained by others studies performed in cities in China, Spain, and the United States. The authors also demonstrated that population density presented a strong positive correlation with the number of COVID-19 cases in the Brazilian capitals. They emphasize that population density, which is linked with higher human mobility, and poorer social-economic environments that have deficient sanitary conditions contribute to the spread of the virus. Although some studies have addressed the studies involving the relationship between meteorological variables and COVID-19, some results appear to be inconsistent. On the one hand, temperature and humidity, for example, were reported as having a significant impact in the majority of the studies. On the other hand, the correlation was positive in some cases and negative in others. These observations suggest that the link between meteorological features and the number of COVID-19 cases is complex and hard to generalize. There is evidence that meteorological variables contribute to the increase in the transmission of COVID-19, but the effect of these relationships should be studied locally, since other factors such as human mobility and public health measures (lockdown, for example) also have a strong influence on the number of COVID-19 cases.

COVID-19 and forecasting

From a methodological point of view, several studies attempt to understand the spread of COVID-19 using artificial intelligence. Albahri et al. (2020) provided an exhaustive overview of integrated artificial intelligence based on data mining and machine learning algorithms. The authors pointed to a need for integrated sensor technologies for outdoor scenarios to control the spread of the coronavirus. This process is only possible when there is an interconnection with IoT technologies. Nayak et al. (2021) present an overview of the applicability of intelligent systems such as machine learning and deep learning to solve COVID-19 outbreak-related issues. Sharma and Gupta (2021) reported and summarized the research performed on COVID-19 with machine learning and big data. The literature presents few studies that address the problem of predicting new COVID-19 cases through decomposition methods. To our knowledge, only Silva et al., 2020, Mousavi et al., 2020 used decomposition methods in their predictions considering independent variables. Both proposed strategies were based on the variational mode decomposition (VMD). Silva et al. (2020) used VMD and some prediction techniques including deep learning and machine learning, to predict COVID-19 cumulative confirmed cases in five Brazilian states and five American states with high daily incidences. The authors used temperature and precipitation as exogenous variables. They pointed that the VMD coupled with cubist regression achieved the best results among the tested techniques. Mousavi et al. (2020) proposed a model based on the combination of VMD with Long Short Term Memory considering the daily temperature, humidity, and transmission rates in the prediction of new COVID-19 daily cases in Maharashtra, Tamil Nadu, and Gujarat, India. Among these works, only Mousavi et al. (2020) addressed the prediction of daily COVID-19 cases, since such prediction is more difficult because of the accumulated cases. In this study, we address the prediction of new daily cases of COVID-19.

Case study: predictive analysis of Brazilian data

In 2020, Brazil had an estimated population of 212,622,578 inhabitants (IBGE, 2020). Brazil was the country with the greatest number of COVID-19 cases in 2020 in Latin America, ranking third in the world. Capitals are the most affected cities, and some experience health system collapse, such as Manaus-AM (Ferrante et al., 2020). Since 23.86% of the Brazilian population lives in capital cities, the spatial units of analysis in this study were the 27 capitals in Brazil (IBGE, 2020). Brazil is a country with continental dimensions and the 5th largest country in the world in territorial extension occupying an area of 8,510,295.91 km2. The Brazilian climate has great variations, with 3 climate zones and 12 climate types (Alvares, Stape, Sentelhas, de Moraes Gonçalves, & Sparovek, 2013). We want to analyze the correlation among COVID-19 cases with meteorological and human mobility parameters, and if there are differences in these correlations within the same country.

Data

COVID-19 data were obtained from Brasil.io (Justen et al., 2020), which compiles newsletters from the State Health Secretariats of Brazil. Meteorological data were obtained from the Centro de Previsño de Tempo e Estudos Climáticos located at the Instituto Nacional de Pesquisas Espaciais (CPTEC, 2020). The meteorological data considered in this study are: Minimum Temperature (Min Temp): refers to the daily minimum temperature in degrees Celsius; Maximum Temperature (Max Temp): refers to the daily maximum temperature in degrees Celsius; Humidity (Hum): refers to the daily air humidity in percentage; Rainfall (Rain): refers to the daily total precipitation in millimeters. Human mobility data were obtained from the COVID-19 Community Mobility Reports (GOOGLE, 2020) prepared by Google. These reports point to geographical movement trends over time, across different categories of places. The place categories are: Retail and recreation (RR): refers to mobility trends to places like restaurants, shopping centers, theme parks, etc; Grocery and pharmacy (GP): refers to mobility trends to places like grocery markets, farmers markets, pharmacies, etc; Parks (PA): refers to mobility trends to places like local parks, public beaches, public gardens, etc; Transit stations (TS): refers to mobility trends to places like subway, bus, train stations, etc; Workplaces (WO): refers to mobility to places of work; Residential (RE): refers to mobility to places of residence. The Residential category shows a change in the permanence of people in their homes, while the other categories measure changes in the total number of visitors. Changes in mobility patterns each day were compared with a baseline corresponding to the same day of the week. This baseline corresponds to the median of the corresponding day of the week, during the five weeks from January 3 to February 6, 2020. The number of observations in human mobility data and meteorological data varies according to the number of data in the variable relative to daily COVID-19 cases in each city. In each city, all reported data start at the day they confirmed the first COVID-19 case in the city (column “first case” in Table 1 of the Supplementary Material) and end at the final compiled day: November 6, 2020. A more descriptive analysis of the data is presented in Section 1 of the Supplementary Material.

Ensemble Empirical Mode Decomposition

Time series decomposition techniques have the goal of extracting simple periodic signals from the original time series, which can be used as inputs to machine learning approaches or other statistical models. Our study focuses on the use of the Ensemble Empirical Mode Decomposition (EEMD) technique (Huang & Wu, 2008), an adaptive data analysis method based on local characteristics of the data. EEMD catches nonlinear, non-stationary oscillations effectively. EEMD has been successfully used in several types of datasets (Lin et al., 2021, Niu et al., 2019), mainly in meteorological data, such as temperature (Liu et al., 2019), precipitation (Alizadeh et al., 2019), and wind speed (Santhosh et al., 2019). EEMD is an improvement of the empirical mode decomposition (EMD) method (Huang et al., 1998, Huang and Wu, 2008). It aims at decomposing the original data into a series of modes, called finite intrinsic mode functions (IMFs) and a residual, identifying the oscillatory modes that coexist. EEMD overcomes the so-called mode-mixing problem found in EMD. The mode-mixing occurs when different oscillation components coexist in a single IMF and very similar oscillations reside in different IMFs (Huang & Wu, 2008). EEMD uses an ensemble of IMFs obtained by applying EMD to several different series of the original time series obtained by adding white Gaussian noise. Adding a white Gaussian noise reduces the mode-mixing problem by occupying the whole time–frequency space (Huang & Wu, 2008). In summary, EEMD has the following steps. Let and s be the input data corresponding to, respectively, the original time series that will be decomposed, the number of ensembles, and the number of IMFs to be extracted from . Make , a control variable that indicates the ensemble to be generated in the iteration. Generate a new time series , obtained from for the ensemble k, adding to it a white noise with a standard deviation proportional to the standard deviation of , called . Therefore, , where is a relatively small number which must be empirically determined. Make , a control variable related to the index of the IMF of the k-th ensemble to be defined in the following steps, referred to as . Identify all the local extreme values of – a combination of high and low values of the series. After that, interpolate all this values by a cubic spline interpolation as the upper (high values) and lower envelopes (low values), respectively and . Calculate the point-to-point arithmetic mean between the envelopes – – and subtract this “average time series” from time series , obtaining the time series – . If , then , and repeat steps 5 and 6. If , assign to the residual time series, called . Make and repeat Steps 3 to 7 until , i.e. until the method obtains the m ensembles. The values of and m were empirically chosen after several computational tests. These tests indicated that an ensemble number and the value equals presented better outcomes. Furthermore, because of the proposed EEMD-ARIMAX method in Section 3.4, the number of IMFs into which the time series is decomposed was fixed in advance and, after tests, we found that a decomposition into 5 IMFs plus a residual was the most appropriate, i.e., . Fig. 1 shows the IMFs extracted from the data of São Paulo COVID-19 cases, by applying the EEMD algorithm. The IMFs were plotted from the first to the last component extracted from the series, where the last plot corresponds to the residual. The x-axis indicates the days, whereas the y-axis represents the values of the decomposed time series.

Fig. 1

Decomposed IMFs and residual obtained by EEMD considering the number of COVID-19 cases in São Paulo.

Autoregressive Integrated Moving Average Exogenous inputs (ARIMAX)

The Autoregressive Integrated Moving Average (ARIMA) model proposed by Box and Jenkins (1990) is the most general class of models for forecasting time series due to its simplicity of application and capability of handling non-stationary data. The AR part of ARIMA indicates that the variable of interest is regressed on its own lagged values. The MA part indicates that the regression error is a linear combination of error values that occurred in the past. Finally, the I (for “integrated”) part represents the order of differencing to turn the time series into a stationary series (if necessary). Differencing means replacing the original series by the difference between their values and the previous values (Box & Jenkins, 1990). The ARIMA model that includes other time series as input variables (exogenous variables) is referred to as Autoregressive Integrated Moving Average Exogenous inputs (ARIMAX) model. The parameters of ARIMAX() model are: p, the number of autoregressive terms; d, the number of nonseasonal differences needed for stationary; q, the number of lagged forecast errors in the prediction equation; n, the number of exogenous variables; , a constant; and, , for , for , and , for , the model parameters. Mathematically, this model can be formulated as in Eq. (1).where and , for , are the predicted values of the time series; , for , are the exogenous variables; and , for , represent the error terms.

EEMD-ARIMAX

To our knowledge, EEMD has not yet been used to predict time series with independent variables. The main idea behind EEMD-ARIMAX is to predict time series of independent and dependent variables. For this, we first decompose each time series of the independent variables () and dependent variables by applying the EEMD method, creating s levels of decomposition for each variable. Then, in each level of the decomposition, we use the ARIMAX method to predict the IMFs related to the dependent variables, by considering the IMFs of the variables as the exogenous variables. We employ the same procedure to predict the time series of the residual values. Finally, by summing the predicted time series, we obtain the prediction for the original time series of the daily number of COVID-19 cases. The algorithm of the proposed EEMD-ARIMAX method can be described by steps 1–5: Let be dependent variable under study, the independent/predictor variables, m the number of ensembles, and s the number of IMFs that will be extracted of each time series; Apply EEMD to decompose the time series of the dependent and independent variables individually, to obtain a set of s IMFs and a time series Res, in each decomposition; Fit IMFs of the same “levels” using ARIMAX – meaning that the j-th IMF of the time series represented by the “Daily number of COVID-19 cases”, denoted here by , will be fitted by the j-th IMFs of the same level j of the time series related to meteorological/mobility variables , denoted by , for all . The estimated j-th IMF is denoted by ; Denote the residual values obtained by applying EEMD in by ,...,, respectively. Denote the residual value found by applying EEMD in by . Let be the estimated time series of Res through ARIMAX using ,..., as exogenous variables; Denote the fitted values of variable by . Thereby, . A flowchart of the proposed EEMD-ARIMAX method is presented in Fig. 2 .

Fig. 2

Flowchart of the proposed EEMD-ARIMAX method.

Results and discussion

We apply a lag of 5 days in the number of new confirmed COVID-19 daily cases, since symptoms start five days after someone is infected, and the patients seek medical advice (He, Yi, & Zhu, 2020). All studies were performed considering the database with this lag. We used R statistical software (R Core Team, 2020) in all tests carried out for this paper.

Correlation analysis

We evaluate the pairwise correlation between the number of COVID-19 cases and meteorological/mobility variables using Spearman correlation. For more details about this measure, see Section 2.1 of the Supplementary Material. Tables 1 and 2 show the correlation values – columns “” – between the number of COVID-19 cases and the meteorological and human mobility variables, respectively, for all Brazilian capitals, in the period considered. In addition, these tables present the p-value regarding the statistical significance of the corresponding variables at a significance level of . Therefore, if the p-value of the indicated correlation is less than or equal to , the correlation is said to be statistically significant.

Table 1

Spearman correlation between the number of COVID-19 cases and the meteorological data.

Table 2

Spearman correlation between the number of COVID-19 cases and the human mobility data.

Spearman correlation between the number of COVID-19 cases and the meteorological data. Spearman correlation between the number of COVID-19 cases and the human mobility data. We consider that two variables are correlated if or . As stated before, if is positive, the variables are directly proportional, otherwise, they are inversely proportional. Therefore, on the one hand, we say that there is a positive correlation between a pair of variables when , meaning that there is evidence that the variables grow together. On the other, when the correlation is negative, i.e., , it means that the analyzed pair of variables has an opposite behavior: the greater the values of one variable, the smaller the values of the other variable. For better visualization, we highlighted the positive correlations in dark gray, and negative correlations in light gray. According to the results, the number of COVID-19 cases and meteorological variables were correlated in 16 cities. In 11 of them, the correlated meteorological variable was the minimum temperature. The number of COVID-19 cases and meteorological variables were not correlated in any of the cities in the South region. The maximum temperature and the number of COVID-19 cases were correlated in all cities in the Midwest region. The correlations between the number of COVID-19 cases and minimum temperature were negative, indicating that the number of cases increases when the minimum temperature decreases. The same behavior was observed between the number of COVID-19 cases and maximum temperature, except in Teresina-PI and Palmas-TO. In these two cities, the relationship between the number of COVID-19 cases and maximum temperature was inversely proportional. Humidity and the number of daily COVID-19 cases are correlated in the following cities: Palmas-TO, Fortaleza-CE, Recife-PE, Teresina-PI, Brasília-DF, and Goiânia-GO. Particularly in Palmas-TO and Teresina-PI, the humidity and number of COVID-19 cases showed a strong correlation. The average humidity of Palmas-TO was the lowest among the capitals of the North region. The average humidity of Teresina-PI was the second lowest average of the capitals of the Northeast region. Since humidity is directly linked to temperature, these facts could explain the inversely proportional correlations between the number of COVID-19 cases and the maximum temperature in both cities. Among the 6 capitals that showed a correlation between the number of COVID-19 cases and humidity, Fortaleza-CE and Recife-PE presented correlations of 0.329 and 0.316 respectively. The other four capitals showed negative correlations. One can observe that the rainfall variable and the number of COVID-19 cases are not correlated in Fortaleza-CE and Recife-PE. In Palmas-TO, Teresina-PI, Brasília-DF, and Goiânia-GO, on the other hand, it is possible to see that they were negatively correlated. It is known that meteorological data regarding temperature, humidity, and rainfall are related and, therefore, influence one another. In this study, however, we will only consider the meteorological variables of each capital that had a correlation with the number of COVID-19 cases greater than 0.3, in absolute value. These values are summarized in Table 3 .

Table 3

Variables per Brazilian capital which showed some level of correlation with the number of COVID-19 cases and were considered in the proposed models.

Region	City-Federative unit	Meteorological variables	Mobility variables
North	Belém-PA	-	-
	Boa Vista-RR	-	RR, GP, TS, WO
	Macapá-AP	-	-
	Manaus-AM	-	-
	Palmas-TO	Rain, Max Temp, Min Temp, Hum	RR, GP, PA, TS, WO, RE
	Porto Velho-RO	Min Temp	-
	Rio Branco-AC	-	-
Northeast	Aracaju-SE	Max Temp, Min Temp	-
	Fortaleza-CE	Max Temp, Hum	RR, PA, TS, RE
	João Pessoa-PB	Max Temp, Min Temp	-
	Maceió-AL	Rain, Max Temp, Min Temp	RE
	Natal-RN	Max Temp	-
	Recife-PE	Max Temp, Hum	PA, RE
	Salvador-BA	Max Temp, Min Temp	-
	São Luis-MA	Rain, Max Temp	TS, RE
	Teresina-PI	Rain, Max Temp, Min Temp, Hum	RR, PA, TS, WO, RE
Midwest	Brasilia-DF	Rain, Min Temp, Hum	GP
	Campo Grande-MS	-	RR, GP, PA, TS, WO
	Cuiabá-MT	Max Temp, Min Temp	RR, GP, PA, TS, WO, RE
	Goiânia-GO	Rain, Hum	RR, GP, PA, TS, WO, RE
Southeast	Belo Horizonte-MG	-	WO
	Rio de Janeiro-RJ	Min Temp	-
	São Paulo-SP	Min Temp	-
	Vitória-ES	-	GP, WO
South	Curitiba-PR	-	-
	Florianópolis-SC	-	RR, GP, PA, TS, WO, RE
	Porto Alegre-RS	-	RR, GP, TS, WO, RE

Variables per Brazilian capital which showed some level of correlation with the number of COVID-19 cases and were considered in the proposed models. As mentioned before, Table 2 shows the correlation between the mobility variables and the number of COVID-19 cases. The mobility variables and the corresponding cities with which the number of COVID-19 cases have a positive correlation are: Retail and recreation: Boa Vista-RR, Palmas-TO, Teresina-PI, Campo Grande-MS, Goiânia-GO, Florianópolis-SC, Porto Alegre-RS; Grocery and pharmacy: Boa Vista-RR, Palmas-TO, Teresina-PI, Brasília-DF, Campo Grande-MS, Goiânia-GO, Vitória-ES, Florianópolis-SC, Porto Alegre-RS; Parks: Palmas-TO, Teresina-PI, Campo Grande-MS, Goiânia-GO, Florianópolis-SC; Transit stations: Boa Vista-RR, Palmas-TO, Teresina-PI, Campo Grande-MS, Goiânia-GO, Florianópolis-SC, Porto Alegre-RS; Workplaces: Boa Vista-RR, Palmas-TO, Teresina-PI, Campo Grande-MS, Goiânia-GO, Belo Horizonte-MG, Vitória-ES, Florianópolis-SC, Porto Alegre-RS; Residential: Fortaleza-CE, Maceió-AL, Recife-PE, São Luis-MA, Cuiabá-MT. On the one hand, the positive correlation between the mobility parameters and the number of COVID-19 cases, except for the Residential variable, shows that the increase in the number of COVID-19 cases is directly proportional to the rise in the populations’ mobility trends in traffic, pharmacies, work, parks, and retail. This means that the higher the mobility rate, the greater the number of cases. On the other hand, some cities showed a negative correlation between mobility variables and the number of COVID-19 cases. They are: Retail and recreation: Fortaleza-CE, Cuiabá-MT; Grocery and pharmacy: Cuiabá-MT; Parks: Fortaleza-CE, Recife-PE, Cuiabá-MT; Transit stations: Fortaleza-CE, São Luis-MA, Cuiabá-MT; Workplaces: Cuiabá-MT; Residential: Palmas-TO, Teresina-PI, Goiânia-GO, Florianópolis-SC, Porto Alegre-RS. Therefore, for example, the negative correlation between Residential and daily COVID-19 cases means that the fewer people stayed at home, the greater the number of COVID-19 cases. The negative correlations between COVID-19 cases and meteorological parameters in Cuiabá-MT are due to a sequence of null values at the end of the series describing the number of COVID-19 cases. If these values were excluded, the correlation coefficient between these variables would be positive.

Analysis of the number of predicted cases

The EEMD-ARIMAX method was implemented in the R software using the “Rlibeemd” and “forecast”. We generated 125 new time series for each variable considering that the standard deviation of Gaussian noise was 1% of the standard deviation of the corresponding original time series. Table 4 shows the results of the EEMD-ARIMAX method for all Brazilian capitals. We compared EEMD-ARIMAX with the ARIMAX method. The objective was to analyze the effect of EEMD on the prediction. In both methods and for each city, we present the widely employed mean error (ME), root-mean-square deviation (RMSE), and mean absolute error (MAE) measures to describe the results of the predictions. For details about these measures, see Section 2.1 of the Supplementary Material. Column “City-Federative unit” shows the pair city-federative unit and the parameters used by ARIMAX to forecast the number of COVID-19 cases in this capital. These parameters were calibrated for each city using auto.arima() function in R. The independent variables that were considered to predict the number of cases of COVID-19 in each corresponding city are shown in Table 3 and follow the Spearman correlation coefficients shown in Table 1, Table 2.

Table 4

Results achieved by ARIMAX and EEMD-ARIMAX methods.

Region	City-Federative unit	ARIMAX			EEMD-ARIMAX
		ME	RMSE	MAE	ME	RMSE	MAE
North	Belém-PA (1,0,1)	3.213	139.121	100.688	−5.891	89.189	61.905
	Boa Vista-RR (0,1,1)	5.092	235.997	123.145	−2.248	159.335	97.518
	Macapá-AP (3,0,2)	0.229	185.697	79.845	−7.852	148.587	69.708
	Manaus-AM (2,1,3)	9.106	213.732	152.327	−12.994	142.140	102.373
	Palmas-TO (2,0,2)	-0.223	62.519	37.374	−0.168	53.240	34.604
	Porto Velho-RO (2,1,3)	6.074	186.198	104.206	0.266	162.095	98.116
	Rio Branco-AC (0,1,2)	1.088	46.482	28.736	−1.487	30.006	18.009
Northeast	Aracaju-SE (0,1,1)	1.413	163.284	91.061	−0.183	107.398	58.659
	Fortaleza-CE (1,0,1)	7.275	255.173	139.918	0.023	150.085	98.026
	João Pessoa-PB (3,0,2)	6.512	104.106	73.007	−0.629	61.758	42.498
	Maceió-AL (2,1,2)	0.646	89.629	54.596	0.010	61.337	41.793
	Natal-RN (1,0,3)	4.485	209.895	104.098	−12.078	183.497	99.912
	Recife-PE (3,0,2)	3.028	144.336	82.521	−6.969	114.020	69.529
	Salvador-BA (0,1,3)	3.456	329.488	197.839	−2.474	214.520	134.791
	São Luis-MA (2,0,3)	2.091	54.378	32.419	−2.704	32.342	19.566
	Teresina-PI (2,0,3)	0.703	78.607	59.522	−4.655	56.288	43.008
Midwest	Brasilia-DF (0,1,4)	5.992	272.017	173.067	3.756	176.154	111.602
	Campo Grande-MS (0,1,4)	4.305	130.909	65.215	−7.399	87.561	50.282
	Cuiabá-MT (1,0,4)	1.487	58.115	28.884	1.521	35.695	19.566
	Goiânia-GO (4,1,1)	7.194	262.247	165.356	−11.852	209.587	150.245
Southeast	Belo Horizonte-MG (2,1,3)	6.596	273.975	183.085	−1.762	186.489	123.697
	Rio de Janeiro-RJ (2,1,3)	12.532	444.872	294.399	−15.907	318.717	227.619
	São Paulo-SP (0,1,5)	18.479	988.765	660.780	4.203	775.817	494.870
	Vitória-ES (1,0,2)	3.598	62.274	42.943	1.440	41.234	27.729
South	Curitiba-PR (5,1,0)	0.843	127.775	76.223	−9.406	103.601	61.407
	Florianópolis-SC (0,1,3)	26.907	230.129	80.612	15.483	210.716	86.626
	Porto Alegre-RS (0,1,1)	25.704	373.925	140.315	−23.187	282.510	140.701

Results achieved by ARIMAX and EEMD-ARIMAX methods. In all cities, the proposed decomposition method improved the predictions of the time series in terms of RMSE values. The average RMSE of the predictions considering only the ARIMAX method was 211.987 with a standard deviation of 186.335. Using the EEMD-ARIMAX method, the average RMSE was 155.330 with a standard deviation of 145.645. EEMD-ARIMAX showed an improvement of 26.73% over ARIMAX. Section 2.2 of the Supplementary material presents some graphics comparing the original time series with the predicted values by EEMD-ARIMAX in all Brazilian regions.

Anomaly analysis

The data used in the case study have several registration errors that may affect the accuracy of the prediction model. We used an anomaly detection strategy to identify whether there is a relationship between data errors and significant errors in the values predicted by EEMD-ARIMAX. The employed anomaly detection method uses the Fourier transform in graphs as a tool to analyze the daily variation in the number of COVID-19 cases in each region. Thus, it identifies days with potentially anomalous numbers of COVID-19 cases. Section 5.1 presents a discussion about the strategy adopted to define and quantify the model errors. Section 5.2 shows the concept of anomaly adopted and a tool to highlight anomalies. Section 5.3 addresses the methodology employed to compare the errors of the model with the detected anomalies. Section 5.4 presents the strategy adopted to correct the anomalies and run the model again. In summary, the anomaly analysis shows that there is a direct relationship between the days when the EEMD-ARIMAX significantly missed the prediction and the days when the anomaly detection strategy pointed to an abnormality. This indicates that the data errors affected the models’ effectiveness. After normalizing and correcting the data, EEMD-ARIMAX’s accuracy showed a significant increase.

Analyzing model errors

We analyzed the days for which the model significantly missed the predicted number of cases for each city. The error made by the model was quantified by the difference between the observed and predicted number of cases, as shown in Eq. (2).where and are, respectively, the observed and predicted number of COVID-19 cases on day t in city i. An error is considered significant when , where is the vector formed by the elements , and TD is defined by Eq. (3), where and are the arithmetic mean and standard deviation of , respectively. Fig. 3 illustrates the values of considering the city of Goiânia - GO. The threshold value is highlighted in red. Therefore, every day t whose is above the red line corresponds to a significantly mispredicted day by the model.

Fig. 3

Model errors of Goiânia - GO and the significance threshold.

Analyzing data anomalies

A spectral anomaly detection strategy was adopted to detect days when the recorded number of daily COVID-19 cases was potentially anomalous. While the model errors are identified by comparing the predicted values with the observed values, the anomaly detection strategy analyzes the daily variation in the number of cases considering the distance between cities to identify potentially anomalous variations. For example, if a city has a slight variation over two days in the number of COVID-19 cases, we expect nearby cities to have a similar variation. Similarly, if the number of cases in a city suffers a significant increase from one day to the other, we expect nearby cities also to have a relative increase in the number of cases. To perform this analysis, we model a complete and weighted network where a node represents a city, and the weight of the edge is the Euclidean distance between cities i and j. Each node carries the daily variation in the number of cases in city i, with the daily variation defined by Eq. (4), . A signal contains the values for every city i in the dataset. We calculate the spectra of signal, , using the Fourier transform for graphs (Sandryhaila & Moura, 2013). In graph Fourier analysis, the graph Laplacian eigenvectors associated with small eigenvalues vary slowly across the graph, whereas eigenvectors associated with larger eigenvalues oscillate more rapidly (Ortega, Frossard, Kovacevic, Moura, & Vandergheynst, 2018). It means that if two vertices are connected by an edge with a large weight, the values of the eigenvector at those locations are likely to be similar. This concept is then used to define low and high frequencies for signals indexed by graphs. According to this definition, abrupt oscillations are concentrated at the high frequencies of the signal spectrum. To highlight abrupt variations and expose anomalies, we accentuate the magnitude of the high frequencies of spectrum to make anomalous variations more evident, generating a new spectrum . We apply the inverse Fourier transform to to get a new signal, that contains the accentuated variation in the number of cases in each city. The intuition behind this operation is that, if the variation in the number of cases in a city i is normal, then probably holds, where is the i-th element in vector . On the other hand, if the city has an anomalous variation, probably holds. Fig. 4 shows a graphic visualization of the normal variation and the accentuated variation.

Fig. 4

Observed variation versus accentuated variation in all 27 capitals of Brazil on August 17, 2020.

Observed variation versus accentuated variation in all 27 capitals of Brazil on August 17, 2020. The threshold used to determine whether a variation is anomalous or not is calculated in the same way for errors, as defined in Eq. (3), for . Fig. 5 illustrates the values of , which is a vector with the of a given city i, and the threshold . It is worth noting the similarity between Fig. 3, Fig. 5, which points out that there is a direct relation between the cases in which EEMD-ARIMAX significantly missed the prediction and the days when the attenuator pointed out potentially anomalous variations.

Fig. 5

Accentuated daily variations in the number of COVID-19 cases in Goiânia - GO and the anomaly threshold.

Comparing errors and anomalies

As presented in Section 5.2, Fig. 3, Fig. 5 indicate that there is a direct relationship between the days when EEMD-ARIMAX made a significant error and the days whose variation in the number of cases was interpreted as potentially anomalous. We compared the model’s errors with anomalous variations to establish a quantifier that indicates whether there is, in fact, a direct relationship between them. For each city, two sets were defined: set CE, containing the days on which the model made a significant error; and set CA, with the days whose variation was detected as potentially anomalous. To quantify the relationship between CE and CA, we adopted the following criterion: if a day and , then the error made by the model on day t and the anomalous variation that occurred on the days adjacent to t are directly related. Fig. 6 shows the percentage of days the model made significant mispredictions and which are directly related to a day with a potentially anomalous variation. On average, more than 60% of the days when the model was wrong were detected as anomalous, as indicated by the red line, which represents the average.

Fig. 6

Percentage of days which EEMD-ARIMAX significantly mispredicted and corresponded to an anomaly.

Percentage of days which EEMD-ARIMAX significantly mispredicted and corresponded to an anomaly. This result indicates that EEMD-ARIMAX was affected by errors in the data and the results point that the model’s errors are directly related to anomalous variations. To overcome this problem and correct the anomalies, we adopted a spectral strategy for removing anomalies, also based on the Fourier transform. Section 5.4 presents the methodology employed to correct the data anomalies.

Normalizing Data

As discussed before, the abrupt variations are concentrated in the high frequencies. To detect the anomalies, the presented anomaly detection strategy accentuated the magnitude of the high frequencies. Then, to correct the anomalies, we adopted a strategy that does the opposite, using a low-pass filter that attenuates high frequencies. Unlike the high-frequency accentuator, the low pass filter decreases the magnitude of the high frequencies, attenuating abrupt variations. While the accentuator was applied to the signal that carried the daily variation in the number of cases, the low-pass filter was applied to the signal formed by the number of cases on each day, that is, the C signal, where is the number of cases in city i at day t, to generate a filtered signal . Fig. 7 compares an original signal and a filtered signal. It is possible to note that, in general, the signal oscillation is mitigated, ensuring a more reliable signal.

Fig. 7

Observed Daily Cases versus Normalized Daily Cases.

Observed Daily Cases versus Normalized Daily Cases. By applying both the ARIMAX and the EEMD-ARIMAX methods to the normalized data, we obtained the results shown in Table 5 , which are presented as in Table 4. The average RMSE for the forecasting considering only the ARIMAX method was 142.981 with a standard deviation of 122.703. The average RMSE for the EEMD-ARIMAX method was 107.664 with a standard deviation of 99.917. Therefore, EEMD-ARIMAX was 24.70% better than ARIMAX.

Table 5

Results achieved by ARIMAX and EEMD-ARIMAX using normalized data.

Region	City-Federative unit	ARIMAX			EEMD-ARIMAX
		ME	RMSE	MAE	ME	RMSE	MAE
North	Belém-PA (4,1,1)	3.496	88.611	64.383	0.823	55.300	40.659
	Boa Vista-RR (1,0,1)	7.276	228.034	115.729	−15.266	165.307	100.733
	Macapá-AP (2,1,3)	2.783	129.141	56.598	−0.702	79.429	39.340
	Manaus-AM (4,1,1)	0.796	18.994	12.861	0.005	12.406	8.771
	Palmas-TO (3,1,2)	1.301	49.561	35.493	1.416	35.471	26.719
	Porto Velho-RO (2,1,3)	−0.193	151.779	87.690	−0.024	131.221	19.299
	Rio Branco-AC (2,0,2)	0.287	44.530	29.891	−1.546	28.245	17.304
Northeast	Aracaju-SE (0,1,1)	2.672	103.176	65.999	−2.111	75.438	48.003
	Fortaleza-CE (1,0,1)	7.836	179.518	103.165	0.758	122.469	76.730
	João Pessoa-PB (0,1,4)	2.929	83.694	57.921	0.827	45.635	32.355
	Maceió-AL (0,1,5)	1.645	65.674	43.626	−1.429	42.589	29.490
	Natal-RN (0,1,3)	2.701	155.989	83.457	−12.897	130.216	72.329
	Recife-PE (0,1,5)	2.483	105.509	63.195	−4.679	80.054	50.157
	Salvador-BA (4,1,1)	3.053	195.621	121.726	−7.470	155.119	102.551
	São Luis-MA (0,1,4)	1.747	42.258	31.661	1.173	25.417	18.514
	Teresina-PI (3,1,2)	2.676	59.647	42.921	2.135	47.579	35.588
Midwest	Brasilia-DF (3,1,2)	4.041	135.460	87.941	−1.909	91.309	57.297
	Campo Grande-MS (3,1,2)	3.944	90.317	55.276	−0.974	62.449	40.751
	Cuiabá-MT (2,1,3)	1.410	54.245	37.052	−0.109	35.409	24.553
	Goiânia-GO (3,1,2)	3.249	136.126	88.814	−11.139	100.058	73.351
Southeast	Belo Horizonte-MG (3,1,2)	4.806	156.051	109.193	1.246	114.420	81.809
	Rio de Janeiro-RJ (0,1,5)	8.852	292.669	192.758	−22.889	230.824	159.742
	São Paulo-SP (0,1,5)	12.976	628.305	420.853	−57.823	497.835	314.477
	Vitória-ES (3,1,2)	1.569	56.394	41.890	3.439	44.391	32.097
South	Curitiba-PR (2,1,3)	2.342	99.194	61.001	−7.094	80.935	51.362
	Florianópolis-SC (0,1,3)	18.719	184.337	76.301	−8.676	144.855	68.046
	Porto Alegre-RS (0,1,1)	25.068	325.657	129.257	−2.233	272.542	150.659

Results achieved by ARIMAX and EEMD-ARIMAX using normalized data. There was an improvement of 30.69% in the prediction by EEMD-ARIMAX when normalized data were used. Fig. 8 shows all the RMSEs obtained by the EEMD-ARIMAX method using non-normalized (black) and normalized (red) data.

Fig. 8

RMSE of original data versus RMSE of normalized data.

Final remarks and future works

As stated by Fildes, Nikolopoulos, Crone, and Syntetos (2008), contributions to forecasting are normally achieved by developing new methods that establish a connection between their effectiveness and the context they are applied to. The contributions offered by this paper meet this purpose, since it is case-oriented and we examine the system as a whole, identifying patterns in the time series as well as anomalies to draw conclusions about the correlation between meteorological and mobility variables. The novel method is an EEMD-ARIMAX hybrid, which uses an intelligent strategy to detect anomalies in data after the method has provided a forecast. The analysis of the original data indicated that the correlation between the number of COVID-19 cases and the meteorological/human mobility variables depends on the region the Brazilian city under study is located. The prediction methods ARIMAX and EEMD-ARIMAX achieved an average square error of 211.99 and 155.33, respectively. These results indicate that the decomposition method improved the prediction of COVID-19 cases. Because some data anomalies were observed, e.g., as very high peaks and negative numbers of cases, we proceeded with an anomaly study to normalize the data. When the ARIMAX and EEMD-ARIMAX methods were applied to the normalized data, an average quadratic error equal to 142.98 and 99.92 was found, respectively, confirming the positive effect of the data decomposition in the prediction of COVID-19 cases. Therefore, anomaly detection played a key role in effectively fitting the COVID-19 curve as it repaired the data deficiencies found in the vast majority of real-world applications. Future studies may involve the use of other prediction methods, including deep learning strategies. We also suggest the use of optimization algorithms, such as nature-inspired metaheuristics (Kar, 2016, Abualigah, 2021) and the sine and cosine algorithm (Mirjalili, 2016), to either identify approximations of the local maxima and minima or to optimize the decision on which points to interpolate in the EEMD-ARIMAX. Optimization algorithms can also be used to find the best values of , and q in the ARIMAX model () in each extracted IMF, or to determine a linear regression model as described by Makade et al. (2020), which used particle swarm optimization for this task. Another future work direction would be to determine anomalies in the extracted IMFs instead of in the original time series.

CRediT authorship contribution statement

Tiago Tiburcio da Silva: Conceptualization, Methodology, Software, Data curation, Writing - original draft. Rodrigo Francisquini: Methodology, Software, Data curation, Writing - original draft. Mariá C.V. Nascimento: Supervision, Writing - Review & Editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Table A.6

Part 1 of the list of symbols and notations used in this paper.

Symbol	Description
Xt, Wt	time series
m	number of ensembles in EEMD
s	number of IMFs to be extracted from Xt or Wt
k	variable that specifies an ensemble in a given iteration of EEMD
n	number of meteorological and human mobility variables
Y1,…,Yn	meteorological and human mobility variables
Zt	time series obtained from Wt
σoriginal	standard deviation of Xt
σnoise	standard deviation of Zt
μ	a relatively small number which relates σnoise and σoriginal
emaxk	upper envelope of Xt
emink	lower envelope of Xt
mtk	time series which correspond to average between emaxk and emink
dtk	time series obtained by the operation dtk=Zt-mtk
IMFjk	j-th IMF of ensemble k
IMFXtj	j-th IMF obtained from the time series Xt
IMFYtj	j-th IMF obtained from the time series Yt
RESYj	residual values found by applying EEMD to Yj
IMF^j	j-th IMF of the estimated time series
Res^	estimated residual values
X^t	time series X^t=IMF^1+…+IMF^s+Res^
ρ	Spearman correlation coefficient

Table A.7

Part 2 of the list of symbols and notations used in this paper.

Symbol	Description
p	number of autoregressive terms in ARIMAX
d	number of nonseasonal differences needed for stationarity in ARIMAX
q	number of lagged forecast errors in ARIMAX
η	constant of the ARIMAX
ϕi	i-th element of parameter ϕ in ARIMAX, for i=1,…,p
θj	j-th element of parameter θ in ARIMAX, for j=1,…q
ζl	l-th element of parameter ζ in ARIMAX, for l=1,…,n
et-j	error terms of the ARIMAX, for j=1,…,q
nc	number of days in the dataset
cit	observed number of COVID-19 cases on day t in city i
c^it	predicted number of COVID-19 cases on day t in city i
eit	error in the prediction of the number of COVID-19 cases on day t in city i
sit	absolute value of the difference between 1 and citci(t-1)
St	set St={sit,∀i}
S^t	spectrum of St
λt	eigenvalues of graph Laplacian
Rt	time series obtained by applying the inverse Fourier transform in R^t
rit	i-th element of vector Rt

3 in total

1. COVID-19 personal health mention detection from tweets using dual convolutional neural network.

Authors: Linkai Luo; Yue Wang; Hai Liu
Journal: Expert Syst Appl Date: 2022-04-02 Impact factor: 8.665

2. COVID-19 ICU demand forecasting: A two-stage Prophet-LSTM approach.

Authors: Dalton Borges; Mariá C V Nascimento
Journal: Appl Soft Comput Date: 2022-06-17 Impact factor: 8.263

3. A Novel Supervised Filter Feature Selection Method Based on Gaussian Probability Density for Fault Diagnosis of Permanent Magnet DC Motors.

Authors: Weihao Wang; Lixin Lu; Wang Wei
Journal: Sensors (Basel) Date: 2022-09-20 Impact factor: 3.847

3 in total