Literature DB >> 34930975

Air quality assessment and pollution forecasting using artificial neural networks in Metropolitan Lima-Peru.

Chardin Hoyos Cordova¹, Manuel Niño Lopez Portocarrero¹, Rodrigo Salas², Romina Torres³, Paulo Canas Rodrigues⁴, Javier Linkolk López-Gonzales^5,6.

Abstract

The prediction of air pollution is of great importance in highly populated areas because it directly impacts both the management of the city's economic activity and the health of its inhabitants. This work evaluates and predicts the Spatio-temporal behavior of air quality in Metropolitan Lima, Peru, using artificial neural networks. The conventional feedforward backpropagation known as Multilayer Perceptron (MLP) and the Recurrent Artificial Neural network known as Long Short-Term Memory networks (LSTM) were implemented for the hourly prediction of [Formula: see text] based on the past values of this pollutant and three meteorological variables obtained from five monitoring stations. The models were validated using two schemes: The Hold-Out and the Blocked-Nested Cross-Validation (BNCV). The simulation results show that periods of moderate [Formula: see text] concentration are predicted with high precision. Whereas, for periods of high contamination, the performance of both models, the MLP and LSTM, were diminished. On the other hand, the prediction performance improved slightly when the models were trained and validated with the BNCV scheme. The simulation results showed that the models obtained a good performance for the CDM, CRB, and SMP monitoring stations, characterized by a moderate to low level of contamination. However, the results show the difficulty of predicting this contaminant in those stations that present critical contamination episodes, such as ATE and HCH. In conclusion, the LSTM recurrent artificial neural networks with BNCV adapt more precisely to critical pollution episodes and have better predictability performance for this type of environmental data.

Entities: Chemical

Year: 2021 PMID： 34930975 PMCID： PMC8688545 DOI： 10.1038/s41598-021-03650-9

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

The World Health Organization (WHO) reported that air pollution causes 4.2 million premature deaths per year in cities and rural areas around the world[1]. The US Environmental Protection Agency[2] mentions that one of the pollutants with the most significant negative impact on public health is particulate material with a diameter of less than ten () because it can easily access the respiratory tract causing severe damage to health. For their part, Valdivia and Pacsi[3] report that Metropolitan Lima (LIM) is vulnerable to high concentrations of , due to its accelerated industrial and economic growth, in addition to its large population, as it is home to 29% of the total Peruvian population[4]. To mitigate the damage caused by to public health, the WHO established concentration thresholds suitable to achieve a minimum adverse effect on health[5]. In various countries, several laws were issued to regulate concentrations and air quality in general[6], as established in Peru by the Ministry of the Environment[7] and in, e.g., the United States by the Environmental Protection Agency (EPA)[8]. In recent years, various forecasting methodologies have been adapted and developed to understand how pollutants behave in the air at the molecular level, simulating diffusion and dispersion patterns based on the size and type of the molecule. However, the results of the prediction tend to achieve a somehow low precision[9,10]. Examples of such models are the Community Multiscale Air Quality model and the Weather Research and Forecasting model coupled with Chemistry developed in Chen et al.[11] and Saide et al.[12], respectively, which are used to forecast air quality in urban areas. On the other hand, some methods tend to be more appropriate to model and forecast air quality because they use statistical modeling techniques, such as Artificial Neural Networks (ANNs). These models have been widely used to forecast time series and applied to environmental data such as particulate matter in different countries[13,14]. Several studies have been focusing on applying recurrent neural networks to forecast air quality in large cities. For instance, Guarnaccia et al.[15] reported that predicting air quality with high accuracy can be problematic. This issue is becoming increasingly important because it is a tool capable of providing complete information for helping to prevent critical pollution episodes and reduce human exposure to these contaminants[13,16,17]. However, there is a limited number of studies in the context of Lima, Peru, which is one of the cities with the highest pollution levels in South America[18-20]. For instance, Herrera and Trinidad[21] used neural networks to predict in the Carabayllo district - Lima, with a good forecasting performance. Salas et al.[22] developed a NARX model using artificial neural networks to predict the pollutant in Santiago, Chile. Athira et al.[23] aimed at forecasting three days ahead and at comparing the performance of the standard LSTM, GRU, and RNN models, concluding that all three models showed good performance for out-of-sample forecasting. Lima is considered to be one of the most polluted cities in Latin America in terms of . In this sense, the need for sophisticated environmental management instruments arises, aiming at making predictions with greater precision using cutting-edge methodologies, such as deep learning algorithms, which support decision-making to establish mitigation and prevention policies. In addition, it allows the population to avoid being exposed to high concentrations of . For this reason, this study aims to assess the air quality of Lima, to understand its behavior, and the possible causes and factors that favor pollution. Subsequently, we applied the Multilayer Perceptron (MLP) and the Long Short-Term Memory (LSTM) models to forecast concentrations, where the models were evaluated under two validation schemes: the Hold-out (HO) and the Block Nested Cross-Validation (BNCV). Our contributions are summarized below:The remainder of the paper is structured as follows: Section “Materials and methods” presents the developed methodology based on an exploratory study described in two phases. In Section 3, we present the main results and their discussion. Finally, in Section 4, we provide the main conclusions and give some future works. In this study, we have implemented artificial neural networks to model time series data collected from five meteorological and air quality monitoring stations from Lima, Peru. The monitoring stations are ATE, Campo de Marte (CDM), Carabayllo (CRB), Huachipa (HCH) and San Martin de Porres (SMP). We have investigated the geographical and meteorological divergence of the forecast results from the five air quality monitoring areas in LIM using data collected from two years. The proposed time series forecasting model based on the MLP and LSTM neural networks efficiently predicted one-hour-ahead concentrations. The prediction performances between the five stations were compared. According to the literature review, this study is the first to use deep learning algorithms to predict air quality () in LIM. We have focused the study in LIM because its air pollution has worsened in recent years. The main reason for this change is that population growth has been unsustainable, and high industrial activity and the accelerated growth of the automobile fleet have increased. These factors make it challenging to predict pollution concentrations.

Materials and methods

In this work, we follow the Knowledge Discovery from Databases (KDD) methodology to obtain relevant information for air quality management decision-making. The main goal of the KDD is to extract implicit, previously unknown, and potentially helpful information[24] from raw data stored in databases. Therefore, the resulting models can predict, e.g., one-hour ahead, the air quality and support the city’s management decision-making (see Fig. 1).

Figure 1

Knowledge Discovery from Databases (KDD) methodology used for Air Quality Assessment and Pollution Forecasting.

The KDD methodology has the following stages: (a) Phenomena Understanding; (b) Data Understanding; (c) Data Preparation; (d) Modeling; (e) Evaluation; and, (d) Selection/Interpretation. In the following subsections, we explain each stage of the process. Knowledge Discovery from Databases (KDD) methodology used for Air Quality Assessment and Pollution Forecasting.

Phenomena Understanding

In this first stage, we contextualize the contamination phenomenon concerning the concentrations in the five Lima monitoring stations. The main focus is to predict air pollution to support decision-making related to establishing pollution mitigation policies. For this, we use both MLP and LSTM as computational statistical methods for prediction. Lima is the capital of the Republic of Peru. It is located in the center of the western side of the South American continent in the W and S and, together with its neighbor, the constitutional province of Callao, form a populated and extensive metropolis with 10,628,470 inhabitants and an area of [25,26]. The average relative humidity (temperature) in the summer (December–March) ranges from 65–68% (24 °C–26 °C) in the mornings, while at night the values fluctuate between 87–90% (18 °C–20 °C). In the winter (June–September), the average daytime relative humidity (temperature) ranges between 85–87% (18 °C–19 °C) and at night it fluctuates between 90–92% (18 °C–19 °C). The average annual precipitation is 10 mm. On the other hand, the average altitudes reached by the thermal inversion in summer and winter are approximately 500 and 1500 m above sea level, respectively[27,28]. Map with the study area and the locations of the Lima air quality monitoring stations: ATE, Campo de Marte (CDM), Carabayllo (CRB), Huachipa (HCH) and San Martin de Porres (SMP). Pollutant and weather variables used in this study, and their units of measurement.

Data understanding

Lima has ten air quality monitoring stations located in the constitutional province of Callao and the north, south, east, and center of Lima. The data used comprise hourly observations from January 1st, 2017, to December 31st, 2018, and includes three meteorological variables and the concentration of particulate matter . Where the latter is considered to be an agent that, when released into the environment, causes damage to ecosystems and living beings[29,30]. For this study, the hourly data, recorded at five air quality monitoring stations (see Fig. 2), which are managed by the National Service of Meteorology and Hydrology of Peru (SENAMHI), was considered. Table 1 shows the considered variables and their units of measurement.

Figure 2

Map with the study area and the locations of the Lima air quality monitoring stations: ATE, Campo de Marte (CDM), Carabayllo (CRB), Huachipa (HCH) and San Martin de Porres (SMP).

Table 1

Pollutant and weather variables used in this study, and their units of measurement.

Variable	Unit of measurement
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\hbox {PM}_{10}$$\end{document}PM10	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mu \hbox {g}/\hbox {m}^{3}$$\end{document}μg/m3
Temperature	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\circ }\hbox {C}$$\end{document}∘C
Relative humidity	%
Wind speed	m/s
Wind direction	Degrees (\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{\circ }$$\end{document}∘)

When considering environmental data, such as concentrations, from different locations, preliminary spatio-temporal visualization studies are of great use to better understand the behavior of the meteorological variables, the topography of the area, and the pollutants[31].

Data preparation

This stage is very relevant because it precedes the modeling stage. The preparation of the data had various stages. First, we address the problem of missing data. The treatment was performed with the MICE library. This library performs multiple imputations using the Fully Conditional Specification[32] and requires a specification of a separate univariate imputation method for each incomplete variable. In this context, predictive mean matching, a versatile semiparametric method focusing on continuous data, was used, which allows the imputed values to match one of the observed values for each variable. The data imputation was performed for each of the five stations with a percentage of missing data below 25%. The data from the monitoring stations consist of a sequence of observed values recorded at specific times t. In this case, the time series is collected at hourly intervals. After the data imputation, we proceed to normalize all the observations in the range [0,1] as follows:Moreover, the time series is decomposed into the trend, seasonality, and the irregular components following an additive model (the cyclic component is omitted in this work):The trend component at time t reflects the long-term progression of the series that could be linear or non-linear. The seasonal component at time t, reflects the seasonal variation. The irregular component (or “noise”) at time t describes the random and irregular influences. In some cases, the time series has a cyclic component that reflects the repeated but non-periodic fluctuations. The main idea of applying this decomposition is to obtain the deterministic and the random components, where a forecasting model is obtained using the deterministic part[33,34]. In this article, we have used the method implemented in Statmodels for Python[35], where a centered moving average filter is applied to the time series.

Modeling using artificial neural networks

Artificial Neural Networks have received a great deal of attention in engineering and science. Inspired by the study of brain architecture, ANNs represent a class of non-linear models capable of learning from data[36]. The essential features of an ANN are the basic processing elements referred to as neurons or nodes, the network architecture describing the connections between nodes, and the training algorithm used to estimate values of the network parameters. Researchers see ANNs as either highly parameterized models, or semiparametric structures[36]. ANNs can be considered as hypotheses of the parametric form , where the hypothesis h is indexed by the vector of parameters . The learning process consists of estimating the value of the vector of parameters to adapt the learner h to perform a particular task. Machine Learning and Deep learning methods have been successfully applied for time series forecasting[37-42]. For instance, recurrent artificial neural networks (RNNs) are dynamic models frequently used for processing sequences of real data step by step, predicting what comes next. They are applied in many domains, such as the prediction of pollutants[43]. It is known that when there are long-term dependencies in the data, RNNs are challenging to train, which leads to the development of models such as the LSTM that have been successfully applied in time series forecasting[44]. Schematic of the architecture of the Multilayer Perceptron. The figure shows three layers of neurons: input, hidden and output layers. The Multilayer Perceptron model consists of a set of elementary processing elements called neurons[36,45-48]. These units are organized in architecture with three layers: input, hidden, and output. The neurons corresponding to one layer are linked to the neurons of the subsequent layer. Figure 3 illustrates the architecture of this artificial neural network with one hidden layer. The non-linear function represents the output of the model, where is the input signal and being its parameter vector. For a three-layer FANN (one hidden layer), the k-th output computation is given by the following equationwhere is the number of hidden neurons. An important factor in the specification of neural models is the activation function’s choice. These can be any non-linear functions as long as they are continuous, bounded, and differentiable. The transfer function of the hidden neurons should be nonlinear while for the output neurons the function could be a linear function or nonlinear functions. One of the most used functions is the sigmoid:

Figure 3

Schematic of the architecture of the Multilayer Perceptron. The figure shows three layers of neurons: input, hidden and output layers.

The MLP operates as follows. The input layer neurons receive the input signal; these neurons propagate the signal to the first hidden layer and do not make any processing. The first hidden layer processes the signal and transfers it to the subsequent layer; the second hidden layer propagates the signal to the third, and so on. When the signal is received and processed by the output layer, it generates the response. The Long Short-Term Memory networks model is a type of RNN, having as its primary strength the ability to learn long-term dependencies and being a solution for long time series intervals[20,49]. In such a model, memory blocks replace the neurons in the hidden layer of the standard RNN[50]. The memory block consists of three gates that control the system’s state: Input, forget, and output gates. First, the input gate determines how much information will be added to the cell. Second, the forget gate controls the information lost in the cells. Lastly, the output gate performs the function of determining the final output value based on the input and memory of the cell[51,52]. Model of one block of the LSTM. The block is composed of the input gate, forget gate and output gate. Figure 4 shows the LSTM model block, with the output and input blocks, which consists of three gates. At each step, an LSTM maintains a hidden vector h and a memory vector o responsible for controlling status updates and outputs.

Figure 4

Model of one block of the LSTM. The block is composed of the input gate, forget gate and output gate.

The first step is to decide what information will not be considered in the status cell. This decision is made by the forget gate, which uses a hyperbolic tangent activation function (IAF). represents the output of the forget gate, which can be calculated using equation (5). This gate considers the concatenation of the vectors and . It generates a number between 0 and 1 for each number in the state cell , where and are the weight matrices and the bias vector parameters, respectively. Both must be learned during training and are stored in the vector . If one of the values of this vector is equal to or close to zero, then the LSTM will eliminate that information. On the other hand, if it reaches values equal to or close to 1, this information will be maintained and reach the status cell. The next step is to decide what new information to store in the status cell. This is done by the input gate, linked to a sigmoid activation function (GAF), and with an output for that gate (), all this is calculated by the equation (6, 7). In addition, for the input block, the hyperbolic tangent activation function (IAF) is used. First, the vectors and are concatenated. Being and , the weight matrices and the bias vector parameters, respectively, must be learned during training; all this is stored in the vector called the input gate, which decides which values to update. Then a hyperbolic tangent function creates a vector of new candidate values, , involving the vectors and . In the next step, these values are filtered by multiplying point by point both vectors to create a status cell update. The previous cell, is updated to the new state of cell (equation 8). In addition, the output gate, also linked with the GAF activation function and with an output of the output gate (), for its calculation uses the equation (equation 9). Finally, , expresses the new output of the model (equation 10). The current cell state is represented by , while W is the weight vector o parameters of the model, and b is the bias of the model.

Model evaluation

To evaluate the forecast ability of the models, the performance metrics given below were used (see[53,54]). In what follows, we will consider: , , are the target values; , , are the model’s predictions; is the mean of the target values; and n is the number of samples. Mean Absolute Error: The average absolute difference between the target and the predicted values. Root Mean Squared Error: The squared root of the average of the squared errors. Symmetric Mean Absolute Percentage Error: A measure of accuracy based on a percentage of relative errors. Spearman’s rank correlation coefficient: A nonparametric correlation measure between the target and the prediction. Spearman’s correlation assesses monotonic relationships by using the rank of the variables. where is the difference between the ranks of the targets and the predictions .

Model selection and interpretation

The model selection and interpretation is the final step in the KDD process and requires that the knowledge extracted from the previous step be applied to the specific domain of the prediction in a visualized format. At this stage, in addition to selecting the model with the best precision in the prediction, it also drives the decision-making process based on the air quality assessment in Lima. We have used two schemes for the validation: Hold-Out (HO) and Blocked Nested Cross-Validation (BNCV). On the one hand, HO has the conventional separation of the dataset in training, validation, and testing subsets (see Fig. 5). On the other hand, the BNCV is a fixed-size window that slides, and the model is retrained with all the data up to the current day (see Fig. 6).

Figure 5

Figure 6

Blocked Nested Cross-Validation Scheme used for the validation of the models. The dataset is separated into three sets using a time-window of fixed size: training, validation, and testing. The last day is used for testing.

Hold-Out Scheme used for the validation of the models. The dataset is split into three sets: training, validation, and testing. The train set is the basis for training the model, and the test set is used to see how well the model performs in untrained concentrations. Blocked Nested Cross-Validation Scheme used for the validation of the models. The dataset is separated into three sets using a time-window of fixed size: training, validation, and testing. The last day is used for testing.

Results and discussion

Air quality assessment in Metropolitan Lima-Peru

In this section, we report the results of the statistical analysis of air pollution in LIM.

Statistical analysis of the concentration of

Table 2 shows the descriptive analysis of the data from the five monitoring stations focused in the , between 01-01-2017 and 31-12- 2018. Additionally, the histogram (see Fig. 7) is reported to show the behavior of the pollutant in every season. In the probability distribution, it is observed that they are skewed to the right, which indicates the existence of critical episodes of contamination, being the HCH station the one with the highest incidence, with an average of g/. This value exceeds that standardized by the Peruvian norm[7], and shows relevant fluctuations and high dispersion of pollutants (8404.34 g/) that cause a high standard deviation. The stations HCH and ATE register higher concentration levels. The order of the stations from the lowest to the highest levels of the mean of is as follows: CRB; CDM; SMP; ATE; HCH. Similar behaviour was found in other studies[31,55]. Encalada et al.[31] carried out a study of visualization of concentrations in Lima using the same data, where similar behavior patterns of concentrations are shown in the five stations. In addition, all the stations surpass the limits established by the WHO. Moreover, four of the five stations (except CRB) exceed the utmost limits of the annual arithmetic mean of proposed in the Quality Standards Environmental (ECA) in Peru.

Table 2

Descriptive statistics for the five monitoring stations.

SM	Minimun	Maximun	1st Qu.	3rd Qu.	Median	Mean ± DS	Variance	Skewness	Kurtosis
CRB	5.44	488.02	31.49	58.45	198.31	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$48.69 \pm 28.39$$\end{document}48.69±28.39	806.03	3.24	22.27
SMP	7.77	426.80	61.95	105.10	142.50	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$86.05 \pm 35.73$$\end{document}86.05±35.73	1276.41	1.00	2.86
CDM	6.08	463.60	35.84	63.45	145.50	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$52.30 \pm 24.61$$\end{document}52.30±24.61	605.54	2.30	18.25
ATE	6.41	931.00	82.90	148.00	421.90	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$121.56 \pm 60.30$$\end{document}121.56±60.30	3635.75	2.08	11.07
HCH	5.21	974.00	62.10	176.50	138.40	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$130.03 \pm 91.68$$\end{document}130.03±91.68	8404.34	1.53	4.89

Figure 7

Histograms for each of the five monitoring stations, respectively CRB, SMP, CDM, ATE, and HCH.

Descriptive statistics for the five monitoring stations. Histograms for each of the five monitoring stations, respectively CRB, SMP, CDM, ATE, and HCH.

Analysis of the correlations with the meteorological variables

A significant correlation between and the meteorological variables was observed in the station HCH, which is the area with the highest concentration. Factors such as dust, population / area ratio and weather conditions have a predominant effect on concentration[56]. Figure 8 shows that there is a moderate positive correlation (0.39) between temperature and and a moderate negative correlation (-0.38) between relative humidity and . This is due to the meteorological patterns that occur in the study area. According to Silva et al.[57] between the years 1992 and 2014, the base of thermal inversions in Lima ranged between 0.6 and 0.9 kilometers from June to November and between 0.1 and 0.6 kilometers from December to May, having a minimum average of 0.13 kilometers in March, which coincides with the season that presents critical episodes of concentrations.

Figure 8

Correlation matrices between the meteorological variables and the for each monitoring station.

The thermal inversion in the summer months reduces the dispersion of atmospheric pollutants because the density of the stratiform clouds decreases. Consequently, solar radiation leads to an increase in temperature and to a reduction in relative humidity. The latter results in a turbulent process causing the resuspension of coarse particles as [25]. High temperatures increase the photochemical activity that causes the decomposition of matter and, consequently, the increase of [58-60]. On the other hand, stratiform cloudiness increases in winter, as does relative humidity, that accompanied by drizzles in that season, help to significantly decrease the temperature and concentrations due to wet deposition typical of the season[28]. The above explains the high negative correlation observed between temperature and relative humidity in the five monitoring stations (see Fig. 8), which is a normal phenomenon because the relative humidity directly depends on temperature and pressure to determine the capacity of the air in the intake of water vapor[61]. For this reason, the higher the temperature, the lower the relative humidity, as shown in Fig. 9.

Figure 9

Time series of all variables, , temperature, relative humidity and wind speed, in each monitoring station, ATE, CDM, CRB, HCH and ATE, respectively.

Correlation matrices between the meteorological variables and the for each monitoring station.

Influence of wind direction and speed on concentrations

The stations located in the highest area (eastern part) of the city have the highest concentration of . Contrary to the above, the stations located in the lowest area have a lower concentration of . This trend is due to the entry direction of persistent local winds from the coast to the south-southwest, which causes that pollutants such as be transferred to the northeast and east areas of the city, making them in critical places of contamination by particulate matter[28,31]. Although there is no significant correlation between wind speed and , this parameter has meteorological influence on the dispersion, resuspension, and horizontal transport of pollutants, provided that there are strong air currents (winds)[61-63], which is not the case of the present study because the highest frequencies of wind speeds are between 0 – 3.10 m/s[31]. The wind speed has a meteorological influence on the dispersion, suspension, and horizontal transport of pollutants provided that there are strong air currents (winds)[61-63]. However, this is not the case of the present study because the highest frequencies of wind speeds are between 0 and 3.10 m/s[31], meaning that there is no significant correlation between wind speed and .

Critical episodes of contamination at the HCH station

The station with the highest average concentration between 2017 and 2018 is HCH (see Table 2). This area has the characteristic of high vehicular traffic compared to the rest of the stations considered. The Ramiro Prialé highway that crosses HCH and is the most used to access the central road connects the center and the east of the Peruvian territory, turning it into high traffic congestion. Moreover, 2,462,321 vehicles were circulating in Lima[64] in 2017, and according to the National Institute of Statistics and Informatics (INEI), the vehicle fleet in Peru grew by 4.4% between 2017 and 2018[65]. The aforementioned explains the influence of high traffic vehicles in critical pollution episodes in HCH, which according to what is referred by Srishti et al.[66], the traffic caused from vehicles contributes to about 21% of of the pollution. In addition, it is associated with the wear of tires and brakes[64]. Another particular feature of HCH compared to the other stations is the dilapidated, unpaved roads and the frequent inadequate disposal of land clearing on public roads by the population. These conditions generate a significant increase in dust, the main component of particulate matter, contributing to 54% of air pollution. The soil dust has a more significant impact in seasons or areas with little rainfall[66-68]. Furthermore, Lima is considered a city where it seldom rains and that only slight drizzles or wet haze breakouts from cloud-type clouds nimbostratus[69]. In the surrounding area of HCH, there is also high industrial activity. Industrialization is directly associated with the increased generation of [69]. Concepción and Rodríguez[70] note that both the industrial activity and the vehicle fleet are the leading causes of the generation of high concentrations of in Lima, where the primary industries are brick kilns and non-metallic ore extraction. Moreover, it was evidenced that the HCH brick industries do not have the appropriate technology to mitigate air pollution and that in all their processes, high emission of particulate matter, from the movement of land to the burning of tires, plastics, or firewood in the ovens[71]. Added to all this, it is the lack of green areas in HCH, which facilitates the resuspension of . Time series of all variables, , temperature, relative humidity and wind speed, in each monitoring station, ATE, CDM, CRB, HCH and ATE, respectively.

Exploratory analysis on a daily and monthly scale

The predominant time scale in the concentration of was evaluated in two episodes (see Fig. 10). That between 07:00 and 11:00 in the morning, followed by the one between 17:00 and 22:00 at night. Similar results were found by Sánchez et al.[27], where the air quality of Lima was evaluated in 2015. From the above, it can be inferred that the levels of environmental pollution referring to , find the highest peaks in the evening (153.9991 and ), while the lowest peaks are between 03:00 and 04:00 a.m. each day, which coincides with the results reported for the station HCH. As mentioned by Valdivia et al.[3], this is related to the reduction in emissions from mobile sources that are own of the dawn.

Figure 10

Bar plot per day and month for each monitoring station, ATE, CDM, CRB, HCH, and ATE, respectively. The average hourly pollution per day of the week and month of the year is reported for all monitoring stations. The behavior of concentration levels of contamination varies depending on the month. In each monitoring station, we can see two main peaks (see Fig. 10). The first corresponds to February, March, and April, which report the highest contamination in the first semester of the year. In this period, it is the beginning of classes for schoolchildren that intensifies vehicle activity. The end of the summer and the beginning of the autumn are the period associated with the time at which the thermal inversion occurs, which favors the generation of high peaks of contamination[57]. The second peak involves the winter season and the beginning of spring, highlighting mainly October as part of the second semester of the year. Similar results were found by Encalada et al.[31]. In these time windows, the stations with the highest critical episodes were HCH and ATE, while CRB had the lowest concentrations. In addition, from the emissions of high traffic vehicular and fixed sources of pollution, the meteorological and topographic conditions of the study area cause the high emission of in the air, exceeding the proposed standards in all cases by WHO.

Air pollution forecasting results

In this study, we focus on the one-hour ahead prediction of the concentration based on both the past values of the pollutant concentration and the current weather variables. For this, the MLP and LSTM were used with a particular architecture. Based on the autocorrelation function (ACF) and the partial autocorrelation function (PACF), relevant lags were detected that are used in the model. The configuration of the network is associated with the information provided by the ACF and PACF, where the lags , , , , and of the time series are defined as relevant. In addition, temperature, relative humidity, and wind speed are used with (4 hours ago). In summary, the non-linear autoregressive model with exogenous variables identified has the following structure:where is the time series. The weather exogenous variables are , and for temperature, humidity and wind speed respectively. Moreover, is the random noise. The non-linear function stands for either the MLP or the LSTM neural networks. The purpose of incorporating exogenous variables in this study is to improve the precision of the forecast. The exogenous variables are crucial to improve the efficiency of predictions by identifying the important meteorological covariates that affect , such as temperature, relative humidity, and wind speed[72]. In this work, we have implemented a three-layer MLP with 8 input nodes, 16 hidden nodes, and 1 output node. The activation function for the hidden and output nodes is the sigmoid function . On the other hand, the LSTM was implemented with 16 parallel blocks, and the output of each block is aggregated with a single neuron with a sigmoid activation function. To train both ANN models, we have selected the mean absolute error for the loss function as a robust function due to outliers. The nadam optimizer was used for the backpropagation algorithm. A 25% dropout strategy with a 10% of validation data was applied to avoid over-fitting. A maximum of 500 epochs and batch sizes of 1024 was used to fit the models’ weights. Two alternatives were considered to obtain out-of-sample forecasts (see Fig. 11). On the one hand, the ANN models were adjusted with the training set only once for the Hold-Out scheme, and the resulting model was used to forecast one-hour ahead for the last 60 days of data. On the other hand, the ANN modes were trained several times with a fixed sliding window for the Blocked Nested Cross-Validation, where the model was updated for each subsequent day belonging to the test set, and the following days (24 samples) were used for the test set.

Figure 11

Plots for one-hour ahead predictions for the last 15 days of the concentration level using LSTM with the BNCV scheme. Predictions for the following monitoring stations: (a) ATE, (b) CDM, (c) CRB, (d) HCH, (e) SMP. Table 3 shows the performance results obtained by the MLP and LSTM models evaluated in the test set using the Hold-Out and the Blocked Nested Cross-Validation Schemes. Figure 11 shows the graphs obtained by the predictions of the LSTM neural network for the five monitoring stations. Artificial neural networks show good prediction performance according to the Spearman score (over 0.60) for all the stations, except for ATE that reaches a score near 0.52. ATE and HCH monitoring stations are located in industrial areas with heavy traffic stations. The ATE and HCH monitoring stations have the highest levels of contamination and a more significant presence of outliers, which is reflected in the error metrics with values greater than twice that of the other stations. Notice that RMSE shows a higher value due to the presence of extreme values in the levels, being MAE less affected by this type of value. On the other hand, the models evaluated by applying the BNCV scheme show slightly better performance than their HO counterparts. However, the BNCV scheme keeps the models updated with the latest records through an incremental training process with the new data.

Table 3

Metrics	ATE		CDM		CRB		HCH		SMP
Metrics	MLP	LSTM	MLP	LSTM	MLP	LSTM	MLP	LSTM	MLP	LSTM
Hold-Out scheme
MAE	27.458	27.637	9.639	9.609	6.577	6.548	42.740	41.514	10.441	10.105
RMSE	45.752	46.509	13.771	13.743	10.573	10.682	64.297	62.903	15.959	15.520
sMAPE	24.059	24.071	19.344	19.328	17.283	17.208	33.846	32.829	14.331	13.935
Spearman r	0.517	0.514	0.658	0.660	0.756	0.755	0.649	0.663	0.815	0.823
Blocked Nested Cross-Validation scheme
MAE	26.845	27.066	9.689	9.562	6.644	6.339	44.586	43.191	10.155	9.696
RMSE	44.718	45.923	13.885	13.808	10.840	10.722	64.785	63.690	16.162	15.752
sMAPE	23.590	23.607	19.499	19.240	17.280	16.639	35.54	34.569	14.012	13.467
Spearman r	0.523	0.520	0.654	0.657	0.756	0.766	0.632	0.648	0.815	0.817

The models’ performances were strongly affected by a period of excessive contamination with critical episodes that appeared between December 3rd, 2018, and December 21st, 2018 (just before the Christmas festivities). Performance results for the MLP and LSTM models were evaluated using The Hold-Out and the Blocked Nested Cross-Validation schemes. The summary of the results corresponds to one-hour ahead predictions of the concentration levels of the pollutant evaluated in the last 60 days of the data set. The time series of the pollutant was decomposed into trend, seasonality and irregular components using the decomposition method described in equation 2. The irregular component was subtracted from the original time series, and filtered time series is obtained:Table 4 shows the performance results obtained by the MLP and LSTM models evaluated in the test set using the Hold-Out and the Blocked Nested Cross-Validation Schemes applied to the filtered time series. Under this situation, both the MLP and the LSTM performed very well in predicting the regular component of the contamination levels at all monitoring stations. A remarkable point is an outstanding performance obtained by the artificial neural network models, which shows that the irregular component is hard to predict. Figure 12 shows the graphs obtained by the predictions of the LSTM neural network for the five monitoring stations.

Table 4

Metrics	ATE		CDM		CRB		HCH		SMP
Metrics	MLP	LSTM	MLP	LSTM	MLP	LSTM	MLP	LSTM	MLP	LSTM
Hold-Out scheme
MAE	4.203	2.659	1.737	1.336	1.628	1.423	6.370	4.255	2.830	1.941
RMSE	5.724	3.706	2.235	1.732	2.192	1.844	8.324	5.837	3.602	2.299
sMAPE	3.646	2.411	3.581	2.830	4.269	3.927	4.636	3.224	4.063	2.867
Spearman r	0.986	0.991	0.973	0.982	0.967	0.974	0.981	0.988	0.982	0.990
Blocked Nested Cross-Validation scheme
MAE	4.217	2.720	1.829	1.325	1.645	1.333	6.621	4.561	2.749	1.841
RMSE	5.738	3.731	2.350	1.712	2.297	1.835	8.622	6.101	3.558	2.194
sMAPE	3.619	2.468	3.743	2.810	4.330	3.575	4.905	3.454	3.856	2.709
Spearman r	0.984	0.991	0.973	0.982	0.963	0.973	0.977	0.987	0.980	0.991

Figure 12

Plots for one-hour ahead predictions for the last 15 days of the regular component of the concentration level using LSTM with the BNCV scheme. Predictions for the following monitoring stations: (a) ATE, (b) CDM, (c) CRB, (d) HCH, (e) SMP.

Performance results for the MLP and LSTM models were evaluated using The Hold-Out and the Blocked Nested Cross-Validation schemes. The summary of the results corresponds to one-hour ahead predictions of the filtered time series of the concentration levels of the pollutant evaluated in the last 60 days of the data set. Plots for one-hour ahead predictions for the last 15 days of the regular component of the concentration level using LSTM with the BNCV scheme. Predictions for the following monitoring stations: (a) ATE, (b) CDM, (c) CRB, (d) HCH, (e) SMP.

Comparison of the present study with past studies

This section shows the comparison of the present study with other previous studies on the evaluation and prediction of in Lima, showing the duration of the study and the main findings. It is observed that our results agree with the other studies in that vehicular traffic is the main activity that causes critical episodes of , and this is exacerbated in the summer months. Silva et al.[28] shows that the highest concentrations of were observed in the eastern part of the city. The main sources of particulate material are the large open areas, vehicular traffic, the commercialization of rubble, bricks, and cement. The highest concentrations of are observed in summer. Pollutant types: , . Duration of study: 6 years (2010-2015). Reátegui-Romero at al.[73] show that, for the monitoring stations in the eastern zone, the highest concentrations of are observed in the northern area of Lima, the Relative Humidity is inversely proportional to the concentrations of , higher peaks are observed in the summer month. Pollutant types: , . Duration of study: 2 months (February and July 2016). Sanchez et al.[10] show that there is a higher concentration of in the areas with the greatest impact of vehicular traffic, reaching maximum concentrations of 476,8 g/ for Santa Anita station. They used the WRF-Chem model to predict concentrations, obtaining low precision results. Pollutant types: . Duration of study: 33 days (2016). In our study, we have specified that the major sources of the pollutant are the vehicle fleet, the industrial park, and overcrowding, reaching maximum peaks of 974 g/ at the HCH station. The highest concentrations were observed in the summer months. Artificial neural networks were used, specifically, the LSTM model under two validation schemes to predict concentrations. The results showed good prediction performance for both low concentrations and critical episodes. Pollutant types: . Duration of study: 2 years (2017-2018).

Limitations

This study has some limitations. First, the number of data points represents a relatively short period (two years). A more extended period of hourly data may have allowed a more rigorous statistical analysis and more conclusive results. It is worth mentioning that the data related to in Lima requires greater attention since many stations do not have the pertinent record of this pollutant, added to the scarce existing research related to this topic. Second, the collection of data related to other meteorological variables was also restricted since the monitoring stations do not record correctly for the most part. Third, the study does not consider data related to vehicular traffic or hospital care; the use of both variables may have enriched the research. However, our findings from the analysis are consistent and complementary to a recent study showing the visual and exploratory aspect of the pollutant[31]. In addition, the MLP and LSTM architectures that allowed the analysis of predictions under two validation schemes are the precedent for future work with a predictive approach, being the first study in Lima that addresses the prediction of using neural networks artificial. Likewise, it will be a support in the taking of preventive actions to critical environmental episodes.

Conclusions

This study addressed the problem of forecasting concentration on an hourly scale based on air quality indicators from five monitoring stations in Lima, Peru. A comparative study was accomplished between the MLP and LSTM neural networks evaluated with the Hold-Out and Blocked Nested Cross-Validation. The MLP and LSTM can use the data from the previous period to accurately forecast the value of the concentration in a short time ahead. They can learn the concentration trends accurately. However, the performance is diminished when a station is subject to unpredictable external sources of pollution or due to short-term changes in climate and landforms (ATE and HCH). In this sense, the LSTM with the BNCV could better adapt to data from the monitoring stations that present episodes of extreme values. The results show that periods of moderate concentration are predicted with very high precision. While for periods of high contamination, the model’s accuracy is diminished, although in any case, it has a reasonable degree of predictability. Using a high-performance model in air quality forecasting in large cities, such as Lima, can help develop critical health protection and prevention tools. Deep learning neural networks such as the LSTM are crucial in helping design public policies that prioritize improving air quality conditions to develop more sustainable cities. The different configurations of the LSTM respond to the forecast of events by selecting the relevant meteorological variables. Precisely, the essential property of the LSTM is that through its memory units, they can remember the patterns over time, which is beneficial when forecasting . In this sense, LSTM with BNCV could better adapt to data from the monitoring stations that present episodes of extreme values. The results show that the concentration prediction achieves better results with artificial intelligence methods since they are suitable for this type of approach. However, it is proposed to conduct this type of study with other cross-validation methods and hybrid and ensemble methods, giving greater precision in the prediction. This study will help in decision-making regarding air pollution mitigation and strategies, not only in Lima but also in other cities in the country and abroad. In this sense, this study of could be extrapolated to other pollutants, both at a national and international level. In fact, a recent study[74] showed that genetic programming had higher prediction accuracy than artificial neural networks and was equally competent for peak predictions. Further works are required to explore other methods (hybrid or ensemble) to increase the accuracy of predictions. As future work, we expect to apply other variants of deep learning models that include incremental learning[75], as well as to introduce self-identification techniques for the model identification[41,76].

13 in total

1. Mapping real-time air pollution health risk for environmental management: Combining mobile and stationary air pollution monitoring with neural network models.

Authors: Matthew D Adams; Pavlos S Kanaroglou
Journal: J Environ Manage Date: 2015-12-17 Impact factor: 6.789

2. Multi-criteria decision-making using GIS-AHP for air pollution problem in Igdir Province/Turkey.

Authors: Fatma Sahin; Mehmet Kazım Kara; Ahmet Koc; Gökhan Sahin
Journal: Environ Sci Pollut Res Int Date: 2020-06-18 Impact factor: 4.223

3. Monsoonal differences and probability distribution of PM(10) concentration.

Authors: Noor Faizah Fitri Md Yusof; Nor Azam Ramli; Ahmad Shukri Yahaya; Nurulilyana Sansuddin; Nurul Adyani Ghazali; Wesam Al Madhoun
Journal: Environ Monit Assess Date: 2009-04-14 Impact factor: 2.513

4. Characterization of indoor settled dust and investigation of indoor air quality in different micro-environments.

Authors: Veerendra Sahu; Suresh Pandian Elumalai; Sneha Gautam; Nitin Kumar Singh; Pradyumn Singh
Journal: Int J Environ Health Res Date: 2018-06-11 Impact factor: 3.411

5. Deep learning architecture for air quality predictions.

Authors: Xiang Li; Ling Peng; Yuan Hu; Jing Shao; Tianhe Chi
Journal: Environ Sci Pollut Res Int Date: 2016-10-13 Impact factor: 4.223

6. Particulate matter levels in a South American megacity: the metropolitan area of Lima-Callao, Peru.

Authors: Jose Silva; Jhojan Rojas; Magdalena Norabuena; Carolina Molina; Richard A Toro; Manuel A Leiva-Guzmán
Journal: Environ Monit Assess Date: 2017-11-13 Impact factor: 2.513

7. Source apportionment of fine particles in Kuwait City.

Authors: Mohammad A Alolayan; Kathleen W Brown; John S Evans; Walid S Bouhamra; Petros Koutrakis
Journal: Sci Total Environ Date: 2012-12-25 Impact factor: 7.963

8. Seasonal characteristics of aerosols (PM_2.5 and PM₁₀) and their source apportionment using PMF: A four year study over Delhi, India.

Authors: Srishti Jain; S K Sharma; N Vijayan; T K Mandal
Journal: Environ Pollut Date: 2020-03-10 Impact factor: 8.071

9. Modeling Study of the Particulate Matter in Lima with the WRF-Chem Model: Case Study of April 2016.

Authors: Odón R Sánchez-Ccoyllo; Carol G Ordoñez-Aquino; Ángel G Muñoz; Alan Llacza; María Fátima Andrade; Yang Liu; Warren Reátegui-Romero; Guy Brasseur
Journal: Int J Appl Eng Res Date: 2018

Review 10. A Systematic Review of Statistical and Machine Learning Methods for Electrical Power Forecasting with Reported MAPE Score.

Authors: Eliana Vivas; Héctor Allende-Cid; Rodrigo Salas
Journal: Entropy (Basel) Date: 2020-12-15 Impact factor: 2.524

2 in total

1. Design and development of an open-source framework for citizen-centric environmental monitoring and data analysis.

Authors: Sachit Mahajan
Journal: Sci Rep Date: 2022-08-24 Impact factor: 4.996

2. Statistical modeling approach for PM₁₀ prediction before and during confinement by COVID-19 in South Lima, Perú.

Authors: Rita Jaqueline Cabello-Torres; Manuel Angel Ponce Estela; Odón Sánchez-Ccoyllo; Edison Alessandro Romero-Cabello; Fausto Fernando García Ávila; Carlos Alberto Castañeda-Olivera; Lorgio Valdiviezo-Gonzales; Carlos Enrique Quispe Eulogio; Alex Rubén Huamán De La Cruz; Javier Linkolk López-Gonzales
Journal: Sci Rep Date: 2022-10-06 Impact factor: 4.996

2 in total