Literature DB >> 26761008

RAQ-A Random Forest Approach for Predicting Air Quality in Urban Sensing Systems.

Ruiyun Yu1, Yu Yang2, Leyou Yang3, Guangjie Han4, Oguti Ann Move5.   

Abstract

Air quality information such as the concentration of PM2.5 is of great significance for human health and city management. It affects the way of traveling, urban planning, government policies and so on. However, in major cities there is typically only a limited number of air quality monitoring stations. In the meantime, air quality varies in the urban areas and there can be large differences, even between closely neighboring regions. In this paper, a random forest approach for predicting air quality (RAQ) is proposed for urban sensing systems. The data generated by urban sensing includes meteorology data, road information, real-time traffic status and point of interest (POI) distribution. The random forest algorithm is exploited for data training and prediction. The performance of RAQ is evaluated with real city data. Compared with three other algorithms, this approach achieves better prediction precision. Exciting results are observed from the experiments that the air quality can be inferred with amazingly high accuracy from the data which are obtained from urban sensing.

Entities:  

Keywords:  air quality prediction; point of interest; random forest; traffic

Year:  2016        PMID: 26761008      PMCID: PMC4732119          DOI: 10.3390/s16010086

Source DB:  PubMed          Journal:  Sensors (Basel)        ISSN: 1424-8220            Impact factor:   3.576


1. Introduction

As urbanization leads to urban community growth, the transportation infrastructure dependent on fossil fuels also expands consequently [1]. The popularity in vehicle use gives rise to an increase in traffic related pollutant emissions. Urban air pollution is a major problem in both developed and developing countries, as atmospheric pollutants have a great effect on human health. Numerous illnesses such as lung cancer may be caused by various atmospheric pollutants [2]. In addition, some other serious environmental problems can also result from air pollution, such as acid rain and the greenhouse gas effect. For example, SO2 and NO2 are the main causes of acid rain [3], while CO2 and N2O are the main reasons for the greenhouse gas effect [3]. Recently, especially in China, environmental problems have become a major concern in big cities such as Beijing and Shanghai, where the primary sources of pollutants include exhaust emissions from Beijing's more than five million motor vehicles, coal burning in neighboring regions, dust storms from the north and local construction dust [4]. A particularly severe smog engulfed the Beijing for weeks in early 2013, elevating public awareness to unprecedented levels and prompting the government to roll out emergency measures [4]. Air pollution monitoring is thus becoming more and more significant. Real-time air quality information, such as the concentration of PM2.5, PM10 and NO2, is an important aspect for pollution management and protecting human beings from damages caused by air pollutants. Considering the significance of air quality, governments take measures to monitor it through establishing air quality monitoring stations. However, because of the high expense to start up and maintain these facilities, there are not sufficient stations in cities. For example, Figure 1 shows the Google Map of Shenyang City. The red pins represent the 11 air quality monitoring stations. Among them, S1 is located in a college; S2, S3, S4, S6, S8 are located on the roofs of buildings; S5, S9 are located along roads; S10 is located in a park; S11 is located near factories. These only 11 stations that cover more than three thousand square kilometers of downtown area in Shenyang. Another example is to compare London and Beijing. The area of Beijing is 10 times bigger than London but the number of monitoring stations is less than one fourth of London’s [5]. One station can only monitor an area of limited size, therefore precise air quality reports for many areas cannot be generated.
Figure 1

Monitoring station locations in Shenyang city (China).

Monitoring station locations in Shenyang city (China). Figure 2a shows samples of the AQI data of 10 stations in different locations. The x-axis denotes the different stations and the y-axis denotes AQI. Three bars in colors denote the AQI at different times. As demonstrated in Figure 2a, stations at different locations can differ a lot at the same time such as S7 and S8 on 6 May 2015 [6]. Air quality on continuous two days can also display big jumps such as AQI at S3 which raised from 55 to 408 in the morning between 6 May 2015 and 7 May 2015 [6]. Figure 2b shows the ways in which air quality changes follow different rules in different locations. For example, no matter whether stations are a short distance apart like S5 and S6 or a long distance like S5 and S10, they showed different changes between points in time.
Figure 2

(a) AQI Samples in Shenyang; (b) AQI Trend on 12 May 2015 in Shenyang.

(a) AQI Samples in Shenyang; (b) AQI Trend on 12 May 2015 in Shenyang. It is hard to reflect these changes in a general function which can be applied to all the locations, therefore, we cannot come up with a general formula to predict the air quality in a certain time slot. Therefore, how to infer the air quality in the blank areas is a challenging and meaningful topic. In this paper, we come up with an algorithm to infer the air quality indications throughout the city. In an urban sensing system, an algorithm (RAQ) based on a random forest concept is proposed to predict the urban area air quality through the use of historical air quality data, meteorology data, historical traffic and road status as well as POI distribution information. These data are collected from all kinds of urban sensors such as weather monitoring stations. This method hides all these kinds of inaccessible factors in the traditional mathematic models. In practical applications, we cannot take all the factors such as vehicle emissions and factory emissions into count, as it is hard to get accurate data about these factors. This kind of replacement is not only good for the computation but also good for increased prediction accuracy. At the same time, all the features used in this paper are much cheaper than the accurate measured data from monitoring stations. No equipment cost is required in this approach. As for the accuracy, this algorithm performs better than some other classical ones and the overall results can provide meaningful references to citizens. Regarding the scalability and expansibility, more possible related features such as human mobility can be input into this algorithm without significant changes. The algorithm itself is also robust enough for even higher dimensions. The remainder of this paper is organized as follow: Section 2 presents related work. The problem description and formulation are presented in Section 3. In Section 4, the system framework and the RAQ algorithm is proposed. Extensive experiments are implemented in Section 5. We conclude and outline the directions for future work in Section 6.

2. Related Work

In the past decades, many studies on air quality inference have been done using approaches such as dispersion models, satellite remote sensing and wireless sensor networks. Air pollution dispersion models are tools that use a mathematical model such as the Box model [7], Gaussian model [8], Lagrangian model [9], Eularian model [10], SLAB model [11] or some mixed models. to simulate how air pollution disperses in the atmosphere. The classical dispersion models are mainly functions of meteorology, traffic volumes, building distributions and so on. These models depend mainly on experience and the parameters above to simulate the pollution dispersion, but some other potential factors are not taken into consideration such as human mobility and concentrations. In the meantime, dispersion models depend on access to relatively accurate data, such as the strength of pollutant sources, wind speed, traffic emissions and so on, which accuracy cannot be guaranteed in certain conditions. For example, wind speed may vary a lot in different regions because of the obstructions of buildings, and their roles in determining the modified wind circulation between and over structures. Accurate traffic emissions are also hard to obtain. We can only estimate the value according to the fuel consumption and distances travelled. Satellite remote sensing technology is another possible way to monitor air quality. Research has developed quickly using satellites to monitor air conditions in the past decades. For example, Liu et al. came up with an approach using satellite remote sensing technology to test the thickness of PM2.5 on the ground [12]. Similarly, Martin et al. came up with a way of using satellite remote sensing technology to test some ground air pollutants, including CO, NO, SO2 and so on [13]. Pawan et al. used this technology to evaluate the air conditions of every city [14]. These methods mainly use satellite remote sensing technology to directly measure the concentration of certain air pollutants by analyzing the images obtained by the satellites to estimate the concentrations of air pollutants. However, many air quality managers are not yet taking full advantage of satellite data for their applications because of the challenges associated with accessing, processing, and properly interpreting observational data. That is, a certain degree of technical skill is required on the part of the data end-user, which is often problematic for organizations with limited resources [15]. Sensor networks have also been studied extensively because of their broad applicability and enormous application potential in areas such the environmental monitoring field. A Wireless Sensor Network Air Pollution Monitor System (WAPMS) was deployed on the island of Mauritius for monitoring air quality [16]; distributed infrastructure-based wireless sensor networks and grid computing is also used for monitoring the air quality of London [17]. Rajasegarar et al. also used wireless sensor networks to monitor air pollutants [18]. However, sensor networks require a large number of sensor devices, and can only be deployed in a small range, such as indoors and in small areas. For a city and other large areas, if using cheap sensors with single function, we cannot get information about all kinds of air pollutants. If using sensors with complex functions such as monitoring stations, infrastructure construction and maintenance costs make it difficult to promote wireless sensor networks for a wide usage range. It is the same reason which limits the number of stations in cities of China. Besides all the methods above, participatory sensing is also an important approach for air quality prediction. With the popularity of smart devices, participatory sensing and crowdsourcing has been a hot topic of discussion in recent years. People see unlimited possibilities in smart devices. A personalized mobile sensing system (MAQS) was proposed for indoor air quality monitoring [19]; a system based on smart phones and monitoring sensors has also been used to monitor outdoor air quality [20]; noise pollution is also monitored using mobile phones [21]. Sivaraman et al. used a participatory sensor system to monitor air pollutants in Sydney (Australia) [22]. However, most current smartphones does not carry air pollutant sensors, so the sensing devices required for the system need external sensing modules which leads to extra costs. Besides the high expense, user participation and the accuracy of the data are problems that remain to be solved. Recently, urban computing has been one of the ways to solve problems in cities. Yuan Jing et al. proposed an algorithm to infer the functional areas of cities by using trajectories [23]; Zheng et al. made use of the city daily data to infer urban air quality [24,25]. However, similarly, urban computing also requires pre-installed urban sensors such as GPS devices. For instance, when inferring the air quality, Zheng made use of months of data collected from the GPS installed in taxis in Beijing. This is an important limitation that prevents the promotion of this approach because in most cities we cannot access the GPS information of taxis. Spatiotemporal data analysis is also an important aspect for air quality prediction. Chen et al. established a spatiotemporal data framework named BigSmog to provide China smog analysis [26]. Zhu et al. proposed Granger-causality-based air quality estimation with heterogeneous spatiotemporal data [27]. Some other studies [28,29] also analyzed spatiotemporal data to generate air pollutant distributions.

3. Problem Description and Definition

3.1. Definition

3.1.1. Air Quality Index

An air quality index (AQI) is a number used by government agencies to communicate to the public how polluted the air is currently or how polluted it is forecasted to become [30]. As the AQI increases, an increasingly large percentage of the population is likely to be exposed, and people might experience increasingly severe health effects. Different countries have their own air quality indices, corresponding to different national air quality standards. In this paper, we use the standard of China, where the AQI is based on the levels of six atmospheric gases, namely sulfur dioxide (SO2), nitrogen dioxide (NO2), suspended particulates smaller than 10 μm in aerodynamic diameter (PM10), suspended particulates smaller than 2.5 μm in aerodynamic diameter (PM2.5), carbon monoxide (CO), and ozone (O3), measured at the monitoring stations throughout each city [31]. The AQI value is calculated per hour according to a formula published by China’s Ministry of Environmental Protection [31]. AQI is the maximum value of which is a reference value of one air pollutant p: where is mass concentration value of the air pollutant p, is the high value of the concentration limit which can be checked in the reference table from the paper [31], is the low value of the concentration limit which can be checked in the reference table from [31], is the corresponding value of in the same reference table, is also the corresponding value of in the reference table. Table 1 shows the relationship between AQI values and air pollution levels which are marked by different colors. In this way, air quality prediction can be treated as a classification problem so that we only need to match the air quality index to different classification levels in Table 1. The six levels in Table 1 represent six AQI levels.
Table 1

AQI classification.

AQIAir Pollution Level
0–50Excellent
51–100Good
101–150Lightly Polluted
151–200Moderately Polluted
201–300Heavily Polluted
300+Severely Polluted
AQI classification.

3.1.2. Traffic Congestion Status

Traffic Congestion Status (TCS) describes the traffic conditions on a certain road. Different colors denote different levels of congestion. For example, Figure 3 shows an example of a TCS graph.
Figure 3

A TCS graph.

A TCS graph.

3.1.3. Point of Interest

A point of interest, or POI, is a specific location that someone may be interested in. For example, restaurants and shopping malls surrounding us are POI. Figure 4 presents the restaurant locations around Sanhao Street of Shenyang on Google Maps.
Figure 4

POI near the Sanhao street of Shenyang city.

POI near the Sanhao street of Shenyang city.

3.2. Problem Formulation

This paper uses urban sensing data to solve the problem of air quality inference which means to infer the unknown air quality of areas by using all kinds of data. These data affect either the sources of air pollution such as traffic emissions and point of interest distribution or their results such as the air quality index, so establishing the relationship between these data and air quality is the key to this kind of approach. The RAQ algorithm collects several kinds of related data including air monitoring station data (AQI), meteorology data (MD), traffic (TCS), road information (RI) and POI data. All these data are fetched at intervals of one hour. We divide the city into grids (G) and each grid is regarded as one unit. Those grids (G1) with air quality monitoring stations generate the data with the label AQI while the grids (G2) without stations generate the data used for prediction. Data from G1 are used for training our learning model and data from G2 are input into the model to generate the predication value. The only difference of data from G1 and G2 is data from G1 are labeled as an AQI value. The results are given as different AQI levels. If the actual value from monitoring stations belongs to this AQI level, then we know the prediction is right. Otherwise the prediction is wrong. This problem can be formulated as follows: given a collection of grids G = G1 ∪ G2 (|G1|≪ |G2|), where g1·AQI (g1 ∈ G1) is known and g2·AQI (g2 ∈ G2) is unknown, g·MD, g·TCS, g·RI and g·POI are known (g ∈ G), RAQ aims to predict g2·AQI at intervals of one hour.

4. RAQ Algorithm

In the RAQ algorithm, all data are collected from the urban sensing system including air monitoring station data, meteorology data, traffic data, road information and POI data and necessary features are extracted from heterogeneous data. These features are the most common data in city life. Traffic-related sources like vehicle emissions and POI like factories are the main sources for air pollutants [3]. Meteorology is the main approach for dispersion of air pollutants [3]. These data can represent well the air quality situation. The training dataset includes all the necessary features and is divided into subsets using bootstrap technology. Figure 5 shows the structure of the dataset. A decision tree is constructed on each subset, and the classification is done by aggregating the results generated from all decision trees. Figure 6 shows the procedure of the RAQ algorithm.
Figure 5

Dataset structure.

Figure 6

The procedure of RAQ.

Dataset structure. The procedure of RAQ.

4.1. Data Collection and Feature Extraction

4.1.1. Meteorology Data

Meteorology data such as temperature and humidity are very important factors that severely affect the concentration and spread of air pollutants. Understanding the behavior of meteorological parameters in the planetary boundary layer is important because the atmosphere is the medium in which air pollutants are transported away from the source, which is governed by the meteorological parameters such as atmospheric wind speed, wind direction, and temperature [32]. In this paper, we use weather monitoring stations as one part of the urban sensing system. Considering the accessibility of the data, we use following meteorology data features: temperature (Fmt, °C), humidity (Fmh, %), barometric pressure (Fmp, mmHg), wind speed (Fmw, m/s) and visibility (Fmv, m).

4.1.2. Traffic and Road Data

Traffic is one of the most important factors that affect the air quality. Figure 3 is a sample of the original data that is available from map service providers. In this paper, we rely on two important characteristics of traffic, which are road length (F) and traffic congestion status (F). If the road is very long and traffic congestion is relatively light, exhaust gas emissions can be at a high level because of the total number of vehicles on this road. Similarly, if a road is short and traffic congestion is heavy. However, we do not have a method or accurate data to quantify these two characteristics directly. Most map service providers offer online maps and real-time traffic status. They do not publish public application interfaces (APIs) for third party developers to access these data, but we can still get some useful hints through analyzing the web http requests of the map. Essentially, these data are collected from GPS equipment installed in cars or speed measurement sensors. These data denote another important part of the urban sensing systems. Figure 7 shows the http request records of a typical Baidu map when we invoke the traffic widget.
Figure 7

HTTP request analysis by Chrome developer tool.

As we know, a picture is composed of many pixels, so a picture can be digitized into a matrix. We use the colored pixel distribution to represent the information of road length and congestion status. For each tile grid, we count the quantity of pixels to represent the road. The larger the quantity of pixels, the greater the length of the road is in one tile grid. As shown in Figure 3, traffic congestion status is denoted by different colors (green, orange and red) in the pixels which represent roads. According to the traffic volume of different congestion levels, different weights are assigned to the numbers of pixels in different colors (1, 2 and 5). In Figure 8, the weighted tcs value is calculated by formula a + 2b + 5c, where a is the number of pixels in green, b is the number of pixels in orange and c is the number of pixels in red.
Figure 8

Traffic congestion status.

HTTP request analysis by Chrome developer tool. Traffic congestion status.

4.1.3. POI Data

The category of POIs and their density in a region indicate the land use and the function of the region as well as the traffic patterns in the region, therefore contributing to the air quality inference of the region [24]. For example, shopping streets are more likely to gather more people than parks so there will be more human-related air pollution sources like vehicles. Schools always have more green areas than factories so there are more plants to absorb the air pollutants. Therefore, POI distribution has a strong effect on air quality. These data also imply the significance of human activities in urban sensing systems. In this paper, the number of POI is counted in each tile grid. According to the searching results of Baidu maps and Google Map, the majority of POI are divided into ten categories. Table 2 shows the categories and Figure 9 presents the number of POI (F) in each category.
Table 2

POI categories.

CodePOI Category
P1Transportation
P2Entertainment
P3Restaurant
P4Education
P5Residential District
P6Park
P7Company
P8Factory
P9Shopping mall
P10Gas station
Figure 9

Numbers of POI in Shenyang city.

POI categories. Numbers of POI in Shenyang city.

4.2. Random Forest Classification

The Random Forest is a general term for ensemble methods using tree-type classifiers where the are independent identically distributed random vectors and x is an input pattern, is a generated classifier [33]. It uses recursive partitioning to generate many trees and then aggregate the results. Each tree is independently constructed using a bootstrap sample of the training data, which subdivides the parameter set first into several parts depending on one of the parameters, and subsequently repeats the process for each part.

4.2.1. Bootstrap Aggregating (Bagging)

There is usually a single data sample in each class for training. A simple method is to divide the dataset into non-overlapping subsets and construct the trees independently. However, this requires a huge amount of data and it cannot always be guaranteed in different situations. A better way is sampling the original dataset with replacement for a certain times to produce a bootstrap sample. This method ensures that the samples’ distributions are statistically identical with the original data sample [34]. There are n records in the original dataset and so the probability of each record is constantly 1/n. The probability of not selecting a certain record is (1 – 1/n), which results in (1 – 1/n) when repeated n times. Assuming the sample size tends to be infinite, the probability can be expressed as limn→∞(1 – 1/n) which is equal to e–1. Therefore, the probability of selecting one record is (1 – e–1) ≈ ⅔. Thus, in each bootstrap sample there are about ⅔ original samples for training.

4.2.2. Tree Growing and Splitting

As we know, a decision tree starts with one root node. In the following process, the samples are split into different spaces using one of the features including monitoring station data (AQI), meteorology data(MD), traffic(TCS), road information(RI) and POI data.. Therefore, how to select the feature in each split is of great significance for the performance of a decision tree. Information gain [35] is usually used as the criterion for classifiers. The features selection for each bootstrap sample is randomized. According to bagging theory, random forest is strong classifier based on multiple weak classifiers. Therefore, both the number of data and the number of features of the subset are smaller than original dataset’s. We need T subsets with m features. According to Brieman’s suggestions [33], m is much less than the number of all the features. Brieman suggests three possible values for m: . In the evaluation section, we would show four features and 400 subsets are best for our model and dataset. When splitting the dataset, for each feature candidate, entropy is calculated as in Equation (3): where c is the AQI level i which is specified in Table 1, the probability p(c) is calculated through Equation (2) where Ni is the quantity of records in different AQI level and k is the number of AQI levels. Therefore, the information gain is defined as shown in Equation (5): where f represents records of the i level of tree, f are records in jth node of the ith level of tree, and w is the number of nodes in this level. The process of splitting stops when: (a) the records in one node fall below the threshold value defined by users; (b) the node is pure which means all the records fall into one class. For the terminated node has unordered records, the percentage of different classes are calculated and so the predicted class is defined as in Equation (6):

4.3. Prediction

After all the trees are constructed, the unlabeled data are input into all decision trees. For each tree, p(c) is the estimated probability of the AQI level i. The final probability of the AQI level i p’(c) in the random forest is defined in Equation (7), where T is the number of decision trees as mentioned before: The final result is determined by Equation (8): The pseudocode of RAQ algorithm is described in Algorithm 1.

5. Evaluation

5.1. Dataset

In the experiments, one-month data from 4 May 2015 to 5 June 2015 is collected and the following four datasets of Shenyang are used which are all available to the public. In our testing period, we use a total of 2701 data to test this algorithm and Shenyang is divided into 1258 grids corresponding to 34 rows and 37 columns. Because all the grids belong to the main city area, all data including meteorology data, traffic data, road information and POI data in these grids are accessible from our data sources. Air quality data is accessible in the areas covered by air monitoring stations.

5.1.1. Monitoring Station Data

The air quality information from the Shenyang monitoring stations includes AQI, the concentrations of CO, NO2, SO2, O3, PM10 and PM25 and timestamp. Table 3 shows the format of the monitoring station data. Table 4 shows the locations of all the monitoring stations. All the data are collected from the public website [36] whose data are produced by National Department of Environmental Protection. We use the Java programming language to access the API interface hourly and store all the data into a MySQL database.
Table 3

Data samples of monitoring stations.

Station_idAqiCO (μg/m3)NO2 (μg/m3)SO2 (μg/m3)O3 (μg/m3)PM10 (μg/m3)PM25 (μg/m3)Time
747771.802706963104522015-05-24 03:00
7501392.2336270571251062015-05-24 03:00
751821.706735869100602015-05-24 03:00
741851.94280644394632015-05-24 03:00
748631.02461626876372015-05-24 04:00
749671.35860296281482015-05-24 04:00
742881.646978212125142015-05-24 04:00
743840.8086816745117522015-05-24 04:00
744981.71866564392732015-05-24 04:00
745861.33378729121372015-05-24 04:00
746661.22966244882452015-05-24 04:00
747631.17558487075362015-05-24 04:00
Table 4

Locations of monitoring stations.

Station_idLatitudeLongitude
74141.841445123.65436
74241.758166123.533761
74341.71694123.451378
74441.788094123.288852
74541.838551123.549754
74641.855605123.442396
74741.773208123.421573
74841.785295123.489395
74941.79609169123.4084114
75041.789429123.373275
75141.83933982123.4126515
Data samples of monitoring stations. Locations of monitoring stations.

5.1.2. Meteorological Data

We collect meteorological data including temperature, humidity, barometric pressure, wind speed and visibility from the public website [37]. As Table 5 illustrates, the data format is presented as temperature (F), humidity (F), barometric pressure (F), wind speed (F) and visibility (F).
Table 5

Meteorological samples.

Temperature (Fmt, °C)Barometric Pressure (Fmp, mmHg)Humidity (Fmh, %)Wind Speed (Fmw, m/s)Visibility (Fmv, m)Time
18.8748.656216.02015-05-14 11:00:00
18.3746.450726.02015-05-14 08:00:00
17.0744.663312.02015-05-14 05:00:00
18.4743.058116.02015-05-14 02:00:00
19.7743.963118.02015-05-13 23:00:00
18.0742.67207.02015-05-13 21:00:00
Meteorological samples.

5.1.3. Road and Traffic Data

There are no public websites that offer statistical road and traffic data. Therefore, we cannot directly get available formatted data. However, most of the map service providers offer online maps and real-time traffic status. They do not publish public API interfaces for third party developers to access these data, but we can still get some useful tips through analyzing the map web http requests. From map services providers [38,39], we collect the traffic map tiles every hour.

5.1.4. POI

Thank to Baidu map and Google map service, we can easily get these data from a public interface. Each POI record contains name, latitude, longitude, tag and located tile grids. Figure 10 shows about 28,000 records in the MySQL database.
Figure 10

POI Samples in Shenyang.

POI Samples in Shenyang.

5.2. Evaluation Method

The most accurate criterion for air quality measure is the air quality information from monitoring stations. In this experiment, we use the AQI data from monitoring stations as the reference standard. To construct a random forest, we need to determine two parameters which are the numbers of trees and the number of features used to construct each tree. To choose the best parameters, we use OOB (Out-of-Bag) [33] error to compare RAQ accuracy based on different parameters pairs <#features, #trees> which means the number of features used to construct each tree and the number of trees that are constructed in the random forest. In random forests, the error is estimated internally during the construction of trees. Each tree is constructed using a different bootstrap sample from original data, which about one-third are left out of the bootstrap sample. The one-third sample is used as test cases to be input into the tree and get the classification of each test case. At the end of the run, take the class j that got most of the votes every time case n was oob [40]. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate [40]. The smaller number of oob, the high accuracy of the model. For the number of features, we increase by one each time from 2 to 8 (total number of features is 8 specified in algorithm ). For the quantity of trees, we increase by 100 from 100 to 1000. Because of the time consumption with more number of trees, we ignore the trees number greater than 1000 and 100 gap is suitable to balance performance and accuracy. To compare this algorithm with others, we use cross-validation method to judge the performance.

5.3. Results

5.3.1. Effects of Parameters on Prediction Error Rate

There are two important factors that affect the performance of a random forest, which are the number of trees and features. Figure 11 shows how the OOB error changes along with the number of features and trees. X-axis is the number of features and Y-axis is the number of trees.
Figure 11

OOB error result distribution.

OOB error result distribution. Empirically, for our experiment, we choose integer as the number of features and 100 interval integer as the number of trees, so only the discrete coordinate values such as (2,100), (3,200) are meaningful in this graph. Different colors mean different OOB error values. The deeper the color is, the smaller the oob is. As the graph shows, the OOB errors reach the best when the parameters pairs are <4, 400> and <6, 1000>. Considering less time consumption, we choose <4, 400> as the best parameters pair.

5.3.2. Comparison

For the contrast tests, Naïve Bayes, Logistic Regression, Single Decision Tree and ANN are chosen. Here we use Weka [41] as the tool to conduct all the comparison tests. For Naïve Bayes, there are eight features which are Fmt, Fmh, Fmp, Fmw, Fmv, Fri, Ftcs, Fpn and six classification categories (C) which are specified in Table 1. In Weka, this algorithm is denoted as weka.classifiers.bayes.NaiveBayesMultinomial. For Logistic Regression, we choose Multinomial Logistic Regression because of the multi AQI levels. In Weka, this algorithm is denoted as weka.classifiers.functions.Logistic. For Single Decision Tree, we choose all the features to construct one single tree for classification. In Weka, this algorithm is denoted as weka.classifiers.trees.REPTree. For ANN, we choose back-propagation neural network with one hidden layer for its simplicity and generality. In Weka, this algorithm is denoted as weka.classifiers.functions.MultilayerPerceptron. After realizing different algorithms, tests are carried out. Table 6 shows the results of the test cases in which Y means correct predictions and N means incorrect predictions. The precision is calculated by the formula Y/(Y + N) where Y is the number of correct predictions and N is the number of incorrect predictions. Figure 12 illustrates how the prediction precision changes as the data size changes. This figure shows RAQ performs steadily, even when the data size is relatively small. Other algorithms are less accurate at all time.
Table 6

Precision table of different algorithms.

AlgorithmPrecisionYN
NaïveBayes 52.1%14081293
Logistic 66.2%1790911
Decision Tree 77.4%2092609
ANN 71.8%1940761
RAQ 81.5%2203498
Figure 12

Precision changes according to data size.

Precision table of different algorithms. Precision changes according to data size. Besides the precision measurement, we also refer to other measurements including Recall, F-score, Relative Absolute Error (RAE) and Receiver Operating Characteristic (ROC). Recall is the proportion of instances classified as a given class divided by the actual total in that class. F-score is a combined measure for precision and recall calculated as 2∗Precision∗Recall/(Precision + Recall) where Recall (also known as sensitivity) is the fraction of relevant instances that are retrieved. Relative absolute error is calculated by the following formula: where is the estimated value, is the real value, is the average value, N is the number of test cases. ROC shows how the number of correctly classified positive examples varies with the number of incorrectly classified negative examples [42]. Indexes of different algorithms. Indexes chart of different algorithms. Based on our dataset, these measurements also show that RAQ performs better than others in this specific problem. Table 7 shows the original data of the experiments result and Figure 13 illustrates these data in chart form.
Table 7

Indexes of different algorithms.

AlgorithmRecallF-ScoreROCRAE
Naive Bayes 0.5210.5290.784.9%
Logistic 0.6630.6490.78575.8%
Decision Tree 0.7750.7690.88847.4%
ANN 0.7180.7070.82960.9%
RAQ 0.8160.8140.92836.9%
Figure 13

Indexes chart of different algorithms.

6. Conclusions

In this paper, with the public data in the urban sensing system, our model predicts the AQI of all the regions in Shenyang based on the AQI published by 11 air quality monitoring stations, meteorology data reported by weather stations, road information and real-time traffic status collected from Baidu Map and Google Maps and the POI distributions provided by Baidu Map and Google Maps. We use a random forest algorithm to predict all the uncovered regions in the downtown area. In Shenyang, this algorithm finally results in an overall precision of 81% for AQI prediction. This experimental result outperforms that of Naïve Bayes, Logistic Regression, single decision tree and ANN. All of these data are directly or indirectly available on the Internet. This shows that the algorithm could be easily applied for other cities. RAQ makes use of historical data for model training but ignores the real-time data. Our work will be extended to support online learning so daily data can be used to improve the performance of the air prediction algorithm.
  2 in total

1.  The role of meteorology in air pollution.

Authors:  H WEXLER
Journal:  Monogr Ser World Health Organ       Date:  1961

2.  Estimating ground-level PM2.5 in the eastern United States using satellite remote sensing.

Authors:  Yang Liu; Jeremy A Sarnat; Vasu Kilaru; Daniel J Jacob; Petros Koutrakis
Journal:  Environ Sci Technol       Date:  2005-05-01       Impact factor: 9.028

  2 in total
  10 in total

1.  Calibration of miniature air quality detector monitoring data with PCA-RVM-NAR combination model.

Authors:  Bing Liu; Yirui Zhang
Journal:  Sci Rep       Date:  2022-06-04       Impact factor: 4.996

2.  Design of a Spark Big Data Framework for PM2.5 Air Pollution Forecasting.

Authors:  Dong-Her Shih; Thi Hien To; Ly Sy Phu Nguyen; Ting-Wei Wu; Wen-Ting You
Journal:  Int J Environ Res Public Health       Date:  2021-07-02       Impact factor: 3.390

3.  Integrating Statistical Machine Learning in a Semantic Sensor Web for Proactive Monitoring and Control.

Authors:  Jude Adekunle Adeleke; Deshendran Moodley; Gavin Rens; Aderemi Oluyinka Adewumi
Journal:  Sensors (Basel)       Date:  2017-04-09       Impact factor: 3.576

4.  Analysis and prediction of air quality in Nanjing from autumn 2018 to summer 2019 using PCR-SVR-ARMA combined model.

Authors:  Bing Liu; Yueqiang Jin; Chaoyang Li
Journal:  Sci Rep       Date:  2021-01-11       Impact factor: 4.379

5.  Investigating a Potential Map of PM2.5 Air Pollution and Risk for Tourist Attractions in Hsinchu County, Taiwan.

Authors:  Yuan-Chien Lin; Hua-San Shih; Chun-Yeh Lai; Jen-Kuo Tai
Journal:  Int J Environ Res Public Health       Date:  2020-11-23       Impact factor: 3.390

6.  The influence of meteorological variables and lockdowns on COVID-19 cases in urban agglomerations of Indian cities.

Authors:  Asha B Chelani; Sneha Gautam
Journal:  Stoch Environ Res Risk Assess       Date:  2022-01-25       Impact factor: 3.821

7.  Development of a Machine Learning Approach for Local-Scale Ozone Forecasting: Application to Kennewick, WA.

Authors:  Kai Fan; Ranil Dhammapala; Kyle Harrington; Ryan Lamastro; Brian Lamb; Yunha Lee
Journal:  Front Big Data       Date:  2022-02-10

8.  Temperature Prediction of Seasonal Frozen Subgrades Based on CEEMDAN-LSTM Hybrid Model.

Authors:  Liyue Chen; Xiao Liu; Chao Zeng; Xianzhi He; Fengguang Chen; Baoshan Zhu
Journal:  Sensors (Basel)       Date:  2022-08-01       Impact factor: 3.847

9.  Applications of artificial intelligence in the field of air pollution: A bibliometric analysis.

Authors:  Qiangqiang Guo; Mengjuan Ren; Shouyuan Wu; Yajia Sun; Jianjian Wang; Qi Wang; Yanfang Ma; Xuping Song; Yaolong Chen
Journal:  Front Public Health       Date:  2022-09-07

10.  Air-pollution prediction in smart city, deep learning approach.

Authors:  Abdellatif Bekkar; Badr Hssina; Samira Douzi; Khadija Douzi
Journal:  J Big Data       Date:  2021-12-22
  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.