Literature DB >> 36185154

Time series-based PM_2.5 concentration prediction in Jing-Jin-Ji area using machine learning algorithm models.

Xin Ma¹, Tengfei Chen¹, Rubing Ge², Caocao Cui¹, Fan Xu¹, Qi Lv¹.

Abstract

Globally all countries encounter air pollution problems along their development path. As a significant indicator of air quality, PM2.5 concentration has long been proven to be affecting the population's death rate. Machine learning algorithms proven to outperform traditional statistical approaches are widely used in air pollution prediction. However research on the model selection discussion and environmental interpretation of model prediction results is still scarce and urgently needed to lead the policy making on air pollution control. Our research compared four types of machine learning algorisms LinearSVR, K-Nearest Neighbor, Lasso regression, Gradient boosting by looking into their performance in predicting PM2.5 concentrations among different cities and seasons. The results show that the machine learning model is able to forecast the next day PM2.5 concentration based on the previous five days' data with better accuracy. The comparative experiments show that based on city level the Gradient Boosting prediction model has better prediction performance with mean absolute error (MAE) of 9 ug/m3 and root mean square error (RMSE) of 10.25-16.76 ug/m3, lower compared with the other three models, and based on season level four models have the best prediction performances in winter time and the worst in summer time. And more importantly the demonstration of models' different performances in each city and each season is of great significance in environmental policy implications.

Entities: Chemical

Keywords: Gradient boosting; Jing-Jin-Ji city group; K-Nearest Neighbor; Lasso regression; Linear SVR; PM2.5 prediction

Year: 2022 PMID： 36185154 PMCID： PMC9519508 DOI： 10.1016/j.heliyon.2022.e10691

Source DB: PubMed Journal: Heliyon ISSN： 2405-8440

Introduction

Air pollution has adverse impacts on economic development and human health, therefore, accurate prediction of air pollutant concentration is important for policy making. Globally many countries encounter air pollution problems along their development path. So as to improve citizens' health status and well-being governments worldwide have paid a lot of efforts in tackling their air pollution issue. Researches on the accurate prediction of air pollution play a fundamental role and should get more academic and public attentions. PM2.5 is particulate matter with an average aerodynamic diameter of up to 2.5 and it is a significant indicator for air quality assessment. Epidemiological and experimental evidences have proven it to be associated with respiratory and cardiovascular mortality and morbidity rates, life expectancy (Burnett et al., 2014; Xing et al., 2016; Apte et al., 2018; Al-Hemoud et al., 2019; Diao et al., 2020; Bu et al., 2021; Geng et al., 2021), and the threat to public health may remain even when its concentration is at low levels (Feng et al., 2016; Ouyang et al., 2020; Yu et al., 2020). Traditional statistical models such as partial least squares regression model (Polat and Gunay, 2015), generalized Markov model (Sun et al., 2013; Alyousifi et al., 2019), Bayesian method (Riccio et al., 2006; Liu et al., 2008; Faganeli Pucer et al., 2018), etc., are often used for the prediction of air pollutant concentration on time series. However, because these models all have the shortcoming of over-simplified, they inherently have difficulties in unraveling the nonlinear interaction relationship between multivariate factors and PM2.5 concentration, so that the favorable factors for PM2.5 prediction cannot be fully utilized (Ni et al., 2017). Time series prediction models such as Autoregressive Integrated Moving Average (ARIMA) which are specifically designed for analyzing time series data have also been used to predict PM2.5 concentration (Gocheva-Ilieva et al., 2014; Abhilash et al., 2018; Bhatti et al., 2021). But due to the high complexity, randomness, non-stationarity and nonlinearity property of PM2.5 time series data, ideal prediction accuracy may not be obtained solely by ARIMA model (Niu et al., 2016; Yan and Ma, 2016). By reviewing existing researches on statistical tools predicting air pollutant, it was found artificial neural network methods (ANN) were preferred when predicting PM and the combination of ANN and statistical models is a hot research trend (Liao et al., 2021). Benefiting from the rapid development of the artificial intelligence (AI), huge progress in air pollutant concentration prediction has been made. Machine learning which is an important branch of AI has a long history in unraveling the interconnections in a chaotic system. Especially with the assistance from the fast developing data science, the prediction performance of the mass data-driven algorithm has been substantially enhanced (Jordan and Mitchell, 2015; Ma et al., 2020). When solving nonlinear regression problems, machine learning models are proven to have good data fitting and learning capacity. Numerous machine learning models have been widely used to make predictions in various fields, such as image processing (Glowacz, 2021a, 2021b), medical use (Kaplan et al., 2021; Khera et al., 2021), text classification (Luo, 2021), and especially air pollution prediction (Castelli et al., 2020; Choubin et al., 2020; Harishkumar et al., 2020; Liang et al., 2020; Lv et al., 2021). Ensemble algorithms have been developed to further enhance the prediction capacity of AI models, such as Gradient boosting (Shahbazi et al., 2020; Su, 2020) and adaboost model (Liu et al., 2019; Bahad and Saxena, 2020) and bagging model (Khan et al., 2022). Compared with the chemical transport models (CTM) which forecast air pollution based on atmospheric chemistry simulations (Di et al., 2016; Hu et al., 2019; Zhang et al., 2021), machine learning models have distinct advantages of little computation cost, good learning and fitting ability etc. However, the black box model structure makes it hard to explain pollutant formation mechanism and transporting process. We reckon that considering the environmental meanings at both modeling stage and results interpretation stage is beneficial and can to some extent help overcome machine learning models' shortcomings mentioned above. In this paper, four typical machine learning models were selected to predict PM2.5 time series. Data collected contain air quality, meteorological, time and historical features. Rolling prediction method was applied when feeding the data into models, and optimal step length was determined by experimenting. Four commonly used indicators i.e., mean absolute error (MAE), root mean square error (RMSE), index of agreement (IA) and correlation coefficient (R) were used to evaluate the predictive performance. The comparative experiments show that the Gradient Boosting prediction model had better prediction performance with lower mean absolute error and root mean square error compared with the models such as the Lasso regression, K-Nearest Neighbor and SVR. Moreover, it has better generalization ability, which can predict the PM2.5 concentration more accurately. And more importantly the demonstration of models' different performances on each city and each season is of great significance in environmental policy implication. The rest of this paper is organized as follows and the structure of our research is presented in Figure 1: Linear SVR, K-Nearest Neighbor, Lasso regression, Gradient boosting are introduced in Section 2. In Section 3, prediction results of four models are evaluated. Section 4 draws the final conclusion.

Figure 1

The diagram of the research’s processes.

Data and model implementation

Data description

The Jing-Jin-Ji Metropolitan Region also known as Beijing-Tianjin-Hebei, located in northeast of China (Figure 2) is China’s capital economic circle. PM2.5 pollution has become a thorny problem of environmental control in Jing-Jin-Ji region. Research on PM2.5 pollution is of great significance for the prevention and control of urban pollution in the “capital economic circle”. In 2017 China’s ministry of ecology and environment issued a working plan aiming at reducing the air pollution in Jin-Jin-Ji region. The concept of atmospheric pollution transmission channel cities surrounding Jin-Jin-Ji region which are referred to “2 + 26” cities was first time officially brought up. Since then, stronger environmental policies and stricter inspections have been adopted in this region, therefore it is an ideal experimental place to test how air pollution would be affected under those two factors of economic development pressure and strong environmental protection policies happening simultaneously.

Figure 2

Geographic location of Beijing-Tianjin-Hebei in China.

Geographic location of Beijing-Tianjin-Hebei in China. This study was conducted based on the data collected from the following monitoring sites, Beijing, Tianjin, Baoding, Cangzhou, Handan, Hengshui, Langfang, Shijiazhuang, Tangshan, Xingtai, which are the overlapping cities of Jing-Jin-Ji city group and “2 + 26” atmospheric pollution transmission channel cities (Figure 3). Air quality data were drawn from the China National Environmental Monitoring Centre (CNEMC). Meteorological data were drawn from China Weather (CW). We collected historical 2206 days daily data of the monitoring cites from those ten cities to train models.

Figure 3

Sample cities selection criteria.

Sample cities selection criteria. Three types of features chosen to train our models are listed in Table 1. We used 1874 days' monitoring data drawn from each city cite as training and validation dataset, and 332 days data as testing dataset.

Table 1

Feature selection of dataset.

Indicator type	Indicators
Air quality features	PM₁₀, SO₂, NO₂, CO, O₃, AQI_L5, PM₁₀_L5, SO₂_L5, NO₂_L5, CO_L5, O₃_L5, AQI ranking_L5
Meteorological features	Lowest temperature, highest temperature, wind speed, Lowest temperature_L5, highest temperature_L5, wind speed_L5
Time features	month, year, season, year_L5, month_L5, season_L5
Historical features	PM_2.5_L1, PM_2.5_L2, PM_2.5_L3, PM_2.5_L4, PM_2.5_L5

L1 is data of one day before, L5 is data of five days before.

Feature selection of dataset. L1 is data of one day before, L5 is data of five days before. The data used as training and validation dataset were from 2013 Oct 28th to 2018 Dec 31st, and the testing dataset were formed data from 2019 Jan 1st to 2019 Dec 31st. In order to manifest the data distribution and variation more directly, the various statistical indicators of air quality and meteorological quality are calculated as shown in Table 2. In the selected time period, the PM2.5 concentration has a minimum of 0, a maximum of 796 and a variance of 4374.42, and it also has the characteristics of high non-stationary, non-linearity.

Table 2

Statistical summary of the collected data (2013–2019).

	Unit	Mean	Variance	Minimum	25% quantile	Median	75% quantile	Maximum
AQI	^____	116.41	5156.15	16	69	96	138	500
PM_2.5	ug/m³	78.66	4374.42	0	36	59	98	796
PM₁₀	ug/m³	134.76	8786.33	0	72	110	168	937
SO₂	ug/m³	32.60	1220.53	0	11	21	40	437
NO₂	ug/m³	48.32	559.85	0	31	44	61	235
CO	mg/m³	1.40	1.11	0	0.75	1.09	1.7	18.92
O₃	ug/m³	57.98	1491.04	0	26	51	83	234
Lowest temperature	°C	8.72	117.92	-20	-2	9	19	29
Highest temperature	°C	19.05	123.20	-12	9	20	29	40
Wind speed	Force (Beaufort scale)	2.41	1.00	0	2	3	3	8

Statistical summary of the collected data (2013–2019). Kernel density estimation is used to estimate an unknown density function in probability theory as one of the non-parametric test methods. To further analyze the frequency distribution of the selected variables, the kernel density estimation was applied to determine their density distribution. This method can visually demonstrate the distribution characteristics of the data through the violin graph without making any assumptions nor having any prior knowledge regarding the data distribution. According to the distribution pattern of each variable (Figure 4), the PM2.5 concentration is right skewed and the maximum value exists in the upper quartile. Moreover, the distribution trend of AQI, PM10, SO2, NO2, and O3 is basically consistent with PM2.5, while the changing trend of CO, lowest temperature, highest temperature, wind speed is quite different from PM2.5 concentration.

Figure 4

Violin plot of the distribution of the feature value.

Violin plot of the distribution of the feature value. The geographical distribution of training and testing samples and their distribution in the four seasons of each city site is shown in Figures 5 and 6 respectively. It shows that the number of samples drawn in each season is relatively uniform. Therefore, the training results in the Jing-Jin-Ji region are representative for spatial and temporal features. The testing datasets can be grouped in two approaches, the first approach is to divide by cities, the second is to divide by seasons. Then the well-trained models were used to predict the PM2.5 concentration by being fed with city-based, and season-based testing dataset respectively. Furthermore, empirical analysis was made according to the prediction results.

Figure 5

Geographical distribution of the training dataset (a) and the testing dataset (b) (2013–2019).

Figure 6

Geographical distribution of training dataset (a) and testing dataset (b) for each season (2013–2019).

Geographical distribution of the training dataset (a) and the testing dataset (b) (2013–2019). Geographical distribution of training dataset (a) and testing dataset (b) for each season (2013–2019). Rolling forecast method in this study was applied when training the model to enhance the generalization capacity and accuracy. The model was fed with previous five days' monitoring data in a rolling manner to predict PM2.5 concentration of the next day. Considering that machine learning model is more applicable for larger dataset, meanwhile due to the periodicity, volatility and integrity nature of the time series data, we applied the rolling prediction method to enlarge our training dataset. As a result, we got a training dataset of which the input data is a matrix of 18,735 × 29 and the output vector is the PM2.5 concentration. The training dataset occupies 82.3% of the total sample data volume, therefore it matches the machine learning data slicing rule. K-fold cross validation (K = 5) was adopted to train. The data matrix of 332 × 30 forms the rest 17.7% testing data’s input and output data. Statistical summary results show that the data are sampled evenly from each city site, and season as well.

Development of predictive models

Linear SVR

Support vector machine (SVM) is a classification algorithm proposed by Vapnik of which the learning strategy is to maximize the interval (Cortes and Vapnik, 1995). Samples are made linearly separable by mapping the samples to a higher-dimensional feature space, and nuclear functions are also introduced to implement nonlinear mapping. Support vector regression (SVR) is developed on the basis of optimization SVM. Compared with neural network, SVR based on deep learning mechanism overcomes the problems of overfitting, underfitting and local optimization using mathematical methods and optimization techniques. As described in Eqs. (1) and (2), this model was performed as follows: For dataset D: Separating hyperplane model can be constructed in higher-dimensional space:f(x)denotes forecast value, φ(x) is nonlinear mapping function, w is weight coefficient, ϑ is intercept, ε is maximum margin. Only when the training samples fell within the maximum margin could the prediction results be considered correct. Therefore, by introducing the relaxation variables , and the penalty function E, the optimized objective function can be obtained (Eq. (3)): The “dual problem” of support vector regression is obtained by using the Lagrange multiplier method. The optimal Lagrange multiplier can be solved by the sequence minimum optimization (Sequential minimal optimization, SMO) algorithm, and the final solution is obtained under the KKT (Karush-Kuhn-Tucker (KKT)) condition (Eq. (4)):, are Lagrange multiplier, is kernel function which is linear type.

K-Nearest Neighbor (KNN)

The k-nearest neighbors algorithm (k-NN) was expanded by Altman (1992) after it was first developed in 1951. This supervised ML algorithm can be used both in classification and regression problems. The following is pseudo-code based on which we implemented KNN model.

Lasso regression

Lasso regression is the Lasso (Least absolute shrinkage and selection operator) method first proposed by Tibshirani (1996). It is a biased estimation method that can be used for feature selection in high-dimensional data. The Lasso method is designed for dealing with data with complex collinearity by constructing a penalty function allowing some coefficient to be minimized to the value 0, thus preserving the characteristics of subset shrinkage. When predicting PM2.5, objective function is constructed as follow (Eq. (5)): Here, y stands for observed PM2.5, is the value of the feature in the feature vector of the independent variable. is the PM2.5 concentration predicted by a linear combination of 31 features, and , can be obtained by minimizing .

Gradient Boosting

Gradient boosting model is a typical ensemble algorithm which creates model with stronger prediction ability by combining several weak classifiers (Bentéjac et al., 2021). The following is pseudo-code based on which we implemented Gradient boosting model.

Evaluation metrics

Evaluation metrics for machine learning models are often used to quantify the performance of predictive model by comparing the prediction values and actual observed values. Four commonly used indicators i.e., mean absolute error (MAE), root mean square error (RMSE), index of agreement (IA) and Pearson’s correlation coefficient (described in Eqs. (6), (7), (8), and (9)) were adopted to measure the prediction accuracy here:where y is the prediction of the PM2.5 for time i, while represents the actual value for time I, and is the observed mean.

Experimental results and discussion

Prediction comparison based on city level

Abovementioned four evaluation indicators were used here to compare the prediction results of 10 surveyed cities' daily PM2.5 concentration generated by model Lasso, Gradient Boosting, LinearSVR, KNeighbors (Figure 7). The geographical distribution of evaluation results based on city lever is illustrated in Figure 7. Gradient Boosting model has the best MAE result which is highly concentrated around 9 ug/m3, followed by Lasso and KNeighbors model of which the prediction MAE are both around 12 ug/m3, in contrast, LinearSVR model prediction MAE values fall into the range of 27.09–38.26 ug/m3. From the perspective of RMSE, Gradient Boosting is still the best model compared to the rest three, and the RMSE of its predictions and observations is between 10.25 and 16.76 ug/m3. Lasso and KNeighbors' RMSE are 14.00–20.47 ug/m3 and 14.67–21.85 ug/m3 respectively, LinearSVR is ranking the fourth with a poor result of 29.97–41.85 ug/m3. Gradient Boosting model’s prediction IA ranges from 0.92 to 0.99, better than Lasso, KNeighbors and LinearSVR. The performances of the four models in the fourth indicator R2 are basically similar to their performances in IA. Among the 10 sample cities, apart from Gradient Boosting, all other three models generally cannot achieve ideal prediction results for Beijing and Tianjin like other cities. So on city level, our results show that daily PM2.5 concentration prediction generated by Gradient Boosting model outperformed the Lasso and KNeighbors model, and LinearSVR has the poorest capacity.

Figure 7

Prediction performance evaluation for four models based on city level.

Prediction performance evaluation for four models based on city level. Prediction error’s distribution for each city is summarized in Table 3, and illustrated in Figure 8. More specifically, LinearSVR predictions tend to have negative errors, and Lasso and KNeighbors prediction errors have larger range. Gradient Boosting outperformed the other models, because most of its prediction errors are concentrated evenly around 0 ug/m3.

Table 3

Distribution of the prediction errors of each city.

City	Lasso prediction error		Gradient Boosting prediction error		LinearSVR prediction error		KNeighbors prediction error
City	90%	75%	90%	75%	90%	75%	90%	75%
Beijing	-26–99	-22–13	-19–27	-14–8	-55–61	-52∼ -20	-24–81	-18–15
Tianjin	-26–100	-18–16	-21–36	-16–10	-54–70	-45∼ -13	-28–67	-20–14
Baoding	-20–89	-15–15	-13–32	-11–9	-52–47	-48∼ -21	-20–60	-14–15
Cangzhou	-17–118	-13–14	-15–32	-10–9	-48–78	-45∼ -19	-18–100	-14–16
Handan	-17–77	-13–22	-13–50	-8–17	-46–42	-41∼ -8	-18–94	-13–23
Hengshui	-23–52	-17–14	-16–33	-13–7	-52–12	-48∼ -19	-24–61	-17–14
Langfang	-13–97	-11–15	-9–40	-7–14	-46–63	-43∼ -17	-14–72	-11–15
Shijiazhuang	-15–77	-12–22	-10–44	-7–13	-44–47	-39∼ -9	-17–90	-13–20
Tangshan	-11–119	-9–21	-8–74	-6–17	-45–87	-43∼ -11	-13–64	-9–20
Xingtai	-18–81	-13–24	-15–46	-10–14	-45–47	-41∼ -9	-21–103	-14–20

Figure 8

Probability distribution of PM2.5 prediction errors for each city.

Distribution of the prediction errors of each city. Probability distribution of PM2.5 prediction errors for each city.

Prediction comparison based on seasons

Scatter plot of the PM2.5 predictions and the observations grouped by seasons is shown in Figure 9. Combined with the IA index and MAE of different seasons in Table 4, the overall performance of Lasso, Gradient Boosting, LinearSVR and KNeighbors models on the quarterly PM2.5 concentration prediction in the Jing-Jin-Ji region shows that the four models have the best prediction results in winter time, and the worst in summer time. This is likely because in Jing-Jin-Ji region the particulate pollutant is the most serious in winter time, average PM2.5 concentration is much higher than the other three seasons, in contrast, particulate pollutant in summer has minor effects and ozone pollution is the primary pollution issue.

Figure 9

Scattering plot of predictions and observations for each season.

Table 4

Evaluation on the prediction results of four models for each season.

		Lasso	Gradient Boosting	LinearSVR	KNeighbors
Spring	MAE	12.82	8.34	33.02	13.04
	RMSE	18.62	11.92	35.98	18.77
	IA	0.90	0.95	0.72	0.87
	R²	0.60	0.84	-0.49	0.59
Summer	MAE	9.02	5.47	33.04	7.98
	RMSE	11.80	7.31	34.55	10.20
	IA	0.84	0.93	0.51	0.84
	R²	0.40	0.77	-4.18	0.55
Autumn	MAE	12.92	8.86	29.58	12.23
	RMSE	17.53	12.33	32.78	17.13
	IA	0.91	0.96	0.76	0.91
	R²	0.64	0.82	-0.25	0.66
Winter	MAE	14.09	11.27	33.74	16.48
	RMSE	19.79	16.61	38.01	23.64
	IA	0.97	0.98	0.91	0.96
	R²	0.90	0.93	0.65	0.86

Scattering plot of predictions and observations for each season. Evaluation on the prediction results of four models for each season. The distinct performance of different models has practical implications in the selection of models targeting different seasons, thus affecting policy making process. Gradient Boosting has the best results in IA and MAE. Lasson and KNeighbours have the similar results in IA and MAE for summer and winter time, in spring, Lasso slightly outperformed KNeighbors. Comparatively they are superior than LinearSVR overall.

Overall evaluation

As Table 5 shows, LinearSVR model had the longest construction time, followed by KNeighbors, and Lasso had the shortest model construction time of 1.02 s. In terms of memory occupancy, KNeighbors occupies the largest memory, the model size is 6010.88 KB, followed by Gradient, Boosting model, the memory occupancy size is 378 KB, the smallest model is LinearSVR, and the memory size is only 3.81 KB.

Table 5

Model construction time and occupied memory size.

Model	Model construction (S)	Occupied memory size (KB)
Lasso	1.02	8.78
Gradient Boosting	2.68	378
LinearSVR	8.07	3.81
KNeighbors	5.04	6010.88

Model construction time and occupied memory size. Researches show that SVM is difficult to implement for large-scale training samples and is sensitive to missing data, and when the sample of the K Nearest Neighbor model is not balanced, the prediction bias is relatively large, and it will likely lead to a curse of dimensionality. Lasso regression is a generalized linear model, it has limitations in handling samples with nonlinearity, randomness, and uncertainty features. The advantage of applying Gradient Boosting model in PM2.5 concentration prediction and influencing factor analysis is obvious: on one hand, it can improve the feature selection process, on the other hand, it can reduce the complexity of model construction and the risk of model overfitting.

Discussion

Industrial structure and regional environment management policies play a crucial role in determining air pollution concentration during certain period (Zheng et al., 2020), thus inevitably creating obstacles for machine learning models to gain ideal prediction performance. For instance, among the 10 cities, Tangshan and Handan’s development are largely depending on heavy industries such as steel, coke and cement. Market fluctuation or production regulations due to environmental protection purpose both can lead to air pollution’s sudden change in the short term, which will become noise for the model training. In contrast, densely populated cities like Beijing and Tianjin, their pollution emissions are mainly attributed to household and transportation sectors. And also compared to other cities, Beijing, Tianjin and Shijiazhuang are facing more stringent industry development policies. The uncertainties mainly lie in the hourly data, comparatively, good prediction accuracy can be attained for daily, seasonal and annual data with less efforts. During winter time, despite the stringent ban on scattered coal consumption in rural area, coal burning is still the main approach for heating in most northern cities in China, which puts heavy pressure on regional air quality. Further, though sampled cities are all located in the same climate zone, their locations in the pollution transmission channel are very decisive in pollutants formation process. Additive effects and spillover effects of various air pollution are making the differences of seasonal PM2.5 concentrations among cities more significant. All factors combined are making the system more complex. Our research was conducted with the aim of comparing the models' prediction capacity for complex experimenting conditions with multi-sourced influencing factors, therefore ideally it is expected to make contributions in the following ways: (1) raise the awareness for policy makers of the discrepancy of machine learning models, and consider the regional and seasonal differences when selecting models; (2) selected features for training models ending up with good prediction accuracy can help enlighten the similar work.

Conclusion

In this paper, to get models with good PM2.5 prediction accuracy, we collected daily monitoring data from 10 atmospheric pollution transmission channel cities located in Jing-Jin-Ji area, the data mainly incorporate air quality features, meteorological features, time features and historical features. Based on the multi-sourced data, four machine learning models, LinearSVR, KNeighbors, Lasso, Gradient Boosting, were used to make predictions of PM2.5 concentration on city level and season level. First, the city level results show that the prediction performance of Gradient Boosting model was significantly better than Lasso and KNeighbors model, and LinearSVR model’s performance was comparatively dissatisfying. Two cities (Beijing, Tianjin) had a slightly lower IA value, while the remaining eight cities had a relatively significant IA value, which is consistent with the results of the root mean square error. LinearSVR predictions tend to have negative errors, and Lasso and KNeighbors prediction errors have larger range. Gradient Boosting outperformed the other models, because most of its prediction errors are concentrated evenly around 0 ug/m3. Second, the results of seasonal prediction show that the four models had the best prediction performances in winter time and the worst in summer time. This is likely because particulate pollution in the Beijing-Tianjin-Hebei region is generally more serious in winter time, with PM2.5 concentration much higher than in other three seasons. In contrast, in summer the particulate pollution level is low while ozone pollution is the primary pollutant. Lastly, in the model overall evaluation, the Gradient Boosting model had comparatively ideal performance in terms of training time and occupied memory size compared to the other three. Social and economic and urban development factors have also been proven to affect the PM2.5 concentration. For future work, it is meaningful to add these features into the modeling process, thus models untangling interconnections between multi-disciplinary features and air quality can be built, based on which richer environmental implications will be attained. In addition, as the time series dataset is enlarged after gathering more recent data, effects of environmental protection policies issued by central government in recent years, e.g. Three-year Action Plan to Fight Air Pollution (2018) and Further Prevention and Control of Pollution (2021) on air pollution control can be simulated and evaluated.

Declarations

Author contribution statement

Xin Ma: Conceived and designed the experiments; Wrote the paper. Tengfei Chen: Performed the experiments; Analyzed and interpreted the data. Rubing Ge: Analyzed and interpreted the data; Wrote the paper. Caocao Cui; Fan Xu; Qi Lv: Contributed reagents, materials, analysis tools or data.

Funding statement

This work was supported by Key Soft Science Projects in Henan Province (222400410010) and Philosophy and Social Science Team Project of North China University of Water Resources and Electric Power (20200704).

Data availability statement

Data will be made available on request.

Declaration of interest’s statement

The authors declare no conflict of interest.

Additional information

No additional information is available for this paper.

14 in total

1. Prediction of 24-hour-average PM(2.5) concentrations using a hidden Markov model with different emission distributions in Northern California.

Authors: Wei Sun; Hao Zhang; Ahmet Palazoglu; Angadh Singh; Weidong Zhang; Shiwei Liu
Journal: Sci Total Environ Date: 2012-11-23 Impact factor: 7.963

Review 2. The health effects of ambient PM2.5 and potential mechanisms.

Authors: Shaolong Feng; Dan Gao; Fen Liao; Furong Zhou; Xinming Wang
Journal: Ecotoxicol Environ Saf Date: 2016-02-19 Impact factor: 6.291

3. Spatial hazard assessment of the PM10 using machine learning models in Barcelona, Spain.

Authors: Bahram Choubin; Mahsa Abdolshahnejad; Ehsan Moradi; Xavier Querol; Amir Mosavi; Shahaboddin Shamshirband; Pedram Ghamisi
Journal: Sci Total Environ Date: 2019-10-04 Impact factor: 7.963

4. Global PM2.5-attributable health burden from 1990 to 2017: Estimates from the Global Burden of disease study 2017.

Authors: Xiang Bu; Zhonglei Xie; Jing Liu; Linyan Wei; Xiqiang Wang; Mingwei Chen; Hui Ren
Journal: Environ Res Date: 2021-04-03 Impact factor: 6.498

5. The association between long-term exposure to low-level PM2.5 and mortality in the state of Queensland, Australia: A modelling study with the difference-in-differences approach.

Authors: Wenhua Yu; Yuming Guo; Liuhua Shi; Shanshan Li
Journal: PLoS Med Date: 2020-06-18 Impact factor: 11.069

6. Impact of Urbanization on PM_2.5-Related Health and Economic Loss in China 338 Cities.

Authors: Beidi Diao; Lei Ding; Qiong Zhang; Junli Na; Jinhua Cheng
Journal: Int J Environ Res Public Health Date: 2020-02-05 Impact factor: 3.390

7. Ventilation Diagnosis of Angle Grinder Using Thermal Imaging.

Authors: Adam Glowacz
Journal: Sensors (Basel) Date: 2021-04-18 Impact factor: 3.576

8. Use of Machine Learning Models to Predict Death After Acute Myocardial Infarction.

Authors: Rohan Khera; Julian Haimovich; Nathan C Hurley; Robert McNamara; John A Spertus; Nihar Desai; John S Rumsfeld; Frederick A Masoudi; Chenxi Huang; Sharon-Lise Normand; Bobak J Mortazavi; Harlan M Krumholz
Journal: JAMA Cardiol Date: 2021-06-01 Impact factor: 14.676

Review 9. Artificial Intelligence/Machine Learning in Respiratory Medicine and Potential Role in Asthma and COPD Diagnosis.

Authors: Alan Kaplan; Hui Cao; J Mark FitzGerald; Nick Iannotti; Eric Yang; Janwillem W H Kocks; Konstantinos Kostikas; David Price; Helen K Reddel; Ioanna Tsiligianni; Claus F Vogelmeier; Pascal Pfister; Paul Mastoridis
Journal: J Allergy Clin Immunol Pract Date: 2021-02-19

10. An integrated risk function for estimating the global burden of disease attributable to ambient fine particulate matter exposure.

Authors: Richard T Burnett; C Arden Pope; Majid Ezzati; Casey Olives; Stephen S Lim; Sumi Mehta; Hwashin H Shin; Gitanjali Singh; Bryan Hubbell; Michael Brauer; H Ross Anderson; Kirk R Smith; John R Balmes; Nigel G Bruce; Haidong Kan; Francine Laden; Annette Prüss-Ustün; Michelle C Turner; Susan M Gapstur; W Ryan Diver; Aaron Cohen
Journal: Environ Health Perspect Date: 2014-02-11 Impact factor: 9.031