| Literature DB >> 36185154 |
Xin Ma1, Tengfei Chen1, Rubing Ge2, Caocao Cui1, Fan Xu1, Qi Lv1.
Abstract
Globally all countries encounter air pollution problems along their development path. As a significant indicator of air quality, PM2.5 concentration has long been proven to be affecting the population's death rate. Machine learning algorithms proven to outperform traditional statistical approaches are widely used in air pollution prediction. However research on the model selection discussion and environmental interpretation of model prediction results is still scarce and urgently needed to lead the policy making on air pollution control. Our research compared four types of machine learning algorisms LinearSVR, K-Nearest Neighbor, Lasso regression, Gradient boosting by looking into their performance in predicting PM2.5 concentrations among different cities and seasons. The results show that the machine learning model is able to forecast the next day PM2.5 concentration based on the previous five days' data with better accuracy. The comparative experiments show that based on city level the Gradient Boosting prediction model has better prediction performance with mean absolute error (MAE) of 9 ug/m3 and root mean square error (RMSE) of 10.25-16.76 ug/m3, lower compared with the other three models, and based on season level four models have the best prediction performances in winter time and the worst in summer time. And more importantly the demonstration of models' different performances in each city and each season is of great significance in environmental policy implications.Entities:
Keywords: Gradient boosting; Jing-Jin-Ji city group; K-Nearest Neighbor; Lasso regression; Linear SVR; PM2.5 prediction
Year: 2022 PMID: 36185154 PMCID: PMC9519508 DOI: 10.1016/j.heliyon.2022.e10691
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Figure 1The diagram of the research’s processes.
Figure 2Geographic location of Beijing-Tianjin-Hebei in China.
Figure 3Sample cities selection criteria.
Feature selection of dataset.
| Indicator type | Indicators |
|---|---|
| Air quality features | PM10, SO2, NO2, CO, O3, AQI_L5, PM10_L5, SO2_L5, NO2_L5, CO_L5, O3_L5, AQI ranking_L5 |
| Meteorological features | Lowest temperature, highest temperature, wind speed, Lowest temperature_L5, highest temperature_L5, wind speed_L5 |
| Time features | month, year, season, year_L5, month_L5, season_L5 |
| Historical features | PM2.5_L1, PM2.5_L2, PM2.5_L3, PM2.5_L4, PM2.5_L5 |
L1 is data of one day before, L5 is data of five days before.
Statistical summary of the collected data (2013–2019).
| Unit | Mean | Variance | Minimum | 25% quantile | Median | 75% quantile | Maximum | |
|---|---|---|---|---|---|---|---|---|
| AQI | ____ | 116.41 | 5156.15 | 16 | 69 | 96 | 138 | 500 |
| PM2.5 | ug/m3 | 78.66 | 4374.42 | 0 | 36 | 59 | 98 | 796 |
| PM10 | ug/m3 | 134.76 | 8786.33 | 0 | 72 | 110 | 168 | 937 |
| SO2 | ug/m3 | 32.60 | 1220.53 | 0 | 11 | 21 | 40 | 437 |
| NO2 | ug/m3 | 48.32 | 559.85 | 0 | 31 | 44 | 61 | 235 |
| CO | mg/m3 | 1.40 | 1.11 | 0 | 0.75 | 1.09 | 1.7 | 18.92 |
| O3 | ug/m3 | 57.98 | 1491.04 | 0 | 26 | 51 | 83 | 234 |
| Lowest temperature | °C | 8.72 | 117.92 | -20 | -2 | 9 | 19 | 29 |
| Highest temperature | °C | 19.05 | 123.20 | -12 | 9 | 20 | 29 | 40 |
| Wind speed | Force (Beaufort scale) | 2.41 | 1.00 | 0 | 2 | 3 | 3 | 8 |
Figure 4Violin plot of the distribution of the feature value.
Figure 5Geographical distribution of the training dataset (a) and the testing dataset (b) (2013–2019).
Figure 6Geographical distribution of training dataset (a) and testing dataset (b) for each season (2013–2019).
Figure 7Prediction performance evaluation for four models based on city level.
Distribution of the prediction errors of each city.
| City | Lasso prediction error | Gradient Boosting prediction error | LinearSVR prediction error | KNeighbors prediction error | ||||
|---|---|---|---|---|---|---|---|---|
| 90% | 75% | 90% | 75% | 90% | 75% | 90% | 75% | |
| Beijing | -26–99 | -22–13 | -19–27 | -14–8 | -55–61 | -52∼ -20 | -24–81 | -18–15 |
| Tianjin | -26–100 | -18–16 | -21–36 | -16–10 | -54–70 | -45∼ -13 | -28–67 | -20–14 |
| Baoding | -20–89 | -15–15 | -13–32 | -11–9 | -52–47 | -48∼ -21 | -20–60 | -14–15 |
| Cangzhou | -17–118 | -13–14 | -15–32 | -10–9 | -48–78 | -45∼ -19 | -18–100 | -14–16 |
| Handan | -17–77 | -13–22 | -13–50 | -8–17 | -46–42 | -41∼ -8 | -18–94 | -13–23 |
| Hengshui | -23–52 | -17–14 | -16–33 | -13–7 | -52–12 | -48∼ -19 | -24–61 | -17–14 |
| Langfang | -13–97 | -11–15 | -9–40 | -7–14 | -46–63 | -43∼ -17 | -14–72 | -11–15 |
| Shijiazhuang | -15–77 | -12–22 | -10–44 | -7–13 | -44–47 | -39∼ -9 | -17–90 | -13–20 |
| Tangshan | -11–119 | -9–21 | -8–74 | -6–17 | -45–87 | -43∼ -11 | -13–64 | -9–20 |
| Xingtai | -18–81 | -13–24 | -15–46 | -10–14 | -45–47 | -41∼ -9 | -21–103 | -14–20 |
Figure 8Probability distribution of PM2.5 prediction errors for each city.
Figure 9Scattering plot of predictions and observations for each season.
Evaluation on the prediction results of four models for each season.
| Lasso | Gradient Boosting | LinearSVR | KNeighbors | ||
|---|---|---|---|---|---|
| Spring | MAE | 12.82 | 8.34 | 33.02 | 13.04 |
| RMSE | 18.62 | 11.92 | 35.98 | 18.77 | |
| IA | 0.90 | 0.95 | 0.72 | 0.87 | |
| R2 | 0.60 | 0.84 | -0.49 | 0.59 | |
| Summer | MAE | 9.02 | 5.47 | 33.04 | 7.98 |
| RMSE | 11.80 | 7.31 | 34.55 | 10.20 | |
| IA | 0.84 | 0.93 | 0.51 | 0.84 | |
| R2 | 0.40 | 0.77 | -4.18 | 0.55 | |
| Autumn | MAE | 12.92 | 8.86 | 29.58 | 12.23 |
| RMSE | 17.53 | 12.33 | 32.78 | 17.13 | |
| IA | 0.91 | 0.96 | 0.76 | 0.91 | |
| R2 | 0.64 | 0.82 | -0.25 | 0.66 | |
| Winter | MAE | 14.09 | 11.27 | 33.74 | 16.48 |
| RMSE | 19.79 | 16.61 | 38.01 | 23.64 | |
| IA | 0.97 | 0.98 | 0.91 | 0.96 | |
| R2 | 0.90 | 0.93 | 0.65 | 0.86 | |
Model construction time and occupied memory size.
| Model | Model construction (S) | Occupied memory size (KB) |
|---|---|---|
| Lasso | 1.02 | 8.78 |
| Gradient Boosting | 2.68 | 378 |
| LinearSVR | 8.07 | 3.81 |
| KNeighbors | 5.04 | 6010.88 |