| Literature DB >> 35655087 |
Biao Zhang1, Ying Zhang2, Xuchu Jiang3.
Abstract
Ozone is one of the most important air pollutants, with significant impacts on human health, regional air quality and ecosystems. In this study, we use geographic information and environmental information of the monitoring site of 5577 regions in the world from 2010 to 2014 as feature input to predict the long-term average ozone concentration of the site. A Bayesian optimization-based XGBoost-RFE feature selection model BO-XGBoost-RFE is proposed, and a variety of machine learning algorithms are used to predict ozone concentration based on the optimal feature subset. Since the selection of the underlying model hyperparameters is involved in the recursive feature selection process, different hyperparameter combinations will lead to differences in the feature subsets selected by the model, so that the feature subsets obtained by the model may not be optimal solutions. We combine the Bayesian optimization algorithm to adjust the parameters of recursive feature elimination based on XGBoost to obtain the optimal parameter combination and the optimal feature subset under the parameter combination. Experiments on long-term ozone concentration prediction on a global scale show that the prediction accuracy of the model after Bayesian optimized XGBoost-RFE feature selection is higher than that based on all features and on feature selection with Pearson correlation. Among the four prediction models, random forest obtained the highest prediction accuracy. The XGBoost prediction model achieved the greatest improvement in accuracy.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35655087 PMCID: PMC9163069 DOI: 10.1038/s41598-022-13498-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Workflow of XGBoost-RFE.
Figure 2BO-XGBoost-RFE.
Figure 3Global distribution of monitoring sites.
Figure 4Iterative process of Bayesian optimization.
Main hyper-parameter range and optimized value.
| Hyper parameter | Range | Optimized value |
|---|---|---|
| Learning_rate | (0.001, 0.3) | 0.0798 |
| N_estimators | (50, 250) | 134 |
| Max_depth | (3, 15) | 8 |
| Min_child_weight | (1, 7) | 4 |
| Gamma | (0, 1) | 0.676 |
| Reg_alpha | (0, 1) | 0.4873 |
| Reg_lambda | (0, 1) | 0.2451 |
| Colsample_bytree | (0.1, 1) | 0.7144 |
| Subsample | (0.1, 1) | 0.823 |
Figure 5XGBoost-RFE feature selection results: Cross-validation MAE under optimal hyperparameter combination.
Comparison of MAE and feature num before and after BO.
| Model | MAE | Feature number |
|---|---|---|
| BO-XGBoost-RFE | 2.410 | 22 |
| XGBoost-RFE | 2.516 | 29 |
MAE, RMSE and R2 of each prediction model.
| Model | After FS | After Pearson’s | All features | ||||||
|---|---|---|---|---|---|---|---|---|---|
| MAE | RMSE | R2 | MAE | RMSE | R2 | MAE | RMSE | R2 | |
| XGBoost | 2.386 | 3.281 | 0.718 | 2.590 | 3.462 | 0.675 | 2.478 | 3.368 | 0.698 |
| RF | 2.374 | 3.206 | 0.720 | 2.500 | 3.380 | 0.690 | 2.407 | 3.266 | 0.710 |
| SVR | 2.676 | 3.631 | 0.659 | 2.912 | 3.871 | 0.583 | 2.677 | 3.620 | 0.636 |
| KNN | 2.801 | 3.808 | 0.606 | 2.873 | 3.837 | 0.601 | 2.846 | 3.834 | 0.601 |
Figure 6Feature importance in random forest.