| Literature DB >> 35254864 |
Wenhua Yu1, Shanshan Li1, Tingting Ye1, Rongbin Xu1, Jiangning Song2, Yuming Guo1.
Abstract
BACKGROUND: Accurate estimation of historical PM2.5 (particle matter with an aerodynamic diameter of less than 2.5μm) is critical and essential for environmental health risk assessment.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35254864 PMCID: PMC8901043 DOI: 10.1289/EHP9752
Source DB: PubMed Journal: Environ Health Perspect ISSN: 0091-6765 Impact factor: 11.035
Figure 1.The framework of the DEML algorithm. is a matrix with rows and columns, which is the combination of predictions for each base model; l represents the original features; h denotes the number of meta-models; is a matrix with row and columns, which is the combination of predictions for each meta model. We finally get as the input to obtain the weights of the meta models by using the NNLS algorithm and get the final prediction; k is the number of folds for CV, and we select the same valid rows for the base and meta models; is the number of records of all data; denotes the number of base models. Note: CV, cross-validation analysis; DEML, the three-stage stacked deep ensemble machine learning method; GBM, gradient boosting machine; GLM, generalized linear model; NNLS, nonnegative least squares algorithm; RF, random forest; SVM, support vector machine; XGBoost, extreme gradient boosting.
The descriptive statistics for the daily average (micrograms per cubic meter) from 2015 to 2019 based on 113 air quality stations in Italy.
| Year | Mean | SD |
|
|
|
|
|
|---|---|---|---|---|---|---|---|
| 2015 | 24.38 | 28.40 | 4.08 | 11.76 | 18.76 | 31.12 | 70.99 |
| 2016 | 20.34 | 18.68 | 3.12 | 10.08 | 15.90 | 24.94 | 63.25 |
| 2017 | 22.61 | 20.11 | 3.12 | 10.08 | 16.38 | 28.27 | 74.87 |
| 2018 | 20.78 | 15.18 | 4.08 | 11.04 | 16.86 | 26.84 | 55.50 |
| 2019 | 17.25 | 18.22 | 3.12 | 8.16 | 13.05 | 20.18 | 55.50 |
| Total | 21.11 | 20.81 | 3.36 | 10.08 | 15.90 | 25.89 | 65.18 |
Note: , , , , and are the 2.5th, 25th, 50th, 75th, and 97.5th percentile of concentrations in the study period separately. , particulate matter with aerodynamic diameter ; SD: standard deviation.
prediction performances of DEML model and five benchmark models from 2015 to 2019 in Italy.
| Year | Measurement | GBM | SVM | RF | XGBoost | SL | DEML |
|---|---|---|---|---|---|---|---|
| 2015 |
| 0.69 | 0.79 | 0.85 | 0.81 | 0.85 | 0.89 |
| RMS E ( | 9.25 | 6.42 | 6.49 | 7.23 | 6.47 | 5.54 | |
| 2016 |
| 0.72 | 0.80 | 0.84 | 0.81 | 0.84 | 0.87 |
| RMSE ( | 7.74 | 6.51 | 5.84 | 6.33 | 5.82 | 5.18 | |
| 2017 |
| 0.74 | 0.81 | 0.85 | 0.81 | 0.85 | 0.89 |
| RMSE ( | 8.20 | 7.19 | 6.41 | 7.09 | 6.38 | 5.37 | |
| 2018 |
| 0.70 | 0.78 | 0.86 | 0.82 | 0.86 | 0.89 |
| RMSE ( | 7.44 | 6.22 | 5.18 | 5.69 | 5.13 | 4.43 | |
| 2019 |
| 0.68 | 0.76 | 0.84 | 0.79 | 0.84 | 0.87 |
| RMSE ( | 7.34 | 6.42 | 5.13 | 5.78 | 5.12 | 4.55 | |
| Total |
| 0.51 | 0.76 | 0.83 | 0.70 | 0.83 | 0.87 |
| RMSE ( | 10.4 | 7.42 | 6.23 | 8.20 | 6.23 | 5.38 |
Note: DEML, the three-stage stacked deep ensemble machine learning method; GBM, gradient boosting machine; , particulate matter with aerodynamic diameter ; , coefficients of determination for unseen independent data; RF, random forest; RMSE, root mean square error; SL, super learner algorithm; SVM, support vector machine; XGBoost, extreme gradient boosting.
SL was constructed with four machine learning models (GBM, SVM, RF, and XGBoost) using a nonnegative least squares (NNLS) approach to achieve the optimal weight.
DEML was a three-stage stacked ensemble model by constructing with four base models (GBM, SVM, RF, and XGBoost), three second-level models (RF, XGBoost, and GLM), and an NNLS algorithm.
Figure 2.The prediction performance of the DEML model in different seasons of 2015–2019 in Italy. The x-axis indicates the observed daily in the monitor stations; y-axis indicates the estimated by the DEML model; the points represent the corresponding for both observed and predicted values. The solid line represents a regression line for the observed and predicted by using the simple linear regression. is the coefficients of determination for the unseen independent data. (A) Overall performance. (B) Spring means from March to May; (C) Summer means from June to August; (D) Autumn means from September to November; and (E) Winter means from December to February. Note: DEML, the three-stage stacked deep ensemble machine learning method; , particulate matter with an aerodynamic diameter ; RMSE, the root mean square error (micrograms per cubic meter).
The performance of the spatial and temporal cross-validation for DEML model and five benchmark models from 2015 to 2019 in Italy.
| Type | Measurement | GBM | SVM | RF | XGBoost | SL | DEML |
|---|---|---|---|---|---|---|---|
| Spatial CV |
| 0.54 | 0.61 | 0.89 | 0.73 | 0.89 | 0.90 |
| RMSE ( | 10.26 | 9.70 | 5.33 | 8.02 | 5.33 | 4.84 | |
| Cluster spatial CV |
| 0.37 | 0.49 | 0.50 | 0.43 | 0.50 | 0.79 |
| RMSE ( | 11.64 | 12.69 | 10.45 | 11.23 | 10.45 | 7.46 | |
| Temporal CV |
| 0.24 | 0.49 | 0.96 | 0.36 | 0.96 | 0.96 |
| RMSE ( | 12.28 | 10.61 | 3.30 | 10.83 | 3.30 | 2.84 |
Notes: CV, cross-validation; DEML, the three-stage stacked deep ensemble machine learning method; GBM, gradient boosting machine; , particulate matter with aerodynamic diameter ; , coefficients of determination for the spatial and temporal cross-validation; RF, random forest; RMSE, the root mean square error; SL, super learner algorithm; SVM, support vector machine; XGBoost, extreme gradient boosting.
The spatial and temporal CV were conducted in both base models and meta models with the same uniform separations.
Randomly selected 5% of monitors and put the observations in these monitors as the testing data and others as training data. The process would repeat 20 times.
The observations from all ground monitors in the same region were simultaneously selected as the testing data and others as training data. The process would repeat seven times because seven regions were involved.
Selected the last 7 days of each year as testing data and others as training data for each year. The process would repeat five times.
Figure 3.The estimated annual average concentrations of particulate matter with an aerodynamic diameter () (micrograms per cubic meter) from 2015 to 2019 in Italy at spatial resolution.
Figure 4.The prediction performance of the DEML models with and without AOD as a predictor from 2015–2019 in Italy. The x-axis indicates the observed daily in the monitor stations; the y-axis indicates the estimated by the DEML model. The points represent the corresponding for both observed and predicted values. The solid line represents a regression line for the observed and predicted by using the simple linear regression. is the coefficients of determination for the unseen independent data. (A) The DEML prediction model including AOD. (B) The DEML prediction model without AOD. Note: DEML, the three-stage stacked deep ensemble machine learning method; , particulate matter with aerodynamic diameter ; RMSE, the root mean square error (micrograms per cubic meter).