| Literature DB >> 36011733 |
Abstract
Many epidemiological studies have evaluated the accuracy of machine learning models in predicting levels of particulate number (PN) and black carbon (BC) pollutant concentrations. However, few studies have investigated the ability of machine learning to predict the pollutant concentration with using unrefined mobile measurement data and explore the reliability of the prediction models. Additionally, researchers are moving away from using fixed-site data in favor of using mobile monitoring data in a variety of locations to develop hourly empirical models of particulate air pollution. This study compared the differences between long-term (daily average) and short-term (hourly average and 1 s unrefined data) model performance in three different classes of cross validation: randomly, spatially, and spatially temporally. This study used secondary data describing BC and PN pollutant levels in the rural location of Blacksburg (VA). Our results show that the model based on unrefined data was able to detect the pollutant hot spot areas with similar accuracy compared to the aggregated model. Moreover, the performance was found to improve when temporal data added to the model: the 10-fold MAE for the BC and PN were 0.44 μg/m3 and 3391 pt/cm3, respectively, for the unrefined data (one second data) model. The findings detailed here will add to the literature on the correlation between data (pre)processing and the efficacy of machine learning models in predicting pollution levels while also enhancing our understanding of more reliable validation strategies.Entities:
Keywords: air pollution; black carbon; land use regression; machine learning; particulate number; spatial and temporal variation
Mesh:
Substances:
Year: 2022 PMID: 36011733 PMCID: PMC9408314 DOI: 10.3390/ijerph191610098
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
Figure 1Sampling bicycle routes [28].
Summary of variables used and CV applies to develop the daily average, one second, and hourly average empirical models.
| Pullutant | LUR Models | Input Variables | Temporal Resolution | Number of Observations | Input Variables | Cross_Validation | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Land Use, Transportation, Natural Envirnoment | Weather | Hour of the Day | Random_Holdout | Spatial_Holdout | Spatial_Temporal_Holdout | |||||
| BC: µg/m3, PN: particles/cm | Daily Average | Lu | 12 h | 423 | ☑ | ☑ | ☑ | |||
| Hourly Average | Lu | 1 h | 5076 | ☑ | ☑ | ☑ | ||||
| Lu_W_Hr | ☑ | ☑ | ☑ | ☑ | ☑ | |||||
| One second | Lu | 1 s | BC: 319,490 -PN: 354,717 | ☑ | ☑ | ☑ | ☑ | |||
| Lu_W_Hr | ☑ | ☑ | ☑ | ☑ | ☑ | ☑ | ||||
Figure 2The 17 identified spatial clusters.
Summary of Pollutant Descriptive Statistics.
| Pollutant | Models | Count | Min | Q1 | Q3 | Max | Mean | Median | S.D. |
|---|---|---|---|---|---|---|---|---|---|
| Black Carbon (μg/m3) | Daily average | 423 | 0.17 | 0.47 | 0.81 | 1.63 | 0.67 | 0.62 | 0.27 |
| Hourly average | 5074 | −0.64 | 0.41 | 0.95 | 16.76 | 0.74 | 0.63 | 0.59 | |
| One second | 319,489 | −6.56 | 0.34 | 1.21 | 108.01 | 1.08 | 0.69 | 2.49 | |
| Particle Number (pt/cm3) | Daily average | 423 | 4305 | 5430 | 6925 | 14,382 | 6388 | 5924 | 1491 |
| Hourly average | 5074 | 1104 | 4996 | 7679 | 41,281 | 6834 | 6044 | 3205 | |
| One second | 354,715 | 4 | 3420 | 11338 | 4,447,494 | 10,293 | 5950 | 30,139 |
Summary of model performance for each pollutant and model type in the random CV.
| Models | Input Variables | BC | PN | ||
|---|---|---|---|---|---|
| Random CV | Random CV | ||||
| MAE (μg/m3) | RMSE (μg/m3) | MAE (pt/cm3) | RMSE (pt/cm3) | ||
| Daily average | Lu | 0.11 | 0.15 | 515.21 | 765.48 |
| Hourly average | Lu | 0.30 | 0.42 | 1698.33 | 2613.62 |
| Lu_W_Hr | 0.23 | 0.34 | 1204.00 | 2038.31 | |
| One second | Lu | 0.78 | 2.35 | 6590.91 | 31,280.53 |
| Lu_W_Hr | 0.44 | 1.18 | 3391.27 | 28,854.74 | |
Figure 3Models estimated standardized concentrations from the three time-averaging models for each pollutant.
Figure 4Model estimated standardized concentrations from the hourly average and one second models for select hours of day.
Comparison between Machine Learning and Stepwise Regression Performance for Each Pollutant.
| Pollutant | Models | Stepwise R2 | ML R2 |
|---|---|---|---|
| BC: µg/m3 | Daily_average (Lu) | 0.58 | 0.734 |
| Hourly_average (Lu + W + Hr) | 0.27 | 0.54 | |
| PN: particles/cm3 | Daily_average (Lu) | 0.7 | 0.78 |
| Hourly_average (Lu + W + Hr) | 0.42 | 0.54 |
Summary of the spatial and spatial-temporal of cross validation for BC and PN.
| Models | Input Variables | BC (μg/m3) | PN (pt/cm3) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Spatial CV | Spatial–Temporal | Spatial CV | Spatial–Temporal | ||||||
| MAE | RMSE | MAE | RMSE | MAE | RMSE | MAE | RMSE | ||
| Daily average | Lu | 0.20 | 0.27 | 886.91 | 1338.33 | ||||
| Hourly average | Lu | 0.36 | 0.59 | 2002.01 | 3204.07 | ||||
| Lu W Hr | 0.35 | 0.58 | 1862.72 | 3022.66 | |||||
| One second | Lu | 0.84 | 2.50 | 0.71 | 1.35 | 6831.35 | 30,366.65 | 10,145.75 | 34,931.82 |
| Lu W Hr | 0.81 | 2.48 | 0.67 | 1.33 | 5611.13 | 29,929.87 | 8890.51 | 34,371.46 | |