| Literature DB >> 32069301 |
Sun-Young Kim1,2, Matthew Bechle3, Steve Hankey4, Lianne Sheppard2,5, Adam A Szpiro5, Julian D Marshall3.
Abstract
National-scale empirical models for air pollution can include hundreds of geographic variables. The impact of model parsimony (i.e., how model performance differs for a large versus small number of covariates) has not been systematically explored. We aim to (1) build annual-average integrated empirical geographic (IEG) regression models for the contiguous U.S. for six criteria pollutants during 1979-2015; (2) explore systematically the impact on model performance of the number of variables selected for inclusion in a model; and (3) provide publicly available model predictions. We compute annual-average concentrations from regulatory monitoring data for PM10, PM2.5, NO2, SO2, CO, and ozone at all monitoring sites for 1979-2015. We also use ~350 geographic characteristics at each location including measures of traffic, land use, land cover, and satellite-based estimates of air pollution. We then develop IEG models, employing universal kriging and summary factors estimated by partial least squares (PLS) of geographic variables. For all pollutants and years, we compare three approaches for choosing variables to include in the PLS model: (1) no variables, (2) a limited number of variables selected from the full set by forward selection, and (3) all variables. We evaluate model performance using 10-fold cross-validation (CV) using conventional and spatially-clustered test data. Models using 3 to 30 variables selected from the full set generally have the best performance across all pollutants and years (median R2 conventional [clustered] CV: 0.66 [0.47]) compared to models with no (0.37 [0]) or all variables (0.64 [0.27]). Concentration estimates for all Census Blocks reveal generally decreasing concentrations over several decades with local heterogeneity. Our findings suggest that national prediction models can be built by empirically selecting only a small number of important variables to provide robust concentration estimates. Model estimates are freely available online.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32069301 PMCID: PMC7028280 DOI: 10.1371/journal.pone.0228535
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Quantile-based plots of annual average concentrations of six criteria air pollutants across all regulatory monitoring sites for 1979–2015 in the contiguous U.S.
Cross-validation (CV) statistics for the Integrated Empirical Geographic (IEG) regression models by pollutant, year, and numbers of geographic variables (zero variables / between 3 and 30 variables / all variables).
| Conventional CV | Clustered CV | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Standardized RMSE | R2 | Standardized RMSE | R2 | ||||||||||
| 0 | 3–30 | All | 0 | 3–30 | All | 0 | 3–30 | All | 0 | 3–30 | All | ||
| Pollutant | Year | ||||||||||||
| NO2 | 2000 | 0.33 | 0.19 | 0.20 | 0.61 | 0.87 | 0.85 | 0.60 | 0.23 | 0.29 | 0.00 | 0.82 | 0.70 |
| (ppb) | 2010 | 0.39 | 0.23 | 0.25 | 0.56 | 0.84 | 0.81 | 0.64 | 0.33 | 0.33 | 0.00 | 0.68 | 0.68 |
| SO2 | 2000 | 0.39 | 0.38 | 0.39 | 0.60 | 0.63 | 0.61 | 0.62 | 0.47 | 0.51 | 0.00 | 0.44 | 0.32 |
| (ppb) | 2010 | 0.64 | 0.63 | 0.65 | 0.29 | 0.31 | 0.28 | 0.79 | 0.65 | 0.72 | 0.00 | 0.26 | 0.10 |
| O3 | 2000 | 0.07 | 0.07 | 0.07 | 0.76 | 0.78 | 0.78 | 0.11 | 0.10 | 0.11 | 0.45 | 0.55 | 0.51 |
| (ppb) | 2010 | 0.06 | 0.06 | 0.06 | 0.81 | 0.82 | 0.81 | 0.11 | 0.10 | 0.11 | 0.44 | 0.51 | 0.44 |
| CO | 2000 | 0.37 | 0.32 | 0.34 | 0.33 | 0.50 | 0.43 | 0.47 | 0.35 | 0.43 | 0.00 | 0.42 | 0.12 |
| (ppm) | 2010 | 0.25 | 0.23 | 0.25 | 0.17 | 0.28 | 0.20 | 0.28 | 0.24 | 0.28 | 0.00 | 0.23 | 0.00 |
| PM10 | 2000 | 0.31 | 0.27 | 0.28 | 0.50 | 0.60 | 0.59 | 0.45 | 0.37 | 0.39 | 0.00 | 0.27 | 0.20 |
| (μg/m3) | 2010 | 0.34 | 0.29 | 0.30 | 0.41 | 0.57 | 0.56 | 0.47 | 0.37 | 0.39 | 0.00 | 0.33 | 0.26 |
| PM25 | 2000 | 0.16 | 0.12 | 0.13 | 0.77 | 0.86 | 0.85 | 0.30 | 0.21 | 0.22 | 0.15 | 0.59 | 0.53 |
| (μg/m3) | 2010 | 0.17 | 0.13 | 0.13 | 0.73 | 0.85 | 0.84 | 0.31 | 0.19 | 0.20 | 0.14 | 0.70 | 0.64 |
aStandardized RMSE is the root mean square error (RMSE) divided by average concentration.
bAll values of CV statistics are shown by the three levels of selected numbers of variables: for models with zero variables (i.e., ordinary kriging), denoted with “0”; the median among the models with between 3 and 30 variables (3, 5, 7, 10, 13, 16, 20, 24, and 30), denoted “3–30”; and for full models with all variables, denoted “all”.
Fig 2Standardized RMSEs and R2s of the national Integrated Empirical Geographic (IEG) models including no, some, and all variables from conventional and clustered cross-validation (CV) during 1979–2015 for the contiguous U.S. by NO2, SO2, ozone, and PM2.5 (triangle: all variables, circle: Some variables (3–30), and cross: no variables; terminology here is the same as in Table 1;).
Fig 3Standardized RMSEs and R2s from the “best” Integrated Empirical Geographic (IEG) models for the contiguous U.S. in 2000, for conventional and clustered cross-validation (CV), by pollutant.
Fig 4Categories (nine out of the eleven in S1 Table) of geographic variables chosen by forward selection for the national Integrated Empirical Geographic (IEG) models by year, pollutant (NO2, SO2, ozone, and PM2.5), and number of variables (5, 10, and 30) during 1979–2015 for the contiguous U.S.
Fig 5Maps of Census Block Group population-weighted mean predicted annual average concentrations for PM2.5, NO2, and ozone from the “best” national Integrated Empirical Geographic (IEG) models mostly including 3–30 variables for 2000 and 2010 in the contiguous U.S.
Summary statistics of population-weighted annual average concentrations across 215,491 Census Block Groups for the contiguous U.S., by pollutant and decadal year, based on predictions at Census Block centroids by using the “best” Integrated Empirical Geographic (IEG) models mostly using 3–30 geographic variables.
| Pollutant | Year | Percentile | Mean | SD | ||||
|---|---|---|---|---|---|---|---|---|
| 10 | 25 | 50 | 75 | 90 | ||||
| NO2 | 1980 | 7.4 | 12.1 | 19.9 | 27.9 | 36.8 | 21.3 | 11.7 |
| (ppb) | 1990 | 6.1 | 8.4 | 12.9 | 19.0 | 26.9 | 15.2 | 9.2 |
| 2000 | 5.6 | 7.8 | 11.8 | 16.7 | 23.2 | 13.3 | 7.5 | |
| 2010 | 3.3 | 4.7 | 7.2 | 10.8 | 15.8 | 8.5 | 5.1 | |
| SO2 | 1980 | 3.4 | 5.8 | 8.9 | 12.5 | 16.6 | 9.6 | 5.3 |
| (ppb) | 1990 | 2.0 | 3.0 | 4.6 | 7.0 | 9.2 | 5.3 | 3.0 |
| 2000 | 1.8 | 2.2 | 3.1 | 4.4 | 6.1 | 3.6 | 1.8 | |
| 2010 | 0.9 | 1.2 | 1.5 | 2.0 | 2.5 | 1.6 | 0.7 | |
| Ozone | 1980 | 39.0 | 45.4 | 51.3 | 57.3 | 63.6 | 51.1 | 9.6 |
| (ppb) | 1990 | 39.6 | 44.8 | 48.6 | 52.4 | 56.8 | 48.5 | 6.5 |
| 2000 | 40.2 | 43.9 | 49.0 | 53.6 | 57.1 | 48.5 | 6.7 | |
| 2010 | 37.7 | 43.1 | 46.6 | 49.6 | 52.2 | 45.6 | 6.0 | |
| CO | 1990 | 0.33 | 0.43 | 0.61 | 0.86 | 1.19 | 0.69 | 0.35 |
| (ppm) | 2000 | 0.29 | 0.35 | 0.43 | 0.55 | 0.74 | 0.48 | 0.20 |
| 2010 | 0.23 | 0.28 | 0.31 | 0.35 | 0.39 | 0.31 | 0.07 | |
| PM10 | 1990 | 19.8 | 22.7 | 25.9 | 30.2 | 36.8 | 27.5 | 7.9 |
| (μg/m3) | 2000 | 15.7 | 18.8 | 22.0 | 25.4 | 30.8 | 22.9 | 6.8 |
| 2010 | 12.8 | 15.2 | 18.3 | 21.5 | 24.1 | 18.4 | 4.6 | |
| PM2.5 | 2000 | 8.6 | 10.7 | 12.9 | 15.2 | 16.7 | 12.9 | 3.4 |
| (μg/m3) | 2010 | 6.3 | 7.9 | 9.6 | 10.8 | 12.1 | 9.4 | 2.2 |