Literature DB >> 25398188

A unified spatiotemporal modeling approach for predicting concentrations of multiple air pollutants in the multi-ethnic study of atherosclerosis and air pollution.

Joshua P Keller¹, Casey Olives, Sun-Young Kim, Lianne Sheppard, Paul D Sampson, Adam A Szpiro, Assaf P Oron, Johan Lindström, Sverre Vedal, Joel D Kaufman.

Abstract

BACKGROUND: Cohort studies of the relationship between air pollution exposure and chronic health effects require predictions of exposure over long periods of time.
OBJECTIVES: We developed a unified modeling approach for predicting fine particulate matter, nitrogen dioxide, oxides of nitrogen, and black carbon (as measured by light absorption coefficient) in six U.S. metropolitan regions from 1999 through early 2012 as part of the Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air).
METHODS: We obtained monitoring data from regulatory networks and supplemented those data with study-specific measurements collected from MESA Air community locations and participants' homes. In each region, we applied a spatiotemporal model that included a long-term spatial mean, time trends with spatially varying coefficients, and a spatiotemporal residual. The mean structure was derived from a large set of geographic covariates that was reduced using partial least-squares regression. We estimated time trends from observed time series and used spatial smoothing methods to borrow strength between observations.
RESULTS: Prediction accuracy was high for most models, with cross-validation R2 (R2CV) > 0.80 at regulatory and fixed sites for most regions and pollutants. At home sites, overall R2CV ranged from 0.45 to 0.92, and temporally adjusted R2CV ranged from 0.23 to 0.92.
CONCLUSIONS: This novel spatiotemporal modeling approach provides accurate fine-scale predictions in multiple regions for four pollutants. We have generated participant-specific predictions for MESA Air to investigate health effects of long-term air pollution exposures. These successes highlight modeling advances that can be adopted more widely in modern cohort studies.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：

Year: 2014 PMID： 25398188 PMCID： PMC4384200 DOI： 10.1289/ehp.1408145

Source DB: PubMed Journal: Environ Health Perspect ISSN： 0091-6765 Impact factor: 9.031

Introduction

Cohort studies have shown an association between long-term exposure to air pollution and cardiovascular morbidity and mortality (Brook et al. 2010; Miller et al. 2007; Pope and Dockery 2006). To estimate these associations, epidemiologic studies develop exposure prediction models to predict pollutant concentrations over long periods of time at cohort home addresses based on monitoring data from regulatory networks or study-specific monitoring campaigns. Although early models were based on region-wide averages or nearest-monitor approaches, more current methods include land-use regression (LUR) (Hoek et al. 2008; Jerrett et al. 2007), the use of satellite and remote sensing data (Kloog et al. 2011), geostatistical methods such as kriging (Beelen et al. 2009), generalized additive models (Hart et al. 2009), or a combination of these techniques (Beckerman et al. 2013; Bergen et al. 2013; Mercer et al. 2011). The Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air) is investigating the association between long-term air pollution exposure and measures of cardiovascular health (Kaufman et al. 2012). MESA Air is following a cohort of > 6,000 individuals in six metropolitan regions: Baltimore, Maryland; Chicago, Illinois; Los Angeles, California; New York, New York; St. Paul, Minnesota; and Winston-Salem, North Carolina. The primary exposures of interest in MESA Air are fine particulate matter (≤ 2.5 μm; PM2.5), nitrogen dioxide (NO2), oxides of nitrogen (NOx), and black carbon, as measured by light absorption coefficient (LAC). One goal of MESA Air is the development of advanced statistical methods that incorporate extensive supplemental monitoring to improve the prediction of intra-urban pollutant variability. MESA Air health effect analyses require spatiotemporal predictions of ambient outdoor concentrations for all four pollutants in all six metropolitan regions for times ranging from 1999 through 2012. In general, exposure prediction models developed in one city do not transfer well to another city (Allen et al. 2011), so prediction models are often study- and city-specific (e.g., Franklin et al. 2012). A challenge for multi-city studies such as MESA Air that combine data from subcohorts and include several pollutant measures is generating predictions that are of comparable quality across pollutants and cities. Here we present a unified and flexible spatiotemporal modeling framework for the four MESA Air pollutants. We apply a standardized approach to model selection for all pollutants and regions, allowing the intrinsic flexibility of our modeling framework to account for differences in the way pollution processes behave in different regions.

Methods

To predict outdoor concentration of pollutants at MESA Air participant residences, we fit a separate spatiotemporal exposure prediction model for each pollutant (PM2.5, NO2, NOx, and LAC) in each metropolitan region. Briefly, our model decomposes the space–time field of concentrations into spatially varying long-term (i.e., duration of study period) averages, spatially varying seasonal and long-term trends, and spatially correlated but temporally independent residuals, and accommodates data from the complex monitoring design described below. We modeled on a 2-week time scale because of the 2-week sampling of MESA Air supplementary monitoring instruments; this allows for flexible aggregation of predictions to time scales of interest for health effects analyses (e.g., 12 months before clinic visit). Monitoring data. We used three sources of outdoor air monitoring data. We obtained hourly NO2 and NOx and daily PM2.5 concentration measurements in each metropolitan region from 1 January 1999 through 31 March 2012 from the U.S. Environmental Protection Agency (EPA) Air Quality System (AQS) (http://www.epa.gov/ttn/airs/airsaqs/detaildata/downloadaqsdata.htm), including data from the Interagency Monitoring of Protected Visual Environments (IMPROVE) network (http://vista.cira.colostate.edu/IMPROVE/). No AQS data were used for black carbon because of differences in collection methods from the MESA Air LAC measurement methods described below. We aggregated hourly data into daily averages and subsequently averaged daily values to the 2-week scale. AQS monitors that had < 2 years of data or had irregular temporal coverage (e.g., operated only in the summer) were not used. In each metropolitan region, we defined the modeling area to be locations within approximately 75 km of each metropolitan center (Figure 1). AQS monitors within the modeling regions were considered for inclusion in the model, and predictions at participant residences were restricted to locations within these modeling regions. In New York, MESA Air participants were recruited from both New York City and Rockland County, so the modeling region included locations near both areas. In Winston-Salem, only one AQS monitoring location for NO2 and NOx met inclusion criteria. To have a complete time series for the 14-year modeling period, an AQS monitor in Charlotte, North Carolina, was included for estimating time trends. In Chicago, the modeling region was further restricted to locations west of –87.5°W longitude because some covariates were unavailable east of that meridian. In Los Angeles, only locations south and/or west of the San Gabriel Mountains were included.

Figure 1

Maps of the modeling areas (denoted by dashed black line) in the six metropolitan regions, including monitor and subject locations. Abbreviations: Fixed, MESA Air fixed monitoring sites; Home, MESA Air home monitoring sites; Snapshot, MESA Air snapshot monitoring sites; Participant, MESA Air participant residence location (moved slightly to protect confidentiality). To better capture the within-city variability of pollutant concentrations, MESA Air conducted a supplementary monitoring campaign targeting the study cohort from July 2005 through August 2009. The MESA Air measurements were 2-week cumulative measurements that began and ended on Wednesdays. Measurements of NO2 and NOx were made using Ogawa passive samplers, and PM2.5 mass was measured on Harvard Personal Environmental Monitor impactors using Teflon filters. LAC was computed from the Teflon filters via reflectance. A detailed description of the data collection and site selection procedures has been previously published (Cohen et al. 2009). The MESA Air monitoring campaign included three types of monitoring sites: fixed, home, and snapshot. Fixed sites were operated for the duration of the 4-year MESA Air sampling period to provide long time series of measurements, with one fixed site collocated with an AQS monitor in each region. Samples of participant residences in each metropolitan region were selected for monitoring as home outdoor sites on a rotating basis, with most locations monitored one to three times in different seasons. Snapshot sites, which measured only NO2 and NOx, were located in clusters to capture gradients near sources (e.g., primary roadways) and monitored for three 2-week periods, one each in winter, summer, and either spring or fall. In New York City, data from the New York City Community Air Survey (NYCCAS) were used to supplement the AQS and MESA Air data (Matte et al. 2013; NYC Department of Health 2014). The NYCCAS data consist of 2-week measurements of PM2.5, NO2, NOx, and LAC collected during December 2008–December 2010 in a manner consistent with MESA Air sampling protocols. Five NYCCAS reference sites (one in each borough) collected measurements throughout the sampling period, and 150 NYCCAS distributed sites were monitored once per season during this time. Because of the similarity in monitoring scheme, we treated NYCCAS reference sites in the same manner as MESA Air fixed sites, and NYCCAS distributed sites in the same manner as MESA Air home sites, in our models. The NYCCAS data and a small subset of the MESA Air 2-week data were centered on different weeks than most of the MESA Air measurements. To align these measurements with the rest of the MESA Air data, we treated these measurements as if they were made 1 week earlier or later, as appropriate. Between 0.4% (LAC) and 1.6% (NO2) of the pollutant measurements were below the limit of detection (LOD) and were replaced with the value LOD/2. The number of each type of monitoring site by region and pollutant is provided in Table 1. The range of the number of PM2.5 observations at each monitoring site during the study period is provided in Table 2, along with summary statistics for the site means. Corresponding statistics for NO2, NOx, and LAC observations are provided in Supplemental Material, Tables S1–S3.

Table 1

Number of monitors by site type, region, and pollutant.

Site type	PM_2.5	NO₂	NO_x	LAC
Baltimore, MD
AQS	29	11	8
MESA fixed	5	5	5	5
MESA home	86	87	87	86
MESA snapshot		104	104
Chicago, IL
AQS	20	7	6
MESA fixed	6	6	6	6
MESA home	136	113	113	136
MESA snapshot		129	129
Los Angeles, CA
AQS	23	29	30
MESA fixed	7	7	7	7
MESA home	113	120	120	113
MESA snapshot		252	250
New York, NY
AQS	45	17	11
MESA fixed	3	3	3	3
MESA home	107	119	118	107
MESA snapshot		157	157
NYCCAS reference	5	5	5	5
NYCCAS distributed	150	150	150	150
St. Paul, MN
AQS	13	5	5
MESA fixed	3	4	4	3
MESA home	126	132	132	129
MESA snapshot		107	107
Winston-Salem, NC
AQS	16	2	2
MESA fixed	4	4	4	4
MESA home	114	117	117	114
MESA snapshot		121	121

Table 2

Summary of PM2.5 monitoring data.

Site type	No. of observations per site		Site means (μg/m³)
	Minimum	Maximum	Minimum	Maximum	Mean ± SD
	Baltimore, MD
AQS	64	345	10.9	16.9	13.4 ± 1.4
MESA fixed	18	92	12.1	15.4	13.7 ± 1.25
MESA home	1	3	7.3	22.7	14.3 ± 3.1
Chicago, IL
AQS	71	320	11.7	16.4	14.0 ± 1.3
MESA fixed	6	87	12.2	14.0	13.1 ± 0.75
MESA home	1	4	5.2	19.5	11.5 ± 3.2
Los Angeles, CA
AQS	82	345	10.7	22.8	16.2 ± 3.5
MESA fixed	76	85	13.7	19.3	16.2 ± 2.0
MESA home	1	2	0.7	42.6	16.9 ± 6.1
New York, NY
AQS	51	342	9.3	17.1	12.5 ± 1.8
MESA fixed	49	83	11.5	15.7	13.7 ± 2.1
MESA home	1	3	3.5	41.6	15.1 ± 4.9
NYCCAS reference	51	52	8.8	9.9	9.4 ± 0.42
NYCCAS distributed	6	8	6.8	19.8	11.0 ± 11.0
St. Paul, MN
AQS	55	305	7.9	11.6	10.0 ± 0.91
MESA fixed	81	89	9.6	10.5	10.0 ± 0.46
MESA home	1	5	5.0	27.4	10.3 ± 3.8
Winston-Salem, NC
AQS	86	346	10.3	15.9	13.4 ± 1.5
MESA fixed	80	93	13.0	13.8	13.4 ± 0.35
MESA home	1	4	9.0	22.8	14.3 ± 2.6

Number of monitors by site type, region, and pollutant. Summary of PM2.5 monitoring data. Geographic covariates. We compiled > 300 geographic covariates for use in the model (see Supplemental Material, Table S4). These covariates included proximity measures (distance to nearest major road, intersection, truck route, railway, railyard, coastline, airport, and port) and buffer measures (major road length, truck route length, land-use category, long-term vegetation index, population density, and emission sources). We included a long-term average of the dispersion model output from a modified implementation of the Caline3QHCR line-source model (Eckhoff and Braverman 1995). The Caline3QHCR model incorporates distance, traffic volume, meteorology, and diurnal traffic patterns in each region. Geographic covariates with minimal variation or potentially highly influential values were excluded from the modeling process. Specifically, variables were removed if a) > 80% of monitoring sites had the same value, b) > 2% of observations were more than 5 SDs away from mean, c) the standard deviation of the distribution of values at participant residences was more than five times the standard deviation of the distribution of values at monitoring locations, or d) the maximum value was 10% among all monitoring sites (for land-use variables only). These filters were applied separately for each pollutant and region. Spatiotemporal model. The monitoring data were highly unbalanced, with a small number of locations providing long time series of several years’ duration and a larger number of locations providing broader spatial coverage, but at a relatively small number of time points. A hierarchical spatiotemporal model had been previously developed to accommodate the unbalanced nature of the MESA Air data (Lindström et al. 2014; Sampson et al. 2011; Szpiro et al. 2010). This model can be written as C(s,t) = μ(s,t) + v(s,t), [1] where C(s,t) represents the log-transformed 2-week average pollutant concentration at location s and time t. The μ(s,t) term is the spatiotemporal mean surface, and the v(s,t) term represents spatiotemporal residual variation. We break down the spatiotemporal mean into components where β0(s) is the long-term mean at location s, f(t) are smooth time trends, and β(s) are spatially varying coefficients for the time trends. The time trends are estimated from AQS and MESA Air fixed sites (and NYCCAS reference sites in New York) using a procedure developed by Fuentes et al. (2007) and previously described in detail by Sampson et al. (2011). In brief, we applied an expectation-maximization procedure to fill in missing values in the time series and derived the trends from a singular value decomposition. We smoothed the trends using splines, controlling the smoothness with the degrees of freedom (df) parameter. The model assumes the time trends account for enough of the temporal structure that the residuals are uncorrelated in time. The long-term averages β0(s) and time trend coefficients β(s) are modeled as spatial random fields with a spatial mean, distributed as β[(s)α Σ(φ σ τ)], i = 0, 1, …, m. [3] Here, (s) are reduced-dimension summaries of the geographic covariates (described in detail below) at location s, and α are vectors of coefficients to be estimated. The covariance structure for β, denoted by Σ, is either an independence model with variance τ or a spatial smoothing model with exponential covariance function parameterized by range φ partial sill σ and nugget τ (Cressie 1993). The zero-mean spatiotemporal residual term v(s,t) in Equation 1 has a spatial correlation structure and is assumed independent at each time point. It includes a random effect for each time point to model short-term variations that affect an entire region, such as large-scale meteorological events. Partial least squares (PLS) scores. Rather than include each of the hundreds of geographic covariates directly in the model or use variable selection methods, we reduced the dimensionality of the covariates using PLS. In a manner similar to principal components analysis (PCA), PLS computes linear combinations, called scores, of the columns of a data matrix. Unlike PCA, the PLS procedure constructs scores that maximize the covariance between the scores and an outcome rather than the variance between the scores. A technical explanation of the PLS algorithm is provided by Abdi (2010). Sampson et al. (2013) described the application of PLS for spatial models, and here we describe how we applied the method to spatiotemporal data. PLS regression requires a single outcome value for each location. Because the MESA Air data are unbalanced time series, we first derived values that could be used as outcomes in PLS regression. For each AQS, fixed, and NYCCAS reference site s, we regressed the time series of observations C(s,t) on the smoothed time trends using ordinary least squares regression with mean function E[C(s,t)] = γs0 f0(t) + … + γsm fm(t) to get estimates (^γs0,…, ^γsm) for each location. For each time trend, PLS regression was performed separately with the ^γsi as the outcomes and the matrix of geographic covariates as the predictors. This gave a set of PLS scores for each location that was different for each time trend. PLS scores at home and snapshot monitoring sites were predicted using the geographic variables at those locations and the score definitions defined from the regression at fixed sites. PLS regression was performed using the pls package (Mevik et al. 2011), in R (R Core Team; http:r-project.org). Scores were included in the model as the (s) in Equation 3. Parameter estimation and model selection. Once the PLS scores (s) and time trends f(t) were computed, the remaining parameters were calculated via maximum likelihood using the SpatioTemporal package, version 1.1.7 (Lindström et al. 2012), in R. We varied several model parameters and used cross-validation to find the best-fitting model in each metropolitan region, as described below. We considered different values for the number of time trends (either 1 or 2), the df for smoothing time trends (either 4 or 8 per year), the number of PLS scores per time trend (2 or 3), and the covariance structure of the β fields (spatial smoothing or no spatial smoothing). Cross-validation procedure. The primary interest of MESA Air is in long-term average exposures, so we assessed model performance using cross-validation of long-term averages (LTAs). Because the highly unbalanced structure of the monitoring data means that LTAs at home sites are computed from a handful of observations of a few weeks’ duration, whereas LTAs at fixed sites are computed from long time series, we performed cross-validation separately for each site type. For home sites and NYCCAS distributed sites, we used 10-fold cross-validation, which leaves out one-tenth of the data in turn. For AQS, fixed, and NYCCAS references sites, we used leave-one-out cross-validation because the total number of monitors was relatively small. For snapshot sites, we used 10-fold cross-validation, with monitors in the same cluster left out together. For all three schemes, the covariance parameters (but not the time trends or PLS scores) were re-estimated using all but the left-out sites. Pollutant concentrations at the left-out sites were predicted using the parameters estimated from the remaining data. We assessed cross-validation performance using two measures: root mean-squared error (RMSE) and cross-validation R2 (denoted by R2). Letting y denote the mean observed value and ^y the mean of the predicted values for the observed time points at monitoring site j, RMSE and R2 were computed on the original scale of the data according to the formulas and R2 = max(0,1 – RMSE2/MSE), [5] where MSE = 1/n Σ = 1 (y – –y)2 is the mean-squared error of the observed values. R2 provides a measure of fit to the 1-1 line, in contrast to the typical regression-based R2 (R2), which measures fit to the regression line and is computed as the square of the correlation coefficient between the cross-validation predictions and the observed values. R2 reflects the contrast of interest because our goal is accurate prediction at unmeasured locations, and it is typically lower than R2. Although R2 was the primary metric for our model evaluation, we present R2 for comparison with published results from other authors. Because we are most interested in spatial contrasts between individual exposures within each region, we prioritized home-site over fixed-site R2 and RMSE in the model selection process. In the context of the hierarchical model (Equation 1), it is challenging to separate the spatial and temporal contributions to R2 for cross-validation of temporally sparse data sets such as the home sites and NYCCAS distributed sites. Lindström et al. (2014) proposed three temporally adjusted adaptations of R2 that use data from the AQS and fixed sites as the reference MSE instead of MSE in Equation 5 in order to focus on spatial prediction accuracy. R2 uses the average values at AQS and fixed sites within that region. R2 uses the closest (in absolute distance) AQS or fixed site. R2 uses the smoothed time trend at the closest AQS or fixed site. Prediction at participant locations. Using the best models from each metropolitan region, predictions of pollutant log-concentrations at participant residences were made on a 2-week scale from January 1999 through March 2012. We back-transformed these predictions using exponentiation to return them to the original scale of concentration measurements and computed averages of 2-week predictions over the study period.

Results

Model structure. Table 3 provides an overview of the model structure selected for each metropolitan region and pollutant. Most models have only one time trend, although all the New York models have two. The two smoothed trends for the NO2 model in Los Angeles are shown in Figure 2. Figure 2 also includes plots of the fitted trends for a selected AQS site and fixed site.

Table 3

Model structure for the best model (selected by cross-validation) for each pollutant and metropolitan region.

Model	No. of time trends^a	No. of PLS scores^b	df/year in time trend^c	Spatial smoothing^d
				Long-term average (β₀)	Time trend coefficients (β_i)
				Baltimore, MD
PM_2.5	1	3	4	Yes	No
NO₂	1	2	8	Yes	No
NO_x	1	2	8	Yes	Yes
LAC	1	3	8	No	No
Chicago, IL
PM_2.5	1	3	8	Yes	No
NO₂	2	2	4	Yes	Yes
NO_x	2	2	8	Yes	No
LAC	1	2	8	Yes	Yes
Los Angeles, CA
PM_2.5	2	3	8	Yes	No
NO₂	2	3	8	Yes	Yes
NO_x	1	3	4	Yes	Yes
LAC	1	2	4	Yes	No
New York, NY
PM_2.5	2	3	8	No	No
NO₂	2	3	4	Yes	Yes
NO_x	2	2	8	No	No
LAC	2	2	4	Yes	No
St. Paul, MN
PM_2.5	1	3	4	Yes	No
NO₂	1	3	4	Yes	No
NO_x	1	3	4	Yes	No
LAC	1	2	8	Yes	No
Winston‑Salem, NC
PM_2.5	2	2	4	No	No
NO₂	1	3	8	Yes	Yes
NO_x	1	2	8	Yes	Yes
LAC	1	2	8	Yes	No
^aSelected from either 1 or 2 time trends. ^bSelected from either 2 or 3 PLS scores; scores were covariates in the mean component of the long-term average (β₀) and time trend (β_i) fields [denoted by X_i(s) in Equation 3]. ^cSelected from either 4 or 8 degrees of freedom (df) per year; controls smoothness of estimated time trends. ^dYes, exponential covariance structure. No, independent covariance structure.

Figure 2

Time trends for the NO2 model in Los Angeles. The top panel shows the smoothed time trends calculated from AQS and fixed sites. The middle and bottom panels show the observed data and fitted trends at an AQS site and fixed site, respectively.

Model structure for the best model (selected by cross-validation) for each pollutant and metropolitan region. Time trends for the NO2 model in Los Angeles. The top panel shows the smoothed time trends calculated from AQS and fixed sites. The middle and bottom panels show the observed data and fitted trends at an AQS site and fixed site, respectively. For PM2.5, there was noticeable heterogeneity of the best models across metropolitan regions (Table 3). New York and Winston-Salem had no spatial smoothing in the long-term PM2.5 average [β0(s)], and no model had spatial smoothing in the PM2.5 time trend coefficients [β(s)]. Half of the regions had two time trends, whereas the other half had only a single time trend for PM2.5. For NO2, all of the models had spatial smoothing for the long-term average, and the same was true for NOx except in New York. The relative contribution of geographic covariates to the PLS scores varied by region and pollutant. In the Supplemental Material, Figure S1 shows the correlations between covariates and PLS scores for the NO2 model in Chicago, which are representative of the general patterns in the other regions (data not shown). Overall, the distance-to-feature covariates and vegetation measures [Normalized Difference Vegetation Index (NDVI) and low development, open development, forest, and wetland land use] tended to have the opposite correlation from emissions and traffic measures (A1, A2/A3, and truck route lengths and intersection counts) within buffers. Model results. Table 4 shows the cross-validation metrics for all models, broken down by pollutant and region. These metrics assess how well the site means are modeled, incorporating both the spatial and temporal components of the predictions. Scatter plots of predictions and observed values are provided in Figure 3 for AQS and fixed sites and in Supplemental Material, Figure S2, for home sites. Metrics for the snapshot sites (for NO2 and NOx) and for AQS and fixed sites (all four pollutants) on the 2-week scale are reported in the Supplemental Material, Tables S5 and S6.

Table 4

Cross-validation measures of predictive accuracy for site means at monitoring locations.

Region	AQS and MESA fixed sites			MESA home sites
	RMSE^a	R²_CV	R²_CVreg	RMSE^a	R²_CV	R²_CVreg
	PM_2.5
Baltimore	0.42	0.90	0.90	1.24	0.84	0.86
Chicago	0.59	0.78	0.82	1.43	0.80	0.80
Los Angeles	1.28	0.83	0.84	2.92	0.77	0.78
New York^b	0.59	0.91	0.91	2.80	0.54	0.56
St. Paul	0.60	0.45	0.84	1.78	0.78	0.79
Winston-Salem	0.44	0.89	0.90	1.00	0.85	0.86
NO₂
Baltimore	0.76	0.96	0.97	1.47	0.90	0.90
Chicago	1.51	0.87	0.97	3.31	0.45	0.48
Los Angeles	2.23	0.88	0.89	3.13	0.77	0.78
New York^b	1.86	0.92	0.93	3.82	0.78	0.78
St. Paul	1.27	0.93	0.94	1.24	0.87	0.87
Winston-Salem	0.95	0.85	0.98	1.41	0.74	0.75
NO_x
Baltimore	3.32	0.92	0.96	3.98	0.92	0.92
Chicago	3.88	0.87	0.91	6.08	0.59	0.59
Los Angeles	6.74	0.87	0.87	5.69	0.91	0.92
New York^b	8.85	0.61	0.89	16.66	0.50	0.50
St. Paul	1.69	0.98	0.98	3.58	0.83	0.84
Winston-Salem	5.46	0.00	0.94	3.74	0.60	0.63
LAC
Baltimore	0.096	0.87	0.91	0.127	0.78	0.79
Chicago	0.045	0.86	0.92	0.108	0.61	0.62
Los Angeles	0.114	0.70	0.93	0.266	0.69	0.71
New York^b	0.147	0.75	0.79	0.329	0.51	0.52
St. Paul	0.043	0.91	0.92	0.074	0.69	0.69
Winston-Salem	0.020	0.94	0.99	0.088	0.64	0.65
^aUnits for RMSE are μg/m³ (PM_2.5), ppb (NO₂ and NO_x), and 10^–5/m (LAC). ^bNew York models include NYCCAS reference sites with AQS and fixed sites, and NYCCAS distributed sites with home sites.

Figure 3

Long-term averages of cross-validated predictions and observations for AQS and fixed monitoring locations for each pollutant.

Cross-validation measures of predictive accuracy for site means at monitoring locations. Long-term averages of cross-validated predictions and observations for AQS and fixed monitoring locations for each pollutant. Predictive accuracy was generally good (R2 > 0.6) to excellent (R2 > 0.8) in almost all regions for each pollutant. NOx models in Baltimore and Los Angeles had the best performance at MESA home sites (R2 of 0.92 and 0.91, respectively) (Table 4). The lowest R2 for home sites was in the Chicago NO2 model (0.45), but its RMSE (3.31 ppb) was comparable with those in New York and Los Angeles (3.82 and 3.13 ppb, respectively). At AQS and fixed sites, R2 was very good (0.70 for LAC in Los Angeles) to excellent (0.98 for NOx in St. Paul), with two notable exceptions: Winston-Salem NOx (0.00) and St. Paul PM2.5 (0.45). However, in both cases the RMSE was comparable with the corresponding RMSE for models in other regions. The small range of observed data (9.8–22.4 ppb) (see Supplemental Material, Table S2) and the small number of monitors in Winston-Salem (Table 1) explain the low R2 for NOx in that city. Table 5 provides three versions of temporally adjusted R2 at home sites (R2, R2, and R2). For NO2 and NOx, these temporally adjusted R2 are fairly similar to the unadjusted R2 reported in Table 4, suggesting that we are predicting spatial differences well. For PM2.5, however, the temporally adjusted R2 are consistently lower than the unadjusted R2.

Table 5

Temporally adjusted cross-validation measures of predictive accuracy for home site means.

Region	R²_Avg	R²_Close	R²_Smooth
PM_2.5
Baltimore	0.45	0.52	0.72
Chicago	0.23	0.33	0.64
Los Angeles	0.40	0.23	0.43
New York^a	0.48	0.36	0.38
St. Paul	0.23	0.29	0.62
Winston-Salem	0.29	0.60	0.77
NO₂
Baltimore	0.92	0.79	0.74
Chicago	0.73	0.64	0.78
Los Angeles	0.63	0.66	0.66
New York^a	0.89	0.78	0.64
St. Paul	0.77	0.89	0.90
Winston-Salem	0.73	0.79	0.81
NO_x
Baltimore	0.86	0.70	0.65
Chicago	0.76	0.73	0.69
Los Angeles	0.81	0.85	0.88
New York^a	0.72	0.64	0.52
St. Paul	0.79	0.88	0.85
Winston-Salem	0.43	0.62	0.64
LAC
Baltimore	0.78	0.67	0.32
Chicago	0.56	0.45	0.36
Los Angeles	0.28	0.34	0.48
New York^a	0.59	0.65	0.53
St. Paul	0.67	0.80	0.84
Winston-Salem	0.37	0.56	0.59
General formula for R² measures is R² = max (0,1 – RMSE²/MSE_obs). R²_Avg uses the mean-squared error of the average observed values at AQS and fixed sites within the region as MSE_obs. R²_Close uses the mean-squared error of the observed values at the closest AQS or Fixed site as MSE_obs. R²_Smooth uses the mean-squared error of the values of the smoothed time trend at the nearest AQS or Fixed site as MSE_obs. ^aIncludes NYCCAS distributed sites.

Temporally adjusted cross-validation measures of predictive accuracy for home site means. Box plots of the long-term averages of predictions at participant residences are provided in Figure 4. On average, predicted concentrations tended to be higher in New York and Los Angeles, consistent with the higher observed monitoring values in those regions. Variability in predictions is also greatest in these two cities, especially in the tails of the distributions.

Figure 4

Pollutant- and region-specific box plots of long-term averages of predictions from 1999 through early 2012 at participant residence locations. Metropolitan region abbreviations: Bal, Baltimore; Chi, Chicago; LA, Los Angeles; NY, New York; SP, St. Paul; W-S, Winston-Salem. Boxes extend from the 25th to the 75th percentile, horizontal bars represent the median, whiskers extend 1.5 times the length of the interquartile range above and below the 75th and 25th percentiles, respectively, and outliers are presented as points. Supplemental Material, Table S7, provides performance metrics for the New York models when the NYCCAS data were excluded from the modeling process. Without the NYCCAS data, R2 was noticeably lower for PM2.5 (0.79 vs. 0.91 at AQS and fixed sites, and 0.36 vs. 0.54 at home sites) and LAC (0.55 vs. 0.75 at AQS and fixed sites, and 0.43 vs. 0.51 at home sites).

Discussion

We present here a complex and successful approach to predicting long-term air pollution concentrations for application in a cohort study. Although this approach was tailored to this particular well-characterized cohort study—taking advantage of cohort-specific monitoring, for example—the success of the approach demonstrates modeling improvements that can be adopted for application in future population-based research on spatially varying pollutants. We believe that this approach to capturing variation in within-region pollution highlights advances that should be adopted in the next generation of air pollution cohort studies, both for understanding contrasts at relatively low concentrations in the United States and at the higher concentrations experienced globally. We describe a unified framework for implementing exposure prediction models of four air pollutants in six metropolitan regions that easily incorporates spatially and temporally unbalanced monitoring data. The application of a consistent modeling framework to all regions and pollutants is important for studies such as MESA Air that use exposure estimates from multiple subcohorts together in health analyses. Although we applied the same approach in all regions, we varied the model structure to best fit the data for each region and pollutant. This unified modeling approach was shown to have very good model performance (R2 > 0.70) for almost all of the pollutants and regions. The architecture for this modeling approach is publicly available through the SpatioTemporal R package (Lindström et al. 2012). As a result of the success of our spatiotemporal modeling approaches, we are confident in using these approaches to model outdoor pollutant concentrations in epidemiological analyses in this cohort and in other populations residing in these same communities. We have also found that implementation of portions of this approach, such as PLS regression of a large set of geographic covariates combined with spatial smoothing via universal kriging, can be used with good success in other regions to predict pollutant concentrations without the same level of small-area monitoring (Sampson et al. 2013). The NYCCAS data increased the spatial density of the monitoring data in New York, which was likely one reason for the improved model performance when the data were included. For LAC, the NYCCAS data provided particular benefit because they allowed the model to be extended through 2010, which would not have been possible with only the MESA Air data. A majority of models included spatial smoothing in the long-term average. This suggests that although PLS scores derived from geographic covariates can predict much of the spatial variation in the data, benefit is gained from borrowing strength across observations nearby in space. Differences in the underlying pollutant variability likely caused some of the differences seen in temporally adjusted R2. PM2.5 tended to exhibit less small-scale spatial variation and greater temporal variability, leading to temporally adjusted R2 that are noticeably lower than the unadjusted measures. The NO2 and NOx data tended to exhibit greater spatial variability, and the similarity of the unadjusted and temporally adjusted R2 values suggests that the unadjusted R2 are not overly inflated by well-predicted temporal variation. The modeling approach presented here does have several limitations. First, we used geographic covariates that were constant in time [although the modeling framework readily extends to spatiotemporal covariates (Lindström et al. 2014)]. Changes in these variables likely occurred during the study decade, but we nonetheless believe that the time-constant geographic variables still provided a useful means to predict long-term pollutant concentrations. Second, the calculation of PLS scores was limited to AQS and fixed sites because they had long time series. For LAC in particular, this means that the scores were based on a very small number of locations because the LAC model relied only on MESA Air data (plus NYCCAS data for New York). Third, the cross-validation model selection procedure conditioned on the time trends and PLS scores. Overfitting may have occurred in the cross-validation of the AQS and fixed sites, because the left-out observations were used to estimate the time trends and PLS scores. However, because the home sites were not used in estimating time trend or in defining the PLS scores, any overfitting was restricted to the AQS and fixed-site cross-validation. This provides further motivation for prioritization of cross-validation metrics from home sites when selecting the best models.

Conclusions

Our unified spatiotemporal modeling method successfully characterized outdoor concentrations of multiple air pollutants at the homes of cohort members in multiple metropolitan regions. This flexible and powerful modeling approach can incorporate an unbalanced monitoring data structure, leveraging data from supplemental monitoring campaigns that increase the spatial coverage of monitoring data. The method was easily transferred between regions and pollutants, allowing for straightforward comparison between model fits across regions. Although aspects of our techniques are particularly tailored to the unique data and resources of MESA Air, lessons learned here can be applied to understand the spatial and temporal variation of pollutants in future cohort studies. Advances in fine-scale modeling resolved in both space and time are important for the next generation of cohort studies assessing health effects of environmental agents. Click here for additional data file.

16 in total

Review 1. Health effects of fine particulate air pollution: lines that connect.

Authors: C Arden Pope; Douglas W Dockery
Journal: J Air Waste Manag Assoc Date: 2006-06 Impact factor: 2.235

2. Mapping of background air pollution at a fine spatial scale across the European Union.

Authors: Rob Beelen; Gerard Hoek; Edzer Pebesma; Danielle Vienneau; Kees de Hoogh; David J Briggs
Journal: Sci Total Environ Date: 2009-01-18 Impact factor: 7.963

3. A Flexible Spatio-Temporal Model for Air Pollution with Spatial and Spatio-Temporal Covariates.

Authors: Johan Lindström; Adam A Szpiro; Paul D Sampson; Assaf P Oron; Mark Richards; Tim V Larson; Lianne Sheppard
Journal: Environ Ecol Stat Date: 2014-09 Impact factor: 1.119

4. Monitoring intraurban spatial patterns of multiple combustion air pollutants in New York City: design and implementation.

Authors: Thomas D Matte; Zev Ross; Iyad Kheirbek; Holger Eisl; Sarah Johnson; John E Gorczynski; Daniel Kass; Steven Markowitz; Grant Pezeshki; Jane E Clougherty
Journal: J Expo Sci Environ Epidemiol Date: 2013-01-16 Impact factor: 5.563

5. A regionalized national universal kriging model using Partial Least Squares regression for estimating annual PM_2.5 concentrations in epidemiology.

Authors: Paul D Sampson; Mark Richards; Adam A Szpiro; Silas Bergen; Lianne Sheppard; Timothy V Larson; Joel D Kaufman
Journal: Atmos Environ (1994) Date: 2013-08-01 Impact factor: 4.798

6. Predicting Intra-Urban Variation in Air Pollution Concentrations with Complex Spatio-Temporal Dependencies.

Authors: Adam A Szpiro; Paul D Sampson; Lianne Sheppard; Thomas Lumley; Sara D Adar; Joel Kaufman
Journal: Environmetrics Date: 2009-09-01 Impact factor: 1.900

7. Predictors of intra-community variation in air quality.

Authors: Meredith Franklin; Hita Vora; Edward Avol; Rob McConnell; Fred Lurmann; Feifei Liu; Bryan Penfold; Kiros Berhane; Frank Gilliland; W James Gauderman
Journal: J Expo Sci Environ Epidemiol Date: 2012-01-18 Impact factor: 5.563

8. Prospective study of particulate air pollution exposures, subclinical atherosclerosis, and clinical cardiovascular disease: The Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air).

Authors: Joel D Kaufman; Sara D Adar; Ryan W Allen; R Graham Barr; Matthew J Budoff; Gregory L Burke; Adrian M Casillas; Martin A Cohen; Cynthia L Curl; Martha L Daviglus; Ana V Diez Roux; David R Jacobs; Richard A Kronmal; Timothy V Larson; Sally Lee-Jane Liu; Thomas Lumley; Ana Navas-Acien; Daniel H O'Leary; Jerome I Rotter; Paul D Sampson; Lianne Sheppard; David S Siscovick; James H Stein; Adam A Szpiro; Russell P Tracy
Journal: Am J Epidemiol Date: 2012-10-04 Impact factor: 4.897

9. Modeling the intraurban variability of ambient traffic pollution in Toronto, Canada.

Authors: M Jerrett; M A Arain; P Kanaroglou; B Beckerman; D Crouse; N L Gilbert; J R Brook; N Finkelstein; M M Finkelstein
Journal: J Toxicol Environ Health A Date: 2007-02-01

10. Approach to estimating participant pollutant exposures in the Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air).

Authors: Martin A Cohen; Sara D Adar; Ryan W Allen; Edward Avol; Cynthia L Curl; Timothy Gould; David Hardie; Anne Ho; Patrick Kinney; Timothy V Larson; Paul Sampson; Lianne Sheppard; Karen D Stukovsky; Susan S Swan; L J Sally Liu; Joel D Kaufman
Journal: Environ Sci Technol Date: 2009-07-01 Impact factor: 9.028

69 in total

1. Association of Air Pollution Exposures With High-Density Lipoprotein Cholesterol and Particle Number: The Multi-Ethnic Study of Atherosclerosis.

Authors: Griffith Bell; Samia Mora; Philip Greenland; Michael Tsai; Ed Gill; Joel D Kaufman
Journal: Arterioscler Thromb Vasc Biol Date: 2017-04-13 Impact factor: 8.311

2. Estimating PM_2.5 Concentrations in the Conterminous United States Using the Random Forest Approach.

Authors: Xuefei Hu; Jessica H Belle; Xia Meng; Avani Wildani; Lance A Waller; Matthew J Strickland; Yang Liu
Journal: Environ Sci Technol Date: 2017-06-01 Impact factor: 9.028

3. Measurement Error Correction for Predicted Spatiotemporal Air Pollution Exposures.

Authors: Joshua P Keller; Howard H Chang; Matthew J Strickland; Adam A Szpiro
Journal: Epidemiology Date: 2017-05 Impact factor: 4.822

Review 4. Advances in Understanding Air Pollution and CVD.

Authors: Joel D Kaufman; Elizabeth W Spalt; Cynthia L Curl; Anjum Hajat; Miranda R Jones; Sun-Young Kim; Sverre Vedal; Adam A Szpiro; Amanda Gassett; Lianne Sheppard; Martha L Daviglus; Sara D Adar
Journal: Glob Heart Date: 2016-09

Review 5. Incorporating Measurement Error from Modeled Air Pollution Exposures into Epidemiological Analyses.

Authors: Evangelia Samoli; Barbara K Butland
Journal: Curr Environ Health Rep Date: 2017-12

6. Ambient air pollution and racial/ethnic differences in carotid intima-media thickness in the Multi-Ethnic Study of Atherosclerosis (MESA).

Authors: Miranda R Jones; Ana V Diez-Roux; Marie S O'Neill; Eliseo Guallar; A Richey Sharrett; Wendy Post; Joel D Kaufman; Ana Navas-Acien
Journal: J Epidemiol Community Health Date: 2015-07-03 Impact factor: 3.710

7. A System for Developing and Projecting PM_2.5 Spatial Fields to Correspond to Just Meeting National Ambient Air Quality Standards.

Authors: James T Kelly; Carey J Jang; Brian Timin; Brett Gantt; Adam Reff; Yun Zhu; Shicheng Long; Adel Hanna
Journal: Atmos Environ X Date: 2019-02-12

8. Air pollution and subclinical interstitial lung disease: the Multi-Ethnic Study of Atherosclerosis (MESA) air-lung study.

Authors: Coralynn Sack; Sverre Vedal; Lianne Sheppard; Ganesh Raghu; R Graham Barr; Anna Podolanczuk; Brent Doney; Eric A Hoffman; Amanda Gassett; Karen Hinckley-Stukovsky; Kayleen Williams; Steve Kawut; David J Lederer; Joel D Kaufman
Journal: Eur Respir J Date: 2017-12-07 Impact factor: 16.671

9. A comparison of statistical and machine learning methods for creating national daily maps of ambient PM_2.5 concentration.

Authors: Veronica J Berrocal; Yawen Guan; Amanda Muyskens; Haoyu Wang; Brian J Reich; James A Mulholland; Howard H Chang
Journal: Atmos Environ (1994) Date: 2019-11-14 Impact factor: 4.798

10. Air Pollution, Cardiovascular Outcomes, and Social Disadvantage: The Multi-ethnic Study of Atherosclerosis.

Authors: Margaret T Hicken; Sara D Adar; Anjum Hajat; Kiarri N Kershaw; D Phuong Do; R Graham Barr; Joel D Kaufman; Ana V Diez Roux
Journal: Epidemiology Date: 2016-01 Impact factor: 4.822