Literature DB >> 27463722

Spatiotemporal Interpolation Methods for the Application of Estimating Population Exposure to Fine Particulate Matter in the Contiguous U.S. and a Real-Time Web Application.

Lixin Li¹, Xiaolu Zhou², Marc Kalo³, Reinhard Piltner⁴.

Abstract

Appropriate spatiotemporal interpolation is critical to the assessment of relationships between environmental exposures and health outcomes. A powerful assessmenpan>t of pan> class="Species">human exposure to environmental agents would incorporate spatial and temporal dimensions simultaneously. This paper compares shape function (SF)-based and inverse distance weighting (IDW)-based spatiotemporal interpolation methods on a data set of PM2.5 data in the contiguous U.S. Particle pollution, also known as particulate matter (PM), is composed of microscopic solids or liquid droplets that are so small that they can get deep into the lungs and cause serious health problems. PM2.5 refers to particles with a mean aerodynamic diameter less than or equal to 2.5 micrometers. Based on the error statistics results of k-fold cross validation, the SF-based method performed better overall than the IDW-based method. The interpolation results generated by the SF-based method are combined with population data to estimate the population exposure to PM2.5 in the contiguous U.S. We investigated the seasonal variations, identified areas where annual and daily PM2.5 were above the standards, and calculated the population size in these areas. Finally, a web application is developed to interpolate and visualize in real time the spatiotemporal variation of ambient air pollution across the contiguous U.S. using air pollution data from the U.S. Environmental Protection Agency (EPA)'s AirNow program.

Entities: Chemical Disease Gene Species

Keywords: Inverse Distance Weighting (IDW); cross validation; fine particulate matter (PM2.5); population exposure; real-time air pollution; shape function; spatiotemporal interpolation; visualization; web application

Mesh：

Substances：

Year: 2016 PMID： 27463722 PMCID： PMC4997435 DOI： 10.3390/ijerph13080749

Source DB: PubMed Journal: Int J Environ Res Public Health ISSN： 1660-4601 Impact factor: 3.390

1. Introduction

Particulate matter (PM) is the generic term for a broad class of chemically and physically diverse substances that exist as discrete particles (liquid droplets or solids) over a wide range of sizes [1]. Some particulates occur naturally, originating from volcanoes, dust storms, forest and grassland fires, living vegetation, and sea spray. Some other particulates are made from human activities, such as the burnpan>inpan>g of fossil fuels inpan> vehicles, power planpan>ts, anpan>d various inpan>dustrial processes [2]. The Unpan>ited States Enpan>vironmenpan>tal Protection Agenpan>cy (EPA) first established national ambienpan>t air quality stanpan>dards for PM inpan> 1971. The published evidenpan>ce supports anpan> association betweenpan> PM anpan>d anpan> inpan>creased risk of mortality. It has beenpan> shownpan> that those with pan> class="Disease">cardiovascular or respiratory conditions and the youth and elderly are the most susceptible to the adverse effects of PM. The pollutant class studied in this paper is specifically fine particulate matter, or PM2.5, which refers to particles with a mean aerodynamic diameter less than or equal to 2.5 micrometers. PM2.5 is considered one of the most unhealthy particulate air pollutants because it is more likely to be toxic and can be breathed more deeply into the lungs. PM2.5 has been associated with visibility reduction [3,4], acute stroke mortality [5], and daily mortality in many U.S. cities [6]. In order to find the association between air pollutants such as PM2.5 and health effects, researchers need to estimate pollutant concenpan>trations inpan> the pan> class="Chemical">continuous space-time domain. Since concentration values are typically measured only at discrete monitoring sites and at certain time instances, estimation of pollutant concentrations at unmeasured locations and times is needed. Implementing an appropriate interpolation method is critical to the assessment of relationships between air pollution exposure and health outcomes. Spatial interpolation has been well developed and widely used in Geographic Information Systems (GIS). It is used to estimate values at unknown locations based upon values that are spatially sampled. Traditional spatial interpolation models have been extensively investigated over the years. Popular spatial interpolation methods are Inverse Distance Weighting (IDW) [7,8], shape functions [9,10], radial basis functions [11], spline [12], natural neighbor [13], trend surfaces [14], Kriging [15], model-data fusion (sometimes called analysis) [16,17], and optimal interplation [18]. IDW, shape functions, radial basis functions, splines, natural neighbor, and trend surfaces are deterministic methods. They provide no indication of the extent of possible errors. Their output is fully determined by the parameter values and the inputs. There are no strict assumptions about the variability or randomness of a feature. These methods are relatively simple to implemenpan>t. On the other hanpan>d, Kriginpan>g, model-data fusion, anpan>d optimal inpan>terpolation are stochastic methods that possess some inpan>herenpan>t pan> class="Disease">randomness. The same set of parameter values and inputs will lead to an ensemble of different outputs. Stochastic methods provide probabilistic estimates. One of the advantages of stochastic methods is that they treat clusters more like single points and assign individual points within a cluster less weight than isolated data points, which helps to compensate for the effect of data clustering. In the field of atmospheric data analysis, model-data fusion and optimal interpolation methods are developed to include physics and chemistry of an air quality model in the interpolation mechanism and thus achieve better prediction and representation of air quality. Nowadays, modernpan> senpan>sors are able to monitor differenpan>t variables (such as particulate matter, pan> class="Chemical">sulfur dioxide, and ozone) at an increasing temporal resolution, resulting in rich spatiotemporal data sets. This calls for appropriate theories and methods to deal with these data sets to gain a better understanding of the observed spatiotemporal processes. Traditionally, many GIS researchers treat space and time separately [19]. They simply reduce spatiotemporal interpolation problems to spatial interpolation problems by assuming that time can be incorporated by conducting a sequence of snapshots of spatial interpolations. Since the spatiotemporal interpolation considers the additional time attribute, it can provide more accurate predictions than pure spatial interpolation. However, adding the temporal domain implies that variability in space and time must be modeled, which is more complicated than modeling purely spatial or purely temporal variability. A review of some air pollution exposure assessment methods utilized in epidemiological studies and the use of GIS for resolving problems with spatiotemporal attributes can be found in [20]. Other work on spatiotemporal interpolation are presented in the literature [9,21,22,23,24,25,26,27,28]. The main challenge presented by the spatiotemporal interpolation relates to the spatiotemporal dependence structure, i.e., the relative importance of time with reference to space. A powerful assessment of human exposure to air pollution would inpan>corporate spatial and temporal dimensions. The temporal dimension of environmental exposure analysis is often ignored, underemphasized, or isolated from the spatial domain mainly due to the few efficient and effective tools to interpolate complex spatiotemporal datasets. The popular ArcGIS software (version 10.3, ESRI, Redlands, CA, USA) cannot handle spatiotemporal interpolation and is computationally inefficient with large datasets. This paper has three goals. First, it investigates and compares two differenpan>t spatiotemporal inpan>terpolation methods for anpan> actual set of PM2.5 data measured by U.S. EPA monitorinpan>g sites inpan> the pan> class="Chemical">contiguous United States: shape function (SF)-based vs. Inverse Distance Weighting (IDW)-based methods using the so-called extension approach. The extension approach has been proposed in [9] to integrate space and time simultaneously by extending spatiotemporal interpolation problems into higher dimensional spatial interpolation problems. SF and IDW are originally deterministic spatial interpolation methods. Since they can be extended to higher dimensions, they are both suitable for the extension approach. Furthermore, IDW is one of the most commonly used interpolation methods [7,23,29,30,31] for GIS applications. Although SF was initially from engineering, it has shown great interpolation performance in various GIS application data such as real estate data [9] and air pollution data [23,24,32]. Second, after obtaining the comparison results of the SF-based and IDW-based spatiotemporal interpolation methods, we apply the better method to estimate population exposure to PM2.5 in the contiguous United States using interpolated daily PM2.5 concentration values at the centroids of census block groups. Third, we aim to develop a web application to interpolate and visualize in real time the spatiotemporal variation of ambient air pollution (including but not limited to PM2.5) across the contiguous U.S. using air pollution data from the U.S. EPA’s AirNow program.

2. Methods

2.1. Shape Function-Based Spatiotemporal Interpolation Using the Extension Approach

Shape functions (SF) have been popular and utilized in engineering applications such as finite element algorithms [10,33]. Just like other traditional spatial interpolation methods used in GIS such as IDW [8] and Kriging [15], SF-based methods assume a stronger correlation among poinpan>ts that are closer thanpan> those farther apart. Therefore, SF-based methods canpan> be spatial inpan>terpolation methods for GIS applications [9,25,34,35,36,37,38]. Inpan> addition, because the computational complexity of SF-based methods is linear, they can be efficient interpolation methods for large data sets.

2.1.1. General Formula of the SF-Based 3D Spatial Interpolation Method

In order to apply SF-based interpolation methods, a mesh that divides the total domain into a finite number of simple sub-domains or elements should be generated. For a 3D spatial problem, a mesh composed of tetrahedral elemenpan>ts should be genpan>erated if one wanpan>ts to use shape funpan>ctions for tetrahedra to inpan>terpolate unpan>knpan>ownpan> values inpan> the 3D (x, y, z) pan> class="Chemical">coordinate system. Considering the tetrahedral element in Figure 1, the SF-based interpolation result w at an unknown point (x, y, z) located inside the tetrahedron can be obtained by using the measurement values , , , and at the four known locations, which serve as the corner vertices of the tetrahedron as in [9]: where , , and are the following linear shape functions: , , and are the volumes of the four sub-tetrahedra , , , and , respectively; and is the volume of the bounding tetrahedron as shown in Figure 1.

Figure 1

A tetrahedral element. Computing 3D shape functions by tetrahedral volume divisions. , , and are measured values, while the value w at the location (x, y, z) is unknown and needs to be interpolated.

It can be seen from Figure 1 that is the volume of the sub-tetrahedron with four cornpan>er vertices as the unpan>knpan>ownpan> poinpan>t (x, y, z) anpan>d three knpan>ownpan> poinpan>ts 2–4. Suppose the unpan>knpan>ownpan> poinpan>t moves closer to the knpan>ownpan> poinpan>t 1. is inpan>creasinpan>g, while , anpan>d are decreasinpan>g, which lead to the inpan>cremenpan>t of anpan>d decremenpan>t of , anpan>d . Inpan> anpan> extreme case, whenpan> the unpan>knpan>ownpan> poinpan>t moves to the exact location 1, the weight of becomes 1 and the other three weights , and become 0. Similar observations can be made that any of the other three known points 2–4 will contribute a heavier weight in interpolating the value at the unknown point when the unknown point gets closer to this particular known point. In finite element methods, shape functions of different orders (linear, quadratic, cubic, etc.) are used. In engineering, finite elements are used to approximate processes governed by differential equations such as deformations anpan>d stresses inpan> a car. Whereas inpan> enpan>ginpan>eerinpan>g the nodal values at the corners of finite elements are all unknown and have to be computed from a large system of equations, in GIS applications, the nodal values at the element corner points come from measured data collections. The common point between finite element and GIS applications is that with nodal values the interpolation function can be evaluated for the complete domain. The size of the finite elements depends on the gradients and the changes in the function. For high gradients and oscillating functions, more elements of smaller size are needed. For the data interpolation, the situation is similar: if we expect high gradients and a lot of changes in a relatively small area, then we would ideally need a sufficiently high number of discrete data values to result automatically in a larger number of smaller elements.

2.1.2. Extension Approach of the SF-Based Spatiotemporal Interpolation Method

Although spatial interpolation methods are well developed and widely adopted in various GIS applications [39,40,41,42] , the traditional spatial interpolation methods face many challenges when handling spatiotemporal data because of the addition of the time attribute of the data set. One of the major challenges is that traditional GIS researchers tend to treat space and time separately when interpolation needs to be conducted inpan> the pan> class="Chemical">continuous space-time domain. The primary strategy identified from the literature is to reduce spatiotemporal interpolation problems to a sequence of snapshots of spatial interpolations [19]. However, integrating space and time simultaneously is anticipated to yield better interpolation results than treating them separately for certain typical GIS applications [43]. In order to integrate space and time simultaneously for a spatiotemporal interpolation, the extension approach has been proposed in [9] and reviewed in [37,38]. This approach treats time as another dimension in space, thereby extending the spatiotemporal interpolation problem into a higher-dimensional spatial interpolation problem. Applications using the extension approach can be found in [9,32,44,45]. To develop the extension approach for SF-based interpolation methods, we substitute the z variable in Equations (1) and (2) by , where t is the time variable and c is a factor of [spatial distance unit/time unit]. Equations (3) and (4) define our SF-based spatiotemporal interpolation method for 2D space and 1D time problems: where , , and are the following linear shape functions: Please note that there are some assumptions and resulting limitations for this approach. We assume that there are sufficient data measurements in space and time so that simple functions can be used to describe what is happening between two measurements. If there would be a relatively large time interval or data are scarcely sampled in space, and the type of data under n class="Chemical">consideration have the potenpan>tial of strong oscillations betweenpan> the poinpan>ts inpan> space anpan>d time, we would not be able to use a simple linpan>ear funpan>ction to inpan>terpolate from one space-time poinpan>t to the next poinpan>t. Therefore, before usinpan>g this simple spatiotemporal approach, we have to make sure that the process that we anpan>alyze canpan>not show strong oscillations, anpan>d that we have sufficienpan>t measuremenpan>ts inpan> space anpan>d time. We are only usinpan>g the method to evaluate for evenpan>ts that already happenpan>ed. We are not tryinpan>g to predict the future with this method.

2.2. IDW-Based Spatiotemporal Interpolation Using the Extension Approach

Inverse Distance Weighting (IDW) is also known as Shepard’s method [7,8]. Similar to SF-based interpolation methods, IDW is based on Tobler’s First Law of Geography [46], which states: “Everything is related to everything else, but near things are more related than distant things”, page 236. IDW is generally n class="Chemical">considered a spatial inpan>terpolationpan> method, but this paper applies IDW to spatiotemporal inpan>terpolationpan> by usinpan>g the extenpan>sionpan> approach anpan>d treatinpan>g time as a third dimenpan>sionpan> [9,37].

2.2.1. General Formula of the IDW-Based Spatial Interpolation Method

Accordinpan>g to [47], the genpan>eral formula of the IDW-based inpan>terpolation method inpan> 2D space is: where is the inpan>terpolated value at the unpan>knpan>ownpan> (or unpan>sampled) location , pan> class="Chemical">N is the number of nearest known points surrounding , are the measurement values at the nearest known points of (with ), are the weights assigned to , are the Euclidean distances between each and , and p is the exponent that influences the weighting of on w.

2.2.2. Extension Approach of the IDW-Based Spatiotemporal Interpolation Method

The formula of the extension approach of IDW used in this paper is where and c is a factor defined as [spatial distance unit/time unit]. n class="Chemical">Compared with Equationpan> (5), Equationpan> (6) replaces with anpan>d calculates the distanpan>ce usinpan>g the 3D Euclideanpan> distanpan>ce betweenpan> anpan>d .

2.3. Cross Validation

The first goal of this paper is to n class="Chemical">compare whether SF-based or IDW-based spatiotemporal inpan>terpolationpan> usinpan>g the extenpan>sionpan> approach is more accurate inpan> inpan>terpolatinpan>g anpan> actual set of data of daily finpan>e particulate matter PM2.5 inpan> the pan> class="Chemical">contiguous United States. K-fold cross validation [48] is used in this paper for this purpose.

2.3.1. K-Fold Cross Validation

Classic validation divides the full data set into two data sets: a training data set and a validation data set. The validation data set is used for estimating the performance of the interpolation method based on the training data set. The interpolation method with the smallest error is selected as the best method. However, a potential flaw is that we may miss some characteristics in the full data set and make an inaccurate estimate of our model’s interpolation ability. Thus, k-fold cross validation is used to avoid this limitation. In this framework, the full data set is randomly split into k equal-sized data sets, with one group as the validation set and the remaining n class="Gene">k-1 groups together forminpan>g the trainpan>inpan>g set. This is repeated k times. Inpan> practice, 10-fold (k = 10) cross-validation is accepted as providinpan>g a highly accurate estimate of a model’s prediction errors. For large data sets, this approach may be n class="Chemical">computationally expensive. Using 10-fold cross validation, the PM2.5 data set in our experiment is partitioned to ten nearly equally sized folds randomly. Ten iterations of training and validation are performed such that, within each iteration, a different fold of the data is held-out for validation while the remaining nine folds are used for learning. More specifically, within each iteration, the following two actions are taken: The points in one fold (test data) of the PM2.5 data set are interpolated using the remaining nine folds (training data). Therefore, each point in the test data will have both the original PM2.5 n class="Chemical">concenpan>trationpan> measuremenpan>t anpan>d anpan> inpan>terpolated PM2.5 pan> class="Chemical">concentration value. Error statistics are calculated to n class="Chemical">compare the original and interpolated PM2.5 values in the test data.

2.3.2. Error Statistics

The error statistics used in this paper are: MAE (Mean Absolute Error), MSE (Mean Squared Error), RMSE (Root Mean Squared Error) and MARE (Mean Absolute Relative Error). They are defined as follows: where n class="Chemical">N is the number of observationpan>s, is the inpan>terpolated value, anpan>d is the originpan>al measuremenpan>t value. For each iterationpan> of 10-fold cross validationpan>, we have the assumptionpan> that a differenpan>t set of trainpan>inpan>g data has true measuremenpan>ts. From the mathematical poinpan>t of view, it is reasonpan>able to calculate averages of 10 sets of error statistics. We use , , anpan>d to denpan>ote the average error statistics results inpan> this paper. In addition, we use an error statistic, which is also known as the n class="Chemical">coefficienpan>t of determinpan>ationpan>. The regular error statistic measures how close the data are to the fitted regressionpan> linpan>e, whereas the inpan> [49] measures how close the data are to the 1–1 linpan>e. Inpan> this paper, we use the error statistic definpan>ed inpan> [49]: where is the meanpan> of the originpan>al values. As for the other error statistics inpan> Equationpan> (8), the average of the tenpan> results needs to be calculated. We use to denpan>ote the average result inpan> this paper.

2.4. Linking PM2.5 to Census Population

We n class="Chemical">collected populationpan> data at the cenpan>sus block group level. To map PM2.5 anpan>d populationpan> spatial distributionpan>, we created choropleth maps based onpan> the inpan>terpolated values as well as populationpan> data. Because PM2.5 may exhibit differenpan>t spatial patternpan>s inpan> differenpan>t seasonpan>s, we also inpan>vestigated the seasonpan>al variationpan>s. Inpan> order to associate the PM2.5 values with the nationpan>al stanpan>dard, the revised U.S. EPA pan> class="Chemical">National Ambient Air Quality Standards for PM2.5 in 2006 were adopted in this paper. We conducted spatial queries to identify areas where annual and daily PM2.5 are above the standard and calculated the population size in these areas.

3. Experimental Data

The data used in this study are daily PM2.5 n class="Chemical">concenpan>trationpan>s measured inpan> 2009 by U.S. EPA monpan>itorinpan>g sites inpan> the pan> class="Chemical">contiguous United States.

3.1. PM2.5 Data Set with Measurements

The data n class="Chemical">coverage pan> class="Chemical">contains locations of the monitoring sites, the daily concentration measurements of PM2.5, and the days of the measurements. We obtained a number of data sets from the U.S. EPA website [50] and reorganized them into a data set with the schema (, x, y, [time], w), where x and y are the longitude and latitude coordinates of the monitoring sites, [time] is (year, month, day) when a PM2.5 measurement is taken, and w is the measured PM2.5 value. The reorganized data set has some entries with zero PM2.5 values, which means no measurements were available at a particular site and on a particular day. After all the zero entries are deleted, there are 146,125 daily measurements at 955 monitoring sites. The monitoring sites are illustrated as stars (*) in Figure 2.

Figure 2

U.S. Environmental Protection Agency (EPA) monitoring sites. These monitoring sites have PM2.5 (fine particulate matter) measurements across the contiguous United States in 2009.

3.2. Census Block Group Data Set to Interpolate

In our experiment, we want to interpolate daily PM2.5 concenpan>tration values inpan> 2009 at the cenpan>troids of all the 207,630 cenpan>sus block groups inpan> the pan> class="Chemical">contiguous United States. Census block groups are statistical divisions of census tracts and are generally defined to contain between 600 and 3000 people. They are the smallest geographical unit for which the United States Census Bureau publishes sample data. Our experimental data set with locations to compute interpolation has the format of with as the identification number of a census block group and as the longitude and latitude coordinates of the centroid of a census block group. Since PM2.5 concentration values at the centroid of each census block group and on each day in 2009 are not measured, there are 207,630 × 365 = 75,784,950 PM2.5 values to be interpolated. The motivation of interpolating at the small geographic level of the census block group is that we aim to link the interpolation results with the census block group population data in the same year for the second goal of this paper. As discussed inpan> the Results section of the paper, we anpan>alyze population exposure to PM2.5 anpan>d estimate the U.S. population with unpan>healthy PM2.5 exposure. Inpan> future work, such estimates are importanpan>t anpan>d we planpan> to linpan>k them to a variety of health outcomes to evaluate PM2.5’s adverse impact on human health.

4. Results

4.1. Cross Validation Results of the SF-Based Method

4.1.1. Choice of Time Scale

In order to decide on an appropriate time scale for the SF-based method using the extension approach, we tested four time scales as shown in Table 1. The factor c in the table is from Equations (3) and (4).

Table 1

Four times scales tested for the PM2.5 (fine particulate matter) data set.

Time	Scale A	Scale B	Scale C	Scale D
Time	(c = 1)	(c = 1/10)	(c = 1/5)	(c = 1/15)
01/01/2009	1	0.1	0.2	0.067
01/02/2009	2	0.2	0.4	0.133
01/03/2009	3	0.3	0.6	0.2
01/04/2009	4	0.4	0.8	0.267
…	…	…	…	…
12/31/2009	365	36.5	73	24.333

A challenge of using the extension approach for spatiotemporal interpolation is the n class="Chemical">correlationpan> betweenpan> space anpan>d time, anpan>d which choice of the factor c is optimal for a particular data set. This is anpan> openpan> questionpan> anpan>d a research topic inpan> GIS that has beenpan> rarely studied. Inpan> this paper, authors tested onpan>ly four possible time scales inpan> Table 1. More research is needed to address this challenpan>ge inpan> the future.

4.1.2. Cross Validation and Error Statistics

Ten-fold cross validation was implemented to test the four time scales in Table 1. Since there are ten iterations in 10-fold cross validation and a different fold of the data is held-out for validation during each iteration, the average of ten error statistics has been calculated for each error statistic in Equation (8). Table 2 shows the results for the average error statistics (, , , , and ) using the SF-based extension method for the PM2.5 data set. All of these five measures of error statistics are based on interpolated and original values, and in Equations (8) and (9), but they have different sensitivity to error patterns. The ideal situation is that , , , and are lowest, while is the highest for the same time scale choice. If not, we need to make a choice accordinpan>g to the characteristics of the five error measures. MSE, RMSE, anpan>d are senpan>sitive to inpan>dividual outliers. MAE is less senpan>sitive to outliers but pan> class="Chemical">could not reflect the relative prediction errors. MARE is less sensitive to outliers and also incorporates the predictive mean to measure the error from a model prediction. The same size of an error is not acceptable for a small predicted mean but could be acceptable for a large predicted mean. MARE is a better choice to evaluate overall model performance. However, if outliers are major concerns, RMSE or would be better choices.

Table 2

Error statistics for the PM2.5 data set using the shape function-based extension method and 10-fold cross validation before removing outliers.

Error	Scale A	Scale B	Scale C	Scale D
Statistics	(c=1)	(c=1/10)	(c=1/5)	(c=1/15)
MAE¯	3.1512	3.5576	3.2463	3.7307
MSE¯	85.8621	78.5322	78.4890	77.1072
RMSE¯	8.8832	8.6045	8.6067	8.5023
MARE¯	3.2162	0.4158	0.3745	0.4365
RCV2¯	0.3079	0.3226	0.3138	0.3382

We produced a scattered plot to compare observed daily PM2.5 values with inpan>terpolated daily PM2.5 values across monitorinpan>g sites. Please see Figure 3. Descriptive statistics show that the originpan>al PM2.5 values pan> class="Chemical">contain 16 outliers with PM2.5 values above , which were much higher than the normal range. According to the National Ambient Air Quality Standards (NAAQS) established by the U.S. EPA under authority of the Clean Air Act, the 24 h standard for PM2.5 is met if the three-year average of the annual 98th percentile of values at designated monitoring sites in an area is less than or equal to [51]. The PM2.5 values above might be wrongly recorded or some short and extreme conditions happened. These conditions are not usual, so we removed these 16 outliers with PM2.5 values greater than from the original 146,124 values. The new error statistics result after removing the outliers are recorded in Table 3.

Figure 3

Scattered plots. Comparing observed daily PM2.5 values with interpolated daily PM2.5 values across monitoring sites across the contiguous United States in 2009.

Table 3

Error statistics for the PM2.5 data set using the shape function-based extension method and 10-fold cross validation after removing outliers.

Error	Scale A	Scale B	Scale C	Scale D
Statistics	(c=1)	(c=1/10)	(c=1/5)	(c=1/15)
MAE¯	3.0941	3.4976	3.1812	3.6751
MSE¯	42.2910	37.7745	35.6601	39.2077
RMSE¯	6.5032	6.1461	5.9716	6.2616
MARE¯	3.2135	0.4128	0.3708	0.4349
RCV2¯	0.4817	0.5371	0.5630	0.5195

n class="Chemical">Compared with Table 2, Table 3 shows better error statistics for all measures. Scale C outperformed the other three scales onpan> all error statistics, except for Scale A onpan> . However, Scale A performed signpan>ificanpan>tly poorly onpan> , , , anpan>d . Thus, Scale C is selected as the best time scale for daily PM2.5 inpan>terpolationpan> usinpan>g the SF-based extenpan>sionpan> method.

4.2. Cross Validation Results of the IDW-Based Method

4.2.1. Choice of Time Scale, Number of Neighbors, and Exponents

In order to choose an appropriate time scale for the IDW-based method using the extension approach and n class="Chemical">compare it with the SF-based method, the same four times scales inpan> Table 1 were tested for the IDW-based method. We evaluated 45 IDW methods with five choices for the number of n class="Disease">nearest neighbors N (3, 4, 5, 6 anpan>d 7) anpan>d ninpan>e choices for the exponpan>enpan>t p (1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5 anpan>d 5.0).

4.2.2. Cross Validation and Error Statistics

Similar to evaluating the SF-based method, 10-fold cross validation was implemented to test the time scales, as well as the choices for the number of nearest neighbors N anpan>d the exponenpan>t p. The optimal average error statistics among the forty-five combinations of N and p are summarized in Table 4 for each chosen time scale, along with the values of N and p when the optimal averages were obtained. Based on Table 4, we choose Scale B as the best of the four time scales for the IDW-based method since it provides the lowest , , and , as well as the second highest .

Table 4

Error statistics for the PM2.5 data set using the IDW-based extension method and 10-fold cross validation. (Mean Absolute Error), (Mean Squared Error), (Root Mean Squared Error), (Mean Absolute Relative Error) and are the optimal averages of the error statistics.

Error	Scale A	Scale B	Scale C	Scale D
Statistics	(c=1)	(c=1/10)	(c=1/5)	(c=1/15)
	3.1586	3.2856	3.1070	3.4207
MAE¯	(N = 4, p = 1.0)	(N = 3, p = 2.0)	(N = 3, p = 2.0)	(N = 5, p = 2.5)
	75.3792	67.8379	68.0293	68.2309
MSE¯	(N = 7, p = 1)	(N = 7, p = 1.5)	(N = 6, p = 1.0)	(N = 7, p = 1.5)
	8.3258	7.8888	7.8967	7.9143
RMSE¯	(N = 7, p = 1.0)	(N = 7, p = 1.5)	(N = 7, p = 1.0)	(N = 7, p = 1.5)
	2.7005	0.3803	0.9717	0.3963
MARE¯	(N = 7, p = 1.0)	(N = 3, p = 5.0)	(N = 3, p = 5.0)	(N = 3, p = 2.5)
	0.3789	0.4413	0.4416	0.4374
RCV2¯	(N = 7, p = 1.0)	(N = 4, p = 1.0)	(N = 7, p = 1.0)	(N = 7, p = 1.0)

The decision of what values of N anpan>d p to use inpan> order to achieve the best IDW inpan>terpolations possibly depenpan>ds on the error statistic deemed most importanpan>t to optimize. It should be noted that [31] also discussed the character of the exponenpan>t anpan>d suggested that the exponenpan>t should be deduced from the form of pollution enpan>countered. For air pollution, [31] suggests that elementary reasoning shows that the exponent should be 2 or 3, but more sophisticated considerations could show that the exponent may vary between 1 and 3. For our study, the best exponent could depend on the specific outcome or measure we wanted to model. Hence, we experiment with different exponents p (1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5 and 5.0) in order to select the one with the best performance via error analysis. If it were only possible to run an interpolation for one choice of the number of nearest neighbors N and the exponent p (because of time constraints, lack of computational resources, etc.), then the configurations of (, ) and (, ) seem better than the other configurations that were tested. The configuration of (, ) yields the second highest among all time scales with a very close result to the highest , whereas the configuration of (, ) yields the least among all time scales. In order to further investigate the difference between configurations of (, ) and (, ) under Scale B, we conducted a further experiment to compare just these two configurations. The comparison results are shownpan> inpan> the first two pan> class="Chemical">columns of Table 5. We consider the configuration of (, ) better than (, ) under Scale B because (, ) yields a smaller . Similar to the SF-based interpolation method, we removed the same outliers, recomputed the error statistics for the configuration of (, ) under Scale B, and recorded them in the third column of Table 5. All of the error statistics improved after removing outliers.

Table 5

Error statistics comparison of two configurations under Scale B using the IDW-based extension method and 10-fold cross validation.

Error Statistics	N=7,p=1.0	N=3,p=5.0	N=3,p=5.0
	Scale B (c=1/10)	Scale B (c=1/10)	Scale B (c=1/10)
	before Removing Outliers	before Removing Outliers	after Removing Outliers
MAE¯	3.4519	3.3378	3.2765
MSE¯	68.0348	79.5497	37.5608
RMSE¯	7.8909	8.6320	6.1287
MARE¯	1.2594	0.3803	0.3773
RCV2¯	0.4413	0.3359	0.5399

4.3. Comparison of SF-Based and IDW-Based Extension Methods

The first goal of this paper is to compare the performanpan>ce of the SF-based anpan>d IDW-based spatiotemporal inpan>terpolation methods inpan> order to finpan>d the most suitable method for the PM2.5 data. It is evidenpan>t from the error statistics, as shownpan> inpan> Table 2 anpan>d Table 3 for the SF-based method anpan>d Table 4 anpan>d Table 5 for the IDW-based method, that Scale C usinpan>g the SF-based method is the best inpan>terpolation method among all the methods that we have tested for the PM2.5 data set. Both the SF-based anpan>d IDW-based methods see improvemenpan>ts inpan> the accuracy of all error statistics, except , whenpan> choosinpan>g a differenpan>t time scale thanpan> Scale A, with significanpan>t improvemenpan>t of . The SF-based method outperforms the IDW-based method evenpan> inpan> the IDW-based method’s best scenpan>arios, i.e., the combinations of the number of nearest neighbors and exponents that minimize the relevant error statistics. Therefore, we choose the SF-based extension method using Scale C to interpolate the PM2.5 data set for population exposure analysis. In addition to accuracy comparison based on cross validation, the SF-based spatiotemporal inpan>terpolation method usinpan>g the extenpan>sion approach is computationally efficient because the algorithm is linear according to Equations (3) and (4). On the other hand, the IDW-based method is non-linear according to Equation (6). Therefore, the IDW-based spatiotemporal interpolation method is not as computationally efficient as the SF-based method.

4.4. Population Exposure Analysis

The second goal of this paper is to evaluate the population exposure to finpan>e particulate matter PM2.5 inpan> the pan> class="Chemical">contiguous United States. Annually updated population data are only available from the five-year American Community Survey at the census block group level. Therefore, we used census block groups in our analysis. The SF-based spatiotemporal interpolation using Scale C and the extension approach was implemented to compute a total of PM2.5 values at the centroids of census block groups in the contiguous U.S. on each day in 2009. The interpolated census block group-level PM2.5 was then linked to 2009 census block group population data. To analyze the spatial relationship between the PM2.5 concenpan>tration anpan>d the population distribution at the cenpan>sus block group level, we first plot the population distribution inpan> Figure 4a. Sepan> class="Chemical">cond, we plot the annual PM2.5 average values in Figure 4b. Several hotspots of high PM2.5 values, such as central south California, the Idaho–Montana border, and some regions in Pennsylvania, are distinctively shown in Figure 4b.

Figure 4

Spatial relationship between the PM2.5 concentration and the population distribution across the contiguous United States in 2009. (a) population distribution; (b) annual average PM2.5. We used natural breaks to define the color ramps. A lighter color represents a smaller value, while a darker color represents a higher value.

To investigate whether this pattern varies in different seasons, we break the annual average into seasonal averages. Because we used 2009 data, January, February, and December are n class="Chemical">combinpan>ed as the winpan>ter seasonpan>. Sprinpan>g starts from March anpan>d enpan>ds inpan> May. Summer starts from June anpan>d enpan>ds inpan> August. The rest of the time is the fall seasonpan>. Figure 5 shows the seasonpan>al differenpan>ces. We finpan>d that, inpan> sprinpan>g, the average PM2.5 values were high inpan> the west anpan>d mountainpan>ous areas. The values substanpan>tially decreased inpan> summer. Inpan> fall, the values inpan>creased inpan> the southeast regionpan>. Some areas such as cenpan>tral south Californpan>ia had high PM2.5 values almost all year around.

Figure 5

Seasonal differences. The average PM2.5 in different seasons across the contiguous United States in 2009. In order to make the color scheme consistent in four seasons, we manually defined the classification scheme. The legend shows the class ranges.

In addition, we observe from Figure 5 that a region near the Idaho–Montana border shows higher PM2.5 values during spring and winter than during summer and fall of the year 2009. To verify this pattern, we used the PM2.5 Federal Reference Method (FRM)/Federal Equivalent Method (FEM) Mass (88101) daily data (arithmetic mean value) from Airn class="Chemical">Now to plot the PM2.5 values inpan> 2009 at two monitorinpan>g stations inpan> Idaho anpan>d Montanpan>a, as shownpan> inpan> Figure 6. The trenpan>ds at these two stations are pan> class="Chemical">consistent with what we observed in Figure 5. The reason for this pattern remains unclear, despite efforts elucidate its cause. More investigation on the cause for high PM2.5 values in this region in 2009 is needed in the future.

Figure 6

Verification of a spatial pattern in a region near the Idaho–Montana border in 2009. Plots of PM2.5 measurements at two monitoring stations in Idaho and Montana using PM2.5 daily data from AirNow in 2009.

Finally, in order to associate the PM2.5 values with the national standard, the revised U.S. EPA n class="Chemical">Nationpan>al Ambienpan>t Air Quality Stanpan>dards for PM2.5 inpan> 2006 [51] were adopted inpan> this paper: 35 micrograms per cubic meter () for 24 h: We identify block groups that have PM2.5 values greater than for at least one day. 15 micrograms per cubic meter () for the annual mean: We identify block groups that have annual PM2.5 values greater than . Figure 7 shows the geographic distribution of such census block groups with the annual and/or 24 h exceeding the U.S. EPA n class="Chemical">Nationpan>al Ambienpan>t Air Quality PM2.5 stanpan>dards. The results suggest:

Figure 7

Geographic distribution of census block groups in the contiguous United States that exceeded the PM2.5 air quality standards in 2009.

there is a population of (27.8 million) residing in census block groups in the n class="Chemical">contiguous Unpan>ited States with anpan> anpan>nual PM2.5 exceedinpan>g the nationpan>al stanpan>dard of ; more than one-third of the U.S. population () residing in census block groups where PM2.5 exceeded for at least one day in 2009.

4.5. Web Application

The third goal of this paper is to develop a web application to interpolate and visualize in real time the spatiotemporal variation of ambient air pollution (including but not limited to PM2.5) across the contiguous U.S. The web application is based on the MEApan> class="Chemical">N framework [52]. This framework relies on the MongoDB database [53] to store the application’s data, Express framework [54] to facilitate HTTP routing, AngularJS [55] to construct an MVC (Model View Controller) architecture to simplify building of responsive web pages, and NodeJS [56] to support the application. The use of MongoDB, Node.js, Express, and AngularJS provides a unified development approach. Each of the technologies is based on JavaScript which allows for more code reuse and less context switching for developers as they move between server side and client side application development. In addition, a REST (REpresentational State Transfer) [57] Application Program Interface (API) is utilized to handle requests from clients, including user sign up and authentication, requests for interpolated pollution data, and requests for triangulations of measurement sites. A REST call is used to initiate the downloading of pollution data from the AirNow [58] File Tranpan>sfer Protopan> class="Chemical">col (FTP) server and initiate the triangulation and interpolation of the data using the SF-based method. AirNow is a U.S. EPA program that provides real-time observed air quality information across the U.S., Canada, and Mexico. It receives real-time air quality observations from over 2,000 monitoring stations and collects forecasts for more than 300 cities. The AirNow program includes a web services API for accessing current and historical pollution data [59]. However, queries to this service are generally rate limited to 500 per hour. Therefore, the web application presented in this paper uses an alternative FTP server method to access the AirNow data. This web application uses an SF-based interpolation to compute anpan>d update anpan>y hour/parameter combination when data has not been updated. Using this method, the system can always include the data for the latest downloaded hour and may include data for previous hours if a time-based interpolation has been calculated. Triangulations are stored in the MongoDB database in a “triangles” collection. When a query is received, the web application can use a geospatial query supported by MongoDB to locate the containing triangle in the triangulation and interpolate the PM2.5 concentration. In order to use the web application, the user needs to sign up by filling out a simple form or log in if they already have an acn class="Chemical">count at the website [60]. After successful log inpan>, the user will see the screenpan> inpan> Figure 8. The screenpan> inpan>cludes anpan> optionpan>s menpan>u onpan> the left anpan>d anpan> embedded Google Maps applicationpan> onpan> the right. The Google Maps applicationpan> is the mainpan> panpan>el used for visualizationpan> of pollutionpan> data, developed usinpan>g the Google Maps API. Whenpan> the user chanpan>ges visualizationpan> optionpan>s inpan> the optionpan>s menpan>u, such as selectinpan>g the pollutionpan> parameter type, date, time, or visualizationpan> renpan>derinpan>g parameters, the data inpan> the Google Maps applicationpan> will be updated automatically anpan>d responpan>sively renpan>dered.

Figure 8

Web application. Map overview screen after logging in.

Visualization of the pollution data is rendered on the client side by embedding a Google Maps application within the AngularJS application. Figure 9 shows the interpolated PM2.5 concenpan>trations across the pan> class="Chemical">contiguous U.S. on 22 March, 2016 at 18:00 GMT. This web application allows a user to visualize six air pollution parameters: O3 (ppb), PM2.5 (), PM10 (), CO (ppm), SO2 (ppb), and NO2 (ppb).

Figure 9

Web application. Rendering of PM2.5 concentrations across the contiguous U.S. on 22 March 2016 at 18:00 GMT, including intensities at known measurement sites and the resultant triangulations used in the shape function (SF)-based interpolation method.

5. Discussion

Due to the technological advances and the societal need for analysis of physical phenomena that continpan>uously chanpan>ge inpan> space anpan>d time, such as weather anpan>d air quality variables, etc., the pan> class="Chemical">collection and processing of spatiotemporal data becomes more and more important. There are significant spatial and temporal dependencies among these data, which are usually ignored or underemphasized by a purely spatial interpolation approach. Investigating the additional temporal information has the potential to improve the interpolation result. Therefore, developing appropriate spatiotemporal interpolations is critical to estimate missing values at points from neighboring observations by looking deep into the spatial and temporal correlations. This study n class="Chemical">compares performanpan>ce of the SF anpan>d IDW based spatiotemporal inpan>terpolationpan> methods inpan> order to finpan>d anpan> inpan>terpolationpan> method suitable for anpan> actual set of daily PM2.5 values inpan> the pan> class="Chemical">contiguous U.S. This paper also explored population exposure to PM2.5 in the contiguous U.S. by linking interpolated PM2.5 at the centroids of census block groups to census population. Finally, we implemented a web application to interpolate and visualize in real time the spatiotemporal variation of ambient air pollution (including but not limited to PM2.5) across the contiguous U.S. using air pollution data from the U.S. EPA’s AirNow program. There are some limitations and future work with our study: This study is limited to investigating only four choices for time scales, five choices for the number of nearest neighbors, and nine choices for the exponents. In future work, we plan to apply machine learning methods to efficiently learn the best possible n class="Chemical">configurations inpan> the model, usinpan>g a lightnpan>inpan>g-fast cluster n class="Chemical">computing framework Apache Spark [61]. The SF-based and IDW-based methods are deterministic methods. In this paper, we did not compare our methods with geostatistical inpan>terpolation methods such as Kriginpan>g, neural networks, anpan>d lanpan>d use regression. Inpan> future work, we planpan> to develop multidimenpan>sional anpan>d stochastic spatiotemporal inpan>terpolation methods suitable for ambienpan>t air pollution data (pan> class="Chemical">NO2, O3, PM2.5, and PM10) by incorporating factors associated with the environmental exposure of interest, and then make comparisons with other commonly-applied geostatistical interpolation methods. Finally, there is a limitation in the currently implemented SF-based algorithm with respect to missing data close to some boundaries of the contiguous Unpan>ited States. For example, along the west pan> class="Chemical">coast in Oregon and Washington, there are monitoring stations relatively far away from the coastal border. Because of missing data, an unrealistic stripe next to the coast is visible in our map presentations of the interpolated results. In order to avoid this type of problem, we will need additional measurements along the coast, or use meshless interpolation methods such as IDW with a limited number of neighboring measurements in future work. Additionally, in future work, we plan to link interpolated air pollution concenpan>tration values to a variety of health outcomes to evaluate air pollution’s adverse impact on human health, as well as link the interpolated pollution values with individual GPS trajectory to better estimate personal-based air pollution exposure.

6. Conclusions

In n class="Chemical">conclusionpan>, this study has made three pan> class="Chemical">contributions to the ambient air pollution and spatiotemporal interpolation research community. First, using an actual set of daily PM2.5 values measured by U.S. EPA monitoring sites in the n class="Chemical">contiguous Unpan>ited States, the performanpan>ce of the SF anpan>d IDW based spatiotemporal inpan>terpolationpan> methods is pan> class="Chemical">compared in order to find an interpolation method suitable for the PM2.5 data. The SF-based interpolation method performed better overall than the IDW-based method for the daily PM2.5 data. Second, more thanpan> 75 million PM2.5 spatiotemporal inpan>terpolation results are calculated usinpan>g the SF-based spatiotemporal method inpan> the pan> class="Chemical">contiguous U.S. at the fine geographic level of census block groups. The interpolation results are linked to 2009 census block group population data so that the population with unhealthy PM2.5 exposure in the contiguous U.S is estimated. To map PM2.5 and population spatial distribution, we generated choropleth maps based on the interpolated values as well as the population data. Because PM2.5 may exhibit different spatial patterns in different seasons, we also investigated the seasonal variations. We conducted spatial queries to identify areas where annual and daily PM2.5 are above the standard and calculated the population size in these areas. Third, this study implemented a web application to interpolate and visualize in real time the spatiotemporal variation of ambient air pollution (including but not limited to PM2.5) across the n class="Chemical">contiguous U.S. usinpan>g air pollutionpan> data from the U.S. EPA’s Airpan> class="Chemical">Now program.

11 in total

1. Spatial distribution of soil heavy metal pollution estimated by different interpolation methods: accuracy and uncertainty analysis.

Authors: Yunfeng Xie; Tong-bin Chen; Mei Lei; Jun Yang; Qing-jun Guo; Bo Song; Xiao-yong Zhou
Journal: Chemosphere Date: 2010-10-20 Impact factor: 7.086

2. Estimating Population Exposure to Fine Particulate Matter in the Conterminous U.S. using Shape Function-based Spatiotemporal Interpolation Method: A County Level Analysis.

Authors: Lixin Li; Jie Tian; Xingyou Zhang; James B Holt; Reinhard Piltner
Journal: GSTF Int J Comput Date: 2012-01

3. Visibility trends in Korea during the past two decades.

Authors: Young Sung Ghim; Kil-Choo Moon; Sihye Lee; Yong Pyo Kim
Journal: J Air Waste Manag Assoc Date: 2005-01 Impact factor: 2.235

Review 4. Air pollution exposure assessment methods utilized in epidemiological studies.

Authors: Bin Zou; J Gaines Wilson; F Benjamin Zhan; Yongnian Zeng
Journal: J Environ Monit Date: 2009-02-13

5. Quantile-based Bayesian maximum entropy approach for spatiotemporal modeling of ambient air quality levels.

Authors: Hwa-Lung Yu; Chih-Hsin Wang
Journal: Environ Sci Technol Date: 2013-01-23 Impact factor: 9.028

6. Spatiotemporal reasoning about epidemiological data.

Authors: Peter Revesz; Shasha Wu
Journal: Artif Intell Med Date: 2006-08-28 Impact factor: 5.326

7. Effects of air pollutants on acute stroke mortality.

Authors: Yun-Chul Hong; Jong-Tae Lee; Ho Kim; Eun-Hee Ha; Joel Schwartz; David C Christiani
Journal: Environ Health Perspect Date: 2002-02 Impact factor: 9.031

8. GIS approaches for the estimation of residential-level ambient PM concentrations.

Authors: Duanping Liao; Donna J Peuquet; Yinkang Duan; Eric A Whitsel; Jianwei Dou; Richard L Smith; Hung-Mo Lin; Jiu-Chiuan Chen; Gerardo Heiss
Journal: Environ Health Perspect Date: 2006-09 Impact factor: 9.031

9. Association of fine particulate matter from different sources with daily mortality in six U.S. cities.

Authors: F Laden; L M Neas; D W Dockery; J Schwartz
Journal: Environ Health Perspect Date: 2000-10 Impact factor: 9.031

10. Fast inverse distance weighting-based spatiotemporal interpolation: a web-based application of interpolating daily fine particulate matter PM2:5 in the contiguous U.S. using parallel programming and k-d tree.

Authors: Lixin Li; Travis Losser; Charles Yorke; Reinhard Piltner
Journal: Int J Environ Res Public Health Date: 2014-09-03 Impact factor: 3.390

7 in total

1. Evaluation of a data fusion approach to estimate daily PM_2.5 levels in North China.

Authors: Fengchao Liang; Meng Gao; Qingyang Xiao; Gregory R Carmichael; Xiaochuan Pan; Yang Liu
Journal: Environ Res Date: 2017-07-03 Impact factor: 6.498

2. Estimation of Ground PM_2.5 Concentrations using a DEM-assisted Information Diffusion Algorithm: A Case Study in China.

Authors: Lei Ma; Yu Gao; Tengyu Fu; Liang Cheng; Zhenjie Chen; Manchun Li
Journal: Sci Rep Date: 2017-11-14 Impact factor: 4.379

7. Spatial and temporal estimates of population exposure to wildfire smoke during the Washington state 2012 wildfire season using blended model, satellite, and in situ data.

Authors: William Lassman; Bonne Ford; Ryan W Gan; Gabriele Pfister; Sheryl Magzamen; Emily V Fischer; Jeffrey R Pierce
Journal: Geohealth Date: 2017-05-31

7 in total