Literature DB >> 28173815

Spatial identification of potential health hazards: a systematic areal search approach.

Alina Svechkina¹, Marina Zusman¹, Natalya Rybnikova¹, Boris A Portnov².

Abstract

BACKGROUND AND AIMS: Large metropolitan areas often exhibit multiple morbidity hotspots. However, the identification of specific health hazards, associated with the observed morbidity patterns, is not always straightforward. In this study, we suggest an empirical approach to the identification of specific health hazards, which have the highest probability of association with the observed morbidity patterns.
METHODS: The morbidity effect of a particular health hazard is expected to weaken with distance. To account for this effect, we estimate distance decay gradients for alternative locations and then rank these locations based on the strength of association between the observed morbidity and wind-direction weighted proximities to these locations. To validate this approach, we use both theoretical examples and a case study of the Greater Haifa Metropolitan Area (GHMA) in Israel, which is characterized by multiple health hazards.
RESULTS: In our theoretical examples, the proposed approach helped to identify correctly the predefined locations of health hazards, while in the real-world case study, the main health hazard was identified as a spot in the industrial zone, which hosts several petrochemical facilities.
CONCLUSION: The proposed approach does not require extensive input information and can be used as a preliminary risk assessment tool in a wide range of environmental settings, helping to identify potential environmental risk factors behind the observed population morbidity patterns.

Entities: Chemical Disease Gene Species

Keywords: Disease hotspots; Multivariate regression analysis; Receptor-oriented models; Source-oriented models; Systematic search approach; Wind adjustment

Mesh：

Substances：
Particulate Matter

Year: 2017 PMID： 28173815 PMCID： PMC5297159 DOI： 10.1186/s12942-017-0078-8

Source DB: PubMed Journal: Int J Health Geogr ISSN： 1476-072X Impact factor: 3.918

Background

Air pollution from motor traffic and industrial facilities is known to be linked to respiratory, cardiovascular and cancer morbidity [1-9]. However, since urban areas are often characterized by multiple sources of air pollution, the identification of specific environmental hazards associated with the observed morbidity patterns is not always straightforward [10-13]. Traditional methods, used to identify the specific sources of air pollution, include the residence time analysis (RTA) and the chemical mass balance (CMB) method [14-22]. The former method is based on measurements of different air pollutants at the receptor sites [15, 18, 20, 23], while the CMB method investigates the chemical composition of air particles, by comparing them with particles emitted from different emission sources [14, 16, 22]. However, the empirical implementation of these methods requires a considerable amount of information on the concentration of specific particles, detailed wind regime assessments and topographic attributes, which are not always available to researchers [14, 24–26]. In this study, we suggest an empirical approach to the identification of specific health hazards, which have the highest probability of being associated with the observed morbidity patterns. The proposed approach does not require extensive input information and can be implemented at a preliminary risk assessment stage, using basic geo-statistical tools. The proposed method is based on an expectation that the morbidity effect of a particular health hazard weakens with distance [9, 27–29]. As a result, people living in a close proximity to a morbidity source, are expected to exhibit, ceteris paribus, a higher rate of morbidity than those living at a distance from that source [11, 30]. To account for this effect, we estimate distance decay gradients of morbidity for alternative potential “source” locations and then rank these locations based on the strength of association between the observed morbidity patterns and wind-direction weighted proximities to these locations.

Spatial identification of pollution sources and morbidity hotspots

Empirical implementations of morbidity source assessments can be classified into two groups: source-oriented approaches and receptor-oriented methods [14, 15, 20, 23, 24, 26, 31–33]. The first group of methodologies uses data from different pollution sources and then computes the concentrations of different air pollutants in a given point of space, by taking into account local meteorological conditions and topography [32, 34]. By contrast, the second group of methods uses data on air pollution measured at the pollution receptors’ sites and then estimates probable pollution sources, by taking into account the backward wind trajectory and other relevant meteorological conditions (see inter alia [14, 20, 35]). In an early study [15], an identification method of potential emission sources of sulphur dioxide (SO2) was developed. The method uses SO2 concentrations measured at the receptor site and then calculates a backward trajectory leading to the potential emission source. In a separate study, [36] discuss the results produced by a chemical transport modeling of particulate matter (PM2.5), using data available for Northern Italy. According to the proposed method, ambient air pollution is partitioned between road transport, industries and domestic heating. In several health geography studies, distances from residential locations to pre-identified environmental hazards are commonly used as proxies for unknown (or unidentified) exposures [37-40]. Potential health hazards, to which this exposure assessment method was applied, included highways, industrial sites, nuclear power plants and gas wells. Thus, in a recent study, McKenzie et al. [6] estimated the health risk associated with areal proximity to natural gas wells in the Garfield County, Colorado. In a separate study, Sermage-Faure et al. [38] investigated the risk of childhood leukemia around nuclear power plants. The total of 32,753 study subjects were subdivided into groups, according to their residential proximity to the existing power plants, and the observed cancers incidence rates across different proximity bands were mutually compared. The results suggested an excess of leukemia in close proximity to nuclear power plants. Zusman et al. [11] used proximity to an oil storage site, as a proxy for residential exposure to unknown levels of emissions of volatile and semi-volatile organic compounds from the site. As the study revealed, the rates of lung and non-Hodgkin lymphoma (NHL) cancers declined in line with distance from the storage site, especially among the elderly (P < 0.01). A similar methodological approach was used by [30], who investigated the link between NHL morbidity and residence near heavy roads. In the study, the geographic distribution of NHL patients was adjusted by the overall density of population residing in the study area. The analysis indicated a steady decline in the density of NHL patients as a function of distance from main thoroughfare roads. Although in the above mentioned and other studies (see inter alia [6, 41–43]), areal proximities were used for assessing the adverse effects of different health hazards on human morbidity, this method was mostly applied to pre-identified health risk sources, that is, health hazards found at known locations—such as roads, industrial sites, etc. In the past decades, several geo-statistical tests have been also developed to assess disease clusters around predefined sources of environmental hazards. These tests include Stone’s Maximum Likelihood Ratio Test [44], Tango’s Focused Test [45], Bithell’s Linear Risk Score Test [46], and the Lawson-Waller Score Test [47], also known as the “focused tests”. Although these tests can be used to identify cluster of events around a single or several pre-specified locations, they cannot be used effectively if the source (or sources) of exposure is unknown, the task which the proposed identification method, based on a systematic areal comparison of alternative risk-source locations controlled for confounders, is designed to achieve.

Methods

Identification methodology

Assuming that the rate of morbidity observed in the ith point of space (morb ) depends on the distance from the potential source of exposure, j, the relationship between morb and dist can be expressed by the following linear function, reflecting a monotonic decline in morb as a function of dist :where b , b are coefficients, = random error term. As long as the relationship between morb and dist follows (1) and the locations of specific sources of exposure (e.g., roads, industrial facilities, etc.) are a priori known, the calculation of the strength of association between morb and dist is technically simple. However, if actual sources of exposure for morb for are unknown, alternative locations, j, can be assessed, one by one, as potential exposure sources. Such alternative locations can then be ranked by their “probability” of being the exposure source (P ) for morb using the coefficient of determination, , between morb and dist :1 The interpretation of (2) is relatively simple: values of close to 1 (when b is negative) would indicate a high “probability” that exposure originating from point j is associated with morbidity observed in i, while values of close to 1, when b 1 is positive, would indicate a “protective” effect, and values of close to zero will point out at the absence of any significant association between the two variables. Since the dispersion of air pollutants from a potential risk source is likely to be affected by the wind frequency of from j to i [48, 49], the pairwise Euclidian distances (dist ) can be adjusted:where = distance between i and j adjusted by wind frequency (W ) between the points (measured as e.g., annual or seasonal averages of directional wind frequencies), and is a distance transformation function (e.g., linear, quadratic and exponential transformations can be used; see “Appendix 2”). To account for the above wind-adjustment effect, (1) can be rewritten as follows: Considering that the association between the observed health effect and proximity to a given health hazard can be confounded by other factors (such as e.g., socio-economic status of the local population, residential densities, ethnicity, etc. [5, 13, 50–52]), the confound relationship between the rate of morbidity observed in i and can be adjusted as follows:where b ,…, b are regression coefficients; GEO = vector of geographical attributes of i (e.g., distance to main roads, elevation above the sea level, etc.); SES = vector of socio-economic attributes of i, including e.g., socio-economic status and ethnic makeup of the local population; POL = vector of air pollution levels measured at the ith point, and = random error term. As with (1), the coefficient of determination obtained for (5) can be considered as a measure of probability that morbidity observed in i and originated from j: The interpretation of (6) is similar to that of (2): in particular, values of close to 1 (when b is negative) indicate a high “probability” that exposure originating from point j is associated with morbidity observed in i, while values of close to 1 (when b is positive) would indicate a “protective” effect, and values of close to zero will point out at the absence of any significant association between the variables. The essential difference between (2) and (6) is that the former equation is uncontrolled for potential confounders, while the latter Eq. (6) takes such confounders into account.

Empirical validation

We tested the proposed identification approach in two stages. During the first stage, we designed several theoretical examples in which loci of morbidity rates were positioned around pre-defined sources of exposure. In particular, we generated two identical, regularly spaced arrays of 100 “reference” point each, surrounding two predefined sources of exposure—either a point or a line (see Fig. 1; left panel). These arrays of “reference” points served in our tests as both disease observations and points from which potential exposure could have been generated. The rates of morbidity were arbitrarily assigned to each reference point using one simple rule: in line with the expected distance decay relationship, reference points with higher morbidity rates were positioned closer to the pre-defined sources of exposure, while places with lower morbidity rates were positioned farther away from these sources (see Fig. 1; left panel). Then, we estimated bivariate regressions to assess the strength of association between morb and dist for each “reference” point (a total of 100 equations, one for each reference point).

Fig. 1

Input morbidity patterns (left panel) and risk source estimates (right panel). Notes: Green dots in the left panel diagrams indicate morbidity observations with the size of each dot set proportional to the morbidity prevalence rate observed in a given location; the triangle and the solid line in the left panel diagrams indicate predefined sources of environmental pollution (see text for explanations) We also incorporated a stochastic element into our analysis. In particular, in order to test the sensitivity of the models under varying levels of inputs, we used a random number generator to generate stochastic noise around the input morbidity rates in our “point” and “line” examples (see Fig. 1). Next, we ran 100 regressions for each of the simulated samples. The test did not change the regression results substantially. In particular, in the case of the “point” source (see Fig. 1b), the estimates for the distance variable were as follows: B = −11.17 (95% CI −11.71, −10.63), t-stat = −40.84, (95% CI −41.163, −40.76), and for the “line” source (Fig. 1d): B = −5.90 (95% CI −7.98, −3.82), and t-stat = −5.47 (95% CI −5.63, −5.36). This confirms that our estimates are essentially robust. Lastly, we interpolated the R 2 values, to create continuous “probability” surfaces, differentiating between areas with high and low values of the coefficients of determination (see Fig. 1; right panel). To this end, we used the Empirical Bayesian Kriging (EBK) method, a kriging interpolation technique, which differs from classical kriging methods by accounting for the error introduced by estimating the semivariogram model [17, 53]. The EBK parameters were set to the default values used by the ArcGISTM10.x software [54]. At the next step, we applied the proposed identification method to the real world case of the Greater Haifa Metropolitan Area (GHMA) in Israel (Fig. 2), characterized by multiple health hazards. Background information on the study area, its location and geographic attributes is reported in the Additional file 1.

Fig. 2

Map of the GHMA study area, showing residential buildings, main industrial facilities (1–5) and thoroughfare roads

Map of the GHMA study area, showing residential buildings, main industrial facilities (1–5) and thoroughfare roads We started our analysis of morbidity patterns in GHMA by geocoding residential addresses of lung and NHL cancer patients, obtained from the Israel National Cancer registry for the year 2012 [55], which are the latest annual records available in the database at the time of the study initiation.2 Next, we calculated cancer rates in different areas of the GHMA, using the Double Kernel Density (DKD) tools (see Additional file 2). In order to convert the obtained continuous DKD surfaces of cancer density into discrete observations, suitable for a multivariate analysis, we generated 1000 randomly distributed “reference” points covering the entire study area (i points). Following the analysis procedure suggested in [11], the reference points created thereby were “spatially joined” with DKD surfaces of both types of cancer under study, enabling us to estimate the cancer morbidity rate for each “reference” point. Using the “spatial join” tool in ArcGIS™10.x software [54], we next assigned the values of several variables, either drawn from small census areas (SCAs) data (such as socio-economic status, percent of residents employed in manufacturing, the share of total population over 65yo and neighborhood level smoking rates) or generated from NOx and PM2.5 air pollution surfaces, to each reference point. The air pollution surfaces were interpolated by kriging using annual averages of air pollution obtained from air quality monitoring stations. According to previous studies, cancer latency period can vary substantially, ranging from several years to several decades [57]. To account for this effect, annual averages of NOx and PM2.5, obtained from 20 Air Quality Monitoring Stations (AQMSs) [58], were lagged by 10 years, which is a temporal lag, commonly used in epidemiological studies of cancer [59-61]. That is, cancer DKD rates estimated for the year 2013 were mutually compared with air pollution data for the year 2003 (see Appendix 1). We considered the above mentioned variables as potential confounders for the observed cancer rates, as commonly done in epidemiological studies of cancer morbidity [5, 13, 51, 52]. Descriptive statistics of the variables used in the analysis are reported in “Appendix 1”. At the next step, we generated a map (layer) of 1000 evenly distributed points, representing locations of potential environmental hazards (j points). For the arrays of i and j, we next calculated Euclidian distances (dist ), from each morbidity point (i) to each source points (j). After these distance pairs were calculated, we introduced them into regression models as potential explanatory variables, in addition to the above mentioned socio-demographic and geographic attributes, considered as controls. To address the issue of multicollinearity, individual dist were introduced into the models separately, one by one, in addition to the constant set of controls, and changes in the coefficient of determinations were traced. The models were estimated separately for two dependent variables—NHL and lung cancer DKD rates. Because simple Euclidian distances may not be a truly accurate proximity matrix, considering wind frequency and direction, we adjusted these distances by applying a wind frequency transformation discussed in “Appendix 2”. By way of this transformation, we calculated wind weighted distances between each pair of i and j and then used these wind-adjusted distances in the regression analysis as alternatives to simple Euclidean distances, used during the initial phase of the analysis. Next, for each morbidity reference point (i), we ran multivariate regressions for both types of cancer under study (that is, lung and NHL cancer separately), using the constant set of the above mentioned socio-demographic explanatory and adding one at a time. For 1000 multivariate regressions obtained for each type of cancer (that is, one regression equation for each j point), we used the coefficient of determination to generate the “probability” surface, covering the entire study area and estimating how well the constant set of socio-demographic variables and wind weighted distance from each potential source point j, to the disease observation point i explain cancer rates observed at i’s. In the initial stage of the analysis, were introduced by their linear terms. However, as our analysis revealed, the relationship between the observed cancer morbidity and industrial proximities was best captured by a non-linear (parabolic) function (see Fig. 3), apparently due to the fact that plumes of air pollution from tall industrial smokestacks land at some distances from the emission sources. To take this non-linear effect into account, we introduced a quadratic term of into the models, in addition to its linear term, and repeated the analysis. To estimate parameters in Eq. (5) multivariate regression models, incorporating linear and non-linear terms, were used. In the following discussion, only non-linear models, providing better fits and generality compared to ordinary linear models, are reported.

Fig. 3

Changes in NHL and lung cancer incidence rates (per 100,000) as a function of distance from industrial facility 5 (see Fig. 2)

Changes in NHL and lung cancer incidence rates (per 100,000) as a function of distance from industrial facility 5 (see Fig. 2) The probability surfaces were generated using the EBK interpolation technique in the ArcGIS™10.x Software [54], while the multivariate regression analysis was performed using the SPSSv.23™ software [62]. The probability level of less than 0.01 (<1%) was set as the accepted statistical significance level.

Results

Theoretical examples

Figure 1 features morbidity rates, marked by dots surrounding two pre-defined sources of exposure—a triangle (Fig. 1a) and a line (Fig. 1c). As mentioned previously, in these diagrams, dots, marking morbidity observations, are sized proportionally to the predefined morbidity rates: the higher the morbidity rate: the bigger the dot that marks it. In line with the expected distance decay relationship, larger dots are positioned closer to the pre-defined sources of exposure, while smaller dots are placed farther away from these sources (see Fig. 1a, c). Concurrently, maps in the right panel (Fig. 1b, d) feature morbidity source estimates, calculated using the estimation approach described in “Empirical validation” section. As Fig. 1b, d show, the spots of high probability of being the source of exposure, marked by orange and red colours in the right panel, correspond, fairly accurately, to the actual locations of the pre-defined health hazards (Fig. 1a, c).

GHMA study

Figure 4a, b shows raster surfaces based on the determination coefficients (), obtained from bivariate regression models, estimated separately for lung (Fig. 4a) and NHL cancers (Fig. 4b). Concurrently, Fig. 4c, d shows source identification surfaces based on the determination coefficients obtained from multivariate regression models. The best performing regression models (both controlled and uncontrolled), are reported in Tables 1 and 2.3

Fig. 4

Table 1

The association between double kernel density (DKD) of lung and NHL morbidity rates (cases per 100,000 residents) and distance to the revealed exposure sources (Method—bivariate regression, distance variables—linear and quadratic wind-adjusted distance terms)c

Variables	Model 1	Model 2
Variables	B^a and (t^b)	B^a and (t^b)
A. Lung cancer
(Constant)	13.935 (58.947*)	1.131 (7.965*)
Distance	−5.500E−0.40 (−19.350*)	0.002 (4.364*)
Distance²	–	−1.115E−07 (−4.152*)
No. of reference points	1000	1000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}^{2} $$\end{document}R2	0.286	0.301
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}_{\text{adjusted}}^{2} $$\end{document}Radjusted2	0.285	0.299
F	374.419*	133.819*
B. NHL cancer
(Constant)	4.656 (17.237*)	−3.697 (−5.219*)
Distance	3.380E−04 (8.409*)	0.003 (13.916*)
Distance²	–	−2.189E−07 (−12.791*)
No. of reference points	1000	1000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}^{2} $$\end{document}R2	0.070	0.205
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}_{\text{adjusted}}^{2} $$\end{document}Radjusted2	0.069	0.204
F	70.714*	120.722*

Model 1: Bivariate linear model

Model 2: Bivariate quadratic model

* indicates a 0.01 two-tailed significance level

aRegression coefficient

b t-statistics in the parentheses

cThe models reported in the table are estimated for the distances to the “best performing” source locations, marked by small triangles in Fig. 4, that is, source locations distances to which help to improve the models’ fits most significantly (see text for explanations)

Table 2

The association between double kernel density (DKD) of lung and NHL morbidity cancer rates (cases per 100,000) and distance to the revealed exposure sources (Method—multivariate regression, distance variables—linear and quadratic wind-adjusted distance terms)c

Variables	Model 3^d	Model 4^d
Variables	B^a and (t^b)	B^a and (t^b)
A. Lung cancer
(Constant)	6.661 (2.591*)	−12.629 (−3.959*)
Distance	−5.159E−04 (−7.470*)	0.003 (8.235*)
Distance²	–	−2.620E−07 (−8.159*)
N of reference points	1000	1000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}^{2} $$\end{document}R2	0.393	0.458
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}_{\text{adjusted}}^{2} $$\end{document}Radjusted2	0.386	0.450
ΔR²	–	0.065
F change^e	–	36.658*
B. NHL cancer
(Constant)	9.119 (5.231*)	−9.144 (−4.388*)
Distance	−2.862E−04 (−5.991*)	0.003 (13.359*)
Distance²	–	−2.415E−07 (−12.791*)
N of reference points	1000	1000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}_{{}}^{2} $$\end{document}R2	0.242	0.369
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}_{\text{adjusted}}^{2} $$\end{document}Radjusted2	0.234	0.361
ΔR²	–	0.127
F change^e	–	92.855*

Model 3: Multivariate linear model

Model 4: Multivariate quadratic model

aRegression coefficient

b t-statistics in the parentheses

dThe models are controlled for distance to the nearest main road (m), elevation above the sea level (m), percent of Jewish population in the SCA, SCA Socio-economic status, distance to the sea (m), manufacturing employment (% of total population of SCA), NOx (ppb), PM 2.5 (ppb), total population over 65 (%),smoking rate in the SCA (%) and distance to the nearest main road (m)

eF-test of R2-change compared to model without hazard source distances (i.e., Models 3A or 3B, respectively)

Risk source assessment for lung cancer (left panel) and NHL cancer (right panel) by uncontrolled (a, b) and controlled regressions (c, d). Note: Black triangles mark the points, distances to which are used in the regression models reported in Tables 1 and 2 The association between double kernel density (DKD) of lung and NHL morbidity rates (cases per 100,000 residents) and distance to the revealed exposure sources (Method—bivariate regression, distance variables—linear and quadratic wind-adjusted distance terms)c Model 1: Bivariate linear model Model 2: Bivariate quadratic model * indicates a 0.01 two-tailed significance level aRegression coefficient b t-statistics in the parentheses cThe models reported in the table are estimated for the distances to the “best performing” source locations, marked by small triangles in Fig. 4, that is, source locations distances to which help to improve the models’ fits most significantly (see text for explanations) The association between double kernel density (DKD) of lung and NHL morbidity cancer rates (cases per 100,000) and distance to the revealed exposure sources (Method—multivariate regression, distance variables—linear and quadratic wind-adjusted distance terms)c Model 3: Multivariate linear model Model 4: Multivariate quadratic model aRegression coefficient b t-statistics in the parentheses cThe models reported in the table are estimated for the distances to the “best performing” source locations, marked by small triangles in Fig. 4, that is, source locations distances to which help to improve the models’ fits most significantly (see text for explanations) dThe models are controlled for distance to the nearest main road (m), elevation above the sea level (m), percent of Jewish population in the SCA, SCA Socio-economic status, distance to the sea (m), manufacturing employment (% of total population of SCA), NOx (ppb), PM 2.5 (ppb), total population over 65 (%),smoking rate in the SCA (%) and distance to the nearest main road (m) eF-test of R2-change compared to model without hazard source distances (i.e., Models 3A or 3B, respectively) Figure 4 has similar coloring such as that used in the theoretical examples, discussed in “Theoretical examples” section and shown in Fig. 1. In particular, warm-coloured pixels in these diagrams correspond to the highest improvements in the models’ determination coefficients, observed by adding wind-adjusted distances from these pixels to the models, containing a constant “pre-set” of socio-demographic variables, discussed in the “Empirical validation” section. Concurrently, blue and green colours in these maps mark pixels adding proximity to which result in relatively small changes in the models’ determination coefficients. As Fig. 4 shows, there are two most probable loci associated with the observed morbidity—the central business district saturated with traffic routes located in the north-eastern part of the study area (for lung cancer cases) and a spot located in the central part of the study area (for both cancer cases under the study) (see Figs. 1, 4a, b). Adding proximities to these spots results in increases in the models’ determination coefficients by up to 14–29% in bivariate models and by up to 7–13% in multivariate models, depending on the cancer type under analysis (see Tables 1, 2). Several interaction effects were also tested. Among them, two effects (i.e., the side of the Carmel mountain vs. elevation above the sea level and the side of the Carmel mountain vs. distance to the identified hotspot), were found to be statistically significant. Regression models incorporating these interaction effects are reported in Table 3.

Table 3

Variables	Model 5	Model 6	Model 7
Variables	B^a and (t^b)	B^a and (t^b)	B^a and (t^b)
A. Lung cancer
(Constant)	−15.663 (−7.937*)	−15.125 (−4.832*)	−15.791 (−4.948*)
Distance	0.004 (8.591*)	0.004 (9.314*)	0.004 (8.109*)
Distance²	−2.689E−07 (−8.258*)	−2.945E−07 (−9.067*)	−2.715E−07 (−7.587*)
No. of reference points	1000	1000	1000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}^{2} $$\end{document}R2	0.478	0.480	0.480
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}_{\text{adjusted}}^{2} $$\end{document}Radjusted2	0.470	0.471	0.472
F	56.308*	56.582*	56.790*
B. NHL cancer
(Constant)	−9.890 (−4.709*)	−9.233 (−4.402*)	−10.001 (−4736*)
Distance	0.003 (13.563*)	0.003 (13.119*)	0.003 (12.436*)
Distance²	−2.438E−07 (−12.930*)	−2.457 (−11.995*)	−2.486E−07 (−11.079*)
No. of reference points	1000	1000	1000
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}^{2} $$\end{document}R2	0.373	0.374	0.374
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{R}}_{\text{adjusted}}^{2} $$\end{document}Radjusted2	0.364	0.364	0.364
F	39.311*	39.288*	36.704*

See comments to Table 2

Model 5: Multivariate quadratic model with the Side of Mountain Carmel vs. elevation above the sea level interaction term

Model 6: Multivariate quadratic model with the Side of Mountain Carmel vs. Distance to the identified hotspot interaction term

Model 7: Multivariate quadratic model with both interaction terms added

The association between double kernel density (DKD) of lung and NHL morbidity rates (cases per 100,000 residents) and distance to the revealed exposure sources (Method—multivariate regression, distance variables—quadratic wind-adjusted distance terms; interaction terms added)c See comments to Table 2 Model 5: Multivariate quadratic model with the Side of Mountain Carmel vs. elevation above the sea level interaction term Model 6: Multivariate quadratic model with the Side of Mountain Carmel vs. Distance to the identified hotspot interaction term Model 7: Multivariate quadratic model with both interaction terms added

Discussion

Empirical studies use several methods for the spatial identification of potential health hazards. Such methods are mostly based on the measurements of air pollutants at the receptor sites, followed by a comparison of the results of such measurements with the chemical composition of particles emitted from different emission sources [14, 15, 20, 23, 24, 26, 31, 32, 35]. However, the empirical implementation of these methods requires a considerable amount of information on the concentration of specific particles, detailed wind regime assessments and topographic attributes, which are not always available to researchers [14, 20, 26]. As an alternative approach, proximities of various health hazards, such as roadways, industrial sites, nuclear power plants and gas wells, are commonly used in epidemiological and health geography studies as proxies for unknown exposures (see inter alia [11, 27, 28, 30]. In the present study, we extend this distance gradient method to the spatial identification of a priori unidentified hazards. The underlying assumption behind the proposed identification approach is that people living in a close proximity to a morbidity source, tend to exhibit, ceteris paribus, a higher rate of morbidity than those living at a distance from that source [11, 30]. To account for this effect, we estimated distance decay gradients of morbidity for alternative potential "risk source" locations and then ranked these locations based on the strength of association between the observed morbidity patterns and wind-direction weighted proximities to these locations. In empirical studies, several measures are commonly used to estimate the improvement of regression models attributed to changes in the predictors’ set. Such measures include the log-likelihood criterion, the Akaike information criterion (AIC), the Bayesian information criterion (BIC), the Schwarz criterion (SBC), Mallow’s Cp statistic, and several others. These criteria monitor changes in the regression residuals and thus help to select the combination of explanatory variables and the functional form of the model best fitted to the data under analysis [63]. In this study, we used R2, a commonly used measure of model fit, also known as the coefficient of determination. Our preference for this measure was motivated by the fact that this measure does not depend on the order of variables, has a specific interval of change (0; 1); it also does not depend on the functional form of the regression equation used [64]. Using this measure and applying it to the constant set of control variables, we monitored changes in the regression fit attributed to changes in wind adjusted distances to alternative hazard locations, which were introduced into the models one by one. Since the set of control variables used in the study included main factors known to affect cancer incidence rates in urban areas [51, 61, 65, 66], we did not consider it feasible to alter this predetermined set of controls. In other words, according to the proposed identification approach, the coefficient of determination, R2, was considered a likelihood criterion, using which we compared several combinations of input parameters. These combinations included the constant set of confounders and a number of vectors of wind-weighted distances between alternative potential health risk sources and morbidity observations. In several theoretical examples we designed, the proposed approach helped to identify correctly the predefined locations of health hazards, while in a real-world case study, the main health hazard were identified as a spot in the industrial zone, which hosts petrochemical facilities, and a major transportation hub in the central business district of the city. According to previous studies (see inter alia, [11, 38, 67]), petrochemical industries are known to be associated with evaluated cancer morbidity in surrounding residential areas. In a separate study, [67] investigated morbidity near nuclear power plants and found it to be linked to childhood cancer. The results of the present study also correspond to the findings of other studies which revealed geographic concentrations of cancer morbidity near heavy roads [30, 40, 41, 68], and in proximity to industrial areas [11, 38]. Thus, [69] identified the link between traffic-related pollution and respiratory morbidity, measured by lung function impairment. Several limitations of our study need to be mentioned. First and foremost, the present study is an ecological analysis, in which explanatory variables are measured at the group level or as distance gradients, and not estimated for individuals. Therefore, we cannot attribute causality in the relationships we observed. However, the strength of population-level studies is that they represent large population groups and reflect varying levels of exposure. The purpose of such studies is not to prove the relationships but rather to generate hypotheses which can further be examined using individual level data [70].

Conclusions

This paper contributes to the existing body of literature by extending the traditional distance gradient method (DGM) to the identification of potential health hazards, which geographic location is a priori unknown. The results of the study demonstrate the utility of the proposed method for epidemiological studies which goal is to identify potential sources of exposure to which the observed morbidity is related. We also consider it important that the proposed approach does not require extensive input information and can be used as a preliminary risk assessment tool, helping to identify potential environmental risk factors behind the observed population morbidity patterns. The proposed approach can be used by researches worldwide in cases in which specific sources of locally elevated morbidity are unclear or cannot be identified by traditional methods. For instance, the proposed method can be used in empirical studies in which available epidemiological data can help to map the existing morbidity patterns, and then to identify potential sources of exposure to which the observed morbidity patterns are related. However, future studies will be needed to extend the theoretical justification of the proposed approach, and to determine its applicability to other urban areas and to other health outcomes.

Table 4

Descriptive statistics of the variables used in the multivariate regressions

Variables	Minimum	Maximum	Mean	SD
DKD of NHL cancer cases (per 100,000)	0.00	18.54	6.78	2.49
DKD of Lung cancer cases (per 100,000)	0.00	27.22	10.08	4.04
Average distance to main industrial facilities (m)	755.57	9996.96	5402.55	2095.05
Distance to the nearest main road (m)	0.54	1217.84	163.71	182.13
Distance to the seashore (m)	2.38	14,302.92	4397.48	3872.51
Manufacturing employment (% of total population of the SCA)	0.00	29.30	14.68	6.20
Percent of Jewish population in the SCA	0.00	100.00	91.05	20.54
SCA socio-economic status (Index)	−1.62	2.88	0.44	1.08
NO_x in 2003 (IDW interpolation, ppb)	7.68	133.12	27.97	15.00
PM_2.5 in 2003 (IDW interpolation, ppb)	17.20	27.80	20.11	1.25
Total population over 65 (%)	0.00	0.39	0.17	0.06
Smoking rate in the SCA in 2003 (%)	15.07	41.78	18.87	3.49
Elevation above the sea level (m)	0.00	440.00	110.49	124.83

34 in total

1. High prevalence of childhood asthma in Northern Israel is linked to air pollution by particulate matter: evidence from GIS analysis and Bayesian Model Averaging.

Authors: Boris A Portnov; Benjamin Reiser; Khaled Karkabi; Orit Cohen-Kastel; Jonathan Dubnov
Journal: Int J Environ Health Res Date: 2011-11-14 Impact factor: 3.411

Review 2. Health effects of fine particulate air pollution: lines that connect.

Authors: C Arden Pope; Douglas W Dockery
Journal: J Air Waste Manag Assoc Date: 2006-06 Impact factor: 2.235

Review 3. Geocoding in cancer research: a review.

Authors: Gerard Rushton; Marc P Armstrong; Josephine Gittler; Barry R Greene; Claire E Pavlik; Michele M West; Dale L Zimmerman
Journal: Am J Prev Med Date: 2006-02 Impact factor: 5.043

4. Geographical epidemiology, spatial analysis and geographical information systems: a multidisciplinary glossary.

Authors: Mohsen Rezaeian; Graham Dunn; Selwyn St Leger; Louis Appleby
Journal: J Epidemiol Community Health Date: 2007-02 Impact factor: 3.710

5. Proximity of licensed child care facilities to near-roadway vehicle pollution.

Authors: Douglas Houston; Paul Ong; Jun Wu; Arthur Winer
Journal: Am J Public Health Date: 2006-07-27 Impact factor: 9.308

6. A class of tests for detecting 'general' and 'focused' clustering of rare diseases.

Authors: T Tango
Journal: Stat Med Date: 1995 Nov 15-30 Impact factor: 2.373

7. The choice of test for detecting raised disease risk near a point source.

Authors: J F Bithell
Journal: Stat Med Date: 1995 Nov 15-30 Impact factor: 2.373

8. Urban air pollution and lung cancer in Stockholm.

Authors: F Nyberg; P Gustavsson; L Järup; T Bellander; N Berglind; R Jakobsson; G Pershagen
Journal: Epidemiology Date: 2000-09 Impact factor: 4.822

Review 9. Preventable exposures associated with human cancers.

Authors: Vincent James Cogliano; Robert Baan; Kurt Straif; Yann Grosse; Béatrice Lauby-Secretan; Fatiha El Ghissassi; Véronique Bouvard; Lamia Benbrahim-Tallaa; Neela Guha; Crystal Freeman; Laurent Galichet; Christopher P Wild
Journal: J Natl Cancer Inst Date: 2011-12-12 Impact factor: 13.506

10. Geographical information system and environmental epidemiology: a cross-sectional spatial analysis of the effects of traffic-related air pollution on population respiratory health.

Authors: Daniela Nuvolone; Roberto Della Maggiore; Sara Maio; Roberto Fresco; Sandra Baldacci; Laura Carrozzi; Francesco Pistelli; Giovanni Viegi
Journal: Environ Health Date: 2011-03-01 Impact factor: 5.984

1 in total

1. Spatial identification of environmental health hazards potentially associated with adverse birth outcomes.

Authors: Alina Svechkina; Boris A Portnov
Journal: Environ Sci Pollut Res Int Date: 2018-12-06 Impact factor: 4.223

1 in total