Literature DB >> 35996594

A spatial copula interpolation in a random field with application in air pollution data.

Debjoy Thakur¹, Ishapathik Das¹, Shubhashree Chakravarty².

Abstract

Interpolating a skewed conditional spatial random field with missing data is cumbersome in the absence of Gaussianity assumptions. Copulas can capture different types of joint tail characteristics beyond the Gaussian paradigm. Maintaining spatial homogeneity and continuity around the observed random spatial point is also challenging. Especially when interpolating along a spatial surface, the boundary points also demand focus in forming a neighborhood. As a result, importing the concept of hierarchical clustering on the spatial random field is necessary for developing the copula model with the interface of the Expectation-Maximization algorithm and concurrently utilizing the idea of the Bayesian framework. This article introduces a spatial cluster-based C-vine copula and a modified Gaussian distance kernel to derive a novel spatial probability distribution. To make spatial copula interpolation compatible and efficient, we estimate the parameter by employing different techniques. We apply the proposed spatial interpolation approach to the air pollution of Delhi as a crucial circumstantial study to demonstrate this newly developed novel spatial estimation technique.

© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2022, Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Entities: Chemical

Keywords: Bayesian Spatial Copula Interpolation; Expectation-Maximization algorithm; Hierarchical Spatial Clustering; Spatial Copula Interpolation; Von-Mises distribution

Year: 2022 PMID： 35996594 PMCID： PMC9385445 DOI： 10.1007/s40808-022-01484-6

Source DB: PubMed Journal: Model Earth Syst Environ

Introduction

The upward trend of Particulate Matter (PM) concentrations in the atmosphere and air pollution has become the greatest threat to human civilization daily. Every year, nearly 0.8 million people die due to the direct and indirect effects of air pollution, and approximately 4.6 million people endure from serious diseases such as chronic obstructive pulmonary disease (COPD), respiratory hazards, premature deaths, and so on (Auerbach and Hernandez 2012; Lim et al. 2012). It is unavoidable, that the air-pollutant concentration is estimated with greater accuracy to control air pollution. The spatial and spatio-temporal application of geostatistics is crucial during prediction. A very interesting task in geostatistics is interpolating a target variable at a particular time stand, in an unobserved location, considering its surroundings. In this scenario, the researchers prefer to use Inverse Distance Weight (IDW), Ordinary Kriging (OK), Universal Kriging (UK), Disjunctive Kriging (DK), etc (Isaaks and Srivastava 1989; Cressie 1990). Because of significant advances in data science, many scientists prefer neural networking-based spatial and spatio-temporal interpolation techniques like Geo-Long Short Time Memory (Geo-LSTM), Random Forest Regression Kriging (RFK), and others (Ma et al. 2019; Shao et al. 2020). The previously mentioned algorithms use the variance-covariance function as a measure of dependence. The main drawback of this traditional spatial interpolation algorithm is the gaussianity assumption, is rarely met. The neural networking-based algorithms outperform, but the mathematical justification is difficult. As a result, applying this model in other cases can be challenging. These significant limitations promote the use of the copula-based spatial and spatio-temporal interpolation approach. This copula-based spatial interpolation technique is both theoretically and practically flexible. A spatial copula can easily capture a Geo-spatial variability. Spatial lag-based Gaussian and non-Gaussian bivariate copulas interpolate four different groundwater quality parameters in Baden-Wurttemberg (Bárdossy 2006). Following that, many advances in spatial copula are established, for example, the utilization of asymmetric copulas to measure spatial independence (Bárdossy and Pegram 2009), the use of Gaussian and non-Gaussian vine copula to derive the conditional distribution in the unobserved location (Bárdossy 2011), and employing the convex combination of archimedean copulas to kriging (Sohrabian 2021). Besides these, application of the Gaussian Copula (GC) via Bayesian framework to predict the maximum temperatures in the Extremadura region in southwestern Spain, for the period 1980-2015 (García et al. 2021), and a bivariate copula (Masseran 2021) measures the association between air pollution severity and its duration. Copula-based bias-correction method Alidoost et al. (2021) develops three multivariate copula-based quantile regression to map daily air temperature data. They (Khan et al. 2020) apply regular (C- and D-) Vine copula and the Student t-copula to explore the structure of spatial dependence of different climate variables, for instance, precipitation, and air temperature. A combination of extreme value models like the Generalized Pareto Distribution (GPD) with copulas (Masseran and Hussain 2020) illustrate a dependence between and a set of four major pollutant variables, namely, CO, NO, SO, and O. Employing extra-parameterized multivariate extreme valued copula (Carreau and Toulemonde 2020) introduces a spatial copula model. Application of D-vine copula in quantile regression (D’Urso et al. 2022) has explained the spatial and spatio-temporal behavior of COVID-19 in Italy. After capturing the seasonality and temporal dependency of the daily mean temperature a new spatial distance-based R-vine copula is introduced (Erhardt et al. 2015). Usage of spatial lag as an important dependence parameter of a vine copula (Gräler 2014; Bostan et al. 2021) introduce a spatial vine copula. With the help of the Metropolis-Hastings Algorithm (MHA), (Kazianka and Pilz 2011) they have improved spatial copula in the Bayesian framework to approximate the posterior predictive density whereas, this method is limited to the GC family. Spatial vine copula and dimension reduction transformation (Musafer and Thompson 2017) create a non-linear optimal multivariate spatial design to mitigate the prediction uncertainty of more than one variable. Introducing the copula-based semi-parametric algorithm (Quessy et al. 2015) models intrinsic stationary and isotropic spatial random fields. A pairwise composite likelihood with the help of a pair copula is defined using the generalized method of moments (Bai et al. 2014). A spatial factor copula model Krupskii et al. (2018) combines the flexibility of a copula, accountability of factor models, and the tractability of GC in higher dimensions, to fit spatial data at different temporal replicates. Extension of spatial Gaussian copula interpolation method (Gnann et al. 2018) predicts the primary variable, groundwater quality, and the categorical information of the primary variable as a secondary variable. A mixture copula explains the spatial dependency of an air temperature of a location on its Geo-spatial neighborhood (Alidoost et al. 2018). A translation process (TP) is discovered (Richardson 2021) for a non-Gaussian spatial copula interpolation process and is too effective to model in the absence of a link function. A spatio-temporal heterogeneous copula-based kriging (HSTCAK) (Wang et al. 2021) measures the space-time dependency by the copula function and mitigates the heterogeneity problem by fuzzy clustering. Crucial advancements in the spatial copula approach like tail dependency, asymmetric dependency, and extension of the linear model of coregionalization specifically model the multivariate spatial data (Krupskii and Genton 2019), and they use cross-covariance function as the measure of spatial dependence. The research articles used copulas in the spatial interpolation very well in the literature, but there are some constraints that the previous authors have ignored. (i) To estimate parameters, they use the Maximum Likelihood estimate, which does not provide a good estimate in presence of missing data. (ii) After creating spatial copula interpolation, they fix one point and calculate the probability distribution at different lags from that point. As a result, the copula is limited within the fixed reference frame, but the reference frame is random in reality. Therefore, we consider the random points to form a cluster, prioritizing a relative distance. (iii) At the time of spatial clustering, they disregard the significance of disjoint Geo-spatial regions. Thus the intersection part is the most affected area, where the different effects of different clusters become confounded. (iv) They use conditional expectation for interpolation, but it is invalid for the extreme valued Probability Density Function (PDF). In this study we evolve a novel spatial cluster-based copula modeling in different frameworks. We divide the entire spatial domain into k spatial clusters to get m number of spatial regions i.e. which is the class of all possible set of points in a spatial region. We create a conditional spatial random field . Here, is an induced probability space created using caratheodory’s extension theorem and and Y is measurable random field and . Our objective in this research is to predict Y at an unobserved location on based upon the n distinct observed location using spatial copula interpolation algorithm in classical and Bayesian framework. The outline of this study is as follows: the details of the algorithm are introduced in “Method” section, the study area and the behavior of data are described in “Study area and data” section, the results and discussion regarding the case study are summarized in “Results and discussion” section and the conclusion is made in “Conclusion” Section.

Method

Fitting marginal distribution

In this section, we illustrate how to fit an ideal univariate parametric distribution on the empirical distribution. We have divided the choice of PDF into three steps: (i) choice of a family of distributions, (ii) suitable marginal PDF of that family, and (iii) to estimate a parameter of these marginal PDF. For step (i), we use the Cullen and Frey graph of skewness-kurtosis plot. We utilize Kernel Density Estimation (KDE) for step (ii). In this step we prioritize the value of Akaike Information Criteria (AIC), and Bayesian Information Criteria (BIC), Log-Likelihood value (LogLik), and Kolmogrov Smirnov (KS) statistic. Although, we face a real challenge in coordination (iii) because there is missing data. Therefore, the Maximum Likelihood Estimation (MLE) of a parameter is not recommendable. For the Parametric Exponential Family distribution (PEF) we use Expectation-Maximization algorithm (EM) (McLachlan and Krishnan 2007). We use Uniformly Minimum Variance Unbiased Estimator (UMVUE) technique for the circular probability distributions. For PEF we consider the fitted distribution is Log-Normal (LN) probability distribution then, . Let, are the observed data points and are the un-observed data points. The likelihood function of based upon the observed data: The complete, observed and missing data vectors are respectively, and reveals . The complete data log-likelihood function is: Let’s consider the E-step on the iteration of the EM algorithm where is the value after the iteration of EM. Using the Eq. (2) we compute the conditional expectation of Log-Likelihood of the Complete data (CElikC) based upon the updated value at the iteration, defined as in the following: Using the Eqs. (1), (2) we simplify the Eq. (3) and then in M-step we maximize . Therefore, the updated values are which is defined in the following: We estimate the parameter from Eq. (4). But to estimate the parameter of a circular probability distribution for example, Von-Mises (VM) distribution, we avoid the computational complexity of EM algorithm due to absence of closed form. Presence of Bessel Function () promotes us to introduce a new theorem regarding the completeness and sufficiency of an estimator to deduce a UMVUE of the parameter of VM distribution in the following:

Theorem 1

If VM then and are the UMVUE of and respectively and their corresponding variances are

Proof

See Appendix (6) Using Theorem (1) and trigonometric inverse function we get the initial value and, update the parameter values of VM distribution like LN (see Appendix (A.1)).

Copula

A copula is used to model the dependence between two or more random variables, for formulating the joint multivariate distribution from the marginal Cumulative Distribution Function (CDF). Let, be n Random Variables (RV) with corresponding marginal CDF s are respectively, . The joint distribution function can be defined as, which is the product of marginal and conditional distribution but, because of the complexity of this approach with the increasing number of random variables this approach is not applicable for the large number of random variables. Therefore, the copula function is defined to create a multivariate distribution from the n marginal distribution (Nelsen 2007; Sklar 1973) to model the dependence between the multidimensional variables in the following way: There are different copula families, for example, Gaussian, Archimedean, Product, etc. They behave differently in the tail part of the distribution. Compared to the other traditional multivariate, elliptical, Archimedean copulas, and Vine Copulas (VC) are more flexible to capture the inherent dependency. Under some certain regularity conditions it’s possible to express the dimensional multivariate copula mentioned in the Eq. (6) as multiplication of pair-copulas (Aas et al. 2009) in the following iterative approach. For the bi-variate probability density function (BPDF) is: In the Eq. (7) is the applicable Pair-Copula Density Function (PCDF) for and . For the Tri-Variate Probability Density Function (TPDF) is: In Eq. (8) are the applicable PCDF and Conditional PCDF (CPCDF) respectively. Likewise, for the four-variate probability density function (FPDF) is: VC is a graphical approach representing an n-dimensional Multivariate PDF (MPDF) using suitable PCDF in a hierarchical manner where the dependence structures of have unconditional PCDF and that of the remaining has Conditional PCPDF (CPCDF). In this paper, we focus on C-Vine Copula (C-VC), because of its better flexibility. Detail of the tree of the Vine Copula A C-VC with 4-variables has 3 trees, and each tree, has nodes and edges where like Fig. 1. In tree each edge between two nodes represent the PCDF. From Fig. 1 in tree the edges between each node is CPCDF where denotes the CPCDF of the first and third variable given the fourth variable and represents the CPCDF of the second and third variable given the fourth variable. In the tree each node is connected by an edge representing CPCDF (Fig. 1) of the first and second variable given third and fourth variable where

Fig. 1

Detail of the tree of the Vine Copula

Spatial interpolation

Here, we propose two novel spatial interpolation approaches combining spatial clustering, knowledge of copula, and C-VC assuming the directional stationarity of data is defined in the following:

Spatial copula estimation

Let be the spatial domain of interest for the spatial interpolation purpose. Salient features of this spatial clustering technique are distance and degree of similarity between two spatial points. Hierarchical Spatial Clustering (HSC) is defined using the complete linkage method (Hubert 1974). The threshold criterion of inclusion is a cutoff value of Haversine Distance (HD) (Gade 2010) and correlogram between each pair of points. Here, the number of HSCs and HSC’s radius are the two most important parameters. According to the principle of HSC, the Sum of Squares Within a Cluster (SSW) is lesser than Between the Clusters (SSB). We consider an optimal number of HSCs while SSW reaches a plateau according to the Elbow method. To determine HSC’s radius, we arrange HSC’s height in ascending order and consider a significant height as a radius. Therefore, in this context, we can think of the HSC as a spatial field defined in the following Eq. (10) Let be the k clusters, be the unobserved point, and be the observed point of the HSC where, ; where be the number of the unobserved points in HSC and , the centre of cluster. A threeshold HD () is chosen that is surrounding the centre of each HSC, a circle is constructed with radius of units and refers the spatial auto-correlation cutoff to maintain the spatial continuity of HSC. For k clusters maximum number of distinct Spatial Regions (SR) is . Let be the presence vector of all unobserved points , where lon and lat stand for longitude and latitude of an unobserved point respectively. Now, we create a linear map in the following: In the Eq. (11) is a binary vector of length k, where The maximum number of presence vectors for k clusters is . As a result, we obtain at most distinct SR in our entire study area. Let be the m SR where . Let and that denotes is inside only. The entire spatial domain, Delhi is divided into four clusters and the corresponding monitoring stations In Fig. 2 we divide the entire spatial domain into some HSC grounded on the HD of monitoring stations and the degree of homogeneity where each circle denotes an HSC, and the dotted points represent the observed monitoring stations contained in that HSC. As a result, the whole area is split into some disjoint SR and the dotted points represent the monitoring stations, included in that SR (Fig. 3) are salient points to interpolate along the surface of each SR.

Fig. 2

The entire spatial domain, Delhi is divided into four clusters and the corresponding monitoring stations

Fig. 3

The entire geographical area is split up into some disjoint regions which are shaded by different colour and the corresponding observed points belonging into each region

The entire geographical area is split up into some disjoint regions which are shaded by different colour and the corresponding observed points belonging into each region Utilizing the concept of the copula, we transform an HSC into a Spatial Random Field (SRF) to predict the values on the unobserved location. Therefore, we concentrate on the inclusion probability of the latitude () and longitude () of an observed location in using univariate Marginal PDF (MDF). The corresponding MDFs are respectively and . Then, we evaluate bivariate PDF (BDF) of inclusion of latitude and longitude in the SR, and Kendall’s as the measure of association between two RVs. So, the joint BDF is as follows:To evaluate the CDF of that spatial random process (SRP), a composite function of two RVs i.e., we implement the KDE to get the MDF of Y i.e., F(y) and making use of copula we deduce the joint Tri-variate probability distribution (TDF) as follows: where . After getting, the BDF () and TDF () using copula we find out the conditional PDF (CPDF) defined in the following two equations: Here, we assume two RVs, latitude () and longitude () follow Uniform [] and Uniform [] respectively. Now we generate random points along, to measure the CDF of SRP i.e., having different CDF for each geographical position. We employ EM algorithm to estimate the parameter at the time of fitting MDF. Applying copula we get the joint TDF of is defined as . Next, we will introduce the HSC-based SI namely, Spatial Copula interpolation (SC). We split in m number of regions making use of the HSC algorithm, discussed earlier. Let’s consider, has three observed locations . In that region we can generate a number of gridded points not necessarily of uniform size, out of those unobserved points we consider one unobserved point, defined as which is the point in . Applying the conditional copula (from Eq. (13)) we establish the Conditional Copula-based Probability Distribution Function (CCDF) for each un-observed point in a SR i.e. . From the Eq. (14) we get the CCDF of un-observed point included in the first SR. That lets us calculate the CCDF of SRP, Y at the unobserved centroid of a cluster, and making the use of CCDF we can calculate the conditional copula-based probability density function (CCPDF). The mathematical formulation is described in the following: In the Eq. (15) the weights are defined as . These weights are proportional to the spatial auto correlation function (ACF) but inversely proportional to the degree of separation. So the required is defined in the following assuming the fact that, In the Eq. (16), is the degree of seperation established upon the HD and the degree of departure of two PDFs defined on the SRF in the following Eq. (17) In the Eq. (17) is included for the computational adjustment (Machuca-Mory and Deutsch 2013) along the boundary points of each SR. specifying modified gaussian distance kernel and here, as a distance we apply the degree of separation between two Conditional Copula-based Spatial PDF (CCSPDF) to capture the probabilistic spatial dissimilarity, and the last part is HD. For in the eq. (16) choice of suitable covariance function is necessary. Therefore, we choose the suitable covariance function among well-defined variogram clouds, for example, Exponential, Gaussian and Spherical, Cressie (1990) etc. Applying this concept we adopt the Algorithm (1) to interpolate over the entire spatial surface:

Spatial Bayesian Vine-Copula estimation

We introduce spatial vine copula estimation based upon the Bayesian statistical approach (SBVC). Under the square error loss function employing MHA we do the posterior estimate of the parameter in the following way: In Eq. (18) defines the Conditional Copula-based PDF (CCPDF) applying the inherited concept of Fig. 1. denotes the prior PDF of , and defines the Posterior PDF (PPDF) of . Using the PPDF we’ll calculate the posterior estimation of under the absolute error loss function. After getting the most updated values using MHA we find out the conditional bayesian prediction of two variables in the following: Using Eq. (19) we interpolate the target variables on target locations. Here, we assume two SRPs, and and apply the concept of tail-dependency of a bivariate copula to measure their hidden reliance. Utilizing VC we find the CCPDF i.e., . Regarding parameter estimation, during fitting MDF, we use UMVUE, EM etc, but to estimate the parameter of the copula family we only consider the posterior estimate. Then using the conditional expectation technique we estimate the values all the randomly generated gridded points and interpolate the values.

Model validation

The accuracy of the models are validated by the following three methods where, is the observed data points and, is the predicted value: Mean of Absolute Error (MAE) Root of Mean Square Error (RMSE) K-fold CV Then, using Eqs. (21) and (20) we measure the accuracy of the proposed model. K-fold CV assumes K as 10. The 10-folded CV indicates that the data set is divided into the 10 random sub-sets, and among these data sets, 9 sub-sets as training data set, and the rest 1 is a test data set, termed as one-leave-one out CV (OLOCV). It helps compare the MAE of proposed and old models

Study area and data

To demonstrate the SC, SBVC, and to compare with OK we take Delhi-air pollution as a circumstance study. Delhi, the capital of India is the most polluted due to, rapid urbanization, boosting amounts of traffic, increasing population, and energy consumption at an alarming level. Sometimes, the level of concentration in the air has reached up to 999 (Mukherjee et al. 2018) and, among all other air pollutants, it affects public health (Zheng et al. 2015) badly. Boosting levels of automobiles, cars, etc cause higher pollutant concentrations in the air (Samal et al. 2013). We look at the air pollution data collected by the monitoring stations, maintained by the Central Pollution Control Board (CPCB), Delhi Pollution Control Committee (DPCC), and the Indian Institute of Tropical Meteorology (IITM). To get the research goal, we collected data on several air pollutants, such as , , NO, NO, NO and wind direction (WD), from the CPCB websites. To map the Spatio-temporal distribution of air quality and deduce the effect of WD on the air pollution in Delhi, these data play an important role. The data were collected over 24 hours, and the period was taken from February 2017 to December 2021. The Fig. 5 depicts the temporal variability of daily emission which is cyclic after a fixed time stand. There is always a higher concentration witnessed from almost the end of November to the end of December (Fig. 5) around and sometimes it grows up to which is very much alarming for the human life, primarily during winter due to burning of firecrackers, agricultural crop burning, etc.

Fig. 5

The time series plot of daily emission from Feburary, 2017 to December, 2021 in the study area, Delhi

The interpolated values of November in the year, 2019 There are 38 monitoring stations in this data set, as shown in Fig. 3. According to Fig. 4, we detect that the Northern part of Delhi is very much sensitive to pollution whereas the Central, East, and West parts of Delhi are less sensitive regarding the pollutant concentrations in the air. According to Fig. 4, the concentration in the Northern part of Delhi can reach up to 220 whereas in the Central, East, and West part of Delhi that is limited into 190 to 200 . As a result, the Spatio-temporal variability in air pollutant concentrations is visible. However, there are two shortcomings to applying spatial interpolation techniques to interpolate (i) the Delhi NCR region is far away from other monitoring stations in Delhi, (ii) The missing Data.

Fig. 4

The interpolated values of November in the year, 2019

The time series plot of daily emission from Feburary, 2017 to December, 2021 in the study area, Delhi We can easily conclude from the Fig. 6 that the observed frequency distribution of daily emission during this period is positively skewed which gives an idea of how to fit the positively skewed distribution such as log-Normal, Gamma, Exponential, Weibull, etc depending upon the tail distribution.

Fig. 6

The Box-Plot of daily emission from Feburary, 2017 to December, 2021 in the study area, Delhi

The Box-Plot of daily emission from Feburary, 2017 to December, 2021 in the study area, Delhi Precisely there is a higher concentration in the interval from . Figure 7 provides a brief overview of the variability and a five-point summary of the pollutants and WD which establishes the fact that and have higher variability compared to other pollutants and the variance of WD also sensitive. We can easily conclude from the Fig. 6 that the observed frequency distribution of daily emission during this period is positively skewed which gives an idea of how to fit the positively skewed distribution such as log-Normal, Gamma, Exponential, Weibull, etc depending upon the tail distribution. Precisely there is a higher concentration in the interval from . Figure 7 provides a brief overview of the variability and a five-point summary of the pollutants and WD which establishes the fact that and have higher variability compared to other pollutants and the variance of WD also sensitive.

Fig. 7

The Box-Plot of daily , , NO, NO emission and WD from Feburary, 2017 to December, 2021 in the study area, Delhi

Results and discussion

This section goes over how to compare two new models, SC (Algorithm (1)) and SBVC (“Spatial Bayesian Vine-Copula estimation” Section) to other well-known spatial models step by step. Following that, we will attempt to provide a brief overview of pollutant concentrations in the future, as well as discuss how an important meteorological parameter can affect pollution concentrations mathematically. The emperical marginal CDF is fitted by the marginal positively skewed parametric CDF and the fitting of marginal PDF The nature of spatial variation of a RV We fit the parametric marginal CDF and PDF on the empirical CDF and PDF of an RV based on the AIC, BIC value, and KS test statistic value in Fig. 8. Because the empirical PDF is positively skewed, we only consider well-known positively skewed distributions such as Weibull, Log-normal (LN), Gamma, and Exponential distributions, and Table 1 shows that the LN distribution is suitable to fit based on the lowest AIC, BIC, and KS test statistic value. Similarly, we fit the circular family distributions on WD and discover that the VM distribution is the best PDF to fit.

Fig. 8

The emperical marginal CDF is fitted by the marginal positively skewed parametric CDF and the fitting of marginal PDF

Table 1

The value of KS statistic, AIC and BIC to determine the feasible marginal parametric PDF

Test Criteria	Weibull	Log-normal	Gamma	Exponential
KS	0.06884229	0.02849193	0.06276518	0.1478233
AIC	365.360	322.296	346.187	413.547
BIC	373.098	330.035	353.926	417.416

The value of KS statistic, AIC and BIC to determine the feasible marginal parametric PDF The next step is to estimate the parameter of the MDF. We already discussed the disadvantages of using MLE to estimate the parameter in “Fitting marginal distribution” section. As a result, we can use the EM algorithm to obtain the updated shape and scale parameters of the LN distribution. We discuss how the LogLik value converges to a fixed value after a certain number of iterations in Fig. 14. The required number of iterations for the EM algorithm in this case study is 223, after which the difference between the two LogLik values is negligible. In Fig. 9 depicts how the value varies with respect to latitude and longitude. While the Longitude (Lon) ranges from 77.0 to 77.1 and the Latitude (Lat) varies, from 28.5 to 28.6, the concentration is generally within but if Lon varies from 77.15 to 77.3, the concentration becomes high and it ranges from . Similarly, while Lat is varying from then the most spatial variability of is detected in every interval of Lon and sometimes reaches up to while the Lat varies from the spatial variation is identified and the variation of is almost lying between . However, during fitting VM distribution we use the concept of UMVUE which is mentioned in the Theorem (1) in the “Fitting marginal distribution” section to get the shape and scale parameter of the VM distribution with better accuracy. In the following Table 2 we discuss the shape and scale parameters of LN and VM PDF and corresponding the last updated LogLik values.

Fig. 14

The convergence of the Log-Likelihood value after updating the value of the parameters in each iteration using EM algorithm

Fig. 9

The nature of spatial variation of a RV

Table 2

Details and updated values of shape and scale parameter and the corresponding Log-likelihood values

PDF	Shape	Scale	LogLik
LN	4.3764856	0.7701984	-1959.331124
VM	3.583	1.908	-32.41559

Details and updated values of shape and scale parameter and the corresponding Log-likelihood values Our goal now is to run the two novel spatial interpolation algorithms mentioned in “Spatial copula estimation” and “Spatial Bayesian Vine-Copula estimation” sections and compare them to other spatial interpolation approaches. Using the threshold criteria mentioned in “Spatial copula estimation” section we divide the entire spatial domain into 4 HSC (Fig. 18) and consider the cutoff radius is 18026m as shown in Fig. 17. In the cluster dendrogram, the height represents the HD, and in the optimal number of clusters section, we plot SSW along the Y-axis and the optimal number of HSC along the X-axis. The following section focuses on the tail dependence of two RVs, as shown in Fig. 10. These two RVs in this case study are and WD.

Fig. 18

Optimal number of HSC using Elbow Method

Fig. 17

Optimal HSC size and its cutoff HD

Fig. 10

In the left part the discussion regarding the lower tail and upper tail dependency of two RV and in the right part the joint CDF of two RV

In the left part the discussion regarding the lower tail and upper tail dependency of two RV and in the right part the joint CDF of two RV The upper tail dependence and lower tail dependence are defined in the Eq. (22). It describes the relationship between two RVs when one goes to extreme values and what the behavior of the other one is (Czado and Nagler 2022). We can conclude from this Fig. 10 that after 0.8 the upper tail of their distributions is independent and lesser than 0.1, the lower tail of their distributions is independent. As a result, we can say that higher values of concentration are unaffected by WD because there is a very low concentration at that point but where the marginal PDF of is moderate, there is a significant tail dependence on WD. The joint CDF of and WD are plotted in the Fig. 10 on right applying function in R, where the fitted copula is Rotated Twan Type-2 Copula with estimated Kendall’s and the LogLik value is which is the highest of any copula family, including GC, t-Copula, Frank, Clayton, Joe, and so on. We apply our novel copula-based spatial interpolation algorithm (SC) in a Bayesian framework after fitting the copula using function in R. The posterior distribution and posterior estimate of the parameter are critical in this context. MHA is used in this context to obtain the posterior estimate of the parameters. According to this Fig. 15 we use the concept of Bayesian Inference to give the posterior estimate, assuming that the parameter prior distributions are uniform and truncated normal distributions. Following that, we use MHA to obtain the posterior estimate under the MSE loss function, which is 0.04898261 and , respectively. The rate of convergence of two parameters is plotted in the Fig. 15 depicting that the rate of convergence of parameter 1 is faster than that of Parameter 2. Now we will look at how WD and spatial clustering affect the variance of in Table 3 and the Fig. 16.

Fig. 15

The rate of convergence of the two parameters of Rotated Tawn Type-2 copula family. In (left) the first parameter of the copula family and in (right) the second parameter of the copula family are estimated

Table 3

Two-way ANOVA to explain the dependence of emission on WD and SC

Treatment	Df	SS	MS	F Ratio	P-Value
WD	21	20599.2864	980.9184	3.696	0.02404\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{*}$$\end{document}∗
Cluster	3	3623	1207.8	4.550	0.0334\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$^{*}$$\end{document}∗
Cluster\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\cdot $$\end{document}· WD	3	1283	427.7	1.611	0.2543
Residuals	9	2389	265.4

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Fig. 16

How WD and Spatial cluster make an impact on in this case study

Two-way ANOVA to explain the dependence of emission on WD and SC Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 In the Two-Way Analysis Of the Variance model (Two way ANOVA), we consider as a dependent variable and WD and clusters as independent variables. In the columns of Table 3, we represent Treatments, Degrees of freedom (Df), Sum of square (SS), Mean Square (MS), F-Ratio, and P-value, and along the rows, we represent WD, Cluster, their interaction effect, and residuals. We can see from Table 3 that there is a significant impact of WD and clustering on emission at the 0.05 level of significance. However, the interaction effect of WD and Clusters has no significant impact on emission. To aid comprehension, we present a graphical representation of these ANOVA tables in Fig. 16 in “Appendix C two-way ANOVA” section where WD is represented along the X-axis, is represented along the Y-axis, and each spatial cluster is used as a panel. In the SC interpolation method, we investigate another factor, spatial ACF, which is employed as an important weight to counteract spatial variability across all lags. Spatial ACF corresponding to every spatial lag, we plot the lag distance along the X-axis and the ACF along the Y-axis As a result, in the Fig. 11, we depict the variation of ACF concerning the spatial lag. In this Fig. 11, we notice that the value of ACF is comparatively higher for nearby stations than for stations far away. We use the blue shaded region to give a brief idea of the interval of variation of ACF values. In this case study the fitted variogram model is Matern variogram model with nugget: 0; sill: 617; range: 0.02 and kappa: 0.09. Utilizing this value and the other distance weights in the Eq. (15) we calculate CCPDF in every unobserved location. We assume is 0.4224 and in the Eq. (15).

Fig. 11

Spatial ACF corresponding to every spatial lag, we plot the lag distance along the X-axis and the ACF along the Y-axis

The entire framework is now ready to execute the new spatial copula interpolation (SC) described in “Spatial copula estimation” section and Bayesian Spatial-Vine Copula (SBVC) described in “Spatial Bayesian Vine-Copula estimation” section. As a result, we create an SRF within each HSC and focus on the spatial region between them. For SC, we assume that Lat and Lon have a bivariate uniform distribution, has an LN distribution, and the suitable copula is Clayton Copula among other copula families like Gaussian, t-copula, archimedean-copulas, based on AIC, BIC, and LogLik values, to find their joint CDF using function in R, with a parameter of 0.01697 and a dimension of 3. Then, using the Eq. (23), we obtain the required CCDF. Using the Eqs. (15) and (16), and (17) we get the CCPDF of each unobserved location. Then using the Algorithm (1) we get the interpolated values. Spatial Interpolation of during the month of November, in 2019, 2020 and 2021. In Top the spatial interpolation technique SC is used and in below the SBVC algorithm is used as a spatial interpolation algorithm. Along X-axis we plot Longitude, along Y-axis we plot latitude and along the whole surface we plot the In the Fig. 12 we plot the monthly emission during November for the three years, 2019, 2020, and 2021. According to this Fig. 12 we detect that using SC in November, 2019, the emission varies from . Using SBVC it ranges from (Fig. 12). A similar pattern is carrying on in 2020 and 2021 as well. We detect from Fig. 12 that the variation of SBVC is greater than that of SC. The northern and southeast part of Delhi is highly sensitive. In the western part of Delhi, the SBVC is ineffective to interpolate. As a result the emission is random (Fig. 12). We illustrate the relationship between the observed and predicted values of three methods: SC, SBVC, and Ok in Fig. 13. We follow that there is a strong relationship between the observed and predicted values in SC, followed by SBVC, and lastly OK. Thus we conclude, that the power of explainable variation in SC is greater than SBVC and better than OK. MAE, RMSE of SC is lesser than SBVC, and lastly OK in Fig. 19.

Fig. 12

Spatial Interpolation of during the month of November, in 2019, 2020 and 2021. In Top the spatial interpolation technique SC is used and in below the SBVC algorithm is used as a spatial interpolation algorithm. Along X-axis we plot Longitude, along Y-axis we plot latitude and along the whole surface we plot the

Fig. 13

Relationship between the observed and predicted values of three methods: SC, SBVC, and OK

Fig. 19

Comparison of the performance of three methods: SC, SBVC, and OK. Along Y-axis we plot the RMSE and MAE of the four years from January, 2018 to December, 2021

Although the SC method outperforms the other two, there are some areas where improvements are possible, such as: (i) We assume the rate of inclusion of geo-spatial points in a cluster is constant, during clustering but this can vary in practice. (ii) We ignore the effect of extreme values during interpolation. (iii) Degree of departure of characteristics between observed and unobserved points sometimes contradicts the concept of spatial continuity but we ignore that. (iv) We do not pay enough attention to its temporal stationarity. SBVC accepts the same drawbacks with the incapability of exploration of spatial trends. Relationship between the observed and predicted values of three methods: SC, SBVC, and OK

Conclusion

The proposed models’ SC and SBVC are extensions of the previous spatial copula-based models that majorly addressed issues such as bin selection, usage of MLE to estimate the parameter in missing data sets, and so on. When compared to other geostatistical models, the proposed SC and SBVC are very effective and provide nearly accurate results (from Fig. 19). The SC model produces better results for spatially skewed spatial random fields and provides a mathematical argument for selecting essential covariates. This study provides an idea of alternative distance weights and distance functions that are very effective in capturing spatial variation. A temporal extension of this algorithm is possible, which motivates further research. This model is explained in this study using a real-world data set of PM concentrations in the air. Still, this algorithm can be used in other scenarios such as mining, temperature modeling, meteorological modeling, and so on. This algorithm may be more advantageous than other spatial estimation models because it makes no assumptions about Gaussian distribution, intrinsic stationarity, dynamic behavior, or skewed data sets.

7 in total

Review 1. The effect of environmental oxidative stress on airway inflammation.

Authors: Amy Auerbach; Michelle L Hernandez
Journal: Curr Opin Allergy Clin Immunol Date: 2012-04

2. R-vine models for spatial time series with an application to daily mean temperature.

Authors: Tobias Michael Erhardt; Claudia Czado; Ulf Schepsmeier
Journal: Biometrics Date: 2015-02-06 Impact factor: 2.571

3. Estimating daily ground-level PM_2.5 in China with random-forest-based spatiotemporal kriging.

Authors: Yanchuan Shao; Zongwei Ma; Jianghao Wang; Jun Bi
Journal: Sci Total Environ Date: 2020-05-30 Impact factor: 7.963

4. A comparative risk assessment of burden of disease and injury attributable to 67 risk factors and risk factor clusters in 21 regions, 1990-2010: a systematic analysis for the Global Burden of Disease Study 2010.

Authors: Stephen S Lim; Theo Vos; Abraham D Flaxman; Goodarz Danaei; Kenji Shibuya; Heather Adair-Rohani; Markus Amann; H Ross Anderson; Kathryn G Andrews; Martin Aryee; Charles Atkinson; Loraine J Bacchus; Adil N Bahalim; Kalpana Balakrishnan; John Balmes; Suzanne Barker-Collo; Amanda Baxter; Michelle L Bell; Jed D Blore; Fiona Blyth; Carissa Bonner; Guilherme Borges; Rupert Bourne; Michel Boussinesq; Michael Brauer; Peter Brooks; Nigel G Bruce; Bert Brunekreef; Claire Bryan-Hancock; Chiara Bucello; Rachelle Buchbinder; Fiona Bull; Richard T Burnett; Tim E Byers; Bianca Calabria; Jonathan Carapetis; Emily Carnahan; Zoe Chafe; Fiona Charlson; Honglei Chen; Jian Shen Chen; Andrew Tai-Ann Cheng; Jennifer Christine Child; Aaron Cohen; K Ellicott Colson; Benjamin C Cowie; Sarah Darby; Susan Darling; Adrian Davis; Louisa Degenhardt; Frank Dentener; Don C Des Jarlais; Karen Devries; Mukesh Dherani; Eric L Ding; E Ray Dorsey; Tim Driscoll; Karen Edmond; Suad Eltahir Ali; Rebecca E Engell; Patricia J Erwin; Saman Fahimi; Gail Falder; Farshad Farzadfar; Alize Ferrari; Mariel M Finucane; Seth Flaxman; Francis Gerry R Fowkes; Greg Freedman; Michael K Freeman; Emmanuela Gakidou; Santu Ghosh; Edward Giovannucci; Gerhard Gmel; Kathryn Graham; Rebecca Grainger; Bridget Grant; David Gunnell; Hialy R Gutierrez; Wayne Hall; Hans W Hoek; Anthony Hogan; H Dean Hosgood; Damian Hoy; Howard Hu; Bryan J Hubbell; Sally J Hutchings; Sydney E Ibeanusi; Gemma L Jacklyn; Rashmi Jasrasaria; Jost B Jonas; Haidong Kan; John A Kanis; Nicholas Kassebaum; Norito Kawakami; Young-Ho Khang; Shahab Khatibzadeh; Jon-Paul Khoo; Cindy Kok; Francine Laden; Ratilal Lalloo; Qing Lan; Tim Lathlean; Janet L Leasher; James Leigh; Yang Li; John Kent Lin; Steven E Lipshultz; Stephanie London; Rafael Lozano; Yuan Lu; Joelle Mak; Reza Malekzadeh; Leslie Mallinger; Wagner Marcenes; Lyn March; Robin Marks; Randall Martin; Paul McGale; John McGrath; Sumi Mehta; George A Mensah; Tony R Merriman; Renata Micha; Catherine Michaud; Vinod Mishra; Khayriyyah Mohd Hanafiah; Ali A Mokdad; Lidia Morawska; Dariush Mozaffarian; Tasha Murphy; Mohsen Naghavi; Bruce Neal; Paul K Nelson; Joan Miquel Nolla; Rosana Norman; Casey Olives; Saad B Omer; Jessica Orchard; Richard Osborne; Bart Ostro; Andrew Page; Kiran D Pandey; Charles D H Parry; Erin Passmore; Jayadeep Patra; Neil Pearce; Pamela M Pelizzari; Max Petzold; Michael R Phillips; Dan Pope; C Arden Pope; John Powles; Mayuree Rao; Homie Razavi; Eva A Rehfuess; Jürgen T Rehm; Beate Ritz; Frederick P Rivara; Thomas Roberts; Carolyn Robinson; Jose A Rodriguez-Portales; Isabelle Romieu; Robin Room; Lisa C Rosenfeld; Ananya Roy; Lesley Rushton; Joshua A Salomon; Uchechukwu Sampson; Lidia Sanchez-Riera; Ella Sanman; Amir Sapkota; Soraya Seedat; Peilin Shi; Kevin Shield; Rupak Shivakoti; Gitanjali M Singh; David A Sleet; Emma Smith; Kirk R Smith; Nicolas J C Stapelberg; Kyle Steenland; Heidi Stöckl; Lars Jacob Stovner; Kurt Straif; Lahn Straney; George D Thurston; Jimmy H Tran; Rita Van Dingenen; Aaron van Donkelaar; J Lennert Veerman; Lakshmi Vijayakumar; Robert Weintraub; Myrna M Weissman; Richard A White; Harvey Whiteford; Steven T Wiersma; James D Wilkinson; Hywel C Williams; Warwick Williams; Nicholas Wilson; Anthony D Woolf; Paul Yip; Jan M Zielinski; Alan D Lopez; Christopher J L Murray; Majid Ezzati; Mohammad A AlMazroa; Ziad A Memish
Journal: Lancet Date: 2012-12-15 Impact factor: 79.321

5. Efficient pairwise composite likelihood estimation for spatial-clustered data.

Authors: Yun Bai; Jian Kang; Peter X-K Song
Journal: Biometrics Date: 2014-06-19 Impact factor: 2.571

6. A D-vine copula-based quantile regression model with spatial dependence for COVID-19 infection rate in Italy.

Authors: Pierpaolo D'Urso; Livia De Giovanni; Vincenzina Vitale
Journal: Spat Stat Date: 2022-01-10