Literature DB >> 35866039

An Improved Estimation for Heterogeneous Datasets with Lower Detection Limits regarding Environmental Health.

Navid Feroze¹, Ali Akgul², Taghreed M Jawa³, Neveen Sayed-Ahmed³, Rashid Ali⁴.

Abstract

Analysis of environmental data with lower detection limits (LDL) using mixture models has recently gained importance. However, only a particular type of mixture models under classical estimation methods have been used in the literature. We have proposed the Bayesian analysis for the said data using mixture models. In addition, an optimal mixture distribution to model such data has been explored. The sensitivity of the proposed estimators with respect to LDL, model parameters, hyperparameters, mixing weights, loss functions, sample size, and Bayesian estimation methods has also been proposed. The optimal number of components for the mixture has also been explored. As a practical example, we analyzed two environmental datasets involving LDL. We also compared the proposed estimators with existing estimators, based on different goodness of fit criteria. The results under the proposed estimators were more convincing as compared to those using existing estimators.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35866039 PMCID： PMC9296318 DOI： 10.1155/2022/4414582

Source DB: PubMed Journal: Comput Math Methods Med ISSN： 1748-670X Impact factor: 2.809

1. Introduction

The environmental studies often encounter the exposure measurements falling below the LDL. These nondetectable observations are considered left censored observations [1, 2]. Hughes [3] discussed that the difficulty in modeling the environmental concentration datasets arises when some of the measurements are below the LDL. As the proportion of censored observations may not be trivial, failure to adjust the censoring in the analysis can produce seriously biased results with inflated variances. The most convenient method to adjust the censoring is to replace the censored observation by the detection limit. However, statistical properties of such methods are obscured. As an improvement, Paxton et al. [4] proposed the iterative imputation technique to settle the censoring issue, but this method did not consider the correlated structure of the data and parametric estimates. Some authors have proposed the standard statistical distributions to model the left censored datasets. For example, Mitra and Kundu [5] used generalized exponential distribution, Bhaumik et al. [6] employed normal distribution, Leith et al. [7] and Jin et al. [8] used log-normal distribution, and Asgharzadeh et al. [9] considered Weibull model to model the left censored data from different situations. However, varying modeling capabilities of these models to model the left censored data has been reported by Vizard et al. [10]. Further, Moulton and Halsey [11] raised several concerns over using a standard statistical model to deal with such data. They argued that these datasets may be highly skewed to make most of the standard models inappropriate for analysis. Further, there may exist an additional subpopulation of the observations falling below the detection limit making the data heterogeneous or multimodal. Following these arguments, Moulton and Halsey [11] suggested the use of gamma mixture model (in their case) to model the left censored HIV RNA dataset. Taylor et al. [12] also emphasized that in case of larger proportion of nondetectable observations, a suitable model for analysis should be explored. They proposed log-normal mixture distribution to model the datasets with LDL. Some other authors including Moulton et al. [13] and Chen et al. [14] have suggested the use of mixture models for analyzing left censored data, especially when the proportion of censoring and skewness is high in the data. The use of mixture models also allows additional variability in the distributional shape as compared to some standard models. A recent article by Feroze and Aslam [15] considered the analysis of left censored mixture of the Weibull model. Furthermore, most of the authors employing mixture models for analyzing left censored datasets have proposed maximum likely estimation (MLE) to estimate the model parameters. But, as the mixing cause missing data, MLE is often not suitable in estimating mixture models [16]. Following this argument, Hughes [3] proposed EM algorithm assuming normal mixture model to adjust the censoring issue in MLE. But the EM algorithm can also fail when the censoring rate is high [17]. These issues in the estimation of mixture models can be handled using a Bayesian approach. With informative priors, Bayesian estimation (BE) can produce significant gains in efficiencies of the estimates as compared to classical methods [18, 19]. The Bayesian estimators (BEs) can be biased but often have smaller mean squared errors (as compared to MLEs) and include additional prior information [20]. These methods are also applicable when the parameter estimates are correlated [21]. However, the Bayesian inference has not yet replaced the classical methods to model the heterogeneous environmental concentration datasets with LDL [22]. The generalized exponential (GE) distribution is very important distribution to model censored datasets. Recently, Gupta and Kundu [23] and Mitra and Kundu [5] identified that this model can be an appealing alternative to many standard models, such as gamma, Weibull, lognormal, log-gamma, exponential, and Rayleigh, to analyze the censored datasets. The mixture of GE models for analyzing the censored datasets has been more recently introduced. Teng and Zhang [24] proposed the estimation of model parameters from two-component mixture of GE models (2CMGEM) via EM algorithm. Ali et al. [25] discussed the application of 2CMGEM in modeling right censored data. Mahmoud et al. [26] considered 2CMGEM for obtaining predictions from the censored data. Wang et al. [27] discussed the applicability of 2CMGEM in medical sciences. Similarly, Kazmi and Aslam [28] advocated the employment of GE mixture model to fit various right censored datasets. More recent contributions regarding the analysis of 2CMGEM can be seen from the following works. Aslam and Feroze [29] considered the analysis of 2CMGEM under progressive censoring. Kluppelberg and Seifert [30] reported some important results from the conditional distributions of 2CMGEM. Yang et al. [31] discussed important inferences about 2CMGEM. Various properties of 2CMGEM were investigated. Hence, 2CMGEM is a very relevant model to analyze the censored dataset with mixture or heterogeneous behavior. We have explored the advantages of Bayesian methods in modeling the heterogeneous environmental concentration data with LDL. First, we have employed the Bayesian mixture model approach using simulated datasets. As discussed by Momoli et al. [22], these datasets represent the real data from environmental studies and provide convenience by avoiding issues with empirical comparisons. Further, to investigate the advantages of the proposed methodology in real world, two real datasets regarding environmental concentrations with LDL have been used for the analysis. The 2CMGEM has been used as candidate model to analyze the mixture behavior of these datasets. The comparison of the proposed model with other mixture models, available in the literature to analyze the datasets with LDL, has been made. The comparison among different estimation methods has been discussed. Some advantages of using the proposed methodologies have been observed. The remaining paper has been organized as follows. Section 2 contains the estimation of model parameters using different estimation methods. In Section 3, the simulated datasets are considered for comparison of proposed estimators. In Section 4, two real examples have been used to discuss the applicability and suitability of the proposed estimators. The findings of the study have been reported in Section 5.

2. Material and Methods

This section contains the analytical estimation of model parameters under expectation-maximization (EM) algorithm, Lindley's approximation (LA), and Markov chain Monte Carlo (MCMC) method. In case of BE, the squared error loss function (SELF) and entropy loss function (ELF) along with an informative prior have been assumed for the estimation.

2.1. The Likelihood Function for 2CMGEM under Heterogeneous Data with LDL

The 2CMGEM has following probability density function: where Ω = (λ1, θ1, λ2, θ2, π1) is a parametric set and λ, θ > 0, 0 < π1 < 1u = 1, 2. The cumulative distribution function (CDF) for the 2CMGEM is Suppose, we generate a sample of size ‘n' from 2CMGEM. Further, suppose n1 = nπ1 and n2 = n(1 − π1) observations are generated from first and second components of the mixture, respectively. Let X, ⋯, X are the largest n − k observations about which we have complete information and remaining smallest ‘k' observations fall below the predetermined LDL. Since we have incomplete information regarding smallest ‘k' observations, they as assumed as censored. Hence, x1(, ⋯, x1 and x2(, ⋯, x2 are the observations with complete information from first and second component of the mixture, respectively. So, k1 and k2 are the starting observations censored from each component, respectively. Further, m1 = n1 − k1 − 1 and m2 = n2 − k2 − 1 are observed items (with complete information) from the first and second components of mixture, respectively, where x = min(x1(, x2(), n = n1 + n2, k = k1 + k2, and n − k = m1 + m2. The likelihood function for such heterogeneous dataset x = {(x1(, ⋯, x1), (x2(, ⋯, x2)} with LDL can be written as where Ω = (λ1, θ1, λ2, θ2, π1). Putting the corresponding entries, we have the following. The likelihood function for the heterogeneous dataset with LDL using 2CMGEM is

2.2. Expectation-Maximization (EM) Algorithm

In case of missing observations, the EM algorithm can be used obtain the MLEs for the model parameters. We have used the EM algorithm similar to that proposed by Ruhi et al. [32]. LetΘ = (λ1, λ2, θ1, θ2) be the model parameters and π1 the mixing parameter for the 2CMGEM f(x) where r = 1, 2. Let us denote the observed random sample from 2CMGEM by x = (x1, ⋯,x)′. Further, we assume the missing data vectors as x = (y1′, ⋯,y′)′, where y is a 2-dimensional indicator vector containing zero-one observation and y is one if the x is from rth component of the mixture and zero elsewhere (j = 1, 2, ⋯, n; r = 1, 2). The complete dataset can be defined as x = (x′, x′). The iterations in any EM algorithm contain two steps. The conditional expectation of the complete data log-likelihood for the model parameters is computed in the first step called E-step. The E-step at (z + 1)th iteration is where R(x|Θ() is a reliability function for the rth component of the mixture. As (5) is linear iny, the E-step is implemented by taking the conditional expectation of Y given the observed data x, where y denotes the observations of Y. Now, Ε(Y|x) = y(. Using Bayes theorem, the posterior probabilities y( are presented as Next, the M-step is carried out. In this step, (5) is maximized with respect parameters of the model to estimate Ω(. Since expressions containing π and Θ are not related, they can be maximized independently. We introduce the Lagrange multiplier Δ, with condition ∑2π = 1., to obtain the expression for π. The differentiation of (5) with respect to π is placed equal to zero as The sum of (7) over r and with∑y( = 1, and Δ = −n gives The MLES for Θ = (λ, θ), r = 1, 2 are obtained using iterative procedures. Both E-step and M-step are iterated for achieving the desired accuracy. The steps to estimate the parameters of 2CMGEM using EM algorithm are as follows:. Step-1:determine the initial guess for parameters as π1(0), θ(0), and λ(0) at r = 1, 2 Step-2:using Step-1 compute the conditional expectation y( for zth iteration from (6) Step-3: for zth iteration, derive MLE of π1( from (8) and MLEs of θ( and λ( using numerical procedures Step-4: replicate Step-2 and Step-3 to reach the desired convergence

2.3. Bayesian Estimation

The beta prior for parameter π1 and independent gamma priors for parameters λ1, λ2, θ1, θ2 have been assumed as where μ1, τ1, μ2, τ2, μ3, τ3 > 0, u = 1, 2 are the hyperparameters. Under the assumption of independence, the joint prior density from (9)–(11), for the parametric set Ω = (λ1, λ2, θ1, θ2, π1) can be given as The general form of posterior distribution for the model parameters, combining (4) and (11), is From (13), the posterior distribution for the parameters of 2CMGEM is We have used SELF and ELF for derivation of BEs. The BE under SELF is . On the other hand, the BE under ELF is . Since the closed form expressions for BEs under the said loss functions are not available, the approximate BE methods have been used to compute the numerical results.

2.3.1. Lindley's Approximation (LA)

In this section, the LA has been proposed for the approximate BE for parameters of 2CMGEM. For sufficiently large sample size, Lindley [33] proposed that a ratio of integrals have the form where W(Ω) is a function of λ1, λ2, θ1, θ2and π1, H(Ω) is the logarithmic of joint prior for the parametric set Ω, and K(Ω|x) is the log-likelihood function, can be evaluated as where the MLE of Ω is denoted by, and ω is the (i, j)th element of the inverse of the matrix {L}. Now, consider the log-likelihood function from (4). Differentiating (18) with respect to models parameters and equating the results to zero, we have Here, the numerical solutions of (19) to (21) will provide MLE of Ω. Since, the higher-order derivative form (18) are straightforward, they have not been reported. The inverse of matrix {L}based on second order derivatives is Now, the BEs for parameters of 2CMGEM using SELF are The BEs for the model parameters using LA under ELF are

2.3.2. MCMC Method

This sub-subsection discusses the BE for parameters λ1, λ2, θ1, θ2, and π1 by employing MCMC technique. To implement MCMC procedure, we reformatted (9) as where γ1 = k1 + m1 + μ1, γ2 = k2 + m2 + τ1, γ3 = m + μ2, γ4 = m + μ3, κ1(x) = τ2 − ∑log(1 − exp(−θx)), and κ2(x) = τ3 + ∑x1. From (25), the posterior densities can be written as f 3(λ|θ, x) ∝ Gamma(γ3, κ1(x)), where Beta(.) and Gamma(.) represent bBeta and gamma distributions, respectively. The samples from the posterior density can be obtained using the following algorithm. Step-1:draw π1(, θ(, and λ( from f1(π1|x), f2(θ|x), and f3(λ|θ, x), respectively. Step-2:replicate Step-2 m times to obtain (π1(1), λ1(1), λ2(1), θ1(1), θ2(1)) to (π1(, λ1(, λ2(, θ1(, θ2() for z = 1, 2, ⋯, m. The starting m0 number of samples has been discarded to minimize the impact of initial guess. The BEs for model parameters are computed using the rest of m1 = m − m0 number of samples. The BEs under SELF using MCMC are The BEs under ELF using MCMC are

3. Results and Discussions

In this section, we have generated the simulated datasets from 2CMGEM using different sample sizes, different censoring rates, different true parametric values, different hyperparameters and different mixing weights. As all these factors can have the impact on the estimation, we have considered different situations for each of these factors. As discussed by Momoli et al. [22], these datasets represent the real data from environmental studies and provide convenience by avoiding issues with empirical comparisons. Since censoring is major interest in this study, we have focused on the comparison of complete datasets with censored ones. Different estimation methods and two loss functions have been used in order to explore the best combination of estimation method and loss function in our case. The data from the 2CMGEM with LDL have been generated using the following steps. Step-1: generate a sample of size ‘n' from 2CMGEM Step-2: draw a uniformly distributed random number (u) for each observation generated in Step-1 Step-3: when u ≤ π1, take observation from 1st component, otherwise from 2nd component of 2CMGEM Step-4: let LDL is x Step-5: the observations falling below LDL are assumed to be censored from each component Step-6: the observations greater than or equal to LDL have been used for analysis The corresponding results are given in Tables 1–5 and in Figures 1 and 2 given in the followings. The estimates under simulated samples were replicated 10,000 times, and average of the estimates has been reported. In addition, in case of the MCMC method, the estimates were replicated 11,000 times. The starting 1000 replications were not included in the estimation to remove the impact of choice of starting values for the estimation. The Mathematica and R software has been used for numerical computations.

Table 1

The amounts of AREs and MSEs for the model parameters using different estimation methods.

n	CR	Average relative estimates (AREs)					Mean square error (MSEs)
n	CR	λ ₁ = 0.5	θ ₁ = 1.2	λ ₂ = 0.75	θ ₂ = 1.5	π ₁ = 0.45	λ ₁ = 0.50	θ ₁ = 1.2	λ ₂ = 0.75	θ ₂ = 1.5	π ₁ = 0.45
20	MLE	1.3192	1.1499	1.2417	1.2040	1.1795	0.0407	0.0559	0.0324	0.0608	0.0216
	EM	1.3149	1.1461	1.2377	1.2001	1.1757	0.0397	0.0538	0.0314	0.0588	0.0210
	SELF(LA)	1.3172	1.1395	1.2355	1.1957	1.1725	0.0400	0.0508	0.0304	0.0571	0.0212
	ELF(LA)	1.2852	1.1064	1.2028	1.1625	1.1408	0.0341	0.0378	0.0241	0.0453	0.0181
	SELF(MCMC)	1.3078	1.1208	1.2213	1.1791	1.1577	0.0305	0.0505	0.0272	0.0510	0.0162
	ELF(MCMC)	1.2674	1.1006	1.1909	1.1536	1.1307	0.0241	0.0371	0.0202	0.0378	0.0128

50	MLE	1.1898	1.0882	1.1456	1.1246	1.0949	0.0198	0.0279	0.0165	0.0345	0.0103
	EM	1.1846	1.0728	1.1353	1.1116	1.0837	0.0191	0.0240	0.0149	0.0311	0.0099
	SELF(LA)	1.1848	1.0718	1.1348	1.1109	1.0831	0.0190	0.0237	0.0148	0.0309	0.0099
	ELF(LA)	1.1602	1.0519	1.1125	1.0897	1.0621	0.0161	0.0190	0.0121	0.0254	0.0084
	SELF(MCMC)	1.1708	1.0545	1.1191	1.0943	1.0675	0.0121	0.0202	0.0111	0.0234	0.0063
	ELF(MCMC)	1.1438	1.0381	1.0973	1.0750	1.0477	0.0096	0.0161	0.0089	0.0186	0.0050

100	MLE	1.1035	1.0498	1.0829	1.0737	1.0401	0.0114	0.0177	0.0104	0.0253	0.0058
	EM	1.1026	1.0490	1.0821	1.0729	1.0393	0.0112	0.0173	0.0101	0.0248	0.0057
	SELF(LA)	1.1014	1.0462	1.0800	1.0704	1.0371	0.0111	0.0169	0.0099	0.0243	0.0056
	ELF(LA)	1.0764	1.0257	1.0572	1.0486	1.0156	0.0094	0.0138	0.0082	0.0201	0.0048
	SELF(MCMC)	1.0958	1.0376	1.0729	1.0625	1.0299	0.0064	0.0168	0.0082	0.0201	0.0033
	ELF(MCMC)	1.0728	1.0244	1.0547	1.0467	1.0135	0.0051	0.0135	0.0066	0.0161	0.0026

200	MLE	1.0387	1.0235	1.0371	1.0374	1.0305	0.0058	0.0087	0.0053	0.0140	0.0028
	EM	1.0363	1.0211	1.0347	1.0350	1.0281	0.0056	0.0084	0.0052	0.0135	0.0028
	SELF(LA)	1.0294	1.0178	1.0295	1.0307	1.0234	0.0055	0.0080	0.0050	0.0131	0.0027
	ELF(LA)	1.0206	1.0102	1.0213	1.0227	1.0154	0.0050	0.0072	0.0045	0.0118	0.0025
	SELF(MCMC)	1.0266	1.0123	1.0254	1.0258	1.0189	0.0026	0.0080	0.0039	0.0102	0.0013
	ELF(MCMC)	1.0130	1.0023	1.0135	1.0148	1.0076	0.0022	0.0070	0.0032	0.0084	0.0011

Table 2

Effect of different censoring rates on the estimation using the MCMC method and ELF.

CR	n = 20					n = 200
CR	λ ₁ = 0.50	θ ₁ = 1.2	λ ₂ = 0.75	θ ₂ = 1.5	π ₁ = 0.45	λ ₁ = 0.50	θ ₁ = 1.2	λ ₂ = 0.75	θ ₂ = 1.5	π ₁ = 0.45
0%	0.0190	0.0271	0.0159	0.0298	0.0100	0.0018	0.0057	0.0026	0.0073	0.0009
5%	0.0205	0.0304	0.0172	0.0320	0.0108	0.0019	0.0059	0.0027	0.0076	0.0009
10%	0.0223	0.0332	0.0186	0.0350	0.0119	0.0021	0.0063	0.0030	0.0081	0.0010
11%	0.0228	0.0340	0.0190	0.0358	0.0120	0.0021	0.0063	0.0030	0.0081	0.0010
20%	0.0241	0.0361	0.0202	0.0378	0.0128	0.0022	0.0065	0.0032	0.0084	0.0011
30%	0.0253	0.0380	0.0213	0.0398	0.0134	0.0023	0.0068	0.0033	0.0087	0.0011

Table 3

Effect of mixing parameter on the estimation of model parameters using MCMC method and ELF.

π ₁	CR	n = 20					n = 200
π ₁	CR	λ ₁ = 0.50	θ ₁ = 1.2	λ ₂ = 0.75	θ ₂ = 1.5	π ₁	λ ₁ = 0.50	θ ₁ = 1.2	λ ₂ = 0.75	θ ₂ = 1.5	π ₁
0.25	0%	0.0363	0.0303	0.0239	0.0227	0.0253	0.0042	0.0056	0.0038	0.0054	0.0022
0.45	0%	0.0340	0.0281	0.0245	0.0230	0.0258	0.0041	0.0053	0.0039	0.0055	0.0022
0.75	0%	0.0324	0.0268	0.0256	0.0244	0.0260	0.0040	0.0051	0.0040	0.0057	0.0023
0.25	20%	0.0413	0.0336	0.0259	0.0247	0.0276	0.0047	0.0062	0.0041	0.0059	0.0024
0.45	20%	0.0382	0.0309	0.0269	0.0252	0.0284	0.0045	0.0058	0.0043	0.0060	0.0024
0.75	20%	0.0360	0.0291	0.0283	0.0270	0.0295	0.0043	0.0056	0.0044	0.0063	0.0025

Table 4

Effect of different true parametric values on the estimation using MCMC method and ELF.

PS	CR	n = 20					n = 200
PS	CR	λ ₁	θ ₁	λ ₂	θ ₂	π ₁ = 0.45	λ ₁	θ ₁	λ ₂	θ ₂	π ₁ = .45
S1	0%	0.0340	0.0281	0.0245	0.0230	0.0258	0.0041	0.0053	0.0039	0.0055	0.0022
S2	0%	0.0342	0.0276	0.0272	0.0253	0.0237	0.0041	0.0053	0.0043	0.0057	0.0021
S3	0%	0.0354	0.0310	0.0250	0.0227	0.0250	0.0042	0.0054	0.0040	0.0054	0.0021
S1	20%	0.0382	0.0309	0.0269	0.0252	0.0284	0.0045	0.0055	0.0043	0.0060	0.0024
S2	20%	0.0396	0.0310	0.0305	0.0284	0.0268	0.0046	0.0057	0.0049	0.0062	0.0023
S3	20%	0.0407	0.0348	0.0280	0.0254	0.0281	0.0047	0.0059	0.0045	0.0061	0.0024

Table 5

Effect of different hyper-parameters on the estimation using MCMC method and ELF.

PS	CR	n = 20					n = 200
PS	CR	λ ₁ = 0.50	θ ₁ = 1.2	λ ₂ = 0.75	θ ₂ = 1.5	π ₁ = 0.45	λ ₁ = 0.5	θ ₁ = 1.2	λ ₂ = 0.75	θ ₂ = 1.5	π ₁ = 0.45
P1	0%	0.03395	0.02814	0.02455	0.02303	0.02578	0.0041	0.0053	0.0039	0.0055	0.0022
P2	0%	0.03415	0.02827	0.02463	0.02311	0.02590	0.0041	0.0053	0.0039	0.0055	0.0022
P3	0%	0.03421	0.02833	0.02470	0.02313	0.02594	0.0041	0.0053	0.0039	0.0055	0.0022
P1	20%	0.03815	0.03088	0.02688	0.02523	0.02836	0.0045	0.0058	0.0043	0.0060	0.0024
P2	20%	0.03848	0.03105	0.02705	0.02537	0.02854	0.0045	0.0059	0.0043	0.0060	0.0024
P3	20%	0.03949	0.03171	0.02760	0.02592	0.02912	0.0046	0.0060	0.0044	0.0062	0.0025

Figure 1

Parametric estimates using different sample sizes and censoring rates.

Figure 2

Sensitivity of the estimates with respect to priors, mixing parameter, true parametric values, and censoring rates.

Figure 1 shows the impact of change in sample size and censoring rates on the performance of the estimates. Panel (a1) reports the average of relative estimates (AREs) for parameter (λ1) using different sample sizes and under various estimation methods. Panel (a2) presents the amounts of mean square errors (MSEs) for the estimates of parameter (λ1). Similarly, AREs and MSEs for the parameter (θ1) can be seen in Panels (a1) and (a2), respectively. These figures elucidate that the estimates are converging to the true parametric values with increase in sample size. It is also evident from these panels (a1)–(a4) that amounts of MSEs tend to decrease with increase in the sample size. The results also advocated that the performance of the estimates using the MCMC method assuming ELF is superior as compared to their competitors. The results for the estimates of other model parameters also exhibited the similar trend, please see Table 1. The detailed analysis of the impacts imposed by the increased censoring rate (on the estimates) is fundamental in environmental studies. For that purpose, the simulated data have been used to compare the results under different censoring rates with the case when no censoring occurs. It will provide an idea about how well our proposed estimators are capable of tackling the censoring scenarios. The right panel (c1 and c2) in Figure 1 shows these results. Panel (c1) elucidates the impact of censoring rates when sample size is 20, whereas panel (c2) presents the impacts of censoring rates when sample size is 200. The results in these panels (c1) and (c2) have been reported under MCMC method using ELF, as performance of these estimators was observed to be better than their counterparts. From these panels (c1 and c2), it can be assessed that when the sample is small (n = 20), the increased censoring rates tend to slightly inflate the MSEs for the estimates of the model parameters. However, for the sufficiently large sample size (n = 200), the MSEs for the estimates under complete data and under 30% censored data are very much comparative. Hence, the proposed estimators are efficient enough to deal with left censored environmental datasets. The sensitivity of the proposed estimators with respect different values of hyperparameters, different mixing weights, and different true parametric values has been investigated in Figure 2. All the reported results are based on the MCMC method assuming ELF due to their supreme efficiency as compared to their competitors. The results have been compared on the basis of values of scaled MSEs (SMEs). The SMEs have been obtained by the ratio of MSEs to the corresponding true parametric values. The comparison of the estimates for complete data and 20% censored data has been considered. The left panel (a1 and a2) of the figure investigates the sensitivity of the estimates with respect to change in the hyperparameters. We have used three sets of the hyperparameters resulting in three different priors. The prior number one (P1) is the main prior which has been assumed for the overall analysis. This prior has been defined on the basis of prior means approach. Using prior means approach the hyperparameters in the prior are chosen in such a manner that the mean of resulting prior is approximately equal to the corresponding true parametric value. The other two priors (P2 and P3) have been specified so that the respective prior means have been changed (increased and decreased) by 20% from that of P1. In panels (a1) and (a2), the x-axis represents the combination of prior and censoring rate. For example, P1 (0%) means the results under prior one (P1) with no censoring and P1 (20%) denotes results under prior one (P1) with 20% censored samples. Panel (a1) elucidates that in case of complete data, the change in prior parameters has a very little affect on the performance of the estimates. On the other hand, for 20% censored samples, the change in prior parameters has resulted in slightly inflated SMSEs. However, when sample size is increased to 200, the estimates under all three priors are quite similar and censoring has very little impact on the efficiency of the estimates. The corresponding numerical results can be seen in Tables 2 and 3. The central panel (B1 and B2) represents the affect of choice of various true parametric values on the estimates. Three different sets of true parametric values have been used to assess the impact of variant true parametric values on proposed estimators. The parametric sets used are as follows: S1: (λ1 = 0.5, θ1 = 1.2, λ2 = 0.75, θ2 = 1.5, π1 = 0.45); S2: (λ1 = 0.5, θ1 = 1.2, λ2 = 1.5, θ2 = 3, π1 = 0.45); and S3: (λ1 = 1, θ1 = 2.4, λ2 = 0.75, θ2 = 1.5, π1 = 0.45). The x-axis represents the combination of true parametric values and censoring rates. For example, S1 (0%) indicates the estimates for S1 with no censoring and S1 (20%) represents the results for S1 with 20% censored samples. Panels (b1) and (b2) show the results for sample size 20 and 200, respectively. For n = 20, some fluctuations in the amounts of SMSEs have been observed for different parametric sets. However, for n = 200, there is very little effect of various true parametric values can be seen. The corresponding numerical results have been placed in Table 4. Keeping the values for the model parameters fixed, the impact of changing the mixing parameter has been presented in panels (c1) and (c2), respectively. The x-axis line denotes Pi1: π1 = 0.25 with no censoring; Pi2: π1 = 0.45 with no censoring; Pi3: π1 = 0.75 with no censoring; Pi4: π1 = 0.25 with 20% censoring; Pi5: π1 = 0.45 with 20% censoring; and Pi6: π1 = 0.75 with 20% censoring. For n = 20, the increase in value of the mixing parameter imposes a positive impact on the estimation of the model parameters from the first component of the mixture, without seriously affecting the estimation for the other component of the mixture. Interestingly, for n = 200, the choice of larger true values for the mixing parameter still produce the improved estimation for the first component of the mixture.

4. Real Examples

We have used two real environmental datasets with LDL for illustrating the applicability of the methods proposed. The first dataset contains ammonium (NH-4) concentration (mg/L) in precipitation observed at Olympic National Park, Hoh Ranger Station (WA14). This dataset has been reported by Aboueissa [34] and was collected during January, 2009 to December, 2011 on a weekly basis. We named it as dataset-1. The dataset contains high proportion (46 out of 102) of observations which are below LDL. The observations of the dataset are <0.006, <0.006, 0.006, 0.016, <0.006, 0.015, 0.023, 0.034, 0.022, 0.007, 0.021, 0.012, <0.006, 0.021, 0.015, 0.088, 0.058, <0.006, <0.006, <0.006, <0.006, 0.074, 0.011, 0.121, <0.006, 0.007, 0.007, 0.015, <0.006, <0.006, <0.006, <0.006, <0.006, <0.006, <0.006, <0.006, <0.006, <0.01, <0.01, <0.01, 0.042, <0.01, 0.013, <0.01, 0.028, <0.01, <0.01, <0.01, 0.049, 0.036, <0.01, 0.011, 0.019, 0.253, <0.018, <0.01, 0.014, 0.08, <0.01, 0.065, <0.01, <0.01, <0.01, <0.01, <0.01, <0.01, <0.008, <0.008, <0.008, <0.008, 0.009, <0.008, 0.017, 0.012, 0.02, 0.009, 0.014, 0.03, 0.06, 0.031, 0.024, 0.013, 0.059, 0.009, 0.017, 0.033, 0.052, 0.015, 0.019, <0.008, <0.008, <0.008, <0.008, 0.009, 0.013, 0.023, 0.036, <0.008, 0.012, 0.03, 0.022, and 0.008. Approximately, 45% of the observations have been below LDL. The average for the observations is 0.0316 with variance as 0.0015. The second dataset (dataset-2) has also been reported by Aboueissa [34]. This dataset is about the manganese concentrations (ppb) in groundwater collected from five wells. Since two LDL were used to obtain the dataset, 2-component mixture (2CM) can be used to model it. This dataset consist of twenty-five values with six observations falling below LDL. Hence, the data contains approximately 24% observation below the lower detection limit. The values of the dataset are <5, 12.1, 16.9, 21.6, <2, <5, 7.7, 53.6, 9.5, 45.9, <5, 5.3, 12.6, 106.3, 34.5, 6.3, 11.9, 10, <2, 77.2, 17.9, 22.7, 3.3, 8.4, and <2. The mean for the observations is 25.43 and variance is 752.96. The numerical results using dataset-1 and dataste-2 have been given in Tables 6–11 and in Figures 3 and 4. We have compared the proposed model (2CMGEM) with different mixture models available in literature to model the environmental datasets with LDL. These models include 2CM of normal models (2CMNM), 2CM of log-normal models (2CMLNM), 2CM of gamma models (2CMGM), and 2CM of log-gamma models (2CMLGM). By comparing fits of 2CMGEM with the models avaiable in literature, we have found the possibility and suitability of proposed model in modeling the left censored environmental datasets. The proposed MCMC methods have also been compared with the classical methods (MLE and EM algorithm) already used for estimating such datasets. The optimal number of components for the proposed mixture model have also been explored using different goodness-of-fit criteria and test statistics.

Table 6

Comparison of different models based on various goodness-of-fit criteria using real dataset-1.

Model	AIC	BIC	KS statistic	KS P.Value
2CMGEM	-291.7208	-281.5940	0.0792	0.8740
2CMNM	-270.5970	-260.4703	0.0947	0.6971
2CMLNM	-107.2395	-97.1128	0.6972	2.20E-16
2CMGM	-269.6410	-259.5142	0.1077	0.5825
2CMLGM	-251.4048	-241.2780	0.1877	0.0387

Table 7

Selecting optimal components for generalized exponential mixture model using dataset-1.

Model	AIC	BIC	KS value	KS P.Value	LRT value	LRT P.Value
GEM	-277.535	-273.485	0.149	0.168	236.206	0.001
2CMGEM	-291.082	-280.955	0.079	0.874	Reference	Reference
3CMGEM	-285.688	-269.485	0.136	0.255	3.916	0.367
4CMGEM	-279.720	-257.441	0.168	0.084	4.302	0.467
5CMGEM	-273.680	-245.325	0.181	0.050	11.793	0.533

Table 8

Estimates of model parameters and goodness-of-fit statistics for 2CMGEM using dataset-1.

Estimation method	π ₁	λ ₁	λ ₂	θ ₁	θ ₂	AIC	BIC	KS value	KS P.Value
MLE	0.3140	2.381	8.286	27.551	179.155	-291.082	-280.955	0.079	0.874
EM	0.3157	2.392	8.310	27.496	181.599	-291.211	-281.084	0.074	0.916
SELF(LA)	0.3179	2.404	8.343	27.441	182.849	-291.271	-281.144	0.075	0.913
ELF(LA)	0.3264	2.416	8.385	26.728	182.947	-291.508	-281.381	0.071	0.942
SELF(MCMC)	0.3281	2.421	8.409	27.002	184.844	-291.521	-281.394	0.070	0.944
ELF(MCMC)	0.3370	2.432	8.450	27.276	192.047	-291.879	-281.752	0.060	0.988

Table 9

Comparison of different models based on various goodness-of-fit criteria using real dataset-2.

Model	AIC	BIC	KS statistic	KS P.Value
2CMGEM	163.5945	168.3167	0.0763	0.9295
2CMNM	181.5014	186.2236	0.2775	0.0474
2CMLNM	165.3600	170.0822	0.1410	0.7951
2CMGM	166.2773	170.9995	0.1825	0.4952
2CMLGM	169.6096	174.3318	0.2062	0.3863

Table 10

Selecting optimal components for generalized exponential mixture model using dataset-2.

Model	AIC	BIC	KS value	KS P.Value	LRT value	LRT P.Value
GEM	164.085	165.974	0.187	0.463	17.569	0.001
2CMGEM	163.595	168.317	0.076	0.930	Reference	Reference
3CMGEM	169.690	177.245	0.147	0.750	8.428	0.133
4CMGEM	175.832	186.221	0.197	0.400	2.488	0.833
5CMGEM	181.960	195.183	0.220	0.276	3.299	0.984

Table 11

Estimates of model parameters and goodness-of-fit statistics for 2CMGEM using dataset-2.

Estimation method	π ₁	λ ₁	λ ₂	θ ₁	θ ₂	AIC	BIC	KS value	KS P.Value
MLE	0.2554	8.8372	4.5844	0.0440	0.1796	163.595	168.317	0.123	0.905
EM	0.2466	8.2961	4.5118	0.0427	0.1648	163.826	168.548	0.114	0.941
Self(la)	0.2429	8.2081	4.4452	0.0423	0.1639	163.816	168.538	0.112	0.950
Elf(la)	0.2413	8.2045	4.4565	0.0431	0.1641	163.830	168.553	0.111	0.954
SELF(MCMC)	0.2412	8.1208	4.3904	0.0421	0.1644	163.771	168.493	0.106	0.968
ELF(MCMC)	0.2408	8.0702	4.4586	0.0421	0.1701	163.676	168.398	0.091	0.993

Figure 3

Comparison of fits from different mixture models.

Figure 4

Comparison of fits for using different estimators.

The graphs showing goodness-of-fit for different mixture models have been presented in Figure 3. From these graphs, it can easily be assessed that 2CMGEM is more representative as compared to other models for both datasets. In case of dataset-1, 2CMGEM seems to be the best model followed by 2CMNM and 2CMGM, while 2CMLNM and 2CMLGM did not provide good fits. On the other hand for dataset-2, again 2CMGEM provides the best fits followed by 2CMLNM and 2CMGM. Conversely, 2CMLGM and 2CMNM did not provided reasonable. The results have been given in Table 7 and Table 10, respectively. The LRT assumes model with smaller number of components under null hypothesis. Hence, in the comparison of GEM and 2CMGEM, the GEM has been assumed under null hypothesis. While, for comparison on 2CMGEM with high-order mixtures, the model under null hypothesis is 2CMGEM. From the results, it can be seen that 2CMGEM is the best model for modeling each dataset. The graphical comparison of fits under different estimation methods has been presented in Figure 4, while the numerical comparison has been given in Tables 8 and 11. Figure 4 reveals that the estimates under the MCMC method using ELF exhibited better discription data. However, the slight departure for some points can be due to the particular sample. The departure for the estimates under traditionally used MLE and EM methods are pronounced. The numerical results in Tables 8 and 11 further justified these findings. Hence, the proposed estimators are significantly better than the traditionally used estimators.

5. Conclusion

The study is aimed at exploring the improved estimation for the heterogeneous environmental datasets with LDL. The earlier studies considered classical analysis (MLEs and EM algorithm) of such datasets with some specific mixture models. We have proposed Bayesian analysis for modeling such data. The employment of 2CMGEM has also been proposed, and the optimality of the proposed model has been discussed. Our results revealed significant gains in efficiencies for the proposed estimators. In addition, the proposed estimators are quite insensitive with respect to LDL, true parametric values, hyperparametric values, and mixing weights. Hence, the proposed estimators are quite convincing alternative to those existing in the literature. The current methodology can be extended to incorporate the variance analysis, invariant analysis, and covariance analysis. The future work can also be for the cases where the data is right or doubly censored.

12 in total

1. A mixture model for occupational exposure mean testing with a limit of detection.

Authors: D J Taylor; L L Kupper; S M Rappaport; R H Lyles
Journal: Biometrics Date: 2001-09 Impact factor: 2.571

2. Mixed effects models with censored data with application to HIV RNA levels.

Authors: J P Hughes
Journal: Biometrics Date: 1999-06 Impact factor: 2.571

3. Mixture models for quantitative HIV RNA data.

Authors: Lawrence H Moulton; Frank C Curriero; Paulo F Barroso
Journal: Stat Methods Med Res Date: 2002-08 Impact factor: 3.021

4. Analysis of lognormally distributed exposure data with repeated measures and values below the limit of detection using SAS.

Authors: Yan Jin; Misty J Hein; James A Deddens; Cynthia J Hines
Journal: Ann Occup Hyg Date: 2010-12-20

5. Longitudinal analysis of quantitative virologic measures in human immunodeficiency virus-infected subjects with > or = 400 CD4 lymphocytes: implications for applying measurements to individual patients. National Institute of Allergy and Infectious Diseases AIDS Vaccine Evaluation Group.

Authors: W B Paxton; R W Coombs; M J McElrath; M C Keefer; J Hughes; F Sinangil; D Chernoff; L Demeter; B Williams; L Corey
Journal: J Infect Dis Date: 1997-02 Impact factor: 5.226

6. Evaluation of statistical treatments of left-censored environmental data using coincident uncensored data sets: I. Summary statistics.

Authors: Ronald C Antweiler; Howard E Taylor
Journal: Environ Sci Technol Date: 2008-05-15 Impact factor: 9.028

7. Analysis of multiple exposures: an empirical comparison of results from conventional and semi-bayes modeling strategies.

Authors: Franco Momoli; Michal Abrahamowicz; Marie-Elise Parent; Dan Krewski; Jack Siemiatycki
Journal: Epidemiology Date: 2010-01 Impact factor: 4.822

8. A mixed gamma model for regression analyses of quantitative assay data.

Authors: L H Moulton; N A Halsey
Journal: Vaccine Date: 1996-08 Impact factor: 3.641

9. Bayesian mixture modeling of gene-environment and gene-gene interactions.

Authors: Jon Wakefield; Frank De Vocht; Rayjean J Hung
Journal: Genet Epidemiol Date: 2010-01 Impact factor: 2.135

10. Statistical methods for modeling repeated measures of maternal environmental exposure biomarkers during pregnancy in association with preterm birth.

Authors: Yin-Hsiu Chen; Kelly K Ferguson; John D Meeker; Thomas F McElrath; Bhramar Mukherjee
Journal: Environ Health Date: 2015-01-26 Impact factor: 5.984