Literature DB >> 34914842

Dirichlet composition distribution for compositional data with zero components: An application to fluorescence in situ hybridization (FISH) detection of chromosome.

Man-Lai Tang¹, Qin Wu², Sheng Yang³, Guo-Liang Tian⁴.

Abstract

Zeros in compositional data are very common and can be classified into rounded and essential zeros. The rounded zero refers to a small proportion or below detection limit value, while the essential zero refers to the complete absence of the component in the composition. In this article, we propose a new framework for analyzing compositional data with zero entries by introducing a stochastic representation. In particular, a new distribution, namely the Dirichlet composition distribution, is developed to accommodate the possible essential-zero feature in compositional data. We derive its distributional properties (e.g., its moments). The calculation of maximum likelihood estimates via the Expectation-Maximization (EM) algorithm will be proposed. The regression model based on the new Dirichlet composition distribution will be considered. Simulation studies are conducted to evaluate the performance of the proposed methodologies. Finally, our method is employed to analyze a dataset of fluorescence in situ hybridization (FISH) for chromosome detection.

Entities: Chemical

Keywords: Dirichlet distribution; EM algorithm; compositional data; essential zero; gamma distribution; rounded zeros; stochastic representation

Mesh：

Year: 2021 PMID： 34914842 PMCID： PMC9300144 DOI： 10.1002/bimj.202000334

Source DB: PubMed Journal: Biom J ISSN： 0323-3847 Impact factor: 1.715

INTRODUCTION

Compositional data, which consist of vectors of positive components subject to a constant‐sum constraint (e.g., equal to 1 for proportions and 100 for percentages), record the information about the relative frequencies associated with different components of a system (Ferrers, 1886). They commonly arise in many disciplines such as the components of rocks in geology, the budget share patterns of household expenditures in economics, and the proportion of normal cells in medical research. It is noteworthy that compositional data are subject to the following two intrinsic constraints: (Bounded support constraint) Each element in the component vector must lie between 0 and 1, inclusive; and (Summation constraint) All the elements in the component vector must sum to 1. Mosimann (1962) proposed the Dirichlet‐multinomial (DM) distribution, which is a family of discrete multivariate probability distributions on a finite support of nonnegative integers. It is noteworthy that DM is a compound probability distribution, which models counts from a multinomial distribution with a probability vector that is drawn from a Dirichlet distribution. As a result, the DM model (i.e., zero‐inflated generalized DM [ZIGDM] by Tang & Chen, 2019) is designed for count data and fails to deal with the compositional data, which are proportions or percentages, described in this article. Applications of statistical methods designed for unconstrained data to such compositional data may result in invalid inference conclusions. For instance, Pearson (1897) discussed the spurious correlation issue in compositional data analysis and concluded that the unit‐sum constraint is often intentionally ignored and the statistical methods without constraints are misused, which may eventually lead to disastrous results. Aitchison (1982) first proposed statistical methodology for the compositional data. Aitchison (1982, 1986) first introduced the logistic normal (LN) distribution as a framework for compositional data analyses. In particular, his technique assumes multivariate normality of additive log‐ratio transformed data. Since then, various researchers have extended Aitchison's approach in both theoretical and practical respects. For example, Zhang (2000) discussed various distributions for compositional data on the simplex district (e.g., the generalized Dirichlet, additive logistic, and spherical distributions). Egozue et al. (2003) introduced the isometric log‐ratio transformation. In compositional data analysis, the presence of zero components may induce obstacles to the applications of the aforementioned distributional approaches (e.g., zero cannot be the denominator when applying the additive logistic transformation). Aitchison (1986) classified the zeros in compositional data into rounded (or trace elements) zeros and essential (or true) zeros. It is not uncommon that compositional data contain zero components due to either complete absence (i.e., essential zeros), or a small proportion or below the detection limit (i.e., rounded zeros) of certain component(s). Aitchison (1982) pointed out that the log‐ratio transformation failed to work when these zeros are denominators. To deal with the rounded zeros, the most popular method is to replace the rounded zero(s) by a small value (i.e., zero replacement). For example, Palarea‐Albaladejo and Martín‐Fernández (2008) proposed a modified EM algorithm to replace the rounded zeros in compositional data, Hijazi (2011) developed the EM algorithm–based method to deal with rounded zeros. The nonparametric imputation approach is proposed by Martín‐Fernández et al. (2003) to handle rounded zeros. For essential zeros, three well‐known approaches have been developed. The data amalgamation, which was proposed by Aitchison (1990), is to eliminate the components with zero elements by combining them with some other nonzero components. The second approach models the zeros separately (e.g., Aitchison, 1986; Bear & Billheimer, 2016; Zadora et al., 2010). For instance, Bear and Billheimer (2016) projected compositions with zeros onto smaller dimensional subspaces. As a result, they developed a mixture of logistic normals which successfully addresses the issues of division by zero and the log of zero. The third approach is to transform compositions into directional data on the hypersphere and develop a regression model using the Kent distribution (e.g., Kent, 1982; Scealy & Welsh, 2011), which tolerates zeros. Other methods are also investigated, such as the mixture models to eliminate the essential zeros (Stewart & Field, 2011), the latent Gaussian model (Butler & Glasbey, 2008), and the Dirichlet regression model (Tsagris & Stewart, 2018). In clinical research, compositional data with essential zeros are not uncommon. For instance, chromosome abnormalities are considered to be the most common cause of spontaneous abortion. Fluorescence in situ hybridization (FISH) is a cytogenetic technique developed in the early 1980s (see, e.g., Langer‐Safer et al., 1982). It uses fluorescent DNA probes to target specific chromosomal locations within the nucleus, resulting in colored signals that can be detected using a fluorescent microscope. For spontaneous abortion, a damaged embryo is taken out from the gravida and the FISH technique is then employed to detect the cells which are selected randomly from the damaged embryo. Finally, the respective proportions of diploidy, triploidy, and polyploidy at chromosome 22 for those randomly chosen and tested cells are recorded for each embryo. Obviously, the observations are compositional data (i.e., total sum is equal to one). For example, an observation of (0.2,0.3,0.5) means , , and of the selected cells are chromosome diploidy, chromosome triploidy, and chromosome polyploidy, respectively. The FISH data reported in the Supporting Information are the compositional observations of 51 embryos from the curettage operation in Zhongshan People's Hospital in Mainland China. The age of each gravida is also reported. It is noteworthy that nearly (i.e., 40 out of 51) of the embryos demonstrate purely normal chromosomes (i.e., compositional observation being ). Most importantly, none of the aforementioned approaches are suitable for our FISH data, which motivate the present article. The rest of this paper is organized as follows. In Section 2, we introduce a new stochastic representation (SR) for compositional data with zero components and the new Dirichlet composition distribution (DCD) is defined. Likelihood‐based methods for parameter estimation and confidence intervals construction without covariates will be provided in Section 3. Regression model analysis based on the distribution will be considered in Section 4. Simulation studies will be conducted to examine the performance of our proposed methods in Section 5. We will revisit and analyze the FISH dataset in Section 6. A brief discussion will be presented in Section 7. Some technical details are included in the Appendix.

NEW DEFINITION OF COMPOSITIONAL RANDOM VECTOR AND THE DCD

In this section, we introduce a new definition of a compositional random vector which can be adopted for modeling the compositional data. The definition is proposed based on SR. We then introduce the DCD by assuming the base vector following independent Gamma distributions.

Definition of a compositional random vector

To model the zero elements in the compositional data, we employ the SR to establish the definition of a compositional random vector. (Compositional random vector). A random vector is said to be an dimensional compositional random vector if with being the indicator vector (i.e., = 0 or for ) such that 0, and being the base vector with each element being positive random variable (i.e., , for ), “” meaning the random variables on both sides have the same distribution, and . The indicator vector provides the possibility of zero entries in the distribution with meaning the th component in being zero. The base vector carries the quantitative information. It can be any positive vector and determines the nonzero components in .

Definition of DCD

In the compositional random vector, if we let be the ‐dimensional independent Bernoulli random variables by excluding the point , and be the independent Gamma random variable with different shape parameters but identical rate parameter , we can define a new distribution called DCD. Since the rate parameter will be eliminated in the SR of the compositional random vector, is unidentifiable in the distribution. Without loss of generality, we assume . That is, for each , we have . (DCD). A compositional random vector is said to follow the DCD, denoted by , if and and where contains the parameters of the indicator vector and contains the parameters of the base vector with . The probability density function of is then given by where and , is the subset of the index with being positive (i.e., for any and for ). (For more details of the probability density function, refer to Appendix A.1.) It is clear that for . As implies that the element in the th column must be 0, we can simply delete the th column and the remaining columns still form a compositional random vector. If for , it means , we have =Dirichlet. That is, the well‐known Dirichlet distribution is a special case of . We here suppose follows the zero‐truncated multivariate Bernoulli distribution. Due to SR in (1), the denominator must be nonzero; therefore, are not independent as they cannot be 0 at the same time. That is, follow the independent Bernoulli distributions but exclude the point .

Mixed moments and moment generating function

If , then the following results can be easily shown: where is the subset with being 0 (i.e., ), = , and

Statistical inference without covariates

In this section, we present statistical inferences based on data without covariates, which include the maximum likelihood estimates (MLEs) calculation in Section 3.1 and the confidence interval construction in Section 3.2.

Maximum likelihood estimates for target parameters

Suppose the observations are , where and is the number of dimensions. Without loss of information, we assume that there are zero entries in the first observations, that is for , and for , . We have when . Therefore, the observed likelihood function is given by where denotes the index set of those positive elements in each (i.e., if ). Here, the indicator variable can be observed via the observation as Observing that is equivalent to . we have Instead of obtaining the MLEs of the parameters and via solving the solutions to the system of equations = and = , we consider the EM algorithm. Motivated by the SR, we introduce the base vectors and as missing data, where denotes the number of unobserved to make the components in being independent. In fact, are independent Bernoulli variables which exclude the outcome of . Therefore, are the complete observations and the likelihood function based on the complete data is given by and the log‐likelihood function of the complete data likelihood function is The M‐step is to solve the following equations, for where denotes the digamma function with . For , we have However, there are no closed‐form solutions for s and we will use the following Newton–Raphson iterative algorithm to find the MLEs of s. where with being the trigamma function. To obtain the E step, we have the following theorem and the proof is presented in Appendix A.2. The conditional expectation of given is as follows: The E‐step is to replace the missing data by the following conditional expectations: Here, we can consider the initial values of parameters being in the EM algorithm. The above steps (i.e., E‐ and M‐steps) are repeated until a certain convergence condition is achieved. For instance, if the difference between two successive log‐likelihood values is less than the prespecified value 0.001, the algorithm stops after 100–150 iterations.

Confidence interval construction

In this section, we will consider the construction of confidence intervals for target parameters using the bootstrap method. It is noted that the value of must be restricted within the interval [0,1]. However, Wald‐type confidence intervals may produce upper (or lower) limit larger (or less) than 1 (or 0). It is noteworthy that the MLE of obtained via our proposed EM algorithm always lies between 0 and 1. As a result, we apply the bootstrap method to create the bootstrap confidence interval (CI) for any arbitrary function of , denoted by . Briefly, based on the observations, we can independently generate with each is randomly selected from the observations with replacement. Having obtained , we can calculate the parameter estimates and get the bootstrap replication . Independently repeating this process times, we obtain replications . The bootstrap CI of can be constructed by , where and are the 100(/2) and 100(1 − /2) percentiles of , respectively.

STATISTICAL INFERENCE WITH COVARIATES

In this section, we will show how to formulate the regression model for the target parameters and how to obtain the MLEs of the coefficients in the regression model. Let the covariates of each observation be denoted by , . We consider the following regression models: Let denote the number of supplementary to make the elements in being independent, where . Obviously, are missing data with being equivalent to and . Thus, the complete likelihood function is given by Or, the log‐likelihood function is The MLEs of the regression coefficients are the solution to the following equations: It is obvious that there is no closed‐form solution to (21). Here, we use the Newton–Raphson algorithm to calculate the MLEs, and the iterations are given by The first and negative second partial derivatives of the complete‐data log‐likelihood function are given by where Here, and are actually the complete‐data Fisher information matrices associated with the parameter vectors and , respectively, which depend on neither the observed data nor the latent/missing data. To obtain the MLEs of the parameter vectors and in the presence of missing data (i.e., ), we introduce the EM algorithm. Briefly, the M‐step is to separately calculate the MLEs of and via Newton–Raphson algorithms as follows: The E‐step is to replace in (25) by their conditional expectations, that is, where The calculation of coefficients usually works well when the dimension is not large. However, the Newton–Raphson algorithm may fail to work when the dimension is high due to the Jacobian (i.e., ) tending to be 0 in some iterations. Therefore, studies with a large number of covariates should be carefully handled in order to get reliable estimates. This will be an interesting and practical topic for future research.

HYPOTHESIS TEST

We are usually interested in whether some of the coefficients/parameters are equal to zero. In this section, we will consider the likelihood ratio test (LRT) for the following hypotheses: where satisfy , , and and are the number of covariates related to and , respectively. The LRT statistic is then given by where are the constrained MLEs of under and are the unconstrained MLE of . Under the null hypothesis , the ‐value is given by where is the observed value of and is the chi‐square distribution with = + degrees of freedom.

SIMULATION STUDIES

To evaluate the performance of the proposed statistical methods of DCD, we first investigate the accuracy of point estimates and confidence interval estimates for different parameter settings via simulation studies. We then conduct simulation studies for the regression model. The MLEs of parameters, standard deviation, and confidence intervals are presented. We will compare the ZIGDM model proposed by Tang and Chen (2019) with our proposed DCD model. Finally, simulation results for hypothesis testing are presented. In this section, all statistical computations are implemented in R.

Accuracy of point and interval estimates

For the ‐dimensional compositional data, there are parameters in the DCD (i.e., the ‐dimensional parameter and ‐dimensional parameter ). We consider two cases, and , to evaluate the accuracy of point and confidence interval estimates. When , we set or (0.1,0.4,2,1). When , we set or (0.2,0.2,0.3,2,1,3). For each parameter configuration, we generate with , and calculate the MLEs via the EM algorithm and the 95% bootstrap CIs with a significance level with . The MLEs of parameters, the width, and coverage probability of the bootstrap confidence interval are presented in Tables 1 and 2 for and , respectively.

TABLE 1

MLEs and bootstrap confidence intervals of parameters when

	n=100			n=300			n=500
True value	MLE	Width	CP	MLE	Width	CP	MLE	Width	CP
p1 = 0.1	0.101	0.128	0.940	0.101	0.075	0.950	0.100	0.058	0.955
p2 = 0.2	0.199	0.163	0.939	0.199	0.094	0.955	0.199	0.073	0.958
α1 = 3	3.131	2.072	0.916	3.049	1.119	0.937	3.024	0.852	0.952
α2 = 4	4.191	2.823	0.910	4.068	1.516	0.930	4.032	1.157	0.946
p1 = 0.1	0.099	0.145	0.935	0.100	0.085	0.935	0.100	0.067	0.950
p2 = 0.4	0.398	0.198	0.960	0.399	0.115	0.941	0.399	0.089	0.945
α1 = 2	2.136	1.711	0.890	2.060	0.889	0.918	2.040	0.682	0.937
α2 = 1	1.053	0.748	0.896	1.025	0.399	0.919	1.017	0.304	0.938

Note: MLE is the mean of the 1000 point estimates via the EM algorithm; width and CP are the average width and coverage proportion of 1000 bootstrap confidence intervals.

TABLE 2

MLEs and bootstrap confidence intervals of parameters when

	n=100			n=300			n=500
True value	MLE	Width	CP	MLE	Width	CP	MLE	Width	CP
p1 = 0.1	0.100	0.118	0.936	0.100	0.069	0.940	0.100	0.054	0.953
p2= 0.3	0.302	0.180	0.950	0.300	0.104	0.943	0.300	0.081	0.950
p3= 0.2	0.198	0.157	0.943	0.199	0.091	0.938	0.199	0.071	0.938
α1= 3	3.091	1.528	0.926	3.019	0.840	0.952	3.011	0.646	0.948
α2 = 2	2.046	1.033	0.915	2.010	0.572	0.950	2.007	0.440	0.940
α3 = 4	4.115	2.069	0.924	4.028	1.143	0.943	4.012	0.879	0.939
p1 = 0.2	0.201	0.160	0.953	0.200	0.093	0.948	0.200	0.072	0.955
p2= 0.2	0.198	0.158	0.953	0.199	0.092	0.959	0.199	0.072	0.949
p3= 0.3	0.299	0.180	0.954	0.300	0.105	0.948	0.300	0.081	0.952
α1= 2	2.067	1.088	0.921	2.020	0.595	0.941	2.012	0.456	0.957
α2 = 1	1.031	0.514	0.912	1.012	0.284	0.930	1.007	0.218	0.934
α3 = 3	3.100	1.696	0.925	3.032	0.930	0.939	3.019	0.710	0.945

Note: MLE is the mean of the 1000 point estimates via the EM algorithm; width and CP are the average width and coverage proportion of 1000 bootstrap confidence intervals.

MLEs and bootstrap confidence intervals of parameters when Note: MLE is the mean of the 1000 point estimates via the EM algorithm; width and CP are the average width and coverage proportion of 1000 bootstrap confidence intervals. MLEs and bootstrap confidence intervals of parameters when Note: MLE is the mean of the 1000 point estimates via the EM algorithm; width and CP are the average width and coverage proportion of 1000 bootstrap confidence intervals. From Tables 1 and 2, it is clear that the performance of the MLEs is satisfactory in the sense that (i) the bias of the estimate is negligible; (ii) the confidence width is acceptable; and (iii) the coverage probability is from 0.923 to 0.966, which is not far from the prespecified value . Though 0.923 is a little far from 0.95, the coverage proportion can be improved by increasing the sample size.

Numerical results for the regression model

In this subsection, we conduct simulation to evaluate the performance of the proposed regression model for target parameters. Here, we set and the regression coefficient vector is with the true values being set and reported in Table 3. We generate , where and . For each observed data , we calculate the MLEs and this process is repeated 1000 times. The mean value, standard deviation and bootstrap confidence interval are presented in Table 3. According to the results, the MLEs and bootstrap confidence intervals perform well.

TABLE 3

MLEs for the regression coefficients in the DCD regression model

Parameter	True	MLE	Width	CP	True	MLE	Width	CP
	0.2	0.211	0.587	0.946	−1	−1.019	0.579	0.915
β1	−2	−2.022	0.788	0.950	2	2.030	0.709	0.901
	1	1.013	0.594	0.928	−1	−1.012	0.600	0.972
	1	1.019	0.581	0.967	3	3.063	1.244	0.967
β2	−1	−1.016	0.603	0.939	1	1.029	0.713	0.958
	2	2.031	0.700	0.953	−2	−2.037	0.971	0.944
	−1	−1.019	0.721	0.960	−1	−1.016	0.773	0.923
β3	−3	−3.059	1.014	0.912	−2	−2.035	1.019	0.918
	3	3.059	1.090	0.936	3	3.047	1.313	0.951
	−1	−0.925	0.357	0.878	−1	−0.986	0.364	0.928
γ1	−1	−1.012	0.335	0.924	1	0.971	0.338	0.895
	2	1.975	0.291	0.886	−2	−1.963	0.385	0.919
	−2.5	−2.391	0.312	0.862	−2	−1.992	0.345	0.943
γ2	2	1.894	0.316	0.946	0.5	0.478	0.358	0.922
	−2	−1.888	0.386	0.872	−1.5	−1.472	0.358	0.947
	−1	−0.892	0.377	0.939	−1	−0.993	0.382	0.947
γ3	1	0.895	0.373	0.959	2	1.974	0.356	0.936
	−2	−1.887	0.341	0.941	−1	−0.973	0.358	0.956

Note: MLE is the mean of the 1000 point estimates via the EM algorithm; width is the mean of the width of the 1000 CIs and CP is the coverage proportion of the confidence intervals.

MLEs for the regression coefficients in the DCD regression model Note: MLE is the mean of the 1000 point estimates via the EM algorithm; width is the mean of the width of the 1000 CIs and CP is the coverage proportion of the confidence intervals.

The sensitivity and robustness of the model

In this subsection, we will investigate the sensitivity and robustness of our model and compare the performance between the DCD and ZIGDM models. Let be the observation in the ZIGDM model. It is easy to see that the MLEs for the parameters based on and with being any nonzero constant are different under the ZIGDM model. However, the MLE for the parameters under our proposed DCD model will be invariant for constant multiplication. From this perspective, our model is more robust than the ZIGDM model. Next, we will consider the sensitivity of our model. For this purpose, we generate the data from the ZIGDM model and compare the distance between the observation and the predictions using the DCD model and ZIGDM model. We set sample size and parameters being , , , ; , , , , , and . We generate the data with and each following the ZIGDM distribution with . That is . For , we obtain the MLEs of the parameters , and then calculate the predictions of the by applying the ZIGDM model. Similarly, the predictions of are obtained based on the DCD model. The distance between observations and predictions (denoted as and ) are presented in Table 4, where with = 1 and 2.

TABLE 4

The distances between observations and predictions: and

		π1		π2		π3		π4
Parameters		d1	d2	d1	d2	d1	d2	d1	d2
(a1,b1)	n=100	20.26	20.41	21.50	22.06	20.50	20.40	25.57	27.01
	n=300	61.41	61.44	65.31	66.60	62.84	61.61	77.08	81.43
	n=500	102.76	102.34	108.60	110.96	104.80	103.00	128.87	135.42
(a2,b2)	n=100	16.60	16.11	26.66	26.37	36.53	34.79	31.68	32.87
	n=300	49.94	48.44	81.30	79.34	110.94	105.00	96.05	99.18
	n=500	83.07	80.69	135.23	131.94	184.80	175.47	159.84	165.67
(a3,b3)	n=100	14.18	13.79	26.80	26.78	39.48	38.82	32.54	34.27
	n=300	42.66	41.56	81.20	81.19	119.85	117.52	98.16	103.76
	n=500	71.06	69.04	134.98	135.03	200.03	196.08	163.54	172.35

The distances between observations and predictions: and As the data are generated from the DM model, is expected to be less than . From Table 4, is generally less than . It is noticed that the DCD model sometimes fits better than the ZIDGM model. It suggests that our proposed DCD model is robust. Next, we generate the data from DCD distribution with and parameters being , , , ; , and . We obtain the MLEs based on the ZIGDM model and then calculate the distance between the predictions and observed data (denoted as ). Since the ZIGDM model does not work for DCD data, we assume . Similarly, we can get the distance (denoted as ) using the DCD model. We report and in Table 5. As expected, should be generally larger than . The results in Table 5 support that the performance of the DCD model is generally better than the ZIGDM model.

TABLE 5

The distances between observations and predictions: and

		p1		p2		p3		p4
Parameters		d3	d4	d3	d4	d3	d4	d3	d4
α1	n=100	40.86	38.72	26.08	21.35	28.90	26.17	30.54	29.37
	n=300	122.35	116.52	78.82	64.76	87.35	79.20	92.61	89.08
	n=500	204.86	195.15	131.57	107.93	145.43	131.91	154.36	148.83
α2	n=100	56.19	52.03	23.50	21.80	36.75	34.89	50.54	50.60
	n=300	169.76	156.88	70.18	65.42	110.42	104.67	152.10	152.17
	n=500	283.58	261.86	117.68	109.50	184.64	174.48	254.46	254.46
α3	n=100	52.21	50.98	37.92	31.52	44.45	41.45	34.01	31.12
	n=300	157.61	153.92	113.95	94.95	132.90	124.44	103.05	94.36
	n=500	263.51	257.16	190.19	158.34	223.38	207.44	172.41	157.27

The distances between observations and predictions: and

Hypothesis testing

In this subsection, we evaluate the performance of our proposed LRT. We set and for . According to Equation (8), we first consider Case I: We generate the data for 1000 times, where and , , and . Second, we consider Case II: We generate the data with and , , and . Third, we consider Case III: We generate data with parameters being . Applying the LRT in Section 5, we record the proportions of rejection of the above three cases with the sample size being , , , , . The simulated type I error rate is reported in Table 6. It is clearly that the performance of our LRT is fairly good even when the sample size is small. When the sample size is larger than 200, the simulated type I error rate is close to the prespecified significance level (i.e., 0.05).

TABLE 6

The type I error rates for the LRT

Type I error	n=50	n=100	n=200	n=500	n=1000
Case I	0.057	0.073	0.057	0.045	0.056
Case II	0.100	0.058	0.040	0.053	0.050
Case III	0.077	0.056	0.047	0.047	0.044

The type I error rates for the LRT Next, we investigate the power of the LRT. We generate the data from with parameters being . The number of rejections according to the above three cases is recorded in Table 7. As expected, the simulated power increases with the sample size.

TABLE 7

The simulated powers for the LRT

Type I error	n=50	n=100	n=200	n=500	n=1000
Case I	0.692	0.955	1.000	1.000	1.000
Case II	1.000	1.000	1.000	1.000	1.000
Case III	1.000	1.000	1.000	1.000	1.000

The simulated powers for the LRT

THE PERCENTAGE OF CHROMOSOME DATA BY THE FISH TEST

In this section, we revisit the FISH test dataset described in Section 1. We here apply the DCD to analyze the FISH data of chromosome 16. First, we analyze the dataset with no covariate, and the results are reported in Table 8.

TABLE 8

MLEs and 95% bootstrap CIs of parameters for the parameters of the FISH data

Parameter	MLE	Mean	Median	95% Bootstrap CI
p1	0.000	0.000	0.000	[ 0.000, 0.000]
p2	0.784	0.774	0.765	[0.667, 0.882]
p3	0.980	0.969	0.980	[0.922, 0.980]
α1	12.170	10.860	10.435	[6.661, 17.288]
α2	38.516	33.009	32.074	[17.123, 54.360]
α3	6.117	5.359	5.202	[3.146, 8.399]

MLEs and 95% bootstrap CIs of parameters for the parameters of the FISH data As we can see from Table 8, the first component in the composition dataset (i.e., the normal cell) is all nonzeros, means that the probability of zero observation for the normal cell is 0. The probability of zero observations of the triple and tetraploid cell is estimated to be 0.784 and 0.980, respectively. That is, for the chromosome 16 the triple cell can be found with 20% percentage while tetraploid can be found with only 2% percentage. For the base part of the compositional data, the estimate corresponding to the triple cell is larger than those of the normal and tetraploid cells; that is, = 12.17, = 38.52, and = 6.12. In other words, once there is abnormal chromosome in the cell the triple is more frequent than the other two. Due to the assumption of the normal distribution for covariates in the regression model, we make the standardization for covariate age in the FISH dataset. Furthermore, we apply the regression model to investigate the relationship between parameters and the age of gravida. The MLEs of the coefficients are reported in Table 9. We apply the LRT to the hypotheses listed in Section 5, and the null hypothesis is rejected at = 0.05. Therefore, we have reason to believe that age is a significant variable.

TABLE 9

MLEs and 95% bootstrap CIs of parameters for the coefficients in the regression model of the FISH data

Parameter	Coefficients	\|cMLE	Standard deviation	95% Bootstrap CI
β1	Intercept	−12.985	2.912	[−19.302, −8.034]
	Age	1.019	1.840	[−2.419, 4.221]
β2	Intercept	1.293	0.371	[0.663, 2.076]
	Age	0.080	0.388	[−0.680, 0.942]
β3	Intercept	4.192	0.459	[2.779, 4.448]
	Age	0.796	0.316	[0.458, 1.753]
γ1	Intercept	3.116	0.559	[1.238, 3.798]
	Age	1.003	0.992	[−2.461, 2.082]
γ2	Intercept	4.270	0.652	[1.876, 4.904]
	Age	1.178	1.036	[−2.534, 2.234]
γ3	Intercept	1.234	0.329	[ 0.729, 1.803]
	Age	− 0.615	0.150	[−0.925, −0.366]

MLEs and 95% bootstrap CIs of parameters for the coefficients in the regression model of the FISH data

DISCUSSION

In this article, we consider a new framework for analyzing compositional data with zero entries based on SR. In particular, a new distribution, namely the DCD, is developed to accommodate the possible essential‐zero feature in compositional data (i.e., some components are completely absent). In our proposed distribution, the elements in the base vector are assumed to follow independent gamma distributions. Therefore, any positive random variable can be adopted as an element of the base vector (e.g., the inverse Gaussian and chi‐square random variables), and different base vectors will correspond to a different relationship among elements. It is of research and practical interests to relax the assumption of independence among the components in the base vector. Besides, regression modeling for high‐dimensional covariates is also an interesting and necessary topic as the Jacobi tends to be zero when the dimension of covariates is high.

CONFLICT OF INTEREST

The authors have declared no conflict of interest.

OPEN RESEARCH BADGES

This article has earned an Open Data badge for making publicly available the digitally‐shareable data necessary to reproduce the reported results. The data is available in the Supporting Information section. This article has earned an open data badge “Reproducible Research” for making publicly available the code necessary to reproduce the reported results. The results reported in this article were reproduced partially due to their computational complexity. Supporting Information. Click here for additional data file.

4 in total

1. A two-level model for evidence evaluation in the presence of zeros.

Authors: Grzegorz Zadora; Tereza Neocleous; Colin Aitken
Journal: J Forensic Sci Date: 2010-02-11 Impact factor: 1.832

2. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis.

Authors: Zheng-Zheng Tang; Guanhua Chen
Journal: Biostatistics Date: 2019-10-01 Impact factor: 5.899

3. Immunological method for mapping genes on Drosophila polytene chromosomes.

Authors: P R Langer-Safer; M Levine; D C Ward
Journal: Proc Natl Acad Sci U S A Date: 1982-07 Impact factor: 11.205

4. Dirichlet composition distribution for compositional data with zero components: An application to fluorescence in situ hybridization (FISH) detection of chromosome.

Authors: Man-Lai Tang; Qin Wu; Sheng Yang; Guo-Liang Tian
Journal: Biom J Date: 2021-12-16 Impact factor: 1.715

4 in total

1 in total

1. Dirichlet composition distribution for compositional data with zero components: An application to fluorescence in situ hybridization (FISH) detection of chromosome.

Authors: Man-Lai Tang; Qin Wu; Sheng Yang; Guo-Liang Tian
Journal: Biom J Date: 2021-12-16 Impact factor: 1.715

1 in total