Literature DB >> 35741811

MarZIC: A Marginal Mediation Model for Zero-Inflated Compositional Mediators with Applications to Microbiome Data.

Quran Wu¹, James O'Malley², Susmita Datta¹, Raad Z Gharaibeh³, Christian Jobin³, Margaret R Karagas⁴, Modupe O Coker⁴, Anne G Hoen⁴, Brock C Christensen⁴, Juliette C Madan⁴, Zhigang Li¹.

Abstract

BACKGROUND: The human microbiome can contribute to pathogeneses of many complex diseases by mediating disease-leading causal pathways. However, standard mediation analysis methods are not adequate to analyze the microbiome as a mediator due to the excessive number of zero-valued sequencing reads in the data and that the relative abundances have to sum to one. The two main challenges raised by the zero-inflated data structure are: (a) disentangling the mediation effect induced by the point mass at zero; and (b) identifying the observed zero-valued data points that are not zero (i.e., false zeros).
METHODS: We develop a novel marginal mediation analysis method under the potential-outcomes framework to address the issues. We also show that the marginal model can account for the compositional structure of microbiome data.
RESULTS: The mediation effect can be decomposed into two components that are inherent to the two-part nature of zero-inflated distributions. With probabilistic models to account for observing zeros, we also address the challenge with false zeros. A comprehensive simulation study and the application in a real microbiome study showcase our approach in comparison with existing approaches.
CONCLUSIONS: When analyzing the zero-inflated microbiome composition as the mediators, MarZIC approach has better performance than standard causal mediation analysis approaches and existing competing approach.

Entities: Chemical

Keywords: mediation; microbiome; relative abundance; sparse data; zero-inflated composition

Mesh：

Year: 2022 PMID： 35741811 PMCID： PMC9223163 DOI： 10.3390/genes13061049

Source DB: PubMed Journal: Genes (Basel) ISSN： 2073-4425 Impact factor: 4.141

1. Introduction

Emerging evidence suggests that the human microbiome and the immune system are constantly shaping each other [1]. The human microbiome can contribute to disease pathogeneses by mediating disease-leading causal pathways in complex diseases such as Alzheimer’s disease [2] and cancer [3,4]. To study the human microbiome, 16S ribosomal RNA gene sequencing and metagenomic shotgun sequencing have been popular methods to quantify microbiome composition in microbiome studies. A challenging feature of microbiome sequencing data is that it has excessive number of zeros [5]. Many microbiome data sets have more than 50% of the sequencing reads being 0, and it could be as high as or more. These zeros are likely to be a mixture of structural zeros (i.e., true zeros) that represent true absence of microbial taxa and undersampling zeros (i.e., false zeros) that result from failure of detection. The zero-inflated data feature compounded by a compositional structure poses a challenge that needs to be addressed specifically in mediation analyses. Although there have been some exciting efforts to model microbiome as a high-dimensional mediator [6,7,8], it remains a daunting task to address the zero-inflated data structure. Mediation analysis is an important tool to investigate the role of intermediate variables (i.e., mediators) in a causal pathway where the causal effect partially or completely relies on the mediators. For example, people with higher socioeconomic status tend to have longer life expectancy, but this causal pathway may be explained by many possible mediators including access to better health care, fewer stressors, better living environment and so forth. In a mediation analysis, the indirect effect (i.e., mediation effect) through one or more mediators can be estimated and tested along with the direct effect. This technique was first popularized in psychology and social sciences and it has become a common analysis tool in many research areas such as epidemiology, environmental health sciences, medicine, randomized trials and psychiatry. There are two general types of mediation analysis approaches: potential-outcomes (PO) or counterfactual-outcomes methods [9,10,11] and traditional linear mediation analysis methods [12,13]. The latter approach can be considered as a special case of the former approach that can allow for nonlinear associations and interactions between independent variables and mediators. PO approaches are more flexible because they can allow interaction effects of the independent variable with mediators as well as nonlinear effects. Reviews of mediation analysis approaches and their assumptions can be found in the literature [14,15,16]. Although mediation modeling frameworks have been well established, to the best of our knowledge, there have been few studies to address zero-inflated compositional mediators. In a typical mediation analysis, the total effect of an independent variable can be decomposed into a mediation effect and a direct effect where the mediation effect measures the amount of the total causal effect attributable to change in the mediator caused by the independent variable and the direct effect measures the causal effect due to change in the independent variable while keeping the mediator variable constant. When the mediator has a marginal zero-inflated distribution such as a zero-inflated Beta (ZIB) distribution, we show that its mediation effect can be further decomposed into two parts with one part being the mediation effect attributable to the amount of numeric change in the mediator and the other part being the mediation effect attributable to the binary change of the mediator from zero to a non-zero state. This phenomenon can be explained by the two-part nature of a zero-inflated distribution. For example, a ZIB distribution is essentially a two-component mixture distribution [17]: one component is a degenerate distribution with probability mass of one at zero, and the other component is a Beta distribution. The mediator changing from zero to a positive value results in the discrete jump from zero to a non-zero state as well as the change in the numerical metric of the mediator and thus the mediation effect can be decomposed accordingly. Both changes have important interpretations in microbiome research. What makes it more complicated is that the observed zero-valued data points could be false zeros meaning that the true values are non-zero but observed as zero due to failure of detection. This is similar to a missing data problem and will be addressed here as well. To fill the research gap in mediation modeling development, we propose a novel marginal mediation analysis approach under the PO framework to deal with zero-inflated compositional mediators. This approach can allow a mixture of truly zero-valued data points and false zeros. Our method is able to decompose the mediation effect into two components that are inherent to zero-inflated mediators: one component is the mediation effect attributable to the numeric change of the mediator on its continuum scale and the other component is the mediation effect attributable to the binary change of the mediator from zero to a non-zero state. So the mediation effect is actually the total mediation effect of the two components each of which can be estimated and tested. An extensive simulation study is conducted to evaluate our approach MarZIC in comparison with a standard PO mediation analysis approach [10] and another approach [6] that can analyze microbiome composition as a mediator. We introduce the model and its associated notations in Section 2. Estimation and inference procedures are provided in Section 3. A simulation study to assess the performance of our model in comparison with existing approaches is presented in Section 4, followed by an application of our model in Section 5, and a discussion in Section 6. Additional details and derivations can be found in the Appendix A, Appendix B and Appendix C.

2. Model and Notation

For simplicity, we suppress the subject index in all notations in this section. Let Y, and X denote the continuous outcome variable, the compositional mediator variable and the independent variable respectively. For example, M could be the vector of relative abundances (RA) of microbial taxa. Before constructing the model for zero-inflated data, we first describe the model for the special case where the mediator M have no zeros which could happen if investigators choose to impute zeros with a Pseudocount or a small positive number. The model for zero-inflated data will be provided after that.

2.1. Model for Data without Zeros

We first consider cases where there are no zeros for the mediator M in the data which is very rare, but it could happen if zeros are replaced by a Pseudocount or a small positive number. We will move to cases with M containing zeros in the next section. As Dirichlet distributions have been widely used for modeling the RA of microbiome taxa [18,19,20,21,22,23], let M follow a —dimensional Dirichlet distribution indexed by its mean parameters with and a dispersion parameter . We assume the outcome Y depends on M and X through the following regression equation: where the random error follows a normal distribution with mean of 0 and a constant variance, , and are regression coefficients, and is the interaction term between the independent variable X and the mediator . An advantage of using instead of in the model is that it does not require imputing zeros with a positive number. All taxa and their interactions with X are included and thus the compositional structure is accounted for in this model. Later, we will show that a marginal model can also account for the compositional structure. Equation (1) implies that the marginal association between Y and any taxon , has the following form (derivation can be found in the Appendix A): where is the mean of Y conditional on given X, and It is straightforward to see that the full model (1) uniquely determines the marginal association for each taxon. Therefore, without violating model (1), we can construct the following marginal regression model for the association between Y and and X such that it is equivalent to model (1): where the random error has a normal distribution with mean of 0. An advantage of the above marginal model over model (1) is that it is straightforward to interpret the regression coefficient as a typical regression coefficient, whereas the corresponding regression coefficient in Equation (1) does not have such a straightforward interpretation. That is because there has to be at least one , changing when changes due to the compositional structure, and thus it is not possible to hold all ’s, , constant while changing to interpret as a typical regression coefficient. Another nice feature of the marginal model (3) is that the true values of its regression parameters (, , and ) are functions of the parameters of the Dirichlet distribution of M as shown in Equation (2); therefore, the marginal model accounts for the compositional structure. It is also much more convenient to work on the marginal model (3) due to its simpler form. With that and the above advantages, we propose to use the marginal model (3) for constructing the mediation model. When the vector M has the Dirichlet distribution as previously assumed in this section, has a Beta distribution with mean parameter and scale parameter . The following equation can be used to model the association between and X: Equations (3) and (4) together form our marginal mediation model for the scenario without zeros for M.

2.2. Model for Data with Zeros

Now we consider scenarios where the data for M contain zeros. Given the advantages of a marginal model as demonstrated in the above subsection, we will again use a marginal model for the association between Y and any taxon to form a mediation model. For any taxon , we construct the marginal model as follows: where is an indicator function indicating whether is 0, the random error follows a normal distribution , and , , , and are regression coefficients. This model is fully compatible with allowing interactions between the independent variable and mediators as the two interaction terms: and are included in Equation (5). In practice, investigators can also include only one or no interaction term depending on the hypothesis of interest. For the marginal distribution of , it is natural to use a zero-inflated Beta (ZIB) distribution because the marginal of a Dirichlet distribution is a Beta distribution [18,19]. Its two-part density function is given as follows: where is the probability of being 0, is the Beta function and and are the mean and dispersion parameters respectively of the Beta distribution for the non-zero part [24,25]. To model the association of the mediator with X, we use the following equations: Equations (5)–(7) together form our mediation model. The parameter in Equation (6) measures the association between X and the RA level of the mediator and in Equation (7) measures the association between X and the binary presence of the mediator. Notice that X is a scalar here, but it is obvious that other covariates such as potential confounders can be included in the model equations.

2.3. Mechanism for Observing Zeros of the Mediator

For microbiome abundance data, observations that cannot be detected are set to be zero. Consequently, there are two types of zeros in the observed abundance data: true abundance of zero (i.e., absence) and abundance that is reported as zero as a consequence of the measurement failure. Let denote the observed value of . When the observed value is positive (i.e., ), we assume that . But when , we don’t know whether is truly zero or is positive but observed as zero. We consider the following mechanism for the probability of observing a zero of the microbial taxon abundance: where L is the library size (i.e., sequencing depth) and the product can be interpreted as the sample absolute abundance (SAA) of the jth taxon in a sample. Under this mechanism, all SAA below 1 have an observed value of zero. Here 1 can be considered as the Limit of Detection (LOD). We refer to this mechanism as “LOD mechanism” hereafter. Since SAA depends on both L and , the LOD mechanism is not deterministic conditional on the library size. The probability of observing a zero conditional on L, the library size, is equal to .

2.4. Marginal Mediation Effect and Direct Effect

Under the potential-outcomes (PO) framework [15], we can define the natural indirect effect (NIE), natural direct effects (NDE) and controlled direct effect (CDE) where NIE is the mediation effect. We refer to NIE as the marginal mediation effect because the proposed mediation models are based on marginal models as shown in Section 2. The total effect of X is equal to the summation of NIE and NDE. For any , let denote the value of if X equals x. Let denote the value of Y if . The average NIE, NDE and CDE for X changing from to are defined as: where is a counterfactual outcome. By plugging the Equations (5)–(7) into the above definitions and using Riemann-Stieljes integration [26], we can obtain the following formulas: where is the inverse function of , denotes the CDF of and denotes the stieltjes integration [26] with respect to . So NIE, , , NDE and CDE can be estimated by plugging the parameter estimates into the formulas. Confidence intervals (CI) are obtained using the multivariate delta method as outlined in the Appendix B. An alternative approach for finding standard errors to construct CI is bootstrapping [27]. can be interpreted as the marginal mediation effect due to the change of the mediator on its numeric scale and can be interpreted as the marginal mediation effect due to the discrete binary change of the mediator from zero to a non-zero status. This decomposition can be also seen in Figure 1 where there are two possible indirect causal pathways from X to Y through the mediator .

Figure 1

Potential causal mediation pathways of a zero-inflated mediator.

2.5. Sequential Ignorability Assumption

Mediation analyses require assumptions to make causal inference and there have been different forms of assumptions proposed in the liteature [9,28,29,30,31,32]. The key of the assumptions is to identify the terms involving counterfactual outcomes so that they can be estimated with the observed data. One of the popular assumptions is the sequential ignorability assumption proposed in [28]. In the definition of NIE and NDE, the variable is a counterfactual outcome because can not be observed when X takes the value of . The sequential ignorability assumption [28] for identifying can be written as follows in our setting: where and x are any values in the support of X, m is any value in the support of , and Z is a vector of confounders (if any). The first assumption in Equation (9) says the outcome Y at any given value of the vector and the mediator at any given value of X should all be dependent of X conditional on confounders in Z. A randomized trial where X is the random assignment typically makes this assumption automatically satisfied. The second assumption in Equation (10) says the outcome Y at any given value of the vector is independent of the mediator at conditional on and confounders in Z. The second assumption is essentially saying that the mediator is effectively randomly assigned given X and Z. A straightforward interpretation for the first assumption is that there are no unmeasured confounders for the association and the association. A straightforward interpretation for the second assumption is that there are no unmeasured confounders for the association. In our setting, the indicator variable is also considered as a mediator. Because it is completely determined by , the above assumptions are enough to ensure the identifiability of such that it can be estimated by the observed data.

3. Parameter Estimation

Maximum likelihood estimation (MLE) will be used to estimate the parameters. The data that is needed to estimate the marginal mediation effects for the jth taxon is where . The estimation challenge is that is not always observable due to false zeros. The log-likelihood contribution from those subjects with false zeros cannot be directly calculated. However, given that we know the probability of observing a zero in Equation (8), we can still obtain their log-likelihood contributions by integrating the joint density function over all possible values of using Riemann–Stieltjes integration [26]. Let denote the observed data values of for the ith subject in a study and denote the true value of the mediator for the ith subject. We use i for subject index hereafter throughout the paper. The subjects can be divided into two groups by whether is non-zero and we derive the log-likelihood contribution for each group. The first group consists of subjects whose observed value of the mediator is non-zero (i.e., ). Based on the assumptions in the Equations (5)–(7) where is assumed to have a normal distribution, the log-likelihood contribution from the ith subject (if it is in group 1) can be calculated as: where , and are the (conditional) density (or probability mass function) for Y, R and respectively, and . Let denote the (conditional) cumulative distribution function for . The second group consists of subjects with . The log-likelihood contribution from the ith subject (if it is in group 2) can be calculated as: where Taken together, we have the complete log-likelihood function given by: The MLE of the parameters can be obtained by maximizing the above complete log-likelihood function. With the parameter estimates and the observed Fisher information matrix, we will be able to calculate NIE, , , NDE and CDE and their CI’s.

4. Simulation

Extensive simulations were carried out to demonstrate the performance of our approach MarZIC in comparison with two existing approaches under two settings. In setting 1 where the mediator was generated by univariate ZIB distributions which is univariate version of Dirichlet distributions, we compared MarZIC with a current standard practice in causal mediation analyses developed by Imai, Keele and Tingley [10] (IKT approach hereafter) which is a PO approach and can be implemented in R using the package “mediation” [33]. The Marginal Structural Models [9] is also a standard PO approach with a very similar definition of indirect effect. These causal mediation analysis approaches were not developed to analyze microbiome data, and thus could have poor performance when applied to microbiome data. In setting 2 where the mediator was generated by multivariate zero-inflated Dirichlet-Multinomial distributions, MarZIC was compared with IKT and CCMM [6] which was developed specifically to model microbiome composition as a mediator. In all simulation settings, the independent variable X was binary and generated using the Bernoulli distribution Ber(0.5) such that the number of subjects was balanced between the two groups. To mimic the real study data, the library size was generated by randomly picking the library size with replacement from the real study data in Section 5 where the library size ranges from 31,607 to 911,652. The RA data was generated in a way such that it mimicked the distribution of RA in the real data. Multivariate delta method was used to derive confidence intervals in all settings.

4.1. Simulation Setting 1: Univariate ZIB Distribution

In this setting, the outcome Y was assumed to be a continuous variable and generated using Equation (5) where is set to be 0 and other true parameter values can be found in Table 1. Similar to simulation studies in the literature [18,19] where RA were generated individually, we generated individual taxon RA with ZIB distributions (i.e., univariate version of Dirichlet distributions) based on Equations (6) and (7). The LOD mechanism in Equation (8) for observing zero-valued data points of the mediator was used to generate false zeros for the mediator . Two scenarios were considered for the taxon RA: low RA (Scenario 1: mean of positive RA is equal to 0.0025) and high RA (Scenario 2: mean of positive RA is equal to 0.5). We generated 100 random data sets for each scenario and the sample size was 200 for each data set. About 20% of all sequencing reads were generated as true zeros (i.e., structured zeros) in both scenarios. Under the LOD mechanism in Equation (8), about 30% sequencing reads were false zeros in Scenario 1 and there were no false zeros in Scenario 2 because the RA in Scenario 2 was high and thus SAA were greater than 1 for all truly non-zero RA. Model performance was evaluated by estimation bias, standard error, coverage probability (CP) of 95% CI of the estimators for parameters and the mediation effects in this comparison. For Scenario 1, the simulation results (Table 1) showed good performance for MarZIC in terms of bias and CP of the mediation effects and the parameter estimates. All the biases were small and the CP were around the desired level of 95%. The IKT approach, however, had a poor performance with a large bias (84.81%) and a small CP (9%). These poor performances were likely due to the false zeros not being appropriately accounted for by the IKT approach. Another disadvantage of IKT is that it cannot decompose the mediation effect into and . For Scenario 2 with high RA where there were no false zeros, MarZIC showed good performance again in terms of the performance measures. IKT also showed satisfactory performance for the estimation of the NIE because there were no false zeros in the data under this scenario, but IKT cannot decompose the mediation effect according to the zero-inflated distribution of mediator.

Table 1

Simulation results for comparison between MarZIC and IKT with sample size of . Bias, percentage of the bias, the empirical standard errors, the the mean of estimated standard errors and the empirical coverage probability of the CI for each estimator is respectively reported under the columns Bias, Bias %, SE, Mean SE and CP(%). Mediation effects from the IKT approach are provided at the bottom part of the table.

		Low Relative Abundance (Mean = 0.0025)							High Relative Abundance (Mean = 0.5)
Parameter	True	Mean	Bias	Bias	SE	Mean	CP (%)	True	Mean	Bias	Bias	SE	Mean	CP (%)
/Effect		Estimate		%		SE			Estimate		%		SE
MarZIC
NIE₁	0.10	0.11	0.01	10.0	0.08	0.07	91	9.30	9.11	−0.18	−1.98	2.68	2.70	96
NIE₂	0.55	0.52	−0.03	−5.67	0.55	0.56	97	0.55	0.50	−0.06	−10.15	0.62	0.56	94
NIE	0.65	0.63	−0.02	−3.31	0.58	0.58	96	9.85	9.61	−0.24	−2.44	3.25	3.20	95
β0	−2.00	−2.05	−0.05	−2.45	0.32	0.33	96	−2.00	−1.92	0.07	3.82	0.32	0.29	94
β1	100.00	101.89	1.89	1.89	18.04	19.04	97	100.00	99.96	−0.04	−0.04	1.89	1.74	91
β2	4.00	4.05	0.05	1.37	0.38	0.36	94	4.00	3.93	−0.07	−1.73	0.58	0.57	91
β3	5.00	5.08	0.08	1.53	0.53	0.51	94	5.00	4.97	−0.03	−0.62	0.46	0.46	99
β4	3.00	2.93	−0.07	−2.40	0.58	0.55	92	3.00	3.02	0.02	0.55	0.53	0.54	99
δ	1.00	0.99	−0.01	−1.00	0.07	0.07	90	1.00	0.97	−0.03	−2.99	0.07	0.07	89
α0	−6.20	−6.24	−0.04	−0.69	0.36	0.36	94	−1.00	−1.01	−0.01	−0.93	0.05	0.05	90
α1	0.40	0.42	0.02	5.52	0.33	0.29	92	0.40	0.41	0.01	1.69	0.06	0.07	95
ξ	50.00	56.42	6.42	12.83	24.21	19.35	97	50.00	53.37	3.37	6.74	8.22	8.40	96
γ0	−1.16	−1.23	−0.07	−5.75	0.35	0.36	99	−1.16	−1.20	−0.04	−3.18	0.37	0.34	95
γ1	−0.50	−0.53	−0.03	−5.10	0.55	0.55	97	−0.50	−0.47	0.03	6.91	0.58	0.53	91
IKT
NIE	0.65	0.10	−0.55	−84.81	-	-	9	9.85	9.20	−0.65	−6.62	-	-	94

4.2. Simulation Setting 2: Multivariate Zero-Inflated Dirichlet-Multinomial Distribution

The subject index i is suppressed in this subsection for simplicity. The microbiome data was generated using a zero-inflated Dirichlet-multinomial model that can account for variability from both the Dirichlet distribution and the multinomial distribution. The microbiome data generation process can be found in Appendix C. As shown in Table 2, six different scenarios were considered, of which some had the number of taxa smaller than the sample size and the others had the number of taxa larger than the sample size. We generated 100 random data sets for each scenario and the sample size was 200 for each data set. The outcome Y was generated using the following equation: where and denote the RA of the first taxon and the second taxon respectively, and follows the standard normal distribution.

Table 2

Simulation results for the comparison of MarZIC with CCMM and IKT. Here n denotes the sample size and denotes the number of taxa.

		Recall (%)				Precision (%)				F1 (%)
K+1	n	MarZIC	MarZIC	CCMM	IKT	MarZIC	MarZIC	CCMM	IKT	MarZIC	MarZIC	CCMM	IKT
		(NIE₁)	(NIE₂)				(NIE₁)	(NIE₂)		(NIE₁)	(NIE₂)
10	200	99.00	100.00	100.00	58.00	97.70	98.00	38.80	99.70	97.90	98.60	55.30	68.10
25	200	99.50	100.00	96.00	39.50	98.20	99.50	52.40	100.00	98.50	99.60	66.10	48.30
50	200	97.50	100.00	97.00	44.00	100.00	100.00	46.40	100.00	98.30	100.00	60.60	54.70
100	200	96.00	98.90	100.00	32.50	95.50	100.00	42.80	100.00	94.50	98.90	58.00	41.30
300	200	86.00	97.80	-	25.00	90.80	99.50	-	100.00	85.80	97.50	-	31.30
500	200	77.50	94.70	-	23.50	97.80	87.20	-	99.00	83.00	87.30	-	30.00

Notice that the data generation models are different from the analysis models in a few aspects: (a). The data generation model (12) involves both and which is different than the marginal model (5) where only one is in the model; (b). The relationships between X and and in the data generation in Appendix C are different from the data analysis model (6); (c). The zero mechanism for generating false zeros in the data generation as outlined in Appendix C is also different from the proposed mechanism in Section 2.3. Thus, to some extent, this simulation also demonstrated the robustness of our approach with respect to mis-specification of the model and the zero mechanism. Under the data generation model (12), Y has marginal associations with all taxa, but only the first two taxa marginally mediate the effect of X on Y because only their marginal mean values depend on X conditional on their presence according to the data generation in Appendix C. The indicator variable for the second taxon also has a mediation effect because it has an impact on Y as shown in Equation (12) and the probability of presence of the second taxon depends on X. In summary, NIE1 should be significant for and , and NIE2 should be significant for in the analysis results of this simulation. This setting also mimicked the real study case where there were only two OTU’s with significant NIE1. Three indices were used to evaluate the model performance: Recall, Precision and F1 which were calculated as follows: where , , and denote true positive, false positive, true negative and false negative respectively. Recall is a measure of statistical power, the higher the better. Precision has an inverse relationship with false discovery rate (FDR) which is equal to (1-Precision), and thus the higher the Precision, the lower the FDR. When , Precision was set to be 1 regardless of whether . F1 is the Harmonic mean [34] of Recall and Precision that measures the overall performance in terms of Recall and Precision. In the data analysis step of the simulation, MarZIC analyzed each taxon as a mediator one by one whereas CCMM employed regularization to handle high dimensionality. Multiple testing was adjusted using the Benjamini-Hochberg Procedure [35] such that the targeted FDR is 20% for all approaches in this comparison which means that the targeted Precision should be around 80%. The simulation results (See Table 2) showed that MarZIC had a very good overall performance for identifying NIE1 and NIE2 in terms of Recall (>77.5%), Precision (>87.2%) and F1 (>87.3%). MarZIC achieved the targeted Precision of 80% across all cases. CCMM had good performance in terms of Recall, but its Precision rates (38.8–52.4%) were much lower than the targeted Precision rate (80%) which resulted in low F1 values (55.3–66.1%). This suboptimal performance is likely due to: (a). CCMM was proposed to model the RA on log-scale whereas Equation (12) is on the original scale of RA, (b). CCMM was not developed to incorporate the mediation effect of the binary variable , and (c). CCMM could not handle interactions between the independent variable and mediators such as in model (12). And CCMM could not generate any results for those scenarios with the number of taxa greater than or equal to 300 (See Table 2) due to computational issues whereas MarZIC can handle all cases very well. This is likely because CCMM is too computationally demanding for its regularization algorithm which is not computationally capable of handling such high dimensionality. IKT had good Precision rates (>99.7%), but low recall rates (23.5–58.0%) compared to MarZIC, and thus also low F1 values. In addition, we also considered cases with 5 taxa having significant NIE1 and one taxon having significant NIE2 and cases with 10 taxa having significant NIE1 and one taxon having significant NIE2. The simulation results (See Table 3) also showed that MarZIC outperformed the other approaches. It had good recall rates for NIE1 () and NIE2 (), and also achieved the target precision rate (80%) for both NIE1 and NIE2 except that it was 77.10%, slightly lower than 80%, for the case with 300 taxa of which 10 taxa had significant NIE1. Its F1 values were also good for both NIE1 () and NIE2 (). CCMM had fair recall (), but much lower precision rate (19.0–66.2%) and therefore low F1 values (31.2–43.9%). IKT, on the other hand, achieved target precision rate for all cases (), but low recall rate (29.3–66.2%), and thus low F1 values (44.3–78.2%).

Table 3

Simulation results for the comparison of MarZIC with CCMM and IKT.

		Recall (%)				Precision (%)				F1 (%)
K+1	Number of Taxa	MarZIC	MarZIC	CCMM	IKT	MarZIC	MarZIC	CCMM	IKT	MarZIC	MarZIC	CCMM	IKT
	with Non-Zero NIE₁	(NIE₁)	(NIE₂)			(NIE₁B)	(NIE₂)			(NIE₁)	(NIE₂)
50	5	95.00	100.00	89.00	66.20	99.00	98.50	27.90	99.60	96.60	99.00	42.20	78.20
50	10	95.70	92.00	66.00	62.40	98.80	91.80	33.20	99.60	97.10	86.20	43.90	75.70
100	5	96.60	99.00	89.40	60.60	92.70	98.30	19.00	99.10	94.10	97.80	31.20	73.30
100	10	92.10	91.00	80.10	46.00	93.70	97.80	27.20	100.00	92.50	89.50	40.40	61.20
300	5	94.20	96.00	-	56.10	80.50	97.00	-	99.70	85.20	94.00	-	69.90
300	10	85.30	93.00	-	29.30	77.10	91.00	-	99.60	79.60	86.60	-	43.40

Per the suggestion of a referee, we also did a simulation study with only 5 taxa (i.e., K = 4) in the data. The sample size was still 200 and the mean RA of the five taxa were approximately 0.196, 0.250, 0.220, 0.146 and 0.188 respectively. There were no false zeros because the five RA were large. The first two taxa had non-zero NIE1 and the second taxon had non-zero NIE2. The simulation results from 100 random data sets showed good performance for both NIE1 (Recall = 0.95, Precision = 0.96 and F1 = 0.94) and NIE2 (Recall = 1, Precision = 0.97 and F1 = 0.98).

5. Real Study Application

VSL#3 is a commercially available probiotic cocktail (Sigma-Tau Pharmaceuticals, Inc., Gaithersburg, MD, USA) of eight strains of lactic acid-producing bacteria: Lactobacillus plantarum, Lactobacillus delbrueckii subsp. Bulgaricus, Lactobacillus paracasei, Lactobacillus acidophilus, Bifidobacterium breve, Bifidobacterium longum, Bifidobacterium infantis, and Streptococcus salivarius subsp. Orally administered VSL#3 has shown success in ameliorating symptoms and reducing inflammation in human pouchitis [36] and ulcerative colitis [37]. Preventive VSL#3 administration can also attenuate colitis in Il10−/− mice [38] and ileitis in SAMP1/YitFc mice [39]. When used as a preventative strategy, it has the potential capability to prevent inflammation and carcinogenesis. In a mouse model, Arthur et al. [40] studied the ability of a probiotic cocktail VSL#3 to alter the colonic microbiota and decrease inflammation-associated colorectal cancer when administered as interventional therapy after the onset of inflammation. The study duration was 24 weeks. In this study, there were 24 mice of which 10 were treated with VSL#3 and 14 served as control. Gut microbiome data were collected from stools at the end of the study with 16S rRNA sequencing. We obtained sequence data from Arthur et al. [40] and generated open reference OTUs using the Quantitative Insights into Microbial Ecology (QIIME) [41] version 1.9.1 at 97% similarity level using the Greengenes 97% reference dataset (release 13_8). Chimeric sequences were detected and removed using QIIME. OTUs that had 0.005% of the total number of sequences were excluded according to Bokulich and colleagues [42]. Taxonomic assignment was done using the RDP (ribosomal database project) classifier [43] through QIIME with confidence set to 50%. There were 362 OTUs in total in the data sets after quality control and data cleaning. 40% of the OTU RA data points were zero. The relative abundance (RA) of each OTU was analyzed as a mediator variable using a ZIB distribution. The outcome variable in our analysis was dysplasia score (the higher the worse) which is a ordinal categorical variable measuring the abnormality of cell growth and it is treated as a continuous variable in the analysis because of its ordinal nature and its roughly bell-shaped density curve. The treatment variable is coded as 1/0 indicating VSL#3/control. Again, the FDR approach was used for adjusting for multiple testing such that the targeted FDR is 20% and the 95% CI were calculated before adjustment. NIE1 of two OTUs were found to be statistically significant. The first OTU was assigned to the family S24-7 under order Bacteroidales and the second one was assigned to class Bacilli. The estimates of NIE1 were 0.27 (95% CI: 0.1, 0.42) and −1.28 (95% CI: −2.06, −0.49) respectively. The interpretation for the mediation effects are that the treatment had a marginal positive effect of 0.27 on the dysplasia score through changing the RA of the first OTU and it also had a marginal negative effect of −1.28 on the dysplasia score through changing the RA of the second OTU. The family S24-7 and class Bacilli found by our approach have also been reported to be related with colorectal cancer in the literature [44,45]. To give a full picture of the mediation effects in this data set, a heatmap based on p-values was constructed (see Figure 2) to illustrate the NIE1 of all OTUs. CCMM and IKT did not find any significant mediation effects of the OTUs.

Figure 2

Heatmap of mediation strength based on NIE1 in VSL#3 study. The mediation strength is measured by (1-p) where p is the unadjusted p-value. Negative sign indicates negative NIE1. Taxonomic assignment is labeled on the vertical axis. Samples are labeled on the horizontal axis. Absence of an OTU in a sample is left blank in the heatmap.

6. Discussion

We developed an innovative marginal mediation modeling approach under the PO framework to analyze zero-inflated compositional mediators such as microbiome. We showed that the marginal mediation effect for zero-inflated mediators can be decomposed into two components of which the first is due to the change in the mediator over its positive domain and the second is due to the discrete binary change from zero to a non-zero status. These two components have different interpretations and are equally important for investigating causal mechanisms. The marginal model approach can also account for the compositional structure. When the point mass at zero (i.e., ) is equal to zero for the mediator (i.e., the distribution is not zero-inflated), the model reduces to a marginal mediation model for data without zeros as described in Section 2.1. Therefore, this approach can be also used for data sets after zero-valued data points are imputed with a positive number such as a Pseudocount (or after other normalization techniques are applied). R scripts for implementing the method are available upon request. This paper considered X as a univariate variable and did not include covariates as potential confounders in the models. It is straightforward to adjust for a set of covariates using our approach. Let C denote a vector of covariates or potential confounders. Then the NIE and NDE can be calculated at a specific value, c, of C as , and . The value of c can be taken as the mean value of the covariates similar to how least squares mean is calculated in regression models [46]. CI can be obtained using the delta method or resampling methods. Decomposition of NIE follows the same procedure as shown in Section 2.4. Misspecification of the mechanisms for observing zero-valued data points could have an impact on the model performance. This is similar to missing data issues where partial information is available on the missing data. It can be considered as missing not at random (MNAR) [47] because the probability of a data point being observed as zero depends on its true value. Besides the LOD mechanism in Equation (8), another possible mechanism could be where . Model selection approaches such BIC or AIC can be used to choose the optimal mechanism among different mechanisms. Although these mechanisms may not be perfect to account for MNAR, it can, to a large extent, alleviate the burden of not accounting for false zeros in the data at all. A future project has been planned to study the robustness of our model with respect to the mechanism for observing zeros using sensitivity analysis techniques.

31 in total

1. Oral bacteriotherapy as maintenance treatment in patients with chronic pouchitis: a double-blind, placebo-controlled trial.

Authors: P Gionchetti; F Rizzello; A Venturi; P Brigidi; D Matteuzzi; G Bazzocchi; G Poggioli; M Miglioli; M Campieri
Journal: Gastroenterology Date: 2000-08 Impact factor: 22.682

2. Probiotics promote gut health through stimulation of epithelial innate immunity.

Authors: Cristiano Pagnini; Rubina Saeed; Giorgos Bamias; Kristen O Arseneau; Theresa T Pizarro; Fabio Cominelli
Journal: Proc Natl Acad Sci U S A Date: 2009-12-14 Impact factor: 11.205

3. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis.

Authors: Zheng-Zheng Tang; Guanhua Chen
Journal: Biostatistics Date: 2019-10-01 Impact factor: 5.899

4. Commensal Microbiota Promote Lung Cancer Development via γδ T Cells.

Authors: Chengcheng Jin; Georgia K Lagoudas; Chen Zhao; Susan Bullman; Arjun Bhutkar; Bo Hu; Samuel Ameh; Demi Sandel; Xu Sue Liang; Sarah Mazzilli; Mark T Whary; Matthew Meyerson; Ronald Germain; Paul C Blainey; James G Fox; Tyler Jacks
Journal: Cell Date: 2019-01-31 Impact factor: 41.582

Review 5. Role of the microbiota in immunity and inflammation.

Authors: Yasmine Belkaid; Timothy W Hand
Journal: Cell Date: 2014-03-27 Impact factor: 41.582

6. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data.

Authors: Eric Z Chen; Hongzhe Li
Journal: Bioinformatics Date: 2016-05-14 Impact factor: 6.937

7. MODELING MICROBIAL ABUNDANCES AND DYSBIOSIS WITH BETA-BINOMIAL REGRESSION.

Authors: Bryan D Martin; Daniela Witten; Amy D Willis
Journal: Ann Appl Stat Date: 2020-04-16 Impact factor: 2.083

Review 8. What Does It "Mean"? A Review of Interpreting and Calculating Different Types of Means and Standard Deviations.

Authors: Marilyn N Martinez; Mary J Bartholomew
Journal: Pharmaceutics Date: 2017-04-13 Impact factor: 6.321

9. Sodium oligomannate therapeutically remodels gut microbiota and suppresses gut bacterial amino acids-shaped neuroinflammation to inhibit Alzheimer's disease progression.

Authors: Xinyi Wang; Guangqiang Sun; Teng Feng; Jing Zhang; Xun Huang; Tao Wang; Zuoquan Xie; Xingkun Chu; Jun Yang; Huan Wang; Shuaishuai Chang; Yanxue Gong; Lingfei Ruan; Guanqun Zhang; Siyuan Yan; Wen Lian; Chen Du; Dabing Yang; Qingli Zhang; Feifei Lin; Jia Liu; Haiyan Zhang; Changrong Ge; Shifu Xiao; Jian Ding; Meiyu Geng
Journal: Cell Res Date: 2019-09-06 Impact factor: 25.617

10. VSL#3 probiotic modifies mucosal microbial composition but does not reduce colitis-associated colorectal cancer.

Authors: Janelle C Arthur; Raad Z Gharaibeh; Joshua M Uronis; Ernesto Perez-Chanona; Wei Sha; Sarah Tomkovich; Marcus Mühlbauer; Anthony A Fodor; Christian Jobin
Journal: Sci Rep Date: 2013-10-08 Impact factor: 4.379