Literature DB >> 33166356

Zero-Inflated gaussian mixed models for analyzing longitudinal microbiome data.

Abstract

MOTIVATION: The human microbiome is variable and dynamic in nature. Longitudinal studies could explain the mechanisms in maintaining the microbiome in health or causing dysbiosis in disease. However, it remains challenging to properly analyze the longitudinal microbiome data from either 16S rRNA or metagenome shotgun sequencing studies, output as proportions or counts. Most microbiome data are sparse, requiring statistical models to handle zero-inflation. Moreover, longitudinal design induces correlation among the samples and thus further complicates the analysis and interpretation of the microbiome data.
RESULTS: In this article, we propose zero-inflated Gaussian mixed models (ZIGMMs) to analyze longitudinal microbiome data. ZIGMMs is a robust and flexible method which can be applicable for longitudinal microbiome proportion data or count data generated with either 16S rRNA or shotgun sequencing technologies. It can include various types of fixed effects and random effects and account for various within-subject correlation structures, and can effectively handle zero-inflation. We developed an efficient Expectation-Maximization (EM) algorithm to fit the ZIGMMs by taking advantage of the standard procedure for fitting linear mixed models. We demonstrate the computational efficiency of our EM algorithm by comparing with two other zero-inflated methods. We show that ZIGMMs outperform the previously used linear mixed models (LMMs), negative binomial mixed models (NBMMs) and zero-inflated Beta regression mixed model (ZIBR) in detecting associated effects in longitudinal microbiome data through extensive simulations. We also apply our method to two public longitudinal microbiome datasets and compare with LMMs and NBMMs in detecting dynamic effects of associated taxa.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 33166356 PMCID： PMC7652264 DOI： 10.1371/journal.pone.0242073

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

1. Introduction

Since birth, the human body becomes host to millions of microbiota that influence health across whole lives and potentially over generations [1]. The combination of microbiota and the associated genomes (metagenome) interact with the host environment to form the human microbiome [2]. Recent studies have investigated static associations between the human microbiome and many human diseases such as obesity, diabetes, inflammatory bowel disease, irritable bowel syndrome, vaginosis and even cancers [2-7]. However, the microbes could interact with the host and the environment over time [8]. Thus the human microbiome is variable and dynamic in nature, and the infant microbiome could possibly have subsequent implications in future health through the human host’s early life and even adulthood [9]. Longitudinal studies could explain the mechanisms in maintaining the microbiome in health or causing dysbiosis in disease [10]. Recent microbiome studies have employed the longitudinal study design to investigate the dynamic changes of microbial abundance over time and the associations between the microbiome and host environmental/clinical factors [11-15]. As a result of the research interests and the development of high-throughput metagenomics, a large amount of longitudinal 16S rRNA data or metagenome shotgun sequencing data has been generated [16]. It is known that 16S rRNA data or metagenome shotgun sequencing data are both processed and output as number of fragments or reads (in terms of raw or relative abundance) in operational taxonomic units (OTUs) or functional units with various bioinformatics pipelines, such as QIIME and mothur for 16S rRNA data and MetaPhlAn, PhyloSift, and Kraken for shotgun libraries [16]. Although some of the pipelines output the microbiome data in raw counts, others, such as MetaPhlAn, output the relative abundance from shotgun data in proportions. However, it remains challenging to properly analyze and interpret the longitudinal microbiome data, especially in terms of proportion. Due to both biological and technical reasons, microbiome sequencing data is sparse [17]. Moreover, longitudinal microbiome data possesses special features, for example, time-dependent effects and correlations among the samples within the subjects, for which tailored statistical methods are required [10]. La Rosa, Warner [12], as several previous studies, used linear mixed models (LMMs) to account for correlations in longitudinal microbiome studies [12,18-21]. However, using LMMs is not capable to correct for excess zeros in microbiome data. Recently, we have developed negative binomial mixed models (NBMMs) for analyzing longitudinal microbiome count data, but have not explicitly modeled zero-inflation [22,23]. Romero, Hassan [24] used zero-inflated negative binomial mixed-effects models to analyze longitudinal count data. Neither NBMMs nor the zero-inflated negative binomial mixed-effects models is applicable in analyzing longitudinal microbiome proportion data. Alternatively, Chen and Li [25] proposed a zero-inflated Beta regression model with random effects (ZIBR) for analyzing longitudinal microbiome proportions. However, according to the manual of R package ZIBR [26], ZIBR cannot handle missing data, which means each subject must have the same number of time points. Moreover, these two zero-inflated methods have not been developed to account for within-subject correlations and may be computationally sub-optimal for analyzing many OTUs. Thus, statistical models are needed to account for sample correlations over time as well as zero-inflation and other properties of microbiome data [25,27,28]. We here propose zero-inflated Gaussian mixed models (ZIGMMs) and an efficient algorithm to address the previous limitations. Our method is robust and flexible and can analyze longitudinal microbiome proportion data and count data generated with either 16S rRNA or shotgun sequencing technologies. The proposed model can effectively deal with zero-inflation and can include various types of fixed and random effects and within-subject correlation structures. We develop an efficient Expectation-Maximization (EM) algorithm to fit the ZIGMMs by taking advantage of the standard procedure for fitting LMMs. We show computational efficiency of ZIGMMs compared with the other two zero-inflated methods, ZIBR and zero-inflated negative binomial mixed models implemented in the R package glmmTMB. Extensive simulations demonstrate that our ZIGMMs outperform the various previously used methods in detecting associated effects in longitudinal microbiome data. We also apply our method to a shotgun longitudinal microbiome proportion data and a 16S rRNA microbiome count data in detecting dynamic effects of associated taxa. We have implemented the ZIGMMs in the R package NBZIMM, which is freely available from the public GitHub repository http://github.com//nyiuab//NBZIMM.

2. Methods

2.1 Zero-Inflated Gaussian Mixed Models (ZIGMMs)

In a longitudinal microbiome study, we collect n subjects and measure each subject at multiple time points t, j = 1, ···, n; i = 1, ···, n. For the j-th sample of the i-th subject, we denote c the observed count for the h-th taxon at certain taxonomic levels (OTU, e.g. species, genus, classes, etc.). As many previous methods, we analyze one taxon at a time. We first illustrate our model in analyzing the longitudinal microbiome proportion data. We transform the proportions of relative abundance with , where T denotes the total sequence read. For notational simplification, we denote for any given taxon h. For taxa with excessive zeros, it can be assumed that transformed values y may come from either a degenerate distribution having the point mass at zero (zero state) or a Gaussian (i.e., normal) distribution [17]. Thus, the transformed values y can be modeled with the zero-inflated Gaussian distribution: where μ and σ are the mean and standard deviation parameters in normal distribution, respectively, and p is the unknown probability that y is from the zero state. The means μ are expressed as: where X is the vector of covariates for the j-th sample of the i-th subject; β is the vector of fixed effects (i.e. population-level effects), representing the average effects of the covariates over the subjects; b is the vector of subject-specific effects, or called random effects, and G is the vector of group-level covariates, which is a subset of the population-level covariates X. For longitudinal studies, X could be (1, X), (1, X, t), or , where is the variable of interest in X, for example, an indicator variable for the case group and the control group. G could be 1, i.e. only including the subject-specific intercept, or (1, t), i.e. including the subject-specific intercept and time effect. The random effects are assumed to follow a multivariate normal distribution: where Ψ is the variance-covariance matrix which can be defined as a general positive-definite matrix accounting for the correlation among the random covariates. In most applications we restrict Ψ to be a diagonal matrix for simplicity. The zero-inflation probabilities p are assumed to relate some covariates through the logit link function: where Z includes some covariates that are potentially associated with the zero state. The simplest zero-inflation model includes only the intercept in Z, resulting in the same probability of belonging to the zero state for all zeros. We can also add the random-effect terms into the above model: where the random effects a are assumed to follow a multivariate normal distribution: As an alternative, for longitudinal microbiome count data, we transform the observed count data with y = log2(c+1), which equals zero if c = 0. We assume the y can be modeled with the zero-inflated Gaussian distribution, with the means μ being expressed as:

2.2 The EM algorithm for fitting the ZIGMMs

We propose an EM algorithm to fit the ZIGMMs. We introduce latent indicator variables to distinguish the zero state and the Gaussian state, where ξ = 1 when y is from the zero state and ξ = 0 when y is from the normal distribution. The log-likelihood with the complete data (y, ξ) is given by: where Φ represents all the parameters (including random effects) in the ZIGMMs. The EM algorithm replaces the indicator variables ξ by their conditional expectations (E-step), and then updates the parameters by maximizing (M-step). The conditional expectation of ξ can be calculated as: If y≠0, we have p(y|μ,σ2,ξ = 1) = 0, and thus . If y = 0, we have The parameters in the Gaussian distribution can be updated by fitting a weighted linear mixed model with (1 - ) as weights: If the zero-inflation part does not include the random-effect term, the parameters can be updated by running a binomial logistic regression with as response: Otherwise, we can fit the binomial logistic mixed model: The EM algorithm starts from plausible values for the parameters and then updates the parameters as described above until convergence. We use the criterion to assess convergence, where , , and ε is a small value (say 10−5). At convergence, we obtain the maximum likelihood estimates of the Gaussian-state fixed effects and the associated standard deviations from the final weighted LMM. We then can test H0: β = 0 according to the LMM framework. We also obtain the estimates of the zero-state fixed effects and the associated standard deviations from the final binomial logistic (or mixed) model. Thus, we can test H0: α = 0 following the GLM or GLMM framework.

2.3 Accounting for within-subject correlations

The weighted linear mixed model (9) restricts the within-subject errors to be independent. We can relax the assumption of independent within-subject errors to account for special within-subject correlation structures: where R is a correlation matrix. Pinheiro and Bates [29] described several ways to specify the correlation matrix R, for example, autoregressive of order 1, AR(1), or continuous-time AR(1), all of which can be incorporated into our ZIGMMs.

2.4 Software implementation

The proposed method has been implemented in the function lme.zig, which is part of the R package NBZIMM. In the E-step of the EM algorithm, the conditional expectation of ξ can be calculated as in Eq (9). In the M-step, the parameters in the Gaussian distribution can be updated by repeated calls to the function lme in the R package nlme to fit the weighted linear mixed model with (1 - ) as weights. The other parameters can be updated by repeated calls to the functions glm or glmPQL in the package MASS to fit the binomial logistic or mixed logistic model. The function lme is the recommended tool for analyzing linear mixed models. The function lme.zig incorporates the nice features of lme, such as dealing with any types of random effects and within-subject correlation structures. Thus, it provides an efficient and flexible tool for analyzing zero-inflated longitudinal microbiome data. The package NBZIMM is freely available from the public GitHub repository http://github.com//nyiuab//NBZIMM.

3. Results

3.1 Simulation studies

3.1.1 Assess the ZIGMMs in analyzing microbiome proportion data. 3.1.1.1 Simulation design

To evaluate the proposed ZIGMMs, we performed extensive simulations. We first evaluated the ZIGMMs in analyzing microbiome proportion data. We compared ZIGMMs with ZIBR proposed by Chen and Li [25]. We used the function simulate_zero_inflated_beta_random_effect_data in the R package ZIBR [25] to simulate longitudinal microbiome proportion data from zero-inflated beta distribution: with the link functions logit(p) = Zα+Ga and logit(u) = Xβ+Gb. We employed a case-control longitudinal design with the following settings: 5 time points for each subject, fixed effects in both parts, random intercepts in both parts (i.e. G = 1)). We also considered three numbers of subjects: n = 50, 100 and 150, half of which were designated to be cases. We set the regression coefficients as α = (α0, α1) = (-0.5, 0), β = (β0, β1) = (-0.5, 0) to test for false positive rate; while α = (α0, α1) = (-0.5, 0.3), β = (β0, β1) = (-0.5, 0.3) to test for power at a low effect setting and α = (α0, α1) = (-0.5, 0.5), β = (β0, β1) = (-0.5, 0.5) to test for power at a high effect setting. The variance of the random effects to control a and b were set to be 1. The dispersion parameter ϕ was set to be 5. Each simulation was repeated 10000 times. We tested for the hypothesis of β1 = 0. Empirical power and false positive rate were summarized at the significance level of 0.05. We compared zero-inflated Beta regression mixed model, denoted by ZIBR, and the proposed ZIGMMs with the arcsine square root transformation for proportion data, , denoted by ZIGMMs(arcsine), the transformed data was standardized by its standard deviation before model fitting. 3.1.1.2 Simulation results. Table 1 shows the comparison of empirical power and false positive rates between ZIGMMs and ZIBR in analyzing the longitudinal microbiome proportion data. ZIGMMs and ZIBR controlled the false positive rates similarly close to the significance level under all three different sample sizes. Although the proportion data were simulated under the zero-inflated beta distribution, ZIGMMs lead to a higher empirical power to detect the group effect than ZIBR.

Table 1

False positive rate and power for testing H0: β1 = 0 based on ZIGMMs and ZIBR for significance level at 0.05 for various sample sizes.

	False Positive Rate		Power (Low Effect Setting)		Power (High Effect Setting)
Sample Size	ZIGMMs (arcsine)^†	ZIBR^‡	ZIGMMs (arcsine)	ZIBR	ZIGMMs (arcsine)	ZIBR
n = 50	0.0681	0.0577	0.1937	0.1438	0.4100	0.3022
n = 100	0.0554	0.0578	0.3025	0.2218	0.6592	0.5135
n = 150	0.0563	0.0533	0.4308	0.3031	0.8296	0.6906

ZIBR‡: Zero-inflated beta mixed model.

ZIGMMs(arcsine)†: Zero-inflated Gaussian mixed models with arcsine transformation.

ZIBR‡: Zero-inflated beta mixed model. ZIGMMs(arcsine)†: Zero-inflated Gaussian mixed models with arcsine transformation.

3.1.2 Assess the ZIGMMs in analyzing microbiome count data. 3.1.2.1 Simulation design

We then assessed the ZIGMMs in analyzing microbiome count data. We employed the function sim in NBZIMM to simulate zero-inflated longitudinal microbiome count data c as follows. We used the latent-data formulation of the logistic regression to simulate zero-state indicators; the logistic model, p(ξ = 1) = logit−1(μ+Zα+Ga), is approximately equivalent to the model, u~N(Zα+Ga, 1.62), u>h⇔ξ = 1 [30], where h is a constant determined by the preset overall zero-inflation proportion p. Thus, we first simulated latent normal variables u and then set samples with the 100p% largest u as from zero state. This method can easily control the overall zero-inflation proportion and also allow for the sample-specific zero-inflation probabilities p. For the samples from nonzero state, we simulated counts c from the negative binomial distribution NB(c|μ,θ), where μ = log(T)+Xβ+Gb. We adopted a longitudinal design and utilized four different simulation settings. In all the settings, we generated subjects from two groups (i.e. case or control) and simulated samples at multiple time points for each subject. We considered three numbers of subjects: n = 50, 100 and 150, half of which were designated to be cases. Each subject was measured at 5 time points. The random effects, and within-subject correlation structures were set as follows: Setting A: a group variable (β1) is included as fixed effect in the count part, no fixed effect in the zero-inflation part (i.e. Z = 1), random intercepts in both count and zero-inflation parts (i.e. G = 1)), and no within-subject correlation; Setting B: a group variable is included as fixed effects in both parts, random intercept in the count part only, and no within-subject correlation; Setting C: a group variable is included as fixed effects in both parts, random intercepts in both parts (i.e. G = 1)), and no within-subject correlation; Setting D: a group variable is included as fixed effect in the count part, no fixed effect in the zero-inflation part, random intercept in the count part only, and the within-subject correlation was autoregressive of order 1, AR(1), in the count part; Setting E: a group variable (β1), a time variable (β2), and a time by main effect interaction term (β3) are included as fixed effects in both parts, random intercept in the count part only, and no within-subject correlation; We randomly generated the parameters in the models from reasonable ranges. The parameters to simulate the counts from negative binomial distribution were set by following the work of [31]. This can largely reduce the combinations of parameter values and minimize possible bias from setting inappropriate values for parameters. The ranges were described as follows: To simulate counts similar to real microbiome data, we controlled the means of simulated counts through log(T) + β0, where β0 is the fixed intercept. We set β0 = -7 and randomly sampled log(T) from the range [7.1, 10.5]; For settings A-D, the dispersion parameter θ was uniformly sampled from the range [0.1, 5], which yielded highly or moderate over-dispersed counts; for setting E, the dispersion parameter θ was set to be 5. To evaluate false positive rates, the fixed effects β1 was set to be zero. To evaluate empirical powers, we considered two scenarios: a) low effect scenario: β1 was sampled from [0.2, 0.3]; b) high effect scenario: β1 was sampled from [0.3, 0.4]; fixed effects in the zero-inflation part were considered in setting B and C, where α1 was set to be the same as β1; for setting E, β1 was set to be equal to β3. And β2 was set to be 0 in all scenarios. The random effects b and a were generated from N(0, τ2), for settings A-D, where τ was randomly drawn from the range [0.5, 1]; for setting E, τ was set to be 0.5. For settings A-D, the overall zero-inflation proportion was set to be chosen from three levels, that is [0, 0.2], [0.2, 0.4] and [0.4, 0.6]; for setting E, the proportion was set to be chosen from [0, 0.5]. The correlation coefficient ρ and the standard deviation σ for AR(1) correlation were both sampled from [0.1, 0.5], and the AR(1) correlation was generated by the function arima.sim from R package stats; The ranges of all the parameters used in the simulation are summarized in Table 2.

Table 2

Parameter ranges in simulation studies.

Parameter	Range
log(T_ij) + β₀	Unif(0.1, 3.5)
dispersion parameter θ	Unif(0.1, 5)
Fixed effects β₁ (false positive rate)	0
Fixed effects β₁ (power)	Unif(0.2, 0.3)
Fixed effects β₁ (power)	Unif(0.3, 0.4)
Fixed effect α₁ (Setting B and C only)	Unif(0.2, 0.3)
Fixed effect α₁ (Setting B and C only)	Unif(0.3, 0.4)
standard deviation τ	Unif(0.5, 1)
correlation ρ	Unif(0.1, 0.5)
standard deviation σ	Unif(0.1, 0.5)
Overall zero-inflation proportion	Unif(0.0, 0.2)
	Unif(0.2, 0.4)
	Unif(0.4, 0.6)

We repeated the procedure 10000 times for each combination of the parameters. The hypothesis of interest is the fixed effect H0: β1 = 0. Empirical power and false positive rate for testing the hypothesis were calculated at the significance level of 0.05. We compared the proposed ZIGMMs, denoted by ZIGMMs(log), with a previously developed negative binomial mixed model, denoted by NBMMs, and the linear mixed model with the arcsine square root transformed response, , denoted by LMMs. 3.1.2.2 Simulation results. Fig 1 showed empirical power to detect the group effect for settings A, B, C and D at the low effect scenario. It can be clearly seen that the proposed method performed consistently better than NBMMs and LMMs in all the scenarios. Under setting B and C, we simulated fixed effects in the zero-inflation part. ZIGMMs performed extremely remarkable than NBMMs and LMMs in those two settings, inferring ignoring the association between zero-inflation and any covariate could lead to a significant decrease in power. The power was largely affected by the sample size and the zero-inflation probability. The difference in power among ZIGMMs and NBMMs and LMMs increased significantly as the zero-inflation probability increased. With the zero-inflation proportion less than 20%, ZIGMMs performed similarly as NBMMs but still better than LMMs. ZIGMMs had a more noteworthy higher power than NBMMs and LMMs to detect the fixed effect especially when the data was highly zero-inflated. We also summarized the empirical power to detect the binary group effect for the settings A, B, C and D with the high effect scenario in S1 Fig. In the high effect scenario, ZIGMMs outperformed NBMMs and LMMs more significantly when the zero-inflation probability was higher and the sample size was smaller. Fig 2 displays false positive rates for detecting the group effect. For all the four settings, ZIGMMs controlled the false positive rates close to the significance level under all the combinations of parameters. As expected, the increase in sample size n led to the decrease in false positive rates in ZIGMMs.

Fig 1

Empirical powers in four simulation settings under low effect scenario.

Fig 2

False positive rates in all four simulation settings.

Table 3 summarized empirical power and false positive rates for setting E comparing LMMs, NBMMs and ZIGMMs. In this setting, we included group variable, time variable and a time by group interaction term in the simulation and reported empirical power and false positive rates for group variable and time by group interaction term. ZIGMMs had a higher power than LMMs and NBMMs for both group effect and interaction term under various sample sizes however ZIGMMs had inflated the false positive rates compared to LMMs and NBMMs especially for the interaction term.

Table 3

False positive rate and power for testing H0: β1 = 0 and H0: β3 = 0 from setting E for significance level at 0.05 for various sample sizes.

	False Positive Rate
	Test of β₁			Test of β₃
Sample Size	LMMs^§	NBMMs^¶	ZIGMMs(log)^!	LMMs^§	NBMMs^¶	ZIGMMs(log)^!
n = 50	0.045	0.053	0.065	0.045	0.064	0.084
n = 100	0.050	0.061	0.067	0.054	0.072	0.082
n = 150	0.047	0.061	0.071	0.050	0.068	0.082
	Power (Low Effect Setting)
	Test of β₁			Test of β₃
Sample Size	LMMs^§	NBMMs^¶	ZIGMMs(log)^!	LMMs^§	NBMMs^¶	ZIGMMs(log)^!
n = 50	0.082	0.158	0.187	0.172	0.251	0.334
n = 100	0.148	0.265	0.325	0.295	0.425	0.563
n = 150	0.204	0.360	0.439	0.405	0.562	0.720
	Power (High Effect Setting)
	Test of β₁			Test of β₃
Sample Size	LMMs^§	NBMMs^¶	ZIGMMs(log)^!	LMMs^§	NBMMs^¶	ZIGMMs(log)^!
n = 50	0.121	0.252	0.304	0.303	0.418	0.558
n = 100	0.224	0.439	0.522	0.507	0.654	0.815
n = 150	0.340	0.602	0.699	0.628	0.769	0.920

LMMs§: Linear mixed models.

NBMMs¶: Negative Binomial mixed models.

ZIGMMs(log)!: Zero-inflated Gaussian mixed models with log transformation.

LMMs§: Linear mixed models. NBMMs¶: Negative Binomial mixed models. ZIGMMs(log)!: Zero-inflated Gaussian mixed models with log transformation.

3.1.3 Assess the computational efficiency of ZIGMMs

To evaluate the computational efficiency of ZIGMMs, we recorded the computation time for ZIGMMs and two other zero-inflated methods in one simulation when sample size is set to be 100. First, we compared ZIGMMs and ZIBR in analyzing the longitudinal microbiome proportion data. We found that the computation time for ZIGMMs and ZIBR in one simulation was 0.011 and 0.023 minutes, respectively. Besides, we compared ZIGMMs and a zero-inflated negative binomial mixed model which was implemented in the R package glmmTMB in analyzing the longitudinal microbiome count data, and found that the computation time for ZIGMMs and the zero-inflated negative binomial mixed model in one simulation was 0.009 and 0.041 minutes, respectively. ZIGMMs remarkably outperformed in computational efficiency than the other two zero-inflated methods.

3.2 Application to 16S rRNA and shotgun sequencing microbiome data

In our real data analysis, there are two major purposes, one is to evaluate the performances of ZIGMMs in analyzing 16S rRNA data in raw counts, the other is to evaluate the performances of ZIGMMs in analyzing shotgun sequencing data in proportions. So that, we applied our ZIGMMs in two publicly available datasets from Romero, Hassan [24] and Vincent, Miller [32]. Romero, Hassan [24] employed a retrospective case-control longitudinal study to investigate the difference of composition and stability of vaginal microbiota between pregnant and non-pregnant women. They conducted a 16S rRNA gene sequence-based survey among 22 normal pregnant women who delivered at term (38–40 weeks) and 32 non-pregnant women. Vaginal fluid samples were collected every two to four weeks apart for the pregnant group and twice per week for 16 weeks in the non-pregnant group. We analyzed the 16S rRNA sequencing data from Romero, Hassan [24] in terms of counts to evaluate the performances of ZIGMMs(log). Vincent, Miller [32] used metagenome shotgun sequencing to examine the diversity and composition of the fecal microbiota from 98 hospitalized patients. The prospective cohort study was carried out among 8 patients who were either Clostridium difficile infected or colonized and other 90 patients. Clinical data included gender, age, and days from first collection of the fecal samples. The clinical data and shotgun sequencing microbiome relative abundance data were downloaded by R package curatedMetagenomicData [33]. The shotgun sequencing data is normally output as proportion data. So, here, we illustrated our ZIGMMs(arcsine) to analyze this shogun sequencing microbiome data from Vincent, Miller [32] in proportions. According to the manual of R package ZIBR [26], ZIBR cannot handle missing data. Therefore, we could not compare with ZIBR in our real data example. We used the following eight different models to compare the performances of LMMs, NBMMs, and ZIGMMs in detecting the dynamic association between host factor and microbiota composition. Models A-D were used in all LMMs, NBMMs and ZIGMMs while models E-G were only used in ZIGMMs: Model A: host factor and time as fixed effects in Gaussian part, random intercept in Gaussian part; Model B: host factor, time, host factor and time interaction term as fixed effects in Gaussian part, random intercept in Gaussian part; Model C: host factor, time, host factor and time interaction term as fixed effects in Gaussian part, random intercept and the within-subject correlation was autoregressive of order 1, AR(1) in Gaussian part; Model D: host factor, time, host factor and time interaction term as fixed effects in Gaussian part, two random effects (i.e., random intercept and time effect) in Gaussian part; Model E: host factor and time as fixed effects only in both zero-inflation part and Gaussian part, random intercept in Gaussian part; Model F: host factor, time, host factor and time interaction term as fixed effects in both zero-inflation part and Gaussian part, random intercept in Gaussian part; Model G: host factor, time, host factor and time interaction term as fixed effects in both zero inflation part and Gaussian part, random intercept and the within-subject correlation was autoregressive of order 1, AR(1) in Gaussian part; Model H: host factor, time, host factor and time interaction term as fixed effects in both zero-inflation part and Gaussian part, two random effects (i.e., random intercept and time effect) in Gaussian part; The real data and the R code for our analysis are available from the GitHub page: https://abbyyan3.github.io//NBZIMM-tutorial/ZIGMMs-longitudinal.html.

3.2.1 App lication in 16S rRNA longitudinal pregnancy data

We first applied our ZIGMMs to the data of Romero, Hassan [24]. We explored the abilities of ZIGMMs in detecting the dynamic associations between vaginal bacteria taxa composition and two groups (pregnancy vs non-pregnancy) controlled by possible confounding effects of the covariates. We analyzed 16S rRNA sequencing microbiome count data with log transformation (ZIGMMs(log)). In all the eight models, the binary case-control indicator for pregnancy vs non-pregnancy was the host factor of interest (β), and the collection time (GA_days) was the time variable. An interaction term between host factor and time variable (β) was included in model B, C, D, F, G and H. We also included age and race as confounding covariates. The sample size was 897 in the final analysis. We included 59 taxa which has a proportion of zeros greater than 0.3 but smaller than 0.9 in our analysis. Table 4 shows the proportions of significant taxa detected by LMMs, NBMMs and ZIGMMs(log) at the alpha level at 0.05, respectively. The significance of the taxa was evaluated at the alpha level of 0.05 (p-value <0.05) for Models A-H. Test of β in Table 4 summarized the proportions of taxa which is significantly differentiated presented between pregnancy group vs non-pregnancy group. Test of β in Table 4 summarized the proportions of taxa which is significantly differentiated presented between pregnancy group vs non-pregnancy group over the collection time. The proportions of detected significant taxa in model B, C, D, F, G and H were substantially less than the rates from models A and E. It inferred that the majority of taxa existing in the vaginal microbiome did not possess a time-dependent association between the pregnant and non-pregnant groups. Moreover, it showed that ZIGMMs(log) detected more associated taxa than NBMMs and LMMs. We also found ZIGMMs with fixed effects in zero-inflation and Gaussian part in models E-H decrease slightly in the number of significant taxa detected than ZIGMMs with fixed effects in Gaussian part from models A-D. It implied that those taxa did not possess a strong association between the host factors and the zero-inflation.

Table 4

Proportions of significant taxa detected in four models with LMMs, NBMMs and ZIGMMs.

	Model A	Model B		Model C		Model D
	Test of β₁	Test of β₁	Test of β₃	Test of β₁	Test of β₃	Test of β₁	Test of β₃
LMMs^§	0.29	0.03	0.15	0.03	0.12	0.07	0.10
NBMMs^¶	0.49	0.12	0.25	0.12	0.25	0.12	0.25
ZIGMMs(log)^!	0.63	0.34	0.24	0.39	0.27	0.36	0.24
	Model E	Model F		Model G		Model H
	Test of β₁	Test of β₁	Test of β₃	Test of β₁	Test of β₃	Test of β₁	Test of β₃
ZIGMMs(log)	0.54	0.19	0.31	0.20	0.24	0.20	0.20

LMMs§: Linear mixed models.

NBMMs¶: Negative Binomial mixed models.

ZIGMMs(log)!: Zero-inflated Gaussian mixed models with log transformation.

LMMs§: Linear mixed models. NBMMs¶: Negative Binomial mixed models. ZIGMMs(log)!: Zero-inflated Gaussian mixed models with log transformation. To compare the differences in detecting significant taxa for both host factor and interaction term between LMMs, NBMMs, and ZIGMMs(log), we presented model C in Fig 3 and S2 Fig. Fig 3 shows significant taxa in model C at the 5% significance threshold and minus log transformed p-values for LMMs, NBMMs, and ZIGMMs(log). S2 Fig presents three heatmaps of p-values between the taxa and each variable from model C using LMMs, NBMMs, and ZIGMMs(log). We found that ZIGMMs(log) discovered more taxa than NBMMs and LMMs consistently, and yielded smaller p-values. In model C, we were interested in both the host factor and the interaction effect between time and host factor. ZIGMMs(log) identified not only the same taxa which were detected by LMMs and NBMMs but also more taxa for both effects. For the host factor, several taxa were only identified with ZIGMMs(log), including Clostridiales, Streptococcus, Proteobacteria, BVAB1 and Lactobacillales. For the interaction effect between time and host factor, Prevotella genogroup 3, Gemella, Lactobacillus gasseri, Megasphaera sp type 1 and Firmicutes were identified both by NBMMs and ZIGMMs(log). BVAB1, and Sneathia Sanguinegens were only identified by ZIGMMs(log). Among them, bacterial vaginosis associated bacteria 1 (BVAB1) has been previously reported as a highly specific novel bacteria for bacterial vaginosis in the Clostridiales order [34]. Also, the abundance of Gemella, BVAB1, and Sneathia sanguinegens have been reported to change within the duration of pregnancy from another study by Romero, Hassan [35].

Fig 3

The analyses of ZIGMMs(log), NBMMs and LMMs: minus log transformed p-values for the significant differentially abundant taxa at the 5% significance threshold between pregnancy and non-pregnancy groups for host factor effect (left panel) and interaction effect (right panel) from Model C.

3.2.2 Application in shotgun sequencing longitudinal intestinal microbiome data

We then applied our ZIGMMs to the shotgun sequencing microbiome proportion data from Vincent, Miller [32]. In this case, we only compared our ZIGMMs with LMMs. We explored the abilities of ZIGMMs in detecting the dynamic associations between fecal microbiome composition and Clostridium difficile colonization or infection. We adapted ZIGMMs in analyzing microbiome proportion data with arcsine transformation (ZIGMMs(arcsine)). In all the eight models, the binary case-control indicator for Clostridium difficile colonization or infection vs control was the host factor of interest (β), and the collection time (days from the first collection) was the time variable. An interaction term between host factor and time variable (β) was included in models B, C, D, F, G and H. We also included age and gender as confounding covariates. The sample size was 229 in the final analysis. We included 357 taxa which has a proportion of zeros greater than 0.3 but smaller than 0.9 in our analysis. Table 5 shows the proportions of significant taxa detected by LMMs and ZIGMMs(arcsine) at the alpha level at 0.05, respectively. The significance of the taxa was evaluated at the alpha level of 0.05 (p-value <0.05) for Models A-H. Test of β in Table 5 summarized the proportions of taxa which is significantly differentiated presented between Clostridium difficile colonization or infection group vs control group. Test of β in Table 5 summarized the proportions of taxa which is significantly differentiated presented between Clostridium difficile colonization or infection group vs control group over the collection time. We found that our ZIGMMs(arcsine) detected more associated taxa than LMMs in most scenarios. We also found ZIGMMs(arcsine) with fixed effects in zero-inflation and Gaussian part in models E-H increase slightly in the number of significant taxa detected than ZIGMMs(arcsine) with fixed effects in Gaussian part from models A-D. It implied that there is a significant association between the host factors and the zero-inflation in those taxa.

Table 5

Proportions of significant taxa detected in four models with LMMs and ZIGMMs.

	Model A	Model B		Model C		Model D
	Test of β₁	Test of β₁	Test of β₃	Test of β₁	Test of β₃	Test of β₁	Test of β₃
LMMs^§	0.11	0.13	0.12	0.11	0.11	0.10	0.06
ZIGMMs (arcsine)^†	0.12	0.12	0.19	0.17	0.18	0.11	0.10
	Model E	Model F		Model G		Model H
	Test of β₁	Test of β₁	Test of β₃	Test of β₁	Test of β₃	Test of β₁	Test of β₃
ZIGMMs (arcsine)	0.15	0.14	0.21	0.14	0.23	0.14	0.10

ZIGMMs(arcsine)†: Zero-inflated Gaussian mixed models with arcsine transformation.

LMMs§: Linear mixed models.

ZIGMMs(arcsine)†: Zero-inflated Gaussian mixed models with arcsine transformation. LMMs§: Linear mixed models.

4. Discussion

With the emergence of longitudinal microbiome studies, more understandings about the dynamic shifts of the microbiota have been unraveled [8]. It is of interest in studying the dynamic associations between the microbiota and various host factors [8,36]. To realize these research interests, powerful analytic methods are necessary to account for sources of heterogeneity and dependence in microbiome measurements. However, previous methods have not fully addressed the properties of longitudinal microbiome data and are not computationally feasible for analyzing many taxa. Here, we propose ZIGMMs to model longitudinal microbiome proportion and count data. The method is robust in performance when applied to both 16S rRNA gene sequencing and genome shotgun sequencing data, in terms of proportion or count data. The proportions data, mostly from genome shotgun sequencing data, should be transformed with arcsine square root transformation. For count data, mostly from 16S rRNA platforms, log transformation is more appropriate because if converting those count data to proportion data will lead to very small proportions. The proposed ZIGMMs can effectively handle excessive zeros observed in microbiome data, and can incorporate various types of random effects and within-subject correlation structures [29,37]. We have developed an EM algorithm to fit the proposed ZIGMMs by extending a commonly used procedure for fitting LMMs [37-40]. This allows us to integrate the well-established procedures for analyzing longitudinal data into our ZIGMMs. Our analyses show that our algorithm is efficient and stable for most of the scenarios. We showed the computational efficiency of our EM algorithm by comparing with the other two zero-inflated methods. In the simulations, ZIGMMs outperform LMMs, NBMMs and ZIBR consistently. We have also shown that ZIGMMs can efficiently deal with various fixed and random effects in both normal distribution and zero-inflation models, moreover, and account for the auto-regressive correlation among samples. However, we found ZIGMMs had inflated false positive rates especially in detecting interaction terms, suggesting potential fitting issues. According to Weiss, Xu [41] and Hawinkel, Mattiello [42], most of the parametric methods, such as edgeR, limma–voom and metagenomeSeq, fail to control the false positive rate at the nominal level. A possible reason could be the p-value distributions tend to be smaller than uniform distribution especially when taxa is highly inflated [42]. Thus, in current analysis of a real microbial data, researchers normally focus on the top abundant taxa with less zero-inflation rates. Moreover, we applied our method to two previously published datasets and compared the performances of LMMs, NBMMs and ZIGMMs in detecting the dynamic association between host factor and taxa composition. We could not apply the ZIBR in the real data since according to the manual of R package ZIBR, it could only deal with subjects measured at the same number of time points [26]. We found that our ZIGMMs was capable to detect more significant taxa than LMMs and NBMMs. The differences between our ZIGMMs and the other two methods were more substantial when analyzing the taxa with high zero rates. Notably, we found that several taxa from Romero, Hassan [24], which have only been identified by ZIGMMs, have been previously reported for the associations between pregnancy and vaginal bacterial composition by Romero, Hassan [35]. However, we still encounter the fitting issues similarly as other parametric methods to control false positive rates under nominal level, especially when analyzing complex microbiome/metagenomics data. A future plan is to develop analyzing methods under Bayesian framework using MCMC algorithm to possibly address the current fitting issues.

Empirical power of hypothesis in four simulation settings under high effect scenario.

(PDF) Click here for additional data file. Heat map for p-values between the taxa and each variable from Model C using LMMs (left panel), NBMMs (middle panel) and ZIGMMs (right panel). The sign “+” indicates the positive effect. (PDF) Click here for additional data file.

34 in total

1. Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors.

Authors: Andrew K Benson; Scott A Kelly; Ryan Legge; Fangrui Ma; Soo Jen Low; Jaehyoung Kim; Min Zhang; Phaik Lyn Oh; Derrick Nehrenberg; Kunjie Hua; Stephen D Kachman; Etsuko N Moriyama; Jens Walter; Daniel A Peterson; Daniel Pomp
Journal: Proc Natl Acad Sci U S A Date: 2010-10-11 Impact factor: 11.205

2. Accessible, curated metagenomic data through ExperimentHub.

Authors: Edoardo Pasolli; Lucas Schiffer; Paolo Manghi; Audrey Renson; Valerie Obenchain; Duy Tin Truong; Francesco Beghini; Faizan Malik; Marcel Ramos; Jennifer B Dowd; Curtis Huttenhower; Martin Morgan; Nicola Segata; Levi Waldron
Journal: Nat Methods Date: 2017-10-31 Impact factor: 28.547

Review 3. Microbiome and malignancy.

Authors: Claudia S Plottel; Martin J Blaser
Journal: Cell Host Microbe Date: 2011-10-20 Impact factor: 21.023

Review 4. Understanding the role of gut microbiome-host metabolic signal disruption in health and disease.

Authors: Elaine Holmes; Jia V Li; Thanos Athanasiou; Hutan Ashrafian; Jeremy K Nicholson
Journal: Trends Microbiol Date: 2011-06-22 Impact factor: 17.079

5. A robust approach for identifying differentially abundant features in metagenomic samples.

Authors: Michael B Sohn; Ruofei Du; Lingling An
Journal: Bioinformatics Date: 2015-03-19 Impact factor: 6.937

6. A broken promise: microbiome differential abundance methods do not control the false discovery rate.

Authors: Stijn Hawinkel; Federico Mattiello; Luc Bijnens; Olivier Thas
Journal: Brief Bioinform Date: 2019-01-18 Impact factor: 11.622

7. Through ageing, and beyond: gut microbiota and inflammatory status in seniors and centenarians.

Authors: Elena Biagi; Lotta Nylund; Marco Candela; Rita Ostan; Laura Bucci; Elisa Pini; Janne Nikkïla; Daniela Monti; Reetta Satokari; Claudio Franceschi; Patrizia Brigidi; Willem De Vos
Journal: PLoS One Date: 2010-05-17 Impact factor: 3.240

8. The vaginal microbiota of pregnant women who subsequently have spontaneous preterm labor and delivery and those with a normal delivery at term.

Authors: Roberto Romero; Sonia S Hassan; Pawel Gajer; Adi L Tarca; Douglas W Fadrosh; Janine Bieda; Piya Chaemsaithong; Jezid Miranda; Tinnakorn Chaiworapongsa; Jacques Ravel
Journal: Microbiome Date: 2014-05-27 Impact factor: 14.650

Zero-Inflated gaussian mixed models for analyzing longitudinal microbiome data.

1. Introduction

2. Methods

2.1 Zero-Inflated Gaussian Mixed Models (ZIGMMs)

2.2 The EM algorithm for fitting the ZIGMMs

2.3 Accounting for within-subject correlations

2.4 Software implementation

3. Results

3.1 Simulation studies

3.1.1 Assess the ZIGMMs in analyzing microbiome proportion data. 3.1.1.1 Simulation design

3.1.2 Assess the ZIGMMs in analyzing microbiome count data. 3.1.2.1 Simulation design

3.1.3 Assess the computational efficiency of ZIGMMs

3.2 Application to 16S rRNA and shotgun sequencing microbiome data

3.2.1 App lication in 16S rRNA longitudinal pregnancy data

3.2.2 Application in shotgun sequencing longitudinal intestinal microbiome data

4. Discussion

Empirical power of hypothesis in four simulation settings under high effect scenario.

1. Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors.

2. Accessible, curated metagenomic data through ExperimentHub.

Review 3. Microbiome and malignancy.

Review 4. Understanding the role of gut microbiome-host metabolic signal disruption in health and disease.

5. A robust approach for identifying differentially abundant features in metagenomic samples.

6. A broken promise: microbiome differential abundance methods do not control the false discovery rate.

7. Through ageing, and beyond: gut microbiota and inflammatory status in seniors and centenarians.

8. The vaginal microbiota of pregnant women who subsequently have spontaneous preterm labor and delivery and those with a normal delivery at term.

9. Longitudinal analysis of the lung microbiota of cynomolgous macaques during long-term SHIV infection.

10. Genome-wide mapping of gene-microbiota interactions in susceptibility to autoimmune skin blistering.

Review 1. Statistical challenges in longitudinal microbiome data analysis.