Literature DB >> 36123651

A robust and transformation-free joint model with matching and regularization for metagenomic trajectory and disease onset.

Qian Li¹, Kendra Vehik², Cai Li³, Eric Triplett⁴, Luiz Roesch⁴, Yi-Juan Hu⁵, Jeffrey Krischer².

Abstract

BACKGROUND: To identify operational taxonomy units (OTUs) signaling disease onset in an observational study, a powerful strategy was selecting participants by matched sets and profiling temporal metagenomes, followed by trajectory analysis. Existing trajectory analyses modeled individual OTU or microbial community without adjusting for the within-community correlation and matched-set-specific latent factors.
RESULTS: We proposed a joint model with matching and regularization (JMR) to detect OTU-specific trajectory predictive of host disease status. The between- and within-matched-sets heterogeneity in OTU relative abundance and disease risk were modeled by nested random effects. The inherent negative correlation in microbiota composition was adjusted by incorporating and regularizing the top-correlated taxa as longitudinal covariate, pre-selected by Bray-Curtis distance and elastic net regression. We designed a simulation pipeline to generate true biomarkers for disease onset and the pseudo biomarkers caused by compositionality. We demonstrated that JMR effectively controlled the false discovery and pseudo biomarkers in a simulation study generating temporal high-dimensional metagenomic counts with random intercept or slope. Application of the competing methods in the simulated data and the TEDDY cohort showed that JMR outperformed the other methods and identified important taxa in infants' fecal samples with dynamics preceding host disease status.
CONCLUSION: Our method JMR is a robust framework that models taxon-specific trajectory and host disease status for matched participants without transformation of relative abundance, improving the power of detecting disease-associated microbial features in certain scenarios. JMR is available in R package mtradeR at https://github.com/qianli10000/mtradeR.

Entities: Chemical

Keywords: Compositional data; Joint model; Metagenomic trajectory; Pseudo biomarker; Zero-inflated

Mesh：

Year: 2022 PMID： 36123651 PMCID： PMC9484160 DOI： 10.1186/s12864-022-08890-1

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 4.547

Background

Gut microbiota profiled by 16s rRNA gene sequencing or metagenomic (i.e., whole-genome shotgun) sequencing has been frequently used in observational studies of environmental exposures, immune biomarkers, and disease onset [1-5]. One of the challenges in analyzing microbiota in an observational study is to incorporate the matching between participants based on certain confounding risk factors (e.g. gender, clinical site, etc.) and/or disease status (case-control), such as the DIABIMMUNE and TEDDY cohorts [1, 2, 5]. A matching design effectively eliminates the noise effect of sample collection, storage, shipment, sequencing batch, and environmental exposures confounding with disease outcomes, as well as reduces the sequencing costs. Statistical analyses of microbiota in matched sets included, but are not limited to, conditional logistic regression [1], non-parametric comparison PERMANOVA [6] and LDM [7] with extension to compare cases and controls within a matched set [8], which aimed to model and analyze microbiome data at independent time points. Longitudinal profiling is a powerful strategy for the microbiome studies that aim to identify differential microbial trajectories between exposure groups or phenotypes [9, 10] or detect the time intervals of differential abundance [11]. However, most of these studies failed to test if the compositional trajectory of an operational taxonomic unit (OTU) signaled host disease status. To detect microbial trajectories predictive of disease outcome in matched sets, an intuitive method is the generalized linear mixed effect model with or without the zero-inflation component [9, 10, 12], in which a taxon’s abundance and/or presence is the outcome variable and the disease status is the covariate of interest. The Zero-Inflated Beta Regression (ZIBR) model [9] tests the association between OTU and a covariate factor using a two-part model for the non-zero relative abundance and presence of each OTU, assuming the non-zero relative abundance and presence being independent. A similar framework [10] was proposed to analyze the longitudinal zero-inflated counts per OTU using a Negative Binomial distribution, without converting the raw counts to relative abundance. A semi-parametric approach for longitudinal taxon-specific relative abundance is the linear mixed effect model (LMM) with asin-square-root transformation, which has been implemented in MaAsLin 2 [12]. One concern about using generalized linear mixed model to test the association between 16S rRNA or metagenomic trajectory and disease onset is that the covariates in this model may contribute to disease risk. For example, the HLA haplogenotypes and early use of probiotics may affect infants’ gut microbiota and should be included as covariates. These factors were also found associated with islet autoimmunity among children enrolled in TEDDY [13]. One usually added interaction terms between each covariate and the disease outcome [3, 12] to adjust for the association. However, a linear model with many interaction terms may lead to overfitting and reduce the detection power [14]. A sensible choice is the joint modeling of longitudinal biomarker and survival outcomes [15, 16], but there are limitations in applying this model to microbiome data in observational studies. First, the cost of metagenomic sequencing and the availability of fecal samples in a multi-center study may restrict the metagenome profiling to a subgroup of participants selected by certain criteria [1-3], whose survival outcome may deviate from common statistical assumptions. Second, the classic joint modeling approach aims to address repeated measurements of biomarkers in a time-to-event analysis rather than test if a biomarker’s intercept or slope is predictive of host health condition. Third, in an observational study that selects and matches participants by certain factors, their risk of developing disease is also matched. Thus, a survival submodel may not be capable of characterizing the disease risk between matched participants. Many of the existing methods for microbiome data are built on the transformed relative abundance, such as centered log-ratio or inter-quartile log-ratio. In our new method, transformation of compositional data is not considered, since transformation strategy may have profound impact on analysis result and interpretation [17]. The compositional change in true biomarkers (e.g., causal OTUs contributing to disease onset) always leads to simultaneous change in some other OTUs’ composition because of sum-to-one constraint. In an observational study with matching design, it is common to collect and profile microbiota at many time points. The sum-to-one constraint and latent noise effect may yield pseudo biomarkers with relative abundance associated with host disease status but not contributing to disease development. Hence, a taxon-level model is built for relative abundance trajectory that adjusts for the dynamic interdependence between taxa and reduces pseudo biomarker rate. In addition, we illustrate the performance of our method by a simulation pipeline that mimics the negative correlation in microbial community. The latent technical noise in microbiome was removed by converting raw counts to relative abundance, and Zero-Inflated Beta density [9] was adopted to model an OTU’s non-zero relative abundance and presence, respectively. We employed a subject-level random effect to link the logistic regression model of disease to a two-part longitudinal submodel. The latent effect of exposures related to matched set indicator was modeled by another random effect nested with the subject-level random effect. The OTU-disease association was assessed by jointly testing the scaling parameters for the subject-level random effect in the two-part submodel. We benchmarked the robustness and power of our method by a comprehensive simulation study and an application in the TEDDY cohort. The results illustrated that our method controlled the rates of false discovery and pseudo biomarkers, as well as improved the efficacy of detecting microbial trajectories signaling disease outcome.

Results

For simplicity, the aim of present research is to link the matched longitudinal microbiome samples to hosts’ matched disease risk and incorporate the unknown dependence between taxa in an univariate trajectory framework, without modeling the compositionality. Briefly, we develop a Joint model with Matching and Regularization (JMR) to detect taxon-specific compositional trajectory associated with disease onset, adjusting for the linear correlation with other taxa and matched-set-specific latent noises. According to the characteristics of disease risk and infant-age gut microbiota in the TEDDY cohort, we designed a simulation pipeline similar to [8], generated the observed counts of temporal microbiota and compared our method to LMM and ZIBR using the simulated data. We also applied these methods to the shotgun metagenomic sequencing data profiled from the 4-9 months stool samples of infants enrolled in TEDDY cohort.

Overview of TEDDY microbiome study

TEDDY is an observational prospective study of children at increased genetic risk of type 1 diabetes (T1D) conducted in six clinical centers in the U.S. and Europe (Finland, Germany, and Sweden). A total of 8,676 children were enrolled from birth and followed every 3 months for blood sample collection and islet autoantibody measurement up to 4 years of age, then every 3-6 months based on autoantibody status until the age of 15 years or diabetes onset [18]. A primary disease endpoint in TEDDY is islet autoimmunity (IA), defined as persistently positive for insulin autoantibodies (IAA), glutamic acid decarboxylase autoantibodies (GADA), or insulinoma-associated-2 autoantibodies (IA-2A) at two consecutive visits confirmed by the two TEDDY laboratories [18]. The participants’ monthly stool samples were collected from 3-month age until the onset of IA or censoring with random missing samples [1, 2]. Based on the sample availability and metagenomic sequencing cost, the microbiome study in TEDDY selected all the participants (cases) who developed IA by the design cutoff date May 31, 2012 and the controls at 1:1 case-control ratio matched by clinical center, gender, family history of T1D to profile the temporal gut microbiota, resulting in S = 418 matched sets (or pairs [19]). These matching factors are known risk factors for type 1 diabetes. Some of the matched sets are at higher risk of IA than the others due to higher risk human leukocyte antigen (HLA) genotypes, geography or having family history of T1D. Hence, the matched participants have comparable risk of IA, but heterogeneity still exists between them according to the case-control status by the design freeze date. The observed metagenomic counts table in TEDDY was generated by the standard procedure of DNA extraction, PCR amplification, shotgun metagenomic sequencing, assembly, annotation and quantification, as described in [1]. We visualized the top abundant species in the metagenomes of TEDDY participants who had matched IA endpoint no later than 2 years of age (Fig. 1).

Fig. 1

Mean compositional trajectory of top abundant species in infant-age metagenomes grouped by host islet autoimmunity status at 2 years of age

Simulation

Disease outcome for the matched participants are simulated by the procedure below. The observed relative abundance per taxon were simulated by different scenarios. We first generated raw counts for a single OTU by Beta-Binomial distribution to assess the robustness and power of our method JMR without covariate taxa. We also designed a shifting procedure to mimic inherent negative correlation in the true composition of microbiota and generated the temporal high-dimensional raw counts table to evaluate the performance of compared methods.

Generate disease outcome in matched sets

We defined matched sets and subjects as ‘high-risk’ and ‘low-risk’ to generate the temporal OTU counts prior to disease onset. Subjects are matched at 1:1 ratio. For participant in matched set s , we first generated subject-level and set-level random effects from a standard Normal distribution , . Each random effect was converted to a binary variable by the median value. That is , , where (or ) represents a ‘high-risk’ subject (or set). Next, we simulated a host genotype as disease risk factor, and the host disease status by a Bernoulli distribution , where . We fixed , which is the JMR estimate from real data, and set to generate different datasets.

Scenario A: single OTU counts.

We first simulated the observed counts of a single OTU by Beta-Binomial [14] distribution to compare the univariate trajectory methods without adjusting for covariate taxa. The true relative abundance of an OTU at the earliest time point was drawn from a Beta distribution , where parameters were estimated by applying Beta-Binomial MLE to the metagenomic raw counts of an OTU selected at a given relative abundance level in the TEDDY data. To simplify the age-dependent effect, the relative abundance of this OTU at later time points was generated by linearly increasing to . The baseline relative abundance at time t in a matched set s was generated by , and was increased or decreased by if the set was labeled as ‘high-risk’. The true relative abundance of this OTU for subject j in set s at time point t was simulated by , and was increased or decreased by if the subject was ‘high-risk’. The total counts per sample, i.e., library size was drawn from a Poisson distribution , and the counts for this OTU is generated from a Binomial (BN) distribution .

Scenario B1: counts table with random intercept and pseudo biomarkers

We also generated a high-dimensional counts table with OTUs to demonstrate the performance of each method, so that the covariate taxa can be used in JMR. The true composition of each microbiome sample was simulated by a shifting procedure combined with Dirichlet distribution to account for the negative correlation within microbial community. The sample-wise library size was generated by a Poisson distribution, and the observed raw counts were sampled from a Multinomial distribution. Details of data generation process for this scenario is available in Methods, with a visualization for dimension of in Fig. 2.

Fig. 2

Shifting procedure in simulation scenario B1. From time T1 to T4, taxa A-C decrease and taxon D increases in relative abundance. Compared to a low-risk set, taxa A, D are more abundant and taxa B, C are less abundant in a high-risk set. Taxon A is less abundant (i.e., ) and taxon B is more abundant (i.e., ) in a high-risk subject compared to the matched low-risk subject, both being true biomarkers at each time point. Taxon C is a pseudo biomarker (randomly selected as at time point T1) with relative abundance automatically changed due to sum-to-one constraint, while taxon D is unchanged at T1 For a subject labeled as ‘high-risk’, we increased OTUs (denoted by ) in by , and reduced another OTUs (denoted by ) by (). The subsets and are the true biomarkers for disease status. We randomly selected a third subset (denoted by ) from the remaining OTUs in and reduced the composition of by a total of . There may exist OTUs never selected in , , or , which are the ‘null’ OTUs. The OTUs selected in are the pseudo biomarkers due to random shift in frequency. We set the total shift at distinct effect size , where is the maximum shift restricted by sum-to-one.

Scenario B2: counts table with random slope and pseudo biomarkers

Data generation process for this scenario is similar to Scenario B1, except that the shift () in microbiota true composition between ‘low-risk’ and ‘high-risk’ subjects varies by time points. It’s worth to note that we cannot distinguish ‘false positive’ from ‘pseudo positive’ in scenarios B1 and B2. Hence, we use the sum of false positive rate and pseudo positive rate, i.e., false or pseudo positive rate (FPPR) as a performance metric for scenarios B1,B2. That is .

Scenario C: counts table without pseudo biomarkers

In this scenario we considered random intercept signaling the disease onset and fixed half OTUs in as ‘null’ in order to evaluate the FPR and FDR of each method, although this scenario is not applicable to real data. Among the other half OTUs, we selected OTUs in as and OTUs as without a subset of pseudo biomarkers ().

Performance of competing methods

In scenario A, we compared JMR not adjusting for correlated taxa (JMR-NC) with the following methods: a) a joint model with regularization but without matching indicator and correlated taxa (JR-NC); b) the ZIBR model with a Wald statistic jointly testing OTU-specific abundance or presence using either a single random effect (ZIBR-S) or nested random effects (ZIBR-N); c) LMM with arcsin-square-root transformation using either a single random effect (LMM-S) or nested random effects (LMM-N). For LMM and ZIBR methods, we used R package gamlss and set the sample age, genotype, disease status, and genotype-disease interaction term as the fixed effect covariates. It’s worth to note that the nested random effects used in LMM and ZIBR are independent of host disease risk, different from those in JMR. We randomly selected 6 OTUs with different relative abundance from TEDDY data and estimated the baseline parameters for each. These OTUs are Acinetobacter sp. NIPH 236, Brachyspira murdochii, Streptococcus phage YMC-2011, Erysipelatoclostridium ramosum, Ruminococcus gnavus. Then we generated replicates for each OTU with . The type I error rate and power of each method was calculated at statistical significance level , shown in Table 1. The results showed that JMR-NC persistently controlled the type I error and provided higher detection power at distinct abundance levels except for the OTUs with and (5, 6]. Type I error of the reduced model JR-NC was severely inflated in some datasets and its power was lower than JMR-NC. LMM consistently controlled type I error, with power lower than JMR-NC in most simulated OTUs. The ZIBR method yielded inflated type I error rate and low efficacy regardless of sample size in this single-OTU scenario.

Table 1

The type I error and power based on 10000 simulated replicates for a taxon at different levels of mean relative abundance ()

\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-\log _{10}(\varvec{y})$$\end{document}-log10(y)		(0, 1]	(1, 2]	(2, 3]	(3, 4]	(4, 5]	(5, 6]
Type I error
	JMR-NC	0.01	0.003	0.008	0.0004	0	0
	JR-NC	0.061	0.03	0.001	0.002	0.0003	0.0005
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=100$$\end{document}N=100	LMM-N	0.045	0.031	0.026	0.033	0.068	0.052
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(S=50)$$\end{document}(S=50)	LMM-S	0.041	0.036	0.023	0.036	0.066	0.049
	ZIBR-N	0.057	0.066	0.080	0.053	0.164	0.369
	ZIBR-S	0.049	0.079	0.076	0.063	0.163	0.366
	JMR-NC	0.018	0.004	0.022	0.051	0.001	0.0003
	JR-NC	0.162	0.091	0.049	0.066	0.035	0.002
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=200$$\end{document}N=200	LMM-N	0.043	0.039	0.068	0.057	0.079	0.058
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(S=100)$$\end{document}(S=100)	LMM-S	0.046	0.051	0.066	0.060	0.079	0.056
	ZIBR-N	0.056	0.067	0.099	0.078	0.170	0.324
	ZIBR-S	0.059	0.088	0.099	0.088	0.174	0.324
Power
	JMR-NC	0.399	0.5	0.372	0.833	0.798	0.136
	JR-NC	0.32	0.65	0.051	0.057	0.548	0.040
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=100$$\end{document}N=100	LMM-N	0.040	0.084	0.474	0.107	0.552	0.512
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(S=50)$$\end{document}(S=50)	LMM-S	0.043	0.090	0.468	0.108	0.347	0.468
	ZIBR-N	0.06	0.088	0.412	0.123	0.743	0.666
	ZIBR-S	0.057	0.095	0.405	0.123	0.686	0.664
	JMR-NC	0.452	0.723	0.619	0.944	0.917	0.388
	JR-NC	0.677	0.699	0.194	0.836	0.737	0.324
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N=200$$\end{document}N=200	LMM-N	0.06	0.115	0.429	0.478	0.287	0.253
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(S=100)$$\end{document}(S=100)	LMM-S	0.068	0.134	0.423	0.321	0.273	0.206
	ZIBR-N	0.057	0.332	0.433	0.667	0.280	0.379
	ZIBR-S	0.063	0.344	0.430	0.648	0.265	0.377

The type I error and power based on 10000 simulated replicates for a taxon at different levels of mean relative abundance () In scenarios B1, B2 and C, we generated 10 replicates for each OTU table to assess the performance of competing methods. We evaluated the performance of each method at different size of set-level random effect . The taxa associated with disease onset in each OTU table are detected by FDR cutoff . The FPPR in scenario B1 (Fig. 3) showed that adjusting for the top-correlated taxa in JMR successfully controlled the rate of pseudo biomarkers across different scenarios and was more powerful than LMM at larger sample size (), although JMR showed lower detection power compared to JMR-NC. The results in Fig. 4 also demonstrated the outperformance of JMR-NC, JMR, and ZIBR in the sensitivity of detecting taxon-specific trajectory heralding disease outcome. The power of ZIBR in either intercept or slope analysis was higher than the competing methods regardless of sample size in the high-dimensional scenarios, but this model yielded inflated FPPR (Fig. 3). The LMM methods were powerful in the test of intercept, but this model occasionally produced inflated FPPR (Fig. 3) regardless of the set-level random effect size. Furthermore, the power of LMM was unstable in intercept analysis, while its power in slope analysis was nearly zero. To confirm the impact of on performance, we also compared the metrics in Figs. 3 and 5 between different values of , using Kruskal-Wallis test. A larger set-level random effect led to significant change in FPPR, FPR, FDR for LMM and ZIBR methods (), but this impact was trivial in JMR or JMR-NC (). JMR showed the best performance in slope analysis, with higher sensitivity (Fig. 4) and the lowest FPPR (Fig. 3). Results of scenario C in Fig. 5 showed that JMR and LMM effectively controlled the FPR, while LMM produced higher FPR at larger matched-set-specific random effect (). The FDR of JMR at was relatively higher than that of LMM due to lower sensitivity. The inflated FPR and FDR of ZIBR in scenario C (Fig. 5) is consistent with the FPPR in scenario B (Fig. 3).

Fig. 3

FPPR of each method for scenario B1 at different size of matched-set-specific random effect

Fig. 4

Sensitivity of each method for scenarios B1 and B2 at different sample size (N) and different effect size ()

Fig. 5

FDR and FPR for each method in scenario C at different size of matched-set-specific random effect

FPPR of each method for scenario B1 at different size of matched-set-specific random effect Sensitivity of each method for scenarios B1 and B2 at different sample size (N) and different effect size () FDR and FPR for each method in scenario C at different size of matched-set-specific random effect For each raw counts table in scenarios B and C, more than half of simulated OTUs have the observed zero-inflation probability (i.e., prevalence) between 2% and 90%, although there are a few OTUs with the observed prevalence at . The overall prevalence of each OTU table in scenarios B and C is similar between different datasets, which cannot be specified in the Dirichlet-Multinomial distribution or the shifting procedure. Hence, we assess the impact of zero-inflation on performance only in scenario A, whereas the prevalence is related to taxon-specific relative abundance. We also visualized the prevalence of six OTUs generated in scenario A (Fig. 6). The power of JMR-NC (Table 1) was better for the prevalence at a medium level, i.e., replicates with between (3, 4] or (4, 5] (Fig. 6). The OTUs with higher abundance and relatively lower prevalence (i.e., replicates with in Fig. 6) showed better efficacy in Table 1. In general, OTU-specific prevalence being too high or too low may reduce the power of JMR.

Fig. 6

Prevalence distribution for each OTU in scenario A

Application in TEDDY

We applied the competing methods to the longitudinal metagenomes profiled from TEDDY children’s monthly stool samples collected at the age of 4-9 months [1]. We included the cases developing IA between 9-month and 24-month age and their matched controls who remained IA-negative by the cases’ diagnosis age. For each matched pair included in the present analysis, one participant was IA positive and the other one was negative at the age of 24 months. We excluded the participant(s) matched to multiple pairs, yielding subjects ( pairs) and metagenome samples. The cases who experienced IA onset after 24 months and their matched controls were not included in this analysis. We first filtered OTUs at genus and species level by relative abundance and prevalence , selecting 125 out of 265 genera and 365 out of 750 species in downstream analysis. It’s worth to note that there are 1797 species in total profiled and quantified in TEDDY cohort, with 750 species detected between 3- and 9-month age. The sample age and the hosts’ breastfeeding status per time point were used as longitudinal covariates, while HLA DR3 &4 haplotype was included as time-invariant covariates. For the LMM and ZIBR methods, we used the interaction term between IA status and the binary HLA category (DR3 &4 vs. others) as a covariate to adjust for the association. We tested each OTU’s association with IA by FDR cutoff or , individually. The HLA DR3 &4 genotype was confirmed positively and significantly ( by Wald test) associated with IA in JMR. The results in Table 2 showed that JMR identified more OTUs than LMM in both intercept and slope analysis. The LMM methods only found a small subgroup of taxa associated with IA at either genus or species level. We also visualized the overlap and difference between JMR, JMR-NC, LMM-N selected by in Fig. 7 with OTU names listed in Supplementary Table S1, and then compared Akaike Information Criterion (AIC) of JMR and JMR-NC for the 76 species detected by both methods. Adjusting for the correlated taxa in JMR did improve model fitting with lower mean AIC (-2631.847) compared to JMR-NC (-2615.571). LMM-N is not comparable to JMR or JMR-NC in terms of information criteria, since the taxon-specific relative abundance was transformed by asin-square-root.

Table 2

The number of genera and species associated with IA detected by each method in a subgroup of TEDDY participants

		Intercept		Slope
FDR		\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q<0.05$$\end{document}q<0.05	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q<0.1$$\end{document}q<0.1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q<0.05$$\end{document}q<0.05	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q<0.1$$\end{document}q<0.1
	JMR	27	31	27	34
	JMR-NC	44	52	36	44
	LMM-N	11	16	3	4
Genera	LMM-S	49	82	10	19
	ZIBR-N	26	30	31	36
	ZIBR-S	26	32	31	36
	JMR	75	94	37	46
	JMR-NC	147	166	112	125
	LMM-N	43	60	13	21
Species	LMM-S	40	63	7	16
	ZIBR-N	89	106	119	138
	ZIBR-S	83	105	120	140

Fig. 7

Venn diagram for the intercept analysis in TEDDY data by JMR, JMR-NC, LMM-N

The number of genera and species associated with IA detected by each method in a subgroup of TEDDY participants Venn diagram for the intercept analysis in TEDDY data by JMR, JMR-NC, LMM-N The taxa with mean abundance (intercept) associated with IA onset exclusively detected by both JMR and JMR-NC at include Bifidobacterium breve, Bacteroides fragilis, Lactobacillus ruminis, Veillonella ratti. B.breve, as one of the three species dominating infant-age gut microbiota in TEDDY, was less abundant in intercept (i.e. at 4- and 9-month) during infancy among IA cases, with density shown in Fig. 8. The species B.fragilis as part of the normal microbiota in human colon was found more abundant among IA cases compared to their matched controls (Fig. 1). This Bacteroides species was also found differential between T1D cases and controls at only one time point in a small-size Finnish cohort [20].

Fig. 8

Distribution of relative abundance for B.breve and E.coli per time point grouped by 2-year IA status

Distribution of relative abundance for B.breve and E.coli per time point grouped by 2-year IA status There are two more abundant species Faecalibacterium prausnitzii and Escherichia coli visualized in Fig. 1 associated with IA in slope and exclusively detected by JMR. F.prausnitzii, as one of the most abundant and important commensal bacteria of human gut microbiota that produces butyrate and short-chain fatty acids from the fermentation of dietary fiber, increased faster in IA cases after 6-month of age. This rapid change and abnormally higher level of F.prausnitzii prior to IA seroconversion may be a result of the sudden change of dietary pattern during infancy. Our method successfully detected the case-control difference in the slope of E.coli, which was found as an amyloid-producing bacteria with temperal dynamics heralding IA onset in a subset analysis in DIABIMMUNE cohort [21]. The relative abundance of E.coli in TEDDY smoothly decreased from 4-month to 9-month for both cases and controls (Fig. 1), and it was relatively more abundant in controls between 7- and 9-month with stratified densities shown in Fig. 8. The temporal change of E.coli prior to IA seroconversion in TEDDY detected by JMR was consistent with the decrease of E.coli reported in DIABIMMUNE cohort [21], which was possibly due to prophage activation according to the E.coli phage/E.coli ratio prior to E.coli depletion in that research.

Discussion

We developed a joint model with nested random effects to test the association between taxa and disease risk, and adjusted for the correlated taxa screened by a pre-selection procedure in abundance and prevalence, individually. We implemented our method in an R package mtradeR (metagenomic trajectory analysis with disease endpoint and risk factors) with illustration examples at https://github.com/qianli10000/mtradeR. The JMR function implemented the framework in equation (1) by parallel computing. We also provided simulation functions StatSim and TaxaSim to generate (binary) disease status and temporal high-dimensional metagenomic counts of matched sets. The runtime of each method for different sample size and different number of OTUs were compared on an 8-core computer, with mean and standard deviation shown in Table 3. The nested random effects were utilized in each method. For the univariate models without covariate taxa, LMM-N is the fastest algorithm and ZIBR-N is the slowest, both implemented in gamlss R package. Although the adjustment of correlated taxa in JMR requires additional computation, the runtime of JMR is still shorter than ZIBR-N in gamlss.

Table 3

		Mean Runtime (SD)
Dataset scale		JMR-NC	ZIBR-N	LMM-N	JMR
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=153$$\end{document}n=153	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{P}=359$$\end{document}P~=359	11.62 (0.95)	44.11 (4.72)	3.01 (0.36)	35.95 (3.6)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=153$$\end{document}n=153	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{P}=234$$\end{document}P~=234	8.02 (0.26)	20.79 (0.31)	1.93 (0.04)	24.58 (0.52)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=307$$\end{document}n=307	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{P}=370$$\end{document}P~=370	14.16 (0.65)	56.01 (4.78)	5 (0.45)	43.67 (3.09)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n=307$$\end{document}n=307	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tilde{P}=247$$\end{document}P~=247	9.69 (0.63)	37.89 (2.31)	3.9 (0.24)	26.97 (0.68)

The mean and standard deviation (SD) of runtime in minutes for 30 repeated runs by each method at different number of longitudinal samples (n) and filtered OTUs () in TEDDY data. The OTUs in each dataset are filtered by either relative abundance , prevalence or relative abundance , prevalence The simulation of single OTU demonstrated the performance of each method at different relative abundance levels, implying that LMM with either single or nested random effect is still a robust method. The simulation of high-dimensional OTU tables also illustrated LMM’s overall performance in the test of intercept, but the unstableness of LMM is a concern in real data analysis. JMR yielded lower false or pseudo positive rate in the simulated datasets and higher detection power in slope analysis by adjusting for the top-correlated taxa. The pre-selection of top-correlated taxa in JMR was performed in relative abundance and presence, individually, being consistent with the two-part model strategy. According to the simulation study, a disadvantage of JMR is the limited power at small sample size and the dependence on tuning parameter. The prescreening procedure in JMR may occasionally select a true biomarker as covariate taxon, which is possibly confounding with the subject-level random effect. Hence, the adjustment of related taxa in JMR reduced the detection power compared to JMR-NC, although this strategy controlled the pseudo biomarker rate. Adding nodes in the GH approximation may improve the power of JMR, but more nodes will also lead to additional computation costs. Hence, future work should focus on improvement of JMR in both detection power and computation efficiency. Furthermore, the simulation results in Fig. 3 also suggested the minimum number of participants or matched pairs required based on set-level or subject-level random effect size. In an observational study with strong set-level noises in the microbiota (e.g., multi-center effect), a minimum sample size of participants (i.e., pairs) coupled with JMR can improve the detection power and control FPPR at each level of disease-associated random effect. Another limitation of our method is the potential bias in scaling parameter () estimation, possibly caused by the regularization. Our current work only focused on the unsigned association between a taxon and host disease status by using a Wald statistic. An improvement in the estimate of scaling parameter and statistical inference should be considered in future work, such as the algorithm in ZINQ [14]. We did not use quantile regression in current research, since the performance of ZINQ required tuning of grid. But ZINQ provided an alternative approach for modeling zero-inflation in microbiota composition with fewer statistical assumptions. The right-censoring of longitudinal biomarker measurements or a binary disease outcome always occurs in observational studies. Our model allows random missingness or censoring of microbiome samples at any time point. In an observational study like TEDDY, the controls’ disease outcome was censored at or later than the matched cases’ endpoint, because the case-control matching was based on the participants’ disease status. Thus, right-censoring is not applicable to the disease status at matched endpoint. For a study matching participants solely based on confounding risk factors (e.g., DIABIMMUNE), the right-censoring of disease outcome should be addressed prior to the usage of JMR, such as multiple imputation. There are other important topics to be considered in the modeling of longitudinal microbiome data. One potential direction is high dimensional modeling framework, such as tensor singular value decomposition [22]. A promising extension of the current work in JMR is to exploit functional data analysis for multiple microbial trajectories. By employing a non-parametric joint modeling, we may be able to capture nonlinear trends and heterogeneous patterns of longitudinal biomarkers in microbiota, as well as negative correlations among taxa [23].

Conclusions

The proposed framework JMR successfully controlled the false or pseudo biomarkers in taxon-specific trajectory analysis with improved detection power by incorporating the matching of participants and adjusting for the dependence between taxa.

Methods

Joint model with matching and regularization

The probability for participant j in matched set s developing the disease of interest is , where is the binary disease status. There are J participants in each matched set. Let be the relative abundance of an OTU for participant j in matched set s at time point t . We denote the expected non-zero abundance by , and the probability of presence (or zero-inflation) by , similar to [9]. For a microbiome study matching participants by the disease-associated factors and/or disease status (e.g., DIABIMMUNE, TEDDY), the matched participants are assumed to have comparable but distinct disease risk. Hence, we model the disease status by a logistic mixed effect model with nested random effects. A joint model for the host disease status and microbial trajectory in matched set isThe host disease status is determined by a vector of fixed effect covariates and the independent nested random effects , . The non-zero relative abundance and presence per OTU are predicted by the same random effects rescaled by parameters and a vector of clinical or bioinformatics technical covariates . To model the unknown correlation between taxa, this OTU’s non-zero abundance and presence per time point also depend on the other taxa with relative abundance and presence-absence measured at the same time point, pre-selected by a procedure described below. The two-part submodel of characterizes how the trajectory is affected by subject- and set-level latent factors contributing to disease risk, and how the OTU trajectory interacts with correlated taxa over time. If an OTU is a pseudo biomarker, then its relative abundance () should be driven by the top-correlated taxa per time point instead of the disease-associated random effect . On the other hand, the abundance of a true biomarker OTU at each time point is mainly determined by the latent risk of disease onset () and possibly associated with the top-correlated taxa. We set in equation (1) to test intercept, and to test slope. The nested random effects and parameters provide flexibility in the modeling of between-subjects and between-sets heterogeneity, as well as model the abundance-presence correlation in each taxon by shared nested random effects instead of assuming independence between the two processes as in [9].

Parameter estimation and hypothesis testing

To account for the sum-to-one restriction on non-zero relative abundance () and the binarized measurement of an OTU, we intuitively employ the Zero-Inflated Beta (ZIB) density function [9] to define the match-set-specific marginal likelihood for parameter estimation. That is , where are the Gaussian density functions with mean 0 and variance , individually, and is the Beta density function with mean and overdispersion . In the simulation study, we demonstrated that the robustness and performance of this model does not require the observed relative abundance being generated from ZIB distribution. The estimate of overdispersion without regularization is severely inflated and also leads to bias in the estimate of other parameters. Hence, we use (ridge) regularization to control the overdispersion and type I error in hypothesis testing. All the parameters are estimated by maximizing a penalized marginal likelihood function , whereand is selected by a cross-validation described below. There is no closed form of the multivariate integral in equation (2) because of the Beta density in . Hence, can be approximated by Gauss-Hermite (GH) quadrature, with details explained in Appendix. We test the association between OTU trajectory and host disease status with null hypothesis and a Wald statistic , which follows a Chi-Square distribution . The false discovery rate (FDR) for multiple testing is corrected by the Benjamini-Hochberg (BH) procedure.

Pre-selection of correlated taxa and tuning parameter selection

For each OTU () in equation (1), using all the other taxa as covariates is computationally inefficient. Hence, we use a data-driven procedure to pre-select and , and then perform a post-selection hypothesis testing. The first step screens the taxa correlated with in abundance and presence, individually, using the Bray-Curtis distance less than 0.1 quantile. Our current method uses relative abundance in both pre-selection and modeling, since this method is developed for large-scale microbiome studies and the multi-center technical batch effect can be simply normalized by relative abundance. According to the comparison of dissimilarity metrics on microbiome compositional data [24], we choose Bray-Curtis dissimilarity to pre-select the related taxa. This step may still result in many covariate taxa at species level in metageonmic data due to high dimensionality. Thus, we employ elastic net regression to further select the taxa with relative abundance associated with or the taxa with presence associated with , individually. In this pre-selection procedure, we model all the longitudinal metagenomes as independent samples regardless of time points (or age). One may restrict this procedure to a sub-community such as the species or subspecies of certain genera. To reduce the computational burden of cross-validation for a high-dimensional OTU table, we randomly select OTUs from distinct relative abundance levels to represent the complexity of microbiota composition. The matched sets are divided into 5 folds, each being a validation fold for the model built on the other four (training) folds. The penalized log likelihood in equation (4) is the negative objective function in cross-validation. For each validation fold f and the selected OTU i, the loss function is , where . The optimal is selected by the ‘elbow point’ minimizing .

Data generation process for simulation scenario B1

Step 1: Estimate the baseline mean composition (or frequency) of microbiota () and the overdispersion () at the starting time point in TEDDY data by Dirichlet-Multinomial (DM) maximum likelihood estimate (MLE) of the observed counts. Generate the mean frequency of microbiota at the first time point by Dirichlet (DL) distribution: . Step 2: The mean frequency at a later time point is generated by the following shifting procedure: increase the frequency of some OTUs in (denoted by ) with a sum of and simultaneously reduce that of other OTUs in (denoted by ) by . The absolute shift size represented the age effect on microbiota. This shifting strategy characterized the inherent correlation between and because of the simultaneous compositional change in these OTUs. All the OTUs in are assigned to either or to account for the impact of latent exposures across time points. Step 3: At each time point, the heterogeneity between matched sets is the overdispersion estimated by DM MLE based on the samples per time point in TEDDY, denoted by . The overdispersion at the first time point is and linearly decreases over time, which mimics the time-dependent overdispersion observed in the infant-age metagenome in TEDDY. We generated a mean frequency for each matched set s at time point t by . If a set is labeled as ‘high-risk’, we shifted all the OTUs in using the procedure in Step 2 with shift size , which is a proportion of the maximum shift size, i.e., . Step 4: The between-subject heterogeneity within each matched set was the median DM MLE of overdispersion per matched set based on the real data, that is . Hence, we generated the true microbiota composition for a sample collected from a ‘low-risk’ subject j in set s at time t by . The shift in between ‘low-risk’ and ‘high-risk’ subjects were described in Results. Step 5: The library size for each sample is simulated by a Poisson distribution , truncated by a minimum of 10000. The raw counts per sample is generated by Multinomial (MN) distribution . Additional file 1: Appendix. Gauss-Hermite quadrature approximation for marginal likelihood. Additional file 2: Supplementary Table S1. The list of OTUs detected by JMR, JMR-NC, LMM-N, individually, as shown in Figure 7.

22 in total

1. A two-part mixed-effects model for analyzing longitudinal microbiome compositional data.

Authors: Eric Z Chen; Hongzhe Li
Journal: Bioinformatics Date: 2016-05-14 Impact factor: 6.937

2. Testing hypotheses about the microbiome using the linear decomposition model (LDM).

Authors: Yi-Juan Hu; Glen A Satten
Journal: Bioinformatics Date: 2020-08-15 Impact factor: 6.937

3. Toward defining the autoimmune microbiome for type 1 diabetes.

Authors: Adriana Giongo; Kelsey A Gano; David B Crabb; Nabanita Mukherjee; Luis L Novelo; George Casella; Jennifer C Drew; Jorma Ilonen; Mikael Knip; Heikki Hyöty; Riitta Veijola; Tuula Simell; Olli Simell; Josef Neu; Clive H Wasserfall; Desmond Schatz; Mark A Atkinson; Eric W Triplett
Journal: ISME J Date: 2010-07-08 Impact factor: 10.302

4. A field guide for the compositional analysis of any-omics data.

Authors: Thomas P Quinn; Ionas Erb; Greg Gloor; Cedric Notredame; Mark F Richardson; Tamsyn M Crowley
Journal: Gigascience Date: 2019-09-01 Impact factor: 6.524

5. Variation in Microbiome LPS Immunogenicity Contributes to Autoimmunity in Humans.

Authors: Tommi Vatanen; Aleksandar D Kostic; Eva d'Hennezel; Heli Siljander; Eric A Franzosa; Moran Yassour; Raivo Kolde; Hera Vlamakis; Timothy D Arthur; Anu-Maaria Hämäläinen; Aleksandr Peet; Vallo Tillmann; Raivo Uibo; Sergei Mokurov; Natalya Dorshakova; Jorma Ilonen; Suvi M Virtanen; Susanne J Szabo; Jeffrey A Porter; Harri Lähdesmäki; Curtis Huttenhower; Dirk Gevers; Thomas W Cullen; Mikael Knip; Ramnik J Xavier
Journal: Cell Date: 2016-04-28 Impact factor: 41.582

6. Linking the Human Gut Microbiome to Inflammatory Cytokine Production Capacity.

Authors: Melanie Schirmer; Sanne P Smeekens; Hera Vlamakis; Martin Jaeger; Marije Oosting; Eric A Franzosa; Rob Ter Horst; Trees Jansen; Liesbeth Jacobs; Marc Jan Bonder; Alexander Kurilshikov; Jingyuan Fu; Leo A B Joosten; Alexandra Zhernakova; Curtis Huttenhower; Cisca Wijmenga; Mihai G Netea; Ramnik J Xavier
Journal: Cell Date: 2016-11-03 Impact factor: 41.582

7. Joint modeling of zero-inflated longitudinal proportions and time-to-event data with application to a gut microbiome study.

Authors: Jiyuan Hu; Chan Wang; Martin J Blaser; Huilin Li
Journal: Biometrics Date: 2021-07-02 Impact factor: 2.571

8. Multivariable association discovery in population-scale meta-omics studies.

Authors: Himel Mallick; Ali Rahnavard; Lauren J McIver; Siyuan Ma; Yancong Zhang; Long H Nguyen; Timothy L Tickle; George Weingart; Boyu Ren; Emma H Schwager; Suvo Chatterjee; Kelsey N Thompson; Jeremy E Wilkinson; Ayshwarya Subramanian; Yiren Lu; Levi Waldron; Joseph N Paulson; Eric A Franzosa; Hector Corrada Bravo; Curtis Huttenhower
Journal: PLoS Comput Biol Date: 2021-11-16 Impact factor: 4.779

9. The Environmental Determinants of Diabetes in the Young (TEDDY) Study.

Authors:
Journal: Ann N Y Acad Sci Date: 2008-12 Impact factor: 6.499