Literature DB >> 28178947

An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data.

W Duncan Wadsworth¹, Raffaele Argiento², Michele Guindani³, Jessica Galloway-Pena⁴, Samuel A Shelburne⁵, Marina Vannucci⁶.

Abstract

BACKGROUND: The Human Microbiome has been variously associated with the immune-regulatory mechanisms involved in the prevention or development of many non-infectious human diseases such as autoimmunity, allergy and cancer. Integrative approaches which aim at associating the composition of the human microbiome with other available information, such as clinical covariates and environmental predictors, are paramount to develop a more complete understanding of the role of microbiome in disease development.
RESULTS: In this manuscript, we propose a Bayesian Dirichlet-Multinomial regression model which uses spike-and-slab priors for the selection of significant associations between a set of available covariates and taxa from a microbiome abundance table. The approach allows straightforward incorporation of the covariates through a log-linear regression parametrization of the parameters of the Dirichlet-Multinomial likelihood. Inference is conducted through a Markov Chain Monte Carlo algorithm, and selection of the significant covariates is based upon the assessment of posterior probabilities of inclusions and the thresholding of the Bayesian false discovery rate. We design a simulation study to evaluate the performance of the proposed method, and then apply our model on a publicly available dataset obtained from the Human Microbiome Project which associates taxa abundances with KEGG orthology pathways. The method is implemented in specifically developed R code, which has been made publicly available.
CONCLUSIONS: Our method compares favorably in simulations to several recently proposed approaches for similarly structured data, in terms of increased accuracy and reduced false positive as well as false negative rates. In the application to the data from the Human Microbiome Project, a close evaluation of the biological significance of our findings confirms existing associations in the literature.

Entities: Chemical Disease Gene Species

Keywords: Bayesian hierarchical model; Data integration; Dirichlet-multinomial; Microbiome data; Variable selection

Mesh：

Year: 2017 PMID： 28178947 PMCID： PMC5299727 DOI： 10.1186/s12859-017-1516-0

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

The human microbiome is defined as the collection of microorganisms, including bacteria, viruses, and some unicellular eukaryotes, that live in and on our bodies [1]. Research on the microbiome has grown exponentially in the past few years and it has been argued that the microbiota can be regarded as a “second genome”[2, 3]. Indeed, just the human gut microbiome is estimated to be composed of approximately 1014 bacterial cells, i.e. ten times more than the total number of human cells in the body [4]. The contribution of the human microbiome on several health outcomes has been frequently reported in the literature. For example, microbial dysbiosis in the gut has been linked to irritable bowel syndrome and Crohn’s disease [5], type 2 diabetes [6], cardiovascular disease [7], and psychological conditions via the so-called “gut-brain axis” [8]. The composition of microbiota at other body sites have also been associated with conditions such as eczema [9] and pre-term labor [10]. This stream of research holds great potential for a better understanding of many mechanistic processes in the development of human diseases, especially with respect to immune regulation and barrier defense [11, 12]. Microbiome data is most commonly obtained by sequencing variable regions of the 16S rRNA gene, then grouping the transcripts into Operational Taxonomic Units (OTUs), based on their similarity to one another. The OTUs are then defined as a cluster of reads based on a similarity threshold (typically, 97%) set by the researcher. The membership count of each cluster is then used as a proxy for taxa abundances in the sample [13]. See [14] for a discussion of how the selection of the cutoff might impact the resulting OTUs, in particular for rare species. Many studies summarize the taxa abundances by constructing several indicators of community composition (e.g. alpha and beta diversity indexes, see, [15]). Alternatively, the full OTUs abundance table can be used to obtain more detailed information about existing associations between environment or phenotypes and microbes. Well-established statistical models for the analysis of count data (e.g., Poisson or Negative Binomial distribution) can be efficaciously employed for the analysis of taxonomic count data [16]. Although less common, other distributions (e.g., a two-parameter Weibull distribution) have also been shown to provide a good fit to the data for some communities (see, e.g. [17]). One distinctive characteristic of the microbiome data is their overdispersion: while some taxa (e.g., Bacteroides and Lactobacillus species) are common among samples, many other taxa are present at much lower abundances, and often never recorded in a sample, leading to zero-inflated distributions. Many of the existing tools for microbial community analysis (e.g., the QIIME platform, [18]) bypass those characteristics and rely on nonparametric tests to compare species across different conditions [19, 20]. Other approaches use ordination, e.g. multidimensional scaling, to summarize abundances, and are sometimes employed to link the microbiome data with available clinical covariates and phylogenetic information [21, 22]. In those approaches, the choice of the distance metric is often crucial. The interpretation of biological phenomena can also be challenging in low dimensional projections. Most importantly, distance-based methods do not explicitly quantify the relative importance of significant associations between taxa and covariates, and therefore are of limited use for clinical decisions. In this manuscript, we consider an integrative Bayesian approach based on the use of Dirichlet-Multinomial (DM) distributions [23] for studying the association between taxa abundance data and available measurements on clinical, genetic and environmental covariates. Recently, La Rosa et al. [24] proposed the use of a DM model for hypothesis testing and power calculations in microbiome experiments. Holmes et. al [25] used a finite mixture of DM distributions to directly model the taxa counts. Neither method incorporate predictors to study the influence of external factors on the microbiome’s abundance. A penalized likelihood approach based on a DM regression model has been proposed instead by [26] to determine significant associations between the microbiome composition and a set of covariates which describe the individual dietary nutrients’ intakes. Similarly, [27] develop a structure constrained version of sparse canonical correlation analysis that integrates compositionalized microbiome data, phylogenetic information, and nutrient information. Furthermore, [28] propose penalized regression models to associate the multivariate compositionalized microbiome data with some univariate phenotype of interest, e.g. body mass index, as a response. However, the use of a constrained optimization approach does not allow to fully characterize the uncertainty in the selection of the significant associations, which is of particular importance, especially when dealing with high-dimensional and highly-correlated data. Here, we propose a probabilistic modeling approach which both flexibly takes into account the typical features of microbiome count data and also allows for straightforward incorporation of available covariate information within a DM log-linear regression framework. With respect to modeling approaches as in [28], our framework allows the study of associations between multivariate microbiome data and multivariable predictors. By imposing sparsity inducing spike-and-slab priors on the regression coefficients, our model obtains a parsimonious summary of the effects of the associations and also allows an assessment of the uncertainty of the selection process. We evaluate the performance of our model first on simulated data, where we provide comparisons with methods developed for microbiome or similar type of data. We also illustrate our method on data obtained from the Human Microbiome Project [29], to investigate the association between taxonomic abundances and metabolic pathways inferred from whole genome shotgun sequencing reads. It is known that the combination of environmental and host genetic factors shape the composition of the gut microbiota, and these interactions appear to have a significant effect on several biological mechanisms, which may be related, for example, to the individual immunity and barrier defense, as well as metabolism and diet [30, 31]. The approach has been implemented in a user-friendly R code, which has been made publicly available (see the Licensing Section).

Methods

We describe our Bayesian variable selection approach for the analysis of microbiome data and their association with a set of available covariates in the context of DM log-linear regression models.

Dirichlet-multinomial regression with variable selection

Let y =(y ,…,y ) indicate the vector of counts representing the taxonomic abundance table obtained from the ith patient, with y denoting the frequency of the jth microbial taxon, for j=1,…,J and i=1,…,n. Furthermore, let X=(x 1,…,x ) indicate a n×P matrix of measurements on P covariates. We start by modeling the taxonomic count data with a Multinomial distribution with the summation of all counts in the vector, and where the parameters ’s are defined on the J dimensional simplex We further impose a conjugate Dirichlet prior on , that is ∼Dirichlet(), where =(γ 1,…,γ ) indicates a J-dimensional vector of strictly positive parameters. An advantage of our hierarchical formulation is that conjugacy can be exploited to integrate out, obtaining the Dirichlet–Multinomial model, ∼DM(), with probability mass function where . First described in [23] as the compound multinomial, the DM() allows more flexibility than the Multinomial when encountering overdispersion in multivariate count data, as it induces an increase in the variance by a factor of (y ++γ +)/(1+γ +). Next, we incorporate the covariates into the modeling via a log-linear regression framework where the DM parameters depend on the available covariates X’s. More specifically, we define ζ = log(γ ) and assume i=1,…,n; j=1,…,J. In this formulation, the intercept term α corresponds to the log baseline parameter for the taxon j, whereas the regression parameter β captures the effect of the pth covariate on the abundance for that taxon. Identifying the significant associations between taxa and covariates in model (1)–(2) is equivalent to determining the non-zero β parameters. One way to address this issue is through variable selection and the use of spike-and-slab mixture priors [32, 33]. First, we introduce latent binary indicator vectors =(ξ 1,ξ 2,…,ξ ), such that ξ =1 if the pth covariate influences the abundance of the jth taxa and ξ =0 otherwise. Then, we write the prior on the β ’s as where δ 0 denotes a Dirac-delta at 0 and is some suitably large value [34, 35]. It is common to choose relatively large values for . Such a choice suggest a flat prior distribution on the location of the coefficients {β ∣ξ =1} and therefore encourages the selection of relatively large effects. In the Results Section, we discuss the results of a sensitivity analysis to assist with the choice of this parameter. We place Bernoulli priors on the selection indicators ξ , that is We also specify Beta hyperpriors on the hyperparameters p , i.e., p ∼Beta(ab), as this has been shown to provide an automatic adjustment for multiplicity [36]. This is equivalent to placing a Beta mixed Binomial distribution on ξ , with λ=(a,b). As a practical suggestion, the hyper-parameters a and b should be chosen so to induce a relatively weakly specification of the prior as a “flat” Beta distribution. This can be obtained by setting a and b so that a+b=2, and the prior expected mean value m=a/(a+b). For most cases, a value of m=0.01, which corresponds to assuming a priori that 1% of the P covariates will be selected, provides an adequate balance between false positives and false negative counts, as we further illustrate in a sensitivity analysis in the Results Section. Finally, we assume normal priors on the α ’s, i.e. . Large values for encode a diffuse prior, to describe non-informative or objective prior beliefs. However, results are typically quite robust to prior choices on the intercept parameters, and is usually assumed as a default specification in Bayesian regression when dealing with standardized variables. Figure 1 provides an overview of the proposed integrative modeling approach, with reference to the application to the Human Microbiome Project data we describe later.

Fig. 1

Schematic overview of the proposed integrative Bayesian approach for the application to data from the Human Microbiome Project. The observed data counts (right) are regressed on the available covariates (left), through a variable selection approach, which informs the (unknown) population abundance of each taxon

MCMC algorithm

We implement a stochastic search Markov Chain Monte Carlo (MCMC) algorithm for posterior inference that employs a Gibbs scan to sample the non-zero regression coefficients [37]. We encourage an efficient sampling by employing a component-wise adaptive Metropolis algorithm [38] as described below. A generic iteration of the MCMC algorithm comprises the following steps: Update of : This is a Metropolis-Hastings step with a symmetric random walk proposal , for j=1,…,J. Joint update of ( , ): We sample these parameters jointly via a Gibbs scan that employs a Metropolis acceptance step. For each j=1,…,J and p=1,…,P: if ξ =1: propose and . if ξ =0: propose and then propose following an adaptive Metropolis-Hasting scheme where is the current estimate of the variance of the target distribution. The value of is updated using a recursive formula as in [39] on all the previous draws for β . Accept () with probability where , and . For posterior inference, we are interested in identifying the relevant associations between taxa and covariates as captured by the selection indicators ξ ’s and the corresponding regression coefficients β ’s. Estimates of the marginal posterior probabilities of inclusion (PPIs) of the latent indicators ξ can be calculated by counting the number of times that each taxa/covariate association is included across the MCMC iterations. A selection of the significant associations can then be made by choosing those elements that have marginal PPIs greater than a specific value, for example greater than 0.5 for the median probability model of [40]. Another choice for the threshold which controls for multiplicity [41] relies on an estimated pre-specified Bayesian false discovery rate α calculated as where . An optimal threshold c ′ can be found for error rate α by choosing c ′ such that . Estimates of the non-zero regression coefficients β can also be calculated by averaging over the sampled MCMC values. In order to compare selection performance of different methods, we calculate accuracy, false positive rate (FPR), false negative rate (FNR) and Matthews correlation coefficient (MCC), across 30 replicated datasets. We define accuracy as ACC = (TP + TN)/(P + N), with TP the number of true positives out of P selected and TN the number of true negatives out of N not selected. The false negative rate is calculated as FNR = FN/(FN + TP), the false positive rate as FPR = FP/(FP + TN), and the Matthews correlation coefficient as with N=TN+TP+FN+FP, and [42]. Since the MCC balances TP and FP counts, and can be used even if the classes are of very different sizes, it is generally regarded as one of the most appropriate measures of classification accuracy. We further computed receiving operating curves (ROC) to compare the performance of the selection procedure across the different methods.

Comparison study on simulated data

We carry out a simulation study to assess the performance of our model and compare results to alternative methods. More specifically, we consider two methods which have been specifically employed for the integrative analysis of microbiome data: the penalized approach of Chen and Li [26], and the false discovery rate-corrected pair-wise correlation tests considered in [19]. In addition, we consider the factorized maximum a posteriori (MAP) Bayesian lasso of [43], a recently proposed general statistical method for conducting variable selection in multivariate count-response regression. When fitting the Bayesian Gamma Lasso method of [43], model selection was done using the minimum AIC, while for Chen and Li’s approach the minimum BIC was calculated with the group penalty set to 20%. We also fit the method of Chen and Li to the untransformed data. The false discovery rate threshold for the Spearman’s correlation tests was set to 0.05. In simulating data, we set n=100, P=50 and J=50, and chose P =9 and J =5 to obtain a total number of relevant taxa/covariate associations equal to 25. We simulated the covariate matrix X according to a Multivariate-Normal (0,Σ) with Σ =ρ | and ρ=0.4. We drew each vector y of counts from a Dirichlet-Multinomial distribution as follows. For i=1,…,n, , with the row sum N ∼DiscreteUnif[1,0000;2,000], and . For , we set , j=1,…,J, with γ = exp{α +X }, and ψ∈ [ 0,1] an overdispersion parameter. When ψ→0, the simulated values approximate a Multinomial () distribution, while for large ψ, the sampled values are more disperse. Here, we set ψ=0.01. We sampled the non-zero β ’s from the intervals ±[ 0.5,1.0] and the intercept parameters from a Uniform(–2.3, 2.3). Below we report performance results as averages over 30 replicated simulated datasets. When running the MCMC, we used a vague prior for the intercept by setting the variance parameter to . Similarly, we set , to provide sufficiently vague prior information on the non-zero log-linear regression coefficients. Finally, we set m=0.01 (or a=0.02 and b=1.98), resulting in a sparse prior mean on selected associations of 1% of the total. We provide comments on the sensitivity of the selection results to the choice of these hyperparameters in the Section below. We ran the MCMC algorithm for 10,000 iterations and thinned to every fifth iteration. On a single dataset, the C code took approximately 31.5 min to run on an Intel Xeon E5-2630 2.30 GHz processor. We assessed convergence visually and via the Geweke diagnostic [44] as implemented in the R package coda. Convergence was checked for a) the number of active variables in each iteration and b) the samples from each of the selected β . The five number summary of the 25 Geweke z-scores was (–3.43, –1.06, –0.63, 0.71, 1.98).

Inferring associations between taxonomic abundances and metabolic pathways

We demonstrate our approach on publicly available data obtained from the Human Microbiome Project (HMP) website [29] from which we use 79 samples from healthy individuals. The Y matrix in our model contains 16S rRNA microbial counts from stool samples at the genus taxonomic level. As common in microbiome studies, the genera abundances (Bacteroides, Prevotella, etc.) were filtered by requiring each genus to be present in at least 5% of the samples. This procedure removes extremely low-abundance genera leaving 80 genera for the analysis. From the same 79 individuals, we obtained KEGG orthology group abundances which are used as the matrix of covariates X of our model. The KEGG orthology groups were reconstructed from metagenomic shotgun sequencing (WGS) using the HMP Unified Metabolic Analysis Network (HUMAnN) pipeline [45] and were also provided on the HMP website. These values represent inferred abundances of biochemical functional groups and metabolic pathways present due to the shotgun sequenced reads of bacterial and non-bacterial genes in the sample. To reduce correlation among the covariates we used average linkage clustering on the correlation matrix of the KEGG groups and chose one representative from each cluster, according to its relevance to microbiome research, leaving 76 columns in X. Finally, the columns in X were mean centered and scaled to unit variance. Though the HMP sampled 300 individuals for several timepoints and over many sites, there were relatively few samples that included the WGS used to obtain the KEGG orthology data. Thus, when joining the samples from the 16S rRNA data and the KEGG orthology data, a total of 79 matched samples remained. We used the same hyperparameter settings as in the simulation study, that is and and set m=0.01, resulting in a sparse mean selection prior of 1% of the total 6,080 possible associations. The MCMC algorithm described in “MCMC algorithm” section above was run for 500,000 iterations and thinned to every 100th draw. We assessed convergence visually and via the Geweke diagnostic [44] as implemented in the R package coda. The five number summary of the Geweke z-scores for the 26 β ’s was (–3.83, –1.19, 0.15, 1.46, 3.38).

Results

Simulation study

In Fig. 2 we show the plot of the marginal PPIs of the P×J elements ξ , obtained by computing the proportion of times that ξ =1 across all iterations, after burn-in. The selected median model, corresponding to a threshold of 0.5 on the PPIs, results in a false positive rate of 0.0004 and a false negative rate of 0.04. The value of the AUC for this replicate was 0.99.

Fig. 2

Simulated data: Marginal posterior probabilities of inclusion (PPI) for each coefficient β jp, j=1,…,50, p=1,…,50, describing the association between each taxa and each covariate. Each PPI is obtained by averaging the number of times that each taxa/covariate association is included across the MCMC iterations, after burn-in. The true associations are indicated as red dots Figure 3 illustrates the selection performance of the proposed method, by plotting the average ROC curves over the 30 replicated datasets (ψ=0.01) for each of the methods included in the comparison. The Figure shows that our proposed model outperforms the competing methods in terms of achieved average true and false positive rates.

Fig. 3

Simulated data: Comparison results of selection performances (ROC curves). DMBVS: Dirichlet–Multinomial Bayesian Variable Selection (our method), MAPBL : Maximum A Posteriori Bayesian Lasso, CORTEST: Multiplicity Corrected Correlation Tests as in Wu et al. (2010), C&L: composite penalty from Chen and Li (2013) As an additional comparison, together with the total number of correctly identified regression parameters, which we term “overall recovery”, we also looked at the “taxa-wise recovery”, which we defined as the correct recovery of any element from one of the J taxa. Thus, recovery for overall selection occurs for P×J elements while taxa-wise selection occurs for J elements. Table 1 reports average values for accuracy, FPR, FNR and MCC, averaged across the 30 replicated datasets, for both overall and taxa-wise recovery. These results show that our method in particular outperforms competing methods for taxa-wise recovery. In the same Table we report results for a more challenging simulated scenario, obtained with a higher value of the overdispersion parameter (ψ=0.1). As expected, the increase in overdispersion makes the selection task more difficult for all methods. However, our method still outperforms or is commensurate with the competing methods, even in the presence of considerable overdispersion.

Table 1

Simulated data: performance assessment for two different scenarios, characterized by different values of the dispersion parameter ψ

	Overall				Taxa
	DMBVS	MAPBL	C&L	CORTEST	DMBVS	MAPBL	C&L	CORTEST
ψ=0.01
MCC	0.93	0.64	0.67	0.73	0.89	0.66	0.50	0.85
FNR	0.05	0.10	0.12	0.31	0.00	0.46	0.43	0.02
FPR	0.00	0.01	0.01	0.00	0.05	0.01	0.09	0.06
Accuracy	1.00	0.99	0.99	1.00	0.96	0.91	0.85	0.95
ψ=0.1
MCC	0.72	0.42	0.54	0.56	0.73	0.40	0.38	0.70
FNR	0.39	0.58	0.28	0.63	0.24	0.73	0.52	0.37
FPR	0.00	0.01	0.01	0.00	0.05	0.01	0.12	0.02
Accuracy	1.00	0.99	0.99	1.00	0.92	0.86	0.81	0.92

Values are rounded averages over thirty replicates. Results for Matthews’ Correlation Coefficient, Falso Positive Rate, False Negative Rate, and Accuracy, are based on the median probability model. DMBVS: Dirichlet–Multinomial Bayesian Variable Selection (our method), MAPBL: Maximum A Posteriori Bayesian Lasso, C&L: composite penalty from Chen and Li (2013), CORTEST: Multiplicity Corrected Correlation Tests as in Wu et al. (2010)

Simulated data: performance assessment for two different scenarios, characterized by different values of the dispersion parameter ψ Values are rounded averages over thirty replicates. Results for Matthews’ Correlation Coefficient, Falso Positive Rate, False Negative Rate, and Accuracy, are based on the median probability model. DMBVS: Dirichlet–Multinomial Bayesian Variable Selection (our method), MAPBL: Maximum A Posteriori Bayesian Lasso, C&L: composite penalty from Chen and Li (2013), CORTEST: Multiplicity Corrected Correlation Tests as in Wu et al. (2010)

Sensitivity analysis

Since our proposal requires the choice of a number of hyperparameters, it is important to investigate how sensitive the results are to varying parameter sets. Therefore, we conclude our simulation study by briefly discussing the sensitivity of the results to the prior specifications. In general, we found that results were robust to the prior choices on the intercept parameters, α , while, as expected, some sensitivity was observed with respect to the variance hyperparameters of the spike-and-slab prior (3) on the regression coefficients, β , and the hyperparameters of the Beta priors on p . In Table 2 we report results obtained by considering a full grid of values for the prior expected value of p , i.e. m∈{0.005,0.01,0.05}, and the slab variance, . In the Additional file 1, we further report the corresponding ROC curves. With only 25 truly non-zero β ’s, out of 2,500 parameters, small increases in false positive rates can drastically decrease the Matthews correlation coefficient. Thus imposing some sparsity by using a smaller value for m improves overall performance while larger values of m allow for more false positives. The results appear to suggest that assuming moderate sparsity a priori (e.g., m=0.01) generally leads to good operating characteristics. Similarly, when the slab variance is small, e.g. , there is more prior density close to zero, which allows small but insignificant variables to be selected. Conversely, when the slab variance is large, e.g. , false positives are less likely but false negatives increase, since the prior density is spread more evenly over the support. Therefore, an intermediate value, e.g. , provides a reasonable compromise, which favors relatively large effect sizes and a small number of false positives. In the Additional file 1, we also report the performance of our method for varying values of the over-dispersion parameter ψ and the sample size n. As expected, the results show that the performances improve for larger sample sizes and decreasing overdispersion.

Table 2

Simulated data: sensitivity analysis for varying values of the prior expected value of p , m, and the slab variance , and for two different scenarios, characterized by different values of the dispersion parameter ψ

	m=0.005			m=0.01			m=0.05
	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r_{pj}^{2} = 1$\end{document}rpj2=1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r_{pj}^{2} = 10$\end{document}rpj2=10	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r_{pj}^{2} = 100$\end{document}rpj2=100	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r_{pj}^{2} = 1$\end{document}rpj2=1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r_{pj}^{2} = 10$\end{document}rpj2=10	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r_{pj}^{2} = 100$\end{document}rpj2=100	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r_{pj}^{2} = 1$\end{document}rpj2=1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r_{pj}^{2} = 10$\end{document}rpj2=10	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$r_{pj}^{2} = 100$\end{document}rpj2=100
ψ=0.01
MCC	0.69	0.93	0.96	0.69	0.93	0.95	0.69	0.93	0.95
FPR	0.01	0.00	0.00	0.01	0.00	0.00	0.01	0.00	0.00
FNR	0.02	0.04	0.08	0.02	0.05	0.09	0.02	0.05	0.08
Accuracy	0.99	1.00	1.00	0.99	1.00	1.00	0.99	1.00	1.00
AUC	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
ψ=0.1
MCC	0.53	0.73	0.72	0.53	0.73	0.71	0.52	0.73	0.72
FPR	0.01	0.00	0.00	0.01	0.00	0.00	0.01	0.00	0.00
FNR	0.26	0.37	0.46	0.26	0.37	0.47	0.26	0.37	0.47
Accuracy	0.99	1.00	1.00	0.99	1.00	1.00	0.99	1.00	1.00
AUC	0.96	0.96	0.93	0.96	0.96	0.93	0.95	0.95	0.93

Values are averages over 30 replicates

Data analysis

Figure 4 shows the traceplot of the number of included taxa/covariate associations and the plot of the marginal PPIs of the P×J elements ξ , obtained by computing the proportion of times that ξ =1 across all iterations, after burn-in. Here the median model, corresponding to a threshold of 0.5 on the PPIs, selects 92 associations. Among those, 26 have a marginal PPI greater than 0.98, which corresponds to a Bayesian FDR of 0.1. These 26 associations are listed in Table 3, together with the corresponding estimated regression coefficients, and depicted in Figs. 5 and 6, for positive and negative associations, respectively. In these Figures, the magnitude of the association, as captured by the estimated β ’s, is proportional to the width of the edges, with red lines indicating negative associations and blue lines positive associations. As a comparison, the method by Chen and Li [26] identified 120 associations, whereas the Bayesian Lasso of [43] and the correlation test-based method of [19] identified, respectively, 220 and 711 associations. Those results appear to confirm the sparser selection achieved by our method, consistently with the results of the simulation study.

Fig. 4

Table 3

Real data: selection results using a BFDR of 0.1

KEGG ID	Pathway	Taxa	MPPI	β _pj
ko04660	T cell receptor signaling pathway	g.Subdoligranulum	1.00	0.40
ko04512	ECM-receptor interaction	g.Subdoligranulum	1.00	0.44
ko00680	Methane metabolism	g.Sutterella	1.00	-1.47
ko05200	Pathways in cancer	o.Bacteroidales	1.00	-1.04
ko04540	Gap junction	o.Bacteroidales	1.00	-0.92
ko00534	Glycosaminoglycan biosynthesis	g.Ruminococcus	1.00	-0.56
ko00053	Ascorbate and aldarate metabolism	g.Ruminococcus	1.00	-0.46
ko00650	Butanoate metabolism	g.Ruminococcus	1.00	-0.55
ko00513	Various types of N-glycan biosynthesis	g.Parabacteroides	1.00	0.54
ko00500	Starch and sucrose metabolism	g.Parabacteroides	1.00	-0.61
ko00904	Diterpenoid biosynthesis	g.Dialister	1.00	-1.21
ko00360	Phenylalanine metabolism	g.Dialister	1.00	-2.18
ko00626	Naphthalene degradation	g.Bacteroides	1.00	0.39
ko00010	Glycolysis / Gluconeogenesis	g.Bacteroides	1.00	-0.57
ko00513	Various types of N-glycan biosynthesis	g.Prevotella	1.00	0.77
ko00512	Mucin type O-Glycan biosynthesis	g.Prevotella	1.00	-1.06
ko00620	Pyruvate metabolism	g.Prevotella	1.00	-1.76
ko04340	Hedgehog signaling pathway	g.Ruminococcus	1.00	0.45
ko04660	T cell receptor signaling pathway	g.Roseburia	1.00	0.46
ko00983	Drug metabolism	g.Sutterella	1.00	-1.20
ko00260	Glycine, serine and threonine metabolism	o.Bacteroidales	1.00	0.88
ko00330	Arginine and proline metabolism	g.Ruminococcus	0.99	0.39
ko01057	Biosynthesis of type II polyketide products	o.Bacteroidales	0.99	0.76
ko00620	Pyruvate metabolism	g.Ruminococcus	0.99	0.56
ko00920	Sulfur metabolism	f.Prevotellaceae	0.99	1.36
ko00010	Glycolysis/Gluconeogenesis	o.Clostridiales	0.98	0.54

The text in the KEGG column is hyperlinked to the KEGG orthology database for a more complete description of the selected pathways. Taxa names start with “g.”, “f.” or “o.” which stand for genus, family, or order, respectively, and correspond to the lowest taxonomic classification available

Fig. 5

Real data: Selected positive taxa-by-covariate associations. The magnitude of the association, as captured by the median of the MCMC draws for each β , is proportional to the width of the edges

Fig. 6

Real data: Selected negative taxa-by-covariate associations. The magnitude of the association, as captured by the median of the MCMC draws for each β , is proportional to the width of the edges

Real data: Marginal posterior probabilities of inclusion (PPI) for each coefficient β jp, in Eq. (2), describing the association between each taxa and each covariate. Each PPI is obtained by averaging the number of times that each taxa/covariate association is included across the MCMC iterations, after burn-in. Here, the median model, corresponding to a threshold of 0.5 on the PPIs, selects 92 associations. Among those, 26 have a marginal PPI greater than 0.98, which corresponds to a Bayesian FDR of 0.1. These 26 associations are indicated as red dots Real data: Selected positive taxa-by-covariate associations. The magnitude of the association, as captured by the median of the MCMC draws for each β , is proportional to the width of the edges Real data: Selected negative taxa-by-covariate associations. The magnitude of the association, as captured by the median of the MCMC draws for each β , is proportional to the width of the edges Real data: selection results using a BFDR of 0.1 The text in the KEGG column is hyperlinked to the KEGG orthology database for a more complete description of the selected pathways. Taxa names start with “g.”, “f.” or “o.” which stand for genus, family, or order, respectively, and correspond to the lowest taxonomic classification available

Discussion

A close investigation of the biological significance of the associations identified by our model reveals several interesting characteristics and affirms the relevance of these associations. Commensal microbiota that inhabit the human gut are proficient at scavenging glycans and polysaccharides, including those in plants, such as starches or cellulose, animal-derived tissues (glycosaminoglycans and N-linked glycans), and glycans from host mucus (O-linked glycans) [46]. Ruminococcus spp. are known to participate in both resistant starch and glycosaminoglycan degradation [46, 47]. It has been reported that long-term consumption of diets rich in protein and animal fat were associated with an enterotype primarily containing increased Bacteroides and Ruminococcus species [19]. Additionally, Ruminococcus torques and Ruminococcus gnavus have been shown to degrade mucins [48]. Thus, it is logical that Ruminococcus, which is one of the noteworthy genera involved in glycosaminoglycan degradation, would be negatively associated to glycosaminoglycan biosynthesis (ko00534) (Table 3). Similarly, Parabacteroides which is also negatively associated with N-glycan biosynthesis (ko00513), is involved in deglycosylation and utilization of N-glycans [49]. Also, among the associations identified for the glycan pathways, Prevotella was negatively associated with mucin type O-glycan biosynthesis (ko00512). In the literature, Prevotella has implications for mucosal homeostasis, as some Prevotella spp. express a unique mucin-desulfating glycosidase that can hydrolyze GlcNAc residues on mucin-type O-glycans, and thus is important for mucin degradation [50]. Other associations affirmed through the literature included that of Bacteroides with naphthalene degradation (ko00626). It has been reported that Bacteroidetes possess the capability to degrade polycyclic aromatic hydrocarbons such as naphthalene [51]. Associations of Ruminococcus with pyruvate metabolism (ko00620) are also supported, as phosphoenolpyruvate carboxykinase was previously reported to be associated with Ruminococcus flavefaciens in the rumen [52]. Another supported association is that of Prevotellaceae with sulfur metabolism. L-cysteine desulfhydrase enzymes have been characterized in Prevotella intermedia [53]. Additionally, glycosulfatase enzymes have been described in Prevotella [54]. Equally interesting is the selection of pathways that are expected to be omnipresent among many bacteria, such as glycolysis/gluconeogenesis (ko00010), as glycolysis occurs, with variations, in nearly all organisms, both aerobic and anaerobic. Thus, it is not surprising that taxa like Clostridiales are positively associated with glycolysis/gluconeogenesis as they are abundant taxa within the gut microbiome. Given the complexity of metabolic pathways and the process of mapping specific genes to pathways, some of the selected associations are unexpected, and might be due to the 16S abundances that were made available at the HMP site and the mapping of metagenomic sequences to specific KEGG orthology groups by HUMAnN. For example, several species of Ruminococcus are known to participate in butanoate (butyrate) metabolism [55], Dialister spp. have phenylalanine arylamidase activities [56], and Prevotella spp. are known to participate in pyruvate metabolism [57, 58]. Since those associations should be driven exclusively by bacterial genes, it is interesting that we find significant associations between the abundance of certain bacterial taxa and KEGG pathways that are primarily reported among eukaryotic species (i.e., T-cell receptor signaling, hedgehog signaling, pathways in cancer, etc.). Indeed, although precautionary steps are performed, the HMP consortium reported that human contaminants are found in 50–90% of the sequences [15]. This might also explain the negative association exhibited by Bacteriodes and glycolysis/gluconeogenesis. These unexpected findings suggest the need for further investigations and validation.

Conclusion

Herein, we have developed a Bayesian approach to the Dirichlet-Multinomial regression models that allows for the selection of significant associations between covariates and taxa from a microbiome abundance table by imposing spike-and-slab priors on the log-linear regression coefficients of the model. We have applied our model to simulated data and compared performances with methods developed for similar applications. We have illustrated the performance of our method using publicly available data on taxonomic abundances and metabolic pathways inferred from whole genome shotgun sequencing reads, which we obtained from the Human Microbiome Project website. Our results have revealed interesting links between specific taxa (i.e. genera) and particular metabolic pathways, which we have validated via existing literature. Several extensions of our model are possible. Because some habitats, e.g. the gut, are thought to have highly variable dynamics, longitudinal sampling may be preferred to cross-sectional sampling since it may give a better sense of long-term trends [59]. Thus, incorporating repeated samples with specified correlation structures in the linear predictor could produce additional insights. Another important aspect of microbiome data, which is receiving attention from researchers, is the heterogeneity in community structure across samples, as this can be an indication of the existence of “enterotypes” [60, 61]. This can be addressed within our modeling framework by employing Bayesian nonparametric models that would allow to cluster selected associations across partitions of the samples. These extensions are currently under investigation.

49 in total

1. Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors.

Authors: Andrew K Benson; Scott A Kelly; Ryan Legge; Fangrui Ma; Soo Jen Low; Jaehyoung Kim; Min Zhang; Phaik Lyn Oh; Derrick Nehrenberg; Kunjie Hua; Stephen D Kachman; Etsuko N Moriyama; Jens Walter; Daniel A Peterson; Daniel Pomp
Journal: Proc Natl Acad Sci U S A Date: 2010-10-11 Impact factor: 11.205

2. A novel mechanism for desulfation of mucin: identification and cloning of a mucin-desulfating glycosidase (sulfoglycosidase) from Prevotella strain RS2.

Authors: Jung-hyun Rho; Damian P Wright; David L Christie; Keith Clinch; Richard H Furneaux; Anthony M Roberton
Journal: J Bacteriol Date: 2005-03 Impact factor: 3.490

3. Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis.

Authors: Jun Chen; Frederic D Bushman; James D Lewis; Gary D Wu; Hongzhe Li
Journal: Biostatistics Date: 2012-10-15 Impact factor: 5.899

4. Dominant and diet-responsive groups of bacteria within the human colonic microbiota.

Authors: Alan W Walker; Jennifer Ince; Sylvia H Duncan; Lucy M Webster; Grietje Holtrop; Xiaolei Ze; David Brown; Mark D Stares; Paul Scott; Aurore Bergerat; Petra Louis; Freda McIntosh; Alexandra M Johnstone; Gerald E Lobley; Julian Parkhill; Harry J Flint
Journal: ISME J Date: 2010-08-05 Impact factor: 10.302

5. The microbiome-gut-brain axis: from bowel to behavior.

Authors: J F Cryan; S M O'Mahony
Journal: Neurogastroenterol Motil Date: 2011-03 Impact factor: 3.598

6. Comparisons of distance methods for combining covariates and abundances in microbiome studies.

Authors: Julia Fukuyama; Paul J McMurdie; Les Dethlefsen; David A Relman; Susan Holmes
Journal: Pac Symp Biocomput Date: 2012

7. Linking long-term dietary patterns with gut microbial enterotypes.

Authors: Gary D Wu; Jun Chen; Christian Hoffmann; Kyle Bittinger; Ying-Yu Chen; Sue A Keilbaugh; Meenakshi Bewtra; Dan Knights; William A Walters; Rob Knight; Rohini Sinha; Erin Gilroy; Kernika Gupta; Robert Baldassano; Lisa Nessel; Hongzhe Li; Frederic D Bushman; James D Lewis
Journal: Science Date: 2011-09-01 Impact factor: 47.728

8. Hypothesis testing and power calculations for taxonomic-based human microbiome data.

Authors: Patricio S La Rosa; J Paul Brooks; Elena Deych; Edward L Boone; David J Edwards; Qin Wang; Erica Sodergren; George Weinstock; William D Shannon
Journal: PLoS One Date: 2012-12-20 Impact factor: 3.240

9. Chapter 12: Human microbiome analysis.

Authors: Xochitl C Morgan; Curtis Huttenhower
Journal: PLoS Comput Biol Date: 2012-12-27 Impact factor: 4.475

10. AmpliconDuo: A Split-Sample Filtering Protocol for High-Throughput Amplicon Sequencing of Microbial Communities.

Authors: Anja Lange; Steffen Jost; Dominik Heider; Christina Bock; Bettina Budeus; Elmar Schilling; Axel Strittmatter; Jens Boenigk; Daniel Hoffmann
Journal: PLoS One Date: 2015-11-02 Impact factor: 3.240

9 in total

1. Bayesian variable selection for multivariate zero-inflated models: Application to microbiome count data.

Authors: Kyu Ha Lee; Brent A Coull; Anna-Barbara Moscicki; Bruce J Paster; Jacqueline R Starr
Journal: Biostatistics Date: 2020-07-01 Impact factor: 5.899

2. EDClust: an EM-MM hybrid method for cell clustering in multiple-subject single-cell RNA sequencing.

Authors: Xin Wei; Ziyi Li; Hongkai Ji; Hao Wu
Journal: Bioinformatics Date: 2022-05-13 Impact factor: 6.931

3. tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data.

Authors: Johannes Ostner; Salomé Carcy; Christian L Müller
Journal: Front Genet Date: 2021-12-07 Impact factor: 4.599

4. Bayesian biclustering for microbial metagenomic sequencing data via multinomial matrix factorization.

Authors: Fangting Zhou; Kejun He; Qiwei Li; Robert S Chapkin; Yang Ni
Journal: Biostatistics Date: 2022-07-18 Impact factor: 5.279

5. Erratum to: An integrative Bayesian Dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data.

Authors: W Duncan Wadsworth; Raffaele Argiento; Michele Guindani; Jessica Galloway-Pena; Samuel A Shelburne; Marina Vannucci
Journal: BMC Bioinformatics Date: 2017-03-23 Impact factor: 3.169

6. An Exercise Intervention to Unravel the Mechanisms Underlying Insulin Resistance in a Cohort of Black South African Women: Protocol for a Randomized Controlled Trial and Baseline Characteristics of Participants.

Authors: Julia H Goedecke; Amy E Mendham; Louise Clamp; Pamela A Nono Nankam; Melony C Fortuin-de Smidt; Lindokuhle Phiri; Lisa K Micklesfield; Dheshnie Keswell; Nicholas J Woudberg; Sandrine Lecour; Ali Alhamud; Mamadou Kaba; Faith M Lutomia; Paul J van Jaarsveld; Anniza de Villiers; Steven E Kahn; Elin Chorell; Jon Hauksson; Tommy Olsson
Journal: JMIR Res Protoc Date: 2018-04-18

7. scCODA is a Bayesian model for compositional single-cell data analysis.

Authors: M Büttner; J Ostner; C L Müller; F J Theis; B Schubert
Journal: Nat Commun Date: 2021-11-25 Impact factor: 14.919

8. Negative binomial factor regression with application to microbiome data analysis.

Authors: Aditya K Mishra; Christian L Müller
Journal: Stat Med Date: 2022-04-24 Impact factor: 2.497

9. Proteomics data analysis using multiple statistical approaches identified proteins and metabolic networks associated with sucrose accumulation in sugarcane.

Authors: Ao-Mei Li; Zhong-Liang Chen; Cui-Xian Qin; Zi-Tong Li; Fen Liao; Ming-Qiao Wang; Prakash Lakshmanan; Yang-Rui Li; Miao Wang; You-Qiang Pan; Dong-Liang Huang
Journal: BMC Genomics Date: 2022-07-22 Impact factor: 4.547

9 in total