Literature DB >> 35466418

Negative binomial factor regression with application to microbiome data analysis.

Aditya K Mishra¹, Christian L Müller^1,2,3.

Abstract

The human microbiome provides essential physiological functions and helps maintain host homeostasis via the formation of intricate ecological host-microbiome relationships. While it is well established that the lifestyle of the host, dietary preferences, demographic background, and health status can influence microbial community composition and dynamics, robust generalizable associations between specific host-associated factors and specific microbial taxa have remained largely elusive. Here, we propose factor regression models that allow the estimation of structured parsimonious associations between host-related features and amplicon-derived microbial taxa. To account for the overdispersed nature of the amplicon sequencing count data, we propose negative binomial reduced rank regression (NB-RRR) and negative binomial co-sparse factor regression (NB-FAR). While NB-RRR encodes the underlying dependency among the microbial abundances as outcomes and the host-associated features as predictors through a rank-constrained coefficient matrix, NB-FAR uses a sparse singular value decomposition of the coefficient matrix. The latter approach avoids the notoriously difficult joint parameter estimation by extracting sparse unit-rank components of the coefficient matrix sequentially, effectively delivering interpretable bi-clusters of taxa and host-associated factors. To solve the nonconvex optimization problems associated with these factor regression models, we present a novel iterative block-wise majorization procedure. Extensive simulation studies and an application to the microbial abundance data from the American Gut Project (AGP) demonstrate the efficacy of the proposed procedure. In the AGP data, we identify several factors that strongly link dietary habits and host life style to specific microbial families.

Entities: Chemical

Keywords: American Gut Project; microbiome; multivariate analysis; overdispersed count data; reduced rank regression; sparse singular value decomposition

Mesh：

Year: 2022 PMID： 35466418 PMCID： PMC9325477 DOI： 10.1002/sim.9384

Source DB: PubMed Journal: Stat Med ISSN： 0277-6715 Impact factor: 2.497

INTRODUCTION

The human microbiome, the collection of microbes that reside on or within human tissues and fluids, has formed intricate ecological relationships with the host over the course of co‐evolution. Advances in next‐generation amplicon sequencing technology and analysis techniques have enabled the direct identification of microbial species compositions and abundances in their natural habitat. These approaches have revealed considerable variability in both composition and diversity across different body sites and allowed the estimation of potential associations between the microbiome and the underlying health condition of the subject. For instance, differential abundances of the gut microbiome have been linked to medical conditions such as inflammatory bowel disease (IBD), irritable bowel syndrome (IBS), type 2 diabetes, obesity, and neurological disorders. The gut microbiome also makes considerable contribution to metabolic functioning, for example, by breaking down specific food components. Dietary changes can thus induce considerable shifts in gut microbial compositions. Similar intricate relationships between the environment and the microbiome are also known in other ecosystems. For instance, the soil microbiome plays a significant role in the cycle of carbon and nitrogen fixation, thus having direct implications for plant growth. The soil microbiome also shows large variability with respect to soil conditions such as pH, temperature, moisture, and spatial location. Likewise, cyanobacteria in the marine ecosystems contribute to a large extent to the ocean's primary productivity, yet exhibit considerable abundance variability across location, season, and water conditions. , Amplicon‐based microbiome survey data are derived from samples of the habitat of interest, for example, the human gut, where variable regions of the bacterial and archaeal 16S ribosomal RNA are experimentally extracted and sequenced. These marker gene sequences serve as a proxy to the underlying bacterial taxon abundances and are summarized in operational taxonomic units (OTUs) or amplicon sequence variants (ASVs). , Reference databases are used to identify the (approximate) taxonomy of the representative microbial sequences. Bioinformatic workflows and databases, such as, for example, QIIME‐2 or the Qiita framework, allow standardized processing of and access to these OTU/species counts. In addition, large‐scale microbiome survey studies such as the American Gut Project (AGP), the Human Microbiome Project (HMP), and the Earth Microbiome Project also collect large/high‐dimensional host‐ or environment‐associated covariate data. These survey data reach a level of scale and completeness that, in principle, allows to make quantitative predictions about the relationship between host‐associated factors and microbial abundance patterns. For example, the AGP data comprises hundreds of host‐associated features, including variables indicating dietary intake, medical conditions, medication use, participants' demography, and life style. Using the AGP data as representative microbiome data resource, we here introduce a statistical factor regression framework that allows the identification of key associations between host‐related features and microbial taxa. While recent work has already identified individual host factors that confound microbial abundance patterns in relation to specific disease phenotypes, we propose a general factor model that simultaneously takes into account all relevant host‐associated covariates and links them to the observed microbial abundances, independent of a specific downstream task. Since the observed microbial abundances across all levels of taxonomic aggregation come in form of overdispersed count data, we base our model on the classical negative binomial (NB) distribution (see Figure 1). NB models are common place in genomics and microbiome data analysis. For example, the popular DESeq2 package, used extensively in differential expression testing in bulk RNA‐Seq data, uses the NB model as underlying model for total mRNA transcript abundance. Due to technical and experimental limitations, microbial count data, however, carry only relative information and show varied sequencing depth across samples. To mitigate these limitations, microbiome data require transformation/normalization approaches prior to statistical modeling. Important examples include rarefying samples to a common sequencing depth or scaling using factors such as a cumulative sum, median, upper quartile, or the total sum, the latter of which leading to compositional or relative abundance data. A particularly popular approach for NB modeling of microbial count data is common sum scaling, as put forward by McMurdie and Holmes. When modeling microbial count data on the OTU/ASV level, zero‐inflated extensions of the NB model have been proposed to account for the excess number of zeros in the data (see, eg, Reference 21 for a critical assessment).

FIGURE 1

Regression model for the overdispersed microbial abundance data of count types in terms of the covariates and the control variables . The upper panel presents the true generative model with parameters and the lower panel presents the three approaches for estimating the parameters. (A) The generalized linear model of negative binomial regression (NB‐GLM) estimates each of the columns of separately; (B) negative binomial reduced rank regression (NB‐RRR) jointly estimates a low‐rank coefficient matrix as ; (C) negative binomial co‐sparse factor regression (NB‐FAR) jointly estimates a low‐rank coefficient matrix as with sparse singular vectors Alternative generative statistical modeling approaches include the Dirichlet multinomial (mixture) framework, latent Dirichlet allocation, , and Poisson distribution models (including their respective zero‐inflated extensions). Several models also allow the inclusion of host or environmental covariate data in generative modeling, including Poisson factor models, , , latent Dirichlet allocation, and Bayesian Dirichlet multinomial models. , , Due to the abundance of excess zeros at the amplicon or species level, some of these models also include a zero‐inflation modeling component. This and the high dimensionality of the data at the OTU/ASV level make both estimation and biological interpretability challenging. Our modeling framework follows a different objective and focuses on accurate and parsimonious estimation of key host‐associated factors that influence microbial taxa at higher taxonomic ranks. This approach facilitates crisp, biologically interpretable statements about the underlying potential role of host features on broad microbial abundance patterns at the expense of taxonomic resolution. When summarizing microbiome survey data on a higher taxonomic level, the data remains overdispersed, yet contains no excess zeros. This makes the addition of zero‐inflated components in the model superfluous (see Figure S5 of the supplementary materials). Prior work already established that NB regression (NB‐GLM) is capable of relating individual taxa to the host‐associated features (see Figure 1A). This marginal model, however, ignores the fact that some of the taxa are likely influenced by a set of common factors, for example, age, diet, or life style. Here, we alleviate this shortcoming and introduce a NB factor regression framework that models microbial abundance data as outcome and covariates as predictors jointly while encoding the underlying dependencies in a parsimonious fashion. In the high‐dimensional multivariate linear regression setting, it is common practice to model the underlying dependency for dependent outcomes via a structured coefficient matrix (see Figure 1B,C). The model parameters are estimated by solving a regularized optimization problem. For example, reduced‐rank regression , , promotes information‐sharing among response and predictors through a low‐rank coefficient matrix. When covariates are high‐dimensional, sparsity is known to facilitate identifiability and better model interpretation. In the multivariate setting, this has been achieved via a sparse factorization of the model coefficient matrix. , , , When the outcome matrix comprises non‐Gaussian or mixed type variables, for example, Bernoulli‐type for binary outcomes and Poisson‐type for counts, Luo et al proposed mixed‐outcome reduced‐rank regression. Mishra et al proposed generalized co‐sparse factor regression (GO‐FAR) to model the outcome jointly under sparsity constraints. As we will show in the remainder of the article, these existing models are inappropriate for microbial taxon data due to the overdispersed nature of the counts. Instead, we propose negative binomial reduced rank regression (NB‐RRR) and negative binomial co‐sparse factor regression (NB‐FAR) to jointly model the microbial abundance data using the host‐associated features as covariates (see Figure 1B,C for details). NB‐RRR follows previous reduced rank regression frameworks by capturing the underlying dependencies among response and predictors via a low‐rank coefficient matrix. NB‐FAR extends the GO‐FAR framework and encodes the underlying dependency via a sparse singular value decomposition (SSVD) of the coefficient matrix. Following the estimation strategy of GO‐FAR, we extract unit‐rank components of the coefficient matrix sequentially, thus alleviating the challenging problem of joint estimation. There, each sequential step solves a co‐sparse unit rank estimation problem with a suitably adjusted offset term that accounts for the effects of previous steps. NB‐FAR thus models the associations of microbial abundance and host‐associated features via a few latent factors that comprise only a subset of predictors. Both NB‐RRR and NB‐FAR estimation procedures are implemented, tested, validated, and made publicly available in the R package nbfar, available on GitHub at https://github.com/amishra‐stats/nbfar. The remainder of the article is organized as follows. Section 2 provides the details of the NB‐RRR and NB‐FAR framework. Section 3 provides the details of the parameter estimation procedure. In Section 4, we present simulation studies to demonstrate the efficacy of the estimation procedures. Section 5 provides a detailed analysis of the AGP taxon data on the family level and an extensive set of host‐associated covariates using our NB factor regression methods. Section 6 discusses the findings and provides future research directions. Additional data analysis plots and the details of the estimation procedures are provided in the supplementary materials.

FACTOR MODELS FOR MICROBIOME DATA

As motivating example we consider the data from the AGP where samples from thousands of participants have been collected, sequenced, and processed to obtain microbial abundances. Each sample is associated with participant‐specific covariates that are related to diet, heath, and lifestyle. Our overall goal is to understand the associations of these covariates with the observed microbial abundance patterns. Let us denote the abundance/count data of taxa from samples as , the associated predictors/covariates as , and the control variables as . comprises variables, such as age and gender, that are held constant in an experiment and are thus fully adjusted for in the model. The observed taxon abundance data are overdispersed, that is, the variance of the taxa tends to be considerably larger than their mean. This fact motivates the use of a parametric framework based on the NB distribution to model the underlying associations between multivariate count outcome and factors . Using the alternative parameterization of the NB distribution, the generative model for the abundance of the th taxon in the th sample is given by where is the entry‐specific mean parameter and is the taxon‐specific shape parameter. Let us jointly represent the shape parameters of taxa by and the entry‐specific mean parameters by . For the generative model (1), and , that is, , making the model suitable for overdispersed count data. Then, the joint negative log‐likelihood is given by where . To associate the participant‐specific covariates to the microbial abundance, we link entry‐specific mean parameters to the linear predictors as where is a suitable link function that satisfies , is a fixed offset term, is the coefficient matrix corresponding to the predictors , and is the coefficient matrix corresponding to the control variables . The formulation includes an intercept in the model by setting the first column of to be , the vector of ones. Following the work of Zeileis et al and Anders and Huber, we choose as the link function so that any . Depending on the problem, one may choose another link function that satisfies . Unless otherwise stated, we write as . With the shape parameter fixed, the NB distribution (1) belongs to the exponential dispersion family. To utilize the form of this family, we define the corresponding natural parameter matrix as where . Again, unless otherwise stated, we will conveniently express as . We assume the outcomes to be conditionally independent given and . In terms of the natural parameter , we rewrite the negative log‐likelihood function (2) as where , is the trace of a square matrix and such that . For fixed shape parameter , it is straightforward to show that . This results in linking to the linear predictor via . To obtain an estimate of the model parameters, we minimize the objective function with respect to , where and . Minimizing with respect to is equivalent to separately fitting a NB regression model for each outcome, ignoring potential dependencies among covariates and responses. In microbiome data analysis, however, we often observe correlated microbial taxa abundances. Moreover, it is biologically plausible that certain groups of host covariates (eg, diet components, participants' life style) only influence specific subsets of bacterial clades. To capture such response‐predictors dependencies, we follow recent advances in multivariate mixed outcomes modeling , and introduce multivariate models for correlated predictors and interrelated responses via structured low‐rank and sparse coefficient matrices. The microbial abundance data model (1) with the rank constraint is referred to as negative binomial reduced rank regression, denoted by NB‐RRR. The low‐rank coefficient matrix implies a significantly lower number of effective model parameters, thus enabling, under certain assumptions, better estimation in high‐dimensional data problems. , , Any low‐rank coefficient matrix can be expressed as the product of any two low‐rank matrices. Based on the formulation of the linear predictor (3), we associate the responses to latent factors that are constructed as linear combinations of predictors . To ensure identifiability in the large‐dimensional setting, we may additionally assume that only a subset of predictors are relevant, and that the latent factors may be associated with only a subset of responses. This can be achieved by expressing as a product of two unique and identifiable low‐rank matrices that are entry‐wise sparse. Motivated by the recent work of Mishra et al, , we assume that the singular value decomposition (SVD) of the coefficient matrix in (1) is co‐sparse and decompose as where both the left singular vector matrix and the right singular vector matrix are assumed to be sparse, and is the diagonal matrix with the nonzero singular values on its diagonal. In the high‐dimensional setting, this formulation facilitates better interpretation. Additional orthogonality constraints in the formulation ensure that the SVD of is identifiable and the latent factors for are uncorrelated. Each of the right singular vectors in associates the corresponding latent factor to the microbial abundance outcome , and the singular values denote the strength of the association. We term the proposed model negative binomial co‐sparse factor regression, denoted by NB‐FAR. Since the majority of the available microbial count data only provides relative abundance information, it is common practice to scale or transform the data prior to statistical analysis. Finding a suitable normalization/transformation approach remains an active area of research in microbiome data analysis. , , Since both NB‐RRR and NB‐FAR are tailored toward modeling microbial count data , we facilitate scaling/normalization of the data via specifying the offset matrix in (3). Let denotes sample 's and taxon 's entry of . Similar to DESeq2, the R package nbfar offers several normalization schemes: where and , related common sum scaling. , , related to total sum scaling. where (similar to DESeq's size factors ). , that is, sample‐wise geometric mean scaling. We emphasize that each normalization approach can significantly influence the outcome of the analysis. Following McMurdie and Holmes 19, we use normalization as default setting throughout this study. For convenience, the package also enables the inclusion of a fully user‐defined offset matrix (eg, when data‐specific prior knowledge is available).

ESTIMATION PROCEDURES

Parameter estimation in both models requires minimizing a constrained, nonconvex negative log‐likelihood function with respect to . Joint estimation of the parameters satisfying the rank constraint (6) in the case of NB‐RRR and the orthogonality constraints (7) in the case of NB‐FAR is a notoriously difficult problem. In both cases, we solve the optimization problem using the majorization‐minimization (MM) approach, an alternating procedure that updates the blocks of parameters in cyclic order until convergence. In an update step, we minimize a convex surrogate that majorizes the objective function of the optimization problem.

Negative binomial reduced rank regression

The optimization problem to estimate the parameters of NB‐RRR is given by where . Let us denote the problem as . Unless otherwise stated, we will write as and as . Using the framework of MM, we minimize the objective function using an iterative procedure that cycles between ‐step, ‐step, and ‐step to update , , and , respectively, until convergence. In the ‐step, for fixed and , let us denote the natural parameter and as and , respectively. Suppose differentiable is L‐Lipschitz continuous gradient function for some constant that is, for any conformable . The statement holds for any such that ; see Section 1.1 of the supplementary material for details. Using the result, we majorize by a convex surrogate at a given and update the parameter as , where extracts SVD components of matrix . Similarly, in the ‐step, for fixed and , we denote as . Following the ‐step procedure, is also L‐Lipschitz continuous gradient function for some constant ; see Section 1.1 of the supplementary material for details. At any , we majorize and then update the parameter as . Finally, in the ‐step, we follow Zeileis et al to update each shape parameter using the Newton‐Raphson method for fixed and ; see Section 1.6 of the supplementary material for details. We compute the parameter estimates for where is the user‐specified conservative maximum rank, and select a rank using K‐fold cross‐validation. The iterative procedure is summarized in Algorithm 1.

Monotonically decreasing property of NB‐RRR

Using Algorithm 1, we estimate the parameters of NB‐RRR. The iterative procedure consists of minimizing several convex surrogates of the objective function with fixed Lipschitz constants . Let us jointly denote the updated parameters from ‐step, ‐step and ‐step after th iteration by . The sequence of parameter estimates obtained using Algorithm 1 satisfies for the constants and . We have relegated the proof of Theorem 1 to Section 1.2 of the supplementary material. In extensive simulation studies, we have found that the sequence always converges in practice.

Negative binomial co‐sparse factor regression

Joint estimation of the model parameters requires solving an optimization problem that minimizes in the presence of a sparsity‐inducing penalty on and the orthogonality constraint (7). The optimization problem requires the rank to be specified. Since existing optimization tools are computationally inefficient for the task, we extend the sequential extraction procedure, proposed by Mishra et al, , for NB‐FAR and estimate the unit‐rank components of , that is, , for . Let for be the estimate of the unit‐rank components. Then, to extract the th unit‐rank component in the th step of the sequential procedure, we solve the optimization problem where is a sparsity‐inducing penalty function with tuning parameter and with is the offset term. This problem is referred to as negative binomial co‐sparse unit‐rank estimation (NB‐CURE) with input parameters and penalty function , in short, NB‐CURE. Following Mishra et al, we use the elastic net penalty and its adaptive version for the th step as where the operator “” stands for the Hadamard product, is a prespecified weighting matrix, is a tuning parameter controlling the overall amount of regularization and controls the relative weights between the two penalty terms. In the th step of NB‐FAR, we let , where and is an initial estimate of . Here, we solve to obtain this initial estimate and extract . Assuming that the NB‐CURE problem can be solved for a suitable tuning parameter (see Section 3.2.1), NB‐FAR's estimation procedure is summarized in Algorithm 2.

Computation of negative binomial constrained unit‐rank regression (NB‐CURE)

The general form of the optimization problem for the th step of the sequential procedure, that is, , is given by where . In practice, we fix and denote as . Similar to NB‐RRR model estimation, we use the MM framework and solve the optimization problem using an iterative procedure that cycles between ‐step, ‐step, ‐step, and ‐step to update the parameters in blocks of , , , and , respectively, until convergence. Let us represent and , which are functions of , as and , respectively. In the ‐step, for fixed , has L‐Lipschitz continuous gradients for some where . Again, using this fact and following the NB‐RRR estimation procedure, we majorize by a convex surrogate and then minimize it to update the block variable as where is the element‐wise soft‐thresholding operator on any ; see Section 1.3 of the supplementary material for details. Similarly, in the ‐step, for fixed with , we show that is a L‐Lipschitz continuous gradient function for some and update the block variable as We apply the equality constraints in (9) to recover the estimate of from the estimate of the block variables . In the ‐step and the ‐step, we follow the corresponding step of NB‐RRR parameter estimation (see Algorithm 1) to update and , respectively. We have relegated the details of the convex surrogate function that majorizes the objective function, the computation of the constants (, , ), and the update steps to Section 1.3 of the supplementary material. We compute the parameter estimates for several values (the default is 50) in the range of to that are equi‐spaced on a log scale, where and . We apply K‐fold cross‐validation to select a tuning parameter . The iterative procedure is summarized in Algorithm 3.

Monotonically decreasing property of NB‐CURE

Using Algorithm 3, we solve the optimization problem of NB‐CURE to estimate the parameters . In the iterative procedure, the objective function (11) is majorized by a convex surrogate in each of the ‐step, ‐step, and ‐step, and then minimized. Let us jointly denote the updated parameters from ‐step, ‐step, ‐step, and ‐step after th iteration by . The sequence of parameters estimate obtained from Algorithm 3 satisfies for , and . We have relegated the proof of Theorem 2 to Section 1.4 of the supplementary material. Similar to NB‐RRR, we have found in extensive simulation studies that the sequence always converges in practice.

Handling missing outcome values in NB‐FAR

Besides microbiome data, the NB factor models may also prove useful for multivariate count data in other domains, including genomics, sports, image analysis, and text mining. A common scenario in these domains is the presence of missing entries in the outcome matrix . To highlight NB‐FAR's ability to account for missing entries, we can extend the framework of (5) by calculating the negative log‐likelihood as follows. Let us define an index set of the observed entries in as and denote the projection of onto by , where for any and otherwise. Following Mishra et al, we write the negative log‐likelihood function with incomplete data as where and . In case of missing entries in the outcome matrix , one should replace with and with and apply our proposed procedure to estimate the parameters. The same approach is applicable in the NB‐RRR model.

SIMULATION STUDIES

Setup

We compare the performance of NB‐RRR and NB‐FAR with GO‐FAR and NB‐GLM to showcase the efficacy of the proposed procedures in modeling multivariate overdispersed count data in the high/large‐dimensional settings. We evaluate the performance of the methods in terms of estimation error, prediction accuracy, sparsity recovery, rank identification, and shape error. GO‐FAR (implemented in the R package gofar) assumes that the underlying distribution of the count outcomes is Poisson. The specific comparison with GO‐FAR allows us to probe the effect of overdispersion in the data on model quality. The comparison with NB‐GLM highlights the potential limitations of marginal approaches in modeling dependent variables. We followed Mishra et al to simulate the predictor matrix and sparse SVD components of the coefficient matrix , that is, . Our setup considers the true rank with , , and such that , , . The notations denote a vector of length whose entries are uniformly distributed on the set and denote the vector of length with all entries equal to . We generate as , where , , and . Similarly, we generate as , where , , and , . Here we set and . An intercept is included in the model by setting with . We have considered simulation settings with and to demonstrate the efficacy of the proposed procedure in large/high‐dimensional examples. We simulate the predictor matrix from a multivariate normal distribution with some rotations such that the latent factors satisfy the orthogonality constraint (7); we refer to the simulation study of Mishra et al for details on the formulation. At the OTU/ASV level in the taxonomy, typical microbial abundance observations are excessively sparse. Since our factor models are tailored toward modeling taxa aggregated on a higher taxonomic rank, for example, the family level, we first estimate the level of sparsity in the observed AGP data. We found that, on the family level, 20% of the entries AGP data are zeros. In the simulation setting, we thus set the shape parameters to for , resulting in 20% zero entries in the simulated outcome matrix . Based on the model suggested in (1), we simulate such that . Finally, we also include a simulation scenario where 20% of entries in the response matrix are missing at random. The latter scenario showcases the ability of NB‐FAR and NB‐RRR to handle missing values in . We evaluate model performance in terms of (a) the estimation error , (b) the prediction error , (c) sparsity recovery using the false positive rate (FPR) and the false negative rate (FNR), and (d) rank estimation . FNR is computed by comparing the support (nonzero entries) of with corresponding entries in its estimate for . The FPR, on the other hand, compares zero entries in with corresponding entries in its estimate for . In case of overestimated rank, we report the relative residual signal in the excessive components as . The error in the shape parameter estimate is reported as Er() = .

Results

Since we observed similar model performances in the and scenarios, we report the model evaluation statistics only for the latter case in Table 1. The results are average performances over 100 replicates. The model comparison for the case is available in Table S1 of the supplementary material. The boxplot in Figure 2 compares the models in terms of prediction error (see Figure S1 of the supplementary material for comparison on the basis of estimation error Er()). Compared to the standard approach of modeling overdispersed count outcome using the marginal negative binomial regression model (NB‐GLM) or the Poisson counterpart of NB‐FAR, that is, GO‐FAR, both NB factor models show superior performance in terms of parameter estimation, prediction, support identification, and rank estimation. In the present model scenario, NB‐FAR outperforms NB‐RRR at the expense of a higher computational cost. Since the true parameters of the underlying model are sparse, we expect and confirm this superior behavior of NB‐FAR. The performance decrease of the GO‐FAR model highlights the effect of the misspecification in the model with respect to the overdispersed data. In particular, the performance of GO‐FAR considerably deteriorates in terms of false negative rate. Based on the results in the simulation examples where 20% of entries in are missing ( rows in Table 1), we observe that NB‐FAR can efficiently estimate the model parameters with slight deterioration compared to the full data model.

TABLE 1

Simulation: Model evaluation based on 100 replications using various performance measures (standard deviations are shown in parentheses) in case of with negative binomial responses

	M%	Er(C)	Er(Θ)	FPR	FNR	R%	r	Er(Φ)	Time (seconds)
NB‐FAR	0	2.68 (1.10)	20.70 (3.32)	5.10 (1.60)	1.42 (1.44)	0.00 (0.00)	3.00 (0.00)	0.10 (0.07)	263.20 (28.01)
NB‐RRR	0	14.49 (1.14)	40.99 (1.71)	100.00 (0.00)	0.00 (0.00)	0.00 (0.00)	3.00 (0.00)	0.63 (0.14)	95.63 (8.43)
GO‐FAR	0	10.79 (2.17)	47.58 (5.77)	4.56 (3.58)	49.80 (30.21)	0.00 (0.00)	2.51 (0.98)	1.37 (0.00)	60.49 (27.45)
NB‐GLM	0	276.31 (10.27)	163.75 (2.15)	100.00 (0.00)	0.00 (0.00)	69.66 (1.29)	29.58 (0.56)	3741.83 (149.34)	1376.36 (35.38)
NB‐FAR	20	3.36 (1.35)	22.98 (3.58)	5.63 (2.13)	2.51 (2.29)	0.00 (0.00)	3.00 (0.00)	0.12 (0.08)	273.12 (26.96)
NB‐RRR	20	15.15 (1.35)	43.77 (2.07)	100.00 (0.00)	0.00 (0.00)	0.00 (0.00)	3.00 (0.00)	0.74 (0.17)	92.31 (8.80)
GO‐FAR	20	11.78 (2.59)	50.37 (7.45)	4.89 (3.79)	54.17 (28.03)	0.00 (0.00)	2.54 (0.99)	1.37 (0.00)	79.46 (20.03)
NB‐GLM	20	228.27 (7.35)	172.59 (2.53)	100.00 (0.00)	0.00 (0.00)	71.53 (1.24)	29.68 (0.47)	3773.07 (135.35)	1774.59 (41.64)

FIGURE 2

Notched boxplots of the prediction error on the simulated count data under Setup I () and II (), respectively. The results are based on 100 replications

Notched boxplots of the prediction error on the simulated count data under Setup I () and II (), respectively. The results are based on 100 replications Simulation: Model evaluation based on 100 replications using various performance measures (standard deviations are shown in parentheses) in case of with negative binomial responses

APPLICATION

We now illustrate the performance of the NB factor models using the microbial abundance data from the AGP. AGP comprises around 30 000 fecal, oral, hand, skin, and other body site samples. We used Qiita, an open‐source platform for microbial study, to access the raw data. We selected and downloaded the microbial abundance data that were obtained after processing 150‐nucleotides‐long trimmed sequences from the 16S V4 region and used the provided Greengenes reference database for taxonomic annotation. The AGP study also records metadata/covariates related to each participant's health condition, diet, demography, nutrient intake, and habit. Here, we focused on a subset of participants where both fecal amplicon data (with sufficient sequencing depth 2000) and VioScreen variables were available. The VioScreen variables provide a detailed account of the dietary habits of the participants. We aggregated the microbial count data to the family level using the available taxonomic annotation and performed sample‐wise geometric mean scaling , at the minimum sequencing depth. After dropping taxa observed in less than 10% of samples, we arrive at microbial families as multivariate outcome . We curated the metadata as follows. First, we dropped several descriptive variables, such as, for example, sample name and sample identifier. We then removed all variables that were missing in more than 50 out of samples. The final covariate matrix comprises predictors. Each of the continuous predictors in are both centered and scaled. We manually assigned each of the different predictors to high‐level categories, such as, for example, diet, habit, health‐related, nutrient‐related, etc. The curated AGP dataset with the assigned categories is available on the GitHub page of the project (see also agAnalysis.R in the supplementary material). Finally, we considered gender, body mass index (BMI), and age as control variables and included an intercept, leading to , and sample‐wise geometric mean scaling in the offset specification. We used NB‐FAR and NB‐RRR to learn about the underlying association between the set of covariates and the observed microbial family abundances with a special focus on the patterns in the low‐rank and sparse coefficient matrix estimates . For comparison, we also ran NB‐GLM and GO‐FAR and assessed model quality based on AIC, BIC, and log‐likelihood. We also report rank estimates and model size. Table 2 summarizes model performance results from 100 replications with 80% of data used for training and 20% for testing. Compared to the marginal approach of NB‐GLM and the GO‐FAR model, NB‐FAR and NB‐RRR achieve considerably lower AIC, BIC, and log‐likelihood. Both NB‐FAR and NB‐RRR have comparable prediction errors on the test data and choose approximately rank‐3 models. Interestingly, GO‐FAR estimates a rank‐1 model, though at the expense of decreased predictive performance.

TABLE 2

Summary of average model performances on the AGP data in terms of AIC, BIC, and log‐likelihood (second to fourth column) on the test data

Model	AIC	BIC	Log‐likelihood	r	Supp(U)	Supp(V)
NB‐FAR	1.37	1.71	−9.44	3.15	7.97	16.80
NB‐RRR	1.86	3.69	−9.23	3.15	100.00	100.00
GO‐FAR	34.90	35.20	−19.00	1.00	17.90	3.59
NB‐GLM	36.40	37.10	−19.80	27.20	55.10	86.00

Note: The other columns summarize the average rank estimate and support of the singular vectors of as {Supp(), Supp()}.

Summary of average model performances on the AGP data in terms of AIC, BIC, and log‐likelihood (second to fourth column) on the test data Note: The other columns summarize the average rank estimate and support of the singular vectors of as {Supp(), Supp()}. Since NB‐FAR achieved comparable performance to NB‐RRR in terms of predictive log‐likelihood with considerably reduced model complexity, we re‐estimated the model parameters of NB‐FAR on the full data set using Algorithm 2. NB‐FAR again identified a rank() = 3 solution, enabling a parsimonious and interpretable description of host covariates—microbial family associations with only three sparse latent factors, given by sparse left and right singular vectors and . The support size (% of nonzero entries) of the estimates of the singular vectors are and . Using the union of the support of the estimated and , we can visualize all associations with the block‐sparse coefficient matrix , shown in Figure 3(left panel). The other panels in Figure 3 display the individual unit‐rank components , , and . Note that each of the unit‐rank components is orthogonal to one another. We highlight the high‐level categories of the covariates, including diet, habit, and health, by vertical lines. Horizontal lines delineate the different microbial phyla to which the families belong to. We observed that each of the singular vectors induced a different sparse pattern of positive (red) and negative (blue) blocks of associations between covariates and taxa. Overall, we found that 30 out of the 39 microbial families were found to be associated with host‐associated covariates. The families Pseudomonadaceae, Barnesiellaceae, Paraprevotellaceae, Christensenellaceae, Coriobacteriaceae, ML615J‐28, Mogibacteriaceae, Ys2, and Oxalobacteraceae were not associated with any of the covariates.

FIGURE 3

Application—AGP: The sparse estimate of the selected rows and columns of the coefficient matrix with its corresponding unit‐rank components using NB‐FAR. Based on the phylum of the taxon, horizontal lines separate the response into seven ranks: Actinobacteria, Bacteroidetes, Euryarchaeota, Firmicutes, Proteobacteria, Tenericutes, and Verrucomicrobia (top to bottom). Based on the type of the covariates, vertical lines (left to right) separate the selected predictors into five categories: diet, habit, health, nutrients, and subject features As an example for how to read and interpret the coefficient matrices, consider the top‐left sub‐matrix of which relates the two Actinobacteria (red row label) Bifidobacteriaceae and Corynebacteriaceae to diet covariates (green column label). Here, we observe an almost disjoint association pattern of positive and negative factors with diet covariates for these two families. This pattern arises from the first two latent factor and (Figure 3, middle panels). There, we observe a unique nonzero association pattern of diet variables with Corynebacteriaceae (second row in ) and a unique nonzero association pattern of diet variables with Bifidobacteriaceae (first row in ). The third latent component does not contribute any additional associations. We next focused on the analysis of the overall most important covariates‐taxa associations. The left panel of Figure 4 shows the top 25 covariates based on the row sum of the absolute values of the estimated coefficient matrix; the right plot shows the union of the top 10 covariates selected by each of the three unit‐rank components of the estimated . The color intensity in the plot reflects the effect of covariates on the abundance (red/blue for positive/negative effect). For instance, our analysis suggests that latitude (an indicator of geography) or cat ownership significantly impact the abundance of Bifidobacteriaceae, Pasteurellaceae, and Streptococcaceae. This finding is supported by several studies that also report a strong role of geography , , and pet ownership , on microbial abundance patterns. Likewise, the consumption of olive oil (unsaturated fatty acids) negatively impacts the abundance of Corynebacteriaceae and of several families in the Firmicutes phylum while showing no influence on Bacteroidetes families. Several studies such as De Wit et al, Zhao et al, and Farràs et al have reported associations between olive oil intake and a reduction in Firmicutes/Bacteroidetes ratio, consistent with the observations here. The NB‐FAR model also suggests that Bifidobacteriaceae abundances are positively associated with grain intake (: Healthy Eating Index [HEI] score of total grain), as previously observed. Finally, we also observed several associations between an underlying medical condition and microbial taxa. For instance, attention deficit hyperactivity disorder (ADHD) (last column in left panel, ) negatively impacts the abundance of Streptococcaceae, Bifidobacteriaceae, and Pasteurellaceae. A similar observation about Streptococcaceae has been reported in a separate study on children with Autism Spectrum Disorder (showing signs of ADHD).

FIGURE 4

Application—AGP: Plots show the selected rows and columns of the coefficient matrix based on the estimated effect size. The left plot selects the top 25 covariates based on the row sum of the absolute values of the estimated coefficient matrix, . The right plot shows the union of the top 10 covariates selected by each of the three unit‐rank components of the estimated The orthogonality property of the three latent factors also allows a unique factor‐by‐factor analysis of the estimated associations. Since each unit‐rank component estimate divides response‐predictor pairs into two groups, that is, positive and negative, we can cluster each component into exactly four quadrants, essentially providing disentangled bi‐clustering of microbial families and host‐associated covariates. Figure 5 illustrates these biclusters for each of the three unit‐rank components . For instance, in the estimate plot, the upper left quadrant shows the negative associations between subsets of covariates and subsets of families. The covariate olive oil intake (column A1) is significantly negatively associated with seven out of the eight taxa, including Corynebacteriaceae and Enterococcaceae. On the other hand, the covariate exercise location (column A3) is positively associated with these (and other) taxa. Similarly, from the estimate, we observed that the taxa Bifidobacteriaceae, Pasteurellaceae and Streptococcaceae are positively associated with latitude/geography, Erythritol intake, and grain (columns B1‐B3), and negatively associated with inositol level (nutrients) and ADHD (columns B4 and B5). Finally, from the estimate, we observe that the covariates sets {starchy vegetable, acne medication, salted snack frequency, drinking water source} (columns C1‐C3 and C6) and {cesarean birth, lung diseases, juice serving} (columns C4, C5, and C7) have significantly opposite effects on the ‐associated microbial families.

FIGURE 5

Application—AGP: Plots show the selected rows and columns of the estimated unit‐rank components, that is, {}, of the coefficient matrix . Covariates selected in the right plot of Figure 4 are marked in each of the three components. Each quadrant in the subplots shows the underlying associations Overall, these results highlight a strong influence of specific host‐associated features on specific family abundance pattern which should be taken into account in downstream analysis whenever these features are available in a microbiome study.

DISCUSSION

In this contribution, we have presented two novel NB factor regression models, NB‐FAR and NB‐RRR, for the analysis of microbial abundance data. The models have been tailored toward modeling overdispersed count outcome data in the context of amplicon‐derived microbiome data. However, we posit that the models may prove useful in other application areas, including clinical trials, sports, and single‐cell genomics. The key novelty of the models is to express underlying dependencies between responses (eg, microbial counts) and predictors (eg, host or environmental covariates) by assuming either a dense low‐rank or a co‐sparse low‐rank representation of the coefficient matrix. These structural assumptions appear realistic in the context of microbiome data where certain bacterial taxa are likely specialized in metabolizing specific food ingredients and hence show a concerted diet dependence. Compared to marginal approaches where each of the families is separately modeled using a Poisson or NB regression, we have shown that our models are both computationally more efficient and simultaneously achieve better estimation performance on simulated data and the large‐scale AGP data compendium. These results, in turn, challenge recent efforts in establishing host‐microbiome relationships using marginal approaches. , In particular, the study by Manor et al attempted to estimate large‐scale association patterns between microbial genera and host features across thousands of participants using marginal logistic or Poisson regression. There, the authors used microbial abundances as predictors and reported the most significant association patterns with individual host covariates as outcome. In particular, they identified the genera Ys2, Ml615j‐28, Coriobacteriaceae, Christensenellaceae, Mogibacteriaceae, and Oxalobacter to be significantly associated with host covariates all of which were among the few families that were not associated with host covariates in our analysis. These discrepancies highlight the fact that, despite reasonably large sample sizes, there still remain considerable inconsistencies across microbiome studies that require further statistical (meta‐)analysis. As we have shown in our application on the AGP data, the ability of the NB‐FAR model to deliver crisp bi‐clustering of the underlying host‐microbiome associations makes them an ideal tool to perform such future analysis and generate testable biological hypotheses. On the statistical side, potentially fruitful extensions of the NB factor models include the handling of excess zeros in the outcome data using, for example, zero‐inflated components or hurdle models. Since there are two types of zeros in the microbial abundance data, true (structural) zeros and experimental (measurement) zeros, one potential path forward is to identify structural zeros a priori and treat the remaining zeros as missing values the latter of which can already be efficiently handled by NB‐FAR and NB‐RRR. Furthermore, our current approach is restricted to using the log link function associating the mean to the linear predictor. Introducing alternative link functions that satisfy the positive mean constraint would add to the flexibility and generality of the current modeling framework. Finally, in our current framework we select the tuning parameter via K‐fold cross‐validation, making the parameter estimation procedure computationally intensive. This can potentially be alleviated by developing a stage‐wise algorithm for parameter estimation since such a strategy has been proven to be computationally efficient in the multivariate linear regression setting with normally distributed response matrix . Going forward, we also posit that NB‐FAR and NB‐RRR may serve as useful sub‐routines in more complex statistical analysis workflows, including causal inference. For example, consider a typical randomized clinical trials experiment that aims at understanding the causal effect of a treatment on a phenotype of interest. In a diet intervention study, for instance, it is not unlikely that the intended direct effect on host health is mediated or confounded by the presence of certain microbes in the microbiome. While the instrumental variable (IV) approach provides a powerful framework to uncover causal effects (see also Ailer et al in the context of microbiome data), it requires that the instruments are strong and not confounded. A standard IV approach for continuous data estimates the parameters using two‐stage least square. For the high‐dimensional data problem, Lin et al proposed a regularized two‐stage framework that solves a penalized multivariate linear regression in the first stage and a Lasso problem in the second stage. This framework can be likely extended by using the NB‐FAR/NB‐RRR methodology in the first stage when overdispersed count data serve as the independent variables. Taken together, we believe that the introduced NB factor regression models and their efficient implementation in the R package nbfar provide a useful statistical framework for analyzing overdispersed count data in medicine, biology, and other scientific disciplines. Data S1 Supplementary material Click here for additional data file.

52 in total

1. Application of negative binomial modeling for discrete outcomes: a case study in aging research.

Authors: Amy L Byers; Heather Allore; Thomas M Gill; Peter N Peduzzi
Journal: J Clin Epidemiol Date: 2003-06 Impact factor: 6.437

2. Shrinkage improves estimation of microbial associations under different normalization methods.

Authors: Michelle Badri; Zachary D Kurtz; Richard Bonneau; Christian L Müller
Journal: NAR Genom Bioinform Date: 2020-12-17

3. Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics.

Authors: Wei Lin; Rui Feng; Hongzhe Li
Journal: J Am Stat Assoc Date: 2015 Impact factor: 5.033

4. Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2.

Authors: Evan Bolyen; Jai Ram Rideout; Matthew R Dillon; Nicholas A Bokulich; Christian C Abnet; Gabriel A Al-Ghalith; Harriet Alexander; Eric J Alm; Manimozhiyan Arumugam; Francesco Asnicar; Yang Bai; Jordan E Bisanz; Kyle Bittinger; Asker Brejnrod; Colin J Brislawn; C Titus Brown; Benjamin J Callahan; Andrés Mauricio Caraballo-Rodríguez; John Chase; Emily K Cope; Ricardo Da Silva; Christian Diener; Pieter C Dorrestein; Gavin M Douglas; Daniel M Durall; Claire Duvallet; Christian F Edwardson; Madeleine Ernst; Mehrbod Estaki; Jennifer Fouquier; Julia M Gauglitz; Sean M Gibbons; Deanna L Gibson; Antonio Gonzalez; Kestrel Gorlick; Jiarong Guo; Benjamin Hillmann; Susan Holmes; Hannes Holste; Curtis Huttenhower; Gavin A Huttley; Stefan Janssen; Alan K Jarmusch; Lingjing Jiang; Benjamin D Kaehler; Kyo Bin Kang; Christopher R Keefe; Paul Keim; Scott T Kelley; Dan Knights; Irina Koester; Tomasz Kosciolek; Jorden Kreps; Morgan G I Langille; Joslynn Lee; Ruth Ley; Yong-Xin Liu; Erikka Loftfield; Catherine Lozupone; Massoud Maher; Clarisse Marotz; Bryan D Martin; Daniel McDonald; Lauren J McIver; Alexey V Melnik; Jessica L Metcalf; Sydney C Morgan; Jamie T Morton; Ahmad Turan Naimey; Jose A Navas-Molina; Louis Felix Nothias; Stephanie B Orchanian; Talima Pearson; Samuel L Peoples; Daniel Petras; Mary Lai Preuss; Elmar Pruesse; Lasse Buur Rasmussen; Adam Rivers; Michael S Robeson; Patrick Rosenthal; Nicola Segata; Michael Shaffer; Arron Shiffer; Rashmi Sinha; Se Jin Song; John R Spear; Austin D Swafford; Luke R Thompson; Pedro J Torres; Pauline Trinh; Anupriya Tripathi; Peter J Turnbaugh; Sabah Ul-Hasan; Justin J J van der Hooft; Fernando Vargas; Yoshiki Vázquez-Baeza; Emily Vogtmann; Max von Hippel; William Walters; Yunhu Wan; Mingxun Wang; Jonathan Warren; Kyle C Weber; Charles H D Williamson; Amy D Willis; Zhenjiang Zech Xu; Jesse R Zaneveld; Yilong Zhang; Qiyun Zhu; Rob Knight; J Gregory Caporaso
Journal: Nat Biotechnol Date: 2019-08 Impact factor: 54.908

5. Sparse and compositionally robust inference of microbial ecological networks.

Authors: Zachary D Kurtz; Christian L Müller; Emily R Miraldi; Dan R Littman; Martin J Blaser; Richard A Bonneau
Journal: PLoS Comput Biol Date: 2015-05-07 Impact factor: 4.475

6. Dynamics and associations of microbial community types across the human body.

Authors: Tao Ding; Patrick D Schloss
Journal: Nature Date: 2014-04-16 Impact factor: 49.962

7. tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data.

Authors: Johannes Ostner; Salomé Carcy; Christian L Müller
Journal: Front Genet Date: 2021-12-07 Impact factor: 4.599

8. Perinatal pet exposure, faecal microbiota, and wheezy bronchitis: is there a connection?

Authors: Merja Nermes; Katri Niinivirta; Lotta Nylund; Kirsi Laitinen; Jaakko Matomäki; Seppo Salminen; Erika Isolauri
Journal: ISRN Allergy Date: 2013-01-09

9. Cohabiting family members share microbiota with one another and with their dogs.

Authors: Se Jin Song; Christian Lauber; Elizabeth K Costello; Catherine A Lozupone; Gregory Humphrey; Donna Berg-Lyons; J Gregory Caporaso; Dan Knights; Jose C Clemente; Sara Nakielny; Jeffrey I Gordon; Noah Fierer; Rob Knight
Journal: Elife Date: 2013-04-16 Impact factor: 8.140

Review 10. Worlds within worlds: evolution of the vertebrate gut microbiota.

Authors: Ruth E Ley; Catherine A Lozupone; Micah Hamady; Rob Knight; Jeffrey I Gordon
Journal: Nat Rev Microbiol Date: 2008-10 Impact factor: 60.633

2 in total

1. A randomization-based causal inference framework for uncovering environmental exposure effects on human gut microbiota.

Authors: Alice J Sommer; Annette Peters; Martina Rommel; Josef Cyrys; Harald Grallert; Dirk Haller; Christian L Müller; Marie-Abèle C Bind
Journal: PLoS Comput Biol Date: 2022-05-09 Impact factor: 4.779

2. Negative binomial factor regression with application to microbiome data analysis.

Authors: Aditya K Mishra; Christian L Müller
Journal: Stat Med Date: 2022-04-24 Impact factor: 2.497

2 in total