Literature DB >> 35754924

Soft Tensor Regression.

Georgia Papadogeorgou¹, Zhengwu Zhang², David B Dunson³.

Abstract

Statistical methods relating tensor predictors to scalar outcomes in a regression model generally vectorize the tensor predictor and estimate the coefficients of its entries employing some form of regularization, use summaries of the tensor covariate, or use a low dimensional approximation of the coefficient tensor. However, low rank approximations of the coefficient tensor can suffer if the true rank is not small. We propose a tensor regression framework which assumes a soft version of the parallel factors (PARAFAC) approximation. In contrast to classic PARAFAC where each entry of the coefficient tensor is the sum of products of row-specific contributions across the tensor modes, the soft tensor regression (Softer) framework allows the row-specific contributions to vary around an overall mean. We follow a Bayesian approach to inference, and show that softening the PARAFAC increases model flexibility, leads to improved estimation of coefficient tensors, more accurate identification of important predictor entries, and more precise predictions, even for a low approximation rank. From a theoretical perspective, we show that employing Softer leads to a weakly consistent posterior distribution of the coefficient tensor, irrespective of the true or approximation tensor rank, a result that is not true when employing the classic PARAFAC for tensor regression. In the context of our motivating application, we adapt Softer to symmetric and semi-symmetric tensor predictors and analyze the relationship between brain network characteristics and human traits.

Entities: Chemical

Keywords: Bayesian; adjacency matrix; brain connectomics; graph data; latent factors; low rank; network data; parafac; tensor regression

Year: 2021 PMID： 35754924 PMCID： PMC9222480

Source DB: PubMed Journal: J Mach Learn Res ISSN： 1532-4435 Impact factor: 5.177

Introduction

In many applications, data naturally have an array or tensor structure. When the tensor includes the same variable across two of its modes, it is often referred to as a network. Network dependence is often summarized via an adjacency matrix or tensor. For example, data might correspond to an R × R × p array containing p features measuring the strength of connections between an individual’s R brain regions. In tensor data analysis, interest often lies in characterizing the relationship between a tensor predictor and a scalar outcome within a regression framework. Estimation of such regression models most often requires some type of parameter regularization or dimensionality reduction since the number of entries of the tensor predictor is larger than the sample size. In this paper, we propose a soft tensor regression (Softer) framework for estimating a high-dimensional linear regression model with a tensor predictor and scalar outcome. Softer accommodates the predictor’s tensor structure by basing the coefficient tensor estimation on the parallel factors approximation, similarly to other approaches in the literature. However, in contrast to previously developed methodology, Softer adaptively expands away from its low-rank mean to adequately capture and flexibly estimate more complex coefficient tensors. Softer’s deviations from the underlying low-rank, tensor-based structure are interpretable as variability in the tensor’s row-specific contributions.

Tensor Regression in the Literature

Generally, statistical approaches to tensor regression fall under the following categories: they estimate the coefficients corresponding to each tensor entry with entry-specific penalization, regress the scalar outcome on low-dimensional summaries of the tensor predictor, or estimate a coefficient tensor assuming a low-rank approximation. A simple approach to tensor regression vectorizes the tensor predictor and fits a regression model of the outcome on the tensor’s entries while performing some form of variable selection or regularization. Examples include Cox and Savoy (2003) and Craddock et al. (2009) who employed support vector classifiers to predict categorical outcomes based on the participants’ brain activation or connectivity patterns. Other examples in neuroscience include Mitchell et al. (2004); Haynes and Rees (2005); O’Toole et al. (2005); Polyn et al. (2005) and Richiardi et al. (2011) (see Norman et al. (2006) for a review). However, this regression approach to handle tensor predictors is, at the least, unattractive, since it fails to account for the intrinsic array structure of the predictor, effectively flattening it prior to the analysis. Alternatively, dimensionality reduction can be performed directly on the tensor predictor reducing it to low dimensional summaries. In such approaches, the expectation is that these summaries capture all essential information while decreasing the number of parameters to be estimated. For example, Zhang et al. (2019) and Zhai and Li (2019) use principal component analysis to extract information on the participants’ structural and functional brain connectivity, and use these principal components to study the relationship between brain network connections and outcomes within a classic regression framework. However, this approach could suffer due to its unsupervised nature which selects principal components without examining their relationship to the outcome. Moreover, the performance of the low-dimensional summaries is highly dependent on the number and choice of those summaries, and the interpretation of the estimated coefficients might not be straightforward. Ginestet et al. (2017) and Durante and Dunson (2018) developed hypothesis tests for differences in the brain connectivity distribution among subgroups of individuals, and employed them to study the relationship between categorical outcomes and binary network measurements. Even though related, such approaches do not address our interest in building regression models with tensor predictors. An attractive approach to tensor regression performs dimension reduction on the coefficient tensor. Generally, these approaches exploit a tensor’s Tucker decomposition (Tucker, 1966) and its restriction known as the parallel factors (PARAFAC) or canonical decomposition. According to the PARAFAC, a tensor is the sum of D rank-1 tensors, and each entry can be written as the sum of D products of row-specific elements. The minimum value of D for which that holds is referred to as the tensor’s rank. Note that the word “row” along a tensor mode is used here to represent rows in the classic matrix sense (slice of the tensor along the first mode), columns (slice of the tensor along the second mode), or slices along higher modes. Within the frequentist paradigm, Hung and Wang (2013) suggested a bi-linear logistic regression model in the presence of a matrix predictor. For a tensor predictor, Zhou et al. (2013) and Li et al. (2018) exploited the PARAFAC and Tucker decompositions respectively, and proposed low rank approximations to the coefficient tensor. Guhaniyogi et al. (2017) proposed a related Bayesian tensor regression approach for estimating the coefficient tensor. Even though these approaches perform well for prediction in these high-dimensional tensor settings, they are bound by the approximation rank in the sense that they cannot capture any true coefficient tensor. Moreover, these approaches are not directly applicable for identifying important connections. In this direction, Wang et al. (2014) imposed sparsity in the coefficient tensor of a multi-linear logistic regression model by penalizing the PARAFAC contributions. Wang et al. (2018) proposed a related approach to identify small brain subgraphs that are predictive of an individual’s cognitive abilities. Further, Guha and Rodriguez (2021) assumed a PARAFAC decomposition of the mean of the coefficient tensor and used a spike-and-slab prior distribution to identify brain regions whose connections are predictive of an individual’s creativity index. Low-rank approximations to the coefficient tensor of a linear tensor regression model provide a supervised approach to estimating the relationship between a tensor predictor and a scalar outcome. However, such approximations can lead to a poorly estimated coefficient tensor, misidentification of important connections, and inaccurate predictions, if the true rank of the coefficient tensor is not small. As we will illustrate in Section 3, this performance issue arises due to the inflexibility of the PARAFAC approximation which specifies that each row has a fixed contribution to all coefficient entries that involve it, leading to an overly rectangular or block structure of the estimated coefficient tensor. If the true coefficient tensor does not exhibit such a block structure, a large number of components D might be necessary in order to adequately approximate it. Due to this rigid structure, we refer to the PARAFAC approximation to estimating the coefficient tensor as the hard PARAFAC. Recently, there have been efforts to relax the hard PARAFAC structure employing non-parametric methodology using kernels, Gaussian-processes, or neural networks (Signoretto et al., 2013; Suzuki et al., 2016; Kanagawa et al., 2016; Imaizumi and Hayashi, 2016; Maruhashi et al., 2018). Even though these methods are very promising, we focus on regression models which include the tensor predictor linearly, and focus on flexibly estimating the linear functional. The ideas of PARAFAC softening presented here could be extended to non-linear settings and situations outside the scope of tensor-regression.

Our Contribution

With our main focus being improved estimation and inference over model coefficients, we address the inflexibility of the hard PARAFAC in the linear tensor regression setting. Towards this goal, we propose a hierarchical modeling approach to estimate the coefficient tensor. Similarly to the hard PARAFAC, each entry of the coefficient tensor is the sum of products of row-specific contributions. However, our model specification allows a row’s contribution to the coefficients that involve it to be entry-specific. Then, the mean of the entry-specific contributions along a tensor row can be thought of as a row’s overall importance, and the entry-specific deviations as small variations in the row’s importance when interacting with the rows of the other tensor modes. Allowing for the row contributions to vary by entry leads to the softening of the hard structure in the PARAFAC approximation, and for this reason, we refer to it as the soft PARAFAC. We refer to the tensor regression model that uses the soft PARAFAC for estimation of the coefficient tensor as Soft Tensor Regression (Softer). We follow a fully Bayesian approach to inference which allows for straightforward uncertainty quantification in the coefficient estimates and predictions. By studying the induced prior distribution on the coefficient tensor, we choose sensible values for the hyperparameters. Importantly, the flexible structure of the soft PARAFAC allows for Softer to capture any true coefficient tensor without increasing the base rank, in contrast to existing models that use strictly low-rank approximations. We explicitly show this by proving that the imposed prior structure on the coefficient tensor has full support on a large class including true tensors of any rank, and its posterior distribution is consistent for any approximation rank used. We also illustrate it in simulations, where we show that the performance of Softer is robust to the choice of the approximation rank. Due to its increased flexibility, Softer performs better than strictly low-rank models in identifying important entries of the tensor predictor. We extend Softer to generalized linear models and symmetric tensors, widening its applicability. We use the soft tensor regression framework in a study of the relationship between brain structural connectomics and human traits for participants in the Human Connectome Project (HCP; Van Essen et al. (2013)). In the last few years, the HCP has played a very important role in expanding our understanding of the human brain by providing a database of anatomical and functional connections and individual demographics and traits on over a thousand healthy subjects. Data availability and increased sample sizes have allowed researchers across various fields to develop and implement new tools in order to analyze these complex and rich data (see Cole et al. (2014); McDonough and Nashiro (2014); Smith et al. (2015); Riccelli et al. (2017); Croxson et al. (2018) among many others). Using data from the HCP, exploiting state-of-the-art connectomics processing pipelines (Zhang et al., 2018), and within an adaptation of the supervised Softer framework for symmetric tensor predictors, we investigate the relationship between structural brain connection characteristics and a collection of continuous and binary human traits.

Tensor Regression

In this section, we introduce some useful notation and regression of scalar outcomes on a tensor predictor.

Useful Notation

Let and . Then is used to represent the outer product of a and b with dimension p1 × p2 and entries [a ⊗ b] = ab. Similarly, for vectors , k = 1, 2, …, K, the outer product a1 ⊗ a2 ⊗ ⋯ ⊗ a is a K-mode tensor of dimensions p1, p2, …, p and entries . For two tensors 1, 2 of the same dimensions, we use 1 ◦ 2 to represent the Hadamard product, defined as the element-wise product of the two tensors. Further, we use 〈1, 2〉 to represent the Frobenius inner product, which is the sum of the elements of 1 ◦ 2. When the tensors are vectors (1-mode), the Frobenius inner product is the classic dot product. For a K-mode tensor of dimensions p1, p2, …, p, we use “j slice of along mode k” to refer to the (K − 1)-mode tensor with dimensions p1, …, p, p, …, p and entries . For example, the j slice of a p1 × p2 matrix along mode 1 is the matrix’s j row. As a result, we refer to “slice-specific” quantities as “row-specific” even when that slice is not along mode 1. For example, for a p1 × p2 matrix, the mean entry of the j row along mode 2 is the mean of the j column. Remembering that we use “row” to refer to slices (and not necessarily to rows in the classic matrix sense) will be useful when discussing the hard PARAFAC in Section 3 and when introducing the soft PARAFAC in Section 4.

Regression of Scalar Outcome on Tensor Predictor

Let Y be a continuous outcome, = (C, C, …, C) scalar covariates, and a K-mode tensor of dimensions p1, p2, …, p with entries , for unit i = 1, 2, … N. Even though our model is presented here for continuous outcomes, the relationship between tensor predictors and binary or categorical outcomes can be similarly evaluated by considering an appropriate link function as we do in Section 6. We study the relationship between the outcome and the scalar and tensor predictors by assuming a linear model where and . Alternatively, organizing all coefficients in a tensor of equal dimensions to and j1j2 … j entry equal to , the same model can be written as Since the coefficient tensor includes coefficients, it is infeasible to estimate it without some form of regularization or additional structure. Penalization or variable selection approaches based on the vectorization of the tensor predictor are implemented directly on the model of Equation 1, ignoring the predictor’s tensor structure. Alternatively, an approach to account for the predictor’s inherent structure is to assume a low-rank approximation to based on the PARAFAC decomposition, discussed in Section 3.

Tensor Regression Using the Hard PARAFAC Approximation

Under the PARAFAC decomposition, a tensor can be written as for some integer D and . The minimum value of D for which equals its representation in Equation 3 is referred to as its rank. For matrices (2−mode tensors), this decomposition is equivalent to the singular value decomposition, and D is the matrix rank. We refer to the d term in the sum as the PARAFAC’s d component. The tensor PARAFAC decomposition leads to a natural approximation of the coefficient tensor in Equation 2 by assuming that it is in the form of Equation 3 for some small value of D, potentially much smaller than its true rank. Then, the coefficients in are approximated using parameters leading to a large reduction in the number of quantities to be estimated. However, this reduction in the number of parameters might come at a substantial price if the rank D used in the approximation is smaller than the tensor’s true rank. According to Equation 3, the (j1j2 … j) entry of is equal to In turn, Equation 4 implies that row j along mode k has fixed importance, expressed as fixed row contributions , to all coefficient entries that include j, irrespective of the remaining indices. We refer to as the d j-row contribution along mode k. This is best illustrated by considering a rank-1 2-mode tensor (matrix) = β1 ⊗ β2 for vectors and . Then, , and the same entry is used in irrespective of j2. This gives rise to a rectangular structure in in which a row’s importance, , is fixed across all columns (and similarly for ). We further illustrate this in Figure 1a where we plot β1 ⊗ β2 for randomly generated vectors β1, β2 ∈ {0, 1}100. It is evident from Figure 1a that rank-1 matrices are organized in a rectangular structure where rows and columns are either uniformly excluded or not. Even though the generated vectors are binary for ease of illustration, the rectangular structure persists even when β1, β2 include non-binary entries. The rectangular structure observed in rank-1 tensors indicates that a rank-1 (D = 1) approximation to the coefficient tensor could be quite limiting. Generally, a rank-D approximation for D > 1 is employed to estimate the coefficient tensor. Figure 1b shows a matrix of rank D = 3, summing over three rank-1 tensors like the one in Figure 1a. The rank-3 tensor alleviates but does not annihilate the rectangular structure observed previously. This is most obvious in Figure 1c where the rows and columns of Figure 1b are re-ordered according to their mean entry. In Appendix B we further demonstrate the inflexibility of the hard PARAFAC’s block structure.

Figure 1:

Illustration of the Hard PARAFAC Inflexibility. Panel (a) shows a rank-1 tensor of the form = β1 ⊗ β2, for vectors β1, β2 ∈ {0, 1}100. Panel (b) shows a rank-3 matrix that is the sum of the three rank-1 tensors like the one in panel (a). Panel (c) shows the same rank-3 matrix with rows and columns reordered according to their mean entry.

The said block structure is also evident in the work by Zhou et al. (2013); Guhaniyogi et al. (2017) and Li et al. (2018) where they simulated data based on binary coefficient matrices. When these matrices represent combinations of rectangles (such as squares or crosses), the approximation performed well in estimating the true coefficient tensor. However, in situations where the true coefficient tensor was irregular, an increase in the rank was necessary in order to vaguely approximate the truth.

Soft Tensor Regression

Our approach proceeds by increasing the number of parameters in the regression model in Equation 2 and subsequently imposing sufficient structure to ensure model regularization and adaptability, simultaneously. It borrows from the low-rank structure of the hard PARAFAC, but it is more flexible than the rigid form illustrated in Equation 4 and Figure 1, and can adaptively expand away from low-rank. We introduce tensors of equal dimensions to and write From Equation 5, the coefficient with indices is written as the sum of D products of K parameters where is the entry of the tensor . For reasons that will become apparent later, the parameters are referred to as the row-specific contributions along mode k. Note that these row-specific contributions are allowed to depend on all indices . For unrestricted , Equation 5 does not impose any restrictions and any tensor can be written in this form (take D = 1, and , for all k > 1). Writing the coefficient tensor as in Equation 5 might appear counter-productive at first since the already high-dimensional problem of estimating the parameters in is translated to an even higher-dimensional problem with parameters in the right-hand side of Equation 5. However, as we will see, that is not a problem if adequate structure is imposed on the tensors . In fact, in Section 4.1 we show that the hard PARAFAC can be written in this form for carefully designed tensors , which shows that sufficient structure on the tensors can lead to drastic dimensionality reduction. In Section 4.2, we present our approach which is based on imposing structure on the tensors in a careful manner such that it simultaneously allows for low-dimensional and flexible estimation of . We refer to the resulting characterization as the soft PARAFAC of .

Representation of the Hard PARAFAC Motivating the Soft PARAFAC

As shown in Equation 4, the hard PARAFAC row-specific contributions to each entry of the coefficient tensor are fixed across the remaining indices. Hence, the hard PARAFAC can be written in the form Equation 5 by specifying tensors of which two entries with the same j index are equal: This structure on the tensors can be visualized as p constant slices along mode k representing the fixed row-specific contributions to all coefficient entries that involve it. This structure is illustrated in Figure 2a for a 4-by-3 coefficient matrix. As an example, the contribution of row 2 along mode 1 is constant (β1,(2,1) = β1,(2,2) = β1,(2,3)), and the same is true for the contribution of row 1 along mode 2 (β2,(1,1) = β2,(2,1) = β2,(3,1) = β2,(4,1)).

Figure 2:

Row-specific Contributions for the Hard and Soft PARAFAC. Left: For the hard PARAFAC, the contributions are fixed across remaining indices. Right: For the soft PARAFAC, the contributions of a row and column are entry specific and centered around an overall mean.

This demonstrates that the hard PARAFAC is one example of structure that can be imposed on the in order to approximate . The connection between Equation 5 and the hard PARAFAC is the reason why we refer to as row-specific contributions along mode k. However, the hard PARAFAC structure is quite strict and it imposes equalities across the p slices of . Since the hard PARAFAC can only capture coefficient tensors of rank up to D, the hard PARAFAC’s rigid structure on the tensors can severely limit the flexibility of the model to capture a true coefficient tensor of higher rank.

The Soft PARAFAC

The soft PARAFAC builds upon the hard PARAFAC’s low-rank structure, while providing additional flexibility by introducing entry-specific variability in the row contributions. Instead of forcing all entries of with the same index j to be equal to each other, the soft PARAFAC assumes that the entries of are centered around a j-specific value, . Specifically, the soft PARAFAC specifies that, for all k = 1, 2, …, K, j = 1, 2, …, p, and d = 1, 2, … D, for some , , ζ( > 0, where is the (j1j2 … j … j) entry of the tensor. The connection between the soft and hard PARAFAC is evident by noticing that the mean values γ depend only on the index j, and not on the remaining indices in . In fact, the γ parameters resemble the j-specific entries in the hard PARAFAC, and they specify that Softer is based on an underlying γ-based rank-D PARAFAC: where Γ, S, Z are the collections of the γ, σ, ζ parameters respectively. At the same time, Equation 6 allows variation within the mode−k slices of by considering them as random effects centered around an overall mean. This implies that row j’s importance is allowed to be entry-specific leading to a softening in the hard PARAFAC structure. The soft PARAFAC is illustrated in Figure 2b. Here, the row-contributions are centered around a common value (a value resembling the row-contribution according to the hard PARAFAC) but are entry-specific. For example, β1,(2,1) is similar but not equal to β1,(2,2), β1,(2,3). The entry-specific contributions deviate from the baseline according to a mode-specific parameter, , and a component-specific parameter, ζ(. As we will discuss later, the inclusion of ζ( in the variance forces a larger amount of shrinkage on the entry-specific importance for components d that have limited overall importance. For the soft PARAFAC reverts back to the hard PARAFAC, with row-specific contributions fixed at . However, larger values of allow for a PARAFAC-based approximation that deviates from its hard underlying structure and can be used to represent any true tensor . This is further illustrated in Figure 3 where γ1, γ2 ∈ {0, 1}64, and entry-specific contributions are generated according to Equation 6 with . The soft PARAFAC resembles a structured matrix though higher values of the conditional variance lead to further deviations from a low-rank structure.

Figure 3:

Soft PARAFAC Matrices. The plot on the left shows a rank-1 binary matrix γ1 ⊗ γ2 for γ1, γ2 ∈ {0, 1}64. The remaining matrices are generated according to Equation 6 for variance equal to 0.05, 0.1, and 0.2.

The structure imposed by the soft PARAFAC has an interesting interpretation. The parameters represent the baseline importance of row j along the tensor’s k mode. In contrast to the hard PARAFAC, the soft PARAFAC allows for structured, tensor-based deviations of a row’s importance, by allowing row j’s contribution to manifest differently based on the rows of the other modes that participate with it in a coefficient entry, , through . This interpretation of the soft PARAFAC structure is coherent in network settings like the one in our brain connectomics study, where we expect a brain region to have some baseline value for its connections, but the magnitude of this importance might slightly vary depending on the other region with which these connections are made. In this sense, defining deviations from the hard PARAFAC through deviations in the row-specific contributions as specified in Equation 6 represents a tensor-based relaxation of the hard PARAFAC structure, which is itself tensor-based.

Bayesian Inference in the Soft Tensor Regression Framework

Softer is placed within the Bayesian paradigm, which allows for straightforward uncertainty quantification. We consider the structure on expressed in Equation 6 as part of the prior specification on the model parameters of Equation 2. Since are the key building blocks for the mean of representing the underlying hard PARAFAC, we borrow from Guhaniyogi et al. (2017) and specify where = (ζ(1), ζ(2), …, ζ(). Therefore, the parameters vary around 0 with variance that depends on an overall parameter τ, and component and row-specific parameters ζ( and . As discussed in Guhaniyogi et al. (2017), the row-specific components lead to an adaptive Lasso type penalty on (Armagan et al., 2013), and , ζ(, follows a double exponential (Laplace) distribution centered at 0 with scale (Park and Casella, 2008). The component-specific variance parameter ζ( in the prior of encourages only a subset of the D components to contribute substantially in the tensor’s PARAFAC approximation, since parameters for d with small ζ( are shrunk towards zero. We include ζ( in the conditional variance of in Equation 6 to ensure that shrinkage of the baseline row contributions is accompanied by a shrinkage of the corresponding entry-specific contributions towards zero, and that a reduction in the variance of is not overcompensated by an increase in the conditional variance of . We assume normal prior distributions on the intercept and scalar covariates’ coefficients (μ, ) ~ N(0, Σ0), and inverse gamma priors on the residual variance τ2 ~ IG(a, b) and the mode-specific variance components . Specific choices for the hyperparameter values are discussed in Section 4.4.

Choosing Hyperparameters to Achieve Desirable Characteristics of the Induced Prior

The prior distribution on the coefficient tensor ~ π is induced by our prior specification on the remaining parameters. The choice of hyperparameters can have a large effect on model performance, and the use of diffuse, non-informative priors can perform poorly in some situations (Gelman et al., 2008). For that reason and in order to assist default choices of hyperparameters leading to weakly informative prior distributions with interpretable flexibility, we study the properties of the induced prior on . We do so in the following way. First, in Proposition 1 we provide expressions for the induced prior expectation, variance and covariance for the entries in . These expressions illustrate the importance of certain hyperparameters in how the soft PARAFAC transitions away from its underlying, low-rank, hard version. Then, in Proposition 2 we provide default values for hyperparameters for a standardized 2-mode tensor predictor such that, a priori, , and the proportion of the variance that arises due to the proposed PARAFAC softening is equal to AV*. Finally, in Section 4.5 we study statistical properties of the induced prior on , and we show full support over a large class of coefficient tensors irrespective of the base rank D used, which leads to consistent posterior distributions. All proofs are in Appendix A. Proposition 1 For , , such that , we have that , , and for a > 2, where ρ0 = 1 and ρ = a(a + 1) … (a + l − 1) for l ≥ 1. Remark 1 The hyperparameters of the softening variance, a, b. Remember that , it is evident that the prior of , , with higher prior expectation of , prior elicitation for a, b could be decided based on the ratio . Remark 2 Variance of coefficient entries for the hard PARAFAC. For , the prior variance of the coefficient tensor entries is equal to the prior variance of the hard PARAFAC, Comparing the variance of based on the soft and hard PARAFAC quantifies the amount of additional flexibility that is provided by the PARAFAC softening, expressed as We refer to this quantity as the additional variance. Motivated by Figure 3, hyperparameters should be chosen such that the induced prior assigns more weight to coefficient tensors that resemble low-rank factorizations, and achieves sufficiently but not overly large variance on the regression coefficients. Proposition 2 provides such conditions on the hyperparameters for matrix predictors (K = 2). For a tensor predictor with K > 2, similar conditions on the hyperparameters can be acquired by following similar steps. Proposition 2 For a matrix predictor, target variance V* ∈ (0, ∞), target additional variance AV* ∈ [0, 1), if the hyperparameters satisfy a > 2, and where C = (α/D + 1)/(α + 1), then we have that a priori , and AV = AV*. Proposition 2 is used in our simulations and study of brain connectomics to choose hyperparameters such that, a priori, and AV = 10%, assuming a tensor predictor with standardized entries. Specifically, we set a = 3, a = 0.5 and calculate the values of b, b for which V* = 1 and AV* = 10%. These values correspond to and . We specify α = 0.5 < 1 to encourage, a priori, smaller values of . Through-out, we use α = 1 and D = 3, unless stated otherwise. For the hyperparameters controlling the underlying hard PARAFAC, we specify a = 3 and . Lastly, assuming centered and scaled outcome and scalar covariates, we specify , and residual variance τ2 ~ IG(2, 0.35) which specifies P(τ2 < 1) ≈ 0.99. Remark 3 Interplay between variance hyperparameters. From Equation 8, we see that the prior mean of , and does not depend on the remaining hyperparameters (considering that a/(a + 1) ≈ 1). Furthermore, Equation 7 only includes hyperparameters which drive the underlying hard PARAFAC, and it depends on V* and AV* only through V*(1 − AV*), which expresses the prior variability in that is not attributed to the PARAFAC softening. Therefore, in Softer there is a clear and desirable separation between the hard and soft PARAFAC variance hyperparameters. Lastly, since , Equation 7 illustrates the interplay between the two components in the variance V*(1 − AV*) of the underlying hard PARAFAC structure: when the prior mean of τ increases, the prior mean of *(1 − AV*).

Posterior Consistency and Softer’s Robustness to the Underlying Rank

In this section, we focus on the choice of the rank D for Softer. We discuss that results based on Softer are likely robust to small changes in the choice of the underlying rank. To support this claim, we first provide two intuition-based arguments, and then a theoretical one. Finally, Softer’s robustness to the choice of D is empirically illustrated in simulated examples in Section 5. When employing the hard PARAFAC, Guhaniyogi et al. (2017) recommended using D = 10 for a predictor of dimension 64 × 64. The reason is that the prior on permits some amount of flexibility in the number of components that contribute to the coefficient matrix approximation in that (a) if the matrix can be well-approximated by a rank lower than D, the prior leads to a reduction in the approximation’s effective rank, and (b) if all D components are useful in estimation, all of them acquire sufficient weight. Since Softer employs the same prior on , it also allows for sparsity in the effective components in the underlying hard PARAFAC, in that if the true coefficient tensor is of rank lower than D, higher order components will be heavily shrunk towards zero, and softening will be minimal. This claim will be further supported in simulations in Section 5.1 where we illustrate that Softer reverts back to the hard PARAFAC when the underlying low-rank structure is true. However, if the true coefficient tensor is not of low rank, Softer can expand away from a low-rank structure in two ways: it can use additional components in the underlying hard PARAFAC, or it can soften away from the rigid underlying low-rank structure. The deviations based on the PARAFAC softening can effectively capture components corresponding to singular values of any magnitude. In Figure 4 we illustrate the range of singular values that would be accommodated when expanding away from a rank-D1 hard PARAFAC approximation by (1) increasing the hard PARAFAC rank, and (2) softening the PARAFAC. Increasing the hard PARAFAC rank would include components corresponding to some small singular values, but softening the PARAFAC accommodates deviations from the underlying D1-rank structure across all singular values. Since Softer based on an underlying D-rank can capture these arbitrary deviations from a D-rank structure, increasing the rank D in Softer’s underlying PARAFAC is not expected to drastically alter the results, making Softer robust to the choice of rank D. This intuition is empirically evaluated in Section 5.2.

Figure 4:

Hypothetical histogram of the singular values of a matrix. A rank D1 PARAFAC approximation incorporates the components which correspond to the D1 largest singular values. Increasing the rank to D2 allows for D2 − D1 additional components, whereas Softer allows for deviations corresponding to any singular value.

Softer’s robustness to the choice of the underlying rank D and its ability to capture any true coefficient tensor is also evident in the following theoretical results. In Proposition 3 we show that the prior on , π, has full L∞ prior support in that it assigns positive prior weight to any ϵ-sized neighborhood of any true coefficient tensor 0. This indicates that, even if the value of D is lower than the coefficient tensor’s true rank, Softer assigns positive prior weight to a neighborhood of the true tensor. Full prior support is a key element for establishing sufficient flexibility of a Bayesian procedure. In turn, the posterior distribution of the coefficient tensor is consistent, irrespective of the true coefficient tensor’s rank, or the rank used in Softer’s underlying PARAFAC. Proposition 3 (Full prior support) Let ϵ > 0. Then, , where . We assume that the true data generating model is Equation 2 with true coefficient tensor 0, and that the tensor predictor has bounded entries. Since our interest is in estimating 0, we assume that τ2 = 1, μ = 0 and = 0 are known. Then, we have the following. Proposition 4 The posterior distribution of is weakly consistent for 0. These results indicate that softening the PARAFAC allows us to capture any true coefficient tensor. Since they hold irrespective of the choice of the underlying rank, they also indicate that Softer is robust to the choice of rank D, at least asymptotically.

Approximating the Posterior Distribution of the Coefficient Tensor

We approximate the posterior distribution of using Markov chain Monte Carlo (MCMC). An MCMC scheme where most parameters are updated using Gibbs sampling is shown in Appendix C. We found this approach to be sufficiently efficient when the sample size is larger or of similar order to the number of parameters. However, in very high-dimensional settings where p ≫ n, mixing and convergence was slow under reasonable time constraints. For that reason, and in order to provide a sampling approach that performs well across n, p situations, we instead rely on Hamiltonian Monte Carlo (HMC) implemented in Stan (Carpenter et al., 2017) and on the R interface (Stan Development Team, 2018) to acquire samples from the posterior distribution. HMC is designed to improve mixing relative to Gibbs sampling by employing simultaneous updates, and relying on gradients calculated with automatic differentiation to obtain efficient proposals. Using Stan, we employ the No-U-Turn sampler (NUTS) algorithm (Hoffman and Gelman, 2014) which automatically tunes the HMC parameters to achieve a target acceptance rate (the default step size adaptation parameters were used). If MCMC convergence is slow, one could increase the NUTS parameter δ in RStan from 0.8, which is the default, to a value closer to 1. MCMC convergence was assessed based on visual inspection of traceplots across chains with different starting values and the potential scale reduction factor (Gelman and Rubin, 1992) for the regression coefficients μ, , and the residual variance τ2. Note that the remaining parameters are not individually identifiable as the underlying PARAFAC parameters are themselves non-identifiable, and therefore the corresponding softening parameters are also non-identifiable. In simulations, we considered a model that restricts the underlying PARAFAC parameters to make them identifiable (using the constraints discussed in Section 2.3 of Guhaniyogi et al. (2017)). We found that the original and constrained models had identical estimation performance, but that the MCMC for the unconstrained model including non-identifiable parameters required a smaller number of iterations to converge than the constrained model that uses the identifiable underlying PARAFAC.

Simulations

To illustrate the performance of Softer and compare it against alternatives, we simulated data under various scenarios. In one set of simulations (Section 5.1), we considered a tensor predictor of dimension 32 × 32, corresponding coefficient tensors ranging from close to low-rank to full-rank, and with different degrees of sparsity, and sample size equal to 400. In another set of simulations (Section 5.2), we considered a tensor predictor of dimension 20 × 20 and corresponding coefficient tensor of rank 3, 5, 7, 10 and 20 in order to investigate the performance of Softer relative to the hard PARAFAC for a true coefficient tensor that increasingly deviates from low rank form, and various choices of the algorithmic rank D. The sample size in this situation was 200. In all scenarios, the predictor’s entries were drawn independently from a N(0, 1) distribution, and the outcome was generated from a model in the form Equation 2 with true residual variance τ2 = 0.5. We considered the following methods: (a) Softer, (b) the Bayesian hard PARAFAC approach of Guhaniyogi et al. (2017), and (c) estimating the coefficient tensor by vectorizing the predictor and performing Lasso. We considered these two competing approaches because they represent the two extremes of how much prioritization is given to the predictor’s array structure (the hard PARAFAC directly depends on it, the Lasso completely ignores it), whereas Softer is designed to exploit the predictor’s structure while allowing deviations from it. We also considered (d) the Bayesian tensor regression approach of Spencer (2020) which is based on the Tucker decomposition, a generalization of the PARAFAC decomposition. Methods were evaluated in terms of how well they estimated the true coefficient tensor by calculating (1) the entry-specific bias and mean squared error of the posterior mean (for the Bayesian approaches) and the penalized likelihood estimate (for the Lasso), and (2) the frequentist coverage of the 95% credible intervals. In order to evaluate the methods’ performance in accurately identifying important entries (entries with non-zero coefficients), we calculated methods’ (3a) sensitivity (percentage of important entries that were identified), (3b) specificity (percentage of non-important entries that were correctly deemed non-important), (3c) false positive rate (percentage of identified entries that are truly not important), and (3d) false negative rates (percentage of non-identified entries that are important). For the Bayesian methods, an entry was flagged as important if its corresponding 95% credible interval did not overlap with zero. Hierarchical Bayesian models have been shown to automatically perform adjustment for multiple testing error (Scott and Berger, 2010; Müller et al., 2006). Confidence intervals and entry selection for the Lasso were not considered. We also evaluated the models’ predictive performance using (4) the predictive mean squared error defined as the mean of the squared difference between the true outcome and the predictive mean over 1,000 new data points. Lastly, (5) we compared the hard PARAFAC and Softer in terms of their computational time, for various ranks D. Additional simulations are shown in Appendix D and are summarized below, where applicable.

Simulation Results for Tensor Predictor of Dimension 32×32

The scenarios we considered represent settings of varying complexity and sparsity. The first column of Figure 5 shows the true coefficient tensors (squares, feet, dog, diagonal). The squares coefficient matrix is used as a scenario where the true coefficient matrix is rectangular, but not low rank, and not sparse. The next two scenarios represent situations where the underlying structure is not of low-rank form, but could be potentially approximated by a low rank matrix up to a certain degree, hence representing scenarios somewhat favorable to the hard PARAFAC. In the last case, the diagonal coefficient matrix is used to represent a sparse coefficient matrix of full-rank without a network-structure, a scenario that is favorable for the Lasso, but is expected to be difficult for the hard PARAFAC. Therefore, the scenarios we consider here represent a wide range of rank and sparsity specifications, and we investigate Softer’s ability to use a low rank structure when such structure is useful (feet, dog), and expand away from it when it is not (squares, diagonal). Even though none of these coefficient tensors is low-rank, we considered a scenario with a low-rank coefficient matrix in the Appendix, and we discuss it briefly below.

Figure 5:

True and average estimated coefficient matrix based on Softer, hard PARAFAC, Tucker regression, and Lasso.

The remaining columns of Figure 5 show the average posterior mean or penalized estimate across simulated data sets. Results from Softer and the hard PARAFAC correspond to D = 3, though the methods are also considered with rank D = 7 (the results are shown in the Appendix and are discussed below). In the squares, feet and dog scenarios, the hard PARAFAC performs decently in providing a low-rank approximation to the true coefficient matrix. However, certain coefficient entries are estimated poorly to fit its rectangular structure. This is most evident in the squares scenario where the non-zero coefficients are obscured in order to fit in a low-rank form. Results for the approach based on the Tucker decomposition are worse, with estimated coefficient matrices that deviate from the truth more. In the diagonal scenario, the hard PARAFAC and Tucker approaches totally miss the diagonal structure and estimate (on average) a coefficient matrix that is very close to zero. In contrast, the Lasso performs best in the sparse, diagonal scenario and estimates on average the correct coefficient matrix structure. However, in the squares, dog and feet settings, it underestimates the coefficient matrix since it is based on assumed sparsity which is not true, and does not borrow any information across coefficient matrix entries. In all situations, Softer closely identifies the structure of the underlying coefficient matrix, providing a compromise between tensor-based and unstructured estimation, and having small biases across all simulated scenarios (average bias also reported in Table 1). In the squares, feet and dog scenarios, Softer bases estimation on the low-rank structure of the underlying hard PARAFAC, but it expands from it to better describe the true coefficient matrix’s details. At the same time, Softer also performs well in the diagonal scenario, where the true coefficient tensor is full-rank. Therefore, the strength of Softer is found in its ability to use the low-rank structure of the PARAFAC when necessary, and diverge from it when needed.

Table 1:

Simulation Results. Average bias, root mean squared error, frequentist coverage of 95% credible intervals among truly zero and truly non-zero coefficient entries, and predictive mean squared error for Softer (with D = 3), the hard PARAFAC (with D = 3), Tucker regression, and Lasso for the simulation scenario with tensor predictor of dimensions 32 × 32 and sample size n = 400. Bold text is used for the approach performing best in each scenario and for each metric. If no entry is bold, no conclusive argument can be made.

			Softer	PARAFAC	Tucker	Lasso
squares	Truly zero	bias	0.003	0.005	0.013	0.012
		rMSE	0.033	0.05	0.05	0.143
		coverage	99.7%	98.3%	99.8%	–
	Truly non-zero	bias	0.084	0.106	0.136	0.501
		rMSE	0.11	0.148	0.166	0.601
		coverage	80.1%	68.4%	90.7%	–
	Prediction	MSE	5.05	8.99	12.16	111.8
feet	Truly zero	bias	0.037	0.046	0.057	0.016
		rMSE	0.092	0.109	0.102	0.198
		coverage	96.9%	94.2%	95.5%	–
	Truly non-zero	bias	0.116	0.138	0.175	0.43
		rMSE	0.184	0.21	0.226	0.558
		coverage	89.2%	80%	84.8%	–
	Prediction	MSE	31.9	41.8	50.6	264.6
dog	Truly zero	bias	0.047	0.059	0.083	0.029
		rMSE	0.111	0.129	0.129	0.151
		coverage	98.0%	92.8%	95.5%	–
	Truly non-zero	bias	0.129	0.159	0.214	0.350
		rMSE	0.205	0.230	0.261	0.435
		coverage	88.2%	76.9%	78.5%	–
	Prediction	MSE	35.0	45.4	61.7	138.4
diagonal	Truly zero	bias	0.002	0.004	0.002	<0.001
		rMSE	0.02	0.051	0.024	0.009
		coverage	100%	100%	100%	–
	Truly non-zero	bias	0.111	0.899	0.954	0.07
		rMSE	0.126	0.906	0.955	0.084
		coverage	94.7%	3%	0.8%	–
	Prediction	MSE	1.41	29.7	30.6	0.81

In Table 1, we report the methods’ bias and root mean squared error (rMSE), predictive mean squared error, and frequentist coverage of the 95% credible intervals. Conclusions remain unchanged, with Softer performing similarly to the hard PARAFAC when its underlying structure is close to true, and has the ability to diverge from it and estimate a coefficient tensor that is not low-rank in other scenarios. This is evident by an average coverage of 95% posterior credible intervals that is 94.7% in the diagonal scenario. In terms of their predictive ability, the pattern observed for the bias and mean squared error persists, with Softer having the smallest mean squared predictive error in the squares, feet and dog scenarios, and the Lasso for the diagonal scenario, followed closely by Softer. Across all scenarios, the approach based on the Tucker decomposition performs worst among the Bayesian methods in terms of both estimation and predictive accuracy. Table 2 shows the performance of Softer, the hard PARAFAC, and Tucker regression approaches for identifying the important entries of the tensor predictor (entries with non-zero coefficients). Perfect performance would imply specificity and sensitivity equal to 100, and false positive and negative rates (FPR, FNR) equal to 0. The methods perform comparably in terms of specificity, sensitivity, and FNR, except for the diagonal scenario where the sensitivity of the hard PARAFAC and Tucker approaches is dramatically lower. However, the big difference is in the FPR. Even though Softer’s FPR is at times higher than 5%, it remains at much lower levels than the hard PARAFAC and Tucker approaches for which FPR is on average over 10% in the dog and approximately 30% in the diagonal scenario. These results indicate that identifying important entries based on the hard PARAFAC could lead to overly high false discovery rates, whereas Softer alleviates this issue. In Appendix D.1 we investigate the cases where Softer and hard PARAFAC return contradicting results related to an entry’s importance. We illustrate that, when PARAFAC identifies entries as important and Softer does not, it is most often entries with small coefficient values, in an effort to fit the coefficient matrix into a low-rank form.

Table 2:

Methods’ performance in identifying important entries. For sensitivity, specificity and false negative rate (FNR), results are shown as average across simulated data sets (×100), and for false positive rate* (FPR) as average (10, 90 percentile) (×100)

		Sensitivity	Specificity	FPR	FNR
squares	Softer	100	99.7	0.9 (0, 2.5)	0
	PARAFAC	100	98.3	4.7 (1.2, 8.7)	0
	Tucker	100	99.9	0.3 (0, 0.4)	0
feet	Softer	64.5	96.9	2.9 (1.7, 4.1)	36.3
	PARAFAC	68.4	94.1	5.2 (3.3, 7.1)	34.3
	Tucker	63.6	95.5	4.4 (3.3, 5.5)	37.2
dog**	Softer	52.9	96.7	5.2 (2.7, 8.1)	34.7
	PARAFAC	63.1	90.1	12.4 (8.4, 15.9)	30.9
	Tucker	46.3	93.1	11.5 (5.2, 17.6)	38.6
diagonal	Softer	100	100	0 (0, 0)	0
	PARAFAC	3	100	28.8 (0, 70)	3
	Tucker	< 1	100	33.3 (10, 50)	3.1

The average FPR is taken over simulated data sets for which at least one entry was identified as important.

Most coefficients in the dog simulation were non-zero. Results are presented considering coefficients smaller than 0.05 as effectively zero.

Appendix D shows additional simulation results. Appendix D.2 shows results for alternative coefficient matrices, including a coefficient matrix of rank 3, a scenario favorable to the hard PARAFAC. There, we see that Softer collapses to the underlying low-rank structure when such a structure is true. These simulation results are in agreement with our first intuition-based argument of Section 4.5 with regards to the choice of rank D for Softer. Appendix D.3 shows results for a subset of the coefficient matrices and for sample size n = 200. Simulations with a smaller n to p ratio show that Softer performs comparably to the hard PARAFAC for the dog and feet scenarios and has substantially smaller bias and rMSE for the truly non-zero coefficients in the diagonal scenario. The most notable conclusion is that Softer results are closer to the hard PARAFAC results when the sample size is small. This indicates that the data inform the amount of PARAFAC softening which depends on the sample size. Finally, Appendix D.4 shows results for Softer and the hard PARAFAC when D = 7. Even though the hard PARAFAC shows substantial improvements using the higher rank, Softer performs almost identically for rank 3 or 7, illustrating its robustness to the choice of the underlying rank. This last conclusion is further investigated in Section 5.2.

Simulation Results for Coefficient Tensor of Increasing Rank

We evaluated the performance of the hard and soft PARAFAC tensor regression with various values of the algorithmic rank D and the coefficient tensor’s true rank. We considered tensor predictor of dimension 20 × 20, rank of the true coefficient matrix equal to 3, 5, 7, 10 and 20, and we evaluated the hard PARAFAC and Softer with D ∈ {1, 3, 5, 7, 10}. For every value of the true rank, we generated 100 data sets. In Figure 6, we show the average across entries of the coefficient matrix of the absolute bias and mean squared error, and the predictive mean squared error. For illustrative simplicity, we include a subset of the results: Softer with D = 3, and the hard PARAFAC with D = 3 and D = 5, though the complete results are included in the Appendix and discussed below. When the true rank of the coefficient matrix is 3, the three approaches perform similarly. This indicates that both the hard PARAFAC with D = 5 and Softer are able to convert back to low ranks when this is true. For true rank equal to 5 or 7, the hard PARAFAC with D = 5 slightly outperforms Softer. However, for D > 7, Softer based on a rank-3 underlying structure performs best both in estimation and in prediction. These results indicate that, in realistic situations where the coefficient tensor is not of low-rank form, Softer with a low rank has the ability to capture the coefficient tensor’s complex structure more accurately than the hard PARAFAC.

Figure 6:

Average absolute bias, estimation mean squared error and predictive mean squared error (y-axis) for tensor predictor of dimensions 20 × 20 and true coefficient matrix of increasing rank (x-axis). Results are shown for the hard PARAFAC with D = 3 and D = 5, and Softer with D = 3.

In Appendix D.5, we show the performance of Softer and the hard PARAFAC across all values of D considered, D ∈ {1, 3, 5,7, 10}. The conclusions about the methods’ relative performance remain unchanged: even though Softer and the hard PARAFAC perform similarly for coefficient tensors with small true rank, Softer with D > 1 outperforms the hard PARAFAC in terms of both estimation and prediction when the coefficient tensor is of high rank. These results illustrate that the performance of the hard PARAFAC depends heavily on the choice of rank D. In contrast, Softer’s performance is strikingly similar across all values of the algorithmic rank, supporting the second intuition-based argument on Softer’s robustness in Section 4.5. Finally, Table 3 shows the average time across simulated data sets for 15,000 MCMC iterations for the two methods. As expected, the computational time for both approaches increases as the value of D increases, though the true rank of the coefficient tensor seems to not play a role. Here, we see that Softer generally requires two to three times as much time as the hard PARAFAC for the same value of D, which might seem computationally prohibitive at first. However, our results in Figure 6 and Appendix D.5 show that Softer requires a smaller value of D in order to perform similarly to or better than the hard PARAFAC in terms of both estimation and prediction. Therefore, Softer’s computational burden can be drastically alleviated by fitting the method for a value of D which is substantially lower than the value of D for the hard PARAFAC. In Section 7, we also discuss frequentist counterparts to our model that might also reduce the computational load.

Table 3:

Computational time (in minutes) for Softer and the hard PARAFAC.

		True rank of coefficient tensor
		3	5	7	10	20
Softer	D = 1	28	29	30	37	39
	D = 3	166	110	101	104	117
	D = 5	181	149	158	159	152
	D = 7	212	189	205	216	220
	D = 10	275	236	255	288	259
Hard PARAFAC	D = 1	16	16	15	15	15
	D = 3	30	39	46	35	36
	D = 5	69	54	57	53	56
	D = 7	100	90	82	72	76
	D = 10	104	117	108	104	102

Estimating the Relationship Between Brain Connectomics and Human Traits

Data from the Human Connectome Project (HCP) contain information on about 1,200 healthy young adults including age, gender, various brain imaging data, and a collection of measures assessing cognition, personality, substance intake and so on (referred to as traits). We are interested in studying the brain structural connectome, referring to anatomical connections of brain regions via white matter fibers tracts. The white matter fiber tracts can be indirectly inferred from diffusion MRI data. Two brain regions are considered connected if there is at least one fiber tract running between them. However, there can be thousands of fiber tracts connecting a pair of regions. Properties of the white matter tracts in a connection, such as number of tracts, and patterns of entering the regions, might be informative about an individual’s traits. Using data from the HCP and based on the soft tensor regression framework, we investigate the relationship between different connectome descriptors and human traits. Structural connectivity data were extracted using state-of-the-art pipelines in Zhang et al. (2018). In total, about 20 connectome descriptors (adjacency matrices) describing different aspects of white matter fiber tract connections were generated (see Zhang et al. (2018) for more information on the extracted descriptors). Each adjacency matrix has a dimension of 68 × 68, representing R = 68 regions’ connection pattern. The 68 regions were defined using the Desikan-Killiany atlas (Desikan et al., 2006). Of the 20 extracted connectome features, we consider two in this analysis: (a) count, describing the number of streamlines, and (b) connected surface area (CSA), describing the area covered by small circles at the interactions of fiber tracts and brain regions, since they are the most predictive features according to results in Zhang et al. (2019). We examine the relationship between these descriptors of structural brain connections and 15 traits, covering domains such as cognition, motor, substance intake, psychiatric and life function, emotion, personality and health. The full list of outcomes we analyze is presented in Table E.1 and includes both binary and continuous traits. For binary traits, a logistic link function is assumed.

Table E.1:

List of outcomes with description and descriptive statistics.

Category	Name	Description	#obs	Type	Mean (SD)
Cognition	ReadEng AgeAdj	Age adjusted reading score	1,065	C	107.1 (14.8)
	PicVocab AgeAdj	Age adjusted vocabulary comprehension	1,065	C	109.4 (15.2)
	VSPLOT TC	Spatial Orientation (Variable Short Penn Line Orientation Test)	1,062	C	15 (4.44)
Motor	Endurance AgeAdj	Age Adjusted Endurance	1,063	C	108.1 (13.9)
	Strength AgeAdj	Age Adjusted Strength	1,064	C	103.6 (20.1)
Substance Intake	Alc 12 Drinks Per Day	Drinks per day	1,010	C	2.3 (1.57)
	Alc Hvy Frq 5plus	Frequency of 5+ drinks	1,011	C	3 (1.44)
	TB DSM Difficulty Quitting	Tobacco difficulty quitting	280	B	74.6%
	Mj Ab Dep	Marijuana Dependence	1,064	B	9.3%
Psychiatric and	ASR Intn T	Achenbach Adult Self-Report (Internalizing Symptoms)	1,062	C	48.5 (10.7)
Life function	ASR Oth Raw	Achenbach Adult Self-Report (Other problems)	1,062	C	9.1 (4.6)
	Depressive Ep	Has the participant experienced a diagnosed DSMIV major depressive episode over his/her lifetime	1,035	B	9.2%
Emotion	AngHostil Unadj	NIH Toolbox Anger and Affect Survey (Attitudes of Hostility)	1,064	C	50.5 (8.58)
Personality	NEOFAC Agreeableness	“Big Five” trait: Agreeableness	1,063	C	32 (4.93)
Health	BMI	Body Mass Index	1,064	C	26.4 (5.1)

Adapting Softer for (Semi-)Symmetric Brain Connectomics Analysis

The nature of the brain connectivity data implies that the R × R-dimensional tensor predictor including a specific connectivity feature among R regions of interest (ROIs) is symmetric and the diagonal elements can be ignored since self-loops are not considered. Further, considering p features simultaneously would lead to an R × R × p tensor predictor which is semi-symmetric (symmetric along its first two modes). The (semi-)symmetry encountered in the predictor allows us to slightly modify Softer and reduce the number of parameters by imposing that the estimated coefficient matrix is also (semi-)symmetric. We provide the technical details for the (semi-)symmetric Softer in Appendix F.

Analyses of the Brain Connectomics Data

For the purpose of this paper, we investigate the relationship between features of brain connections and human traits by regressing each outcome on each of the two predictors (count and CSA) separately. Even though analyzing the relationship between the traits and multiple features simultaneously is possible, we avoid doing so here for simplicity. We analyze the data employing the following methods: (1) symmetric Softer with D = 6, (2) the hard PARAFAC approach of Guhaniyogi et al. (2017) which does not impose symmetry of the coefficient tensor with D = 10, and (3) Lasso on the vectorized lower triangular part of the tensor predictor. Since publicly available code for non-continuous outcomes is not available for the hard PARAFAC approach, we only consider it when predicting continuous outcomes. We compare methods relative to their predictive performance. For each approach, we estimate the out-of-sample prediction error by performing 15-fold cross validation, fitting the method on 90% of the data and predicting the outcome on the remaining 10%. In the case of Softer and the hard PARAFAC we also investigate the presence of specific brain connections that are important in predicting any of the outcomes by checking whether their coefficients’ 95% posterior credible intervals include zero. Additional results based on Softer for a different choice of baseline rank or when symmetry is ignored are included in Appendix E and are summarized below.

Using Features of Brain Connections for Predicting Human Traits

For continuous outcomes, the methods’ predictive performance was evaluated by calculating the percentage of the marginal variance explained by the model defined as 1 − (CV MSE)/(marginal variance). For binary outcomes, we used the model’s estimated linear predictor to estimate the optimal cutoff for classification based on Youden’s index (Youden, 1950) and calculated the average percentage of correctly classified observations in the held-out data. Figure 7 shows the results for the three approaches considered, and for each feature separately. For most outcomes, one of the two features appeared to be most predictive of the outcome across approaches. For example, the count of streamlines was more predictive than CSA of an individual’s anger level (AngHostil), independent of the modeling approach used. By examining the methods’ predictive performance, it is evident that features of brain connectomics are, in some cases, highly predictive of outcomes. Specifically, over 30% of the variance in an individual’s strength level, and over 10% of the variance in endurance, reading comprehension, and picture vocabulary ability can be explained by the count or CSA of streamlines of their brain connections.

Figure 7:

Top: Percentage of outcome variance explained by the tensor predictor for continuous outcomes calculated as [1 − MSE / (marginal variance)] ×100. Bottom: Average percentage of units correctly classified for binary outcomes. Results are presented using different color for each method, and different line-type for each feature of brain connections.

Not one approach outperformed the others in prediction across all features and outcomes. However, approaches that accommodate the network structure perform better than Lasso in most situations. One example is Softer’s performance relative to Lasso when predicting individuals’ previous depressive episode (Depressive Ep). Here, Lasso performs worse than the random classifier, whereas Softer has over 90% accuracy. Even when the number of observations is less than 300 (difficulty quitting tobacco, TB DSM Difficulty Quitting), Softer performs only slightly worse than Lasso. For continuous outcomes, Softer and hard PARAFAC perform comparably. As we saw in the simulations in Section 5 and in Appendix D.3, the similar predictive performance of Softer and hard PARAFAC could be due to the limited sample size that forces Softer to heavily rely on the underlying low-rank structure for estimation, essentially reverting back to the hard PARAFAC. The low signal in predicting some outcomes implies low power in identifying pairs of brain regions whose connection’s features are important. In fact, 95% credible intervals for all coefficients using the hard PARAFAC overlapped with zero. In contrast, Softer identified seven important connections: five of them were for predicting whether an individual has had a depressive episode (three using count of streamlines as the predictor, and two using CSA), one in predicting an individual’s strength, and one in predicting the variable short pennline orientation (VSPLOT). The identified connections are listed in Table 4 and agree with the literature in neuroscience. All identified connections in predicting a depressive episode involve the parahippocampal, which is the posterior limit of the amygdala and hippocampus and is located in the temporal lobe, and ROIs located in the frontal lobe (paracentral, lateral orbitofrontal, pars orbitalis). Dysfunction of the parahippocampal (as well as the amygdala and hippocampus) has been identified in various studies as an important factor in major depression and emotion-related memory observed in depression (Mayberg, 2003; Seminowicz et al., 2004; LaBar and Cabeza, 2006; Zeng et al., 2012). Further, dysregulation of the pathways between the frontal and temporal lobes has been identified as predictive of depression (Mayberg, 1994; Steingard et al., 2002), even when explicitly focusing on the cortical regions Softer identified as important (Liao et al., 2013). Two of the three connections were identified as important irrespective of the tensor predictor used (count or CSA). Even though these predictors are related, they describe different aspects of brain connectivity. Therefore, Softer identified the same connections as important based on two separate measures of brain connectivity. The identified connection in predicting strength involves the precuneus and superior parietal regions in the parietal lobe. Precuneus’ connectivity has been associated with a variety of human functions, including motor-related traits (Cavanna and Trimble, 2006; Wenderoth et al., 2005; Simon et al., 2002), and the parietal lobe in general is believed to control humans’ motor system (Fogassi and Luppino, 2005).

Table 4:

Brain connections with important features in predicting human traits.

Outcome	Feature	ROI 1	ROI 2
Depressive Episode	Count	(lh) Parahippocampal	(rh) Lateral Orbitofrontal
			(lh) Paracentral
			(lh) Pars Orbitalis
	CSA		(rh) Lateral Orbitofrontal
	CSA		(lh) Paracentral
Strength	Count	(rh) Precuneus	(rh) Superior Parietal
VSPLOT	CSA	(lh) Banks of Superior Temporal Sulcus	(lh) Rostral Anterior Cingulate

Appendix E.2 includes results from the symmetric Softer using a smaller rank (D = 3), and from Softer using the same rank as the results in this section (D = 6) but ignoring the known symmetry of the predictor. All three versions of Softer perform similarly in terms of prediction, with potentially slightly lower predictive power for the symmetric Softer with rank 3. The entries identified as important by Softer with D = 3 are similar, though not identical to the ones in Table 4. When symmetry is not accounted for, Softer does not identify any important connections. Finally, in Appendix E.3 we investigate the reproducibility of identified connections across random subsamples including 90% of the observations. We find that for strength, when connections are identified in the subsample, they generally include the ones in Table 4. Finally, when predicting whether an individual has has a depressive episode, any of the connections identified across any subsample and for either predictor involved at least one of Parahippocampal the (in the left hemisphere) or Lateral Orbitofrontal (in the right hemisphere) ROIs, implying that the importance of these two regions is strikingly reproducible.

Discussion

In this paper, we considered modeling a scalar outcome as a function of a tensor predictor within a regression framework. Estimation of regression models in high dimensions is generally based on some type of assumed sparsity of the true underlying model: sparsity directly on covariates, or “latent sparsity” by assuming a low-dimensional structure. When the assumed sparsity is not true, the model’s predictive ability and estimation can suffer. Our approach is positioned within the class of latent sparsity, since it exploits a low-dimensional underlying structure. We focused on adequately relaxing the assumed structure by softening the low-dimensional PARAFAC approximation and allowing for interpretable deviations of row-specific contributions. We show that softening the PARAFAC leads to improved estimation of coefficient tensors, better performance in identifying important entries, consistent estimation irrespective of the underlying rank used for estimation, and more accurate predictions. The approach is applicable to both continuous and binary outcomes, and was adapted to (semi-)symmetric tensor predictors, which is common in settings where the predictor is measured on a network of nodes. Softer was used to study the relationship between brain connectomics and human traits, and identified several important connections in predicting depression. Combining the two types of assumed sparsity for low-rank and sparse matrix estimation has received some attention in the literature, especially in machine learning with matrix data. Candès et al. (2011), Waters et al. (2011) and Zhou and Tao (2011) decomposed matrices as the sum of a low-rank and a sparse component. Zhang et al. (2016) employed such decomposition in anomaly detection by studying the Mahalanobis distance between the observed data and the low-rank component. Richard et al. (2012) developed a penalization-based approach to estimating matrices that are simultaneously sparse and low-rank by adopting one type of penalty for sparsity and one for rank. All these approaches have been formulated as optimization problems and algorithms are generally based on iterative procedures. Within a Bayesian regression framework, Guha and Rodriguez (2021) combined the two types of sparsity and proposed a network-based spike and slab prior on the nodes’ importance. Under that model, a node is either active or inactive, and active nodes are expected to contribute to the outcome based on a low-rank coefficient tensor. In that sense, this approach has important commonalities to estimating a coefficient tensor that is simultaneously sparse and low-rank. Even though we find that approach to be promising, we find that node selection in itself can be too restrictive in some settings, and future work could incorporate hierarchical or parallel node and entry selection. Softer has similarities but also differences from the methods discussed above. On one hand, Softer provides a relaxation of an assumed low-rank form. However, this relaxation (or softening) is not sparse in any way, and every element of the tensor is allowed to deviate from the low-rank structure. We find that an exciting line of research would combine low-rank and sparse approaches while allowing for sufficient flexibility to deviate from both of them. Important questions remain on the interplay between entry selection and the assumed structure on the coefficient tensor. We illustrated in simulations that tensor regression based on the hard PARAFAC can perform poorly for entry selection. Future work could focus on studying the properties of multiplicity control in structured settings, and forming principled variable selection approaches with desirable properties within the context of structured data. Even though we showed that the posterior distribution of the coefficient tensor based on Softer is consistent irrespective of the tensor’s true rank or the algorithmic rank used, additional theoretical results would help illuminate Softer’s performance. For example, it would be interesting to understand how Softer performs under the asymptotic regime where the tensor predictor is allowed to grow with the sample size. Furthermore, even though our simulation results indicate that Softer reverts back to the hard PARAFAC when the underlying low-rank structure is true, future approaches to “soft” regression can investigate whether the soft and hard approaches asymptotically converge if the hard structure is true. If this result was derived, then it would imply a new form of robustness: robustness in terms of the method chosen, soft or hard. Finally, a criticism of Bayesian methods is often their computational burden. One way forward in alleviating this issue for Softer is to re-formulate our model within the frequentist paradigm. For example, one could impose a penalty term on the magnitude of the parameters using , substitute the softening structure in Equation 6 with a penalty term such as , and maximize the penalized likelihood with appropriate tuning parameters. One might need to link the tuning parameter for penalty(d) with that of penalty(d) to ensure that penalization of one does not lead to overcompensation from the other (similarly to our discussion in Section 4.3 for the inclusion of ζ( in the distribution for . Such L2-norm penalties with a continuous outcome would imply that the maximization problem is convex and optimization procedures can be developed relatively easily. Even though this reformulation would not reduce the number of parameters, it is expected to require far fewer iterations than an MCMC procedure. Alternative norms could also be considered to impose more sparsity, such as using a group-Lasso penalty instead of penalty(d).

Table D.1:

Average bias, root mean squared error, frequentist coverage of 95% credible intervals among truly zero and truly non-zero coefficient entries, and predictive mean squared error for Softer, hard PARAFAC and Lasso for the simulation scenario with tensor predictor of dimensions 32 × 32 and sample size n = 400. Bold text is used for the approach performing best in each scenario and for each metric.

			Softer	PARAFAC	Lasso
constant squares	Truly zero	bias	0.0016	0.0014	0.0013
		rMSE	0.016	0.015	0.016
		coverage	99.2%	99.7%	-
	Truly non-zero	bias	0.0211	0.017	0.073
		rMSE	0.06	0.076	0.093
		coverage	90.9%	79.6%	-
	Prediction	MSE	0.79	1.25	1.32
varying feet	Truly zero	bias	0.06	0.076	0.014
		rMSE	0.123	0.145	0.169
		coverage	96.4%	90.7%	–
	Truly non-zero	bias	0.165	0.205	0.569
		rMSE	0.244	0.279	0.661
		coverage	87.7%	73.8%	–
	Prediction	MSE	40.79	55.19	194
varying dog	Truly zero	bias	0.071	0.091	0.013
		rMSE	0.159	0.176	0.152
		coverage	98.3%	92.7	–
	Truly non-zero	bias	0.263	0.321	0.554
		rMSE	0.358	0.398	0.644
		coverage	81.2	63.2	–
	Prediction	MSE	67.6	85.3	159.2

Table D.2:

Mean bias and rMSE among truly zero and truly non-zero coefficient entries (presented as bias (rMSE)), and average and IQR of the predictive mean squared error (presented as average (IQR)) for tensor predictor of dimensions 32 × 32 and n = 200. Bold text is used for the approach performing best in each scenario.

		Softer	PARAFAC	Lasso
feet	Truly zero	0.06 (0.14)	0.065 (0.15)	0.01 (0.118)
	Truly non-zero	0.216 (0.332)	0.229 (0.328)	0.535 (0.585)
	Prediction	95.4 (76.6, 111.1)	94 (78.8, 107.5)	312 (299, 323)
dog	Truly zero	0.042 (0.155)	0.087 (0.164)	0.017 (0.155)
	Truly non-zero	0.171 (0.264)	0.174 (0.254)	0.248 (0.305)
	Prediction	110.6 (99.6, 121.1)	101.2 (94, 107.8)	177.3 (170.9, 183.4)
diagonal	Truly zero	0.005 (0.054)	0.003 (0.038)	0.002 (0.02)
	Truly non-zero	0.753 (0.773)	0.945 (0.948)	0.149 (0.181)
	Prediction	22.76 (21.04, 24.29)	31.08 (30.1, 31.78)	2.00 (1.57 2.32)

Table D.3:

Simulation results for Softer and hard PARAFAC with D = 7.

			Softer	PARAFAC
squares	Truly zero	bias	0.003	0.005
		rMSE	0.034	0.049
		coverage	99.5%	98.6%
	Truly non-zero	bias	0.085	0.104
		rMSE	0.111	0.146
		coverage	79.6%	70.5%
	Prediction	MSE	5.17	8.76
feet	Truly zero	bias	0.035	0.041
		rMSE	0.092	0.104
		coverage	97.3%	96.7%
	Truly non-zero	bias	0.112	0.128
		rMSE	0.181	0.199
		coverage	89.2%	85.2%
	Prediction	MSE	30.69	37.15
dog	Truly zero	bias	0.079	0.078
		rMSE	0.138	0.138
		coverage	92.8%	94.7%
	Truly non-zero	bias	0.084	0.095
		rMSE	0.157	0.166
		coverage	92.4%	90.3%
	Prediction	MSE	32.49	36.68
diagonal	Truly zero	bias	0.002	0.005
		rMSE	0.02	0.06
		coverage	100%	99.9%
	Truly non-zero	bias	0.113	0.852
		rMSE	0.128	0.861
		coverage	93.3%	4.1%
	Prediction	MSE	1.43	28.09

Table E.2:

Identified connections based on Softer with D = 3.

Outcome	Feature	ROI 1	ROI 2
Strength	Count	(rh) Precuneus	(rh) Superior Parietal	**
			(lh) Superior Frontal	*
			(rh) Caudal Middle Frontal	*
			(rh) Isthmus of cingulate gyrus
			(rh) Pericalcarine
	CSA	(rh) Precuneus	(rh) Superior Parietal	**
			(lh) Superior Frontal	*
			(rh) Caudal Middle Frontal	*
VSPLOT	CSA	(rh) Banks of Superior Temporal Sulcus	(lh) Superior Frontal
Endurance	CSA	(lh) Cuneus	(lh) Pericalcarine	*
		(lh) Cuneus	(rh) Pericalcarine	*
		(lh) Insula	(rh) Inferior Parietal

connections that are related to those identified based on Softer with D = 6.

connections that are also identified based on alternative feature, outcome, or hemisphere.

34 in total

1. Matrix variate logistic regression model with application to EEG data.

Authors: Hung Hung; Chen-Chien Wang
Journal: Biostatistics Date: 2012-07-02 Impact factor: 5.899

Review 2. Motor functions of the parietal lobe.

Authors: Leonardo Fogassi; Giuseppe Luppino
Journal: Curr Opin Neurobiol Date: 2005-11-03 Impact factor: 6.627

Review 3. Cognitive neuroscience of emotional memory.

Authors: Kevin S LaBar; Roberto Cabeza
Journal: Nat Rev Neurosci Date: 2006-01 Impact factor: 34.870

4. Category-specific cortical activity precedes retrieval during memory search.

Authors: Sean M Polyn; Vaidehi S Natu; Jonathan D Cohen; Kenneth A Norman
Journal: Science Date: 2005-12-23 Impact factor: 47.728

5. The role of anterior cingulate cortex and precuneus in the coordination of motor behaviour.

Authors: Nicole Wenderoth; Filiep Debaere; Stefan Sunaert; Stephan P Swinnen
Journal: Eur J Neurosci Date: 2005-07 Impact factor: 3.386

6. Predicting the orientation of invisible stimuli from activity in human primary visual cortex.

Authors: John-Dylan Haynes; Geraint Rees
Journal: Nat Neurosci Date: 2005-04-24 Impact factor: 24.884

7. Is depression a disconnection syndrome? Meta-analysis of diffusion tensor imaging studies in patients with MDD.

Authors: Yi Liao; Xiaoqi Huang; Qizhu Wu; Chuang Yang; Weihong Kuang; Mingying Du; Su Lui; Qiang Yue; Raymond C K Chan; Graham J Kemp; Qiyong Gong
Journal: J Psychiatry Neurosci Date: 2013-01 Impact factor: 6.186

8. Limbic-frontal circuitry in major depression: a path modeling metanalysis.

Authors: D A Seminowicz; H S Mayberg; A R McIntosh; K Goldapple; S Kennedy; Z Segal; S Rafi-Tari
Journal: Neuroimage Date: 2004-05 Impact factor: 6.556

Review 9. Modulating dysfunctional limbic-cortical circuits in depression: towards development of brain-based algorithms for diagnosis and optimised treatment.

Authors: Helen S Mayberg
Journal: Br Med Bull Date: 2003 Impact factor: 4.291

10. A positive-negative mode of population covariation links brain connectivity, demographics and behavior.

Authors: Stephen M Smith; Thomas E Nichols; Diego Vidaurre; Anderson M Winkler; Timothy E J Behrens; Matthew F Glasser; Kamil Ugurbil; Deanna M Barch; David C Van Essen; Karla L Miller
Journal: Nat Neurosci Date: 2015-09-28 Impact factor: 24.884