Wei Cheng1,2,3, Sohini Ramachandran1,2,3, Lorin Crawford3,4,5. 1. Department of Computer Science, Brown University, Providence, RI, USA. 2. Department of Ecology and Evolutionary Biology, Brown University, Providence, RI, USA. 3. Center for Computational Molecular Biology, Brown University, Providence, RI, USA. 4. Department of Biostatistics, Brown University, Providence, RI, USA. 5. Microsoft Research New England, Cambridge, MA, USA.
Abstract
In this paper, we propose a new approach for variable selection using a collection of Bayesian neural networks with a focus on quantifying uncertainty over which variables are selected. Motivated by fine-mapping applications in statistical genetics, we refer to our framework as an "ensemble of single-effect neural networks" (ESNN) which generalizes the "sum of single effects" regression framework by both accounting for nonlinear structure in genotypic data (e.g., dominance effects) and having the capability to model discrete phenotypes (e.g., case-control studies). Through extensive simulations, we demonstrate our method's ability to produce calibrated posterior summaries such as credible sets and posterior inclusion probabilities, particularly for traits with genetic architectures that have significant proportions of non-additive variation driven by correlated variants. Lastly, we use real data to demonstrate that the ESNN framework improves upon the state of the art for identifying true effect variables underlying various complex traits.
In this paper, we propose a new approach for variable selection using a collection of Bayesian neural networks with a focus on quantifying uncertainty over which variables are selected. Motivated by fine-mapping applications in statistical genetics, we refer to our framework as an "ensemble of single-effect neural networks" (ESNN) which generalizes the "sum of single effects" regression framework by both accounting for nonlinear structure in genotypic data (e.g., dominance effects) and having the capability to model discrete phenotypes (e.g., case-control studies). Through extensive simulations, we demonstrate our method's ability to produce calibrated posterior summaries such as credible sets and posterior inclusion probabilities, particularly for traits with genetic architectures that have significant proportions of non-additive variation driven by correlated variants. Lastly, we use real data to demonstrate that the ESNN framework improves upon the state of the art for identifying true effect variables underlying various complex traits.
Variable selection is a fundamental problem in high-dimensional statistical learning that arises in a wide range of application domains (George and McCulloch, 1993; Fan and Lv, 2010; Carbonetto and Stephens, 2012;
Yamada et al., 2020). An important benefit of incorporating sparsity when building a predictive model is that it provides interpretations on which input variables are most important in explaining variation across the output variables. Such a property is particularly desirable when the end goal of an application also includes scientific discovery. For example, the goal of many genome-wide association (GWA) studies is not just to predict the disease status or phenotypic risk of a patient but also to identify the (subsets of) single-nucleotide polymorphisms (SNPs) that are statistically associated with the genetic architecture of the disease (Manolio, 2010; Maller et al., 2012). This can further help with downstream clinical applications such as drug development.Although many methods for variable selection have been developed in the literature (George and McCulloch, 1993; Fan and Lv, 2010; Carbonetto and Stephens, 2012; Yamada et al., 2020; Zou and Hastie, 2005; Tibshirani, 1996), some significant challenges still remain. One important challenge is assessing the uncertainty in which variables should be selected when they are highly correlated (Wang et al., 2020; Carbonetto and Stephens, 2012). As an extreme case, imagine there are two variables that are completely collinear. In this context, it becomes statistically impossible to distinguish them, and many traditional regularization and shrinkage methods will arbitrarily select one SNP as being associated with the trait of the interest and disregard the other (Wang et al., 2020). While such a strategy suffices if the goal is to build a predictive model, it becomes limiting for scientific discovery because the conclusions rely on selecting the correct subset of genetic variants for downstream investigation. Recently, Wang et al. (2020) introduced the “sum of single effects” model called SuSiE to address these issues. More specifically, SuSiE assesses the uncertainty of variables by providing “credible sets” which, in the case of our extreme example, effectively summarize that “either SNP 1 or 2 is relevant, but we are unsure as to which one.” SuSiE uses an iterative Bayesian stepwise selection (IBSS) procedure where it iteratively regresses out effect variables and feeds the corresponding residuals to the next iteration for training.The main limitation of SuSiE is that it is a linear model and therefore does not capture nonlinear effects in data. In GWA studies, it is well known that the genetic architecture of complex traits can be driven by phenomena such as dominance and epistasis (Minamikawa et al., 2017; Crawford et al., 2017; Ramstein et al., 2020; Li et al., 2013). Indeed, machine learning models are most powered in settings when large sets of training data are available and often exhibit greater predictive accuracy than linear models in applications driven by non-additive variation. In this paper, we introduce the “ensemble of single-effect neural networks” (ESNN) framework which overcomes the limitations of SuSiE while preserving the ability to assess uncertainty for variable selection (Figure 1). We demonstrate our approach in a simulation study and on two real GWA datasets.
Figure 1
An example of a single-effect neural network (SNN) with only the first input variable having an effect on the outcome
An example of a single-effect neural network (SNN) with only the first input variable having an effect on the outcome
Results
In this section, we first examine the utility of the ESNN model in simulations motivated by fine-mapping applications for continuous and binary traits in GWA studies. We also apply our method to real-world GWA datasets from the Wellcome Trust Case Control Consortium (WTCCC) and the Wellcome Trust Centre for Human Genetics.
Simulations with continuous phenotypes
In order to evaluate the performance of our model on continuous traits, we simulate data using real genotypes from chromosome 1 of 5,000 randomly sampled individuals of self-identified European ancestry in the UK Biobank (Bycroft et al., 2018). After quality control (Demetci et al., 2021), this dataset had 36,518 SNPs (see STAR Methods). To simulate fine-mapping applications, we used the NCBI’s Reference Sequence (RefSeq) database in the UCSC Genome Browser (Pruitt et al., 2005) to annotate SNPs to genes. Here, we randomly sampled 200 genes on this chromosome where the annotations included both SNPs located within the gene boundary and SNPs that fall within a kb window of the boundary to also include regulatory elements (see STAR Methods).In this study, each gene is considered to be its own dataset with its own complex correlation structure (see Figure S1) and unique number of SNPs (ranging from 50 to 417 variants) encoded as copies of a reference allele where 0 and 2 represent “homozygotes” and 1 represents “heterozygotes.” For each dataset, we assign 5 effect SNPs and use the following generative modelwhere represents the set of causal SNPs, and is an indicator function. Here, β and ω are different effect sizes for heterozygotes and homozygotes, respectively. Both variables are randomly sampled from a standard normal distribution and rescaled according to their frequencies. The error term is also assumed to be normally distributed and is rescaled during the simulation such that the causal SNPs explain a certain proportion of the variance in the synthetic trait (i.e., the narrow-sense heritability, ). We consider different scenarios where .We compare our method with other fine-mapping approaches: SuSiE (Wang et al., 2020), DAP-G (Wen et al., 2016), CAVIAR (Hormozdiari et al., 2014), and FINEMAP (Benner et al., 2016). We run all competing methods under their default parameter settings. We set for both SuSiE and our approach. For ESNN, we used a simple sparse architecture with 5 hidden neurons and tanh activation functions (see STAR Methods). Here, we set the maximum number of epochs to be 30; the hyper-parameter for the indicator variable is chosen from a uniform distribution; we fixed for all L models, and during training, we take 100 Monte Carlo samples to evaluate the log likelihood (see STAR Methods). Finally, we used an Adam optimizer with a learning rate of 0.005 and a decay rate of 0.995 after every epoch, and we used an early stopping rule if the likelihood on validation data stopped increasing (based on 85/15 training/validation splits).To assess performance, we consider three different metrics. The first two metrics focus on evaluating the credible sets. To our knowledge, since only SuSiE and DAP-G generate credible sets, we only compare ESNN with these two methods for these metrics (DAP-G produces “signal clusters,” which follows a definition similar to credible sets in Definition 1; see Wang et al. (2020) for relevant discussion on this distinction). We begin by assessing the probability that each method creates a credible set containing at least 1 effect SNP (first row Figure 2). Ideally, a 95% level credible set should have at least 95% coverage. When heritability is high (e.g., 0.4), signals are easier to detect, and both ESNN and SuSiE achieve the appropriate coverage. However, for lowly heritable traits (e.g., 0.1 and 0.05), the coverage of SuSiE and DAP-G drops, while the coverage of ESNN remains the same. The second metric we check is the average number of effect variables included in all credible sets (Figure S3). In practice, each method can report multiple credible sets. Therefore, this metric essentially helps evaluate the total number of effect variables discovered by each method. Overall, DAP-G and ESNN consistently outperform SuSiE, with DAP-G having the advantage. However, because the coverage of DAP-G is poor (Figure 2), this result effectively means that DAP-G generates a large number of credible sets with false-positive signals. For the final metric, we assess the ability of ESNN and the competing approaches to accurately prioritize causal variants according to the posterior inclusion probabilities (PIPs) that each method provides (see STAR Methods). Here, we use receiver operating characteristic (ROC) and precision-recall curves to compare their ability to rank true positives over false positives (Figures 3 and S4). As decreases, accuracy of the PIPs for all methods decreases, while our method is relatively better powered for all scenarios. Since SuSiE is the most comparable method to the ESNN model, we also highlight the scenarios (denoted by an asterisk) where the distribution of the area under the curve for our method is significantly larger than that for SuSiE (satisfying ). Importantly, the PIPs from ESNN and SuSiE are calibrated similarly (Figure S2).
Figure 2
Comparisons of coverage for ESNN, SuSiE, and DAP-G in simulation studies under different levels of heritability
Results are based on 200 data replicates with standard errors represented by the grey bars.
Figure 3
Receiver operating characteristic (ROC) curves for simulation studies of different scenarios
Listed in each panel are p values indicating the level of significant difference between results for ESNN and SuSiE according to their respective areas under the curve (AUCs) across the simulations. Asterisks (∗) denote scenarios where ESNN is significantly better powered than SuSiE (i.e., satisfying ). Results are based on 200 data replicates.
Comparisons of coverage for ESNN, SuSiE, and DAP-G in simulation studies under different levels of heritabilityResults are based on 200 data replicates with standard errors represented by the grey bars.
Simulations with binary phenotypes
We now assess the performance of ESNN on binary traits (e.g., case-control studies). We consider two generative models for the class labels: (1) logistic regression and (2) a liability threshold (LT) model (Lee et al., 2011; Golan et al., 2014; Falconer, 1965). In the former, we simply use the genotypes from chromosome 1 of the 5,000 randomly sampled individuals from the UK Biobank to assume thatwhere, in addition to the previous notation, the binary traits follow a Bernoulli distribution with probability . In the latter simulation model, we take into account disease prevalence and ascertainment bias which can occur in case-control studies. Here, we adopt the LT model which assumes a latent liability for each observation. With some known prevalence k, one can determine a threshold using the quantile function of normal distribution such that an individual is a case if . To simulate data under the LT model, we first generate 1 million individuals each with 200 SNPs (with minor allele frequency uniformly sampled between 0.05 and 0.5). Next, we select 5 causal SNPs and generate continuous liabilities with a controlled heritability using a model similar to Equation (1). Then we consider a prevalence and define case-control labels for each of the million individuals. Finally, we subsample 2,500 cases and 2,500 controls for the analysis.Once again, we compare ESNN to SuSiE (Wang et al., 2020), DAP-G (Wen et al., 2016), CAVIAR (Hormozdiari et al., 2014), and FINEMAP (Benner et al., 2016) using coverage (Figure 2), the number of effect variables included in all credible sets per dataset (Figure S3), ROC curves (Figure 3), and precision-recall curves (Figure S4). The SuSiE framework was originally designed for continuous traits, so we consider two adaptations of the model for the binary data. In the first, we simply treat the class labels as continuous and run the model as is. In the second, which we refer to as LT-SuSiE, we use a Markov Chain Monte Carlo to estimate continuous liability scores as phenotypes (Felsenstein, 2005; Falconer, 1965; Curnow and Smith, 1975). Here, we use all the same parameter settings as in the regression simulation study, except that we set the learning rate for ESNN to be 0.01. Overall, performances follow a similar trend to the regression simulations such that ESNN consistently outperforms other methods. When disease prevalence is very low (e.g., 1%), cases are assumed to come from the “tail” of the distribution. In this scenario, statistical models are generally better powered (Weissbrod et al., 2015). As the prevalence k becomes greater, such that the LT moves from the tails to the center of the distribution, it will become harder for a classifier to distinguish cases from controls. This also results in lower power for variable selection. Notably, even in these cases, our method remains robust.Receiver operating characteristic (ROC) curves for simulation studies of different scenariosListed in each panel are p values indicating the level of significant difference between results for ESNN and SuSiE according to their respective areas under the curve (AUCs) across the simulations. Asterisks (∗) denote scenarios where ESNN is significantly better powered than SuSiE (i.e., satisfying ). Results are based on 200 data replicates.
Fine-mapping in heterogeneous stock of mice
We applied ESNN and SuSiE to two continuous traits: high-density lipoprotein (HDL) and low-density lipoprotein (LDL) in a heterogeneous stock of mice dataset from the Wellcome Trust Centre for Human Genetics (Valdar et al., 2006). This dataset contains 10,346 SNPs with samples for HDL and samples for LDL (see STAR Methods). To run both methods, we simply partition the whole genome into 21 windows where each window contains 500 SNPs. By doing so, we fine-map SNPs in annotated genes as well as SNPs in intergenic regions. We used the same hyper-parameter settings as in the regression simulations for both methods.For HDL and LDL, ESNN finds 41 and 19 credible sets while SuSiE finds 62 and 26 credible sets, respectively. Our method finding less credible sets can potentially be due to the criterion that we only include an SNN model into the ensemble if it increases the likelihood. This criterion demonstrated to ensure that a credible set generated by ESNN would have high coverage in simulations (Figure 2). There were 12 SNPs that were included in the credible sets of both methods for HDL and 5 for LDL. This potentially means that these SNPs contributed additive effects to the phenotypic variation. SNPs that are only identified by ESNN probably contribute nonlinear effects (e.g., dominance). We highlighted one region for each trait in Figure S5. One SNP found by both methods, rs3090325 in LDL (Figure S5A), can be mapped to the Smarca2 gene, which has been found to be associated with cholesterol regulation (Meaney, 2014). In HDL (Figure S5B), SNP gnf04.147.942 can be mapped to the Panc1 gene, which regulates pancreatic activity and has been shown to be linked with HDL (Mancuso et al., 2020). Furthermore, SNPs such as rs13483562 (which is only found by ESNN in the LDL analysis), can be mapped to the Aldh1a7 gene, which also has been demonstrated to affect related traits such as lipid, cholesterol level, and obesity in mice (Yoo and Desiderio, 2003; Lee et al., 2006).
Fine-mapping in the WTCCC 1 study
We next apply ESNN and SuSiE to two binary traits: type 1 diabetes (T1D) and type 2 diabetes (T2D) from the WTCCC 1 study (Wellcome Trust Case Control Consortium, 2007). This dataset has cases and controls for T1D, cases and controls for T2D, along with 458,868 genotyped SNPs for each individual (see STAR Methods). Similarly, we run ESNN and SuSiE with a window size of 500 SNPs and use the same model settings as in the binary simulations.ESNN identifies 32 and 19 credible sets for T1D and T2D, respectively, whereas SuSiE finds 67 and 30 sets for each trait, respectively. There are 5 SNPs that are found by both methods for T1D but none for T2D. This is likely due to the fact that SuSiE was not originally developed for binary traits and also due to the potential role of nonlinear genetic architecture. We highlight two interesting results in Figure 4 where we plot the PIPs of SNPs computed by ESNN and SuSiE. In panel (a), we show a window near the human leukocyte antigen (HLA) region on chromosome 6, which has been well studied in the literature and found to be associated with T1D (Hu et al., 2015; Nejentsev et al., 2007; Erlich et al., 2008; Noble and Valdes, 2011). One of the two SNPs found only by ESNN, rs3129051, is located upstream (within 50kb) of the HLA-G gene, which is a well-known gene that is related to T1D. The other SNP, rs16894900, is located between MAS1L (within 50kb downstream) and UBD (within 50kb upstream), both of which have been shown to be related to T1D (Noble and Valdes, 2011). In panel (b), we highlight the region around NOS1AP on chromosome 1. This gene has been found to be linked with T2D in several studies (Hu et al., 2010; Chu et al., 2010; Qin et al., 2010). Our method identified 2 SNPs in this region, but SuSiE reports none. It has been suggested that this region may not play a dominant role in susceptibility to T2D, but a minor effect may exist (Hu et al., 2010). Similar to SuSiE, these conclusions were previously made using linear models. We hypothesize that this region may contribute to T2D nonlinearly, and thus, the traditional hypothesis-testing methods would have missed this signal.
Figure 4
Posterior inclusion probabilities (PIP) of ESNN and SuSiE in the WTCCC analysis
(A) Highlighted region for type 1 diabetes (T1D). Significant SNPs found only by ESNN (included in the credible sets), only by SuSiE, and by both methods are color coded in red, black, and blue, respectively.
(B) Highlighted region for type 2 diabetes (T2D).
Posterior inclusion probabilities (PIP) of ESNN and SuSiE in the WTCCC analysis(A) Highlighted region for type 1 diabetes (T1D). Significant SNPs found only by ESNN (included in the credible sets), only by SuSiE, and by both methods are color coded in red, black, and blue, respectively.(B) Highlighted region for type 2 diabetes (T2D).
Discussion
In this paper, we present the ESNN which generalizes the sum of single-effects regression framework by accounting for nonlinear genetic architecture and extending to non-continuous phenotypes. The ESNN approach provides PIPs and credible sets that can guide variable selection (see STAR Methods). While we focus on genetic fine-mapping, this method is also applicable to other fields especially when data are correlated and sparse. We provide a variational algorithm with several relaxation techniques that enables scalable inference (see STAR Methods). We show that ESNN can effectively increase power for variable selection using simulations. We applied ESNN to two real-world genetic datasets and demonstrated its ability to make discoveries that are biologically meaningful.
Limitations of the study
There are a few limitations to the current ESNN framework. Similar to most deep learning models, our method requires large sample sizes for training and requires hyper-parameter fine-tuning. For high-dimensional settings, we currently run the method by splitting the whole dataset into small windows so that the training algorithm can quickly converge. However, this may ignore some long-range interactions. Therefore, a focus for future work will be extending the current model with more complex network architectures.
STAR★Methods
Key resources table
Resource availability
Lead contact
Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Lorin Crawford (lcrawford@microsoft.com).
Materials availability
This study did not generate new unique reagents.
Method details
The sum of single-effects regression model
In this section, we provide background on single-effects regression (SER) and state a rigorous definition of credible sets for variable selection. The original SER model (Servin and Stephens, 2007; Pickrell, 2014) assumes that exactly one of J input variables has a non-zero coefficient. More specifically,where is an N-dimensional response vector (e.g., continuous phenotypes); is an design matrix (e.g., genotypes); is an N-dimensional error term; is a J-dimensional vector of regression coefficients; is a binary indicator that determines which regression coefficient is to be non-zero; and denotes the multinomial distribution with m samples drawn with class probability distribution . For simplicity, we will consider a uniform prior such that . Note that m is set to equal to one so that the coefficient vector has exactly one non-zero entry for modeling the single-effect. To estimate the statistical association of each variable, one would fit J-univariate models corresponding to regressing each j-th column of onto the response and computing posterior inclusion probabilities defined as .In the context of statistical genetics, the original SER model only assumes one causal SNP. However, we know that in many real-world applications, it is desired to have a method that flexibly allows for many variants to have an effect on trait architecture (Carbonetto and Stephens, 2012; Demetci et al., 2021). The SuSiE framework is based on an extension of summing over L-multiple SER models (Wang et al., 2020). Here, the main idea is to construct an overall effect vector from multiple single-effect coefficients via the followingIn practice, SuSiE uses an iterative Bayesian stepwise selection (IBSS) algorithm (i.e., coordinate ascent variational inference) to estimate the model parameters. More specifically, at each iteration, it fits the SER model for using the residuals from the model . At the end of training, the SuSiE model provides L estimated coefficient vectors and L corresponding PIP vectors . Computation of a final inclusion probability assumes that effects are independent across the L different models and is computed asA key component of SuSiE is that it uses these PIPs to naturally construct credible sets. Effectively, a level ρ credible set can be estimated by simply sorting variables in descending order and then including variables into the set until their cumulative probability exceeds ρ (Wang et al., 2020). Below we give the rigorous definition for credible sets.
Definition 1 (Wang et al., (2020))
In the context of a multiple-regression model, a level ρ credible set is defined to be a subset of variables that has probability ρ or greater of containing at least one effect variable (i.e., a variable with non-zero regression coefficient). Equivalently, the probability that all variables in the credible set have zero regression coefficients isor less.The definition above yields a metric for assessing the uncertainty when conducting variable selection. A credible set will determine if a subset of collinear variables have effects on the response even when we are unclear as to which specific ones. This differs from the results produced by the conventional regularization and shrinkage methods (Carbonetto and Stephens, 2012; Tibshirani, 1996; Zou and Hastie, 2005) where the effect sizes for an arbitrarily selected subset of correlated variables will be penalized while the others are retained.
The ensemble of single-effect neural networks
In this section, we detail the full specification of our proposed nonlinear framework for variable selection. While there exist many nonlinear models, neural networks are well known to have the ability to approximate complex systems (Leshno et al., 1993; Barron, 1993). For simplicity, we will focus on multi-layer perceptrons throughout this paper; however, we also want to emphasize that the theoretical concepts we describe can also be applied broadly to other architectures (e.g., convolutional neural networks). Formally, we specify a K-layer probabilistic neural network as a generalized nonlinear modelwhere, in expectation, the response variable is related to the input data by ; is an N-dimensional latent vector to be learned; denotes a general cumulative link function which, for example, is set to be the identity if is continuous or the logit if is binary; denotes the matrix of nonlinear neurons from the k-th hidden layer with corresponding weight matrix ; are deterministic biases that are produced during the network training phase for the k-th hidden layer; is a nonlinear activation function (e.g., ReLU or tanh); and is a matrix of weights for the input layer.Similar to the SER model, the key design that leads to our ability to model single-effect is through the prior we place on the input layer weights in . Let represent the number of neurons in the k-th hidden layer such that is dimensions (i.e., the number of input variables by the number of neurons in the first hidden layer). Next, let denote the j-th row of the weight matrix . We place a grouped “single-effect” shrinkage prior on the input weightswhere is a matrix that is copies of the binary vector , is a -dimensional row-vector of continuous weights in , and denotes the Hadamard product between two parameters. Note that this shrinkage prior mimics the sparse assumption of previous neural network architectures in the literature (Chen et al., 2021; Ghosh et al., 2019), except that the binary indicator variable is assumed to be multinomial with one trial. Hence, since the j-th row of contains the weights connected to the j-th column in , when only , the rest of the input variables are excluded from the model (see proof-of-concept example in Figure 1). Together, we refer to the model above as a “single-effect neural network” (SNN). The SNN resembles the SER model in that it assumes that only one input variable has an effect on the response and, thus, posterior summaries of can be similarly used to compute credible sets.We now extend the SNN to incorporate multiple effect variables. Analogous to the SuSiE framework, we now consider training on the response variable to be based on an ensemble of single-effect neural networks (ESNN). Probabilistically, ESNN maybe specified as a summation of L-latent nonlinear models of the formwhere, in expectation, the response variable is now related to the input data as and the sparse prior for the weights of the network are now specified as the followingNotice that at the end of training, each l-th neural network will also yield an estimated set of input layer weights and a corresponding set of inclusion probabilities which each assess whether all weights connected to the j-th input node are equal to zero. Then, given these L posterior summaries, we can compute credible sets in the same way as SuSiE by defining the overall posterior inclusion probabilities aswhich we use to determine variable significance. A motivating example of the benefits of the ESNN model can be found in Figure S6.
Posterior inference via variational bayes
As the size of many high-throughput genome-wide sequencing studies continue to grow, both in the number of individuals and the number of genetic variants, it has become less feasible to implement traditional Markov Chain Monte Carlo (MCMC) algorithms for inference. To this end, we use variational inference to approximate the posterior distribution of the weights and hyper-parameters within the ESNN framework. We take the hierarchical model specified in (Equations 8 and 9) and replace the intractable true posterior distribution over the parameters with an approximating family of distributions —where we use shorthand to represent the L models in the ensemble, represent the collection of free parameters in the approximations, and is used to denote the observed data and all relevant hyper-parameters. The basic idea behind the variational inference is to iteratively adjust the free parameters such that they minimize the the difference between the two distributions, which amounts to maximizing the so-called evidence lower bound (ELBO)Here, the first term is the expectation of the log likelihood taken with respect to the variational distribution, and the second term is the Kullback-Leibler divergence which measures the similarity between two distributions. We then use a stochastic gradient descent based method to train models under the ESNN framework. In this work, we choose the variational distributions to factorize across L models and for each model we have the following proposalsBased on these choices, the gradients of the KL term are available in closed form, while the expectation of the log likelihood is evaluated using Monte Carlo samples and the local re-parameterization trick (see below for theoretical details and corresponding pseudocode in Algorithm 1). In a regression task with continuous responses, the log likelihood term is chosen to be Gaussian and maximizing the lower bound corresponds to minimizing mean square error. In classification tasks for case-control studies, the log likelihood term is taken to be a binomial distribution which corresponds to minimizing the cross-entropy loss. Since we use gradient descent based method for optimization, ESNN can be applied for both types of data analyses.Input genotype data and phenotypic vector .Choose the number of models L, number of maximum iterations T, and credible set level ρ.Randomly initialize variational parameters for the L models.Initialize the models and iterations .while and
doFix hyper-parameters .Sample using re-parameterization trick.Compute the approximate log likelihood .Compute the gradients for only using the approximate log likelihood.Update using the gradients with optimizers.Compute PIPs and credible sets for the l-th model.ifthen “Purity” Checkif is continuous thenIBSS Procedureend ifend ifend whileCompute (marginal) posterior inclusion probabilities (PIP) for each variable.Determine credible sets .Return.
Details of the variational algorithm
To find the expectation of the log likelihood during posterior inference, we use Monte Carlo samples and a local re-parameterization trick to compute gradients. More specifically, when assuming Gaussian distributions for the variational approximating familiesThis technique has been shown to successfully reduce the variance of gradients (Kingma and Welling, 2014) and stabilizes the training process. Next, we assume that the indicator variables are sampled from a categorical distribution. We adopt a continuous relaxation technique for re-parameterizing these variables by sampling them from a Gumbel-Softmax distribution which is specified as the following (Jang et al., 2017; Maddison et al., 2017),where are the approximate samples for , τ is a temperature parameter, and uniformly sampled random variable. As , samples will become closer to the desired vector where only one entry is one and the rest are zeros. In our experiments, we choose for numerical stability.The convergence of the inclusion probabilities is also important for the ESNN model as it directly influences the performance of variable selection. Importantly, appears very early in the computational pipeline since they are defined for the weights in first hidden layer. As a result, the gradients for can be very small and hinder convergence during training. This problem is commonly known as “vanishing gradients” (Hochreiter, 1998). For our work, we found that simply scaling up the learning rate when updating works well in practice. Note that the Kullback-Leibler (KL) divergence term in the approximate likelihood can be decomposed as the followingwhere the KL divergence for the weights is between two normal distributions with and being an -dimensional row-vector; while the KL divergence for the indicator variables, where is a matrix that is copies of the J-dimensional binary vector , is taken between two discrete multinomial distributions. Importantly, these terms have closed-form solutions with which gradients can be computed.
Iterative bayesian stepwise selection
Similar to the SuSiE framework, the ESNN model also uses an iterative Bayesian stepwise selection (IBSS) procedure where it trains L models by first fitting one model with a coordinate ascent algorithm and then regressing out that model to compute residuals for training next model. By doing so, we can generate credible sets (Wang et al., 2020). It is worth noting that, when the model is uncertain about which variables to choose (e.g., when there are no significant effect variables), will become diffuse such that will contain many variables that are not correlated. Under these scenarios, it makes sense to ignore those sets. Previous work have outlined the concept of “purity” as the smallest absolute correlation between all pairs of variables within a credible set which can be used as a criteria for filtering out nonsensical results (Wang et al., 2020). This same strategy is not particularly useful on its own for the ESNN framework. An intuitive explanation for this is because since the optimizing objective for neural networks is non-convex, training algorithms can get stuck in local optima where the estimated variational parameters are not optimal. In the scenario where the model is unable to find correct effect variable, regressing out will only introduce noise during training. Therefore, we take an extra approach where we also check to ensure that a trained model is informative before computing the residuals. One simple way to do this is by monitoring whether the likelihood is larger with the l-th model trained versus it be excluded from consideration. More specifically, the criteria to include the l-th model can be expressed via the (approximate) likelihood ratiowhere we keep models that satisfy . Note that we only regress out variables on continuous data as this is the scenario where it is meaningful to compute the residuals. For the binary classification case, we simply fix the trained models and add up the logits if the criteria is satisfied.
Quantification and statistical analysis
Our study made use of three real datasets. The simulation results made use of imputed data released from the UK Biobank (Bycroft et al., 2018). Quality control for these data were carried out using the following procedure. First, we only studied individuals who self-identified as being of European ancestry. From this cohort, we further excluded individuals identified by the UK Biobank to have high heterozygosity, excessive relatedness, or aneuploidy (1,550 individuals removed). We also removed individuals whose kinship coefficient was greater than 0.0442 (i.e., close relatives). Next, we removed (i) monomorphic SNPs, (ii) SNPs with minor allele frequency less than 2.5 , (iii) SNPs not in Hardy-Weinberg Equilibrium (Fisher exact test ), (iv) SNPs with missingness greater than 1 , and (v) SNPs in high linkage disequilibrium (using the flag --indep-pairwise 50 5 0.9 with PLINK 1.9 (Purcell et al., 2007)). After all QC steps, we had a final dataset of 349,414 individuals from which we could downsample and 36,518 SNPs on the first chromosome. Next, we used the NCBI’s Reference Sequence (RefSeq) database in the UCSC Genome Browser (Pruitt et al., 2005) to annotate SNPs with appropriate genes. We defined genes using the UCSC gene boundary and augmenting those boundaries by adding SNPs within a 500 kilobase (kb) buffer to account for possible regulatory elements. Genes with only one SNP within their boundary were excluded.One part of the analysis results in this work made use of GWA data from the Wellcome Trust Centre for Human Genetics. This study contains a total of 1,814 heterogeneous stock of mice from 85 families (all descending from eight inbred progenitor strains) (Valdar et al., 2006), and 131 quantitative traits that are classified into six broad categories including behavior, diabetes, asthma, immunology, haematology, and biochemistry. Here, we focused on two specific phenotypes from these categories including: high-density lipoprotein content (Biochem.HDL) and low-density lipoprotein content (Biochem.LDL). Both phenotypes were corrected for sex, age, body weight, season, year, and cage effects. For individuals with missing genotypes, we imputed values by the mean genotype of that SNP in their corresponding mouse family. Only polymorphic SNPs with minor allele frequency above 5% were kept for the analyses. This left a total of 10,346 autosomal SNPs that were available for all mice.The second part of the data analysis used data from the Wellcome Trust Case Control Consortium (WTCCC) one study (Wellcome Trust Case Control Consortium, 2007) which consists of about 14,000 cases of seven common diseases, including 1,963 cases of type 1 diabetes (T1D) and 1,924 cases of type 2 diabetes (T2D), as well as 2,938 shared controls. We selected a total of 458,868 shared single nucleotide polymorphisms (SNPs) following a previous study (Zhou et al., 2013).
Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025
Authors: Henry Erlich; Ana Maria Valdes; Janelle Noble; Joyce A Carlson; Mike Varney; Pat Concannon; Josyf C Mychaleckyj; John A Todd; Persia Bonella; Anna Lisa Fear; Eva Lavant; Anthony Louey; Priscilla Moonsamy Journal: Diabetes Date: 2008-02-05 Impact factor: 9.461
Authors: Christian Benner; Chris C A Spencer; Aki S Havulinna; Veikko Salomaa; Samuli Ripatti; Matti Pirinen Journal: Bioinformatics Date: 2016-01-14 Impact factor: 6.937