Literature DB >> 25519334

Evaluation of logistic Bayesian LASSO for identifying association with rare haplotypes.

Swati Biswas1, Charalampos Papachristou2.   

Abstract

It has been hypothesized that rare variants may hold the key to unraveling the genetic transmission mechanism of many common complex traits. Currently, there is a dearth of statistical methods that are powerful enough to detect association with rare haplotypes. One of the recently proposed methods is logistic Bayesian LASSO for case-control data. By penalizing the regression coefficients through appropriate priors, logistic Bayesian LASSO weeds out the unassociated haplotypes, making it possible for the associated rare haplotypes to be detected with higher powers. We used the Genetic Analysis Workshop 18 simulated data to evaluate the behavior of logistic Bayesian LASSO in terms of its power and type I error under a complex disease model. We obtained knowledge of the simulation model, including the locations of the functional variants, and we chose to focus on two genomic regions in the MAP4 gene on chromosome 3. The sample size was 142 individuals and there were 200 replicates. Despite the small sample size, logistic Bayesian LASSO showed high power to detect two haplotypes containing functional variants in these regions while maintaining low type I errors. At the same time, a commonly used approach for haplotype association implemented in the software hapassoc failed to converge because of the presence of rare haplotypes. Thus, we conclude that logistic Bayesian LASSO can play an important role in the search for rare haplotypes.

Entities:  

Year:  2014        PMID: 25519334      PMCID: PMC4144467          DOI: 10.1186/1753-6561-8-S1-S54

Source DB:  PubMed          Journal:  BMC Proc        ISSN: 1753-6561


Background

It is now widely acknowledged that rare variants play a critical role in complex diseases. Although many approaches have been proposed for detecting association with rare single-nucleotide variants (eg, Refs. [1-6] to name just a few), there are relatively fewer approaches for rare haplotype variants [7-10]. Once a particular genomic region is implicated to be potentially harboring a functional variant from single nucleotide polymorphism (SNP) analysis, typically it is followed up by haplotype analysis to zoom further into the region. In such analysis, rare haplotypes frequently surface because rare haplotypes can result from even combinations of common single variants. The presence of rare haplotypes poses a challenge for commonly used haplotype association approaches such as those based on generalized linear models using the expectation-maximization (EM) algorithm (implemented in the software hapassoc [11] among others). With rare haplotypes, EM estimates can be unstable and the algorithm may fail to converge. To circumvent this problem, rare haplotypes are usually pooled together. However, pooling can result in washing out of association signal if haplotypes of risk and protective types are pooled together [10]. Thus, in recent years, newer approaches for detecting rare haplotypes have been proposed. One of them is logistic Bayesian LASSO (LBL) [10], a Bayesian version of penalized regression. LBL applies penalty to regression coefficients through appropriate choice of their prior distributions. This helps reduce signal noise by weeding out unassociated (especially common) haplotypes, thereby enabling signals contained in the associated rare haplotypes to be more easily detected. For example, the application of LBL to age-related macular degeneration data led to identification of a specific rare haplotype for the first time in the literature [10]. Our goal is to further evaluate the performance of LBL for data generated under complicated and realistic scenarios. Data from Genetic Analysis Workshop 18 (GAW18) provide such an opportunity, and with this aim, we apply LBL to 200 replicates of the simulated GAW18 data. In particular, we mimic a candidate variant search approach. That is, we assume that prior studies, most likely single-SNP studies, have pointed to a genomic region that potentially harbors variants involved in the genetic mechanism of a trait. Following up on that, we zoom into the region with sequence data provided in GAW18. So that we could evaluate power and type I error, we obtained access to the simulating model ("Answers"). We focused on two genomic regions in the MAP4 gene on chromosome 3 that harbored several functional variants and analyzed them using LBL and hapassoc.

Methods

Here we briefly describe LBL; more details can be found in Biswas and Lin [10]. Suppose we have a case-control sample consisting of n1 cases and n2 controls with n1 + n2 = n. Let Y1/0 denote the case-control status of the ith individual, i = 1. . .n, and = (Y1,...,Y). Suppose L SNPs are considered to form a haplotype block. We further let Zdenote the missing (phased) haplotype pair of ith individual and (Z1. . . Z). Note that Zs are unobservable because phase information is usually not deductible from the genotype data. LBL is based on retrospective model for case-control data, which has been well studied in general statistics (including Bayesian), as well as in haplotype-association literature (see Refs. [12-16] and the references therein). The complete data likelihood is written as: where Ψ=(,) denotes the collection of regression coefficients and parameters associated with haplotype frequencies, which will be specified more explicitly as our formulation unfolds. Let a=P(Z|Y = 0) and b=P(Z|Y = 1) denote the frequencies of a haplotype pair Z in the control and the case population, respectively. We first note that we can express bin terms of aand the odds of disease for a given Z, θ: where θ=P(Y=1|Z)/P(Y=0|Z), and H is the set of all possible haplotype pairs. Therefore, the likelihood in equation (1) can be expressed in terms of the as and the θs. Let us next consider a and θ, and specify their models.

Modeling of a

Suppose there are a total of m haplotypes (ie, haplotype diversity is m) and let = (f1,...,f) denote their frequencies with the constraint that f0 and . Then, for a haplotype pair Z= z, we can model aas follows: where δ= 1(0) if z=z (z), ={,d} and d∈(−1,1) is the within-population inbreeding coefficient that can be used to capture excess and/or reduction of homozygosity. By modeling the frequency in this way, we do not need to make the assumption of Hardy-Weinberg equilibrium.

Modeling of θ

We use logistic regression for modeling log odds. Specifically, log θ+ , where is a (row) design vector, α is the intercept, and is a vector of coefficients representing the haplotype effects.

Priors

To cast the problem in the Bayesian setting, we need to assign priors to the parameters Ψ = (, = {, d}). The prior for β plays the important role of regularization of regression coefficients. In particular, a double-exponential distribution with mean 0 and appropriately chosen variance to control the degree of penalty has been shown to give the Bayesian version of LASSO when normal likelihood is used. Specifically, we set the prior for βto be Here λ controls the degree of penalty as the variance of this distribution is 2/λ2. We let the hyper-parameter λ follow a gamma distribution. For and d, note that they are not independent, as a() in equation (2) must be nonnegative. This imposes the constraint that d >{−f(1 − f)} for all k, and because d <1, we have max {−f(1 − f)}

Inference

Markov chain Monte Carlo (MCMC) methods are used to estimate posterior distributions of parameters. At each iteration, we update the missing haplotypes and the parameters , λ, , and d. The Markov chain is run for a total of 50,000 iterations with 20,000 burn-in. We draw an inference regarding association by testing for significance of each β coefficient. We carry out a hypothesis test of H0 : |β| ≤ ∈ versus H: |β| >∈, where ∈ is set to a small number, using Bayes factor (BF), which is defined as the ratio of posterior odds to prior odds of H. If the BF exceeds a certain threshold, we conclude that the corresponding β is significant;that is, that haplotype is associated. In our applications, we use a threshold of 2 and ∈= 0.1 following Biswas and Lin [10]. This method has been implemented as an R package (with dynamic loading of C program) LBL, which is available at http://www.utdallas.edu/~swati.biswas. LBL requires as an input a list of possible haplotypes that are compatible with each person's genotype. We obtain this from the hapassoc package's pre-processing command "pre.hapassoc". Note that LBL does not assign any specific haplotype to persons whose haplotypes cannot be inferred unambiguously; rather, their haplotypes are treated as missing and updated at every MCMC iteration. Thus, the uncertainty in haplotypes for each person is incorporated into the model. By the same token, the uncertainty in haplotype frequencies f is taken into account by treating them as unknown parameters with noninformative prior distribution and updating them at each iteration (note that although pre.hapassoc outputs frequency estimates also, those are not needed by LBL). So, even if in the data some haplotypes are only compatible with the cases but not with any control, by having a prior distribution on fs, the model allows the possibility of the control population having those haplotypes as well. However, if those haplotypes have sufficient contribution to the disease under study and the sample size is reasonable, we will expect those haplotypes to be inferred as associated.

Results

As mentioned in the background section, we propose to use haplotype analysis as a follow-up analysis to zoom into a genomic region that has been implicated in earlier studies. With this premise, we focus on gene MAP4 on chromosome 3. In particular, we consider 2 separate regions around 2 functional variants: (a) the most frequent functional variant located at 47,956,424 basepairs (bp) with minor allele frequency (MAF) =0.378, and (b) variant with strongest effect size at 48,040,283 bp with MAF = 0.032. The rationale for such a choice is that a functional variant that is either fairly common or is rare but has a strong effect could be reasonably expected to have been implicated in prior (single-SNP genome scan) studies. To zoom into these regions, we selected SNPs that are within 4000 bp of these locations and have MAF>0.01, and formed haplotypes with those SNPs in each region. This choice of MAF ensured that in a sample of size 142, at least two copies of each SNP were present and thereby excluded almost monomorphic SNPs. Common SNPs can combine to form rare haplotypes (frequency <0.05), and, indeed, this was the case for these regions. In both regions, we had 9 SNPs each. Specifically, for region 1, 9 SNPs resulted in 9 haplotypes. This is unusually low and it could be just random chance or there may be a biological reason for this. However, as we analyze the simulated phenotypes, which were generated based on single SNPs, we do not expect this to have any significant effect on our analysis. As LBL has been proposed for the case-control data, we used the sample of unrelated people. We classified a person as affected (case) if the individual was diagnosed with high blood pressure or was taking medication for high blood pressure at the last exam. The total sample size was 142 individuals. The number of cases varied from replicate to replicate with mean = 72 and SD = 5, while the genotypes were identical for all replicates. We analyzed all 200 replicates for the two regions described above twice. First, we analyzed with the provided phenotypes to examine power. Second, for each replicate, we randomly permuted the affection status of individuals and reanalyzed the data. This nullified any association present between phenotype and genotypes, and thereby allowed us to gauge the type I error. Note that the total numbers of cases and controls remain the same in both the original and its corresponding "null" versions. We analyzed each replicate using hapassoc and LBL. Hapassoc did not converge, presumably because of the presence of rare haplotypes in both regions studied. Tables 1 and 2 show SNPs and haplotypes in these two genomic regions and the proportions of replicates showing association with each haplotype (BF>2) in both original and null versions. In Table 1 LBL's power to detect the haplotype with two functional variants (the third functional variant has almost a negligible effect) is exceptionally high for a sample size of only 142 individuals. However, there is no power at all to identify the haplotypes with 1 functional variant at 47,956,424 bp. This indicates that the other (rare) functional variant at 47,957,996 bp almost solely drives the power for detecting the haplotype with two variants. The type I errors in the null version are all very low (≤2%). Thus, this illustrates the power of LBL for identifying rare haplotypes. In region 2, the powers are 78% and 30% for the two haplotypes; both are rare haplotypes and each contains one variant. The variant in the former haplotype accounts for higher percentage of the variability in the diastolic blood pressure (0.02 vs. 0.01) and systolic blood pressure (0.03 vs. 0.01). Although the two variants are almost on top of each other (1 bp apart), there was no haplotype containing both variants. The type I errors are, again, very small (≤2%).
Table 1

Results for the region surrounding SNP at 47,956,424 bp using phenotypes as provided (Original) and after randomly redistributing them to individuals (Null).

SNP(bp)MAFHap1Hap2Hap3Hap4Hap5Hap6Hap7Hap8Hap9
47,952,8430.018000000011
47,953,4050.370011111111
47,953,7330.320000011100
47,956,4240.359001111111
47,956,5060.317000001100
47,957,9960.021000000100
47,958,0370.317000001100
47,959,7700.367011111101
47,959,9770.011000100000
Hap Freq0.630.0110.0110.0110.0030.2960.0210.0070.011
OriginalNA0.000.000.000.000.000.890.000.00
NullNA0.000.000.000.000.000.020.000.00

The two rows, Original and Null, show the proportion of replicates with BF >2. There are 9 possible haplotypes (Hap) with minor allele of each SNP denoted by 1. The minor alleles in bold typeface are functional variants. The most frequent haplotype (Hap1) is the baseline. Freq represents haplotype frequency.

∗The % variability explained by this variant is almost negligible (<0.0001).

Table 2

Results for the region surrounding SNP at 48,040,283 bp using phenotypes as provided (Original) and after randomly redistributing them to individuals (Null).

SNP(bp)MAFHap1Hap2Hap3Hap4Hap5Hap6Hap7Hap8Hap9Hap10Hap11Hap12
48,036,8890.011000000000011
48,037,0780.363000011111100
48,038,7140.373000011111111
48,039,9080.246011100000000
48,040,2830.025000000000100
48,040,2840.021000100000000
48,041,4710.025000000111001
48,042,1920.018000000011001
48,043,0580.331001001001100

Hap Freq0.3800.2220.0040.0210.0180.2990.0070.0100.0040.0250.0070.003

OriginalNA0.020.000.300.020.000.000.000.000.780.000.00
NullNA0.010.000.000.020.010.000.000.000.000.000.00

The two rows, Original and Null, show the proportion of replicates with BF >2. There are 12 possible haplotypes (Hap) with minor allele of each SNP denoted by 1. The minor alleles in red and boldface are functional variants. The most frequent haplotype (Hap1) is the baseline. Freq represents haplotype frequency.

Results for the region surrounding SNP at 47,956,424 bp using phenotypes as provided (Original) and after randomly redistributing them to individuals (Null). The two rows, Original and Null, show the proportion of replicates with BF >2. There are 9 possible haplotypes (Hap) with minor allele of each SNP denoted by 1. The minor alleles in bold typeface are functional variants. The most frequent haplotype (Hap1) is the baseline. Freq represents haplotype frequency. ∗The % variability explained by this variant is almost negligible (<0.0001). Results for the region surrounding SNP at 48,040,283 bp using phenotypes as provided (Original) and after randomly redistributing them to individuals (Null). The two rows, Original and Null, show the proportion of replicates with BF >2. There are 12 possible haplotypes (Hap) with minor allele of each SNP denoted by 1. The minor alleles in red and boldface are functional variants. The most frequent haplotype (Hap1) is the baseline. Freq represents haplotype frequency.

Discussion

Here we evaluated the power and type I error of LBL for detecting rare haplotypes. Remarkably, even with a sample size of 142 individuals, LBL seems to be able to achieve very high power and at the same time hold the type I error at very low levels. Even though this high power may be partially explained by the relatively large contribution of the variants to the trait variance (in the range of 0.01% to 0.03%), these results further strengthen the earlier findings on the usefulness of LBL [10]. LBL is relatively computationally efficient. To run one replicate for region 1 (with nine haplotypes), LBL took approximately eight seconds on a 2.8GHz Xeon processor under a Linux operating system with 23.5 GB of RAM. The computational intensity depends on the number of haplotypes rather than the number of SNPs. In our experience, LBL can easily handle 15 to 20 haplotypes. Also, haplotype frequencies of 0.005 were handled well by LBL in Ref. [10]. We also expect LBL to handle rarer haplotypes with larger sample sizes. The nature of the simulating model and the resulting data, although complex, lead to some limitations of the study. First, the simulation model was based on individual SNPs and not haplotypes, whereas from a biological point of view, the role of haplotypes is more than just a combination of SNPs [17]. Our limited exploration of the data (including other regions not presented here) indicated that there may be only few haplotypes that consisted of more than one functional variant. Our first region may be one of the few examples of that type. Even for that haplotype, as we saw earlier, the power to detect association with it seemed to be solely driven by one of the variants. Furthermore, in our second region, from a biological standpoint, it would seem highly unlikely that two independent mutations would have arisen so close to each other (there was no haplotype with both mutations) [17]. A second limitation is the sample size of only 142. With such a small sample, detecting association with a rare variant is highly unlikely. Thus, we chose variants with relatively stronger effects. Nevertheless, it is noteworthy that LBL showed reasonable powers when hapassoc failed to converge. It will be of interest to compare the results with some newer methods for rare haplotype association. Yet another shortcoming of the current study is that the simulated genotypes in all replicates were the same and only phenotypes varied across replicates. For this reason, we did not combine replicates to increase the sample size and explore power for detecting haplotypes with more modest effects. Nonetheless, it has been shown that LBL has good power for detecting such type of effects with larger sample sizes [10]. All together, the results establish LBL as a powerful tool for identifying association with rare haplotypes. Moreover, the flexible framework of LBL allows many useful extensions. For example, covariates and their interactions with haplotypes can be incorporated by modeling log θ= α + + τ + XZ XEwith denoting the design vector based on covariates, the corresponding coefficients, and the coefficients corresponding to interactions. This extension has been recently proposed in Ref [18]. For this, the likelihood in equation (1) is extended to model the joint distribution of haplotypes and covariates. Another future work is to adapt LBL for other data types.

Conclusions

LBL is a powerful approach for detecting rare haplotype association.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SB and CP designed the overall study and conducted statistical analyses. SB drafted the manuscript and CP helped in its revision. All authors read and approved the final manuscript.
  14 in total

1.  Inference on haplotype effects in case-control studies using unphased genotype data.

Authors:  Michael P Epstein; Glen A Satten
Journal:  Am J Hum Genet       Date:  2003-11-20       Impact factor: 11.025

Review 2.  The role of haplotypes in candidate gene studies.

Authors:  Andrew G Clark
Journal:  Genet Epidemiol       Date:  2004-12       Impact factor: 2.135

3.  Comparison of prospective and retrospective methods for haplotype inference in case-control studies.

Authors:  Glen A Satten; Michael P Epstein
Journal:  Genet Epidemiol       Date:  2004-11       Impact factor: 2.135

4.  Logistic Bayesian LASSO for identifying association with rare haplotypes and application to age-related macular degeneration.

Authors:  Swati Biswas; Shili Lin
Journal:  Biometrics       Date:  2011-09-28       Impact factor: 2.571

5.  Generalized linear modeling with regularization for detecting common disease rare haplotype association.

Authors:  Wei Guo; Shili Lin
Journal:  Genet Epidemiol       Date:  2009-05       Impact factor: 2.135

6.  A Bayesian hierarchical model for detecting haplotype-haplotype and haplotype-environment interactions in genetic association studies.

Authors:  Jun Li; Kui Zhang; Nengjun Yi
Journal:  Hum Hered       Date:  2011-07-20       Impact factor: 0.444

7.  Detecting rare haplotype-environment interaction with logistic Bayesian LASSO.

Authors:  Swati Biswas; Shuang Xia; Shili Lin
Journal:  Genet Epidemiol       Date:  2013-11-23       Impact factor: 2.135

8.  Comprehensive approach to analyzing rare genetic variants.

Authors:  Thomas J Hoffmann; Nicholas J Marini; John S Witte
Journal:  PLoS One       Date:  2010-11-03       Impact factor: 3.240

9.  Testing for an unusual distribution of rare variants.

Authors:  Benjamin M Neale; Manuel A Rivas; Benjamin F Voight; David Altshuler; Bernie Devlin; Marju Orho-Melander; Sekar Kathiresan; Shaun M Purcell; Kathryn Roeder; Mark J Daly
Journal:  PLoS Genet       Date:  2011-03-03       Impact factor: 5.917

10.  Gene-based partial least-squares approaches for detecting rare variant associations with complex traits.

Authors:  Asuman S Turkmen; Shili Lin
Journal:  BMC Proc       Date:  2011-11-29
View more
  8 in total

1.  Comparison of haplotype-based statistical tests for disease association with rare and common variants.

Authors:  Ananda S Datta; Swati Biswas
Journal:  Brief Bioinform       Date:  2015-09-02       Impact factor: 11.622

2.  A Family-Based Rare Haplotype Association Method for Quantitative Traits.

Authors:  Ananda S Datta; Shili Lin; Swati Biswas
Journal:  Hum Hered       Date:  2019-02-21       Impact factor: 0.444

3.  ordinalbayes: Fitting Ordinal Bayesian Regression Models to High-Dimensional Data Using R.

Authors:  Kellie J Archer; Anna Eames Seffernick; Shuai Sun; Yiran Zhang
Journal:  Stats (Basel)       Date:  2022-04-15

4.  Bivariate logistic Bayesian LASSO for detecting rare haplotype association with two correlated phenotypes.

Authors:  Xiaochen Yuan; Swati Biswas
Journal:  Genet Epidemiol       Date:  2019-09-23       Impact factor: 2.135

5.  Detecting rare haplotype-environment interaction with logistic Bayesian LASSO.

Authors:  Swati Biswas; Shuang Xia; Shili Lin
Journal:  Genet Epidemiol       Date:  2013-11-23       Impact factor: 2.135

6.  Association of rare haplotypes on ULK4 and MAP4 genes with hypertension.

Authors:  Ananda S Datta; Yuan Zhang; Lei Zhang; Swati Biswas
Journal:  BMC Proc       Date:  2016-11-15

7.  Logistic Bayesian LASSO for genetic association analysis of data from complex sampling designs.

Authors:  Yuan Zhang; Jonathan N Hofmann; Mark P Purdue; Shili Lin; Swati Biswas
Journal:  J Hum Genet       Date:  2017-04-20       Impact factor: 3.172

8.  Impairments of Photoreceptor Outer Segments Renewal and Phototransduction Due to a Peripherin Rare Haplotype Variant: Insights from Molecular Modeling.

Authors:  Luigi Donato; Ebtesam Mohamed Abdalla; Concetta Scimone; Simona Alibrandi; Carmela Rinaldi; Karim Mahmoud Nabil; Rosalia D'Angelo; Antonina Sidoti
Journal:  Int J Mol Sci       Date:  2021-03-27       Impact factor: 5.923

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.