Literature DB >> 23162385

The Covariate's Dilemma.

Abstract

Entities: Disease Gene

Year: 2012 PMID： 23162385 PMCID： PMC3497901 DOI： 10.1371/journal.pgen.1003096

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 5.917

× No keyword cloud information.

An important step in analyzing genetic association study data is deciding whether to adjust for covariates—those variables ancillary to the variants of interest. In particular, when testing for novel associations, should the statistical model also include known genetic or nongenetic covariates that are predictors of the trait (e.g., body mass index when studying type 2 diabetes)? Yes, if the covariates are also correlated with the primary variants but do not mediate their effects, because they may confound the genetic associations. Including them helps control bias and prevent false discoveries (Figure 1a). But the answer is less clear-cut if the covariates are not confounders.

Figure 1

Impact of—and approaches to—including covariates in the analysis of gene–trait associations.

Impact of—and approaches to—including covariates in the analysis of gene–trait associations.

(a) The covariate C is a confounder associated with both the trait D and the gene G but is not an intermediate on the causal path of interest between G and D. The G–D association should be assessed while controlling C. Omitting C from the analysis of the G–D association can lead to misattribution of a C–D effect to G and false discovery or biased estimates of a G–D effect. (b) The covariate C is independently associated with the trait D but not with gene G (so C is not a confounder). If the trait is quantitative or the study subjects are randomly ascertained, including C in a linear or logistic regression model will increase power to detect the G–D association. (c) If the trait is binary and the subjects are ascertained based on case-control status, the probability of selection (S) depends on G and C and induces a correlation between them. Then including C in a logistic regression model can inflate the G–D association's standard error, reducing power. Omitting C provides the most potential gain in power when C has a strong effect on D, and when D is less common [1]. (d) In Zaitlen et al.'s new approach [6] for evaluating G–D associations with case-control data, a risk model for D is developed from external information about the C–D association and observed C and D levels. Residuals from this model, R, distinguish high- and low-risk cases and controls. Then testing for G–R associations assesses genetic effects unexplained by C in a potentially more powerful manner than conventional logistic regression. When the trait of interest is quantitative, including a nonconfounding covariate associated with the trait is often beneficial because it can explain some of the variability in the outcome, thus reducing noise and increasing power to detect novel genetic associations. On the other hand, when the trait is binary, including the covariate can actually reduce power for case-control association studies; this is shown in a recent paper by Piranen et al. [1] and previous work [2]–[5]. Fortunately, all is not lost. In this issue of PLOS Genetics, Zaitlen et al. [6] present a new approach that addresses this problem by leveraging information on covariates to increase power in association studies of binary traits.

Ignorance Is Bliss…

How can ignoring covariate information increase power? Assume that we are studying the potential association between a genetic variant and a binary trait. Moreover, assume we have measured a genetic or environmental covariate associated with the trait but independent of the variant of interest in the source population, so it is not a confounder (Figure 1b). If we ascertain a random sample of study subjects, then the variant of interest and covariate will remain independent. Here, the most powerful model for assessing association includes the covariate (e.g., in a logistic regression model) [1]. While adding the covariate may increase the standard error of the variant association, omitting it can bias the association towards the null hypothesis of no effect and ultimately reduce power [1]–[5], [7]. However, most association studies do not select a random sample of study subjects, but rather ascertain cases and controls from the source population. This ascertainment process can create a correlation between the genetic variant and covariate in the sample, because cases will be enriched for both risk genotypes and high-risk covariate levels. Since these are independent in the source population, they will remain conditionally independent among cases or controls; but the variant and covariate will be correlated in the overall case-control sample (dashed line in Figure 1c). In the presence of this induced correlation, omitting the covariate from a logistic regression model may be the most powerful approach. Indeed, including the covariate could substantially increase the standard error of the genetic variant association (i.e., due to the induced correlation), resulting in a larger power loss than might arise from omitting the covariate and biasing the association towards the null hypothesis. Pirinen et al. [1] investigate this phenomenon in detail and show that the increase in power from omitting covariates is a function of disease prevalence and effect sizes. In particular, omitting a covariate can often improve power to detect genetic effects for diseases with prevalence below 2% or as high as 10% when the covariate is a particularly strong risk factor.

Knowledge Is Power!

Improving analyses by ignoring covariates seems counterintuitive, as they should provide some information. To extract value from covariates, Zaitlen et al. [6] developed a new method that uses existing evidence of covariate associations with the trait of interest, and trait prevalence, to increase power. This approach first builds a liability model using estimates of a covariate's independent effect in the form of trait prevalences at various levels of the covariate (e.g., type 2 diabetes prevalences by age). Then it evaluates the association between the genetic variant of interest and the liability model residuals (Figure 1d). In effect, the external information about covariate effects is used to distinguish high- and low-risk cases and controls. Tests of genetic variant associations with these quantitative residuals have more power than tests of genetic associations with the original binary trait. The value of Zaitlen et al.'s approach is demonstrated in several data sets with case-control and case-control-covariate ascertainment, where the selection probability for an individual to join the study depends on covariate levels, such as in matched studies or those with overrepresentation of low-risk cases. While covariate-based ascertainment of cases and controls can induce selection bias that must be addressed by including the covariate in a conventional regression model [8], the new method provides a potentially powerful alternative. The authors show by application and simulation that the liability model approach increases association test statistics by 18% and 16% in comparison with logistic regression with or without covariates, respectively. Of course, this improvement hinges on having accurate external covariate information; one could envision scenarios where the external covariate data is so poor that using this approach would actually decrease power. One could also use covariate information discerned from a given dataset, but external information may be even better. A framework to propagate uncertainties through the multistage analysis of Zaitlen et al. would be useful to assess sensitivity to the quality of published or assumed trait prevalences and covariate effects, and to the estimation errors in the formation of the liability model and in the calculation of residuals. A starting point might be to repeat the analyses for a range of covariate-specific trait prevalences that bracket the actual published or assumed values. Zaitlen and colleagues have also developed a version of the liability model approach for when the covariates are genetic markers with known trait associations [9]. Future work might compare these novel liability methods to alternative approaches for inclusion of external information, such as Bayesian models with informative priors for the covariate effects. Moreover, schemes for weighted analyses [10] suggest other ways to potentially increase association study power. In summary, if one undertakes a case-control association study and has information on covariates that are independent risk factors for a trait—and are not confounders—simply including them in a logistic regression model is not always the optimal approach for discovering genetic variants. Instead, more power may be gained by excluding them, by using the liability model approach of Zaitlen et al. [6], [9], or by applying other novel techniques to leverage information from such covariates.

6 in total

1. Adjusting for covariates in logistic regression models.

Authors: Guan Xing; Chao Xing
Journal: Genet Epidemiol Date: 2010-11 Impact factor: 2.135

2. Analysis of case-control association studies with known risk variants.

Authors: Noah Zaitlen; Bogdan Pasaniuc; Nick Patterson; Samuela Pollack; Benjamin Voight; Leif Groop; David Altshuler; Brian E Henderson; Laurence N Kolonel; Loic Le Marchand; Kevin Waters; Christopher A Haiman; Barbara E Stranger; Emmanouil T Dermitzakis; Peter Kraft; Alkes L Price
Journal: Bioinformatics Date: 2012-05-03 Impact factor: 6.937

3. What's the best statistic for a simple test of genetic association in a case-control study?

Authors: Chia-Ling Kuo; Eleanor Feingold
Journal: Genet Epidemiol Date: 2010-04 Impact factor: 2.135

4. Including known covariates can reduce power to detect genetic effects in case-control studies.

Authors: Matti Pirinen; Peter Donnelly; Chris C A Spencer
Journal: Nat Genet Date: 2012-07-22 Impact factor: 38.330

5. Link functions in multi-locus genetic models: implications for testing, prediction, and interpretation.

Authors: David Clayton
Journal: Genet Epidemiol Date: 2012-04-16 Impact factor: 2.135

6. Informed conditioning on clinical covariates increases power in case-control association studies.

Authors: Noah Zaitlen; Sara Lindström; Bogdan Pasaniuc; Marilyn Cornelis; Giulio Genovese; Samuela Pollack; Anne Barton; Heike Bickeböller; Donald W Bowden; Steve Eyre; Barry I Freedman; David J Friedman; John K Field; Leif Groop; Aage Haugen; Joachim Heinrich; Brian E Henderson; Pamela J Hicks; Lynne J Hocking; Laurence N Kolonel; Maria Teresa Landi; Carl D Langefeld; Loic Le Marchand; Michael Meister; Ann W Morgan; Olaide Y Raji; Angela Risch; Albert Rosenberger; David Scherf; Sophia Steer; Martin Walshaw; Kevin M Waters; Anthony G Wilson; Paul Wordsworth; Shanbeh Zienolddiny; Eric Tchetgen Tchetgen; Christopher Haiman; David J Hunter; Robert M Plenge; Jane Worthington; David C Christiani; Debra A Schaumberg; Daniel I Chasman; David Altshuler; Benjamin Voight; Peter Kraft; Nick Patterson; Alkes L Price
Journal: PLoS Genet Date: 2012-11-08 Impact factor: 5.917

6 in total

21 in total

1. Gene-environment interactions in cancer epidemiology: a National Cancer Institute Think Tank report.

Authors: Carolyn M Hutter; Leah E Mechanic; Nilanjan Chatterjee; Peter Kraft; Elizabeth M Gillanders
Journal: Genet Epidemiol Date: 2013-10-05 Impact factor: 2.135

2. Mixed model with correction for case-control ascertainment increases association power.

Authors: Tristan J Hayeck; Noah A Zaitlen; Po-Ru Loh; Bjarni Vilhjalmsson; Samuela Pollack; Alexander Gusev; Jian Yang; Guo-Bo Chen; Michael E Goddard; Peter M Visscher; Nick Patterson; Alkes L Price
Journal: Am J Hum Genet Date: 2015-04-16 Impact factor: 11.025

The Covariate's Dilemma.

Impact of—and approaches to—including covariates in the analysis of gene–trait associations.

Ignorance Is Bliss…

Knowledge Is Power!

1. Adjusting for covariates in logistic regression models.

2. Analysis of case-control association studies with known risk variants.

3. What's the best statistic for a simple test of genetic association in a case-control study?

4. Including known covariates can reduce power to detect genetic effects in case-control studies.

5. Link functions in multi-locus genetic models: implications for testing, prediction, and interpretation.

6. Informed conditioning on clinical covariates increases power in case-control association studies.

1. Gene-environment interactions in cancer epidemiology: a National Cancer Institute Think Tank report.

2. Mixed model with correction for case-control ascertainment increases association power.

3. Adjusting for heritable covariates can bias effect estimates in genome-wide association studies.

4. A statistical approach for rare-variant association testing in affected sibships.

5. Causal inference and the data-fusion problem.

6. Simultaneous Modeling of Disease Status and Clinical Phenotypes To Increase Power in Genome-Wide Association Studies.

7. Response to Sul and Eskin.

8. Association Between rs1051730 and Smoking During Pregnancy in Dutch Women.

9. Genetic prediction in the Genetic Analysis Workshop 18 sequencing data.

10. Properties of global- and local-ancestry adjustments in genetic association tests in admixed populations.