Literature DB >> 19810025

An evaluation of statistical approaches to rare variant analysis in genetic association studies.

Abstract

Genome-wide association (GWA) studies have proved to be extremely successful in identifying novel common polymorphisms contributing effects to the genetic component underlying complex traits. Nevertheless, one source of, as yet, undiscovered genetic determinants of complex traits are those mediated through the effects of rare variants. With the increasing availability of large-scale re-sequencing data for rare variant discovery, we have developed a novel statistical method for the detection of complex trait associations with these loci, based on searching for accumulations of minor alleles within the same functional unit. We have undertaken simulations to evaluate strategies for the identification of rare variant associations in population-based genetic studies when data are available from re-sequencing discovery efforts or from commercially available GWA chips. Our results demonstrate that methods based on accumulations of rare variants discovered through re-sequencing offer substantially greater power than conventional analysis of GWA data, and thus provide an exciting opportunity for future discovery of genetic determinants of complex traits. 2009 Wiley-Liss, Inc.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 19810025 PMCID： PMC2962811 DOI： 10.1002/gepi.20450

Source DB: PubMed Journal: Genet Epidemiol ISSN： 0741-0395 Impact factor: 2.135

INTRODUCTION

Recent advances in whole genome genotyping technologies, the availability of large, well-defined population-based disease cohorts, and a better understanding of common human sequence variation, coupled with the development of appropriate quality control and analysis pipelines, have led to the identification of many novel common genetic determinants of complex traits [The Wellcome Trust Case Control Consortium, 2007; Zeggini et al., 2008; Barrett et al., 2008; Raychaudhuri et al., 2008; Cooper et al., 2008; Willer et al., 2009; Aulchenko et al., 2009; Prokopenko et al., 2009]. Nevertheless, despite these successes, much of the genetic component of these traits remains unaccounted for. Although there may be many undiscovered common polymorphisms associated with complex traits, it seems unlikely that the “common-disease common-variant” hypothesis is all encompassing. One unexplored paradigm which may contribute to this unexplained genetic component is a model of multiple rare causal variants, defined here to have a minor allele frequency (MAF) of less than 1%, each of modest effect, but residing within the same functional unit, for example, a gene. Joint analysis of rare variants within a gene, searching for accumulations of minor alleles within the same individual, may thus provide signals of association with complex phenotypes that could not have been identified through traditional association analysis of single nucleotide polymorphisms (SNPs), typically defined to have MAF of at least 1%. For example, minor alleles at multiple rare variants in ABCA1, APOA1 and LCAT have been demonstrated to contribute collectively to low plasma levels of high-density lipoprotein cholesterol [Cohen et al., 2004]. Currently, most studies of rare variants utilise data from commercially available GWA chips, which are far from ideal since they are designed for capturing common human genetic variation. However, the availability of data more appropriate for rare variant association analysis is just around the corner, with whole genome re-sequencing efforts, such as the 1,000 Genomes project (http://www.1000 genomes.org) soon reaching completion. Furthermore, large-scale deep re-sequencing technologies are becoming increasingly efficient and cost effective, and thus may soon be realistic for rare variant discovery in specific genes in large disease or population-based cohorts. We have developed a novel test of association with rare variants discovered through such re-sequencing efforts, based on the accumulation of minor alleles within the same functional unit, for example a gene-coding region extended up and downstream to incorporate additional functional elements and the regulatory region. We have then undertaken a simulation study to focus on two distinct, but timely, scenarios with the aim of addressing specific, as yet unanswered, methodological questions in each. First, when deep re-sequencing data are available to discover rare variants, do methods based on accumulations of minor alleles within the same functional unit offer greater power to detect association with complex traits than traditional analysis of SNPs on GWA chips? Second, when only GWA chip data are available, what is the most powerful strategy for identifying rare variant associations with complex traits?

METHODS

We consider two specific tests of quantitative trait association with accumulations of minor alleles across rare variants within the same functional unit. In the first of these tests, the phenotype is modelled in a linear regression framework as a function of the proportion of rare variants at which an individual carries a minor allele. In the second, the phenotype is modelled in the same regression framework, but this time as a function of the presence/absence of a minor allele at any rare variant within an individual. This collapsing approach has been previously proposed in the context of a binary trait [Li and Leal, 2008], and has been demonstrated to be powerful for detecting association with rare variants discovered through re-sequencing. Consider a sample of unrelated individuals, phenotyped for a normally distributed trait, and typed for rare variants in a gene or small genomic region. Let n denote the number of rare variants for which the ith individual has been successfully genotyped, and let r denote the number of these variants at which they carry at least one copy of the minor allele. We can model the phenotype, y, of the ith individual in a linear regression framework, given by y = E[y]+ɛ, where ɛ,σ), and In this expression, x denotes a vector of covariate measurements for the ith individual, with corresponding regression coefficients β. The parameter λ is the expected increase in the phenotype for an individual carrying a full complement of minor alleles at rare variants compared to an individual carrying none. As an alternative, we can model the expected phenotype of the ith individual as where I(r is an indicator variable taking the value 1 if r>0, and 0 otherwise, in other words, the presence of at least one minor allele at any rare variant. Here, the parameter λ is the expected increase in the phenotype for an individual carrying at least one minor allele at any rare variant compared an individual carrying none. For either model, the likelihood contribution of the ith individual is given by We thus construct likelihood ratio tests of association of an accumulation of rare variants with disease by comparing the maximised likelihoods of two models via analysis of deviance: (i) the null model where λ = 0; and (ii) the alternative model for which λ is unconstrained. The contribution of the ith individual to the likelihood, f(y | α,λ,β,r,n,x), is weighted by n to allow for differential call rates between samples. We denote the likelihood ratio test based on the proportion of rare variants at which an individual carries minor alleles by RVT1, and that based on the presence/absence of at least one minor allele at any rare variant by RVT2. Both RVT1 and RVT2 can be generalised to tests of association with a binary trait within a logistic regression-modelling framework.

SIMULATION STUDY

In order to evaluate the relative merits of different analytical approaches to identify rare variant associations with a quantitative trait, we have performed simulations using simple models of population genetics to generate high-density haplotype data in a 50-kb genomic region used to represent a functional unit of interest. We considered a range of models for association of the trait with multiple causal variants in the same region, under two different assumptions: (i) the mean trait value is determined by the presence or absence of a minor allele at any causal variant; and (ii) the mean trait value determined by the proportion of causal variants at which a minor allele is present. Trait association models were then parame-terised in terms of: (i) the maximum MAF of any individual causal variant; (ii) the total MAF of all causal variants; and (iii) their joint contribution to the phenotypic variance. Full details of the simulation process are described in the Appendix. We began by simulating a population of 40,000 haplotypes, and selected causal variants according to our chosen model of association. We selected 10,000 haplotypes, paired together at random to form 5,000 individuals in our “analysis cohort”, and generated their phenotypes according to their genotypes at the causal variants. We then selected a further 2,000 haplotypes in our “discovery panel”, used here to represent the deep re-sequencing data we expect from the 1,000 Genomes project. Over all simulations, the mean number of rare variants with at least two copies of the minor allele in the discovery panel was 52.2. We assumed that each of these rare variants was taken forward for genotyping in the analysis cohort, and tested for association using both RVT1 and RVT2. Our next step was to select variants in the 50 kb region to have similar properties to the Affymetrix Human SNP Array 6.0 in terms of mean density and MAF profile in the population of 40,000 haplotypes. All SNPs on the GWA chip were analysed independently, using conventional trend tests of association, with Bonferroni correction for multiple testing, and jointly, using standard haplotype-based techniques. Over all simulations, the mean number of GWA chip SNPs in the region was 14.8, while the mean number of SNP haplotypes with population frequency greater than 1% was 18.0. We attempted to apply RVT1 and RVT2 to the GWA chip data, but the mean number of rare variants in the region was just 0.2, and provided minimal power to detect accumulations of minor alleles at these loci. As a result, we extended our analysis to include “low-frequency” SNPs (1% Figure 1 shows the power of each of the tests of association as a function of the percentage of phenotypic variation explained by causal variants in the 50 kb region, assuming the trait mean is determined by the presence or absence of minor alleles at any of the causal variants. Results for two models are presented here, each assuming a total MAF of 5% for all causal variants in the region: (a) the maximum MAF of any individual causal variant is 0.5%; and (b) the maximum MAF of any individual causal variant is 2%. Model (b) incorporates fewer and, on average, more common causal variants than does (a), and thus represents a lower degree of allelic heterogeneity. Supplementary Figures 1 and 2 present power for a wider range of association models encompassing intermediate levels of allelic heterogeneity, where trait means are determined by the presence or absence of minor alleles at any causal variant, and by the proportion of causal variants at which a minor allele is present, respectively.

Fig. 1

Power of six tests of rare variant association with a quantitative trait as a function of the percentage of phenotypic variation explained by causal variants in a 50 kb region, assuming the trait mean is determined by the presence or absence of minor alleles at any of the causal variants. Results for two models are presented, both assuming a total MAF of 5% for all causal variants in the region: (a) the maximum MAF of any individual causal variant is 0.5% and (b) the maximum MAF of any individual causal variant is 2%. Power is estimated at a 5% significance level over 10,000 replicates of data. Re-sequencing RVT1: test of phenotype association with the proportion of rare variants, discovered through re-sequencing, at which individuals carry minor alleles. Re-sequencing RVT2: test of phenotype association with the presence/absence of minor alleles in individuals at any rare variant discovered through re-sequencing. GWA <5% RVT1: test of phenotype association with the proportion of low-frequency variants on the GWA chip at which individuals carry minor alleles. GWA <5% RVT2: test of phenotype association with the presence/absence of minor alleles at any low-frequency variant on the GWA chip. GWA single SNP: standard trend test of quantitative trait association with each SNP on the GWA chip, with Bonferroni correction for multiple testing. GWA SNP haplotypes: haplotype trend test of association with the quantitative trait across all SNPs on the GWA chip. Our results highlight a number of general conclusions. First, when rare variants are discovered through re-sequencing, RVT1, based on the proportion of rare variants at which an individual carries minor alleles, is always at least as powerful as RVT2, based on the presence/absence of minor alleles. The difference in power between the two tests is most noticeable when the trait mean is determined by the proportion of causal variants at which a minor allele is present, which is not surprising, since this model is assumed by RVT1 (Supplementary Fig. 2). However, even when the trait mean is determined by the presence or absence of a minor allele at any causal variant, RVT1 is generally more powerful than RVT2. This would suggest that RVT2 is less robust to the presence of minor alleles at non-causal rare variants than is RVT1. Next, there is a clear gain in power for tests based on rare variants identified through re-sequencing over analyses of SNPs or low-frequency variants present on the GWA chip. The greatest gains are observed in the presence of substantial allelic heterogeneity (Fig. 1a), where rare causal loci are less likely to be captured by SNPs as a result of linkage disequilibrium [The International HapMap Consortium, 2005; Zeggini et al., 2005]. However, the differences in power between the tests are less noticeable when there is less allelic heterogeneity (Fig. 1b). Our results also confirm previous findings that haplotype-based analyses of SNPs have greater power to detect rare variant associations than single-locus tests, unless there is substantial allelic heterogeneity [Morris and Kaplan, 2002]. Finally, low-frequency variants (MAF<5%) on GWA chips are too scarce to detect accumulations of minor alleles, and thus RVT1 and RVT2 have minimal power to identify rare variant associations with this type of data.

DISCUSSION

Our simulations clearly indicate that tests based on the accumulation of minor alleles at rare variants identified through re-sequencing are always more powerful than conventional tests applied to SNPs present on GWA chips, particularly in the presence of substantial allelic heterogeneity. We have assumed a discovery panel of 1,000 individuals from the same population from which the analysis cohort has been ascertained which may not always be the case. With the expense of re-sequencing efforts, focussed studies of samples from the analysis cohort are likely to be much smaller, and thus less powerful for rare variant discovery, although pooling may provide a more efficient initial screening step. Publicly available re-sequencing panels, such as those that will be released through the 1,000 Genomes project may not be matched for ancestry with the analysis cohort. These panels will miss rare variants specific to the population from which the analysis cohort has been ascertained, and may lead to genotyping of variants which are, in fact, monomorphic. Our simulations also assume that all rare variants identified through re-sequencing of the discovery samples will subsequently be genotyped in the analysis cohort. However, genotyping these rare variants on a genome-wide scale will be a considerably more expensive endeavour than utilising GWA platforms. This approach may currently be financially infeasible with the large samples required to detect the modest genetic effects we expect for complex traits, particularly for rare variant associations. One possible approach to reduce genotyping costs is to focus on potentially functional rare variants (for example those leading to non-synonymous changes or located within exons and regulatory regions). However, at present, there is no unbiased evidence to suggest that causal variants are more likely to aggregate in such regions and current annotation of the genome is incomplete, making identification of potentially functional loci difficult. Experience with current genotyping technologies would suggest that rare variants are more difficult to type than SNPs, and thus stringent quality control procedures are required to avoid increased false-positive error rates as a result of genotype misspecification. To increase power, an obvious step would be to combine results of rare variant studies through meta-analysis. Although each study may genotype different loci, we can combine results on the level of the functional unit. This may reduce, or even eliminate, the need for imputation, which may be potentially prone to bias because the spectrum of rare variants is more diverse than that of SNPs, even between populations sharing relatively recent common ancestry, and therefore more difficult to predict [Anderson et al., 2008]. This highlights the need for replication in large samples from closely related populations to confirm rare variant association signals. The field of complex trait genetics is moving rapidly towards an understanding that deep re-sequencing technologies will provide the necessary data to unearth novel-associated loci. It is anticipated that researchers will soon be faced with the challenge of selecting the appropriate analytical strategy for these data sets, which will be of unprecedented scale and depth. In this study, we have developed and evaluated targeted rare variant analysis methods and have provided insights into their relative merits. The methodology we have developed here for detecting rare variant associations is extremely simplistic, and as technologies probing human genome sequence variation move rapidly forward, the development and testing of analytical strategies that maximise output from these investments will continue to be of critical importance.

17 in total

1. Generating samples under a Wright-Fisher neutral model of genetic variation.

Authors: Richard R Hudson
Journal: Bioinformatics Date: 2002-02 Impact factor: 6.937

2. Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals.

Authors: Dmitri V Zaykin; Peter H Westfall; S Stanley Young; Maha A Karnoub; Michael J Wagner; Margaret G Ehm
Journal: Hum Hered Date: 2002 Impact factor: 0.444

3. On the advantage of haplotype analysis in the presence of multiple disease susceptibility alleles.

Authors: Richard W Morris; Norman L Kaplan
Journal: Genet Epidemiol Date: 2002-10 Impact factor: 2.135

4. Multiple rare alleles contribute to low plasma levels of HDL cholesterol.

Authors: Jonathan C Cohen; Robert S Kiss; Alexander Pertsemlidis; Yves L Marcel; Ruth McPherson; Helen H Hobbs
Journal: Science Date: 2004-08-06 Impact factor: 47.728

5. A haplotype map of the human genome.

Authors:
Journal: Nature Date: 2005-10-27 Impact factor: 49.962

6. An evaluation of HapMap sample size and tagging SNP performance in large-scale empirical and simulated data sets.

Authors: Eleftheria Zeggini; William Rayner; Andrew P Morris; Andrew T Hattersley; Mark Walker; Graham A Hitman; Panos Deloukas; Lon R Cardon; Mark I McCarthy
Journal: Nat Genet Date: 2005-10-30 Impact factor: 38.330

7. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population.

Authors: L Excoffier; M Slatkin
Journal: Mol Biol Evol Date: 1995-09 Impact factor: 16.240

8. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.

Authors:
Journal: Nature Date: 2007-06-07 Impact factor: 49.962

9. Variants in MTNR1B influence fasting glucose levels.

Authors: Inga Prokopenko; Claudia Langenberg; Jose C Florez; Richa Saxena; Nicole Soranzo; Gudmar Thorleifsson; Ruth J F Loos; Alisa K Manning; Anne U Jackson; Yurii Aulchenko; Simon C Potter; Michael R Erdos; Serena Sanna; Jouke-Jan Hottenga; Eleanor Wheeler; Marika Kaakinen; Valeriya Lyssenko; Wei-Min Chen; Kourosh Ahmadi; Jacques S Beckmann; Richard N Bergman; Murielle Bochud; Lori L Bonnycastle; Thomas A Buchanan; Antonio Cao; Alessandra Cervino; Lachlan Coin; Francis S Collins; Laura Crisponi; Eco J C de Geus; Abbas Dehghan; Panos Deloukas; Alex S F Doney; Paul Elliott; Nelson Freimer; Vesela Gateva; Christian Herder; Albert Hofman; Thomas E Hughes; Sarah Hunt; Thomas Illig; Michael Inouye; Bo Isomaa; Toby Johnson; Augustine Kong; Maria Krestyaninova; Johanna Kuusisto; Markku Laakso; Noha Lim; Ulf Lindblad; Cecilia M Lindgren; Owen T McCann; Karen L Mohlke; Andrew D Morris; Silvia Naitza; Marco Orrù; Colin N A Palmer; Anneli Pouta; Joshua Randall; Wolfgang Rathmann; Jouko Saramies; Paul Scheet; Laura J Scott; Angelo Scuteri; Stephen Sharp; Eric Sijbrands; Jan H Smit; Kijoung Song; Valgerdur Steinthorsdottir; Heather M Stringham; Tiinamaija Tuomi; Jaakko Tuomilehto; André G Uitterlinden; Benjamin F Voight; Dawn Waterworth; H-Erich Wichmann; Gonneke Willemsen; Jacqueline C M Witteman; Xin Yuan; Jing Hua Zhao; Eleftheria Zeggini; David Schlessinger; Manjinder Sandhu; Dorret I Boomsma; Manuela Uda; Tim D Spector; Brenda Wjh Penninx; David Altshuler; Peter Vollenweider; Marjo Riitta Jarvelin; Edward Lakatta; Gerard Waeber; Caroline S Fox; Leena Peltonen; Leif C Groop; Vincent Mooser; L Adrienne Cupples; Unnur Thorsteinsdottir; Michael Boehnke; Inês Barroso; Cornelia Van Duijn; Josée Dupuis; Richard M Watanabe; Kari Stefansson; Mark I McCarthy; Nicholas J Wareham; James B Meigs; Gonçalo R Abecasis
Journal: Nat Genet Date: 2008-12-07 Impact factor: 38.330

10. Loci influencing lipid levels and coronary heart disease risk in 16 European population cohorts.

Authors: Yurii S Aulchenko; Samuli Ripatti; Ida Lindqvist; Dorret Boomsma; Iris M Heid; Peter P Pramstaller; Brenda W J H Penninx; A Cecile J W Janssens; James F Wilson; Tim Spector; Nicholas G Martin; Nancy L Pedersen; Kirsten Ohm Kyvik; Jaakko Kaprio; Albert Hofman; Nelson B Freimer; Marjo-Riitta Jarvelin; Ulf Gyllensten; Harry Campbell; Igor Rudan; Asa Johansson; Fabio Marroni; Caroline Hayward; Veronique Vitart; Inger Jonasson; Cristian Pattaro; Alan Wright; Nick Hastie; Irene Pichler; Andrew A Hicks; Mario Falchi; Gonneke Willemsen; Jouke-Jan Hottenga; Eco J C de Geus; Grant W Montgomery; John Whitfield; Patrik Magnusson; Juha Saharinen; Markus Perola; Kaisa Silander; Aaron Isaacs; Eric J G Sijbrands; Andre G Uitterlinden; Jacqueline C M Witteman; Ben A Oostra; Paul Elliott; Aimo Ruokonen; Chiara Sabatti; Christian Gieger; Thomas Meitinger; Florian Kronenberg; Angela Döring; H-Erich Wichmann; Johannes H Smit; Mark I McCarthy; Cornelia M van Duijn; Leena Peltonen
Journal: Nat Genet Date: 2008-12-07 Impact factor: 38.330

303 in total

1. A flexible likelihood framework for detecting associations with secondary phenotypes in genetic studies using selected samples: application to sequence data.

Authors: Dajiang J Liu; Suzanne M Leal
Journal: Eur J Hum Genet Date: 2011-12-14 Impact factor: 4.246

2. Rapid and robust resampling-based multiple-testing correction with application in a genome-wide expression quantitative trait loci study.

Authors: Xiang Zhang; Shunping Huang; Wei Sun; Wei Wang
Journal: Genetics Date: 2012-01-31 Impact factor: 4.562

Review 3. Assessing gene-gene interactions in pharmacogenomics.

Authors: Hsien-Yuan Lane; Guochuan E Tsai; Eugene Lin
Journal: Mol Diagn Ther Date: 2012-02-01 Impact factor: 4.074

4. Assessing the impact of non-differential genotyping errors on rare variant tests of association.

Authors: Scott Powers; Shyam Gopalakrishnan; Nathan Tintle
Journal: Hum Hered Date: 2011-10-15 Impact factor: 0.444

5. Optimal tests for rare variant effects in sequencing association studies.

Authors: Seunggeun Lee; Michael C Wu; Xihong Lin
Journal: Biostatistics Date: 2012-06-14 Impact factor: 5.899

6. A novel genome-information content-based statistic for genome-wide association analysis designed for next-generation sequencing data.

Authors: Li Luo; Yun Zhu; Momiao Xiong
Journal: J Comput Biol Date: 2012-05-31 Impact factor: 1.479