Literature DB >> 22306651

Differential confounding of rare and common variants in spatially structured populations.

Abstract

Well-powered genome-wide association studies, now made possible through advances in technology and large-scale collaborative projects, promise to characterize the contribution of rare variants to complex traits and disease. However, while population structure is a known confounder of association studies, it remains unknown whether methods developed to control stratification are equally effective for rare variants. Here, we demonstrate that rare variants can show a stratification that is systematically different from, and typically stronger than, common variants, and this is not necessarily corrected by existing methods. We show that the same process leads to inflation for load-based tests and can obscure signals at truly associated variants. Furthermore, we show that populations can display spatial structure in rare variants, even when Wright's fixation index F(ST) is low, but that allele frequency-dependent metrics of allele sharing can reveal localized stratification. These results underscore the importance of collecting and integrating spatial information in the genetic analysis of complex traits.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22306651 PMCID： PMC3303124 DOI： 10.1038/ng.1074

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Introduction

Quantifying the contribution to of rare variants to the heritability of different traits is an important and open question in complex trait genetics[1]. While there is no universally accepted definition of what constitutes a ‘rare’ variant, a minor allele frequency (MAF) of 1% is the conventional definition of polymorphism[2]. At this frequency, the power of the current generation of genome wide association studies (GWAS) is negligible for modest effect sizes[3]. Therefore, although a small number of associations with rare variants have been reported, for example with type 1 diabetes[4] and cholesterol levels[5,6], it has not been possible to test the hypothesis that rare variants account for a significant proportion of the missing heritability for most complex traits. Recently, however, four factors have combined to make the direct investigation of rare variants possible. First, the increasing size of GWAS samples and meta-samples, now approaching cohort sizes of 100,000 through large-scale international collaborations, boosts power. Second, the ascertainment of many rare variants through the 1000 Genomes project[7], has enabled imputation of millions of rare and low frequency variants and led to the development of a new generation of low-cost genotyping platforms that interrogate rare variants directly. Third, the decline in the cost of sequencing technology has enabled large scale sequencing studies to be performed which in principle allow the detection of all variants in a sample. Finally, the recent development of new statistical tests for association aimed at rare variants[8-13] (reviewed in ref. 14) potentially provides power to detect genes or pathways harbouring multiple rare variants for which there would be individually low power to detect association. The large sample sizes required for such studies typically require combining information across multiple geographic locations, within and across countries. Population structure, which can lead to spurious correlations between allele frequencies and non-genetic risk factors, has long been known to be a major potential confounding factor for association studies[15-17]. The effects of stratification have been studied extensively[18-20] and testing and correcting for structure is now standard practice in GWAS through methods such as genomic control (GC)[21,22], principal component analysis (PCA)[23] and mixed models[24]. However, analyses of these methods have typically concentrated on common variants and there has been little investigation of the effect that structure might have specifically on rare variants. Informally, rare variants, through being typically recent, may tend to have different geographic distributions than more common and typically older variants. We therefore set out to investigate (a) under what conditions population structure will lead to differential test-statistic inflation for variants of different frequencies, (b) whether methods effective for controlling stratification of common variants are also appropriate for rare variants, (c) whether different ways of analyzing rare variants (single-marker versus aggregating) are equally affected by structure and (d) how best to measure population structure in empirical data in a manner that is informative about differential stratification. We used a simple lattice model to approximate population structure across a geographical region and investigated the interaction between the spatial distribution of non-genetic risk and inflation of standard association tests under the null model of no genetic risk (Online methods). We contrasted the situation where non-genetic risk is smoothly distributed (for example, a latitudinal effect) with the situation where the same overall risk is concentrated into one or more small, sharply defined regions (for example, localized environmental pollution).

Results

As is well documented, population structure leads to inflation of association test statistics under the null and hence systematic underestimation of P-values. When the risk has a wide and smooth distribution, rare variants show less inflation than common variants (Fig. 1a, c). In contrast, when the risk has a small, sharp spatial distribution, rare variants show more inflation than common variants, particularly for small P-values (Fig. 1b, d). The magnitude of inflation increases as the P-value decreases in both scenarios and the greatest inflation is for variants with frequency approximately equal to the fraction of the area with elevated risk (Fig. 1c, d). As the size or smoothness of the area of risk increases, the inflation is spread over a wider range of P-values (Supplementary Fig. 1 and 2).

Figure 1

Differential inflation of rare and common variants

(a-b) QQ plots of association test P-values, broken down by allele frequency for (a) a broad, smoothly (Gaussian) varying non-genetic risk factor and (b) a small, sharply defined region of constant non-genetic risk; (c-d) Inflation plots showing the amount by which the observed −log10 P-value exceeds the expected value across allele frequencies. Different lines represent different levels of significance, with −log10 P-value equal to 1,2,3 or 4; The grids in the top left of the pictures represent the spatial distribution of risk and the scale indicates by how many standard deviations the phenotypic mean is shifted in each grid square. The populations simulated here are uniformly distributed over the grid, with two individuals in each square, and a migration rate of 0.01.

Such differential behaviour can be understood as a result of the interaction between the spatial distribution of risk and the spatial distribution of variants. Small P-values occur when a variant shows strong correlation with the non-genetic risk. Rare variants, through being recent, tend to show greater geographic clustering than common ones (Fig. 2a-c). When non-genetic risk varies on a large scale, rare variants cannot be highly correlated with it (Fig. 2d). In contrast, when non-genetic risk varies on a small scale, although most variants are uncorrelated with the risk, rare variants have a tail of highly correlated variants (Fig. 2e), which drive the inflation

Figure 2

Spatial distribution of rare and common variants

(a-c) Examples from simulations of the spatial distribution of (a) rare, (b) low frequency and (c) common variants. In each case, grid squares where the allele is present are in colour; (d-e) The distribution of the correlation coefficient between genotypes and non-genetic risk for rare, low frequency and common variants. These are kernel density estimates of the distribution of the correlation between genotypic value (0/1) and associated environmental risk for individuals from the simulations described in Figure 1; (d) Gaussian risk; (e) Small, sharply defined risk. The inset panels in e show successive enlargements of the boxed areas in the tail of the distribution. All parameters are the same as in Figure 1. Abbreviations: MAF: minor allele frequency.

Several methods for correcting for population stratification in GWAS have been developed. The most popular are genomic control (GC)[21,22], principal component analysis (PCA)[23] and linear mixed models[24]. These corrections are known to be effective in the standard GWAS setting and we find they are all effective when non-genetic risk has a large and smooth distribution (Fig. 3a). However, none of them are effective for the small, sharp distribution of risk (Fig. 3b). GC fails in this case because most variants, even rare ones, have correlation with the non-genetic risk of close to 0 (Fig. 2e). PCA and mixed models fail because they effectively try to correct based on linear functions of relatedness. In the simulations, the first few principal components always include the axes of the grid, so can correct for any non-genetic risk which can be expressed as a linear function of these axes. However, the small, sharp region of risk would require a highly non-linear function to be expressed in these terms, which cannot be achieved simply by including the top components. Ultimately, including a large enough number of principal components will remove virtually all stratification (here, between 20 and 100 PC’s is sufficient; Supplementary Fig. 3), but it is not possible to know how many are required and inclusion of many components will lead to substantial reduction in power to detect true associations.

Figure 3

Comparison of methods for correcting for population structure

(a-b) QQ plots of −log10 P-values showing the uncorrected values and the values under different corrections; (c-d) Simulated rare variant load tests (Online methods); All parameters are the same as in Figure 1, except the non-genetic risk is doubled so for the Gaussian risk a and c the phenotypic mean is shifted by at most 0.8 standard deviations, while for the small, sharp risk in b and d it is shifted by at most 2 standard deviations; These are both averaged over multiple simulations in order to show the average effect. Individual experiments may vary due to the sampling variance of the trait. (a-b) averaged over 100 simulations, each testing one trait at 10,000 loci in total (10 loci on each of 1000 genealogies, representing independent genomic regions). (c-d) averaged over 10 simulations, each one testing 10,000 genealogies with either 1,3, or 10 variants in each; Abbreviations: GC, genomic control; PCA principal component analysis, using the first 10 principal components; Rare PCA, as PCA but using only variants with MAF < 4%.

Where variants are sufficiently rare that they are unlikely to be observed in more than a few samples, adequate power to detect true association can only be obtained by combining information across multiple variants within a gene, though this can be approached in many ways[8-14]. To assess the effects of stratification of such aggregating tests, we considered one of the simplest ‘load-based’ tests[11], which tests association with the number of rare variants carried in a region, typically a gene. For smoothly-varying Gaussian risk, test-statistic inflation is largely independent of the number of variants considered (Fig. 3c). For sharply-defined risk, test-statistic inflation is reduced as more variants are considered, but still increases sharply for low P-values (Fig. 3d). Given that some versions of these tests cannot easily accommodate relatedness information and that the problems of spatial structure will increase as allele frequency decreases, these results suggest that similar care need be taken when interpreting enrichment of either single or multiple variants within cases or controls. The results discussed so far relate to inflation under the null. However, another implication of differential structure is that causal rare variants may be geographically localized. Thus even when there is no spatial structure to non-genetic risk, test-statistic inflation will be observed. When there are many loci with rare variants contributing to the background genetic effect, inflation is typically stronger for common variants and will be corrected for by standard approaches. However, when there are only a few loci driving risk, inflation is greater for rare variants (Supplementary Fig. 4). Consequently, if genetic risk is driven by small numbers of rare variants, then true signals are more likely to be obscured by rare variants that show association even though they are not physically linked to the causal variants.

Discussion

We have demonstrated that under certain conditions rare and common variants exhibit differential patterns of stratification. However these results are qualitative and we must also ask whether these conditions are likely to be met in practice. While the data that would be required to investigate this effect directly are not yet available, we can nonetheless consider metrics that could be used to relate our simulations to real populations. Historically, approaches to summarizing population structure in genetic data have focused on simple statistics, such as Wright’s fixation index F, which measures the proportion of overall genetic variation that results from between-population variation. Among human populations, F is typically estimated to be <0.1 (for example, 0.071 between the 1000 Genomes CEU and YRI populations[7] and typically <0.02 within Europe[25]). Dividing the simulated grid into two equal sub-populations, for the migration parameter used for Figure 1 (M=0.01) F is approximately 0.1, which is comparable to a worldwide sample. Within a European sample, a more appropriate migration parameter might be M=10, which gives F<0.01, a value which would be considered negligible. However, F estimates are driven by common variants, and also depend on the relative sizes and number of the sub-populations (Supplementary Fig. 5). Analysis of allele sharing by distance as a function of distance shows that while common variants show effectively no excess allele sharing at short ranges, even with M=10 rare variants still show excess clustering (Fig. 4) and although stratification is much reduced compared to a low migration rate, it is still greatest for rare variants (Supplementary Fig. 6). These results are consistent with empirical observations that show very low rare-allele sharing even between very closely related human populations[26]. The fact that excess allele sharing increases as frequency decreases implies that even for relatively unstructured populations, this effect will be observed below some, sufficiently low, variant frequency. These results highlight the need for methods for explicitly showing spatial structure, such as the allele-sharing plot (Fig. 4) or other spatial correlation measures such as Moran’s I statistic[27], which provides a much richer and more informative representation than any single statistic.

Figure 4

Excess allele sharing

A ratio measuring how much more likely two individuals at a given spatial distance are to share a derived allele, compared to what would be expected in a homogenous population (Methods). The parameters are the same as those used in Figure 1, apart from migration rate, which is (a) M=0.01, (b) M=10; In a, F=0.1 and in b, F<0.01; Abbreviations: DAF: derived allele frequency.

There are three ways in which non-genetic risk might show sharply-defined boundaries of the type for which we have shown differential inflation. First, localized environmental exposure may be highly patchy, for example associated with urban areas. Second, there may be systematic measurement bias at a single recruitment centre. Third, and more subtly, there may be local variation in recruitment policy or rates of misclassification (the effect of which can be thought of as changing the background disease risk). Although we have simulated quantitative trait data, case-control studies are subject to the same issues of population structure and a case-control study that randomly misclassifies cases and controls will bias effect size estimates[28]. When this misclassification is restricted to a particular spatial area, for example a single recruitment center in a large study, it will produce the effects described here. In fact, if we add additional disconnected small areas of risk of the same size as the first, the inflation in P-value has the same distribution with respect to frequency (Supplementary Fig. 7) so this observation would extend to the case where multiple collection centers were making biased measurements or random misclassifications. Because the extent and clustering of non-genetic risk will differ between phenotypes and study designs, it is not possible to predict any general influence of differential stratification. The principal problem with trying to account for known non-genetic risk (i.e. to include these as covariates within the analysis) is that while information about broad-scale risk factors may be available, typically, the more localized a risk factor is, the less we are likely to know about it and the greater effect this will have on rare variants. Given that existing methods can fail to correct for rare variant stratification, what approaches can be taken to guard against its effects? One approach is to use methods that are robust to stratification (though at a cost to power and ease of experimental design), such as family-based association, perhaps only for replication. Another is to adapt existing methods to work better with rare variants. For example, although PCA with rare variants does not effectively control inflation if we linearly correct using the top components (Fig. 3b), in principal, more sophisticated methods for selecting non-linear functions of components could correct appropriately. Alternatively, we might look to the development of new measures of relatedness more sensitive to recent ancestry and fine-scale structure. Whichever approach is taken, it is likely to require fine-grained information about the geographic origins and recruitment path of each sample. The collection of such information must be an important consideration in the design of future studies.

28 in total

1. The effects of human population structure on large genetic association studies.

Authors: Jonathan Marchini; Lon R Cardon; Michael S Phillips; Peter Donnelly
Journal: Nat Genet Date: 2004-03-28 Impact factor: 38.330

2. Population structure, differential bias and genomic control in a large-scale, case-control association study.

Authors: David G Clayton; Neil M Walker; Deborah J Smyth; Rebecca Pask; Jason D Cooper; Lisa M Maier; Luc J Smink; Alex C Lam; Nigel R Ovington; Helen E Stevens; Sarah Nutland; Joanna M M Howson; Malek Faham; Martin Moorhead; Hywel B Jones; Matthew Falkowski; Paul Hardenbol; Thomas D Willis; John A Todd
Journal: Nat Genet Date: 2005-10-09 Impact factor: 38.330

3. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

4. Genomics for the world.

Authors: Carlos D Bustamante; Esteban González Burchard; Francisco M De la Vega
Journal: Nature Date: 2011-07-13 Impact factor: 49.962

5. Notes on continuous stochastic phenomena.

Authors: P A P MORAN
Journal: Biometrika Date: 1950-06 Impact factor: 2.445

6. Efficient control of population structure in model organism association mapping.

Authors: Hyun Min Kang; Noah A Zaitlen; Claire M Wade; Andrew Kirby; David Heckerman; Mark J Daly; Eleazar Eskin
Journal: Genetics Date: 2008-03 Impact factor: 4.562

7. Bias due to misclassification in the estimation of relative risk.

Authors: K T Copeland; H Checkoway; A J McMichael; R H Holbrook
Journal: Am J Epidemiol Date: 1977-05 Impact factor: 4.897

Review 8. Genetic dissection of complex traits.

Authors: E S Lander; N J Schork
Journal: Science Date: 1994-09-30 Impact factor: 47.728

9. Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture.

Authors: W C Knowler; R C Williams; D J Pettitt; A G Steinberg
Journal: Am J Hum Genet Date: 1988-10 Impact factor: 11.025

Review 10. Common and rare variants in multifactorial susceptibility to common diseases.

Authors: Walter Bodmer; Carolina Bonilla
Journal: Nat Genet Date: 2008-06 Impact factor: 38.330

199 in total

1. Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project.

Authors: Dmitry Prokopenko; Julian Hecker; Edwin K Silverman; Marcello Pagano; Markus M Nöthen; Christian Dina; Christoph Lange; Heide Loehlein Fier
Journal: Bioinformatics Date: 2015-12-31 Impact factor: 6.937

2. Rare-variant extensions of the transmission disequilibrium test: application to autism exome sequence data.

Authors: Zongxiao He; Brian J O'Roak; Joshua D Smith; Gao Wang; Stanley Hooker; Regie Lyn P Santos-Cortez; Biao Li; Mengyuan Kan; Nik Krumm; Deborah A Nickerson; Jay Shendure; Evan E Eichler; Suzanne M Leal
Journal: Am J Hum Genet Date: 2013-12-19 Impact factor: 11.025

3. Analysis of rare variant population structure in Europeans explains differential stratification of gene-based tests.

Authors: Matthew Zawistowski; Mark Reppell; Daniel Wegmann; Pamela L St Jean; Margaret G Ehm; Matthew R Nelson; John Novembre; Sebastian Zöllner
Journal: Eur J Hum Genet Date: 2014-01-08 Impact factor: 4.246

4. Deciphering the fine-structure of tribal admixture in the Bedouin population using genomic data.

Authors: B Markus; I Alshafee; O S Birk
Journal: Heredity (Edinb) Date: 2013-10-02 Impact factor: 3.821

5. Testing genetic association with rare variants in admixed populations.

Authors: Xianyun Mao; Yun Li; Yichuan Liu; Leslie Lange; Mingyao Li
Journal: Genet Epidemiol Date: 2012-10-02 Impact factor: 2.135

6. Loss-of-function mutations in SLC30A8 protect against type 2 diabetes.

Authors: Jason Flannick; Gudmar Thorleifsson; Nicola L Beer; Suzanne B R Jacobs; Niels Grarup; Noël P Burtt; Anubha Mahajan; Christian Fuchsberger; Gil Atzmon; Rafn Benediktsson; John Blangero; Don W Bowden; Ivan Brandslund; Julia Brosnan; Frank Burslem; John Chambers; Yoon Shin Cho; Cramer Christensen; Desirée A Douglas; Ravindranath Duggirala; Zachary Dymek; Yossi Farjoun; Timothy Fennell; Pierre Fontanillas; Tom Forsén; Stacey Gabriel; Benjamin Glaser; Daniel F Gudbjartsson; Craig Hanis; Torben Hansen; Astradur B Hreidarsson; Kristian Hveem; Erik Ingelsson; Bo Isomaa; Stefan Johansson; Torben Jørgensen; Marit Eika Jørgensen; Sekar Kathiresan; Augustine Kong; Jaspal Kooner; Jasmina Kravic; Markku Laakso; Jong-Young Lee; Lars Lind; Cecilia M Lindgren; Allan Linneberg; Gisli Masson; Thomas Meitinger; Karen L Mohlke; Anders Molven; Andrew P Morris; Shobha Potluri; Rainer Rauramaa; Rasmus Ribel-Madsen; Ann-Marie Richard; Tim Rolph; Veikko Salomaa; Ayellet V Segrè; Hanna Skärstrand; Valgerdur Steinthorsdottir; Heather M Stringham; Patrick Sulem; E Shyong Tai; Yik Ying Teo; Tanya Teslovich; Unnur Thorsteinsdottir; Jeff K Trimmer; Tiinamaija Tuomi; Jaakko Tuomilehto; Fariba Vaziri-Sani; Benjamin F Voight; James G Wilson; Michael Boehnke; Mark I McCarthy; Pål R Njølstad; Oluf Pedersen; Leif Groop; David R Cox; Kari Stefansson; David Altshuler
Journal: Nat Genet Date: 2014-03-02 Impact factor: 38.330

7. Next-generation sequencing of 100 candidate genes in young victims of suspected sudden cardiac death with structural abnormalities of the heart.

Authors: C L Hertz; S L Christiansen; L Ferrero-Miliani; M Dahl; P E Weeke; G L Ottesen; R Frank-Hansen; H Bundgaard; N Morling
Journal: Int J Legal Med Date: 2015-09-17 Impact factor: 2.686

8. The role of rare variants in systolic blood pressure: analysis of ExomeChip data in HyperGEN African Americans.

Authors: Yun Ju Sung; Jacob Basson; Nuo Cheng; Khanh-Dung H Nguyen; Priyanka Nandakumar; Steven C Hunt; Donna K Arnett; Victor G Dávila-Román; Dabeeru C Rao; Aravinda Chakravarti
Journal: Hum Hered Date: 2015 Impact factor: 0.444

9. Association of Rare CYP39A1 Variants With Exfoliation Syndrome Involving the Anterior Chamber of the Eye.

Authors: Zheng Li; Zhenxun Wang; Mei Chin Lee; Matthias Zenkel; Esther Peh; Mineo Ozaki; Fotis Topouzis; Satoko Nakano; Anita Chan; Shuwen Chen; Susan E I Williams; Andrew Orr; Masakazu Nakano; Nino Kobakhidze; Tomasz Zarnowski; Alina Popa-Cherecheanu; Takanori Mizoguchi; Shin-Ichi Manabe; Ken Hayashi; Shigeyasu Kazama; Kenji Inoue; Yosai Mori; Kazunori Miyata; Kazuhisa Sugiyama; Tomomi Higashide; Etsuo Chihara; Ryuichi Ideta; Satoshi Ishiko; Akitoshi Yoshida; Kana Tokumo; Yoshiaki Kiuchi; Tsutomu Ohashi; Toshiya Sakurai; Takako Sugimoto; Hideki Chuman; Makoto Aihara; Masaru Inatani; Kazuhiko Mori; Yoko Ikeda; Morio Ueno; Daniel Gaston; Paul Rafuse; Lesya Shuba; Joseph Saunders; Marcelo Nicolela; George Chichua; Sergo Tabagari; Panayiota Founti; Kar Seng Sim; Wee Yang Meah; Hui Meng Soo; Xiao Yin Chen; Anthi Chatzikyriakidou; Christina Keskini; Theofanis Pappas; Eleftherios Anastasopoulos; Alexandros Lambropoulos; Evangelia S Panagiotou; Dimitrios G Mikropoulos; Ewa Kosior-Jarecka; Augustine Cheong; Yuanhan Li; Urszula Lukasik; Monisha E Nongpiur; Rahat Husain; Shamira A Perera; Lydia Álvarez; Montserrat García; Héctor González-Iglesias; Andrés Fernández-Vega Cueto; Luis Fernández-Vega Cueto; Federico Martinón-Torres; Antonio Salas; Çilingir Oguz; Nevbahar Tamcelik; Eray Atalay; Bilge Batu; Murat Irkec; Dilek Aktas; Burcu Kasim; Yury S Astakhov; Sergei Y Astakhov; Eugeny L Akopov; Andreas Giessl; Christian Mardin; Claus Hellerbrand; Jessica N Cooke Bailey; Robert P Igo; Jonathan L Haines; Deepak P Edward; Steffen Heegaard; Sonia Davila; Patrick Tan; Jae H Kang; Louis R Pasquale; Friedrich E Kruse; André Reis; Trevor R Carmichael; Michael Hauser; Michele Ramsay; Georg Mossböck; Nilgun Yildirim; Kei Tashiro; Anastasios G P Konstas; Miguel Coca-Prados; Jia Nee Foo; Shigeru Kinoshita; Chie Sotozono; Toshiaki Kubota; Michael Dubina; Robert Ritch; Janey L Wiggs; Francesca Pasutto; Ursula Schlötzer-Schrehardt; Ying Swan Ho; Tin Aung; Wai Leong Tam; Chiea Chuen Khor
Journal: JAMA Date: 2021-02-23 Impact factor: 56.272

10. Rare nonsynonymous exonic variants in addiction and behavioral disinhibition.

Authors: Scott I Vrieze; Shuang Feng; Michael B Miller; Brian M Hicks; Nathan Pankratz; Gonçalo R Abecasis; William G Iacono; Matt McGue
Journal: Biol Psychiatry Date: 2013-10-04 Impact factor: 13.382