Literature DB >> 22373521

Capability of common SNPs to tag rare variants.

Xiangqing Sun¹, Junghyun Namkung, Xiaofeng Zhu, Robert C Elston.

Abstract

Genome-wide association studies are based on the linkage disequilibrium pattern between common tagging single-nucleotide polymorphisms (SNPs) (i.e., SNPs having only common alleles) and true causal variants, and association studies with rare SNP alleles aim to detect rare causal variants. To better understand and explain the findings from both types of studies and to provide clues to improve the power of an association study with only common SNPs genotyped, we study the correlation between common SNPs and the presence of rare alleles within a region in the genome and look at the capability of common SNPs in strong linkage disequilibrium with each other to capture single rare alleles. Our results indicate that common SNPs can, to some extent, tag the presence of rare alleles and that including SNPs in strong linkage disequilibrium with each other among the tagging SNPs helps to detect rare alleles.

Entities: Chemical Disease Gene Species

Year: 2011 PMID： 22373521 PMCID： PMC3287929 DOI： 10.1186/1753-6561-5-S9-S88

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

In recent years, genome-wide association studies have identified hundreds of genetic variants that may be associated with many common diseases [1-3]. It is believed that the associated single-nucleotide polymorphisms (SNPs) detected from current association studies may represent linkage disequilibrium (LD) between a common tagging SNP and true causal variants. Under the common disease/rare variants hypothesis, which suggests that many rare variants can contribute to the phenotypic variation [4,5], association studies to detect rare alleles have become more and more important. In this study, we try to answer two questions: (1) Within a region in the genome, how well do common SNPs tag the presence of rare alleles? (2) When selecting common tagging SNPs for association studies to detect rare alleles, should we exclude SNPs in strong LD with each other (r2 > 0.95), or does it help to capture more information on the rare alleles if we include tagging SNPs in strong LD (r2 > 0.95) with each other? To answer the first question, we analyzed the correlation between common SNPs and the number of rare alleles in samples of rare SNPs (i.e., SNPs containing rare alleles) in each region of the chromosomes. Then, for the second question, we studied the change in correlation between a single rare SNP and common tagging SNPs that is achieved by including SNPs in strong LD with each other when selecting common tagging SNPs.

Methods

Sample

We use the Genetic Analysis Workshop 17 (GAW17) data set, which is composed of 697 individuals in this study. The data include 24,487 SNPs, 74% (18,131) of which are considered rare SNPs with a minor allele frequency (MAF) less than 0.01 and only 12.8% of which are common SNPs with MAF > 0.05. Because of the unbalanced number of rare and common SNPs in the data, in order to study the capability of the common SNPs to tag rare variants, we incorporate into this data set genotype data from the International HapMap Project, release 28 (http://hapmap.ncbi.nlm.nih.gov/). The final data set includes 627 individuals from 7 populations: European (88), Chinese (91), Chinese in Denver (90), Japanese (92), Luhya (98), Tuscan (61), and Yoruba (107). After removal of SNPs in perfect LD, we are left with 13,777 rare SNPs (MAF < 0.01) and 116,944 common SNPs (MAF > 0.05).

Correlation between common SNPs and the presence of rare alleles

We divide the genome into nonoverlapping 1-Mb bins. For each bin, we separate the rare SNPs from the common SNPs. The common SNP value for each individual is the number of minor alleles. The correlation between the set of common SNPs and the numbers of rare alleles is calculated in each bin as follows. For n randomly selected rare SNPs (here we studied n = 5) in a bin, we quantify the number of rare alleles as the total number of rare alleles, y, that individual i (i = 1, 2, …,N) carries. The correlation between the variable y and the common SNPs in the bin is calculated over the N individuals in two ways. In the first way we calculate the Pearson correlation r between y and each of the common SNPs, taking the maximum r2. In the second way we calculate the multiple correlation R2 [6] between y and the common SNPs, using a multiple regression model. These two correlations are calculated for each consecutive region across the whole genome. We repeat the random sampling of the rare SNPs and the calculation of the correlation n/n times (i.e., the closest integer to n/n ) if n >n, where n is defined as the total number of rare SNPs in a bin. We calculate the correlations between the common SNPs and the number of rare alleles in rare SNPs separately in each of the seven subpopulations, to test whether the tagging capability is different in different populations. We also calculate the correlation between common SNPs and the number of each of two types of rare alleles (synonymous and nonsynonymous) to test whether common SNPs have a different capability to tag these two types of rare alleles. To examine whether the correlations between common SNPs and rare alleles are due to statistical noise, we perform a permutation test. We permute each of the common SNPs within the bin across individuals and calculate the correlations between the variable yi and the permuted common SNPs. Then the observed and permutation correlation distributions are compared using a Kolmogorov-Smirnov test. We also compare the means of the two distributions using a t test.

Capability of common SNPs in strong LD to capture rare alleles

We hypothesize that incorporating common SNPs in strong LD will capture significantly more variation resulting from rare alleles than using only the common SNPs in less strong LD with each other. We select the common SNPs within a 1-Mb region of each rare SNP and divide them into two sets. The first set is composed of the common tagging SNPs with LD of r2 ≤ 0.80 between each pair; we call this set A. The second set is composed of the common SNPs with LD of r2 ≤ 0.95 between each pair, which we call set B. So set B has two parts: all the SNPs in set A (r2 ≤ 0.80) and those SNPs in set (B − A) that are in higher LD with the SNPs in set A or between themselves (0.80 6] between each rare SNP and the set of common SNPs (set A and set B, respectively). Because R2 always increases when the number of independent variables in the model increases, is always greater than or equal to [6], where the subscripts A and B represent set A and set B, respectively. An F statistic, where n and n are the numbers of SNPs in set A and set B, respectively, is calculated to test whether the increase in over to predict the rare alleles is significant. Because R2 increases with the number of explanatory terms in a model, we use the adjusted , which adjusts for the number of explanatory common SNPs in the multiple regression model [6], to evaluate the multiple correlation: where n is n or n. In order to test whether the increase in R2 is due to the stronger LD among the SNPs in set B, which comes from the SNPs in set (B − A), or due to the larger number of SNPs from set (B − A), we evaluate the significance of the F statistic by comparison to a sample of 1,000 replicates of its permutation distribution, obtained by permuting across individuals the set of SNPs in set B but not in set A (i.e., the SNPs in set (B − A)), which breaks any LD structure between sets A and (B − A) but keeps the structure within the set (B − A). For each rare SNP, we also compare its multiple correlation with the common SNP set A having LD given by r2 ≤ 0.95 and with set B having LD given by r2 ≤ 0.99.

Results

Correlation between the number of rare alleles and common SNPs within a region

Using all 627 samples, the correlation between the number of rare alleles in any randomly selected five rare SNPs and a set of common SNPs within a 1-Mb region is less than 0.1 for both correlation measures. The correlation between the number of rare alleles and a set of common SNPs within subpopulations was larger than that of the samples overall (Table 1; Figure 1). The mean adjusted multiple correlation for European, Chinese, Denver Chinese, Japanese, Luhya, Tuscan, and Yoruba ranged from 0.06 to 0.24 (Table 1). Compared with random correlations, which are given by correlations between the number of rare alleles and a set of randomly permuted common SNPs, there was no significant difference in the total sample. In the subpopulations, however, the correlations between the number of rare alleles and the set of common SNPs were significantly different from random correlations (P < 0.001) (Table 1), but the difference was quite small.

Table 1

Population	(1) Common vs. rare SNPs	(2) Random correlation	t test P	Kolmogorov-Smirnov test P	(3) Common vs. synonymous rare SNPs	(4) Common vs. nonsynonymous rare SNPs	t test P
European	0.078	−0.022	2.84 × 10⁻⁹	9.66 × 10⁻¹⁵	0.078	0.077	0.952
Chinese	0.067	0.003	6.80 × 10⁻⁶	1.28 × 10⁻⁵	0.090	0.041	0.024
Denver Chinese	0.063	−0.002	2.30 × 10⁻⁶	3.052 × 10⁻¹⁰	0.085	0.064	0.350
Japanese	0.089	0.004	1.50 × 10⁻⁸	6.17 × 10⁻¹²	0.091	0.081	0.668
Luhya	0.238	−0.0006	<2.2 × 10⁻¹⁶	<2.2 × 10⁻¹⁶	0.241	0.233	0.678
Tuscan	0.063	−0.002	0.001	4.60 × 10⁻⁶	0.088	0.045	0.100
Yoruba	0.120	−0.007	<2.2 × 10⁻¹⁶	<2.2 × 10⁻¹⁶	0.142	0.099	0.008
All samples	0.053	0.054	0.580	0.118	0.057	0.048	7.74 × 10^-4

Figure 1

Distribution of the correlation The correlation is between the common SNPs and the number of rare alleles present in five random rare SNPs within a 1-Mb region. X-axes are the correlation r2, y-axes are the probability densities.

Mean multiple correlation between (1) the set of common SNPs and the number of rare alleles, (2) permuted common SNPs and the number of rare alleles, (3) the set of common SNPs and the number of synonymous rare alleles, and (4) the set of common SNPs and the number of nonsynonymous rare alleles Distribution of the correlation The correlation is between the common SNPs and the number of rare alleles present in five random rare SNPs within a 1-Mb region. X-axes are the correlation r2, y-axes are the probability densities. In the total sample, the set of common SNPs has a correlation with the number of rare synonymous alleles and with the number of rare nonsynonymous alleles ; the difference, although small, is significant (P = 7.74 × 10−4). In the subpopulations, the set of common SNPs also showed higher correlations with the number of rare synonymous alleles than with the number of rare nonsynonymous alleles, and the difference was most significant in Yoruba (P = 0.008). Note that in Yoruba, although the average correlation between common SNPs and the number of rare alleles is not high , it is significantly different from a random correlation, which suggests that common SNPs are able to capture some information on the number of rare alleles. In Yoruba, the set of common SNPs has a significantly smaller correlation with the number of rare synonymous alleles than with the number of rare nonsynonymous alleles (P = 0.008), which may indicate that the common SNPs are more prone to detecting nonfunctional SNPs than functional SNPs in this population. The correlation between common SNPs and the number of rare alleles is highest in Luhya , but common SNPs show no significant difference in capturing synonymous and nonsynonymous SNPs.

Capability of common SNPs in strong LD to capture rare variants within a region

By comparing two correlations—the adjusted multiple correlation between a rare SNP and the set of common SNPs in set A (LD of r2 ≤ 0.80) and the adjusted multiple correlation between that rare SNP and the set of common SNPs in set B (composed of both the SNPs in set A with LD r2 ≤ 0.80 and the SNPs in stronger LD, 0.80 0.08, permutation P > 0.11).

Figure 2

Distribution of the multiple correlation Each point represents a rare SNP. The x-axis is the adjusted between the rare SNP and the common SNPs in set A, and the y-axis is the adjusted between the rare SNP and the common SNPs in set B. SNPs in set B have stronger LD than SNPs in set A, thus set B contains all the SNPs in set A and the SNPs that have stronger LD with those in set A or between themselves. In the left-hand panel, SNPs in set A have LD r2 ≤ 0.8 and SNPs in set B have LD r2 ≤ 0.95. In the right-hand panel, SNPs in set A have LD r2 ≤ 0.95 and SNPs in set B have LD r2 ≤ 0.99.

Discussion

In this study, we found that within a region in the genome, overall the common SNPs are not highly correlated with the number of rare alleles, so they are not powerful for tagging the presence of rare alleles. But in subpopulations, the common SNPs can capture some information on the presence of rare variants, and their increased correlations are statistically significant but are often small (Table 1). We also found that including tagging SNPs in strong LD with each other is helpful in detecting rare alleles. Common SNPs have higher correlations with the presence of rare SNPs in the subpopulations, which indicates that population structure influences the tagging power. The common SNPs have lower correlations with the presence of nonsynonymous SNPs, especially in the Yoruba population, which may indicate difficulty in capturing rare functional variants in that population. In addition to the presence of rare alleles, we also analyzed the correlation between common SNPs and another variable, a collapsing statistic for rare SNPs [7-9], which has the value 1 if a rare allele is present and the value 0 if no rare alleles are present among several randomly selected SNPs within a genome region. We obtained similar results with the collapsing variable (data not shown). Our study suggests that we should not exclude SNPs in strong LD (e.g., r2 > 0.95) from tagging SNPs in an association study, because they can help to detect rare SNPs. They are less helpful for predicting disease risk, however, because their attributable risk is so small; but the significant associations detected by them could be important for detecting new metabolic pathways. The multiple correlation R2 could be overadjusted because the adjusting assumes independence of the common SNPs, which is not the case for our study. But we nevertheless get increased to tag rare SNPs by including SNPs in strong LD with each other among the tagging SNPs, which indicates their importance in an association study to detect causal variants.

Conclusions

In this study, we found that, overall, common SNPs are not good at capturing the presence of rare alleles within a region of the genome, but they can capture some information on their presence in subpopulations. The common SNPs are more prone to capturing nonfunctional rare SNPs, especially in some populations. We also found that including tagging SNPs in strong LD with each other can be helpful in detecting rare variants.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XZ and RCE conceived of the study and participated in its design and coordination, XS performed the statistical analysis and drafted the manuscript. JN prepared data from the HapMap Project. RCE, XZ and JN helped to draft and modify the manuscript. All authors read and approved the final manuscript.

8 in total

1. Pooled association tests for rare variants in exon-resequencing studies.

Authors: Alkes L Price; Gregory V Kryukov; Paul I W de Bakker; Shaun M Purcell; Jeff Staples; Lee-Jen Wei; Shamil R Sunyaev
Journal: Am J Hum Genet Date: 2010-05-13 Impact factor: 11.025

2. A genome-wide association study identifies IL23R as an inflammatory bowel disease gene.

Authors: Richard H Duerr; Kent D Taylor; Steven R Brant; John D Rioux; Mark S Silverberg; Mark J Daly; A Hillary Steinhart; Clara Abraham; Miguel Regueiro; Anne Griffiths; Themistocles Dassopoulos; Alain Bitton; Huiying Yang; Stephan Targan; Lisa Wu Datta; Emily O Kistner; L Philip Schumm; Annette T Lee; Peter K Gregersen; M Michael Barmada; Jerome I Rotter; Dan L Nicolae; Judy H Cho
Journal: Science Date: 2006-10-26 Impact factor: 47.728

3. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.

Authors: Bingshan Li; Suzanne M Leal
Journal: Am J Hum Genet Date: 2008-08-07 Impact factor: 11.025

Review 4. Common and rare variants in multifactorial susceptibility to common diseases.

Authors: Walter Bodmer; Carolina Bonilla
Journal: Nat Genet Date: 2008-06 Impact factor: 38.330

Review 5. Common vs. rare allele hypotheses for complex diseases.

Authors: Nicholas J Schork; Sarah S Murray; Kelly A Frazer; Eric J Topol
Journal: Curr Opin Genet Dev Date: 2009-05-28 Impact factor: 5.578

6. A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility variants.

Authors: Laura J Scott; Karen L Mohlke; Lori L Bonnycastle; Cristen J Willer; Yun Li; William L Duren; Michael R Erdos; Heather M Stringham; Peter S Chines; Anne U Jackson; Ludmila Prokunina-Olsson; Chia-Jen Ding; Amy J Swift; Narisu Narisu; Tianle Hu; Randall Pruim; Rui Xiao; Xiao-Yi Li; Karen N Conneely; Nancy L Riebow; Andrew G Sprau; Maurine Tong; Peggy P White; Kurt N Hetrick; Michael W Barnhart; Craig W Bark; Janet L Goldstein; Lee Watkins; Fang Xiang; Jouko Saramies; Thomas A Buchanan; Richard M Watanabe; Timo T Valle; Leena Kinnunen; Gonçalo R Abecasis; Elizabeth W Pugh; Kimberly F Doheny; Richard N Bergman; Jaakko Tuomilehto; Francis S Collins; Michael Boehnke
Journal: Science Date: 2007-04-26 Impact factor: 47.728

7. A groupwise association test for rare mutations using a weighted sum statistic.

Authors: Bo Eskerod Madsen; Sharon R Browning
Journal: PLoS Genet Date: 2009-02-13 Impact factor: 5.917

8. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.

Authors:
Journal: Nature Date: 2007-06-07 Impact factor: 49.962

8 in total

7 in total

1. Two novel susceptibility SNPs for ischemic stroke using exome sequencing in Chinese Han population.

Authors: Yanwei Zhang; Yeqing Tong; Yong Zhang; Hu Ding; Hao Zhang; Yijie Geng; Renli Zhang; Yuebin Ke; Jingjun Han; Zhixiang Yan; Li Zhou; Tangchun Wu; Frank B Hu; Daowen Wang; Jinquan Cheng
Journal: Mol Neurobiol Date: 2013-10-10 Impact factor: 5.590

2. Rare variants, common markers: synthetic association and beyond.

Authors: Jack W Kent
Journal: Genet Epidemiol Date: 2011 Impact factor: 2.135

3. Chip-based direct genotyping of coding variants in genome wide association studies: utility, issues and prospects.

Authors: Caroline M Nievergelt; Nathan E Wineinger; Ondrej Libiger; Phillip Pham; Guangfa Zhang; Dewleen G Baker; Nicholas J Schork
Journal: Gene Date: 2014-02-09 Impact factor: 3.688

4. Do rare variant genotypes predict common variant genotypes?

Authors: Jack W Kent; Vidya Farook; Harald Hh Göring; Thomas D Dyer; Laura Almasy; Ravindranath Duggirala; John Blangero
Journal: BMC Proc Date: 2011-11-29

5. Common genetic variants, acting additively, are a major source of risk for autism.

Authors: Lambertus Klei; Stephan J Sanders; Michael T Murtha; Vanessa Hus; Jennifer K Lowe; A Jeremy Willsey; Daniel Moreno-De-Luca; Timothy W Yu; Eric Fombonne; Daniel Geschwind; Dorothy E Grice; David H Ledbetter; Catherine Lord; Shrikant M Mane; Christa Lese Martin; Donna M Martin; Eric M Morrow; Christopher A Walsh; Nadine M Melhem; Pauline Chaste; James S Sutcliffe; Matthew W State; Edwin H Cook; Kathryn Roeder; Bernie Devlin
Journal: Mol Autism Date: 2012-10-15 Impact factor: 7.509

6. Fine-mapping the HOXB region detects common variants tagging a rare coding allele: evidence for synthetic association in prostate cancer.

Authors: Edward J Saunders; Tokhir Dadaev; Daniel A Leongamornlert; Sarah Jugurnauth-Little; Malgorzata Tymrakiewicz; Fredrik Wiklund; Ali Amin Al Olama; Sara Benlloch; David E Neal; Freddie C Hamdy; Jenny L Donovan; Graham G Giles; Gianluca Severi; Henrik Gronberg; Markus Aly; Christopher A Haiman; Fredrick Schumacher; Brian E Henderson; Sara Lindstrom; Peter Kraft; David J Hunter; Susan Gapstur; Stephen Chanock; Sonja I Berndt; Demetrius Albanes; Gerald Andriole; Johanna Schleutker; Maren Weischer; Børge G Nordestgaard; Federico Canzian; Daniele Campa; Elio Riboli; Tim J Key; Ruth C Travis; Sue A Ingles; Esther M John; Richard B Hayes; Paul Pharoah; Kay-Tee Khaw; Janet L Stanford; Elaine A Ostrander; Lisa B Signorello; Stephen N Thibodeau; Daniel Schaid; Christiane Maier; Adam S Kibel; Cezary Cybulski; Lisa Cannon-Albright; Hermann Brenner; Jong Y Park; Radka Kaneva; Jyotsna Batra; Judith A Clements; Manuel R Teixeira; Jianfeng Xu; Christos Mikropoulos; Chee Goh; Koveela Govindasami; Michelle Guy; Rosemary A Wilkinson; Emma J Sawyer; Angela Morgan; Douglas F Easton; Ken Muir; Rosalind A Eeles; Zsofia Kote-Jarai
Journal: PLoS Genet Date: 2014-02-13 Impact factor: 5.917

Review 7. Identification of rheumatoid arthritis biomarkers based on single nucleotide polymorphisms and haplotype blocks: A systematic review and meta-analysis.

Authors: Mohamed N Saad; Mai S Mabrouk; Ayman M Eldeib; Olfat G Shaker
Journal: J Adv Res Date: 2015-02-04 Impact factor: 10.479

7 in total