Literature DB >> 22267498

Detecting rare variant associations by identity-by-descent mapping in case-control studies.

Sharon R Browning¹, Elizabeth A Thompson.

Abstract

Identity-by-descent (IBD) mapping tests whether cases share more segments of IBD around a putative causal variant than do controls. These segments of IBD can be accurately detected from genome-wide SNP data. We investigate the power of IBD mapping relative to that of SNP association testing for genome-wide case-control SNP data. Our focus is particularly on rare variants, as these tend to be more recent and hence more likely to have recent shared ancestry. We simulate data from both large and small populations and find that the relative performance of IBD mapping and SNP association testing depends on population demographic history and the strength of selection against causal variants. We also present an IBD mapping analysis of a type 1 diabetes data set. In those data we find that we can detect association only with the HLA region using IBD mapping. Overall, our results suggest that IBD mapping may have higher power than association analysis of SNP data when multiple rare causal variants are clustered within a gene. However, for outbred populations, very large sample sizes may be required for genome-wide significance unless the causal variants have strong effects.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
HLA Antigens

Year: 2012 PMID： 22267498 PMCID： PMC3316661 DOI： 10.1534/genetics.111.136937

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.562

THE idea of using identity-by-descent (IBD) haplotype sharing to detect signals of disease-causing variants in population samples is not new (Houwen ; Te Meerman ); however, the greatly increased density of SNP markers now makes it possible to detect much smaller segments of IBD. New statistical methods for detecting such IBD have been proposed (Purcell ; Kong ; Thomas ; Leibon ; Gusev ; Albrechtsen ; Thompson 2009; Bercovici ; Browning and Browning, 2010, 2011; Han and Abney 2011; Brown ), and it is possible to determine pairwise IBD sharing in a large sample over the whole genome to a resolution of approximately 2 cM (Browning and Browning 2011). In this article we investigate the power of IBD mapping to detect associations for complex diseases and compare this with the power of SNP association mapping. Throughout this article, IBD is genomic-location specific. That is, we have information from genetic data on whether two individuals share alleles identical by descent at a certain genomic position. We focus on segments of IBD that are due to recent shared ancestry. For example, IBD segments from shared ancestry 25 generations ago have average length of 2 cM. Two classes of statistics have been proposed for IBD mapping. The first, which we call “pairwise” statistics, use IBD detected between pairs of individuals. The rate of IBD in case/case pairs is compared to the rate of IBD in either control/control pairs or non-case/case pairs (control/control and control/case pairs) (Purcell ). The pairwise approach is somewhat similar to affected relative pair linkage analysis, with important differences. Affected relative pair linkage analysis involves pairwise IBD, but only within relative pairs, not across relative pairs. Also, control individuals are not needed in affected relative pair linkage because the background rate of IBD (the rate of IBD in pairs of unaffected relatives of the same type) is assumed to be known. In IBD mapping, the exact degree of relationship is unknown, and the rate of detected IBD tends to vary along the genome due to differences in informativeness and stochastic differences in haplotype genealogies. The use of control individuals allows the analysis to account for these differences in background rates of IBD. The second class of statistics, which we call “clustering” statistics, cluster haplotypes into IBD classes at a locus. All haplotypes within a class are IBD with each other at the locus. An individual is a member of two IBD classes at a locus because the individual has two haplotypes, although the two classes will be the same if the individual is homozygous by descent. Clusters are tested for association with case-control status (Gusev ). The clustering approach is difficult because IBD is not usually estimated with 100% certainty or 100% power. Thus one must determine how to resolve inconsistencies, such as when haplotypes A and B are estimated to be IBD, and B and C are estimated to be IBD, but A and C are not estimated to be IBD. Also, for IBD due to recent ancestry the IBD clusters will be very small in an outbred population, so that testing individual clusters will tend to have low power. Whichever statistic is used, IBD mapping focuses on signals from rare variants. That is because coalescence times will be shorter for haplotypes carrying the same recent (hence rare) variant, resulting in larger segments of IBD that are more detectable. In small populations, all coalescence times are relatively short, so the variants need not be as rare. Various methods are available for detecting IBD segments. Resolution of the methods differs, but generally power is low for detecting segments shorter than 2 cM with current genome-wide SNP panels in Europeans (Browning and Browning 2011). IBD mapping is performed on data that are also suitable for standard association analysis. If the data are sequence data, with all rare variants genotyped, then the data contain perfect information about IBD for the purposes of determining association. Variants that are identical can be assumed to be IBD; even if there is recurrent mutation and the identical variants are not IBD, it is only the identity of the variant that matters when directly testing a putative causal variant. Hence detected IBD segments cannot add further information. Thus, for sequence data, standard single-variant association testing should be more powerful than IBD mapping. From the point of view of merely comparing testing procedures, standard single-variant association testing pays the cost of higher multiple testing correction, but IBD mapping suffers from incomplete information (not all IBD is detected) and added noise. For the pairwise IBD statistics, the background rate of IBD provides noise; for the clustering statistics, IBD clusters that do not represent actual variants provide noise. The pairwise statistics allow aggregation from several causal variants located proximally (e.g., within the same gene), which can add power. With sequence data, specialized association tests can aggregate over multiple rare variants and have the advantage that functional information, such as whether a variant is a missense mutation, can be utilized (Li and Leal 2008; Madsen and Browning 2009; Wu ). In genome-wide SNP array data, many rare variants are not typed, allowing the possibility of a power advantage for IBD mapping. For non-IBD-based testing, both single-SNP and haplotypic tests are possible. Haplotypic tests can be useful because haplotypes can tag rare variants that have not been genotyped. The line between IBD mapping and haplotypic testing with long haplotypes is somewhat blurred. The IBD clustering tests are similar to haplotype clustering tests (Browning 2006), while the IBD pairwise tests are similar to haplotype sharing tests (Van Der Meulen and Te Meerman 1997). Although haplotypic association tests have theoretical advantages over SNP association tests for detecting associations with rare variants, several practical issues have limited their usefulness. First, haplotypic tests are very susceptible to differential genotype error between cases and controls that can create apparently unusual haplotypes primarily carried by either cases or controls (Browning and Browning 2008). IBD mapping is less susceptible to this problem because genotype errors generally can make it more difficult to detect IBD but would not usually induce false-positive IBD. Thus a small rate of genotype error at a locus, even if differentially distributed between cases and controls, will not be likely to create false-positive IBD mapping signals. Second, haplotypic tests (and IBD mapping tests) are focused on the effects of rare variants. However, power is low to detect association with a rare variant unless it has a very strong effect. Although aggregation over multiple rare variants is possible for haplotypic tests (Zhu ), and may add power, the effectiveness of the aggregation is reduced because of the lack of functional information about the rare variants being tagged by the haplotypes. Pairwise IBD tests also aggregate over variants, and one advantage of this approach is that one does not need to decide which variants to aggregate, as the aggregation is done automatically by the test. Some causal rare variants may have large effect sizes relative to the typically small effect sizes seen for common variants in complex traits (Manolio ), as rare variants tend to be recent and thus have had less time to be removed from the population by natural selection. If the rare variant effects are very large (such as in fully penetrant Mendelian diseases), these effects would have been found through previous family-based linkage studies. However, there may be a middle ground in which multiple rare variants of moderate effect size play a key role in the etiology of some diseases. Such situations may be ideal for IBD mapping. Few published studies have attempted IBD mapping in unrelated samples. Albrechtsen and Moltke analyzed small numbers (five to seven) of unrelated, mostly Danish, breast and ovarian cancer patients who were known to carry the same BRCA1 mutation and showed that these individuals were detectably IBD around the gene. Gusev applied DASH, a haplotype-testing/IBD mapping program, to data from samples from the island of Kosrae (a small founder population) and to case-control data from the UK, and found known associations and some novel associations, including one in the Kosrae data that replicated in a European cohort. Francks used PLINK’s IBD mapping method on two cohorts of 400–500 schizophrenia cases and similar numbers of controls and obtained a P-value of 1.6 × 10−5 on chromosome 19q. By adding approximately 1000 familial bipolar cases and a similar number of controls they were able to strengthen the signal to P = 2.6 × 10−6. It remains to be seen whether this signal will replicate in other data sets.

Methods

Pairwise IBD test

Our IBD test compares rates of IBD in case/case pairs of individuals and non-case/case pairs of individuals. At each position we calculate the rate for each of the two groups (the group of case/case pairs and the group of non-case/case pairs) and subtract off the genomic average for each group; we then take the difference between the two groups. We perform a one-sided test: that is, we test for case/case pairs having a higher adjusted rate of IBD than non-case/case pairs. Because IBD segments are not independent between pairs, we use permutation of case-control labels to assess significance. This testing procedure is the same as in Purcell . Python programs implementing the IBD test that we used for the WTCCC type 1 diabetes data can be downloaded from http://faculty.washington.edu/sguy/ibdmapping.html.

SNP association test

For comparison with the pairwise IBD test, we consider allelic association tests. The power of these tests is reduced if there is allelic heterogeneity (multiple causal mutations in a gene), because each single-marker test is likely to tag at most one of the causal variants. Power also tends to be low for rare causal variants, because these are not well tagged by genome-wide SNP panels. Genome-wide association analyses involve hundreds of thousands of association tests. Thus, multiple-testing correction is required to avoid obtaining large numbers of false-positive results. The multiple-testing correction depends on the panel of SNPs used for testing as well as on the population from which the sample is derived. Gao find that a nominal P-value of 1.4 × 10−7 corresponds to an adjusted P-value of 0.05 for Illumina 1 M data in European American data while the corresponding nominal P-value threshold for Affymetrix 500K data are 2.5 × 10−7. Dudbridge and Gusnanto (2008) extrapolate from WTCCC (UK) data to a P-value threshold of 7.2 × 10−8 for infinitely dense SNP data with allele ascertainment matching that of the Affymetrix 500K data. Pe’er estimate a multiple testing threshold for all common SNPs (frequency >5%) of around 10−7 in Europeans and 5 × 10−8 in Africans. Recently, a threshold of 5 × 10−8 has become a de facto standard regardless of the populations being studied (e.g., Lowe ). The effective population size to consider for SNP association testing is different than that for IBD testing, as the relevant time interval differs. IBD testing concerns mostly the effective population size in the past G generations, where G = 25, for example (the recent effective population size), whereas SNP association testing is concerned with much larger timescales (the long-term effective population size). For a population of changing size, the relevant effective population size is close to the minimum (effective) population size over the time period of interest (Wright 1931). Thus population bottlenecks, for example during the out-of-Africa migration(s), can have a large impact on long-term effective population size, and hence on the multiple testing correction for SNP association tests.

Results

Simulation study

We performed a simulation study to compare standard SNP association testing with pairwise IBD testing. We used SFS_CODE (Hernandez 2008) to generate sequence data from a base population of constant size N = 10,000. The population size influences the scaled rates of mutation, 4Nμ, and selection, 2Ns. When we doubled the population size to N = 20,000 and halved the selection coefficient s we observed essentially identical results (data not shown). The mutation coefficient we used was μ = 10−8 (rate of mutations per base pair, per meiosis). The recombination rate was 10−8 per base pair per meiosis. Figure 1 shows the simulation scheme. The output of the simulations was 100 common SNPs spread evenly over a 200-kb region (one SNP per 2 kb) and a variable number of causal SNPs (see Table 1) located within the central 9 kb of the 200-kb region.

Figure 1

Table 1

Properties of simulated causal variants

s	No. of variants	Variant frequencies	Haplotype carrier frequencies	Max R²
0.0005	11–16	0.00015–0.0060	0.045–0.13	0.91–1.00
0.001	9–14	0.00010–0.0031	0.019–0.050	0.28–1.00
0.002	8–13	0.00010–0.0020	0.0097–0.031	0.06–0.52
0.005	7–10	0.000088–0.0011	0.0045–0.011	0.03–0.16

Interquartile ranges (IQR; lower quartile to upper quartile) from the 100 simulations with selection coefficient s are shown for several quantities of interest. The second column gives the number of causal variants per simulation. The third column gives the frequencies of the causal variants. The fourth column gives the proportion of haplotypes that carry a causal variant. The final column gives the maximum squared correlation coefficient between any one of the 100 common variants tested in the association test with any one of the causal variants. All results are from the base simulation population of 10,000 individuals.

Simulation scheme. Each simulated region is made up of 100 simulated segments of length 1 kb with gaps of length 1 kb between them. The central five segments can contain causal SNPs. Causal SNPs are those that the simulation program designates as protein-changing mutations. These SNPs have been subject to negative selection at a specified rate. Only the causal SNPs and one SNP per segment with highest minor allele frequency (MAF) are retained. The causal SNPs are used to determine disease status, while the high MAF SNPs are tested in the association analysis. IBD status is determined through further simulation, as described in the main text. Interquartile ranges (IQR; lower quartile to upper quartile) from the 100 simulations with selection coefficient s are shown for several quantities of interest. The second column gives the number of causal variants per simulation. The third column gives the frequencies of the causal variants. The fourth column gives the proportion of haplotypes that carry a causal variant. The final column gives the maximum squared correlation coefficient between any one of the 100 common variants tested in the association test with any one of the causal variants. All results are from the base simulation population of 10,000 individuals. The causal SNPs were subject to negative selection with selection coefficient s during the generation of the base population. Individuals carrying one or more of these variants have the disease with probability 0.1 (the penetrance). Individuals not carrying any of these variants have the disease with probability 0.01 (the sporadic rate). These parameters were chosen to obtain good power with reasonable sample sizes while being somewhat realistic. The ratio of penetrance to sporadic rate determines power. Thus increased power could be obtained either by increasing penetrance or decreasing the sporadic rate. If the penetrance is extremely high (such as 0.5) then strong familial aggregration will be seen and a family-based sampling approach would be more appropriate than a population-based sample. On the other hand, very low sporadic rates are unrealistic for complex diseases because the sporadic rate includes both nongenetic causes and genetic causes at other loci. We consider a range of selection coefficients. If the selection coefficient is too low, most individuals will carry a causal variant, so the model considered here is not suitable. With N = 10,000 and s = 0.0002, a majority of individuals carry a causal variant. Thus, the smallest selection coefficient we consider is s = 0.0005, for which approximately 10–20% of individuals carry a causal variant, resulting in a prevalence of approximately 2–3% for the disease. If the selection coefficient is very high, the causal variants will tend to be very rare, and very large sample sizes are required to detect association, particularly with the SNP association test. The largest selection coefficient that we consider is 0.005. For each selection coefficient we generated 100 independent data sets, which were used as base populations for all the forward simulations described below. Table 1 shows the distributions of frequencies of causal variants and numbers of causal variants from the simulations. After generating a base population with SFS_CODE, we ran our own forward simulator for 25 or 100 generations without selection or mutation or recombination, and with a different population size corresponding to a recently expanded population (Nrecent = 100,000) or a population that has gone through a recent bottleneck (Nrecent = 1000). Omission of selection, mutation, and recombination during this 25 or 100 generation time period is not significant because of the short timescale. Our forward simulator generated each new generation by sampling with replacement from the existing pool of haplotypes. When there were no causal variants segregating in the population after the final generation, the sampling process was rerun. The purpose of the final generations of simulation using our own forward simulator was to determine IBD status. Haplotypes deriving from the same ancestral haplotype G generations ago (where G = 25 or G = 100) are considered to be detectably IBD (see Appendix). Current methods for IBD detection from SNP data can detect a majority of IBD segments arising from a shared ancestor within the past 25 generations. With improving SNP data and further developments in IBD detection methodology, it may soon be possible to detect IBD arising from shared ancestry 50 or even 100 generations ago. In real data, ability to detect IBD depends on the number of generations to the common ancestor only through the length of the IBD segment. More recent common ancestry tends to result in longer IBD segments, which are easier to detect. However, the distribution of IBD lengths given the number of generations to the common ancestor is highly variable as it approximately follows an exponential distribution. Using a cut-off in terms of number of generations to common ancestor in place of a cut-off in terms of length of region simplifies the simulation procedure and gives some sense as to the properties of analysis of real data. When comparing a threshold of G = 25 generations in simulated data with a threshold of 2 cM in real data, say, on the one hand some IBD segments due to common ancestry 25 generations ago will be too short to be detectable in real data, while on the other hand some segments due to ancestry more than 25 generations ago will be long enough to be detected in real data, with these two effects cancelling each other to some extent. Cases and controls were generated by sampling with replacement from the final generation until sufficient numbers were obtained. The numbers of cases and controls were chosen to achieve at least moderate power for both SNP association and IBD tests. In many instances the number of cases is much larger than would seem reasonable given the effective population size. However, in real life the actual population size is typically larger than the effective population size. Also, many human populations have undergone rapid expansion in the past few generations (due to improved medical care, etc), which again allows for realistically increased sample sizes beyond the typical population size in the past 25–100 generations. For the SNP association test, the retained common SNPs were tested for association with disease status using Fisher’s exact allelic test. The minimum P-value over the 100 common SNPs was compared to a genome-wide significance threshold of 5 × 10−8. The tested SNPs are common, while the causal SNPs tend to be rare. Table 1 shows that for low-selection coefficients many simulations contain at least one tested common variant that is very highly correlated with at least one causal variant, whereas for high-selection coefficients (very rare causal variants) the correlations tend to be low. For the IBD test, the IBD status was determined by ancestry at the beginning of the final G generations of simulation, as described above. The difference between case/case and non-case/case IBD proportions was calculated, and 5 million permutations of case-control status were performed to obtain a P-value, which was compared to a genome-wide significance threshold of 2 × 10−6 for G = 25 or 5 × 10−7 for G = 100 (see the Appendix for derivation of these thresholds). Note that G = 25 corresponds approximately to the current resolution of IBD detection in SNP data, whereas G = 100 corresponds to improved resolution that may be achievable with denser data or improved methods. In the simulations the IBD status is the same at each of the simulated segments; thus a single IBD test covers the whole region. For each parameter setting, 100 replicate data sets were generated using the 100 base populations generated with SFS_CODE. SNP association and IBD mapping tests were performed on the same data sets. The final generations of simulation were performed independently for the G = 25 and G = 100 generation simulations. Table 2 shows results for a large recent effective population size (Nrecent = 100,000). It can be seen that the IBD test with G = 25 is more powerful than the SNP association test when the selection against causal variants is sufficiently strong; however, once selection becomes too strong the variants are so rare that very large sample sizes are required to obtain reasonable power. The IBD test with G = 100 is at least as powerful as the SNP association test for all selection coefficients considered here. Within the model framework considered here, we cannot consider weaker selection and hence more common causal variants. SNP association testing would have superior power to IBD testing if one common causal variant dominated the effect of the gene on the disease.

Table 2

Simulated power results: Large population size

s	No. of cases	No. of controls	Power assoc.	Power IBD25	Power IBD100	Assoc. vs. IBD25	Assoc. vs. IBD100
0.0005	500	500	0.87	0.57	0.81	assoc.	NS
0.001	500	500	0.65	0.53	0.81	NS	IBD
0.002	1000	1000	0.53	0.87	0.93	IBD	IBD
0.005	3000	3000	0.47	0.90	0.84	IBD	IBD

From an equilibrium population size of N = 10,000, the population was expanded to a recent effective size of Nrecent = 100,000. The selection coefficient, s, used in simulating the equilibrium population is given in the first column. The second and third columns give the sample sizes. The fourth column gives the estimated power of the SNP association test with G = 25 generations at the recent effective population size; the power of the SNP association test with G = 100 generations was not significantly different (data not shown). The fifth and sixth columns give the estimated power of pairwise IBD tests with IBD determined from G = 25 and G = 100 generations at the recent effective population size, respectively. All power estimates are from 100 replicates; the standard errors are 0.03–0.05. The seventh column states whether the SNP association test or IBD test with G = 25 is more powerful if the difference is significant (two-sided paired t-test P < 0.05) or NS (nonsignificant) otherwise. Similarly the eighth column compares the SNP association test and IBD test with G = 100. By comparing results with G = 25 and G = 100 generations as the IBD detection cut-off we can see the effects of increased marker density on the power of IBD mapping. The G = 25 results correspond roughly to IBD detection with 500,000 common SNPs genome-wide. We do not yet have good data on the extent to which IBD detection power will improve with increased marker density and inclusion of lower frequency variants. The G = 100 results may correspond to very-high-density SNP data. The results in Table 2 show that improved IBD detection due to increased marker density has the potential to significantly expand the range of scenarios in which IBD mapping has more power than standard single-marker association testing. However, as noted in the introduction, IBD mapping cannot improve upon appropriate association testing (which may consist of rare-variant aggregation testing) in high-quality sequence data. Table 3 shows results for a bottleneck that occurred 25 generations ago, when we can detect IBD arising from shared ancestry within the past G = 25 generations. The IBD test is more powerful than the SNP association test for all values of selection considered in this recent bottleneck scenario. Unless a causal variant is very common, it will tend to be represented by only a very small number of initial haplotypes and thus the IBD test has good power. On the other hand, during the time that the population is small, some haplotypes are lost and thus linkage disequilibrium increases between common SNPs and remaining haplotypes. This increases the power of the SNP association test, although not to a large enough extent to match the IBD test.

Table 3

Simulated power results: Small population size, very recent bottleneck

s	No. of cases	No. of controls	Power assoc.	Power IBD25	Assoc. vs. IBD25
0.0005	200	200	0.53	0.64	IBD
0.001	400	400	0.60	0.73	IBD
0.002	400	600	0.51	0.60	NS
0.005	400	1000	0.33	0.46	IBD

From an equilibrium population size of N = 10,000, the population was contracted 25 generations ago to a recent effective size of Nrecent = 1000. The selection coefficient, s, used in simulating the equilibrium population is given in the first column. The second and third columns give the sample sizes. The fourth column gives the estimated power of the SNP association test, while the fifth column gives the estimated power of pairwise IBD test with IBD determined from the final G = 25 generations. All power estimates are from 100 replicates; the standard errors are 0.04–0.05. The sixth column states whether the SNP association test or IBD test is more powerful if the difference is significant (two-sided paired t-test P < 0.05) or NS (nonsignificant) otherwise. The situation in which the change in population size occurs at exactly the point in time at which the IBD reckoning begins is unrealistic. For the large population size, increasing the number of generations for which the population had the larger size to say 125 has little effect on either test because this is not enough time to add new haplotypes, and haplotypes do not tend to get lost in large populations over such relatively short timescales. On the other hand, for the small recent population size, increasing the amount of time over which the population is small has little effect on the power of the IBD test but significantly increases the power of the SNP association test because linkage disequilibrium is increased. Table 4 shows results with 100 additional generations (125 generations total) at size Nrecent = 1000. Selection was applied during the forward simulation for these simulations. The power of the SNP association test is higher than that with the recent bottleneck, while the change in the power of the IBD test is not significant. As a result, the SNP association test is more powerful than the IBD test, in contrast to the recent bottleneck.

Table 4

Simulated power results: Small population size, older bottleneck

s	No. of cases	No. of controls	Power assoc.	Power IBD25	Assoc. v. IBD25
0.0005	200	200	0.71	0.55	Assoc.
0.001	400	400	0.76	0.67	NS
0.002	400	600	0.76	0.57	Assoc.
0.005	400	1000	0.73	0.51	Assoc.

From an equilibrium population size of N = 10,000, the population was contracted 125 generations ago to a recent effective size of Nrecent = 1000. The selection coefficient, s, used in simulating the equilibrium population is given in the first column. The second and third columns give the sample sizes. The fourth column gives the estimated power of the SNP association test, while the fifth column gives the estimated power of pairwise IBD tests with IBD determined from the final G = 25 generations. All power estimates are from 100 replicates; the standard errors are 0.04–0.05. The sixth column states whether the SNP association or IBD test is more powerful if the difference is significant (two-sided paired t-test P < 0.05) or NS (nonsignificant) otherwise.

Analysis of Wellcome Trust Case Control Consortium (WTCCC) type 1 diabetes data

We analyzed the Wellcome Trust Case Control Consortium (WTCCC) type 1 diabetes data (Wellcome Trust Case Control Consortium 2007) to determine whether we could detect signals with IBD mapping that could not be detected with SNP association testing in these data. This data set is fairly large and homogeneous, with approximately 3000 controls and 2000 cases, all of whom are of European ancestry from the United Kingdom. Type 1 diabetes is a relatively good candidate for IBD mapping: it has relatively low prevalence (≈0.4%; Mehers and Gillespie 2008), and high heritability (≈88% heritability on the liability scale; Hyttinen ). The major genetic contribution comes from the HLA (human leukocyte antigen) region, but other genes have also been implicated through association studies (Barrett ). Multiple rare variants in one gene, IFIH1 (interferon induced with helicase C domain 1), are associated with the disease (Nejentsev ). However, the IFIH1 variants are protective, which will tend to increase IBD in control/control pairs, while our IBD test looks for increased IBD in case/case pairs. Before performing the analysis, we called the genotypes from the signal intensity data using BeagleCall, as described previously (Browning and Yu 2009). Highly accurate genotypes are critical for detecting IBD, and BeagleCall utilizes linkage disequilibrium information to significantly improve the genotype accuracy. Stringent quality control filters were also applied to the markers during the calling process. In total, 458,204 autosomal SNPs were analyzed in 1963 cases and 2938 controls. The fastIBD module of Beagle 3.3.1 (Browning and Browning 2011) was used to detect IBD segments. We combined results from 10 runs of the program and used a threshold on the fastIBD score of 10−10. In total, 4.1 million segments were found, or 0.42 per pair of individuals (0.1 per pair of haplotypes) over the autosome. The average number of SNPs covered by a segment was 464, while the average segment length was 3.2 cM (genetic distances obtained by interpolation from HapMap estimates; McVean ). Figure 2 shows the distribution of lengths of detected segments. The genome average proportion of pairs IBD at a given location was 0.0003441 for non-case/case pairs and 0.0003447 for case/case pairs.

Figure 2

Distribution of lengths of detected IBD segments in the WTCCC type 1 diabetes data. IBD segments were detected using BEAGLE fastIBD. Lengths greater than 8 cM are not shown.

Distribution of lengths of detected IBD segments in the WTCCC type 1 diabetes data. IBD segments were detected using BEAGLE fastIBD. Lengths greater than 8 cM are not shown. To perform the IBD test, we calculated the difference in IBD proportions between case/case pairs and non-case/case pairs and compared this difference to that obtained from 5 million permutations of case-control status. We calculated this difference and the corresponding permutation P-value at every tenth SNP along the autosomes. In addition, we calculated permutation P-values genome-wide for 1000 permutations of case-control status, which allows us to determine the correct multiple-testing adjustment. The fifth percentile of the distribution of minimum P-value over the autosomes for the permuted data were 5.8 × 10−6. This may be compared to the approximate multiple-testing adjusted thresholds calculated in the Appendix: 2 × 10−6 for G = 25 (corresponding to the approximate resolution of detection of 2 cM), which equates to a 2.2% genome-wide significance level in these data; or 4 × 10−6 for G = 15 (corresponding to the mean detected IBD segment length of 3.2 cM), which equates to a 3.4% genome-wide significance level in these data. Thus, use of the approximate genome-wide P-value thresholds calculated in the Appendix would be somewhat conservative in these data. Figure 3 shows the unadjusted P-values. Because of the limited number of permutations, the smallest achievable P-value is 2 × 10−7 = 1/(5 × 106). The HLA region is clearly significant; however, this is not surprising given the extremely strong signal in this region. A region on chromosome 2 is almost significant (genome-wide adjusted P-value 0.20). A recent review (Baker and Steck 2011) lists two known associations with type 1 diabetes on chromosome 2. These are IFIH1 at 2q24.2 and CTLA4 (cytotoxic T lymphocyte associated antigen 4) at 2q33.2. The closer of these is IFIH1, which is 103 Mb away from IBD signal. The closest gene to the IBD signal is BCL11A (B-cell CCL/lymphoma 11A), which is 1.0 Mb away from the location of the smallest P-value on chromosome 2. BCL11A has been suggestively associated with type 2 diabetes (Zeggini ) and affects pancreatic β-cell function (Simonis-Bik ).

Figure 3

Permutation P-values for the IBD test in the WTCCC type 1 diabetes data. P-values were calculated at every tenth marker along the autosomes. The smallest possible P-value from the 5,000,000 permutations (2 × 10−7) is shown by the black horizontal line. The genome-wide significance level determined by 1000 permutations (6 × 10−6) is shown by the blue horizontal line. In the original SNP association analysis of these data (Wellcome Trust Case Control Consortium 2007), four loci were significant at a genome-wide significance threshold of 5 × 10−8. These included the HLA region, with a P-value of 2 × 10−134. In our IBD mapping analysis, the HLA region achieved the smallest P-value possible with the limited number of permutations performed (P = 2 × 10−7). Computational constraints preclude performing further permutations to determine how small a P-value can be obtained with IBD mapping in this region. Nevertheless, it is the achievement of genome-wide significance, rather than the actual P-value, that is important. Both IBD mapping and SNP association testing achieved genome-wide significance in the HLA region. However, SNP association testing was able to find genome-wide significant association at further loci, while IBD testing did not. Thus, SNP association testing found more significant results than the IBD testing in these data.

Discussion

IBD mapping has potential advantages over standard SNP association testing for detecting associations with rare variants using SNP array data, particularly when multiple rare variants are clustered within a gene. We showed through simulations that there are scenarios in which IBD mapping has higher power than SNP association testing. In large outbred populations, IBD mapping can have higher power when the level of selection against causal variants is relatively high, so that most causal variants remaining in the population are of recent origin. Because such variants are rare, either they must have strong effect sizes or there must be multiple causal variants clustered within a gene to have power to detect the association with a reasonably sized sample. Our simulations also showed that IBD mapping has higher power than SNP association testing in founder populations provided that the founding event was very recent. If the number of generations since the founding event is approximately the same as the number of generations G for which shared ancestry can be detected, IBD mapping has higher power. On the other hand, if the founding event was further back, the SNP association test tends to have higher power because the additional generations of low population size significantly increase the level of linkage disequilibrium in the population. The multiple testing correction for IBD mapping is lower than that for SNP association testing, and we derived genome-wide significance levels for IBD mapping which we used in our simulations and which could be used in analysis of real data if permutation-based adjustment is not feasible. The precise multiple testing adjustment depends on the sensitivity of IBD detection, which depends on the IBD detection method and on the characteristics of the data set, such as the SNP density. In the type I diabetes WTCCC data, the correct genome-wide significance threshold based on permutation was higher (less stringent) than the theoretical value that we had derived. Inference of IBD in real data are influenced primarily by the length of the IBD segment, while in our theoretical work IBD detection is based on depth of coancestry. A significant proportion of IBD segments due to common ancestry G = 25 generations ago will be shorter than 2 cM, and in real data many such segments will not be detectable. The presence of very short IBD segments in the theoretical data leads to a faster rate of transitions in the IBD process, and hence to the need for a greater multiple testing correction, which may explain the difference between the theoretical and real data results. In the WTCCC type 1 diabetes data we were able to detect an association with the HLA region using IBD mapping. However, given the huge effect of the HLA region, which is easily detected through SNP association tests, this was not an outstanding achievement. Overall, while it is possible that IBD mapping will be useful in some circumstances, it seems doubtful that it will be worthwhile to routinely apply this approach. The best scenario for IBD mapping is a disease that has allelic heterogeneity (multiple causal variants within a gene, to provide an advantage relative to single-marker association testing), with low frequency causal variants (due to negative selection, so that shared causal variants are likely to be of recent origin and hence detectably IBD) and high heritability (high effect sizes, for reasonable power with moderate sample sizes). Our simulation results showed that IBD mapping can have higher power than association mapping in the case of moderately strong negative selection and allelic heterogeneity. Diseases with very high heritability are best suited to large-family studies, because ascertainment of large families increases allelic homogeneity (within families) and reduces the incidence of sporadic cases, thus increasing power (Wijsman and Amos 1997). However, when heritability is only moderately high, or when collection of family data are not practical, IBD mapping captures some of the advantages of the family-based approach. It may be that the advantages of IBD mapping would be realized for very high sample sizes. IBD mapping is targeted at rare variants, and, particularly for very rare variants, large sample sizes are needed to see each such variant more than once in the sample. Rare variants that occur only once in a sample do not contribute to the IBD mapping statistic. Unfortunately, one needs to have a fairly homogeneous sample for IBD mapping, which is likely to preclude increasing sample sizes to high levels. The existence of population structure within a sample can cause false-positive IBD detection, and can also result in large differences genome-wide in IBD rates between case/case pairs and non-case/case pairs. Such structure need not be continental-level differences but can be, for example, Wales vs. England (Browning and Browning 2011). Although genome-wide differences can be subtracted out of the analysis, localized differences due to differences in informativeness of the haplotypes present in the populations may remain and cause false-positive IBD mapping signals. As SNP array data become more dense, with arrays of several million SNPs, IBD segment detection will improve, and it will be possible to detect IBD due to more distant ancestry. This increases the range of scenarios over which IBD mapping can have good power, although such data will also have improved power for standard association testing. With high-quality sequence data, IBD mapping becomes irrelevant. This is because in the context of association testing (mapping), the usefulness of inferred IBD segments is to provide information about untyped variants in the vicinity of genotyped SNPs.

43 in total

1. Genomic control for association studies.

Authors: B Devlin; K Roeder
Journal: Biometrics Date: 1999-12 Impact factor: 2.571

2. Inferring coancestry in population samples in the presence of linkage disequilibrium.

Authors: M D Brown; C G Glazner; C Zheng; E A Thompson
Journal: Genetics Date: 2012-01-31 Impact factor: 4.562

3. Multilocus association mapping using variable-length Markov chains.

Authors: Sharon R Browning
Journal: Am J Hum Genet Date: 2006-04-07 Impact factor: 11.025

4. DASH: a method for identical-by-descent haplotype mapping uncovers association with recent variation.

Authors: Alexander Gusev; Eimear E Kenny; Jennifer K Lowe; Jaqueline Salit; Richa Saxena; Sekar Kathiresan; David M Altshuler; Jeffrey M Friedman; Jan L Breslow; Itsik Pe'er
Journal: Am J Hum Genet Date: 2011-05-27 Impact factor: 11.025

5. Genome screening by searching for shared segments: mapping a gene for benign recurrent intrahepatic cholestasis.

Authors: R H Houwen; S Baharloo; K Blankenship; P Raeymaekers; J Juyn; L A Sandkuijl; N B Freimer
Journal: Nat Genet Date: 1994-12 Impact factor: 38.330

6. The fine-scale structure of recombination rate variation in the human genome.

Authors: Gilean A T McVean; Simon R Myers; Sarah Hunt; Panos Deloukas; David R Bentley; Peter Donnelly
Journal: Science Date: 2004-04-23 Impact factor: 47.728

7. Avoiding the high Bonferroni penalty in genome-wide association studies.

Authors: Xiaoyi Gao; Lewis C Becker; Diane M Becker; Joshua D Starmer; Michael A Province
Journal: Genet Epidemiol Date: 2010-01 Impact factor: 2.135

8. Detection of sharing by descent, long-range phasing and haplotype imputation.

Authors: Augustine Kong; Gisli Masson; Michael L Frigge; Arnaldur Gylfason; Pasha Zusmanovich; Gudmar Thorleifsson; Pall I Olason; Andres Ingason; Stacy Steinberg; Thorunn Rafnar; Patrick Sulem; Magali Mouy; Frosti Jonsson; Unnur Thorsteinsdottir; Daniel F Gudbjartsson; Hreinn Stefansson; Kari Stefansson
Journal: Nat Genet Date: 2008-09 Impact factor: 38.330

9. Genome-wide association studies in an isolated founder population from the Pacific Island of Kosrae.

Authors: Jennifer K Lowe; Julian B Maller; Itsik Pe'er; Benjamin M Neale; Jacqueline Salit; Eimear E Kenny; Jessica L Shea; Ralph Burkhardt; J Gustav Smith; Weizhen Ji; Martha Noel; Jia Nee Foo; Maude L Blundell; Vita Skilling; Laura Garcia; Marcia L Sullivan; Heather E Lee; Anna Labek; Hope Ferdowsian; Steven B Auerbach; Richard P Lifton; Christopher Newton-Cheh; Jan L Breslow; Markus Stoffel; Mark J Daly; David M Altshuler; Jeffrey M Friedman
Journal: PLoS Genet Date: 2009-02-06 Impact factor: 5.917

10. Gene variants in the novel type 2 diabetes loci CDC123/CAMK1D, THADA, ADAMTS9, BCL11A, and MTNR1B affect different aspects of pancreatic beta-cell function.

Authors: Annemarie M Simonis-Bik; Giel Nijpels; Timon W van Haeften; Jeanine J Houwing-Duistermaat; Dorret I Boomsma; Erwin Reiling; Els C van Hove; Michaela Diamant; Mark H H Kramer; Robert J Heine; J Antonie Maassen; P Eline Slagboom; Gonneke Willemsen; Jacqueline M Dekker; Elisabeth M Eekhoff; Eco J de Geus; Leen M 't Hart
Journal: Diabetes Date: 2009-10-15 Impact factor: 9.461

53 in total

1. Efficient clustering of identity-by-descent between multiple individuals.

Authors: Yu Qian; Brian L Browning; Sharon R Browning
Journal: Bioinformatics Date: 2013-12-19 Impact factor: 6.937

Review 2. Identity by descent: variation in meiosis, across genomes, and in populations.

Authors: Elizabeth A Thompson
Journal: Genetics Date: 2013-06 Impact factor: 4.562

3. Fast pairwise IBD association testing in genome-wide association studies.

Authors: Buhm Han; Eun Yong Kang; Soumya Raychaudhuri; Paul I W de Bakker; Eleazar Eskin
Journal: Bioinformatics Date: 2013-10-24 Impact factor: 6.937

4. A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data.

Authors: Ying Zhou; Sharon R Browning; Brian L Browning
Journal: Am J Hum Genet Date: 2020-03-12 Impact factor: 11.025

5. The variance of identity-by-descent sharing in the Wright-Fisher model.

Authors: Shai Carmi; Pier Francesco Palamara; Vladimir Vacic; Todd Lencz; Ariel Darvasi; Itsik Pe'er
Journal: Genetics Date: 2012-12-24 Impact factor: 4.562

6. Inference of identity by descent in population isolates and optimal sequencing studies.

Authors: Dominik Glodzik; Pau Navarro; Veronique Vitart; Caroline Hayward; Ruth McQuillan; Sarah H Wild; Malcolm G Dunlop; Igor Rudan; Harry Campbell; Chris Haley; Alan F Wright; James F Wilson; Paul McKeigue
Journal: Eur J Hum Genet Date: 2013-01-30 Impact factor: 4.246