Mary D Fortune1, Hui Guo2, Oliver Burren1, Ellen Schofield1, Neil M Walker1, Maria Ban3, Stephen J Sawcer3, John Bowes4, Jane Worthington4, Anne Barton4, Steve Eyre4, John A Todd1, Chris Wallace5. 1. JDRF/Wellcome Trust Diabetes and Inflammation Laboratory, Department of Medical Genetics, National Institute for Health Research Cambridge Biomedical Research Centre, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK. 2. 1] JDRF/Wellcome Trust Diabetes and Inflammation Laboratory, Department of Medical Genetics, National Institute for Health Research Cambridge Biomedical Research Centre, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK. [2] Centre for Biostatistics, Institute of Population Health, University of Manchester, Manchester, UK. 3. University Neurology Unit, Addenbrooke's Hospital, Cambridge, UK. 4. 1] Arthritis Research UK Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Institute of Inflammation and Repair, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK. [2] National Institute for Health Research Manchester Musculoskeletal Biomedical Research Unit, Central Manchester Foundation Trust, Manchester Academic Health Science Centre, Manchester, UK. 5. 1] JDRF/Wellcome Trust Diabetes and Inflammation Laboratory, Department of Medical Genetics, National Institute for Health Research Cambridge Biomedical Research Centre, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK. [2] Medical Research Council Biostatistics Unit, Cambridge Institute of Public Health, Cambridge, UK.
Abstract
Determining whether potential causal variants for related diseases are shared can identify overlapping etiologies of multifactorial disorders. Colocalization methods disentangle shared and distinct causal variants. However, existing approaches require independent data sets. Here we extend two colocalization methods to allow for the shared-control design commonly used in comparison of genome-wide association study results across diseases. Our analysis of four autoimmune diseases--type 1 diabetes (T1D), rheumatoid arthritis, celiac disease and multiple sclerosis--identified 90 regions that were associated with at least one disease, 33 (37%) of which were associated with 2 or more disorders. Nevertheless, for 14 of these 33 shared regions, there was evidence that the causal variants differed. We identified new disease associations in 11 regions previously associated with one or more of the other 3 disorders. Four of eight T1D-specific regions contained known type 2 diabetes (T2D) candidate genes (COBL, GLIS3, RNLS and BCAR1), suggesting a shared cellular etiology.
Determining whether potential causal variants for related diseases are shared can identify overlapping etiologies of multifactorial disorders. Colocalization methods disentangle shared and distinct causal variants. However, existing approaches require independent data sets. Here we extend two colocalization methods to allow for the shared-control design commonly used in comparison of genome-wide association study results across diseases. Our analysis of four autoimmune diseases--type 1 diabetes (T1D), rheumatoid arthritis, celiac disease and multiple sclerosis--identified 90 regions that were associated with at least one disease, 33 (37%) of which were associated with 2 or more disorders. Nevertheless, for 14 of these 33 shared regions, there was evidence that the causal variants differed. We identified new disease associations in 11 regions previously associated with one or more of the other 3 disorders. Four of eight T1D-specific regions contained known type 2 diabetes (T2D) candidate genes (COBL, GLIS3, RNLS and BCAR1), suggesting a shared cellular etiology.
Overlaps of genetic association to different diseases have been widely observed, and are thought to reflect shared etiology between diseases.[1] However, showing that a variant is associated with two traits does not demonstrate that it is causal for both: this may be due to distinct variants in linkage disequilibrium.[2] Colocalization analyses are used to study whether potential causal variants are shared by combining information across multiple single nucleotide polymorphisms (SNPs) in a region. The proportional approach[3] tests a null hypothesis of proportionality under which, if causal variants are shared, we expect to see that the effects of any set of SNPs on the two diseases are proportional to each other. A weakness of this approach is interpretation. Failure to reject the null hypothesis does not only imply colocalization, but could also be caused by either disease being not associated, or by insufficient power owing to too few samples analysed and/or an incomplete genetic map[4] (Supplementary Fig. 1). We have no way of measuring how likely colocalization is. A strength is that no assumptions are made about the number of causal variants: the null hypothesis corresponds to complete sharing across all causal variants. An alternative is to use a Bayesian framework,[5] to generate posterior probabilities for colocalization and distinct causal variants, as competing hypotheses. However, a weakness of this approach, as currently developed, is that it assumes only a single causal variant for each trait within any region.Existing colocalization methods require that genetic association with the two traits of interest has been tested in distinct samples. However, this requirement restricts the applicability of the approach to related diseases since each set of case samples must have a corresponding distinct set of control samples, enabling a logistic binomial model to be used independently upon each disease. In contrast, many studies use a common set of controls for different diseases to increase efficiency. Here, we extend both colocalization methods to allow for the use of multinomial logistic regression, the natural model for shared controls.Previous studies have identified many regions associated with multiple autoimmune or autoinflammatory diseases, including type 1 diabetes (T1D) and celiac disease (CEL).[6,1] Such multi–disease association led to the development of the ImmunoChip,[7] a custom genotyping chip with 196,000 SNPs designed to densely cover 186 regions known to associate with at least one immune disease on the basis of GWAS p–value < 10−8. The ImmunoChip consortium genotyped a common control set, to which some disease groups added their own controls. We applied our extended methods to ImmunoChip raw genotyping data for a total of 36,030 samples, including one set of controls and four disease cohorts, in order to better understand the extent of shared genetic etiology in these diseases.
Results
Overview of Method
The Bayesian method derives the posterior support for each of five hypotheses describing the possible association of the region with both diseases. Of greatest interest are:: Both diseases are associated with the region, with different causal variants.: Both diseases are associated with the region, and share a single causal variant.Association with both traits corresponds to or ; colocalization corresponds to . This method requires specification of prior probabilities for each hypothesis. We calibrated priors to match our expectations that about 50% of regions associated with two immune–mediated diseases correspond to a shared causal variant (Supplementary Fig. 2), which is close to the proportion found in a manually curated summary of association to six immune–mediated diseases[8] (58%). For rheumatoid arthritis (RA)[9] and multiple sclerosis (MS),[10] for which only UK subsets of international cohorts were analyzed, we modified priors in regions with published associations to reflect this additional information from the published papers. Where a region was annotated in ImmunoBase as associated with RA or MS, we shrunk our priors for hypotheses corresponding to no association for the disease close towards 0, and increased our priors for the remaining hypotheses (Supplementary Methods).One hundred and twenty six ImmunoChip regions assigned to at least one of the diseases (based upon knowledge when the chip was designed or identified in subsequent papers and curated in ImmunoBase, accessed 12/11/13) were analyzed using both approaches for all six pairwise comparisons of the four diseases. Sample and SNP QC is described in the Methods; we excluded low frequency variants (MAF< 1%) to reduce the number of models to be considered and because genotyping errors are more common amongst this group of SNPs, and we did not have cluster plots available for all diseases. Although GWAS studies typically have sufficient power to detect association only with more common SNPs, some rarer variants (for example, in TYK2[11]) have been reported with these diseases which will be missed in our analysis.
Overview of Results
The Bayesian approach assumes a single causal variant per trait in any region. To allow for multiple causal variants, we used a stepwise method. In the overwhelming majority of cases (740 out of 756 pairwise comparisons, or 98%) data were consistent with at most one causal variant per trait in a region. Ninety of the 126 regions (71%) showed association with at least one disease; in 33 regions, the association was shared between at least two diseases (Fig 1). Complete results are given in Supplementary Table 1, Supplementary Table 2 and Supplementary Table 3. For fifty–seven regions, the greatest support was for association with precisely one of the four diseases; in 21 cases, we know of no other immune–mediated diseases that have reported association to these regions and therefore hypothesize these may be disease specific among autoimmune diseases (Table 1).
Figure 1
A Venn diagram showing summary of disease assignments to 90 regions which showed association to at least one disease, based upon the results of the Bayesian analysis. In cases where assignment was uncertain, the assignment most supported by the posterior probabilities was used. The numbers in brackets correspond to how many of these regions show evidence of distinct causal variants. Thirty–six regions analyzed did not demonstrate association to any disease within our available data, and so are not included in this figure.
Table 1
Twenty–one regions which are most likely disease specific under our analysis and for which we know of no other immune–mediated diseases (from the 15 diseases curated in ImmunoBase) that have reported association to these regions (as curated in ImmunoBase, accessed July 9th 2014, and NIHR GWAS catalog, accessed 07/10/2014). Regions required posterior probability of single disease association > 0.5 in at least one pairwise analysis (SNP coverage varies between analyses) and posterior probability of association to any other disorder < 0.2. Candidate causal genes are given. In the case where no candidate causal genes are known, we have given, in brackets, the genes in and around the region.
Chromosome
Position
Disease Association
Posterior Probability of Single Association
Candidate Causal Gene(s)/ Genes in Region
1p22.1
92023171–93311800
MS
1.00
EVI5
1p21.2
100982239–101455699
MS
0.57
EXTL2 VCAM1 SLC30A7
1p13.1
116831830–116911865
MS
1.00
CD58
3p24.1
28015774–28105476
MS
0.99
(CMC1)
3q13.33
122818149–123329522
MS
1.00
IQCB1 SLC15A2 CD86
5q21.1
102062861–102777130
RA
0.58
C5orf30
6q23.3
137348296–137587799
MS
1.00
IL22RA2
7p12.2
50337180–50662811
T1D
0.97
3′ IKZF1* region
7p12.2
50866661–51640000
T1D
1.00
COBL
8q21.12
79575897–79914680
MS
1.00
ZC2HC1A
8q24.21
129187117–129368419
MS
0.51
PVT1 MIR1208
9p24.2
4218549–4311558
T1D
1.00
GLIS3
10q23.31
89998026–90268360
T1D
0.87
RNLS
11p15.5
2024999–2264880
T1D
1.00
INS
12q24.31
121926103–122574026
MS
0.59
PITPNM2
14q32.2
100357783–100398492
T1D
0.98
DLK1
16q23.1
73760230–74086012
T1D
1.00
BCAR1
19p13.3
6564831–6636304
MS
1.00
TNFSF14
19p13.11
16300497–16612240
MS
1.00
(EPS15L1, CALR3, MED26, C19orf44, CHERP, SLC35E1)
19p13.11
17905598–18272802
MS
0.87
MPV17L2 IFI30
20p13
1444472–1707590
T1D
0.99
(SIRPD, SIRPB1, SIRPG)
There are two ImmunoChip regions which overlap IKZF1 and are separated by a recombination hotspot. The region towards the 5′ end has colocalizing associations with MS and T1D while the region towards the 3′ end appears specific to T1D, as shown in Supplementary Figure 7. Note we provide coordinates of the region, and not an index SNP as is conventional in gwas studies because the method synthesises information across the whole region and does not, in most cases, highlight a single SNP responsible for the association.
In the Bayesian approach, when the posterior probability of a hypothesis is close to 0.5, assignment cannot be made with confidence to any single hypothesis. However, in the 30 instances in which both diseases showed very strong evidence of association , the Bayesian and proportional approaches produced consistent results. For these 30 cases, the proportional null was rejected only in cases in which the Bayesian analysis favored , and not rejected in cases where was favored. Focusing on these, the data strongly supported that the same causal variants underlie all diseases in ten cases, while seven showed strong evidence for distinct variants, suggesting that just under half, 42%, of overlapping association signals reflect distinct causal variants. In total, fourteen regions showed evidence of separate SNP effects , see Table 2).
Table 2
Fourteen regions showing evidence of separate SNP effects . D corresponds to T1D, R to RA, C to CEL and M to MS. Candidate causal genes are as associated across all curated diseases by ImmunoBase. Distinct signals are indicated by ‘— ’. Many of these regions are associated with other diseases (see ImmunoBase). For instance, the 2q32.2 region is additionally associated with Ulcerative Colitis, Crohn’s Disease, Primary Biliary Cirrhosis, Systemic Lupus Erythematosus and Juvenile Idiopathic Arthritis. The 6q23.3 region is additionally associated with Ulcerative Colitis, Systemic Lupus Erythematosus and Psoriasis. Note that in some regions, such as 10p15.1, the conditional analysis supports the existence of multiple associated variants: if none of these overlap, then we consider the region to have separate SNP effects. Note we provide coordinates of the region, and not an index SNP as is conventional in gwas studies because the method synthesises information across the whole region and does not, in most cases, highlight a single SNP responsible for the association.
Chromosome
Position
Associations
Evidence
Candidate Causal Gene
2p16.1
60722116–61952276
C—M
CM:H3 ~0.65
REL
2q32.2
191412527–191739472
RC—M
RM:H3 ~0.51
STAT1 STAT4
2q33.1
202920548–204528303
D—C—R
DR:H3 ~0.98 RC:H3 ~0.91
CD28 CTLA4 ICOS
3p21.31
45812888–46633741
D—C
DC:H3 ~0.92
CCR3 CCR1 CCR5
3q25.33
160950948–161389020
C—M
CM:H3 ~0.96
IL12A
4q27
123121079–124497235
D—C
DC:H3 ~1.00
IL2 IL21
6q23.3
137914792–138345363
DRC—M
RM:H3 ~0.75 CM:H3 ~0.85
TNFAIP3
10p15.1
6068495–6237542
D—M
DM:H3 ~1.00
IL2RA
11q23.3
117805448–118403529
C—M
CM:H3 ~0.82
CXCR5
13q32.3
98723872–99034738
D—C
DC:H3 ~0.67
GPR183
16p13.13
10831557–11408130
DM—C
DC:H3 ~0.51
DEXI SOCS1
18p11.21
12407903–12919721
D—C
DC:H3 ~0.58
PTPN2
19p13.2
10081000–11019034
DRM—C
DC:H3 ~0.53
ICAM1 ICAM3 TYK2
21q22.3
42681877–42771181
D—R—C
DR:H3 ~0.77 DC:H3 ~0.99 RC:H3 ~0.69
UBASH3A
Disentangling Patterns of Association
For colocalized disease regions, the two diseases generally have consistent directions of effect (Fig 2) with the exception of the 6q25.3 region containing candidate gene TAGAP, which is associated in our analysis with CEL and MS only: the risk allele for CEL is protective for MS and vice versa (Supplementary Fig. 3). This opposing effect of TAGAP alleles has been previously described for T1D and CEL,[6] although the region did not provide sufficient evidence for association with T1D in the data available to us. A similar effect for the 2q12.1 region containing candidate gene IL18RAP has also been reported.[6] However, later data[12] have not offered support for T1D association to 2q12.1, and, in our analysis, the posterior support is concentrated on CEL association alone.
Figure 2
The distribution of , the estimated proportionality coefficient, together with its 95% confidence interval. In the case of colocalization, η is the ratio of the effects the region exert upon the two traits. ∣η∣ > 1 corresponds to a stronger effect in Trait 2 than Trait 1. We estimate η by . Labels on the x–axis give the traits and regions analyzed; D for T1D, R for RA, C for CEL and M for MS. Note that in some regions, the conditional analysis supports the existence of multiple associated variants: if none of these overlap, then we consider the region to have separate SNP effects.
(a) Regions with novel evidence of disease association, in which we believe there to be colocalization present between the novel association and at least one of the existing associations. Regions have been ordered such that estimates the effect size for the novel trait divided by the effect size for the known association. Labels give the novel association being given first. It can be seen that the effect size tends to be smaller in the new disease.
(b) Regions with strong evidence of colocalization . As we would expect, is distributed about 1, which corresponds to the regions having equal effects on each trait. Note that 6q25.3, containing the candidate causal gene TAGAP, has , indicating opposite effects on the two diseases. Trait 1 is listed first, and trait 2 second.
Patterns of association with multiple diseases can be complex. In the 2q33 region containing established candidate gene CTLA4, as well as the equally strong functional candidate genes, CD28 and ICOS, three potential causal variants appear to be partially shared between T1D, RA and CEL. The strongest association with T1D is at rs3087243 (which has previously been called CT60), while the strongest association with CEL is with rs231775 (which alters the amino acid at position 17 of CTLA4, Ala17Thr, and has previously been called CT42). The two SNPs have r2 = 0.5, and haplotype analysis has previously suggested CT60 and not CT42 is causal for Graves’ disease.[13] For RA, the strongest single SNP signal is at rs1980422, which is not in LD with either CT42 or CT60 (r2 < 0.1). We fitted the 512 possible standard multinomial models involving these three SNPs for the three diseases, and computed approximate Bayes factors for each. Assuming each model to be equally likely a priori, the model with highest posterior probability has rs1980422/rs3087243 (CT60) signals for CEL and rs231775 (CT42)/rs1980422 for both T1D and RA, although whilst rs231775 (CT42) is the strongest effect for T1D, rs1980422 is strongest for RA (Fig. 3). We note that our analysis is based on SNPs selected through a stepwise process and that without fine mapping analysis we cannot claim that any one of these models correctly reflects the causal variants for any disease. These results do, however, clearly illustrate the different patterns of association with the three disorders and emphasize the potential complexity that can arise in regions of multiple association signals. They motivate the future extension of the colocalization approach developed here to allow model search strategies that do not require stepwise assumptions.
Figure 3
The 2q33.1 region containing the candidate gene CTLA4. Three potential causal variants are partially shared between T1D, RA and CEL.
(a) A Manhattan plot of the region. The blue signal corresponds to the tag rs231775, the green to rs1980422 and the magenta to rs3087243. All other SNPs are colored according to their linkage disequilibrium with these three SNPs. SNPs rs231775 and rs3087243 have r2 = 0.50; all other pairwise r2 < 0.1.
(b) Each possible model involving these three SNPs was tested; the four models with highest posterior probabilities, which together encompass over 90% of the total posterior probability, are shown.
(c) Effect size estimates (including 95% confidence intervals) of each SNP for each disease for the most likely model.
(d) Effect size estimates (including 95% confidence intervals) of each SNP for each disease for the second most likely model.
Discovery of Novel Associations
Two regions were associated with all four diseases (Fig. 1). One was the 6q23.3 region containing candidate gene TNFAIP3, known to be associated with RA and CEL. There has been some published evidence that T1D is associated with this region,[14] although not at genome–wide significant levels. Our results identify a T1D signal, colocalized with that for RA and CEL, suggesting a single shared causal variant affecting the three diseases. There is also evidence of MS association, driven by a distinct causal variant (in the CEL–MS analysis, , Fig. 4).
Figure 4
The 6q23.3 region containing candidate causal gene TNFAIP3. Our results show that T1D, RA and CEL all colocalize, suggesting a single shared causal variant affecting the three diseases; rs6933404 being the most likely SNP. There is also evidence of MS association, driven by a distinct causal variant. Note that this region was associated with MS at genome–wide significant levels in the analysis of the international MS dataset[10].
The second region was 19p13.2, known to be associated with T1D, RA and MS, containing the strong functional candidate gene TYK2, although immune adhesion genes ICAM1 and ICAM3 are also good candidate genes. Our analysis supports these associations, with a posterior probability of colocalization approaching 1. We also find evidence for a novel CEL association. In each of the pairwise analyses involving CEL, the probability of both diseases being associated ≃ 0.88, although this could be a distinct signal: we have (Supplementary Fig. 4). In total, 11 regions showed strong evidence of novel association with (Table 3).
Table 3
Eleven regions showing strong evidence of novel association for an analysis involving a previously non–associated trait. D corresponds to T1D, R to RA, C to CEL and M to MS. Novel associations are underlined and denoted by bold font. Candidate causal genes are as associated across all curated diseases by ImmunoBase. Note that in the case of TNFAIP3, there is strong evidence that MS is caused by a distinct causal variant compared to the other traits. Distinct signals are separated by a ‘— ’. Since we only have a subset of the genotype data, not all of the prior (previously published) associations are seen.
Chromosome
Position
Prior Associations
Associations Found
Posterior probability both diseases are associated P(H3orH4)
Posterior probability shared causal variant given joint association P(H4∣H3orH4)
Candidate Causal Genes/Genes in Region
1q24.3
170882016–171208336
C
DC
DC:0.75
DC:0.95
FASLG
2p14
65246601–65570598
R
RM
RM:0.86
RM:0.72
SPRED2
2q11.2
99883120–100415547
DR
DRC
DC:0.98 RC:1.00
DC:0.57 RC:0.90
AFF3
2q37.1
230758228–230962304
M
CM
CM:0.94
CM:0.90
SP140
5q11.2
55450712–55492884
RM
DRM
DR:0.71 DM:0.71
DR:1.00 DM:1.00
ANKRD55
6q23.3
137914792–138345363
RCM
DRC—M
DR:0.80 DC:0.77
DR:0.94 DC:0.93
TNFAIP3
7p14.2
37323488–37406978
CM
RCM
RC:0.80 RM:0.77
RC:0.84 RM:0.83
ELMO1
7p12.2
50222360–50335957
M
DM
DM:0.73
DM:0.70
5′ IKZF1* region
13q32.3
98723872–99034738
DM
D—C
DC:0.67
DC:0.00
GPR183
15q25.1
76773859–77050416
DM
DC
DC:0.82
DC:0.99
CTSH
19p13.2
10081000–11019034
DRM
DRM—C
DC:0.87 RC:0.87 CM:0.88
DC:0.40 RC:0.46 CM:0.57
ICAM1 ICAM3 TYK2
An association of T1D in a region 3′ of IKZF1, for which it is hypothesised that IKZF1 is the candidate causal gene is already known[33] (see Table 1). The novel association we report here is in a region 5′ of IKZF1, and independent of the established association. Note we provide coordinates of the region, and not an index SNP as is conventional in gwas studies because the method synthesises information across the whole region and does not, in most cases, highlight a single SNP responsible for the association.
In regions with colocalizing novel associations, effect sizes tended to be smaller in the new disease (Fig. 2). This could indicate that the stronger effect is in the previously known association, or it could be due to Winner’s Curse,[15] with the previously known associations displaying inflated effect size estimates. In general for colocalized signals, the coefficient of proportionality is centered about 1.One novel association found was in the chromosome 1q24.3 region, known to be associated with CEL and containing candidate gene FASLG. Pathway analysis also produced evidence for a T1D–associated variant here,[16] although no SNP has reached the genome–wide significance threshold. Our results support a shared causal variant for T1D and CEL (posterior probability 0.71). Our Bayesian approach also enables fine–mapping when dense genotyping data are available, as is the case here. We identified a single likely causal variant lying in a region with strong evidence of predicted regulatory activity, rs78037977 (Supplementary Fig. 5), with a posterior probability of being causal amongst all genotyped variants, given the colocalization hypothesis, of 0.99. Note that rs78037977 was removed from the CEL data in the original analysis[17] owing to failing a missingness check (the call rate of 99.942% was just below the 99.95% cut–off). Plots of the signal clouds for our samples at this SNP are given in Supplementary Figure 6. The clustering shown here is of good quality, implying that the rs78037977 genotype can be considered reliable.
Prior Sensitivity
We tested prior sensitivity by varying p12 (the probability that an arbitrary SNP is associated with both diseases) from p12 = 10−5 to 10−7, while keeping p1 and p2 (the probability that this SNP is associated with only trait 1, or only trait 2) constant at 10−4 (Supplementary Table 4). Whether a region is disease specific is largely unaffected by choice of p12 and, for the five regions discussed in detail in this paper (1q24.3/FASLG; 2q33.1/CTLA4; 6q23.3/TNFAIP3; 6q25.3/TAGAP and 19p13.2/TYK2), the prior does not change which diseases are associated. However, the posterior odds for does vary with p12. Under p12 = 10−7, neither 1q24.3/FASLG nor 6q23.3/TNFAIP3 had strong posterior support as a novel T1D region since the evidence for novel association in these regions comes about due to colocalization with the stronger previously known association. This dependence on prior belief is a strength of Bayesian methods, but they require that priors be carefully calibrated. Whilst our prior belief is that about 50% of regions associated with two immune–mediated diseases are likely to correspond to a shared causal variant, others may disagree. The results given in Supplementary Table 2 can be used to calculate the posterior under any alternative p12 using the formula given in Supplementary Material.
Discussion
Colocalization methods so far have allowed for the simultaneous analysis of only two traits: a potential weakness when considering more than two diseases, as investigated here. The Bayesian approach could be extended to arbitrarily many traits, at the cost of increased computational complexity and spreading the posterior over an exponentially increasing hypothesis space, potentially making it difficult to draw firm conclusions. Wen et al, in their description of an alternative method for partitioning the association of a single SNP amongst multiple related quantitative traits,[18] suggest dealing with this complexity by considering only the extremes – a SNP is associated to all traits, exactly one, or none. Such reduction is impractical when analyzing regions, since it does not allow for overlapping but distinct signals. Although we have extended our software to consider three diseases simultaneously, we have chosen for practical reasons to focus on pairwise analyses with manual curation of the 11 cases (9%) for which more than two diseases showed association.Giambartolomei et al[5] showed that inference is consistent when the causal variant is directly genotyped or well imputed. The decision was taken when the ImmunoChip was designed not to thin by LD, but instead target all SNPs and small indels known at that time in 1000 Genomes European samples and it has since been shown that common variants may be very accurately imputed using ImmunoChip.[19] Therefore we are likely to be very close to the situation where causal variants are directly genotyped. The application of our method to the less complete coverage provided by genome–wide SNP arrays would require an imputation step to allow consistent inference to be made. The Bayesian colocalization analysis assumes a single causal variant per region, which could be restrictive, and we addressed this using a stepwise approach, attempting to colocalize the individual signals for each disease where there was evidence for more than one. The agreement between our results with this approach and using the proportional colocalization approach which does not make this assumption confirms the appropriateness of the stepwise approach in the cases we consider.We identified 21 regions that appeared associated to only one autoimmune disease. One challenge in interpretation when defining disease unique signals is exemplified by a region on chromosome 7p12.2 which contains the candidate causal gene IKZF1. This gene overlaps two ImmunoChip regions separated by a recombination hotspot, one 5′ of IKZF1 and one 3′ of IKZF1. The 5′ region contains a colocalized signal for MS and T1D, whilst the 3′ region contains only a T1D signal (Supplementary Fig. 7). Our analysis has been based on regions, as defined in the design of the ImmunoChip and based on recombination hot spots. However, whilst the T1D signals in these regions are independent and the 3′ region of IKZF1 appears unique to T1D, it is plausible that the causal variants in both regions act through the same gene, IKZF1. Another challenge is to deal with the effects of power, given the established influence of sample size on power to detect associations.[20] Many of the regions in Table 1 contain genes linked to immune function, and we expect a number of apparent disease–specific results to associate with other diseases as sample sizes for each disease continue to increase. Indeed, the chromosome 19p13.11 region, associated only with MS in our analysis, has previously been associated with lymphocyte count,[21] with high LD between the peak MS SNP (rs1870071) and the lymphocyte count SNP (rs11878602, r2 = 0.99), suggesting an immune mechanism for the association.However, in the case of T1D, three disease–unique regions overlap known type 2 diabetes (T2D) regions. Chromosome 9p24.2, containing the candidate gene GLIS3, has been associated with T2D[22] and fasting glucose[23] with high LD between the peak SNP for T1D (rs10814914) and these other traits (rs7041847, r2 > 0.9). GLIS3 and its causal allele alter disease risk by altering pancreatic beta–cell function, probably by increasing beta–cell apoptosis.[24] Chromosome 16q23.1, containing the candidate gene BCAR1, is associated with T1D in our analysis and T2D,[22] and the T2D alleles in this region have been associated with reduced beta cell function,[25] again with high LD between the peak SNPs for T1D (rs8056814) and T2D (rs7202877, r2 = 0.81). Inspecting the distribution of T2D GWAS p values at the peak SNPs in our T1D associated regions (Supplementary Fig. 8), we note that the peak SNP in the T1D associated region 6q22.32, rs17754780, also shows association to T2D (p = 7.9 × 10−5) and is in tight LD with peak T2D SNP in the region (rs9385400, r2 = 0.97). This region has been reported as associated with T2D at genome–wide significance in a larger study.[26] Chromosome 6q22.3 is not uniquely associated to T1D in our analysis because it overlaps an established Crohn’s disease region,[27] but the lead Crohn’s SNP (rs9491697) is not in LD with the T1D SNP (r2 = 0.03). This is then likely to be a third shared signal between T1D and T2D. The nearest genes are MIR588 about which little appears to be known and CENPW (centromere protein W) which has no obvious functional candidacy. This genetic overlap between T1D and T2D (Supplementary Table 5) emphasizes that T1D results from an interaction between the immune system and beta cells, and it is probable that some of our other apparent disease unique regions will also prove to be specific to the target of autoimmune destruction in MS and RA.By analyzing regions known to associate with one disease, we were able to link 11 to additional disorders: in most cases (8/11) the novel disease association was clearly colocalized with a previously known signal, whilst in one case, GPR183, the evidence supported a distinct causal variant for the novel association. In others (3/11) the evidence for colocalization was more equivocal, even with evidence for pairwise association.In a standard GWAS analysis, a p–value significance threshold of 5 × 10−8 is used in absence of replication data, due to a desire to minimise reporting of false positive results, although a relaxation of this threshold has been suggested.[29] However, since autoimmune diseases are known to share etiology, conditioning upon association for one autoimmune disease, we should require a less stringent threshold to believe it significant for another. Indeed, whilst the question of whether the ImmunoChip significance threshold should be somewhat relaxed remains,[8] examination of p–values in the regions in which we observe novel associations (Supplementary Fig. 9) suggests that a threshold between 10−5 and 10−6 for SNPs that are confirmed index SNPs for another disease might be more appropriate. We estimated that 42% of overlapping and genome–wide significant immune–mediated disease signals relate to distinct causal variants. In these regions, therefore, there appear to be distinct causal variants for two or more autoimmune diseases which are physically proximal but in low LD. We suggest that physical proximity to a known associated variant in a related disease, and not only LD with it, may prove to be an appropriate criterion with which to alter interpretation of a small but not genome–wide significance threshold. Variants meeting such thresholds might be prioritised for genotyping in replication samples. We note, also, that the four diseases we studied are all characterized by the presence of autoantibodies. Had we included autoantibody negative diseases we might have found a higher proportion of discordant associations as reported in a previous manual curation of ImmunoChip studies,[8] given there remains considerable overlap in location of association signals. Although a careful and detailed manual curation of several studies has been conducted,[8] the ability of colocalization methods to distinguish shared from distinct causal variants allows clearer interpretation of genetic results.In summary, we have developed a methodology for examining shared genetic etiology between diseases in the case of common control datasets, extending previous work.[2,3] This enables the discovery of new disease associations and the exploration of complex association patterns. Although these methods have been presented in this paper to analyze autoimmune diseases, the prior is user defined, and could be used to analyze any pair of related diseases.
Online Methods
Samples
All samples included in this analysis were gathered in the United Kingdom, and have reported or self declared European ancestry. Informed consent was obtained from all subjects. Detailed summaries of the sample cohorts are given in the ImmunoChip papers for CEL,[17] RA,[9] MS[10] and T1D.[30] For the RA and MS cases, we used the subset of cases from the UK. Sample exclusions were applied as described in each paper, and in total, 6,691 T1D, 3,870 RA, 7,987 CEL, 5,112 MS and 12,370 control samples were analyzed. SNPs were filtered according to the following criteria: call rate > 0.99; minor allele frequency > 0.01; Hardy–Weinberg ∣Z∣ < 5. SNPs which passed these thresholds in controls and any specific pair of cases were used for that pairwise analysis.Using only UK cases and controls means that we expect any effects of population stratification to be very limited, as evidenced by the low genomic inflation factors in published UK ImmunoChip analyses[31] and we did not take any further specific actions to limit effects of population stratification.
Selection of Regions for Analysis
We considered all regions annotated in ImmunoBase (accessed on 12/11/13) as associated with at least one of our diseases. Where regions overlapped, we formed the union. Regions containing fewer than ten SNPs or with a SNP density < 1 SNP/kb were excluded. The MHC (chr6:29797978–33606563 hg18) was removed from the analysis, since this region is known to have complex multi–SNP effects. A full list of the 126 regions analyzed, together with our resulting associations, can be found in Supplementary Table 1.
Colocalization Analysis
Two colocalization methods were applied to each of the 126 regions (see Supplementary Fig. 1).
Bayesian Approach
The first approach is based upon a Bayesian approach proposed by Giambartolomei et al.[5] All models in which each trait is caused by at most one variant are considered, and approximate Bayes factors computed for each. Our extension follows the same framework, but, in order to extend this method to the case of a common control, a multinomial model was used. Bayes factors were computed using a Laplace approximation[32] as implemented in the R package mlogitBMA. Each of these models is contained within precisely one of the following sets:: No SNP is associated with either trait: There is a SNP associated with trait 2, but no SNP is associated with trait 1.: There is a SNP associated with trait 1, but no SNP is associated with trait 2.: Both the traits are associated with the region, with different causal variants.: Both the traits are associated with the region, and share a single causal variant.By summing the Bayes factors generated for all models in the set, a posterior possibility can be computed for each of the hypotheses, and hence for colocalization (). Similarly, the posterior probability of any given model, given a specific hypothesis and equal prior probability of each model, is proportional to the Bayes factor for that model. Since a Bayes factor is assigned to each model independently, it is straightforward to calculate the conditional probability of each SNP being causal, given association, as proportional to the Bayes factor for the relevant model.This approach assumes a single causal variant at any region. In regions with strong evidence of association we performed conditional analysis. Firstly, all plausibly important SNPs were discovered by iteratively conditioning on the most likely set of SNPs to cause the associations seen, until there was no longer strong evidence of additional association. In those cases where multiple SNPs were considered relevant, all but a pair (one potentially causal for the first trait, and one for the second) were conditioned upon, in order to discover the colocalization (or not) of the effects at this pair alone.
Proportional Approach
A second method based upon the proportional approach[2,3] was also used. Phenotypes are modeled using multinomial logistic regression, producing maximum likelihood estimates b1 and b2 of regression coefficients β1 and β2. Since the samples sizes can be large, the asymptotic normality of maximum likelihood estimators is used to approximate:
for some variance–covariance matrix V.The proportional method[3] assumes that b1, b2 are independent (i.e. V
12 = V
21 = O). However in the extension to a common control dataset, we cannot assume this, and proceed with a fully unknown V.The null hypothesis corresponds to the existence of a constant η such that:
Under this hypothesis, and given η,
This is used as our test statistic. However, since the value of η was unknown, a posterior predictive p–value is generated instead, by integrating the p–values associated with the test statistic over the posterior distribution of η. To avoid bias in regression coefficients due to selection of SNPs on the basis of their strength of association, Bayesian model averaging was used to average inference over all plausible two SNP models.Further details of the colocalization methods can be found in the Supplementary Methods section.
Identification of disease specific regions
To examine evidence for GWAS association with other traits, we took the index SNP with smallest p values in a region, and then identified proxy SNPs based on LD (r2 ≥ 0.9) using 1000 genomes EUR data. We used this as a query SNP set to examine associations annotated in the NIHR GWAS catalog (accessed 07/10/2014)We identified disease specific regions for which: the posterior probability for single SNP association was >0.5; posterior probability of association with any other disease was <0.2; the region was not annotated as associated with any other autoimmune disease in ImmunoBase; and no proxies for the index SNP were associated with any other autoimmune disease in the NIHR GWAS catalog.
Type 2 diabetes data
Summary from a T2D GWAS meta–analysis[22] was downloaded from the DIAGRAM website (accessed 20/10/14).
Code availability
The code used is given in the colocCommonControl R package, which can be found at https://github.com/mdfortune/colocCommonControl
Authors: Hironori Ueda; Joanna M M Howson; Laura Esposito; Joanne Heward; Hywel Snook; Giselle Chamberlain; Daniel B Rainbow; Kara M D Hunter; Annabel N Smith; Gianfranco Di Genova; Mathias H Herr; Ingrid Dahlman; Felicity Payne; Deborah Smyth; Christopher Lowe; Rebecca C J Twells; Sarah Howlett; Barry Healy; Sarah Nutland; Helen E Rance; Vin Everett; Luc J Smink; Alex C Lam; Heather J Cordell; Neil M Walker; Cristina Bordin; John Hulme; Costantino Motzo; Francesco Cucca; J Fred Hess; Michael L Metzker; Jane Rogers; Simon Gregory; Amit Allahabadia; Ratnasingam Nithiyananthan; Eva Tuomilehto-Wolf; Jaakko Tuomilehto; Polly Bingley; Kathleen M Gillespie; Dag E Undlien; Kjersti S Rønningen; Cristian Guja; Constantin Ionescu-Tîrgovişte; David A Savage; A Peter Maxwell; Dennis J Carson; Chris C Patterson; Jayne A Franklyn; David G Clayton; Laurence B Peterson; Linda S Wicker; John A Todd; Stephen C L Gough Journal: Nature Date: 2003-04-30 Impact factor: 49.962
Authors: Chris Wallace; Maxime Rotival; Jason D Cooper; Catherine M Rice; Jennie H M Yang; Mhairi McNeill; Deborah J Smyth; David Niblett; François Cambien; Laurence Tiret; John A Todd; David G Clayton; Stefan Blankenberg Journal: Hum Mol Genet Date: 2012-03-08 Impact factor: 6.150
Authors: Gosia Trynka; Karen A Hunt; Nicholas A Bockett; Jihane Romanos; Vanisha Mistry; Agata Szperl; Sjoerd F Bakker; Maria Teresa Bardella; Leena Bhaw-Rosun; Gemma Castillejo; Emilio G de la Concha; Rodrigo Coutinho de Almeida; Kerith-Rae M Dias; Cleo C van Diemen; Patrick C A Dubois; Richard H Duerr; Sarah Edkins; Lude Franke; Karin Fransen; Javier Gutierrez; Graham A R Heap; Barbara Hrdlickova; Sarah Hunt; Leticia Plaza Izurieta; Valentina Izzo; Leo A B Joosten; Cordelia Langford; Maria Cristina Mazzilli; Charles A Mein; Vandana Midah; Mitja Mitrovic; Barbara Mora; Marinita Morelli; Sarah Nutland; Concepción Núñez; Suna Onengut-Gumuscu; Kerra Pearce; Mathieu Platteel; Isabel Polanco; Simon Potter; Carmen Ribes-Koninckx; Isis Ricaño-Ponce; Stephen S Rich; Anna Rybak; José Luis Santiago; Sabyasachi Senapati; Ajit Sood; Hania Szajewska; Riccardo Troncone; Jezabel Varadé; Chris Wallace; Victorien M Wolters; Alexandra Zhernakova; B K Thelma; Bozena Cukrowska; Elena Urcelay; Jose Ramon Bilbao; M Luisa Mearin; Donatella Barisani; Jeffrey C Barrett; Vincent Plagnol; Panos Deloukas; Cisca Wijmenga; David A van Heel Journal: Nat Genet Date: 2011-11-06 Impact factor: 38.330
Authors: Luke Jostins; Stephan Ripke; Rinse K Weersma; Richard H Duerr; Dermot P McGovern; Ken Y Hui; James C Lee; L Philip Schumm; Yashoda Sharma; Carl A Anderson; Jonah Essers; Mitja Mitrovic; Kaida Ning; Isabelle Cleynen; Emilie Theatre; Sarah L Spain; Soumya Raychaudhuri; Philippe Goyette; Zhi Wei; Clara Abraham; Jean-Paul Achkar; Tariq Ahmad; Leila Amininejad; Ashwin N Ananthakrishnan; Vibeke Andersen; Jane M Andrews; Leonard Baidoo; Tobias Balschun; Peter A Bampton; Alain Bitton; Gabrielle Boucher; Stephan Brand; Carsten Büning; Ariella Cohain; Sven Cichon; Mauro D'Amato; Dirk De Jong; Kathy L Devaney; Marla Dubinsky; Cathryn Edwards; David Ellinghaus; Lynnette R Ferguson; Denis Franchimont; Karin Fransen; Richard Gearry; Michel Georges; Christian Gieger; Jürgen Glas; Talin Haritunians; Ailsa Hart; Chris Hawkey; Matija Hedl; Xinli Hu; Tom H Karlsen; Limas Kupcinskas; Subra Kugathasan; Anna Latiano; Debby Laukens; Ian C Lawrance; Charlie W Lees; Edouard Louis; Gillian Mahy; John Mansfield; Angharad R Morgan; Craig Mowat; William Newman; Orazio Palmieri; Cyriel Y Ponsioen; Uros Potocnik; Natalie J Prescott; Miguel Regueiro; Jerome I Rotter; Richard K Russell; Jeremy D Sanderson; Miquel Sans; Jack Satsangi; Stefan Schreiber; Lisa A Simms; Jurgita Sventoraityte; Stephan R Targan; Kent D Taylor; Mark Tremelling; Hein W Verspaget; Martine De Vos; Cisca Wijmenga; David C Wilson; Juliane Winkelmann; Ramnik J Xavier; Sebastian Zeissig; Bin Zhang; Clarence K Zhang; Hongyu Zhao; Mark S Silverberg; Vito Annese; Hakon Hakonarson; Steven R Brant; Graham Radford-Smith; Christopher G Mathew; John D Rioux; Eric E Schadt; Mark J Daly; Andre Franke; Miles Parkes; Severine Vermeire; Jeffrey C Barrett; Judy H Cho Journal: Nature Date: 2012-11-01 Impact factor: 49.962
Authors: Jason D Cooper; Matthew J Simmonds; Neil M Walker; Oliver Burren; Oliver J Brand; Hui Guo; Chris Wallace; Helen Stevens; Gillian Coleman; Jayne A Franklyn; John A Todd; Stephen C L Gough Journal: Hum Mol Genet Date: 2012-08-24 Impact factor: 6.150
Authors: Deborah J Smyth; Vincent Plagnol; Neil M Walker; Jason D Cooper; Kate Downes; Jennie H M Yang; Joanna M M Howson; Helen Stevens; Ross McManus; Cisca Wijmenga; Graham A Heap; Patrick C Dubois; David G Clayton; Karen A Hunt; David A van Heel; John A Todd Journal: N Engl J Med Date: 2008-12-10 Impact factor: 91.245
Authors: Andrew P Morris; Benjamin F Voight; Tanya M Teslovich; Teresa Ferreira; Ayellet V Segrè; Valgerdur Steinthorsdottir; Rona J Strawbridge; Hassan Khan; Harald Grallert; Anubha Mahajan; Inga Prokopenko; Hyun Min Kang; Christian Dina; Tonu Esko; Ross M Fraser; Stavroula Kanoni; Ashish Kumar; Vasiliki Lagou; Claudia Langenberg; Jian'an Luan; Cecilia M Lindgren; Martina Müller-Nurasyid; Sonali Pechlivanis; N William Rayner; Laura J Scott; Steven Wiltshire; Loic Yengo; Leena Kinnunen; Elizabeth J Rossin; Soumya Raychaudhuri; Andrew D Johnson; Antigone S Dimas; Ruth J F Loos; Sailaja Vedantam; Han Chen; Jose C Florez; Caroline Fox; Ching-Ti Liu; Denis Rybin; David J Couper; Wen Hong L Kao; Man Li; Marilyn C Cornelis; Peter Kraft; Qi Sun; Rob M van Dam; Heather M Stringham; Peter S Chines; Krista Fischer; Pierre Fontanillas; Oddgeir L Holmen; Sarah E Hunt; Anne U Jackson; Augustine Kong; Robert Lawrence; Julia Meyer; John R B Perry; Carl G P Platou; Simon Potter; Emil Rehnberg; Neil Robertson; Suthesh Sivapalaratnam; Alena Stančáková; Kathleen Stirrups; Gudmar Thorleifsson; Emmi Tikkanen; Andrew R Wood; Peter Almgren; Mustafa Atalay; Rafn Benediktsson; Lori L Bonnycastle; Noël Burtt; Jason Carey; Guillaume Charpentier; Andrew T Crenshaw; Alex S F Doney; Mozhgan Dorkhan; Sarah Edkins; Valur Emilsson; Elodie Eury; Tom Forsen; Karl Gertow; Bruna Gigante; George B Grant; Christopher J Groves; Candace Guiducci; Christian Herder; Astradur B Hreidarsson; Jennie Hui; Alan James; Anna Jonsson; Wolfgang Rathmann; Norman Klopp; Jasmina Kravic; Kaarel Krjutškov; Cordelia Langford; Karin Leander; Eero Lindholm; Stéphane Lobbens; Satu Männistö; Ghazala Mirza; Thomas W Mühleisen; Bill Musk; Melissa Parkin; Loukianos Rallidis; Jouko Saramies; Bengt Sennblad; Sonia Shah; Gunnar Sigurðsson; Angela Silveira; Gerald Steinbach; Barbara Thorand; Joseph Trakalo; Fabrizio Veglia; Roman Wennauer; Wendy Winckler; Delilah Zabaneh; Harry Campbell; Cornelia van Duijn; Andre G Uitterlinden; Albert Hofman; Eric Sijbrands; Goncalo R Abecasis; Katharine R Owen; Eleftheria Zeggini; Mieke D Trip; Nita G Forouhi; Ann-Christine Syvänen; Johan G Eriksson; Leena Peltonen; Markus M Nöthen; Beverley Balkau; Colin N A Palmer; Valeriya Lyssenko; Tiinamaija Tuomi; Bo Isomaa; David J Hunter; Lu Qi; Alan R Shuldiner; Michael Roden; Ines Barroso; Tom Wilsgaard; John Beilby; Kees Hovingh; Jackie F Price; James F Wilson; Rainer Rauramaa; Timo A Lakka; Lars Lind; George Dedoussis; Inger Njølstad; Nancy L Pedersen; Kay-Tee Khaw; Nicholas J Wareham; Sirkka M Keinanen-Kiukaanniemi; Timo E Saaristo; Eeva Korpi-Hyövälti; Juha Saltevo; Markku Laakso; Johanna Kuusisto; Andres Metspalu; Francis S Collins; Karen L Mohlke; Richard N Bergman; Jaakko Tuomilehto; Bernhard O Boehm; Christian Gieger; Kristian Hveem; Stephane Cauchi; Philippe Froguel; Damiano Baldassarre; Elena Tremoli; Steve E Humphries; Danish Saleheen; John Danesh; Erik Ingelsson; Samuli Ripatti; Veikko Salomaa; Raimund Erbel; Karl-Heinz Jöckel; Susanne Moebus; Annette Peters; Thomas Illig; Ulf de Faire; Anders Hamsten; Andrew D Morris; Peter J Donnelly; Timothy M Frayling; Andrew T Hattersley; Eric Boerwinkle; Olle Melander; Sekar Kathiresan; Peter M Nilsson; Panos Deloukas; Unnur Thorsteinsdottir; Leif C Groop; Kari Stefansson; Frank Hu; James S Pankow; Josée Dupuis; James B Meigs; David Altshuler; Michael Boehnke; Mark I McCarthy Journal: Nat Genet Date: 2012-08-12 Impact factor: 38.330
Authors: Mary D Fortune; Hui Guo; Oliver Burren; Ellen Schofield; Neil M Walker; Maria Ban; Stephen J Sawcer; John Bowes; Jane Worthington; Anne Barton; Steve Eyre; John A Todd; Chris Wallace Journal: Nat Genet Date: 2015-08 Impact factor: 38.330
Authors: Daniel A Skelly; Narayanan Raghupathy; Raymond F Robledo; Joel H Graber; Elissa J Chesler Journal: Genetics Date: 2019-05-21 Impact factor: 4.562