Literature DB >> 34493871

Genomic and phenotypic insights from an atlas of genetic effects on DNA methylation.

Josine L Min^1,2, Gibran Hemani^3,4, Eilis Hannon⁵, Koen F Dekkers⁶, Juan Castillo-Fernandez⁷, René Luijk⁶, Elena Carnero-Montoro^7,8, Daniel J Lawson^3,4, Kimberley Burrows^3,4, Matthew Suderman^3,4, Andrew D Bretherick⁹, Tom G Richardson^3,4, Johanna Klughammer¹⁰, Valentina Iotchkova¹¹, Gemma Sharp^3,4, Ahmad Al Khleifat¹², Aleksey Shatunov¹², Alfredo Iacoangeli^12,13, Wendy L McArdle⁴, Karen M Ho⁴, Ashish Kumar^14,15,16, Cilla Söderhäll¹⁷, Carolina Soriano-Tárraga¹⁸, Eva Giralt-Steinhauer¹⁸, Nabila Kazmi^3,4, Dan Mason¹⁹, Allan F McRae²⁰, David L Corcoran²¹, Karen Sugden^21,22, Silva Kasela²³, Alexia Cardona^24,25, Felix R Day²⁴, Giovanni Cugliari^26,27, Clara Viberti^26,27, Simonetta Guarrera^26,27, Michael Lerro²⁸, Richa Gupta^29,30, Sailalitha Bollepalli^29,30, Pooja Mandaviya³¹, Yanni Zeng^9,32,33, Toni-Kim Clarke³⁴, Rosie M Walker^35,36, Vanessa Schmoll³⁷, Darina Czamara³⁷, Carlos Ruiz-Arenas^38,39,40, Faisal I Rezwan⁴¹, Riccardo E Marioni^35,36, Tian Lin²⁰, Yvonne Awaloff³⁷, Marine Germain⁴², Dylan Aïssi⁴³, Ramona Zwamborn⁴⁴, Kristel van Eijk⁴⁴, Annelot Dekker⁴⁴, Jenny van Dongen⁴⁵, Jouke-Jan Hottenga⁴⁵, Gonneke Willemsen⁴⁵, Cheng-Jian Xu^46,47, Guillermo Barturen⁸, Francesc Català-Moll⁴⁸, Martin Kerick⁴⁹, Carol Wang⁵⁰, Phillip Melton^51,52,53, Hannah R Elliott^3,4, Jean Shin⁵⁴, Manon Bernard⁵⁴, Idil Yet^7,55, Melissa Smart⁵⁶, Tyler Gorrie-Stone⁵⁷, Chris Shaw^12,58, Ammar Al Chalabi^12,58,59, Susan M Ring^3,4, Göran Pershagen¹⁴, Erik Melén^14,60, Jordi Jiménez-Conde¹⁸, Jaume Roquer¹⁸, Deborah A Lawlor^3,4, John Wright¹⁹, Nicholas G Martin⁶¹, Grant W Montgomery²⁰, Terrie E Moffitt^21,22,62,63, Richie Poulton⁶⁴, Tõnu Esko^23,65, Lili Milani²³, Andres Metspalu²³, John R B Perry²⁴, Ken K Ong²⁴, Nicholas J Wareham²⁴, Giuseppe Matullo^26,27, Carlotta Sacerdote^27,66, Salvatore Panico⁶⁷, Avshalom Caspi^21,22,62,63, Louise Arseneault⁶³, France Gagnon²⁸, Miina Ollikainen^29,30, Jaakko Kaprio^29,30, Janine F Felix^68,69, Fernando Rivadeneira³¹, Henning Tiemeier^70,71, Marinus H van IJzendoorn^72,73, André G Uitterlinden³¹, Vincent W V Jaddoe^68,69, Chris Haley⁹, Andrew M McIntosh^34,36, Kathryn L Evans^35,36, Alison Murray⁷⁴, Katri Räikkönen⁷⁵, Jari Lahti⁷⁵, Ellen A Nohr^76,77, Thorkild I A Sørensen^3,4,78,79, Torben Hansen⁷⁸, Camilla S Morgen^78,80, Elisabeth B Binder^37,81, Susanne Lucae³⁷, Juan Ramon Gonzalez^38,39,40, Mariona Bustamante^38,39,40,82, Jordi Sunyer^38,39,40,83, John W Holloway^84,85, Wilfried Karmaus⁸⁶, Hongmei Zhang⁸⁶, Ian J Deary³⁶, Naomi R Wray^20,87, John M Starr^36,88, Marian Beekman⁶, Diana van Heemst⁸⁹, P Eline Slagboom⁶, Pierre-Emmanuel Morange⁹⁰, David-Alexandre Trégouët⁴², Jan H Veldink⁴⁴, Gareth E Davies⁹¹, Eco J C de Geus⁴⁵, Dorret I Boomsma⁴⁵, Judith M Vonk⁹², Bert Brunekreef^93,94, Gerard H Koppelman⁴⁶, Marta E Alarcón-Riquelme^8,14, Rae-Chi Huang⁹⁵, Craig E Pennell⁵⁰, Joyce van Meurs³¹, M Arfan Ikram⁹⁶, Alun D Hughes⁹⁷, Therese Tillin⁹⁷, Nish Chaturvedi⁹⁷, Zdenka Pausova⁵⁴, Tomas Paus⁹⁸, Timothy D Spector⁷, Meena Kumari⁵⁶, Leonard C Schalkwyk⁵⁷, Peter M Visscher^20,87, George Davey Smith^3,4, Christoph Bock^10,99, Tom R Gaunt^3,4, Jordana T Bell⁷, Bastiaan T Heijmans⁶, Jonathan Mill⁵, Caroline L Relton^3,4.

Abstract

Characterizing genetic influences on DNA methylation (DNAm) provides an opportunity to understand mechanisms underpinning gene regulation and disease. In the present study, we describe results of DNAm quantitative trait locus (mQTL) analyses on 32,851 participants, identifying genetic variants associated with DNAm at 420,509 DNAm sites in blood. We present a database of >270,000 independent mQTLs, of which 8.5% comprise long-range (trans) associations. Identified mQTL associations explain 15-17% of the additive genetic variance of DNAm. We show that the genetic architecture of DNAm levels is highly polygenic. Using shared genetic control between distal DNAm sites, we constructed networks, identifying 405 discrete genomic communities enriched for genomic annotations and complex traits. Shared genetic variants are associated with both DNAm levels and complex diseases, but only in a minority of cases do these associations reflect causal relationships from DNAm to trait or vice versa, indicating a more complex genotype-phenotype map than previously anticipated.

Entities: Chemical

Mesh：

Substances：
DNA

Year: 2021 PMID： 34493871 PMCID： PMC7612069 DOI： 10.1038/s41588-021-00923-x

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 41.307

The role of common inter-individual variation in DNAm on disease mechanisms is not yet well characterized. It has, however, been hypothesized that DNAm serves as a viable biomarker for risk stratification, early disease detection and the prediction of disease prognosis and progression.[1] Because genetic influences on DNAm in blood are widespread[2-4], a powerful avenue for studying the functional consequences of differences in DNAm levels is to map genetic differences associated with population-level variation, identifying mQTLs that include both local (cis-mQTL) and distal (trans-mQTL) effects. We can harness mQTLs as natural experiments, allowing us to observe randomly perturbed DNAm levels in a manner that is not confounded with environmental factors[5,6]. In this regard, mapping even very small genetic effects on DNAm is valuable for gaining power to evaluate whether its variation has a substantial causal role in disease and other biological processes. To date, only a small fraction of the total genetic variation estimated to influence DNAm across the genome has been identified[7], and the proportion of trans heritability explained by trans-mQTLs (defined as variants >1Mb from the DNAm site) is much smaller than the proportion of cis heritability explained by cis-mQTLs. Therefore, the majority of genetic effects are likely to act in trans, with small effect sizes[5,7-9], while being potentially biologically informative.[8,10] Much larger sample sizes are required to map associations involving small genetic effects in order to permit greater understanding of the genetic architecture and the biological processes underlying DNAm[7]. To this end, we established the Genetics of DNA Methylation Consortium (GoDMC), an international collaboration of human epidemiological studies that comprises >30,000 study participants with genetic and DNAm data. We use this resource to develop a comprehensive catalogue of cis- and trans-mQTLs, which enables us to examine the genetic architecture of DNAm. By constructing networks of multiple cis- and trans-mQTLs, we learn about their collective impact on pathways and complex traits. Finally, we interrogate the potential role of DNAm in disease mechanisms by mapping the causal relationships between variable DNAm and 116 complex traits and diseases in a bi-directional manner. A database of our results is available as a resource to the community at http://mqtldb.godmc.org.uk/.

Results

Genetic variants influence 45% of tested DNAm sites

In order to map genetic influences on DNAm, we established an analysis workflow that enabled standardized meta-analysis and data integration across 36 population-based and disease datasets. Using a two-phase discovery study design, we analyzed ~10 million genotypes imputed to the 1000 Genomes Project reference panel[11] and quantified DNAm at 420,509 sites using Illumina HumanMethylation BeadChips in whole blood derived from 27,750 European participants (Figure 1a, Supplementary Figures 1-4, Extended Data Fig. 1, Supplementary Tables 1-2, Supplementary Note). The microarray technology used in the majority of cohorts limited us to the analysis of only 1.5% of the ~28M DNAm sites across the genome[12], including 96% of CpG islands and CpG shores and 99% of RefSeq genes[13] and all inferences relate only to these sites.

Figure 1

Discovery and replication of mQTLs

a. Study Design. In the first phase, 22 cohorts performed a complete mQTL analysis of up to 480,000 sites against up to 12 million variants; retaining their results for p<1e-5. In the second phase, 120 million SNP-DNAm site pairs selected from the first phase, and GWA catalog SNPs against 345k DNAm sites, were tested in 36 studies (including 20 phase 1 studies) and meta-analyzed. QC, quality control. b. Distributions of the weighted mean of DNAm across 36 cohorts for cis only, cis+trans and trans only sites. The weighted mean DNAm level across 36 studies was defined as low (<20%), intermediate (20%-80%) or high (>80%). Plots are colored with respect to the genomic annotation. Cis only sites showed a bimodal distribution of DNAm. Cis+trans sites showed intermediate levels of DNAm. Trans only sites showed low levels of DNAm. c. Discovery and replication effect size estimates between GoDMC (n=27,750) and Generation Scotland (n=5,101) for 169,656 mQTL associations. The regression coefficient is 1.13 (se=0.0007). d. Relationship between DNAm site heritability estimates and DNAm variance explained in Generation Scotland. The center line of a boxplot corresponds to the median value. The lower and upper box limits indicate the first and third quartiles (the 25th and 75th percentiles). The length of the whiskers corresponds to values up to 1.5 times the IQR in either direction. The regression coefficient for the twin family study was 3.16 (se=0.008) and for the twin study 2.91 (se=0.008) across 403,353 DNAm sites. The variance explained for DNAm sites with missing r2 (n=277,428) and/or h2=0 (Twin family: n=80,726 Twins: n=34,537) were set to 0. GS, Generation Scotland.

Extended Data Fig. 1

Quality control of 36 studies.

We used 337 independent SNPs on chromosome 20 with a p-value<1e-14. The number of SNPs used for each study are indicated in the bottom plot. a. Mstatistic for each of the 36 cohorts. b. Boxplot of mQTL effect sizes for each of the 36 studies. The center line of a boxplot corresponds to the median value. The lower and upper box limits indicate the first and third quartiles (the 25th and 75th percentiles). The length of the whiskers corresponds to values up to 1.5 times the IQR in either direction.

Using linkage disequilibrium (LD) clumping, we identified 248,607 independent cis-mQTL associations (p<1e-8, <1Mb from the DNAm site, Supplementary Figure 3) with a median distance between single nucleotide polymorphisms (SNP) and DNAm sites of 36kb (interquartile range (IQR)=118 kb, Extended Data Fig. 2). We found 23,117 independent trans-mQTL associations (using a conservative threshold of p<1e-14[7], Supplementary Figure 3, Supplementary Note). These mQTLs involved 190,102 DNAm sites, representing 45.2% of all those tested (Figure 1b) which is a 1.9x increase of sites with a cis association (p<1e-8) and 10x increase of sites with a trans association (p<1e-14) over a previous study whose sample size was 7x smaller[8]. As expected, mQTL effect sizes for each DNAm site (the maximum absolute additive change in DNAm level measured in standard deviation (SD) per allele) were lower for sites with a trans association (as compared to sites with a cis association (per allele SD change = -0.02 (s.e.=0.002, p=2.1e-14, Extended Data Fig. 3). The differential improvement in yield between cis and trans associations is revealing in terms of the genetic architecture – relatively small sample sizes are sufficient to uncover the majority of large cis effects, whereas much larger sample sizes are required to identify the polygenic trans component.

Extended Data Fig. 2

Distance of SNP from DNAm site.

a. Density plot of the distance of SNP from DNAm site against the -log10 p-value of 4,533 intra-chromosomal trans-mQTL associations (>1Mb). b. Density plot of the distance of SNP from DNAm site against the -log10 p-value of 248,607 cis-mQTL associations (<1Mb).

Extended Data Fig. 3

Effect sizes and weighted standard deviation (SD) for each mQTL category.

a. For each DNAm site, the strongest absolute effect size (the maximum absolute additive change in DNAm level measured in SD per allele) was selected. The kernel density estimations of the effect sizes were shown for all sites with a mQTL (n=190,102), sites with cis only effects (n=170,986), cis effects for sites with cis and trans effects (n=11,902), trans effects for sites with cis and trans effects (n=11,902) and sites with trans only effects (n=7,214). Comparing the strongest effect size for each site in a two-sided linear regression model showed that cis+trans sites had larger cis effect sizes (per allele SD change = 0.05 (s.e.= 0.002), p<2e-16) as compared to cis only sites and weaker trans effect sizes (per allele SD change = -0.06 (s.e.= 0.002), p<2e-16) as compared to trans only sites. To detect these small trans effect sizes at sites with both a cis and a trans association, it is crucial to regress out the cis effect to decrease the residual variance and improve power to detect a trans effect. b. The violin plots represent kernel density estimates of the weighted SD across 36 cohorts for each DNAm site. The center line of the boxplot in the violin plots corresponds to the median value. The lower and upper box limits indicate the first and third quartiles (the 25th and 75th percentiles). The length of the whiskers corresponds to values up to 1.5 times the IQR in either direction.

The majority of trans associations (80%) were inter-chromosomal. Of the intra-chromosomal trans associations, 34% were >5 Mb from the DNAm site, Extended Data Fig. 2a). We found a substantially lower number of inter-chromosomal trans associations per 5 Mb region (1.59) than intra-chromosomal associations (>1 Mb: 7.95; >6 Mb 4.81, excluding chromosome 6). Next, using conditional analysis[14] we explored the potential for multiple independent SNPs operating within the locus of each mQTL, identifying 758,130 putative independent variants. Each DNAm site, for which a mQTL in cis had been detected, had a median of 2 independent variants (IQR=4 variants, Supplementary Figure 5). For all subsequent analyses, we used index SNPs from clumping procedures to be conservative and unbiased due to the non-independence of genetic variants. We sought to replicate the mQTLs in the Generation Scotland cohort (n=5,101) using an independent analysis pipeline. Replication data were available for 188,017 of our discovery mQTLs (137,709 sites). We found a strong correlation of effect sizes for both cis and trans effects (Pearson r=0.97, n=155,191 and 0.96, n=14,465 at p<1e-3, respectively; Figure 1c); 99.6% of the associations had a consistent direction of effect (Supplementary Note). At a Bonferroni corrected threshold of 0.05/188,017, 142,727 of the discovery mQTLs replicated in the Generation Scotland cohort (76%); the replication rates for cis- and trans-mQTLs were 76% and 79%, respectively. To evaluate whether our replication rate was in line with expectations given the smaller replication sample size, we estimated that under the assumption that the discovery mQTLs are true positives, 171,824 mQTLs would be expected to replicate at a nominal threshold of p<1e-3; we found that the actual number of mQTLs replicating at this level was 169,656, indicating that the majority of our discovery mQTLs are likely to be true positives (Supplementary Data 1, Supplementary Note). Our findings indicate that there is little between-study heterogeneity in our analysis and that genetic effects on DNAm are relatively stable across samples of European ancestry (Extended Data Fig. 1, Supplementary Table 2). Overall, the variance explained by replicated genetic effects on DNAm was small. For 99% of the associations in cis and trans, mQTLs explained less than 21% and 16% of the variation in DNAm respectively (Supplementary Figure 6). Aggregating across all 420,509 tested DNAm sites, our replicated mQTL associations explain 1.3% of the total assayed variation in DNAm, 8% of this being due to trans associations. Restricting to sites that have at least one cis effect or trans effect, however, we explain 4.2% and 2.5% of the DNAm variance, respectively. We then investigated how much of the heritability of variable DNAm can be explained by our mQTL associations using family-based heritability studies of DNAm[2,15]. We found a strong positive relationship between variance explained by replication mQTL estimates (127,680 sites in Generation Scotland) and heritability for both studies (family: Pearson r=0.41 across 121,582 available sites; twin: Pearson r=0.37 across 118,955 available sites) (Figure 1d, Supplementary Data 2). The mQTLs that we identified explain 15%-17% of the additive genetic variance of DNAm (Supplementary Figure 7). Finally, there were strong positive relationships between the heritability of DNAm levels at a DNAm site and the number of independent mQTLs (Supplementary Figure 8), heritability and effect size (Supplementary Figure 9), variance explained and the number of independent mQTLs (Supplementary Figure 10) and variance explained and distribution of DNAm levels (Supplementary Figure 11). Overall, our results support a mixed genetic architecture of polygenic genome-wide effects and larger cis effects. Our mQTL coverage was limited by the computational necessity of a multiple-stage study design (Extended Data Fig. 4a). The discovered mQTLs with r2<1% are likely a small fraction of all the mQTLs in this category expected to exist. Across these DNAm sites, and within the range of mQTLs detected in our study (r2>0.22%) we estimate that there are twice as many cis-mQTLs and 22.5 times more trans-mQTLs yet to discover (Extended Data Fig. 4b). This would likely not explain all estimated heritability, indicating that a substantial set of the heritability is due to causal variants with smaller effects or due to rare variants.

Extended Data Fig. 4

Impact of the two-stage design on mQTL coverage.

a. Loss in power in two-stage design. We calculated the power of detecting a cis association in at least one of the 22 studies at p<1e-5 or a trans association in at least two of 22 studies at p<1e-5. b. Expected number of mQTLs. Using the number of mQTLs with a particular r2 value, and the power of detecting mQTLs with that r2 value, we calculated how many mQTLs would expect to exist with that value.

Cis- and trans-mQTLs operate through distinct mechanisms

To infer biological properties of trans features that were independent of any incidental cis effects[7,8,16-18], we categorized mQTLs into those only associated with DNAm in cis (n=157,095, 69.9%), those only associated with DNAm in trans (n=794, 0.35%), and those associated with DNAm in both cis and trans (n=66,759, 29.7%). Similarly, of the 190,102 DNAm sites influenced by a SNP, 170,986 DNAm sites (89.9%) were cis only, 11,902 DNAm sites (6.3%) were cis+trans, and 7,214 DNAm sites (3.8%) were trans only. We first compared the distributions of DNAm levels (weighted mean DNAm level across 36 studies (Figure 1b). We then performed enrichment analyses on the mQTL SNPs and DNAm sites using 25 combinatorial chromatin states from 127 cell types[19] and gene annotations (Figure 2a, Supplementary Figures 12-15, Supplementary Tables 3-6). Consistent with previous studies[7,8,18], we found that cis only sites are represented in high (32%), low (28%) and intermediate (40%) DNAm levels, and these sites are mainly enriched for enhancer chromatin states (mean odds ratio (OR) =1.37), CpG islands (OR=1.25) and shores (OR=1.26). For cis+trans sites, we found that the majority of these sites (66%) have intermediate DNAm levels. By replicating this finding in two isolated white-blood-cell subsets (Supplementary Figure 16), we showed that this is due to cell-to-cell variability[19,20] or subcell-type differences. In line with the observation that intermediate levels of DNAm are found at distal regulatory sequences[21,22], these cis+trans sites were enriched for enhancer (mean OR=1.65) and promoter states (mean OR=1.41). However, for trans only sites, we found a pattern of low DNAm (for 55% of sites) and enrichments for promoter states (mean OR=1.39) especially TssA (active transcription start site) promoter state (mean OR=2.03). These enrichment patterns were consistent if we restricted to only inter-chromosomal associations (Supplementary Note, Supplementary Figure 17).

Figure 2

Cis- and trans-mQTLs operate through distinct mechanisms

a. Distributions of enrichments for chromatin states and gene annotations among mQTL sites and SNPs. Enrichment analyses were performed using 25 combinatorial chromatin states from 127 cell types (including 27 blood cell types) and gene annotations. The heatmap represents the distribution of ORs for cis only, trans only, or cis+trans sites and SNPs. For the enrichment of chromatin states, ORs were averaged across cell types. The following chromatin states were analyzed: TssA, Active TSS; PromU, Promoter Upstream TSS; PromD1, Promoter Downstream TSS with DNase; PromD2, Promoter Downstream TSS; Tx5', Transcription 5'; Tx, Transcription; Tx3', Transcription 3'; TxWk, Weak transcription; TxReg, Transcription Regulatory; TxEnh5', Transcription 5' Enhancer; TxEnh3', Transcription 3' Enhancer; TxEnhW, Transcription Weak Enhancer; EnhA1, Active Enhancer 1; EnhA2, Active Enhancer 2; EnhAF, Active Enhancer Flank; EnhW1, Weak Enhancer 1; EnhW2, Weak Enhancer 2; EnhAc, Enhancer Acetylation Only; DNase, DNase only; ZNF/Rpts, ZNF genes & repeats; Het, Heterochromatin; PromP, Poised Promoter; PromBiv, Bivalent Promoter; ReprPC, Repressed PolyComb, Quies Quiescent/Low. The significance was categorized as: *=FDR<0.001;**=FDR<1e-10;***=FDR<1e-50 b. Distributions of enrichment for occupancy of TFBSs among mQTL sites and SNPs. Each density curve represents the distribution of ORs for cis only, trans only, or cis+trans sites (left) and SNPs (right). c. Distributions of enrichment of mQTLs among 41 complex traits and diseases. Each density curve represents the distribution of ORs for cis only, trans only, or cis+trans SNPs.

Analyzing the differences in properties for the SNP categories, we found that cis only and cis+trans SNPs were enriched for active chromatin states and genic regions whereas trans only SNPs were enriched for intergenic regions and the heterochromatin state (Figure 2a, Supplementary Figures 14-15, Supplementary Tables 5-6). Overall, these results highlight that a complex relationship between molecular features is underlying the mQTL categories and the biological contexts are substantially different between cis and trans features. We found that these inferences were often shared across other tissues. DNAm sites with low or intermediate DNAm levels have similar DNAm distributions in 12 tissues (Supplementary Figures 18-20) with stronger enrichments in blood datasets for the enhancer states indicating some level of tissue specificity for mQTLs in these regions (Supplementary Figures 12, 14, 21). To investigate whether mQTLs are tissue specific, we compared the correlation of effect estimates of cis- and trans-mQTLs in blood against adipose tissue (n=603)[23] and brain (n=170)[9] (Supplementary Note, Extended Data Fig. 5). We found a larger extent of QTL sharing of blood and adipose tissue as compared to blood and brain which might be explained by shared cell types, in line with cis expression QTL findings[24]. Generally, the between tissue effect correlations were high, consistent with a recent comparison of cis-mQTL effects between brain and blood[25]. However, we found that the highest correlations were for associations involving trans-only sites (Adipose rb=0.92 (se =0.004); Brain rb=0.88 (se=0.009)) despite having on average smaller effect sizes than cis only associations, implying that they are less tissue specific than cis effects (Adipose rb=0.73 (se =0.002); Brain rb=0.59 (se=0.004)), which is in line with the notion that DNAm of promoters is less tissue specific. Stratifying the mQTL categories to low, intermediate and high DNAm, showed that the brain-blood correlations are the lowest for intermediate DNAm categories and adipose-blood correlations are lowest for high DNAm categories, which may suggest cellular heterogeneity for high DNAm levels (Extended Data Fig. 5). These results show the value of large sample sizes in blood to detect trans-mQTLs regardless of the tissue.

Extended Data Fig. 5

Correlation of mQTL effects (p<1e-14) between blood and other tissues.

For each mQTL category, the correlation of genetic effects between tissues (r) was estimated using the r method[25] where we used the blood mQTLs as reference. DNAm levels are categorized as low (<0.2), intermediate (0.2-0.8) or high (>0.8).

Trans-mQTL SNPs and DNAm exhibit patterned TF binding

Recent studies have uncovered multiple types of transcription factor (TFs)/DNA interactions influenced by DNAm including the binding of DNAm-sensitive TFs[26-28] and cooperativity between TFs[27,29]. To gain insights into how SNPs induce long-range DNAm changes, we mapped enrichments for DNAm sites and SNPs across binding sites for 171 TFs in 27 cell types[30,31]. We found strong enrichments for the majority of TFs and cell types amongst DNAm sites with a trans association (cis+trans: 55%; trans only: 80%; cis only: 18%) and amongst cis-acting SNPs (cis only: 96%, cis+trans: 91%, trans only: 1%) (Figure 2b, Supplementary Tables 7-8, Supplementary Figures 22-23). Consistent with the observation that trans only DNAm sites are enriched for CpG islands (Supplementary Figure 13), DNAm sites that overlap transcription factor binding sites (TFBSs) were relatively hypomethylated (weighted mean DNAm levels = 21% vs 52%, p<2.2e-16) (Supplementary Figure 24). Next we hypothesized that if a trans-mQTL is driven by TF activity[8,10] then particular TF-TF pairs may exhibit preferential enrichment[32]. A mQTL has a pair of TFBS annotations[31], one for the SNP and one for the DNAm site. We evaluated if the annotation pairs amongst 18,584 inter-chromosomal trans-mQTLs were associated to TF binding in a non-random pattern (Supplementary Note, Extended Data Fig. 6a-b). We found that 6.1% (22,962 of 378,225) of possible pairwise combinations of SNP-DNAm site annotations were more over- or under-represented than expected by chance after strict multiple testing correction (Supplementary Note, Supplementary Table 9, Extended Data Fig. 6c).

Extended Data Fig. 6

Two-dimensional enrichment of SNP and DNAm site TFBS annotation.

a. To test if the annotations of the SNPs involved in trans-mQTLs were specific to the annotations of the DNAm sites that they influence, we compared the real SNP-DNAm site pairs against permuted SNP-DNAm site pairs, where the biological link between SNP and site is severed whilst maintaining the distribution of annotations for the SNPs and sites. We constructed 100 such permuted datasets. b. SNP and site positions were annotated against genomic features, and we quantified how frequently mQTLs were found for each pair of SNP-DNAm site annotations. This enabled the construction of two-dimensional annotation matrices for both the real trans-mQTL list and the permuted trans-mQTL lists. c. Distribution of two-dimensional enrichment values of trans-mQTLs. There was substantial departure from the null in the real dataset for all tissues indicating that the TFBS of a site depended on the TFBS of the SNP that influenced it. d. A bipartite graph of the two-dimensional enrichment for trans-mQTLs, SNPs annotations (blue) with pemp< 0.01 after multiple testing correction co-occur with particular site annotations (red).

After accounting for abundance and other characteristics, the strongest pairwise enrichments involved sites close to TFBSs for proteins in the cohesin complex, for example CTCF, SMC3 and RAD21, as well as TFs such as GATA2 related to cohesin[33]. Bipartite analysis showed that these clustered due to being related to similar sets of SNP annotations (Extended Data Fig. 6d). Other clusters were also found; for example, sites close to TFBSs for interferon regulatory factor 1 (IRF1), a gene for which trans-acting regulatory networks[34], and enrichment amongst causally interacting chromatin accessibility QTLs[35] have been previously reported were more likely to be influenced by SNPs near TFBSs for EZH2, SMC3, ATF3, BCL3, TR4 and MAX. Next, we compared the locations of inter-chromosomal trans-mQTLs (n=18,584) to known regions of chromatin interactions[36] as alternative mechanism for trans coordination[8,37]. We found 1,175 overlaps for 637 SNP-DNAm site pairs (3.4%) where the LD region of the mQTL SNP and the corresponding site overlapped with any interacting regions (525 SNPs, 602 sites) as compared to a mean of 473 SNP-DNAm site pairs in 1,000 permuted datasets (OR=1.36, pFisher=6.5e-7, pempirical<1e-3) (Supplementary Figure 25). To summarize, our results show that trans-mQTLs are in part driven by long-range cooperative TF interactions and, that for a small proportion of inter-chromosomal trans-mQTLs the spatial distance in vivo is likely to be small.

Trans-mQTL effects form DNAm communities

Genetic variation can perturb chromatin activity[32,35,37], DNAm[8] or gene expression[38] across multiple sites in cis and trans revealing coordinated activity between regulatory elements and genes. We observed that there were 1,728,873 instances where a SNP acting in trans also associated with a cis DNAm site (before LD pruning). Genetic colocalization analysis indicated that 278,051 of these instances were due to the cis and trans sites sharing a genetic factor, representing 3,573 independent cis-trans genomic region pairs, of which 3,270 were inter-chromosomal (Supplementary Table 10, see Supplementary Note for sensitivity analysis for the colocalization method used in the context of the two-stage mQTL discovery design). These pairs consisted of 1,755 independent SNPs and 5,109 independent DNAm sites across the genome, indicating that some sites with cis associations shared genetic factors with multiple sites with trans associations, revealing distal coordination between mQTLs. From the cis-trans pairs we constructed a network linking these genomic regions which elucidated 405 “communities” of genomic regions that were substantially connected (Supplementary Note). Fifty-six of these communities comprised 10 or more sites, and the largest community comprised 253 sites (Figure 3a).

Figure 3

Communities constructed from trans-mQTLs.

a. A network depicting all communities in which there were twenty or more sites. Random walks were used to generate communities (colors), so occasionally a DNA site connects different communities. b. The relationship between genomic annotations, mQTLs and communities. Communities 9 and 22 comprised DNAm sites that are related through shared genetic factors. The sankey plots show the genomic annotations for the genetic variants (left) and for the DNAm sites (right). The DNAm sites comprising these communities are enriched for TFBSs related to the cohesin complex and NFkB, respectively. c. Enrichment of GWA traits among community SNPs. The genomic loci for each of the 56 largest communities were tested for enrichment of low p-values in 133 complex trait GWASs (y-axis) against a null background of community SNPs. The x-axis depicts the two-sided -log10 p-value for enrichment, with the 5% FDR shown by the vertical dotted line. Colors represent log odds ratios. Enrichments were particularly strong for blood-related phenotypes (including circulating metal levels).

We hypothesized that cis sites were causally influencing multiple trans sites within their communities. We evaluated whether the estimated causal effect (obtained from the trans-mQTL effect divided by the cis-mQTL effect, i.e. the Wald ratio) of the cis site on the trans site was consistent with the observational correlation between the cis site and the trans site. While there was an association, the relationship was weak (Pearson r=0.096, p=1.73e-6, Supplementary Figure 26), indicating that changes in cis sites causing changes in trans sites are probably not the predominant mechanism. We did observe that the cis-trans DNAm levels were more strongly correlated than we would expect by chance (Supplementary Figure 27), suggesting that they are jointly regulated without generally being causally related. Next, we evaluated if DNAm sites within each community were enriched for regulatory annotations and/or gene ontologies (Supplementary Tables 11-14, Supplementary Figures 28-29). Multiple communities showed enrichments (false discovery rate (FDR) <0.001); community 9 DNAm sites were strongly enriched for TFBS annotations relating to the cohesin complex in multiple cell types, community 22 DNAm sites were enriched for NFKB and EBF1 in B lymphocytes and community 76 DNAm sites were enriched for EZH2 and SUZ12 and bivalent promotor and repressed polycomb states (Figure 3b). Community 2 (comprising 253 sites) was enriched for active enhancer state in three cell types and for lymphocyte activation (GO:0046649 FDR = 0.016) and multiple KEGG pathways including the JAK-STAT signalling pathway (I04630: FDR =8.53e-7) (Supplementary Tables 13, 14). Regulatory features within a network may share a set of biological features that are related to complex traits. We performed enrichment analysis to evaluate if the loci tagged by DNAm sites in a community were related to each of 133 complex traits (Supplementary Table 15), accounting for non-random genomic properties of the selected loci. Restricting the analysis to only the 56 communities with ten or more sites, we found eleven communities that tagged genomic loci that were enriched for small p-values with 22 complex traits (FDR<0.05) (Figure 3c, Supplementary Table 16). Blood-related phenotypes were overrepresented (11 out of 23 enrichments being related to metal levels or hematological measures, binomial test p-value = 4.2e-5). Amongst the communities enriched for genome-wide association study (GWAS) signals, community 16 was highly associated with iron and hemoglobin traits. Community 9 was associated to plasma cortisol (p=8.27e-5). Finally, we performed enrichment analysis on 36 blood cell count traits[39]. We found that community 16 was enriched for hematocrit (p=4.34e-10) and hemoglobin concentration (p=1.99e-8) and community 5 was enriched for reticulocyte traits (p=1.67e-6) (Supplementary Figure 30). The enrichments found for these DNAm communities indicate that a potentially valuable utility of mapping trans-mQTLs is to indicate how distal regions of the genome are functionally related.

DNAm and complex traits share genetic factors

The majority of GWA loci map to non-coding regions[40] and cis-mQTLs are enriched amongst GWAs[17,41,42]. Here we investigated the value of the large number of mQTLs especially trans-mQTLs to annotate the functional consequences of GWA loci. We first compared distributions of enrichment of cis and trans-mQTL categories among 41 complex traits. After accounting for non-random genomic distribution of mQTLs[43] and multiple testing, we identified enrichments for 35% of the complex traits, particularly for studies with a larger number of GWA signals (Supplementary Figure 31, Supplementary Table 17, Supplementary Note). The distribution of enrichment effect estimates (ORs) of trans-mQTLs was substantially closer to the null or in depletion when compared to mQTLs that included cis effects (Figure 2c). These enrichments correspond to the results reported earlier, in which trans-SNPs were typically depleted for enhancer and promoter regions, whereas complex trait loci are enriched for coding and regulatory regions[44]. Though the mQTL discovery pipeline adjusted for predicted cell types[45,46] and non-genetic DNAm principal components (PCs), there is a possibility that residual cell type heterogeneity remains. We performed another set of GWAS enrichment analysis, this time using 36 blood cell traits[39], and found enrichments. These were strongest amongst cis+trans mQTLs, as seen in the previous enrichments (Supplementary Figure 32). For 98.9-100% of the mQTLs, mQTL SNPs explained more variation in DNAm than they explain variation in blood cell counts, suggesting a causal chain of mQTL to blood trait[47]. Alternatively, a systematic measurement error difference could explain these observations, where DNAm captures blood cell counts more accurately than conventional measures. We next searched for instances of specific DNAm sites sharing the same genetic factors against each of 116 complex traits and diseases, and initially found 23,139 instances of an mQTL strongly associating with a complex trait (Figure 4). To evaluate the extent to which these were due to shared genetic factors (and not, for example, LD between independent causal variants), we performed genetic colocalization analysis[48] (Supplementary Tables 15, 18). Excluding genetic variants in the MHC region, we found 1,373 potential examples in which at least one DNAm site putatively shared a genetic factor with at least one of 71 traits (including 19 diseases). Those DNAm sites that had a shared genetic factor with a trait were 6.9 times more likely to be present in a community compared to any other DNAm site with a known mQTL (Fisher’s exact test 95% CI 4.8-9.7, p=9.2e-19). Next, we evaluated how often the DNAm site that colocalized with a known GWAS hit was the closest DNAm site to the lead GWAS variant by physical distance. Notably, in only 18.1% of the cases where a GWAS signal and an assayed 450k DNAm site colocalized, was that DNAm site the closest DNAm site to the signal. This finding is similar to results found for gene expression[49], but the converse has been found for protein levels[50].

Figure 4

Identifying putative causal relationships between sites and traits using bi-directional MR.

Aggregated results from a systematic bi-directional MR analysis between DNAm sites and 116 complex traits. The y-axis represents the two-sided p-value from MR analysis. The top plot depicts results from tests of DNAm sites colocalizing with complex traits. The light grey points represent MR estimates that either did not surpass multiple testing, or shared small p-values at both the DNAm site and complex trait but had weak evidence of colocalization. Bold, colored points are those that showed strong evidence for colocalization (Posterior probability>0.8 for H4 - one shared SNP for DNAm and trait.). The bottom plot shows the two-sided -log10 p-values from MR analysis of risk factor or genetic liability of disease on DNAm levels. Extensive follow up was performed on DNAm site-trait pairs with putative associations, and those that pass filters are plotted in bold and colored according to the trait category. A substantial number of MR results in both directions exhibited very strong effects but failed to withstand sensitivity analyses.

It has previously been difficult to conclude whether genetic colocalization between DNAm and complex traits indicates a) a causal relationship where the DNAm level is on the pathway from genetic variant to trait (vertical pleiotropy) or b) a non-causal relationship where the variant influences the trait and DNAm independently through different pathways (horizontal pleiotropy)[51]. In Mendelian randomization (MR) it is reasoned that under a causal model, multiple independent genetic variants influencing DNAm should exhibit consistent causal effects on the complex trait[52]. Amongst the putative colocalizing signals, 440 (32%) involved a DNAm site that had at least one other independent mQTL. We cannot determine with certainty the causal relationship of any specific site with a trait. To test if there was a general trend of DNAm sites causally influencing a trait we evaluated if the MR effect estimate based on the colocalizing signals were consistent with those obtained based on the secondary signals. There were substantially more large genetic effects of the secondary mQTLs on respective traits than expected by chance (70 with p<0.05, binomial test p=2.4e-16). However only 41 (59%) of these had effect estimates in the same direction as the primary colocalizing variant, which is not substantially better than chance (binomial test p=0.19). Of the 41 mQTLs, twelve were located in the HLA region. Of the remaining mQTLs, 27 were associated with anthropometric (ESR1 and birth weight), immune response (IRF5 and systemic lupus erythematosus) and lipid traits (TBL2 and triglycerides). We then performed systematic colocalization analysis of all mQTLs against 36 blood cell traits[39]. Here we discovered 94,738 instances of a DNAm site and a blood cell trait sharing a causal variant. In 28,138 instances the colocalizing DNAm site had an independent secondary mQTL, and with these associations we again tested for a general trend of DNAm sites causally influencing the blood trait. The association between independent signals was very weak (R2 = 0.008). Together, across the sites that were analyzable in this manner, these results indicate that those blood-measured DNAm sites that have shared genetic factors with traits cannot be typically thought of as mediating the genetic association to the trait (Extended Data Fig. 7, Supplementary Table 19). Instead, if DNAm is a co-regulatory phenomenon then the colocalizing signals between DNAm sites and complex traits may be due to a common cause, for example genetic variants primarily acting on TF binding.[8,10]

Extended Data Fig. 7

Correspondence of MR estimates amongst multiple independent instruments.

a. To evaluate if a site having a shared causal variant with a trait was potentially due to the site being on the causal pathway to the trait, we reasoned that independent instruments for the site should exhibit consistent effects on the outcome consistent with the original colocalizing variant. b. Amongst the putative colocalizing signals, 440 involved a DNAm site that had at least one other independent mQTL. The plot shows the causal effect estimate estimated from the original colocalizing signal against the causal effect estimates obtained from the independent variants (n=440). Grey regions represent the 95% confidence of the slope. c. Correspondence of MR estimates amongst multiple independent instruments on 36 blood traits. To evaluate if a site having a shared causal variant with a blood trait was potentially due to the site being on the causal pathway to the trait, we reasoned that independent instruments for the site should exhibit consistent effects on the outcome consistent with the original colocalizing variant. Amongst the putative colocalizing signals, 30% involved a DNAm site that had at least one other independent mQTL. The plot shows the causal effect estimate estimated from the original colocalizing signal against the causal effect estimates obtained from the independent variants. The HLA region has been removed and betas are plotted.

The influence of traits on DNAm variation

Previous studies have not been adequately powered to estimate the causal influences of complex traits on DNAm variation through MR, as the sample size of the outcome variable (DNAm) is a predominant factor in statistical power[48,53]. We systematically analyzed 109 traits for causal effects on DNAm using two-sample MR[54,55], where each trait was instrumented using SNPs obtained from their respective previously published GWAS (Supplementary Note, Supplementary Table 15). Included amongst the traits were 35 disease traits, which when used as exposure variables in MR must be interpreted in terms of the influence of liability rather than presence/absence of disease. The sample size used to estimate SNP effects in DNAm was up to 27,750 (Figure 4). We initially identified 4,785 associations where risk factors or genetic liability to disease influences DNAm levels (multiple testing threshold p<1.4e-7). However, causal inference on omic variables can lead to false positives due to violations in the MR assumptions. We developed a filtering process involving a novel causal inference method to help protect against these invalid associations (Supplementary Note, Supplementary Figure 33). This left 85 associations (involving 84 DNAm sites) in which DNAm sites were putatively influenced by 13 traits (nine risk factors or four diseases) (Supplementary Table 20). Further filtering that would exclude traits that were predominantly instrumented by variants in the HLA region or driven by one SNP would reduce the total number of associations substantially from 84 to 19. We replicated five associations for triglycerides influencing DNAm sites near CPTA1 and ABCG1 [56] and found associations for transferrin saturation/iron influencing DNAm sites near HFE. We next evaluated if there was evidence for small, widespread changes in DNAm levels in response to complex trait variation, by calculating the genomic control inflation factor (GCin) for the p-values obtained from the MR analyses of each trait against all DNAm sites. Five traits (fasting glucose, age at menarche, cigarettes smoked per day, immunoglobulin G index levels, serum creatinine), showed GCin values above 1.05 (Extended Data Fig. 8). GCin calculations were performed at each chromosome singly for each trait (Supplementary Figure 34) and in a leave-one-chromosome-out analysis (Supplementary Figure 35). The GCin remained consistent (except for immunoglobulin G index levels), indicating that the traits have small but widespread influences on DNAm levels across the genome.

Extended Data Fig. 8

Genomic inflation factors for genome-wide scans of causal effects of traits on DNAm sites.

Each trait (x-axis) was tested for causal effects against (on average) 317,659 DNAm sites, excluding sites in the MHC region. The p-values from IVW MR analysis were used to estimate the genomic inflation for each trait (y-axis). Traits are ordered by genomic inflation factor.

While most of the traits (n=105, 96%) tested did not appear to induce genome-wide enrichment this does not rule out the possibility that they have many localized small effects. For example, the smallest MR p-value for the analysis of body mass index on DNAm levels was 2.27e-6, which did not withstand genome-wide multiple testing correction, and GCin was 0.95. However, restricting GCin to 187 sites known to associate with body mass index from a previous epigenome-wide association study (EWAS)[20] indicated a strong enrichment of low p-values (median GCin = 3.95). A similar pattern was found for triglycerides, in which genome-wide median GCin = 0.94 but the 10 sites known to associate with triglycerides from a previous EWAS[57] had an MR p-value of 8.3e-70 (Fisher’s combined probability test). These results indicate that traits causally influencing DNAm levels in blood is the most likely mechanism that gives rise to these EWAS hits. It also indicates that the general finding that there were very few filtered putative causal effects of risk factors or genetic liability to disease on DNAm could be due to true positives being generally very small, even to the extent that our sample size of up to 27,750 individuals was insufficient to find them.

Discussion

A map of hundreds of thousands of genetic associations has enabled novel biological insights related to DNAm variation. Using a rigorous analytical framework enabled us to minimize heterogeneity and expand sample sizes for large omic data. This revealed a genetic architecture of DNAm that is polygenic. Given the diverse ranges of age, sex proportions and geographical origins between the cohorts in this analysis, the minimal extent of heterogeneity across datasets indicates that genetic effects on DNAm are relatively stable across contexts, at least when restricted to European ancestries. We show that cis- and trans-mQTLs operate through distinct mechanisms, as their genomic properties are distinct. A driver of long-range associations may be co-regulated through TF binding and nuclear organization. Though we found substantial sharing of genetic signals between DNAm sites and complex traits, we were able to demonstrate that this was not predominantly due to DNAm variation being on the causal path from genotype to phenotype. While our results were restricted to 1.5% of the DNAm sites in the genome and are limited by the two-phase design, these findings have several implications, especially in the context of EWASs that are often based on the same tissue and DNAm array. First, we anticipate that some previously reported EWAS associations are likely due to reverse causation e.g. the risk factor or genetic liability to disease state itself alters DNAm and not vice versa, or confounding. Second, the genetic effects on DNAm that overlap with complex traits likely primarily influence other regulatory factors which in turn influence complex traits and DNAm through diverging pathways. Third, DNAm might be on the causal pathway in a disease-relevant cell type or context. Fourth, if the path from genotype to complex trait is non-linear, for example involving the statistical interactions between different regulatory features[16], then our results indicate that large individual-level multi-omic datasets will be required to dissect such mechanisms. Higher density DNAm microarrays[12] or low-cost sequencing technologies[58] will expedite detailed interrogations of enhancer and other regulatory regions. Given our projection of mQTL yields expected for future studies, pleiotropy involving mQTLs is likely to be increasingly important to model when interpreting genotype-trait pathways. Overall, our data and results present a comprehensive atlas of genetic effects on DNA methylation. We expect that this atlas will be of use to the scientific community for studies of genome regulation and for causality analysis, and that it will contribute to the control of confounding in EWASs.

Online Methods

Study design overview

Initially, 38 independent studies were recruited to contribute data towards a mQTL meta-analysis of which 36 studies (Supplementary Table 1, Supplementary Note) passed our stringent quality criteria described below. Conventional GWAS meta-analyses involve performing complete GWAS in each study, sharing the summary data and meta-analyzing every tested SNP. As a mQTL analysis involves ~450,000 GWAS analyses, it is difficult to store and share the complete summary data from 38 studies. To circumvent this problem, each study performed GWAS analyses but provided only the associations that surpass a relaxed significance threshold (p<1e-5) in their study. Due to sampling variation the exact mQTL associations reported would differ between studies, meaning that the number of studies contributing to the meta-analysis would be highly variable and could be as low as two studies. This would introduce two problems. First, publication bias arises if it is in fact a null association because the studies demonstrating null effects would not contribute to counteract the inflated effects from those that do happen to surpass the threshold. Second, the precision of the effect estimate is limited by the number of studies that happen to contribute data on that association. To mitigate both problems the analysis in this study has been performed in two phases. In phase 1 of our study we performed mQTL analyses of 420,509 high quality DNAm sites[59] using data from 22 independent European studies to identify putative associations (Supplementary Table 1, Figure 1a) at a threshold of p<1e-5. We used two approaches to exclude DNAm sites from our analyses. First we excluded 50,186 DNAm sites that were masked by Zhou et al.[59] which includes probes with potential cross-reaction and probes that could not be mapped to genome. Secondly, we removed an additional 14,882 probes including multi-mapping probes (bisulfite converted sequences allowing two mismatches at any position mapped to the hg19 primary assembly) and probes with variants (minor allele frequency (MAF) >5%, UK10K) at the CpG dinucleotide or the extension base (for type I probes). All candidate mQTL associations at p<1e-5 were combined to create a unique ‘candidate list’ of mQTL associations. In total we identified 102,965,711 candidate mQTL associations in cis (p<1e-5, +/-1 Mb from DNAm site) and 710,638,230 candidate mQTL associations in trans (>1Mb from DNAm site) in at least one dataset. 59% of the candidate mQTL associations in cis (n=61,103,065) and 2.4% of the associations in trans (n=17,246,702) were found in at least two datasets (Supplementary Figure 1). To reduce the computational burden, we included cis associations found in at least one dataset and trans associations in at least two datasets. The candidate list (n=120,212,413) was then sent back to all studies, and the association estimates were obtained for every mQTL association on the candidate list. In phase 2 of our study, we performed association tests for each of the candidate mQTL associations in 20 studies from phase 1 and 16 additional studies with European ancestry (total n=27,750) (Supplementary Table 1). The estimates for the candidate list were meta-analyzed to obtain the final results (Figure 1a). This two-phase approach had a single objective: to minimize the computational burdens of storing summary data from the complete analysis from every study. However, we have effectively performed a complete search of all candidate mQTL associations, though with likely loss of coverage. The significant results obtained from the meta-analysis are identical to what would have been identified had we performed a meta-analysis on every candidate mQTL association. The only difference between a complete scan and our scan was that we would have missed some associations that were not at p<1e-5 in any study but when combined across all studies would have surpassed an experiment wide multiple testing correction.

Data preparation

Participants

To study the relationship between common genetic variation and DNAm, we focused on studies of European ancestry with genotype data imputed to the 1000 Genomes reference panel[11] and DNAm profiles quantified from bisulfite-converted genomic whole blood DNA using the Infinium HumanMethylation BeadChip (HumanMethylation450 or EPIC arrays). Details of the studies for discovery and replication are provided in Supplementary Table 1 and Supplementary Note.

The GoDMC pipeline

To facilitate the harmonization of the large volume of data we developed a GoDMC pipeline that was split into several modules, each focusing on the separate tasks of data checking, genotype preparation, phenotype and covariate preparation, DNAm data preparation, and subsequent analyses. In the first module the data format of the genotype data, DNAm and covariate data was checked. In addition, the number of individuals with DNAm and genotype data (requirement of n>100), the number of SNPs, the number of sites, covariates including cell counts, genotype build and strand, and the number of DNAm outliers were recorded. We also generated matrices with mean and SD by DNAm site and study descriptives. The entire pipeline can be viewed at https://github.com/MRCIEU/godmc, and the following text describes the procedures that were used.

Genotype data

Each study performed quality control on genotype data for all autosomes and chromosome X (if available) and imputed to 1000 Genomes Project phase 1 or above using hg19/build37. Dosages were converted to bestguess data without a probability cut-off. SNPs that failed Hardy Weinberg equilibrium (p<1e-6), had a MAF<0.01, an info score <0.8 or missingness in more than 5% of the participants were removed. We recoded SNPs to CHR:POS[11] format and removed duplicate SNPs. We then harmonized the recoded SNPs to the 1000 Genomes Project reference using easyQC_v9.2[60]. This harmonization script removed SNPs with mismatched alleles and recoded INDEL alleles to I and D. We performed a sex check and removed participants discordant to the covariate file. We extracted and pruned a set of common HapMap3 SNPs (MAF>0.2, without long-range LD regions before we calculated the first 20 genetic PCs on LD pruned SNPs and excluding regions of high LD from the analysis. We used PLINK 2.0[61] for unrelated participants and GENESIS[62] for related participants to identify ancestry outliers. Ancestry outliers that deviated 7 SDs from the mean were removed. After outlier removal we recalculated genetic PCs for use in subsequent analyses. To identify relatedness in unrelated datasets, we pruned the genotype data to a set of independent HapMap 3 SNPs with MAF>0.01 and calculated genome-wide average identity by state (IBS) using PLINK2.0. Participants with IBS>0.125 were removed.

DNAm data normalization and quality control

DNAm was measured in whole blood or cord blood using HumanMethylation450 or EPIC arrays in at least 100 European individuals. Each study performed normalization and quality control on the DNAm data independently, with most studies using functional normalization through the R package meffil v0.1.0[63] (Supplementary Table 1). Briefly, meffil has been designed to preprocess raw idat files to a normalization matrix for large sample sizes without large computational memory requirements and to perform quality control in an automated way where the analyst can adjust default parameters easily. Sample quality control included removal of participants where more than 10% of the DNAm sites failed the detection p-value of 0.1 and/or threshold of 3 beads. In addition, mismatched samples were identified by comparing the 65 SNPs on the DNAm array to the genotype array and a sex check. Additional DNAm quality was checked by the methylated versus unmethylated ratio, dye bias using the normalization control probes and bisulfite control probes. Protocols can be found here: https://github.com/perishky/meffil/wiki. For each DNAm site, we replaced outliers that were 10 SDs from the mean (3 iterations) with the DNAm site mean.

Covariates

We used sex, age at measurement, batch variables (slide, plate, row if available), smoking (if available) and recorded cell counts to adjust for possible confounding and to reduce residual variation. Additional confounders (genetic PCs, non-genetic DNAm PCs, and where necessary predicted smoking and cell counts) were calculated using the GoDMC pipeline. After quality control and normalization of the DNAm data, we predicted smoking status by using previously reported DNAm associations with smoking[64]. In addition, we predicted cell counts using the Houseman algorithm[46] implemented in meffil v0.1.0[63]. We performed a principal component analysis on the 20,000 most variable autosomal DNAm sites and kept all PCs that cumulatively explained 80% of the variance. We performed GWASs on the DNAm PCs and retained the PCs that were not associated with a genotype (p>1e-7). We kept a maximum of 20 non-genetic PCs for subsequent adjustment.

DNAm data adjustment

We attempted to minimize non-genetic variation in the DNAm data to improve power for mQTL detection. We adjusted datasets with predominant family structures (pedigrees, twin studies) and population-based studies in slightly different ways. For unrelated participants we regressed out age, sex, predicted cell counts, predicted smoking and genetic PCs (adjustment 1). For related participants we did the same except also fitting the genetic kinship matrix using the method described in GRAMMAR[65]. We took the residuals from the first adjustment forward to regress out the non-genetic DNAm PCs on the adjusted DNAm beta values (adjustment 2). The residuals from these analyses were rank transformed and centered to have mean 0 and variance 1.

Positive and negative controls

Before we performed the meta-analysis, we checked the number of SNPs and INDELs, the number of sites and individuals analyzed and the average mean and SD for each DNAm site to identify possible inconsistencies. Each of the 38 studies conducted a GWAS of cg07959070. We chose this DNAm site as a positive control as it showed a strong cis-mQTL in several datasets on chr22 and hasn’t been proposed to be excluded from the analyses by probe annotation efforts[59,66-68]. To identify possible errors, we checked the cis association on chromosome 22 (p<0.001) for this DNAm site. In addition, we checked quantile-quantile and Manhattan plots for this DNAm site. We also used this control to identify studies with deflated or inflated lambdas (lambda >1.1 or lambda <0.9). We noticed deflation of the genomic lambda after adjustment of the index cis SNP in datasets with relatedness. However, lambdas were around 1 when not adjusted. After inspection one study was removed from the analysis due to deflation and one study was removed due to a lack of the positive control association signal, leaving 36 studies for the final meta-analysis.

Association analyses

Phase 1: creating the candidate list of associations

We performed a fast, comprehensive analysis of all cis- and trans-associations on 420,509 reliable[59] residualised DNAm sites separately in 22 studies (n=16,907) using the R package Matrix eQTL v2.1.0[69]. For each DNAm site j the residual value y was regressed against each SNP k where genotype values x were coded as allele counts {0,1,2}, α was the intercept term, and β was the effect estimate of each SNP k on each residualized DNAm site j.

Phase 2: obtaining summary data from all studies for meta-analysis

This candidate list was sent to 36 studies (n=27,750) where effect sizes for all putative associations were recalculated by fitting linear models. For putative cis-mQTLs we performed linear regression as in phase 1. To improve statistical power to estimate the trans-mQTL effects we recorded the top cis SNP x, for each DNAm site (based on lowest p-value within that study) and fit this as a covariate in the trans-mQTL regressions

Evaluation of DNAm data adjustment

As adjustment for non-genetic DNAm PCs might have substantial benefits on power or an adverse effect by inducing collider bias[70], we explored the impact by comparing mQTLs not adjusted for non-genetic PCs with mQTLs adjusted for non-genetic PCs in ARIES. Specifically, we found 80,890 clumped mQTL associations in the PC-adjusted dataset and 74,402 clumped mQTL associations in the PC-unadjusted dataset. The Pearson correlation between effect sizes of the PC-unadjusted clumped mQTLs vs PC-adjusted mQTLs (cis r=0.998; trans r=0.998) and PC-adjusted clumped mQTLs (cis r=0.997; trans r=0.997) versus PC-unadjusted mQTLs was very high (Supplementary Figure 36). These results suggest that if collider bias is impacting the results it is extremely small. The simplest explanation for the minimal difference in effect sizes and slightly higher mQTL yield amongst the PC-adjusted mQTLs is that reduced residual variance has improved power.

Impact of two-stage design on power of study

Though the multi-stage study design was performed out of practical necessity, we evaluated the impact it had on statistical power in comparison to the hypothetical situation of analysing all the data together in a standard one-stage mQTL design. For cis-mQTL associations we calculated the power of detecting an association in at least one of 22 studies at p<1e-5. To do this we calculated the probability of missing an association as being the product of the probability of missing it in study 1 AND in study 2 AND in study 3 etc. where f(x; k; λ) is the probability density function for the non-central chi-square distribution with k degrees of freedom and λ the non-centrality parameter based on the postulated variance explained by an mQTL (r 2) and the study sample size n and 19.5 denotes the chi-square threshold at p=1e-5 with one degree of freedom. For trans-mQTL associations we calculated the power to detect an association in at least two of 22 studies at p<1e-5. We calculated the probability of missing an association as being the product of the probability of missing it in both study 1 and study 2 AND in study 1 and study 3 AND in study 1 and study 4 etc. where f(x; k; λ)is the probability density function for the non-central chi-square distribution with k degrees of freedom and λ the non-centrality parameter based on the postulated variance explained by an mQTL (r) and the study sample sizes n and n and 19.5 denotes the chi-square threshold at p=1e-5 with one degree of freedom. We found that we have no loss of power (<1%) for loci that explain more than 1.2% or less than 0.1% of the variance. Within these bounds >80% of power is lost for cis-mQTLs with r2 0.16% to 0.38%. For trans-mQTLs, power suffers slightly more because of requiring detection by at least two studies in the first stage (r2 0.27% to 0.64%) (Extended Data Fig. 4a).

Meta-analyses

We used the SNP effect estimates and standard errors for each SNP-DNAm site pair in the candidate list in the meta-analyses. Inverse variance fixed effects (FE) meta-analyses of the 36 studies were performed using METAL[71]. We modified METAL (https://github.com/explodecomputer/random-metal) to incorporate the DerSimonian and Laird random effect (RE) models[72] and multiplicative random effects (MRE) models[73]. These results are available at: http://mqtldb.godmc.org.uk/. We also inspected the meta-analysis and conditional analysis (see below) logfiles and removed any SNPs that had inconsistent allele codes between studies, which were in almost all cases multi-allelic SNPs. We inspected our results by counting the number of associations against the direction of the effect size (+ or −) for each study. A high number of associations was found if the direction of the effect sizes agreed across studies (Supplementary Figure 2a). In addition, the average I2 heterogeneity estimate for the effect size direction categories was 44% (min=0%, max 100%). For categories with more than 100 associations, average I2 was 49% (min=36%, max 61%) (Supplementary Figure 2b). We also explored whether the number of phase 1 studies was correlated to I2 and tau2. We found a nonsignificant correlation (r=0.002, p=0.23, r=-0.001, p=0.32) indicating that mQTL associations found in a low number of phase 1 studies didn’t show more heterogeneity than mQTL associations found in a high number of phase 1 studies. To explore heterogeneity further, we meta-analyzed our SNP-DNAm pairs using FE, RE and MRE models and found that associations that were dropped in MRE analyses showed higher I2 and tau2 and smaller effect sizes and DNAm site SDs (Supplementary Figures 3-4). Further inspection showed that trans only sites had higher I2 heterogeneity statistics than associations from cis only or cis+trans sites (mean I2 values of 53%, 46% and 39%, respectively). However, as I2 and tau2 were positively correlated to effect sizes (Supplementary Figure 2c) we deemed the use of FE meta-analysis to be appropriate for reducing false negative rates. Further downstream analyses have been described in Supplementary Note.

Quality control of 36 studies.

Distance of SNP from DNAm site.

Effect sizes and weighted standard deviation (SD) for each mQTL category.

Impact of the two-stage design on mQTL coverage.

Correlation of mQTL effects (p<1e-14) between blood and other tissues.

Two-dimensional enrichment of SNP and DNAm site TFBS annotation.

Correspondence of MR estimates amongst multiple independent instruments.

Genomic inflation factors for genome-wide scans of causal effects of traits on DNAm sites.

66 in total

1. Epigenetics as a unifying principle in the aetiology of complex traits and diseases.

Authors: Arturas Petronis
Journal: Nature Date: 2010-06-10 Impact factor: 49.962

2. Genomic surveys by methylation-sensitive SNP analysis identify sequence-dependent allele-specific DNA methylation.

Authors: Kristi Kerkel; Alexandra Spadola; Eric Yuan; Jolanta Kosek; Le Jiang; Eldad Hod; Kerry Li; Vundavalli V Murty; Nicole Schupf; Eric Vilain; Mitzi Morris; Fatemeh Haghighi; Benjamin Tycko
Journal: Nat Genet Date: 2008-06-22 Impact factor: 38.330

3. Disease variants alter transcription factor levels and methylation of their binding sites.

Authors: Marc Jan Bonder; René Luijk; Daria V Zhernakova; Matthijs Moed; Patrick Deelen; Martijn Vermaat; Maarten van Iterson; Freerk van Dijk; Michiel van Galen; Jan Bot; Roderick C Slieker; P Mila Jhamai; Michael Verbiest; H Eka D Suchiman; Marijn Verkerk; Ruud van der Breggen; Jeroen van Rooij; Nico Lakenberg; Wibowo Arindrarto; Szymon M Kielbasa; Iris Jonkers; Peter van 't Hof; Irene Nooren; Marian Beekman; Joris Deelen; Diana van Heemst; Alexandra Zhernakova; Ettje F Tigchelaar; Morris A Swertz; Albert Hofman; André G Uitterlinden; René Pool; Jenny van Dongen; Jouke J Hottenga; Coen D A Stehouwer; Carla J H van der Kallen; Casper G Schalkwijk; Leonard H van den Berg; Erik W van Zwet; Hailiang Mei; Yang Li; Mathieu Lemire; Thomas J Hudson; P Eline Slagboom; Cisca Wijmenga; Jan H Veldink; Marleen M J van Greevenbroek; Cornelia M van Duijn; Dorret I Boomsma; Aaron Isaacs; Rick Jansen; Joyce B J van Meurs; Peter A C 't Hoen; Lude Franke; Bastiaan T Heijmans
Journal: Nat Genet Date: 2016-12-05 Impact factor: 38.330

4. Genetics of gene expression surveyed in maize, mouse and man.

Authors: Eric E Schadt; Stephanie A Monks; Thomas A Drake; Aldons J Lusis; Nam Che; Veronica Colinayo; Thomas G Ruff; Stephen B Milligan; John R Lamb; Guy Cavet; Peter S Linsley; Mao Mao; Roland B Stoughton; Stephen H Friend
Journal: Nature Date: 2003-03-20 Impact factor: 49.962

Review 5. Mendelian randomization: genetic anchors for causal inference in epidemiological studies.

Authors: George Davey Smith; Gibran Hemani
Journal: Hum Mol Genet Date: 2014-07-04 Impact factor: 6.150

6. Genetic and environmental influences interact with age and sex in shaping the human methylome.

Authors: Jenny van Dongen; Michel G Nivard; Gonneke Willemsen; Jouke-Jan Hottenga; Quinta Helmer; Conor V Dolan; Erik A Ehli; Gareth E Davies; Maarten van Iterson; Charles E Breeze; Stephan Beck; H Eka Suchiman; Rick Jansen; Joyce B van Meurs; Bastiaan T Heijmans; P Eline Slagboom; Dorret I Boomsma
Journal: Nat Commun Date: 2016-04-07 Impact factor: 14.919

7. Characterizing genetic and environmental influences on variable DNA methylation using monozygotic and dizygotic twins.

Authors: Eilis Hannon; Olivia Knox; Karen Sugden; Joe Burrage; Chloe C Y Wong; Daniel W Belsky; David L Corcoran; Louise Arseneault; Terrie E Moffitt; Avshalom Caspi; Jonathan Mill
Journal: PLoS Genet Date: 2018-08-09 Impact factor: 5.917

8. Methylation QTLs in the developing brain and their enrichment in schizophrenia risk loci.

Authors: Eilis Hannon; Helen Spiers; Joana Viana; Ruth Pidsley; Joe Burrage; Therese M Murphy; Claire Troakes; Gustavo Turecki; Michael C O'Donovan; Leonard C Schalkwyk; Nicholas J Bray; Jonathan Mill
Journal: Nat Neurosci Date: 2015-11-30 Impact factor: 24.884

9. Genome-wide identification of genes regulating DNA methylation using genetic anchors for causal inference.

Authors: Paul J Hop; René Luijk; Lucia Daxinger; Maarten van Iterson; Koen F Dekkers; Rick Jansen; Joyce B J van Meurs; Peter A C 't Hoen; M Arfan Ikram; Marleen M J van Greevenbroek; Dorret I Boomsma; P Eline Slagboom; Jan H Veldink; Erik W van Zwet; Bastiaan T Heijmans
Journal: Genome Biol Date: 2020-08-28 Impact factor: 13.583

10. Systematic identification of genetic influences on methylation across the human life course.

Authors: Tom R Gaunt; Hashem A Shihab; Gibran Hemani; Josine L Min; Geoff Woodward; Oliver Lyttleton; Jie Zheng; Aparna Duggirala; Wendy L McArdle; Karen Ho; Susan M Ring; David M Evans; George Davey Smith; Caroline L Relton
Journal: Genome Biol Date: 2016-03-31 Impact factor: 13.583

37 in total

1. Integrated methylome and phenome study of the circulating proteome reveals markers pertinent to brain health.

Authors: Danni A Gadd; Robert F Hillary; Daniel L McCartney; Liu Shi; Aleks Stolicyn; Neil A Robertson; Rosie M Walker; Robert I McGeachan; Archie Campbell; Shen Xueyi; Miruna C Barbu; Claire Green; Stewart W Morris; Mathew A Harris; Ellen V Backhouse; Joanna M Wardlaw; J Douglas Steele; Diego A Oyarzún; Graciela Muniz-Terrera; Craig Ritchie; Alejo Nevado-Holgado; Tamir Chandra; Caroline Hayward; Kathryn L Evans; David J Porteous; Simon R Cox; Heather C Whalley; Andrew M McIntosh; Riccardo E Marioni
Journal: Nat Commun Date: 2022-08-09 Impact factor: 17.694

2. The immune factors driving DNA methylation variation in human blood.

Authors: Etienne Patin; Lluís Quintana-Murci; Jacob Bergstedt; Sadoune Ait Kaci Azzou; Kristin Tsuo; Anthony Jaquaniello; Alejandra Urrutia; Maxime Rotival; David T S Lin; Julia L MacIsaac; Michael S Kobor; Matthew L Albert; Darragh Duffy
Journal: Nat Commun Date: 2022-10-06 Impact factor: 17.694

3. Maternal Dietary Glycemic Index and Glycemic Load in Pregnancy and Offspring Cord Blood DNA Methylation.

Authors: Leanne K Küpers; Sílvia Fernández-Barrés; Giulia Mancano; Laura Johnson; Raffael Ott; Jesus Vioque; Marco Colombo; Kathrin Landgraf; Elmar W Tobi; Antje Körner; Romy Gaillard; Jeanne H M de Vries; Vincent W V Jaddoe; Martine Vrijheid; Gemma C Sharp; Janine F Felix
Journal: Diabetes Care Date: 2022-08-01 Impact factor: 17.152

4. Molecular Quantitative Trait Locus Mapping in Human Complex Diseases.

Authors: Oluwatosin A Olayinka; Nicholas K O'Neill; Lindsay A Farrer; Gao Wang; Xiaoling Zhang
Journal: Curr Protoc Date: 2022-05

Review 5. Challenges in Analyzing Functional Epigenetic Data in Perspective of Adolescent Psychiatric Health.

Authors: Diana M Manu; Jessica Mwinyi; Helgi B Schiöth
Journal: Int J Mol Sci Date: 2022-05-23 Impact factor: 6.208

6. Maternal Mediterranean diet in pregnancy and newborn DNA methylation: a meta-analysis in the PACE Consortium.

Authors: Leanne K Küpers; Sílvia Fernández-Barrés; Aayah Nounu; Chloe Friedman; Ruby Fore; Giulia Mancano; Dana Dabelea; Sheryl L Rifas-Shiman; Rosa H Mulder; Emily Oken; Laura Johnson; Mariona Bustamante; Vincent W V Jaddoe; Marie-France Hivert; Anne P Starling; Jeanne H M de Vries; Gemma C Sharp; Martine Vrijheid; Janine F Felix
Journal: Epigenetics Date: 2022-03-02 Impact factor: 4.861

7. Temporal associations between leukocytes DNA methylation and blood lipids: a longitudinal study.

Authors: Zhiyu Wu; Lu Chen; Xuanming Hong; Jiahui Si; Weihua Cao; Canqing Yu; Tao Huang; Dianjianyi Sun; Chunxiao Liao; Yuanjie Pang; Zengchang Pang; Liming Cong; Hua Wang; Xianping Wu; Yu Liu; Yu Guo; Zhengming Chen; Jun Lv; Wenjing Gao; Liming Li
Journal: Clin Epigenetics Date: 2022-10-23 Impact factor: 7.259

8. Genome-wide discovery for diabetes-dependent triglycerides-associated loci.

Authors: Margaret Sunitha Selvaraj; Kaavya Paruchuri; Sara Haidermota; Rachel Bernardo; Stephen S Rich; Gina M Peloso; Pradeep Natarajan
Journal: PLoS One Date: 2022-10-21 Impact factor: 3.752

9. Methylation risk scores for childhood aeroallergen sensitization: Results from the LISA birth cohort.

Authors: Anna Kilanowski; Junyu Chen; Todd Everson; Elisabeth Thiering; Rory Wilson; Nicole Gladish; Melanie Waldenberger; Hongmei Zhang; Juan C Celedón; Esteban G Burchard; Annette Peters; Marie Standl; Anke Hüls
Journal: Allergy Date: 2022-05-02 Impact factor: 14.710

10. Mitochondrial genome-wide analysis of nuclear DNA methylation quantitative trait loci.

Authors: Jaakko Laaksonen; Pashupati P Mishra; Ilkka Seppälä; Emma Raitoharju; Saara Marttila; Nina Mononen; Leo-Pekka Lyytikäinen; Marcus E Kleber; Graciela E Delgado; Maija Lepistö; Henrikki Almusa; Pekka Ellonen; Stefan Lorkowski; Winfried März; Nina Hutri-Kähönen; Olli Raitakari; Mika Kähönen; Jukka T Salonen; Terho Lehtimäki
Journal: Hum Mol Genet Date: 2022-05-19 Impact factor: 5.121