Literature DB >> 27480116

Gene, Environment and Methylation (GEM): a tool suite to efficiently navigate large scale epigenome wide association studies and integrate genotype and interaction between genotype and environment.

Hong Pan^1,2, Joanna D Holbrook¹, Neerja Karnani^1,3, Chee Keong Kwoh⁴.

Abstract

BACKGROUND: The interplay among genetic, environment and epigenetic variation is not fully understood. Advances in high-throughput genotyping methods, high-density DNA methylation detection and well-characterized sample collections, enable epigenetic association studies at the genomic and population levels (EWAS). The field has extended to interrogate the interaction of environmental and genetic (GxE) influences on epigenetic variation. Also, the detection of methylation quantitative trait loci (methQTLs) and their association with health status has enhanced our knowledge of epigenetic mechanisms in disease trajectory. However analysis of this type of data brings computational challenges and there are few practical solutions to enable large scale studies in standard computational environments.
RESULTS: GEM is a highly efficient R tool suite for performing epigenome wide association studies (EWAS). GEM provides three major functions named GEM_Emodel, GEM_Gmodel and GEM_GxEmodel to study the interplay of Gene, Environment and Methylation (GEM). Within GEM, the pre-existing "Matrix eQTL" package is utilized and extended to study methylation quantitative trait loci (methQTL) and the interaction of genotype and environment (GxE) to determine DNA methylation variation, using matrix based iterative correlation and memory-efficient data analysis. Benchmarking presented here on a publicly available dataset, demonstrated that GEM can facilitate reliable genome-wide methQTL and GxE analysis on a standard laptop computer within minutes.
CONCLUSIONS: The GEM package facilitates efficient EWAS study in large cohorts. It is written in R code and can be freely downloaded from Bioconductor at https://www.bioconductor.org/packages/GEM/ .

Entities: Chemical Disease Gene Species

Keywords: EWAS; GxE; Matrix operation; methQTL

Mesh：

Year: 2016 PMID： 27480116 PMCID： PMC4970299 DOI： 10.1186/s12859-016-1161-z

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

Understanding DNA methylation biomarkers of environmental exposures and developmental trajectories to disease is highly desirable [1] and their discovery is the aim of many epigenome wide association studies (EWAS) [2, 3]. The computational burden in analyzing the genomics data from this type of studies is considerable due to the high number of variables returned from epigenetic screens, for instance >483,000 individual measures from the widely used Illumina Infinium HumanMethylation450 Array (Infinium450K) [4] or the millions of loci covered by RRBS [5] or methyl-capture technologies [6, 7]. Hundreds or thousands of subjects are required to provide the statistical power to draw inference in EWAS studies [8]. The need to include covariates pertaining to the subjects, such as gender, ethnicity and social economic status [9], and to the samples, such as cellular heterogeneity [10-12], increase the computational time needed to run statistical models. Some of these problems are familiar from the genome wide association studies (GWAS) field, although DNA methylation profile is surrogated by continuous percentage values and distributed very differently from genotype calls. However, what has really pushed EWAS studies to the brink of what is computationally possible, is the realization that DNA methylation levels are not just specified by extrinsic factors but also are influenced by genotype. Polymorphisms close to CpGs in the same chromosome (cis-) often form methylation quantitative trait loci (methQTLs) with nearby CpGs [13-15], or blocks of cis- polymorphisms associated with a cluster of methylation quantitative trait loci, named GeMES (15, 19). MethQTLs can be discovered by correlating single nucleotide polymorphism (SNP) data with CpG methylation from the same samples. Creating a genome wide methQTL map requires assessing the correlation of genotype at millions of SNPs with thousands to millions of CpG methylation states, by millions multiplied with millions linear iterative regressions. Sun 2014 [16] surveyed methQTL studies between year 2010–2014 and found that most of methQTL studies were restricted to screen cis- SNP-CpG pairs, while some were even restricted to the 50,000 bp to 1,000,000 bp regions flanking to each SNP. However SNPs far from the CpG or in different chromosome (trans-) were also reported to be associated with CpG. Trans- methQTLs have been detected to be relevant to normal or disease states in many studies [17]. Furthermore, it is now apparent that genotype can work in interaction with environment (GxE) to influence specific DNA methylation levels [18, 19] and these can be linked to phenotypes [20, 21]. This type of correlated methylation structure has implications for statistical models whereby genotype and environment, or genotype and methylation interact to predict methylation levels or phenotype. This has exponentially increased the computational burden for the proper analysis of EWAS data. Large-scale genomic research benefits from high-performance computing (HPC) environments together with parallel computing techniques. However, the operation and integration of results needs domain expertise [22] and HPC is not always easily accessed by biology lab researchers. Therefore, we were motivated to develop computational solutions that allow biological researchers to explore EWAS, methQTLs and GxE using standard desktop computers within realistic computational times. A R package called MatrixEQTL [23] was developed for expression quantitative train loci (eQTL) analysis. Based on matrix operation, iterative correlation was implemented to achieve computational efficiency, and data was sliced into blocks to achieve memory efficiency. A function in MatrixEQTL that allows inclusion of interaction terms in correlative statistical models, gained our attention, though the author did not highlight it when the package was reported. We deployed the fast and efficient MatrixEQTL software and created a tool suite to explore the associations of Gene, Environment and Methylation. We named the tool suite “GEM”. It provides three fast linear regression models denoted Emodel, Gmodel and GxEmodel to facilitate analyses in EWAS. The GEM_Emodel tests the association of methylome marks and environmental factors; the GEM_Gmodel creates a methQTL genome-wide map; finally, the GEM_GxE model tests the ability of gene and environmental interaction models to predict DNA methylation levels. We benchmarked the performance of the GEM operations on a publicly available EWAS dataset generated on the Infinium450K array with concurrent genotyping on the OmniExpress Array and simulated environment and phenotype information on 237 neonates. Our results demonstrated that the GEM package can facilitate reliable EWAS analyses within minutes, in a standard computational setting (processor = 2.2GHz, RAM = 8G, system = window7 64bit).

GEM implementation

Simplifying the data input into a methylation matrix as , genetic variants matrix as , and the environment vector as , and the matrix for covariates as , and using a pseudo coding language like R script, we can denote Emodel (detecting methylation markers associated with environment) function as , Gmodel (detecting methylation markers associated with genotype i.e. methQTLs) as and GxE model (interaction of genotype and environment to specify methylation marks) as The genome wide studies for Emodel, Gmodel and GxEmodel can be accomplished by calling R function iteratively by millions of times, which were denoted as LM_Emodel (Table 1), LM_Gmodel (Table 2) and LM_GxEmodel.

Table 1

Pseudo R code for Emodel

Table 2

Pseudo R script to explore methQTLs by Gmodel

Pseudo R code for Emodel Pseudo R script to explore methQTLs by Gmodel Shabalin [23] introduced matrix standardization and projection and successfully made an ultra-fast software for expression quantitative trait loci (eQTL). Basically, to quantify the strength of the relationship between and controlled by covariates (), a practical regression is, where , , and are coefficients, is to estimated. A standardization method (22) was applied to vector , then the projections of and to are, where denotes inner product of two matrix. After these operations, the linear regression between and with covariates can be simplified into the calculation of inner product of the projects of x and y as and estimation of the test statistics. Shabalin [23] also demonstrated the strategy to slice the large matrix into a small “blocks” in the correlation calculation for memory efficiency, which make the software able to handle data matrix with millions of rows and columns feasible in normal computational setting. GEM tools called MatrixEQTL [23] library and implemented the below models which were used in [18], which was implemented by calling matrixEQTL with “modelLINEAR”, replacing gene expression with methylation, and SNP with environmental data. which was implemented by calling matrixEQTL with “modelLINEAR”, replacing gene expression with methylation. which was implemented by calling matrixEQTL with “modelLINEAR_CROSS”, replacing gene expression with methylation. Emodel finds the association between methylation and environment genome-wide by performing millions of linear regression (N = number_of_CpGs). The output of Emodel for particular phenotype, environmental factor or disease trait is a list of CpGs that are potential epigenetic biomarkers, as in Table 1. Table 2 demonstrates the pseudo code that used function by iterative loops for Gmodel, we denoted it as LM_Gmodel. The best fit is chosen by the largest R squared value. Replacing the linear regression equation (line 6) in Table 2 by “fit < − summary(lm(M(i) ~ G(j) * E + cvrt))”, produces the pseudo code for the implementation of LM_GxE model. The output of GxEmodel is a list of CpG-SNP-Env triplets, indicating the CpG-Env association segregated by genotype. The significant association of each triplet implies the methylation change is determined by the interplay between genotyping and environment. Both implementations indicate the number of linear regression as N = number_of_SNPs x number_of_CpGs. N could be billions of linear regressions engendering a very substantial computational task. However, using GEM tools, calculation of methQTLs and GxE interactions can be accomplished with much improved computational efficiency.

Results

To benchmark GEM suite, we used the dataset from Teh et al. [18]. The standard laptop used for time comparisons had a 2.2GHz processor, 8G RAM, a windows 7 operating system and was 64 bit, which is typical in an academic setting. The HPC structure had eight parallel processes of each with eight core CPUs. In [18], we studied the 1423 variably methylated regions from the methylomes of 237 neonates, and their association with 708,365 genetic variants and nineteen environmental factors made up of maternal conditions and birth outcomes. The methylome and genotype data are publically available at the NCBI Gene Expression Omnibus (GEO: http://www.ncbi.nlm.nih.gov/geo/ under accession numbers GSE53816 and GSE54445.) Environmental factors were simulated. A schematic of the analyses performed is shown in Fig. 1. When the original analyses were conducted, multivariate regression models were applied sequentially in a HPC environment. For the same dataset, we compared the time taken to implement LM_Gmodel, LM_GxEmodel and LM_Emodel in a standard and HPC computational environment with the time taken to implement GEM_Gmodel, GEM_GxEmodel and GEM_Emodel on a standard laptop (Table 3).

Fig. 1

Table 3

Benchmarking time consumption of GEM implementations on Emodel, Gmodel and GxEmodel by comparing normal R script in a public available dataset in standard laptop and HPC settings

	Dataset: Teh et al., 1423 CpGs, 708,365 SNPs and 19 environments in a standard laptop
Method	Time cost on standard laptop	Time cost in HPC	Method	Time cost
LM_Emodel	95.1 s		GEM_Emodel	18.9 s
LM_Gmodel	> = 60 days (^a)	3 h	GEM_Gmodel	5.2 min
LM_GxEmodel	> = 60 days (^a)	21 h	GEM_GxEmodel	1.5 h

aThe time for LM_Gmodel and LM_GxEmodel in standard laptop was computed based on the time cost on 10 CpGs

A schematic of the analyses performed within Teh et al. [17]. Gmodel sequentially tests the association of 1423 methylation marks with 708,365 genotypes (and covariates), Emodel sequentially tests the association of 1423 methylation marks with 19 environmental factors and GxEmodel sequentially tests that association of 1423 methylation marks with the interaction of every combination of the 708,365 genotypes and 19 environmental factors. Benchmarking time consumption of GEM implementations on Emodel, Gmodel and GxEmodel by comparing normal R script in a public available dataset in standard laptop and HPC settings aThe time for LM_Gmodel and LM_GxEmodel in standard laptop was computed based on the time cost on 10 CpGs

Benchmarking GEM_Emodel

A substantial time saving was achieved using GEM_Emodel compared to standard sequential regression as LM_Emodel as in Table 1. (19 s compared to 95 s for 19 Emodels on 1423 CpGs). Results achieved were identical. In addition, GEM_Emodel has the option to create Q-Q plot for theoretical distribution and observed distribution on p values for every environment e.g. Fig. 2.

Fig. 2

QQplot for pvalues from iterations of ~250,000 CpGs with an environment factor. It was produced by GEM_Emodel

QQplot for pvalues from iterations of ~250,000 CpGs with an environment factor. It was produced by GEM_Emodel As subject numbers increase, computational time to run sequential models increases exponentially, whilst computational time in GEM_Emodel increases linearly. Figure 3 shows the computational time required for one Emodel on 100–1000 subjects for ~250,000 CpGs.

Fig. 3

The operation time consumption benchmark on the associations for ~250,000 CpGs and one environmental factor and one covariate for the sequential number of samples from 100 to 1000. X-axis is the number of subjects, and y-axis is the time consumption in seconds. The benchmarking is done in a personal computer (processor = 2.2GHz, RAM = 8G, system = window7 64bit)

Benchmarking GEM_Gmodel

In the original analysis [18], the regression equation (Eq. 1 and Table 1) used built-in lm function in R script, which we denoted as LM_Gmodel, was applied to each of the 1423 VMRs, cycling through the 708,365 SNPs, adjusted by sex as the covariate, resulting in 1008 million regression models. We compared the LM_Gmodel with GEM_Gmodel by the result and computational efficiency in a standard laptop (processor = 2.2GHz, RAM = 8G, system = window7, 64bit). We also used a HPC structure with eight parallel processes of each with eight core CPUs (denoted as HPC) to benchmark LM_Gmodel as a reference. The computational time on HPC was 3 h, in a standard computational environment, computational time was estimated to be 61 days. The same data was processed by the GEM Gmodel. It took 5.2 min to accomplish the task on a standard laptop. The results were identical to those reported by Teh et al. [18] i.e. 12 disrupting pairs, 828 in cis- pairs and 583 in trans- pairs.

Benchmarking GEM_GxEmodel

The same scale of improvement in performance was achieved for the GEM_GxEmodel where each CpG was tested against the interaction of genotype at each of 708,365 SNPs with each of 19 environmental factors. This analysis originally took 21 h in the HPC environment and an estimated > =60 days on a standard laptop by using normal linear regression in R script, denoted as LM_GxEmodel. In the GEM_GxEmodel, it was accomplished in only 1.5 h. The results were identical between analyses with identical p-values for models containing all winning pairs of SNPs and environments (data not shown). In addition GEM also has the option to produce a “segregation scatter plot” for methylation corresponding to environment in different genotype groups, for example, Fig. 4.

Fig. 4

The scatter plot to display an example of methylation corresponding to the environment in different genotype groups. AA, AB and BB are pseudo codes for major allele homozygote, heterozygote and minor allele homozygote. Phenotypic values are shown on the x-axis, and methylation value in percentage on the y-axis. The straight lines fit for associations in each group

Conclusion and discussion

The advancements in genome-wide genotyping and DNA methylation assessment methods, coupled with well-characterized biological samples enable epigenetic association studies. GEM is designed for very fast testing of millions of hypotheses in epigenetics by using multiple linear regression models. It is suitable to the standard computing resources available to nearly all researchers. GEM includes a graphic user interface for the convenience of researchers and does not require specialist computational knowledge, outside of the widely used R environment. It should be noted that missing data requires careful handling in matrix-based operations. GEM uses the mean value to impute missing values if the data matrices supplied are incomplete. Figure 5 showed the p-values for GEM_Emodel and LM_Emodel are slightly different when the methylation matrix contains missing values. Researchers should assess the suitability of this imputation in the context of the individual study.

Fig. 5

Emodel benchmarking for methylation matrix containing missing values. Pvalue was transformed as –log10. A-axis is pvalues from LM_Emodel, y-axis is from GEM_Emodel. Among ~250,000 CpGs that were tested, 18 % of them contained at least one missing values. Our results showed pvalues for CpGs without missing values are perfectly matched, while there were slightly differences between the two implementations when CpG contains missing values

Abbreviations

cis-, SNP and CpG locate in the same chromosome; EWAS, Epigenome wide association studies; GxE, the interaction of environmental and genetic influences; methQTLs, methylation quantitative trait loci; trans-, SNP and CpG locate in different chromosomes

23 in total

Review 1. An epigenetic escape route.

Authors: Joanna D Holbrook
Journal: Trends Genet Date: 2015-01 Impact factor: 11.639

Review 2. Epigenome-wide association studies for common human diseases.

Authors: Vardhman K Rakyan; Thomas A Down; David J Balding; Stephan Beck
Journal: Nat Rev Genet Date: 2011-07-12 Impact factor: 53.242

3. The Influences of Genetic and Environmental Factors on Methylome-wide Association Studies for Human Diseases.

Authors: Yan V Sun
Journal: Curr Genet Med Rep Date: 2014-12-01

4. Abundant quantitative trait loci exist for DNA methylation and gene expression in human brain.

Authors: J Raphael Gibbs; Marcel P van der Brug; Dena G Hernandez; Bryan J Traynor; Michael A Nalls; Shiao-Lin Lai; Sampath Arepalli; Allissa Dillman; Ian P Rafferty; Juan Troncoso; Robert Johnson; H Ronald Zielke; Luigi Ferrucci; Dan L Longo; Mark R Cookson; Andrew B Singleton
Journal: PLoS Genet Date: 2010-05-13 Impact factor: 5.917

5. Characterization of functional methylomes by next-generation capture sequencing identifies novel disease-associated variants.

Authors: Fiona Allum; Xiaojian Shao; Frédéric Guénard; Marie-Michelle Simon; Stephan Busche; Maxime Caron; John Lambourne; Julie Lessard; Karolina Tandre; Åsa K Hedman; Tony Kwan; Bing Ge; Lars Rönnblom; Mark I McCarthy; Panos Deloukas; Todd Richmond; Daniel Burgess; Timothy D Spector; André Tchernof; Simon Marceau; Mark Lathrop; Marie-Claude Vohl; Tomi Pastinen; Elin Grundberg
Journal: Nat Commun Date: 2015-05-29 Impact factor: 17.694

6. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines.

Authors: Jordana T Bell; Athma A Pai; Joseph K Pickrell; Daniel J Gaffney; Roger Pique-Regi; Jacob F Degner; Yoav Gilad; Jonathan K Pritchard
Journal: Genome Biol Date: 2011-01-20 Impact factor: 13.583

7. The effect of genotype and in utero environment on interindividual variation in neonate DNA methylomes.

Authors: Ai Ling Teh; Hong Pan; Li Chen; Mei-Lyn Ong; Shaillay Dogra; Johnny Wong; Julia L MacIsaac; Sarah M Mah; Lisa M McEwen; Seang-Mei Saw; Keith M Godfrey; Yap-Seng Chong; Kenneth Kwek; Chee-Keong Kwoh; Shu-E Soh; Mary F F Chong; Sheila Barton; Neerja Karnani; Clara Y Cheong; Jan Paul Buschdorf; Walter Stünkel; Michael S Kobor; Michael J Meaney; Peter D Gluckman; Joanna D Holbrook
Journal: Genome Res Date: 2014-04-07 Impact factor: 9.043

8. Long-range epigenetic regulation is conferred by genetic variation located at thousands of independent loci.

Authors: Mathieu Lemire; Syed H E Zaidi; Maria Ban; Bing Ge; Dylan Aïssi; Marine Germain; Irfahan Kassam; Mike Wang; Brent W Zanke; France Gagnon; Pierre-Emmanuel Morange; David-Alexandre Trégouët; Philip S Wells; Stephen Sawcer; Steven Gallinger; Tomi Pastinen; Thomas J Hudson
Journal: Nat Commun Date: 2015-02-26 Impact factor: 14.919

9. The role of longitudinal cohort studies in epigenetic epidemiology: challenges and opportunities.

Authors: Jane W Y Ng; Laura M Barrett; Andrew Wong; Diana Kuh; George Davey Smith; Caroline L Relton
Journal: Genome Biol Date: 2012-06-29 Impact factor: 13.583

Review 10. Parallel computing in genomic research: advances and applications.

Authors: Kary Ocaña; Daniel de Oliveira
Journal: Adv Appl Bioinform Chem Date: 2015-11-13

21 in total

1. Environmentally sensitive hotspots in the methylome of the early human embryo.

Authors: Matt J Silver; Ayden Saffari; Noah J Kessler; Gririraj R Chandak; Caroline H D Fall; Prachand Issarapu; Akshay Dedaniya; Modupeh Betts; Sophie E Moore; Michael N Routledge; Zdenko Herceg; Cyrille Cuenin; Maria Derakhshan; Philip T James; David Monk; Andrew M Prentice
Journal: Elife Date: 2022-02-21 Impact factor: 8.713

2. A transdisciplinary approach to understand the epigenetic basis of race/ethnicity health disparities.

Authors: Lucas A Salas; Lauren C Peres; Zaneta M Thayer; Rick Wa Smith; Yichen Guo; Wonil Chung; Jiahui Si; Liming Liang
Journal: Epigenomics Date: 2021-03-10 Impact factor: 4.778

3. Association of Maternal DNA Methylation and Offspring Birthweight.

Authors: Parnian Kheirkhah Rahimabad; Syed Hasan Arshad; John W Holloway; Nandini Mukherjee; Anna Hedman; Olena Gruzieva; Ellika Andolf; Juha Kere; Goran Pershagen; Catarina Almqvist; Yu Jiang; Su Chen; Wilfried Karmaus
Journal: Reprod Sci Date: 2020-08-04 Impact factor: 3.060

4. Adiposity associated DNA methylation signatures in adolescents are related to leptin and perinatal factors.

Authors: R C Huang; P E Melton; M A Burton; L J Beilin; R Clarke-Harris; E Cook; K M Godfrey; G C Burdge; T A Mori; D Anderson; S Rauschert; J M Craig; M S Kobor; J L MacIsaac; A M Morin; W H Oddy; C E Pennell; J D Holbrook; K A Lillycrop
Journal: Epigenetics Date: 2021-02-08 Impact factor: 4.861

5. DNA-Methylation and Body Composition in Preschool Children: Epigenome-Wide-Analysis in the European Childhood Obesity Project (CHOP)-Study.

Authors: Peter Rzehak; Marcela Covic; Richard Saffery; Eva Reischl; Simone Wahl; Veit Grote; Martina Weber; Annick Xhonneux; Jean-Paul Langhendries; Natalia Ferre; Ricardo Closa-Monasterolo; Joaquin Escribano; Elvira Verduci; Enrica Riva; Piotr Socha; Dariusz Gruszfeld; Berthold Koletzko
Journal: Sci Rep Date: 2017-10-30 Impact factor: 4.379

Review 6. Genetic-epigenetic interactions in cis: a major focus in the post-GWAS era.

Authors: Catherine Do; Alyssa Shearer; Masako Suzuki; Mary Beth Terry; Joel Gelernter; John M Greally; Benjamin Tycko
Journal: Genome Biol Date: 2017-06-19 Impact factor: 13.583

7. Epigenome-Wide Association Study Reveals Duration of Breastfeeding Is Associated with Epigenetic Differences in Children.

Authors: William B Sherwood; Dilini M Kothalawala; Latha Kadalayil; Susan Ewart; Hongmei Zhang; Wilfried Karmaus; S Hasan Arshad; John W Holloway; Faisal I Rezwan
Journal: Int J Environ Res Public Health Date: 2020-05-20 Impact factor: 3.390

8. Gene × Environment Interaction in Developmental Disorders: Where Do We Stand and What's Next?

Authors: Gianluca Esposito; Atiqah Azhari; Jessica L Borelli
Journal: Front Psychol Date: 2018-10-26

9. Epigenomics and metabolomics reveal the mechanism of the APOA2-saturated fat intake interaction affecting obesity.

Authors: Chao-Qiang Lai; Caren E Smith; Laurence D Parnell; Yu-Chi Lee; Dolores Corella; Paul Hopkins; Bertha A Hidalgo; Stella Aslibekyan; Michael A Province; Devin Absher; Donna K Arnett; Katherine L Tucker; Jose M Ordovas
Journal: Am J Clin Nutr Date: 2018-07-01 Impact factor: 7.045

10. Maternal Smoking During Pregnancy Induces Persistent Epigenetic Changes Into Adolescence, Independent of Postnatal Smoke Exposure and Is Associated With Cardiometabolic Risk.

Authors: Sebastian Rauschert; Phillip E Melton; Graham Burdge; Jeffrey M Craig; Keith M Godfrey; Joanna D Holbrook; Karen Lillycrop; Trevor A Mori; Lawrence J Beilin; Wendy H Oddy; Craig Pennell; Rae-Chi Huang
Journal: Front Genet Date: 2019-09-05 Impact factor: 4.599