Literature DB >> 28264653

Ultra-high dimensional variable selection with application to normative aging study: DNA methylation and metabolic syndrome.

Grace Yoon1, Yinan Zheng2, Zhou Zhang2, Haixiang Zhang3, Tao Gao2, Brian Joyce2, Wei Zhang2, Weihua Guan4, Andrea A Baccarelli5, Wenxin Jiang1, Joel Schwartz5, Pantel S Vokonas6, Lifang Hou2, Lei Liu7.   

Abstract

BACKGROUND: Metabolic syndrome has become a major public health challenge worldwide. The association between metabolic syndrome and DNA methylation is of great research interest.
RESULTS: We constructed a binomial model to investigate the association between a metabolic syndrome index and DNA methylation in the Normative Aging Study. We applied the Iterative Sure Independence Screening (ISIS) method with elastic net penalty to DNA methylation levels at 484,548 CpG markers from 659 human subjects, and demonstrated that the screening step in ISIS can significantly improve the performance of the elastic net.
CONCLUSION: The proposed method identifies four CpGs which can be mapped to two biologically relevant and functional genes. Identification of significant CpG markers may potentially have practical implications for disease prevention and treatment.

Entities:  

Keywords:  Bootstrap; ISIS; Metabolic syndrome; Ultra-high dimensional variable selection; elastic net; methylation

Mesh:

Year:  2017        PMID: 28264653      PMCID: PMC5340011          DOI: 10.1186/s12859-017-1568-1

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

DNA methylation is an epigenetic mechanism for regulating gene expression. Chemically, it involves the modification of a cytosine (C) base by adding a methyl group. In adult cells, DNA methylation typically occurs at CpG sites, i.e., regions of DNA where cytosine (C) and guanine (G) bases are linked by a phosphate. It can suppress the expression of neighboring genes without changing the underlying genetic sequence. Methylation has been the most commonly studied epigenetic marker because of its transmissibility during cell division as well as stability in stored and processed blood samples. Deciphering the DNA methylation code will help us predict and prevent diseases [1, 2]. One of the major public health challenges worldwide is the steadily increasing prevalence of metabolic syndrome that follows in the wake of society-wide changes such as urbanization, surplus energy intake, increasing obesity and sedentary lifestyle. The International Diabetes Federation estimates that one-quarter of the world’s adult population has metabolic syndrome [3]. Metabolic syndrome is significantly associated with risks of developing cardiovascular disease and diabetes [4]. Our goal is to explore the associations between metabolic syndrome and ultra-high dimensional DNA methylation markers. Our motivating example is the Normative Aging Study (NAS), where methylation levels from 484,548 CpG sites were measured in 659 subjects. This paper describes our application of an Iterative Sure Independence Screening (ISIS) method [5, 6] with elastic net penalty [7] to address the ultra-high dimensionality and correlation structure of these methylation markers. The structure of the paper is as follows. In “Results” section, we use simulations to evaluate the performance of our method and apply it to the NAS data. Then, we give the clinical interpretation of our findings in “Discussion” section. In “Discussion” section, we demonstrate the results of using our method on the NAS data. Finally, in “Conclusions” section, we conclude with a summary discussion and possible directions for future research.

Methods

Data

The Normative Aging Study (NAS) is a longitudinal cohort study established in 1963 by the Department of Veterans Affairs [8]. With an initial cohort of 2280 healthy men, NAS is an ongoing project to study the effects of aging on various health issues. Eligibility criteria at enrollment included veteran status; residence in the Boston area; ages 21-80; and no history of hypertension, heart disease, cancer, diabetes, or other chronic health conditions. From 1963 to 1999, 981 participants died and 470 were lost to follow up. Participants were recalled for clinical examinations every 3-5 years. Between March 1999 and December 2013, 802 (96.7%) of the remaining 829 active participants agreed to donate blood, 686 of whom were randomly selected and profiled using the Illumina 450K BeadChip array at up to three follow-up visits separated by a median time interval of 3.5 years (IQR 3.1-5.7). We excluded participants who 1) were non-white or had missing information on race to minimize potential confounding effects of genetic ancestry, or 2) had leukemia diagnosed prior to or during the year of their blood draw as their blood methylation profiles could have been affected. A total of 664 individuals and samples collected at their first blood draw remained for analysis. DNA samples were extracted from buffy coat using the QIAamp DNA Blood Kit (QIAGEN, Valencia, CA, USA). A total of 500 ng of DNA was used to perform bisulfite conversion using the EZ-96 DNA Methylation Kit (Zymo Research, Orange, CA, USA). To limit chip and plate effects, a two-stage age-stratified algorithm was used to randomize samples and ensure similar age distributions across chips and plates. We randomized 12 samples (sampled across all age quartiles) to each chip, then randomized chips to plates (eight chips per plate). Quality control analysis was performed to remove samples and probes where >1% had a detection p-value > 0.05. The remaining samples were preprocessed using the Illumina-type background correction [9] and normalized with the dye-bias [10] and BMIQ [11] adjustments. Beta values for DNA methylation level were calculated as the ratio of the methylated probe intensity to the overall intensity, which can be interpreted as the approximate percentage of methylation. Beta values had a range of 0 to 1, but were severely compressed at the extremes. Consequently, Beta values were converted to M-values through logit transformation, providing insight into the distribution of methylation across the genome difficult to visualize with the raw value [12]. M-values were then used in our analysis. The K-nearest neighbors algorithm was applied in the space of CpG sites to impute missing methylation values [13]. Batch and potential confounding effects of white blood cell subtypes as estimated by Houseman’s method [14] were corrected for using ComBat [15]. Metabolic syndrome is defined as whether at least three of the following five conditions are satisfied (y=1) or not (y=0): Abdominal obesity (waist circumference > 102cm for men); High fasting blood sugar (≥ 100mg/dl) or currently taking diabetes medication; Reduced HDL cholesterol (< 40mg/dl for men) or currently taking cholesterol medication; Hypertension (systolic blood pressure > 130mmHg or diastolic blood pressure > 85mmHg) or currently taking antihypertensive medication; Hypertriglyceridemia (≥ 150mg/dl) or currently taking medication for hypertriglyceridemia. To increase power, in this paper we created a metabolic syndrome index as the number of above satisfied conditions. Five subjects with missing data for the above metabolic syndrome conditions are excluded. The final working dataset includes methylation levels of 659 subjects measured at 484,548 CpG sites.

Analytical method

Two issues complicate the analysis of DNA methylation data. First, the DNA methylation markers are ultra-high dimensional, i.e., p≫n. Second, DNA methylation levels measured from probes in close proximity are correlated [16]. For example, in the NAS data, the co-methylation correlation could be as high as 0.98 as the samples were free of cell culture-induced epigenetic changes. It is thus imperative to account for ultra-high dimensionality and high correlation simultaneously. In this paper, we adopt the ISIS approach, an iterative two-step procedure combining the screening and variable selection steps. Fan and Lv [5] proposed the sure independence screening (SIS) and Iterative SIS (ISIS) methods. Later, Fan et al. [6] extended ISIS to the general pseudo-likelihood framework. In SIS, all predictor variables are first ranked based on their Pearson correlations with the response variable. Then, model selection is conducted using a predefined number of the most highly correlated variables. The goal for ISIS is to rescue some variables among missed variables iteratively by ranking marginal correlations with residuals. It can detect important predictors which are marginally uncorrelated by themselves but jointly correlated with the response. Least absolute shrinkage and selection operator (LASSO), smoothly clipped absolute deviation (SCAD), Dantzig selector, and other methods are used for model selection in [5, 6]. For the analysis in this paper, the elastic net penalty is considered to account for correlated methylation markers. As a compromise between the ridge and LASSO methods, elastic net enjoys a similar sparsity as LASSO but shrinks together the coefficients of correlated predictors like ridge. It also offers considerable computational advantages over the L penalties where q∈(0,1) [7, 17, 18]. The elastic net penalty has been used widely to conduct model selection in epigenetic studies. For example, [19] built a predictive model of aging using elastic net combined with a bootstrap approach. [20] also used the elastic net regression model to predict epigenetic age across a broad spectrum of human tissues and cell types. The screening step in ISIS could reduce the ultra-high dimensional covariates to a manageable number by identifying markers which are marginally correlated with the outcome. As a result, in the variable selection step we can tackle the correlation issue in a much smaller covariate space, in which elastic net is expected to perform well. The iterative procedure can recover variables missed at the screening step. Hereafter we choose a weight coefficient of w=0.5, i.e., half LASSO and half ridge penalties. We will use a binomial model for the ordinal metabolic syndrome index {0,1,…,5} as a response variable (y) and methylation levels as predictor variables (x). n is a sample size and π is a probability of having any of the above metabolic syndrome conditions for the ith subject. All methods were implemented in the R programming language. See https://github.com/GraceYoon/ISIS_EN for the R source code and an simple example.

Results

Simulation

We will illustrate our method by simulation. R is incapable of generating an ultra-high dimensional correlation matrix (484,548 by 484,548). Therefore, in a similar fashion to [21], the real NAS methylation data set is used as an n×p design matrix (X=(x 1,x 2,…,x )=(X 1,X 2,…,X )) to take the correlation structure among covariates into account. We randomly generate y from a binomial distribution with parameters m=5 and π(x). Then, each element of y=(y 1,…,y ) can take an integer value ∈{0,1,2,3,4,5} for the metabolic syndrome index. This yields simulation data the same size as the NAS dataset: n=659 and p=484,548. We used the following coefficients as true parameters =(β 1,β 2,…,β ) in the simulation setting which are the estimated coefficients in the actual data analysis: For ISIS, we need to choose a proper submodel size (d) in the screening step, which should be large enough to include the true significant coefficients with a probability approaching 1. According to [5], is recommended for a binary outcome, for count, and for a continuous outcome, where n is a sample size. Since y takes integer values from 0 to 5, we choose two values of d here: , and . The study by Hannum [19] implemented the elastic net penalty on bootstrap samples, and selected CpG markers which were presented for more than half of all bootstraps. Before that, [22] and [23] proposed Bolasso (bootstrap-enhanced lasso): use LASSO for bootstrapped replications of a given sample, and intersect the supports of the LASSO bootstrap estimates. A softer version of Bolasso selects those variables which are present in a high proportion of bootstrap replications. These papers showed that Bolasso leads to consistent model selection. Along these lines, we generated 100 bootstrap samples of the same size (n=659), and used ISIS with elastic net penalty to select the significant methylation markers in each bootstrap sample. Here we show the results from ISIS with elastic net, using two different choices of d on 100 bootstrap samples. For comparison, we also list the results estimated by elastic net only (without the screening step) in Table 1.
Table 1

Simulation results

Elastic net onlyISIS with elastic net
d=50 d=25
j freq j freq j freq
474287100474287100474287100
325547961265648412656487
1265648438487683848766
38487763255476532554754
70756723599762635997618
320060643849212138492118
270466562582311942505614
88446554250561825823112
27872755864411529291911
56822533294941532415311
35978493201391432949411
304994890855132649849
32296345233845133201399
35050945324153134302108
610384216510123585677
2132644236198712165106
223673424609011447906
3811784226498411460906
45299841349498112338456
3669640865259865255
Simulation results In all cases, the four nonzero coefficient variables are all correctly selected the most often. However, the elastic net only method (without screening) identified 6 additional false markers (70756, 320060, 270466, 88446, 278727, 56822) in at least half of all bootstrap samples, indicating a poor performance against false positives. In contrast, ISIS with the elastic net has a much wider gap in selected frequencies between true and redundant variables, and none of the redundant markers are selected in more than 1/3 of the 100 bootstrap samples. The results from the two different sizes of d are consistent with one another. We repeated this process 5 times (5 datasets with 100 bootstrap samples for each dataset) with consistent results (available upon request). Moreover, we have conducted simulations with varying weights w=0.25,0.75 and 1 (LASSO), under the same simulation setting (available upon request). The results show that a larger w results in a sparser model when there is no screening step. However, there is virtually no change for different w values in ISIS, demonstrating the robustness of ISIS with respect to the weight chosen.

Application to NAS data

Similar to the Simulation Section, we generated 100 bootstrap samples from the original data. Table 2 shows the selected markers and frequencies of their selection in the model out of 100 bootstrap samples. Among 484,548 CpGs, our method identifies four CpG sites as being strongly associated with metabolic syndrome.
Table 2

NAS data results

Elastic net onlyISIS with elastic net
d=50 d=25
j freq j freq j freq
474287984742879147428791
325547961265646712656457
1265649138487463848754
219492843255474632554753
384878012205201220530
36730711417221746736924
131967664673691614172222
12205653512001521368417
248438641470681340254916
95930624025491243349416
40254960193471115508715
207644592136841114706815
40014159268623113348914
25604654550871034332414
79189531044281020686913
46736953433494109593012
4081835295930919347112
41604452219492921949212
31747951248438924843812
47899249206869831747912
NAS data results We also compare our results to those from the elastic net only method [19]. As shown in the left column of Table 2, this method identifies 19 CpGs that appear in more than half of the bootstrap samples, many more than the 4 identified by ISIS. For example, the 4th most-selected CpG by the elastic net only method, X 219492, is listed with very low frequency in both columns representing our method. The iterative screening step in ISIS can therefore improve the performance of elastic net by reducing the chance of false positives in ultra-high dimensional data. To compare the performances of the resulting models, we used 5-fold cross validation to calculate AUC (Area Under the Curve) of ROC curves. Four folds were taken as training data, which we used to build our model. The remaining fold was used as a test datum to calculate AUC. Since we used metabolic syndrome index as a count variable y, we measured multiclass AUC proposed by [24] and the average value over 5 folds is reported. We also present the mean AUC value for binary outcomes for the standard definition of metabolic syndrome, i.e., whether at least three of the five conditions are satisfied (y=1) or not (y=0). These results are shown in Table 3. We note that even though our model has selected many fewer variables (due to the reduced sample size in the training data), its AUC is higher than the elastic net only method which is subject to false positives.
Table 3

5-fold cross validation for AUC

Elastic netISIS, d=50ISIS, d=25
The number of selected variables9.61.61.6
(2.51)(0.55)(0.55)
Average of AUC0.60110.62190.6197
Average of multiclass AUC0.62490.63580.6441
5-fold cross validation for AUC

Discussion

Associated gene information for the four CpG markers selected by ISIS with the elastic net method is shown in Table 4. The first three CpGs (cg27243685, cg06500161 and cg01881899) are located in close proximity to one another in the same gene: ABCG1. Two, cg06500161 and cg01881899, are at the South Shore and North Shelf of the same CpG Island, respectively. Pfeiffer et al. [25] identified that higher methylation at cg06500161 was associated with lower high-density lipoprotein (HDL) cholesterol and higher triglycerides. The coefficient estimates () in Table 4 are consistent with this association. Moreover, methylation levels in cg06500161 and cg27243685 were found to be negatively associated with ABCG1 transcripts. Hidalgo et al. [26] showed associations between the methylation status of cg06500161 and fasting insulin as well as with HOMA-IR (homeostasis model assessment of insulin resistance), a surrogate marker of insulin resistance. Ding et al. [27] reported that it is the most strongly correlated CpG site with BMI among expression-associated methylation sites within one megabase of any cholesterol metabolism network. Our results are also consistent with functional studies of ABCG1 expression. Kennedy et al. [28] and Frisdal et al. [29] identified that higher expression of ABCG1 is associated with increased fat mass, and that deficiency of ABCG1 reduces triglyceride storage. Together, these findings suggest that ABCG1 expression plays a key role in metabolic syndrome, and that DNA methylation may be substantially involved in this pathway.
Table 4

Genes associated with the most frequently selected CpG markers

j NAMECHRREFGENEREFGENE GROUP \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\hat {\beta }_{j}$\end{document}β^j
474287cg2724368521ABCG1Body; 5’UTR6.64901
126564cg0650016121ABCG1Body2.20140
38487cg0188189921ABCG1Body3.49123
325547cg179015841DHCR24TSS1500−6.31737
Genes associated with the most frequently selected CpG markers cg17901584 is located in the TSS1500 region (from -200 to -1500 nucleotides upstream of transcription start site) in the promoter of the gene DHCR24. Drzewinska et al. [30] showed that methylation of the DHCR24 promoter region affects transcriptional efficiency. DNA methylation mediates transcriptional repression via binding of the methylated DNA-binding protein or preserves the binding of transcription factors to their motifs. In the Bloch (Cholesterol Biosynthesis) pathway, desmosterol is converted into cholesterol by DHCR24 in the final step. Zerenturk et al. [31] and Luu et al. [32] found that modulating DHCR24 activity alters levels of desmosterol which further reduces cellular cholesterol status. Thus, DNA methylation may also affect metabolic syndrome via pathways related to DHCR24.

Conclusions

Using the ISIS method with the elastic net penalty, our study found four important CpGs associated with the metabolic syndrome index from ultra-high dimensional DNA methylation markers. They are located in two biologically relevant and functional genes. Adding the screening step iteratively to the variable selection method is shown to improve its performance against false positives. In conclusion, the two criteria we used: 50% and a gap in the frequencies in the bootstrap samples yield satisfactory selection results against false positives. In a practical application, one may set in the screening step and select the value of c from a grid using the cross-validated prediction error. The adaptive choice of tuning parameter c may lead to improved performance when the sample size is not too large. In NAS, methylation levels were measured up to three times with a median interval of 3.5 years. Our method could be extended to longitudinal data analysis along the way of [33]. Moreover, we are interested in mediation analysis to determine whether methylation mediates the path from intervention (e.g. diet, physical exercise) to health outcomes, thereby helping understand the underlying biological mechanisms of interventions [34]. This analysis is limited to white male subjects in the NAS study. In the future we will validate our results using other cohorts, e.g., the Coronary Artery Risk Development in Young Adults Study (CARDIA), and further examine the relation between DNA methylation and metabolic syndrome in young and middle-aged, mixed-gender, and multi-racial populations.
  26 in total

1.  Genome-wide variation of cytosine modifications between European and African populations and the implications for complex traits.

Authors:  Erika L Moen; Xu Zhang; Wenbo Mu; Shannon M Delaney; Claudia Wing; Jennifer McQuade; Jamie Myers; Lucy A Godley; M Eileen Dolan; Wei Zhang
Journal:  Genetics       Date:  2013-06-21       Impact factor: 4.562

2.  Ultrahigh dimensional feature selection: beyond the linear model.

Authors:  Jianqing Fan; Richard Samworth; Yichao Wu
Journal:  J Mach Learn Res       Date:  2009       Impact factor: 3.654

3.  Epigenetics at the Crossroads of Genes and the Environment.

Authors:  Andrew P Feinberg; M Daniele Fallin
Journal:  JAMA       Date:  2015-09-15       Impact factor: 56.272

4.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION.

Authors:  Patrick Breheny; Jian Huang
Journal:  Ann Appl Stat       Date:  2011-01-01       Impact factor: 2.083

Review 5.  Desmosterol and DHCR24: unexpected new directions for a terminal step in cholesterol synthesis.

Authors:  Eser J Zerenturk; Laura J Sharpe; Elina Ikonen; Andrew J Brown
Journal:  Prog Lipid Res       Date:  2013-10-02       Impact factor: 16.195

6.  Alterations of a Cellular Cholesterol Metabolism Network Are a Molecular Feature of Obesity-Related Type 2 Diabetes and Cardiovascular Disease.

Authors:  Jingzhong Ding; Lindsay M Reynolds; Tanja Zeller; Christian Müller; Kurt Lohman; Barbara J Nicklas; Stephen B Kritchevsky; Zhiqing Huang; Alberto de la Fuente; Nicola Soranzo; Robert E Settlage; Chia-Chi Chuang; Timothy Howard; Ning Xu; Mark O Goodarzi; Y-D Ida Chen; Jerome I Rotter; David S Siscovick; John S Parks; Susan Murphy; David R Jacobs; Wendy Post; Russell P Tracy; Philipp S Wild; Stefan Blankenberg; Ina Hoeschele; David Herrington; Charles E McCall; Yongmei Liu
Journal:  Diabetes       Date:  2015-07-07       Impact factor: 9.461

7.  Signaling regulates activity of DHCR24, the final enzyme in cholesterol synthesis.

Authors:  Winnie Luu; Eser J Zerenturk; Ika Kristiana; Martin P Bucknall; Laura J Sharpe; Andrew J Brown
Journal:  J Lipid Res       Date:  2013-12-20       Impact factor: 5.922

8.  Adipocyte ATP-binding cassette G1 promotes triglyceride storage, fat mass growth, and human obesity.

Authors:  Eric Frisdal; Soazig Le Lay; Henri Hooton; Lucie Poupel; Maryline Olivier; Rohia Alili; Wanee Plengpanich; Elise F Villard; Sophie Gilibert; Marie Lhomme; Alexandre Superville; Lobna Miftah-Alkhair; M John Chapman; Geesje M Dallinga-Thie; Nicolas Venteclef; Christine Poitou; Joan Tordjman; Philippe Lesnik; Anatol Kontush; Thierry Huby; Isabelle Dugail; Karine Clement; Maryse Guerin; Wilfried Le Goff
Journal:  Diabetes       Date:  2014-09-23       Impact factor: 9.461

9.  Epigenome-wide association study of fasting measures of glucose, insulin, and HOMA-IR in the Genetics of Lipid Lowering Drugs and Diet Network study.

Authors:  Bertha Hidalgo; M Ryan Irvin; Jin Sha; Degui Zhi; Stella Aslibekyan; Devin Absher; Hemant K Tiwari; Edmond K Kabagambe; Jose M Ordovas; Donna K Arnett
Journal:  Diabetes       Date:  2013-10-29       Impact factor: 9.461

10.  DNA methylation age of human tissues and cell types.

Authors:  Steve Horvath
Journal:  Genome Biol       Date:  2013       Impact factor: 13.583

View more
  4 in total

1.  Integrated analysis of the gene expression profile and DNA methylation profile of obese patients with type 2 diabetes.

Authors:  Juan Shen; Bin Zhu
Journal:  Mol Med Rep       Date:  2018-03-28       Impact factor: 2.952

2.  ABCA1 and ABCG1 DNA methylation in epicardial adipose tissue of patients with coronary artery disease.

Authors:  Valentina V Miroshnikova; Alexandra A Panteleeva; Irina A Pobozheva; Natalia D Razgildina; Ekaterina A Polyakova; Anton V Markov; Olga D Belyaeva; Olga A Berkovich; Elena I Baranova; Maria S Nazarenko; Valery P Puzyrev; Sofya N Pchelina
Journal:  BMC Cardiovasc Disord       Date:  2021-11-27       Impact factor: 2.298

3.  Epigenome-wide association study of metabolic syndrome in African-American adults.

Authors:  Tomi Akinyemiju; Anh N Do; Amit Patki; Stella Aslibekyan; Degui Zhi; Bertha Hidalgo; Hemant K Tiwari; Devin Absher; Xin Geng; Donna K Arnett; Marguerite R Irvin
Journal:  Clin Epigenetics       Date:  2018-04-10       Impact factor: 6.551

4.  Hypermethylation in Calca Promoter Inhibited ASC Osteogenic Differentiation in Rats with Type 2 Diabetic Mellitus.

Authors:  Lei Wang; Feng Ding; Shaojie Shi; Xingxing Wang; Sijia Zhang; Yingliang Song
Journal:  Stem Cells Int       Date:  2020-03-04       Impact factor: 5.443

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.