Literature DB >> 30545403

Machine learning selected smoking-associated DNA methylation signatures that predict HIV prognosis and mortality.

Xinyu Zhang1,2, Ying Hu3, Bradley E Aouizerat4, Gang Peng5, Vincent C Marconi6, Michael J Corley7, Todd Hulgan8, Kendall J Bryant9, Hongyu Zhao5, John H Krystal1,2, Amy C Justice2,10, Ke Xu11,12.   

Abstract

BACKGROUND: The effects of tobacco smoking on epigenome-wide methylation signatures in white blood cells (WBCs) collected from persons living with HIV may have important implications for their immune-related outcomes, including frailty and mortality. The application of a machine learning approach to the analysis of CpG methylation in the epigenome enables the selection of phenotypically relevant features from high-dimensional data. Using this approach, we now report that a set of smoking-associated DNA-methylated CpGs predicts HIV prognosis and mortality in an HIV-positive veteran population.
RESULTS: We first identified 137 epigenome-wide significant CpGs for smoking in WBCs from 1137 HIV-positive individuals (p < 1.70E-07). To examine whether smoking-associated CpGs were predictive of HIV frailty and mortality, we applied ensemble-based machine learning to build a model in a training sample employing 408,583 CpGs. A set of 698 CpGs was selected and predictive of high HIV frailty in a testing sample [(area under curve (AUC) = 0.73, 95%CI 0.63~0.83)] and was replicated in an independent sample [(AUC = 0.78, 95%CI 0.73~0.83)]. We further found an association of a DNA methylation index constructed from the 698 CpGs that were associated with a 5-year survival rate [HR = 1.46; 95%CI 1.06~2.02, p = 0.02]. Interestingly, the 698 CpGs located on 445 genes were enriched on the integrin signaling pathway (p = 9.55E-05, false discovery rate = 0.036), which is responsible for the regulation of the cell cycle, differentiation, and adhesion.
CONCLUSION: We demonstrated that smoking-associated DNA methylation features in white blood cells predict HIV infection-related clinical outcomes in a population living with HIV.

Entities:  

Keywords:  DNA methylation; Ensemble machine learning; HIV frailty; Mortality; Tobacco smoking

Mesh:

Year:  2018        PMID: 30545403      PMCID: PMC6293604          DOI: 10.1186/s13148-018-0591-z

Source DB:  PubMed          Journal:  Clin Epigenetics        ISSN: 1868-7075            Impact factor:   6.551


Background

Smoking is a common and underappreciated contributor to poor outcomes in HIV-infected individuals. The prevalence of smoking among HIV-infected people exceeds 60% [1], and it is an independent risk factor for mortality in treated HIV-infected individuals [2]. Smoking increases the mortality risk among HIV-infected individuals with an odds ratio between 2 and 3 [2, 3]. However, we have little insight into the mechanisms through which smoking contributes to poorer HIV outcomes. Smoking-associated effects on DNA methylation in white blood cells (WBCs) have been demonstrated through epigenome-wide association studies (EWAS). DNA methylation is an epigenetic mechanism regulating gene expression independent of variation in the DNA sequence. To date, hundreds of CpG sites (i.e., cytosine-guanine dinucleotides), where cytosines can be methylated to form 5-methylcytosine, in WBCs have been associated with smoking status [4], quantity [5], smoking cessation [6], and smoking-related traits or diseases (e.g., oxidative stress level [7], lung cancer [8], chronic inflammatory disease [9]) in the HIV-uninfected population. Indices of DNA methylation constructed from smoking-associated CpG sites have predicted smoking-related lung cancer incidence [10] and oral cancer incidence [11]. A recent study using a smoking DNA methylation index derived from six CpG sites was associated with frailty in aging populations [12]. And finally, smoking-associated CpGs in the blood were reported to predict all-cause mortality [13, 14] and cardiovascular-related mortality [15]. However, smoking-related DNA methylation associations have not been described in HIV-infected populations to date. The host epigenome is also impacted by HIV infection. We and others recently showed that DNA methylation is associated with HIV infection and HIV-related aging [16-19]. We reported that CpG sites in the promoter of NLRC5, a transcriptional activator of major histocompatibility complex class I, were less methylated in samples from HIV-infected persons as compared to samples from HIV-uninfected persons [19]. Epigenetic marks were also associated with cognitive impairment in the HIV-infected population, and the epigenetic clock relates to biological aging in HIV-infected individuals [20]. Taken together, it is reasonable to hypothesize that both smoking and HIV infection have effects on the epigenome that contribute to poor HIV outcomes and an increased risk of mortality. To select high-dimensional epigenetic data for predicting clinical outcomes is challenging. For this purpose, machine learning has emerged as a powerful tool that enables the discovery of unknown features in the epigenome to predict phenotypes of interest [21]. Machine learning has been successfully applied to select DNA methylation features to identify biomarkers for complex diseases and to predict treatment outcomes [16, 21, 22]. Recently, a kernel machine learning method improved the prediction of cancer prognosis by integrating molecular profiles and clinical predictors [23]. A panel of DNA methylation markers was able to diagnose common cancers with 95% accuracy and identified 29 out of 30 colorectal cancer metastases [24]. In another study, DNA methylation-based learning selected immune response features improved the prediction of better treatment outcomes of chemotherapy and survival for breast cancer patients [25]. Such an approach can be useful to identify biological signatures of HIV-related outcomes influenced by smoking. In this study, using an ensemble-based machine learning approach, our goal was to select smoking-associated DNA methylation CpGs in the HIV-infected host epigenome and link the selected CpGs to the HIV disease outcomes. The motivation to use ensemble-based learning is that an ensemble approach has advantages to reduce the bias from individual machine learning methods and to improve the stability of prediction performance in an imbalanced sample [26, 27]. We were also interested in understanding the biological significance of the selected features. This study demonstrates that the application of advanced machine learning on methylation features provides evidence of a link between the mechanisms of smoking and smoking-associated adverse HIV outcomes.

Results

The study design and the framework are presented in Fig. 1. Briefly, all DNA samples were extracted from WBCs collected from people who live with HIV from the Veteran Aging Cohort Study (VACS) (N = 1137). All samples were randomly divided into a discovery (cohort 1) sample and a replication (cohort 2) sample. Demographic and clinical variables are presented in Table 1. We first conducted a meta-analysis of the EWAS for smoking in two separate HIV-infected samples. We then selected smoking-associated CpGs that predicted HIV outcomes by using an ensemble-based learning approach.
Fig. 1

Study design for the epigenome-wide association study for smoking and machine learning prediction

Table 1

Demographics and clinical variables in the study population

Cohort 1 (Illumina 450K)Cohort 2 (Illumina Epic)
Smokers (N = 361)Non-smokers (N = 247) p Smokers (N = 309)Non-smokers (N = 220) p
HIV-positive (%)100100N/A100100N/A
Age49.2 ± 6.7449.7 ± 8.700.4448.0 ± 6.5548.3 ± 9.200.62
Sex (male, %)100100N/A100100N/A
Race (AA, %)87.583.40.017982.70.32
AUDIT-C4.00 ± 3.323.72 ± − 3.480.313.67 ± 3.052.99 ± 2.890.01
ART (%)76.576.10.8769.683.20.0004
WBC5.35 ± 2.095.34 ± 1.810.955.27 ± 1.965.27 ± 1.550.99
CD4408 ± 290457 ± 2720.04415 ± 200493 ± 2800.002
VL (log10)2.75 ± 1.242.52 ± 1.150.032.82 ± 1.262.55 ± 1.200.02
CD8*0.18 ± 0.080.17 ± 0.080.400.16 ± 0.080.17 ± 0.080.38
CD4*0.05 ± 0.050.05 ± 0.050.330.07 ± 0.050.08 ± 0.060.19
Nature killer cells*0.06 ± 0.050.09 ± 0.07< 0.00050.08 ± 0.050.09 ± 0.06< 0.005
B cell*0.09 ± 0.050.08 ± 0.050.470.11 ± 0.050.11 ± 0.040.44
Monocytes*0.12 ± 0.040.12 ± 0.040.830.11 ± 0.040.10 ± 0.040.002
Granulocytes*0.53 ± 0.130.51 ± 0.120.070.51 ± 0.110.49 ± 0.110.02

AA African-American, AUDIT Alcohol Use Disorder Identification Test, ART antiretroviral therapy, VL viral load

*Methylation-based cell type deconvolution by Housman et al. algorithm

Study design for the epigenome-wide association study for smoking and machine learning prediction Demographics and clinical variables in the study population AA African-American, AUDIT Alcohol Use Disorder Identification Test, ART antiretroviral therapy, VL viral load *Methylation-based cell type deconvolution by Housman et al. algorithm

DNA methylation in WBCs associated with tobacco smoking

Discovery

We profiled CpGs using the Illumina Infinium HumanMethylation 450 Beadchip (450K) (San Diego, CA, USA) in HIV-infected samples (cohort 1, N = 608; current smokers = 361; non-smokers = 247) from the VACS. After adjustment for potential confounders (i.e., age, immune cell types, adherence of antiretroviral therapy, the top principal components to limit global confounding effects), we identified 41 CpGs differentially methylated (i.e., 33 hypomethylated CpGs, 8 hypermethylated CpGs) between smokers and non-smokers (Fig. 2a, pnominal < 1.0E−7) (Additional file 1: Table S1). Of note, 40 out of 41 CpG sites were previously reported to be associated with smoking [4, 9, 10, 28–35]. The most significant sites included the established smoking biomarkers on AHRR (cg05575921, cg23576855, cg26703534, cg21161138) and on F2RL3 (cg03636183). One CpG site, cg15212292 located in the body of PRKCA, was previously reported significant association for smoking in a large meta-analysis from combined European-American (EA) and African-American (AA) populations but showed no association with smoking in AA [35]. We found this CpG site highly significant in our sample of predominantly AA (t = − 8.911; p = 5.074E−19). Consistent with previous reports, the majority of smoking-associated CpGs were hypomethylated in smokers as compared to non-smokers.
Fig. 2

Epigenome-wide association analysis in blood identified multiple CpG sites for tobacco smoking. a Discovery sample. b Replication sample

Epigenome-wide association analysis in blood identified multiple CpG sites for tobacco smoking. a Discovery sample. b Replication sample

Replication

We conducted a second EWAS for smoking in a sample that was independent of the discovery sample (VACS cohort 2, N = 529; current smokers = 309; non-smokers = 220). DNA methylation in the replication sample was profiled using the Illumina Methylation EPIC platform (San Diego, CA, USA) that included 870 K CpGs, with 408,583 CpGs shared between the Illumina 450K and EPIC arrays. To ensure consistency in comparisons across the samples, only CpGs shared across both arrays were assessed. The methylation state probes common to both platforms were highly correlated (r ~ 0.91 to 0.99). Applying the same analytical protocol, we adjusted for the same confounders in the discovery and replication samples. A total of 49 CpG sites reached epigenome-wide significance in the replication sample including the 41 CpGs identified in the discovery EWAS and 8 significant CpGs that were only seen in the replication sample (Fig. 2b) (Additional file 1: Table S2). The 8 additional CpGs were all hypomethylated in smokers compared to non-smokers. The high concordance in findings between the two samples suggests that smoking-associated CpG sites are highly reproducible.

Meta-analysis

Combining the discovery and replication samples, a meta-EWAS revealed a total of 137 CpGs that were significantly associated with smoking (p < 1.0E−7) (Table 2, Additional file 2: Figure S1). A test for heterogeneity across the two samples for these 137 CpG sites was not significant after Bonferroni correction(padjusted > 0.05) for any of the sites, suggesting that their association with smoking is not due to the confound of sample heterogeneity. Of the 137 CpG sites, 122 sites were hypomethylated, and only 15 CpG sites were hypermethylated in smokers compared to non-smokers. As expected, the most significant CpG site was cg05575921 at AHRR. An additional 15 CpG sites on AHRR were also significantly associated with smoking status. Consistent with the findings from more than 30 previous studies in HIV-uninfected samples, these results demonstrate that alteration of DNA methylation is associated with smoking exposure regardless of HIV status.
Table 2

Epigenome-wide significant CpG sites for tobacco smoking in a veteran HIV-positive population: a meta-analysis

ProbeChrPositionNearest gene p dis Effect (SE)dis p rep Effect (SE)repZ score p meta DirectionHat p
cg055759215373,378 AHRR 3.64E−40− 0.134 (0.009)1.36E−40− 0.164 (0.011)− 18.8007.56E−790.480
cg215666422233,284,6611.35E−27− 0.078 (0.007)2.64E−36− 0.096 (0.007)− 16.5431.81E−610.076
cg019402732233,284,9341.53E−26− 0.056 (0.005)6.92E−33− 0.068 (0.005)− 15.9443.12E−570.144
cg235768555373,299 AHRR 9.77E−31− 0.117 (0.009)3.67E−27− 0.127 (0.011)− 15.7913.58E−560.975
cg267035345377,358 AHRR 4.18E−21− 0.034 (0.003)2.50E−17− 0.038 (0.004)− 12.6708.65E−370.811
cg211611385399,360 AHRR 5.82E−18− 0.034 (0.004)8.62E−16− 0.039 (0.005)− 11.8033.78E−320.994
cg09935388192,947,588 GFI1 1.47E−13− 0.059 (0.008)1.61E−18− 0.090 (0.010)− 11.3944.48E−300.167
cg036361831917,000,585 F2RL3 2.42E−17− 0.067 (0.008)7.86E−12− 0.056 (0.008)− 10.8611.76E−270.438
cg213224367145,812,842 CNTNAP2 9.60E−14− 0.024 (0.003)6.16E−15− 0.028 (0.003)− 10.7664.98E−270.532
cg033295392233,283,3291.29E−15− 0.029 (0.003)3.22E−12− 0.030 (0.004)− 10.6003.00E−260.720
cg256482035395,444 AHRR 7.14E−13− 0.025 (0.003)2.29E−13− 0.033 (0.004)− 10.2481.20E−240.642
cg116600181186,510,915 PRSS23 8.89E−11− 0.024 (0.004)1.02E−15− 0.036 (0.004)− 10.2151.69E−240.148
cg237713661186,510,998 PRSS23 4.21E−07− 0.020 (0.004)4.97E−20− 0.042 (0.004)− 9.9512.50E−230.001
cg052847421493,552,128 ITPK1 1.24E−13− 0.018 (0.002)3.22E−11− 0.020 (0.003)− 9.9472.60E−230.839
cg195724871738,476,024 RARA 1.82E−08− 0.024 (0.004)3.79E−17− 0.040 (0.005)− 9.8586.30E−230.020
cg019013321175,031,054 ARRB1 1.16E−14− 0.031 (0.004)8.50E−10− 0.030 (0.005)− 9.8308.32E−230.436
cg24859433630,720,2034.20E−11− 0.024 (0.004)3.64E−13− 0.032 (0.004)− 9.7821.35E−220.415
cg15342087630,720,2094.85E−11− 0.023 (0.003)6.14E−13− 0.027 (0.004)− 9.7182.54E−220.437
cg034508421080,834,947 ZMIZ1 6.08E−11− 0.018 (0.003)5.02E−13− 0.026 (0.003)− 9.7122.69E−220.412
cg045517765393,366 AHRR 7.00E−10− 0.017 (0.003)1.34E−13− 0.029 (0.004)− 9.5581.20E−210.227
cg12803068745,002,919 MYO1G 8.63E−100.047 (0.008)5.40E−120.064 (0.009)9.1884.02E−20++0.391
cg152122951764,710,687 PRKCA 6.36E−10− 0.014 (0.002)1.22E−10− 0.021 (0.003)− 8.9115.07E−190.624
cg14753356630,720,1084.55E−09− 0.018 (0.003)6.42E−11− 0.028 (0.004)− 8.7442.25E−180.436
cg25189904168,299,493 GNG12 3.14E−06− 0.035 (0.008)7.95E−15− 0.066 (0.008)− 8.7083.11E−180.012
cg02657160398,311,063 CPOX 8.20E−10− 0.014 (0.002)7.79E−10− 0.015 (0.002)− 8.6853.79E−180.758
cg131938402233,285,2891.62E−08− 0.013 (0.002)4.55E−11− 0.021 (0.003)− 8.6226.58E−180.336
cg079863781211,898,284 ETV6 1.54E−06− 0.024 (0.005)5.41E−13− 0.040 (0.005)− 8.4353.30E−170.046
cg228515611474,214,183 C14orf43 5.28E−10− 0.019 (0.003)1.17E−08− 0.026 (0.004)− 8.4323.39E−170.948
cg27537125125,349,6811.45E−09− 0.009 (0.002)2.48E−08− 0.010 (0.002)− 8.2261.93E−160.960
cg066444282233,284,1122.07E−06− 0.016 (0.003)8.58E−12− 0.029 (0.004)− 8.1294.34E−160.079
cg146242071168,142,198 LRP5 4.95E−09− 0.015 (0.003)2.13E−08− 0.019 (0.003)− 8.0985.61E−160.915
cg262715912178,125,956 NFE2L2 5.49E−06− 0.025 (0.005)3.90E−12− 0.052 (0.007)− 8.0587.76E−160.048
cg119027775368,843 AHRR 3.87E−10− 0.012 (0.002)5.52E−07− 0.008 (0.002)− 7.9931.32E−150.543
cg19859270398,251,294 GPR15 2.46E−10− 0.013 (0.002)1.21E−06− 0.007 (0.001)− 7.9392.03E−150.443
cg087096721206,224,334 AVPR1B 9.81E−09− 0.021 (0.004)8.12E−08− 0.019 (0.004)− 7.8524.09E−150.991
cg239168965368,804 AHRR 1.14E−11− 0.040 (0.006)2.89E−05− 0.019 (0.004)− 7.8165.45E−150.116
cg19089201745,002,287 MYO1G 7.39E−080.027 (0.005)2.09E−080.042 (0.007)7.7588.63E−15++0.669
cg025834841254,677,008 HNRNPA1 2.94E−06− 0.013 (0.003)3.50E−10− 0.023 (0.004)− 7.6991.37E−140.162
cg0902223075,457,225 TNRC18 1.45E−07− 0.019 (0.004)2.52E−08− 0.027 (0.005)− 7.6462.07E−140.626
cg07178945124,488,800 FGF23 1.87E−070.020 (0.004)2.08E−080.028 (0.005)7.6352.27E−14++0.587
cg04885881111,123,1181.46E−06− 0.025 (0.005)1.79E−09− 0.034 (0.006)− 7.6262.43E−140.265
cg259495507145,814,306 CNTNAP2 1.21E−08− 0.011 (0.002)4.80E−07− 0.013 (0.002)− 7.6012.94E−140.837
cg00073090191,265,8791.20E−05− 0.011 (0.002)1.97E−10− 0.017 (0.003)− 7.5424.63E−140.095
cg272418452233,250,3702.53E−06− 0.019 (0.004)1.98E−08− 0.034 (0.006)− 7.2703.59E−130.371
cg18900812636,646,127 CDKN1A 4.74E−06− 0.017 (0.004)9.22E−09− 0.021 (0.004)− 7.2653.74E−130.280
cg151599871917,003,890 CPAMD8 8.33E−10− 0.019 (0.003)1.11E−04− 0.014 (0.004)− 7.1251.04E−120.174
cg192541631160,623,782 GPR44 6.38E−07− 0.012 (0.002)1.02E−06− 0.019 (0.004)− 6.9753.06E−120.859
cg08149865821,914,600 EPB49 1.23E−06− 0.014 (0.003)9.45E−07− 0.016 (0.003)− 6.8915.53E−120.782
cg202952141206,226,794 AVPR1B 1.47E−06− 0.014 (0.003)9.65E−07− 0.011 (0.002)− 6.8636.76E−120.766
cg23219570124,488,893 FGF23 1.30E−050.012 (0.003)8.52E−080.019 (0.004)6.8427.83E−12++0.346
cg265296555424,371 AHRR 2.39E−06− 0.011 (0.002)7.37E−07− 0.012 (0.002)− 6.8278.68E−120.687
cg115543915321,320 AHRR 6.57E−06− 0.012 (0.003)2.79E−07− 0.015 (0.003)− 6.8001.05E−110.495
cg253057038128,378,2181.26E−04− 0.020 (0.005)6.37E−09− 0.037 (0.006)− 6.7651.34E−110.103
cg090998301630,485,485 ITGAL 1.42E−06− 0.015 (0.003)2.15E−06− 0.017 (0.003)− 6.7591.39E−110.860
cg107501821073,497,514 CDH23 1.23E−07− 0.012 (0.002)2.98E−05− 0.011 (0.003)− 6.7151.88E−110.579
cg145802115150,161,299 C5orf62 1.05E−05− 0.020 (0.004)3.19E−07− 0.030 (0.006)− 6.7091.96E−110.464
cg04180046745,002,736 MYO1G 2.39E−050.027 (0.006)1.35E−070.045 (0.008)6.6862.30E−11++0.330
cg172871555393,347 AHRR 1.43E−05− 0.010 (0.002)3.74E−07− 0.011 (0.002)− 6.6393.15E−110.450
cg14316231841,895,100 MYST3 2.32E−05− 0.012 (0.003)2.87E−07− 0.026 (0.005)− 6.5954.26E−110.386
cg036040115400,201 AHRR 1.38E−070.012 (0.002)7.92E−050.005 (0.001)6.5445.98E−11++0.479
cg09662411192,946,132 GFI1 3.82E−05− 0.025 (0.006)3.56E−07− 0.038 (0.007)− 6.4848.93E−110.361
cg249969791474,223,355 C14orf43 1.17E−04− 0.009 (0.002)1.20E−07− 0.015 (0.003)− 6.4281.29E−100.214
cg1375111311118,085,214 AMICA1 8.66E−06− 0.011 (0.003)5.27E−06− 0.015 (0.003)− 6.3592.03E−100.767
cg003104121574,724,918 SEMA7A 3.66E−04− 0.011 (0.003)4.49E−08− 0.020 (0.004)− 6.3372.34E−100.117
cg18642234349,394,622 GPX1 1.24E−06− 0.013 (0.003)4.44E−05− 0.012 (0.003)− 6.3312.43E−100.748
cg263615358144,576,604 ZC3H3 9.58E−04− 0.014 (0.004)1.10E−08− 0.029 (0.005)− 6.3132.74E−100.054
cg002954852106,755,721 UXS1 5.86E−05− 0.017 (0.004)7.63E−07− 0.029 (0.006)− 6.3112.77E−100.382
cg03440944745,023,329 C7orf40 8.68E−08− 0.012 (0.002)4.66E−04− 0.013 (0.004)− 6.3012.96E−100.275
cg120759288141,801,307 PTK2 4.90E−06− 0.021 (0.005)1.60E−05− 0.026 (0.006)− 6.2843.30E−100.969
cg014812511132,912,7195.67E−05− 0.016 (0.004)1.05E−06− 0.012 (0.002)− 6.2743.51E−100.410
cg037071681949,379,127 PPP1R15A 1.16E−05− 0.024 (0.006)6.89E−06− 0.027 (0.006)− 6.2743.52E−100.766
cg114361132019,191,1451.29E−05− 0.014 (0.003)6.14E−06− 0.018 (0.004)− 6.2743.53E−100.741
cg007066832233,251,030 ECEL1P2 7.54E−080.020 (0.004)6.22E−040.016 (0.005)6.2673.69E−10++0.244
cg100629191738,503,802 RARA 3.21E−05− 0.008 (0.002)2.37E−06− 0.009 (0.002)− 6.2593.86E−100.539
cg063944601328,130,393 LNX2 8.32E−07− 0.028 (0.006)1.36E−04− 0.019 (0.005)− 6.2065.44E−100.568
cg04517079641,546,161 FOXP4 1.28E−04− 0.010 (0.002)6.03E−07− 0.014 (0.003)− 6.2055.48E−100.300
cg16547579204,954,333 SLC23A2 3.74E−04− 0.011 (0.003)1.39E−07− 0.019 (0.004)− 6.1945.87E−100.154
cg214461721223,745,234 CAPN8 1.76E−04− 0.010 (0.003)4.68E−07− 0.020 (0.004)− 6.1806.40E−100.260
cg072518871773,641,809 RECQL5 1.03E−07− 0.016 (0.003)8.91E−04− 0.013 (0.004)− 6.1587.38E−100.230
cg15474579636,645,812 CDKN1A 1.01E−05− 0.013 (0.003)2.32E−05− 0.018 (0.004)− 6.1159.66E−100.934
cg163987611474,220,238 C14orf43 1.58E−05− 0.010 (0.002)2.13E−05− 0.013 (0.003)− 6.0561.39E−090.870
cg13039251532,018,601 PDZD2 4.03E−030.014 (0.005)8.21E−090.040 (0.007)6.0351.59E−09++0.024
cg231104222140,182,073 ETS2 5.94E−05− 0.015 (0.004)5.59E−06− 0.020 (0.004)− 6.0341.60E−090.560
cg02275418315,372,726 SH3BP5 5.16E−05− 0.009 (0.002)7.61E−06− 0.008 (0.002)− 6.0131.82E−090.609
cg031883822233,245,886 ALPP 3.42E−08− 0.021 (0.004)3.83E−03− 0.012 (0.004)− 6.0081.88E−090.099
cg147120581916,988,083 SIN3B 1.93E−07− 0.018 (0.003)1.37E−03− 0.012 (0.004)− 5.9902.09E−090.226
cg03109660437,684,505 RELL1 4.91E−06− 0.013 (0.003)1.05E−04− 0.018 (0.005)− 5.9872.14E−090.780
cg0803532329,843,5251.19E−030.017 (0.005)1.31E−070.036 (0.007)5.9702.37E−09++0.099
cg130386181477,467,3911.08E−03− 0.011 (0.003)1.54E−07− 0.020 (0.004)− 5.9702.38E−090.108
cg0323477711118,095,544 AMICA1 4.97E−05− 0.009 (0.002)1.34E−05− 0.011 (0.002)− 5.9362.92E−090.677
cg13422817124,550,927 FGF6 1.02E−050.011 (0.003)7.28E−050.013 (0.003)5.9332.98E−09++0.913
cg156576411147,939,7691.73E−060.008 (0.002)3.77E−040.007 (0.002)5.9233.17E−09++0.508
cg24851513352,099,522 C3orf74 1.44E−040.007 (0.002)4.37E−060.007 (0.002)5.9133.37E−09++0.444
cg17485141242,566,5568.57E−040.005 (0.002)4.17E−070.014 (0.003)5.8903.86E−09++0.154
cg00385142398,235,918 CLDND1 2.87E−07− 0.009 (0.002)1.81E−03− 0.004 (0.001)− 5.8814.09E−090.223
cg26764244168,299,511 GNG12 1.68E−03− 0.014 (0.004)1.52E−07− 0.026 (0.005)− 5.8784.14E−090.090
cg018990895369,969 AHRR 4.09E−08− 0.016 (0.003)6.94E−03− 0.011 (0.004)− 5.8544.80E−090.077
cg250130952231,809,6723.75E−08− 0.008 (0.001)7.93E−03− 0.002 (0.001)− 5.8355.39E−090.070
cg252928821539,431,4671.72E−04− 0.011 (0.003)6.24E−06− 0.009 (0.002)− 5.8295.57E−090.459
cg14179389192,947,961 GFI1 6.93E−03− 0.013 (0.005)1.63E−08− 0.033 (0.006)− 5.8275.66E−090.022
cg098379777110,731,201 LRRN3 8.75E−06− 0.008 (0.002)1.92E−04− 0.006 (0.002)− 5.7956.84E−090.760
cg16503724317,130,667 PLCL2 2.95E−030.007 (0.002)1.19E−070.020 (0.004)5.7867.22E−09++0.065
cg2045451812133,135,463 FBRSL1 2.88E−040.021 (0.006)6.45E−060.028 (0.006)5.7281.01E−08++0.409
cg163820472231,790,037 GPR55 6.89E−06− 0.018 (0.004)3.70E−04− 0.023 (0.006)− 5.7171.08E−080.643
cg069729081630,488,321 ITGAL 1.26E−05− 0.009 (0.002)2.17E−04− 0.013 (0.003)− 5.7161.09E−080.784
cg00605777297,533,635 SEMA4C 1.29E−05− 0.009 (0.002)2.47E−04− 0.009 (0.003)− 5.6901.27E−080.768
cg003006375319,433 AHRR 9.94E−040.012 (0.004)1.84E−060.021 (0.004)5.6611.50E−08++0.214
cg0188299166,677,7565.00E−05− 0.010 (0.002)8.16E−05− 0.013 (0.003)− 5.6531.58E−080.909
cg050493351166,103,889 RIN1 4.95E−07− 0.008 (0.002)4.26E−03− 0.007 (0.002)− 5.6271.84E−080.180
cg107883711176,381,040 LRRC32 6.06E−02− 0.006 (0.003)5.69E−10− 0.024 (0.004)− 5.6002.14E−080.001
cg047165301630,485,684 ITGAL 3.32E−05− 0.006 (0.002)1.78E−04− 0.009 (0.002)− 5.5922.25E−080.928
cg09006487372,424,982 RYBP 2.93E−09− 0.023 (0.004)6.83E−02− 0.009 (0.005)− 5.5842.35E−080.007
cg108411245433,274 AHRR 2.42E−050.007 (0.002)2.63E−040.013 (0.004)5.5772.45E−08++0.833
cg019561541494,423,399 ASB2 3.59E−05− 0.009 (0.002)1.81E−04− 0.009 (0.002)− 5.5762.46E−080.936
cg268008931167,184,596 ATPGD1 5.39E−050.008 (0.002)1.21E−040.008 (0.002)5.5752.48E−08++0.955
cg05460226178,804,279 PIK3R5 2.57E−−04− 0.015 (0.004)2.13E−05− 0.029 (0.007)− 5.5722.51E−080.538
cg18146737192,946,700 GFI1 1.99E−04− 0.036 (0.010)3.50E−05− 0.037 (0.009)− 5.5432.97E−080.625
cg162011462019,191,5264.06E−07− 0.012 (0.002)7.10E−03− 0.009 (0.003)− 5.5413.01E−080.137
cg168147198134,114,834 TG;SLA 4.48E−06− 0.005 (0.001)1.41E−03− 0.005 (0.002)− 5.5333.16E−080.427
cg00501876339,193,251 CSRNP1 5.22E−04− 0.009 (0.003)1.15E−05− 0.014 (0.003)− 5.5293.22E−080.400
cg0876310243,079,751 HTT 1.73E−03− 0.007 (0.002)2.13E−06− 0.015 (0.003)− 5.5253.30E−080.184
cg07381806192,094,327 MOBKL2A 1.68E−03− 0.015 (0.005)2.62E−06− 0.035 (0.007)− 5.5023.75E−080.196
cg017317831474,211,788 C14orf43 8.41E−07− 0.008 (0.002)5.40E−03− 0.006 (0.002)− 5.5003.81E−080.185
cg1971777372,847,554 GNA12 1.86E−04− 0.033 (0.009)5.57E−05− 0.043 (0.011)− 5.4824.21E−080.691
cg17791651138,513,489 POU3F1 4.05E−05− 0.011 (0.003)2.99E−04− 0.012 (0.003)− 5.4684.55E−080.876
cg231614921590,357,202 ANPEP 6.62E−04− 0.016 (0.005)1.30E−05− 0.026 (0.006)− 5.4644.66E−080.387
cg074656271753,167,407 STXBP4 3.03E−06− 0.011 (0.002)2.99E−03− 0.013 (0.004)− 5.4395.36E−080.311
cg167023131474,251,926 C14orf43 4.78E−04− 0.007 (0.002)2.64E−05− 0.008 (0.002)− 5.4215.94E−080.490
cg073392362050,312,490 ATP9A 5.74E−03− 0.007 (0.003)6.65E−07− 0.014 (0.003)− 5.4116.27E−080.080
cg15187398192,093,896 MOBKL2A 2.94E−03− 0.011 (0.004)2.10E−06− 0.019 (0.004)− 5.4106.29E−080.150
cg027430701080,834,309 ZMIZ1 7.44E−05− 0.007 (0.002)2.35E−04− 0.009 (0.002)− 5.4066.45E−080.990
cg200599281540,361,4854.01E−05− 0.025 (0.006)5.01E−04− 0.021 (0.006)− 5.3777.57E−080.798
cg141207039139,416,102 NOTCH1 4.33E−05− 0.008 (0.002)4.80E−04− 0.008 (0.002)− 5.3727.79E−080.814
cg118168383150,484,0931.33E−02− 0.006 (0.002)1.81E−07− 0.009 (0.002)− 5.3697.90E−080.033
cg260577541183,774,231 RGL1 6.85E−06− 0.004 (0.001)2.32E−03− 0.003 (0.001)− 5.3678.01E−080.400
cg197138512233,246,594 ALPP 5.33E−05− 0.031 (0.008)4.12E−04− 0.032 (0.009)− 5.3648.13E−080.863
cg035198791474,227,499 C14orf43 8.89E−07− 0.008 (0.002)9.74E−03− 0.006 (0.002)− 5.3578.46E−080.144
Epigenome-wide significant CpG sites for tobacco smoking in a veteran HIV-positive population: a meta-analysis

Ensemble-based feature selection of DNA methylation for HIV frailty

The VACS index was used as an indicator of HIV outcome [36]. High HIV frailty and poor prognosis was defined as a VACS index of greater than 50. Ensemble learning was applied to classify the samples with a VACS index score of greater than 50 as having a poor prognosis, and samples with a VACS index of less than 50 as having a good prognosis. All samples were divided into a training set (80% of the samples in cohort 1), a testing set (20% of the samples in cohort 1), and a validating set (cohort 2). We first filtered CpGs based on p values (false discovery rate, FDR < 0.5) from the EWAS analysis. A total of 997 candidate CpGs from the discovery EWAS were used for feature selection. The goal of the feature selection was to eliminate redundant and irrelevant CpGs without losing informative loci that were associated with high frailty and poor prognosis. In our sample, the numbers of high and low VACS index samples were unequal (high VACS index = 237, low VACS index = 900). Individual machine learning approaches favor the classification of samples into the larger class (e.g., low VACS index samples). To reduce this potential bias without decreasing the sample employed in the training set, we applied a greedy ensemble-based feature selection to build a classifier less likely to be biased towards the larger class from the four machine learning methods(i.e., lasso and elastic-net regularized generalized linear model (GLMNET), support vector method (SVM), random forest (RF), and XGBoot). In the training sample from cohort 1, we applied a bootstrap aggregating (Bagging) approach, in which GLMNET was used with 100 bootstraps using 70% of the training sample, to weigh the importance of each CpG (Fig. 3a). The CpGs were subsequently clustered into 21 CpG groups from 2 to 997 CpGs based on the importance rank with an incrementation of 50 CpGs. Four machine learning methods, GLMNET, SVM, RF, and XGBoost, were applied to build prediction models using each CpG group separately. Then, a set of classifiers was determined and used to classify new data points by taking a weighted average of the prediction from each of the four machine learning methods. The performance of tenfold cross-validation for each CpG group showed high sensitivity (> 0.9) but relatively low specificity (< 0.5) for each of the 4 machine learning methods. The models from ensemble learning and 4 individual machine learnings were evaluated in the test sample separately.
Fig. 3

a Importance rank in 997 CpG sites using GLMNET method following 100 bootstraps. Only top-ranked 20 CpG sites are displayed. b Performance of the selected features predicting high HIV frailty (Veteran Aging Cohort index, VACS index) in a test sample set measured by area under curve (AUC) in receiver operating characteristic analysis. Ensemble-based machine learning from GLMNET, RF, SVM, and XGBoost was applied. c Performance of the selected features predicting high HIV frailty (Veteran Aging Cohort index, VACS index) in a test sample set measured by balanced accuracy. Ensemble-based machine learning from four base machine learning methods, GLMNET, RF, SVM, and XGBoost, was applied. d Venn plot showing the overlapping CpG sites between the selected 698 features and epigenome-wide significant CpG sites for tobacco smoking

a Importance rank in 997 CpG sites using GLMNET method following 100 bootstraps. Only top-ranked 20 CpG sites are displayed. b Performance of the selected features predicting high HIV frailty (Veteran Aging Cohort index, VACS index) in a test sample set measured by area under curve (AUC) in receiver operating characteristic analysis. Ensemble-based machine learning from GLMNET, RF, SVM, and XGBoost was applied. c Performance of the selected features predicting high HIV frailty (Veteran Aging Cohort index, VACS index) in a test sample set measured by balanced accuracy. Ensemble-based machine learning from four base machine learning methods, GLMNET, RF, SVM, and XGBoost, was applied. d Venn plot showing the overlapping CpG sites between the selected 698 features and epigenome-wide significant CpG sites for tobacco smoking In the testing set, the ensemble method selected a set of 689 CpGs that discriminated poor and good prognosis with the best performance (Fig. 3b). The prediction efficiency was estimated using receiver operator characteristic curves; the 698 CpG set displayed an area under curve (AUC) of 0.73 (95%CI 0.63~0.83) for high HIV frailty. The AUCs from RF and XGBoost at the 698 CpGs were also high (0.76). Although RF and XGBoost had high AUCs across all CpG sets, their balanced accuracy was not as good as ensemble method (Fig. 3c). Therefore, the set of 698 CpGs was selected to test the prediction efficiency. Importantly, the majority of EWAS-significant CpGs (121 out of 137 EWAS-significant CpGs) were included in the 698 CpGs (Fig. 3d), suggesting that ensemble learning enables the selection of biologically informative CpGs to predict HIV frailty.

Validation of prediction for HIV frailty using the selected 698 CpGs

To further validate the prediction results of the 698 CpGs from the discovery sample, we tested the prediction efficiency in the replication sample (cohort 2). Using the same VACS index score cut point, we found that the AUC was 0.78 (95%CI 0.73~0.83) (Fig. 4). The balanced accuracy of prediction was improved to 0.76. The results suggest that the model built in the training set had minimal overfitting features and can be applied to differentiate good and poor HIV prognosis in independent samples.
Fig. 4

Validation of performance of the selected 698 CpG sites predicting high HIV frailty (Veteran Aging Cohort index, VACS index) in an independent sample set. a Area under curve (AUC) from receiver operating characteristic analysis. b Sensitivity and specificity of the 698 CpGs predicting the high score of VACS index. FN, false negative; FP, false positive; TN, true negative; TP, true positive

Validation of performance of the selected 698 CpG sites predicting high HIV frailty (Veteran Aging Cohort index, VACS index) in an independent sample set. a Area under curve (AUC) from receiver operating characteristic analysis. b Sensitivity and specificity of the 698 CpGs predicting the high score of VACS index. FN, false negative; FP, false positive; TN, true negative; TP, true positive Of note, to test whether an individual machine learning method alone, a penalized regression model, can select a smaller number of CpG sites than ensemble learning from genome-wide CpG sites to predict HIV frailty, we conducted a feature selection from 408,583 CpG sites using GLMNET to predict the same high and low VACS index. We found that GLMNET selected 1852 CpG sites that predicted the VACS index with AUC of 0.76 (Additional file 3: Figure S2). Although the performance of GLMNET was comparable to the ensemble-based approach, the latter was able to select a smaller number of features and linked smoking-DNA methylation to HIV outcomes. We tested whether ensemble learning can predict resilient persons that are HIV-positive. Using cutoff of the VACS index < 16 as an excellent prognosis, we found that ensemble learning showed poor performance prediction (AUC < 0.7 and balanced accuracy < 0.5). The poor prediction is likely due to an insufficient number of samples with excellent prognosis (i.e., the sample was underpowered). We were also interested in understanding whether the prediction of the high and low VACS index using the 698 CpG sites performed better than smoking status alone. We found the AUC of smoking status predicting VACS index was 0.55 (Additional file 4: Figure S3), suggesting that smoking-associated DNA methylation is a better predictor for HIV frailty compared to smoking status alone.

Prediction of the selected 698 CpGs for all-cause mortality in HIV infection

To support the value of the 698-CpG set in predicting HIV outcomes, the ability of the set to predict mortality in HIV-infected individuals was evaluated. Using the same ensemble model, we first tested the prediction performance of the 698 CpGs with mortality in cohort 2, in which 84 subjects died within 5 years after the blood draw used to profile the DNA methylome. The AUC was 0.66 (95%CI 0.60~0.73) (Additional file 5: Figure S4), which was not as good as the prediction of HIV frailty. We then constructed a DNA methylation index score based on the coefficient of each CpG site from the 698 CpGs in cohort 1. After adjusting for confounding factors such as age, CD4 count, viral load, and antiretroviral therapy, we found a significant association between the methylation index and the 5-year survival rate in cohort 2 (HR = 1.46; 95%CI 1.06~2.02, p = 0.02) (Fig. 5). As expected, the significant association was driven by hypomethylated CpG sites for smoking (HR = 1.39, p = 0.02) but not by hypermethylated CpGs for smoking (HR = 1.21, p = 0.21). The results provide further evidence that DNA methylation-based prediction of mortality can be applied in the HIV-infected population.
Fig. 5

Association of a methylation index constructed from smoking—698 CpG sites with a 5-year survival rate in HIV-infected samples

Association of a methylation index constructed from smoking—698 CpG sites with a 5-year survival rate in HIV-infected samples

Biological significance of the selected 698 CpG sites

The selected 698 CpGs were located among 445 genes (Additional file 1: Table S3). Pathway analysis showed a significant enrichment on the canonical integrin signaling pathway (p = 9.55E−05, FDR = 0.036). Fourteen out of 445 genes were in this pathway: MAP2K4, ITGA2B, ARHGAP26, PIK3R5, ITGAL, PTK2, NCK2, CAPN8, RHOG, GAB1, LIMS1, ITGA11, CTTN, and ACTN1. Integrin signaling determines cellular responses such as migration, survival, differentiation, and motility and provides a context for responding to other inputs. The function of integrin signaling is critical for cell adhesion, tissue maintenance and repair, host defense, and hemostasis. Among non-canonical pathways, cancer, organismal injury, and abnormalities were the most significant (FDR = 1.87E−17). Other top disease-related pathways were in the categories of gastrointestinal disease, liver hyperproliferation, and dermatological diseases. These results suggest that ensemble learning selected biologically relevant features underlying pathological changes in smoking-related diseases.

Discussion

Applying a DNA methylation-based machine learning approach, we report a set of smoking-associated DNA methylation sites predicting HIV prognosis and mortality in people living with HIV. The prediction of HIV frailty by the selected features showed an ability to accurately differentiate good and poor HIV-related clinical outcomes in an independent sample. The DNA methylation index constructed from the selected CpGs was also associated with mortality in the HIV-infected population. Interestingly, the selected smoking-associated methylation features were enriched in the integrin signaling pathway and related to multiple cancers and organismal injuries, which supports the hypothesis that the contributions of smoking to poor disease outcome are in part due to the changes in DNA methylation in the HIV-infected host epigenome. The study has demonstrated that the application of methylation-based machine learning can be useful for linking molecular information to clinical outcomes. One of the major challenges to building a successful model using high dimensional data to predict disease outcomes is how to select informative features among redundant or irrelevant data, background noise, and biased features [21]. We applied several approaches to guide the machine learning process. First, epigenome CpGs were filtered based on association analysis of DNA methylation sites with smoking, which considerably reduced the number of features for model building. We rationalized using smoking-associated features because smoking alters DNA methylation, and smokers have higher mortality rates in the population when living with HIV. Second, we applied ensemble learning based on the results of multiple machine learning methods to optimize the selected features and to limit the bias of each method. This data processing method typically improves the accuracy of the model when employing an unbalanced sample. Our results showed that the performance of the ensemble-based model is highly reproducible and better than individual machine learning method such as GLMNET. The advantage of the greedy ensemble machine learning approach can also reduce overfitting and improve model stability [37]. Overfitting is another major challenge in building a predictive model. To address this concern, we split the sample into two cohorts: cohort 1, which was sub-divided into a training set and a testing set, and cohort 2, which was used to replicate the predictive model performance. Thus, the features selected from cohort 1 could be independently tested in cohort 2. Therefore, the steps we took ensured we selected features with high accuracy to predict HIV outcomes. Our results showed that the selected features were predictive for HIV frailty with moderate to high sensitivity and specificity. Methylation marks for smoking were previously applied to predict frailty in an elderly population. Gao et al. reported that 9 smoking-associated CpG sites were significantly associated with higher frailty. We found that our selected 698 features showed better performance (AUC 0.78 versus AUC 0.55), which may be due to the inclusion of significantly more CpG sites and different populations in our sample compared to the Gao et al. study. The prediction of HIV frailty using the selected 698 features also outperformed the use of tobacco smoking alone. We found that the prediction of 698 sites for mortality was not as good as the prediction for the VACS index. This result is not unexpected as the model was built for the VACS index, not for mortality. Second, the number of deaths by cohort was unbalanced. In cohort 2, only 87 individuals had died at the time of this analysis, which may have reduced the power for accurate prediction. However, the methylation index with 698 CpGs was significantly predictive for 5-year survival rate. Individuals with a greater methylation index were more likely to have shorter life expectancy than individuals with lower methylation index. Importantly, the selected DNA methylation features were not only computationally effective for classifying good and poor outcomes and for predicting mortality but were also biologically relevant to HIV frailty and mortality. The selected 698 CpGs included loci in the genes involving immune activation and inflammatory processes, which is highly associated with HIV frailty and mortality. For example, the most significant smoking-associated gene, AHRR, not only involves the metabolism of endogenous toxins from smoking that result in pathological processes but also represses other signaling pathways, including NF-κappaB, and is capable of regulating inflammatory responses [38]. TNFRSF4 has been shown to activate NF-kappa B and plays a role in apoptosis. In addition, a number of CpGs in the 698 CpG sites were previously reported to involve acceleration of aging, frailty, cancer pathogenesis, and all-cause mortality. Although the majority of DNA methylation differences at a single CpG site between smokers and non-smokers are modest, the 698 features were enriched in pathways highly relevant to disease prognosis, frailty, and mortality. While a model of 698 CpG sites may seem to be a large number of features for the prediction of frailty, emerging evidence has demonstrated that DNA methylation at individual CpG sites on a complex trait is small (less than 10%) [39]. In our EWAS analysis, the effect size of single CpG sites on smoking was in a range of 1 to 13%. To predict a complex outcome such as frailty with a small number of CpG sites is highly unlikely. A recently published paper showed a panel of 200 to 1100 CpG sites predicting multiple complex traits including alcohol, smoking, HDL cholesterol, education, and death [40]. Thus, a panel of hundreds of CpG sites predicting complex traits is expected. However, methods to select more informative features and to potentially reduce the number of features in future studies are warranted. We acknowledge several limitations of this study. A recent study suggests that mRNA and miRNA profiles showed the best prediction for cancer prognosis [23]. Integrating DNA methylation with other omic and clinical data may improve the predictive value and clinical utility of the predicting model. Due to methodological limitations, we were unable to build a model to predict the VACS index as a continuous variable, which may have better clinical utility. The study was conducted in a retrospective cohort and smoking was defined from self-report, which may introduce bias. Applying our predictive model using the 698 selected features in a prospective cohort is warranted to confirm the results. The mechanisms that underlie the selected CpG features on HIV progression remain to be defined. Future studies of smoking’s effects on DNA methylation in HIV-infected specific cell types are warranted to better understand how the selected features involve smoking-related HIV prognosis. Our results demonstrate a machine learning approach to establish methylation signatures for disease outcomes. The identified methylation sites may be a biological surrogate for the VACS index to measure clinical outcomes and to predict mortality. This first-ever methylation-based machine learning-based study sheds light on the impact of smoking on risk for complicated clinical outcomes, estimated using a molecular profile, in the setting of HIV infection.

Conclusion

Applying DNA methylation-based ensemble learning, we identified a set of 698 smoking-associated DNA methylation CpG sites that predict HIV frailty and mortality. Building on more than 30 previous studies in HIV-uninfected persons, our findings suggest that smoking exposure changes DNA methylation in the HIV-infected host genome that is linked to HIV disease prognosis. Our results demonstrate that DNA methylation-based machine learning is a robust approach for the prediction of HIV prognosis.

Methods

Study population and phenotype assessment

The VACS, a nation-wide multicenter collaborative project designed to understand the role of co-morbid medical and psychiatric diseases in determining clinical outcomes in HIV infection, was the source of specimen and data (https://medicine.yale.edu/intmed/vacs/). The VACS biobank cohort is comprised of 2470 participants who were recruited for genetic studies from 2006 to 2007. Participants of the VACS biobank cohort provided written informed consent for the genetic study and provided blood samples. Clinical and demographic data were collected within 90 days of the blood sample collection. A total of 1137 samples were selected and randomly divided into two subsets (labeled cohort 1 and cohort 2), and DNA methylation was processed separately using different methylation arrays. Self-report was used to collect information on smoking status. Current smokers were defined as smoking cigarettes daily during the past week; non-smokers reported never smoking cigarettes. The VACS created an index score to estimate overall frailty of HIV-infected individuals by summing pre-assigned points for age, routinely monitored indicators of HIV disease (CD4 count and HIV-1 RNA), and general indicators of organ system injury including hemoglobin, platelets, aspartate and alanine transaminase (AST and ALT), creatinine, and viral hepatitis C infection (HCV) (https://medicine.yale.edu/intmed/vacs/welcome/vacsindexinfo.aspx). The VACS index has been associated with important changes in health condition and behavior [41]. The VACS index has been shown to predict all-cause mortality among those undergoing treatment for HIV infection [42]. A higher VACS index score indicated greater frailty. Mortality rate 5 years after blood draw was 16%.

Profiling DNA methylation using Illumina DNA methylation Beadchips

Genomic DNA was extracted from whole blood samples. DNA methylation profiling was conducted at the Yale Center for Genomic Analysis using the Illumina (San Diego, CA, USA) Infinium HumanMethylation450 BeadChip (HM450K) for cohort 1 and Illumina Infinium MethylationEPIC (EPIC) for cohort 2. Two sample sets were processed at different times but were processed by the same scientist at the Yale Center for Genomic Analysis who was blinded to the phenotypic information collected. All samples were randomly placed on each array and batch-corrected using the removeBatchEffect function in limma. Probe normalization and batch correction were performed as previously described by Lehne et al. [43].

Data quality control and normalization

In cohort 1, we removed 11,648 probes on sex chromosomes and 36,142 probes within 10 base pairs of single nucleotide polymorphisms. A total of 437,722 probes remained for analysis. As described by Lehne et al. [43], 24,416 probes on Y chromosomes were applied to evaluate the detection p value. A p < 1E−12 was set as a detection p value threshold to improve the quantification of methylation intensities. The intensity values with detection p > 1E−12 were labeled missing, and samples with a call rate < 98% were excluded. We also compared the predicted sex with self-reported sex. All samples matched as male. In cohort 2, we applied the same criteria for quality control. We removed 11 samples due to mismatched sex or low call rate. Only the 408,583 probes that were identical with HM450 array were extracted for replication analysis. Quantile normalization of intensity values was performed following the recommendations of Lehne et al. Six cell types (CD4+ T cells, CD8+ T cells, NK T cells, B cells, monocytes, and granulocytes) in the blood were estimated in each sample using the method described by Houseman et al. [44].

Data analysis

The study design and analytical approaches are summarized in Fig. 1.

Epigenome-wide association analysis

Analyses of discovery and replication stages were performed using the same pipeline [43]. To adjust for significant global confounding factors, we conducted two serial regression analyses to determine the associations between methylome-wide CpGs and smoking. The following steps were performed to correct for global co-variations that may confound specific DNA methylation in smoking. The first principal component analysis (PCA) was performed to evaluate the intensity values of positive control probes designed in HM450. Then, the first GLM was performed as follows: The residuals for each probe and the top 30 PCs of the first PCA were used to adjust for technical biases, particularly batch effects. The second PCA was performed on the resulting regression residuals from the first model. The top 5 PCs of the second PCA were used to control for global biological confounders that cannot be directly captured in the model. Final GLM model The significance threshold was set at p < 1.0E−07, which is equivalent to the Bonferroni correction. We conducted an EWAS meta-analysis by combining the data from the discovery (cohort 1) and replication (cohort 2) samples. Effect size and p values for each probe were obtained from analyses in cohort 1 and cohort 2 samples, respectively. We performed fixed-effects, inverse-variance meta-analysis, with scheme parameters of sample size and standard error by implementing the METAL (ver: 2010-02-08) program, combining summary statistics in two sample sets. We investigated heterogeneity in two sample sets using the I2 statistic.

Machine learning prediction HIV prognosis

Considering the samples were processed at different times and platforms, batch effects were removed using the removeBatchEffect function in limma using R (ver. 3.32.10) before performing the machine learning prediction. To reduce redundant DNA methylation signals and noise for improving the prediction accuracy of HIV frailty, CpG sites with FDR < 0.5 from EWAS in cohort 1 were selected for machine learning. The samples in cohort 1 were randomly divided into a training set and a test set with a ratio of 8:2. We first built a model using the training set, in which each sample was labeled poor (VACS index > 50) or good prognosis (VACS index ≤ 50). We then tested the model by performing 10-fold cross-validation in the testing set, and the best-performed model was tested in an independent replication set.

Prediction model and validation

Machine learning GLMNET was used to build a prediction model. A total of 997 CpGs from EWAS (FDR < 0.1) were ranked based on an importance value for each CpG from GLMNET. The CpG sites were clustered as 21 groups from 2 to 997 sites using 50 CpG increments. Tenfold cross-validation was performed in the training set to identify the best performing model. Additional machine learning methods were used to predict the best outcomes. GLMENT, SVM, RF, and XGBoost were performed separately. The parameters were fine-tuned by using R package caret (ver: 6.0-78) (https://libraries.io/cran/caret/6.0-78) for each algorithm. To avoid bias of each method, we used the ensemble method with R package caretEnsemble (ver: 2.0.0) (https://cran.r-project.org/web/packages/caretEnsemble/index.html) that constructed a new model by weighing the vote of each CpG from four machine learning methods. The testing set was employed to evaluate the model by ROC analysis. The best pre-formed features were used to further validate the model in the independent testing set (cohort 2) using an ensemble-based method. Sensitivity, specificity, and AUC were used to assess model performance.

Association of DNA methylation index with mortality

To examine whether the selected CpG site methylation was associated with mortality, we constructed a methylation index from the 698 CpG sites following the previous formula [45]. A separate index was constructed for hypomethylated and hypermethylated CpG sites, respectively. The association of the DNA methylation risk index with all-cause mortality was examined by Kaplan-Meier plots and log-rank tests in all samples. Cox regression model was then used to adjust for age, antiretroviral therapy adherence, HIV-1 load, and CD4 count. In the Cox regression model, the DNA methylation index score was a categorical variable (using the highest quartiles as the reference category) or a continuous variable (calculating HR for a decrease in DNA methylation by one standard deviation). Indexhypo and indexhyper were evaluated for the prediction of mortality separately.

Gene enrichment analysis

Pathway and network analysis was conducted for the selected 698 CpG sites on 455 genes by employing the Ingenuity Pathway Analysis (IPA). For genes with multiple CpG sites, the lowest p value at the CpG site within a gene was used to represent the gene level significance. Significant pathways were defined at a FDR < 0.05. Table S1. Epigenome-wide significant CpG sites associated with tobacco smoking in a discovery sample. Table S2. Epigenome-wide significant CpG sites associated with tobacco smoking in a replication sample. Table S3. Machine learning selected 698 CpGs for the prediction of HIV frailty. (XLSX 77 kb) Figure S1. Meta-analysis of epigenome-wide association of smoking in HIV-infected samples. A. Manhattan plot of meta-analysis in two sample sets. Red line indicates Bonferroni-corrected epigenome-wide significance; B. Hypo- and hyper-CpG sites associated with tobacco smoking. (PDF 1562 kb) Figure S2. Prediction of 408,583 CpG sites on HIV frailty by using GLMNET model. HIV frailty is represented by Veteran Aging Cohort Study index (VACS index). AUC: area under curve from receiver operating characteristic analysis. (PDF 54 kb) Figure S3. Prediction of smoking status on HIV frailty indicated by Veteran Aging Cohort Study (VACS) index. AUC: area under curve from receiver operating characteristic analysis. (PDF 8 kb) Figure S4. A prediction of the smoking-associated 698 CpG sites for mortality in a HIV-positive population. AUC: area under curve. (PDF 708 kb)
  45 in total

1.  F2RL3 methylation, lung cancer incidence and mortality.

Authors:  Yan Zhang; Ben Schöttker; José Ordóñez-Mena; Bernd Holleczek; Rongxi Yang; Barbara Burwinkel; Katja Butterbach; Hermann Brenner
Journal:  Int J Cancer       Date:  2015-04-17       Impact factor: 7.396

2.  An Ensemble Framework Coping with Instability in the Gene Selection Process.

Authors:  José A Castellanos-Garzón; Juan Ramos; Daniel López-Sánchez; Juan F de Paz; Juan M Corchado
Journal:  Interdiscip Sci       Date:  2018-01-08       Impact factor: 2.233

3.  Associations of self-reported smoking, cotinine levels and epigenetic smoking indicators with oxidative stress among older adults: a population-based study.

Authors:  Xu Gao; Xīn Gào; Yan Zhang; Lutz Philipp Breitling; Ben Schöttker; Hermann Brenner
Journal:  Eur J Epidemiol       Date:  2017-04-22       Impact factor: 8.082

4.  DNA methylation markers for diagnosis and prognosis of common cancers.

Authors:  Xiaoke Hao; Huiyan Luo; Michal Krawczyk; Wei Wei; Wenqiu Wang; Juan Wang; Ken Flagg; Jiayi Hou; Heng Zhang; Shaohua Yi; Maryam Jafari; Danni Lin; Christopher Chung; Bennett A Caughey; Gen Li; Debanjan Dhar; William Shi; Lianghong Zheng; Rui Hou; Jie Zhu; Liang Zhao; Xin Fu; Edward Zhang; Charlotte Zhang; Jian-Kang Zhu; Michael Karin; Rui-Hua Xu; Kang Zhang
Journal:  Proc Natl Acad Sci U S A       Date:  2017-06-26       Impact factor: 11.205

5.  Smoking-Associated DNA Methylation Biomarkers and Their Predictive Value for All-Cause and Cardiovascular Mortality.

Authors:  Yan Zhang; Ben Schöttker; Ines Florath; Christian Stock; Katja Butterbach; Bernd Holleczek; Ute Mons; Hermann Brenner
Journal:  Environ Health Perspect       Date:  2015-05-27       Impact factor: 9.031

6.  HIV-1 Infection Accelerates Age According to the Epigenetic Clock.

Authors:  Steve Horvath; Andrew J Levine
Journal:  J Infect Dis       Date:  2015-05-12       Impact factor: 5.226

7.  Reversion of AHRR Demethylation Is a Quantitative Biomarker of Smoking Cessation.

Authors:  Robert Philibert; Nancy Hollenbeck; Eleanor Andersen; Shyheme McElroy; Scott Wilson; Kyra Vercande; Steven R H Beach; Terry Osborn; Meg Gerrard; Frederick X Gibbons; Kai Wang
Journal:  Front Psychiatry       Date:  2016-04-06       Impact factor: 4.157

8.  Comparative DNA Methylation Profiling Reveals an Immunoepigenetic Signature of HIV-related Cognitive Impairment.

Authors:  Michael J Corley; Christian Dye; Michelle L D'Antoni; Mary Margaret Byron; Kaahukane Leite-Ah Yo; Annette Lum-Jones; Beau Nakamoto; Victor Valcour; Ivo SahBandar; Cecilia M Shikuma; Lishomwa C Ndhlovu; Alika K Maunakea
Journal:  Sci Rep       Date:  2016-09-15       Impact factor: 4.379

9.  Smoking induces DNA methylation changes in Multiple Sclerosis patients with exposure-response relationship.

Authors:  Francesco Marabita; Malin Almgren; Louise K Sjöholm; Lara Kular; Yun Liu; Tojo James; Nimrod B Kiss; Andrew P Feinberg; Tomas Olsson; Ingrid Kockum; Lars Alfredsson; Tomas J Ekström; Maja Jagodic
Journal:  Sci Rep       Date:  2017-11-06       Impact factor: 4.379

10.  Epigenetic prediction of complex traits and death.

Authors:  Daniel L McCartney; Robert F Hillary; Anna J Stevenson; Stuart J Ritchie; Rosie M Walker; Qian Zhang; Stewart W Morris; Mairead L Bermingham; Archie Campbell; Alison D Murray; Heather C Whalley; Catharine R Gale; David J Porteous; Chris S Haley; Allan F McRae; Naomi R Wray; Peter M Visscher; Andrew M McIntosh; Kathryn L Evans; Ian J Deary; Riccardo E Marioni
Journal:  Genome Biol       Date:  2018-09-27       Impact factor: 13.583

View more
  10 in total

1.  Mining the Selective Remodeling of DNA Methylation in Promoter Regions to Identify Robust Gene-Level Associations With Phenotype.

Authors:  Yuan Quan; Fengji Liang; Si-Min Deng; Yuexing Zhu; Ying Chen; Jianghui Xiong
Journal:  Front Mol Biosci       Date:  2021-03-26

2.  Low variability in the underlying cellular landscape adversely affects the performance of interaction-based approaches for conducting cell-specific analyses of DNA methylation in bulk samples.

Authors:  Richard Meier; Emily Nissen; Devin C Koestler
Journal:  Stat Appl Genet Mol Biol       Date:  2021-08-10

3.  High-Dimensional DNA Methylation Mediates the Effect of Smoking on Crohn's Disease.

Authors:  Tingting Wang; Pingtian Xia; Ping Su
Journal:  Front Genet       Date:  2022-04-05       Impact factor: 4.772

4.  Prediction of Microvascular Invasion in Hepatocellular Carcinoma With a Multi-Disciplinary Team-Like Radiomics Fusion Model on Dynamic Contrast-Enhanced Computed Tomography.

Authors:  Wanli Zhang; Ruimeng Yang; Fangrong Liang; Guoshun Liu; Amei Chen; Hongzhen Wu; Shengsheng Lai; Wenshuang Ding; Xinhua Wei; Xin Zhen; Xinqing Jiang
Journal:  Front Oncol       Date:  2021-03-16       Impact factor: 6.244

Review 5.  Epigenome-wide epidemiologic studies of human immunodeficiency virus infection, treatment, and disease progression.

Authors:  Boghuma K Titanji; Marta Gwinn; Vincent C Marconi; Yan V Sun
Journal:  Clin Epigenetics       Date:  2022-01-11       Impact factor: 6.551

6.  Genetic Control of Human Infection with SARS-CoV-2.

Authors:  A N Kucher; N P Babushkina; A A Sleptcov; M S Nazarenko
Journal:  Russ J Genet       Date:  2021-07-03       Impact factor: 0.581

7.  DNA methylation mediates the effect of cocaine use on HIV severity.

Authors:  Chang Shu; Amy C Justice; Xinyu Zhang; Zuoheng Wang; Dana B Hancock; Eric O Johnson; Ke Xu
Journal:  Clin Epigenetics       Date:  2020-09-14       Impact factor: 6.551

8.  Abrupt and altered cell-type specific DNA methylation profiles in blood during acute HIV infection persists despite prompt initiation of ART.

Authors:  Michael J Corley; Carlo Sacdalan; Alina P S Pang; Nitiya Chomchey; Nisakorn Ratnaratorn; Victor Valcour; Eugene Kroon; Kyu S Cho; Andrew C Belden; Donn Colby; Merlin Robb; Denise Hsu; Serena Spudich; Robert Paul; Sandhya Vasan; Lishomwa C Ndhlovu
Journal:  PLoS Pathog       Date:  2021-08-13       Impact factor: 6.823

9.  DNA methylation signature on phosphatidylethanol, not on self-reported alcohol consumption, predicts hazardous alcohol consumption in two distinct populations.

Authors:  Xiaoyu Liang; Amy C Justice; Kaku So-Armah; John H Krystal; Rajita Sinha; Ke Xu
Journal:  Mol Psychiatry       Date:  2020-02-07       Impact factor: 13.437

10.  DNA methylation biomarker selected by an ensemble machine learning approach predicts mortality risk in an HIV-positive veteran population.

Authors:  Chang Shu; Amy C Justice; Xinyu Zhang; Vincent C Marconi; Dana B Hancock; Eric O Johnson; Ke Xu
Journal:  Epigenetics       Date:  2020-10-22       Impact factor: 4.528

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.