| Literature DB >> 29293675 |
Meeshanthini V Dogan1,2,3, Isabella M Grumbach4,5, Jacob J Michaelson2, Robert A Philibert1,2,6.
Abstract
An improved method for detecting coronary heart disease (CHD) could have substantial clinical impact. Building on the idea that systemic effects of CHD risk factors are a conglomeration of genetic and environmental factors, we use machine learning techniques and integrate genetic, epigenetic and phenotype data from the Framingham Heart Study to build and test a Random Forest classification model for symptomatic CHD. Our classifier was trained on n = 1,545 individuals and consisted of four DNA methylation sites, two SNPs, age and gender. The methylation sites and SNPs were selected during the training phase. The final trained model was then tested on n = 142 individuals. The test data comprised of individuals removed based on relatedness to those in the training dataset. This integrated classifier was capable of classifying symptomatic CHD status of those in the test set with an accuracy, sensitivity and specificity of 78%, 0.75 and 0.80, respectively. In contrast, a model using only conventional CHD risk factors as predictors had an accuracy and sensitivity of only 65% and 0.42, respectively, but with a specificity of 0.89 in the test set. Regression analyses of the methylation signatures illustrate our ability to map these signatures to known risk factors in CHD pathogenesis. These results demonstrate the capability of an integrated approach to effectively model symptomatic CHD status. These results also suggest that future studies of biomaterial collected from longitudinally informative cohorts that are specifically characterized for cardiac disease at follow-up could lead to the introduction of sensitive, readily employable integrated genetic-epigenetic algorithms for predicting onset of future symptomatic CHD.Entities:
Mesh:
Year: 2018 PMID: 29293675 PMCID: PMC5749823 DOI: 10.1371/journal.pone.0190549
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Filtering and modeling.
Flowchart summarizing steps performed for Random Forest model training and testing.
Summary of variables.
The demographics and CHD risk factors of the 1,545 and 142 individuals included in the training and testing sets, respectively.
| Training | Testing | |||
|---|---|---|---|---|
| CHD | No CHD | CHD | No CHD | |
| Male | 115 | 579 | 49 | 39 |
| Female | 58 | 793 | 22 | 32 |
| Male | 71.1±7.4 | 66.4±8.5 | 67.5±8.4 | 59.6±9.2 |
| Female | 73.0±8.7 | 66.4±8.6 | 72.5±9.0 | 64.6±10.8 |
| Male | 154±33 | 176±33 | 141±25 | 191±32 |
| Female | 172±35 | 199±36 | 180±41 | 187±35 |
| Male | 45±12 | 50±14 | 46±11 | 51±15 |
| Female | 59±17 | 65±19 | 62±18 | 61±18 |
| Male | 6.0±0.9 | 5.7±0.8 | 5.9±0.9 | 5.9±1.4 |
| Female | 6.0±0.9 | 5.7±0.5 | 6.3±1.0 | 6.0±1.0 |
| Male | 128±19 | 130±17 | 124±19 | 127±17 |
| Female | 135±18 | 129±18 | 136±17 | 129±15 |
| Male | 94 | 228 | 1 | 3 |
| Female | 44 | 254 | 0 | 0 |
| Male | -0.15±1.19 | -0.07±1.05 | -0.46±1.43 | -0.26±1.12 |
| Female | -0.12±1.11 | 0.08±0.92 | 0.10±1.12 | -0.13±0.93 |
| Male | 12 | 39 | 6 | 7 |
| Female | 2 | 64 | 1 | 4 |
DNA methylation and CHD status.
Top 30 significant methylation sites associated with symptomatic CHD after Bonferroni correction for multiple comparisons using training set data.
| CpG | Beta | Gene | Chr | Position | Island Status | Corrected p-value |
|---|---|---|---|---|---|---|
| cg26910465 | 6.48E-01 | ADAL | 15 | TSS200 | Island | 8.01E-18 |
| cg13567813 | 6.60E-01 | NR1H2 | 19 | TSS200 | Island | 2.05E-17 |
| cg09238957 | 5.98E-01 | ORC6L | 16 | TSS200 | Island | 7.97E-17 |
| cg04099813 | 6.12E-01 | TSSC4 | 11 | TSS1500 | S_Shore | 1.45E-16 |
| cg07546106 | 6.29E-01 | TAP2 | 6 | 5’UTR | N_Shore | 2.40E-16 |
| cg20808462 | 6.01E-01 | HAUS3 | 4 | 5’UTR | Island | 5.42E-16 |
| cg16968115 | 5.92E-01 | WDTC1 | 1 | TSS200 | Island | 1.25E-15 |
| cg24475210 | 5.84E-01 | MRFAP1 | 4 | TSS200 | Island | 1.26E-15 |
| cg03031660 | 5.84E-01 | MRPS7 | 17 | 1stExon | Island | 1.45E-15 |
| cg22605179 | 5.97E-01 | EWSR1 | 22 | 5’UTR | Island | 3.81E-15 |
| cg02357877 | 5.71E-01 | GBAS | 7 | TSS1500 | Island | 4.04E-15 |
| cg22111723 | 5.65E-01 | 13 | Island | 4.57E-15 | ||
| cg06117184 | 5.67E-01 | CKAP2L | 2 | 1stExon | Island | 4.87E-15 |
| cg07478100 | 5.85E-01 | MIS12 | 17 | TSS1500 | Island | 5.36E-15 |
| cg15318396 | 5.83E-01 | 21 | Island | 5.52E-15 | ||
| cg00544901 | 5.76E-01 | RPS11 | 19 | TSS1500 | Island | 5.62E-15 |
| cg24478630 | 5.88E-01 | MOGS | 2 | TSS200 | S_Shore | 5.65E-15 |
| cg04022019 | 5.90E-01 | DCAF13 | 8 | 1stExon | Island | 5.86E-15 |
| cg12124516 | 5.81E-01 | MCM6 | 2 | TSS200 | Island | 6.41E-15 |
| cg20935862 | 5.96E-01 | C9orf41 | 9 | TSS1500 | Island | 6.62E-15 |
| cg07377675 | 6.00E-01 | USP1 | 1 | TSS200 | Island | 7.79E-15 |
| cg07734253 | 5.83E-01 | CORO1A | 16 | TSS1500 | N_Shore | 8.16E-15 |
| cg03699307 | 5.94E-01 | GABARAPL2 | 16 | TSS1500 | Island | 8.44E-15 |
| cg17360140 | 5.79E-01 | C4orf29 | 4 | TSS1500 | Island | 9.10E-15 |
| cg25632648 | 6.06E-01 | KCTD21 | 11 | TSS200 | Island | 9.83E-15 |
| cg06339248 | 5.83E-01 | ZDHHC5 | 11 | 5’UTR | Island | 1.10E-14 |
| cg24275354 | 6.24E-01 | NDUFA10 | 2 | Body | N_Shore | 1.52E-14 |
| cg25261764 | 5.93E-01 | NARS | 18 | 1stExon | Island | 1.69E-14 |
| cg14172283 | 5.83E-01 | TOMM5 | 9 | 1stExon | Island | 2.00E-14 |
| cg01089095 | 5.69E-01 | CHCHD1 | 10 | TSS200 | Island | 2.02E-14 |
Integrated genetic-epigenetic training metrics.
The 10-fold cross-validation performance metrics of the eight integrated genetic-epigenetic models within the ensemble on the training set.
| Model | Accuracy | AUC | Sensitivity | Specificity |
|---|---|---|---|---|
| 1 | 0.78±0.09 | 0.82±0.09 | 0.79±0.12 | 0.77±0.08 |
| 2 | 0.75±0.05 | 0.83±0.06 | 0.78±0.10 | 0.72±0.08 |
| 3 | 0.79±0.05 | 0.85±0.07 | 0.83±0.07 | 0.76±0.08 |
| 4 | 0.78±0.07 | 0.84±0.07 | 0.79±0.12 | 0.76±0.07 |
| 5 | 0.75±0.06 | 0.78±0.06 | 0.70±0.09 | 0.79±0.09 |
| 6 | 0.70±0.05 | 0.77±0.05 | 0.70±0.12 | 0.70±0.10 |
| 7 | 0.80±0.06 | 0.87±0.04 | 0.82±0.08 | 0.77±0.07 |
| 8 | 0.78±0.06 | 0.85±0.05 | 0.82±0.07 | 0.74±0.08 |
Fig 2Integrated genetic-epigenetic model ROC curve.
The Receiver Operating Characteristic curves of the integrated genetic-epigenetic model with the largest average 10-fold cross-validation area under the curve value.
Testing of integrated genetic-epigenetic ensemble.
The confusion matrix of the integrated genetic-epigenetic ensemble of eight models on the test dataset consisting of 142 individuals.
| Predicted | ||
|---|---|---|
| 57 | 14 | |
| 18 | 53 | |
DNA methylation and statin use.
All significant methylation sites associated with statin use after Bonferroni correction for multiple comparisons using training set data.
| CpG | Beta | Gene | Chr | Position | Island Status | Corrected p-value |
|---|---|---|---|---|---|---|
| cg17901584 | -4.59E-01 | DHCR24 | 1 | TSS1500 | S_Shore | 3.53E-12 |
| cg06500161 | 4.43E-01 | ABCG1 | 21 | Body | S_Shore | 9.30E-11 |
| cg05119988 | -3.30E-01 | SC4MOL | 4 | 5’UTR | S_Shelf | 5.22E-04 |
| cg19751789 | -3.23E-01 | LDLR | 19 | TSS200 | N_Shore | 1.54E-03 |
| cg27243685 | 3.01E-01 | ABCG1 | 21 | Body | S_Shelf | 1.87E-02 |
| cg01185530 | -2.58E-01 | DNAJC3 | 13 | 5’UTR | Island | 2.81E-02 |
| cg11072882 | -2.74E-01 | FAM47E | 4 | Body | Island | 4.97E-02 |
Conventional risk factor training metrics.
The 10-fold cross-validation performance metrics of the eight conventional risk factor models within the ensemble on the training set.
| Model | Accuracy | AUC | Sensitivity | Specificity |
|---|---|---|---|---|
| 1 | 0.73±0.03 | 0.77±0.05 | 0.71±0.07 | 0.75±0.10 |
| 2 | 0.73±0.07 | 0.75±0.08 | 0.74±0.08 | 0.72±0.09 |
| 3 | 0.75±0.07 | 0.79±0.06 | 0.73±0.12 | 0.77±0.10 |
| 4 | 0.70±0.06 | 0.75±0.08 | 0.68±0.10 | 0.72±0.07 |
| 5 | 0.70±0.06 | 0.72±0.08 | 0.67±0.09 | 0.73±0.10 |
| 6 | 0.71±0.10 | 0.75±0.10 | 0.68±0.14 | 0.75±0.10 |
| 7 | 0.76±0.04 | 0.79±0.05 | 0.73±0.11 | 0.79±0.09 |
| 8 | 0.71±0.10 | 0.76±0.12 | 0.68±0.15 | 0.75±0.11 |
Fig 3Conventional risk factors model ROC curve.
The Receiver Operating Characteristic curves of the conventional risk factors model with the largest average 10-fold cross-validation area under the curve value.
Testing of conventional risk factors ensemble.
The confusion matrix of the conventional risk factors ensemble of eight models on the test dataset consisting of 142 individuals.
| Predicted | ||
|---|---|---|
| TRUE | CHD absent | CHD present |
| 63 | 8 | |
| 41 | 30 | |