| Literature DB >> 31185988 |
Quincy A Hathaway1,2, Skyler M Roth3, Mark V Pinti4, Daniel C Sprando5, Amina Kunovac1,2, Andrya J Durr1,2, Chris C Cook6, Garrett K Fink1, Tristen B Cheuvront3, Jasmine H Grossman3, Ghadah A Aljahli3, Andrew D Taylor1,2, Andrew P Giromini5, Jessica L Allen3, John M Hollander7,8.
Abstract
BACKGROUND: Diabetes mellitus is a chronic disease that impacts an increasing percentage of people each year. Among its comorbidities, diabetics are two to four times more likely to develop cardiovascular diseases. While HbA1c remains the primary diagnostic for diabetics, its ability to predict long-term, health outcomes across diverse demographics, ethnic groups, and at a personalized level are limited. The purpose of this study was to provide a model for precision medicine through the implementation of machine-learning algorithms using multiple cardiac biomarkers as a means for predicting diabetes mellitus development.Entities:
Keywords: CART; Epigenetics; Heart; Machine-learning; Mitochondria; SHAP
Mesh:
Substances:
Year: 2019 PMID: 31185988 PMCID: PMC6560734 DOI: 10.1186/s12933-019-0879-0
Source DB: PubMed Journal: Cardiovasc Diabetol ISSN: 1475-2840 Impact factor: 9.951
Patient characteristics and demographic information
| Parameter | Non-diabetic | Type 2 diabetic |
|---|---|---|
| Age | 61.97 ± 2.449 | 61.16 ± 3.047 |
| Sex | Male = 26, female = 4 | Male = 15, female = 5 |
| BMI (kg/m2) | 29.13 ± 1.08 | 29.14 ± 1.448 |
| Coronary artery disease | 73.33% ± 8.212% | 100% ± 0.0%* |
| Hypertension | 86.67% ± 6.312% | 94.74% ± 5.263% |
| Valvular disease | 26.67% ± 8.212% | 15.79% ± 8.595% |
| HbA1c | 5.567 ± 0.07898 | 8.016 ± 0.4024* |
Groups are considered significantly different if P ≤ 0.05 = * compared to non-diabetic. All data are presented as the mean ± standard error of the mean (SEM)
HbA1c: glycated hemoglobin
Fig. 1Overview of machine-learning using Classification and Regression Trees (CART) and SHapley Additive exPlanations (SHAP). a Classification trees begin with a specific parameter that most successfully partitions the samples, such as CpG24 methylation, and determine the probability of correctly delineating a population into classifications, such as non-diabetic and diabetic, through a discrete value of the parameter (e.g. 0.275). The delineation is then given a probability score (i.e. 0.475, or a 47.5% chance of classifying the sample incorrectly), assigned a label, and further passed on to other parameters in the tree (e.g. CpG11 methylation and CpG28 methylation). As the samples progress through the tiers of the tree, the Gini impurity gets smaller, more accurately delineating samples that make it to that particular “truth” statement. b An example of how SHAP illustrates sample distribution. The “SHAP Value” delineates between a condition being true (value > 0.0, T2DM) and it being false (value < 0.0, ND). The more a specific value of a sample influences the composition of the model, the farther the point will migrate away from zero on the y-axis. If the value of a sample does not influence the model, it will reside near or at zero on the y-axis. In the example, a larger value of “X” and lower value of “Z” are highly predictive of the patient being ND, with these values strongly influencing the model “Y”. CpG: cytosine nucleotide followed by a guanine nucleotide; ND: non-diabetic; T2DM: type 2 diabetic
Overview of 6 machine-learning model analysis on all 345 features in binary classification
| Model | Training | Training (StDev) | Testing | Testing (StDev) | F1 score | Important features | Important feature bias | AUC |
|---|---|---|---|---|---|---|---|---|
| LR | 0.608 | 0.301 | 0.667 | 0.0 | 0.640 | Complex III, Complex I, CpG31, CpG28, CpG30, Complex IV, CpG8, CpG4, CpG12, Age | (− 2.688), (− 1.688), (1.648), (− 1.163), (− 1.016), (0.982), (0.945), (0.887), (0.882), (0.848) | NA |
| LDA | 0.567 | 0.203 | 0.556 | 0.0 | 0.400 | SNP16245, SNP16344, SNP151, SNP5463, SNP4295, SNP13722, SNP94, SNP15884, SNP9055, SNP477 | (− 3.896E+15), (− 3.896E+15), (− 3.896E+15), (− 3.896E+15), (− 2.719E+15), (− 2.719E+15), (3.398E+14), (3.398E+14), (3.398E+14), 0.266 | 0.700 |
| KNN | 0.642 | 0.239 | 0.444 | 0.0 | 0.430 | NA | NA | 0.600 |
| NB | 0.725 | 0.227 | 0.778 | 0.0 | 0.780 | Mito 5hmC, Methyltransferase | (1.000), (0.000) | 0.775 |
| SVM | 0.583 | 0.337 | 0.667 | 0.0 | 0.640 | Complex III, CpG31, Complex I, CpG28, CpG8, CpG22, CpG12, CpG29, CpG4, CpG35 | (− 0.732), (0.488), (− 0.443), (− 0.372), (0.350), (− 0.349), (0.322), (− 0.260), (0.259), (0.257) | NA |
| CART | 0.790 | 0.209 | 0.711 | 0.1 | 0.714 | CpG 24, CpG 28, Nuc 5mC, CpG11, CpG23, CpG1, CpG4 | (0.587%), (0.213%), (0.040%), (0.040%), (0.040%), (0.040%), (0.040%) | 0.715 |
Model analysis was conducted five times and averages are reported for the resulting training accuracy, training standard deviation, testing accuracy, testing standard deviation, F1 score, and area under the curve (AUC). Important biomarker features associated with each trained model are provided along with the associated influence value for each feature. Important features are listed in order of influence within the model. LR, LDA, SVM feature bias exists as an influence parameter where magnitude dictates feature influence. A positive influence value indicates the biomarker favors classification towards one label while a negative value indicates favorable classification of the opposite label. The larger the magnitude, the more strongly that feature shifts classification. NB feature influence indicates the most important biomarker per class in binary (0,1) classification schemes. CART feature bias percentages indicate feature influence on the created classification tree. Larger percentages indicate a feature that arises near the beginning of a tree before subsequent branching. Influence is not provided for KNN due to model restrictions
Overview of 6 machine-learning model analysis on all 345 features in multiple classification
| Model | Training | Training (StDev) | Testing | Testing (StDev) | F1 score | Important features | Important feature bias |
|---|---|---|---|---|---|---|---|
| LR | 0.333 | 0.207 | 0.444 | 0.0 | 0.430 | Complex V, CpG35, BMI, CpG38, CpG18, CpG40, CpG19, CpG23, Complex IV, CpG25 | (− 2.417), (− 2.214), (1.942), (− 1.541), (− 1.313), (− 0.994), (− 0.881), (− 0.824), (− 0.812), (0.8071) |
| LDA | 0.433 | 0.178 | 0.333 | 0.0 | 0.170 | SNP11167, SNP10506, SNP16309, SNP16343, SNP2294, SNP14139, SNP16162, SNP3672, SNP8642, SNP143 | (− 4.623E+14), (− 4.623E+14), (− 4.623E+14), (− 4.623E+14), (− 4.623E+14), (− 4.623E+14), (− 4.623E+14), (− 4.623E+14), (− 4.623E+14), (5.779E+13) |
| KNN | 0.358 | 0.239 | 0.444 | 0.0 | 0.450 | NA | NA |
| NB | 0.425 | 0.243 | 0.778 | 0.0 | 0.780 | Methyltransferase, Mito 5hmC, Nuc 5 hmC | (0.000), (1.000), (2.000) |
| SVM | 0.442 | 0.163 | 0.556 | 0.0 | 0.520 | Complex V, BMI, Complex III, Complex I, Complex IV, CpG31, Age, CpG19, CpG22, CpG6 | (− 0.943), (0.754), (0.561), (− 0.383), (− 0.344), (0.307), (− 0.287), (− 0.268), (− 0.210), (0.198) |
| CART | 0.660 | 0.257 | 0.556 | 0.0 | 0.558 | CpG24, TFAM CpG, TFAM Non-CpG, BMI, SNP94, Complex IV, SNP8557, CpG7, SNP242, SNP13722, Complex III, Mito 5mC | (0.328%), (0.206%), (0.176%), (0.137%), (0.016%), (0.045%), (0.016%), (0.016%), (0.016%), (0.016%), (0.016%), (0.016%) |
Model analysis was conducted five times and averages are reported for the resulting training accuracy, training standard deviation, testing accuracy, testing standard deviation, and F1 score. Important biomarker features associated with each trained model are provided along with the associated influence value for each feature. Important features are listed in order of influence within the model. LR, LDA, SVM feature bias exists as an influence parameter where magnitude dictates feature influence. A positive influence value indicates the biomarker favors classification towards one label while a negative value indicates favorable classification of the opposite label. The larger the magnitude, the more strongly that feature shifts classification. NB feature influence indicates the most important biomarker per class in multiple (0,1,2) classification schemes. CART feature bias percentages indicate feature influence on the created classification tree. Larger percentages indicate a feature that arises near the beginning of a tree before subsequent branching. Influence is not provided for KNN due to model restrictions
Fig. 2Feature importance of physiological and biochemical characteristics from patients. a Using HbA1c for binary classification representing the factors positively (red) and negatively (blue) impacting the construction of the model, with size of the bars depicting importance. The b total nuclear methylation and c total nuclear hydroxymethylation of patients. SHAP binary depiction of the interaction between d total nuclear methylation and e total nuclear hydroxymethylation and HbA1c levels. f Not including HbA1c for binary classification representing the factors positively (red) and negatively (blue) impacting the construction of the model, with size of the bars depicting importance. SHAP binary depiction without HbA1c of the interaction between g total nuclear methylation and methyltransferase activity and h electron transport chain complex III and BMI. Examining the multiple classification effects of prediabetes, i A modified T-Plot where the main effects of biomarkers on the prediction output are shown along the diagonal axis whereas interaction effects are shown off the diagonal. SHAP depiction of patient separation with the individual and correlated effects of HbA1c and total nuclear methylation. SHAP multiple classification depiction of the interaction between j total nuclear methylation and HbA1c. SHAP values > 0.0 are diabetic (T2DM), SHAP values < 0.0 are non-diabetic (ND), SHAP values = 0 are either ND or T2DM without influence on the model. Groups are considered significantly different if P ≤ 0.05 = * compared to non-diabetic. All data are presented as the mean ± standard error of the mean (SEM). ND: non-diabetic; T2DM: type 2 diabetic; Nuc: nuclear; Mito: mitochondrial; 5mC: 5-methylcytosine; 5hmC: 5-hydroxymethylcytosine; HbA1c: glycated hemoglobin; binary: no diabetes and diabetes; multiple: no diabetes, prediabetes, and type 2 diabetes
Fig. 3Feature importance of mitochondrial DNA SNPs from patients. a The most important predictive parameters using binary classification with HbA1c, the absolute value of a feature being high (red) or low (blue) depicting diabetic (right-side) or non-diabetic (left-side) status. b The most important predictive parameters using binary classification without HbA1c, the absolute value of a feature being high (red) or low (blue) depicting diabetic (right-side) or non-diabetic (left-side) status. c Frequency of mitochondrial DNA SNPs by nucleotide converted in ND and T2DM patients; increasing frequency of SNPs occurring in the patient population are depicted by movement closer to the mitochondrial DNA strand. d SHAP binary depiction with HbA1c of the interaction between SNP16126 and HbA1c. e SHAP binary depiction without HbA1c of the interaction between SNP7028 and SNP73. SHAP values > 0.0 are diabetic (T2DM), SHAP values < 0.0 are non-diabetic (ND), SHAP values = 0 are either ND or T2DM without influence on the model. ND: non-diabetic; T2DM: type 2 diabetic; HbA1c: glycated hemoglobin; binary: no diabetes and diabetes; multiple: no diabetes, prediabetes, and type 2 diabetes
Fig. 4Feature importance of CpG island methylation of TFAM from patients. a Methylation across the promoter CpG region of the TFAM gene was determined using overhang bisulfite sequencing. b Experimental paradigm for amplification of the bisulfite-converted DNA for 23 CpG sites proximal (Amplicon 1) and 19 CpG sites distal (Amplicon 2) to the TFAM start site. SHAP binary depiction with HbA1c of the interaction between c CpG24 methylation and HbA1c and d CpG29 methylation and HbA1c. e Not including HbA1c for binary classification representing the factors positively (red) and negatively (blue) impacting the construction of the model, with size of the bars depicting importance. f A modified T-Plot where the main effects of biomarkers on the prediction output are shown along the diagonal axis whereas interaction effects are shown off the diagonal. SHAP binary depiction without HbA1c of patient separation with the individual and correlated effects of CpG24 methylation and CpG29 methylation. g Using HbA1c for multiple classification representing the factors positively (red) and negatively (blue) impacting the construction of the model, with size of the bars depicting importance. h SHAP multiple classification depiction with HbA1c of the interaction between TFAM gene total methylation and HbA1c. SHAP values > 0.0 are diabetic (T2DM), SHAP values < 0.0 are non-diabetic (ND), SHAP values = 0 are either ND or T2DM without influence on the model. Groups are considered significantly different if P ≤ 0.05 = * compared to non-diabetic. All data are presented as the mean ± standard error of the mean (SEM). ND: non-diabetic; T2DM: type 2 diabetic; HbA1c: glycated hemoglobin; CpG: cytosine nucleotide followed by a guanine nucleotide; TFAM: transcription factor A, mitochondrial; binary: no diabetes and diabetes; multiple: no diabetes, prediabetes, and type 2 diabetes
Fig. 5Feature importance of best factors combined from patients. The most important predictive parameters using a binary and b multiple classification with HbA1c, the absolute value of a feature being high (red) or low (blue) depicting diabetic (right-side) or non-diabetic (left-side) status. The most important predictive parameters using c binary and d multiple classification without HbA1c, the absolute value of a feature being high (red) or low (blue) depicting diabetic (right-side) or non-diabetic (left-side) status. SHAP e binary and f multiple classification depiction without HbA1c of the interaction between total nuclear methylation and CpG24 methylation. SHAP values > 0.0 are diabetic (T2DM), SHAP values < 0.0 are non-diabetic (ND), SHAP values = 0 are either ND or T2DM without influence on the model. ND: non-diabetic; T2DM: type 2 diabetic; HbA1c: glycated hemoglobin; CpG: cytosine nucleotide followed by a guanine nucleotide; Nuc: nuclear; 5mC: 5-methylcytosine; binary: no diabetes and diabetes; multiple: no diabetes, prediabetes, and type 2 diabetes
Fig. 6Overview of machine-learning pipeline implementing biological variables across a spectrum of gathered information. From the patient population undergoing coronary artery bypass graft surgery (CABG), physiological parameters (demographics, health reports, etc.) and atrial tissue were used for subsequent analyses. From cardiac tissue genomic (mitochondrial DNA), epigenomic (TFAM promoter CpG methylation), and biochemical (nuclear and mitochondrial function) were assessed. Cumulatively, the biological data was processed through tree ensembles in SHAP and validated through CART analysis with tenfold cross validation. Using these machine-learning algorithms, graphical depictions and biomarker feature importance are able to be derived, allowing for prediction of the onset and progression of diabetes. Ultimately, by using biological data at the genomic and epigenomic level, it allows for precision medicine approaches and more personalized diagnostics and prognostics. TFAM: transcription factor A, mitochondrial; mtDNA: mitochondrial DNA; CpG: cytosine nucleotide followed by a guanine nucleotide; CART: Classification and Regression Trees; SHAP: SHapley Additive exPlanations