| Literature DB >> 31694707 |
An Dinh1, Stacey Miertschin2, Amber Young3, Somya D Mohanty4.
Abstract
BACKGROUND: Diabetes and cardiovascular disease are two of the main causes of death in the United States. Identifying and predicting these diseases in patients is the first step towards stopping their progression. We evaluate the capabilities of machine learning models in detecting at-risk patients using survey data (and laboratory results), and identify key variables within the data contributing to these diseases among the patients.Entities:
Keywords: Ensemble learning; Feature learning; Health analytics; Machine learning
Mesh:
Year: 2019 PMID: 31694707 PMCID: PMC6836338 DOI: 10.1186/s12911-019-0918-5
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Model Development and Evaluation Pipeline. A flow chart visualizing the data processing and model development process
Diabetes classification criteria
| Criteria | Classification | |
|---|---|---|
| Answered “yes” to “Have you been told by a doctor that you have diabetes” | ⇒ | Diabetic |
| Answered “no”, but had a Plasma Glucose ≥126 mg/dl | ⇒ | Undiagnosed diabetic |
| Had a Plasma Glucose between 100−125 mg/dl | ⇒ | Prediabetic |
| Had a Plasma Glucose ≤100 mg/dl | ⇒ | Not diabetic |
NHANES survey questionnaire
NHANES laboratory results
Label assignments for Case I and Case II
| Classification | Case I | Case II |
|---|---|---|
| Diabetic | 1 | Excluded |
| Undiagnosed diabetic | 1 | 1 |
| Prediabetic | 0 | 1 |
| Not diabetic | 0 | 0 |
Case I - Records containing diabetic, pre / undiagnosed and non diabetic patients. Case II - Records containing pre / undiagnosed and non diabetic patients only. 1 - Positive record for the case; 0 - Negative record for the case (non diabetic patient)
Cardiovascular disease classification criteria and label Assignments
| Criteria | Classification | Label Assignment | |
|---|---|---|---|
| Answered “yes” to having had one of the following | ⇒ | Having heart diseases | 1 |
| If they answered “no” to all conditions | ⇒ | Not having heart diseases | 0 |
γ - On the NHANES survey questionnaire. 1 - Positive record for CVD; 0 - Negative record for CVD
The structure of the datasets used for diabetes and cardiovascular classification
| Year | Case | Observations | Variables | No. of 0s | No. of 1s |
|---|---|---|---|---|---|
| 1999-2014 | Case I | 21,131 | 123 | 15,599 | 5,532 |
| 1999-2014 | Case II | 16,426 | 123 | 9,944 | 6,482 |
| 2003-2014 | Case I | 16,443 | 168 | 11,977 | 4,466 |
| 2003-2014 | Case II | 12,636 | 168 | 7,503 | 5,133 |
| 2007-2014 | Cardio | 8,459 | 131 | 7,012 | 1,447 |
Case I and II datasets are for diabetes classification, Cardio dataset is for CVD classification. 1 - Positive records for the disease; 0 - Negative records for the disease
Results using 10-fold cross-validation for diabetes classification
| Lab | Year & Case | Model | AUC |
|
| |
|---|---|---|---|---|---|---|
| No lab | Logistic Reg. | 0.827 | 0.75 | 0.75 | 0.75 | |
| 1999-2014 | SVM | 0.849 | 0.77 | 0.77 | 0.77 | |
| Diab. Case I | Random Forest | 0.855 | 0.78 | 0.78 | 0.78 | |
|
|
|
|
|
| ||
| Ensemble | 0.859 | 0.78 | 0.78 | 0.78 | ||
| Logistic Reg. | 0.732 | 0.67 | 0.67 | 0.67 | ||
| 1999-2014 | SVM | 0.734 | 0.68 | 0.68 | 0.68 | |
| Diab. Case II | Random Forest | 0.731 | 0.67 | 0.67 | 0.67 | |
| XGBoost | 0.734 | 0.67 | 0.67 | 0.67 | ||
|
|
|
|
| |||
| Logistic Reg. | 0.800 | 0.72 | 0.72 | 0.72 | ||
| 2003-2014 | SVM | 0.822 | 0.75 | 0.75 | 0.75 | |
| Diab. Case I |
|
|
|
|
| |
| XGBoost | 0.837 | 0.75 | 0.75 | 0.75 | ||
| Ensemble | 0.834 | 0.75 | 0.75 | 0.75 | ||
| Logistic Reg. | 0.718 | 0.66 | 0.66 | 0.66 | ||
| 2003-2014 | SVM | 0.716 | 0.66 | 0.66 | 0.66 | |
| Diab. Case II | Random Forest | 0.719 | 0.67 | 0.67 | 0.66 | |
|
|
|
|
|
| ||
| Ensemble | 0.725 | 0.66 | 0.66 | 0.66 | ||
| With lab | Logistic Reg. | 0.866 | 0.79 | 0.79 | 0.79 | |
| 1999-2014 | SVM | 0.887 | 0.81 | 0.81 | 0.81 | |
| Diab. Case I | Random Forest | 0.937 | 0.86 | 0.86 | 0.86 | |
|
|
|
|
|
| ||
| Ensemble | 0.944 | 0.87 | 0.87 | 0.87 | ||
| Logistic Reg. | 0.724 | 0.67 | 0.67 | 0.67 | ||
| 1999-2014 | SVM | 0.737 | 0.68 | 0.68 | 0.68 | |
| Diab. Case II | Random Forest | 0.738 | 0.68 | 0.68 | 0.68 | |
|
|
|
|
|
| ||
| Ensemble | 0.783 | 0.71 | 0.71 | 0.71 | ||
| Logistic Reg. | 0.877 | 0.80 | 0.80 | 0.80 | ||
| 2003-2014 | SVM | 0.882 | 0.81 | 0.80 | 0.80 | |
| Diab. Case I | Random Forest | 0.939 | 0.86 | 0.86 | 0.86 | |
|
|
|
|
|
| ||
| Ensemble | 0.948 | 0.88 | 0.88 | 0.88 | ||
| Logistic Reg. | 0.738 | 0.68 | 0.68 | 0.68 | ||
| 2003-2014 | SVM | 0.737 | 0.68 | 0.68 | 0.68 | |
| Diab. Case II | Random Forest | 0.740 | 0.68 | 0.68 | 0.67 | |
|
|
|
|
|
| ||
| Ensemble | 0.798 | 0.72 | 0.72 | 0.72 |
AUC - Area Under the Curve, (where TP - True Positive, FP - False Positive, FN - False Negative), and F1 (score) = . Bold face font signifies best performing model result
Fig. 2ROC curves from the 1999-2014 Diabetes Case I models. This graph shows the ROC curves generated from different models applied to the 1999-2014 Diabetes Case I datasets without lab
Fig. 3ROC curves from 1999-2014 Diabetes Case II models. This graph shows the ROC curves generated from different models applied to the 1999-2014 Diabetes Case II datasets without lab
Fig. 4ROC curves from the cardiovascular models This graph shows the ROC curves generated from different models applied to the 1999-2007 cardiovascular disease datasets without lab
Fig. 5Average feature importance for diabetes classifiers without lab results. This graphs shows the most important features not including lab results for predicting diabetes
Results using 10-fold cross-validation for cardiovascular disease classification
| Lab | Year | Model | AUC |
|
| |
|---|---|---|---|---|---|---|
| No lab | Logistic Reg. | 0.822 | 0.74 | 0.74 | 0.74 | |
| 2007-2014 | SVM | 0.816 | 0.74 | 0.74 | 0.74 | |
| Random Forest | 0.829 | 0.75 | 0.74 | 0.74 | ||
| XGBoost | 0.830 | 0.74 | 0.74 | 0.74 | ||
|
|
|
|
|
| ||
| With lab | Logistic Reg. | 0.827 | 0.75 | 0.75 | 0.75 | |
| 2007-2014 | SVM | 0.825 | 0.75 | 0.75 | 0.75 | |
| Random Forest | 0.836 | 0.76 | 0.76 | 0.76 | ||
| XGBoost | 0.838 | 0.76 | 0.76 | 0.76 | ||
|
|
|
|
|
|
Lab - Laboratory results, AUC - Area Under the Curve, (where TP - True Positive, FP - False Positive, FN - False Negative), and F1 (score) = . Bold face font signifies best performing model result
Fig. 6Average feature importance for diabetes classifiers with lab results. This graphs shows the most important features including lab results for predicting diabetes
Fig. 7Feature importance for cardiovascular disease classifier without lab results This graphs shows the most important features not including lab results for predicting cardiovascular disease
Fig. 8Feature importance for cardiovascular disease classifier with lab results This graphs shows the most important features including lab results for predicting cardiovascular disease