| Literature DB >> 35893185 |
Jorge A Morgan-Benita1, Carlos E Galván-Tejada1, Miguel Cruz2, Jorge I Galván-Tejada1, Hamurabi Gamboa-Rosales1, Jose G Arceo-Olague1, Huizilopoztli Luna-García1, José M Celaya-Padilla1.
Abstract
Type 2 diabetes mellitus (T2DM) represents one of the biggest health problems in Mexico, and it is extremely important to early detect this disease and its complications. For a noninvasive detection of T2DM, a machine learning (ML) approach that uses ensemble classification models with dichotomous output that is also fast and effective for early detection and prediction of T2D can be used. In this article, an ensemble technique by hard voting is designed and implemented using generalized linear regression (GLM), support vector machines (SVM) and artificial neural networks (ANN) for the classification of T2DM patients. In the materials and methods as a first step, the data is balanced, standardized, imputed and integrated into the three models to classify the patients in a dichotomous result. For the selection of features, an implementation of LASSO is developed, with a 10-fold cross-validation and for the final validation, the Area Under the Curve (AUC) is used. The results in LASSO showed 12 features, which are used in the implemented models to obtain the best possible scenario in the developed ensemble model. The algorithm with the best performance of the three is SVM, this model obtained an AUC of 92% ± 3%. The ensemble model built with GLM, SVM and ANN obtained an AUC of 90% ± 3%.Entities:
Keywords: ensemble model; logistic regression; machine learning; neural networks; support vector machine; type 2 diabetes mellitus detection
Year: 2022 PMID: 35893185 PMCID: PMC9331873 DOI: 10.3390/healthcare10081362
Source DB: PubMed Journal: Healthcare (Basel) ISSN: 2227-9032
Figure 1Flowchart of the proposed methodology.
Figure 2Feature Correlation Heat Map.
Features discarded.
| Feature | Description | Possible Values |
|---|---|---|
| Age DX | Diagnosis age of T2DM | Numeric Integer |
| Glucose | Blood glucose levels | Numeric |
| HbA1c | Glycated Hemoglobin | Numeric |
| GFR | Glomerular Filtration Rate (blood test that checks how well the kidneys are working) | Numeric Integer |
| Glibenclamide | Drug Treatment | 0—No |
| Metformin | Drug Treatment | 0—No |
| Pioglitazone | Drug Treatment | 0—No |
| Rosiglitazone | Drug Treatment | 0—No |
| Acarbose | Drug Treatment | 0—No |
| Insuline | Drug Treatment | 0—No |
| Complications T2DM | Complications associated with T2DM | NEUROPATHY—Have neuropathy |
All features in this table were excluded from the analysis by data imputation.
Features description and possible values.
| Feature | Description | Possible Values | |
|---|---|---|---|
| Education | Studies concluded by the patient | 1—Elementary School | 0.00118 |
| Salary | Monthly income | 1—Less than $2000.00 |
|
| Sex | Patients sex | 0—Male |
|
| Age | Age of the patient in years | Numeric Integer |
|
| WHR | Waist Hip Ratio | Numeric |
|
| BMI | Body Mass Index | Numeric |
|
| Urea | Waste product resulting from the breakdown of protein in the patient body | Numeric Integer |
|
| Creatinine | Waste product produced by muscles as part of regular daily activity | Numeric | 0.000456 |
| Lipids treatment | Lipid levels in treatment | 1—Lipid levels in treatment | 0.956 |
| Cholesterol | Fat-like substance that is found in all cells of the patient body | Numeric |
|
| HDL | High Density Lipoprotein (corrected for medication) | Numeric |
|
| LDL | Low Density Lipoprotein (corrected for medication) | Numeric |
|
| Triglycerides | Type of fat found in the patient body | Numeric |
|
| TCHOLU | Total Cholesterol (uncorrected) | Numeric Integer | 0.258 |
| HDLU | High Density Lipoprotein (uncorrected) | Numeric Integer |
|
| LDLU | Low Density Lipoprotein (uncorrected) | Numeric Integer | 0.240 |
| TGU | Triglycerides (uncorrected) | Numeric Integer |
|
| SBP | Systolic Blood Pressure (corrected by medication) | Numeric Integer |
|
| DBP | Diastolic Blood Pressure (corrected by medication) | Numeric Integer |
|
| SBPU | Systolic Blood Pressure (uncorrected) | Numeric Integer |
|
| DBPU | Diastolic Blood Pressure (uncorrected) | Numeric Integer |
|
| HA-TX | Hypertension Treatment | 0—Not in hypertension treatment | 0.959 |
| Output | Classifier of patients | 0—Patient negative for T2DM | - |
All features in this table were included in the analysis by data imputation.
LASSO result Features.
| Feature | Description | Possible Values | |
|---|---|---|---|
| Salary | Monthly income | 1—Less than $2000.00 |
|
| Sex | Patients sex | 0—Male | 0.00538 |
| Age | Age of the patient in years | Numeric Integer |
|
| WHR | Waist Hip Ratio | Numeric | 0.07312 |
| BMI | Body Mass Index | Numeric | 0.00760 |
| Urea | Waste product resulting from the breakdown of protein in the patient body | Numeric Integer |
|
| Lipids treatment | Lipid levels in treatment | 1—Lipid levels in treatment | 0.97047 |
| HDL | High Density Lipoprotein (corrected by medication) | Numeric |
|
| Triglycerides | Type of fat found in the patient body | Numeric |
|
| DBP | Diastolic Blood Pressure (corrected by medication) | Numeric Integer |
|
| SBPU | Systolic Blood Pressure (uncorrected) | Numeric Integer |
|
| HA-TX | Hypertension Treatment | 0—No | 0.96440 |
All features in this table were included in all models as part of the final ensemble.
Metrics.
| Metric | Description |
|---|---|
| Sensitivity (see Equation ( | Correct identification of patients with T2DM (True Positive) |
| Specificity (see Equation ( | Correct identification of patients without T2DM (True Negative) |
| Precision (see Equation ( | Defines what portion of the positive cases of T2DM are actually positive |
| Negative Predictive Value (see Equation ( | Defines what portion of the negative cases of T2DM are actually negative |
| False Positive Rate (see Equation ( | The rate of the predicted false values that are actually true |
| False Negative Rate (see Equation ( | The rate of the predicted true values that are actually false |
| Accuracy (see Equation ( | The percentage of cases that the model has classified correctly |
| F1 Score (see Equation ( | The measure of precision that a test has |
All metrics in this table were extracted in models as part of the final ensemble.
Confusion Matrix structure.
| True Values | Predicted (True) | Predicted (False) |
|---|---|---|
| True |
|
|
| False |
|
|
SVM Confusion Matrix Measure Values.
| Measure | Value |
|---|---|
| Sensitivity | 0.8750 |
| Specificity | 0.9238 |
| Precision | 0.9269 |
| Negative Predictive Value | 0.8700 |
| False Positive Rate | 0.0762 |
| False Negative Rate | 0.1250 |
| Accuracy | 0.8982 |
| F1 Score | 0.9002 |
SVM Confusion Matrix.
| True Values | Predicted (True) | Predicted (False) |
|---|---|---|
| True | 203 | 16 |
| False | 29 | 194 |
ANN Confusion Matrix Measure Values.
| Measure | Value |
|---|---|
| Sensitivity | 0.8559 |
| Specificity | 0.9175 |
| Precision | 0.9224 |
| Negative Predictive Value | 0.8475 |
| False Positive Rate | 0.0825 |
| False Negative Rate | 0.1441 |
| Accuracy | 0.8846 |
| F1 Score | 0.8879 |
ANN Confusion Matrix.
| True Values | Predicted (True) | Predicted (False) |
|---|---|---|
| True | 202 | 17 |
| False | 34 | 189 |
GLM Confusion Matrix Measure Values.
| Measure | Value |
|---|---|
| Sensitivity | 0.8487 |
| Specificity | 0.9167 |
| Precision | 0.9224 |
| Negative Predictive Value | 0.8386 |
| False Positive Rate | 0.0833 |
| False Negative Rate | 0.1513 |
| Accuracy | 0.8801 |
| F1 Score | 0.8840 |
GLM Confusion Matrix.
| True Values | Predicted (True) | Predicted (False) |
|---|---|---|
| True | 202 | 17 |
| False | 36 | 187 |
Maxvoting Ensemble Confusion Matrix Measure Values.
| Measure | Value |
|---|---|
| Sensitivity | 0.8788 |
| Specificity | 0.9242 |
| Precision | 0.9269 |
| Negative Predictive Value | 0.8744 |
| False Positive Rate | 0.0758 |
| False Negative Rate | 0.1212 |
| Accuracy | 0.9005 |
| F1 Score | 0.9022 |
Ensemble Model Confusion Matrix.
| True Values | Predicted (True) | Predicted (False) |
|---|---|---|
| True | 203 | 16 |
| False | 28 | 195 |
Figure 3AUC in SVM, ANN, GLM and Ensemble Model.
Related work comparison.
| Autor | ML Model | Dataset | Metrics |
|---|---|---|---|
| Shaker E. et al., (2019) [ | Ensemble of: k-nearest neighbors, naïve Bayes, decision tree, support vector machine, fuzzy decision tree, artificial neural network, and logistic regression | Electronic health records of Mansura University Hospitals (Mansura, Egypt) | 90% of accuracy, 90.2% of recall, and 94.9% of precision |
| Kumari et al., (2021) [ | Ensemble of: random forest, logistic regression, and Naive Bayes | PIMA diabetes dataset | 79.04% of accuracy, 73.48% of precision, 71.45% of recall, and 80.6% of |
| Singh N. et al., (2020) [ | stacking-based evolutionary ensemble learning system “NSGA-II-Stacking” | PIMA diabetes dataset | accuracy of 83.8%, sensitivity of 96.1%, specificity of 79.9%, f-measure of 88.5% and area under ROC curve of 85.9% |
| Liu Y. et al., (2019) [ | Majority voting Ensemble: Support vector machine, tree-based methods and neural networks | REACTION study (Risk Evaluation of Cancers in Chinese Diabetic Individuals: A Longitudinal Study) | Majority voting with model selection results: AUC of 0.802 (80.2%), Sensitivity of 0.662 (66.2%), Specificity of 0.702 (70.2%) |
| This Work Authors | Hard voting Ensemble of: generalized liner regression, support vector machines and artificial neural networks | Centro Médico Nacional Siglo XXI dataset | Sensitivity of 0.8788 (87.88%) Specificity of 0.9242 (92.42%) Precision of 0.9269 (92.69%) Area under the ROC curve 90.5% |