| Literature DB >> 36015915 |
Hansel Hu1, Tin Lai2, Farnaz Farid3.
Abstract
Prediabetes and diabetes are becoming alarmingly prevalent among adolescents over the past decade. However, an effective screening tool that can assess diabetes risks smoothly is still in its infancy. In order to contribute to such significant gaps, this research proposes a machine learning-based predictive model to detect adolescent diabetes. The model applies supervised machine learning and a novel feature selection method to the National Health and Nutritional Examination Survey datasets after an exhaustive search to select reliable and accurate data. The best model achieved an area under the curve (AUC) score of 71%. This research proves that a screening tool based on supervised machine learning models can assist in the automated detection of youth diabetes. It also identifies some critical predictors to such detection using Lasso Regression, Random Forest Importance and Gradient Boosted Tree Importance feature selection methods. The most contributing features to Youth diabetes detection are physical characteristics (e.g., waist, leg length, gender), dietary information (e.g., water, protein, sodium) and demographics. These predictors can be further utilised in other areas of medical research, such as electronic medical history.Entities:
Keywords: adolescent diabetes prediction; diabetes detection; medical machine learning
Mesh:
Year: 2022 PMID: 36015915 PMCID: PMC9416136 DOI: 10.3390/s22166155
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.847
Figure 1Workflow of the research.
The Clinical guideline used to define Diabetes/Non-diabetes (preDM/DM) status.
| Criteria | Classification |
|---|---|
| plasma glucose level after an overnight fast (FPG) ≥ 100 mg/dL | Diabetes/1 |
| plasma glucose level two hours after an oral glucose load (2hrPG) ≥ 140 mg/dL | Diabetes/1 |
| hemoglobin A1c (HbA1C) ≥ 5.7% | Diabetes/1 |
| None above | Non-diabetes/0 |
Figure 2Top 10 important features of the (left) Random Forest Feature Importance Method and the (right) Gradient Boosted Tree Feature Importance Method.
Figure 3Support Vector Machine.
Figure 4Decision Tree.
Figure 5Random Forest.
Figure 6Extreme Gradient Boosted Tree.
Confusion Matrix.
| True Label | Model Prediction | |
|---|---|---|
| Non-Diabetic | Diabetic | |
| Non-Diabetic | True Negative | False Positive |
| Diabetic | False Negative | True Positive |
Figure 7Performance of machine learning models in Diabete detection with Non-diabetic and Diabetic class labels, in terms of precision, recall and F1 score.
Evaluation metrics of the models from previous research and this research.
| AUC | Precision | Recall | F1 Score | Accuracy | |
|---|---|---|---|---|---|
| Best model from previous research | N/A | 0.35 | 0.36 | 0.35 | N/A |
| Logistic Regression | 0.70 | 0.42 | 0.68 | 0.52 | 0.64 |
| Support Vector Machine | 0.66 | 0.37 | 0.69 | 0.48 | 0.57 |
| Random Forest | 0.69 | 0.40 | 0.71 | 0.51 | 0.61 |
| Extreme Gradient Boosted Tree | 0.70 | 0.41 | 0.76 | 0.53 | 0.61 |
| Weighted Voting Classifier | 0.71 | 0.43 | 0.70 | 0.53 | 0.64 |
5x2cv paired t-test.
| Classifier A | Classifier B | Result |
|---|---|---|
| Weighted Voting Classifier | Logistic Regression | |
| SVM | 🗸 | |
| Random Forest | 🗸 | |
| Extreme Gradient Boosted Tree | ||
| Extreme Gradient Boosted Tree | Logistic Regression | |
| SVM | 🗸 | |
| Random Forest | ||
| Random Forest | Logistic Regression | 🗸 |
| SVM | 🗸 | |
| SVM | Logistic Regress |
Figure 8ROC curves of different models.
Figure 9Most Important Features for Weighted Voting Classifier.