| Literature DB >> 33063051 |
L J Muhammad1, Ebrahem A Algehyne2, Sani Sharif Usman3.
Abstract
Diabetes mellitus (DM) is one of the deadliest diseases in the world, especially in developed nations. In recent years, it has become rampant in the developing nations such as Nigeria, posing more threats to individuals in the latter than those in the former. More than 415 million people were reported to suffer from DM worldwide as of 2015, with type 2 of the disease accounting for approximately 90% of the cases. The number of people with DM is expected to rise to 592 million by the year 2035. Therefore, DM is one of the growing public health concerns in Nigeria. In this study, the diagnostic dataset of DM type 2 was collected from the Murtala Mohammed Specialist Hospital, Kano, and used to develop predictive supervised machine learning models based on logistic regression, support vector machine, K-nearest neighbor, random forest, naive Bayes and gradient booting algorithms. The random forest predictive learning-based model appeared to be one of the best developed models with 88.76% in terms of accuracy; however, in terms of receiver operating characteristic curve, random forest and gradient booting predictive learning-based models were found to be the best predictive learning models with 86.28% predictive ability, respectively. © Springer Nature Singapore Pte Ltd 2020.Entities:
Keywords: Diabetes mellitus; Diabetes mellitus type 2; Machine learning; Predictive model; Random forest
Year: 2020 PMID: 33063051 PMCID: PMC7372976 DOI: 10.1007/s42979-020-00250-8
Source DB: PubMed Journal: SN Comput Sci ISSN: 2661-8907
Fig. 1Flowchart for training process machine learning tasks [10]
Description of units and ranges of the dataset attributes
| SN | Attribute | Unit | Range |
|---|---|---|---|
| 1 | Age | Year | 1–150 |
| 2 | Family history | Yes (1), No (0) | 0, 1 |
| 3 | Glucose | mg/dL | 37–295 |
| 4 | Cholesterol (CHOL) | mg/dL | 128–575 |
| 5 | Blood pressure (BP) | mmHg | 90–190 |
| 6 | HDL | mg/dL | 10.6–73 |
| 7 | Triglyceride | mg/dL | 40–690 |
| 8 | BMI | kg/m2 | 20.28–40.25 |
| 9 | Diagnosis result | Positive (1), Negative (0) | 0, 1 |
Fig. 2Principle of filling random forest. The bootstrap resampling technique is firstly used where multiple samples are randomly selected from the original training dataset x to generate a new training dataset [32]. Then, multiple decision trees are built to form the random forest which then finally averages the output of each decision tree to determine the final filling result y [33].
Fig. 3Typical ROC curve
Sample of the dataset
| Age (years) | Family history | Glucose (mg/dL) | CHOL (mg/dL) | BP (mmHg) | HDL (mg/dL) | Triglyceride (mg/dL) | BMI (kg/m2) | Diagnosis result |
|---|---|---|---|---|---|---|---|---|
| 62 | 1 | 281 | 135 | 312 | 56 | 234 | 56 | 1 |
| 42 | 1 | 201 | 171 | 391 | 71 | 98 | 43 | 1 |
| 39 | 0 | 281 | 140 | 309 | 45 | 62 | 45 | 0 |
| 62 | 1 | 136 | 140 | 129 | 32 | 201 | 32 | 0 |
| 60 | 1 | 149 | 120 | 134 | 60 | 119 | 44 | 1 |
| 57 | 1 | 120 | 135 | 178 | 11 | 300 | 14 | 0 |
| 59 | 0 | 130 | 180 | 341 | 67 | 65 | 15 | 1 |
| 63 | 1 | 199 | 130 | 198 | 15 | 123 | 15 | 1 |
| 74 | 0 | 178 | 118 | 169 | 32 | 56 | 32 | 1 |
| 61 | 1 | 201 | 176 | 190 | 21 | 319 | 32 | 1 |
| 34 | 0 | 123 | 130 | 231 | 17 | 21 | 17 | 0 |
Fig. 4Workflow of the predictive models
Data type of the dataset attributes
| SN | Attribute | Data type |
|---|---|---|
| 1 | Age | int64 |
| 2 | Family history | int64 |
| 3 | Glucose | int64 |
| 4 | Cholesterol | int64 |
| 5 | Blood pressure | int64 |
| 6 | High density lipoprotein | int64 |
| 7 | Triglyceride | int64 |
| 8 | Body mass index | int64 |
| 9 | Diagnosis result | int64 |
Fig. 5Description of the values of the dataset attributes
Fig. 6Scatterplot correlation coefficient of the dataset attributes
Fig. 7The correlation matrix of the dataset attributes
r value and correlation coefficient
| SN | Dependent variable | Independent variable | r value | Correlation coefficient relationship |
|---|---|---|---|---|
| 1 | Age | Diagnosis result | 0.041 | A weak positive correlation coefficient relationship |
| 2 | Family history | Diagnosis result | 0.37 | A moderate positive correlation coefficient relationship |
| 3 | Glucose | Diagnosis result | 0.72 | A strong positive correlation coefficient relationship |
| 4 | Cholesterol | Diagnosis result | 0.43 | A moderate positive correlation coefficient relationship |
| 5 | Blood pressure | Diagnosis result | − 0.25 | A weak negative correlation coefficient relationship |
| 6 | High density lipoprotein | Diagnosis result | − 0.19 | A weak negative correlation coefficient relationship |
| 7 | Triglyceride | Diagnosis result | 0.21 | A weak positive correlation coefficient relationship |
| 8 | Body mass index | Diagnosis result | 0.079 | A weak positive correlation coefficient relationship |
Performance evaluation result of the model
| S/N | Supervised machine learning model | Accuracy (%) | ROC (%) |
|---|---|---|---|
| 1 | Logistic regression | 80.88 | 80.73 |
| 2 | Support vector machine | 85.29 | 84.74 |
| 3 | K-nearest neighbor | 82.35 | 81.94 |
| 4 | Random forest | 88.76 | 86.28 |
| 5 | Naive Bayes | 77.94 | 77.43 |
| 6 | Gradient booting | 86.76 | 86.28 |
Fig. 8Performance evaluation result of the predictive models
Fig. 9Visualization of the random tree predictive model