Literature DB >> 35178244

Machine learning models for classification and identification of significant attributes to detect type 2 diabetes.

Koushik Chandra Howlader¹, Md Shahriare Satu², Md Abdul Awal³, Md Rabiul Islam⁴, Sheikh Mohammed Shariful Islam⁵, Julian M W Quinn⁶, Mohammad Ali Moni⁷.

Abstract

Type 2 Diabetes (T2D) is a chronic disease characterized by abnormally high blood glucose levels due to insulin resistance and reduced pancreatic insulin production. The challenge of this work is to identify T2D-associated features that can distinguish T2D sub-types for prognosis and treatment purposes. We thus employed machine learning (ML) techniques to categorize T2D patients using data from the Pima Indian Diabetes Dataset from the Kaggle ML repository. After data preprocessing, several feature selection techniques were used to extract feature subsets, and a range of classification techniques were used to analyze these. We then compared the derived classification results to identify the best classifiers by considering accuracy, kappa statistics, area under the receiver operating characteristic (AUROC), sensitivity, specificity, and logarithmic loss (logloss). To evaluate the performance of different classifiers, we investigated their outcomes using the summary statistics with a resampling distribution. Therefore, Generalized Boosted Regression modeling showed the highest accuracy (90.91%), followed by kappa statistics (78.77%) and specificity (85.19%). In addition, Sparse Distance Weighted Discrimination, Generalized Additive Model using LOESS and Boosted Generalized Additive Models also gave the maximum sensitivity (100%), highest AUROC (95.26%) and lowest logarithmic loss (30.98%) respectively. Notably, the Generalized Additive Model using LOESS was the top-ranked algorithm according to non-parametric Friedman testing. Of the features identified by these machine learning models, glucose levels, body mass index, diabetes pedigree function, and age were consistently identified as the best and most frequently accurate outcome predictors. These results indicate the utility of ML methods in constructing improved prediction models for T2D and successfully identified outcome predictors for this Pima Indian population. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s13755-021-00168-2.

Entities: Chemical

Keywords: Classifiers; Diabetes; Feature selection sets; Machine learning models; Prediction model

Year: 2022 PMID： 35178244 PMCID： PMC8828812 DOI： 10.1007/s13755-021-00168-2

Source DB: PubMed Journal: Health Inf Sci Syst ISSN： 2047-2501

Introduction

Type 2 Diabetes (T2D) is one of the most common severe chronic diseases characterized by progressive complications that include cardiovascular disease, hypertension, retinopathy, kidney disease, and strokes [61, 63]. Pancreas produced insulin controls blood glucose uptake by cells thereby reducing circulating levels; without such glycaemic control circulating sugar levels can remain high for extended periods, resulting in glycation products that have myriad deleterious effects on the body, but notably the vascular system [21]. Type 2 diabetes results from poorly understood processes that cause resistance to insulin stimulation and gradual loss of glycaemic control, which can be accompanied by reduced insulin production. A survey found that 451 million people were globally affected by T2D which will likely increase to 693 million by 2045 [17]. In addition, 85% of T2D patients by 2030 will live in developing countries [40, 63]. However, this disease can generally be prevented or reduced in severity by following healthy lifestyle including a well-balanced diet, exercise and low level psychological stress, however, genetics and environmental factors play a significant role in T2D development [9, 23, 32, 33, 38, 46]. The signs of T2D development and progression include excessive thirst, weight loss, hunger, fatigue, skin problems and slow healing wounds, progressively advancing to life-threatening health issues, as well as significant associations with many other serious comorbidities such as rheumatoid arthritis and Alzheimer’s disease [10, 31, 41, 42, 45]. Given the wide variety of presentation and development of comorbidities in T2D, treatment and care of patients can be greatly improved if the prognostic signs are used to better sub-categorize T2D patients. Machine learning methods are well suited to such categorization tasks and potentially provide useful information to clarify the key symptoms of interest of this disease. The motivation of this work is therefore to develop intelligent T2D detection and categorization models which identifies types of T2D patients and distinguishes them from non-diabetic controls earlier and with greater precision. The demographic details of pima Indian diabetes dataset However, there are many challenges in designing such kinds of models. T2D is a complex metabolic disorder that contains various types of signs and related comorbid diseases [65]. Identification of major significant features is important for controlling this disease and to utilise effective treatment regimens for affected people. The development and medical costs resulting from T2D are enormous, but there are many poorly defined risk factors. Nevertheless, there has been a great deal of development work in categorizing T2D using various different types of computational methods. In those studies, researchers analyzed T2D patient records to identify more accurate prognostic indicators [25, 54]. However, most of these studies were not able to explore and identify improved working models that have high enough performing features to be usefully employed in the clinic. In this work, we propose an intelligent T2D detection model where different feature selection and classification models have been applied to analyze the T2D dataset to determine out the best classifier. These classification outcomes were then used to explore significant attributes from different perspectives. The contributions of this work are given as follows: Newly extended versions of feature selection and classification methods were employed for the analyses of T2D datasets. The proposed model showed greatly improved performance with extended classification models able to recognise T2D better than other existing approaches. The classification results of this work are represented with the resampling distribution of summary statistics more accurately. This combination can identify the top performing machine learning model from a range of different viewpoints. Finally, non-parametric statistical methods were used to identify the best machine learning model. Then, wireframe contour plots were used to identify the most useful feature subsets with high efficiency.

Related work

Numerous studies have attempted to predict T2D outcomes using a variety of machine learning techniques [19, 21, 29, 29, 40, 51, 57]. Proposed methods were employed various data preprocessing and machine learning techniques to isolate T2D patients from controls. In data retrieval steps, various techniques such as data cleaning, clustering, sampling, missing value imputation, and outlier detection was used to prepare data for further evaluation. Feature selection methods are also useful to explore the most significant features and reduce computational complexity, including stable outcomes. To analyze T2D detection performance, various widely used classifiers such as K-Nearest Neighbor (KNN), support vector machine (SVM), Naïve Bayes (NB), Artificial Neural Network (ANN), Logistic Regression (LR), Decision Trees (DT), and Random Forest (RF) were implemented. Recently, many ensemble and voting based classification methods have been proposed for such work. [26, 53]. For instance, Kahramani et al. [24] used a hybrid method that mingled ANN and fuzzy neural network (FNN) to predict T2D cases more efficiently. Vaishali et al. [59] used genetic algorithm as feature selection method and applied various classifiers such as multi-objective evolutionary (MOE) Fuzzy, NB, J48 Graft, and Multi Layer Perceptron (MLP) to investigate diabetes dataset. Dagliati et al. [11] considered a data mining pipeline where missing data by means of RF and data balancing strategies were employed, therefore LR with stepwise feature selection and different classifiers were used in that analysis. In addition, Maniruzzaman et al. [30] used a range of feature selection methods, including principal component analysis (PCA), Analysis of Variance (ANOVA), mutual information (MI), LR, and RF) in the PIDD analysis to explore various subsets and then classify them with various classifiers. Also, Wei et al. [64] used deep neural network (DNN) in preprocessed PIDD (i.e., applying scaling, normalization, imputation and dimensionality reduction method) and showed highest 77.86% accuracy. Thus, Battineni et al. [6] employed KNN to impute missing records as well as NB, J48, LR, and RF were implemented for investigating T2D datasets. Wang et al. [63] proposed a method named Prediction algorithm for the classification of T2D on imbalanced data with Missing values (DMP_MI) where NB compensated this missing records. The adaptive synthetic sampling method (ADASYN) was then used to balance this dataset and applied RF to achieve a classification result. Hasan et al. [17] proposed a machine learning methodology where they implemented PCA, Independent Component Analysis (ICA) and Correlation-based Feature Selection (CFS) for feature selection and employed KNN, DT, RF, AdaBoost, NB, XGBoost, and MLP as classification techniques. Tripathi and Kumar [55] used random oversampling (ROS), normalization, and several classifiers like Linear Discriminant Analysis (LDA), KNN, SVM, and RF were used to investigate their primary diabetes dataset for machine learning purposes. Ismail et al. [22] provided a taxonomy of significant factors where different machine learning algorithms were used with or without feature selection processing. In addition, Ramesh et al. [44] implemented multivariate imputation by chained equations (MICE) method for handling missing values of primary diabetes dataset. Subsequently three feature selections (chi-squared test, extremely randomized trees, and least absolute shrinkage and selection operator (LASSO)) and some classifiers such as KNN, LR, Gaussian NB, and SVM were used to investigate this dataset. Meanwhile, Banerjee and Satyanarayana [4] created an ensemble learning method called SDS where DT, stochastic gradient boosting (SGD), and gradient boosting classifier (GBC) are incorporated to find its highest results. Some deep learning approach had been applied into diabetes dataset to get more suitable results for detecting diabetes [27, 34]. For example, Gupta et al. [16] used deep learning (DL) and quantum machine learning (QML) to detect diabetes where DL outperformed related QML algorithms.

Materials and methods

Several steps were considered to analyze T2D dataset and its feature subsets by implementing a number of high performing classifiers which are given as follows (see Fig. 1).

Fig. 1

Proposed methodology

Machine learning based diabetes detection model

Data Description and Preprocessing In this work, we employed a widely used dataset, PIDD obtained from the publicly available Kaggle ML Repository, provided by the National Institutes of Diabetes, Digestive and Kidney Diseases [37]. All of the subjects were females over 21 years old of Pima Indian indigenous heritage from a population near Phoenix, Arizona, USA. It provides 768 patient records with 9 features where 268 patients (34.9%) had T2D and 500 patients (65.1%) were non-diabetic (see details in Table 1). PIDD contains personal health data from medical examination and does not have missing values, but required some cleaning and removal of unwanted instances from the dataset.

Table 1

The demographic details of pima Indian diabetes dataset

S/N	Pregnancies	Glucose	BloodPressure	Thickness	Insulin	BMI	DPF	Age
Feature type	Integer	Real	Real	Real	Real	Real	Real	Integer
Unit	Number of times	mg/dL	mm Hg	mm	mu U/ml	kg/m²		years
Distinct count	17	136	47	51	186	248	517	52
Unique (%)	2.20%	17.70%	6.10%	6.60%	24.20%	32.30%	67.30%	6.80%
Mean	3.8451	120.89	69.105	20.536	79.799	31.993	0.47188	33.241
Range	0–17	0–199	0–122	0–99	0–846	0–67.1	0.078–2.42	21–81
Zeros (%)	14.50%	0.70%	4.60%	29.60%	48.70%	1.40%	0.00%	0.00%
5-th percentile	0	79	38.7	0	0	21.8	0.14035	21
Q1	1	99	62	0	0	27.3	0.24375	24
Median	3	117	72	23	30.5	32	0.3725	29
Q3	6	140.25	80	32	127.25	36.6	0.62625	41
95-th percentile	10	181	90	44	293	44.395	1.1328	58
Range	17	199	122	99	846	67.1	2.342	60
IQR	5	41.25	18	32	127.25	9.3	0.3825	17
Standard deviation	3.370	31.973	19.356	15.952	115.240	7.884	0.331	11.760
Coef of variation	0.876	0.264	0.280	0.777	1.444	0.246	0.702	0.354
Kurtosis	0.159	0.641	5.180	-0.520	7.214	3.290	5.595	0.643
MAD	2.772	25.182	12.639	13.660	84.505	5.842	0.247	9.586
Skewness	0.902	0.174	-1.844	0.109	2.272	-0.429	1.920	1.130
Sum	2953	92847	53073	15772	61286	24570	362.4	25529
Variance	11.354	1022.2	374.65	254.47	13281	62.16	0.10978	138.3
Memory size	6.1 KB	6.1 KB	6.1 KB	6.1 KB	6.1 KB	6.1 KB	6.1 KB	6.1 KB

Feature Selection Approach Feature selection methods are used to interpret and reduce variation and computational cost of processing training datasets. After performing preprocessing steps, different feature subsets were identified from PIDD using a number of feature selection methods such as information gain attribute evaluation (IGAE), gain ratio attribute evaluation (GRAE), gini indexing attribute evaluation (GIAE), analysis of variance (ANOVA), chi-square () test, extension of relief (reliefF) attribute evaluation (RFAE), correlation based feature selection subset evaluation (CFSSE). fast correlation based feature selection (FCFS), and filter subset evaluator (FSE). These methods have been widely used in many previous machine learning studies [20, 30]. After these steps, these feature subsets were used to generate sub datasets from PIDD. Classification Numerous classification models (i.e., almost 184 classifiers) were implemented to scrutinize primary and its sub datasets. However, some of these required long computation times and were not supported on these datasets, therefore, we discarded them. Finally, ten classifiers like boosted generalized additive model (GAMBoost), regularized LR (RLR), penalized multinomial regression (PMR), Bayesian generalized linear model (BGLM), penalized LR (PLR), generalized linear model (GLM), sparse distance weighted discrimination (SDWD), generalized boosted regression modeling (GBM), generalized additive model using LOESS (GAMLOESS) and NB were employed in the PIDD data along with its sub-datasets. In this work, we considered cross validation (CV) protocol for each classifier to analyze T2D data. In this case, the re-sampling technique were used for the machine learning models by dividing instances into k groups (randomly constructed of approximately equal size) where the specific (k) fold was treated as a validation set, along with remaining k-1 folds. Different evaluation metrics such as accuracy, kappa-statistics, AUROC, sensitivity, specificity, and logloss were used to investigate the performance of different classifiers. Investigating Derived Results The classification outcomes were analyzed to identify the best models (see details in “Experimental results” section). Furthermore, non-parametric Friedman Tests [51], along with Iman-Davenports () adjustment was implemented into the generated results to verify the predictive performance of individual classifiers as well as identify the best performing classifier. To explore the best feature subsets, we investigated the optimum combination of datasets and classification results to identify the significant feature subsets where different classifiers had shown good performance. Proposed methodology However, a brief description of the various feature selection and classification methods are provided as follows:

Feature selection approach

The general description of individual feature selection methods is given as follows. Information Gain Attribute Evaluation (IGAE) compares the entropy of the dataset before and after transformation [50]. It is preferable to identify significant attributes from a large number of features. Suppose is the set of training samples where information gain (IG) is determined for a random variable using following equation: Gain Ratio Attribute Evaluation (GRAE) is the extension of IG that lessens its biasness using intrinsic information (i.e., entropy of data distribution in branches) [39]. Therefore, the gain ratio of attribute A is shown as follow: where is denoted as Intrinsic Information. Gini Indexing Attribute Evaluation (GIAE) was used to select most splitting features from nodes [35]. However, bias remains in the unbalanced datasets that contain a large numbert of attributes. Besides this, Gini indexes provide low values for stubby frequent attributes and high values for top frequent attributes. However, these values are relatively lower for specific attributes of larger classes. Analysis of Variance (ANOVA) is a parametric statistical hypothesis test where the means of two or more samples are checked and ensured their same distribution or not [30]. It uses an F-test to determine the significant difference between samples. Therefore, it contrasts between-groups variability to within the group variability using F-distribution. Chi-Square () Test compares the independence of different variables. It uses statistics to measure the strength of the relationship between independent features [60]. In this method, higher values of features are more dependent on the response [28]. Hence, this method is calculated using following equations: Extension of Relief Attribute Evaluation (RF-AE) is a filter based method that is notably sensitive regarding feature interaction. Relief score () determines the value of each attribute and ranks them for feature selection. This score is calculated based on the selection of attribute value differences between nearest neighbor instance pair of different and same classes [58]. It defines as follows: In this case, if a attribute value difference is found for the same classes, then the relief score is decreased. Otherwise, this score is increased. Correlation based Feature Selection (CFS) measures the importance of individual features by computing inter-correlation values among them. In this method, highly correlated and irrelevant features are avoided [7] to identify the most significant features from the dataset. Also, different methods like best first search (BFS), evolutionary search (ES), reranking search (RS), scatter search (SS) and other related methods are employed with CFS to explore significant features. Fast Correlation based Feature Selection (FC-FS) [3] is a multivariate method that has symmetrical uncertainty to determine feature dependencies and find the corresponding subset using backward selection procedure. Filter Subset Evaluation (FSE) is employed with an arbitrary filter (SpreadSubsampler) when different instances are passed through this filter and identified significant features.

Classification approaches

Boosted Generalized Additive Model (GAMBoost) is transformed each predictor variables and generated a weighted sum of them in a nonlinear way [56]. Each predicting component is fitted with the residuals to minimize prediction cost of this model. Regularized Logistic Regression (RLR) contains one or more independent variables [18, 66] that represents hypothetical outcomes considering logistic or sigmoid function using regularization term. It is also prone over fitting if there are a large number of features. Let, independent variables and parameters are considered where the expected result is: where . So, the cost function of LR can be expressed as: The cost function is updated by the penalized high values of a parameter called regularization term (i.e., is the regularization factor) that is also expressed as: Regularization in LR is useful to generalize better on unseen data and prevent overfitting of training data. Penalized Multinominal Regression (PMR) is a mixture logit model that initiates with a penalty to eliminate the infinite number of components from the maximum likelihood estimators [5]. Ridge regression is a simple form of penalized regression which handles multicollinearity of regressors (i.e., following linear regression). This penalization approach helped to avoid an overfitting problem. Bayesian Generalized Linear Model (BGLM) is a generalization of linear regression model where statistical analysis is happened in the context of Bayesian inference. In this case, Bayes estimation remains consistent with true value by its prior support. This approach is used to estimate linear model coefficients with external information. Moreover, the complexity of BGLM gives uncertainty which leads to the natural regularization. Hence, LASSO and other regularized estimators are represented as Bayesian estimators for a particular prior [14]. Penalized Logistic Regression (PLR) creates a regression model with a large number of variables using the logistic or sigmoid function. Three regression models, such as ridge, LASSO and elastic regression are mingled which shrinks low-contributing factors towards zero [8]. Ridge regression follows L2 regularization where the penalty term is used to the cost function. Besides, L1 regularization is considered by LASSO regression where following penalty term is used. Elastic net is a combination of L2 and L1 regularization penalties to define cost function. Like the other regression models, it minimizes cost function and maximize its outcomes. Generalized Linear Model (GLM) is a induction of linear regression which gathers systematic and random components in a statistical models. Suppose, a set of independent variables with some coefficients is used to build following hypothesis [18]: Besides, the cost function of GLM is represented as: After generating the cost function , minimizing is needed to get more accurate results in data analysis. Sparse Distance Weighted Discrimination (SD-WD) represents Distance Weighted Discrimination (DWD) (i.e., by following SVM) by replacing DWD in order to achieve sparsity and show its lost and penalty. If norm penalty is used, the performance of all high dimensional variables is very poor [62]. Therefore, Zhu et al. [67] proposed the -norm SVM to fix this problem. It provides efficient computational performance for extensive numerical experiment. Generalized Boosted Regression Model (GBM) is the combination of various decision trees and boosting methods where these decision trees are fitted repeatedly to improve the performance of the model. In this case, a random data subset is selected from each new tree using a boosting method whereby the first tree is fitted and next tree is taken based on the residuals. Thus, this model tries to improve accuracy at every step. It explores the combination of related parameters which determines minimum error for predictions with at least 1000 trees (i.e. following sufficient shrinkage rates) [12, 13]. Generalized Additive Model using LOESS (G-AMLOESS) utilizes linear predictor along with locally weighted regression (LOESS) to fit on smooth 2D in the 3D surfaces. Let Y be a univariate response variable where is defined with various continuous, ordinal and normal predictors. Furthermore, different distributions such as normal, binomial or poisson distributions as well as link functions like identity and log functions are used to get the expected value of Y. Naïve Bayes (NB) is a probabilistic classifier which is based on Bayes theorem with the strong independent assumption between the features. It is particularly useful for large datasets. In addition, the presence of particular features are not related with any others which is manipulated by the following condition [15]: where P(c|X) is called posterior probability of class for given predictor. Then, , P(c|x), P(c), P(x|c) is defined as likelihood. Besides, P(c) and P(X) are represented as prior probability and marginal respectively.

Performance measures

A confusion matrix describes the performance of a classification model using the number of false-positive (FP), false negative (FN), true positive (TP) and true negative (TN) values. Several evaluation metrics such as accuracy, kappa statistics, AUROC, sensitivity, specificity, and logarithmic loss are used to justify the outcomes of different classifiers [47, 48, 50]. Therefore, a brief description of them is given as follows:

Evaluation metrics

Accuracy indicates the ratio between correct and overall number of predictions which is provided as follows: Kappa Statistics defines the inter rater agreement of observed and expected accuracy for qualitative features. Average area under receiver operating characteristic (AUROC) is calculated from true positive rate/sensitivity and (1-false positive rate)/specificity for all possible orderings. Let, and are considered as the time observation of the concentration and respectively. Therefore, AUROC can be defined as: Sensitivity represents the proportion of correctly classified positive and all positive instances. Specificity determines from the proportion of correctly classified negative and all the negative instances. Logarithmic loss (Logloss) assesses the performance of individual classifiers by following equation

Friedman test

Friedman test is a non-parametric statistical method which considers p with degrees of freedom under the null hypothesis and their outcomes do not rapidly change in all machine learning approaches. is indicated as the average rank over N training sets of a classifier. If the null hypothesis is not accepted, the best classifier is assessed pairwise with each standard algorithm using several post-hoc tests, including Bonferroni, Holm and Holland. Thus, Iman-Davenport and Friedman statistics are defined as:

Experimental results

Experimental settings

In this work, we implemented the following feature selection methods (FSM) in the PIDD and generated various feature subsets (i.e., FS1, FS2, FS3, FS4, FS5, and FS6) using Orange v3.29.1 and Waikato Environment for Knowledge Analysis (WEKA 3.8.5). We conjugated various searching methods such as BFS, ES, RS, and SS with different attribute selector of WEKA. In this case, we selected the top 5 ranked attributes for each method using Orange software. Table 2 shows the list of feature subsets sequentially. This process resulted in different sub-datasets (DS1, DS2, DS3, DS4, DS5, and DS6) of PIDD formulated based on the feature subsets. Various classifiers (almost 184) were then employed to analyze these datasets using caret package in R (3.5.1). However, proposed top ten stable classifiers were identified to evaluate automatic diabetes detection process more accurately. To visualize the resampling distribution of summary results (i.e. minimum, mean, median and maximum findings), we utilized the matplotlib library using python in the Google Colaboratory platform. Finally, non-parametric Friedman Test was applied to derived classification results to explore significant classification model by assessing overall results using Knowledge Extraction based on Evolutionary Learning (KEEL GPLv3).

Table 2

Formulation of Various Feature Subsets

FS	FST	Tool	SM/TS	Features
FS1	IGAE	Orange	Top 5	Glucose, Age, BMI, Insulin, and
	GRAE	Orange	Top 5	Pregnancies
FS2	GIAE	Orange	Top 5	Glucose, BMI
	ANOVA	Orange	Top 5	Age, DPF, and
	X2 test	Orange	Top 5	Pregnancies
FS3	RFAE	Weka	Ranker, Top 5	Glucose, Age, Pregnancies, Thickness, and BMI
FS4	FCFS	Orange	Top 5	Glucose, Age, BMI, DPF, and Insulin
FS5	CFS	Weka	BFS, ES, RS, SS	Glucose, BMI, DPF, and Age
FS6	FSE	Weka	BFS	Glucose, BMI, and Age

Formulation of Various Feature Subsets

Investigating the classification performance of diabetes detection

To scrutinize PIDD and its sub-datasets, various classifier models including GAMBoost, RLR, PMR, BGLM, PLR, GLM, SDWD, GBM, GAMLOESS and NB were considered. In this case, we identified the best classifiers to determine the accurate results along with significant features for detecting T2D. Then, the experimental outcomes of them were justified. In this work, the summary statistical results are organized by resampling distribution. The details of these findings are shown in Supplementary Table 1–6, respectively. The accuracy of these classifiers are given in Supplementary Table 1. In this work, GAMLOESS provided minimum highest accuracy (71.05%) for DS4. However, many classifiers gave the top median accuracy (77.92%) for different datasets. Consequently, RLR, BGLM, PLR, and SDWD showed the best median accuracy for PIDD and SDWD provided the highest median accuracy for DS2. Also, GAMBoost, RLR, PMR, BGLM, PLR, and GLM for DS5 and GAMLOESS for DS6 produced similar results. Thus, GAMBoost presented the best mean accuracy of 77.73% for DS5. Besides this GBM gave the greatest maximum accuracy of 90.91% for DS4. Kappa statistics for individual classifiers are shown in Supplementary Table 2. GAMLOESS determined the supreme minimum kappa of 31.42% for DS4. Besides, GAMBoost provided the best median kappa (49.87%) for DS5. On the other hand, NB showed the top mean kappa of 48.97% for DS2. Finally, GBM exhibited the utmost maximum kappa of 78.77% for DS4. The AUROC values of different classifiers are given in Supplementary Table 3. GAMLOESS generated the highest minimum (76.92%), median (85.36%) AUROC for FS5 and FS6 respectively. NB provided the supreme mean AUROC of 84.84% for DS5. For DS3, GAMLOESS showed the best maximum AUROC, of 95.26%. The sensitivity of the following classifiers is given in Supplementary Table 4. SDWD gave the highest minimum (96%), median (100%), mean (99.2%) and maximum (100%) sensitivity for DS6 (see Supplementary Table 4). In addition, SDWD and GBM gave the theoretical maximum sensitivity (100%) for DS5 and DS2 respectively. In addition, NB showed the highest minimum (44.44%) and median (62.96%) specificity for DS2. Again, this classifier provided the highest minimum (44.44%), median (62.96%) and mean (62.23%) specificity for DS3 respectively. Besides this, NB showed the top median specificity (62.96%) for DS6. However, GBM manipulated the utmost maximum specificity (85.19%) for DS6. When the experimental result with logloss was analyzed (see Supplementary Table 6), NB gave the lowest minimum logloss (30.98%) for DS4. GAMLOESS gave the lowest median logloss of 45.58% for DS6. In contrast, GAMBoost provided the shallow mean (46.43%) for DS5. Afterwards, this classifier presented the stubby maximum logloss of 56.83% for DS4. Average (Minimum, Median, Mean, and Maximum) Results of Different Classifiers Wireframe Contour of Average Best Classification Results for Individual Datasets The average minimum, median, mean and maximum accuracy, kappa statistics, sensitivity, AUROC, specificity and logloss are visualized at Fig. 2. The average best classification results for different datasets are illustrated with wireframe contours maps in Fig. 3.

Fig. 2

Average (Minimum, Median, Mean, and Maximum) Results of Different Classifiers

Fig. 3

Wireframe Contour of Average Best Classification Results for Individual Datasets

Classifiers Ranking & Adjusted P-values using Post Hoc Methods (Friedman) based on Average Findings

Discussion

Comparing classification performances and identifying significant feature subsets

In this study, we analyzed PIDD and its sub-datasets using various classifiers to identify the best classifier based on experiment results. In all cases giving the best results for individual classifiers, GBM gave the highest maximum accuracy (90.91%) and maximum kappa statistics (78.77%) for DS4 respectively. Also, this classifier provided the best specificity for DS6. Then, SDWD showed the top sensitivity (100%) for DS5 and GAMLOESS gave the maximum AUROC of 95.26% for DS3. However, GAMBoost obtained the lowest logloss for DS4 respectively. However, the overall best classifier were not identified from this analysis. The average outcomes (i.e., accuracy, kappa statistics, AUROC, sensitivity, specificity and logloss) of individual classifiers were used to explore the best classification approach (see Fig. 2). Among all classifiers, GAMBoost and GAMLOESS provided the best outcomes in this analysis. That is to say that, GAMBoost gave a better performance than GAMLOESS for accuracy, sensitivity (see Fig. 2a, c) while, GAMLOESS showed better results for AUROC and specificity (see Fig. 2d, e). GAMBoost and GAMLOESS gave comparable results for kappa statistics and logloss. However, the performance of other classifiers was not consistent for different evaluation metrics; these included GAMBoost and GAMLOESS. Therefore, we again averaged minimum, median, mean and maximum results of different classifiers and used Friedman test to conduct non-parametric statistical analysis among them (see Table 3). This showed that GAMLOESS as the best ranked classifier (#1) to correctly classify diabetes outcomes, while GAMBoost was the second best (#2) ranked algorithm.

Table 3

Classifiers Ranking & Adjusted P-values using Post Hoc Methods (Friedman) based on Average Findings

i	Classifier	Ranking	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$z=\frac{R_0-R_i}{E}$$\end{document}z=R0-RiE	Unadjusted p	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{Bonf}$$\end{document}pBonf	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{Holm}$$\end{document}pHolm	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_{Hochberg}$$\end{document}pHochberg
1	GAMLOESS	3.00
2	GAMBoost	3.17	0.10	0.9240	8.3164	0.9240	0.9240
3	GBM	5.00	1.14	0.2526	2.2730	0.5458	0.5051
4	SDWD	5.33	1.33	0.1819	1.6373	0.5458	0.5051
5	BGLM	5.67	1.53	0.1271	1.1441	0.5167	0.5051
6	GLM	5.92	1.67	0.0952	0.8568	0.5167	0.4760
7	NB	6.00	1.72	0.0861	0.7751	0.5167	0.4760
8	PLR	6.67	2.10	0.0359	0.3235	0.2516	0.2516
9	PMR	6.92	2.24	0.0251	0.2254	0.2004	0.2004
10	RLR	7.33	2.48	0.0132	0.1186	0.1186	0.1186

In the 2D wireframe contour graph noted above, the average highest classification outcomes are illustrated only for those datasets where classifiers provide the best average outcomes. This surface chart is helpful to extract the optimum combination of datasets for minimum, median, mean and maximum outcomes. Shown in Fig. 3 is the optimum combination of average highest performance found for DS5. The other amalgamation of surfaces are visualized for DS6, DS4 and DS2, respectively. As a result, Glucose levels, FS5 is found to be the most consistent feature subset which produces frequent outcomes. In addition, FS6, FS4 and FS2 can be also considered as the significant feature subsets where numerous classifiers can generate good and consistent results. Furthermore, we have provided the average highest classification outcomes for different datasets in Supplementary Table 7.

Comparing results with previous studies

A number of studies have previously been performed on this PIDD data but their outcomes were not useful in some respects. Therefore, we proposed an intelligent computing diabetes detection model which fixes some of these issues to provide more suitable outcomes. Most of the machine learning related PIDD studies were used different kinds of general data processing approaches (i.e.,identifying/removing/replacing missing words and deleting wrong values) and advanced approaches such as data transformation [1, 2, 27], outlier detection [43], removal or replacement with mean or median values. [30, 49]. In real-time data analysis, most of a dataset contains significant numbers of outliers and extreme values. In this study, the general procedures of data cleaning are followed to pre-process and generate better results. In previous studies, many researchers had used unsupervised clustering methods to gather more similar instances into homogeneous group [51, 55]. Nevertheless, numerous similar instances of clusters were not matched with regular classes, so need to remove them from analysis [35, 65]. In our proposed model, we avoided more pre-processing approaches to keep practical characteristics of PIDD. In the current study, we applied different types of standard classifiers and extended these to use on the PIDD and its feature subsets, which did not use many state-of-art techniques [1, 30, 35, 51]. Many previous studies researchers had not employed about feature subsets evaluation [36, 52, 65]. However, in this work, different standard and augmented classifiers were used to investigate their performance based on resampling distribution (i.e., minimum, median, mean, and maximum) of summary statistics. Therefore, the performance of individual classifiers was scrutinized more carefully. Also, we used non parametric Friedman testing to make a priority list of individual classifier. It should also be noted that the wireframe contour plot efficiently depicted the most significant feature subsets which were not identified in previous studies. In this work, the performance of individual classifiers were not assessed with more T2D datasets. We did not fully compare the performance of the existing model with extended classifiers because the evaluation metrics of them are not same.

Conclusion and future work

In this work, we investigated the PIDD T2D dataset using various statistical, machine learning and visualization techniques to determine the ranking of classifiers and feature subsets. We found that GAMLOESS was the top ranked classifier and FS5 was the most significant feature subset for achieving the best classifications and analyzing this disease. Note that this T2D dataset which we used, is not very large. In future, the performance of this model will be inspected using multiple diabetes datasets and explored with high performing machine learning models for various crucial features which will enable us better classify this disorder. This work, therefore, has potentially significant clinical importance and the study outcomes method developed will help physicians and researchers to predict T2D more reliably. Below is the link to the electronic supplementary material. Supplementary file1 (PDF 41 kb)

17 in total

1. Boosted trees for ecological modeling and prediction.

Authors: Glenn De'ath
Journal: Ecology Date: 2007-01 Impact factor: 5.499

2. Machine Learning Methods to Predict Diabetes Complications.

Authors: Arianna Dagliati; Simone Marini; Lucia Sacchi; Giulia Cogni; Marsida Teliti; Valentina Tibollo; Pasquale De Cata; Luca Chiovato; Riccardo Bellazzi
Journal: J Diabetes Sci Technol Date: 2017-05-12

Review 3. Relief-based feature selection: Introduction and review.

Authors: Ryan J Urbanowicz; Melissa Meeker; William La Cava; Randal S Olson; Jason H Moore
Journal: J Biomed Inform Date: 2018-07-18 Impact factor: 6.317

4. Network regularised Cox regression and multiplex network models to predict disease comorbidities and survival of cancer.

Authors: Haoming Xu; Mohammad Ali Moni; Pietro Liò
Journal: Comput Biol Chem Date: 2015-10-19 Impact factor: 2.877

5. A novel stacking technique for prediction of diabetes.

Authors: Satish Kumar Kalagotla; Suryakanth V Gangashetty; Kanuri Giridhar
Journal: Comput Biol Med Date: 2021-06-08 Impact factor: 4.589

6. A remote healthcare monitoring framework for diabetes prediction using machine learning.

Authors: Jayroop Ramesh; Raafat Aburukba; Assim Sagahyroon
Journal: Healthc Technol Lett Date: 2021-05-02

7. comoR: a software for disease comorbidity risk assessment.

Authors: Mohammad Ali Moni; Pietro Liò
Journal: J Clin Bioinforma Date: 2014-05-23

8. A Framework to Understand the Progression of Cardiovascular Disease for Type 2 Diabetes Mellitus Patients Using a Network Approach.

Authors: Md Ekramul Hossain; Shahadat Uddin; Arif Khan; Mohammad Ali Moni
Journal: Int J Environ Res Public Health Date: 2020-01-16 Impact factor: 3.390

9. Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers.

Authors: Md Maniruzzaman; Md Jahanur Rahman; Md Al-MehediHasan; Harman S Suri; Md Menhazul Abedin; Ayman El-Baz; Jasjit S Suri
Journal: J Med Syst Date: 2018-04-10 Impact factor: 4.460

10. A Network-Based Bioinformatics Approach to Identify Molecular Biomarkers for Type 2 Diabetes that Are Linked to the Progression of Neurological Diseases.

Authors: Md Habibur Rahman; Silong Peng; Xiyuan Hu; Chen Chen; Md Rezanur Rahman; Shahadat Uddin; Julian M W Quinn; Mohammad Ali Moni
Journal: Int J Environ Res Public Health Date: 2020-02-06 Impact factor: 3.390

3 in total

1. An Improved Machine-Learning Approach for COVID-19 Prediction Using Harris Hawks Optimization and Feature Analysis Using SHAP.

Authors: Kumar Debjit; Md Saiful Islam; Md Abadur Rahman; Farhana Tazmim Pinki; Rajan Dev Nath; Saad Al-Ahmadi; Md Shahadat Hossain; Khondoker Mirazul Mumenin; Md Abdul Awal
Journal: Diagnostics (Basel) Date: 2022-04-19

Review 2. Smartphone Apps for Diabetes Medication Adherence: Systematic Review.

Authors: Sheikh Mohammed Shariful Islam; Vinaytosh Mishra; Muhammad Umer Siddiqui; Jeban Chandir Moses; Sasan Adibi; Lemai Nguyen; Nilmini Wickramasinghe
Journal: JMIR Diabetes Date: 2022-06-21

3. Transient Polyhydramnios during Pregnancy Complicated with Gestational Diabetes Mellitus: Case Report and Systematic Review.

Authors: Agnesa Preda; Adela Gabriela Ștefan; Silviu Daniel Preda; Alexandru Cristian Comănescu; Mircea-Cătălin Forțofoiu; Mihaela Ionela Vladu; Maria Forțofoiu; Maria Moța
Journal: Diagnostics (Basel) Date: 2022-05-28

3 in total