| Literature DB >> 32156739 |
Xing-Wei Wu1,2, Heng-Bo Yang3, Rong Yuan4, En-Wu Long5,2, Rong-Sheng Tong5,2.
Abstract
OBJECTIVE: Medication adherence plays a key role in type 2 diabetes (T2D) care. Identifying patients with high risks of non-compliance helps individualized management, especially for China, where medical resources are relatively insufficient. However, models with good predictive capabilities have not been studied. This study aims to assess multiple machine learning algorithms and screen out a model that can be used to predict patients' non-adherence risks.Entities:
Keywords: adherence; personality; prediction and prevention; type 2 diabetes
Year: 2020 PMID: 32156739 PMCID: PMC7064141 DOI: 10.1136/bmjdrc-2019-001055
Source DB: PubMed Journal: BMJ Open Diabetes Res Care ISSN: 2052-4897
Demographic and clinical data of participants
| Parameter | Value (n=401) | |
| Age (years) | n | 401 |
| Mean±SD | 58.9±11.89 | |
| Median | 58 | |
| Minimum, maximum | 27, 85 | |
| Gender | n | 401 |
| Male | 244 (60.8%) | |
| Female | 157 (39.1%) | |
| Weight (kg) | n | 397 |
| Mean±SD | 65.0±10.28 | |
| Median | 65 | |
| Minimum, maximum | 42, 110 | |
| Marital status | n | 396 |
| Married/living as married/civil partnership | 393 (99.2%) | |
| Single/never married | 2 (0.4%) | |
| Divorced or separated | 2 (0.4%) | |
| Employment status | n | 399 |
| Unemployed | 58 (14.5%) | |
| Employed | 149 (37.3%) | |
| Retirement | 191 (47.9%) | |
| Other | 1 (0.2%) | |
| Highest level of education | n | 399 |
| Illiteracy | 41 (10.3%) | |
| Junior middle school | 128 (32.1%) | |
| High school or special secondary school | 130 (32.6%) | |
| College and above | 100 (25.1%) | |
| Family history of diabetes | n | 391 |
| Yes | 122 (31.2%) | |
| No | 269 (68.8%) | |
| Previous HbA1c value | n | 264 |
| <7% | 97 (36.7%) | |
| 7%–9% | 126 (47.7%) | |
| >9% | 41 (15.5%) | |
| The interval between the last HbA1c measurement and the present (days) | n | 267 |
| Mean±SD | 227.9±271.52 | |
| Median | 150 | |
| Minimum, maximum | 2, 2920 | |
| The course of diabetes (months) | n | 401 |
| Mean±SD | 89.7±76.44 | |
| Median | 72 | |
| Minimum, maximum | 1, 480 | |
| Regular monitoring of fasting blood glucose frequency | n | 401 |
| Irregular monitoring | 71 (17.7%) | |
| Two or three times a week | 156 (38.9%) | |
| Three or four times a month | 129 (32.1%) | |
| Two or three times in 3 months | 45 (11.2%) | |
| Fasting blood glucose value (mmol/L) | n | 325 |
| 3.8–6.1 | 23 (7.1%) | |
| 6.1–7 | 96 (29.5%) | |
| ≥7 | 206 (63.3%) | |
| Complications | n | 401 |
| Yes | 42 (10.5%) | |
| No | 359 (89.5%) | |
| Exercise intensity | n | 401 |
| No exercise | 44 (10.9%) | |
| Low-intensity exercise (eg, walking) | 269 (67.0%) | |
| Medium-intensity exercise (eg, fast walking, jogging) | 57 (14.2%) | |
| High-intensity exercise (eg, fitness, cycling, dancing) | 31 (7.7%) | |
| Exercise time (min) | n | 401 |
| Mean±SD | 63.3±71.66 | |
| Median | 60 | |
| Minimum, maximum | 0, 600 | |
| Eat reasonably | n | 401 |
| Yes | 294 (73.3%) | |
| No | 107 (26.7%) | |
| Sleep status | n | 401 |
| Good | 214 (53.3%) | |
| Ordinary | 120 (29.9%) | |
| Lose sleep | 67 (16.7%) | |
| Psychological status | n | 401 |
| Optimistic | 247 (61.6%) | |
| Ordinary | 144 (35.9%) | |
| Depressed | 10 (2.5%) | |
| Compliance | n | 401 |
| Good | 316 (78.8%) | |
| Poor | 85 (21.2%) |
n, number of respondents.
Figure 1The data flowed into the ‘Partition’ node after feature selection. The ‘Auto Data Prep’ node was used for data filling, the ‘Balance’ node performed a data balanced sampling process, and the ‘Binning’ node was applied for data binning. The ‘Partition’ node divided set 1 into a training set and a testing set, used the ‘Auto Classifier’ node to build various classification models, and used the ‘Analysis’ and ‘Evaluation’ nodes to output the AUC values and curve figure of each model. Use the ‘Select’ node to select the ‘Set 2’ data set. The set 2 set was concatenated with the models established above and the ensemble model of them, and the AUC values and graphs of all models were output using the ‘Analysis’ and ‘Evaluation’ nodes. AUC, area under the receiver operating characteristic curve.
AUC and overfitting values for 30 machine learning models
| Imputing or not | Sampling | Binning or not | Screening methods | Model methods | Num of Variables | Num of Samples | AUC_TR | AUC_TE | AUC_Set 2 | OF1 | OF2 |
| Not | Not | Not | Not | $D | 16 | 167 | 0.693±0.025 | 0.621±0.054 | 0.737±0.062 | 1.125±0.122 | 0.849±0.111 |
| Not | Not | Not | Forward and stepwise | $S | 3 | 167 | 0.551±0.033 | 0.577±0.078 | 0.557±0.051 | 0.974±0.162 | 1.039±0.131 |
| Not | Not | Not | Backward | $L | 5 | 167 | 0.659±0.019 | 0.658±0.039 | 0.687±0.052 | 1.005±0.072 | 0.964±0.102 |
| Not | Not | Yes | Forward and stepwise | $S | 3 | 167 | 0.551±0.033 | 0.577±0.078 | 0.577±0.051 | 0.974±0.162 | 1.039±0.131 |
| Not | Not | Yes | Backward | $S | 3 | 167 | 0.551±0.033 | 0.577±0.078 | 0.577±0.051 | 0.974±0.162 | 1.039±0.131 |
| Not | Undersampling | Not | Not | $XF | 16 | 98 | 0.778±0.025 | 0.827±0.080 | 0.744±0.085 | 0.949±0.102 | 1.130±0.198 |
| Not | Undersampling | Not | Forward and stepwise | $L | 3 | 98 | 0.679±0.032 | 0.660±0.075 | 0.664±0.067 | 1.044±0.142 | 1.008±0.192 |
| Not | Undersampling | Not | Backward | $L | 3 | 98 | 0.679±0.032 | 0.660±0.075 | 0.664±0.067 | 1.044±0.142 | 1.008±0.192 |
| Not | Undersampling | Yes | Forward and stepwise | $KNN | 4 | 98 | 0.725±0.028 | 0.674±0.074 | 0.715±0.070 | 1.086±0.122 | 0.955±0.169 |
| Not | Undersampling | Yes | Backward | $XF | 5 | 98 | 0.755±0.047 | 0.753±0.074 | 0.725±0.058 | 1.013±0.136 | 1.044±0.125 |
| Not | Oversampling | Not | Not | $R | 16 | 263 | 0.781±0.021 | 0.770±0.040 | 0.758±0.070 | 1.017±0.060 | 1.028±0.150 |
| Not | Oversampling | Not | Backward | $XF | 5 | 263 | 0.814±0.031 | 0.799±0.037 | 0.761±0.094 | 1.021±0.073 | 1.066±0.155 |
| Not | Oversampling | Not | Forward and stepwise | $XF | 4 | 263 | 0.716±0.020 | 0.726±0.026 | 0.690±0.068 | 0.987±0.048 | 1.064±0.133 |
| Not | Oversampling | Yes | Forward and stepwise | $XF | 4 | 263 | 0.834±0.030 | 0.821±0.031 | 0.782±0.080 | 1.018±0.068 | 1.060±0.112 |
| Not | Oversampling | Yes | Backward | $XF | 7 | 263 | 0.864±0.028 | 0.856±0.022 | 0.813±0.127 | 1.010±0.040 | 1.088±0.261 |
| Yes | Not | Not | Not | $D | 16 | 315 | 0.725±0.012 | 0.678±0.047 | 0.703±0.051 | 1.074±0.087 | 0.973±0.131 |
| Yes | Not | Yes | Forward and stepwise | $XF | 5 | 315 | 0.812±0.024 | 0.760±0.048 | 0.757±0.056 | 1.073±0.089 | 1.008±0.091 |
| Yes | Not | Not | Forward and stepwise | $XF | 4 | 315 | 0.752±0.020 | 0.701±0.070 | 0.711±0.055 | 1.084±0.131 | 0.994±0.144 |
| Yes | Not | Not | Backward | $XF | 6 | 315 | 0.742±0.019 | 0.734±0.063 | 0.734±0.066 | 1.017±0.094 | 1.012±0.160 |
| Yes | Not | Yes | Backward | $B | 6 | 315 | 0.729±0.019 | 0.718±0.099 | 0.714±0.100 | 1.034±0.151 | 1.034±0.266 |
| Yes | Undersampling | Not | Not | $B | 16 | 199 | 0.785±0.032 | 0.811±0.087 | 0.778±0.063 | 0.980±0.126 | 1.052±0.170 |
| Yes | Undersampling | Yes | Forward and stepwise | $XF | 4 | 199 | 0.701±0.027 | 0.665±0.074 | 0.722±0.050 | 1.067±0.135 | 0.927±0.130 |
| Yes | Undersampling | Not | Forward and stepwise | $S | 4 | 199 | 0.685±0.022 | 0.658±0.069 | 0.702±0.053 | 1.053±0.137 | 0.946±0.146 |
| Yes | Undersampling | Yes | Backward | $S | 5 | 199 | 0.699±0.015 | 0.754±0.083 | 0.733±0.052 | 0.938±0.113 | 1.034±0.143 |
| Yes | Undersampling | Not | Backward | $KNN | 5 | 199 | 0.740±0.029 | 0.738±0.082 | 0.736±0.065 | 1.017±0.143 | 1.013±0.165 |
| Yes | Oversampling | Not | Not | $XF | 16 | 513 | 0.916±0.030 | 0.869±0.041 | 0.862±0.123 | 1.056±0.052 | 1.039±0.243 |
| Yes | Oversampling | Yes | Forward and stepwise | $B | 7 | 513 | 0.857±0.023 | 0.824±0.039 | 0.849±0.072 | 1.042±0.052 | 0.978±0.118 |
| Yes | Oversampling | Not | Forward and stepwise | $XF | 8 | 513 | 0.907±0.031 | 0.861±0.039 | 0.843±0.115 | 1.054±0.049 | 1.049±0.230 |
| Yes | Oversampling | Not | Backward | $XF | 9 | 513 | 0.907±0.024 | 0.871±0.030 | 1.041±0.036 | 1.017±0.134 | |
| Yes | Oversampling | Yes | Backward | $B | 9 | 513 | 0.865±0.032 | 0.823±0.050 | 0.839±0.107 | 1.054±0.070 | 1.003±0.191 |
OF1 was calculated using the formula: AUCSet 1 training set /AUCSet 1 testing set, and OF2, AUCSet 1 testing set /AUCSet 2.
The bold value was the maximum AUCSet 2 of 30 algorithms.
AUC, area under the receiver operating characteristic curve; AUC_Set 2, AUC of set 2; AUC_TE, AUC of set 1 testing set; AUC_TR, AUC of set 1 training set; $B, Bayesian network; $D, discriminant model; $KNN, KNN algorithm; $L, logistic regression model; $R, CHAID; $S, SVM; $XF, the ensemble model.
Assessment of models by different algorithms of the selected data governance
| Models | AUC | Precision | Recall | F1 score ( |
| Bayesian network | 0.764±0.029 | 0.729±0.055 | 0.717±0.042 | 0.721±0.025 |
| KNN | 0.838±0.018 | 0.813±0.046 | 0.636±0.062 | 0.712±0.039 |
| SVM | 0.765±0.044 | 0.728±0.059 | 0.589±0.078 | 0.647±0.046 |
| C&R Tree | 0.755±0.030 | 0.739±0.026 | 0.669±0.066 | 0.700±0.033 |
| CHAID | 0.770±0.041 | 0.792±0.035 | 0.655±0.049 | 0.716±0.031 |
| Ensemble | ||||
| P value | p<0.0001 | p<0.0001 | p<0.0001 | p<0.0001 |
The bold value indicate the performance parameters of the best algorithm.
AUC, area under the receiver operating characteristic curve.
The impact of modeling approaches on predictive indicators
| Approaches | AUC_TR | AUC_TE | AUC_Set 2 | OF1 | OF2 | |||||||||||||||
| Univariate analysis | Multivariate analysis* | Univariate analysis | Multivariate analysis | Univariate analysis | Multivariate analysis | Univariate analysis | Multivariate analysis | Univariate analysis | Multivariate analysis | |||||||||||
| P value | MMD/R | P value | SE | P value | MMD/R | P value | SE | P value | MMD/R | P value | SE | P value | MMD/R | P value | SE | P value | MMD/R | P value | SE | |
| Imputing or not | − | 0.0813 | −0.2082 | 0.1719 | −0.1687 | 0.6812 | −0.0635 | 0.1084† | 0.0203 | 0.6829 | −0.0634 | |||||||||
| Sampling methods | − | 0.2980 | −0.1436 | 0.0840 | −0.2471 | 0.7884† | 0.0145 | 0.3667 | −0.1615 | 0.5764† | 0.0440 | 0.6617 | 0.0786 | |||||||
| Binning or not | 0.6441† | 0.0053 | 0.7135 | −0.0149 | 0.7837† | 0.0009 | 0.8834 | −0.0073 | 0.8258† | 0.0028 | 0.7578 | 0.0160 | 0.9188† | 0.0066 | 0.9212 | −0.0064 | 0.6672† | 0.0036 | 0.7668 | −0.0193 |
| Screening methods | 0.2277‡ | 0.0242 | 0.1343 | −0.1242 | 0.6512† | 0.0213 | 0.4484 | 0.0631 | ||||||||||||
| Model methods | 0.5617 | −0.0386 | 0.1616 | 0.0936 | ||||||||||||||||
| Num of Variables | 0.2593§ | 0.0653 | 0.4035 | −0.0739 | 0.2219§ | −0.0707 | 0.7425 | 0.0292 | ||||||||||||
| Num of Samples¶ | 0.0722§ | 0.1039 | 0.1972 | 0.3024 | 0.5946§ | −0.0308 | 0.8326 | −0.0497 | ||||||||||||
Oversampling leading to abnormal distribution in all five indexes.
The bold values indicate the parameters of approaches which would significantly affect predictive indicators.
*Multiple linear regression was used for multivariate analysis.
†Kruskal-Wallis test.
‡General linear models for analysis of variance (ANOVA).
§Spearman correlation analysis.
¶The variance inflation factor (VIF) of variable ‘Num of Samples’ in multiregression model is 16.4146 (which is greater than 10), indicates multicollinearity that maybe exists and may make the model unstable; this variable may be severely collinear with imputing, binning, and sampling, so the multiple linear regression (MLR) model was re-established after the three variables were eliminated.
AUC, area under the receiver operating characteristic curve; AUC_Set 2, AUC of set 2; AUC_TE, AUC of set 1 testing set; AUC_TR, AUC of set 1 training set; MMD, maximum mean difference among levels; R, correlation coefficient; SE, standardized estimate.