| Literature DB >> 32974105 |
Kuang-Ming Kuo1, Paul Talley2, YuHsi Kao3, Chi Hsien Huang4,5.
Abstract
BACKGROUND: Numerous studies have utilized machine-learning techniques to predict the early onset of type 2 diabetes mellitus. However, fewer studies have been conducted to predict an appropriate diagnosis code for the type 2 diabetes mellitus condition. Further, ensemble techniques such as bagging and boosting have likewise been utilized to an even lesser extent. The present study aims to identify appropriate diagnosis codes for type 2 diabetes mellitus patients by means of building a multi-class prediction model which is both parsimonious and possessing minimum features. In addition, the importance of features for predicting diagnose code is provided.Entities:
Keywords: Diagnosis; Machine-learning techniques; Predictive models; Type 2 diabetes mellitus
Year: 2020 PMID: 32974105 PMCID: PMC7487151 DOI: 10.7717/peerj.9920
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Artificial neural network architecture.
Figure 2Decision trees.
Figure 3Margin hyperplane, support vector, and convex hull.
Type 2 diabetes mellitus diagnosis related studies: adopted machine learning algorithms.
| Study | Instance-based | Decision trees | Neural network | Ensemble | Bayesian | Statistical model | Others |
|---|---|---|---|---|---|---|---|
| Support vector machine | J48 | Adaboostm1 | Naïve Bayes, Bayes net | ||||
| Logistic regression | K-means | ||||||
| Support vector machine | Decision | Neural network | Logistic regression | Clustering | |||
| First-order logic rules | |||||||
| Self-organizing map, support vector machine | Neural network | Principal component analysis | |||||
| Naïve Bayes | Linear discriminant analysis, Quadratic discriminant analysis | Gaussian process classification | |||||
| Support vector machine | Rule-based | ||||||
| J48, Decision tree, Logistic model tree | Random forest | Naïve Bayes | Logistic regression | ||||
| Support vector machine, KNN | C5.0 | Neural networks | Random forest, Gradient boosting machine, Extreme gradient boosting | Bayesian model | Linear model, Discriminant analysis, Partial least squares, Multinomial logistic regression | Rule-based, Elastic net, Nearest shrunken centroid | |
| Bayesian inference |
Note:
Denotes the best performed algorithm.
Type 2 diabetes mellitus diagnosis-related studies: Included features for machine learning.
| Study | Demographic data | Laboratory test results | Vital signs | Life style | History | Others |
|---|---|---|---|---|---|---|
| Age, gender, BMI | Physical activity, work stress, salty food preference | History of cardiovascular disease or stroke, family history of diabetes, hypertension | ||||
| Age, BMI | 2-h plasma glucose, 2-h | Diastolic | Number of times pregnant | Triceps skin fold thickness | ||
| Age, sex, BMI | High density lipoprotein cholesterol, triglycerides, | Systolic | Shortness of breath, frequent urination at night, excessive thirst | History of high blood glucose, parental history of diabetes | Waist/hip ratio, waist circumference | |
| Hemoglobin A1c | A prescription for metformin, DM-related medication | ICD-9-CM code | ||||
| Age, sex, BMI | High density lipoprotein cholesterol, triglycerides, | Systolic | Shortness of breath, frequent urination at night, excessive thirst | History of high blood glucose, parental history of diabetes | Waist/hip ratio, waist circumference | |
| Age, sex, BMI | High density lipoprotein cholesterol, triglycerides, fast plasma glucose, Hemoglobin A1c | Systolic | Shortness of breath, frequent urination at night, excessive thirst | History of high blood glucose, parental history of diabetes | Waist/hip ratio, waist circumference | |
| Random glucose, glycol-albumin, HbA1c, GAD antibody, IA2 antibody, C-peptide | T2DM medication | ICD-10 code, I25l-insulin biding ratio | ||||
| Age, black, obesity | Metabolic equivalent | Resting heart rate, resting systolic blood pressure, resting diastolic blood pressure | Sedentary lifestyle | Family history of premature coronary artery disease, hypertension, aspirin | ||
| Fasting glycemia, HbA1c | Diabetes Mellitus related prescriptions filled | Diabetes Mellitus related codes | ||||
| Age, gender, BMI, race, region, insurance status, average annual household income, education status | Hemoglobin A1c, fasting glucose, 2h oral glucose tolerance, random glucose, triglycerides, total bilirubin, alanine aminotransferase, creatinine, low-density lipoprotein, high-density lipoprotein | Heart rate, blood pressure, body temperature |
Type 2 diabetes mellitus diagnosis-related studies: samples and classification type.
| Study | Country | Sample size | Classification type | Results |
|---|---|---|---|---|
| China | 4,205 | Binary | J48 has the best performance (accuracy = 0.9503, precision = 0.950, recall = 0.950, | |
| USA | 768 | Binary | The proposed model attained a 3.04% higher prediction accuracy than those of other studies | |
| Australia | 10,911 | Binary | The performance of different learners depends on both period and purpose of prediction | |
| USA | 4,208 | Binary | The proposed algorithm performed well with a 99.70% sensitivity and a 99.97% specificity | |
| USA | 768 | Binary | The proposed method remarkably improves the accuracy of prediction in relation to prior methods | |
| USA | 768 | Binary | The performance of Gaussian process classification are better than other methods with accuracy = 81.97%, sensitivity = 91.79%, positive predictive value = 84.91%, and negative predictive value = 62.50% | |
| Japan | 104,522 | Binary | The proposed phenotyping algorithms show better performance than baseline algorithms | |
| USA | 32,555 | Binary | The proposed ensemble approach achieved high accuracy of prediction (AUC = 0.920) | |
| Argentina | 2,463 | Multi-class | The stacked generalization strategy and feed-forward neural network performed the best with validation set | |
| USA | 24,331 | Binary | The proposed ensemble model accurately predicted progression to T2DM (AUC = 0.76), and was validated out of sample (AUC = 0.78) |
Operational definition of features.
| Features/Target class | Measurement | Definition | References | |
|---|---|---|---|---|
| Target class | Diagnosis of T2DM | Discrete | The probability of four kinds of T2DM diagnosis: E1121, E1143, and E1165 | NA |
| Features | Gender | Discrete | Gender of the patients, Male or Female. | |
| Age | Continuous | Age ( | ||
| Smoking status | Discrete | Yes, quit, or no | ||
| BMI | Continuous | Body mass index | ||
| Total Cholesterol | Continuous | The level of total cholesterol during out-patient services | ||
| Triglyceride | Continuous | The level of triglyceride during out-patient services | ||
| Glucose (AC) | Continuous | The level of glucose (AC) during out-patient services | ||
| Hemoglobin A1c | Continuous | The level of Hemoglobin A1c during out-patient services | ||
| High density lipoprotein cholesterol | Continuous | The level of high-density lipoprotein cholesterol during out-patient services | ||
| Low density lipoprotein cholesterol | Continuous | The level of low-density lipoprotein cholesterol during out-patient services | ||
R packages used and the optimal model parameters given.
| Method | Parameters | Best parameter setting | R packages |
|---|---|---|---|
| Support vector machine | sigma | 0.664667494 | kernlab 0.9-29 |
| C | 11.07262251 | ||
| C5.0 | winnow | FALSE | C50 0.1.3 |
| trials | 43 | ||
| Deep neural network | hidden | 200 | h2o 3.30.0.1 |
| input_dropout_ratio | 0 | ||
| activation | Maxout | ||
| eXtreme gradient boosting | nrounds | 154 | xgboost 1.0.0.2 |
| max_depth | 10 | ||
| eta | 0.745922343 | ||
| gamma | 3.194824195 | ||
| colsample_bytree | 0.945590117 | ||
| min_child_weight | 3.35705624 | ||
| subsample | 0.802348509 | ||
| Random Forest | mtry | 2 | randomForest 4.6-14 |
Confusion matrix.
| Predicted class | |||
|---|---|---|---|
| Positive | Negative | ||
| Actual class | Positive | True positive (TP) | False negative (FN) |
| Negative | False positive (FP) | True negative (TN) | |
Formulae for performance metrics.
| Metric | Formula |
|---|---|
| Average accuracy | |
| Matthew correlation coefficient | |
| Precision | |
| Recall | |
| F1 score |
Note:
l denotes class levels, M denotes macro-averaging metrics, TP means true positive, FP denotes false positive, FN means false negative, and TN denotes true negative.
Data summary results.
| Feature | Range | Summary statistics |
|---|---|---|
| Gender | Male/Female | Male: 86, Female: 63 |
| Age | 21~91 | |
| Smoking status | No/Quit/Yes | No = 123, Quit = 10, Yes = 16 |
| BMI | 15.49~44.05 | |
| Total cholesterol | 77~311 | |
| Triglyceride | 37~546 | |
| Glucose (AC) | 68~346 | |
| Hemoglobin A1c | 5.1~11.6 | |
| High density lipoprotein cholesterol | 16~98 | |
| Low density lipoprotein cholesterol | 29~152 |
Note:
M denotes mean and SD means standard deviation.
Model performance: 10-fold cross-validation.
| Sample | Learner | Accuracy (SD) | AUC (SD) | MCC (SD) | Macro | Process time | Stuart–Maxwell test | ||
|---|---|---|---|---|---|---|---|---|---|
| F1 (SD) | Precision (SD) | Recall (SD) | |||||||
| Train | SVM | 0.998 (0.006) | 1.000 (0.000) | 0.995 (0.011) | 0.994 (0.012) | 0.997 (0.008) | 0.991 (0.015) | 2.22 | |
| C5.0 | 0.984 (0.015) | 0.999 (0.001) | 0.969 (0.031) | 0.981 (0.020) | 0.987 (0.015) | 0.975 (0.026) | 6.74 | ||
| DNN | 0.947 (0.019) | 0.985 (0.016) | 0.896 (0.033) | 0.935 (0.027) | 0.956 (0.031) | 0.922 (0.028) | 13.56 | ||
| XGB | 0.943 (0.021) | 0.992 (0.008) | 0.885 (0.044) | 0.918 (0.050) | 0.946 (0.036) | 0.894 (0.058) | 7.86 | ||
| RF | 0.986 (0.010) | 1.000 (0.000) | 0.972 (0.017) | 0.985 (0.011) | 0.992 (0.006) | 0.978 (0.016) | 4.59 | ||
| Test | SVM | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| C5.0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| DNN | 0.855 | 0.985 | 0.730 | 0.678 | 0.876 | 0.684 | χ2(3) = 253.20, | ||
| XGB | 0.989 | 1.000 | 0.979 | 0.985 | 0.992 | 0.978 | χ2(2) = 13.00, | ||
| RF | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
Note:
AUC, area under receiver operating characteristic; SD, standard deviation; MCC, Matthew correlation coefficient; SVM, support vector machine; DNN, deep neural network; XGB, eXtreme gradient boosting; RF, random forest, the second is used to measure process time.
Figure 4Model performance of test dataset—10-fold cross-validation.
AUC, area under receiver operating characteristic curve; MCC, Matthew correlation coefficient.
Model performance: leave-one-subject-out cross-validation.
| Sample | Learner | Accuracy (SD) | AUC (SD) | MCC (SD) | Macro | Process time | Stuart–Maxwell test | ||
|---|---|---|---|---|---|---|---|---|---|
| F1 (SD) | Precision (SD) | Recall (SD) | |||||||
| Train | SVM | 0.999 (0.000) | 1.000 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 280.67 | |
| C5.0 | 0.999 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 879.37 | ||
| DNN | 0.984 (0.004) | 0.998 (0.001) | 0.968 (0.008) | 0.981 (0.005) | 0.983 (0.004) | 0.979 (0.005) | 2145.94 | ||
| XGB | 0.992 (0.002) | 0.999 (0.000) | 0.985 (0.005) | 0.990 (0.004) | 0.994 (0.003) | 0.986 (0.005) | 1028.34 | ||
| RF | 0.999 (0.000) | 1.000 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 0.999 (0.000) | 639.22 | ||
| Test | SVM | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| C5.0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| DNN | 0.893 | 0.996 | 0.802 | 0.797 | 0.902 | 0.781 | χ2(3) = 87.45, | ||
| XGB | 0.993 | 0.999 | 0.985 | 0.989 | 0.994 | 0.985 | χ2(2) = 5.67, | ||
| RF | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
Note:
AUC, area under receiver operating characteristic, SD, standard deviation, MCC, Matthew correlation coefficient, SVM, support vector machine, DNN, deep neural network, XGB, eXtreme gradient boosting, RF, random forest, the second is used to measure process time.
Figure 5Model performance of test dataset—Leave-one-subject-out cross-validation.
AUC, area under receiver operating characteristic curve; MCC, Matthew correlation coefficient.
Model performance: holdout cross-validation.
| Sample | Method | Accuracy | AUC | MCC | Macro | Process time | Stuart–Maxwell test | ||
|---|---|---|---|---|---|---|---|---|---|
| F1 | Precision | Recall | |||||||
| Train | SVM | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 0.23 | |
| C5.0 | 0.950 | 0.996 | 0.903 | 0.933 | 0.948 | 0.920 | 0.59 | ||
| DNN | 0.970 | 0.997 | 0.940 | 0.954 | 0.970 | 0.939 | 1.59 | ||
| XGB | 0.886 | 0.974 | 0.775 | 0.809 | 0.869 | 0.770 | 0.75 | ||
| RF | 0.978 | 1.000 | 0.957 | 0.980 | 0.989 | 0.972 | 0.39 | ||
| Test | SVM | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ||
| C5.0 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
| DNN | 0.814 | 0.989 | 0.623 | 0.739 | 0.913 | 0.676 | χ2(3) = 205.04, | ||
| XGB | 0.993 | 1.000 | 0.987 | 0.993 | 0.996 | 0.989 | χ2(2) = 8.00, | ||
| RF | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | |||
Note:
AUC, area under receiver operating characteristic; SD, standard deviation; MCC, Matthew correlation coefficient; SVM, support vector machine; DNN, deep neural network; XGB, eXtreme gradient boosting; RF, random forest; the second is used to measure process time.
Figure 6Model performance of test dataset—Holdout cross-validation.
AUC, area under receiver operating characteristic curve; MCC, Matthew correlation coefficient.
Figure 7Importance of features.
Comparison of our study with state-of-the-art works.
| Algorithms | Study | Accuracy | AUC | MCC | Precision | Recall | F1 score |
|---|---|---|---|---|---|---|---|
| Support vector machine | This study | 1 | 1 | 1 | 1 | 1 | 1 |
| 0.908 | 0.763 | NA | 0.903 | 0.908 | 0.905 | ||
| NA | 0.831 | 0.922 | NA | 0.683 | NA | ||
| NA | NA | NA | 0.8 | 0.909 | NA | ||
| Neural network | This study | 0.788 | 0.986 | 0.566 | 0.910 | 0.620 | 0.684 |
| NA | 0.663 | 0.007 | NA | 0.41 | NA | ||
| 0.923 | NA | NA | NA | NA | NA | ||
| NA | NA | NA | 0.930 | 0.960 | 0.940 | ||
| Random forest | This study | 1 | 1 | 1 | 1 | 1 | 1 |
| 0.840 | NA | NA | 0.844 | 0.994 | 0.913 |
Note:
AUC, area under receiver operating characteristic; MCC, Matthew correlation coefficient; NA, not available.