Azra Ramezankhani1, Omid Pournik2, Jamal Shahrabi3, Fereidoun Azizi4, Farzad Hadaegh1, Davood Khalili1,5. 1. Prevention of Metabolic Disorders Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (AR, FH, DK) 2. Department of Community Medicine, School of Medicine, Iran University of Medical Sciences, Tehran, Iran (OP) 3. Industrial Engineering Department, Amirkabir University of Technology, Tehran, Iran (JS) 4. Endocrine Research Center, Research Institute for Endocrine Sciences, Shahid Beheshti University of Medical Sciences, Tehran, Iran (FA) 5. Department of Epidemiology, School of Public Health, Shahid Beheshti University of Medical Sciences, Tehran, Iran (DK).
Abstract
OBJECTIVE: To evaluate the impact of the synthetic minority oversampling technique (SMOTE) on the performance of probabilistic neural network (PNN), naïve Bayes (NB), and decision tree (DT) classifiers for predicting diabetes in a prospective cohort of the Tehran Lipid and Glucose Study (TLGS). METHODS: . Data of the 6647 nondiabetic participants, aged 20 years or older with more than 10 years of follow-up, were used to develop prediction models based on 21 common risk factors. The minority class in the training dataset was oversampled using the SMOTE technique, at 100%, 200%, 300%, 400%, 500%, 600%, and 700% of its original size. The original and the oversampled training datasets were used to establish the classification models. Accuracy, sensitivity, specificity, precision, F-measure, and Youden's index were used to evaluated the performance of classifiers in the test dataset. To compare the performance of the 3 classification models, we used the ROC convex hull (ROCCH). RESULTS: Oversampling the minority class at 700% (completely balanced) increased the sensitivity of the PNN, DT, and NB by 64%, 51%, and 5%, respectively, but decreased the accuracy and specificity of the 3 classification methods. NB had the best Youden's index before and after oversampling. The ROCCH showed that PNN is suboptimal for any class and cost conditions. CONCLUSIONS: To determine a classifier with a machine learning algorithm like the PNN and DT, class skew in data should be considered. The NB and DT were optimal classifiers in a prediction task in an imbalanced medical database.
OBJECTIVE: To evaluate the impact of the synthetic minority oversampling technique (SMOTE) on the performance of probabilistic neural network (PNN), naïve Bayes (NB), and decision tree (DT) classifiers for predicting diabetes in a prospective cohort of the Tehran Lipid and Glucose Study (TLGS). METHODS: . Data of the 6647 nondiabetic participants, aged 20 years or older with more than 10 years of follow-up, were used to develop prediction models based on 21 common risk factors. The minority class in the training dataset was oversampled using the SMOTE technique, at 100%, 200%, 300%, 400%, 500%, 600%, and 700% of its original size. The original and the oversampled training datasets were used to establish the classification models. Accuracy, sensitivity, specificity, precision, F-measure, and Youden's index were used to evaluated the performance of classifiers in the test dataset. To compare the performance of the 3 classification models, we used the ROC convex hull (ROCCH). RESULTS: Oversampling the minority class at 700% (completely balanced) increased the sensitivity of the PNN, DT, and NB by 64%, 51%, and 5%, respectively, but decreased the accuracy and specificity of the 3 classification methods. NB had the best Youden's index before and after oversampling. The ROCCH showed that PNN is suboptimal for any class and cost conditions. CONCLUSIONS: To determine a classifier with a machine learning algorithm like the PNN and DT, class skew in data should be considered. The NB and DT were optimal classifiers in a prediction task in an imbalanced medical database.
Authors: Rafael V Veiga; Helio J C Barbosa; Heder S Bernardino; João M Freitas; Caroline A Feitosa; Sheila M A Matos; Neuza M Alcântara-Neves; Maurício L Barreto Journal: BMC Bioinformatics Date: 2018-06-26 Impact factor: 3.169
Authors: Jinying Chen; John Lalor; Weisong Liu; Emily Druhl; Edgard Granillo; Varsha G Vimalananda; Hong Yu Journal: J Med Internet Res Date: 2019-03-11 Impact factor: 5.428
Authors: Davide Barbieri; Nitesh Chawla; Luciana Zaccagni; Tonći Grgurinović; Jelena Šarac; Miran Čoklo; Saša Missoni Journal: Int J Environ Res Public Health Date: 2020-10-28 Impact factor: 3.390