John Minou1, John Mantas1, Flora Malamateniou2, Daphne Kaitelidou3. 1. Health Informatics Laboratory, Faculty of Nursing, National and Kapodistrian University of Athens, Greece. 2. Department of Digital Systems, University of Piraeus, Greece. 3. Department of Health Sciences, Faculty of Nursing, National and Kapodistrian University of Athens, Greece.
Abstract
INTRODUCTION: The World Health Organization has estimated that 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the developed countries are due to cardiovascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients. AIM: The aim of this paper is to build and compare classification techniques for cardiovascular diseases. METHODS: The dataset contained 4270 patients and 14 attributes and it is available on the UCI data repository. The prediction is a binary outcome (event and no event). Variables of each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD). RESULTS: Different classifiers were tested. The SMOTE technique was used in order to solve the class imbalance. The cross-validation method was used in order to estimate how accurately our predictive models will perform. We evaluate our classifiers by using the following metrics: precision, recall, F1-score, Accuracy, AUC (Area Under Curve). CONCLUSIONS: Based on the resluts, the best scores have the Random Forest and Decision Tree classifiers.
INTRODUCTION: The World Health Organization has estimated that 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the developed countries are due to cardiovascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients. AIM: The aim of this paper is to build and compare classification techniques for cardiovascular diseases. METHODS: The dataset contained 4270 patients and 14 attributes and it is available on the UCI data repository. The prediction is a binary outcome (event and no event). Variables of each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD). RESULTS: Different classifiers were tested. The SMOTE technique was used in order to solve the class imbalance. The cross-validation method was used in order to estimate how accurately our predictive models will perform. We evaluate our classifiers by using the following metrics: precision, recall, F1-score, Accuracy, AUC (Area Under Curve). CONCLUSIONS: Based on the resluts, the best scores have the Random Forest and Decision Tree classifiers.
The World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases (1). Half the deaths in the developed countries is due to Cardiovascular diseases (2).The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. On the other hand, the data mining approach provides innovation and strategy to replace voluminous information into useful data for achieving a decision. By utilizing information mining systems it needs less investment for the forecast of the sickness with more accuracy and precision (3).
AIM
The aim of this paper is to build and compare classification techniques for cardiovasculardiseases.
METHODS
The research aim of this paper is to apply and evaluate classification techniques. The classification goal is to predict whether the patient runs a risk of future coronary heart disease (CHD) in the next 10 years. For the supervised classification a dataset was used.The dataset is publicly available, as a CSV file, on the UCI website and it is from an ongoing cardiovascular study.It contains 4270 patients and 14 attributes.What is the difference between variables and attributes, is a potential risk factor. There are both demographic, behavioral and medical risk factors.The endpoint is defined as a binary outcome: there is or there is not a 10 year risk of coronary heart disease for a patient.Demographics:Sex: male or female.Age: Age of the patient.Behavioral:Current Smoker: whether or not the patient is a current smoker.Cigs Per Day: the number of cigarettes that the person smoked on average in one day.Information on medical history:BP Meds: whether or not the patient was on blood pressure medication.Previous Stroke: whether or not the patient had previously had a stroke.Previous Hyp: whether or not the patient was hypertensive.Diabetes: whether or not the patient had diabetes.Information on current medical condition:Tot Cholesterol: total cholesterol level.Systolic BP: systolic blood pressure.Diabetes BP: diastolic blood pressure.BMI: Body Mass Index.Heart Rate: heart rate.Glucose: glucose level.Target variable to predict:10 year risk of coronary heart disease (CHD) - (binary: “1”, means “Yes”, “0” means “No”).First, the missing values were removed (4). Aftewards, we examined the dataset for imbalanced data. From the data exploration we noticed that the classes were imbalanced, and the ratio of patients without cardio vascular diseases and patients with cardio vascular diseases was 85:15.The main motivation behind the need to preprocess imbalanced data before we feed them into a classifier is that typically classifiers are more sensitive to detecting the majority class and less sensitive to the minority class (5). In order to avoid overfitting and data loss the SMOTE oversampling method was used (6). This method generates synthetic data based on feature space similarities between existing minority instances (7). In order to create a synthetic instance, it finds the K-nearest neighbors of each minority instance, randomly selects one of them, and then calculates linear interpolations to produce a new minority instance in the neighborhood (8). After the SMOTE applicaltion we had a ratio of 50:50 balanced data.Classifiers such as Logistic regression, Naive Bayes Classifier, Decision Tree, K-Means, Support Vector Machine and Random Forest were applied. Metrics such as precision, recall, F1-score, Accuracy, AUC (Area Under Curve) were used to evaluate the performance of the aforementioned classifiers (9). Ten-fold cross-validation was used to assess, and improve the acurracy of our classifiers (10). The implementation was done in Python.
RESULTS
According to Table 1, the highest Precision has Decision Tree with 0.79. The worst Precision has SVN with 0. Furthermore, the Decision Tree has the highest Recall, F1-score, Accuracy with 0.82,0.81,0.84 respectively. The highest AUC has the Random Forest. The classifier with the second highest metrics is Logistic Regression. Finally, the classifier withe the lowest metrics is the SVN.
Table 1.
Classification Results
Precision
Recall
F1-Score
Accuracy
AUC
Logistic Regression
0.69139
0.68447
0.68791
0.68857
0.69
Naïve Bayes
0.71929
0.41068
0.52284
0.71929
0.62
Decision tree
0.79454
0.82637
0.81014
0.8436
0.8
KNN
0.29787
0.1
0.14973
0.82843
0.51
SVN
0
0
0
0.8301
0.5
Random Forest
0.64285
0.06428
0.11688
0.91102
1
DISCUSSION
Most of the applied classifiers achieved a reasonable performance, except Naive Bayes, KNN and SVN. In general, there is no unique answer for this.A threshold-based classifier may work well in many applications, but it may be the case that a more complicated system will perform better. It depends on the problem you are dealing with (11-17).Also these classifiers were applied by using only the SMOTE oversamplig method which is a restriction of this research.Future work includes testng the classifiers using different oversampling and undersampling methods and compare the results.
CONCLUSION
The cross-validation method was used in order to estimate how accurately our prdictive models will perform. We evaluate our classifiers by using the following metrics: precision, recall, F1-score, Accuracy, AUC (Area Under Curve). Conclusions: Based on the resluts, the bost scores have the Random Forest and Decision Tree classifiers.