Literature DB >> 32317833

Classification Techniques for Cardio-Vascular Diseases Using Supervised Machine Learning.

John Minou¹, John Mantas¹, Flora Malamateniou², Daphne Kaitelidou³.

Abstract

INTRODUCTION: The World Health Organization has estimated that 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the developed countries are due to cardiovascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients. AIM: The aim of this paper is to build and compare classification techniques for cardiovascular diseases.
METHODS: The dataset contained 4270 patients and 14 attributes and it is available on the UCI data repository. The prediction is a binary outcome (event and no event). Variables of each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).
RESULTS: Different classifiers were tested. The SMOTE technique was used in order to solve the class imbalance. The cross-validation method was used in order to estimate how accurately our predictive models will perform. We evaluate our classifiers by using the following metrics: precision, recall, F1-score, Accuracy, AUC (Area Under Curve).
CONCLUSIONS: Based on the resluts, the best scores have the Random Forest and Decision Tree classifiers.

Entities: Chemical

Keywords: Cardio vascular diseases; Classification; Cross Validation; SMOTE

Mesh：

Year: 2020 PMID： 32317833 PMCID： PMC7164736 DOI： 10.5455/medarh.2020.74.39-41

Source DB: PubMed Journal: Med Arch ISSN： 0350-199X

INTRODUCTION

The World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases (1). Half the deaths in the developed countries is due to Cardiovascular diseases (2). The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. On the other hand, the data mining approach provides innovation and strategy to replace voluminous information into useful data for achieving a decision. By utilizing information mining systems it needs less investment for the forecast of the sickness with more accuracy and precision (3).

AIM

The aim of this paper is to build and compare classification techniques for cardiovasculardiseases.

METHODS

The research aim of this paper is to apply and evaluate classification techniques. The classification goal is to predict whether the patient runs a risk of future coronary heart disease (CHD) in the next 10 years. For the supervised classification a dataset was used. The dataset is publicly available, as a CSV file, on the UCI website and it is from an ongoing cardiovascular study. It contains 4270 patients and 14 attributes. What is the difference between variables and attributes, is a potential risk factor. There are both demographic, behavioral and medical risk factors. The endpoint is defined as a binary outcome: there is or there is not a 10 year risk of coronary heart disease for a patient. Demographics: Sex: male or female. Age: Age of the patient. Behavioral: Current Smoker: whether or not the patient is a current smoker. Cigs Per Day: the number of cigarettes that the person smoked on average in one day. Information on medical history: BP Meds: whether or not the patient was on blood pressure medication. Previous Stroke: whether or not the patient had previously had a stroke. Previous Hyp: whether or not the patient was hypertensive. Diabetes: whether or not the patient had diabetes. Information on current medical condition: Tot Cholesterol: total cholesterol level. Systolic BP: systolic blood pressure. Diabetes BP: diastolic blood pressure. BMI: Body Mass Index. Heart Rate: heart rate. Glucose: glucose level. Target variable to predict: 10 year risk of coronary heart disease (CHD) - (binary: “1”, means “Yes”, “0” means “No”). First, the missing values were removed (4). Aftewards, we examined the dataset for imbalanced data. From the data exploration we noticed that the classes were imbalanced, and the ratio of patients without cardio vascular diseases and patients with cardio vascular diseases was 85:15. The main motivation behind the need to preprocess imbalanced data before we feed them into a classifier is that typically classifiers are more sensitive to detecting the majority class and less sensitive to the minority class (5). In order to avoid overfitting and data loss the SMOTE oversampling method was used (6). This method generates synthetic data based on feature space similarities between existing minority instances (7). In order to create a synthetic instance, it finds the K-nearest neighbors of each minority instance, randomly selects one of them, and then calculates linear interpolations to produce a new minority instance in the neighborhood (8). After the SMOTE applicaltion we had a ratio of 50:50 balanced data. Classifiers such as Logistic regression, Naive Bayes Classifier, Decision Tree, K-Means, Support Vector Machine and Random Forest were applied. Metrics such as precision, recall, F1-score, Accuracy, AUC (Area Under Curve) were used to evaluate the performance of the aforementioned classifiers (9). Ten-fold cross-validation was used to assess, and improve the acurracy of our classifiers (10). The implementation was done in Python.

RESULTS

According to Table 1, the highest Precision has Decision Tree with 0.79. The worst Precision has SVN with 0. Furthermore, the Decision Tree has the highest Recall, F1-score, Accuracy with 0.82,0.81,0.84 respectively. The highest AUC has the Random Forest. The classifier with the second highest metrics is Logistic Regression. Finally, the classifier withe the lowest metrics is the SVN.

Table 1.

Classification Results

	Precision	Recall	F1-Score	Accuracy	AUC
Logistic Regression	0.69139	0.68447	0.68791	0.68857	0.69
Naïve Bayes	0.71929	0.41068	0.52284	0.71929	0.62
Decision tree	0.79454	0.82637	0.81014	0.8436	0.8
KNN	0.29787	0.1	0.14973	0.82843	0.51
SVN	0	0	0	0.8301	0.5
Random Forest	0.64285	0.06428	0.11688	0.91102	1

DISCUSSION

Most of the applied classifiers achieved a reasonable performance, except Naive Bayes, KNN and SVN. In general, there is no unique answer for this. A threshold-based classifier may work well in many applications, but it may be the case that a more complicated system will perform better. It depends on the problem you are dealing with (11-17). Also these classifiers were applied by using only the SMOTE oversamplig method which is a restriction of this research. Future work includes testng the classifiers using different oversampling and undersampling methods and compare the results.

CONCLUSION

The cross-validation method was used in order to estimate how accurately our prdictive models will perform. We evaluate our classifiers by using the following metrics: precision, recall, F1-score, Accuracy, AUC (Area Under Curve). Conclusions: Based on the resluts, the bost scores have the Random Forest and Decision Tree classifiers.

7 in total

Review 1. Value and limitations of existing scores for the assessment of cardiovascular risk: a review for clinicians.

Authors: Marie Therese Cooney; Alexandra L Dudina; Ian M Graham
Journal: J Am Coll Cardiol Date: 2009-09-29 Impact factor: 24.094

Review 2. The CrowdHEALTH project and the Hollistic Health Records: Collective Wisdom Driving Public Health Policies.

Authors: Dimosthenis Kyriazis; Serge Autexier; Michael Boniface; Vegard Engen; Ricardo Jimenez-Peris; Blanca Jordan; Gregor Jurak; Athanasios Kiourtis; Thanos Kosmidis; Mitja Lustrek; Ilias Maglogiannis; John Mantas; Antonio Martinez; Argyro Mavrogiorgou; Andreas Menychtas; Lydia Montandon; Cosmin-Septimiu Nechifor; Sokratis Nifakos; Alexandra Papageorgiou; Marta Patino-Martinez; Manuel Perez; Vassilis Plagianakos; Dalibor Stanimirovic; Gregor Starc; Tanja Tomson; Francesco Torelli; Vicente Traver-Salcedo; George Vassilacopoulos; Andriana Magdalinou; Usman Wajid
Journal: Acta Inform Med Date: 2019-12

Classification Techniques for Cardio-Vascular Diseases Using Supervised Machine Learning.

INTRODUCTION

AIM

METHODS

RESULTS

DISCUSSION

CONCLUSION

Review 1. Value and limitations of existing scores for the assessment of cardiovascular risk: a review for clinicians.

Review 2. The CrowdHEALTH project and the Hollistic Health Records: Collective Wisdom Driving Public Health Policies.

3. Health Professionals' Perception about Big Data Technology in Greece.

4. Data Sources and Gateways: Design and Open Specification.

5. Disseminating Research Outputs: The CrowdHEALTH Project.

6. The Integrated Holistic Security and Privacy Framework Deployed in CrowdHEALTH Project.

7. Generating and Knowledge Framework: Design and Open Specification.