Literature DB >> 34154576

A novel approach for heart disease prediction using strength scores with significant predictors.

Armin Yazdani1, Kasturi Dewi Varathan2, Yin Kia Chiam1, Asad Waqar Malik3, Wan Azman Wan Ahmad4.   

Abstract

BACKGROUND: Cardiovascular disease is the leading cause of death in many countries. Physicians often diagnose cardiovascular disease based on current clinical tests and previous experience of diagnosing patients with similar symptoms. Patients who suffer from heart disease require quick diagnosis, early treatment and constant observations. To address their needs, many data mining approaches have been used in the past in diagnosing and predicting heart diseases. Previous research was also focused on identifying the significant contributing features to heart disease prediction, however, less importance was given to identifying the strength of these features.
METHOD: This paper is motivated by the gap in the literature, thus proposes an algorithm that measures the strength of the significant features that contribute to heart disease prediction. The study is aimed at predicting heart disease based on the scores of significant features using Weighted Associative Rule Mining.
RESULTS: A set of important feature scores and rules were identified in diagnosing heart disease and cardiologists were consulted to confirm the validity of these rules. The experiments performed on the UCI open dataset, widely used for heart disease research yielded the highest confidence score of 98% in predicting heart disease.
CONCLUSION: This study managed to provide a significant contribution in computing the strength scores with significant predictors in heart disease prediction. From the evaluation results, we obtained important rules and achieved highest confidence score by utilizing the computed strength scores of significant predictors on Weighted Associative Rule Mining in predicting heart disease.

Entities:  

Keywords:  Cardiovascular disease; Heart disease prediction; Weighted associative rule mining; Weighted scores

Mesh:

Year:  2021        PMID: 34154576      PMCID: PMC8215833          DOI: 10.1186/s12911-021-01527-5

Source DB:  PubMed          Journal:  BMC Med Inform Decis Mak        ISSN: 1472-6947            Impact factor:   2.796


Introduction

Cardiovascular disease (CVD) is one of the most life-threatening diseases in the world. The World Health Organization (WHO) as well as the Global Burden of Disease (GBD) study reported cardiovascular disease as the main cause of death around the globe annually [40, 56].  WHO revealed that CVD is expected to affect almost 23.6 million people by the year 2030. In some industrialized countries such as the United States of America, the rate is about 1 in 4 deaths [34]. The Middle East and North Africa (MENA) region has an even higher percentage, which is 39.2% of the mortality rate [20]. Hence, early and accurate diagnosis and the provision of appropriate treatments are keys to reducing the amount of death causing cardiovascular diseases. Availability of such services is essential for those who have a high risk of developing heart disease [29]. There are many features that contribute to heart disease prediction. Researchers in the past were more focused on identifying significant features to be used in their heart disease prediction models [8]. Less importance was given to determining the relationships between these features and to identifying their level of priority [32, 32] within the prediction model. To address the issues which hinder early and accurate diagnosis, many data mining related studies were previously conducted [9, 16, 28]. Weighted Association Rule Mining (WARM) is one of the data mining techniques used to discover the relationships between features and to determine mining rules that lead to certain predictions [22]. The weight that is used in this mining technique provides users with a convenient way to indicate the importance of the features that contributes to heart disease and helps obtain more accurate rules [4]. In many prediction models, different features have different importance. Hence, different weights are assigned to different features based on their predicting capabilities [48]. The failure in determining the weight indicates the failure in determining the importance of the features. Past research had used Weighted Associative Rule Mining (WARM) in heart disease prediction [18, 31, 46, 48, 50]. However, the prediction model reported in these studies still demands further exploration in terms of the number of features used, the strength of these features and the evaluation of scores obtained. In this research, we proposed an algorithm to compute the weight of each feature that contributes to heart disease prediction. We have experimented on all features as well as selected significant features using WARM. The results obtained showed that the significant features outperformed all features with the highest confidence score of 98% in predicting heart disease. To the best of our knowledge, this study is the first that used strength scores of significant predictors in WARM. The rest of the paper is organized as follows: Sect. 2 presents the background of the study followed by Sect. 3 on research objectives. Section 4 presents the methodology and Sect. 5 displays the results obtained by this research. Section 6 includes the discussions and Sect. 7 benchmarks this research against previous studies. Finally, Sect. 8 concludes the research with a summary of the findings and future work.

Related works

CVDs are disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease and other conditions. Heart attacks and strokes are the main causes of mortality in cardiovascular disease in which the rate nears one out of three [6]. With the high rate of mortality, diagnosis and prevention measures need to be performed effectively and efficiently. Many data mining techniques have been used to help address these issues (Amin et al. [8]). Most of the past research looked into identifying features that contribute to better heart prediction accuracy [9]. However, very little researches looked into the relationships that exist between these features. The relationship between each feature that contributes to heart disease prediction can be obtained by using the Associative Rule Mining (ARM) technique [11]. The ARM technique is popular in transactional and relational datasets. The hidden knowledge in large datasets such as business transactions developed the interest of many business owners to understand the patterns that can help them to improve their business decisions (Agarwal and Mithal [1]). For instance, discovering the frequently bought items by customers in market basket analysis. This analysis looks at the various items found in customers’ shopping cart and identifies the associations between them. A good example would be if customers were looking to purchase milk, they were likely to purchase bread on the same trip to the supermarket. This approach is also widely used in the healthcare industry specifically in privacy preservation of healthcare data [15], predicting cancer associated protein interactions [12], predicting obstructive sleep apnea [43] and predicting co-diseases in Thyroid patients [23]. ARM is also used in heart disease prediction. Table 1 shows the studies that used ARM in heart disease prediction. Akbaş et al. [3], Shuriyaa and Rajendranb [42], Srinivas et al. [49], Khare and Gupta [24] and Lakshmi and Reddy [27] have used ARM on UCI dataset. Some of the studies listed in Table 1 used private datasets from hospitals and heart centres. Although the scores that were obtained from these datasets are high (99% by Sonet et al. [45]), 100% by Thanigaivel and Kumar [52], the studies have a limitation in terms of reproduction, as the datasets are not open for access. Akbaş et al. [3] on the other hand obtained a score of 97.8% in confidence using the UCI dataset. However, the confidence score obtained predicted people with no risk of heart disease.
Table 1

Studies on Heart Disease Prediction using ARM

AuthorsTechniqueNo of Features UsedEvaluation MetricScoreDataset
Akbaş et al. [3]Associative Rule Mining13Confidence97.8 (Predicting no heart disease)UCI
Vasanthanageswari and Vanitha [54]Associative Rule Mining16NANACongenital Heart Defect Dataset
Shuriyaa and Rajendranb [42]Associative Rule Mining + ANFIS13Accuracy93.2UCI
Sonet et al. [45]Associative Rule Mining13Confidence99National Institute of Cardiovascular Disease, Dhaka, Bangladesh
Thanigaivel and Kumar [52]Associative Rule Mining25Confidence100Hospital (name of the hospital not mentioned)
Srinivas et al. [49]Associative Rule Mining and MLP13Accuracy84.9UCI
Khare and Gupta [24]Associative Rule Mining13Confidence94UCI
Lakshmi and Reddy [27]Associative Rule Mining13Accuracy96.6UCI
Said et al. [41]Associative Rule Mining13Confidence91UCI
Nahar et al. [36]Associative Rule Mining13Confidence96UCI
Studies on Heart Disease Prediction using ARM Weighted Associative Rule Mining (WARM) is an extension of ARM, in which weights are assigned to differentiate the importance of the features mined. Let T be the training dataset in which contains T = {r1, r2, r3… ri} with a set of weight associated with each {attribute, attribute value} pair. Every ith record ri is a set of value and weight wi attached to each feature of ri tuple / record. In a weighted framework, each record is a set of triple {ai, vi, wi} where feature ai has a value of vi and weight of wi where 0 < wj <  = 1. Assigning a correct weight to each feature is a hard task. In various fields of studies, there are different ways of calculating the weights of features. For instance, according to Malarvizhi and Sathiyabhama [30] in web mining, visitor page dwelling time is a way of calculating weightage. WARM is widely used in research on shopping basket scenarios and in predicting customers’ behaviour. Chengis et al. [10] investigated on assigning weight before and after ARM. WARM was also used in predicting disease comorbidities using clinical as well as molecular data (Lakshmi and Vadivu 26). This technique is also used in predicting breast cancer [5]. Recent research by Park and Lim [39] used this technique to reduce design failures of pre-alarming systems in the shipbuilding industry. However, not many researchers focused on applying WARM to cardiovascular disease. Table 2 shows studies on heart disease prediction using WARM. However, the weight of features was not precisely calculated (Jabbar et al. [21], Sundar et al. [50], Soni and Vyas [48]). Soni et al. [47] proposed a new framework, which was an associative classifier that used WARM. Different weights were assigned to different attributes based on their predicting capability. Their theoretical model yielded a confidence score of 79.5%. Soni and Vyas [48] also applied WARM and the confidence level they achieved was was 79.5%. Their research assigned weights based on age range, smoking habits, hypertension and BMI range. On the other hand, Soni et al. [46] assigned weights to each of the attributes based on the advice obtained from the medical experts. They presented an intelligent and effective heart attack prediction system using a weighted associative classifier by achieving a maximum score of 80% confidence. Meanwhile, Sundar et al. [50] developed a system using two data mining techniques, which are Naïve Bayes and WARM. Their experiments showed that WARM achieved a score of 84% on confidence score, outperforming Naïve Bayes, which obtained only 78%. Chauhan et al. [11] also used WARM in predicting heart disease. They obtained an accuracy score of 60.4%. Kharya et al. [25] used Weighted Bayesian Association Rule Mining Algorithm, which combines WARM with heart disease dataset. However, they failed to indicate the results obtained in their study. Ibrahim and Sivabalakrishnan [19] have used Random Walker Memetic algorithm-based WARM for predicting coronary disease. They obtained an accuracy of 95% using the UCI heart disease dataset.
Table 2

Studies on Heart Disease Prediction using WARM

AuthorsTechniqueNo of Features UsedEvaluation MetricScoreDataset
Ibrahim and Sivabalakrishnan [19]Random Walk Memetic Algo with WARM13Precision92%UCI
Ibrahim and Sivabalakrishnan [18]WARM13Confidence67%UCI
Kharya et al. [25]WARM with Bayesian Belief Network4NANANA
Chauhan et al.[11]WARM13Accuracy60.4%UCI
Sundar et al. [50]WARM13Confidence84%UCI
Soni et al. [46]WARM13Confidence80%UCI
Soni and Vyas [48]WARM13Confidence79.5%UCI
Studies on Heart Disease Prediction using WARM Despite having research that is based on WARM in predicting heart disease, none of them was focused on identifying the important features to be used in heart disease prediction which would contribute to better prediction performance. The weight of each feature plays an equally important role in deciding which feature has the highest impact (strength) in predicting heart disease. The right weight of the significant features identified will yield an effective prediction model. Thus, this research is focused on identifying the weight of significant features and utilizing the generated score in predicting heart disease.

Research objectives

The main objectives of this research are as follows: To compute the weight of significant features in heart disease prediction. To predict heart disease using the computed weight of significant features (using WARM).To evaluate the performance of WARM in predicting heart disease.

Proposed methodology

This section describe in detail the methodology used as shown in Fig. 1. It contains 5 main stages which are data pre-processing, feature selection, feature weight computation, apply WARM and model evaluation.
Fig. 1

Methodology

Methodology

Dataset

This research uses the heart disease dataset that is obtained from UCI Machine Learning Repository [13]. UCI Machine Learning Repository is one of the largest available datasets, having over 417 various datasets. The Cleveland dataset from UCI Machine Learning Repository is one of the datasets on heart disease, which is widely used by researchers to date (Amin et al. [8]). This research will also use this dataset of which contains 303 rows. The dataset contains 76 features in which 14 attributes including class label are used. The 14 features together with their descriptions and data types are shown in Table 3.
Table 3

Features description

NoFeaturesDescriptionData Type
1AgeAge in yearNumeric
2SexGenderNominal
3CPChest pain typeNominal
4TrestbpsResting blood pressureNumeric
5CholSerum cholesterolNumeric
6FbsFasting blood sugarNominal
7RestegResting electrographic resultsNominal
8TalachMaximum heart rate achievedNumeric
9ExangExercise induce anginaNominal
10OldpeakST depression induced by exercise relative to restNumeric
11SlopeThe slope of the peak exercise ST segmentNominal
12CANumber of major vessels coloured by fluoroscopyNumeric
13ThalThallium heart scanNominal
14GoalDiagnosis of heart diseaseNominal
Features description

Experimental Setup

In this research, Weka 3.8 was used to conduct the experiments. The retrieved Cleveland dataset went through a pre-processing phase. The significant features were retrieved from a total of 14 factors from the Cleveland dataset (Amin [7]). Further, the weight of each significant feature was computed and assigned back to them accordingly. WARM was applied to the heart disease dataset to generate rules. Finally, evaluation was performed to obtain the confidence score of the best rules generated using WARM based on significant features. The detailed explanation of each process is explained in the following sections.

Data Pre-Processing

In the data pre-processing phase, all missing records were deleted from the dataset, which consists of 6 instances. Based on Table 3, there are 13 normal attributes(age’, ‘sex’, ‘cp’, ‘trestbps’, ‘chol’, ‘fbs’, ‘restecg’, ‘thalach’, ‘exang’, ‘oldpeack’, ‘slope’, ‘ca’, ‘thal’) and 1 class label(‘goal’), which refers to the criticality level of heart disease in patients. It ranged from 0–4, in which 0 refers to’No Heart Disease’ and the other values indicates the presence of heart disease at different criticality levels. Since this research aims at predicting the presence of heart disease and not its criticality levels, the range from 1 to 4 is thus normalized to 1, which indicates the presence of heart disease, and 0 to represent the absence of heart disease. Data normalization is also performed as a part of the data transformation process that involved mounting data into nominal data. This is required, as WARM utilizes nominal data only. All the ranges formed for each features are indicated in Table 4.
Table 4

Ranges formed for features

Age

 <  = 40: lessThanForty

41–64: betweenAge

 >  = 65: greaterThanSixtyFour

Sex

1: Male

0: Female

CP

1: typicalAngina

2: atypicalAngina

3: nonAnginalPain

4: asymptomatic

Trestbps

90–120: normal

120–140: unusual

140–160: high

 > 160: very high

Cholesterol (chol)

110–200: normal

200–240: borderline_high

240–250: high

 > 250: very high

Fbs

True

False

Restecg

0: normal

1: STTWaveAbnormality

2: showingProbable

Thalach

60–100: Normal

 > 100: Tachycardia

Exang

Yes

No

Oldpeak

Zero

greaterThanZero

Slope

1: Upsloping

2: Flat

3: Downsloping

CA

Zero

One

Two

Three

Thal

3: Normal

6: Fixed

7: Reversible

Output

0: No Heart Disease

1: Heart Disease

Source: Khare et al. [24]

Ranges formed for features <  = 40: lessThanForty 41–64: betweenAge >  = 65: greaterThanSixtyFour 1: Male 0: Female 1: typicalAngina 2: atypicalAngina 3: nonAnginalPain 4: asymptomatic 90–120: normal 120–140: unusual 140–160: high > 160: very high 110–200: normal 200–240: borderline_high 240–250: high > 250: very high True False 0: normal 1: STTWaveAbnormality 2: showingProbable 60–100: Normal > 100: Tachycardia Yes No Zero greaterThanZero 1: Upsloping 2: Flat 3: Downsloping Zero One Two Three 3: Normal 6: Fixed 7: Reversible 0: No Heart Disease 1: Heart Disease Source: Khare et al. [24]

Feature Selection

Features were selected based on experiments conducted by Amin et al. [8] since they had used the same dataset (UCI). They performed a set of experiments that dealt with 8100 combinations of features with 7 different classification models (K-NN, Decision Tree, Naïve Bayes, Logistic Regression, Neural Network and Vote) to identify significant features. Table 5 shows the features obtained from the highest performance of each classification models. The highlighted columns indicate the features which appeared more than 10 times and thus were selected as significant features. The selected 8 features are sex, CP, Fbs, Exang, Oldpeak, Slope, CA, and Thal.
Table 5

Selecting significant features from the result of the highest performance

AgeSexCPTrestbpsCholFbsRestecgThalachExangOldpeakSlopeCAThal
Occurrence in Highest Accuracy2771254346475
Occurrence in Highest F-Measure2771254346475
Occurrence in Highest Precision0642122242454
Total Occurence4201845121081214121914

Source: Amin et al. [8]

Selecting significant features from the result of the highest performance Source: Amin et al. [8]

Feature weight computation

This section explains how the weight of the features was calculated. The fundamental of WARM states that different features in a dataset have different importance in predicting heart disease. The weight of each feature ranges from 0 to 1. Thus, a weight that is closer to 1 indicates a more significant feature. On the other hand, a weight that is closer to 0 is the least significant in heart disease prediction.

Calculate feature weight

The first step was to calculate the individual feature weights. Let R be the set of features R = {n0, n1, n2… ni} and (n > 0). In this experiment, the total number of features is 13 and after feature selection, it is reduced to 8 (Sex, CP, Fbs, Exang, Oldpeak, Slope, CA, and Thal). W (n) is the weight of each feature (W is the weight of each feature to be calculated and n represents a feature), For example, the value of sex as displayed in Table 5 is’20’ and the sum of all the features will be’121’. The total value of significant features (Sex, CP, Fbs, Exang, Oldpeak, Slope, CA, andThal) is calculated as (20 + 18 + 12 + 12 + 14 + 12 + 19 + 14 = 121). Thus, to calculate the weight of ‘sex’ (weight of features, WOF): Table 6 displays the calculated weights for each of the significant features. All weights were computed accordingly. From the distribution of the weights, CA has the greatest strength followed by Sex, CP, Oldpeak and Thal, Fbs, Exang and Slope has the similar weight of 0.09 each.
Table 6

Weight of the significant features

Sex0.17
CP0.15
Fbs0.09
Exang0.09
Oldpeak0.12
Slope0.09
CA0.18
Thal0.11
Weight of the significant features

Calculate feature value weight

This section explains how feature values are computed. Feature values represent all the values that a feature contains. For instance, feature values for sex are male and female. Let A be the number of each feature value contained in the dataset and (A ∪ B) be the total number of records. Table 7 shows the total sub value of each feature based on the UCI dataset. Male value is represented by 203 records and female by 94 records which gives a total of 297 records from the UCI dataset. To calculate the value of each feature weight, let A be the selected value and B be the rest of the features value,
Table 7

Identify total sub value of each feature

TotalMaleFemale
29720394
Identify total sub value of each feature Figure 2 shows the comparison of the percentage of males and females in the Cleveland heart disease dataset.
Fig. 2

Comparison on the percentage of male and female in Cleveland heart disease dataset

Comparison on the percentage of male and female in Cleveland heart disease dataset

Calculate total weight for feature

This section explains how the total weight for features is computed. The feature weight (W (n)) and feature value weight (W (value)) gives the total weight (W (t)) for the feature. The computation is shown below. Example of calculating the total weight of feature W (t):

Algorithm

This section detailed out the algorithm to obtain the weighted score of each feature in predicting heart disease. The algorithm is stated as follows:

Apply WARM

Not all features in the heart disease dataset have the same level of significance in predicting the risk of heart disease. Thus, different weights based on their prediction capability are assigned. These values are then imported into Weka 3.8 to experiment with WARM using Apriori Algorithm.

Apriori algorithm

The Apriori algorithm is a well-known approach in WARM. Apriori was first proposed by Agrawal and Srikant [2]. The algorithm starts with a dataset including transactions that wants to construct frequent item sets, having at least a user-specified threshold. In the algorithmic process of Apriori, an item set X of length k is frequent if and only if every subset of X, having length k—1, is also frequent. This consideration results in a substantial reduction of search space and allows rule discovery in a computationally feasible time. Apriori generates a rule of the form: s =  > (f – s) if and only if the confidence of the rule is above the user-defined threshold. Confidence is essentially the accuracy of the rule and is used in Apriori to rank the rules (Agrawal & Srikant [2]; Mutter et al. [51]).

Weighted confidence

The confidence level is used in order to show how often the rule appears to be true. Let Y be the ‘goal’, then the weighted confidence of a rule X → Y can be calculated as the ratio of weighted support of over the weighted support of (X). For instance, the rule {sex = Male, CA = 3} → {heart disease} has a confidence of 0.2/0.2 = 1.0. It means a patient who is a male and having 3 CA (major vessels coloured by fluoroscopy) has a 100% chance of having heart disease.

Evaluation

This phase generates rules based on the Apriori algorithm in Weighted Associative Rule Mining. Two sets of rules and confidence scores were generated for the followings: All features—this includes all the 13 features. Selected significant features (8 features). The following section provides a detailed explanations of the results obtained which are the rules and confidence scores.

Results (rules and confidence level generated)

The rules and confidence level generated for all the (13) features and the selected significant features (8) are shown in this section.

All features

Table 8 shows the top 20 rules and confidence scores obtained for all the features using WARM. The rules were sorted by the highest confidence scores.
Table 8

Rules generated from all the features using WARM

NoRulesConfidence
1Trestbps = unusual Thalach = Tachycardia Exang = No CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.96
2Trestbps = unusual Fbs = FALSE Thalach = Tachycardia Exang = No CA = zero Thal = normal 52 =  =  > class_HD = No Heart Disease0.96
3Sex = Female Exang = No CA = zero =  =  > class_HD = No Heart Disease0.96
4Sex = Female Thalach = Tachycardia Exang = No CA = zero =  =  > class_HD = No Heart Disease0.96
5Sex = Female Exang = No CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.96
6Age = betweenAge Trestbps = unusual Exang = No CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.96
7CP = asymptomatic Slope = flat Thal = reversable =  =  > class_HD = Heart Disease0.96
8Sex = Female Fbs = FALSE Exang = No CA = zero =  =  > class_HD = No Heart Disease0.96
9Sex = Female Fbs = FALSE Thalach = Tachycardia Exang = No CA = zero =  =  > class_HD = No Heart Disease0.96
10Sex = Female Thalach = Tachycardia Exang = No CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.96
11Age = betweenAge Trestbps = unusual Thalach = Tachycardia Exang = No CA = zero Thal = normal 48 =  =  > class_HD = No Heart Disease0.96
12Trestbps = unusual Exang = No CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.95
13Trestbps = unusual Fbs = FALSE Thalach = Tachycardia CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.95
14Age = betweenAge CP = asymptomatic Oldpeak = greaterThanZero Thal = reversable =  =  > class_HD = Heart Disease0.95
15Restecg = normal Thalach = Tachycardia Exang = No CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.95
16CP = asymptomatic Fbs = FALSE Oldpeak = greaterThanZero Thal = reversable =  =  > class_HD = Heart Disease0.94
17Trestbps = unusual Fbs = FALSE Exang = No CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.94
18CP = asymptomatic Exang = Yes Thal = reversable =  =  > class_HD = Heart Disease0.94
19Sex = Male CP = asymptomatic Exang = Yes Oldpeak = greaterThanZero =  =  > class_HD = Heart Disease0.94
20Age = betweenAge CP = asymptomatic Thalach = Tachycardia Oldpeak = greaterThanZero Thal = reversable =  =  > class_HD = Heart Disease0.94
Rules generated from all the features using WARM The highest confidence level achieved for predicting the risk of having heart disease is 96% and the number of features used to generate this rule is 3(CP, Slope and Thal). This can be clearly seen in Table 8 (Rule Number 7). The rule states that if the value of Chest Pain (CP) is asymptomatic, the slope is flat and the value of Thallium (Thal) is reversible, therefore, the patient has a very high tendency (confidence level = 96%) of having the risk of heart disease. All the highlighted rows in Table 8 show the rules that contributed to the prediction of the risk of having heart disease. Further, the Table 9 is the summary that shows the frequency of each features used in the rules, which were generated from Table 8 (which contains the rules that predicts heart disease). It shows the rule number and the features used in each of the top 20 rules. From the top 20 rules, only 6 rules predicts heart disease and others are non-sick rules which predicts no heart disease.
Table 9

Summary of frequency of each features contained in the rules that predicts heart disease (all features)

Features
CPSlopeThalAgeOldPeakFbsExangSexThalach
Rule number
7
14
16
18
19
20
Total rules
6614241211
Summary of frequency of each features contained in the rules that predicts heart disease (all features) Although all 13 features have been used for rules and confidence score generation as shown in Table 8, only 9 features have been used for heart disease prediction based on the top 20 rules. The most significant feature in predicting heart disease is CP. This feature exist in all the 6 rules generated that predicts heart disease. Thal and Oldpeak exist in 4 rules out of the 6 rules in predicting heart disease.

Selected significant features

This section emphasizes on the rules and confidence scores obtained by the selected significant features. Table 10 shows the top 20 rules generated from the significant features using WARM. The confidence score obtained in predicting the risk of having heart disease using 8 selected significant features shows a comparatively high confidence level at 98%. The rule obtained for the top confidence score states as.
Table 10

Rules generated from 8 significant features using weighted associative rule mining

NoRulesConfidence
1Sex = Female CP = nonAnginalPain Thal = normal =  =  > class_HD = No Heart Disease1
2Sex = Female Exang = No Oldpeak = greaterThanZero CA = zero =  =  > class_HD = No Heart Disease1
3CP = asymptomatic Exang = Yes Oldpeak = greaterThanZero Thal = reversible =  =  > class_HD = Heart Disease0.98
4Sex = Male CP = asymptomatic Exang = Yes Oldpeak = greaterThanZero Thal = reversable =  =  > class_HD = Heart Disease0.97
5CP = asymptomatic Fbs = FALSE Exang = Yes Oldpeak = greaterThanZero Thal = reversable =  =  > class_HD = Heart Disease0.97
6Sex = Female CP = nonAnginalPain =  =  > class_HD = No Heart Disease0.97
7Sex = Female Fbs = FALSE Exang = No Oldpeak = greaterThanZero Thal = normal =  =  > class_HD = No Heart Disease0.97
8Sex = Male CP = asymptomatic CA = one =  =  > class_HD = Heart Disease0.97
9Sex = Female CP = nonAnginalPain Exang = No =  =  > class_HD = No Heart Disease0.97
10CP = asymptomatic Exang = Yes Slope = flat Thal = reversable =  =  > class_HD = Heart Disease0.97
11CP = asymptomatic Exang = Yes Oldpeak = greaterThanZero Slope = flat Thal = reversable =  =  > class_HD = Heart Disease0.97
12Sex = Male CP = asymptomatic Fbs = FALSE Exang = Yes Oldpeak = greaterThanZero Thal = reversable =  =  > class_HD = Heart Disease0.97
13Sex = Female Exang = No CA = zero =  =  > class_HD = No Heart Disease0.96
14Sex = Female Exang = No CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.96
15CP = asymptomatic Slope = flat Thal = reversable =  =  > class_HD = Heart Disease0.96
16Sex = Female Fbs = FALSE Exang = No CA = zero =  =  > class_HD = No Heart Disease0.96
17CP = asymptomatic Oldpeak = greaterThanZero Slope = flat Thal = reversable =  =  > class_HD = Heart Disease0.96
18Sex = Female Fbs = FALSE Exang = No CA = zero Thal = normal =  =  > class_HD = No Heart Disease0.96
19CP = asymptomatic Fbs = FALSE Slope = flat Thal = reversable =  =  > class_HD = Heart Disease0.95
20CP = asymptomatic Fbs = FALSE Oldpeak = greaterThanZero Slope = flat Thal = reversable =  =  > class_HD = Heart Disease0.95
Rules generated from 8 significant features using weighted associative rule mining CP = asymptomatic, Exang = Yes, Oldpeak = greaterThanZero, Thal = reversible =  =  > class_HD = Heart Disease. which means if Chest Pain (CP) is asymptomatic, exercise-induce angina (Exang) is present, Oldpeak (ST depression induced by exercise relative to rest) is present and, Thallium heart scan (Thal) is reversible then the patient is diagnosed as having heart disease. From the top 20 rules generated, 11 rules are meant for predicting heart disease as highlighted in Table 10. Table 11 shows the summary of the frequency of existence of each features contained in the rules that predicts heart disease. There are a total of 11 rules out of 20 rules generated using significant features to predict the presence of heart disease. The most significant feature that exists in all the positive rules that predicts the Heart Disease is Chest pain (CP). Thallium heart scan (Thal) is seen in 9 out of 11 rules and Oldpeak (ST depression induced by exercise relative to rest) is seen in 7 rules.
Table 11

Summary of frequency for each features contained in the rules that predicts heart disease (8 selected features)

Features
CPSlopeThalOldPeakFbsExangSexCA
Rule number
3
4
5
8
10
11
12
15
17
19
20
Total rules
11116974631
Summary of frequency for each features contained in the rules that predicts heart disease (8 selected features)

Discussions

The implementation of WARM on selected significant features managed to achieve the highest confidence score in predicting heart disease which is 98% compared to 96% obtained from all features. It can be concluded that WARM predicts the risk of having heart disease well. From the top 20 rules generated, only 6 rules were based all features. On the other hand, 11 rules from the top 20 generated were based on the selected 8 features. Studying the top 20 rules generated revealed some significant information. These findings were validated by a cardiologist:- Asymptomatic chest pain, positive exercise-induced angina, Oldpeak > 0 and reversible thallium heart scan implies the presence of heart disease. CP = asymptomatic, Exang = Yes, Oldpeak = greaterThanZero, Thal = reversible =  =  > class_HD = Heart Disease Asymptomatic chest pain is one of the most important features as it appears in all the rules generated in detecting heart disease. Reversible thallium heart scan and Oldpeak greater than zero are positively correlated with heart disease. Males are more prone to have heart disease compared to females as all the sick rules stated sex as male and the healthy rules stated sex as female. There is a strong negative correlation between CA and Thal for heart disease prediction. The most common features that exist in healthy rules are Sex = Female, Exang (Exercise induce angina) = No and CA (Number of major vessels coloured by fluoroscopy) = Zero. A patient will be predicted as not having heart disease if the patient is female, angina is not induced by exercise and has no major vessels coloured by fluoroscopy. Slope is not featured in any of the healthy rules. This study managed to determine the processes involved in obtaining significant features and to devise a scoring mechanism to obtain the strength of each feature. This will enable for the correct weight to be imposed on each of the significant features to be used in WARM for predicting heart disease. The confidence score obtained in this study is the highest obtained in heart disease prediction using WARM based on the UCI dataset. This study can be used as a guide for computing thestrength scores of significant features found in other heart disease datasets.

Comparative analysis with existing work

This section performs comparison between the proposed work and existing works using WARM. The results obtained in this research proved that the weighted scores imposed on WARM for 8 significant features have the highest confidence score of 98% compared with other existing studies. Figure 3 shows the confidence score of all the existing studies on WARM that used the UCI Cleveland heart disease dataset in comparison with the proposed work. The confidence score obtained by both the experiments which includes all features and significant features in predicting heart disease using WARM achieved a significant difference in terms of the confidence score achieved compared to previous studies. The use of the significant features score in WARM provides the highest confidence of 98% predicting heart disease.
Fig. 3

Result comparison on WARM using UCI Cleveland heart disease dataset

Result comparison on WARM using UCI Cleveland heart disease dataset Table 12 presents a comparative analysis of WARM using significant features versus existing results of ARM in heart disease prediction. Rules that gave the highest confidence scores were retrieved and compared in this table. Research by Said et al. [41] and Khare and Gupta [24] showed lower confidence scores compared to this research. Although Sonet et al., [45] managed to obtain a confidence score of 99%, the rule generated for this score is questionable. The rule stated that if a patient has diabetes, then the patient will have heart disease. Although the risk of having heart disease is proven to be higher in diabetic patient, this rule cannot be generalized for all diabetic patients. This is the result of bias that might have existed in their dataset. The dataset used in their study is collected from 4 different medical institutions with a total of 131 records and is not an open dataset. Besides that, the dataset contained different features from the dataset used in this study.
Table 12

Comparative Analysis of Weighted Associative analysis and Associative Rule Mining in predicting heart disease

ResearchConfidence Score (%)RulesNo of attributes in highest confidence ruleTechniqueDataset
Nahar et al. [36]96Chest_Pain_Type = asympt, Slope = flat, Thal = rev3ARMUCI
Said et al. [41]91Chest Pain Type = asymptomic and Thal = reversible defect2ARMUCI
Khare and Gupta [24]94Thal = reversible_defect, CP = asymptomatic, Exercise_Induced_Angina = yes3ARMUCI
Sonet et al. [41]97Lack-of-Exercise = yes, Stress = yes, BP = high, Smoking = yes, Diabetes = yes ֜5ARMData collected from 4 medical institutions (131 records)
99Diabetes1ARM
Soni and Vyas [48]79.5NANAWARMUCI
Soni et al. [46]80NANAWARMUCI
Sundar et al. [50]84NANAWARMUCI
Ibrahim & Sivabalakrishnan [18]6770..79- > yes1WARMUCI
Our Experiment (all features)96CP = asymptomatic Slope = flat Thal = reversable3WARMUCI
Our Experiment (8 Significant features)98CP = asymptomatic, Exang = Yes, Oldpeak = greaterThanZero, Thal = reversible4
Comparative Analysis of Weighted Associative analysis and Associative Rule Mining in predicting heart disease This study also benchmarked the rules generated using the UCI dataset by past researches with the rules generated in our study. The extracted healthy rules are shown in Table 13 and sick rules are shown in Table 14. Table 13 shows that our experiment with 8 significant features obtained the optimum confidence score of 100% for predicting healthy rules. The rules retrieved for this stated that if the sex is female, chest paint is non-angina and thallium heart scan is normal, this person is then predicted not to have heart disease.
Table 13

Healthy rules extractions

ResearchRulesConfidence Scores
Nahar et al. [36]Sex = female, Exercise_induced_angina = fal, Number_of_vessels_colored = 0, Thal = nom98
Said et al. [41]Sex = female and Exercise_induced_angina = No and Thal = normal89
Khare et al. [24]Ca = 0, Thal = normal, Exercise_induced_angina = no90
Proposed work (with all features)Trestbps = unusual, Thalach = Tachycardia, Exang = No, CA = zero Thal = normal96
Proposed Work (with significant features)Sex = Female, CP = nonAnginalPain, Thal = normal100
Table 14

Sick rules extractions

ResearchRulesConfidence Scores
Nahar et al. [36]Chest_pain_type = asympt, Slope = flat, Thal = rev96
Said et al. [41]Chest pain type = asymptomic and Thal = reversible defect91
Khare et al. [24]Thal = reversible_defect, CP = asymptomatic, Exercise_induced_angina = yes94
Ibrahim and Sivabalakrishnan [19]70..79- > yes67
Proposed Work (all features)CP = asymptomatic, Slope = flat, Thal = reversable96
Proposed Work (8 significant features)CP = asymptomatic Exang = Yes Oldpeak = greaterThanZero Thal = reversible98
Healthy rules extractions Sick rules extractions Table 14 shows the sick rules together with the highest confidence scores of this research in comparison with other resesarch on associative and WARM for heart disease prediction. This study achieved a confidence score of 98% which is better than all the other predicted sick rules. To the best of our knowledge, the significant features’ weighted scores in our study managed to beat the scores obtained by all other research using ARM and WARM to predict heart disease.

Conclusion

This research contributed to obtaining the highest confidence score using significant features in WARM for heart disease prediction. Assigning appropriate weight scores have proven to improve the performance of confidence level in the prediction. A set of significant features with different weights to represent the strength of each of the features was used in heart disease prediction. To the best of our knowledge, this is the first study that made use of significant features in executing WARM. This research has also contributed to listing the top rules in predicting heart disease based on the UCI dataset. This is the first research that benchmarked the healthy rules and sick rules with the highest confidence scores. Future researches may look into predicting the risk levels of heart disease, as this will help medical practitioners and patients to gauge their heart disease severity. The algorithm used in this study for measuring weight can be further explored for use with other datasets to cater to other prediction models using the weighted approach. The machine learning techniques used in feature selection phase of this research is limited to the most popular techniques used in heart disease prediction research. Future researchers should look into exploring other machine learning techniques in selecting the significant features.
  6 in total

1.  Mortality in the United States, 2017.

Authors:  Sherry L Murphy; Jiaquan Xu; Kenneth D Kochanek; Elizabeth Arias
Journal:  NCHS Data Brief       Date:  2018-11

2.  Incorporating repeating temporal association rules in Naïve Bayes classifiers for coronary heart disease diagnosis.

Authors:  Kalia Orphanou; Arianna Dagliati; Lucia Sacchi; Athena Stassopoulou; Elpida Keravnou; Riccardo Bellazzi
Journal:  J Biomed Inform       Date:  2018-03-16       Impact factor: 6.317

3.  Novel technique for reorganisation of opinion order to interval levels for solving several instances representing prioritisation in patients with multiple chronic diseases.

Authors:  K I Mohammed; A A Zaidan; B B Zaidan; O S Albahri; A S Albahri; M A Alsalem; A H Mohsin
Journal:  Comput Methods Programs Biomed       Date:  2019-11-01       Impact factor: 5.428

4.  Biclustering-based association rule mining approach for predicting cancer-associated protein interactions.

Authors:  Lopamudra Dey; Anirban Mukhopadhyay
Journal:  IET Syst Biol       Date:  2019-10       Impact factor: 1.615

5.  Global, regional, and national age-sex-specific mortality for 282 causes of death in 195 countries and territories, 1980-2017: a systematic analysis for the Global Burden of Disease Study 2017.

Authors: 
Journal:  Lancet       Date:  2018-11-08       Impact factor: 79.321

6.  Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017.

Authors: 
Journal:  Lancet       Date:  2018-11-08       Impact factor: 79.321

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.