BACKGROUND: Today there are abounding collected data in cases of various diseases in medical sciences. Physicians can access new findings about diseases and procedures in dealing with them by probing these data. This study was performed to predict stroke incidence. METHODS: This study was carried out in Esfahan Al-Zahra and Mashhad Ghaem hospitals during 2010-2011. Information on 807 healthy and sick subjects was collected using a standard checklist that contains 50 risk factors for stroke such as history of cardiovascular disease, diabetes, hyperlipidemia, smoking and alcohol consumption. For analyzing data we used data mining techniques, K-nearest neighbor and C4.5 decision tree using WEKA. RESULTS: The accuracy of the C4.5 decision tree algorithm and K-nearest neighbor in predicting stroke was 95.42% and 94.18%, respectively. CONCLUSIONS: The two algorithms, C4.5 decision tree algorithm and K-nearest neighbor, can be used in order to predict stroke in high risk groups.
BACKGROUND: Today there are abounding collected data in cases of various diseases in medical sciences. Physicians can access new findings about diseases and procedures in dealing with them by probing these data. This study was performed to predict stroke incidence. METHODS: This study was carried out in Esfahan Al-Zahra and Mashhad Ghaem hospitals during 2010-2011. Information on 807 healthy and sick subjects was collected using a standard checklist that contains 50 risk factors for stroke such as history of cardiovascular disease, diabetes, hyperlipidemia, smoking and alcohol consumption. For analyzing data we used data mining techniques, K-nearest neighbor and C4.5 decision tree using WEKA. RESULTS: The accuracy of the C4.5 decision tree algorithm and K-nearest neighbor in predicting stroke was 95.42% and 94.18%, respectively. CONCLUSIONS: The two algorithms, C4.5 decision tree algorithm and K-nearest neighbor, can be used in order to predict stroke in high risk groups.
Entities:
Keywords:
Data mining; K-nearest neighbor; decision tree; prediction; stroke
Based on studies of more than 56 million deaths in 2001, it was found that 7.1 million cases were due to heart disease and 5.4 million were also due to stroke.[1] This indicates that stroke – after heart disease – is the second major cause of death in the world that is nearly 10% of all deaths reported. Stroke is the third leading cause of death in the United States, and about 137,000 Americans die due to this disease each year. In 2006, 6 out of every 10 deaths from stroke had occurred in women.[2] In the United States, one suffers from stroke every 40 second and every 3–4 minute one dies from stroke. The cost of this disease – only in America – has been estimated about 73.7 million dollars.[3] Stroke is one of the major causes of disability in the world. According to the reports published in 2005, close to 1.1 million people have survived a stroke but live with some problems in their daily activity.[4] In Iran, Based on researches conducted on a population of 450,229 people in the city of Mashhad, it was seen that the stroke occurred nearly a decade sooner than in the Western countries and the incidence rate in Iran was also higher than in most of them.[5] Most studies performed on automated diagnosis of stroke and its subtypes were on the image processing techniques and computerized tomographic scan and magnetic resonance imaging.[6-8] For example, computerized tomographic scan images have been used for diagnosis of stroke and its subtypes. After improvement of images and noise reduction, the skull line of symmetry is determined and then a histogram chart is created for the brain hemispheres. Hemorrhagic and chronic stroke are distinguished by the histogram chart. We used wavelet features for diagnosis of acute stroke and normal images.[6] Precision and Recall obtained were 90% and 100%, respectively.[6] In another study of a mining algorithm, classification rules were used to analyze data from strokepatients. The NGTS (New General-To-Specific) algorithm, which is a sequential covering algorithm for extracting classification rules, has been applied for 162 specimens. Total number of extracted rules was 84% and 84.8% classification accuracy has also been achieved.[9] The T3 algorithm, which provides a decision tree with a maximum depth of 3, has been investigated to construct the decision tree from the data of strokepatients. The results obtained from comparison with the C4.5 algorithm show that the accuracy of the T3 classification algorithms for training and test data sets was higher and overall displayed better performance. A data set contains 795 records and 37 attributes per record. The best error classification for the T3 algorithm was 0.4%, whereas the best value for the C4.5 algorithm was 33.6%.[10] There are several factors that play a role in stroke incidence, some of which are heredity, age, gender and race, certain medical conditions such as high blood pressure, hypercholesterolemia, heart disease and diabetes. Overweight, past history of stroke can also increase the incidence risk of stroke. No smoking and no alcohol consumption and daily activities can also be effective to reduce the risk of stroke. By use of the aforementioned risk factors, and techniques of data mining, decision support system can be designed that besides knowledge and experience of a physician, can be used to predict stroke. Owing to the human need of knowledge and increasing data volume, technique development for automated extraction of knowledge from these data is inevitable. Data mining is extraction of knowledge and attractive patterns from a large volume of data.[11] Data mining techniques based on knowledge that can be extracted are divided into three major groups: Pattern classification, data clustering and association rule mining.With regard to these findings and emphasis on prediction of stroke incidence to reduce complications, disabilities and healthcare costs, this study was aimed to investigate 50 risk factors for brain stroke. After that, for collecting, pre-processing and data cleaning, data software WEKA 3.6, the C4.5 algorithm (version 8) and the K-nearest neighbor algorithm were used to analyze the data.
METHODS
In the pattern classifications – which have been used in this article – based on a set of attributes, one class label was assigned to one sample of data.[12] Classification is a two-step process. In the first phase, which is called the Learning Phase, a clustering algorithm makes a model from analysis of a training data set that describes a set of class labels and predefined concepts. In the second phase, which is called the Test Phase, the classification accuracy of the model is measured using a test data set.In this investigation, which lasted from August 1387 to March 1389, at first after studying sources and texts written on science data mining, in order to extract the concepts, structures and algorithms, 50 risk factors that were effective in stroke incidence were provided for a healthy community and a population with stroke. A total 807 checklists were collected then the samples were formed using Excel files.Meanwhile, some records were unspecified values, therefore the following techniques were used.[11]Use of average property values: Using the mean values of a feature to fill unspecified values of that features. This method is commonly used for continuous data such as height or weight.Using mean values in lines so that their class is equal to lines with unspecified values. This methods used for continuous features.After that data mining, software WEKA 3.6 was used for data analysis. The J48 and IBK algorithms that are for implementation of version 8 of the C4.5 algorithm and the K-nearest neighbor algorithm, respectively, in the JAVA language were applied on the data.[13]
RESULTS
The Sensitivity, Specificity, Precision and Accuracy criterion of the two algorithms, K-nearest neighbor and C4.5, have been shown for different values in Table 1.
Table 1
Results of the classification algorithms using data sets on stroke
Results of the classification algorithms using data sets on strokeAlso three criterions of Accuracy, Sensitivity and Precision were perused and compared in two methods in Figure 1.
Figure 1
Comparison of the C4.5 and KNN algorithms based on the Accuracy, Sensitivity and Precision criteria
Comparison of the C4.5 and KNN algorithms based on the Accuracy, Sensitivity and Precision criteriaThe Specificity criterion is shown in Figure 2.
Figure 2
Comparison of specificity based on the C4.5 and KNN criterion
Comparison of specificity based on the C4.5 and KNN criterionThe figures show that in the K-nearest algorithm, Specificity and Precision decreased after K increment, and then Accuracy decreased a little. In addition, after K increment, Sensitivity increased a little and the number of patients who were truly diagnosed increased but simultaneously false-positive patients also increased. This issue resulted in Precision reduction as K increased. The best results are pertaining to the C4.5 algorithm that outruns the K-nearest neighbor algorithm in the Accuracy, Precision and Specificity criteria by a small difference.
DISCUSSION
Most studies performed on stroke diagnosis and its species differentiation focused on image-processing techniques and no research studies have been conducted on the C4.5 decision tree and K-nearest neighbor methods. In this study, 50 risk factors such as gender, age, hours of activity, sleep duration, body mass index, hypertension, diabetes, hyperlipidemia, smoking, alcohol, narcotics, stimulants and other risk factors that have not been considered previously were extracted then the C4.5 and K-nearest neighbor algorithms in data mining software WEKA 3.6 were used to analyze stroke data.[14]After applying the algorithms and performing a comparison and computation of accuracy using the decision tree and K-nearest neighbor methods, the best results are those pertaining to the C4.5 algorithm that outruns the K-nearest neighbor algorithm in the Accuracy, Precision and Specificity criteria by a small difference. In medical diagnosis systems, even a small difference in classification is important since right prediction of illness is vital and very important. Finally the “decision tree” was selected as the stroke-predicting algorithm because of its higher accuracy.The efficiency, especially Sensitivity and Accuracy of classification, of the introduced algorithms was over 90%, showing that besides knowledge and experience of clinicians, we can take advantage of data mining techniques in diagnosis and management of patients with stroke.The use of feature selection techniques to reduce data dimension and finding the effect of each of them on stroke incidence are very important in computational and medical vision. Also the effect of data dimension reduction on classification accuracy and other algorithm performance criteria should be examined. Designing a decision support system that not only diagnoses the species of stroke (hemorrhagic or ischemic), but also predicts the disease is very valuable. Generally, if we can design decision support systems that carefully define type and age of stroke in addition to stroke incidence that will affect an individual in future, many medical expenses will be saved. Since data mining is knowledge that is unfortunately unknown in our country, the numbers of people who have done good research in this area are lacking. Most investigations are presented theoretically and just one data mining conference is being held in Iran. This knowledge and its applications, which can be used in different fields, will gradually find place among professionals. But this scientific field is unfamiliar to medical specialists.In fact, in Iran, one of the biggest problems with medical data mining is mistrust in the medical community because of lack of knowledge of this science. In fact, for many people the result of data mining is incredible. If you are looking to obtain interesting results on data mining, data must be strong and in large volume. Many medical centers do not provide data of their patients to data mining teams. Security, lack of confidence in the results of data mining and the desire to retain them exclusively for their next possible studies were the major problems in our study. Furthermore, if participants who will suffer a stroke in this project is determined during a multi-year process, more interesting results will be achieved.
CONCLUSIONS
The two algorithms, C4.5 decision tree algorithm and K-nearest neighbor, can be used in order to predict stroke in high risk groups.
Authors: Melonie Heron; Donna L Hoyert; Sherry L Murphy; Jiaquan Xu; Kenneth D Kochanek; Betzaida Tejada-Vera Journal: Natl Vital Stat Rep Date: 2009-04-17
Authors: A Przelaskowski; K Sklinda; P Bargieł; J Walecki; M Biesiadko-Matuszewska; M Kazubek Journal: Comput Biol Med Date: 2006-09-25 Impact factor: 4.589