Kedir Eyasu1, Worku Jimma2, Takele Tadesse2. 1. Department of Information Technology, Faculty of Engineering and Technology, Mettu University, Ethiopia. 2. Department of Information Science, Faculty of Computing and Informatics, Jimma Institute of Technology, Jimma University, Ethiopia.
Abstract
BACKGROUND: Diabetes is a disease that affects the body's ability to produce or use insulin. A total of 425 million people are suffering from diabetes in the world. Of this, more than 16 million people live in the Africa Region, which is estimated to be around 41 million by 2045. The main objective of this study was to design and develop a prototype knowledge-based system using data mining techniques for diagnosis and treatment of diabetes. METHODS: For this study, experimental research design was employed, and the researchers used domain expert knowledge as a supplement of data mining techniques whereby three classification algorithms in WEKA; namely J48, PART and JRip were used, and finally the researchers decided to use the results of J48 classification algorithm. Ultimate Visual basic studio 2013 (Vb.net) was used to store knowledge and as front side of prototype. Common lisp prolog (Clisp) was used for obtained knowledge back end coding. RESULTS: Using a decision tree algorithm; namely J48, 2512 (95.1515%) of the instances were classified correctly, and 128 (4.8485 %) were classified incorrectly. The second most performing model was generated by JRip Classier. This model scored the 94.7348% accuracy on the general data to classify the status of diabetic patient datasets. It classified the 2501 instances of the records correctly. CONCLUSION: The J48 model was the best performing model with the best accuracy of results.
BACKGROUND: Diabetes is a disease that affects the body's ability to produce or use insulin. A total of 425 million people are suffering from diabetes in the world. Of this, more than 16 million people live in the Africa Region, which is estimated to be around 41 million by 2045. The main objective of this study was to design and develop a prototype knowledge-based system using data mining techniques for diagnosis and treatment of diabetes. METHODS: For this study, experimental research design was employed, and the researchers used domain expert knowledge as a supplement of data mining techniques whereby three classification algorithms in WEKA; namely J48, PART and JRip were used, and finally the researchers decided to use the results of J48 classification algorithm. Ultimate Visual basic studio 2013 (Vb.net) was used to store knowledge and as front side of prototype. Common lisp prolog (Clisp) was used for obtained knowledge back end coding. RESULTS: Using a decision tree algorithm; namely J48, 2512 (95.1515%) of the instances were classified correctly, and 128 (4.8485 %) were classified incorrectly. The second most performing model was generated by JRip Classier. This model scored the 94.7348% accuracy on the general data to classify the status of diabetic patient datasets. It classified the 2501 instances of the records correctly. CONCLUSION: The J48 model was the best performing model with the best accuracy of results.
Diabetes is one of the most chronic and rampant diseases. Diabetes occurs when insulin is not being properly produced or responded by the body, which is essentially needed to maintain the proper level of sugar in the human body (1). According to the International Diabetes Federation (IDF) estimates, about 366 million people were affected by type II diabetes in 2011, and by 2030 it may increase to 552 million worldwide. Almost 80% of the diabetic people belong to middle and low income countries. According to the American Diabetes Association (ADA), diabetes imposes significant economic burden on the countries by healthcare expenditures, due to the fact that diabetes account for 11% ($465 billion) of the total healthcare expenses in the world in 2011. This number is projected to exceed $595 billion by 2030 (2). This chronic disease needs taking oral medications, controlled diet and physical exercise, but no comprehensive cure is available yet. However, the existing practices for medical treatment need patients to see specialists for diagnosis and treatment.Knowledge-based system (KBS) is a computer program that contains the knowledge and analytical skills of one or more human experts in a specific problem domain. It can be an encouraging good solution to reduce human expertise and medical error because as one of the specialized branches of AI, KBS is functioning in a specific domain to offer wise decisions with reasoning (3). Studies indicate shortage of skilled manpower in healthcare, which contributes to poor diagnosis and treatment of people living with diabetes in Ethiopia, mainly due to shortage of manpower and laboratory facilities (4). Thus, automatic processing and exploring technology is important in order to solve such problems using KBS. As part of the complex process of knowledge discovery in databases, data mining tries to find useful patterns in large amounts of data, which have no obvious relationship between them. Such a pattern can be the key knowledge used for developing effective and efficient KBS. Thus, this study was initiated with the main objective to design and develop prototype a knowledge-based system using data mining techniques for diagnosis and treatment of diabetes.
Methods
Research design: Planning an experiment properly is very important in order to ensure that the right type of data and a sufficient sample size and influence are available to answer the research questions of interest as clearly and efficiently as possible (5). In this study, experimental research design method was used for model building, investigation, prototype development and testing. The primary goal of an experimental design is to establish a causal connection between the variables, whereas the secondary goal is to extract the maximum amount of information with the minimum expenditure of resources. It is the process of planning a study to meet specified objectives.Data mining was one of the methods, i.e., to extract interesting and previously unknown knowledge or patterns from data sources. Data mining was used as the central point of knowledge discovery in databases (KDD) process, and it corresponds to the modeling step in the knowledge discovery in databases process and CRISP-data mining model as automatic knowledge acquisition method. Manually gathered knowledge is fuzzy in nature; as result, data mining techniques and algorisms were used as method of discovering unknown knowledge from preprocessed diabetic patients' medical datasets.Study site and population: The study sites for this research were Jimma Medical Center, Saint Paul Millennium Medical Hospital and Adama Medical College Hospital. The total population interviewed were twelve, and the data used for the research was collected from patients' records in the form of text. Purposive sampling technique was used for the selection of hospitals and domain experts for knowledge acquisition.Techniques and algorithms: Data mining process inputs a cleaned and transformed data, searches the data by using different techniques and algorithms, and then outputs patterns and relationships to the interpretation/evaluation step of the knowledge discovery from data (6).Classification technique: Classic data mining technique based on machine learning maps data into predefined groups or classes. Supervised learning is a technique whereby the classes are determined before examining the data. Classification algorithms require that the classes be predefined based on data attribute values. It often describes these classes by looking at the characteristics of data already known to belong to the classes (7). A Rule-based classification extracts a set of rules that show relationships between attributes of the dataset and the class label. For this study, three algorithms; namely J48 pruned, PART, and JRip were used. J48 decision trees are mainly used in the classification and prediction. It is a simple and a powerful way of representing knowledge. PART is rule-based classifier which uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form IF condition THEN conclusion. JRip is also a rule-based classifier that uses a set of IF-THEN rules for classification; this experiment was conducted with default parameters of WEKA, and the algorithm generates a model with rules.KBS development tool: KBS shell with the ready-to-wear utilities of self-learning, explanation and inference, like Quincy prolog, visual prolog, and Clips rule based, Java Expert System Shell (JESS), GURU, and Vidwan is more specific and can also be useful to develop KBS. KBS can be developed using programming languages like LISP and Prolog (8). Therefore, for this study, Clisp.net was used to develop a prototype KBS.Domain experts knowledge acquisition: In this work, interview and document analysis were used primarily for understanding the basic concepts related to diagnosis, treatment and prognosis of diabetes to acquire the general and domain-specific knowledge and to obtain comprehensive example sets. On the other hand, data mining approach is particularly fruitful in automating the knowledge acquisition task of rule based system. The development of an efficient and effective smart knowledge-based system involves the development of efficient knowledge base that has to be complete, coherent and non-redundant. Knowledge acquisition is most difficult and error-prone task in the development of KBS due to the fact that knowledge acquisition involves communications between people with completely different backgrounds, human experts and knowledge engineers, who must formulate the concepts, relations and control mechanisms needed for the knowledge based system (7). Knowledge acquisition using data mining technique eliminates or reduces the difficulty caused by the knowledge acquisition bottleneck of rule based-systems and automates knowledge acquisition by obtaining low-cost, adds better values on tradition method, and produces high-quality knowledge base.
Results
Data mining and model selection: To build the analytical model for diagnosis and treatment of diabetes, three classification algorithms, namely J48, JRIP and PART are built. J48 is tree-based classifiers in WEKA whereas JRip and PART are rule based classifiers. In supervised learning, the training data are accompanied by class labels indicating the class of the clarifications.Experiment J48 pruned tree: Decision trees are data-mining methodologies applied in many real-world applications as a powerful solution to classification problems. In decision tree experiment, the performance of J48 classifier in predicting diabetes status of patients was evaluated. The experiment was conducted with the default parameters of WEKA. From the total dataset of 2640 records, 2512 were correctly classified and the remaining 128 instances were incorrectly classified. The result is displayed as a tree, hence the name of this technique. Decision trees are mainly used in the classification and prediction.The models obtained from the decision tree are represented as a tree structure. The instances are classified by sorting them down the tree from the root node to some leaf node. The nodes are branching based on if-then condition. This experiment was conducted under percentage split test option with 90% of the dataset for training and the remaining for testing with default parameters of WEKA and the algorithm generates a model. Accordingly, 2512(95.1515%) of the instances were classified correctly and 128(4.8485%) were classified incorrectly. The rules generated using J48 decision tree algorithm is depicted in Figure 1.
Figure 1
the sample rules generated by J48 classifier decision tree
the sample rules generated by J48 classifier decision treeModel evaluation: All the selected algorithms generated rules from the dataset. The results of the algorithms were evaluated based on prediction accuracy in classifying the instances of the dataset into Healthy, Diabetes, PreDm, AtRiskPreDm, TypeIDm, AtriskTypeIDm, TypeIIDm, AtRiskTypeIIDm, GestationalDm and AtRiskgestationDm. The performance of classifier algorithms is compared, and the one which performed better is selected as prime choice for the knowledge acquisition step. The accuracy, precision, recall and f-measure of each of the mentioned classifiers which are obtained from the experiment are shown in Table 1.
Table 1
Performance of Classifiers
Model
Evaluation
Correctly classified
instances
Incorrectly classified
instances
Precision
Recall
F Measure
Classifiers
Instances
Percentage
Instances
Percentage
J48
2512
95.1515 %
128
4.8485%
0.949
0.952
0.95
PART
2495
94.5076%
145
5.4924%
0.944
0.945
0.944
JRip
2501
94.7348%
139
5.2652%
0.95
0.947
0.947
Performance of ClassifiersAs shown in the table, three experiments were carried out using decision tree classifiers (i.e. J48 pruned tree), rule based-classifiers (i.e. PART and JRip). From this experiment, one can observe that the J48 classifier achieves best accuracy by classifying 2512 instances out of 2640 correctly and comparing with JRip and PART. Results of JRip and PART show nearly equal number of incorrectly classified instances. The highest incorrect classification is registered by JRip algorithm.The Second most performing model was generated by JRip classier. This model scored the 94.7348 % accuracy on the general data to classify the status of diabetic patient datasets. It classified the 2501 instances of the records correctly and the precision and recall was 0.95 and 0.947 respectively. This result is the most promising result next to J48 algorithm by understanding the experiment result of the model.The third most performing model was the PART pruned model which is the third one according to the above criteria (i.e., performance) which is almost very close to the JRip classifier. This model performed the third promising result next to the JRip algorithm. This model scored the 94.5076% accuracy on the general data to classify the status of diabetic patient datasets. It classified 2495 instances of the records correctly and the precision was 0.944 whereas the recall was 0.945. This result is the most promising result next to JRip algorithm by understanding the experiment result of the model.Architecture of the prototype system: An architecture is a blueprint showing how the components of the prototype self-learning knowledge-based system interacts and interrelates. The architecture for the developed prototype KBS is depicted in Figure 2.
Figure 2
Architecture of the prototype KBS
Architecture of the prototype KBSSoftware requirement for the developed system: Clisp.net is integrating common lisp prolog with dot.net framework using library assembly plugin mechanism to visual studio development environment. For this study, Clisp.net integration tool was used to construct the prototype system with Microsoft Visual Studio Ultimate 2013. This programming language is preferred due to its object-oriented nature and its interactive capabilities with the users.Knowledge base: Knowledge base is a set of rules or the encoded knowledge about diagnosis and treatment of diabetes of the prototype system. The validated knowledge is represented in the form of rules by rule-based representation technique, and the rules are codified to the knowledge base of the prototype system using integrated Clisp.net Prolog programming language. The rules used for development of the prototype are generated using data mining techniques that the test instances scores more than 95% of accuracy.Knowledge representation: The most commonly used methods of knowledge representation are production rule, frame and network. Knowledge captured from experts and other sources must be organized in such a fashion that a computer inference program enables to access this knowledge whenever needed and draw conclusions. In this prototype KBS, production rules are used, since it permits the relationships that make up the knowledge base to be broken down into manageable units. Some of the rules extracted form J48 decision trees are as follows.Rule-1: if ((Diabetes Symptoms = Healthy or Diagnosis=Healthy): Free From DiabetesRule-2: if ((Diabetes Symptoms = Diabetes or Diagnosis=Diabetes): Got DiabetesRule: If ((patients Age <= 40) and (Diabetes Risk factor = Healthy) and (patients Age <= 34) and (patients Age <= 28) and (BMI = Healthywieght) and (patients Age >13) and (BP >130) and (FBS1<=130) and (patients Age <=23) and (FBS1 > 150) and (FBS1 <= 166)): PreDm (17.04/8.16).Rule-3: If ((patients Age <= 40) and (Diabetes Risk factor = Healthy) and (patients Age <= 34) and (patients Age > 28) and (patients Status = Male) and (BP <= 165) and (patients Age <= 29)): AtRiskTypeIDm (4.01/0.01) or if (patients Age > 29): PreDm (68.3/4.07) or if (BP > 165): AtRiskTypeIDm (12.2/1.09)Rule-4: If ((patients Age <= 40) and (Diabetes Risk factor = Healthy) and (patients Age >34) and (BP <= 125) and (patients Status = Male)): TypeIDm (47.81/2.89) or if (BP> 25): AtRiskTypeIIDm (12.94/1.57).In the case of Rule 1, the system requests the identifying by using prefilling information about patients and submits it to diagnosis the patients whether they have diabetes or are healthy.Diagnosis and treatment: Diagnosis of diabetes is based on several cases like patient's physical exam, presence or absence of symptoms, medical history, risk factors, blood test reports, etc. Blood tests can be used to confirm a diagnosis of diabetes, based on the amount of glucose found. Urine test can also be used to check protein in the urine that may help diagnose diabetes. These tests also can be used to monitor the disease once the patient is on a standardized diet, physical exercise, oral medications, or insulin therapy. The system can provide necessary information about the indications, diagnosis and primary treatment advices to the diabetics. Prototype KBS developed using both automatic extracted and expert-based knowledge can be used for diagnosis and treatment of diabetes as shown below using Mockler Chart. This Mockler Chart of diagnosis and treatment has been drawn to show the relation of components tests, patient's situation, patient's age, Body Mass Index (BMI), symptoms and risk factors. The used Mockler Chart of symptoms, the questions and choices related to determining of the patient's symptoms which concluded diabetes or further analysis of the patients (Figure 3).
Figure 3
Mockler Chart shows relations how KBS used for diagnosis and treatment of diabetes mellitus.
Mockler Chart shows relations how KBS used for diagnosis and treatment of diabetes mellitus.User interface: It is the means by which the user and a computer system interact, in particular the use of input devices and software. The acceptability of a knowledge-based system depends on the quality of the user interface. The user interface is used as the means of interacting user and the knowledge based system. In this study, user interface facilitates diagnosis and treatment based on predefined knowledge-based contain the rules.Explanation facility: The prototype system can describe “what”, a request to repeat for clarification before it reaches on its conclusions. This ability is usually important because the type of problems to which a knowledge-based system needs an explanation of the result delivered to the end-users. It has also the ability of justifying “why” a certain problem is being questioned in order to reason out what it means and its and how it benefits. The developers of the system included “what” and “why” explanation facilities in problem solving. Moreover, the developers consider that the explanation facilities included in the system are easily understandable by the end-users. The developed prototype KBS facilitates reasoning and describing mechanism which made the system more user friendly in addition to graphical user interface.To enter her/his option from the available options, what the user has to do is check the checkbox and click on the button which is next to the box. Then, based on the user choice the system displays the next step. An example is depicted in figure 4 whereby the user wants to conduct clinical diagnosis
Figure 4
user interface used to identify patients diabetes using questions
user interface used to identify patients diabetes using questionsThe system displays questions sequentially by including symptoms associated with a diabetic patient. The form has been kept simple in terms of description of symptoms and response to each symptom in “Yes/No” format. Then the system provide more information if the user selects “Yes”. As shown in figure 4, the developed prototype system functions by asking questions to the new patient who comes for diagnosis and treatment of diabetes.As shown in figure five, the system first asks questions in order to identify whether the patient has diabetes or not. Then it advises sending the patient for laboratory blood test. In order to identify type of diabetes, the system orders further diagnosis by activating further test message before restarting itself and ready for new automatic diagnosis progress (Figure 5).
Figure 5
User interface detects patients got diabetes using questions
User interface detects patients got diabetes using questionsThe prototype system is used to decide whether the patient has specifically pre-diabetes, Type I Diabetes Mellitus, Type II Diabetes Mellitus, Gestational Diabetes Mellitus or other types of diabetes. The system automatically provides user interface of the patient's symptoms in order to make decision more clear by remembering the user from what previously diagnosed automatically by asking questions provided.For laboratory test, the prototype KBS contemplates the patient's status, concept of age, Body Mass Index, risk factors, criteria of diabetes such as the Fasting Blood Sugar, 2 hour glucose oral glucose tolerance test, HbAc1, family history of diabetes and/or obesity, and ketone and/or autoantibodies, etc..Besides, if the previous diagnosed test result of a certain patient shows the patient is Type I Diabetes and when the patient wants to diagnose again, the test result shows the patient has diabetes. As shown in the figure below, the result displayed after all requirement information is fulfilled, otherwise the result is not perfect or is error loaded. After “Test Result” is displayed, the system displays “How?” that is to recommend a proper treatment, based on the type of diabetes that the patient is diagnosed for.After identifying the diabetes type (i.e., similar to diagnosed Type I Diabetes), the system recommends the treatment facilities such as the diet information, medication therapy, exercise and foods to avoid or limit. Moreover, the system advises the patient to monitor his/her glucose level using either FPG or 2h OGTT or hemoglobin A1C test methods. FPG test enables the patient to control his/her glucose level daily, and hemoglobin A1C is used to measure the average sugars level that is accumulated in the patient's blood for the last 2–3 months by looking at the level of sugar in the patient's hemoglobin. As a result, if diabetic patients get proper treatments, they can control their sugar level, which will further enable them to control more complications due to diabetes.
Discussion
The use of data mining techniques to build the knowledge base of the KBS can be taken as a strong feature of the developed system. The main aim of data mining is classifying the attribute based on the given attributes. This was achieved by J48 decision tree even though three algorithms were selected for this purpose because it performed better. Classification algorithms usually require that the classes be defined based on the data attribute values. They often describe these classes by looking at the characteristics of data already known to belong to classes. Pattern recognition is a type of classification where an input pattern is classified into one of the several classes based on its similarity to these predefined classes (9). The rule that is discovered from the J48 decision tree is the attempt of finding the rules which were used to develop prototype KBS. For this sake, the previous rules, which were mined by the J48 classifier is the result of this study and were used for prototype KBS to diagnosis and treatment of diabetes.In this study, different types of activities were conducted for the purpose of developing a prototype KBS for diagnosis and treatment of diabetes. Moreover, which rules, which model and which algorithm would perform very well were identified.In general, three algorithms were selected to test on the diabetic datasets in order to generate rules, namely J48, PART and JRip. These algorithms were analyzed one by one by looking at their performance during the experiment. Accordingly, the J48 algorithm is the most performing model compared to the other two algorithms in the cases of performance, labeling, and specificity. The J48 algorithm classified 2512 instances correctly. Besides, the model showed better performance than the other. The area under ROC (AUC) of this model was 0.997 while the result of precision and recall (0.949 and 0.952) were also higher compared to the remaining models. As proved on the result of the J48 model, the accuracy scored was 95.1515% in classifying the instances correctly. The rules that were generated from the J48 algorithm are the best rules which used for development of prototype KBS for diagnosis and treatment.The JRip rule induction is the second most performing model next to J48 model whereas the PART is the last performing classifier. As discussed above, the performance of J48 is better than JRip and PART algorithms. Therefore the rules generated by J48 model were used for prototype KBS for diagnosis and treatment of diabetes.Throughout coding the represented knowledge about diabetes using the Clisp.net, the facts base of the prototype system was able to update its knowledge automatically by get and set method. However, the researchers encountered a challenge to update the rules of the knowledge base of the prototype system dynamically. The reason for failing to do so was that rules were generated automatically (by data mining) as well as manually (i.e. domain experts and by referring medical records diabetic patients').In developing countries where there are limited laboratory facilities and experienced physicians, diabetic patients are not getting the right diagnosis and treatment at the right time. Hence, in this study, an effort was made to design and develop a prototype of an effective and efficient knowledge-based system that can provide advice for experts and patients to facilitate the diagnosis and treatment of diabetes patients.Knowledge-based systems can help a great deal in decision-making through a display of intelligent behavior that may include learning and reasoning. In developing the prototype system, knowledge is acquired using both structured and unstructured interviews with domain experts and from relevant documents by using documents analysis method to find the solution of the problem. In this research, effective and efficient KBS that supports diagnosis and treatment of diabetes disease was developed by using data mining techniques as a knowledge acquisition step. The generated knowledge was represented using rule-based representation technique and codified using Clisp.net editor tool for building the knowledge-based system for diagnosis and treatment of diabetes.The testing result showed that the overall performance of the developed prototype system was good enough, which was 92%. The use of data mining techniques to build the knowledge base of the KBS and graphical user interface are strong features of the system.Data mining techniques was applied on diabetic patients' baseline datasets in order to generate rules used for developing prototype KBS for diagnosis and treatment. However, diabetic datasets are manually stored which made preprocessing datasets difficult. In a future study, if such challenge is solved, it is possible to automatically generate rules from database, integrating to knowledge base the rule generated so that data mining techniques become easy. Moreover, a method must be investigated on how to integrate the prototype system with the existing health information systems. This would lead to the development of standards applicable to all, enabling suitable information exchange and planning for additional improvement of functionality.