| Literature DB >> 25821508 |
Kindie Biredagn Nahato1, Khanna Nehemiah Harichandran1, Kannan Arputharaj2.
Abstract
The availability of clinical datasets and knowledge mining methodologies encourages the researchers to pursue research in extracting knowledge from clinical datasets. Different data mining techniques have been used for mining rules, and mathematical models have been developed to assist the clinician in decision making. The objective of this research is to build a classifier that will predict the presence or absence of a disease by learning from the minimal set of attributes that has been extracted from the clinical dataset. In this work rough set indiscernibility relation method with backpropagation neural network (RS-BPNN) is used. This work has two stages. The first stage is handling of missing values to obtain a smooth data set and selection of appropriate attributes from the clinical dataset by indiscernibility relation method. The second stage is classification using backpropagation neural network on the selected reducts of the dataset. The classifier has been tested with hepatitis, Wisconsin breast cancer, and Statlog heart disease datasets obtained from the University of California at Irvine (UCI) machine learning repository. The accuracy obtained from the proposed method is 97.3%, 98.6%, and 90.4% for hepatitis, breast cancer, and heart disease, respectively. The proposed system provides an effective classification model for clinical datasets.Entities:
Mesh:
Year: 2015 PMID: 25821508 PMCID: PMC4364360 DOI: 10.1155/2015/460189
Source DB: PubMed Journal: Comput Math Methods Med ISSN: 1748-670X Impact factor: 2.238
Figure 1System architecture.
Description of hepatitis dataset.
| Number | Attribute name | Domain values | Number of missing values |
|---|---|---|---|
| 1 | Age | 10, 20, 30, 40, 50, 60, 70, 80 | 0 |
| 2 | Sex | Male, female | 0 |
| 3 | Steroid | No, yes | 1 |
| 4 | Antivirals | No, yes | 0 |
| 5 | Fatigue | No, yes | 1 |
| 6 | Malaise | No, yes | 1 |
| 7 | Anorexia | No, yes | 1 |
| 8 | Liver big | No, yes | 10 |
| 9 | Liver firm | No, yes | 11 |
| 10 | Spleen palpable | No, yes | 5 |
| 11 | Spiders | No, yes | 5 |
| 12 | Ascites | No, yes | 5 |
| 13 | Varices | No, yes | 5 |
| 14 | Bilirubin | 0.39, 0.80, 1.20, 2.00, 3.00, 4.00 | 6 |
| 15 | Alk phosphate | 33, 80, 120, 160, 200, 250 | 29 |
| 16 | Sgot | 13, 100, 200, 300, 400, 500 | 4 |
| 17 | Albumin | 2.1, 3.0, 3.8, 4.5, 5.0, 6.0 | 16 |
| 18 | Protime | 10, 20, 30, 40, 50, 60, 70, 80, 90 | 67 |
| 19 | Histology | No, yes | 0 |
| 20 | Class | Die, live | 0 |
Description of breast cancer dataset.
| Number | Attribute name | Domain | Missing value |
|---|---|---|---|
| 1 | Clump thickness | 1–10 | 0 |
| 2 | Uniformity of cell size | 1–10 | 0 |
| 3 | Uniformity of cell shape | 1–10 | 0 |
| 4 | Marginal adhesion | 1–10 | 0 |
| 5 | Single epithelial cell size | 1–10 | 16 |
| 6 | Bare nucleoli | 1–10 | 0 |
| 7 | Bland chromatin | 1–10 | 0 |
| 8 | Normal nucleoli | 1–10 | 0 |
| 9 | Mitosis | 1–10 | 0 |
| 10 | Class | 2 for benign, | 0 |
Attribute information of Statlog heart disease dataset.
| Number | Attribute | Description | Data type | Domain |
|---|---|---|---|---|
| 1 | Age | Patient age in year | Numerical | 29 to 77 |
|
| ||||
| 2 | Sex | Gender | Binary | 0 = female |
|
| ||||
| 3 | Chp | Chest pain type | Nominal | 1 = typical angina, 2 = atypical angina |
|
| ||||
| 4 | Bp | Resting blood pressure | Numerical | 94 to 200 |
|
| ||||
| 5 | Sch | Serum cholesterol | Numerical | 126 to 564 |
|
| ||||
| 6 | Fbs | Fasting blood sugar >120 mg/dL | Binary | 0 = false |
|
| ||||
| 7 | Ecg | Resting electrocardiographic result | Nominal | 0 = normal |
|
| ||||
| 8 | Mhrt | Maximum heart rate | Numerical | 71 to 200 |
|
| ||||
| 9 | Exian | Exercise induced angina | Binary | 0 = no |
|
| ||||
| 10 | Opk | Old peak | Numerical | Continuous (0 to 6.2) |
|
| ||||
| 11 | Slope | Slope of peak exercise ST segment | Nominal | 1 = upsloping |
|
| ||||
| 12 | Vessel | Number of major vessels | Nominal | 0 to 3 |
|
| ||||
| 13 | Thal | Defect type | Nominal | 3 = normal, 6 = fixed defect, |
|
| ||||
| 14 | Class | Heart disease | Binary | 0 = absence, 1 = presence |
Sample of information system table of heart disease.
|
|
|
| ||
|---|---|---|---|---|
| Chp | ECG | Vessel | Class | |
|
| 2 | 0 | 3 | Yes |
|
| 1 | 2 | 1 | No |
|
| 3 | 2 | 3 | Yes |
|
| 2 | 0 | 0 | No |
|
| 3 | 0 | 0 | No |
|
| 3 | 0 | 0 | Yes |
Algorithm 1Steps to extract reduct from dataset.
Figure 2Architecture of BPNN for reduct R1 of Stalog Heart disease dataset.
Parameters of BPNN.
| Number of layers | Input layer: 1 with 9–13 features for hepatitis |
| Hidden layer: 1 with 25 nodes ( | |
| Output layer: 1 with class label (0 or 1) | |
|
| |
| Activation function | Hidden layer: tangent sigmoid |
| Output layer: linear | |
|
| |
| Learning algorithm | Backpropagation |
|
| |
| Dataset division | Random division |
Confusion matrix.
| Predicted class | |||
|---|---|---|---|
| Positive | Negative | ||
| Actual class | Positive | TP | FN |
| Negative | FP | TN | |
Selected reducts for hepatitis dataset.
| Reduct | Number of attributes | Attribute set |
|---|---|---|
|
| 9 | [1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0] |
|
| 10 | [1 1 1 0 1 0 0 1 1 0 0 0 0 1 1 1 1 0] |
|
| 10 | [1 0 1 0 1 0 0 1 0 1 1 0 0 1 1 1 1 0] |
|
| 11 | [1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 0] |
|
| 11 | [1 0 1 0 1 1 0 1 1 1 0 0 0 1 1 1 1 0] |
|
| 12 | [1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 1 1 0] |
|
| 12 | [1 0 1 0 1 0 1 1 0 1 1 0 1 1 1 1 1 0] |
|
| 13 | [1 1 1 1 1 0 0 0 0 1 1 1 0 1 1 1 1 1] |
|
| 13 | [1 0 1 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1] |
| All | 18 | [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] |
Example of reducts of breast cancer dataset.
| Reduct | Attribute size | Attribute members |
|---|---|---|
|
| 7 | [1 1 1 1 1 1 1 0 0] |
|
| 7 | [1 1 1 1 1 0 1 1 0] |
|
| 7 | [1 1 1 0 1 1 1 1 0] |
|
| 7 | [1 0 1 1 1 1 1 1 0] |
|
| 8 | [1 1 1 1 1 1 1 1 0] |
|
| 8 | [1 1 1 1 1 1 1 0 1] |
|
| 8 | [1 1 1 1 1 0 1 1 1] |
|
| 8 | [1 1 1 0 1 1 1 1 1] |
|
| 8 | [1 0 1 1 1 1 1 1 1] |
| All | 9 | [1 1 1 1 1 1 1 1 1] |
Selected reducts for heart disease dataset.
| Reduct | Attribute size | Attribute set |
|---|---|---|
|
| 4 | [1 0 1 0 1 0 0 0 0 1 0 0 0] |
|
| 4 | [1 0 1 0 0 0 0 1 0 0 0 1 0] |
|
| 5 | [1 0 1 0 0 0 0 1 0 1 0 1 0] |
|
| 5 | [1 0 1 1 0 0 0 1 0 0 0 1 0] |
|
| 6 | [1 0 1 1 0 0 0 0 1 0 0 1 1] |
|
| 6 | [0 0 1 0 1 0 1 0 0 1 0 1 1] |
|
| 6 | [0 0 1 0 1 0 0 1 0 1 0 1 1] |
|
| 7 | [0 1 0 1 1 0 1 0 0 0 1 1 1] |
|
| 7 | [0 0 0 0 1 1 0 1 1 1 0 1 1] |
| All | 13 | [1 1 1 1 1 1 1 1 1 1 1 1 1] |
Figure 3ROC graph for (a) Hepatitis Reduct-R9, (b) Breast cancer Reduct-R1, (c) Heart disease Reduct-R6.
Figure 4Result of Hepatitis Reduct-R9.
Figure 5Result of Breast cancer Reduct-R1.
Figure 6Result of Heart disease Reduct-R6.
Performance measure of best reducts of each dataset.
| Dataset | Reduct | TP | FN | TN | FP | Accuracy (%) | Sensitivity (%) | Specificity (%) | TPR | FPR | AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Hepatitis | Reduct- | 26 | 2 | 117 | 2 | 97.30 | 98.32 | 97.28 | 0.93 | 0.017 | 0.9492 |
| Breast cancer | Reduct- | 238 | 3 | 451 | 7 | 98.60 | 98.76 | 98.57 | 0.99 | 0.015 | 0.9952 |
| Heart disease | Reduct- | 102 | 18 | 142 | 8 | 90.40 | 94.67 | 90.37 | 0.85 | 0.053 | 0.9204 |
Comparison of proposed system with recent works.
| Author | Technique | Accuracy of dataset (%) | ||
|---|---|---|---|---|
| Hepatitis | Breast cancer | Heart disease | ||
|
Sartakhti et al. (2012) [ | SVM-SA | 96.25 | — | — |
| Chen et al. (2011) [ | LFDA-SVM | 96.77 | — | — |
| Çalişir and Dogantekin (2011) [ | PCA-LSSVM | 96.12 | — | — |
| Bascil and Temurtas (2011) [ | MLNN | 91.87 | — | — |
| Dogantekin et al. (2009) [ | LDA-ANFIS | 94.16 | — | — |
| Polat and Güneş (2006) [ | FS_AIRS | 94.12 | ||
| Zheng et al. (2014) [ | K-SVM | — | 97.38 | — |
| Karabatak and Ince (2009) [ | AR_NN | — | 95.60 | — |
| Shao et al. (2014) [ | MARS-LR | — | — | 83.93 |
| Anooj (2012) [ | Weighted fuzzy | — | — | 62 |
| Vijaya et al. (2010) [ | Fuzzy neurogenetic | — | — | 80 |
| Kahramanli and Allahverdi (2008) [ | ANN-FNN | — | — | 87 |
| Proposed method | RS-BPNN | 97.3 | 98.6 | 90.4 |
Comparison of proposed system with conventional methods.
| Technique | Accuracy (%) | ||
|---|---|---|---|
| Hepatitis | Breast cancer | Heart disease | |
| CHAID | 80.8 | 92.7 | 76.6 |
| CRT | 79.4 | 92.4 | 76.6 |
| MLP | 89.0 | 96.1 | 83.3 |
| RBFN | 84.6 | 94.0 | 84.6 |
| RS-BPNN | 97.3 | 98.6 | 90.4 |
(a) Reduct for hepatitis dataset
| Number of attributes | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 |
| Number of subsets | 48620 | 43758 | 31824 | 18564 | 8568 | 3060 | 816 | 153 | 18 |
| Number of reducts | 2 | 19 | 71 | 137 | 142 | 104 | 43 | 43 | 1 |
(b) Reduct for breast cancer dataset
| Number of attributes | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| Number of subsets | 9 | 36 | 84 | 126 | 126 | 84 | 36 | 9 |
| Selected reducts | — | — | — | — | — | — | 5 | 6 |
(c) Reduct for heart disease dataset
| Number of attributes | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| Number of subsets | 286 | 715 | 1287 | 1716 | 1716 | 1287 | 715 | 286 | 78 | 13 |
| Number of reducts | 9 | 90 | 365 | 811 | 1127 | 1043 | 657 | 280 | 78 | 13 |
(a) Hepatitis accuracy (%)
| Reduct | (80–20) | (70–30) | (60–40) |
|---|---|---|---|
|
| 95.2 | 93.9 | 91.2 |
|
| 96.6 | 92.5 | 93.2 |
|
| 93.9 | 91.2 | 92.5 |
|
| 95.2 | 91.8 | 91.2 |
|
| 93.9 | 91.8 | 94.6 |
|
| 95.9 | 94.6 | 91.8 |
|
| 95.2 | 93.2 | 91.2 |
|
| 95.2 | 92.5 | 91.8 |
|
| 97.3 | 95.9 | 93.9 |
| All features | 87.8 | 87.1 | 86.4 |
(b) Breast cancer accuracy (%)
| Reduct | (80–20) | (70–30) | (60–40) |
|---|---|---|---|
|
| 98.6 | 98.3 | 97.6 |
|
| 97.6 | 97.4 | 97.4 |
|
| 98.4 | 98.0 | 97.6 |
|
| 98.4 | 98.1 | 97.4 |
|
| 98.0 | 97.9 | 97.6 |
|
| 98.0 | 98.0 | 97.9 |
|
| 97.9 | 97.7 | 97.1 |
|
| 98.4 | 98.0 | 97.3 |
|
| 98.0 | 97.7 | 97.6 |
| All features | 97.7 | 97.7 | 97.4 |
(c) Heart Disease Accuracy
| Reduct | (80–20) | (70–30) | (60–40) |
|---|---|---|---|
|
| 79.6 | 78.1 | 77.4 |
|
| 81.1 | 81.1 | 80.4 |
|
| 84.8 | 84.8 | 84.4 |
|
| 85.9 | 83.0 | 83.3 |
|
| 88.9 | 84.8 | 84.1 |
|
| 90.4 | 88.1 | 87.8 |
|
| 88.1 | 87.0 | 85.9 |
|
| 86.3 | 85.2 | 83.3 |
|
| 88.5 | 85.6 | 84.8 |
| All features | 84.4 | 84.1 | 84.8 |