| Literature DB >> 28610411 |
Suganthi Jeyasingh1, Malathi Veluchamy.
Abstract
Early diagnosis of breast cancer is essential to save lives of patients. Usually, medical datasets include a large variety of data that can lead to confusion during diagnosis. The Knowledge Discovery on Database (KDD) process helps to improve efficiency. It requires elimination of inappropriate and repeated data from the dataset before final diagnosis. This can be done using any of the feature selection algorithms available in data mining. Feature selection is considered as a vital step to increase the classification accuracy. This paper proposes a Modified Bat Algorithm (MBA) for feature selection to eliminate irrelevant features from an original dataset. The Bat algorithm was modified using simple random sampling to select the random instances from the dataset. Ranking was with the global best features to recognize the predominant features available in the dataset. The selected features are used to train a Random Forest (RF) classification algorithm. The MBA feature selection algorithm enhanced the classification accuracy of RF in identifying the occurrence of breast cancer. The Wisconsin Diagnosis Breast Cancer Dataset (WDBC) was used for estimating the performance analysis of the proposed MBA feature selection algorithm. The proposed algorithm achieved better performance in terms of Kappa statistic, Mathew’s Correlation Coefficient, Precision, F-measure, Recall, Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Relative Absolute Error (RAE) and Root Relative Squared Error (RRSE). Creative Commons Attribution LicenseEntities:
Keywords: Breast cancer; Wisconsin Diagnosis Breast Cancer (WDBC) dataset; modified bat algorithm
Year: 2017 PMID: 28610411 PMCID: PMC5555532 DOI: 10.22034/APJCP.2017.18.5.1257
Source DB: PubMed Journal: Asian Pac J Cancer Prev ISSN: 1513-7368
Figure 1Overall Flow Diagram of the Proposed MBA Feature Selection with RF Classification
Attribute Information of Wisconsin Breast Cancer Dataset
| Attribute number | Attribute description | Attribute value |
|---|---|---|
| 1 | Class | No-recurrence and recurrence events |
| 2 | Age | 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99 |
| 3 | Menopause | lt40, ge40, premeno |
| 4 | Tumor size | 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59 |
| 5 | inv-nodes | 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39 |
| 6 | Node-caps | Yes and no |
| 7 | deg-malig | 1, 2, 3 |
| 8 | Breast | Left and Right |
| 9 | Breast-quad | Left-up, left-low, right-up, right-low, central |
| 10 | Irradiat | Yes: No |
Comparative Analysis of Cfs, Gr and Proposed Approach
| Measures | CFS | GR | MBA-FS |
|---|---|---|---|
| Kappa statistic | 0.8 | 0.8 | 0.9 |
| MAE | 0.05 | 0.05 | 0.07 |
| RMSE | 0.14 | 0.14 | 0.15 |
| RAE | 36.23% | 35.46% | 20.51% |
| RRSE | 50.98% | 50.51% | 36.07% |
| Precision | 0.89 | 0.9 | 0.96 |
| Recall | 0.89 | 0.9 | 0.96 |
| F-measure | 0.89 | 0.9 | 0.96 |
| MCC | 0.85 | 0.86 | 0.94 |
| Correctly Classified Instances | 256 | 258 | 277 |
| Incorrectly Classified Instances | 30 | 28 | 9 |
| Correctly classified rate | 89.51% | 90.21% | 96.85% |
| Incorrectly Classified rate | 10.49% | 9.79% | 3.15% |
Figure 4Accuracy Analysis of the Proposed MBA-FS and Existing CFS and GR