Literature DB >> 27403253

Feature selection using genetic algorithm for breast cancer diagnosis: experiment on three different datasets.

Shokoufeh Aalaei1, Hadi Shahraki2, Alireza Rowhanimanesh3, Saeid Eslami4.   

Abstract

OBJECTIVES: This study addresses feature selection for breast cancer diagnosis. The present process uses a wrapper approach using GA-based on feature selection and PS-classifier. The results of experiment show that the proposed model is comparable to the other models on Wisconsin breast cancer datasets.
MATERIALS AND METHODS: To evaluate effectiveness of proposed feature selection method, we employed three different classifiers artificial neural network (ANN) and PS-classifier and genetic algorithm based classifier (GA-classifier) on Wisconsin breast cancer datasets include Wisconsin breast cancer dataset (WBC), Wisconsin diagnosis breast cancer (WDBC), and Wisconsin prognosis breast cancer (WPBC).
RESULTS: For WBC dataset, it is observed that feature selection improved the accuracy of all classifiers expect of ANN and the best accuracy with feature selection achieved by PS-classifier. For WDBC and WPBC, results show feature selection improved accuracy of all three classifiers and the best accuracy with feature selection achieved by ANN. Also specificity and sensitivity improved after feature selection.
CONCLUSION: The results show that feature selection can improve accuracy, specificity and sensitivity of classifiers. Result of this study is comparable with the other studies on Wisconsin breast cancer datasets.

Entities:  

Keywords:  Breast cancer; Classification feature; Selection data mining

Year:  2016        PMID: 27403253      PMCID: PMC4923467     

Source DB:  PubMed          Journal:  Iran J Basic Med Sci        ISSN: 2008-3866            Impact factor:   2.699


Introduction

A major class of problems in medical science involves the diagnosis of disease, based on a number of tests done on the patients. Because of welter of data, the ultimate diagnosis may be difficult to obtain, even for a medical expert. Improvements in facilities caused very large databases can be collected in medicine which needs to discover relationships buried in data. Data mining approaches in medical domain are using intensively for these purposes (1, 2). One of the application areas of analysing database is automated diagnostic systems. These systems can help doctors in their decision making. Another application is finding ways to improve patient outcome, reduce cost and enhance clinical studies. In addition, need for automated diagnosis has been most acute in case of deadly disease like cancer where early detection can greatly enhance the chances of long-term survival and reduce the costs. Breast cancer considered the most common invasive cancer in women. In USA, it is considered to be second leading cause of mortality among women and the most common cause of mortality in the age group 40 to 55 years women (3). The effectiveness of early detection has been proven to reduce a lot of mortality among patients with breast cancer (4). There are three classical methods available for detecting breast cancer: physical exam, mammography and biopsy including Fine needle aspiration biopsy (FNAB or FNAC), Core needle biopsy, Surgical biopsy, Lymph node biopsy (5). Mammography is one of the most used methods to detect the breast cancer. In literature, radiologists show considerable variation in interpreting a mammography (6). Accuracy of mammography varies from 68 % to 79% (7). When mammography detects a tumour, biopsy is required to determine its malignancy. The accuracy of surgical biopsy is nearly 100% but it is costly, invasive, time consuming and painful. FNAC is also widely adopted in the diagnosis of breast cancer. The accuracy of FNAC with visual interpretation varies from 35% to 95% depending on the experience of a doctor (8). So, it is necessary to develop better identification methods to recognize the breast cancer. These identification methods can help to assign patients to either a ‘benign’ group that does not have breast cancer or a ‘malignant’ group who has strong evidence of having breast cancer. Malignant tumours generally are more serious than benign tumours. As mentioned, early detection of breast cancer leads to much higher chances of successful treatment. In order to reach this goal, it is necessary to have diagnostic systems with high levels of accuracy and reliability that help doctors to distinguish between benign breast tumours and malignant ones. One of the problems in diagnostic systems is the multiplicity of features. Irrelevancy and redundancy in these features increase the confusion of classification algorithm and decrease learning precision (9, 10). Feature selection is one of the methods that can cope with this problem and plays an important role in classification. Feature selection is one of the pre-processing techniques in data mining and extensively used in the fields of statistics, pattern recognition and medical domain. There are three approaches for feature selection including Wrapper, Filter and Embedded (11). In wrapper approach the goodness of selected subset of features determined by learning and evaluating a classifier using only the variables included in the proposed subset. Filter approach uses some techniques to score the selected subset, ignoring classifier algorithm. In other word goodness of selected subset of features determined by using only intrinsic properties of the data (12). In embedded approach, selecting the best subset of features is performed during the model construction process. A good amount of research on breast cancer datasets using feature selection methods is found in literature such as ant colony algorithm (13), a discrete particle swarm optimization method (14), wrapper approach with genetic algorithm (15), support vector-based feature selection using fisher’s linear discriminate and support vector machine (16), fast correlation based feature selection (FCBF), multi thread based FCBF feature selection and decision dependent-decision independent correlation (DDC- DIC) (17), Rough set K-Means Clustering (18), modification correlation rough set feature selection (MCRSFS) (19). In this study a wrapper feature selection method is proposed based on genetic algorithm based feature selection. This model employed particle swarm optimization algorithm based classifier (PS-classifier) as fitness function. The model evaluated on Wisconsin breast cancer databases.

Materials and Methods

Dataset Description (Wisconsin breast cancer databases)

In this study, the Wisconsin breast cancer datasets from UCI Machine Learning Repository is used (20). They have been collected by Dr. William H. Wolberg (1989–1991) at the University of Wisconsin–Madison Hospitals. The detail of these datasets is shown in table 1.
Table 1

Wisconsin breast cancer datasets (18)

DatasetNo. of attributeNo. of instancesNo. of class
Wisconsin breast cancer (WBC)116992
Wisconsin diagnosis breast cancer (WDBC)325692
Wisconsin prognosis breast cancer (WPBC)341982
Wisconsin breast cancer datasets (18) In WBC dataset there are 699 records that each record has nine attributes expect of id number and class. These nine attributes are graded on an interval scale from a normal state of 1–10, with 10 being the most abnormal state (Table 2). In this database, 241 (65.5%) records are malignant and 458 (34.5%) records are benign.
Table 2

Wisconsin breast cancer (WBC) Attribute (20)

# AttributeDomain
1Sample code numberId number
2Clump thickness1 – 10
3Uniformity of cell size1 – 10
4Uniformity of cell shape1 – 10
5Marginal adhesion1 – 10
6Single epithelial cell size1 – 10
7Bare nuclei1 – 10
8Bland chromatin1 – 10
9Normal nucleoli1 – 10
10Mitoses1 – 10
11Class(2 for benign, 4 for malignant)
Wisconsin breast cancer (WBC) Attribute (20) In WDBC there are 569 records that each record has thirty attributes expect of id number and class. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. Ten real-valued features are computed for each cell nucleus: “radius (mean of distances from center to points on the perimeter) texture (standard deviation of gray-scale values) perimeter area smoothness (local variation in radius lengths) compactness (perimeter^2 / area - 1.0) concavity (severity of concave portions of the contour) concave points (number of concave portions of the contour) symmetry fractal dimension (“coastline approximation”- 1)” (20). The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE and field 23 is Worst Radius. The WPBC and WDBC have the same features yet the WPBC has two additional features as follows: Tumour size that is the diameter of the excised tumour in centimeters and lymph node status that is number of positive axillary lymph nodes observed at time of surgery.

Feature selection

Feature selection is a process that reduces the number of attributes and selects a subset of original features. Feature selection is often used in data pre-processing to identify relevant features that are often unknown previous and removes irrelevant or redundant features which do not have significance in classification task. Feature selection aims to improve the classification accuracy (9).

Genetic algorithm

Genetic algorithm (GA), originally developed by Holland, is a computational optimization paradigm modelled on the concept of biological evolution (21). The GA is an optimization procedure that operates in binary search spaces and manipulates a population of potential solutions. A point in the search space is represented by a finite sequence of 0s and 1s, called a chromosome. The quality of possible solutions is evaluated by a fitness function. The probability of survival is proportional to the chromosome’s fitness value. In GA, the initial population is randomly generated by three operators: selection, crossover, and mutation. The selection operator selects elites to transfer directly to next generation. The crossover operator randomly swaps a portion of chromosomes between two chosen parents to produce offspring chromosomes. The mutation operator randomly alerts a bit in chromosomes. In this work GA is used to eliminate insignificant features. In order to reach this purpose, we defined chromosomes as a mask for features. In other word, each chromosome is a subset of features. The size of chromosome (number of genes) is equal to the number of features that represent the specification of a cancer patient. As mentioned, a chromosome is represented in form of binary string that is 0 or 1. 1 means the corresponding feature is selected and 0 means it is not selected (Figure 1).
Figure 1

Generating initial population

Generating initial population

Evaluation function

The goal of the proposed model is selecting the best subset of features that can produce the highest classification accuracy for diagnosis and prognosis the breast cancer. Therefore, the best subset of features should be selected. For selecting the best subset, a function is needed to evaluate the result of selecting each subset of features (chromosome). In this work we used a classifier based on the particle swarm optimization algorithm (PS-classifier) which is a novel classifier that proposed by Zahiri and Seyedin (22). The particle swarm optimization developed by Kennedy and Eberhart (23). This optimization method is based on the behaviour of swarm of bees or flock of birds while searching for food. In PSO, the particles fly through the problem space by following the optimal particles. Each particle remembers the best position that it has visited (Pbest) and also best position among all the particles in the population (Gbest). The position of each particle changes according to the Pbest and Gbest in the problem space. In PS-classifier, PSO algorithm is used to find the decision hyper planes between the different classes. Decision hyper planes are employed to divide feature space into individual regions. Each region is assigned to a specific class. A general hyper plane is in the form of where X=(x1, x2, …, xn) and W=(w1, w2, …, wn+1) are called the augmented feature and weight vector, respectively. n is the feature space dimension. In a general case, there are a number of hyper planes that separate the feature space to different regions, that each region distinguishes an individual class (Figure 2).
Figure 2

Separating two classes with one hyper plane

Separating two classes with one hyper plane The PS-classifier must find Wj (j=1, 2, …, H) in solution space, where H is the necessary number of decision hyper planes. Fitness function of PS-classifier is defined as follow: where Miss is the number of misclassified data points by W.

Feature selection process

The feature selection process is represented in Figure 3. It is observed that GA selects subset of features as chromosomes and each chromosome is sent to the PS-classifier for calculating fitness value. PS-classifier uses each chromosome as mask for features. So that each gene on chromosome determines the corresponding feature should be used in PS-classifier or not. PS-classifier determines a fitness value for each chromosomes and GA uses these fitness values to the process of chromosome evolution. Finally GA finds an optimal subset of features.
Figure 3

Proposed feature selection flowchart

Proposed feature selection flowchart In proposed model, the number of chromosomes in each population (size of population) is 150 and maximum iteration is 300. The mutation rate is 0.4 and crossover is 0.5 and elite rate is 0.1. Also for PS-classifier, swarm size of 150 was selected and initial inertia weight was chosen 0.7.

Prediction models

In this study we used different classifier algorithms namely artificial neural network (ANN), PS-classifier and GA-classifier as subset evaluating mechanism on Wisconsin breast cancer datasets (WBCD). In this work we build three 3-layer neural networks by using nprtool in Matlab software. Artificial neural networks are a computational tool, based on the properties of biological neural systems. GA-classifier is another classifier that is used to evaluate proposed method and it is presented by Bandyopadhyay et al (24). The number of chromosomes in each population (size of population) is 150 and maximum iteration is 300. The mutation rate is 0.4 and crossover is 0.5 and elite rate is 0.1. The third selected classifier is PS-classifier that was described before. In order to evaluate the classification efficiency, three main metrics including accuracy, sensitivity and specificity have been computed for the classifiers. These metrics are calculated from: Where TN is number of True Negatives, TP is number of True Positives, FN is number of False Negatives and FP is number of False Positives. Our training and testing was iterated 30 times for each classifier and average of results was expressed as the final result. 80% of data is allocated to training set and the remaining 20% is allocated to test set (in case of ANN, 20% of data allocated to validating set). It should be noted that parameters tuning of the classifiers are equal before and after feature selection.

Results

Proposed feature selection method was applied on Wisconsin breast cancer databases and Table 3 shows selected relevant features.
Table 3

Selected features after applying feature selection method

DatasetSelected features
WBC3,6,8,9
WDBC1,2,6,8,12,14,18,19,21,22,25,26,27,29
WPBC1,4,5,6,7,10,11,13,15,16,18,23,24,25,28,29
Selected features after applying feature selection method In neural network, the layers include an input layer of 9, 30 and 33 discrete variables with WBC, WDBC, WPBC datasets, respectively without feature selection. After feature selection we build layers include an input layer of 4, 14 and 16 discrete variables. In all networks we considered a hidden layer with 5 nodes and an output layer with 2 nodes.

Wisconsin breast cancer dataset (WBC)

We used classifiers with and without feature selection with WBC dataset. Results are summarized in the Table 4.
Table 4

The Sensitivity, specificity and accuracy of 3 classifiers with and without feature selection (FS) using WBC dataset

AccuracySpecificitySensitivity
Without FSWith FSWithout FSWith FSWithout FSWith FS
PSO96.296.996.497.596.597.7
GA9696.696.596.696.597.1
ANN96.896.795.297.294.997.2
The Sensitivity, specificity and accuracy of 3 classifiers with and without feature selection (FS) using WBC dataset

Wisconsin diagnosis breast cancer (WDBC)

We employed described classifiers on WDBC. The comparison of average accuracies for the three classifiers (ANN, PS-classifier, GA-classifier) with and without feature selection is shown in Table 5.
Table 5

The Sensitivity, specificity and accuracy of 3 classifiers with and without feature selection (FS) using WDBC dataset

AccuracySpecificitySensitivity
Without FSWith FSWithout FSWith FSWithout FSWith FS
PSO96.497.293.195.698.698
GA96.196.692.993.797.897.5
ANN96.597.39695.198.298.4
The Sensitivity, specificity and accuracy of 3 classifiers with and without feature selection (FS) using WDBC dataset

Wisconsin prognosis breast cancer (WPBC)

Results of employing three described classifiers on WPBC are summarized in the Table 6.
Table 6

The Sensitivity, specificity and accuracy of 3 classifiers with and without feature selection (FS) using WPBC dataset

AccuracySpecificitySensitivity
Without FSWith FSWithout FSWith FSWithout FSWith FS
PSO77.878.288.592.932.033.3
GA76.378.190.292.826.931.0
ANN77.479.294.496.328.333
The Sensitivity, specificity and accuracy of 3 classifiers with and without feature selection (FS) using WPBC dataset

Discussion

In this study a feature selection model with GA-based on feature selection is designed to identify relevant features. GA has more recently developed in compare to different feature selection algorithms. GA can be useful to feature selection when the problem has exponential search space. There are many advantages of the GAs for feature selection that have published in various literatures (25, 26). The comparison of average accuracies for the three classifiers (ANN, PS-classifier, GA-classifier) with and without feature selection on WBC dataset showed that without feature selection the accuracy of ANN (96.8%) is the best and the accuracy obtained by PS-classifier is better than that produced by GA-classifier (96.2 vs. 96.08). It is observed that feature selection improved the accuracy of all classifiers expect of ANN and the best accuracy with feature selection achieved by PS-classifier (96.9%). Also it is apparent from results obtained that specificity and sensitivity has been approximately improved by feature selection. Table 7 shows a comparison between classification accuracies of other published studies which used different feature selection methods and the accuracies obtained by ANN, PS-classifier and GA-classifier in this work on WBC dataset.
Table 7

Comparison of experimental results of proposed method and other papers in WBC

Classifier (reference)CART (27)AR+NN (28)RS-SVM (29)SVM (30)Graph-based (31)This study

ANNPS-classifierGA-classifier
Classification accuracy96.997.496.896.596.496.796.996.6
Comparison of experimental results of proposed method and other papers in WBC For WDBC dataset, ANN classifier shows the best accuracy (96.5%). From Table 5 it is obvious that the ANN accuracy with WDBC is well than PS-classifier and GA-classifier accuracies respectively (96.4 vs. 96.1). Results show feature selection improved accuracy of all three classifiers and the best accuracy with feature selection achieved by ANN (97.3%). Also Table 5 shows that specificity and sensitivity can improve after feature selection. Table 8 shows a comparison between classification accuracies of other published studies which used different feature selection methods and the accuracies obtained in this work on WDBC dataset.
Table 8

Comparison of experimental results of proposed method and other papers in WDBC

Classifier (reference)CART (27)RBF_FS (32)FRNN_FS (32)FS_SFS (33)This study

ANNPS-classifierGA-classifier
Classification accuracy94.796.0595.8893.097.397.296.6
Comparison of experimental results of proposed method and other papers in WDBC The comparison of average accuracies for the described classifiers with and without feature selection on WPBC showed that without feature selection the accuracy of PS-classifier (77.8%) is the best and the accuracy obtained by ANN is better than that produced by GA-classifier (77.4 vs. 76.3). It is clear that feature selection improved the accuracy of all three classifiers and the best accuracy with feature selection achieved by ANN (79.2%). Also as can be seen from the table 8, the specificity and sensitivity improved after feature selection. The result of this dataset is comparable with other studies (35). Table 9 shows a comparison between classi-fication accuracies of other published studies which used different feature selection methods and the accuracies obtained by three different classifiers in this work on WPBC dataset.
Table 9

Comparison of experimental results of proposed method and other papers in WPBC

Classifier (reference)CART (27)Naïve Bayes- ReliefF (34)Naïve Bayes -Fisher Filtering (34)This study
ANNPS-classifierGA-classifier
Classification accuracy73.377.7475.2579.278.278.1
Comparison of experimental results of proposed method and other papers in WPBC It should be noted while data mining can facilitate analysing of large databases and help medical staff in decision making we should consider the limitations of what it can do. data mining techniques can discover pattern buried in data but it can’t replace physician’s insights (36). Also sometimes the increase in the number of features leads to the decrease in the speed of the algorithm. Therefore identifying patterns may be time consuming.

Conclusion

In this paper, we proposed a feature selection method using GA for selecting the best subset of features for breast cancer diagnosis system. ANN, PS-classifier and GA-classifier were used to evaluate proposed feature selection method on Wisconsin Breast Cancer Datasets. In WBC, the classification using PS-classifier is superior to other classification. In WDBC and WPBC, ANN achieved the best accuracy. The results show that feature selection can improve accuracy of classifiers. Result of this study is comparable with the other studies on Wisconsin breast cancer datasets.
  8 in total

1.  Data mining for indicators of early mortality in a database of clinical records.

Authors:  G Richards; V J Rayward-Smith; P H Sönksen; S Carey; C Weng
Journal:  Artif Intell Med       Date:  2001-06       Impact factor: 5.326

2.  Hybrid genetic algorithms for feature selection.

Authors:  Il-Seok Oh; Jin-Seon Lee; Byung-Ro Moon
Journal:  IEEE Trans Pattern Anal Mach Intell       Date:  2004-11       Impact factor: 6.226

Review 3.  Diagnosis of breast lesions: fine-needle aspiration cytology or core needle biopsy? A review.

Authors:  S M Willems; C H M van Deurzen; P J van Diest
Journal:  J Clin Pathol       Date:  2011-10-29       Impact factor: 3.411

4.  Variability in radiologists' interpretations of mammograms.

Authors:  J G Elmore; C K Wells; C H Lee; D H Howard; A R Feinstein
Journal:  N Engl J Med       Date:  1994-12-01       Impact factor: 91.245

5.  Report of the International Workshop on Screening for Breast Cancer.

Authors:  S W Fletcher; W Black; R Harris; B K Rimer; S Shapiro
Journal:  J Natl Cancer Inst       Date:  1993-10-20       Impact factor: 13.506

6.  Quantitative Structure-Activity Relationship Studies of 4-Imidazolyl- 1,4-dihydropyridines as Calcium Channel Blockers.

Authors:  Farzin Hadizadeh; Saadat Vahdani; Mehrnaz Jafarpour
Journal:  Iran J Basic Med Sci       Date:  2013-08       Impact factor: 2.699

7.  Designing a Human T-Lymphotropic Virus Type 1 (HTLV-I) Diagnostic Model using the Complete Blood Count.

Authors:  Masoumeh Sarbaz; Omid Pournik; Leila Ghalichi; Khalil Kimiafar; Amir Reza Razavi
Journal:  Iran J Basic Med Sci       Date:  2013-03       Impact factor: 2.699

8.  Artificial Neural Networks Analysis Used to Evaluate the Molecular Interactions between Selected Drugs and Human Cyclooxygenase2 Receptor.

Authors:  Ali Tayarani; Ali Baratian; Mohammad-Bagher Naghibi Sistani; Mohammad Reza Saberi; Zeinab Tehranizadeh
Journal:  Iran J Basic Med Sci       Date:  2013-11       Impact factor: 2.699

  8 in total
  8 in total

1.  Determining the effective factors in predicting diet adherence using an intelligent model.

Authors:  Hediye Mousavi; Majid Karandish; Amir Jamshidnezhad; Ali Mohammad Hadianfard
Journal:  Sci Rep       Date:  2022-07-19       Impact factor: 4.996

2.  Development of Spectral Disease Indices for 'Flavescence Dorée' Grapevine Disease Identification.

Authors:  Hania Al-Saddik; Jean-Claude Simon; Frederic Cointault
Journal:  Sensors (Basel)       Date:  2017-11-29       Impact factor: 3.576

3.  Modified Bat Algorithm for Feature Selection with the Wisconsin Diagnosis Breast Cancer (WDBC) Dataset

Authors:  Suganthi Jeyasingh; Malathi Veluchamy
Journal:  Asian Pac J Cancer Prev       Date:  2017-05-01

4.  Artificial neural network with Taguchi method for robust classification model to improve classification accuracy of breast cancer.

Authors:  Md Akizur Rahman; Ravie Chandren Muniyandi; Dheeb Albashish; Md Mokhlesur Rahman; Opeyemi Lateef Usman
Journal:  PeerJ Comput Sci       Date:  2021-01-25

5.  Feature Selection and Molecular Classification of Cancer Phenotypes: A Comparative Study.

Authors:  Luca Zanella; Pierantonio Facco; Fabrizio Bezzo; Elisa Cimetta
Journal:  Int J Mol Sci       Date:  2022-08-13       Impact factor: 6.208

6.  Effective hybrid feature selection using different bootstrap enhances cancers classification performance.

Authors:  Noura Mohammed Abdelwahed; Gh S El-Tawel; M A Makhlouf
Journal:  BioData Min       Date:  2022-09-30       Impact factor: 4.079

7.  Comparison of Bayes Classifiers for Breast Cancer Classification

Authors:  Bazila Banu A; Ponniah Thirumalaikolundusubramanian
Journal:  Asian Pac J Cancer Prev       Date:  2018-10-26

8.  Correlation-Based Ensemble Feature Selection Using Bioinspired Algorithms and Classification Using Backpropagation Neural Network.

Authors:  V R Elgin Christo; H Khanna Nehemiah; B Minu; A Kannan
Journal:  Comput Math Methods Med       Date:  2019-09-23       Impact factor: 2.238

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.