| Literature DB >> 34373776 |
Sunil Kumar Prabhakar1, Harikumar Rajaguru2, Dong-Ok Won3.
Abstract
In the field of bioinformatics, feature selection in classification of cancer is a primary area of research and utilized to select the most informative genes from thousands of genes in the microarray. Microarray data is generally noisy, is highly redundant, and has an extremely asymmetric dimensionality, as the majority of the genes present here are believed to be uninformative. The paper adopts a methodology of classification of high dimensional lung cancer microarray data utilizing feature selection and optimization techniques. The methodology is divided into two stages; firstly, the ranking of each gene is done based on the standard gene selection techniques like Information Gain, Relief-F test, Chi-square statistic, and T-statistic test. As a result, the gathering of top scored genes is assimilated, and a new feature subset is obtained. In the second stage, the new feature subset is further optimized by using swarm intelligence techniques like Grasshopper Optimization (GO), Moth Flame Optimization (MFO), Bacterial Foraging Optimization (BFO), Krill Herd Optimization (KHO), and Artificial Fish Swarm Optimization (AFSO), and finally, an optimized subset is utilized. The selected genes are used for classification, and the classifiers used here are Naïve Bayesian Classifier (NBC), Decision Trees (DT), Support Vector Machines (SVM), and K-Nearest Neighbour (KNN). The best results are shown when Relief-F test is computed with AFSO and classified with Decision Trees classifier for hundred genes, and the highest classification accuracy of 99.10% is obtained.Entities:
Mesh:
Year: 2021 PMID: 34373776 PMCID: PMC8349254 DOI: 10.1155/2021/6680424
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Dataset details.
| Dataset | Number of genes | Class 1 (ADCA) | Class 2 (MPM) | Total samples |
|---|---|---|---|---|
| Lung Harvard 2 | 12533 | 150 | 31 | 181 |
Figure 1Pictorial representation of the work.
Algorithm 1D function execution and termination.
Algorithm 2BFO.
Performance analysis of classifiers in terms of classification accuracies (%) with Grasshopper optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 82.29 | 98.96 | 76 | 78.12 | 83.59 | 83.59 | 76 | 83.59 | 77.08 | 76 | 76 | 77.08 |
| Relief–F test | 85.93 | 91.67 | 83.59 | 83.59 | 78.12 | 89.6 | 78.12 | 98.96 | 95.83 | 87.5 | 78.12 | 77.08 |
| Chi-square test | 97.91 | 77.08 | 83.59 | 95.83 | 93.75 | 77.08 | 82.29 | 83.59 | 91.67 | 76 | 77.08 | 83.59 |
|
| 77.08 | 82.29 | 82.29 | 97.91 | 95.83 | 78.12 | 83.59 | 93.75 | 95.83 | 77.08 | 76 | 85.93 |
Performance analysis of classifiers in terms of classification accuracies (%) with Moth flame optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 96.34 | 89.6 | 93.75 | 85.93 | 83.59 | 93.26 | 91.67 | 97.91 | 89.6 | 89.6 | 83.59 | 95.83 |
| Relief–F test | 93.75 | 82.29 | 91.67 | 98.01 | 97.91 | 86.19 | 85.93 | 78.12 | 95.83 | 82.29 | 97.91 | 89.6 |
| Chi-square test | 93.75 | 93.75 | 90.74 | 91.67 | 85.93 | 84.10 | 97.91 | 86.40 | 85.93 | 85.93 | 97.91 | 91.67 |
|
| 85.93 | 91.67 | 95.83 | 85.93 | 93.75 | 84.75 | 96.15 | 85.93 | 91.67 | 85.93 | 85.93 | 97.91 |
Performance analysis of classifiers in terms of classification accuracies (%) with Bacterial foraging optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 97.91 | 96.85 | 91.64 | 95.83 | 85.93 | 85.93 | 85.02 | 82.24 | 89.92 | 85.93 | 87.12 | 93.75 |
| Relief–F test | 97.91 | 93.75 | 89.6 | 89.6 | 91.67 | 83.59 | 97.91 | 83.81 | 87.30 | 95.83 | 97.91 | 97.91 |
| Chi-square test | 93.75 | 89.6 | 91.67 | 98.56 | 90.53 | 97.33 | 97.72 | 85.93 | 97.91 | 97.91 | 95.83 | 95.83 |
|
| 91.67 | 93.75 | 85.93 | 84.41 | 95.83 | 98.04 | 86.71 | 97.91 | 84.95 | 95.83 | 93.75 | 89.6 |
Performance analysis of classifiers in terms of classification accuracies (%) with Krill Herd optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 97.91 | 91.67 | 81.57 | 86.58 | 89.6 | 96.76 | 91.89 | 93.75 | 97.91 | 92.05 | 91.40 | 96.92 |
| Relief–F test | 89.6 | 97.91 | 83.59 | 82.29 | 85.93 | 87.37 | 98.38 | 79.21 | 77.47 | 83.46 | 90.97 | 84.25 |
| Chi-square test | 93.75 | 93.27 | 81.72 | 97.12 | 82.59 | 92.12 | 88.55 | 78.99 | 98.69 | 87.88 | 78.61 | 88.91 |
|
| 95.83 | 97.91 | 91.67 | 95.83 | 96.93 | 95.83 | 94.72 | 85.06 | 96.67 | 86.73 | 88.03 | 98.96 |
Performance analysis of classifiers in terms of classification accuracies (%) with Artificial fish swarm optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 90.24 | 91.67 | 89.05 | 88.41 | 86.82 | 81.03 | 94.90 | 93.75 | 93.75 | 93.75 | 89.6 | 93.75 |
| Relief–F test | 95.96 | 85.93 | 77.08 | 89.79 | 99.10 | 86.42 | 95.83 | 97.91 | 97.91 | 89.6 | 78.59 | 80.56 |
| Chi-square test | 82.59 | 83.48 | 78.12 | 84.10 | 95.34 | 94.61 | 95.83 | 98.63 | 93.75 | 94.49 | 97.91 | 84.39 |
|
| 93.75 | 89.75 | 87.5 | 86.78 | 92.29 | 89.33 | 89.6 | 97.91 | 97.91 | 95.83 | 94.79 | 97.91 |
Performance analysis of classifiers in terms of PI (%) with Grasshopper optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 45.16 | 97.87 | 7.69 | 22.22 | 51.16 | 51.16 | 7.69 | 51.16 | 15.36 | 7.69 | 7.69 | 15.36 |
| Relief–F test | 60.87 | 80.01 | 51.163 | 51.163 | 22.22 | 78.93 | 22.22 | 97.87 | 91.58 | 66.66 | 22.22 | 15.36 |
| Chi-square test | 95.65 | 15.36 | 51.163 | 91.58 | 85.7 | 15.36 | 45.16 | 51.16 | 80.01 | 7.69 | 15.36 | 51.16 |
|
| 15.36 | 45.16 | 45.16 | 95.65 | 91.58 | 22.22 | 51.16 | 85.7 | 91.58 | 15.36 | 7.69 | 60.87 |
Performance analysis of classifiers in terms of PI (%) with Moth flame optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 92.45 | 78.93 | 85.7 | 60.87 | 51.16 | 84.31 | 80.01 | 95.65 | 78.93 | 78.93 | 51.16 | 91.58 |
| Relief-F test | 85.7 | 45.16 | 80.01 | 95.85 | 95.65 | 61.14 | 60.87 | 22.22 | 91.58 | 45.16 | 95.65 | 78.93 |
| Chi-square test | 85.7 | 85.7 | 77.62 | 80.01 | 60.87 | 53.35 | 95.65 | 62.50 | 60.87 | 60.87 | 95.65 | 80.01 |
|
| 60.87 | 80.01 | 91.58 | 60.87 | 85.7 | 56.08 | 92.17 | 60.87 | 80.01 | 60.87 | 60.87 | 95.65 |
Performance analysis of classifiers in terms of PI (%) with Bacterial foraging optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 95.65 | 93.17 | 79.93 | 91.58 | 60.87 | 60.87 | 57.22 | 45.09 | 78.31 | 60.87 | 64.03 | 85.7 |
| Relief-F test | 95.65 | 85.7 | 78.93 | 78.93 | 80.01 | 51.16 | 95.65 | 52.11 | 63.85 | 91.58 | 95.65 | 95.65 |
| Chi-square test | 85.7 | 78.93 | 80.01 | 97.00 | 77.11 | 94.45 | 94.96 | 60.87 | 95.65 | 95.65 | 91.58 | 91.58 |
|
| 80.01 | 85.7 | 60.87 | 52.98 | 91.58 | 95.92 | 63.76 | 95.65 | 56.00 | 91.58 | 85.7 | 78.93 |
Performance analysis of classifiers in terms of PI (%) with Krill Herd optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 95.65 | 80.01 | 39.87 | 62.47 | 78.93 | 93.44 | 80.36 | 85.7 | 95.65 | 81.07 | 79.21 | 93.28 |
| Relief-F test | 78.93 | 95.65 | 51.16 | 45.16 | 60.87 | 64.68 | 96.68 | 28.94 | 17.93 | 50.06 | 78.04 | 53.72 |
| Chi-square test | 85.7 | 84.41 | 42.31 | 93.60 | 46.56 | 81.00 | 72.79 | 27.54 | 97.28 | 68.96 | 24.01 | 74.91 |
|
| 91.58 | 95.65 | 80.01 | 91.58 | 93.22 | 91.58 | 88.45 | 57.42 | 92.69 | 63.08 | 62.83 | 97.87 |
Performance analysis of classifiers in terms of PI (%) with Artificial fish swarm optimization for different gene selection techniques using 50–200 selected genes.
| Method | NBC | DT | SVM | KNN | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Genes selected | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 | 50 | 100 | 200 |
| Information gain | 77.67 | 80.01 | 75.76 | 72.02 | 64.12 | 37.59 | 88.01 | 85.7 | 85.7 | 85.7 | 78.93 | 85.7 |
| Relief-F test | 91.26 | 60.87 | 15.36 | 78.55 | 98.16 | 62.67 | 91.58 | 95.65 | 95.65 | 78.93 | 25.11 | 35.42 |
| Chi-square test | 46.56 | 50.67 | 22.22 | 53.37 | 90.08 | 88.09 | 91.58 | 97.14 | 85.7 | 87.6 | 95.65 | 54.61 |
|
| 85.7 | 78.67 | 66.66 | 65.91 | 82.65 | 77.2 | 78.93 | 95.65 | 95.65 | 91.58 | 88.64 | 95.65 |