| Literature DB >> 34539088 |
Babak Nouri-Moghaddam1, Mehdi Ghazanfari1, Mohammad Fathian1.
Abstract
Microarray technology is known as one of the most important tools for collecting DNA expression data. This technology allows researchers to investigate and examine types of diseases and their origins. However, microarray data are often associated with a small sample size, a significant number of genes, imbalanced data, etc., making classification models inefficient. Thus, a new hybrid solution based on a multi-filter and adaptive chaotic multi-objective forest optimization algorithm (AC-MOFOA) is presented to solve the gene selection problem and construct the Ensemble Classifier. In the proposed solution, a multi-filter model (i.e., ensemble filter) is proposed as preprocessing step to reduce the dataset's dimensions, using a combination of five filter methods to remove redundant and irrelevant genes. Accordingly, the results of the five filter methods are combined using a voting-based function. Additionally, the results of the proposed multi-filter indicate that it has good capability in reducing the gene subset size and selecting relevant genes. Then, an AC-MOFOA based on the concepts of non-dominated sorting, crowding distance, chaos theory, and adaptive operators is presented. AC-MOFOA as a wrapper method aimed at reducing dataset dimensions, optimizing KELM, and increasing the accuracy of the classification, simultaneously. Next, in this method, an ensemble classifier model is presented using AC-MOFOA results to classify microarray data. The performance of the proposed algorithm was evaluated on nine public microarray datasets, and its results were compared in terms of the number of selected genes, classification efficiency, execution time, time complexity, hypervolume indicator, and spacing metric with five hybrid multi-objective methods, and three hybrid single-objective methods. According to the results, the proposed hybrid method could increase the accuracy of the KELM in most datasets by reducing the dataset's dimensions and achieve similar or superior performance compared to other multi-objective methods. Furthermore, the proposed Ensemble Classifier model could provide better classification accuracy and generalizability in the seven of nine microarray datasets compared to conventional ensemble methods. Moreover, the comparison results of the Ensemble Classifier model with three state-of-the-art ensemble generation methods indicate its competitive performance in which the proposed ensemble model achieved better results in the five of nine datasets. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s00521-021-06459-9.Entities:
Keywords: DNA microarray data; Ensemble classification; Forest optimization algorithm; Gene selection; Hybrid method; Multi-filter; Multi-objective wrapper
Year: 2021 PMID: 34539088 PMCID: PMC8435304 DOI: 10.1007/s00521-021-06459-9
Source DB: PubMed Journal: Neural Comput Appl ISSN: 0941-0643 Impact factor: 5.606
Fig. 1The flowchart of the proposed method
Selected datasets summary
| Datasets | # of genes | # of Classes | # of Samples |
|---|---|---|---|
| SRBCT | 2308 | 4 | 83 |
| Tumors_9 | 5726 | 9 | 60 |
| Leukaemia3 | 7129 | 3 | 72 |
| Colon_Prostate | 10,937 | 2 | 355 |
| Lung | 12,601 | 5 | 203 |
| GCM | 16,064 | 14 | 190 |
| Breast | 24,482 | 2 | 97 |
| Rsctc_5 | 54,614 | 4 | 89 |
| Rsctc_6 | 59,005 | 5 | 92 |
Fig. 2Tree structure in AC-MOFOA
Fig. 3Example of a local seedling operator with
Fig. 4Example of Global seedling operator performance in AC-MOFOA with
Parameters and settings of selected algorithms
| Algorithms | Representation | Operators | Parameters |
|---|---|---|---|
| GA [ | Binary | 2-point Cross-over, uniform mutation | |
| Adaptive GA [ | Binary | 2-point Cross-over, Conditional mutation | |
| TLBOGSA [ | Binary | Presented operators | |
| MOBBBO [ | Binary | Basic Habitat Migration and Mutation Strategy | |
| C-HMOSHSSA [ | Continuous | Standard Spotted Hyena and salp swarm operators | |
| MOCEPO[ | Continuous | Standard Emperor Penguin operators | |
| MOSSO [ | Continuous | Standard SSO operators | |
| NSPSO [ | Continuous | Standard PSO operators | |
| AC-MOFOA | Continuous | Adaptive Chaotic Local and Global seeding |
The classification accuracy results of different filter methods based on the top genes selection threshold (i.e., 25%, 30%, and 35%)
| Dataset | Type | Selection threshold | Tumors_9 | Leukaemia3 | Colon_Prostate | Breast | Rsctc_6 |
|---|---|---|---|---|---|---|---|
| Information gain (IG) | Uni | 35% | 77.89 | 96.03 | 75.11 | 76.71 | |
| 30% | 96.58 | 77.46 | |||||
| 25% | 79.73 | 95.91 | 95.76 | 76.51 | 77.27 | ||
| Gain ratio | Uni | 35% | 75.45 | 93.91 | 95.29 | 73.15 | |
| 30% | 77.24 | 94.47 | 95.56 | 75.54 | 74.31 | ||
| 25% | 76.11 | 94.26 | 94.93 | 77.82 | 73.89 | ||
| Symmetrical Uncertainty | Uni | 35% | 73.28 | 95.38 | 94.85 | 68.29 | 74.43 |
| 30% | 76.97 | 94.67 | 94.28 | 71.68 | 76.2 | ||
| 25% | 75.92 | 94.23 | 94.44 | 72.28 | |||
| Fisher Score | Uni | 35% | 78.59 | 96.09 | 96.37 | 76.7 | 75.28 |
| 30% | 77.46 | 75.83 | 76.37 | ||||
| 25% | 96.4 | 96.2 | 75.11 | ||||
| Chi-square | Uni | 35% | 74.51 | 94.43 | 96.61 | 69.39 | 75.64 |
| 30% | 78.37 | 95.65 | 96.03 | 72.47 | 74.56 | ||
| 25% | 75.88 | 95.33 | 95.72 | 71.56 | 76.04 | ||
| ReliefF | Uni | 35% | 77.85 | 95.82 | 96.67 | 76.23 | 76.73 |
| 30% | 96.14 | ||||||
| 25% | 77.85 | 94.83 | 96.37 | 77.74 | 77.16 | ||
| Correlation | Uni | 35% | 78.37 | 93.13 | 95.93 | 70.82 | 73.92 |
| 30% | 77.72 | 94.59 | 96.33 | 73.96 | 74.73 | ||
| 25% | 75.93 | 93.94 | 96.65 | 72.09 | 75.28 | ||
| Correlation-based feature selection (CFS) | Multi | 35% | 96.37 | 97.18 | 77.15 | 75.66 | |
| 30% | 78.98 | ||||||
| 25% | 75.93 | 96.49 | 96.94 | 78.12 | 77.92 | ||
| Fast Correlation-based feature selection (FCBF) | Multi | 35% | 76.73 | 96.37 | 96.65 | 74.33 | 74.86 |
| 30% | 76.59 | 96.51 | 77.23 | 76.52 | |||
| 25% | 75.93 | 96.13 | 96.94 | 76.64 | 78.21 | ||
| Minimum-Redundancy-Maximum-Relevance (mRMR) | Multi | 35% | 79.31 | 96.26 | 97.42 | 77.29 | 76.97 |
| 30% | 78.67 | ||||||
| 25% | 79.02 | 96.47 | 97.29 | 78.48 |
Result of applying multi-filter step
| Datasets | # of genes | KELM acc | # of selected genes by multi-filter | % gene reduction ratio | KELM acc. on selected genes |
|---|---|---|---|---|---|
| SRBCT | 2308 | 88.23 | 472 | 79.5% | 90.72 |
| Tumors_9 | 5726 | 78.98 | 1209 | 78.8% | 84.41 |
| Leukaemia3 | 7129 | 97.22 | 1836 | 74.2% | 97.66 |
| Colon_Prostate | 10,937 | 95.74 | 2081 | 80.9% | 97.89 |
| Lung | 12,601 | 90.24 | 2745 | 78.21% | 93.97 |
| GCM | 16,064 | 70.53 | 3653 | 77.25% | 78.03 |
| Breast | 24,482 | 68.04 | 5992 | 75.52% | 86.53 |
| Rsctc_5 | 54,614 | 69.66 | 14,507 | 73.43% | 72.38 |
| Rsctc_6 | 59,005 | 81.523 | 15,511 | 73.71% | 82.69 |
Comparison of the proposed multi-filter with other ensemble filters based on the KELM accuracy
| Datasets | Multi-filter method [ | Multi-filter method [ | Multi-filter method [ | Proposed multi-filter |
|---|---|---|---|---|
| SRBCT | 89.93 | 86.06 | 88.69 | |
| Tumors_9 | 82.37 | 81.61 | 79.54 | |
| Leukaemia3 | 96.85 | 96.24 | 95.77 | |
| Colon_Prostate | 95.49 | 96.23 | ||
| Lung | 92.54 | 91.62 | 89.51 | |
| GCM | 77.83 | 76.86 | 78.03 | |
| Breast | 85.41 | 82.92 | ||
| Rsctc_5 | 72.38 | 69.19 | 70.18 | |
| Rsctc_6 | 82.08 | 80.57 | 81.73 |
Fig. 5Comparing AC-MOFOA with other hybrid single-objective algorithms based on non-dominated solutions and average Pareto front on the test set. AC-MOFOA-ND represents the non-dominated solutions, and AC-MOFOA-Ave is the average Pareto Front
Fig. 6Comparing AC-MOFOA with other multi-objective algorithms based on non-dominated solutions on the test set
Fig. 7Comparison of the classification accuracy distribution of AC-MOFOA solutions on the test set
Fig. 8Comparison of the number of selected genes distributions of AC-MOFOA solutions on the test set
Fig. 9Comparison of AC-MOFOA with other multi-objective algorithms based on non-dominated solutions on the train set
Fig. 10Comparison of the classification accuracy distribution of AC-MOFOA solutions on the train set
SCC measure comparison on the test set
| Datasets | AC-MOFOA | MOCEPO | MOBBBO | C-HMOSSA | MOSSO | NSPSO |
|---|---|---|---|---|---|---|
| SRBCT | 4 | 0 | 2 | 1 | 0 | 0 |
| Tumors_9 | 3 | 0 | 0 | 0 | 1 | 0 |
| Leukaemia3 | 3 | 1 | 0 | 0 | 0 | 0 |
| Colon_Prostate | 4 | 0 | 0 | 0 | 0 | 0 |
| Lung | 4 | 2 | 0 | 0 | 0 | 0 |
| GCM | 6 | 0 | 0 | 3 | 0 | 0 |
| Breast | 3 | 0 | 0 | 2 | 0 | 0 |
| Rsctc_5 | 6 | 0 | 0 | 4 | 0 | 0 |
| Rsctc_6 | 6 | 1 | 0 | 4 | 0 | 0 |
SCC measure comparison on the train set
| Datasets | AC-MOFOA | MOCEPO | MOBBBO | C-HMOSSA | MOSSO | NSPSO |
|---|---|---|---|---|---|---|
| SRBCT | 2 | 1 | 0 | 0 | 0 | 0 |
| Tumors_9 | 3 | 0 | 0 | 1 | 0 | 0 |
| Leukaemia3 | 2 | 2 | 0 | 0 | 0 | 0 |
| Colon_Prostate | 4 | 1 | 0 | 0 | 0 | 0 |
| Lung | 3 | 2 | 0 | 0 | 0 | 0 |
| GCM | 6 | 0 | 0 | 2 | 0 | 0 |
| Breast | 2 | 0 | 0 | 3 | 0 | 0 |
| Rsctc_5 | 6 | 0 | 0 | 4 | 0 | 0 |
| Rsctc_6 | 9 | 1 | 0 | 3 | 0 | 0 |
T-test of hypervolume ratios in the data
| Datasets | MOCEPO | MOBBBO | C- HMOSHSSA | MOSSO | NSPSO |
|---|---|---|---|---|---|
| SRBCT | + | + | + | + | + |
| Tumors_9 | + | + | + | + | + |
| Leukaemia3 | + | + | + | + | + |
| Colon_Prostate | + | + | + | + | + |
| Lung | = | + | + | + | + |
| GCM | + | + | = | + | + |
| Breast | + | + | = | + | + |
| Rsctc_5 | + | + | = | + | + |
| Rsctc_6 | + | + | = | + | + |
T-test of hypervolume ratios in the data
| Datasets | MOCEPO | MOBBBO | C- HMOSHSSA | MOSSO | NSPSO |
|---|---|---|---|---|---|
| SRBCT | = | + | + | + | + |
| Tumors_9 | + | + | + | + | + |
| Leukaemia3 | = | + | + | + | + |
| Colon_Prostate | + | + | + | + | + |
| Lung | = | + | + | + | + |
| GCM | + | + | = | + | + |
| Breast | + | + | = | + | + |
| Rsctc_5 | + | + | = | + | + |
| Rsctc_6 | + | + | + | + | + |
T-test of spacing metric in the data
| Datasets | MOCEPO | MOBBBO | C- HMOSHSSA | MOSSO | NSPSO |
|---|---|---|---|---|---|
| SRBCT | = | + | + | + | + |
| Tumors_9 | = | = | = | = | + |
| Leukaemia3 | = | = | = | + | + |
| Colon_Prostate | = | + | = | + | + |
| Lung | + | + | + | + | + |
| GCM | = | + | = | = | + |
| Breast | = | = | = | + | + |
| Rsctc_5 | = | + | = | = | = |
| Rsctc_6 | + | + | = | + | + |
T-test of Spacing metric in the data
| Datasets | MOCEPO | MOBBBO | C- HMOSHSSA | MOSSO | NSPSO |
|---|---|---|---|---|---|
| SRBCT | = | = | = | + | + |
| Tumors_9 | = | = | = | = | + |
| Leukaemia3 | = | + | = | = | + |
| Colon_Prostate | = | + | = | + | + |
| Lung | + | + | + | + | + |
| GCM | = | + | = | - | + |
| Breast | + | + | = | + | + |
| Rsctc_5 | = | + | = | = | + |
| Rsctc_6 | + | + | = | = | + |
Fig.11Comparison of the mean execution time of AC-MOFOA with other multi-objective algorithms (in seconds)
Comparison results of proposed Ensemble Classifier with conventional ensemble methods
| Datasets | KELM | Proposed ensemble classifier | Random forest | Adaboost | Bagging |
|---|---|---|---|---|---|
| SRBCT | 88.23 | 91.97 | 95.18 | 94.59 | |
| Tumors_9 | 78.98 | 92.67 | 93.15 | 90.6 | |
| Leukaemia3 | 97.22 | 97.86 | 96.74 | 94.13 | |
| Colon_Prostate | 95.74 | 96.08 | 95.27 | 95.78 | |
| Lung | 90.24 | 97.6 | 92.14 | 91.61 | |
| GCM | 70.53 | 74.24 | 72.05 | 76.19 | |
| Breast | 68.04 | 82.51 | 88.36 | 91.31 | |
| Rsctc_5 | 69.66 | 70.48 | 71.66 | 71.98 | |
| Rsctc_6 | 81.523 | 79.73 | 80.73 | 82.64 |
Fig.12Comparison of the classification accuracy of the proposed ensemble approach with other recent ensemble approaches