| Literature DB >> 35003246 |
Hanaa Fathi1, Hussain AlSalman2, Abdu Gumaei3, Ibrahim I M Manhrawy4, Abdelazim G Hussien5,6, Passent El-Kafrawy1,7.
Abstract
Cancer can be considered as one of the leading causes of death widely. One of the most effective tools to be able to handle cancer diagnosis, prognosis, and treatment is by using expression profiling technique which is based on microarray gene. For each data point (sample), gene data expression usually receives tens of thousands of genes. As a result, this data is large-scale, high-dimensional, and highly redundant. The classification of gene expression profiles is considered to be a (NP)-Hard problem. Feature (gene) selection is one of the most effective methods to handle this problem. A hybrid cancer classification approach is presented in this paper, and several machine learning techniques were used in the hybrid model: Pearson's correlation coefficient as a correlation-based feature selector and reducer, a Decision Tree classifier that is easy to interpret and does not require a parameter, and Grid Search CV (cross-validation) to optimize the maximum depth hyperparameter. Seven standard microarray cancer datasets are used to evaluate our model. To identify which features are the most informative and relative using the proposed model, various performance measurements are employed, including classification accuracy, specificity, sensitivity, F1-score, and AUC. The suggested strategy greatly decreases the number of genes required for classification, selects the most informative features, and increases classification accuracy, according to the results.Entities:
Mesh:
Year: 2021 PMID: 35003246 PMCID: PMC8731276 DOI: 10.1155/2021/7231126
Source DB: PubMed Journal: Comput Intell Neurosci
Review of previous studies on the feature selection, optimization, and classification methods.
| Author | Datasets | Method | Remark |
|---|---|---|---|
| [ | DLBCL | (JMI) Joint mutual information (mRMR) information gain (IG) | This research introduced modern filter-based gene selection technique for detecting biomarkers from microarray data. |
| [ | Ovarian, leukemia, and Central Nervous system (CNS) | Relief-F of support vector machines (SVM), coevolutionary neural networks (CNN) | This research introduced a hybrid approach based on Relief-F and CNN for cancer diagnosis and classification. |
| [ | Colon tumor ALL, AML, 4 (CNS) MLL | Binary black hole algorithm (BBHA) and random forest ranking (RFR) | The authors introduced gene selection and classification techniques to microarray data based on RFR and BBHA. |
| [ | Breast cancer | Binary particle swarm optimization (BPSO) and Decision Tree C4.5 | This research introduced binary PSO and DT for cancer detection based on microarray data classification. |
| [ | UCI | J48 DTs | They presented induction algorithm and introduced hyperparameter tuning of a Decision Tree induction algorithm. |
| [ | Breast cancer | Whale optimization algorithm (WOA), extremely randomized tree BCD-WERT | This research introduced a novel model for breast cancer detection using WOA optimization based on extremely randomized tree algorithm and efficient features. |
| [ | Srivastava, G. (2020) | UCI heart disease | This research presented an adaptive genetic fuzzy logic algorithm and introduced a hybrid GA and a fuzzy logic classifier for heart diagnosis and disease. |
| [ | Colon cancer, breast cancer, prostate cancer | Elastic NET PSO algorithm | This research introduced parameters optimization of Elastic NET using PSO algorithm for high-dimensional data. |
| [ | De novo acute myeloid leukemia | Recursive feature elimination (RFE), tree-based feature selection (TBFS) | This research introduced multifeature selection with machine learning for de novo acute myeloid leukemia in Egypt. |
| [ | Breast cancer | AdaBoost and Gradient Boosting random forest, logistic regression | This research introduced classification for microarray breast cancer data using machine learning methods. |
Figure 1Experiment steps of microarray.
Figure 2Microarray data matrix structure.
Figure 3The relation between Gini impurity, Entropy, and misclassification error. [22].
Figure 4Working of Grid Search cross-validation [26].
Figure 5Proposed model (PCC-DTCV).
Characterization of the dataset.
| Disease | Dataset | No. of samples | No. of features |
|---|---|---|---|
| Prostate cancer | D1, Singh [ | 102 | 12600 |
| Lung cancer | D2, Gordon [ | 181 | 12533 |
| Breast cancer | D3, Chowdary [ | 104 | 22283 |
| Colon cancer | D4, Alon [ | 62 | 2000 |
| Breast cancer | D5, Chin [ | 118 | 22215 |
| Breast cancer | D6, West [ | 49 | 7129 |
| Leukemia | D7, Golub [ | 72 | 7129 |
PCC-DTCV model with DT and PPC ≥0.4.
| Dataset | Features | Selected features | PCC-DTCV model with DT and PPC ≥0.5 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | AUC | Sensitivity | Specificity | Recall |
| |||||
| 0 | 1 | 0 | 1 | |||||||
| Singh | 12600 | 299 | 0.77 ± 0.16 | 0.78 ± 0.16 | 0.78 ± 0.19 | 0.77 ± 0.20 | 0.80 | 0.79 | 0.79 | 0.80 |
| Gordon | 12533 | 743 | 0.93 ± 0.06 | 0.88 ± 0.11 | 0.97 ± 0.06 | 0.79 ± 0.24 | 0.95 | 0.71 | 0.95 | 0.73 |
| Chowdary | 22283 | 410 | 0.90 ± 0.09 | 0.89 ± 0.08 | 0.90 ± 0.15 | 0.88 ± 0.12 | 0.94 | 0.88 | 0.93 | 0.89 |
| Alon | 2000 | 61 | 0.79 ± 0.20 | 0.075 ± 0.22 | 0.62 ± 0.37 | 0.88 ± 0.20 | 0.64 | 0.82 | 0.65 | 0.81 |
| Chin | 22215 | 1211 | 0.81 ± 0.12 | 0.80 ± 0.11 | 0.78 ± 0.13 | 0.82 ± 0.17 | 0.79 | 0.85 | 0.77 | 0.86 |
| West | 7129 | 28 | 0.83 ± 0.15 | 0.84 ± 0.16 | 0.82 ± 0.23 | 0.87 ± 0.16 | 0.84 | 0.83 | 0.84 | 0.83 |
| Golub | 7129 | 465 | 0.89 ± 0.11 | 0.87 ± 0.13 | 0.93 ± 0.10 | 0.80 ± 0.21 | 0.89 | 0.76 | 0.88 | 0.78 |
PCC-DTCV model with optimized DT and PPC ≥0.4.
| Dataset | Features | Selected features | PCC-DTCV model with optimized DT and PPC ≥0.5 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | AUC | Sensitivity | Specificity | Recall |
| |||||
| 0 | 1 | 0 | 1 | |||||||
| Singh | 12600 | 299 | 0.76 ± 0.12 | 0.76 ± 0.12 | 0.74 ± 0.16 | 0.78 ± 0.21 | 0.80 | 0.81 | 0.80 | 0.81 |
| Gordon | 12533 | 743 | 0.93 ± 0.05 | 0.87 ± 0.11 | 0.97 ± 0.04 | 0.76 ± 0.23 | 0.97 | 0.84 | 0.97 | 0.84 |
| Chowdary | 22283 | 410 | 0.92 ± 0.09 | 0.92 ± 0.09 | 0.94 ± 0.12 | 0.90 ± 0.12 | 0.94 | 0.88 | 0.93 | 0.83 |
| Alon | 2000 | 61 | 0.82 ± 0.13 | 0.80 ± 0.14 | 0.73 ± 0.23 | 0.88 ± 0.17 | 0.73 | 0.82 | 0.71 | 0.84 |
| Chin | 22215 | 1211 | 0.79 ± 0.09 | 0.78 ± 0.10 | 0.74 ± 0.17 | 0.81 ± 0.12 | 0.72 | 0.85 | 0.73 | 0.85 |
| West | 7129 | 28 | 0.86 ± 0.13 | 0.87 ± 0.13 | 0.87 ± 0.21 | 0.87 ± 0.16 | 0.84 | 0.83 | 0.84 | 0.83 |
| Golub | 7129 | 465 | 0.86 ± 0.09 | 0.83 ± 0.12 | 0.92 ± 0.10 | 0.75 ± 0.21 | 0.94 | 0.80 | 0.92 | 0.83 |
PCC-DTCV model with DT and PPC ≥0.5.
| Dataset | Features | Selected features | PCC-DTCV model with optimized DT and PPC ≥0.5 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | AUC | Sensitivity | Specificity | Recall |
| |||||
| 0 | 1 | 0 | 1 | |||||||
| Singh | 12600 | 58 | 0.82 ± 0.11 | 0.82 ± 0.10 | 0.88 ± 0.13 | 0.77 ± 0.20 | 0.88 | 0.77 | 0.83 | 0.82 |
| Gordon | 12533 | 274 | 0.94 ± 0.05 | 0.88 ± 0.11 | 0.97 ± 0.05 | 0.79 ± 0.24 | 0.97 | 0.81 | 0.97 | 0.83 |
| Chowdary | 22283 | 39 | 0.91 ± 0.15 | 0.90 ± 0.15 | 0.94 ± 0.12 | 0.86 ± 0.19 | 0.95 | 0.90 | 0.94 | 0.92 |
| Alon | 2000 | 9 | 0.73 ± 0.18 | 0.70 ± 0.17 | 0.63 ± 0.24 | 0.78 ± 0.21 | 0.59 | 0.80 | 0.60 | 0.79 |
| Chin | 22215 | 305 | 0.81 ± 0.09 | 0.80 ± 0.08 | 0.75 ± 0.10 | 0.85 ± 0.14 | 0.72 | 0.85 | 0.73 | 0.85 |
| West | 7129 | 5 | 0.90 ± 0.13 | 0.91 ± 0.13 | 0.92 ± 0.17 | 0.90 ± 0.15 | 0.92 | 0.88 | 0.83 | 0.82 |
| Golub | 7129 | 133 | 0.87 ± 0.08 | 0.85 ± 0.10 | 0.94 ± 0.09 | 0.77 ± 0.20 | 0.91 | 0.80 | 0.91 | 0.82 |
PCC-DTCV model with optimized DT and PPC ≥0.5.
| Dataset | Features | Selected features | PCC-DTCV model with optimized DT and PPC ≥0.5 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | AUC | Sensitivity | Specificity | Recall |
| |||||
| 0 | 1 | 0 | 1 | |||||||
| Singh | 12600 | 58 | 0.78 ± 0.15 | 0.78 ± 0.15 | 0.78 ± 0.19 | 0.78 ± 0.23 | 0.78 | 0.83 | 0.80 | 0.81 |
| Gordon | 12533 | 274 | 0.94 ± 0.04 | 0.90 ± 0.08 | 0.97 ± 0.04 | 0.84 ± 0.16 | 0.95 | 0.77 | 0.95 | 0.77 |
| Chowdary | 22283 | 39 | 0.94 ± 0.08 | 0.94 ± 0.08 | 0.95 ± 0.09 | 0.93 ± 0.11 | 0.95 | 0.90 | 0.94 | 0.92 |
| Alon | 2000 | 9 | 0.78 ± 0.19 | 0.75 ± 0.19 | 0.68 ± 0.26 | 0.82 ± 0.23 | 0.68 | 0.80 | 0.67 | 0.81 |
| Chin | 22215 | 305 | 0.82 ± 0.09 | 0.80 ± 0.09 | 0.77 ± 0.15 | 0.84 ± 0.12 | 0.74 | 0.85 | 0.74 | 0.85 |
| West | 7129 | 5 | 0.83 ± 0.18 | 0.83 ± 0.18 | 0.80 ± 0.24 | 0.87 ± 0.16 | 0.84 | 0.83 | 0.84 | 0.83 |
| Golub | 7129 | 133 | 0.89 ± 0.06 | 0.89 ± 0.07 | 0.92 ± 0.10 | 0.85 ± 0.19 | 0.87 | 0.76 | 0.87 | 0.76 |
PCC-DTCV model with DT and PPC ≥0.5.
| Dataset | Features | Selected features | PCC-DTCV model with optimized DT and PPC ≥0.5 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | AUC | Sensitivity | Specificity | Recall |
| |||||
| 0 | 1 | 0 | 1 | |||||||
| Singh | 12600 | 10 | 0.90 ± 0.08 | 0.90 ± 0.08 | 0.90 ± 0.10 | 0.90 ± 0.10 | 0.90 | 0.88 | 0.89 | 0.89 |
| Gordon | 12533 | 98 | 0.95 ± 0.05 | 0.89 ± 0.11 | 0.98 ± 0.04 | 0.81 ± 0.22 | 0.99 | 0.74 | 0.97 | 0.84 |
| Chowdary | 22283 | 5 | 0.96 ± 0.06 | 0.96 ± 0.06 | 0.97 ± 0.06 | 0.96 ± 0.12 | 0.95 | 0.93 | 0.95 | 0.93 |
| Alon | 2000 | 1 | 0.72 ± 0.18 | 0.71 ± 0.19 | 0.67 ± 0.32 | 0.75 ± 0.22 | 0.68 | 0.75 | 0.64 | 0.78 |
| Chin | 22215 | 54 | 0.85 ± 0.07 | 0.84 ± 0.08 | 0.82 ± 0.17 | 0.87 ± 0.11 | 0.72 | 0.88 | 0.75 | 0.86 |
| West | 7129 | 2 | 0.82 ± 0.19 | 0.82 ± 0.21 | 0.87 ± 0.32 | 0.85 ± 0.19 | 0.80 | 0.83 | 0.82 | 0.82 |
| Golub | 7129 | 36 | 0.85 ± 0.10 | 0.82 ± 0.12 | 0.89 ± 0.11 | 0.75 ± 0.21 | 0.94 | 0.80 | 0.92 | 0.83 |
PCC-DTCV model with optimized DT and PPC ≥ 0.6.
| Dataset | Features | Selected features | PCC-DTCV model with optimized DT and PPC ≥0.5 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Accuracy | AUC | Sensitivity | Specificity | Recall |
| |||||
| 0 | 1 | 0 | 1 | |||||||
| Singh | 12600 | 10 | 0.89 ± 0.05 | 0.89 ± 0.05 | 0.88 ± 0.10 | 0.90 ± 0.13 | 0.88 | 0.94 | 0.91 | 0.92 |
| Gordon | 12533 | 98 | 0.95 ± 0.04 | 0.87 ± 0.11 | 0.99 ± 0.03 | 0.75 ± 0.23 | 0.99 | 0.77 | 0.97 | 0.84 |
| Chowdary | 22283 | 5 | 0.93 ± 0.06 | 0.93 ± 0.07 | 0.95 ± 0.07 | 0.91 ± 0.14 | 0.97 | 0.90 | 0.95 | 0.93 |
| Alon | 2000 | 1 | 0.79 ± 0.18 | 0.76 ± 0.17 | 0.67 ± 0.22 | 0.85 ± 0.25 | 0.64 | 0.88 | 0.68 | 0.84 |
| Chin | 22215 | 54 | 0.84 ± 0.11 | 0.83 ± 0.12 | 0.78 ± 0.20 | 0.88 ± 0.09 | 0.79 | 0.83 | 0.76 | 0.85 |
| West | 7129 | 2 | 0.83 ± 0.15 | 0.83 ± 0.17 | 0.78 ± 0.32 | 0.88 ± 0.18 | 0.80 | 0.83 | 0.82 | 0.82 |
| Golub | 7129 | 36 | 0.90 ± 0.09 | 0.88 ± 0.12 | 0.96 ± 0.08 | 0.80 ± 0.21 | 0.91 | 0.80 | 0.91 | 0.80 |
Figure 6Accuracy obtained for PCC-DTCV model using the DT classifier with PPC ≥0.4, 0.5, and 0.6 for all datasets.
Figure 7AUC obtained for PCC-DTCV model using the DT classifier with PPC ≥0.4, 0.5, and 0.6 for all datasets.
Figure 8Sensitivity obtained for PCC-DTCV model using the DT classifier with PPC ≥0.4, 0.5, and 0.6 for all datasets.
Figure 9Specificity obtained for PCC-DTCV model using the DT classifier with PPC ≥0.4, 0.5, and 6 for all datasets.
Tests of normality.
| Kolmogorov–Smirnova | Shapiro−Wilk | ||||
|---|---|---|---|---|---|
| PCR | Statistic | Sig. | Statistic | ||
| Spec. DT OPT | ≥0.4 | 0.158 | 0.200 | 0.983 | 0.971 |
| ≥0.5 | 0.166 | 0.200 | 0.943 | 0.668 | |
| ≥0.6 | 0.257 | 0.178 | 0.940 | 0.639 | |
|
| |||||
| AUC DT OPT | ≥0.4 | 0.161 | 0.200 | 0.981 | 0.965 |
| ≥0.5 | 0.190 | 0.200 | 0.956 | 0.190 | |
| ≥0.6 | 0.205 | 0.200 | 0.892 | 0.285 | |
|
| |||||
| Sen. DT OPT | ≥0.4 | 0.137 | 0.200 | 0.962 | 0.832 |
| ≥0.5 | 0.163 | 0.200 | 0.926 | 0.517 | |
| ≥0.6 | 0.245 | 0.200 | 0.884 | 0.245 | |
|
| |||||
| Sen. DT | ≥0.4 | 0.205 | 0.200 | 0.871 | 0.189 |
| ≥0.5 | 0.241 | 0.200 | 0.873 | 0.197 | |
| ≥0.6 | 0.275 | 0.117 | 0.905 | 0.364 | |
|
| |||||
| Spec. DT | ≥0.4 | 0.284 | 0.093 | 0.836 | 0.090 |
| ≥0.5 | 0.169 | 0.200 | 0.956 | 0.785 | |
| ≥0.6 | 0.188 | 0.200 | 0.911 | 0.403 | |
|
| |||||
| AUC DT | ≥0.4 | 0.204 | 0.200 | 0.948 | 0.714 |
| ≥0.5 | 0.243 | 0.200 | 0.900 | 0.334 | |
| ≥0.6 | 0.233 | 0.200 | 0.933 | 0.580 | |
This is a lower bound of the true significance. aLilliefors significance correction.
Figure 10Accuracy obtained for PCC-DTCV model using the DT classifier with PPC ≥0.4, 0.5, and 0.6 for all datasets.
Figure 11AUC obtained for PCC-DTCV model using the DT classifier with PPC ≥0.4, 0.5, and 0.6 for all datasets.
Figure 12Sensitivity obtained for PCC-DTCV model using the optimized DT classifier with PPC ≥0.4, 0.5, and 0.6 for all datasets.
Figure 13Specificity obtained for PCC-DTCV model using the optimized DT classifier with PPC ≥0.4, 0.5, and 0.6 for all datasets.