| Literature DB >> 35983572 |
Tarneem Elemam1, Mohamed Elshrkawey1.
Abstract
Cancer is a deadly disease that occurs due to rapid and uncontrolled cell growth. In this article, a machine learning (ML) algorithm is proposed to diagnose different cancer diseases from big data. The algorithm comprises a two-stage hybrid feature selection. In the first stage, an overall ranker is initiated to combine the results of three filter-based feature evaluation methods, namely, chi-squared, F-statistic, and mutual information (MI). The features are then ordered according to this combination. In the second stage, the modified wrapper-based sequential forward selection is utilized to discover the optimal feature subset, using ML models such as support vector machine (SVM), decision tree (DT), random forest (RF), and K-nearest neighbor (KNN) classifiers. To examine the proposed algorithm, many tests have been carried out on four cancerous microarray datasets, employing in the process 10-fold cross-validation and hyperparameter tuning. The performance of the algorithm is evaluated by calculating the diagnostic accuracy. The results indicate that for the leukemia dataset, both SVM and KNN models register the highest accuracy at 100% using only 5 features. For the ovarian cancer dataset, the SVM model achieves the highest accuracy at 100% using only 6 features. For the small round blue cell tumor (SRBCT) dataset, the SVM model also achieves the highest accuracy at 100% using only 8 features. For the lung cancer dataset, the SVM model also achieves the highest accuracy at 99.57% using 19 features. By comparing with other algorithms, the results obtained from the proposed algorithm are superior in terms of the number of selected features and diagnostic accuracy.Entities:
Mesh:
Year: 2022 PMID: 35983572 PMCID: PMC9381276 DOI: 10.1155/2022/1056490
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Proposed cancer prediction system.
Figure 2Feature ranking process.
Algorithm 1Overall ranking.
Figure 3Feature selection and classification process.
Algorithm 2Feature selection and classification process.
Dataset description.
| Dataset | Number of samples | Number of features | Number of classes | Notes |
|---|---|---|---|---|
| Leukemia | 72 | 7129 | 2 (binary class) | ALL: 47, AML: 25 |
| Ovarian cancer | 253 | 15154 | 2 (binary class) | Cancer: 162, normal: 91 |
| SRBCT | 83 | 2308 | 4 (multi-class) | EWS: 29, BL: 11, NB: 18, RMS: 25 |
| Lung cancer | 203 | 12600 | 5 (multi-class) | AD: 139, NL: 17, SMCL: 6, SQ: 21, COID: 20 |
ALL—acute lymphocytic leukemia; AML—acute myeloid leukemia; EWS—Ewing's sarcoma; BL—Burkitt's lymphoma; NB—neuroblastoma; RMS—rhabdomyosarcoma; AD—adenocarcinoma; NL—normal lung; SMCL—small cell lung cancer; SQ—squamous cell carcinoma; COID—carcinoid.
Feature score table (FST) for the leukemia dataset.
| Feature |
|
| MI |
|---|---|---|---|
| D88422_at | 10.9959 | 36.676 | 0.442189 |
| M11722_at | 6.34719 | 34.1799 | 0.411207 |
| M16038_at | 8.98639 | 68.645 | 0.342855 |
| M19507_at | 10.7386 | 45.239 | 0.323609 |
| M22960_at | 6.23369 | 65.0347 | 0.304583 |
| M23197_at | 9.65571 | 80.6443 | 0.524171 |
| M27891_at | 14.4819 | 69.3333 | 0.490489 |
| M63138_at | 7.87681 | 64.6046 | 0.35717 |
| M84526_at | 9.52866 | 73.2956 | 0.429966 |
| M92287_at | 5.66233 | 43.1572 | 0.385647 |
| M96326_rna1_at | 7.75604 | 42.5043 | 0.364215 |
| U05259_rna1_at | 5.78902 | 38.5285 | 0.335938 |
| U46499_at | 11.0199 | 69.8495 | 0.418786 |
| X17042_at | 10.1589 | 81.3535 | 0.352082 |
| X59417_at | 4.90528 | 45.0031 | 0.344774 |
| X61587_at | 4.87439 | 55.0604 | 0.330945 |
| X62654_rna1_at | 4.60257 | 43.7079 | 0.394887 |
| X95735_at | 8.57247 | 119.315 | 0.497266 |
| L09209_s_at | 8.18643 | 71.1058 | 0.465179 |
| M31523_at | 4.76099 | 41.6929 | 0.478039 |
Feature score table (FST) for the ovarian cancer dataset.
| Feature |
|
| MI |
|---|---|---|---|
| MZ244.36855 | 31.2944 | 358.634 | 0.40996 |
| MZ244.66041 | 31.4802 | 642.088 | 0.517489 |
| MZ244.95245 | 35.4655 | 857.449 | 0.5695 |
| MZ245.24466 | 35.8013 | 905.675 | 0.552878 |
| MZ245.53704 | 33.7166 | 833.195 | 0.541194 |
| MZ245.8296 | 30.9072 | 713.093 | 0.53736 |
| MZ246.12233 | 28.6963 | 597.697 | 0.532594 |
| MZ246.41524 | 26.7876 | 507.936 | 0.490768 |
| MZ246.70832 | 25.0276 | 444.715 | 0.482529 |
| MZ247.00158 | 23.932 | 399.491 | 0.470867 |
| MZ247.295 | 24.0877 | 361.301 | 0.453271 |
| MZ247.58861 | 23.5078 | 302.861 | 0.432823 |
| MZ247.88239 | 22.6694 | 289.701 | 0.425722 |
| MZ261.88643 | 13.5449 | 418.438 | 0.460344 |
| MZ417.73207 | 14.4542 | 411.356 | 0.46087 |
| MZ434.68588 | 11.8603 | 384.315 | 0.475778 |
| MZ435.07512 | 12.2497 | 405.504 | 0.521955 |
| MZ435.46452 | 12.3376 | 381.56 | 0.526363 |
| MZ463.95962 | 18.2664 | 314.626 | 0.393138 |
| MZ464.36174 | 18.1024 | 320.451 | 0.422442 |
Feature rank table (RT) for the leukemia dataset. Evaluation scores are converted to ranks, with 1 being the highest rank.
| Feature |
|
| MI |
|---|---|---|---|
| D88422_at | 4 | 41 | 6 |
| M11722_at | 21 | 51 | 9 |
| M16038_at | 9 | 8 | 23 |
| M19507_at | 5 | 20 | 29 |
| M22960_at | 23 | 9 | 37 |
| M23197_at | 7 | 3 | 1 |
| M27891_at | 1 | 7 | 3 |
| M63138_at | 12 | 10 | 17 |
| M84526_at | 8 | 4 | 7 |
| M92287_at | 30 | 26 | 12 |
| M96326_rna1_at | 14 | 28 | 16 |
| U05259_rna1_at | 28 | 37 | 24 |
| U46499_at | 3 | 6 | 8 |
| X17042_at | 6 | 2 | 19 |
| X59417_at | 43 | 21 | 22 |
| X61587_at | 44 | 14 | 25 |
| X62654_rna1_at | 55 | 24 | 10 |
| X95735_at | 10 | 1 | 2 |
| L09209_s_at | 11 | 5 | 5 |
| M31523_at | 49 | 32 | 4 |
Feature rank table (RT) for the ovarian cancer dataset. Evaluation scores are converted to ranks, with 1 being the highest rank.
| Feature |
|
| MI |
|---|---|---|---|
| MZ244.36855 | 5 | 18 | 25 |
| MZ244.66041 | 4 | 5 | 8 |
| MZ244.95245 | 2 | 2 | 1 |
| MZ245.24466 | 1 | 1 | 2 |
| MZ245.53704 | 3 | 3 | 3 |
| MZ245.8296 | 6 | 4 | 4 |
| MZ246.12233 | 7 | 6 | 5 |
| MZ246.41524 | 8 | 7 | 10 |
| MZ246.70832 | 9 | 8 | 11 |
| MZ247.00158 | 11 | 12 | 13 |
| MZ247.295 | 10 | 17 | 16 |
| MZ247.58861 | 12 | 25 | 18 |
| MZ247.88239 | 13 | 29 | 20 |
| MZ261.88643 | 58 | 9 | 15 |
| MZ417.73207 | 38 | 10 | 14 |
| MZ434.68588 | 110 | 13 | 12 |
| MZ435.07512 | 91 | 11 | 7 |
| MZ435.46452 | 89 | 14 | 6 |
| MZ463.95962 | 18 | 24 | 30 |
| MZ464.36174 | 19 | 21 | 22 |
Overall rank table (ORT) for the leukemia dataset. Features are arranged from the most important, at the top, to the least important at the bottom according to their assigned overall rank (OR), which is calculated after moderating the outliers (shown in bold) as per Algorithm 1. The smaller the overall rank, the more significant the feature.
| Feature |
|
| MI | Overall rank (OR) |
|---|---|---|---|---|
| X95735_at |
| 1 | 2 | 9 |
| M23197_at | 7 | 3 | 1 | 11 |
| M27891_at | 1 | 7 | 3 | 11 |
| U46499_at | 3 | 6 | 8 | 17 |
| M84526_at | 8 | 4 | 7 | 19 |
| L09209_s_at | 11 | 5 | 5 | 21 |
| X17042_at | 6 | 2 |
| 24 |
| D88422_at | 4 |
| 6 | 30 |
| M63138_at | 12 | 10 | 17 | 39 |
| M16038_at | 9 | 8 | 23 | 40 |
| M19507_at | 5 | 20 | 29 | 54 |
| M96326_rna1_at | 14 | 28 | 16 | 58 |
| M92287_at | 30 | 26 | 12 | 68 |
| M22960_at | 23 | 9 | 37 | 69 |
| M11722_at | 21 | 51 | 9 | 81 |
| X61587_at | 44 | 14 | 25 | 83 |
| M31523_at | 49 | 32 | 4 | 85 |
| X59417_at | 43 | 21 | 22 | 86 |
| U05259_rna1_at | 28 | 37 | 24 | 89 |
| X62654_rna1_at | 55 | 24 | 10 | 89 |
Overall rank table (ORT) for the ovarian cancer dataset. Features are arranged from the most important, at the top, to the least important at the bottom according to their assigned overall rank (OR), which is calculated after moderating the outliers (shown in bold) as per Algorithm 1. The smaller the overall rank, the more significant the feature.
| Feature |
|
| MI | Overall rank (OR) |
|---|---|---|---|---|
| MZ245.24466 | 1 | 1 | 2 | 4 |
| MZ244.95245 | 2 | 2 | 1 | 5 |
| MZ245.53704 | 3 | 3 | 3 | 9 |
| MZ245.8296 | 6 | 4 | 4 | 14 |
| MZ244.66041 | 4 | 5 | 8 | 17 |
| MZ246.12233 | 7 | 6 | 5 | 18 |
| MZ246.41524 | 8 | 7 | 10 | 25 |
| MZ246.70832 | 9 | 8 | 11 | 28 |
| MZ247.00158 | 11 | 12 | 13 | 36 |
| MZ247.295 | 10 | 17 | 16 | 43 |
| MZ244.36855 | 5 | 18 | 25 | 48 |
| MZ435.07512 |
| 11 | 7 | 54 |
| MZ247.58861 | 12 | 25 | 18 | 55 |
| MZ435.46452 |
| 14 | 6 | 60 |
| MZ464.36174 | 19 | 21 | 22 | 62 |
| MZ417.73207 | 38 | 10 | 14 | 62 |
| MZ247.88239 | 13 | 29 | 20 | 62 |
| MZ463.95962 | 18 | 24 | 30 | 72 |
| MZ261.88643 |
| 9 | 15 | 72 |
| MZ434.68588 |
| 13 | 12 | 75 |
Average classification accuracy using four classifiers on four biological datasets. The number of features is shown in parentheses. The best results are shown in bold font.
| Dataset | Classifier | Accuracy using full features (%) | Accuracy using selected features (%) |
|---|---|---|---|
| Leukemia (7129) | SVM |
|
|
| DT | 85.89 | 98.57 (3) | |
| RF |
| 98.57 (3) | |
|
| 92.14 |
| |
|
| |||
| Ovarian cancer (15154) | SVM |
|
|
| DT | 97.60 | 98.80 (4) | |
| RF | 99.60 | 99.60 (10) | |
|
| 95.28 | 100 (10) | |
|
| |||
| SRBCT (2308) | SVM |
|
|
| DT | 83.19 | 96.67 (8) | |
| RF |
| 98.75 (8) | |
|
| 88.19 | 100 (10) | |
|
| |||
| Lung cancer (12600) | SVM | 99.42 |
|
| DT | 97.85 | 99.14 (18) | |
| RF |
| 99.43 (20) | |
|
| 94.99 | 99.57 (22) | |
Listing of best subset of features that achieves the best accuracy.
| Dataset | Selected features |
|---|---|
| Leukemia | HG1612-HT1612_at, M23197_at, M27891_at, X17042_at, X95735_at |
| Ovarian cancer | MZ221.86191, MZ244.36855, MZ244.95245, MZ245.24466, MZ435.07512, MZ464.76404 |
| SRBCT | gene123, gene153, gene187, gene509, gene742, gene1389, gene1601, gene1955 |
| Lung cancer | 38138_at, 38239_at, 35622_at, 36894_at, 37545_at, 40093_at, 34842_at, 36119_at, 36160_s_at, 37302_at, 37305_at, 38032_at, 38065_at, 40193_at, 40619_at, 41289_at, 32542_at, 1814_at, 893_at |
Comparison of proposed method with some existing research: with reduced features shown inside the parentheses. The symbol “—” indicates that no information is available. The best results are shown in bold font.
| Author | Algorithm | Dataset | |||
|---|---|---|---|---|---|
| Leukemia | Ovarian cancer | SRBCT | Lung cancer | ||
| [ | Customized similarity measure using a fuzzy rough quick reduct algorithm | 97.22 (7) | 99.60 (9) | — | — |
| [ | MI-GA | — | 99.21 (20) | — | 81.37 (10) |
| [ | DFS strategy | 98.61 (85) | — | 100 (133) | — |
| [ | ReliefF-WCGWO-mrPNN | 89.33 (150) | 99.21(200) | — | — |
| Fisher score-WCGWO-mrPNN | 99.21 (40) | 100 (150) | — | — | |
| [ | FSJaya | 96.74 (3531) | — | — | — |
| [ | G-Forest | 100 (1282) | — | — | — |
| [ | ESA-DL | — | 99.21 (384) | 83.14 (306) | 94.10 (4545) |
| FFS-DL | — | 97.24 (35) | 93.98 (768) | 93.11 (5304) | |
| [ | SARA | 97.65 (7) | 99.15 (6) | 99.81 (5) | 90.22 (5) |
| [ | Cuckoo search | 100 (650) | — | — | — |
| [ | FSAEFA | 96.14 (3530) | — | — | — |
| Proposed method (ours) |
|
|
|
| |