| Literature DB >> 34616883 |
Shilpi Bose1, Chandra Das1, Abhik Banerjee1, Kuntal Ghosh2, Matangini Chattopadhyay3, Samiran Chattopadhyay4, Aishwarya Barik1.
Abstract
BACKGROUND: Machine learning is one kind of machine intelligence technique that learns from data and detects inherent patterns from large, complex datasets. Due to this capability, machine learning techniques are widely used in medical applications, especially where large-scale genomic and proteomic data are used. Cancer classification based on bio-molecular profiling data is a very important topic for medical applications since it improves the diagnostic accuracy of cancer and enables a successful culmination of cancer treatments. Hence, machine learning techniques are widely used in cancer detection and prognosis.Entities:
Keywords: Attribute clustering; DNA Microarray; Ensemble classifier; Filter; Gene expression data; Machine learning
Year: 2021 PMID: 34616883 PMCID: PMC8459790 DOI: 10.7717/peerj-cs.671
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Cluster representative refinement procedure.
Each row of the table represents the gene with its class relevance value in terms of Pearson correlation coefficient with respect to sample class row. TR+ and TR− represent the augmented gene with their class relevance score in terms of Pearson correlation coefficient with respect to sample class row.
Figure 2Block diagram of the proposed MFSAC-EC model.
Here BK1, BK2…BKD are D number of bootstrapped datasets. RSD11…RSD17 represent different reduced sub-datasets of BK1 bootstrapped datasets after applying MFSAC method. IC11 to IC17 represent individual classifiers applied on RSD11…RSD17 respectively.
Figure 3Block diagram of MFSAC method.
BKl is the lth bootstrapped dataset. FT1… FT7 are the seven filter score functions as Table S1. SD11…SD17 are sub-datasets created after applying filter score functions. SAC is the Supervised attribute clustering method applied to generate RSD11…RSD17 reduced sub-datasets.
Description of cancer gene expression datasets.
| Dataset | Data Dimension Gene × Sample (Original) | Data Dimension Gene × Sample (Used) | Sample Class Labels | Dataset | Data Dimension Gene × Sample (Original) | Data Dimension Gene × Sample (Used) | Sample Class Labels |
|---|---|---|---|---|---|---|---|
| Leukemia | 7,129 × 72 | 7,070 × 72 | 2 | Breast | 7,129 × 49 | 7,129 × 49 | 2 |
| Colon | 2,000 × 62 | 2,000 × 62 | 2 | MLL | 12,582 × 72 | 12,582 × 72 | 3 |
| Prostate | 12,600 × 136 | 12,600 × 136 | 2 | SRBCT | 2,308 × 63 | 2,308 × 63 | 4 |
| Lung | 12,533 × 181 | 12,533 × 181 | 2 | RAHC | 41,057 × 50 | 41,057 × 50 | 2 |
| Rbreast | 24,481 × 97 | 24,188 × 97 | 2 | RAOA | 18,433 × 30 | 18,433 × 30 | 2 |
Classification accuracy of MFSAC-EC depending on varying number of genes selected by each filter.
This table shows the impact of parameter P with respect to sample classification accuracy(%) in terms of both LOOCV and tenfold Cross Validation approach. P defines the number of top ranked genes selected by each filter method.
| Dataset | Evaluation Metric | MFSAC-EC | |||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NB | KNN | DT | SVM | NB | KNN | DT | SVM | NB | KNN | DT | SVM | NB | KNN | DT | SVM | NB | KNN | DT | SVM | NB | KNN | DT | SVM | ||
| Leukemia | LOOCV | 98.6 | 98.6 | 98.6 | 98.6 | 98.6 | 98.6 | 98.6 | 98.6 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 10 Fold | 98.6 | 98.6 | 98.6 | 98.6 | 98.6 | 98.6 | 98.6 | 98.6 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| RAHC | LOOCV | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 10 Fold | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| MLL | LOOCV | 98.6 | 100 | 100 | 97.2 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 10 Fold | 97.2 | 100 | 100 | 97.2 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| RAOA | LOOCV | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 10 Fold | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| SRBCT | LOOCV | 98.4 | 98.4 | 100 | 98.4 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 10 Fold | 100 | 98.4 | 98.4 | 98.4 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| Breast | LOOCV | 98 | 95.9 | 93.9 | 95.9 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 10 Fold | 100 | 95.9 | 95.9 | 98 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| Lung | LOOCV | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 10 Fold | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| Rbreast | LOOCV | 92.6 | 93.7 | 93.7 | 95.8 | 99 | 97.9 | 100 | 99 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
| 10 Fold | 91.6 | 96.8 | 95.8 | 97.9 | 97.9 | 99 | 99 | 99 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |
| COLON | LOOCV | 91.9 | 91.9 | 93.6 | 91.9 | 91.9 | 91.9 | 96.8 | 91.9 | 98.4 | 98.4 | 96.8 | 98.4 | 98.4 | 98.4 | 98.4 | 98.4 | 98.4 | 98.4 | 98.4 | 100 | 100 | 100 | 98.4 | 100 |
| 10 Fold | 91.9 | 91.9 | 93.6 | 91.9 | 91.9 | 91.9 | 95.2 | 91.9 | 98.4 | 96.8 | 96.8 | 96.8 | 98.4 | 100 | 98.4 | 98.4 | 100 | 98.4 | 98.4 | 100 | 100 | 100 | 98.4 | 100 | |
| Prostrate | LOOCV | 83.8 | 90.4 | 92.7 | 86.8 | 88.2 | 92.7 | 92.7 | 88.2 | 91.2 | 95.6 | 97.1 | 91.9 | 94.9 | 97.1 | 97.8 | 94.1 | 98.5 | 98.5 | 97.8 | 98.5 | 98.5 | 99.3 | 98.5 | 99.3 |
| 10 Fold | 85.3 | 88.2 | 91.9 | 86.8 | 88.2 | 91.9 | 93.4 | 89.7 | 91.9 | 95.6 | 97.8 | 91.2 | 95.6 | 97.8 | 97.8 | 94.1 | 99.3 | 98.5 | 97.8 | 98.5 | 99.3 | 99.3 | 97.8 | 99.3 | |
Total execution time in a single run of MFSAC-EC on different datasets.
Total execution time in a single run of MFSAC-EC including Bootstrapped dataset creation, Feature Selection by filter methods and supervised attribute clustering approach, Training, Testing using LOOCV, fivefold, tenfold, and Random Splitting is given in the first row. While execution time using only tenfold Cross Validation is given in the 2nd row. Here the time for the best P value is shown here.
| Leukemia | RAHC | MLL | RAOA | SRBCT | Breast | Lung | Rbreast | COLON | Prostrate | |
|---|---|---|---|---|---|---|---|---|---|---|
| No. of Feature selected for best result | 500 | 100 | 200 | 100 | 200 | 200 | 100 | 500 | 1,200 | 3,000 |
| Total Time Taken | 8 min 23 s | 7 min 32 s | 7 min 54 s | 4 min 43 s | 5 min 17 s | 4 min 2 s | 11 min 14 s | 10 min 22 s | 17 min 40 s | 1 h 18 min 41 s |
| Time Taken for only 10 fold | 35 s | 30 s | 41 s | 36 s | 30 s | 32 s | 36 s | 33 s | 30 s | 36 s |
Classification accuracy of the proposed MFSAC-EC model with respect to LOOCV.
Classification accuracy (%) of MFSAC-EC model has been shown in terms of LOOCV with respect to four ensemble classifiers MFSAC-EC + NB, MFSAC-EC+KNN, MFSAC-EC+DT, and MFSAC-EC+SVM. Every ensemble classifier is run 50 times using LOOCV for every dataset and the accuracy is shown which is obtained maximum number of times.
| Dataset | Proposed model | Cluster representatives | Dataset | Proposed model | Cluster representatives | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 1 | 2 | 3 | ||||||
| COLON | MFSAC-EC | NB | 100 | 98.39 | 98.39 | MLL | MFSAC-EC | NB | 100 | 100 | 100 |
| KNN | 98.39 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 98.39 | 98.39 | 98.39 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 98.4 | 98.4 | SVM | 100 | 100 | 100 | ||||
| Prostate | NB | 97.06 | 97.79 | 98.53 | SRBCT | NB | 96.83 | 100 | 100 | ||
| KNN | 97.79 | 97.79 | 98.53 | KNN | 96.83 | 100 | 100 | ||||
| DT | 97.79 | 98.53 | 97.79 | DT | 96.83 | 98.41 | 100 | ||||
| SVM | 98.53 | 99.26 | 99.26 | SVM | 82.54 | 98.41 | 100 | ||||
| Leukemia | NB | 100 | 100 | 100 | Lung | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
| RAOA | NB | 100 | 100 | 100 | RAHC | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
| Breast | NB | 100 | 100 | 100 | RBreast | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
Classification accuracy of the proposed MFSAC-EC model with respect to fivefold cross validation.
Classification accuracy (%) of MFSAC-EC model has been shown in terms of fivefold Cross Validation with respect to four ensemble classifiers MFSAC-EC + NB, MFSAC-EC+KNN, MFSAC-EC+DT, and MFSAC-EC+SVM. Every ensemble classifier is run 50 times using fivefold Cross Validation for every dataset and the accuracy is shown which is obtained maximum number of times.
| Dataset | Proposed model | Cluster representatives | Dataset | Proposed model | Cluster representatives | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 1 | 2 | 3 | ||||||
| COLON | MFSAC-EC | NB | 96.77 | 96.77 | 96.77 | MLL | MFSAC-EC | NB | 100 | 100 | 100 |
| KNN | 98.39 | 96.77 | 96.77 | KNN | 98.61 | 100 | 100 | ||||
| DT | 98.39 | 96.77 | 98.39 | DT | 98.61 | 100 | 100 | ||||
| SVM | 98.39 | 96.77 | 96.77 | SVM | 100 | 100 | 100 | ||||
| Prostate | NB | 97.06 | 97.79 | 98.53 | SRBCT | NB | 98.41 | 100 | 100 | ||
| KNN | 97.79 | 97.79 | 99.26 | KNN | 96.83 | 100 | 100 | ||||
| DT | 97.06 | 97.79 | 94.85 | DT | 96.83 | 98.41 | 100 | ||||
| SVM | 97.79 | 98.53 | 99.26 | SVM | 96.83 | 100 | 100 | ||||
| Leukemia | NB | 100 | 100 | 100 | Lung | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 99.44 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
| RAOA | NB | 100 | 100 | 100 | RAHC | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
| Breast | NB | 100 | 100 | 100 | RBreast | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
Classification accuracy of the proposed MFSAC-EC model with respect to tenfold cross validation.
Classification accuracy (%) of MFSAC-EC model has been shown in terms of tenfold cross validation with respect to four ensemble classifiers MFSAC-EC + NB, MFSAC-EC+KNN, MFSAC-EC+DT, and MFSAC-EC+SVM. Every ensemble classifier is run 50 times using tenfold cross validation for every dataset and the accuracy is shown which is obtained maximum number of times.
| Dataset | Proposed model | Cluster representatives | Dataset | Proposed model | Cluster representatives | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 1 | 2 | 3 | ||||||
| COLON | MFSAC-EC | NB | 98.39 | 98.39 | 98.39 | MLL | MFSAC-EC | NB | 100 | 100 | 100 |
| KNN | 98.39 | 98.39 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 98.39 | 98.39 | 98.39 | DT | 100 | 100 | 100 | ||||
| SVM | 98.39 | 98.39 | 98.39 | SVM | 100 | 100 | 100 | ||||
| Prostate | NB | 97.06 | 97.79 | 98.53 | SRBCT | NB | 96.83 | 96.83 | 100 | ||
| KNN | 97.79 | 97.79 | 99.26 | KNN | 92.06 | 100 | 100 | ||||
| DT | 97.06 | 97.79 | 94.85 | DT | 95.24 | 96.83 | 100 | ||||
| SVM | 97.79 | 98.53 | 99.26 | SVM | 80.95 | 92.06 | 100 | ||||
| Leukemia | NB | 100 | 100 | 100 | Lung | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
| Breast | NB | 100 | 100 | 100 | RBreast | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
| RAOA | NB | 100 | 100 | 100 | RAHC | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
Classification accuracy of the proposed MFSAC-EC model with respect to random splitting of the datasets.
Classification accuracy (%) of MFSAC-EC model has been shown in terms of random splitting with respect to four ensemble classifiers MFSAC-EC + NB, MFSAC-EC+KNN, MFSAC-EC+DT, and MFSAC-EC+SVM. Every ensemble classifier is run 50 times using random splitting for every dataset and the accuracy is shown which is obtained maximum number of times. For random splitting the dataset is divided into training (2/3) and testing (1/3) part 50 times randomly.
| Dataset | Proposed model | Cluster representatives | Dataset | Proposed model | Cluster representatives | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 1 | 2 | 3 | ||||||
| COLON | MFSAC-EC | NB | 98.39 | 98.39 | 98.39 | MLL | MFSAC-EC | NB | 100 | 100 | 100 |
| KNN | 98.39 | 98.39 | 98.39 | KNN | 100 | 100 | 100 | ||||
| DT | 98.39 | 98.39 | 98.39 | DT | 98.61 | 100 | 98.61 | ||||
| SVM | 98.39 | 100 | 98.39 | SVM | 100 | 100 | 100 | ||||
| Prostate | NB | 94.68 | 95.74 | 93.62 | SRBCT | NB | 95 | 85 | 95 | ||
| KNN | 97.87 | 96.81 | 92.55 | KNN | 95 | 100 | 90 | ||||
| DT | 94.68 | 94.68 | 94.68 | DT | 80 | 90 | 95 | ||||
| SVM | 94.68 | 96.81 | 94.68 | SVM | 65 | 75 | 95 | ||||
| Leukemia | NB | 100 | 100 | 100 | Lung | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 100 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 100 | ||||
| RAOA | NB | 100 | 100 | 100 | RAHC | NB | 100 | 100 | 100 | ||
| KNN | 100 | 100 | 100 | KNN | 100 | 100 | 100 | ||||
| DT | 100 | 100 | 100 | DT | 100 | 100 | 81.25 | ||||
| SVM | 100 | 100 | 100 | SVM | 100 | 100 | 81.25 | ||||
| Breast | NB | 100 | 100 | 100 | RBreast | NB | 91.94 | 91.94 | 91.94 | ||
| KNN | 100 | 100 | 100 | KNN | 85.48 | 87.10 | 83.87 | ||||
| DT | 100 | 100 | 100 | DT | 83.87 | 79.03 | 80.65 | ||||
| SVM | 100 | 100 | 100 | SVM | 93.55 | 91.94 | 91.94 | ||||
Evaluation of MFSAC-EC classifier based on SN, SP, PPV, NPV, FPR for two class data sets with respect to LOOCV.
The performance of the MFSAC-EC model for two class datasets is represented using Receiver Operator Characteristic (ROC) analysis. SN represents Sensitivity, SP represents Specificity, PPV represents Positive Predicted Value, NPV represents Negative Predicted Value, and FPR represents False Positive Rate.
| Dataset | Proposed model | SN | SP | PPV | NPV | FPR | Dataset | Proposed model | SN | SP | PPV | NPV | FPR | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Leukemia | MFSGC-EC | NB | 100 | 100 | 100 | 100 | 0 | Breast | MFSGC-EC | NB | 100 | 100 | 100 | 100 | 0 |
| KNN | 100 | 100 | 100 | 100 | 0 | KNN | 100 | 100 | 100 | 100 | 0 | ||||
| DT | 100 | 100 | 100 | 100 | 0 | DT | 100 | 100 | 100 | 100 | 0 | ||||
| SVM | 100 | 100 | 100 | 100 | 0 | SVM | 100 | 100 | 100 | 100 | 0 | ||||
| Prostate | NB | 98.7 | 98.3 | 98.7 | 98.3 | 1.7 | Rbreast | NB | 100 | 100 | 100 | 100 | 0 | ||
| KNN | 98.7 | 98.3 | 98.7 | 98.3 | 1.7 | KNN | 100 | 100 | 100 | 100 | 0 | ||||
| DT | 100 | 96.61 | 97.46 | 100 | 3.4 | DT | 100 | 100 | 100 | 100 | 0 | ||||
| SVM | 100 | 98.3 | 98.7 | 100 | 1.7 | SVM | 100 | 100 | 100 | 100 | 0 | ||||
| Colon | NB | 100 | 100 | 100 | 100 | 0 | Lung | NB | 100 | 100 | 100 | 100 | 0 | ||
| KNN | 100 | 100 | 100 | 100 | 0 | KNN | 100 | 100 | 100 | 100 | 0 | ||||
| DT | 100 | 100 | 100 | 100 | 0 | DT | 100 | 100 | 100 | 100 | 0 | ||||
| SVM | 100 | 100 | 100 | 100 | 0 | SVM | 100 | 100 | 100 | 100 | 0 | ||||
| RAHC | NB | 100 | 100 | 100 | 100 | 0 | RAOA | NB | 100 | 100 | 100 | 100 | 0 | ||
| KNN | 100 | 100 | 100 | 100 | 0 | KNN | 100 | 100 | 100 | 100 | 0 | ||||
| DT | 100 | 100 | 100 | 100 | 0 | DT | 100 | 100 | 100 | 100 | 0 | ||||
| SVM | 100 | 100 | 100 | 100 | 0 | SVM | 100 | 100 | 100 | 100 | 0 | ||||
Figure 4AUC for for three datasets using MFSAC-EC+ KNN, MFSAC-EC+NB, MFSAC-EC+ DT and MFSAC-EC+ SVM classifiers.
(A) For the breast cancer dataset using LOOCV. (B) For the colon cancer dataset using fivefold cross validation. (C) For RAHC dataset using tenfold cross validation.
Figure 5Heatmap of MFSAC-EC with base classifiers NB, KNN, DT and SVM, respectively, for multiclass datasets.
(A) For the SRBCT dataset using fivefold cross validation. (B) For MLL dataset using tenfold cross validation.
Figure 6Comparison of MFSAC-EC with other well-known supervised gene selection methods and full gene set in terms of fivefold cross validation for all datasets.
In each figure classification accuracy (%) of MFSAC-EC model along with other supervised gene selection methods for all datasets are represented using different colored bars using (A) NB (B) KNN (c) DT and (D) SVM as base classifier.
Figure 7Comparison of MFSAC-EC with other well-known unsupervised gene selection methods in terms of random splitting for different datasets.
Classification accuracy (%) of the MFSAC-EC model along with other unsupervised gene selection methods for four datasets are represented with different colored bars using (A) NB (B) DT and (C) SVM as base classifier.
Comparison of MFSAC-EC + DT with different existing Ensemble classifiers using DT in terms of tenfold cross validation.
Here MFSAC-EC + DT model is compared with existing ensemble classifiers where DT is used as base classifier. C4.5 algorithm is used as DT.
| MFSAC-EC | PCA-based RotBoost | ICA-based RotBoost | AdaBoost | Bagging | Arcing | Rotation Forest | EN-NEW1 | EN-NEW2 | |
|---|---|---|---|---|---|---|---|---|---|
| Colon | 98.39 | 95.48 | 96.1 | 94.97 | 94.92 | 69.35 | 95.21 | 79.03 | 83.87 |
| Leukemia | 100 | 98.75 | 98.77 | 98.22 | 97.47 | Not Found | 97.97 | Not Found | Not Found |
| Breast | 100 | 94.39 | 97.88 | 98.89 | 92.74 | 80.41 | 98.6 | 94.85 | 95.88 |
| Lung | 100 | 98.11 | 99.54 | 96.3 | 97.08 | 97.24 | 97.56 | 98.34 | 99.45 |
| Prostate | 97.79 | Not Found | Not Found | 90.44 | 94.12 | 87.5 | Not Found | 94.85 | 97.06 |
| MLL | 100 | 98.86 | 99.31 | 97.63 | 97.11 | 91.67 | 97.61 | 93.06 | 98.61 |
| SRBCT | 100 | 99.5 | 99.59 | 98.16 | 96.46 | Not Found | 97.44 | Not Found | Not Found |
Comparison of MFSAC-EC using DT, KNN, NB, SVM with different existing Ensemble classifiers using DT, KNN, NB, SVM in terms of tenfold cross validation.
Here classification accuracy (%) of four ensemble classifiers MFSAC-EC + NB, MFSAC-EC + KNN, MFSAC-EC+DT, and MFSAC-EC+SVM are shown with respect to results of other existing ensemble classifiers with the same base learners. The best accuracy (%) for every dataset is shown in bold.
| Dataset | MFSAC-EC | Bagging | Boosting | Stacking | HBSA | SD_Ens | Meta_Ens | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DT | NB | KNN | SVM | DT | NB | KNN | DT | NB | KNN | DT | NB | KNN | KNN | SVM | |||
| Leukemia |
|
|
|
| 94.12 | 88.23 | 73.53 | 91.18 | 88.24 | 75.53 | 91.18 | 91.18 | 91.18 | 88.46 | 88.46 | 92.45 | 94.12 |
| Colon | 98.39 | 98.39 |
| 98.39 | 95.16 | 66.13 | 90.32 | 98.39 | 87.1 | 91.94 | 98.39 | 93.59 | 93.59 | 75 | 85 | 94.4 | 99.21 |
| Prostate | 97.79 |
|
|
| 26.47 | 26.47 | 38.24 | 26.47 | 26.47 | 52.94 | 26.47 | 26.47 | 52.94 | 85.29 | 97.06 | 52.94 | 52.94 |
| Lung |
|
|
|
| 91.28 | 96.64 | 97.32 | 81.88 | 95.3 | 97.99 | 97.99 | 97.99 | 96.64 | Not Found | Not Found | 81.88 | 97.99 |
| Breast |
|
|
|
| 78.95 | 36.84 | 68.42 | 68.42 | 36.84 | 68.42 | 68.42 | 68.42 | 68.42 | Not Found | Not Found | 73.49 | 79.87 |
Comparison of MFSAC-EC using SVM and KNN with respect to different existing deep learning classifiers using random splitting.
Here classification accuracy (%) of two ensemble classifiers MFSAC-EC + KNN, and MFSAC-EC+SVM are shown with respect to results of other existing ensemble classifiers with the same base learners. The best accuracy (%) for every dataset is shown in bold.
| Dataset | SVM | KNN | ||||
|---|---|---|---|---|---|---|
| MFSAC-EC | Folded Autoencoder | Autoencoder | MFSAC-EC | Folded Autoencoder | Autoencoder | |
| Colon |
| 90.15 | 73.11 |
| 81.09 | 56.97 |
| Prostate |
| 84.16 | 64.3 |
| 76.48 | 52.1 |
| Leukemia |
| 93.62 | 84.12 |
| 85.24 | 77.13 |
List of genes selected by MFSAC-EC model for the colon and leukemia cancer datasets.
| Dataset | Gene name | Accession number | Description | Validation of genes |
|---|---|---|---|---|
| Colon | TPM1 | Hsa.1130 | Human tropomyosin isoform mRNA, complete cds. | |
| IGFBP4 | Hsa.1532 | Human insulin-like growth factor binding protein-4 (IGFBP4) gene, promoter and complete cds. | ||
| MYL9 | Hsa.1832 | Myosin Regulatory Light Chain 2, Smooth Muscle Isoform (Human); contains element TAR1 repetitive element | ||
| ALDH1L1 | Hsa.10224 | Aldehyde Dehydrogenase, Mitochodrial X Precursor ( | ||
| KLF9 | Hsa.41338 | Human mRNA for GC box binding protein/ Kruppel Like Factor 9, complete cds | ||
| MEF2C | Hsa.5226 | Myocyte-Specific Enhancer Factor 2, Isoform MEF2 (Homosapiens) | ||
| GADPH | Hsa.1447 | Glyceraldehyde 3-Phosphate Dehydrogenase | ||
| TIMP3 | Hsa.11582 | Metalloproteinase Inhibitor 3 Precursor | ||
| Leukemia | TXN | X77584_at | TXN Thioredoxin | |
| CSF3R | M59820_at | CSF3R Colony stimulating factor 3 receptor (granulocyte) | ||
| MPO | M19508_xpt3_s_at | MPO from Human myeloperoxidase gene | ||
| LYZ | M21119_s_at | LYZ Lysozyme | ||
| CST3 | M27891_at | CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage) |
| |
| ZYX | X95735_at | Zyxin | ||
| CTSD | M63138_at | CTSD Cathepsin D (lysosomal aspartyl protease) |
| |
| CD79A/MB-1 gene | U05259_rna1_at | MB-1 membrane glycoprotein |
Note:
Here second column represents the gene names while third column indicate the gene accession number. The fourth column indicates the description of the gene while the fifth column indicates the literature where it has been referred as cancer biomarker.
The gene names and their corresponding accession numbers for both of the datasets COLON and LEUKEMIA can be found in the following links: COLON: http://genomics-pubs.princeton.edu/oncology/affydata/index.html, http://genomics-pubs.princeton.edu/oncology/affydata/names.html. LEUKEMIA: https://www.kaggle.com/crawford/gene-expression.
Figure 8Original gene (different class label with different color) and corresponding Augmented gene with respect to different filter methods for Breast Cancer dataset.
Seven figures for seven different filter score function are shown here. In each figure the original gene and augmented gene are plotted with respect to sample class label. X-axis represents class label while Y-axis represents expression value. Two different class labels are represented by blue and red color. The difference of expression values of two classes in the augmented gene shows class discrimination ability of that gene. Gene number is the column number in the original dataset.
MFSAC-EC
|
|
| TR+, TR−, R are vectors similar to a gene vector. |
| 1. Create |
| 2. For Every bootstrapped dataset |
| 3. Repeat for |
| A. Repeat for |
| a) Calculate class relevance score |
| B. Select |
| C. Set |
| D. Repeat until |
| a) Set |
| b) Set |
| c) Select the gene (let |
| d) Add |
| e) Set count =1 |
| f) Repeat for |
| I. Compute first augmented representatives |
| II. Compute second augmented representatives |
| III. Compute class relevance value |
| IV. If |
| If |
|
• Set • count = count +1 |
| V. If |
| If |
|
• Set • count = count + 1 |
| g) Set |
| h) Set |
| E. Select |
| F. Construct a classifier |
| 4. Apply a test sample over all the classifiers of all bootstrapped dataset and calculate the prediction accuracy of each classifier |
| 5. Apply simple voting over all predictions to form an ensemble classifier |
| 6. Calculate number of occurrences for every gene for all |
| 7. Select a number of top-ranked genes as informative genes. |
| 8. End |