| Literature DB >> 33861766 |
Murad Al-Rajab1, Joan Lu1, Qiang Xu1.
Abstract
Gene expression profiles can be utilized in the diagnosis of critical diseases such as cancer. The selection of biomarker genes from these profiles is significant and crucial for cancer detection. This paper presents a framework proposing a two-stage multifilter hybrid model of feature selection for colon cancer classification. Colon cancer is being extremely common nowadays among other types of cancer. There is a need to find fast and an accurate method to detect the tissues, and enhance the diagnostic process and the drug discovery. This paper reports on a study whose objective has been to improve the diagnosis of cancer of the colon through a two-stage, multifilter model of feature selection. The model described deals with feature selection using a combination of Information Gain and a Genetic Algorithm. The next stage is to filter and rank the genes identified through this method using the minimum Redundancy Maximum Relevance (mRMR) technique. The final phase is to further analyze the data using correlated machine learning algorithms. This two-stage approach, which involves the selection of genes before classification techniques are used, improves success rates for the identification of cancer cells. It is found that Decision Tree, K-Nearest Neighbor, and Naïve Bayes classifiers had showed promising accurate results using the developed hybrid framework model. It is concluded that the performance of our proposed method has achieved a higher accuracy in comparison with the existing methods reported in the literatures. This study can be used as a clue to enhance treatment and drug discovery for the colon cancer cure.Entities:
Mesh:
Substances:
Year: 2021 PMID: 33861766 PMCID: PMC8691854 DOI: 10.1371/journal.pone.0249094
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Colon cancer hybrid methods literature review for classification accuracy.
| No. | Reference | Method | Accuracy | |
|---|---|---|---|---|
| Feature Selection | Classifier | [%] | ||
| 1. | [ | PSO+GA | SVM | 91.90 |
| 2. | [ | mRMR + PSO | SVM | 90.32 |
| 3. | [ | Genetic Algorithm (GA) | SVM | 90.32 |
| 4. | [ | CFS + Wrapper (J48) | SVM | 89.03 |
| 5. | [ | Filter (F-Score+IG) + Wrapper (SBE) | SVM | 87.50 |
| 6. | [ | CFS + Wrapper (Random Forest) | SVM | 87.10 |
| 7. | [ | CFS + Wrapper (Random Trees) | SVM | 85.48 |
| 8. | [ | mRMR | SVM | 85.48 |
| 9. | [ | mRMR+GA-SVM | SVM | 85.48 |
| 10. | [ | mRMR+GA | SVM | 85.48 |
| 11. | [ | FSBRR + MI | KNN | 91.91 |
| 12. | [ | CFS + Wrapper (Random Forest) | KNN | 87.10 |
| 13. | [ | CFS + Wrapper (J48) | KNN | 85.48 |
| 14. | [ | CFS + Wrapper (Random Trees) | KNN | 82.26 |
| 15. | [ | Genetic Algorithm (GA) | DT | 88.8 |
| 16. | [ | PSO+GA | DT | 83.9 |
| 17. | [ | GE Hybrid | DT | 83.41 |
| 18. | [ | IG | DT | 77.26 |
| 19. | [ | MF-GE | DT | 76.64 |
| 20. | [ | PSO+GA | Naïve Bayes | 85.50 |
| 21. | [ | GE Hybrid | Naïve Bayes | 84.96 |
| 22. | [ | MF-GE | Naïve Bayes | 75.07 |
| 23. | [ | mRMR | Naïve Bayes | 66.13 |
| 24. | [ | MIM+AGA | Extreme Learning Machine (ELM) | 89.09 |
| 25. | [ | Information Gain (IG) & Standard Genetic Algorithm (SGA) | Genetic Programming | 85.48 |
| 26. | [ | GE Hybrid | 7-Nearest Neighbor | 85.34 |
| 27. | [ | MF-GE | 7-Nearest Neighbor | 68.78 |
| 28. | [ | GE Hybrid | 3-Nearest Neighbor | 84.93 |
| 29. | [ | MF-GE | 3-Nearest Neighbor | 77.01 |
| 30. | [ | GE Hybrid | Random Forests | 81.67 |
| 31. | [ | MF-GE | Random Forests | 74.35 |
| 32. | [ | PCA | GA + ANN | 83.33 |
Description of the datasets’ gene expression used in the study.
| TYPE OF DATASET | NO. OF GENES ACROSS THE SAMPLES | CLASSIFICATION TYPE | NO. OF SAMPLES | |
|---|---|---|---|---|
| Alon et al. [ | 2000 | Tumour | 62 | 40 |
| Normal | 22 | |||
| Notterman [ | 7457 | Tumour | 36 | 18 |
| Normal | 18 | |||
Fig 1Proposed computational model framework.
Fig 2Pseudocode of the proposed model.
Fig 3The summary of the developed multifilter 2-stage framework method.
Number of selected features by the proposed method on each dataset.
| Colon Features | ||
|---|---|---|
| Dataset 1 | Dataset 2 | |
| Full Data Set | 2000 | 6597 |
| Phase 1 (IG+GA) | 68 | 475 |
| Phase 2 (Phase 1 + mRMR) | 22 | 35 |
* this number after eliminating duplicates.
Top genes ranked and selected according the proposed framework model.
| Dataset 1 | Dataset 2 |
|---|---|
| M26383 gene 1 "Human monocyte-derived neutrophil-activating protein (MONAP) mRNA, complete cds." | R36977 yf53h07.s1 Homo sapiens cDNA clone 26045 3’ similar to SP:TF3A_XENLA P03001 TRANSCRIPTION FACTOR IIIA; |
| M63391 gene 1 "Human desmin gene, complete cds. " | M77836 "Human pyrroline 5-carboxylate reductase mRNA, complete cds" |
| M76378 gene 1 "Human cysteine-rich protein (CRP) gene, exons 5 and 6. " | T96548 "ye49f12.s1 Homo sapiens cDNA clone 121103 3’ similar to gb:X16940 ACTIN, GAMMA-ENTERIC SMOOTH MUSCLE (HUMAN);" |
| J02854 gene 1 "MYOSIN REGULATORY LIGHT CHAIN 2, SMOOTH MUSCLE ISOFORM (HUMAN);contains element TAR1 repetitive element;." | T64297 "yc48a10.s1 Homo sapiens cDNA clone 83898 3’ similar to gb:M10050 FATTY ACID-BINDING PROTEIN, LIVER (HUMAN);" |
| T96873 3’ UTR 2a 121343 HYPOTHETICAL PROTEIN IN TRPE 3’REGION (Spirochaeta aurantia) | M97496 "Homo sapiens guanylin mRNA, complete cds" |
| U21090 gene 1 "Human DNA polymerase delta small subunit mRNA, complete cds. " | X64559 H.sapiens mRNA for tetranectin |
| H40560 3’ UTR 1 175410 THIOREDOXIN (HUMAN);. | Z50753 H.sapiens mRNA for GCAP-II/uroguanylin precursor |
| M36634 gene 1 "Human vasoactive intestinal peptide (VIP) mRNA, complete cds." | M83670 "Human carbonic anhydrase IV mRNA, complete cds" |
| T51571 3’ UTR 1 72250 P24480 CALGIZZARIN. | T52362 yb23g02.s1 Homo sapiens cDNA clone 72050 3’ |
| M91463 gene 1 "Human glucose transporter (GLUT4) gene, complete cds." | H57136 yr08c08.s1 Homo sapiens cDNA clone 204686 3’ similar to SP:A40533 A40533 CAMP-DEPENDENT PROTEIN KINASE MAJOR MEMBRANE SUBSTRATE PRECURSOR—; |
| T62947 3’ UTR 2a 79366 60S RIBOSOMAL PROTEIN L24 (Arabidopsis thaliana) | U17077 "Human BENE mRNA, partial cds" |
| R97912 3’ UTR 2a 200181 SERINE/THREONINE-PROTEIN KINASE IPL1 (Saccharomyces cerevisiae) | T67077 ya52f06.s1 Homo sapiens cDNA clone 66563 3’ similar to SP:A40533 A40533 CAMP-DEPENDENT PROTEIN KINASE MAJOR MEMBRANE SUBSTRATE PRECURSOR—; |
| L41559 gene 1 "Homo sapiens pterin-4a-carbinolamine dehydratase (PCBD) mRNA, complete cds." | T55741 yb40d07.s1 Homo sapiens cDNA clone 73645 3’ similar to SP:TELO_RABIT P29294 |
| R39209 3’ UTR 2a 23464 HUMAN IMMUNODEFICIENCY VIRUS TYPE I ENHANCER-BINDING PROTEIN 2 (Homo sapiens) | M12272 "Homo sapiens alcohol dehydrogenase class I gamma subunit (ADH3) mRNA, complete cds" |
| T90350 3’ UTR 2a 110964 MYOBLAST CELL SURFACE ANTIGEN 24.1D5 (Homo sapiens) | D63874 "Human mRNA for HMG-1, complete cds" |
| T54276 3’ UTR 1 69195 PROTEASOME COMPONENT C13 (HUMAN). | R71676 yj85e03.s1 Homo sapiens cDNA clone 155548 3’ |
| R49459 3’ UTR 2a 38253 TRANSFERRIN RECEPTOR PROTEIN (Homo sapiens) | M26697 "Human nucleolar protein (B23) mRNA, complete cds" |
| Z24727 gene 1 "H.sapiens tropomyosin isoform mRNA, complete CDS." | M80244 "Human E16 mRNA, complete cds" |
| T51849 3’ UTR 2a 75009 TYROSINE-PROTEIN KINASE RECEPTOR ELK PRECURSOR (Rattus norvegicus) | L11708 "Human 17 beta hydroxysteroid dehydrogenase type 2 mRNA, complete cds" |
| K03460 gene 1 "Human alpha-tubulin isotype H2-alpha gene, last exon." | T46924 yb11b02.s1 Homo sapiens cDNA clone 70827 3’ similar to gb:U11863 AMILORIDE-SENSITIVE AMINE OXIDASE (HUMAN) |
| X61118 gene 1 Human TTG-2 mRNA for a cysteine rich protein with LIM motif. | U17899 "Human chloride channel regulatory protein mRNA, complete cds" |
| R06601 3’ UTR 2a 126458 METALLOTHIONEIN-II (Homo sapiens) | X73502 H. Sapiens mRNA for cytokeratin 20 |
| H09351 yl95g07.s1 Homo sapiens cDNA clone 46019 3’ similar to gb:D28480 MCM3 HOMOLOG (HUMAN); | |
| H06524 "yl78h01.s1 Homo sapiens cDNA clone 44386 3’ similar to gb:X04412 GELSOLIN PRECURSOR, PLASMA (HUMAN);" | |
| H77597 ys08a06.s1 Homo sapiens cDNA clone 214162 3’ similar to gb:X64177 H.sapiens mRNA for metallothionein (HUMAN); | |
| X15183 Human mRNA for 90-kDa heat-shock protein | |
| R50129 yj54h10.s1 Homo sapiens cDNA clone 152611 3’ similar to gb:J02939 4F2 CELL-SURFACE ANTIGEN HEAVY CHAIN (HUMAN); | |
| L03840 "Human fibroblast growth factor receptor 4 (FGFR4) mRNA, complete cds" | |
| T51261 yb03h03.s1 Homo sapiens cDNA clone 70133 3’ | |
| H14506 ym18f10.s1 Homo sapiens cDNA clone 48421 3’ | |
| H08393 yl92a10.s1 Homo sapiens cDNA clone 45395 3’ | |
| T55200 yb43f08.s1 Homo sapiens cDNA clone 73959 3’ similar to gb:M10942_cds1 Human metallothionein-Ie gene (HUMAN) | |
| Z17227 H.sapiens mRNA for transmenbrane receptor protein | |
| H65066 yr69f12.s1 Homo sapiens cDNA clone 210575 3’ similar to SP:VIS1_RAT P28677 VISININ-LIKE PROTEIN 1; contains MER6 repetitive element; | |
| H17127 ym42e05.s1 Homo sapiens cDNA clone 50869 3’ |
Comparison summary between stage one and stage 2 accuracy results.
|
|
| |||
|
|
| |||
|
|
|
|
| |
|
| 79.0% | 97.2% | 82.3% | 97.2% |
|
| 80.7% | 97.2% | 83.9% |
|
|
| 90.3% | 94.4% | 90.3% | 94.4% |
|
| 77.4% | 97.2% | 79.0% | 97.2% |
|
|
| |||
|
|
| |||
|
|
|
|
| |
|
| 83.9% | 97.2% | 82.3% | 97.2% |
|
| 82.3% | 97.2% | 88.7% | 97.2% |
|
| 83.9% | 91.7% | 83.9% | 91.7% |
|
| 80.7% | 97.2% | 85.5% | 97.2% |
|
|
| |||
|
|
| |||
|
|
|
|
| |
|
| 81.3% | 100% | 81.3% | 100% |
|
| 75.0% | 100% | 87.5% | 100% |
|
| 81.3% | 72.2% |
| 72.7% |
|
| 81.3% | 100% |
| 100% |
Fig 4Evaluation of the proposed procedure’s classification accuracy using different testing models.
Fig 5Evaluation of the proposed procedure’s classification accuracy using the best testing model with low error rates.
Confusion matrix of the present study for the evaluation of dataset 1.
| Confusion Matrix | SVM | NB | DT | K-NN | ||||
|---|---|---|---|---|---|---|---|---|
|
| 3 | 1 | 3 | 1 | 4 | 0 | 4 | 0 |
|
| 2 | 10 | 1 | 11 | 1 | 11 | 1 | 11 |
|
| 0.813 | 0.875 | 0.938 | 0.938 | ||||
|
| 0.229 | 0.208 | 0.021 | 0.021 | ||||
Confusion matrix of the present study for the evaluation of dataset 2.
| Confusion Matrix | SVM | NB | DT | K-NN | ||||
|---|---|---|---|---|---|---|---|---|
|
| 18 | 0 | 18 | 0 | 18 | 0 | 18 | 0 |
|
| 1 | 17 | 0 | 18 | 2 | 16 | 1 | 17 |
|
| 0.972 | 1 | 0.944 | 0.972 | ||||
|
| 0.028 | 0 | 0.056 | 0.028 | ||||
Fig 6ROC curve of the present study for dataset 1.
Fig 7ROC curve of the present study for dataset 2.
Performance evaluation assessment for the experiment results.
| Classifier | Dataset 1 | Dataset 2 | ||||||
|---|---|---|---|---|---|---|---|---|
| Accuracy | Sensitivity | Specificity | MCC | Accuracy | Sensitivity | Specificity | MCC | |
| SVM | 81.25 | 0.75 | 0.833 | 0.545 | 97.2 | 1 | 0.944 | 0.945 |
| NB | 87.50 | 0.75 | 0.917 | 0.667 | 100 | 1 | 1 | 1 |
| DT | 93.75 | 1 | 0.917 | 0.856 | 94.4 | 1 | 0.889 | 0.894 |
| K-NN | 93.75 | 1 | 0.917 | 0.856 | 97.2 | 1 | 0.944 | 0.945 |
Comparison of the proposed method with others reported in the literature using each dataset.
| Method | Accuracy (%) |
|---|---|
|
| |
| (FSBRR+MI)—K-NN [ | 91.90 |
| (mRMR+PSO)—SVM
[ | 90.32 (10) |
| (PSO+GA)—SVM
[ | 91.90 (18) |
| (mRMR+GA)—SVM [ | 85.48 (40) |
| Filter
(F-Score+IG)—Wrapper (SBE) + SVM [ | 87.50 |
| (mRMR+GA)–SVM [ | 85.48 |
| (PSO+GA)–DT
[ | 85.50 |
| (IG +GA)–GP [ | 85.48 |
| (PCA) + GA–ANN [ | 83.33 |
|
|
|
|
| |
| F-Score–Majority Voting [ | 97.22 (95) |
| Gain Ratio/ Chi-square + ensemble DT [ | 97.22 |
|
|
|