| Literature DB >> 29246519 |
Lingyun Gao1, Mingquan Ye2, Xiaojie Lu1, Daobin Huang1.
Abstract
It remains a great challenge to achieve sufficient cancer classification accuracy with the entire set of genes, due to the high dimensions, small sample size, and big noise of gene expression data. We thus proposed a hybrid gene selection method, Information Gain-Support Vector Machine (IG-SVM) in this study. IG was initially employed to filter irrelevant and redundant genes. Then, further removal of redundant genes was performed using SVM to eliminate the noise in the datasets more effectively. Finally, the informative genes selected by IG-SVM served as the input for the LIBSVM classifier. Compared to other related algorithms, IG-SVM showed the highest classification accuracy and superior performance as evaluated using five cancer gene expression datasets based on a few selected genes. As an example, IG-SVM achieved a classification accuracy of 90.32% for colon cancer, which is difficult to be accurately classified, only based on three genes including CSRP1, MYL9, and GUCA2B.Entities:
Keywords: Cancer classification; Gene selection; Information gain; Small sample size with high dimension; Support vector machine
Mesh:
Year: 2017 PMID: 29246519 PMCID: PMC5828665 DOI: 10.1016/j.gpb.2017.08.002
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1Workflow of proposed approach
Details of gene expression datasets examined
| Lung cancer | 2 | 7129 | 96 | 86 primary lung adenocarcinoma samples | 10 non-neoplastic lung samples |
| DLBCL | 2 | 4026 | 47 | 24 GCB subgroup cases | 23 ABC subgroup cases |
| Colon cancer | 2 | 2000 | 62 | 40 tumor biopsy samples | 22 normal biopsy samples |
| Prostate cancer | 2 | 12,600 | 102 | 52 prostate tumor samples | 50 non-tumor prostate samples |
| Leukemia | 2 | 7129 | 72 | 25 AML bone marrow samples | 47 ALL bone marrow samples |
Note: DLBCL, diffuse large B-cell lymphoma; GCB, germinal center B-like; ABC, activated B-cell like; AML, acute myelocytic leukemia. ALL, acute lymphoblastic leukemia.
Figure 2Cancer classification performance using different filtersClassification accuracies plotted according to the number of genes selected using different filters, including information gain, gain ratio, reliefF, and correlation, are shown for lung cancer (A), DLBCL (B), colon cancer (C), prostate cancer (D), and leukemia (E), respectively. DLBCL, diffuse large B-cell lymphoma.
Cancer classification accuracies (%) obtained based on the top 3 genes selected using hybrid methods
| Lung cancer | 98.96 | 98.96 | 98.96 | |
| DLBCL | 97.87 | 95.74 | ||
| Colon cancer | 85.48 | 87.10 | 87.10 | |
| Prostate cancer | 93.14 | 91.18 | 93.14 | |
| Leukemia | 94.44 | 97.22 | 97.22 |
Note: Numbers in bold represent the highest accuracies achieved for the hybrid gene selection methods tested. DLBCL, diffuse large B-cell lymphoma; IG, information gain; GR, gain ratio; Cor, correlation; SVM, support vector machine.
Informative genes selected using IG-SVM
| Lung cancer | F2968 | M61906_at | 0.377 | Phosphoinositide-3-kinase regulatory subunit 1 | |
| F4530 | U45973_at | 0.322 | Inositol polyphosphate-5-phosphatase K | ||
| F5983 | X61118_rna1_at | 0.292 | LIM domain only 2 | ||
| Colon cancer | F765 | M76378_at | 0.356 | Cysteine and glycine rich protein 1 | |
| F1423 | – | 0.315 | Myosin light chain 9 | ||
| F377 | – | 0.229 | Guanylate cyclase activator 2B | ||
| Prostate cancer | F6185 | 37639_at | 0.675 | Hepsin | |
| F7067 | 40436_g_at | 0.366 | Solute carrier family 25 member 6 | ||
| F10234 | 41504_s_at | 0.238 | MAF bZIP transcription factor | ||
Note: IG value of each gene in a dataset was calculated as described in the Methods section. All genes are ranked according to the IG values and the three selected informative genes are obtained using SVM.