| Literature DB >> 31182017 |
Fei Han1,2, Di Tang3,4, Yu-Wen-Tian Sun3,4, Zhun Cheng5, Jing Jiang3,4, Qiu-Wei Li3,4.
Abstract
BACKGROUND: Gene selection is one of the critical steps in the course of the classification of microarray data. Since particle swarm optimization has no complicated evolutionary operators and fewer parameters need to be adjusted, it has been used increasingly as an effective technique for gene selection. Since particle swarm optimization is apt to converge to local minima which lead to premature convergence, some particle swarm optimization based gene selection methods may select non-optimal genes with high probability. To select predictive genes with low redundancy as well as not filtering out key genes is still a challenge.Entities:
Keywords: Gene scoring; Gene selection; Microarray data; Particle swarm optimization
Mesh:
Year: 2019 PMID: 31182017 PMCID: PMC6557739 DOI: 10.1186/s12859-019-2773-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The frame of the proposed hybrid gene selection method
Six microarray datasets
| Data | Total Samples | Training samples | Testing samples | Number of classes | Number of genes |
|---|---|---|---|---|---|
| Leukemia | 72 | 38 | 34 | 2 | 7129 |
| Brain Cancer | 60 | 30 | 30 | 2 | 7129 |
| Colon | 62 | 40 | 22 | 2 | 2000 |
| SRBCT | 83 | 63 | 20 | 4 | 2308 |
| LUNG | 203 | 103 | 100 | 5 | 3312 |
| Lymphoma | 58 | 29 | 29 | 2 | 7129 |
The classification accuracy obtained by elm with different gene subsets selected by the sc-ipso-elm method on the six microarray data
| Data | Selected gene subsets | 5-fold CV Accuracy Mean(%) ±std | Test Accuracy Mean(%) ±std |
| Leukemia | 4050,2642,2121 | 100 ±0.00 | 100 ±0.00 |
| 4050,2642,1882 | 100 ±0.00 | 100 ±0.00 | |
| 4050,2642,3258 | 100 ±0.00 | 100 ±0.00 | |
| 42335,2642,1843,4050 | 100 ±0.00 | 100 ±0.00 | |
| Brain cancer | 1091,798,337 | 90.14 ±0.036 | 89.62 ±0.025 |
| 3052,973,3041,3692,4796 | 92.00 ±0.023 | 91.78 ±0.046 | |
| 4628,7129,7045,4413,798 | 92.29 ±0.020 | 90.22 ±0.022 | |
| 7129,2881,3052,865,1970,2935,4871 | 92.78 ±0.012 | 91.88 ±0.019 | |
| Colon | 14,1976,1325,1993,1870,1892,653,1917,187,22,1209,1060 | 93.63 ±0.025 | 97.27 ±0.013 |
| 377,792,14,1976,765,187,251,1110,175,53,1293,1740,200 | 93.00 ±0.035 | 98.06 ±0.013 | |
| 792,1423,14,1976,1909,1110,1589,102,107,1916,175,1151 | 93.73 ±0.031 | 98.71 ±0.013 | |
| 792,14,1976,765,1909,1524,1110,175,43,53,1293,1740,251 | 96.86 ±0.033 | 99.05 ±0.011 | |
| SRBCT | 742,1003,1954,430,2050,123 | 100 ±0.00 | 100 ±0.00 |
| 545,1955,1434,509,971,255 | 100 ±0.00 | 100 ±0.00 | |
| 1003,545,1911,153,123,1489,2161 | 100 ±0.00 | 100 ±0.00 | |
| 1955,2050,545,2144,2045,123,1489 | 100 ±0.00 | 100 ±0.00 | |
| LUNG | 1765,2779,2841,1474,2045,3191,2763,2817,525,1630 | 98.27 ±0.014 | 93.33 ±0.011 |
| 525,1493,607,2763,792,580,867,368,3279,2158,1225 | 98.39 ±0.023 | 93.47 ±0.012 | |
| 1765,883,2763,792,580,867,985,3279,2988,2045,814 | 98.67 ±0.021 | 93.60 ±0.019 | |
| 1765,525,2763,2841,1474,2583,867,985,2045,814,918 | 98.67 ±0.019 | 94.01 ±0.024 | |
| Lymphoma | 152,2347,2650,5679,438,1855,5863 | 90.60 ±0.023 | 85.11 ±0.020 |
| 1855,2828,152,2437,806,530,1102 | 92.36 ±0.027 | 89.33 ±0.019 | |
| 5279,4687,4940,5449,1133,1855,4519 | 93.51 ±0.022 | 90.47 ±0.029 | |
| 152,2437,4829,2828,6441,806,2508 | 93.79 ±0.020 | 90.45 ±0.023 |
The top ten frequently selected genes with the sc-ipso-elm method on the leukemia data
| Gene No. | Gene Name | Description |
|---|---|---|
| 2354 | M92287 | CCND3 Cyclin D3 ∗∘ |
| 6855 | M31523 | CF3 Transcription factor 3 (E2A immunoglobulin enhancer bind-ing factors E12/E47) |
| 2642 | U05259 | MB-1 gene ∗∘ |
| 4050 | X03934 | GB DEF = T-cell antigen receptor gene T3-delta ∗⋆ |
| 1834 | M23197 | CD33 CD33 antigen (differenti-ation antigen) ∗∘ |
| 1882 | M27891 | CST3 Cystatin C (amyloid an-giopathy and cerebral hemor-rhage) ∗∘ |
| 4377 | X62654 | ME491 gene extracted from H.sapiens gene for Me491/CD63 antigen |
| 2121 | M63138 | CTSD Cathepsin D (lysosomal aspartyl protease) ∗∘ |
| 2288 | M84526 | DF D component of comple-ment (adipsin) |
| 6271 | M33493 | Tryptase-III mRNA, 3’ end |
*also selected in [15]; ∘also selected in [26]; ⊲also selected in [22]; ⋆also selected in [16]; ∙also selected in [27]
The top ten frequently selected genes with the sc-ipso-elm method on the brain cancer data
| Gene No. | Gene Name | Description |
|---|---|---|
| 798 | D86961 | Lipoma HMGIC fusion partner-like 2 |
| 865 | D87454 | KIAA0265 protein |
| 2648 | M28879 | Granzyme B (granzyme 2, cytotoxic T-lymphocyte-associated serine esterase 1) |
| 2881 | M57506 | Chemokine (C-C motif) ligand 1 |
| 3041 | M64934 | Kell blood group ∗ |
| 3052 | M65254 | Protein phosphatase 2 (formerly 2A), regulatory subunit A (PR 65), beta isoform |
| 3692 | U03644 | CBF1 interacting corepressor |
| 4628 | U50079 | Histone deacetylase 1 |
| 6571 | X93036 | FXYD domain containing ion transport regulator 3 |
| 7129 | Z97074 | Rab9 effector protein with kelch motifs |
*also selected in [15]
The top ten frequently selected genes with the sc-ipso-elm method on the colon data
| Gene No. | Gene Name | Description |
|---|---|---|
| 14 | H20709 | MYOSIN LIGHT CHAIN ALKALI, SMOOTH-MUSCLE ISOFORM (HU-MAN) ∗∘ |
| 1772 | H08393 | COLLAGEN ALPHA 2(XI) CHAIN (Homo sapiens) |
| 1935 | X62048 | H.sapiens Wee1 hu gene |
| 286 | H64489 | LEUKOCYTE ANTIGEN CD37 (Homo sapiens) |
| 792 | R88740 | ATP SYNTHASE COUPLING FACTOR 6MITOCHONDRIAL PRE-CURSOR (HUMAN) ∘⋆ |
| 187 | T51023 | HEAT SHOCK PROTEIN HSP 90-BETA (HUMAN) |
| 1976 | K03474 | Human Mullerian inhibiting substance gene, complete cds |
| 493 | R87126 | MYOSIN HEAVY CHAIN, NONMUSCLE (Gallus gal-lus) |
| 1635 | M36634 | Human vasoactive intestinal peptide (VIP) mRNA, com-plete cds |
| 698 | T51261 | GLIA DERIVED NEXIN PRECURSOR (Mus muscu-lus) |
*also selected in [28]; ∘also selected in [29]; ⊲also selected in [15]; ⋆also selected in [16]
The top ten frequently selected genes with the sc-ipso-elm method on the srbct data
| Gene No. | Gene Name | Description |
|---|---|---|
| 742 | 812105 | Transmembrane protein ∗∘ |
| 1003 | 796258 | Sarcoglycan, alpha (50kD dystrophin-associated glycoprotein) ∗⋆∙ |
| 255 | 325182 | Cadherin 2, N-cadherin (neuronal) ∘ |
| 123 | 236282 | Wiskott-Aldrich syndrome (ecezema-thrombocytopenia |
| 545 | 1435862 | Antigen identified by monoclonal antibodies 12E7, F21 and O13 ∗⋆ |
| 1319 | 866702 | Protein tyrosine phosphatase, non-receptor type 13 (APO-1/CD95 (Fas)-associated phosphatase) |
| 1606 | 624360 | Proteasome (prosome, macropain) subunit, beta type, 8 (large multifunctional protease 7) |
| 2046 | 244618 | ESTs |
| 246 | 377461 | Caveolin 1, caveolae protein, 22kD |
| 509 | 207274 | Human DNA for insulin-like growth factor II (IGF-2); exon 7 and additional ORF |
*also selected in [23]; ∘also selected in [30]; ⊲also selected in [15]; ⋆also selected in [16]; ∙also selected in [31]
The top ten frequently selected genes with the sc-ipso-elm method on the lung data
| Gene No. | Gene Name | Description |
|---|---|---|
| 2763 | 185_at | Neuro-oncological ventral antigen 1 |
| 580 | 39333_at | Collagen, type IV, alpha 1 ∘ |
| 792 | 38704_at | Cadherin 2, N-cadherin (neuronal) ∗∘ |
| 2841 | 32696_at | Pre-B-cell leukemia transcription factor 3 |
| 2045 | 35276_at | Claudin 4 |
| 2657 | 32648_at | Delta-like homolog (Drosophila) |
| 1765 | 39722_at | Nuclear receptor co-repressor 1 ∗∘ |
| 1493 | 38967_at | Chromosome 14 open reading frame 2 |
| 3191 | 39383_at | Adenylate cyclase 6 |
| 2338 | 1315_at | Ornithine decarboxylase antizyme 1 |
*also selected in [16]; ∘also selected in [15]
The top ten frequently selected genes with the sc-ipso-elm method on the lymphoma data
| Gene No. | Gene Name | Description |
|---|---|---|
| 152 | M97935_5_at | Signal transducer and activator of transcription 1, 91kDa |
| 1855 | L17328_at | Fasciculation and elongation protein zeta 2 (zygin II) |
| 2437 | M18185_at | Gastric inhibitory polypeptide |
| 2347 | M14091_at | Serine (or cysteine) proteinase inhibitor, clade A (alpha-1 antiproteinase, antitrypsin), member 7 |
| 2828 | M37763_at | Neurotrophin 3 |
| 5279 | U83843_at | Chaperonin containing TCP1, subunit 7 (eta) |
| 806 | D86968_at | Mitogen-activated protein kinase kinase kinase 4 ∗ |
| 4092 | U22178_s_at | Microseminoprotein, beta- |
| 4940 | U66559_at | Anaplastic lymphoma kinase (Ki-1) |
| 4194 | U28150_at | ATP-binding cassette, sub-family D (ALD), member 2 |
*also selected in [16]
Fig. 2The heatmap of expression levels based on the top ten frequently selected genes on the six data
The 5-fold cv classification accuracies of elm based on the three gene selection methods on the six microarray data
| Data | KMeans-GCSI-MBPSO-ELM | BPSO-GCSI-ELM | SC-IPSO-ELM | |||
|---|---|---|---|---|---|---|
| 5-fold CV Accuracy(%) ± std | genes | 5-fold CV Accuracy(%) ± std | genes | 5-fold CV Accuracy(%) ± std | genes | |
| Leukemia | 100.00 ±0.00 | 3 | 100.00 ±0.00 | 3 | 100.00 ±0.00 | 3 |
| Brain cancer | 88.63 ±0.0216 | 6 | 89.88 ±0.0223 | 7 | 91.88 ±0.019 | 7 |
| Colon | 97.61 ±0.0137 | 6 | 97.82 ±0.0132 | 9 | 99.05 ±0.011 | 13 |
| SRBCT | 100.00 ±0.00 | 6 | 100.00 ±0.00 | 6 | 100.00 ±0.00 | 6 |
| LUNG | 97.10 ±0.063 | 11 | 96.28 ±0.072 | 12 | 98.67 ±0.019 | 11 |
| Lymphoma | 86.97 ±0.024 | 8 | 84.50 ±0.023 | 8 | 93.79 ±0.020 | 7 |
The classification accuracies of svm based on the three gene selection methods on the six microarray data
| Data | KMeans-GCSI-MBPSO-ELM | BPSO-GCSI-ELM | SC-IPSO-ELM | |||
|---|---|---|---|---|---|---|
| 5-fold CV Accuracy(%) ± std | genes | 5-fold CV Accuracy(%) ± std | genes | 5-fold CV Accuracy(%) ± std | genes | |
| Leukemia | 99.99 ±0.0014 | 3 | 99.99 ±0.0014 | 3 | 99.99 ±0.0014 | 3 |
| Brain cancer | 84.05 ±0.0301 | 6 | 82.70 ±0.0319 | 7 | 86.55 ±0.0299 | 7 |
| Colon | 90.69 ±0.0226 | 6 | 92.02 ±0.0275 | 9 | 93.35 ±0.0310 | 13 |
| SRBCT | 99.24 ±0.0119 | 6 | 98.34 ±0.0100 | 6 | 99.39 ±0.0074 | 6 |
| LUNG | 94.63 ±0.054 | 11 | 96.65 ±0.058 | 11 | 95.38 ±0.047 | 11 |
| Lymphoma | 77.59 ±0.032 | 8 | 72.41 ±0.034 | 8 | 81.03 ±0.025 | 7 |
Fig. 3The parameter, θac versus the classification accuracy on the training dataset obtained by ELM
Fig. 4The number of the selected genes versus the classification accuracy on the training dataset obtained by ELM