| Literature DB >> 24844313 |
Fei Han1, Wei Sun1, Qing-Hua Ling2.
Abstract
To obtain predictive genes with lower redundancy and better interpretability, a hybrid gene selection method encoding prior information is proposed in this paper. To begin with, the prior information referred to as gene-to-class sensitivity (GCS) of all genes from microarray data is exploited by a single hidden layered feedforward neural network (SLFN). Then, to select more representative and lower redundant genes, all genes are grouped into some clusters by K-means method, and some low sensitive genes are filtered out according to their GCS values. Finally, a modified binary particle swarm optimization (BPSO) encoding the GCS information is proposed to perform further gene selection from the remainder genes. For considering the GCS information, the proposed method selects those genes highly correlated to sample classes. Thus, the low redundant gene subsets obtained by the proposed method also contribute to improve classification accuracy on microarray data. The experiments results on some open microarray data verify the effectiveness and efficiency of the proposed approach.Entities:
Mesh:
Year: 2014 PMID: 24844313 PMCID: PMC4028211 DOI: 10.1371/journal.pone.0097530
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The frame of the proposed hybrid gene selection method.
Six microarray data.
| Data | Total | Training | Testing | Classes | Genes |
| Leukemia | 72 | 38 | 34 | 2 | 7129 |
| Colon | 62 | 40 | 22 | 2 | 2000 |
| SRBCT | 83 | 63 | 20 | 4 | 2308 |
| LUNG | 203 | 103 | 100 | 5 | 3312 |
| Brain cancer | 60 | 30 | 30 | 2 | 7129 |
| Lymphoma | 58 | 29 | 29 | 2 | 7129 |
Figure 2The GCS values of genes in the remainder clusters (The character ‘C’ along the X axis is the abbreviation for ‘Cluster’) (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.
The classification accuracies with different gene subsets by ELM on the six data.
| Data | Selected genes | 5-fold CV Accuracy Mean(%)±std | Testing Accuracy Mean(%)±std |
| Leukemia | 2642,4050, 2121 | 100±0.00 | 100±0.00 |
| 2642,4050,3258 | 100±0.00 | 100±0.00 | |
| 2642,4050,1882 | 99.74±0.55 | 100±0.00 | |
| 4050,1685,1078,2121 | 99.64±0.61 | 98.21±1.61 | |
| Colon | 141,792,251,1679,1976,14 | 97.61±1.37 | 93.68±2.58 |
| 141,1110,792,251,1976,286, 23,14 | 98.03±1.46 | 94.36±2.59 | |
| 127,652,1110,43,251,1976, 795,1071,286,14 | 97.50±1.63 | 95.09±3.27 | |
| 304,360,377,1110,792,312,251,36,1763,1867,1976,795,14 | 98.05±1.38 | 96.14±2.53 | |
| SRBCT | 742,1003,1055,2050,846, 1772 | 100±0.00 | 100±0.00 |
| 742,1003,603,971,846,1389 | 100±0.00 | 100±0.00 | |
| 236,976,1003,123,819,545 | 100±0.00 | 100±0.00 | |
| 742,1003,2050,235,1634,1120, 545 | 100±0.00 | 100±0.00 | |
| LUNG | 498,614,567,2750,1209, 1765,2763,867,2659,2670 | 96.88±0.61 | 94.80±0.79 |
| 641,777,1288,614,567,320, 3178,792,3295,2558,997 | 97.10±0.63 | 93.02±1.17 | |
| 580,103,2750,1559,1765,2763,2583,997,1014 | 96.17±0.65 | 94.44±0.82 | |
| Brain cancer | 3362,1970,3123,5931 | 86.07±1.99 | 77.20±1.40 |
| 6571,4413,4917,5931 | 85.70±3.16 | 79.53±3.96 | |
| 5721,4069,1970,3123,5931 | 87.87±1.73 | 78.30±1.74 | |
| 6571,4409,4413,4628,1970,5931 | 88.63±2.16 | 80.40±3.36 | |
| Lymphoma | 4092,6171,412,5843,806,4037 | 85.05±2.44 | 78.62±1.70 |
| 5660,4092,364,152,956,806,4037 | 84.60±2.75 | 74.52±2.40 | |
| 4092,6171,5357,3646,5909,152,806,2650 | 86.97±2.44 | 73.38±3.10 | |
| 5660,4092,6171,510,6219,2374,1568,2650 | 86.95±2.33 | 78.79±1.24 |
Figure 3The GCS values for all reserved genes (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.
The top ten frequently selected genes with the proposed method on the Leukemia data.
| Gene no. | Gene name | Description |
| 2642 | U05259 | MB-1 gene |
| 4050 | X03934 | GB DEF = T-cell antigen receptor gene T3-delta |
| 2121 | M63138 | CTSD Cathepsin D (lysosomal aspartyl protease) |
| 3320 | U50136 | Leukotriene C4 synthase (LTC4S) gene |
| 6539 | X85116 | Epb72 gene exon 1 |
| 1882 | M27891 | CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage) |
| 5191 | Z69881 | Adenosine triphosphatase, calcium# |
| 1779 | M19507 | MPO Myeloperoxidase |
| 4847 | X95735 | Zyxin |
| 1078 | J03473 | ADPRT ADP-ribosyltransferase (NAD+; poly (ADP-ribose) polymerase) |
*Also selected by the method in [40].
#Also selected by the method in [37].
The top ten frequently selected genes with the proposed method on the Lymphoma data.
| Gene no. | Gene name | Description |
| 5660 | X14046 | CD37 antigen |
| 1116 | HG2815-HT4023_s | / |
| 2748 | M33478 | phosducin |
| 5357 | U90543 | butyrophilin, subfamily 2, member A1 |
| 806 | D86969 | PHD finger protein 16 |
| 3823 | U09770 | cysteine-rich protein 1 (intestinal) |
| 4269 | U32324 | interleukin 11 receptor, alpha |
| 5381 | U90914 | carboxypeptidase D |
| 2073 | L36033 | chemokine (C-X-C motif) ligand 12 (stromal cell-derived factor 1) |
| 6105 | X67098 | enolase superfamily member 1 |
The top ten frequently selected genes with the proposed method on the Colon data.
| Gene no. | Gene name | Description |
| 14 | H20709 | Myosin light chain alkali, smooth-muscle isoform (Human) ♀♂ |
| 237 | T50334 | 14-3-3-like protein GF14 omega (Arabidopsis thaliana) |
| 1482 | T64012 | Acetylcholine receptor protein, delta chain precursor (Xenopus laevis) |
| 175 | T94579 | Human chitotriosidase precursor mRNA, complete cds. ♂ |
| 286 | H64489 | Leukocyte antigen CD37 (Homo sapiens) |
| 141 | D21261 | Sm22-alpha homolog (Human) |
| 792 | R88740 | Atp synthase coupling factor 6, mitochondrial precursor (Human) ♂ |
| 3 | R39465 | Eukaryotic initiation factor 4A (Oryctolagus cuniculus) |
| 251 | U37012 | Human cleavage and polyadenylation specificity factor mRNA, complete cds |
| 23 | R22197 | 60S ribosomal protein L32 (Human) ♀ |
Also selected by the method in [41].
Also selected by the method in [42].
The top ten frequently selected genes with the proposed method on the SRBCT data.
| Gene no. | Gene name | Description |
| 1003 | 796258 | sarcoglycan, alpha (50kD dystrophin-associated glycoprotein) |
| 742 | 812105 | transmembrane protein |
| 1601 | 629896 | microtubule-associated protein 1B |
| 603 | 42558 | glycine amidinotransferase (L-arginine:glycine amidinotransferase) |
| 1055 | 1409509 | troponin T1, skeletal, slow |
| 545 | 1435862 | antigen identified by monoclonal antibodies 12E7, F21 and O13 |
| 1955 | 784224 | fibroblast growth factor receptor 4 |
| 1 | 21652 | catenin (cadherin-associated protein), alpha 1 (102 kD) |
| 1389 | 770394 | Fc fragment of IgG, receptor, transporter, alpha |
| 976 | 786084 | chromobox homolog 1 (Drosophila HP1 beta) |
Also selected by the method in [33].
Also selected by the method in [43].
The top ten frequently selected genes with the proposed method on the LUNG data.
| Gene no. | Gene name | Description |
| 498 | 39755 | Cluster Incl Z93930:Human DNA sequence from clone 292E10 on chromosome 22q11–12. Contains |
| 1559 | 1011_s | tyrosine 3-monooxygenase/tryptophan 5-monooxygenase activation protein, epsilon polypeptide |
| 792 | 38704 | actin binding protein; macrophin (microfilament and actin filament cross-linker protein) |
| 3178 | 38799 | Cluster Incl AF068706:Homo sapiens gamma2-adaptin (G2AD) mRNA, complete cds/cds = (763,3018) |
| 1765 | 39722 | nuclear receptor co-repressor 1 |
| 1243 | 39012_g | endosulfine alpha |
| 614 | 1147 | V-Erba Related Ear-3 Protein |
| 2750 | 38484 | synaptosomal-associated protein, 25 kD |
| 1014 | 588 | protein tyrosine phosphatase, non-receptor type 1 |
| 567 | 33412 | Cluster Incl AI535946:vicpro2.D07.r Homo sapiens cDNA, 5 end/clone_end = 5″/gb = AI535946 |
Figure 4The Heatmap of expression levels based on the top ten frequently selected genes on the six data (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.
Figure 5Plot of the first three principal components using the top 30 frequently selected genes (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.
Figure 6The correlation between the ranks of genes from two independent runs on the six data to assess reproducibility of the proposed approach (A) Leukemia (B) Colon (C) SRBCT (D) LUNG (E) Brain cancer (F) Lymphoma.
Figure 75-fold CV accuracy on the training data versus the iteration number of the MBPSO.
The 5-fold CV classification accuracies of the ELM classifier based on four PSO-based gene selection methods on the six data.
| Method | Mean accuracy (%)+std. (Gene number) | |||||
| Leukemia | Colon | SRBCT | LUNG | Brain cancer | Lymphoma | |
| KMeans-GCSI-MBPSO-ELM | 100±0.00 (3) | 97.61±1.37(6) | 100±0.00 (6) | 97.10±0.63 (11) | 88.63±2.16(6) | 86.97±2.44(8) |
| KMeans-BPSO-ELM | 99.17±1.04(4) | 93.50±2.02(9) | 99.27±0.82(7) | 95.64±0.56(12) | 87.23±2.34(8) | 85.14±2.87(6) |
| BPSO-ELM | 98.56±0.27(5) | 93.34±1.99(9) | 99.82±0.60(10) | 94.80±0.57(11) | 85.45±2.33(7) | 83.50±2.72(8) |
| Method in | 99.64±0.67(3) | 93.94±1.17(5) | 99.39±0.88(6) | 95.67±0.72(11) | 86.55±2.35(5) | 83.72±2.33(6) |
The 5-fold CV classification accuracies of the KNN-classifier and SVM-classifier based on six gene selection methods on the Leukemia data.
| Method | KNN | SVM | ||||
| 30 | 60 | 100 | 30 | 60 | 100 | |
| New method | 96.53±1.16 | 95.97±1.40 | / | 98.42±0.48 | 98.40±0.54 | / |
| Method in | 94.04±1.79 | / | / | 95.83±1.84 | / | / |
| GS2 | 96.10±4.80 | 96.80±4.40 | / | 95.80±5.20 | 96.70±4.70 | / |
| GS1 | 96.50±4.80 | 97.30±4.00 | / | 96.50±5.00 | 97.00±4.30 | / |
| Cho’s | 95.80±4.90 | 96.30±4.60 | / | 95.30±5.40 | 96.20±5.30 | / |
| F-test | 96.00±4.90 | 96.60±4.50 | / | 95.70±5.50 | 96.80±4.90 | / |
The 5-fold CV classification accuracies of the KNN-classifier and SVM-classifier based on two gene selection methods on the Colon data.
| Method | KNN | SVM | ||||
| 30 | 60 | 100 | 30 | 60 | 100 | |
| New method | 83.77±2.37 | 84.95±1.59 | 84.97±2.09 | 84.95±3.21 | 87.97±2.76 | 86.32±2.46 |
| Method in | 75.95±2.01 | 80.90±2.01 | 81.03±2.01 | 84.05±3.43 | 80.18±3.46 | 79.56±3.73 |
The LOOCV classification accuracies of five methods on the Brain cancer and Lymphoma data.
| Data | LOOCV classification accuracy (%) (Gene number) | ||||
| KMeans-GCSI-MBPSO-ELM | Method in | MIDClass | SGC-t | SGC-W | |
| Brain cancer | 90.93 (6) | 88.38 (5) | 83 (239) | 80 (1) | 77 (1) |
| Lymphoma | 93.79 (8) | 91 (6) | 69 (3) | 76 (1) | 71 (1) |
The LOOCV and 5-fold CV classification accuracies of ELM based on two gene selection methods on the six data.
| Data | KMeans-GCSI-MBPSO-ELM | KMeans+Elbow-GCSI-MBPSO-ELM | ||||
| Classification accuracy±std |
| Classification accuracy±std |
| |||
| LOOCV | 5-fold CV | LOOCV | 5-fold CV | |||
| Leukemia | 100.00±0.00 | 100.00±0.00 | 5 | 100.00±0.00 | 100.00±0.00 | 5 |
| Colon | 99.35±0.92 | 97.61±1.37 | 8 | 99.35±0.92 | 97.61±1.37 | 8 |
| SRBCT | 100.00±0.00 | 100.00±0.00 | 8 | 99.90±0.33 | 99.45±0.77 | 6 |
| LUNG | 98.14±0.33 | 97.10±0.63 | 10 | 97.33±0.66 | 95.67±0.74 | 6 |
| Brain cancer | 90.93±1.65 | 88.63±2.16 | 6 | 90.93±1.65 | 88.63±2.16 | 6 |
| Lymphoma | 93.79±2.07 | 86.97±2.44 | 7 | 93.79±2.07 | 86.97±2.44 | 7 |
Figure 8The number of the clusters (N) versus the classification accuracy obtained by ELM with the genes selected by the KMeans-GCSI-MBPSO-ELM method on the six data.
The top ten frequently selected genes with the proposed method on the Brain cancer data.
| Gene no. | Gene name | Description |
| 5931 | X58987 | dopamine receptor D1 |
| 4413 | U39817 | Bloom syndrome |
| 130 | AFFX-BioDn-5_st | / |
| 1745 | L08895 | MADS box transcription enhancer factor 2, polypeptide C (myocyte enhancer factor 2C) |
| 6732 | Y00317 | UDP glucuronosyltransferase 2 family, polypeptide B4 |
| 4843 | U61262 | neogenin homolog 1 (chicken) |
| 2935 | M60459 | erythropoietin receptor |
| 3502 | S74683 | ADP-ribosyltransferase 1 |
| 1970 | L25270 | Smcy homolog, X-linked (mouse) |
| 18 | AB000895 | dachsous 1 (Drosophila) |
The 5-fold CV classification accuracies of the KNN-classifier and SVM-classifier based on six gene selection methods on the SRBCT data.
| Method | KNN | SVM | ||||
| 30 | 60 | 100 | 30 | 60 | 100 | |
| New method | 98.27±0.91 | 98.87±0.38 | 99.06±0.53 | 97.71±0.81 | 99.73±0.50 | 99.82±0.43 |
| Method in | 97.46±1.47 | 98.86±1.07 | 99.04±0.92 | 99.41±1.03 | 99.25±1.09 | 99.86±0.52 |
| GS2 | 95.30±4.80 | 97.10±4.10 | 98.00±3.80 | 94.90±4.70 | 97.60±4.00 | 99.00±2.60 |
| GS1 | 94.10±4.70 | 96.10±4.50 | 97.70±4.10 | 95.90±5.40 | 97.80±4.00 | 98.80±3.00 |
| Cho’s | 82.00±9.60 | 86.40±9.30 | 89.60±8.70 | 83.50±8.80 | 91.80±6.90 | 94.30±6.20 |
| F-test | 96.30±5.00 | 97.30±4.60 | 97.80±4.00 | 97.00±4.20 | 98.00±3.90 | 99.20±2.10 |
The 5-fold CV classification accuracies of the KNN-classifier and SVM-classifier based on six gene selection methods on the LUNG data.
| Method | KNN | SVM | ||||
| 30 | 60 | 100 | 30 | 60 | 100 | |
| New method | 95.17±0.42 | 95.76±0.57 | 96.40±0.47 | 94.59±0.77 | 96.01±0.69 | 94.67±0.73 |
| Method in | 92.13±0.57 | 94.82±0.50 | 94.91±0.57 | 90.82±1.00 | 97.66±0.47 | 96.47±0.71 |
| GS2 | 88.40±5.30 | 91.60±4.10 | 92.80±3.70 | 85.80±6.10 | 91.30±3.50 | 93.10±3.30 |
| GS1 | 89.00±4.60 | 91.90±4.10 | 93.70±3.40 | 87.10±5.10 | 92.20±3.80 | 93.80±3.10 |
| Cho’s | 84.30±5.30 | 89.70±4.40 | 92.40±3.80 | 80.30±6.50 | 89.40±4.40 | 92.40±3.50 |
| F-test | 87.30±4.90 | 88.20±4.40 | 91.80±4.40 | 85.20±5.50 | 90.10±4.20 | 93.00±3.60 |