| Literature DB >> 23075381 |
Argiris Sakellariou1, Despina Sanoudou, George Spyrou.
Abstract
BACKGROUND: A feature selection method in microarray gene expression data should be independent of platform, disease and dataset size. Our hypothesis is that among the statistically significant ranked genes in a gene list, there should be clusters of genes that share similar biological functions related to the investigated disease. Thus, instead of keeping N top ranked genes, it would be more appropriate to define and keep a number of gene cluster exemplars.Entities:
Mesh:
Year: 2012 PMID: 23075381 PMCID: PMC3542193 DOI: 10.1186/1471-2105-13-270
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The m-KL methodology flowchart .
The real microarray data divided in train and test sets
| Amyotrophic lateral sclerosis | |||
| Duchenne muscular dystrophy | |||
| Juvenile dermatomyositis | |||
| Limb-girdle muscular dystrophy type 2A | |||
| Limb-girdle muscular dystrophy type 2B | |||
| Nemaline myopathy | |||
The FS methods sorted by the AUC metric achieved in validation test for each neuromuscular disease using the RF classifier
| | |||||||
|---|---|---|---|---|---|---|---|
| 1.00 (0.00) | 1.00 (0.00) | 0.98 (0.14) | 1.00 | 1.00 | |||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 1.00 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 1.00 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 1.00 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.91 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.73 | 1.00 | ||
| | 1.00 (0.00) | 0.98 (0.10) | 1.00 (0.00) | 0.73 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.64 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | .64 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | .64 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | .36 | 1.00 | ||
| |
HykGene
| 1.00 (0.00) | 0.97 (0.12) | 0.98 (0.10) |
0.94
| 0.45 | 1.00 |
| |
maxT
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.94
| 0.45 | 1.00 |
| |
SNR
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.94
| 0.73 | 1.00 |
| |
Rnd
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.89
| 0.70 | 0.93 |
| |
PCA
| 0.83 (0.30) | 0.61 (0.43) | 0.77 (0.39) |
0.58
| 0.27 | 1.00 |
| LGMD2B | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.73 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.64 | 1.00 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.55 | 1.00 | ||
| |
BGA-COA
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.98
| 0.73 | 1.00 |
| |
maxT
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.91
| 0.64 | 1.00 |
| |
Rnd
| 1.00 (0.00) | 1.00 (0.00) | 0.93 (0.25) |
0.90
| 0.56 | 1.00 |
| | SNR | 1.00 (0.00) | 1.00 (0.01) | 1.00 (0.00) | 0.88 | 0.73 | 1.00 |
| | HykGene | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.82 | 0.64 | 1.00 |
| |
t-test
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.82
| 0.73 | 0.67 |
| |
ODP
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.73
| 0.45 | 1.00 |
| |
mAP-KL
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.70
| 0.36 | 0.67 |
| |
SAM
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.52
| 0.27 | 1.00 |
| |
eBayes
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.48
| 0.27 | 0.67 |
| |
cat
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.36
| 0.09 | 1.00 |
| |
PCA
| 0.89 (0.25) | 0.74 (0.38) | 0.61 (0.44) |
0.21
| 0.09 | 1.00 |
| NM | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.77 | 1.00 | ||
| |
t-test
| 1.00 (0.00) | 0.98 (0.10) | 1.00 (0.00) |
0.89
| 0.77 | 0.80 |
| |
HykGene
| 1.00 (0.00) | 1.00 (0.00) | 0.99 (0.07) |
0.88
| 0.69 | 0.80 |
| |
maxT (200)
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.80
| 0.69 | 0.80 |
| |
cat
| 1.00 (0.00) | 1.00 (0.00) | 0.99 (0.07) |
0.78
| 0.46 | 1.00 |
| |
mAP-KL
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.74
| 0.69 | 0.60 |
| | Rnd | 0.98 (0.03) | 0.87 (0.09) | 0.96 (0.06) | 0.67 | 0.49 | 0.76 |
| | SAM | 1.00 (0.00) | 0.87 (0.28) | 0.98 (0.10) | 0.65 | 0.15 | 1.00 |
| |
PCA
| 0.82 (0.30) | 0.77 (0.35) | 0.73 (0.39) |
0.55
| 0.92 | 0.40 |
| |
BGA-COA
| 0.96 (0.14) | 0.87 (0.28) | 0.91 (0.19) |
0.47
| 0.23 | 0.60 |
| |
PLS-CV
| 0.97 (0.12) | 0.87 (0.28) | 0.99 (0.07) |
0.42
| 0.08 | 1.00 |
| |
maxT
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.37
| 0.38 | 0.40 |
| |
ODP
| 1.00 (0.00) | 0.92 (0.23) | 1.00 (0.00) |
0.25
| 0.38 | 0.20 |
| |
RF-MDA
| 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) |
0.22
| 0.15 | 0.60 |
| eBayes | - | - | - | - | - | - | |
The FS methods sorted by the AUC metric achieved in validation test for each cancer disease using the RF classifier
| | |||||||
|---|---|---|---|---|---|---|---|
| 0.80 (0.11) | 0.79 (0.16) | 0.73 (0.18) | 1.00 | 0.50 | |||
| | maxT(200) | 0.85 (0.11) | 0.83 (0.13) | 0.69 (0.17) | 0.83 | 0.71 | 0.83 |
| | PLS-CV | 0.91 (0.08) | 0.85 (0.13) | 0.77 (0.15) | 0.82 | 0.86 | 0.42 |
| | RF-MDA | 0.91 (0.07) | 0.91 (0.11) | 0.70 (0.16) | 0.82 | 0.86 | 0.75 |
| | maxT | 0.87 (0.10) | 0.84 (0.13) | 0.74 (0.18) | 0.77 | 0.71 | 0.58 |
| | SAM | 0.82 (0.11) | 0.79 (0.15) | 0.69 (0.19) | 0.77 | 0.71 | 0.75 |
| | SNR | 0.86 (0.10) | 0.85 (0.14) | 0.72 (0.20) | 0.77 | 0.71 | 0.67 |
| | BGA-COA | 0.83 (0.10) | 0.79 (0.15) | 0.67 (0.15) | 0.76 | 0.57 | 0.58 |
| | HykGene | 0.91 (0.06) | 0.86 (0.12) | 0.76 (0.17) | 0.76 | 0.71 | 0.75 |
| | Rnd | 0.79 (0.01) | 0.76 (0.03) | 0.65 (0.03) | 0.76 | 0.70 | 0.78 |
| | cat | 0.91 (0.07) | 0.86 (0.12) | 0.78 (0.16) | 0.75 | 0.71 | 0.50 |
| | PCA | 0.72 (0.14) | 0.66 (0.18) | 0.56 (0.19) | 0.75 | 0.43 | 0.67 |
| | ODP | 0.83 (0.10) | 0.80 (0.14) | 0.69 (0.18) | 0.74 | 0.71 | 0.58 |
| | t-test | 0.82 (0.10) | 0.81 (0.14) | 0.69 (0.19) | 0.73 | 0.71 | 0.58 |
| | eBayes | - | - | - | - | - | - |
| 0.99 (0.03) | 0.95 (0.12) | 0.97 (0.09) | 0.71 | 0.84 | |||
| | BGA-COA | 0.98 (0.06) | 0.89 (0.22) | 0.87 (0.19) | 0.87 | 0.71 | 0.80 |
| | Rnd | 0.98 (0.02) | 0.90 (0.06) | 0.90 (0.03) | 0.84 | 0.73 | 0.82 |
| | maxT(200) | 1.00 (0.00) | 0.94 (0.13) | 0.94 (0.13) | 0.83 | 0.71 | 0.88 |
| | PCA | 0.79 (0.19) | 0.80 (0.23) | 0.72 (0.26) | 0.83 | 0.43 | 0.84 |
| | ODP | 0.99 (0.03) | 0.97 (0.13) | 0.93 (0.13) | 0.82 | 0.71 | 0.80 |
| | HykGene | 0.98 (0.06) | 0.93 (0.14) | 0.95 (0.12) | 0.81 | 0.71 | 0.88 |
| | RF-MDA | 0.99 (0.03) | 0.96 (0.11) | 0.93 (0.13) | 0.81 | 0.71 | 0.80 |
| | eBayes | 0.99 (0.03) | 0.97 (0.11) | 0.93 (0.13) | 0.80 | 0.71 | 0.80 |
| | SAM | 1.00 (0.02) | 0.99 (0.09) | 0.93 (0.13) | 0.80 | 0.71 | 0.80 |
| | cat | 0.99 (0.04) | 0.97 (0.14) | 0.93 (0.13) | 0.80 | 0.57 | 0.80 |
| | maxT | 1.00 (0.02) | 0.97 (0.10) | 0.94 (0.13) | 0.79 | 0.71 | 0.80 |
| | PLS-CV | 1.00 (0.02) | 0.94 (0.16) | 0.94 (0.13) | 0.79 | 0.71 | 0.80 |
| | SNR | 0.99 (0.03) | 1.00 (0.00) | 0.93 (0.13) | 0.79 | 0.71 | 0.80 |
| | t-test | 0.99 (0.03) | 0.99 (0.05) | 0.93 (0.13) | 0.79 | 0.71 | 0.80 |
| 0.99 (0.04) | 1.00 (0.00) | 0.81 (0.27) | 1.00 | 0.86 | |||
| | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 1.00 | 0.86 | ||
| | 1.00 (0.00) | 1.00 (0.00) | 0.91 (0.19) | 0.95 | 0.93 | ||
| | RF-MDA | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.99 | 1.00 | 0.86 |
| | PLS-CV | 1.00 (0.00) | 1.00 (0.00) | 0.89 (0.25) | 0.99 | 0.95 | 0.93 |
| | SAM | 1.00 (0.00) | 1.00 (0.00) | 0.91 (0.19) | 0.99 | 0.95 | 0.93 |
| | cat | 1.00 (0.00) | 1.00 (0.00) | 0.95 (0.14) | 0.99 | 0.95 | 0.93 |
| | HykGene | 1.00 (0.00) | 1.00 (0.00) | 0.90 (0.20) | 0.97 | 0.90 | 0.93 |
| | Rnd | 0.99 (0.01) | 0.98 (0.02) | 0.86 (0.06) | 0.97 | 0.99 | 0.75 |
| | maxT | 1.00 (0.02) | 0.98 (0.07) | 0.85 (0.27) | 0.96 | 1.00 | 0.64 |
| | m | 1.00 (0.00) | 1.00 (0.00) | 0.97 (0.17) | 0.71 | 0.90 | 0.43 |
| | PCA | 0.56 (0.16) | 1.00 (0.00) | 0.00 (0.00) | 0.64 | 0.95 | 0.14 |
| | SNR | 0.50 (0.00) | 1.00 (0.00) | 0.00 (0.00) | 0.50 | 1.00 | 0.00 |
| | t-test | 0.50 (0.00) | 1.00 (0.00) | 0.00 (0.00) | 0.50 | 1.00 | 0.00 |
| | ODP | - | - | - | - | - | - |
| 0.96 (0.04) | 0.97 (0.05) | 0.88 (0.10) | 0.00 | 1.00 | |||
| | maxT(200) | 0.95 (0.10) | 0.95 (0.10) | 0.89 (0.10) | 0.88 | 0.00 | 1.00 |
| | PLS-CV | 0.97 (0.03) | 0.95 (0.08) | 0.92 (0.07) | 0.87 | 0.33 | 1.00 |
| | eBayes | 0.96 (0.04) | 0.98 (0.04) | 0.89 (0.10) | 0.86 | 0.00 | 1.00 |
| | RF-MDA | 0.97 (0.04) | 0.97 (0.06) | 0.90 (0.09) | 0.83 | 0.11 | 1.00 |
| | m | 0.93 (0.06) | 0.90 (0.09) | 0.85 (0.11) | 0.80 | 1.00 | 0.36 |
| | BGA-COA | 0.95 (0.05) | 0.91 (0.09) | 0.89 (0.10) | 0.73 | 0.22 | 0.88 |
| | Rnd | 0.93 (0.02) | 0.89 (0.04) | 0.86 (0.03) | 0.70 | 0.18 | 0.94 |
| | HykGene | 1.00 (0.00) | 1.00 (0.00) | 1.00 (0.00) | 0.69 | 0.89 | 0.24 |
| | maxT | 0.89 (0.07) | 0.88 (0.09) | 0.79 (0.13) | 0.50 | 0.00 | 1.00 |
| | PCA | 0.84 (0.09) | 0.77 (0.15) | 0.75 (0.15) | 0.50 | 0.00 | 1.00 |
| | SNR | 0.50 (0.00) | 0.08 (0.27) | 0.92 (0.27) | 0.50 | 0.00 | 1.00 |
| | t-test | 0.50 (0.00) | 0.08 (0.27) | 0.92 (0.27) | 0.50 | 0.00 | 1.00 |
| | ODP | - | - | - | - | - | - |
| cat | - | - | - | - | - | - | |
An overview of the published classification results in van ’t Veer et al. breast cancer data
| [ | 65/78 | 83.3 | 17/19 | 89.5 | 70 |
| [ | - | 92.13 | - | 91.67 | 3 |
| [ | 60/78 | 76.90 | 15/19 | 78.9 | 231 |
| [ | - | 76.20 | 15/19 | 78.9 | 231 |
| [ | 62/78 | 81.40 | 17/19 | 89.5 | 44 |
| [ | 88/97 | 90.7 | - | - | 50 |
| [ | 49/78 | 62.9 | - | - | - |
| [ | - | - | 17/19 | 89.47 | 834 |
| [ | 66/97 | 68.04 | - | - | 8 |
| m | - | 75.93 | 13/19 | 68.42 | 6 |
| m | - | 56.35 | 5/19 | 26.32 | 6 |
| m | - | 71.47 | 11/19 | 57.89 | 6 |
An overview of the published classification results in Alon et al. colon cancer data
| [ | 57/62 | 91.94 | - | | - |
| [ | 53/62 | 85.48 | - | | - |
| [ | - | - | - | 90.3 | - |
| [ | - | - | - | 94.1~ | - |
| [ | - | - | - | 80.6 | - |
| [ | - | - | - | 74.2 | - |
| [ | - | - | - | 72.6 | - |
| [ | - | - | - | 87.1 | - |
| [ | - | - | - | 87.1 | - |
| [ | - | - | - | 93.5 | 50 |
| [ | - | - | - | 91.9 | 1000 |
| [ | 52/62 (MAVE-LD) | 83.87 | - | - | 50 |
| [ | 56/62 | 90.3 | - | - | 50 |
| [ | 59/62 | 95.16 | - | - | 135 |
| m | - | 96.00 | 26/32 | 81.25 | 20 |
| m | - | 96.00 | 26/32 | 81.25 | 20 |
| m | - | 94.00 | 28/32 | 87.50 | 20 |
An overview of the published classification results in Golub et al. ALL/AML leukemia data
| [ | 36/38 | 94.73 | 29/34 | 85.29 | 50 |
| [ | 38/38 | 100.00 | 34/34 | 100 | - |
| [ | - | - | 33/34 | 97.06 | - |
| [ | - | - | - | 94.1 | - |
| [ | - | - | - | 94.1 | - |
| [ | - | - | - | 91.6 | - |
| [ | - | - | - | 94.4 | - |
| [ | - | - | - | 95.8 | - |
| [ | - | - | - | 94.17 | 50 |
| [ | - | - | - | 95.44 | 50 |
| [ | - | - | - | 95.94 | 50 |
| [ | - | - | - | 96.44 | 50 |
| [ | 38/38 | 100 | 31/34 | 91.17 | 7129 |
| [ | 38/38 | 100 | 34/34 | 100 | 999 |
| [ | 38/38 | 100 | 32/34 | 94.11 | 99 |
| [ | 38/38 | 100 | 30/34 | 88.23 | 49 |
| [ | - | - | 34/34 | 100 | 40 |
| [ | - | - | 32/34 | 94.11 | 5 |
| [ | - | - | - | 95.0~ | - |
| [ | - | - | - | 95.0~ | - |
| [ | - | - | - | 95.0~ | - |
| [ | 37/38 | 98 | 34/34 | 100 | 185 |
| [ | 38/38 | 100 | 34/34 | 100 | 3800 |
| [ | 37/38 | 98 | 32/34 | 94.11 | 21 |
| [ | 71/72 | 98.6 | - | - | - |
| [ | 71/72 | 98.61 | - | - | 2 |
| [ | 38/38 (DLDA) | 100 | 33/34 (DLDA) | 97.06 | 50 |
| [ | 38/38 | 100 | - | - | 50 |
| [ | - | - | 31/34 | 91.18 | 1038 |
| m | - | 98.93 | 24/34 | 70.59 | 5 |
| m | - | 93.61 | 24/34 | 70.59 | 5 |
| m | - | 97.36 | 27/34 | 79.41 | 5 |
An overview of the published classification results in Singh et al. prostate cancer data
| [ | 98/102 | 96.08 | 33/34 | 97.06 | - |
| [ | - | - | 25/34 | 73.53 | 3071 |
| [ | 124/136 | 91.18 | - | - | 6 |
| m | - | 87.33 | 18/34 | 52.94 | 12 |
| m | - | 82.22 | 29/34 | 85.29 | 12 |
| m | - | 87.82 | 33/34 | 97.06 | 12 |
The number of clusters identified by m-KL for several top ranked genes compared to three other FS methods (the number of genes per subset is in parenthesis)
| | | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 50 | 5 (5) | 6 (6) | 4 (4) | 3 (3) | 3 (3) | 3 (3) | 2 (2) | 2 (2) | 2 (2) | 2 (2) | 2 (20) | 2 (20) | 5 (20) |
| 100 | 3 (3) | 5 (5) | 6 (6) | 6 (14) | 5 (5) | 4 (4) | 4 (4) | 4 (4) | 3 (3) | 3 (3) | 1 (20) | 2 (20) | 5 (20) |
| 200 | 3 (3) | 6 (6) | 8 (8) | 10 (10) | 11 (11) | 11 (11) | 8 (8) | 5 (5) | 5 (5) | 5 (5) | 1 (20) | 2 (20) | 10 (20) |
| 300 | 3 (3) | 6 (6) | 8 (8) | 10 (10) | 13 (13) | 15 (15) | 11 (11) | 7 (7) | 7 (7) | 6 (6) | 2 (20) | 4 (20) | 10 (20) |
| 400 | 4 (4) | 6 (6) | 8 (8) | 11 (11) | 13 (13) | 15 (15) | 18 (18) | 20 (20) | 21 (23) | 10 (10) | 3 (30) | 4 (30) | 16 (30) |
| 500 | 4 (4) | 7 (7) | 9 (9) | 11 (11) | 13 (13) | 16 (16) | 18 (18) | 20 (20) | 23 (23) | 25 (25) | 3 (30) | 4 (30) | 19 (30) |
The subsets of genes selected from the ‘choedata’ according to m-KL
| 17 | 7983 | ||
| 21 | CG14254 | 8561 | |
| 53 | 9874 | ||
| 66 | 10011 | ||
| 92 | 593 | ||
| 114 | 11006 | ||
| CG8300 | 120 | 3545 | |
| 123 | 11303 | ||
| 162 | kek3 | 2322 | |
| RhoGEF2 | 163 | 10244 | |
| Imp | 188 | 9612 | |
| Dip2 | 209 | 11063 | |
| Spred | 219 | 1148 | |
| NA | 269 | CG18125 | 2424 |
| NA | 333 | 9585 | |
| 9432 | |||
Figure 2The overall performance of the FS methods according to the AUC metric. We have sorted the methods, except the Rnd, which is not actually a method, according to the mean of the AUC values. The standard deviation across all diseases quantifies the robustness of each method. The mean value per disease across all feature selection methods is a difficulty index of discrimination. The NM from the myopathies and the prostate cancer were the most difficult cases towards the phenotype discrimination.