| Literature DB >> 22093447 |
Shengqiao Li1, E James Harner, Donald A Adjeroh.
Abstract
BACKGROUND: Successfully modeling high-dimensional data involving thousands of variables is challenging. This is especially true for gene expression profiling experiments, given the large number of genes involved and the small number of samples available. Random Forests (RF) is a popular and widely used approach to feature selection for such "small n, large p problems." However, Random Forests suffers from instability, especially in the presence of noisy and/or unbalanced inputs.Entities:
Mesh:
Year: 2011 PMID: 22093447 PMCID: PMC3281073 DOI: 10.1186/1471-2105-12-450
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Computing feature supports using Random KNN bidirectional voting
| /* Generate |
| /* Return support for each feature */ |
| Perform query from base data sets using each KNN; |
| Compare predicted values with observed values; |
| Calculate accuracy, |
| |
| |
Figure 1Bidirectional voting using Random KNN.
Figure 2Supports for the first 30 most relevant genes using the Golub leukemia data (Left panel: using dynamic partition; Right panel using fixed partition of the data for testing and training).
Two-stage variable backward elimination procedure for Random KNN
| Stage 1: Geometric Elimination |
| initialize |
| initialize |
| if |
| |
| |
| |
| |
| |
| |
| |
| Stage 2: Linear Reduction |
| |
| |
| |
| |
| |
| |
Figure 3Mean accuracy change with the number of features for the Golub leukemia data in the first stage.
Figure 4Mean accuracy change with the number of features for the Golub leukemia data in the second stage (feature set with peak value is selected).
Figure 5The effect of .
Figure 6The effect of .
Microarray gene expression datasets, Group I
| Dataset | Sample Size, | No. of Genes, | No. of classes, | ||
|---|---|---|---|---|---|
| Ramaswamy | 308 | 15009 | 26 | 49 | 1267 |
| Staunton | 60 | 5726 | 9 | 95 | 859 |
| Nutt | 50 | 10367 | 4 | 207 | 829 |
| Su | 174 | 12533 | 11 | 72 | 792 |
| NCI60 | 61 | 5244 | 8 | 86 | 688 |
| Brain | 42 | 5597 | 5 | 133 | 666 |
| Armstrong | 72 | 11225 | 3 | 156 | 468 |
| Pomeroy | 90 | 5920 | 5 | 66 | 329 |
| Bhattacharjee | 203 | 12600 | 5 | 62 | 310 |
| Adenocarcinoma | 76 | 9868 | 2 | 130 | 260 |
| Golub | 72 | 5327 | 3 | 74 | 222 |
| Singh | 102 | 10509 | 2 | 103 | 206 |
Microarray gene expression datasets, Group II
| Dataset | Sample Size, | No. of Genes, | No. of classes, | ||
|---|---|---|---|---|---|
| Lymphoma | 62 | 4026 | 3 | 65 | 195 |
| Leukemia | 38 | 3051 | 2 | 80 | 161 |
| Breast.3.Classes | 95 | 4869 | 3 | 51 | 154 |
| SRBCT | 63 | 2308 | 4 | 37 | 147 |
| Shipp | 77 | 5469 | 2 | 71 | 142 |
| Breast.2.Classes | 77 | 4869 | 2 | 63 | 126 |
| Prostate | 102 | 6033 | 2 | 59 | 118 |
| Khan | 83 | 2308 | 4 | 28 | 111 |
| Colon | 62 | 2000 | 2 | 32 | 65 |
Comparative performance with gene selection, Group I
| Dataset | Mean Accuracy | Standard Deviation | Coefficient of Variation | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| RF | R1NN | R3NN | RF | R1NN | R3NN | RF | R1NN | R3NN | ||
| Ramaswamy | 1267 | 0.577 | 0.726 | 0.704 | 0.019 | 0.013 | 0.013 | 3.231 | 1.775 | 1.796 |
| Staunton | 859 | 0.561 | 0.692 | 0.663 | 0.042 | 0.026 | 0.031 | 7.485 | 3.802 | 4.669 |
| Nutt | 829 | 0.671 | 0.903 | 0.834 | 0.051 | 0.030 | 0.031 | 7.619 | 3.268 | 3.674 |
| Su | 792 | 0.862 | 0.901 | 0.888 | 0.016 | 0.015 | 0.014 | 1.884 | 1.624 | 1.567 |
| NCI | 688 | 0.813 | 0.854 | 0.836 | 0.033 | 0.027 | 0.023 | 4.083 | 3.135 | 2.796 |
| Brain | 666 | 0.969 | 0.958 | 0.940 | 0.025 | 0.013 | 0.018 | 2.574 | 1.323 | 1.875 |
| Armstrong | 468 | 0.936 | 0.993 | 0.980 | 0.020 | 0.009 | 0.013 | 2.166 | 0.938 | 1.345 |
| Pomeroy | 329 | 0.858 | 0.933 | 0.863 | 0.025 | 0.016 | 0.017 | 2.892 | 1.762 | 1.991 |
| Bhattacharjee | 310 | 0.934 | 0.956 | 0.954 | 0.015 | 0.006 | 0.006 | 1.572 | 0.620 | 0.618 |
| Adenocarcinoma | 260 | 0.942 | 0.939 | 0.859 | 0.018 | 0.017 | 0.032 | 1.948 | 1.808 | 3.675 |
| Golub | 222 | 0.943 | 0.986 | 0.986 | 0.022 | 0.003 | 0.004 | 2.328 | 0.289 | 0.369 |
| Singh | 206 | 0.889 | 0.952 | 0.931 | 0.024 | 0.014 | 0.018 | 2.718 | 1.427 | 1.920 |
| Average | 0.830 | 0.899 | 0.870 | 0.026 | 0.016 | 0.018 | 3.375 | 1.814 | 2.191 | |
Comparative performance with gene selection, Group II
| Dataset | Mean Accuracy | Standard Deviation | Coefficient of Variation | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| RF | R1NN | R3NN | RF | R1NN | R3NN | RF | R1NN | R3NN | ||
| Lymphoma | 195 | 0.993 | 1.000 | 1.000 | 0.012 | 0.000 | 0.000 | 1.162 | 0.000 | 0.000 |
| Leukemia | 161 | 1.000 | 0.999 | 0.999 | 0.000 | 0.006 | 0.004 | 0.000 | 0.596 | 0.427 |
| Breast.3.class | 154 | 0.778 | 0.793 | 0.761 | 0.024 | 0.037 | 0.035 | 3.023 | 4.665 | 4.639 |
| SRBCT | 147 | 0.982 | 0.998 | 0.996 | 0.010 | 0.005 | 0.007 | 0.967 | 0.470 | 0.684 |
| Shipp | 142 | 0.865 | 0.997 | 0.991 | 0.033 | 0.008 | 0.011 | 3.757 | 0.800 | 1.077 |
| Breast.2.class | 126 | 0.838 | 0.841 | 0.822 | 0.024 | 0.052 | 0.042 | 2.894 | 6.206 | 5.049 |
| Prostate | 118 | 0.947 | 0.941 | 0.917 | 0.007 | 0.011 | 0.016 | 0.703 | 1.154 | 1.701 |
| Khan | 111 | 0.985 | 0.994 | 0.994 | 0.006 | 0.006 | 0.008 | 0.643 | 0.608 | 0.809 |
| Colon | 65 | 0.894 | 0.944 | 0.910 | 0.010 | 0.013 | 0.025 | 1.163 | 1.337 | 2.733 |
| Average | 0.920 | 0.945 | 0.932 | 0.014 | 0.015 | 0.016 | 1.590 | 1.760 | 1.902 | |
Average gene set size and standard deviation, Group I
| Dataset | Mean Feature Set Size | Standard Deviation | |||||
|---|---|---|---|---|---|---|---|
| RF | R1NN | R3NN | RF | R1NN | R3NN | ||
| Ramaswamy | 1267 | 907 | 336 | 275 | 666 | 34 | 52 |
| Staunton | 859 | 185 | 74 | 60 | 112 | 12 | 11 |
| Nutt | 829 | 146 | 49 | 49 | 85 | 6 | 4 |
| Su | 792 | 858 | 225 | 216 | 421 | 9 | 26 |
| NCI | 688 | 126 | 187 | 163 | 118 | 41 | 33 |
| Brain | 666 | 18 | 137 | 120 | 13 | 42 | 42 |
| Armstrong | 468 | 249 | 76 | 73 | 1011 | 16 | 12 |
| Pomeroy | 329 | 69 | 89 | 82 | 70 | 15 | 13 |
| Bhattacharjee | 310 | 33 | 148 | 146 | 29 | 15 | 10 |
| Adenocarcinoma | 260 | 8 | 38 | 11 | 4 | 20 | 11 |
| Golub | 222 | 12 | 27 | 21 | 8 | 5 | 5 |
| Singh | 206 | 26 | 25 | 13 | 32 | 6 | 6 |
| Average | 220 | 118 | 102 | 214 | 18 | 19 | |
Average gene set size and standard deviation, Group II
| Dataset | Mean Feature Set Size | Standard Deviation | |||||
|---|---|---|---|---|---|---|---|
| RF | R1NN | R3NN | RF | R1NN | R3NN | ||
| Lymphoma | 195 | 75 | 114 | 103 | 30 | 49 | 44 |
| Leukemia | 161 | 2 | 28 | 36 | 0 | 22 | 18 |
| Breast.3.Class | 154 | 47 | 43 | 36 | 35 | 23 | 8 |
| SRBCT | 147 | 49 | 65 | 64 | 50 | 8 | 9 |
| Shipp | 142 | 13 | 46 | 48 | 23 | 9 | 6 |
| Breast.2.Class | 126 | 32 | 23 | 15 | 29 | 16 | 10 |
| Prostate | 118 | 16 | 32 | 15 | 10 | 10 | 11 |
| Khan | 111 | 17 | 67 | 36 | 5 | 11 | 14 |
| Colon | 65 | 21 | 37 | 36 | 18 | 5 | 5 |
| Average | 30 | 51 | 43 | 22 | 17 | 14 | |
Execution time comparison, Group I
| Dataset | Time (min) | Ratio | ||||
|---|---|---|---|---|---|---|
| RF | R1NN | R3NN | RF/R1NN | RF/R3NN | ||
| Ramaswamy | 1267 | 22335 | 4262 | 4324 | 5.2 | 5.2 |
| Staunton | 859 | 3310 | 744 | 753 | 4.4 | 4.4 |
| Nutt | 829 | 176 | 195 | 195 | 0.9 | 0.9 |
| Su | 792 | 3592 | 1284 | 1279 | 2.8 | 2.8 |
| NCI | 688 | 142 | 177 | 178 | 0.8 | 0.8 |
| Brain | 666 | 92 | 124 | 125 | 0.7 | 0.7 |
| Armstrong | 468 | 327 | 301 | 297 | 1.1 | 1.1 |
| Pomeroy | 329 | 296 | 319 | 320 | 0.9 | 0.9 |
| Bhattacharjee | 310 | 4544 | 1725 | 1733 | 2.6 | 2.6 |
| Adenocarcinoma | 260 | 274 | 272 | 273 | 1.0 | 1.0 |
| Golub | 222 | 160 | 224 | 224 | 0.7 | 0.7 |
| Singh | 206 | 646 | 503 | 498 | 1.3 | 1.3 |
| Total | 35894 | 10130 | 10199 | 3.54 | 3.52 | |
Execution time comparison, Group II
| Dataset | Time (min) | Ratio | ||||
|---|---|---|---|---|---|---|
| RF | R1NN | R3NN | RF/R1NN | RF/R3NN | ||
| Lymphoma | 195 | 57 | 146 | 147 | 0.4 | 0.4 |
| Leukemia | 161 | 18 | 74 | 74 | 0.3 | 0.2 |
| Breast.3.Class | 154 | 310 | 332 | 334 | 0.9 | 0.9 |
| SRBCT | 147 | 97 | 177 | 178 | 0.5 | 0.5 |
| Shipp | 142 | 238 | 293 | 286 | 0.8 | 0.8 |
| Breast.2.Class | 126 | 167 | 221 | 222 | 0.8 | 0.8 |
| Prostate | 118 | 370 | 389 | 391 | 1.0 | 0.9 |
| Khan | 111 | 745 | 452 | 451 | 1.6 | 1.7 |
| Colon | 65 | 75 | 156 | 157 | 0.5 | 0.5 |
| Total | 2077 | 2240 | 2240 | 0.93 | 0.93 | |
Figure 7Comparison of execution time between RKNN-FS and RF-FS.