| Literature DB >> 24410865 |
Abstract
BACKGROUND: Gene selection is an important part of microarray data analysis because it provides information that can lead to a better mechanistic understanding of an investigated phenomenon. At the same time, gene selection is very difficult because of the noisy nature of microarray data. As a consequence, gene selection is often performed with machine learning methods. The Random Forest method is particularly well suited for this purpose. In this work, four state-of-the-art Random Forest-based feature selection methods were compared in a gene selection context. The analysis focused on the stability of selection because, although it is necessary for determining the significance of results, it is often ignored in similar studies.Entities:
Mesh:
Substances:
Year: 2014 PMID: 24410865 PMCID: PMC3897925 DOI: 10.1186/1471-2105-15-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Post-selection error rates. Post-selection errors of a Random Forest classifier over bootstrap iterations, presented directly and as boxplots. Colour is used for clarity.
Selection consistency analysis
| | | | | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RF-ACE | 0.0 | 1354.2 | 0% | 398.0 | 1946.1 | 20% | 0.0 | 1569.1 | 0% | 1356.0 | 7778.6 | 17% |
| Bor. Ferns 1 | 91.8 | 176.6 | 52% | 228.9 | 391.9 | 58% | 336.8 | 567.8 | 59% | 480.3 | 757.3 | 63% |
| Bor. Ferns 2 | 93.0 | 182.8 | 51% | 249.0 | 423.3 | 59% | 354.5 | 652.2 | 54% | 520.7 | 840.4 | 62% |
| Bor. Ferns 3 | 104.9 | 192.2 | 55% | 247.5 | 439.6 | 56% | 375.0 | 720.2 | 52% | 582.1 | 916.6 | 64% |
| Bor. Ferns 4 | 118.8 | 210.8 | 56% | 252.9 | 453.0 | 56% | 383.6 | 786.7 | 49% | 621.9 | 986.5 | 63% |
| Bor. Ferns 5 | 120.6 | 227.2 | 53% | 270.9 | 482.7 | 56% | 396.2 | 864.2 | 46% | 670.3 | 1046.3 | 64% |
| Bor. Ferns 6 | 135.9 | 246.8 | 55% | 275.3 | 513.2 | 54% | 395.8 | 959.4 | 41% | 692.1 | 1077.3 | 64% |
| Bor. Ferns 7 | 145.0 | 277.9 | 52% | 296.4 | 550.1 | 54% | 357.0 | 1058.3 | 34% | 705.8 | 1104.5 | 64% |
| Bor. RF Gini | 77.2 | 137.8 | 56% | 230.2 | 407.6 | 56% | 358.4 | 626.7 | 57% | 267.2 | 462.1 | 58% |
| Bor. RF Raw | 116.9 | 214.7 | 54% | 256.9 | 446.2 | 58% | 403.9 | 807.6 | 50% | 422.7 | 728.0 | 58% |
| Bor. RF Norm. | 103.3 | 199.1 | 52% | 237.5 | 403.3 | 59% | 400.8 | 839.2 | 48% | 301.5 | 529.9 | 57% |
| RFE Ferns 1 | 23.2 | 95.5 | 24% | 4.4 | 8.5 | 51% | 39.0 | 72.8 | 54% | 28.9 | 503.9 | 6% |
| RFE Ferns 2 | 18.6 | 55.2 | 34% | 4.4 | 8.0 | 55% | 36.6 | 75.2 | 49% | 73.8 | 854.0 | 9% |
| RFE Ferns 3 | 23.1 | 88.5 | 26% | 4.3 | 8.3 | 52% | 30.6 | 78.1 | 39% | 47.2 | 125.9 | 38% |
| RFE Ferns 4 | 18.0 | 77.3 | 23% | 3.9 | 8.5 | 46% | 38.6 | 70.9 | 54% | 34.9 | 402.9 | 9% |
| RFE Ferns 5 | 18.6 | 52.5 | 35% | 4.9 | 9.1 | 54% | 38.0 | 104.3 | 36% | 99.5 | 321.1 | 31% |
| RFE Ferns 6 | 18.5 | 58.7 | 32% | 5.1 | 9.6 | 53% | 33.1 | 52.5 | 63% | 75.6 | 280.8 | 27% |
| RFE Ferns 7 | 13.8 | 70.9 | 19% | 5.0 | 9.6 | 52% | 32.8 | 49.1 | 67% | 36.6 | 81.3 | 45% |
| RFE RF Gini | 17.7 | 110.1 | 16% | 4.8 | 8.5 | 57% | 26.5 | 38.9 | 68% | 71.7 | 163.2 | 44% |
| RFE RF Raw | 18.6 | 51.2 | 36% | 4.8 | 8.3 | 58% | 31.3 | 46.9 | 67% | 43.6 | 274.9 | 16% |
| RFE RF Norm. | 11.9 | 32.5 | 37% | 4.3 | 8.0 | 53% | 28.1 | 43.7 | 64% | 34.6 | 60.0 | 58% |
| RRF | 1.4 | 15.9 | 9% | 0.0 | 3.8 | 0% | 1.9 | 8.3 | 22% | 1.1 | 19.2 | 6% |
| No. features | 2000 | 3051 | 1586 | 12533 | ||||||||
The average number of significantly self-consistent and all selected genes by a given method in one bootstrap iteration. c – the average number of significantly self-consistent genes, f – the average number of selected genes.
Execution time
| RF-ACE | 40’ | 24’ | 57’ | 2 h 47’ |
| Boruta Ferns depth 1 | 01’ | 01’ | 01’ | 03’ |
| Boruta Ferns depth 7 | 05’ | 05’ | 11’ | 09’ |
| Boruta RF Gini | 2 h 27’ | 2 h 19’ | 10 h 52’ | 30 h 48’ |
| Boruta RF Raw | 3 h 30’ | 2 h 43’ | 14 h 35’ | 40 h 23’ |
| Boruta RF Norm. | 3 h 28’ | 2 h 34’ | 16 h 04’ | 35 h 27’ |
| RFE Ferns depth 1 | 10’ | 08’ | 15’ | 6 h 43’ |
| RFE Ferns depth 7 | 10’ | 08’ | 16’ | 7 h 24’ |
| RFE RF Gini | 21’ | 16’ | 31’ | 13 h 34’ |
| RFE RF Raw | 21’ | 16’ | 33’ | 13 h 49’ |
| RFE RF Norm. | 22’ | 17’ | 32’ | 13 h 17’ |
| RRF | 03’ | 02’ | 04’ | 1 h 04’ |
| No. features | 2000 | 3051 | 1586 | 12533 |
| No. objects | 62 | 38 | 83 | 102 |
The execution time of selected algorithms, represented as the mean over 30 bootstrap iterations. All algorithms investigated in this study were run single-threaded.
Datasets
| Colon | Alon | 2000 | 62 | Normal/tumor colon tissue | 40:22 |
| Leukemia | Golub | 3051 | 38 | ALL/AML leukemia type | 27:11 |
| SRBCT | Khan | 1586 | 83 | 4 SRBCT types | 11:29:18:25 |
| Prostate | Singh | 12533 | 102 | Normal/tumor prostate tissue | 50:52 |
The microarray datasets used in this study.