| Literature DB >> 16398926 |
Ramón Díaz-Uriarte1, Sara Alvarez de Andrés.
Abstract
BACKGROUND: Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.Entities:
Mesh:
Year: 2006 PMID: 16398926 PMCID: PMC1363357 DOI: 10.1186/1471-2105-7-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Main characteristics of the microarray data sets used
| Dataset | Original ref. | Genes | Patients | Classes |
| Leukaemia | [44] | 3051 | 38 | 2 |
| Breast | [9] | 4869 | 78 | 2 |
| Breast | [9] | 4869 | 96 | 3 |
| NCI 60 | [61] | 5244 | 61 | 8 |
| Adenocarcinoma | [62] | 9868 | 76 | 2 |
| Brain | [63] | 5597 | 42 | 5 |
| Colon | [64] | 2000 | 62 | 2 |
| Lymphoma | [65] | 4026 | 62 | 3 |
| Prostate | [66] | 6033 | 102 | 2 |
| Srbct | [67] | 2308 | 63 | 4 |
Error rates (estimated using the 0.632+ bootstrap method with 200 bootstrap samples) for the microarray data sets using different methods. The results shown for variable selection with random forest used ntree = 2000, fraction.dropped = 0.2, mtryFactor = 1. Note that the OOB error used for variable selection is not the error reported in this table; the error rate reported is obtained using bootstrap on the complete variable selection process. The column "no info" denotes the minimal error we can make if we use no information from the genes (i.e., we always bet on the most frequent class).
| Data set | no info | SVM | KNN | DLDA | SC.l | SC.s | NN.vs | random forest | random forest var.sel. | |
| s.e. 0 | s.e. 1 | |||||||||
| Leukemia | 0.289 | 0.014 | 0.029 | 0.020 | 0.025 | 0.062 | 0.056 | 0.051 | 0.087 | 0. 075 |
| Breast 2 cl. | 0.429 | 0.325 | 0.337 | 0.331 | 0.324 | 0.326 | 0.337 | 0.342 | 0.337 | 0. 332 |
| Breast 3 cl. | 0.537 | 0.380 | 0.449 | 0.370 | 0.396 | 0.401 | 0.424 | 0.351 | 0.346 | 0. 364 |
| NCI 60 | 0.852 | 0.256 | 0.317 | 0.286 | 0.256 | 0.246 | 0.237 | 0.252 | 0.327 | 0.353 |
| Adenocar. | 0.158 | 0.203 | 0.174 | 0.194 | 0.177 | 0.179 | 0.181 | 0.125 | 0.185 | 0. 207 |
| Brain | 0.762 | 0.138 | 0.174 | 0.183 | 0.163 | 0.159 | 0.194 | 0.154 | 0.216 | 0. 216 |
| Colon | 0.355 | 0.147 | 0.152 | 0.137 | 0.123 | 0.122 | 0.158 | 0.127 | 0.159 | 0. 177 |
| Lymphoma | 0.323 | 0.010 | 0.008 | 0.021 | 0.028 | 0.033 | 0.04 | 0.009 | 0.047 | 0. 042 |
| Prostate | 0.490 | 0.064 | 0.100 | 0.149 | 0.088 | 0.089 | 0.081 | 0.077 | 0.061 | 0. 064 |
| Srbct | 0.635 | 0.017 | 0.023 | 0.011 | 0.012 | 0.025 | 0.031 | 0.021 | 0.039 | 0.038 |
Stability of variable (gene) selection evaluated using 200 bootstrap samples. "# Genes": number of genes selected on the original data set. "# Genes boot.": median (1st quartile, 3rd quartile) of number of genes selected from on the bootstrap samples. "Freq. genes": median (1st quartile, 3rd quartile) of the frequency with which each gene in the original data set appears in the genes selected from the bootstrap samples. Parameters for backwards elimination with random forest: mtryFactor = 1, s.e. = 0, ntree = 2000, ntreelterat = 1000, fraction.dropped = 0.2.
| Data set | Error | # Genes | # Genes boot. | Freq. genes |
| Leukemia | 0.087 | 2 | 2 (2, 2) | 0.38 (0.29, 0.48)1 |
| Breast 2 cl. | 0.337 | 14 | 9 (5, 23) | 0.15 (0.1, 0.28) |
| Breast 3 cl. | 0.346 | 110 | 14 (9, 31) | 0.08 (0.04, 0.13) |
| NCI 60 | 0.327 | 230 | 60 (30, 94) | 0.1 (0.06, 0.19) |
| Adenocar. | 0.185 | 6 | 3 (2, 8) | 0.14 (0.12, 0.15) |
| Brain | 0.216 | 22 | 14 (7, 22) | 0.18 (0.09, 0.25) |
| Colon | 0.159 | 14 | 5 (3, 12) | 0.29 (0.19, 0.42) |
| Lymphoma | 0.047 | 73 | 14 (4, 58) | 0.26 (0.18, 0.38) |
| Prostate | 0.061 | 18 | 5 (3, 14) | 0.22 (0.17, 0.43) |
| Srbct | 0.039 | 101 | 18 (11, 27) | 0.1 (0.04, 0.29) |
| Leukemia | 0.075 | 2 | 2 (2, 2) | 0.4 (0.32, 0.5)1 |
| Breast 2 cl. | 0.332 | 14 | 4 (2, 7) | 0.12 (0.07, 0.17) |
| Breast 3 cl. | 0.364 | 6 | 7 (4, 14) | 0.27 (0.22, 0.31) |
| NCI 60 | 0.353 | 24 | 30 (19, 60) | 0.26 (0.17, 0.38) |
| Adenocar. | 0.207 | 8 | 3 (2, 5) | 0.06 (0.03, 0.12) |
| Brain | 0.216 | 9 | 14 (7, 22) | 0.26 (0.14, 0.46) |
| Colon | 0.177 | 3 | 3 (2, 6) | 0.36 (0.32, 0.36) |
| Lymphoma | 0.042 | 58 | 12 (5, 73) | 0.32 (0.24, 0.42) |
| Prostate | 0.064 | 2 | 3 (2, 5) | 0.9 (0.82, 0.99)1 |
| Srbct | 0.038 | 22 | 18 (11, 34) | 0.57 (0.4, 0.88) |
| SC.s | ||||
| Leukemia | 0.062 | 822 | 46 (14, 504) | 0.48 (0.45, 0.59) |
| Breast 2 cl. | 0.326 | 31 | 55 (24, 296) | 0.54 (0.51, 0.66) |
| Breast 3 cl. | 0.401 | 2166 | 4341 (2379, 4804) | 0.84 (0.78, 0.88) |
| NCI 60 | 0.246 | 51183 | 4919 (3711, 5243) | 0.84 (0.74, 0.92) |
| Adenocar. | 0.179 | 0 | 9 (0, 18) | NA (NA, NA) |
| Brain | 0.159 | 4177 | 1257 (295, 3483) | 0.38 (0.3, 0.5) |
| Colon | 0.122 | 15 | 22 (15, 34) | 0.8 (0.66, 0.87) |
| Lymphoma | 0.033 | 2796 | 2718 (2030, 3269) | 0.82 (0.68, 0.86) |
| Prostate | 0.089 | 4 | 3 (2, 4) | 0.72 (0.49, 0.92) |
| Srbct | 0.025 | 374 | 18 (12, 40) | 0.45 (0.34, 0.61) |
| NN.vs | ||||
| Leukemia | 0.056 | 512 | 23 (4, 134) | 0.17 (0.14, 0.24) |
| Breast 2 cl. | 0.337 | 88 | 23 (4, 110) | 0.24 (0.2, 0.31) |
| Breast 3 cl. | 0.424 | 9 | 45 (6, 214) | 0.66 (0.61, 0.72) |
| NCI 60 | 0.237 | 1718 | 880 (360, 1718) | 0.44 (0.34, 0.57) |
| Adenocar. | 0.181 | 9868 | 73 (8, 1324) | 0.13 (0.1, 0.18) |
| Brain | 0.194 | 1834 | 158 (52, 601) | 0.16 (0.12, 0.25) |
| Colon | 0.158 | 8 | 9 (4, 45) | 0.57 (0.45, 0.72) |
| Lymphoma | 0.04 | 15 | 15 (5, 39) | 0.5 (0.4, 0.6) |
| Prostate | 0.081 | 7 | 6 (3, 18) | 0.46 (0.39, 0.78) |
| Srbct | 0.031 | 11 | 17 (11, 33) | 0.7 (0.66, 0.85) |
1 Only two genes are selected from the complete data set; the values are the actual frequencies of those two genes.
2 [33] select 21 genes after visually inspecting the plot of cross-validation error rate vs. amount of shrinkage and number of genes. Their procedure is hard to automate and thus it is very difficult to obtain estimates of the error rate of their procedure.
3 [31] report obtaining more than 2000 genes when using shrunken centroids with this data set and show that the minimum error rate is achieved with about 5000 genes.
4 [33] select 43 genes. The difference is likely due to differences in the random partitions for cross-validation. Repeating 100 times the gene selection process with the full data set the median, 1st quartile, and 3rd quartile of the number of selected genes are 13, 8, and 147. For these data, [31] obtain 72 genes with shrunken centroids, which also falls within the above interval.
Figure 1Out-of-Bag (OOB) vs . mtryFactor is the multiplicative factor of the default mtry (); thus, an mtryFactor of 3 means the number of genes tried at each split is 3 * ; an mtryFactor = 0 means the number of genes tried was 1; the mtryFactors examined were = {0, 0.05, 0.1, 0.17, 0.25, 0.33, 0.5, 0.75, 0.8, 1, 1.15, 1.33, 1.5, 2, 3, 4, 5, 6, 8, 10, 13}. Results shown for six different ntree = {1000, 2000, 5000, 10000, 20000, 40000}, nodesize = 1.