| Literature DB >> 20025773 |
Anne-Laure Boulesteix1, Carolin Strobl.
Abstract
BACKGROUND: In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias.Entities:
Mesh:
Year: 2009 PMID: 20025773 PMCID: PMC2813849 DOI: 10.1186/1471-2288-9-85
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Summary of the considered candidate classifiers
| Method | Type | Number of genes | Function | Fixed parameters | Parameters tuned via CV |
|---|---|---|---|---|---|
| KNN | 1 | 20, 50, 100, 200, 500 | knnCMA | ||
| LDA | 2 | 10, 20 | ldaCMA | ||
| FDA | 2 | 10, 20 | fdaCMA | ||
| DLDA | 3 | 20, 50, 100, 200, 500 | dldaCMA | ||
| PLSLDA | 3 | 20, 50, 100, 200, 500 | plsldaCMA | ||
| NNET | 3 | 20, 50, 100, 200, 500 | nnetCMA | ||
| RF | 4 | rfCMA | |||
| linear SVM | 4 | svmCMA | cost | ||
| PAM | 4 | pamCMA | shrinkage parameter | ||
| 4 | plrCMA | penalty | |||
Column 1: Acronym of the method. Column 2: Type of the method regarding preliminary variable selection. Column 3: Number of selected genes p* (if preliminary variable selection is performed). Column 4: Name of the function in the CMA package. Column 5: Name and values of the fixed parameters. Column 6: Name of the parameters tuned using internal 3-fold-cross-validation.
30] and the absolute value of the normalized two-sample Wilcoxon statistic. Note that further methods could have been considered, such as the "traditional" Golub criterion or more sophisticated multivariate approaches, e.g. based on random forests [24]. In our experiment, however, we focus on the most standard approaches, because we consider it realistic that a statistician, who wants to try a large number of procedures, would prefer those that are freely available or easy to implement, computationally efficient and conceptually simple.
Formula yielding a total of 124 classifiers
| KNN: | 5 values of | × | 3 gene selection criteria | × | 3 values of | + |
|---|---|---|---|---|---|---|
| LDA: | 2 values of | × | 3 gene selection criteria | + | ||
| FDA: | 2 values of | × | 3 gene selection criteria | + | ||
| DLDA | 5 values of | × | 3 gene selection criteria | + | ||
| PLSLDA: | 5 values of | × | 3 gene selection criteria | × | 2 values of | + |
| NNET: | 5 values of | × | 3 gene selection criteria | + | ||
| RF | × | 4 values of | + | |||
| SVM | + | |||||
| PAM | + | |||||
| = | ||||||
| 124 classifiers |
This table explains how we obtain a total of 124 classifiers.
Results of the real data study, colon data
| Colon | p* | 10 | 20 | 50 | 100 | 200 | 500 | 2000 | |
|---|---|---|---|---|---|---|---|---|---|
| Method | Parameter | ||||||||
| KNN | - | 0.19 | 0.21 | 0.24 | 0.18 | 0.23 | - | ||
| KNN | - | 0.18 | 0.15 | 0.16 | 0.16 | 0.16 | - | ||
| KNN | - | 0.18 | 0.16 | 0.19 | 0.15 | 0.13 | - | ||
| LDA | 0.19 | 0.21 | - | - | - | - | - | ||
| FDA | 0.18 | 0.21 | - | - | - | - | - | ||
| DLDA | - | 0.15 | 0.16 | 0.13 | 0.18 | 0.24 | - | ||
| PLSLDA | - | 0.16 | 0.16 | 0.16 | 0.18 | 0.16 | - | ||
| - | 0.18 | 0.16 | 0.13 | 0.16 | 0.18 | - | |||
| NNET | - | 0.35 | 0.35 | 0.34 | 0.37 | 0.34 | - | ||
| RF | - | - | - | - | - | - | 0.18 | ||
| - | - | - | - | - | - | 0.18 | |||
| - | - | - | - | - | - | 0.18 | |||
| - | - | - | - | - | - | 0.18 | |||
| SVM | - | - | - | - | - | - | 0.13 | ||
| PAM | - | - | - | - | - | - | 0.11 | ||
| - | - | - | - | - | - | 0.18 | |||
| Method | Parameter | ||||||||
| KNN | - | 0.12 | 0.15 | 0.14 | 0.10 | 0.12 | - | ||
| KNN | - | 0.08 | 0.08 | 0.07 | 0.07 | 0.09 | - | ||
| KNN | - | 0.07 | 0.09 | 0.08 | 0.09 | 0.10 | - | ||
| LDA | 0.08 | 0.08 | - | - | - | - | - | ||
| FDA | 0.08 | 0.08 | - | - | - | - | - | ||
| DLDA | - | 0.10 | 0.13 | 0.13 | 0.19 | 0.24 | - | ||
| PLSLDA | - | 0.06 | 0.08 | 0.06 | 0.08 | 0.08 | - | ||
| - | 0.08 | 0.06 | 0.06 | 0.08 | 0.07 | - | |||
| NNET | - | 0.10 | 0.12 | 0.10 | 0.15 | 0.20 | - | ||
| RF | - | - | - | - | - | - | 0.08 | ||
| - | - | - | - | - | - | 0.08 | |||
| - | - | - | - | - | - | 0.08 | |||
| - | - | - | - | - | - | 0.08 | |||
| SVM | - | - | - | - | - | - | 0.10 | ||
| PAM | - | - | - | - | - | - | 0.21 | ||
| - | - | - | - | - | - | 0.09 | |||
CV error rates obtained with different methods for the colon data set (top) [17] and the prostate data (bottom) [18], with variable selection (if any) based on the t-statistic.
Results of the permutation study
| Colon | A | B | C | D | E | F |
|---|---|---|---|---|---|---|
| KNN | 0.33 | 0.36 | 0.37 | 0.38 | 0.41 | 0.45 |
| LDA | 0.40 | - | 0.43 | 0.43 | - | 0.46 |
| FDA | 0.42 | - | 0.44 | 0.47 | - | 0.48 |
| DLDA | 0.36 | - | 0.41 | 0.42 | - | 0.44 |
| PLSLDA | 0.34 | 0.35 | 0.37 | 0.37 | 0.42 | 0.43 |
| NNET | 0.34 | - | 0.35 | 0.35 | - | 0.36 |
| RF | 0.40 | 0.40 | - | - | - | 0.42 |
| SVM | 0.37 | - | - | - | - | 0.37 |
| PAM | 0.36 | - | - | - | - | 0.36 |
| 0.44 | - | - | - | - | 0.44 | |
| 0.31 | 0.32 | 0.33 | 0.33 | 0.34 | 0.43 | |
| KNN | 0.43 | 0.45 | 0.45 | 0.47 | 0.50 | 0.52 |
| LDA | 0.46 | - | 0.47 | 0.50 | - | 0.51 |
| FDA | 0.45 | - | 0.47 | 0.49 | - | 0.49 |
| DLDA | 0.46 | - | 0.49 | 0.49 | - | 0.51 |
| PLSLDA | 0.44 | 0.46 | 0.47 | 0.49 | 0.51 | 0.52 |
| NNET | 0.46 | - | 0.49 | 0.47 | - | 0.52 |
| RF | 0.52 | 0.54 | - | - | - | 0.54 |
| SVM | 0.57 | - | - | - | - | 0.57 |
| PAM | 0.54 | - | - | - | - | 0.54 |
| 0.52 | - | - | - | - | 0.52 | |
| 0.41 | 0.42 | 0.43 | 0.44 | 0.46 | 0.52 | |
Colon data set [17] and prostate data set [18], with variable selection (if any) based on the t-statistic. Approach A: Minimal error rate over the different tuning parameter values (k = 1, 3, 5 for KNN, ncomp = 2, 3 for PLSLDA, mtry = , , , for RF), different numbers of genes and different gene selection methods (median over the 20 runs). Approach B: Minimal error rate over the different numbers of genes and different gene selection methods (median over the 20 runs). Approach C: Minimal error rate over the different tuning parameter values (k = 1, 3, 5 for KNN, ncomp = 2, 3 for PLSLDA, mtry = , , , for RF) and different gene selection methods (median over the 20 runs). Approach D: Minimal error rate over the different tuning parameter values (k = 1, 3, 5 for KNN, ncomp = 2, 3 for PLSLDA, mtry = , , , for RF) and different numbers of genes (median over the 20 runs). Approach E: Minimal error rate over the different tuning parameter values (k = 1, 3, 5 for KNN, ncomp = 2, 3 for PLSLDA, mtry = , , , for RF) (median over the 20 runs). Approach F: Median of all 124 × 20 calculated error rates.
Figure 1Permutation-based analyses. Alon's colon cancer data (left) and Singh's prostate cancer data (right). Boxplots of the minimal error rates , and for the 20 permutations, and of all the error rates obtained with the 124 classifiers for the 20 permutations = 124 × 20 points (right). The three horizontal lines represent the three baseline error rates defined as follows: the error rate obtained by assigning all observations to the majoritary class (plain), the error rate obtained by randomly assigning N0 observations to class 0 and N1 observations to class 1 (dotted), and 50% (dashed). Main conclusion: The minimal error rate is much lower than all three baseline error rates, and a large part of this bias is due to the optimal selection of the classification method.
Figure 2Subsample analyses. Alon's colon cancer data (left) and Singh's prostate cancer data (right). Boxplots of the minimal error rate over the 124 classifiers for each subsample size (each boxplot corresponds to 20 error rate estimates). Main conclusion: The median minimal error rate does not seem to increase with decreasing sample size.