| Literature DB >> 26537575 |
Roman Hornung1, Christoph Bernau2,3, Caroline Truntzer4, Rory Wilson5, Thomas Stadler6, Anne-Laure Boulesteix7.
Abstract
BACKGROUND: In applications of supervised statistical learning in the biomedical field it is necessary to assess the prediction error of the respective prediction rules. Often, data preparation steps are performed on the dataset-in its entirety-before training/test set based prediction error estimation by cross-validation (CV)-an approach referred to as "incomplete CV". Whether incomplete CV can result in an optimistically biased error estimate depends on the data preparation step under consideration. Several empirical studies have investigated the extent of bias induced by performing preliminary supervised variable selection before CV. To our knowledge, however, the potential bias induced by other data preparation steps has not yet been examined in the literature. In this paper we investigate this bias for two common data preparation steps: normalization and principal component analysis for dimension reduction of the covariate space (PCA). Furthermore we obtain preliminary results for the following steps: optimization of tuning parameters, variable filtering by variance and imputation of missing values.Entities:
Mesh:
Year: 2015 PMID: 26537575 PMCID: PMC4634762 DOI: 10.1186/s12874-015-0088-9
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Overview of the datasets used in the studies on normalization and PCA. The following information is given: accession number, number of observations, number of variables, proportion of observations in the smaller class, data type
| Study | Label/ | Num. of | Num. of | Prop. smaller | Data type | ID |
|---|---|---|---|---|---|---|
| acc. number | observ. | variables | class | |||
| Normalization | E-GEOD-10320 | 100 | 22283 | 0.42 | transcription | 1 |
| Normalization | E-GEOD-47552 | 74 | 32321 | 0.45 | transcription | 2 |
| Normalization | E-GEOD-25639 | 57 | 54675 | 0.46 | transcription | 3 |
| Normalization | E-GEOD-29044 | 54 | 54675 | 0.41 | transcription | 4 |
| Normalization | E-MTAB-57 | 47 | 22283 | 0.47 | transcription | 5 |
| Normalization | E-GEOD-19722 | 46 | 54675 | 0.39 | transcription | 6 |
| Normalization | E-MEXP-3756 | 40 | 54675 | 0.50 | transcription | 7 |
| Normalization | E-GEOD-34465 | 26 | 32321 | 0.35 | transcription | 8 |
| Normalization | E-GEOD-30174 | 20 | 54675 | 0.50 | transcription | 9 |
| Normalization | E-GEOD-39683 | 20 | 32321 | 0.40 | transcription | 10 |
| Normalization | E-GEOD-40744 | 20 | 20706 | 0.50 | transcription | 11 |
| Normalization | E-GEOD-46053 | 20 | 54675 | 0.40 | transcription | 12 |
| PCA | E-GEOD-37582 | 121 | 48766 | 0.39 | transcription | 13 |
| PCA | ProstatecTranscr | 102 | 12625 | 0.49 | transcription | 14 |
| PCA | GSE20189 | 100 | 22277 | 0.49 | transcription | 15 |
| PCA | E-GEOD-57285 | 77 | 27578 | 0.45 | DNA methyl. | 16 |
| PCA | E-GEOD-48153 | 71 | 23232 | 0.48 | proteomic | 17 |
| PCA | E-GEOD-42826 | 68 | 47323 | 0.24 | transcription | 18 |
| PCA | E-GEOD-31629 | 62 | 13737 | 0.35 | transcription | 19 |
| PCA | E-GEOD-33615 | 60 | 45015 | 0.35 | transcription | 20 |
| PCA | E-GEOD-39046 | 57 | 392 | 0.47 | transcription | 21 |
| PCA | E-GEOD-32393 | 56 | 27578 | 0.41 | DNA methyl. | 22 |
| PCA | E-GEOD-42830 | 55 | 47323 | 0.31 | transcription | 23 |
| PCA | E-GEOD-39345 | 52 | 22184 | 0.38 | transcription | 24 |
| PCA | GSE33205 | 50 | 22011 | 0.50 | transcription | 25 |
| PCA | E-GEOD-36769 | 50 | 54675 | 0.28 | transcription | 26 |
| PCA | E-GEOD-43329 | 48 | 887 | 0.40 | transcription | 27 |
| PCA | E-GEOD-42042 | 47 | 27578 | 0.49 | DNA methyl. | 28 |
| PCA | E-GEOD-25609 | 41 | 1145 | 0.49 | transcription | 29 |
| PCA | GSE37356 | 36 | 47231 | 0.44 | transcription | 30 |
| PCA | E-GEOD-49641 | 36 | 33297 | 0.50 | transcription | 31 |
| PCA | E-GEOD-37965 | 30 | 485563 | 0.50 | DNA methyl. | 32 |
ArrayExpress accession numbers have the prefix E-GEOD-, NCBI GEO accession numbers have the prefix GSE
Fig. 1CVIIM,n,K-values from variable selection study. The numbers distinguish the datasets. psel denotes the number of selected variables
Estimates of global CVIIM from the variable selection study
| Number of sel. |
|
|
|
|---|---|---|---|
| variables | |||
| 5 | 0.5777 | 0.5927 | 0.6126 |
| 10 | 0.5557 | 0.5617 | 0.5505 |
| 20 | 0.3971 | 0.4706 | 0.4511 |
| p/2 | 0.2720 | 0.2702 | 0.2824 |
Fig. 2CVIIMs,n,K-values from normalization study. The grey lines connect the values corresponding to the same datasets. The diamonds depict the estimates of global CVIIM
Estimates of global CVIIM from the normalization study
| Normalization | Classification |
|
|
|
|---|---|---|---|---|
| method | method | |||
| RMA | NSC | 0.0000 | 0.0000 | 0.0000 |
| PLS-LDA | 0.0030 | 0.0064 | 0.0000 | |
| RMAglobalVSN | NSC | 0.0000 | < 0.0001 | 0.0000 |
| PLS-LDA | 0.0000 | 0.0030 | 0.0000 |
Fig. 3CVIIMs,n,K-values from PCA study. The grey lines connect the values corresponding to the same datasets. The diamonds depict the estimates of global CVIIM
Estimates of global CVIIM from the PCA study
| Classification | Number of |
|
|
|
|---|---|---|---|---|
| method | components | |||
| LDA | 2 | 0.0974 | 0.0805 | 0.0582 |
| 5 | 0.0397 | 0.0371 | 0.0354 | |
| 10 | 0.0000 | 0.0000 | 0.0000 | |
| 15 | 0.0000 | 0.0000 | 0.0000 | |
| RF | 2 | 0.0855 | 0.0747 | 0.0659 |
| 5 | 0.0686 | 0.0558 | 0.0516 | |
| 10 | 0.0907 | 0.0613 | 0.0368 | |
| 15 | 0.1117 | 0.0988 | 0.0794 |
Fig. 4Dependency on CV errors in PCA study. Upper panel: CVIIMs,n,K-values versus e ()-values for all settings; Lower panel: Zero-truncated differences of e ()- and e ()-values versus e ()-values for all settings. The colors and numbers distinguish the different datasets. The filled black circles depict the respective means over the results of all settings obtained on the specific datasets
Fig. 5Errors in PCA study. e ()- and e ()-values for all datasets and settings from the PCA study