| Literature DB >> 18761753 |
Martin Eklund1, Ola Spjuth, Jarl Es Wikberg.
Abstract
BACKGROUND: There has been recent concern regarding the inability of predictive modeling approaches to generalize to new data. Some of the problems can be attributed to improper methods for model selection and assessment. Here, we have addressed this issue by introducing a novel and general framework, the C1C2, for simultaneous model selection and assessment. The framework relies on a partitioning of the data in order to separate model choice from model assessment in terms of used data. Since the number of conceivable models in general is vast, it was also of interest to investigate the employment of two automatic search methods, a genetic algorithm and a brute-force method, for model choice. As a demonstration, the C1C2 was applied to simulated and real-world datasets. A penalized linear model was assumed to reasonably approximate the true relation between the dependent and independent variables, thus reducing the model choice problem to a matter of variable selection and choice of penalizing parameter. We also studied the impact of assuming prior knowledge about the number of relevant variables on model choice and generalization error estimates. The results obtained with the C1C2 were compared to those obtained by employing repeated K-fold cross-validation for choosing and assessing a model.Entities:
Mesh:
Year: 2008 PMID: 18761753 PMCID: PMC2556350 DOI: 10.1186/1471-2105-9-360
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The . The data partitioning in step (a) in the pseudocode separates the model choice from its assessment, which is highlighted in purple in the figure. The left side of the figure relates to steps (b) to (d) in the pseudocode, and the right side to step (e); i.e. the left side relates to choosing the model and saving the parameter estimates, and the right side to assessing the model and saving the assessment results.
p, and where the observations were sampled from an orthogonal multivariate normal distribution or not, according to the following schema:
Coefficient estimates of model (6) with w= , i = 1, ..., 128, as a response variable.
| intercept | -0.01992 | 0.04606 | -0.433 | 0.6661 |
| c1c2 | -0.04337 | 0.03761 | -1.153 | 0.2511 |
| ga | 0.15683 | 0.03761 | 4.170 | 5.72e-05 |
| cor | 0.07211 | 0.03761 | 1.918 | 0.0575 |
| 15 | 0.21324 | 0.03761 | 5.670 | 9.75e-08 |
| all | 0.32754 | 0.03761 | 8.710 | 1.78e-14 |
c1c2 – the C1C2 was used (as opposed to repeated K-fold cross-validation), ga – the GA search method was used (as opposed to the brute force search method), cor – correlated independent variables in the dataset (as opposed to uncorrelated,) 15 – n = 15 observations in the dataset (as opposed to n = 200), all – no assumption regarding the number of nonzero δ(as opposed to the assumption of three δ= 1).
Coefficient estimates of model (6) with w= , i = 1, ..., 128 as a response variable.
| Intercept | 0.02864 | 0.04732 | 0.605 | 0.546181 |
| c1c2 | -0.04804 | 0.03863 | -1.244 | 0.216065 |
| ga | -0.07329 | 0.03863 | -1.897 | 0.060193 |
| cor | 0.56058 | 0.03863 | 14.510 | < 2e-16 |
| 15 | 0.14307 | 0.03863 | 3.703 | 0.000321 |
| all | 0.11504 | 0.03863 | 2.977 | 0.003506 |
See Table 1 for notation explanation.
Coefficient estimates of model (6) with w= , i = 1, ..., 128 as a response variable.
| intercept | 0.034642 | 0.004841 | 7.157 | 6.73e-11 |
| c1c2 | -0.024003 | 0.003952 | -6.073 | 1.47e-08 |
| ga | -0.001149 | 0.003952 | -0.291 | 0.771710 |
| cor | 0.006198 | 0.003952 | 1.568 | 0.119423 |
| 15 | 0.013469 | 0.003952 | 3.408 | 0.000888 |
| all | -0.012089 | 0.003952 | -3.059 | 0.002732 |
See Table 1 for notation explanation.
Figure 2Generalization errors obtained with the . The figure shows , where were produced using the C1C2 (blue) and repeated K-fold cross-validation (red) for all other factor combinations in model (6). The plot is based on pooled over the four replicates for each method. The bars show the 95% confidence interval, calculated from the pooled results (the confidence intervals are only shown in one direction to avoid cluttering). The factor combinations in model (6) are coded as: ga – the GA search method was used, bf – the brute force search method was used, uncor – orthogonal independent variables in the dataset, cor – correlated independent variables in the dataset, 15 – n = 15 observations in the dataset, 200 – n = 200 observations in the dataset, all – no assumption regarding the number of nonzero δ, 3 – three δ= 1 were assumed.
Figure 3Cluster dendrogram of the 14 selected variables from the Selwood dataset using repeated . Three distinct clusters can be noted (shown in red, green, and yellow rectangles). One sub-cluster can be seen within the red cluster (shown in a blue rectangle). The red and green numbers are p-values of a given cluster; they indicate how well the cluster is supported by data (see [31] for details). +Additional variables selected by repeated K-fold cross-validation compared to the C1C2.
Summary of the demonstrations of the C1C2.
| Both the |
| The |
| Prior information about the number of important independent variables improves model choice but can reduce the accuracy of generalization error estimates. |
| Correlated independent variables and using the genetic algorithm worsened the model choice significantly, but not the generalization error estimates. |
| The |
n denotes the number of observations in a dataset, p the number of variables, Δ represents a given subset of the p variables, and λ the ridge regression parameter.