| Literature DB >> 25415265 |
Daniela Dunkler1, Max Plischke2, Karen Leffondré3, Georg Heinze1.
Abstract
Statistical models are simple mathematical rules derived from empirical data describing the association between an outcome and several explanatory variables. In a typical modeling situation statistical analysis often involves a large number of potential explanatory variables and frequently only partial subject-matter knowledge is available. Therefore, selecting the most suitable variables for a model in an objective and practical manner is usually a non-trivial task. We briefly revisit the purposeful variable selection procedure suggested by Hosmer and Lemeshow which combines significance and change-in-estimate criteria for variable selection and critically discuss the change-in-estimate criterion. We show that using a significance-based threshold for the change-in-estimate criterion reduces to a simple significance-based selection of variables, as if the change-in-estimate criterion is not considered at all. Various extensions to the purposeful variable selection procedure are suggested. We propose to use backward elimination augmented with a standardized change-in-estimate criterion on the quantity of interest usually reported and interpreted in a model for variable selection. Augmented backward elimination has been implemented in a SAS macro for linear, logistic and Cox proportional hazards regression. The algorithm and its implementation were evaluated by means of a simulation study. Augmented backward elimination tends to select larger models than backward elimination and approximates the unselected model up to negligible differences in point estimates of the regression coefficients. On average, regression coefficients obtained after applying augmented backward elimination were less biased relative to the coefficients of correctly specified models than after backward elimination. In summary, we propose augmented backward elimination as a reproducible variable selection algorithm that gives the analyst more flexibility in adopting model selection to a specific statistical modeling situation.Entities:
Mesh:
Year: 2014 PMID: 25415265 PMCID: PMC4240713 DOI: 10.1371/journal.pone.0113677
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Brief outline of the augmented backward elimination procedure.
Simulation study: bias and root mean squared error (RMSE) of regression coefficients of a continuous exposure variable in unselected models, models selected by backward elimination (BE) and models selected by augmented backward elimination (ABE) for linear, logistic and Cox regression.
| VIF |
| Variable selection among | Bias (×100) | RMSE (×100) | Selected models (%) | ||
| of | Biased | Correct | Inflated | ||||
| Linear regression | |||||||
| No selection | 1 | 24 | 100 | ||||
| 2 | 0 | BE, | 3 | 21 | 33 | 35 | 32 |
| ABE, | 2 | 22 | 28 | 34 | 38 | ||
| No selection | 1 | 24 | 100 | ||||
| 2 | 1 | BE, | 3 | 21 | 33 | 35 | 32 |
| ABE, | 2 | 21 | 29 | 35 | 36 | ||
| No selection | 1 | 57 | 100 | ||||
| 4 | 0 | BE, | 6 | 50 | 39 | 34 | 27 |
| ABE, | 2 | 56 | 25 | 14 | 61 | ||
| No selection | 1 | 57 | 100 | ||||
| 4 | 1 | BE, | 6 | 50 | 40 | 34 | 27 |
| ABE, | 2 | 56 | 28 | 17 | 56 | ||
| Logistic regression | |||||||
| No selection | 1 | 20 | 100 | ||||
| 2 | 0 | BE, | 1 | 17 | 7 | 48 | 45 |
| ABE, | 1 | 20 | 1 | 4 | 95 | ||
| No selection | 6 | 25 | 100 | ||||
| 2 | 1 | BE, | 5 | 21 | 11 | 43 | 46 |
| ABE, | 6 | 25 | 2 | 3 | 95 | ||
| No selection | 2 | 47 | 100 | ||||
| 4 | 0 | BE, | 4 | 38 | 12 | 46 | 42 |
| ABE, | 2 | 47 | 1 | 2 | 97 | ||
| No selection | 6 | 55 | 100 | ||||
| 4 | 1 | BE, | 7 | 46 | 17 | 41 | 43 |
| ABE, | 6 | 55 | 1 | 1 | 98 | ||
| Cox regression | |||||||
| No selection | −1 | 22 | 100 | ||||
| 2 | 0 | BE, | 1 | 19 | 20 | 40 | 40 |
| ABE, | −1 | 22 | 7 | 13 | 80 | ||
| No selection | 2 | 23 | 100 | ||||
| 2 | 1 | BE, | 3 | 20 | 19 | 40 | 41 |
| ABE, | 2 | 23 | 7 | 13 | 80 | ||
| No selection | −2 | 52 | 100 | ||||
| 4 | 0 | BE, | 4 | 47 | 26 | 36 | 38 |
| ABE, | −2 | 52 | 6 | 5 | 89 | ||
| No selection | 1 | 52 | 100 | ||||
| 4 | 1 | BE, | 6 | 46 | 28 | 36 | 36 |
| ABE, | 1 | 52 | 7 | 4 | 89 | ||
Abbreviations and symbols: , significance threshold; ABE, augmented backward elimination; BE, backward elimination; RMSE, root mean squared error; , change-in-estimate threshold; VIF, variance inflation factor of conditional on . Sample size, subjects; Number of simulations, ; Variables selection based on six continuous candidate adjustment variables , among which three are truly associated with the outcome and five are correlated with the exposure; ‘Biased’, at least one variable from the true model was not selected; ‘Correct’, selected set of variables matches the true model; ‘Inflated’, selected set of variables contains all variables of the true model and at least one further variable. Full details on the simulation setup are contained in the Methods section.
Urine osmolarity example: demographic and clinical characteristics of all 245 patients at baseline.
| Median (1st, 3rd quartile) or Mean (SD) or n (%) | |
| UOSM (mosm/L) | 510.1 (417.2, 620.6) |
| Creatinine clearance (ml/min) | 46.4 (29.9, 78.8) |
| Proteinuria (g/L) | 1.0 (0.4, 2.5) |
| Mean arterial pressure (mmHg) | 97.7 (7.8) |
| Age (years) | 54.6 (15.3) |
| Male gender | 139 (56.7%) |
| Polycystic kidney disease | 16 (6.5%) |
| Beta-blockers | 116 (47.4%) |
| Diuretics | 115 (46.9%) |
| ACEI/ARBs | 206 (84.1%) |
| Log2 of UOSM | 9.0 (0.5) |
| Log2 of creatinine clearance | 5.6 (0.9) |
| Log2 of proteinuria | 0.0 (1.6) |
Depending on the scale of the characteristic and its distribution either the median (1st, 3rd quartile), mean (standard deviation SD), or absolute number n (percentage) is given.
Abbreviations: ACEI/ARBs, use of angiotensin-converting enzyme inhibitors and Angiotensin II type 1 receptor blockers; SD, standard deviation; UOSM, urine osmolarity.
Urine osmolarity example: final models selected by backward elimination (BE) with a significance threshold , augmented backward elimination (ABE) with and a change-in-estimate threshold , and unselected model (No selection).
| Parameter | BE ( | Bootstrap inclusion frequencies | ABE ( | Bootstrap inclusion frequencies | No selection HR (95% CI), p |
| Log2 of UOSM | 2.03 (1.11, 3.71), 0.021 | 2.05 (1.13, 3.72), 0.019 | 1.95 (1.03, 3.72), 0.042 | ||
| Log2 of creatinine clearence | 0.14 (0.09, 0.21), <0.001 | 100.0% | 0.14 (0.09, 0.21), <0.001 | 100.0% | 0.13 (0.08, 0.21), <0.001 |
| Log2 of proteinuria | 1.88 (1.61, 2.19), <0.001 | 100.0% | 1.94 (1.64, 2.29), <0.001 | 100.0% | 1.90 (1.60, 2.25), <0.001 |
| Polycystic kidney disease | 2.94 (1.50, 5.80), 0.002 | 93.1% | 2.98 (1.51, 5.88), 0.002 | 94.3% | 2.95 (1.48, 5.87), 0.002 |
| Beta-blockers | 1.57 (1.02, 2.44), 0.042 | 74.2% | 1.58 (1.02, 2.446), 0.040 | 77.1% | 1.55 (0.99, 2.42), 0.057 |
| Diuretics | 1.41 (0.91, 2.16), 0.122 | 60.6% | 1.45 (0.94, 2.24), 0.096 | 66.3% | 1.49 (0.94, 2.38), 0.092 |
| ACEI/ARBs | 35.2% | 0.71 (0.35, 1.45), 0.344 | 47.7% | 0.69 (0.33, 1.42), 0.310 | |
| Age (in decades) | 36.9% | 45.9% | 0.96 (0.83, 1.11), 0.593 | ||
| Male gender | 33.0% | 40.4% | 1.14 (0.72, 1.83), 0.577 | ||
| Mean arterial pressure | 30.1% | 36.8% | 1.01 (0.97, 1.04), 0.730 |
Urine osmolarity UOSM, the exposure of main interest, is included in all models. The initial set of adjustment variables for these models was selected by the disjunctive cause criterion. Hazard ratios (HR), confidence limits (CI) and p-values are given. Model stability was evaluated by bootstrap inclusion frequencies (based on bootstrap resamples). UOSM, creatinine clearance, and proteinuria were log2-transformed and therefore, corresponding hazard ratios are per doubling of each variable.
Abbreviations and symbols: , significance threshold; ABE, augmented backward elimination; ACEI/ARBs, use of angiotensin-converting enzyme inhibitors and Angiotensin II type 1 receptor blockers; BE, backward elimination; CI, confidence interval; HR, hazard ratio; , change-in-estimate threshold; Uosm, urine osmolarity (mosm/L).
Figure 2Urine osmolarity example: selection path (left column) of standardized regression coefficients and model stability (inclusion frequencies) in bootstrap resamples (right column) for backward elimination (BE) and augmented backward elimination (ABE).
First row: BE with ; second row: ABE with and ; third row: ABE with and . Abbreviations: ABE, augmented backward elimination; BE, backward elimination; log2UOsm, log2 of urine osmorality; log2CCL, log2 of creatinine clearance; log2Prot, log2 of proteinuria; BBlock, use of beta-blockers; PKD, presence of polycystic kidney disease; Diur, use of diuretics; Age, age in decades; ACEI, use of angiotensin-converting enzyme inhibitors and Angiotensin II type 1 receptor blockers; MAP, mean arterial pressure.
Figure 3Urine osmolarity example: number of selected variables in the final models of bootstrap resamples for backward elimination BE with and augmented backward elimination ABE with and .
The highlighted bars indicate the number of selected variables in the original sample. Abbreviations and symbols: , significance threshold; ABE, augmented backward elimination; BE, backward elimination; , change-in-estimate threshold;.
Urine osmolarity example: incorporating model uncertainty into standard error (SE) estimates of urine osmolarity UOSM.
| Selection algorithm | Model-based SE | Robust SE | Bootstrap SE |
| BE with | 0.307 | 0.346 | 0.400 |
| ABE with | 0.305 | 0.340 | 0.400 |
Model-based standard error, robust standard error and standard error based on bootstrap resamples for models selected with backward elimination (BE) and augmented backward elimination (ABE).
Abbreviations and symbols: , significance threshold; ABE, augmented backward elimination; BE, backward elimination; , change-in-estimate threshold; SE, standard error; Uosm, urine osmolarity (mosm/L).