| Literature DB >> 33920501 |
Mariella Gregorich1, Susanne Strohmaier1,2, Daniela Dunkler1, Georg Heinze1.
Abstract
Regression models have been in use for decades to explore and quantify the association between a dependent response and several independent variables in environmental sciences, epidemiology and public health. However, researchers often encounter situations in which some independent variables exhibit high bivariate correlation, or may even be collinear. Improper statistical handling of this situation will most certainly generate models of little or no practical use and misleading interpretations. By means of two example studies, we demonstrate how diagnostic tools for collinearity or near-collinearity may fail in guiding the analyst. Instead, the most appropriate way of handling collinearity should be driven by the research question at hand and, in particular, by the distinction between predictive or explanatory aims.Entities:
Keywords: collinearity; correlated predictors; exposure-response association; multivariable modelling; nonlinear effects
Year: 2021 PMID: 33920501 PMCID: PMC8073086 DOI: 10.3390/ijerph18084259
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Illustrative extract of the data of the study of Ratzinger et al. [11]. White blood cells consist of the five subtypes neutrophils, eosinophils, basophils, lymphocytes, and monocytes and, hence, their sum equals the white blood cell count. See Supplementary Materials for the full table and the analysis.
| Observation | Neutrophils (G/L) | Eosinophils (G/L) | Basophils (G/L) | Lymphocytes (G/L) | Monocytes (G/L) | White Blood Cell Count (G/L) | C-Reactive Protein (mg/dL) |
|---|---|---|---|---|---|---|---|
| 1 | 11.5 | 0.0 | 0.1 | 0.6 | 1.1 | 13.3 | 15.99 |
| 2 | 13.9 | 0.0 | 0.0 | 3.0 | 3.3 | 20.2 | 13.27 |
| 3 | 13.0 | 0.2 | 0.0 | 0.2 | 1.1 | 14.5 | 14.99 |
| 4 | 11.0 | 0.1 | 0.0 | 0.6 | 0.8 | 12.5 | 9.93 |
| 5 | 10.1 | 0.0 | 0.0 | 0.6 | 0.8 | 11.5 | 16.70 |
Figure 1COVID-19 study: scatterplot of the square roots of GGO and consolidation by severity of COVID-19 disease progression.
Odds ratios with 95% confidence intervals (CI), Akaike information criteria (AIC), and the C-statistics for the two fitted univariable logistic regression models and the multivariable model including both independent variables.
| Model for Disease Severity | Independent Variable(s) | Odds Ratio | Model Performance | ||
|---|---|---|---|---|---|
| Estimate | 95% CI | AIC | C-Index | ||
| Univariable | GGO | 1.82 | (1.55, 2.22) | 88.6 | 0.96 |
| Univariable | Consolidation | 1.94 | (1.59, 2.47) | 142.8 | 0.89 |
| Multivariable model | GGO | 1.83 | (1.48, 2.38) | 90.6 | 0.96 |
Figure 2(a) Annual global temperature anomalies with respect to the 20th century mean and (b) annual global carbon emissions, 1880−2014.
Figure 3Partial effect plots of the independent variables year (left) and annual global CO2 emission (right). Shaded area illustrates the 95% confidence interval of the partial effect curve.
Some options to deal with collinearity by research aim. With ‘symptoms’, we mean typical consequences of collinearity such as inflated standard errors and unstable parameter estimates.
| Method | Explanation | Remark |
|---|---|---|
|
| ||
| Variable omission | Omit one of the variables involved in the collinearity | Removes the symptoms, but leads to different interpretation of the model |
| Summary score | Combine several nearly collinear variables into a summary score and include only the summary score in the regression model | Removes the symptoms, retains most of the predictive value of the model, but leads to different interpretation of the model |
|
| ||
| Use information criteria | Information criteria such as Akaike’s can be used to guide model building | Information criteria guide the analyst in a search for the most predictive model |
|
| ||
| Use causal reasoning | Specification of variables (exposure of interest, confounders) is necessitated by causal reasoning | Neither exposure nor confounders should be omitted as this violates assumptions needed to identify the causal estimand of interest |