| Literature DB >> 31107924 |
Sam Doerken1,2, Marta Avalos3,4, Emmanuel Lagarde3, Martin Schumacher1.
Abstract
Estimating and selecting risk factors with extremely low prevalences of exposure for a binary outcome is a challenge because classical standard techniques, markedly logistic regression, often fail to provide meaningful results in such settings. While penalized regression methods are widely used in high-dimensional settings, we were able to show their usefulness in low-dimensional settings as well. Specifically, we demonstrate that Firth correction, ridge, the lasso and boosting all improve the estimation for low-prevalence risk factors. While the methods themselves are well-established, comparison studies are needed to assess their potential benefits in this context. This is done here using the dataset of a large unmatched case-control study from France (2005-2008) about the relationship between prescription medicines and road traffic accidents and an accompanying simulation study. Results show that the estimation of risk factors with prevalences below 0.1% can be drastically improved by using Firth correction and boosting in particular, especially for ultra-low prevalences. When a moderate number of low prevalence exposures is available, we recommend the use of penalized techniques.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31107924 PMCID: PMC6527211 DOI: 10.1371/journal.pone.0217057
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Case-control data.
In the whole cohort of 50,728 drivers, we fit a reference model using logistic regression. We take 100 subsamples without replacement of 10,000 drivers and fit logistic, Firth, ridge and the lasso estimates. The bias (1a) and mean squared error (1b) of 213 risk factors shown are in comparison to the estimates of the reference model. The prevalence is on a log scale.
CESIR simulation: Selection rates (%) for each of the 5 relevant variables.
| Prevalence of relevant variables | |||||
|---|---|---|---|---|---|
| 0.005% | 0.02% | 0.1% | 0.6% | 3% | |
| 5 | 29 | 54 | 100 | 100 | |
| 0 | 0 | 36 | 99 | 100 | |
| 0 | 4 | 37 | 97 | 100 | |
| 17 | 32 | 52 | 86 | 100 | |
| 0 | 8 | 60 | 100 | 100 | |
CESIR simulation: Selection rates (%) for the 45 irrelevant variables, grouped together by prevalence (9 variables per interval).
| Prevalence of irrelevant variables | |||||
|---|---|---|---|---|---|
| 0.006%–0.016% | 0.02%–0.06% | 0.07%–0.22% | 0.25%–0.8% | 0.9%–3% | |
| 10 | 21 | 18 | 15 | 15 | |
| 0 | 1 | 7 | 11 | 11 | |
| 0 | 5 | 8 | 8 | 7 | |
| 24 | 35 | 36 | 33 | 36 | |
| 1 | 11 | 28 | 36 | 37 | |
Fig 2Simulation data.
Each plot shows the bias for 1 relevant variable (rel.) and the 9 irrelevant variables (irr.) within the prevalence range. The boxplots are thus based on 100 values (one for each simulation study) for the relevant variables and 900 values (100*9) for the irrelevant variables.
Fig 3Simulation data.
(a) Mean squared errors across the 5 relevant variables. (b) Mean squared errors across 5 groups of irrelevant variables (9 variables per group). All axes are on a logarithmic scale.