| Literature DB >> 33302880 |
Elizabeth Handorf1, Yinuo Yin2, Michael Slifker3, Shannon Lynch2.
Abstract
BACKGROUND: Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome.Entities:
Keywords: Social environment; Variable selection
Mesh:
Year: 2020 PMID: 33302880 PMCID: PMC7727197 DOI: 10.1186/s12874-020-01183-9
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1Simulation model
Candidate methods
| Abbreviation | Description | Selection rule | R packages |
|---|---|---|---|
| UNIV-BFN | Univariable models with Bonferroni-adjusted p-val | base R [ | |
| LASSO-MIN | Lasso with λ chosen at the minimum prediction error | β ≠ 0 | glmnet [ |
| LASSO-1SE | Lasso with λ chosen at 1 SE above the minimum error | β ≠ 0 | glmnet |
| ELNET-MIN | Elastic net, grid search for α (0.05–0.95 by 0.05), λ at min | β ≠ 0 | glmnet |
| ELNET-1SE | Elastic net, grid search α (0.05–0.95 by 0.05), λ at 1 SE | β ≠ 0 | glmnet |
| HCLST-CORR-SGL | Hierarchical clustering, groups with corr > 0.8, sparse group lasso | β ≠ 0 | SGL [ |
| HCLST-BOOT-SGL | Hierarchical clustering, groups from bootstrap, sparse group lasso | β ≠ 0 | SGL, pvclust [ |
| RF | Random Forests algorithm with bootstrap-based confidence intervals for the variable importance scores | 99.995% CI > 0 | randomForestSRC [ |
| BAGGING | Similar to Random Forests, but with all variables considered candidates for splitting at each node | 99.995% CI > 0 | randomForestSRC |
| BART-LOCAL | Bayesian Additive Regression Trees, local criteria for Inclusion Proportion (IP) | IP > 0.95 quantile of local distribution | bartMachine [ |
| BART-GLOBALSE | Bayesian Additive Regression Trees, global SE criteria for IP | IP > threshold from local distribution with global multiplier | bartMachine |
| BART-GLOBALMAX | Bayesian Additive Regression Trees, global Max criteria for IP | IP > 0.95 quantile of global max distribution | bartMachine |
Simulation study results
| A. Binary outcome | Strict | Relaxeda | ||||
| TP (N/10) | FP (N/990) | F2 | TP (N/10) | FP (N/953) | F2 | |
| UNIV-BFN | 4.09 | 32.49 | 0.267 | 5.13 | 12.70 | 0.443 |
| LASSO-1SE | 3.84 | 6.05 | 0.383 | 5.53 | 3.71 | 0.559 |
| LASSO-MIN | 4.25 | 9.01 | 0.399 | 5.98 | 6.53 | 0.569 |
| ELNET-1SE | 5.26 | 20.51 | 0.405 | 6.21 | 9.33 | 0.560 |
| ELNET-MIN | 5.53 | 26.11 | 0.393 | 6.61 | 14.40 | 0.548 |
| HCLST-CORR-SGL | ||||||
| HCLST-BOOT-SGL | 5.20 | 16.66 | 0.420 | 6.35 | 7.07 | 0.594 |
| RF | 3.53 | 18.41 | 0.281 | 4.91 | 7.68 | 0.462 |
| BAGGING | 3.56 | 13.94 | 0.308 | 4.73 | 6.70 | 0.456 |
| BART-LOCAL | 4.68 | 15.66 | 0.387 | 6.32 | 7.13 | 0.591 |
| BART-GLOBALSE | 1.96 | 0.53 | 0.228 | 2.24 | 0.22 | 0.261 |
| BART-GLOBALMAX | 0.01 | 0.00 | 0.001 | 0.01 | 0.00 | 0.001 |
| B. Continuous outcome | Strict | Relaxeda | ||||
| TP (N/10) | FP (N/990) | F2 | TP (N/10) | FP (N/953) | F2 | |
| UNIV-BFN | 4.83 | 39.57 | 0.286 | 5.90 | 17.47 | 0.468 |
| LASSO-1SE | 2.88 | 4.49 | 0.298 | 4.33 | 2.42 | 0.454 |
| LASSO-MIN | 4.61 | 10.52 | 0.419 | 6.47 | 7.60 | 0.599 |
| ELNET-1SE | 3.88 | 8.27 | 0.366 | 4.87 | 3.29 | 0.500 |
| ELNET-MIN | 5.18 | 14.79 | 0.433 | 6.61 | 8.88 | 0.598 |
| HCLST-CORR-SGL | ||||||
| HCLST-BOOT-SGL | 5.52 | 17.03 | 0.441 | 6.72 | 8.45 | 0.610 |
| RF | 4.63 | 28.46 | 0.316 | 5.93 | 14.26 | 0.493 |
| BAGGING | 4.40 | 25.73 | 0.314 | 5.81 | 12.96 | 0.494 |
| BART-LOCAL | 5.14 | 18.35 | 0.404 | |||
| BART-GLOBALSE | 2.40 | 0.93 | 0.274 | 2.85 | 0.41 | 0.326 |
| BART-GLOBALMAX | 0.01 | 0.00 | 0.002 | 0.01 | 0.00 | 0.002 |
TP True positive, FP False positive; boldface denotes best performing method by F2 statistic
aUnder the relaxed definition, if a true variable or its surrogate was selected, that variable was considered to be identified. Surrogates are therefore no longer in the pool of potential false positives, but the maximum number of true positive variables remains 10
Effect of confounding on detection rate (binary outcome)
| Mean β (true = 0.22) | Detection rate (UNIV-BFN) | Detection rate (HCLST-CORR-SGL) | Detection rate | |
|---|---|---|---|---|
| X1 | 0.223 | 0.25 | 0.586 | 0.244 |
| X2 | 0.343 | 0.972 | 0.778 | 0.126 |
| X3 | 0.252 | 0.546 | 0.372 | 0.132 |
| X4 | 0.152 | 0.022 | 0.014 | 0.028 |
| X5 | 0.085 | 0 | 0 | 0.01 |
| X6 | 0.233 | 0.316 | 0.800 | 0.662 |
| X7 | 0.281 | 0.812 | 0.928 | 0.71 |
| X8 | 0.228 | 0.388 | 0.768 | 0.382 |
| X9 | 0.280 | 0.786 | 0.928 | 0.728 |
| X10 | 0.032 | 0 | 0.024 | 0.006 |
Results from full data: Variables identified as associated with PCa aggressiveness by both HCLST-CORR-SGL and NWAS
| SGL variable(s) | Domain: Description | NWAS variable | Correlation(s) |
|---|---|---|---|
| PCT_SF3_PCT050102 | Poverty: % Ratio of Income to Poverty level for persons aged 45–54 under 0.50 | PCT_SF3_PCT050102 | 1.0 |
| PCT_SF3_HCT005092 | Housing/Income: %Renter-occupied housing units built 1939–1949 with householder aged 25–34 | PCT_SF3_HCT015042 | 0.912 |
| PCT_SF3_PCT065I007 | Employment/Transportation: % White Only (non-Hispanic) Worker taking public transportation (trolley or streetcar) to work | PCT_SF3_P030007 | 0.935 |
| PCT_SF3_PCT065A007 | % White Only Worker taking public transportation (trolley or streetcar) to work | PCT_SF3_P030007 | 0.935 |