| Literature DB >> 32600277 |
Joshua J Levy1,2,3, A James O'Malley4,5.
Abstract
BACKGROUND: Machine learning approaches have become increasingly popular modeling techniques, relying on data-driven heuristics to arrive at its solutions. Recent comparisons between these algorithms and traditional statistical modeling techniques have largely ignored the superiority gained by the former approaches due to involvement of model-building search algorithms. This has led to alignment of statistical and machine learning approaches with different types of problems and the under-development of procedures that combine their attributes. In this context, we hoped to understand the domains of applicability for each approach and to identify areas where a marriage between the two approaches is warranted. We then sought to develop a hybrid statistical-machine learning procedure with the best attributes of each.Entities:
Keywords: Interactions; Logistic regression; Machine learning; Model Explanations; Random Forest
Mesh:
Year: 2020 PMID: 32600277 PMCID: PMC7325087 DOI: 10.1186/s12874-020-01046-3
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Fig. 1Characteristics of generalized linear modeling approach (blue), decision trees (orange), random forest (green), true model (black); (a-b) Predictor X1 versus binary response Y for the a) linear continuous predictor, b) non-linear transformed predictor; c) binary prediction performance as a function of interaction strength, smoothed using Savitzky-Golay filter to best illustrate trends
Fig. 2a-b Boxenplots of prediction performance of Logistic Regression versus Random Forest for a) original 556 datasets, b) subset 277 datasets, c) linear plot of the performance gains of hybrid approach versus the random forest approach
95% confidence intervals for predictive performance (C-statistic) of Logistic Regression, Hybrid Approach, and Random Forest on held-out test set for the tasks of Epistasis and Diabetes prediction; confidence intervals obtained using 1000 sample non-parametric bootstrap.
Fig. 3a-c Epistasis SHAP summary plots for: a) Logistic Regression, b) Random Forest, and c) Hybrid Approach; d) Odds-ratios for coefficients of predictors derived from hybrid logistic regression augmented by RF suggested interactions; e) SHAP scores (measures of predictor variable importance) for top 10 ranked interactions
Fig. 4a-c Diabetes SHAP summary plots for: a) Logistic Regression, b) Random Forest, and c) Hybrid Approach; d) Odds-ratios of predictors derived using logistic regression prior to application of hybrid procedure; e) Odds-ratios of predictors derived from hybrid approach logistic regression coefficients; f) SHAP interaction scores for top 10 ranked interactions