| Literature DB >> 29854919 |
Catherine Kreatsoulas1,2, S V Subramanian1.
Abstract
Entities:
Year: 2018 PMID: 29854919 PMCID: PMC5976835 DOI: 10.1016/j.ssmph.2018.03.007
Source DB: PubMed Journal: SSM Popul Health ISSN: 2352-8273
An overview of the strengths and limitations of the machine learning approaches outlined by Seligman et al. (2018).
| Regression | Attempts to fit a straight hyperplane to data | Excellent for prediction among linear relationships; Simple to interpret and understand model because attributes have an additive effect on the model Can be regularized to deal with overfitting | Does not handle well non-linear relationships in data Learning algorithms make a set of assumptions about the data and therefore there is an inductive bias embedded within each algorithm | Selecting the best model is more challenging than optimizing its parameters once model is fixed Assumes that any changes in the attributes and output both occur with some regularity and smoothness for generalization |
| LASSO penalized regression | Additional variables that do not substantially improve prediction are penalized | Useful in OLS when many variables are highly correlated (as variance increases in OLS, beta becomes increasingly inaccurate) | The weighted penalty, lambda, is estimated and tested by a variety of methods each with pros and cons | Goal is to reduce and select among redundant predictors in generalized linear model to improve prediction |
| Random forests | Repeatedly split dataset into random sets of decision trees with if-then rules at branches and interpolation at leaves | Learning is non-parametric Variables do not need to be transformed Handles outliers well Handles missing values well Ensemble methods that include random forests often perform well | Highly prone to overfitting (model can keep branching until the data is memorized) Black box predictions are difficult to interpret | Larger forests typically have better prediction (being mindful of overfitting and correlated trees) |
| Neural networks | Based on neuron/synapse activation structure of human brain using synaptic weights that represent ‘hidden layers’ between inputs and outputs | Learning is nonlinear Handles outliers well Can learn complex patterns from highly dimensional data Hidden layers alleviates features engineering Often best performing algorithm | Difficult to set up; many parameters require decisions on architecture and hyperparameters of network Easy to overfit Often very difficult to interpret Requires large sample sizes Computationally very intense to train | Generalization is difficult without large samples of data |