| Literature DB >> 36068261 |
Mark P Little1,2, Philip S Rosenberg3, Aryana Arsham4.
Abstract
Random forests are a popular type of machine learning model, which are relatively robust to overfitting, unlike some other machine learning models, and adequately capture non-linear relationships between an outcome of interest and multiple independent variables. There are relatively few adjustable hyperparameters in the standard random forest models, among them the minimum size of the terminal nodes on each tree. The usual stopping rule, as proposed by Breiman, stops tree expansion by limiting the size of the parent nodes, so that a node cannot be split if it has less than a specified number of observations. Recently an alternative stopping criterion has been proposed, stopping tree expansion so that all terminal nodes have at least a minimum number of observations. The present paper proposes three generalisations of this idea, limiting the growth in regression random forests, based on the variance, range, or inter-centile range. The new approaches are applied to diabetes data obtained from the National Health and Nutrition Examination Survey and four other datasets (Tasmanian Abalone data, Boston Housing crime rate data, Los Angeles ozone concentration data, MIT servo data). Empirical analysis presented herein demonstrate that the new stopping rules yield competitive mean square prediction error to standard random forest models. In general, use of the intercentile range statistic to control tree expansion yields much less variation in mean square prediction error, and mean square prediction error is also closer to the optimal. The Fortran code developed is provided in the Supplementary Material.Entities:
Mesh:
Year: 2022 PMID: 36068261 PMCID: PMC9448733 DOI: 10.1038/s41598-022-19281-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Description of five datasets fitted.
| Dataset | NHANES | Tasmanian abalone | Boston housing | Los Angeles ozone | MIT servo |
|---|---|---|---|---|---|
| Number of datapoints | 8343 | 4177 | 506 | 330 | 167 |
| Dependent variable | Glycohemoglobin (mg/dL) | Rings ( = age) | Crime rate per capita by town | Upland CA maximum ozone | Rise time of servo |
| Explanatory variables | Age (years) | Sex (M/F) | Proportion of residential land zoned for lots over 25,000 sq ft | Vandenberg 500 mb height | Type of motor linkage (A,B,C,D,E) |
| Weight (kg) | Length (mm) | Proportion of non-retail business acres per town | Wind speed (mph) | Type of screw linkage (A,B,C,D,E) | |
| Systolic blood pressure (mm Hg) | Diameter (mm) | Charles River variable ( = 1 if tract bounds river, 0 otherwise) | Humidity (%) | Gain setting 1 | |
| Diastolic blood pressure (mm Hg) | Height (mm) | Nitric oxides concentration (parts per 107) | Sandburg AFB temperature | Gain setting 2 | |
| Glucose (mg/dL) | Whole weight (g) | Average number of rooms per dwelling | Inversion base height | ||
| Cholesterol (mg/dL) | Shuck weight (g) | Proportion of owner-occupied units built prior to 1940 | Daggot pressure gradient | ||
| Triglycerides (mg/dL) | Viscera weight (g) | Weighted distances to five Boston employment centres | Inversion base temperature | ||
| Urination (minutes between last urination) | Shell weight (g) | Index of accessibility to radial highways | Visibility (miles) | ||
| Sedentary activity (minutes of sedentary activity in typical day) | Full-value property-tax rate per $10,000 | Day of year | |||
| Gender | Pupil-teacher ratio by town | ||||
| Race | 1000 × (proportion blacks (by town) − 0.63)2 | ||||
| Risk for diabetes (ever been told you have health risk for diabetes) | % lower status of the population | ||||
| Kidneys (ever been told you had weak/failing kidneys) | Median value of owner occupied homes in $1000s | ||||
| Stroke (ever been told you had a stroke) | |||||
| Weight loss (doctor told you to control/lose weight) | |||||
| Salt (doctor told you to reduce salt in diet) | |||||
| Cigarette smoking (used any tobacco product in last 5 days) | |||||
| Income | |||||
| Night urination (how many times urinate in night) | |||||
| Year (2016, 2018) |
Measures of goodness of fit (mean square cross-validated test error) to glycohemoglobin percentage, estimated from hold-out test set (2017–2018 NHANES data) associated with fit of random forest model fit to 2015–2016 NHANES data, and similar measures of goodness of fits to Tasmanian Abalone data, Boston Housing data, Los Angeles Ozone data and MIT Servo data.
| Method of limiting tree growth | NHANES | Tasmanian abalone | Boston Housing | Los Angeles ozone | MIT servo |
|---|---|---|---|---|---|
| Parent node size limiting | 32.2552 | 0.2729 | |||
| Leaf node size limit | 0.1398 | 4.5119 | 15.8862 | 0.2774 | |
| Proportion of variance limit | 0.1398 | 4.5475 | 33.7808 | 15.9173 | 0.2601 |
| Proportion of range limit | 0.1398 | 4.5475 | 32.6826 | 15.8043 | 0.2676 |
| Proportion of 10–90% intercentile range limit | 0.1399 | 4.5497 | 33.7754 | 15.8223 | 0.2241 |
| Proportion of 25–75% intercentile range limit | 0.1397 | 4.5398 | 32.4181 | 15.9343 |
The optimal model for each method of tree-growth limitation is shown in boldface.
Figure 1Percentage increase in mean square predictive error (MSPE) for each stopping rule over the tree expansion rule yielding lowest MSPE, for each dataset.