| Literature DB >> 22479605 |
Anne E Goodenough1, Adam G Hart, Richard Stafford.
Abstract
Despite recent papers on problems associated with full-model and stepwise regression, their use is still common throughout ecological and environmental disciplines. Alternative approaches, including generating multiple models and comparing them post-hoc using techniques such as Akaike's Information Criterion (AIC), are becoming more popular. However, these are problematic when there are numerous independent variables and interpretation is often difficult when competing models contain many different variables and combinations of variables. Here, we detail a new approach, REVS (Regression with Empirical Variable Selection), which uses all-subsets regression to quantify empirical support for every independent variable. A series of models is created; the first containing the variable with most empirical support, the second containing the first variable and the next most-supported, and so on. The comparatively small number of resultant models (n = the number of predictor variables) means that post-hoc comparison is comparatively quick and easy. When tested on a real dataset--habitat and offspring quality in the great tit (Parus major)--the optimal REVS model explained more variance (higher R(2)), was more parsimonious (lower AIC), and had greater significance (lower P values), than full, stepwise or all-subsets models; it also had higher predictive accuracy based on split-sample validation. Testing REVS on ten further datasets suggested that this is typical, with R(2) values being higher than full or stepwise models (mean improvement = 31% and 7%, respectively). Results are ecologically intuitive as even when there are several competing models, they share a set of "core" variables and differ only in presence/absence of one or two additional variables. We conclude that REVS is useful for analysing complex datasets, including those in ecology and environmental disciplines.Entities:
Mesh:
Year: 2012 PMID: 22479605 PMCID: PMC3316704 DOI: 10.1371/journal.pone.0034338
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The REVs procedure outlined (MLR = Multiple Linear Regression).
Vegetation variables in the case study dataset.
| Generic habitat variables | Specific habitat variables |
| Number of trees | Number of Pedunculate oak ( |
| Distance to the nearest tree | Number of silver birch ( |
| Number of saplings | Number of beech ( |
| Number of shrubs | Number of rowan ( |
| Percentage of ground cover | Number of sycamore ( |
| Diversity of trees | Number of white/downy birch ( |
| Diversity of saplings | Percentage coverage by holly ( |
| Diversity of field-layer species | Percentage coverage by hawthorn ( |
| Total plot diversity | Percentage coverage by bramble ( |
| Grazing regime (grazed or not) | Percentage coverage by bracken ( |
Details of the 10 additional datasets (the top five datasets are on species-habitat interactions; the second five datasets are wider biological datasets).
| Dataset details | Dependent variable | Independent variables | Cases | Source |
| Blue tit nest site selection (Nagshead, Gloucestershire) | Frequency of nestbox occupations over (15 years) | 20 nestbox variables (e.g. size, height, location) | 295 | A Goodenough; unpublished data |
| Great tit nest site selection (Nagshead, Gloucestershire) | As above | As above | 295 | A Goodenough; unpublished data |
| Dormouse nest site selection (Midger Wood, Gloucestershire) | Frequency of nest tubes occupation (13 years) | 25 variables describing surrounding habitat | 100 | R. Williams; unpublished data |
| Pied flycatcher clutch size (Nagshead, Gloucestershire) | Mean number of eggs per clutch per nestbox (15 years) | 31 variables describing surrounding habitat | 258 |
|
| Pied flycatcher fledging success (Nagshead, Gloucestershire) | Mean number of fledglings per brood per nestbox (15 years) | As above | 254 |
|
| Plant morphology (Lady Park Wood, Gwent) | Canopy coverage | 4 tree-specific variables, including height and DBH | 300 | A Goodenough; unpublished data |
| Animal behaviour | Average time spent in slow wave sleep per 24 hours | 7 life-history variables (e.g. weight, gestation, lifespan) | 62 |
|
| Human biometrics | Percentage body fat (underwater weighing) | 14 measurements (e.g. weight, height, chest circumference) | 252 | Data from R. Johnson; available from: |
| Aquatic bacterial load (River Severn, Gloucestershire) | Total bacteria plate count from 100 µl water on nutrient agar | 5 chemical parameters (nitrogen, calcium, pH etc) | 12 | S. Eley; unpublished data |
| Organic pollution (Oslo, Norway) | Amount of organic particulate matter (log transformed) | 7 environment parameters (e.g. wind speed, time of day) | 500 | Data from M. Aldrin; available from |
Running the REVS procedure on these datasets took <1 min.
Comparison of REVS against full model regression, stepwise regression (P to enter = 0.05) and LEAPS all-subset regression for the case study dataset of great tit chick fitness (quantified using wing length) as the dependent variable and 25 independent habitat parameters.
| Model | Complete model ( | Comparison of | ||||
| Adjusted R | AIC | Delta AIC | P | Adjusted R | RSS | |
| REVS (best model | 0.374 | 156.00 | 0.00 | 0.0005 | 0.478 | 62.745 |
| LEAPS all-subsets (best model | 0.331 | 157.40 | 1.40 | 0.0007 | 0.104 | 353.713 |
| Stepwise (best model | 0.254 | 160.71 | 4.71 | 0.0014 | 0.439 | 80.229 |
| Full | 0.034 | 184.74 | 28.74 | 0.4899 | 0.449 | 76.986 |
For variables included in the best model, see Table 4.
Variables included: orientation category (−), percentage cover bracken (+), percentage cover holly (−), diversity of field-layer species (+), and canopy coverage (+).
Variables included: orientation category (−), percentage cover bracken (+), and overall percentage cover ground (−).
The full models are detailed and the prediction accuracy of each is calculated using a hold-out sample (see methods for more details).
REVS models for analysis of great tit wing length giving R2 and delta AIC values.
| Model | Adjusted R2 | Delta AIC |
| Orientation category (1 = S-SW; 0 = other; negative relationship) | 0.074 | 14.072 |
| +Distance to nearest path (positive relationship) | 0.085 | 12.535 |
| +Number of trees (positive relationship) | 0.165 | 9.920 |
| +Number of silver birch (positive relationship) | 0.258 | 5.011 |
| +Distance to nearest water source (negative relationship) | 0.314 | 2.066 |
| +Percentage of ground cover (positive relationship) | 0.318 | 2.596 |
| +Number of downy birch (negative relationship) | 0.342 | 1.681 |
|
|
|
|
| +Distance to nearest road (positive relationship) | 0.369 | 1.191 |
| +Percentage holly coverage (negative relationship) | 0.355 | 2.970 |
Each row shows the latest variable to be entered into the model (in addition to those previously added) and the overall adjusted R2 and delta AIC. The model in bold was the single best model when models were compared using delta AIC (or R2). All models were significant (P<0.05).
Figure 2Mean (± se) results of running 10 datasets (detailed in ) through REVS compared with standard full regression and stepwise regression for (a) delta AIC values (combines model fit and parsimony; lower values are preferable); (b) R2 values (higher values are preferable) and (c) significance (P values; lower values are preferable).
It should be noted that the delta AIC value for REVS was 0 in all cases.
Figure 3Conceptual graph showing how REVS regression parameters (delta AIC; R2 and significance) relate to one another and how they change at different levels for the same dataset (a new independent variable is added at each level; thus giving a more complex model).
This is based on running 10 sample datasets (Table 2) though REVS.