| Literature DB >> 25889014 |
Karen E Lamb1, Simon R White2.
Abstract
BACKGROUND: In the analysis of the effect of built environment features on health, it is common for researchers to categorise built environment exposure variables based on arbitrary percentile cut-points, such as median or tertile splits. This arbitrary categorisation leads to a loss of information and a lack of comparability between studies since the choice of cut-point is based on the sample distribution. DISCUSSION: In this paper, we highlight the various drawbacks of adopting percentile categorisation of exposure variables. Using data from the SocioEconomic Status and Activity in Women (SESAW) study from Melbourne, Australia, we highlight alternative approaches which may be used instead of percentile categorisation in order to assess built environment effects on health. We discuss these approaches using an example which examines the association between the number of accessible supermarkets and body mass index. We show that alternative approaches to percentile categorisation, such as transformations of the exposure variable or factorial polynomials, can be implemented easily using standard statistical software packages. These procedures utilise all of the available information available in the data, avoiding a loss of power as experienced when categorisation is adopted.We argue that researchers should retain all available information by using the continuous exposure, adopting transformations where necessary.Entities:
Mesh:
Year: 2015 PMID: 25889014 PMCID: PMC4335683 DOI: 10.1186/s12966-015-0181-9
Source DB: PubMed Journal: Int J Behav Nutr Phys Act ISSN: 1479-5868 Impact factor: 6.457
Figure 1Illustrative example of the ‘trouble with tertiles’ predicting BMI using the count of supermarkets. We split the original SESAW dataset [3] (n = 1462) into two sub-samples, A and B, each with n = 500. (a) The sub-samples are analysed separately using a tertile approach and a linear model (with a single linear predictor and intercept, the linear fits are both significant and the coefficients are shown on the plot). (b) If we consider the two sub-samples as independent studies, it is then of interest to consider the combined estimate of the association between supermarket density and BMI. The combined sub-sample fits are obtained using standard meta-analysis methods (in essence, a weighted mean of the estimates accounting for sample size and standard errors); the combined fits are compared to the same analysis on the complete data. Of note, the combined tertile model no longer has three groups, there are now five groups, which complicates the interpretation. Conversely, the combined linear model retains the same interpretation.
Figure 2Comparison of approaches for estimating non-linear relationships using the SESAW study [ 3 ]. (a) Comparison of a simple linear model, fractional polynomial (of which the best fitting was equivalent to the simple linear model), linear splines, tertiles and a non-parametric smoother (see Table 1 for the respective AICs to assess model comparison). (b) As in Figure 2(a) with an extension to the y-axis to show the complete range of BMI and the observed data plotted (n = 1462 points). We see visually the result of comparing the AICs in Table 1 that due to the large variance in BMI scores there is no evidence for anything more complicated than a simple linear model. Further, there is nothing statistically to choose between the linear and tertile fits. However, the linear model has the benefit of not being data-dependent.
Comparison of modelling approaches for predicting BMI from the count of supermarkets within 5 km
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Linear Model | Intercept | 26.43 | (0.35) | <0.001 | 9139.73 |
| Count of supermarkets | −0.10 | (0.03) | <0.001 | ||
| Fractional | Intercept | 26.53 | (0.37) | <0.001 | 9139.73 |
| Polynomial(b) | (Count of supermarkets + 1)/10 | −1.00 | (0.27) | <0.001 | |
| Spline (2 knots)(c) | Intercept | 26.53 | (0.56) | <0.001 | 9141.66 |
| 1st segment, 0—11 supermarkets | −1.25 | (0.69) | 0.068 | ||
| 2nd segment, 11—15 supermarkets | −4.80 | (1.56) | 0.002 | ||
| Spline (3 knots)(d) | Intercept | 25.39 | (0.68) | <0.001 | 9132.98 |
| 1st segment, 0—9 supermarkets | 0.90 | (0.86) | 0.29 | ||
| 2nd segment, 9—15 supermarkets | −1.11 | (0.71) | 0.12 | ||
| 3rd segment, 15—50 supermarkets | 0.66 | (0.66) | 0.78 | ||
| Tertiles(e) | 0—9 supermarkets (baseline) | 26.05 | (0.25) | <0.001 | 9137.67 |
| 10—14 supermarkets | −1.00 | (0.34) | <0.001 | ||
| 15— supermarkets | −1.49 | (0.36) | <0.001 |
(a)S.E. = standard error.
(b)The fractional polynomial with intercept and covariate was found to be the best fitting from among the pre-defined set of fractional polynomials (selection is based on the AIC and is automatically carried out by the statistical algorithm). Since the logarithm is one of the possible transformations, it is not allowed to have zero values, hence the addition of a 1 to the number of supermarkets in this model.
(c)Fixed 2-knot spline not shown on Figure 2.
(d)Default knots for the spline function are placed at the equivalent quantiles. Hence the knot locations coincide with the tertile boundaries. With splines, it is possible to estimate the knot locations as part of the inference or to use pre-specified knot locations. The spline was anchored to be within the range of 0 and 50 for this example.
(e)Note that the third category is unbounded. This highlights the issue of how outliers are included in the analysis and the issue of how to interpret a ‘high’ density of supermarkets, we can define high as 15, 20, 25, 30, etc. (the actual range of the data is 0—29). For closed intervals like these, the representative value can be thought of as the interval mid-point. However, taking the mid-point assumes values are uniformly distributed within the interval. For the lower band this is not true (low counts have a mean of 6.2 and median of 6 compared to the mid-point of 4.5). High counts have a mean of 18.3 and a median of 17 with an undefined mid-point due to the unspecified upper bound of percentile categorisation.
(f)We performed a Cox test for non-nested models to compare the model fits and found no significant difference in AIC between the linear and tertile model fits. The 3-knot spline has a smaller, therefore better, AIC but none of the coefficients are significant which perhaps indicates over-fitting. The 2-knot spline, with the same number of parameters as the tertile model, is not statistically different from the linear model.
Figure 3Illustration that within a meta-analysis the tertile approach will tend to a linear model. The SESAW data [3] were split into 20 sub-samples (A-T), each with n = 75. This plot shows four meta-analyses which combine an increasing number of the sub-samples (A, A-G, A-M, and A-T). The linear model approach is consistent and approaches the equivalent analysis using the full data. Conversely, the tertile approach becomes increasingly bumpy, as each sub-sample has data-dependent tertile cut-points. In the limit, as illustrated, the tertile combined analysis will tend towards the linear model approach.