| Literature DB >> 30459683 |
Jolynn Pek1, Octavia Wong2, Augustine C M Wong2,3.
Abstract
The linear model often serves as a starting point for applying statistics in psychology. Often, formal training beyond the linear model is limited, creating a potential pedagogical gap because of the pervasiveness of data non-normality. We reviewed 61 recently published undergraduate and graduate textbooks on introductory statistics and the linear model, focusing on their treatment of non-normality. This review identified at least eight distinct methods suggested to address non-normality, which we organize into a new taxonomy according to whether the approach: (a) remains within the linear model, (b) changes the data, and (c) treats normality as informative or as a nuisance. Because textbook coverage of these methods was often cursory, and methodological papers introducing these approaches are usually inaccessible to non-statisticians, this review is designed to be the happy medium. We provide a relatively non-technical review of advanced methods which can address non-normality (and heteroscedasticity), thereby serving a starting point to promote best practice in the application of the linear model. We also present three empirical examples to highlight distinctions between these methods' motivations and results. The paper also reviews the current state of methodological research in addressing non-normality within the linear modeling framework. It is anticipated that our taxonomy will provide a useful overview and starting place for researchers interested in extending their knowledge in approaches developed to address non-normality from the perspective of the linear model.Entities:
Keywords: best practice; bootstrap; linear model; non-normality; robust statistics; sandwich estimators; transformation
Year: 2018 PMID: 30459683 PMCID: PMC6232275 DOI: 10.3389/fpsyg.2018.02104
Source DB: PubMed Journal: Front Psychol ISSN: 1664-1078
Frequency and counts of approaches for addressing non-normality across statistics textbooks published from 2003 to 2018.
| Transform | 16 | 89 | 17 | 38 |
| Reverse transform | 3 | 17 | 2 | 4 |
| CLT | 10 | 56 | 35 | 78 |
| Rank-based Nonparametric | 9 | 50 | 34 | 76 |
| Bootstrap | 9 | 50 | 6 | 13 |
| Trim | 6 | 33 | 9 | 20 |
| Winsorize | 5 | 28 | 4 | 9 |
| HCCM | 3 | 17 | 0 | 0 |
| Nonlinear models | 3 | 17 | 0 | 0 |
| Not covered | 1 | 6 | 3 | 7 |
Methods are ranked ordered according to the most popular method mentioned in graduate textbooks. CLT, central limit theorem; HCCM, heteroscedasticity-corrected covariance matrix. Not covered implies that even the CLT was not mentioned. The percentages reported do not sum to 100% because textbooks can include more than one method for addressing non-normality.
Taxonomy of methods developed to address non-normality.
| CLT | ✓ | ✓ | ✗ |
| HCCM | ✓ | ✓ | ✗ |
| Bootstrap | ✓ | ✓ | ✗ |
| Trim or Winsorize | ✓ | ✗ | ✓ |
| Transform | ✓ | ✗ | Depends |
| Rank-based Nonparametric | ✗ | ✗ | ✗ |
| Nonlinear models | ✗ | ✓ | ✗ |
CLT, central limit theorem; HCCM, heteroscedasticity-corrected covariance matrix.
Trimming and Winsorizing treat non-normality as an indication of contamination by outliers; the outliers are themselves treated as nuisance.
Rank-based nonparametric approaches tend to focus on the rank order in the data by ignoring any quantitative information; technically, instead of transforming the data, order statistics (e.g., minimum and maximum observations) are computed to take the place of usual sufficient statistics (e.g., mean and variance).
Figure 1Histogram and de-trended QQ plot of residuals of N = 14 European countries' percentage differences in daily newspaper reading for males minus females. The solid vertical reference line in the histogram represents the mean, and the dashed vertical reference line represents the median.
Sex difference in percentages for examples 1 and 2.
| CLT | 9.17 | 2.57 | 3.58 | 0.0034 | [3.63, 14.71] | 1.03 | 0.36 | 2.86 | 0.0050 | [0.32, 1.75] |
| HC0 | 9.17 | 2.47 | 3.71 | 0.0026 | [3.83, 14.51] | 1.03 | 0.36 | 2.87 | 0.0048 | [0.32, 1.75] |
| HC1 | 9.17 | 2.57 | 3.58 | 0.0034 | [3.63, 14.71] | 1.03 | 0.36 | 2.86 | 0.0050 | [0.32, 1.75] |
| HC2 | 9.17 | 2.57 | 3.58 | 0.0034 | [3.63, 14.71] | 1.03 | 0.36 | 2.86 | 0.0050 | [0.32, 1.75] |
| HC3 | 9.17 | 2.66 | 3.45 | 0.0043 | [3.42, 14.92] | 1.03 | 0.36 | 2.85 | 0.0052 | [0.32, 1.75] |
| percBS | 9.17 | – | – | – | [4.55, 13.80] | 1.03 | – | – | – | [0.35, 1.77] |
| BCa | 9.17 | – | – | – | [5.21, 15.02] | 1.03 | – | – | – | [0.37, 1.81] |
| Winsorize | 6.30 | 1.66 | 3.79 | 0.0068 | [2.37, 10.23] | 0.35 | 0.18 | 1.91 | 0.0601 | [–0.015,0.72] |
| Trim | 5.76 | 1.61 | 3.56 | 0.0090 | [1.96, 9.59] | 0.25 | 0.18 | 1.35 | 0.1820 | [–0.12, 0.61] |
S.E., standard error; CI, confidence interval; CLT, central limit theorem; HC, heteroscedastic consistent method; percBS, percentile bootstrap; BCa, bias corrected and accelerated bootstrap. Estimates are in the direction of male percentages minus female percentages. Winsorized and trimmed means pertain to modifying about 20% of the tail distributions; 20.43% for Example 1 and 20.51% for Example 2.
Figure 2Histogram and de-trended QQ plot of residuals of N = 117 countries' percentage differences primary school enrollment for males minus females. The solid vertical reference line in the histogram represents the mean, and the dashed vertical reference line represents the median.
Figure 3Histograms of bootstrapped sampling distributions of the mean of sex differences in percentages. The solid vertical line represents the estimate, , and the dashed vertical lines represent lower and upper bounds to the 95% percentile bootstrap CI.
Percentage of women incumbents and prestige on income in N = 102 Canadian occupations.
| CLT | 6797.9 | 254.79 | [6292.3, 7303.5] | –48.4 | 8.1 | [–64.5, –32.3] | 165.9 | 15.0 | [136.1, 195.6] | 0.64 |
| HC0 | 6797.9 | 251.02 | [6299.8, 7296.0] | –48.4 | 5.7 | [–59.8, –37.0] | 165.9 | 22.3 | [121.7, 210.1] | 0.64 |
| HC1 | 6797.9 | 254.79 | [6292.3, 7303.5] | –48.4 | 5.8 | [–60.0, –36.8] | 165.9 | 22.6 | [121.0, 210.7] | 0.64 |
| HC2 | 6797.9 | 256.05 | [6289.8, 7306.0] | –48.4 | 5.9 | [–60.0, –36.7] | 165.9 | 22.8 | [120.6, 211.2] | 0.64 |
| HC3 | 6797.9 | 261.22 | [6279.6, 7316.2] | –48.4 | 6.0 | [–60.3, –36.5] | 165.9 | 23.4 | [119.4, 212.4] | 0.64 |
| percBS | 6797.9 | – | [6331.0, 7331.0] | –48.4 | – | [–61.4, 37.3] | 165.9 | – | [123.0, 211.1] | 0.64 |
| Bca | 6797.9 | – | [6387.0, 7408.0] | –48.4 | – | [–65.3, 38.8] | 165.9 | – | [129.8, 221.3] | 0.64 |
| Huber Weights | 6517.5 | 131.34 | [6260.0, 6774.9] | –42.8 | 4.2 | [–51.0, –34.6] | 134.1 | 7.7 | [118.9, 149.2] | 0.59 |
| Biweight | 6389.7 | 120.88 | [6152.7, 6626.6] | –41.4 | 3.9 | [–49.0, –33.8] | 122.9 | 7.1 | [109.0, 136.8] | 0.59 |
| Box-Cox ( | 31.23 | 0.25 | [30.74, 31.73] | –0.07 | 0.01 | [–0.09, –0.05] | 0.21 | 0.01 | [0.18, 0.24] | 0.76 |
| loge | 8.66 | 0.03 | [8.60, 8.72] | –0.01 | 0.001 | [–0.01, –0.01] | 0.02 | 0.002 | [0.02, 0.03] | 0.74 |
| reverse loge | 5770.5 | – | – | 0.99 | – | – | 1.02 | – | – | |
All estimated parameters are statistically significant at the 5% level. S.E., standard error; CI, confidence interval; CLT, central limit theorem; HC, heteroscedastic consistent method; percBS, percentile bootstrap; BCa, bias corrected and accelerated bootstrap. Results based on iterated re-weighted least squares (IRWLS) are an extension of Winsorizing and trimming in linear regression models, where Huber (1964) weights and the biweight (Beaton and Tukey, 1974) are special cases.
S.E.s and CIs are liberal because they have not been corrected for estimating γ.
We caution against reverse transforming S.E.s and CIs because they are biased and statistically inconsistent, and do not present them here.
Figure 4Residual by predicted plots for N = 102 Canadian occupations in 1971 where income, Box-Cox transformed income, and log-transformed income are regressed onto centered values of percentage of women incumbents and prestige scores. Spline curves are presented as solid lines overlaying the points.