| Literature DB >> 33963496 |
Ulrich Knief1, Wolfgang Forstmeier2.
Abstract
When data are not normally distributed, researchers are often uncertain whether it is legitimate to use tests that assume Gaussian errors, or whether one has to either model a more specific error structure or use randomization techniques. Here we use Monte Carlo simulations to explore the pros and cons of fitting Gaussian models to non-normal data in terms of risk of type I error, power and utility for parameter estimation. We find that Gaussian models are robust to non-normality over a wide range of conditions, meaning that p values remain fairly reliable except for data with influential outliers judged at strict alpha levels. Gaussian models also performed well in terms of power across all simulated scenarios. Parameter estimates were mostly unbiased and precise except if sample sizes were small or the distribution of the predictor was highly skewed. Transformation of data before analysis is often advisable and visual inspection for outliers and heteroscedasticity is important for assessment. In strong contrast, some non-Gaussian models and randomization techniques bear a range of risks that are often insufficiently known. High rates of false-positive conclusions can arise for instance when overdispersion in count data is not controlled appropriately or when randomization procedures ignore existing non-independencies in the data. Hence, newly developed statistical methods not only bring new opportunities, but they can also pose new threats to reliability. We argue that violating the normality assumption bears risks that are limited and manageable, while several more sophisticated approaches are relatively error prone and particularly difficult to check during peer review. Scientists and reviewers who are not fully aware of the risks might benefit from preferentially trusting Gaussian mixed models in which random effects account for non-independencies in the data.Entities:
Keywords: Hypothesis testing; Linear model; Normality; Regression
Mesh:
Year: 2021 PMID: 33963496 PMCID: PMC8613103 DOI: 10.3758/s13428-021-01587-5
Source DB: PubMed Journal: Behav Res Methods ISSN: 1554-351X
Description of the ten simulated distributions of the independent variable Y and the predictor X
| Name | Sampling distribution | Mean | Variance | Categories | Degree of zero-inflation | Skewness† | Kurtosis† | Arguments in TrustGauss§ |
|---|---|---|---|---|---|---|---|---|
| D0 | Gaussian | 0 | 1 | - | 0 | 1.9 × 10-5 | 3.00 | DistributionY=“Gaussian”, MeanY.gauss=0, SDY.gauss=1 |
| D1 | Binomial | 0.5 | 0.25 | - | 0 | 6.5 × 10-6 | 1.00 | DistributionY=“Binomial”, zeroLevelY.zero=0.5 |
| D2 | Gaussian with categories and zero-inflation# | 0 | 1 | 5 | 0.5 | 0.64 | 2.02 | DistributionY=“GaussianZeroCategorical”, MeanY.gauss=3, SDY.gauss=1, nCategoriesY.cat=5 |
| D3 | Gaussian with zero-inflation# | 0 | 1 | - | 0.5 | 0.45 | 1.69 | DistributionY=“GaussianZero”, MeanY.gauss=3, SDY.gauss=1, zeroLevelY.zero=0.5 |
| D4 | Absolute Gaussian# | 0 | 1 | - | 0 | 1.00 | 3.87 | DistributionY=“AbsoluteGaussian”, MeanY.gauss=0, SDY.gauss=1 |
| D5 | Student's t | 0 | 2 | - | 0 | 0.01 | 20.71 | DistributionY=“StudentsT”, DFY.student=4 |
| D6 | Gamma with categories# | 10 | 100 | 3 | 0 | 3.45 | 15.09 | DistributionY=“GammaCategorical”, nCategoriesY.cat=3, ShapeY.gamma=1, ScaleY.gamma=10 |
| D7 | Negative Binomial | 10 | 110 | - | 0 | 2.00 | 9.02 | DistributionY=“NegativeBinomial”, ShapeY.gamma=1, ScaleY.gamma=10 |
| D8 | Binomial | 0.9 | 0.09 | - | 0 | -2.67 | 8.12 | DistributionY=“Binomial”, zeroLevelY.zero=0.90 |
| D9 | Gamma | 10 | 1000 | - | 0 | 6.32 | 62.84 | DistributionY=“Gamma”, ShapeY.gamma=0.1, ScaleY.gamma=100 |
#Mean and Variance refer to the distributions prior to adding categories, zero-inflation or taking the absolute values.
†Skewness and kurtosis were estimated from the simulated distributions with 50 million data points using the moments R package (v0.14, Komsta & Novomestky, 2015).
§Here we specified the arguments for the dependent variable Y only. However, the specified values are identical for the independent variable X.
Fig. 1p values from Gaussian linear regression models are in most cases unbiased. a Overview of the ten different distributions that we simulated. Distributions D0 is Gaussian and all remaining distributions are sorted by their tendency to produce strong outliers. Distributions D1, D2, D6, D7, and D8 are discrete. The numbers D0–D9 refer to the plots in b–e where on the Y-axis the distribution of the dependent variable and on the X-axis of the predictor is indicated. b Type I error rate at an α-level of 0.05 for sample sizes of N = 10, 100, and 1000. Red colors represent increased and blue conservative type I error rates. c Scale shift parameter, d bias in p values at an expected p value of 10-3 and e bias in p values at an expected p value of 10-4
Fig. 2Power, bias, and precision of parameter estimates from Gaussian linear regression models are in most cases unaffected by the distributions of the dependent variable Y or the predictor X. a Overview of the different distributions that we simulated, which were the same as in Fig. 1. The numbers D0–D9 refer to the plots in b–e where on the Y-axis the distribution of the dependent variable and on the X-axis of the predictor is indicated. b Power at a regression coefficient b = 0.2 for sample sizes of N = 10, 100, and 1000. Red colors represent increased power. c Power at regression coefficients b = 0.59, 0.19, and 0.06 for sample sizes of N = 10, 100, and 1000, respectively, where the expected power derived from a normally distributed Y and X is 0.5. Red colors represent increased and blue colors decreased power. d Bias and e precision of the regression coefficient estimates at an expected b = 0.2 for sample sizes of N = 10, 100, and 1000
Summary of power, bias, and precision of parameter estimates and interpretability from 50,000 simulation runs across the six combinations of the dependent variable Y and the predictor X. Each combination was either fitted using a Gaussian error structure or the appropriate error structure according to the distribution of Y (that is either Poisson with a mean of 1 or binomial with a mean of 0.75). The predefined effect was chosen such that a power of around 0.5 was reached (see Table S2 for details). The column Effect is the mean estimated effect (intercept + slope) after back-transformation
| Distribution of | Distribution of | Error distribution | Sample size | Power at α = 0.05 | Power at α = 0.001 | Mean of slope | Variance in slope | CV of slope | Mean intercept | Variance in intercept | CV of intercept | Effect | Variance in effect |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Poisson | Gaussian | Gaussian | 100 | 0.522 | 0.094 | 0.200 | 9.96 × 10-3 | 0.498 | 1.000 | 9.70 × 10-3 | 0.098 | 1.201 | 0.023 |
| Poisson | Gaussian | Poisson | 100 | 0.511 | 0.090 | 1.228 | 0.015 | 0.100 | 0.976 | 9.80 × 10-3 | 0.101 | 1.195 | 0.022 |
| Binomial | Gaussian | Gaussian | 100 | 0.502 | 0.085 | 0.085 | 1.79 × 10-3 | 0.500 | 0.750 | 1.82 × 10-3 | 0.057 | 0.835 | 2.84 × 10-3 |
| Binomial | Gaussian | Binomial | 100 | 0.504 | 0.091 | 0.617 | 3.63 × 10-3 | 0.098 | 0.762 | 2.03 × 10-3 | 0.059 | 0.834 | 2.75 × 10-3 |
| Poisson | Gamma | Gaussian | 100 | 0.588 | 0.162 | 0.023 | 1.28 × 10-4 | 0.502 | 0.776 | 1.28 × 10-4 | 0.176 | 0.798 | 0.017 |
| Poisson | Gamma | Poisson | 100 | 0.537 | 0.095 | 1.019 | 7.67 × 10-5 | 0.009 | 0.818 | 7.67 × 10-5 | 0.142 | 0.833 | 0.013 |
| Binomial | Gamma | Gaussian | 100 | 0.459 | 0.029 | 0.008 | 1.55 × 10-5 | 0.481 | 0.669 | 4.12 × 10-3 | 0.096 | 0.677 | 3.75 × 10-3 |
| Binomial | Gamma | Binomial | 100 | 0.549 | 0.113 | 0.517 | 1.15 × 10-4 | 0.021 | 0.634 | 6.87 × 10-3 | 0.131 | 0.650 | 5.59 × 10-3 |
| Poisson | Binomial | Gaussian | 100 | 0.673 | 0.126 | 0.534 | 0.039 | 0.371 | 0.599 | 0.025 | 0.265 | 1.133 | 0.014 |
| Poisson | Binomial | Poisson | 100 | 0.699 | 0.189 | 1847.624 | 1.70 × 1011 | 223.359 | 0.599 | 0.025 | 0.264 | 1.132 | 0.014 |
| Binomial | Binomial | Gaussian | 100 | 0.510 | 0.127 | 0.200 | 0.012 | 0.551 | 0.600 | 9.96 × 10-3 | 0.166 | 0.800 | 2.15 × 10-3 |
| Binomial | Binomial | Binomial | 100 | 0.491 | 0.094 | 0.717 | 0.011 | 0.146 | 0.600 | 0.010 | 0.167 | 0.800 | 2.16 × 10-3 |
Fig. 3Distribution of observed p values (when the null hypothesis is true) as a function of different model specifications (columns) and different distributions of the dependent variable Y (rows a to e). Each panel was summed up across ten different distributions of the predictor X (500,000 simulations per panel with N = 100 data points per simulation). Models were fitted either as glms with a Gaussian error structure that violate the normality assumption (first column), as glms with a Quasipoisson error structure that take overdispersion into account (second column), as glmms with a Poisson error structure and an observation-level random effect (OLRE; Harrison et al., 2018) or as glms with a Poisson error structure that violate the assumption of the Poisson distribution. In each panel, TIER indicates the realized type I error rate (across the ten different predictor distributions), highlighted with a color scheme as in Fig. 1b (blue: below the nominal level of 0.05, red: above the nominal level, grey: closely matching the nominal level). The dependent variable Y was distributed as a distribution D1, b distribution D2, c distribution D6, d distribution D7 or e distribution D8 (see Table 1 and Fig. 1a for details)
(A) Many researchers, being concerned about fitting an “inappropriate” Gaussian model, hold the believe that binomial data always requires modelling a binomial error structure, and that count data mandates modeling a Poisson-like process. Yet, what they consider to be “more appropriate for the data at hand” may often fail to acknowledge the non-independence of events in count data (Forstmeier et al., (B) When observational data do not comply with any distributional assumption, randomization techniques like bootstrapping seem to offer an ideal solution for working out the rate at which a certain estimate arises by chance alone (Good, |
(1) (2) Each value of the dependent variable (3) The dependent variable (4) The variance in the regression error (5) The errors of the model should be normally distributed ( |