Literature DB >> 29719361

Relationship between Omnibus and Post-hoc Tests: An Investigation of performance of the F test in ANOVA.

Tian Chen¹, Manfei Xu², Justin Tu³, Hongyue Wang⁴, Xiaohui Niu⁵.

Abstract

Comparison of groups is a common statistical test in many biomedical and psychosocial research studies. When there are more than two groups, one first performs an omnibus test for an overall difference across the groups. If this null is rejected, one then proceeds to the next step of post-hoc pairwise group comparisons to determine sources of difference. Otherwise, one stops and declares no group difference. A common belief is that if the omnibus test is significant, there must exist at least two groups that are significantly different and vice versa. Thus, when the omnibus test is significant, but no post-hoc between-group comparison shows significant difference, one is bewildered at what is going on and wondering how to interpret the results. At the end of the spectrum, when the omnibus test is not significant, one wonders if all post-hoc tests will be non-significant as well so that stopping after a nonsignificant omnibus test will not lead to any missed opportunity of finding group difference. In this report, we investigate this perplexing phenomenon and discuss how to interpret such results.

Entities: Disease Species

Keywords: F test; Omnibus test; Tukey’s test; post-hoc test

Year: 2018 PMID： 29719361 PMCID： PMC5925602 DOI： 10.11919/j.issn.1002-0829.218014

Source DB: PubMed Journal: Shanghai Arch Psychiatry ISSN： 1002-0829

1. Introduction

Comparison of groups is a common issue of interest in most biomedical and psychosocial research studies. In many studies, there are more than two groups, in which case the popular t-test for two (independent) groups no longer applies and models for comparing more than two groups must be used, such as the analysis of variance, ANOVA, model.[ When comparing more than two groups, one follows a hierarchical approach. Under this approach, one first performs an omnibus test, which tests the null hypothesis of no difference across groups, i.e., all groups have the same mean. If this test is not significant, there is no evidence in the data to reject the null and one then concludes that there is no evidence to suggest that the group means are different. Otherwise, post-hoc tests are performed to find sources of difference. During post-hoc analysis, one compares pairs of groups and finds all pairs that show significant difference. This hierarchical procedure is predicated upon the premise that if the omnibus test is significant, there must exist at least two groups that are significantly different and vice versa. The hierarchical procedure is taught in basic as well as advanced statistics courses and built into many popular statistical packages. For example, when performing the analysis of variance (ANOVA) model for comparing multiple groups, the omnibus test is carried out by the F-statistic.[ For post-hoc analyses, one can use a number of specialized procedures such as Tukey’s and Scheffe’s tests.[ Special statistical tests are needed for performing post-hoc analyses, because of potentially inflated type I errors when performing multiple tests to identify the groups that have different means. Tukey’s, Scheffe’s and other post-hoc tests are all adjusted for such multiple comparisons to ensure correct type I errors in the fact of multiple testing. In practice, however, it seems quite often that none of the post-hoc tests are significant, while the omnibus test is significant. The reverse seems to occur often as well; when the omnibus test is not significant, although some of the post-hoc tests are significant. To the best of our knowledge, there does not appear a general, commonly accepted approach to handle such a situation. In this report, we examine this hierarchical approach and see how well it performs using simulated data. We want to know if a significant omnibus test guarantees at least one post-hoc test and vice versa. Although the statistical problem of comparing multiple groups is relevant to all statistical models, we focus on the relatively simpler analysis of variance (ANOVA) model and start with a brief overview of this popular model for comparing more than two groups.

2. One-Way Analysis of Variance (ANOVA)

2.1 The Statistical Model

The analysis of variance (ANOVA) model is widely used in research studies for comparing multiple groups. This model extends the popular t-test for comparing two (independent) groups to the general setting of more than two groups.[ Consider a continuous outcome of interest, Y, and let I denote the number of groups. We are interested in comparing the (population) mean of Y across the I groups. The classic analysis of variance (ANOVA) model has the form: where Y is the outcome from the j th subject within the i th group, µ = E(Y) is the (population) mean of the i th group, ε is the error term, N(µ,σ[2]) denotes the normal distribution with mean µ and variance σ[2], and n is the sample size for the i th group. With the statistical model in Equation (1), the primary objective of group comparison can be stated in terms of statistical hypotheses as follows. First, we want to know if all the groups have the same mean. Under the ANOVA above, the null and alternative hypothesis for this comparison of interest is stated as: Thus, under the null H0, all groups have the same mean. If H0 is rejected in favor of the alternative H, there are at least two groups that have different means. When performing ANOVA, one first tests the hypothesis in Equation (2). If this omnibus test is not rejected, then one concludes that there is evidence to indicate different means across the groups. Otherwise, there is evidence against the null in favor of the alternative and one then proceeds to the next step to identify the groups that have different means from each other. For I groups, there are a total of I(I-1)/2 pairs of groups to examine. In the post-hoc testing phase, one performs I(I-1)/2 tests to identify the groups that have different group means µ. This number I(I-1)/2 can be large, especially where there is a large number of groups. Thus, performing all such tests can potentially increase type I errors. The popular t-test for comparing two (independent) groups is inappropriate and specially designed tests must be used to account for accumulated type I errors due to multiple testing to ensure correct type I errors. Next we review the omnibus and some post-hoc tests, which will later be used in our simulation studies.

2.1 The Omnibus F Test for No Difference Across Groups

The omnibus test for comparing all group means simultaneously within the context of ANOVA is the F-test. The F-test is defined by quantities in the so-called ANOVA table. To set up this table, let us define the following quantities: where N is the total sample size, denotes sum of all denotes sum of A over both indices i and j, i.e., The ANOVA table is defined by: In the ANOVA table above, SS(R) is called the regression sum of squares, SS(E) is called the error sum of squares, SS(Total) is called the total sum of squares, MS(R) is called the mean regression sum of squares, and SS(E)is called the mean error sum of squares. These sums of squares characterize variability of all the groups (1) when ignoring differences in the group means (SS(Total)), and (2) after accounting for such differences (SS(R)). For example, it can be shown that Thus, if the group means help explain a large amount of variability in group differences, SS(R) will be close to SS(Total), resulting in small SS(E), in which case groups means are likely to be different. Otherwise, SS(R) will be small and SS(E) will become close to SS(Total), in which case group means are unlikely to be different. By normalizing SS(R) with respect to the number of groups and SS(E) with respect to the total sample size, the mean squares MS(R) and MS(E) can be used to quantify the relative difference between SS(R) and SS(E) to help discern where the group means are different. Under the null hypothesis H0, the ratio, or F statistic, , follows the F-test: where F1, denotes the F-distribution with I – 1 (numerator) degrees of freedom and N – I (denominator) degrees of freedom. As noted earlier, a larger MS(R) relative to MS(E) indicates evidence against the null and vice versa. This is consistent with the fact that a larger value of the F-statistic leads to rejection of the null and vice versa.

2.2 Tests for Post-hoc Group Comparison

If the null of no group mean difference is rejected by the F-test, one then proceeds to the next step to identify groups that have different group means. Multiple specialized procedures are available to perform such post-hoc tests by preserving the type I error. For example, if the group size is the same for all groups, i.e., n1 = n for all 1 ≤ i ≤ I, we can use Tukey’s procedure. We first rank the sample group means Ȳ and then test if two groups with sample group means, Ȳ and Ȳ, have different (population) group means, i.e., µ, by the following criteria: where s[2]=MS(E), qα(1,N-1) is the upper-tail critical value of the Studentized range for comparing I groups and N is the common group size.

2.3 Simulation Study

When applying the ANOVA for comparing multiple groups, we first perform the omnibus F test and then follow with post-hoc pairwise group comparisons if the omnibus test is significant. Otherwise, we stop and draw the conclusion that there is no evidence of rejecting the null of no group difference. Implicit in the procedure is the assumption that a significant omnibus test implies at least one significant pairwise comparison and vice versa. If this assumption fails, this procedure will either (1) yield false positive (significant omnibus test, but no significant pairwise test) or (2) false negative (no significant omnibus test, but at least one significant post-hoc test) results. In the first scenario, it is difficult to logically reconcile such differences and report findings, while the second scenario also leads to missed opportunity to find group difference. In this section, we use simulated data to examine this assumption upon which this popular hierarchical procedure is predicated. For brevity and without loss of generality, we consider four groups and assume a common group size n. Then the ANOVA model for the simulation study setting is given by: We assume the first three groups have the same mean, i.e., µ1 =µ2 =µ3 = µ, which differs from the mean of the fourth group by , i.e., µ4 = µ + d, with d > 0. For the simulation study, we set µ = 1 and σ[2] = 1 in all the simulations, but vary the group size n and type I error level α to see how the performance of the hierarchical procedure changes when these parameters vary. To compare group means using the hierarchical procedure, we first test the null of no group mean difference across the four groups: If the above null is rejected, then we proceed to performing pairwise comparison of the four groups to identify groups that are significantly different from each other, i.e., H : µ v.s. H : µ, for all pairs (i,k), 1≤i Within the simulation study setting, there are a total of post-hoc tests. To see how well the hierarchical procedure performs for the simulated data, we use Monte Carlo replications and set the Monte Carlo sample size to M= 1,000. Thus, for a given group size n, we simulate data Y from the ANOVA in Equation (3) and then perform the F test to test the null of no group mean difference in Equation (4). If the F test is significant, we proceed to the post-hoc phase by performing six pairwise group comparisons in Equation (5). Shown in Table 1 under “F” is the percent of times the F test is significant for testing the null of no group mean difference and under “Tukey” is the percent of times at least one of the post-hoc Tukey’s tests is significant based on M = 1,000 Monte Carlo replications, as a function of sample size n, difference d, and type I error level α. The percent is actually an estimate of power, or empirical power, for rejecting the respective null when the null is false. Since power increases with sample size as well as differences between the means, the percent increased as sample size n grew from 20 to 40 and the difference between the group means d increased from 0.5 to 1 for both α = 0.05 and α = 0.001. Also, as expected, the percent became smaller as α reduced from α = 0.05 to α = 0.001.

Table 1.

p-values from the F test and Tukey’s test (at least one of the pairs is significant), rates of false positive (FP) and rates of false negative (FN) based on M=1000 Monte Carlo replications.

Sample	α = 0.05				α = 0.001
size		F	Tukey	FP	FN	F	Tukey	FP	FN
n=20	d=0.5	0.318	0.313	0.091	0.035	0.032	0.033	0.188	0.007
n=40	d=0.5	0.597	0.585	0.059	0.057	0.143	0.126	0.196	0.013
n=20	d=0.8	0.718	0.711	0.039	0.074	0.220	0.197	0.182	0.022
n=40	d=0.8	0.969	0.967	0.004	0.065	0.466	0.448	0.079	0.042

The objective of comparing multiple groups is to find the groups that have different group means. The hierarchical procedure is intended to facilitate this task by first performing a “screening” test to see if it is necessary to further delve into comparisons of all pairs of groups. In this sense, we may characterize the performance of the hierarchical procedure using the “false positive” (FP) and “false negative” (FN) rates defined by: where Pr(B | A) denotes the probability that event B occurs given event A. Thus, FP is the proportion that none of the post-hoc tests is significant given a significant F test, while FN is the proportion that at least one post-hoc test is significant given a nonsignificant F test. In the case of FP, we have a false alarm, while in the case of FN, we miss the opportunity to find group differences. Shown in Table 1 under “FP” is an estimate of FP by the percent of non-significant post-hoc test among the significant F tests. It is interesting that FP increased substantially when α changed from α = 0.05 to α = 0.001. For example, for n = 20 and d = 0.5, FP is about 10% when α = 0.05, but increased to nearly 20% when α = 0.001. In other words, we have about 10% false alarm when α = 0.05, but nearly 20% false alarm when α = 0.001. Shown in Table 1 under “FN” is an estimate of FN by the percent of significant post-hoc test among the non-significant F tests. As in the case of FP, FN also varied as a function of α. But, unlike FP, FN decreased when α changed from α = 0.05 to α = 0.001. Also, FN is smaller compared to FP. For example, for n = 20 and d = 0.5, FN is about 4% when α = 0.05 and less than 1% when α = 0.001. In other words, we have about 10% false alarm when α = 0.05, but nearly 20% false alarm when α = 0.001.

3. Discussion

In this report, we investigated performance of the omnibus test using simulated data. The hierarchical procedure is a widely used approach for comparing multiple (more than two) groups.[ The omnibus test is intended to preserve type I errors by eliminating unnecessary post-hoc analyses under the null of no group difference. However, our simulation study shows that the hierarchical approach is not guaranteed to work all the time. The omnibus and post-hoc tests are not always in agreement. As our goal of comparing multiple groups is to find groups that have different means, a significant omnibus test gives a false alarm, if none of the post-hoc tests are significant. But, most important, we may also miss opportunities to detect group differences, if we have a non-significant omnibus test, since some or all post-hoc tests may still be significant in this case. Although we focus on the classic ANOVA model in this report, the same considerations and conclusions also apply to more complex models for comparing multiple groups, such as longitudinal data models [. Since for most models, post-hoc tests with significant levels adjusted to account for multiple testing do not have exactly the same type I error as the omnibus test as in the case of ANOVA, it is more difficult to evaluate performance of the hierarchical procedure. For example, the Bonferroni correction is generally conservative. Given our findings, it seems important to always perform pairwise group comparisons, regardless of the significance status of the omnibus test and report findings based on such group comparisons.

Source	degree of freedom ¢df)	Mean Squre (MS)
Groups	I – 1	MS(R) = SS(R)/(I – 1)
Error	N – I	MS(E) = SS(E)/(N – I)
Total	N – 1

11 in total

1. High prevalence and adverse health effects of loneliness in community-dwelling adults across the lifespan: role of wisdom as a protective factor.

Authors: Ellen E Lee; Colin Depp; Barton W Palmer; Danielle Glorioso; Rebecca Daly; Jinyuan Liu; Xin M Tu; Ho-Cheol Kim; Peri Tarr; Yasunori Yamada; Dilip V Jeste
Journal: Int Psychogeriatr Date: 2019-10 Impact factor: 3.878

2. Impaired P1 Habituation and Mismatch Negativity in Children with Autism Spectrum Disorder.

Authors: Francisco J Ruiz-Martínez; Elena I Rodríguez-Martínez; C Ellie Wilson; Shu Yau; David Saldaña; Carlos M Gómez
Journal: J Autism Dev Disord Date: 2020-02

3. The association between symptom onset characteristics and prehospital delay in women and men with acute coronary syndrome.

Authors: Sahereh Mirzaei; Alana Steffen; Karen Vuckovic; Catherine Ryan; Ulf G Bronas; Jessica Zegre-Hemsey; Holli A DeVon
Journal: Eur J Cardiovasc Nurs Date: 2019-09-11 Impact factor: 3.908

4. Knowing Students' Characteristics: Opportunities to Adapt Physical Education Teaching.

Authors: Alina Kirch; Melina Schnitzius; Sarah Spengler; Simon Blaschke; Filip Mess
Journal: Front Psychol Date: 2021-02-12

5. Emerging School Readiness Profiles: Motor Skills Matter for Cognitive- and Non-cognitive First Grade School Outcomes.

Authors: Erica Kamphorst; Marja Cantell; Gerda Van Der Veer; Alexander Minnaert; Suzanne Houwen
Journal: Front Psychol Date: 2021-11-23

6. Amifostine (WR-2721) Mitigates Cognitive Injury Induced by Heavy Ion Radiation in Male Mice and Alters Behavior and Brain Connectivity.

Authors: Sydney Weber Boutros; Benjamin Zimmerman; Sydney C Nagy; Joanne S Lee; Ruby Perez; Jacob Raber
Journal: Front Physiol Date: 2021-11-16 Impact factor: 4.566

7. Participation preferences of health service users in health care decision-making regarding rehabilitative care in Germany-A cross-sectional study.

Authors: Lisa A Baumann; Anna L Brütt
Journal: Health Expect Date: 2021-09-14 Impact factor: 3.377

8. Videogame and Computer Intervention Effects on Older Adults' Mental Rotation Performance.

Authors: Brad Taylor; Anna Yam; Patricia Belchior; Michael Marsiske
Journal: Games Health J Date: 2021-06

9. How Many Proximal Screws Are Needed for a Stable Proximal Humerus Fracture Fixation?

Authors: Hyojune Kim; Myung Jin Shin; Erica Kholinne; Janghyeon Seo; Duckwoo Ahn; Ji Wan Kim; Kyoung Hwan Koh
Journal: Geriatr Orthop Surg Rehabil Date: 2021-02-09

10. Common cancer treatments targeting DNA double strand breaks affect long-term memory and relate to immediate early gene expression in a sex-dependent manner.

Authors: Sydney Weber Boutros; Destine Krenik; Sarah Holden; Vivek K Unni; Jacob Raber
Journal: Oncotarget Date: 2022-01-24