Literature DB >> 21713172

Moving beyond traditional null hypothesis testing: evaluating expectations directly.

Rens Van de Schoot¹, Herbert Hoijtink, Romeijn Jan-Willem.

Abstract

This mini-review illustrates that testing the traditional null hypothesis is not always the appropriate strategy. Half in jest, we discuss Aristotle's scientific investigations into the shape of the earth in the context of evaluating the traditional null hypothesis. We conclude that Aristotle was actually interested in evaluating informative hypotheses. In contemporary science the situation is not much different. That is, many researchers have no particular interest in the traditional null hypothesis. More can be learned from data by evaluating specific expectations, or so-called informative hypotheses, than by testing the traditional null hypothesis. These informative hypotheses will be introduced while providing an overview of the literature on evaluating informative hypothesis.

Entities: Chemical Disease Gene Species

Keywords: Bayesian analysis; inequality constraints; informative hypothesis; null hypothesis testing

Year: 2011 PMID： 21713172 PMCID： PMC3111216 DOI： 10.3389/fpsyg.2011.00024

Source DB: PubMed Journal: Front Psychol ISSN： 1664-1078

Introduction

The present mini-review argues that testing the traditional null hypothesis is not always the appropriate strategy. That is, many researchers have no particular interest in the hypothesis “nothing is going on” (Cohen, 1990). So why test a hypothesis one is not really interested in? The APA stresses in its publication manual that null hypothesis testing should be a starting point for statistical analyses: “Reporting elements such as effect sizes and confidence intervals are needed to convey the most complete meaning of the results” (American Psychological Association, 2001, p. 33; see also Fidler, 2002). In the current paper we go beyond this first step of reporting effect sizes and confidence intervals, arguing that specific expectations should be evaluated directly. As Osborne (2010) stated: “The world doesn't need another journal promulgating 20th century thinking, genuflecting at the altar of p < 0.05. I challenge us to challenge tradition” (p. 3). This is exactly what we set out to do in the current paper. Statistical tools for the evaluation of informative hypotheses are becoming available and are more often used in applications. We provide an overview of the current state of affairs for the evaluation of informative hypotheses. But first we argue, half in jest, what is “wrong” with the traditional null hypothesis and introduce the informative hypothesis. One important prior note has to be made. Researchers like Wagenmakers et al. (2008) criticize T-tests for rendering no legitimate results and argue that p-values are prone to misinterpretation. Others, such as Coulson et al. (2010), or Fidler and Thompson (2001), explicitly argue against solely reporting p-values and argue for using confidence intervals. Along similar lines, using focused contrasts which could be used to evaluate expectations directly is proposed by Rosenthal et al. (2000). However, in the current paper we will focus on developments in statistics that move beyond using confidence intervals, effect sizes, and planned contrasts.

What is “Wrong” with the Traditional Null Hypothesis?

Cohen (1994) aptly summarized the criticism of traditional null hypothesis testing in the title of his paper “The earth is round (p < 0.05).” Let us elaborate on his criticism using an example inspired by this title originally meant to instruct and entertain. The question of the shape of the earth was a recurring issue in scientific debate during the era of Aristotle (384–322 BC; see Rusell, 1997). By that time, the Greek idea that the earth was round dominated scientific thinking. The only serious opponents were the atomists Leucippus and Democritus, who still believed that the earth was a flat disk floating in the ocean, as certain ancient Mesopotamian philosophers had maintained. Now let us embark on some historical science fiction to tell the story of how Aristotle in his scientific investigations might have used different ways of evaluating hypotheses. We propose that in order to falsify the old Mesopotamian hypothesis, Aristotle might have used an approach based on testing the traditional null hypothesis: H0: The shape of the earth is a flat disk, H1: The shape of the earth is not a flat disk. Clearly, these hypotheses are no statistical hypotheses and no actual statistical inference could have been carried out; these hypotheses are purely designed to serve as an example. So, in the set up of our reverse science fiction, Aristotle would have gathered data about the shape of the earth and found evidence against the null hypothesis, for example: stars that were seen in Egypt were not seen in countries north of Egypt, while stars that never were beyond the range of observation in northern Europe were seen to rise and set in Egypt. Such observations could not be taken as evidence of a flat earth. H0 would have been rejected, leading Aristotle to conclude that the earth cannot be represented by a flat disk. In actual fact, Aristotle agreed with Pythagoras (582 to ca. 507 BC), who believed that all astronomical objects have a spherical shape, including the earth. So, once again embarking on an episode of imaginary history, Aristotle might also have tested: : The shape of the earth is a sphere, : The shape of the earth is not a sphere. Now, imagine that Aristotle continued his search for data and that he gathered data yielding evidence against (!) the null hypothesis: while standing on a mountain top, he noticed that the Earth's surface has many irregularities and concluded that if enough irregularities could be observed, this might provide just enough evidence to reject the null hypothesis. And so it might have happened that Aristotle once again rejected the null hypothesis, concluding that the earth is not a sphere [Cohen: “The earth is round (p < 0.05)”]. What can be learned from this conclusion? Not much! Both hypothesis tests reject the traditional null hypotheses H0 and . As a next step, following the Neyman–Pearson procedure of hypothesis testing, we could tentatively adopt the alternative hypotheses H1 and . This procedure tells us that the earth is neither a flat disk nor a sphere and consequently we remain ignorant of the earth's actual shape. This ignorance is a result of the “catch-all” alternative hypothesis as proposed by Neyman and Pearson (1967). Unfortunately, the catch-all includes all shapes that are non-flat and non-spherical, for example pear-shaped. Rather than using the hypothesis tests given above, we might argue that Aristotle was actually interested in evaluating: H: The shape of the earth is a flat disk, versus H: The shape of the earth is a sphere. In such a direct comparison the conclusion will be more informative.

What does This Historical Example Teach Us?

Evaluating specific expectations directly produces more useful results than sequentially testing traditional null hypotheses against catch-all rivals. We argue that researchers are often interested in the evaluation of informative hypotheses and already know that the traditional null hypothesis is an unrealistic hypothesis. This presupposes that prior knowledge often is available; if this is not the case, testing the traditional null hypothesis is appropriate. In most applied articles, however, prior knowledge is indeed available in the form of specific expectations about the ordering of statistical parameters. Let us illustrate this using an example of Van de Schoot et al. (2010). The authors investigated the association between popularity and antisocial behavior in a large sample of young adolescents from preparatory vocational schools (VMBO) in the Netherlands. In this setting, young adolescents are at increased risk of becoming (more) antisocial. Five so-called sociometric status groups were defined in terms of a combination of social preference and social impact: a popular, rejected, neglected, controversial, and an average group of adolescents. Each sociometric status group was characterized by distinct behavioral patterns which influenced the quality of social relations. For example, peer rejection was found to be related to antisocial behavior, whereas popular adolescents tended to be considered as well-known, attractive, athletic, and socially competent, although this group could also be antisocial, as was shown by Van de Schoot et al. (2010). Suppose we want to compare these five sociometric status groups on the number of committed offenses reported to the police last year (minor theft, violence, and so on) and let the groups be denoted by μ1 for the mean on the number of committed offenses for the popular group, μ2 for the rejected group, μ3 for the neglected group, μ4 for the controversial group and μ5 for the average group. Different types of hypotheses can be formulated that are used in the procedures and are described in the remainder of this paper. First, informative hypotheses can be formulated denoted by for a set of N hypotheses. These hypotheses contain information about the ordering of the parameters in a model, in our example the five means. Such expectations about the ordering of parameters can stem from previous studies, a literature review or even academic debate. Consider an imaginary hypothesis with inequalities between the five mean scores, where the neglected group is expected to commit fewer offenses compared to the popular group, who in turn are expected to commit fewer offenses compared to the average group, and so on. If no information is available about the ordering, this is denoted by a comma. Another expectation could be the hypothesis where the neglected group is expected to commit fewer offenses compared to the popular, average, and rejected groups. There is no expected ordering between these three groups, but all three are expected to commit fewer offenses than the controversial group. The research question would be which of the two informative hypotheses receives most support from the data. Second, there is the traditional null hypothesis (denoted by H0), which states that nothing is going on and all groups have the same score, H0: μ1 = μ2 = μ3 = μ4 = μ5. Third, if no constraints are imposed on any of the means and any ordering is equally likely, the hypothesis is called a “catch-all” alternative hypothesis, or an unconstrained hypothesis (denoted by H): H: μ1, μ2, μ3, μ4, μ5. In the next section we present an overview of possible alternatives for traditional null hypothesis testing to evaluate one or more informative hypotheses.

Evaluating Informative Hypotheses

Different procedures are described in a range of sources that allow for the evaluation of informative hypotheses. We present an overview of technical papers, software, and applications for two types of approaches: (1) hypothesis testing approaches and (2) model selection approaches. Note that we limit ourselves to a discussion of papers where software is available for applied researchers.

Hypothesis testing approach

Some approaches reported in the literature render a p-value for the comparison of H with H0 or with H. First, an adaptation of the traditional F-test for analysis of variance (ANOVA) was proposed by Silvapulle et al. (2002, see also Silvapulle and Sen, 2004), called the F-bar test. It is a confirmatory method to test one single informative hypothesis in two steps, for example: versus and versus where in the second hypothesis test serves as the null hypothesis. Software for the F-bar test is described in Kuiper et al. (2010), but applications have not yet, to our knowledge, been reported in the literature. Application of the F-bar test is easy using the software and the results are comparable with a classical F-test. The disadvantage is that only one single informative hypothesis at a time can be evaluated and this only for univariate ANOVA. Testing informative hypotheses for structural equation models (SEM) is described in Stoel et al. (2006), where constraints are imposed on variance terms to obtain only positive values (see also Gonzalez and Griffin, 2001). A likelihood ratio test is used and the software is available in the statistical package R (R Development Core Team, 2005). The procedure described in Van de Schoot et al. (2010) also makes use of a likelihood ratio test, but goes one step further than Stoel et al. (2006). A parametric bootstrap procedure is used in combination with inequality constraints imposed on regression coefficients. The methodology consists of several steps to be performed with the aid of commonly used software, Mplus (Muthén and Muthén, 2007). Van de Schoot and Strohmeier (in press) introduce the methodology to non-statisticians and show that using this method results in a power gain. That is, fewer participants are needed to obtain a significant effect compared to a default chi-square test.

Model selection approach

A second way of evaluating an informative hypothesis is to use a model selection approach. This is not a test of the model in the sense of hypothesis testing, rather it is an evaluation between statistical models using a trade-off between model fit and model complexity. Several competing statistical models may be ranked according to their value on the model selection tool used and the one with the best trade-off is the winner of the model selection competition. There is a variety of model selection procedures commonly used in practical applications, most notably Akaike's information criterion (AIC; Akaike, 1973), the Bayesian information criterion (BIC; Schwarz, 1978), and the deviance information criterion (DIC; Spiegelhalter et al., 2002). Problems with these standard model selection tools in the context of evaluating informative hypotheses arise because the tools are not equipped to deal with inequality constraints (Mulder et al., 2009a; Van de Schoot et al., under review-b). Although the model selection tools differ in their expression, the result always consists of two parts: the likelihood of the best fitting hypothesis within the model is a measure of model fit; and an expression containing the number of (effective) parameters of the model is a measure of complexity. The greater the number of dimensions, the greater the compensation for model complexity becomes. So, adding a parameter should be accompanied by an increase in model fit to accommodate for the increase in complexity. The problem is that the expression of complexity is based on the number of parameters in the model and cannot take inequality constraints into account. That is, and would receive the same measure for complexity, which is unwanted, because is more parsimonious than due to more restriction imposed on the five means. Alternative model selection tools have been proposed in the literature. First, an alternative model selection procedure is the paired-comparison information criterion (PCIC) proposed by Dayton (1998, 2003), with an application in Taylor et al. (2007). The PCIC is an exploratory approach which computes a default model selection tool for all logically possible subsets of group orderings. Only the source code for the programming language GAUSS was available for the PCIC (Dayton, 2001), but Kuiper and Hoijtink (2010) made the PCIC available in a user friendly interface. The disadvantage of the PCIC is that it is an exploratory approach. Second, the literature also contains one modification of the AIC that can be used in the context of inequality constrained ANOVA models. It is called the order-restricted information criterion (ORIC; Anraku, 1999; Kuiper et al., in press) with an application in Hothorn et al. (2009). It can be used for the evaluation of models differing in the order restrictions among a set of means. Inequality constraints are taken into account in the estimation of the likelihood and in the penalty term of the ORIC. Software for ORIC is described in Kuiper et al. (2010). The ORIC is as yet only available for ANOVA models, but a generalization is under construction. Alternatives for the BIC and the DIC are under construction: see Romeijn et al. (under review) and Van de Schoot et al. (under review-a), respectively. Finally, one other method of model selection, which is receiving more and more attention in the literature, involves the evaluation of informative hypothesis using Bayes factors. In this method each (informative) hypothesis of interest is provided with a “degree of support” which tells us exactly how much support there is for each of the hypotheses under investigation. This process involves collecting evidence that is meant to provide support for or against a given hypothesis; as evidence accumulates, the degree of support for a hypothesis increases or decreases. The methodology of evaluating a set of inequality constrained hypotheses has proven to be a flexible tool that can deal with many types of constraints. We refer to the book of Hoijtink et al. (2008b), and the papers of Van de Schoot et al. (in press) and Van de Schoot et al. (2011) as a first step for interested readers. For a philosophical background, see Romeijn and Van de Schoot (2008) and for more information on hypothesis elicitation, see Van Wesel et al. (under review). Various papers describe comparisons between traditional null hypothesis testing and Bayesian evaluation of informative hypotheses; see Kuiper and Hoijtink (2010), Hoijtink et al. (2008b), Hoijtink and Klugkist (2007), and Van de Schoot et al. (2011). Software is available for: AN(C)OVA models (Klugkist et al., 2005; Kuiper and Hoijtink, 2010; Van Wesel et al., 2010) with an application in Van Well et al. (2009); Multivariate linear models including time-varying and time-invariant covariates (Mulder et al., 2009a,b) with an application in Kammers et al. (2009); Latent class analyses (Hoijtink, 2001; Laudy et al., 2005a) with applications in Laudy et al. (2005b) and Van de Schoot and Wong (in press); Order-restricted contingency tables (Laudy and Hoijtink, 2007; see also Klugkist et al., 2010) with applications in Meeus et al. (2010) and Meeus et al. (in press).

Conclusion

Statistics have come a long way since the early beginnings of testing the traditional null hypothesis of “nothing is going on.” Developments in statistics, in particular specific developments in the evaluation of informative hypothesis, allow researchers to directly evaluate their expectations specified with inequality constraints. This mini-review illustrates that testing the traditional null hypothesis is not always an appropriate strategy. We argued that more can be learned from data by evaluating informative hypotheses, than by testing the traditional null hypothesis. These informative hypotheses were introduced by means of an example. Finally, we presented the current state of affairs in the area of evaluating informative hypotheses.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

15 in total

1. Testing parameters in structural equation modeling: every "one" matters.

Authors: R Gonzalez; D Griffin
Journal: Psychol Methods Date: 2001-09

2. Information criteria for pairwise comparisons.

Authors: C Mitchell Dayton
Journal: Psychol Methods Date: 2003-03

3. Bayesian evaluation of inequality and equality constrained hypotheses for contingency tables.

Authors: Irene Klugkist; Olav Laudy; Herbert Hoijtink
Journal: Psychol Methods Date: 2010-09

4. Comparisons of means using exploratory and confirmatory approaches.

Authors: Rebecca M Kuiper; Herbert Hoijtink
Journal: Psychol Methods Date: 2010-03

5. On the likelihood ratio test in structural equation modeling when parameters are subject to boundary constraints.

Authors: Reinoud D Stoel; Francisca Galindo Garre; Conor Dolan; Godfried van den Wittenboer
Journal: Psychol Methods Date: 2006-12

6. Inequality constrained analysis of variance: a Bayesian approach.

Authors: Irene Klugkist; Olav Laudy; Herbert Hoijtink
Journal: Psychol Methods Date: 2005-12

7. Bayesian methods for the analysis of inequality constrained contingency tables.

Authors: Olav Laudy; Herbert Hoijtink
Journal: Stat Methods Med Res Date: 2007-04 Impact factor: 3.021

8. The weight of representing the body: addressing the potentially indefinite number of body representations in healthy individuals.

Authors: Marjolein P M Kammers; Joris Mulder; Frédérique de Vignemont; H Chris Dijkerman
Journal: Exp Brain Res Date: 2009-09-22 Impact factor: 1.972

9. Robust dimensions of anxiety sensitivity: development and initial validation of the Anxiety Sensitivity Index-3.

Authors: Steven Taylor; Michael J Zvolensky; Brian J Cox; Brett Deacon; Richard G Heimberg; Deborah Roth Ledley; Jonathan S Abramowitz; Robert M Holaway; Bonifacio Sandin; Sherry H Stewart; Meredith Coles; Winnie Eng; Erin S Daly; Willem A Arrindell; Martine Bouvard; Samuel Jurado Cardenas
Journal: Psychol Assess Date: 2007-06

10. Trend tests for the evaluation of exposure-response relationships in epidemiological exposure studies.

Authors: Ludwig A Hothorn; Michael Vaeth; Torsten Hothorn
Journal: Epidemiol Perspect Innov Date: 2009-03-06

7 in total

1. Combat experiences, pre-deployment training, and outcome of exposure therapy for post-traumatic stress disorder in Operation Enduring Freedom/Operation Iraqi Freedom veterans.

Authors: Matthew Price; Daniel F Gros; Martha Strachan; Kenneth J Ruggiero; Ron Acierno
Journal: Clin Psychol Psychother Date: 2012-01-18

5. Limited Near and Far Transfer Effects of Jungle Memory Working Memory Training on Learning Mathematics in Children with Attentional and Mathematical Difficulties.

Authors: Michel Nelwan; Evelyn H Kroesbergen
Journal: Front Psychol Date: 2016-09-21

6. Farewell to Bright-Line: A Guide to Reporting Quantitative Results Without the S-Word.

Authors: Kevin M Cummins; Charles Marks
Journal: Front Psychol Date: 2020-05-13

7. Assessing reliability in neuroimaging research through intra-class effect decomposition (ICED).

Authors: Andreas M Brandmaier; Elisabeth Wenger; Nils C Bodammer; Simone Kühn; Naftali Raz; Ulman Lindenberger
Journal: Elife Date: 2018-07-02 Impact factor: 8.140