Literature DB >> 28729890

Varieties of Confidence Intervals.

Abstract

Error bars are useful to understand data and their interrelations. Here, it is shown that confidence intervals of the mean (CI M s) can be adjusted based on whether the objective is to highlight differences between measures or not and based on the experimental design (within- or between-group designs). Confidence intervals (CIs) can also be adjusted to take into account the sampling mechanisms and the population size (if not infinite). Names are proposed to distinguish the various types of CIs and the assumptions underlying them, and how to assess their validity is explained. The various CIs presented here are easily obtained from a succession of multiplicative adjustments to the basic (unadjusted) CI width. All summary results should present a measure of precision, such as CIs, as this information is complementary to effect sizes.

Entities: Chemical Disease Gene Species

Keywords: confidence intervals; methods; precision; statistics

Year: 2017 PMID： 28729890 PMCID： PMC5511611 DOI： 10.5709/acp-0214-z

Source DB: PubMed Journal: Adv Cogn Psychol ISSN： 1895-1171

Varieties of Confidence Intervals

Error bars have an important role to play in describing results and their precision and, to a lesser extent, in assessing whether the results meet the researcher’s expectations or if they are at odds with them (Cumming, 2014; Loftus, 1993, 1996; Wilkinson & the Task Force on Statistical Inference, 1999). However, error bars come in many different types, and there is some confusion in the literature as to (a) when to use error bars, (b) which one to depict, and (c) how to interpret them. The answers to these questions are straightforward. Concerning the answer to question (a): Error bars should always be present on any plot showing summary results. There should be no exception, and editors should request them prior to publication (Fidler, Thomason, Cumming, Finch, & Leeman, 2004). The answer to question (b) is: Error bars are meant to provide some representation of the magnitude of probable error around a result. Two simple statistics can be used to that end: the standard error () or a confidence interval (CI). Other, more advanced statistics can also be used (based, e.g., on Bayesian credible intervals, tolerance intervals, or likelihood regions, see, e.g., Lee, 2012; Wiens & Nilsson, 2016). Which one is chosen ultimately rests on what the authors are trying to convey as a result. However, unless there is a specific reason to prefer a different measure, error bars should preferably represent 95% CIs, as argued by, among others, Baguley (2012b), Cumming (2014), Franz and Loftus (2012), and Loftus (1996). The last answer regarding question (c), the interpretation of CIs: CIs (unlike other types of error bars) must all be interpreted in the same fashion—if a given value is within the interval of a result, the two can be informally assimilated as being comparable. This is the golden rule of confidence intervals and all CIs should obey this rule. Although the name of the rule is my proposal, this rule is found in many sources (e.g., DeGroot, 1989, p. 337; Neyman, 1937, p. 348). Keep in mind that CIs are not magical wands. They are only meant to better qualify effect sizes, facilitate the detection of patterns of results, and, to a lesser extent, to attract attention to odd results or deviations that are surprisingly large. When they are correctly used, they are powerful tools to understand the results (Cumming & Fidler, 2009). Sadly, there has been some confusion in the recent years on how to interpret them (e.g., Belia, Fidler, Williams, & Cumming, 2005; Cumming & Finch, 2005) or even if they should be interpreted at all (Hoekstra, Morey, Rouder, & Wagenmakers, 2014; Morey, Hoekstra, Rouder, Lee, & Wagenmakers, 2016; but see Miller & Ulrich, 2016) to the point that they are sometimes reported in figures but ignored in the text (Fidler et al., 2004). The truth is that CIs are reliable as long as they are (a) built from adequate assumptions, (b) given correct information (sample size, population size, experimental design, sampling mechanism), and (c) used for the purpose they were built for (estimation of a quantity or comparison with another estimate). Although CI formulas are derived using mathematical arguments, it is easy to validate a confidence interval of the mean (CI) using random number generators: Generate a dataset from a simulated population with a known mean and verify that the population mean is contained within the bounds of the γ-level CI (often, γ is 95%). Sometimes, it will not be within the bounds, but over many replications the proportion of times it is will be γ. Formally defined, a 95% CI is made in a way that in the long run, 95 out of 100 replications will return an interval which indeed contains the true population mean. Remember, however, that for a given CI, a Type I error is always a possibility. In this article, I concentrate mostly on CIs. I argue that there are different types of CIs to serve the researcher’s objective (compare a result to a fixed value or to other results), to match the experimental design (within-subject or between-groups), and to reflect the sampling mechanism used (simple randomized sampling or cluster randomized sampling). To avoid confusion, I propose specific names to distinguish the types of CIs. What is less known is that most CIs are based on assumptions. I will highlight these assumptions and indicate how or if they can be assessed from visual inspection. I will briefly discuss the difference between the formula-based CI and the bootstrap CI. CIs are not just for mean results, they exist for any summary statistics, and I will present examples along with the relevant literature. This article is not about the aesthetic of plots and error bars. There are discussions as to whether summary statistics are better represented by histograms or by dots and whether the extremities of error bars should be signaled by a crossbar or not. In the present article, I chose to use dots and no crossbars (see, e.g., Baguley, 2012a, and discussions linked to that web page), but the quality of a good plot is ultimately evaluated by how well it reveals the important effects. Hence, it may be necessary to try various layouts and various aesthetics to find out which one works best.

Computing Confidence Intervals: Two Basic Adjustments

Most researchers know the usual CI of the mean given by in which M is the mean of a set of observations, is the of that mean, and tγ is a multiplier read from a Student t distribution with degrees of freedom given by n - 1 (n being the number of observations) and coverage level γ, where γ is commonly 95% (oftentimes noted in full as t(1-γ)/2, n-1). The SE of the mean is an indication of how much a sample mean is expected to vary from the population mean. All descriptive statistics have an SE (see later) and SEs are often used as a yardstick to compute CIs—as in Equation 1. The SE of the mean is given by where s is an estimate of the population SD obtained by computing the sample SD. What is less known is that this type of CI has a very limited scope: It cannot be easily used to compare a mean to another mean, and it is useless for that purpose in repeated-measures designs. In this section, adjustments to Equation 1 are presented so that CI can be used for comparison purposes in between-group and within-subject designs.

Confidence Intervals and the Researcher’s Objective

The CI of Equation 1 is based on the assumption that the mean will be examined in isolation. If it is compared, it is compared to fixed values—to a hypothesized population mean, for example. This fixed value has no uncertainty attached to it; hence, there is just one source of error, the sampling error of the group. If one group mean is compared to another group mean, both the position of each mean and the relative position of one mean with respect to the other mean are uncertain. Consequently, the SE of a difference between two means is larger than the SE of the difference between one mean and a fixed value. Expanding the length of the CI compensates for the fact that both quantities are based on samples and, consequently, that their difference contains a larger amount of uncertainty. How much to expand the CI depends on the variances in each condition to be compared. However, if the variances are fairly homogeneous across conditions, a simple solution exists because the sum of two identical variances amounts to multiplying a common variance by two. Consequently, the CI must be √2 ≈ 1.41 times wider (i.e., increased by 41%). Thus, when the purpose of a CI is to compare a mean to other means and variances are considered homogeneous CI, the is given by Alternatively, if the variances are not homogeneous, use the SE of a difference (SED) instead of √2 × SEM, which is based on the pooled SD: where is the harmonic mean of the groups’ sample sizes; see Pfister and Janczyk (2013). If the variances are homogeneous, whether the pooled SD (sp) is used (as recommended by Loftus & Masson, 1994) or each group’s SD (s) is used (as recommended by Cousineau, 2005) is a matter of taste. If you choose both sp and , all the error bars will be of the same length. On the other hand, if you take SDs and sample sizes from each group separately, the error bars will most likely be different. Note that Equation 3 is identical to Equation 1 in all points except for the adjustment factor √2. Such a difference-adjusted CI can only be interpreted with respect to differences between sample means using the golden rule. Equation 1 is the CI when it is meant to be compared to a fixed point; Equation 3 is the CI when the researcher’s objective is to compare one mean to other means. This adjustment was used by Hollands and Jarmasz (2010) to rephrase the golden rule: “the difference between the means of two conditions is significant if it exceeds half the total length of the CI […] multiplied by a factor of √2” (p. 135; Loftus & Masson, 1994, report a similar rule). What truly differentiates the two types of CIs in Equations 1 and 3 is the objective. This distinction was also present in Goldstein and Healy (1995), Franz and Loftus (2012), and Baguley (2012b), among others. When the term √2 is omitted, the proportion of CI of future replications containing the true population difference is not 95% but only 83.4%, as the error bars are too short. This problem was first raised by Estes (1997) and explored by Cumming, Williams, and Fidler (2004). It may seem counterintuitive that the error bars for differences are longer than the error bars of each mean taken individually. If the observer was to use such bars to estimate the population true mean, it is as if precision had been lost. However, remember that difference-adjusted CIs are meant to assess differences, not single means in isolation. It is therefore important that the type of CI pictured is clearly indicated. As an example, suppose that one member of a research group is in charge of collecting the data from a treatment group, with the hope that this group’s mean score is different from 100. After collecting the data and generating a plot showing the 95% CI as per Equation 1, she finds that the mean seems different from 100. Indeed, the observed mean is 105.0; the 95% CI ranges from 100.9 to 109.1 (the raw data for this example and most of the following ones are available as supplementary material so that readers can replicate the computations). If she runs a t test with the null hypothesis H0:μ = 100, she finds that the null hypothesis is rejected at the .05 level, Hedge’s g = 0.50, t(24) = 2.5, p = .02. A colleague measures the control group with the hope that it has a mean close to 100. He finds that the control group has a mean of precisely 100.0 (not significantly different from 100, needless to say). The CI obtained from Equation 1 is [95.8, 104.2] and does not include the mean of the treatment group. If they merge the datasets, they will be surprised to find that a two-sample t test indicates no significant difference at the .05 level, g = 0.50, t(48) = 1.76, p = .085. The left panel of Figure 1 shows the plot they produced (in both groups, the SDs are approximately 10.0).

Figure 1.

Example mean plots from two independent groups. Left: The error bars show the 95% CI of the means (CIs); middle: The error bars show the difference-adjusted 95% CIs; right: The difference between groups is shown, with the 95% CI of the difference. The raw data are available in the supplementary material. Because their objective is to compare both groups, they increase the length of both CIs by a factor of 1.41. Figure 1, middle panel, shows the results using CIs based on the SED (Equation 3). Here, because one mean is included in the CI of the other mean, the difference between them can informally be assimilated to an absence of difference, congruent with the result of the t test. Alternatively, and as recommended by many, for example, Cumming (2014) and Franz and Loftus (2012), they could have made a plot of the difference in mean score, as shown in the last panel of Figure 1. This approach is explained fully in Pfister and Janczyk (2013). However, for designs with multiple groups, the number of pairwise differences increases very rapidly. For three or four groups, it is still possible to show on a single plot all the pairwise differences; one example is illustrated in Figure 2. Beyond that, the benefit of the pairwise difference plot is dubious, as seen if you compare the left panels of Figure 2 with the right panels.

Figure 2.

Example mean plots for a three-groups design (top) and a four-groups design (bottom) with error bars showing difference-adjusted 95% CI of the means (CIs) (left) and 95% CI of the difference for all pairwise differences (right). One critique that can be addressed to these adjusted CIs is that they do not provide an estimate of the population mean for a given group. This critique is relatively correct. However, in Psychology, it can be argued that we are rarely interested in estimating a population mean in isolation. As Loftus and Masson (1994) put it, “in psychological experiments, it is rare […] for one to be genuinely interested in inferring the specific value of a population mean. More typically, one is interested in inferring the pattern formed by a set of population means” (p. 480, the authors’ emphasis). Because an absolute estimate of mean performance is utopian, psychologists spend considerable time and resources measuring control groups, placebo groups, pre-treatment scores, and other forms of baseline scores, separately for any new experiment. These design requirements should be mirrored by equivalent estimates meant to highlight patterns of results. This is the purpose of the adjusted CIs.

Confidence Intervals and the Experimental Design

Experimental designs can be divided as to whether they are between-groups or within-subject (mixed designs will be discussed in a later section). In a within-subject (or repeated-measures) design, the participants’ scores are typically positively correlated. By considering such correlations in the participants’ scores, it is possible to evaluate differences between two means more precisely, a fact little known (Belia et al., 2005). In a two repeated-measures design, the correction in length is equal to when the variances are homogeneous, in which r is Pearson’s correlation, such that the CI for a repeated-measures design is An alternative way to understand the correlation adjustment is to note that in Equation 2, the square root of the sample size is replaced by to obtain Equation 5a, so that the ratio n/1 - r can be termed the effective sample size. The stronger the correlation is, the more accurate the regression slope is. Consequently, the difference between the two means is estimated as if we had measured a larger sample. With a sample size of 25 and a correlation of .8, for example, the effective sample size is five times larger than the true sample size (n/1 - r as = 25/0.2 = 125). When there are more than two measurements, there is no universally accepted way to get a CI adjusted to within-subject correlations. The difficulty owes to the fact that the variances are not perfectly identical between groups and the correlations are not perfectly identical between pairs of groups. One method (Bakeman & McArthur, 1996; Cousineau, 2005; see Morey, 2008, for the appropriate correction for bias) is to obtain a transformed dataset Z derived from the original data set, such that within-subject correlation is removed. Then, the CI is obtained as usual using the SE from the transformed data set rather than from the original data set. Note that the correlation-adjusted CI must always be difference-adjusted as well, as implicitly, the two groups are compared in getting a correlation. Thus, the SE of Z must be increased by a √2 factor as well. In general, for two or more repeated measures, the CI is given by It is thus a correlation-adjusted as well as a difference-adjusted CI of the mean. Cousineau and O’Brien (2014) give more details on how to compute the transformed data set Z. Masson and Loftus (2003; see also Loftus & Masson, 1994) provide an alternative approach. Both methods are identical when variances and correlation are truly homogeneous between measurements. Baguley (2012b) and Franz and Loftus (2012) evaluated these and other propositions. As another example, a researcher gets data from a sample of 25 participants in a repeated-measures design (for example a pre-post design). The mean of the first measurement is 105.0 and the mean of the second measurement is 100.0. Both SDs are near 15.0. The error bars obtained from Equation 1 are shown in the left panel of Figure 3. There does not seem to be any difference between the two measures, yet a paired t test indicates a strong and significant difference (Cohen’s dz = 0.58, t[24] = 2.90, p = .008). The researcher, remembering that his objective is to compare the two measures, may switch to a difference-adjusted CI (Equation 3), but things would get worse as CIs for differences are √2 times longer as seen in Figure 3, second panel. The apparent inconsistency between the CI and the statistical test owes to the fact that this is a repeated-measures design: Participants’ scores are correlated. In the present dataset, the correlation between the pairs of scores is .84, so that . Hence, in this case, the CI taking into account correlation should be 40% the length of the error bars, based on independent samples (more than halved). The third panel of Figure 3 shows the resulting, correct CI. If you prefer the paired difference CI, it is shown in the last panel of Figure 3.

Figure 3.

Example mean plots for two repeated measures. First panel: The error bars show the 95% CI of the means (CIs); second panel: The error bars show the difference-adjusted 95% CIs; third panel: the error bars show the correlation and difference-adjusted 95% CIs; fourth panel: paired difference and 95% CI of the paired difference. εHF is discussed later in the text. The raw data are available in the supplementary material. When the data are correlated, the CIs are shortened as within-subject correlation is used to better estimate the difference across means; the more positively correlated the data are, the shorter the CI becomes. In the unlikely event that the data are negatively correlated, the CI is expanded by the correlation adjustment.

Naming Convention

At this time, the three types of CI (unadjusted, difference-adjusted, and correlation- and difference-adjusted) have no distinct names. It is therefore difficult in a figure caption to figure out which type is plotted. A common statement is “the error bars are corrected for within-subject variability” followed by a reference, for example, “Loftus and Masson (1994).” I propose the following three labels: • CIs of the means (Equation 1) • Difference-adjusted CIs of the means (Equation 3) • Correlation- and difference-adjusted CIs of the means (Equation 5) Loftus and Masson (1994), contrary to Baguley (2012b), recommend the use of the pooled SE so that the label representing their approach could be correlation and difference-adjusted pooled 95% CIs of the mean. As typically, the purpose of mean plots is to compare means to other means, the difference-adjusted CIs would be used most often. If unadjusted CIs are used and there exists a conventional reference point, that point of reference could be present on the plot with a dashed line, for example, as was done in the first panels of Figures 1 and 3; no reference point should be shown when difference-adjusted CIs are depicted. Rouder and Morey (2005) suggested the expressions arelational (unadjusted, Equation 1) and relational (all the other CIs proposed here). These authors noted that “there are many advantages to arelational CIs: They provide a rough guide to variability in data, a coarse view of replicability of patterns and a quick check of the heterogeneity of variances. Arelational CIs, however, do not reflect between-group information and cannot be used for direct comparisons” (p. 77). Pfister and Janczyk (2013) also proposed a naming convention which applies when the difference between two means is plotted. They coined the expressions CI of means (unadjusted CI), CI of differences (for between-group difference in means), and CI of paired difference (for within-subject difference in means). The naming convention is important so that the type of CI shown on plots can be identified unambiguously. In addition to naming the CIs, it is useful to have a uniform way of reporting them when the authors want to write down the CI. Following Cumming (2014) and the American Psychological Association Publication Manual (2009), brackets should be used to denote 95% CIs. The notation M ± CI should not be used, as CIs are not always symmetrical. CIs are symmetrical for central tendencies (the mean, the median, the geometric and the harmonic means) and some nonparametric statistics of dispersion (median absolute deviation and interquartile range). However, in general, they are not symmetrical, as, for example, the CI of the SD and the CI of the kurtosis. Conversely, SEs are always symmetrical, so the notation M ± SE makes sense and should be used exclusively to report SEs. Finally, SE should not be used for the length of the error bars in plots. They are not easy to interpret (but see Cumming & Finch, 2001, 2005) and the fact that SEs are always symmetrical may yield a false impression. For example, suppose a group of 20 data has an SD of 12.33. The SE of the SD in that case is 2.00. The 95% CI of that SD is [9.38, 18.0]. There is no single number which added to and subtracted from 12.33 can yield this interval. Further, the asymmetry in precision would go unnoticed if SE were reported.

Confidence Intervals and Hidden Assumptions

CIs are well known (albeit not universally used). However, one thing that might be less known is that CI estimates are not assumption free. On the one hand, the use of the SE of the mean, , rests on few, quite general assumptions. CIs, on the other hand, are based on the assumption that the means are normally distributed. Indeed, to obtain a CI, the SE is multiplied by a t value which is based on this assumption. Owing to the central limit theorem, large-sample means should meet this assumption; for smaller samples, one safeguard, prior to drawing a mean plot, could be to run a test of normality (e.g., a Kolmogorov-Smirnov test) or tests for null skewness and null kurtosis and fail to reject the null (if the sample size is small) or find mild deviations (if the sample size is moderate; Rochon, Gondan, & Kieser, 2012). Likewise, and as we saw, the difference-adjusted CI is based on the homogeneity of variances assumption. This assumption can be checked visually when the groups are of the same size: As a given CI is based on the SD of that group of data only, the length of the error bars should all be of a comparable size. If there are important differences in length, then there is certainly a problem with the assumption of homogeneity of variances. Figure 4, left panel, shows an example with equal sample sizes. As the CIs are of very different lengths, it can be inferred that the variances are not homogeneous, and therefore the difference-adjusted CI should not be relied upon strongly. As a rule of thumb and for samples of moderate sizes, if the variance in one group is twice the variance in another group, Levene’s test will likely detect heterogeneity (and indeed it did in Figure 4, left panel: F = 6.72, p = .013). In terms of CIs, as they are based on SDs (square roots of variances), a CI which is 40% longer than another one suggests heterogeneity of variances.

Figure 4.

Example mean plots for two experiments with sample size 25 per condition. Left: means of two independent groups with error bars showing difference-adjusted 95% CI of the means (CIs). The two groups have different variances, as evidenced by the error bars of unequal length. Right: means from three measures with error bars showing correlation- and difference-adjusted 95% CIs. Although the measures’ variances are different, the data do not violate the sphericity assumption, as evidenced by a Huynh-Feldt ε of 1. In the left panel, we can test the difference in means using the Welch test, a t test whose degrees of freedom are corrected to handle heterogeneity of variances; the difference is borderline not significant, g = 0.40, t(35.2) = 2.01, p = .052. In the right panel, the analysis of variance (ANOVA) is significant, η2 = 0.12, F(2, 48) = 3.34, p = .04. Post hoc analyses show that the difference between Measure 1 and Measure 3 (8 points of separation) is the only significant one (p = .026). For repeated-measures designs, the correlation-adjusted CIs are based on the sphericity assumption; loosely speaking, this is similar to a homogeneity of correlation assumption (Baguley, 2004; Lane, 2016, are more precise). This is true whether a method based on separate estimates (such as Cousineau, 2005; Morey, 2008) or based on a pooled estimate (such as Loftus & Masson, 1994) is used. Sadly, this assumption cannot be verified visually with error bars. For example, the CIs may be of very different lengths and yet sphericity still holds (Baguley, 2004; Huynh, 1978). One solution is to compute epsilon (ε), a measure of sphericity whose value is between 1 / (J − 1) and 1, where J is the number of repeated measures; ε of 1 means that the data are perfectly spherical, that is, that the CIs are accurate. Some authors consider that εs above .9 indicate a mild deviation from perfect sphericity (see Field, 2013; Tabachnick & Fidell, 1996, for more on this measure). The ε measure was originally created by Greenhouse and Geisser (1959); Huynh and Feldt (1976) provided a formula corrected for bias. Figure 4, right panel, shows the means for three measurements (an example based on Baguley, 2004). Visually, the CIs are of unequal length, but this information in not relevant as this is a repeated-measures design. The Huynh-Feldt ε is 1, which indicates that the sphericity assumption holds for this data set (and indeed, a Mauchly test of sphericity does not reject the null hypothesis of sphericity, W = 0.947, χ2[2] = 1.24, p = .55). Because it is possible to have a visually educated guess with regards to homogeneity of variances in between-subjects designs, I suggest that the Huynh-Feldt ε (εHF) be always visible on a mean plot showing repeated measures (when there are three or more repeated measures; with only two measures, sphericity always holds; Lane, 2016). If there is any problem with the assumptions (normality and either homogeneity of variances for between-group designs or sphericity for within-subject designs), the assumption-based CIs might nevertheless be used as visual tools to provide rough intuitions on the results. However, if statistical inference is important, they should not be used. Alternatively, it is also possible to use bootstrap estimates of CI (Efron & Tibshirani, 1993). The basic algorithm for bootstrap estimation is simple: Given a sample of size n: 1. Subsample the sample, extracting n data with replacement from the original sample. 2. Compute on this subsample the statistic desired (e.g., the mean for a CI). 3. Repeat Steps 1 and 2 a very large number of times (e.g., 10,000 times). 4a. Finally, obtain the CI by locating the bounds within which a proportion γ of the subsample statistics are located. 4b. Alternatively, if you want SE instead, compute the SD of all the subsample statistics. Bootstrap estimates should be based on a large number of subsamples (minimally 10,000, but more if your platform can run it); as a consequence, they are slower to obtain than the formula-based intervals. Bootstrap CIs are based on fairly mild assumptions about the underlying population distribution (e.g., Shao & Tu, 1995). The sample should be reasonably large, although there is no explicit prescription as to what large means precisely. One safe rule is to at least match the sample size recommended from power computations (Mayr, Erdfelder, Buchner, & Faul, 2007). When the assumptions are met, bootstrap CI returns on average the same interval as the formula-based CI. One disadvantage of bootstrap estimates is that their exact value is different every time they are computed. This is why they must be based on a large number of subsamples. With 10,000 subsamples, the first two digits should be stable, so do not report bootstrap estimates with more than two significant digits or increase the number of subsamples. More sophisticated bootstrap algorithms have been developed (see, e.g., BCa; Efron, 1987; or ABC; DiCiccio & Efron, 1996). Figure 5 shows simulated data with three groups, in which black lines show the formula-based CI and gray lines show the bootstrap-based CI. The data were simulated from a normal distribution with means of 97, 100, and 103 and a common SD of 15. For a large sample (200 in Figure 5, right panel), the difference between the two types of approach to estimating CI is immaterial.

Figure 5.

Example mean plots from three independent groups with error bars showing difference-adjusted 95% CI of the mean (CIs) obtained from formula-based estimates (Equation 3; black bars) or from bootstrap estimates (gray bars). Left: a small sample (n of 20 per group); right: a large sample (n of 200 per group). The raw data for the left panel are available in the supplementary material.

Computing Confidence Intervals: Advanced Adjustments

All the CI and SE formulas given in the present article are valid for experimental designs examining a population of infinite size using simple randomized sampling. Yet Little (2004) strongly encouraged researchers to incorporate the sampling mechanisms in their models. Consequently, this information should also be incorporated in the CI by using sampling adjustments. Here, I illustrate how this can be done when the population is not so large as to be considered infinite, when a different sampling mechanism is used, or both.

Confidence Intervals and the Population Size

When the sample represents a sizeable proportion of the whole population, it is not possible to consider the population as infinite. Examples where the population cannot be considered infinite include: a study of employees within a given company, the LGBT community in a linguistic minority, or students’ achievements in public schools. Regarding the last example, the Austrian government aims to assess 20% of the population every year. As discussed in Cochran (1953), when the sample size exceeds 5% of the population size, a finite population correction must be applied to the sample estimates of variability (see also Thompson, 2012). In the following example, let n denote the sample size and N denote the population size. The adjustment is based on the proportion of elements not sampled from the population, 1− n/N so that the CI adjusted for population size becomes in the case where there are no other adjustments. As n tends to N, there is less and less uncertainty in the estimated variance of the population so that the adjustment factor tends to zero and the CIs shrink to null. The adjustments for finite sample size can be used jointly with the correlation adjustment and the difference adjustment.

Confidence Intervals and the Sampling Method

In simple randomized sampling, all the participants are chosen randomly from the studied population with an equal chance of being selected. Other sampling techniques exist, such as cluster randomized sampling and stratified sampling (Kish, 1965; Thompson, 2012). Cluster randomized sampling is often used in educational psychology and consists, for example, of picking whole classes from schools. The children are not selected with equal chances; the classes are. Stratified sampling is often used for survey studies and consists in selecting individuals, such that the sample is representative of the population on certain control variable(s), on age categories, for example. Regarding cluster randomized sampling, Cousineau and Laurencelle (2015) provided a cluster-adjusted CI. It requires an estimate of the intraclass correlation. The cluster adjustment can be used in conjunction with difference and correlation adjustments. Likewise, Lai, Kwok, Hsiao, and Cao (in press) argue that the correction for cluster randomized sampling can be used in conjunction with the correction for finite population size. The detailed computation of this adjustment is given in Appendix A. For stratified sampling techniques and other sampling techniques, the expression of SEs and CIs are not agreed-upon and most require numerical algorithms so that a simple adjustment does not seem possible at this time. As seen, considerations related to sampling methods are easily handled using additional adjustments that are simply multiplied to the CI length.

Various Considerations

Visualizing Confidence Intervals in Mixed Designs

The fact that CIs are different in between-groups designs and in within-subject designs is problematic for mixed designs where both types of CIs coexist. In this case, the researcher may choose to plot just one type of CI, the one which captures the results he or she wants to concentrate on. If the researcher wants to show both types of CIs, Baguley (2012b) proposed the use of two-tiered error bars. These error bars are drawn with two sets of aesthetics: The ones delimited with a cross-line (often the shortest) are correlation- and difference-adjusted CIs (and related to the within-subject results); the ones without cross-lines (often the longest) are the difference-adjusted CIs (and related to the between-subjects results). Although this solution is ingenious, the plots are often harder to interpret. The CIs are meant to synthetize results so that they are more easily apprehended. Multiplying the number of bars only achieves the opposite effect. As a general rule, the number of error bars should be kept to a minimum. If both between-groups and within-subject CIs are important, consider presenting two distinct plots.

Software for Computing Confidence Intervals for the Means and Other Statistics

Typically used summary statistics, not just means, all have SEs and CIs (see Harding, Tremblay, & Cousineau, 2014, for a review). Hence, all summary plots should be drawn with some measure of dispersion around them, the conventional measure being 95% CIs. As an example, Figure 6 illustrates 95% CIs for nine descriptive statistics, including robust and nonparametric statistics (the median, the median absolute deviation, and the Pearson skew; Daszykowski, Kaczmarek, Vander Heyden, & Walczak, 2007; Harding, Tremblay, & Cousineau, 2015; Siegel & Castellan, 1988).

Figure 6.

Plots of various statistics from fictitious data as a function of group with error bars showing 95% CIs. The CIs are asymmetrical for SD and kurtosis. The first six are difference-adjusted; the last three shows unadjusted CIs in which zero is the reference. The same data set is used in all panels and were used in Figure 5, left panel (so that the first panel shows the same results in both figures). At this time, there is no statistical package that implements the adjustments to CIs of the means. SPSS can only draw unadjusted CIs for many descriptive statistics; an extension to SPSS (O’Brien & Cousineau, 2014) implements both correlation-adjusted and difference-adjusted CIs. Likewise, R has no standard commands to draw adjusted CIs, but Baguley (2012b) programmed commands to that end and Kelley (2017) made the MBESS R library with CIs for a few statistics such as effect sizes. A standalone application, MorePower, can compute CIs for a few within-subject and between-subjects designs (Campbell & Thompson, 2012). Finally, a Mathematica package, available from the author, is briefly described in Supplementary Material. There are a number of references in which the computation of CIs are given and described. Beaulieu-Prévost (2006) reported how to compute the unadjusted CI as well as the difference-adjusted CI; also, the CI of the Pearson’s correlation r is given. Finally, CIs for a proportion p and for difference-adjusted proportions are given. Cumming and Fidler (2009) reported some of the above, and also CIs for the Hedges’ effect size g. Harding et al. (2014) reviewed SEs and CIs for an exhaustive list of descriptive statistics: (central tendency) mean, median, geometric and harmonic means, (dispersion) variance, SD, median absolute deviation and interquartile range, (shape) Fisher skew, and kurtosis. Harding et al. (2015) gave SEs and CIs for the Pearson skew. Bootstrap estimates are fairly easy to obtain in SPSS with the module BOOTSTRAP, sold separately from SPSS (version 19 and above) or the module GSD (Harding & Cousineau, 2016). Otherwise, Weaver and Koopman (2014) showed how to bootstrap estimates of CIs for Pearson’s correlation with SPSS; Hallgren (2013) showed how to perform bootstrapping in general in the R environment. Finally, Hélie (2006) provided a general introduction to the topic of model selection using bootstrap. Few commands provide the full flexibility needed to plot any summary statistics in conjunction with any type of CIs. I hope that this situation will change rapidly so that researchers are encouraged to plot adjusted CIs routinely.

General Discussion

All the CIs reviewed here are summarized in Algorithm 1. Also, the relevant formulas are provided in Appendix A. They all obey the golden rule of interpretation for CIs: If a given value is within the interval of a result, the two can be informally assimilated as being comparable. By making all CIs follow the same and unique interpretative rule, researchers might start relying on these statistics more frequently, more consistently, and more confidently. Algorithm 1 Steps to compute SEs and CIs of means 1- Are the data from a within-subject design or mixed design? Yes: decorrelate the data within each group (Equation A4). 2- Compute SEs for each group and each measure (Equation 2). Do you want to pool the SEs? Yes: use Equation A5. Do you want to pool the sample sizes? Yes: use harmonic mean. 3- Do you want to show CI instead of SE? Yes: Choose your confidence level (typically 95%) and get tγ then multiply SE by the multiplier tγ (Equation A3). 4- Purpose: Will comparisons be made to other sample means? Yes: Use difference adjustment (Equation A6). 5- Sampling mechanisms. a- Is the population of finite size? Yes: Equation A7. b- Is the sample obtained from cluster randomized sampling? Yes: use Equations A8 and A9. 6- Place the CI about the mean (Equation A1). Some have argued that Equations 1, 3, and 5 (5a or 5b) are not three different types of CIs, but just one type of CI for three different statistics (Equation 1 is the CI of a single mean, Equation 3 is akin to the CI for the difference between two independent means, and Equation 5b is the CI for the within-subject difference in means). I do not object to this point of view and if it is more intuitive to the readers, please make these the labels by which you identify the intervals in your future communications. The only thing that really matters is that anything having the name confidence interval should be interpreted in a consistent and universal fashion, that is, according to the golden rule. CIs should be part of any plots or listed in tables of results whenever a summary statistic is reported. There exists a CI for any statistic you may want to report and many can be found in the literature. Although CIs are not always clearly understood in very formal ways (see Belia et al., 2005; Cumming et al., 2004; Hoekstra et al., 2014), I believe that they are more intuitive than other kinds of statistical information. See, among others, Loftus (1996) for a similar point of view. If we can agree on the golden rule and make sure that all CIs plotted conform to it consistently and systematically, intuition regarding them should improve. Previous texts have not sought to enforce uniformity by discussing error bars based on SE or by promoting half-length intervals. Half-length CIs were suggested by Baguley (2012b), Franz and Loftus (2012), and Goldstein and Healy (1995), by which the length of the difference-adjusted CI is divided by 2. Such half-length CIs must be interpreted differently as it is the presence of overlap between error bars that signals comparable means. This is unfortunate; if we want researchers to develop the correct automatisms when facing error bars, we must devise intervals that are to be interpreted consistently (Shiffrin & Schneider, 1977). CIs are the result of solid mathematical arguments. They provide an interval which likely contains the population means. Indeed, just to take an example, 95% of the 95% CIs of the means do contain the population mean. There is no guarantee that one specific CI contains the population mean, but we may have a certain confidence that this is the case (Miller & Ulrich, 2016). Note that a CI is accurate only if the assumptions are correct, only if the experimental design and sampling methods are inscribed in it, and only if it is used for the correct objective. If any of these elements are changed, the CI length will change accordingly (as was shown in Morey et al., 2016). It is not a demonstration that CIs are fallacious; it is a demonstration that CIs must be informed as accurately and as completely as possible. The only arbitrary aspect of CIs is the coverage level γ used to compute t. The purpose of this quantity is to provide a reasonably large coverage for the interval. On the one hand, too narrow an interval could yield the impression that a study is hardly replicable (even if replications are scarce within Psychology; see Cousineau, 2014; Jasny, Chin, Chong, & Vignieri, 2011; Makel, Plucker, & Hegarty, 2012; Pashler & Wagenmakers, 2012). On the other hand, too wide an interval would bring little information with respect to the true characteristic(s) of a population. A conventional level is required; Cumming (2014; also see publication policy of the Psychological Science journal regarding statistics) argued that a 95% coverage level is a reasonable position (Marmolejo-Ramos & Cousineau, 2016). Finally, keep in mind that ultimately, good science should return short CIs. Being able to assess patterns of means is important, as argued in the Introduction. However, being able to assess results with high precision is also, if not more, important. Along this document, I made a few recommendations that I reiterate here: 1. always show or list CIs whenever results based on summary statistics are given; 2. use formula-based CIs if the assumptions are not rejected by the data of if it is conventional to do so in the area of research; use bootstrap CIs otherwise; 3. prefer difference-adjusted CIs if focus is on the pattern of results; if unadjusted CIs are given and there is a conventional reference value, provide the reference value on the plot (e.g., with a dashed line); 4. in the text, use the notation [low, high] for 95% CIs. Use the notation ± to denote SEs. 5. in plots showing means in a within-subject design, provide the Huynh-Feldt ε so that readers can assess whether the sphericity assumption holds or not. In between-subjects designs, the reader can assess the homogeneity of variances assumption visually by comparing the length of the error bars. 6. if half-length CIs are used, clearly identify this fact and give the rule for interpreting these. As mentioned by Belia et al. (2005), “better guidelines for researchers and less ambiguous graphical conventions are needed before the advantages of CIs for research communication can be realized” (p. 389). I hope that this article is one step further in that direction.

24 in total

1. Using confidence intervals for graphically based data interpretation.

Authors: Michael E J Masson
Journal: Can J Exp Psychol Date: 2003-09

2. Data replication & reproducibility. Again, and again, and again .... Introduction.

Authors: Barbara R Jasny; Gilbert Chin; Lisa Chong; Sacha Vignieri
Journal: Science Date: 2011-12-02 Impact factor: 47.728

3. The new statistics: why and how.

Authors: Geoff Cumming
Journal: Psychol Sci Date: 2013-11-12

4. Using confidence intervals in within-subject designs.

Authors: G R Loftus; M E Masson
Journal: Psychon Bull Rev Date: 1994-12

5. Calculating and graphing within-subject confidence intervals for ANOVA.

Authors: Thom Baguley
Journal: Behav Res Methods Date: 2012-03

6. Replications in Psychology Research: How Often Do They Really Occur?

Authors: Matthew C Makel; Jonathan A Plucker; Boyd Hegarty
Journal: Perspect Psychol Sci Date: 2012-11

7. Editors' Introduction to the Special Section on Replicability in Psychological Science: A Crisis of Confidence?

Authors: Harold Pashler; Eric-Jan Wagenmakers
Journal: Perspect Psychol Sci Date: 2012-11

8. Confidence intervals for two sample means: Calculation, interpretation, and a few simple rules.

Authors: Roland Pfister; Markus Janczyk
Journal: Adv Cogn Psychol Date: 2013-06-17

9. To test or not to test: Preliminary assessment of normality when comparing two independent samples.

Authors: Justine Rochon; Matthias Gondan; Meinhard Kieser
Journal: BMC Med Res Methodol Date: 2012-06-19 Impact factor: 4.615

10. The fallacy of placing confidence in confidence intervals.

Authors: Richard D Morey; Rink Hoekstra; Jeffrey N Rouder; Michael D Lee; Eric-Jan Wagenmakers
Journal: Psychon Bull Rev Date: 2016-02

4 in total

1. When combined spatial polarities activated through spatio-temporal asynchrony lead to better mathematical reasoning for addition.

Authors: Hélène Verselder; Nicolas Morgado; Sébastien Freddi; Vincent Dru
Journal: Mem Cognit Date: 2018-10