During the past decade, student health care centers across Sweden have routinely invited all students they serve to complete an electronic screening and brief intervention (eSBI) targeting harmful and hazardous alcohol consumption. Students are, on a yearly basis, invited via email to complete a 10-item questionnaire, after which they are given personal feedback alongside some advice on behavior change. The evidence for eSBIs generally indicates that they may have a small, yet positive effect on the amount of alcohol consumed in the short term (Cohen d=−0.17, 95% CI −0.27 to −0.18 [1]; Cohen d=−0.14, 95% CI −0.24 to −0.03 [2]) and the weighted mean difference of alcohol in grams (−16.59, 95% CI −23.70 to −9.48 [3]).In 2011, the first Alcohol Email Assessment and Feedback Study Dismantling Effectiveness for University Students (AMADEUS-1) trial aimed to investigate the effect of this routine practice. An unconventional study design was used to target both treatment and non–treatment-seeking individuals as well as to mask trial participation and allow for baseline assessment effects to be measured. The trial, reported originally in 2013 [4-6], identified a small reduction in alcohol consumption and risky drinking among those who had been invited to assess their consumption compared to a no-contact control. A Bayesian reanalysis of the AMADEUS-1 trial has also been reported [7].The unconventional trial design employed in the AMADEUS-1 trial necessitated inclusion of many individuals at follow-up who had decided not to complete the baseline assessment, as well as of nonharmful drinkers and abstainers. This prompted the AMADEUS-2 trial [8,9], which aimed to assess the effect of an eSBI on harmful and hazardous drinking among students.
AMADEUS-2
The AMADEUS-2 trial [8,9] followed a more conventional two-arm randomized controlled trial design than did its predecessor AMADEUS-1. In March 2013, students in semesters 2, 4, and 6 at 9 colleges and universities in Sweden were sent an email (n=54,507) with an invitation to answer a single screening question regarding their alcohol consumption. The third item of the Alcohol Use Disorders Identification Test [10], which asks about the frequency of heavy episodic drinking, was used to screen participants for inclusion. Students were eligible if they had consumed at least four (female) or five (male) standard drinks twice a month or more often on a single occasion in the past 3 months. One standard drink is in Sweden defined as 12 grams of alcohol.Eligible students who gave consent to take part in the trial were randomized into two groups: intervention and control. The intervention group was offered an eSBI immediately after randomization. They were asked to complete a 10-item questionnaire, which assessed their current consumption, after which they received feedback on their responses, including graphical representations of their current risk level, normative comparison with other students, and personal advice on how to reduce one’s consumption. The control group was told that they would receive the intervention in 2 months.At follow-up, 2 months after the initial invitation, both groups were sent identical emails with an invitation to participate in the follow-up survey. The survey consisted of the same questionnaire and feedback that was offered to the intervention group at baseline.
Concerns over the (Mis)use of P values
In 2017, Benjamin et al [11] (signed by 71 authors) recommended that the conventional threshold used for determining statistical significance should be lowered from .05 to .005. This recommendation was motivated by a growing concern that scientific findings are becoming less credible. Furthermore, the authors recommended that findings with P values between .05 and .005 should be considered suggestive evidence rather than being outright rejected.This recommendation met critique, as others believed that trichotomization of evidence does not solve the issue of P-hacking, selective reporting, and publication bias [12,13]. These concerns resonate with the recent clarification from the American Statistical Association on the principles underlying P value reporting [14], Nuzzo’s summary in Nature [15], and a series of articles in the Journal of the American Statistical Association [16-20].One approach that could potentially replace the P value dichotomization is the use of Bayesian inference, where evidence is considered as a continuous entity [21-25]. For this reason, the Journal of Medical Internet Research is inviting submissions to a special issue where authors are asked to reanalyze data from previous trials using a Bayesian framework and compare the analytical results with those of the original P value.
Objective
The primary outcome in the AMADEUS-2 trial was self-reported weekly alcohol consumption at the 2-month follow-up. The main hypothesis was that the intervention group would report a lower weekly alcohol consumption than the control group at follow-up. An unplanned sensitivity analysis was also conducted, which excluded three data points considered outliers post hoc. The objective of this study is to redo the primary and sensitivity analysis using a Bayesian framework and contrast the results with those of the original analysis.
Methods
Bayesian model
In the original analysis of the AMADEUS-2 trial, negative binomial regression was used to contrast grams of alcohol consumed per week between the intervention and control groups. The primary model was adjusted for baseline variables. The same model was used in the enclosed Bayesian analysis, with uniform priors for all model parameters. Negative binomial regression with uniform priors used to contrast grams of alcohol per week is expressed by Equation 1:Equation 1 presents the full specification of the model, where HED represents the number of heavy episodes of drinking per week at baseline, that is, the initial screening question.The primary interest was the regression coefficient θ1 for the GROUP variable, that is, the expected difference in log count of grams of alcohol consumed between the intervention and control groups. By exponentiating this coefficient, we get the incidence rate ratio (IRR), which indicates by how much we should multiply the control group’s consumption to get the intervention group’s consumption. Thus, a value of exp (θ1) lower than 1 would suggest that the grams per week consumed for the intervention group was lower than that for the control group at the time of follow-up. Informed by the original analysis, thresholds for which the marginal posterior distribution of exp (θ1) should be inspected were chosen at 1, 0.96, and 0.92. The threshold of 1 was chosen to communicate whether offering the intervention was preferable to not doing so, and the thresholds 0.96 and 0.92 were chosen to indicate the magnitude of the difference between the two groups.
Inference
Hamiltonian Monte Carlo, a type of Markov chain Monte Carlo (MCMC) technique, was used for Bayesian inference. The model was coded using Stan (Textbox 1) and run in R with RStan version 2.16.2. The data were one-hot encoded before being passed to Stan. No transformations were made to the variables.When using MCMC for inference, we aim to draw samples from the posterior distribution of all model parameters. These samples can then be used to calculate how probable different values of these parameters are. For each model in the enclosed analysis, 50,000 iterations were run with 25,000 warmup iterations in four chains.data {intN; // Number of data itemsint K; // Number of predictorsmatrix[N,K] X;int y[N]; // Response}parameters {real phi; // Dispersion parametervector[K] beta;}model {y ~ neg_binomial_2_log(X * beta, phi);}
Ethical Approval
This study was approved by the Regional Ethical Committee in Linköping, Sweden (No. 2013/46-31).
Results
In total, 1605 eligible students agreed to take part in the trial, of which 825 were randomized to the intervention group and 780, to the control group. Two months after the initial invitation, 58% (931/1605) of trial participants completed the follow-up questionnaire.
Original Analysis: Null Hypothesis Framework
Part of the original analysis of the AMADEUS-2 trial is presented in Table 1. Null hypothesis tests were two-tailed and assessed at the .05 threshold. No statistically significant difference was found between the intervention and control groups with respect to grams of alcohol consumed per week at follow-up (P=.13). To clarify, if we hypothesize that the population IRR is exactly 1, then the data collected in this trial are not extraordinary, that is, the probability of seeing these data is greater than 5%. According to convention, this does not allows us to reject the hypothesis that the IRR is exactly 1. The CI identifies a span of hypotheses that cannot be rejected, given the available data. Since the span includes both hypotheses of effect and no effect, the evidence is inconclusive.
Table 1
Original analysis of grams of alcohol consumer per week at follow-up compared between the intervention and control groups. When removing three potential outliers, the difference was marginally statistically significant.
Intervention group (n=402), mean (SD)a
Control group (n=529), mean (SD)a
Incidence rate ratiob (95% CI)
P value
Weekly alcohol consumption (g/wk)b
113.4 (81.1)
120.8 (86.4)
0.937 (0.861-1.019)
.13
Sensitivity analysis excluding three outliers
107.4 (73.4)
119.1 (81.3)
0.921 (0.848-1.000)
.049
aMean and SD given by negative binomial regression.
bIncidence rate ratio given by negative binomial regression (adjusted for sex, age, university, and frequency of heavy episodic drinking at baseline).
In an unplanned sensitivity analysis, data were graphically assessed for skewness (using Q-Q plots), and three potential outliers were identified (Figure 1): one in the intervention group (weekly consumption of 1044 g/week) and two in the control group (1128 g/week and 1524 g/week). These data suggest that the participants consumed over 80 standard drinks in a typical week. The difference between the groups was marginally statistically significant when these three outliers were excluded (P value=.049), with the intervention group on average reporting lower consumption than the control group.
Figure 1
Unplanned sensitivity analysis identifying three potential outliers with respect to weekly alcohol consumption.
Original analysis of grams of alcohol consumer per week at follow-up compared between the intervention and control groups. When removing three potential outliers, the difference was marginally statistically significant.aMean and SD given by negative binomial regression.bIncidence rate ratio given by negative binomial regression (adjusted for sex, age, university, and frequency of heavy episodic drinking at baseline).Unplanned sensitivity analysis identifying three potential outliers with respect to weekly alcohol consumption.
Bayesian Analysis
We recall from our discussion in the Methods section that Equation 1 represents the coefficient for the GROUP variable, that is, the difference between the intervention and control group in terms of log count of grams of alcohol consumed per week. We get the IRR by exponentiating this coefficient. The control group’s consumption is multiplied with the IRR to get the intervention groups consumption, thus an IRR less than 1 implies that the intervention group consumed less than the control.Histograms of the samples drawn from the posterior distribution of θ1 during MCMC are shown in Figure 2 (exponentiated) and samples drawn when excluding the three potential outliers are depicted in Figure 3 (exponentiated). These histograms should be interpreted as visualizing how plausible different values of θ1 are compared to one another. For instance, note how a strong majority of samples drawn were less than 1, indicating that it is more likely than not that the IRR is less than 1 when comparing the intervention and control groups. Thus, the model suggests that it is more likely than not, that the intervention group drank less than the control group.
Figure 2
Samples from the posterior distribution of θ1 (exponentiated).
Figure 3
Samples from the posterior distribution of θ1 (excluding three potential outliers, exponentiated).
For different IRR thresholds of interest, we can calculate the marginal posterior probability by simply counting the rate of samples that fall below or above a given threshold. In Table 2, we have calculated the marginal posterior probabilities for each of the predefined thresholds for the IRR. For instance, 17,875 samples were drawn below 0.96 when including the potential outliers (Figure 2), and we drew a total of 25,000 samples, resulting in a probability of 71.5% (17,875/25,000) that the IRR was less than 0.96.
Table 2
Bayesian analysis of incidence rate ratios comparing the intervention and control groups at follow-up.
Intervention (n=402), mean (SD)
Control (n=529), mean (SD)
Probabilitya (%)
Incidence rate ratio<1
Incidence rate ratio <0.96
Incidence rate ratio <0.92
Weekly alcohol consumption (g/wk)
113.4 (81.1)
120.8 (86.4)
93.6
71.5
33.9
Sensitivity analysis excluding three outliers
107.4 (73.4)
119.1 (81.3)
97.5
83.8
49.1
aMarginal posterior probabilities for incidence rate ratios comparing intervention and control groups, given by negative binomial regression (adjusted for sex, age university, and frequency of heavy episodic drinking at baseline, see Equation1).
No sampling issues during MCMC were found when inspecting trace plots (Multimedia Appendix 1).Samples from the posterior distribution of θ1 (exponentiated).Samples from the posterior distribution of θ1 (excluding three potential outliers, exponentiated).Bayesian analysis of incidence rate ratios comparing the intervention and control groups at follow-up.aMarginal posterior probabilities for incidence rate ratios comparing intervention and control groups, given by negative binomial regression (adjusted for sex, age university, and frequency of heavy episodic drinking at baseline, see Equation1).
Discussion
Null Hypothesis Testing
The original analysis of the AMADEUS-2 trial did not find a statistically significant difference between the intervention and control groups at follow-up (Table 2, P value=.13). A summary remark of the main analysis in the original publication was stated as follows [9]:The study found no strong evidence of short-term effectiveness of the Swedish national system of proactive online alcohol intervention for university and college students. However, inspection of the confidence intervals for the primary outcome reveals that this study does not rule out an intervention effect of up to 13% reduction in total weekly alcohol consumption.Thus, dichotomization leads us into a state of uncertainty: We cannot rule out that the intervention had no effect, yet we cannot conclude that the intervention had an effect.The unplanned sensitivity analysis excluding outliers identified a marginally statistically significant difference; however, such unspecified analyses should be viewed with skepticism. It is generally impossible to know which data points should be considered correct, which are data entry errors, and which are malicious entries.Although not included in the original analysis, we calculated the P value when excluding only the most extreme potential outlier (1524 g/week) and found that the difference between groups was then statistically significant with a P value of .04 (down from .13). The null hypothesis testing framework, and, in particular, P values, rely on point estimates of difference, that is, single values that are supposed to summarize the data. Such point estimates can be highly sensitive to single data points. Considering that policy decisions might be made based on this type of trial, we should feel uneasy knowing that statistical significance in a data set of 931 entries may rely on a single or a few study participants alone.The Bayesian analysis of the AMADEUS-2 trial (Figures 2 and 3, Table 2) suggests that that there is a 93.6% probability that the intervention group consumed less alcohol than the control group at follow-up in terms of the IRR. The data also suggest that the IRR was more likely than not to be less than 0.96. We may conclude that this difference is due to a positive effect of engaging in an eSBI, and this conclusion is licensed by the randomization component of the trial. However, the difference in point estimates of mean weekly alcohol consumption was approximately 7 grams between the intervention and control groups, suggesting that the eSBI had a lower mean effect than has been synthesized in meta-analyses [1-3].When excluding the three entries with extreme levels of consumption, the probability of a difference increases. However, the difference is not extreme, partially because we are not relying on dichotomization, but mainly because in a Bayesian framework we look at the entire posterior distribution of parameters, rather than point estimates. The major benefit here is that we do not feel obligated to remove the potential outliers at all. Since analyses where outliers have been removed should be viewed with high skepticism, we can keep them in our data analysis while still obtaining similar results.The posterior probabilities in Table 2 should be the basis for policy decision and be viewed in light of other factors, including alternative interventions and costs. The national system in Sweden used by the majority of student health care centers allows for eSBIs to reach tens of thousands of university students each year; however, the costs have been kept low by sharing a common platform. Given the high reach of the intervention and its low cost, there is a >90% probability of a positive effect in the trial, which may convince policy makers that the system should see continued use; however, this reasoning could not be established within the null hypothesis framework, since the evidence was found to be inconclusive.
Limitations
The AMADEUS-2 trial was not sufficiently powered to obtain the prespecified effect size considered worth investigation. Approximately one quarter of the target sample size was recruited, creating a limitation on the possibility of detecting significant effect sizes. This also creates a limit for the Bayesian analysis, as the width of the posterior distribution, in general, decreases as the number of samples increase, allowing for narrower posterior distributions.All analyses were performed under the intention-to-treat principle with complete cases, which assumes that data are missing at random. Although attrition analyses in the original publication did not find evidence against data missing at random, there was a difference in follow-up rates between the intervention group (404/825, 49.0%) and the control group (529/780, 67.8%), which should temper any strong conclusions from the original analysis and this reanalysis.Finally, subjective measures were used to collect data at baseline and follow-up, which requires participants to recall their alcohol consumption in a typical week. Although such measurements may be subject to several sources of bias, such as recall and social desirability bias, it is the norm in brief interventions to use subjective measures, as in most cases, it is infeasible to collect biomarker data.
Conclusions
The use of null hypothesis testing with P values has been the target for criticism for some time, not least due to the prevalent misinterpretation of P values and CIs [11,12,15-21]. Yet, the praxis stubbornly persists.In the original publication of the AMADEUS-2 trial, it was acknowledged that it is challenging to reliably detect small effects and that the process may be subject to chance. Digital lifestyle interventions targeting large and sometimes non–treatment-seeking populations are generally expected to have a small-to-modest effect. Basing policy decisions on P values that may be highly sensitive to single data points may not be the most reliable way of deciding which evidence-based interventions should be recommended to the public.
Authors: Jim McCambridge; Marcus Bendtsen; Nadine Karlsson; Ian R White; Preben Bendtsen Journal: BMC Public Health Date: 2013-10-10 Impact factor: 3.295
Authors: Preben Bendtsen; Jim McCambridge; Marcus Bendtsen; Nadine Karlsson; Per Nilsen Journal: J Med Internet Res Date: 2012-10-31 Impact factor: 5.428
Authors: Preben Bendtsen; Marcus Bendtsen; Nadine Karlsson; Ian R White; Jim McCambridge Journal: J Med Internet Res Date: 2015-07-09 Impact factor: 5.428
Authors: Johanna Sandborg; Emmie Söderström; Pontus Henriksson; Marcus Bendtsen; Maria Henström; Marja H Leppänen; Ralph Maddison; Jairo H Migueles; Marie Blomberg; Marie Löf Journal: JMIR Mhealth Uhealth Date: 2021-03-11 Impact factor: 4.773
Authors: Katarina Åsberg; Jenny Blomqvist; Oskar Lundgren; Hanna Henriksson; Pontus Henriksson; Preben Bendtsen; Marie Löf; Marcus Bendtsen Journal: BMJ Open Date: 2022-07-26 Impact factor: 3.006
Authors: Nneka Emenyonu; Allen Kekibiina; Sarah Woolf-King; Catherine Kyampire; Robin Fatch; Carol Dawson-Rose; Winnie Muyindike; Judith Hahn Journal: JMIR Form Res Date: 2022-09-01
Authors: Marcus Bendtsen; Anna Seiterö; Preben Bendtsen; Hanna Henriksson; Pontus Henriksson; Kristin Thomas; Marie Löf; Ulrika Müssener Journal: BMC Public Health Date: 2021-07-16 Impact factor: 3.295
Authors: Anne H Berman; Marcus Bendtsen; Olof Molander; Petra Lindfors; Philip Lindner; Lilian Granlund; Naira Topooco; Karin Engström; Claes Andersson Journal: Scand J Public Health Date: 2021-07-02 Impact factor: 3.021