| Literature DB >> 32589653 |
Keith R Lohse1,2, Kristin L Sainani3, J Andrew Taylor4, Michael L Butson5, Emma J Knight6, Andrew J Vickers7.
Abstract
Magnitude-based inference (MBI) is a controversial statistical method that has been used in hundreds of papers in sports science despite criticism from statisticians. To better understand how this method has been applied in practice, we systematically reviewed 232 papers that used MBI. We extracted data on study design, sample size, and choice of MBI settings and parameters. Median sample size was 10 per group (interquartile range, IQR: 8-15) for multi-group studies and 14 (IQR: 10-24) for single-group studies; few studies reported a priori sample size calculations (15%). Authors predominantly applied MBI's default settings and chose "mechanistic/non-clinical" rather than "clinical" MBI even when testing clinical interventions (only 16 studies out of 232 used clinical MBI). Using these data, we can estimate the Type I error rates for the typical MBI study. Authors frequently made dichotomous claims about effects based on the MBI criterion of a "likely" effect and sometimes based on the MBI criterion of a "possible" effect. When the sample size is n = 8 to 15 per group, these inferences have Type I error rates of 12%-22% and 22%-45%, respectively. High Type I error rates were compounded by multiple testing: Authors reported results from a median of 30 tests related to outcomes; and few studies specified a primary outcome (14%). We conclude that MBI has promoted small studies, promulgated a "black box" approach to statistics, and led to numerous papers where the conclusions are not supported by the data. Amidst debates over the role of p-values and significance testing in science, MBI also provides an important natural experiment: we find no evidence that moving researchers away from p-values or null hypothesis significance testing makes them less prone to dichotomization or over-interpretation of findings.Entities:
Mesh:
Year: 2020 PMID: 32589653 PMCID: PMC7319293 DOI: 10.1371/journal.pone.0235318
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1PRISMA flowchart.
PRISMA flowchart showing the screening of articles through the systematic review process.
Fig 2Example MBI inferences.
Ten hypothetical results and corresponding MBI inferences, assuming: a trivial range of -0.2 to 0.2 standard deviations, maximum risk of harm of 5%, and equivalent treatment of positive and negative directions (non-clinical MBI). MBI inferences correspond to the locations of the 50% and 90% confidence intervals relative to the negative (or harmful), positive (or beneficial), and trivial ranges. The result is deemed “unclear” if the 90% confidence interval spans the trivial range. Of note, minimal effect testing with α = 0.05, two-sided, would not arrive at conclusions of negative or positive for any of the examples shown. Equivalence testing with α = 0.05 would also fail to conclude equivalent (i.e., trivial difference) for any of the examples shown.
The top five most frequent venues for MBI publications identified in our review.
| Journal Title | Journal Impact Factor | Number of MBI Publications |
|---|---|---|
| The Journal of Strength and Conditioning Research | 2.325 | 35 |
| The International Journal of Sports Physiology and Performance | 3.384 | 24 |
| Journal of Sports Sciences | 2.733 | 17 |
| PLoS One | 2.766 | 11 |
| Frontiers in Physiology | 3.394 | 10 |
Journal impact factors were extracted from the Journal Citation Reports database on 2019-3-20.
Descriptive statistics of the 232 articles identified in the systematic review, median [IQR] or N (%).
MBI settings were not discernible from all studies, as indicated.
| Measure | Median [IQR] or N(%) |
|---|---|
| N per group for studies with >1 group (n = 111) | 10 [8, 15] |
| Total N for single group studies (n = 121) | 14 [10, 24] |
| Number of dependent variables | 7 [5, 12] |
| Number of Statistical Tests pertaining to the main hypotheses | 30 [15, 56] |
| MBI Parameters | |
| Harm/negative Threshold = -0.2 | 182 (79%) |
| Benefit/positive Threshold = +0.2 | 181 (78%) |
| maximum risk of harm ( | 183 (79%) |
| Statement of | 34 (15%) |
| ‘Primary’ Variable Explicitly Defined | 33 (14%) |
| Attrition or Exclusions Stated | 55 (24%) |
| Described as Bayesian | 0 (0%) |
| NHST also Performed | 108 (47%) |
| Minimum MBI evidence threshold applied | |
| “Possible” (≥25%) | 88 (38%) |
| ≥50% | 19 (8%) |
| “Likely” (≥75%) | 100 (43%) |
| “Very likely” (≥95%) | 0 (0%) |
| Not able to be determined | 25 (11%) |
| Study Design | |
| RCT | 53 (23%) |
| Cross-Over | 58 (25%) |
| Observational | 95 (41%) |
| Other | 26 (11%) |
| Clinical or Non-Clinical MBI | |
| Clinical, explicitly stated | 8 (3.4%) |
| Non-clinical, explicitly stated | 37 (16%) |
| Both, explicitly stated | 3 (1.3%) |
| Determined to be clinical though not explicitly stated | 5 (2.2%) |
| Determined to be non-clinical though not explicitly stated | 164 (71%) |
| Not able to be determined | 15 (6.1%) |
aOf these, 72 were two-group studies. The median [IQR] of sample size for two-group studies was: 10 [8,14].
bOur counts may represent an underestimate of the number of times the default MBI parameters were used, as some papers provided insufficient information to determine these values. We were unable to discern a value for the harm/negative threshold in 35 papers, the benefit/positive threshold in 32 papers, and the maximum risk of harm in 20 papers.
cSome authors explicitly set a minimum evidence threshold above which effects were declared “implementable”, “substantial”, or “practically meaningful.” Others implicitly set this threshold by only choosing to highlight and draw conclusions based on effects that met a given evidentiary threshold, such as “likely” or “possible.”
dClinical MBI was inferred from statements such as: “a clinically clear beneficial effect was at least possibly beneficial (>25% chance) and almost certainly not harmful (<0.5% risk).” Our count includes one paper that was explicitly labeled as non-clinical MBI but we believe to have run clinical MBI.
eNon-clinical MBI was inferred from the statement: “When the positive and negative values were both >5%, the inference was classified as unclear” or, equivalently, “If the 90% confidence interval overlapped the thresholds for the smallest worthwhile positive and negative effects, effects were classified as unclear.” In a few other cases, non-clinical MBI was determined mathematically based on how “unclear” results were called. Our count includes two papers that were explicitly labeled as clinical MBI but we believe to have run non-clinical MBI.
Fig 3MBI’s Type I error rates.
A and B: Type I error rates for MBI’s “possible” (purple) and “likely” (red) thresholds, as well as standard hypothesis testing at α = 0.05 (blue) as a function of sample size. The statistical comparison is a two-group comparison of means. True effect size = 0, meaning there is no difference between the groups. (A) assumes variance of 0.364, as might arise in a pre-post study, whereas (B) assumes a variance of 1.0, as in a cross-sectional study. Shaded area shows the interquartile range of sample sizes of the reviewed studies; vertical reference line is the median sample size. Type I error rates were identical whether calculated mathematically or by simulation with 200,000 repeats (see S1 Appendix). C: MBI results from 5000 simulated trials where variance = 0.364 and n = 10 per group. D: MBI results from 5000 simulated trials where variance = 1.0 and n = 10 per group. Simulations and calculations use the MBI settings that predominate in the literature: trivial range of -0.2 to 0.2; maximum risk of harm of 5%; and equivalent treatment of positive and negative directions.
Fig 4An example of MBI inferences in practice.
Left Panel (Reproduced from Parfey et al. [50], Fig 2C): Literature example where effects deemed “likely” by MBI are associated with large p-values. Confidence intervals are 95% CIs. Starred values are effects meeting MBI’s “likely” threshold. These results were interpreted as evidence of a difference between groups; the authors concluded: “Individuals with CLBP and PR manifested altered activation patterns during the hollowing maneuver compared to healthy controls.” Right panel: Simulation that shows the MBI inferences that are expected for a study of this type (n = 10 per group, cross-sectional) when the true effect is 0. Note that in both the real example and the simulation, most observed effects larger than 0.5 are deemed “likely”.
Type I error rates for MBI vary as a function of sample size, statistical comparison, variance, maximum risk of harm, and thresholds for harm/benefit.
Row 1 represents the typical MBI study; subsequent rows demonstrate how changing specific parameters alters the rates; parameters that remain unchanged from the base case are grayed out whereas altered parameters are bolded. Rates were calculated mathematically and also confirmed by simulation with 200,000 repeats (See S1 Appendix for description and S2 Appendix for code).
| Statistical comparison | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sample size per group | Variance | Maximum risk of harm | Threshold for harm/benefit | True trivial effect size | Number of statistical tests run | MBI “possible” threshold | MBI “likely” threshold | p < .05 | p < .01 | |
| Two-group pre-post | 10 | 0.36 | 0.05 | 0.2 | 0 | 1 | 35% | 16% | 5% | 1% |
| Two-group pre-post | 0.36 | 0.05 | 0.2 | 0 | 1 | 52% | 9% | 5% | 1% | |
| 10 | 0.05 | 0.2 | 0 | 1 | 22% | 21% | 5% | 1% | ||
| 0.05 | 0.2 | 0 | 1 | 48% | 6% | 5% | 1% | |||
| 0.05 | 0 | 1 | 29% | 19% | 5% | 1% | ||||
| 0.05 | 0.2 | 0 | 1 | 39% | 13% | 5% | 1% | |||
| Two-group pre-post | 10 | 0.36 | 0.2 | 0 | 1 | 6% | 6% | 5% | 1% | |
| Two-group pre-post | 10 | 0.36 | 0.05 | 0 | 1 | 20% | 20% | 5% | 1% | |
| Two-group pre-post | 10 | 0.36 | 0.05 | 0.2 | 1 | 38% | 19% | 6% | 1.5% | |
| Two-group pre-post | 10 | 0.36 | 0.05 | 0.2 | 0 | 99% | 82% | 40% | 10% | |
aCalculations use standardized effect sizes, so variance = 1. But for statistical comparisons that involve change scores, the within-person variance may be lower than 1.
bType I error rates can also be calculated for non-zero, trivial effects.
cWhen number of statistical tests>1, the Type I error rates represent the chance of at least one false positive, calculated assuming independent tests.
dSome within-person studies use a threshold for harm/benefit of 0.3 of the within-subject coefficient of variation; this typically translates to a smaller trivial range than 0.2 baseline standard deviations.
Example of how MBI results can be re-interpreted by one-sided minimal effects testing and noninferiority testing with α = .05.
| Dependent variable | MBI benefit probability | 95% CI for the effect | MBI interpretation | P-value for the null hypothesis of no increase (H0: effect ≤ δb) | P-value for the null hypothesis of decrease (H0: effect ≤ -δh) | Re-Interpretation |
|---|---|---|---|---|---|---|
| 24 hours | 74% | -2.14, 6.9 | Possible increase | 0.26 | < .05 | No substantial decrease |
| 48 hours | 92% | -0.22, 5.82 | Substantial increase | 0.08 | < .05 | No substantial decrease |
| 72 hours | “Unclear” | -2.82, 2.62 | Unclear | 0.62 | 0.33 | Inconclusive |
| 24 hours | “Unclear” | -6.18, 8.78 | Unclear | 0.52 | 0.22 | Inconclusive |
| 48 hours | 90% | -0.96, 14.16 | Substantial increase | 0.10 | < .05 | No substantial decrease |
| 72 hours | 79% | -3.24, 13.14 | Substantial increase | 0.21 | < .05 | No substantial decrease |
aThis example study (MacDonald et al. [30]) was a randomized trial comparing n = 10 in the intervention group (foam rolling) to n = 10 controls. The study examined 13 outcome variables at 3 time points, but did not designate a primary outcome or timepoint and made no corrections for multiple testing. Using data from their Tables 1 and 2 [30], we have re-analyzed and re-interpreted the data for two variables: vertical jump height and quadriceps passive range of motion. Column 5 shows the p-values for the null hypothesis of no increase (H0: true effect≤0.2 SD), which corresponds to a one-sided minimal effects test. Using α = .05, we would fail to reject this null hypothesis for any outcome. Column 6 shows the p-values for the null hypothesis of decrease (H0: true effect≤-0.2 SD), which corresponds to a noninferiority test. Using α = .05, we would reject the null hypothesis for 4 of 6 outcomes. Though the paper concluded that foam rolling improved vertical jump height and passive range of motion, this re-analysis suggests that these conclusions were overly optimistic. At best the study could conclude that foam rolling was not detrimental to jump height or passive range of motion for some time points. Note that this re-analysis fails to account for the multiplicity of tests (39 total tests were run; only 6 are shown here).
bThe p-values for the null hypothesis of no increase are obtained by subtracting the MBI benefit/positive probabilities from 1. For example, 1-.74 = .26.
cEffects were only deemed “clear” if the one-sided p-value for the null hypothesis of decrease was significant at p < .05 (the study used non-clinical MBI with η1 = η1 = 5%).
dThis paper used a minimal evidence cutoff of “likely” for declaring substantial effects, specifying: “Results with a >75% likelihood were considered to be substantial.” [30]
eMBI probabilities were not given for “unclear” results, but we were able to back-calculate these p-values from the effect size estimate and 95% confidence intervals available in the paper.