Literature DB >> 35525708

Excess significance and power miscalculations in neurofeedback research.

Robert T Thibault¹, Hugo Pedder².

Abstract

Entities: Chemical

Keywords: Neurofeedback; Neuroimaging; Statistical power analysis; fMRI; fNIRS

Mesh：

Year: 2022 PMID： 35525708 PMCID： PMC9421468 DOI： 10.1016/j.nicl.2022.103008

Source DB: PubMed Journal: Neuroimage Clin ISSN： 2213-1582 Impact factor: 4.891

× No keyword cloud information.

Recent systematic reviews of neurofeedback with functional magnetic resonance imaging (fMRI-nf) (Tursic et al., 2020) and neurofeedback with functional near infrared spectroscopy (fNIRS-nf) (Kohl et al., 2020) miscalculate the statistical power and statistical sensitivity of several studies they review. The fMRI-nf review overestimates the mean and median statistical power of included studies by about 3 times and the statistical sensitivity by about 2 times (see Table 1 for recalculated values and comparisons). The fNIRS-nf review, on which I (RTT) was a coauthor, overestimates power by about 2 times and sensitivity by about 1.5 times (see Table 2).

Table 1

Recalculated values for the fMRI-nf review (Tursic et al., 2020).

		Power			Sensitivity (in Cohen’s d)
	N	d = 0.2	d = 0.5	d = 0.8	Power = 80%	Power = 95%
Recalculated
Mean (regulation)	29.22	0.08	0.25	0.47	1.31	1.68
Median (regulation)	22.00	0.07	0.20	0.43	1.26	1.62
Mean (clinical)	26.73	0.07	0.21	0.45	1.31	1.68
Median (clinical)	27.00	0.07	0.21	0.48	1.15	1.47

Original
Mean (regulation)	29.90	0.24	0.61	0.76	0.77	0.99
Median (regulation)	22.50	0.15	0.67	0.98	0.58	0.73
Mean (clinical)	26.70	0.31	0.73	0.85	0.58	0.74
Median (clinical)	27.00	0.30	0.98	0.99	0.36	0.46

Overestimation factor (original/recalculated)
Mean (regulation)	1.02	2.91	2.49	1.61	1.70	1.69
Median (regulation)	1.02	2.05	3.34	2.27	2.17	2.22
Mean (clinical)	1.00	4.20	3.48	1.89	2.26	2.27
Median (clinical)	1.00	4.10	4.77	2.05	3.18	3.21

The first section of the table presents the values we calculated. The second section presents the values published in the original review. The third section presents an overestimation factor calculated by dividing the original values by the recalculated values for power and by dividing the recalculated values by the original values for sensitivity. Power and sensitivity calculations for the ability to regulate the neurofeedback signal are presented separately from those for clinical measures. The overestimation factor was calculated before rounding values to two decimal place. Thus, recalculating the overestimation factor with the numbers in the table will produce slightly different values. The mean and median sample sizes in the review differ slightly from ours, possibly due to a calculation error. We used the data provided in the review’s supplementary material for these calculations.

Table 2

Recalculated values for the fNIRS-nf review (Kohl et al., 2020).

		Power			Sensitivity (in Cohen’s d)
	N	d = 0.2	d = 0.5	d = 0.8	Power = 80%	Power = 95%
Recalculated
Mean (regulation)	19.29	0.14	0.41	0.67	0.98	1.29
Median (regulation)	19.00	0.14	0.43	0.75	0.85	1.13
Mean (behavioural)	22.10	0.10	0.31	0.56	1.11	1.45
Median (behavioural)	20.00	0.08	0.22	0.42	1.30	1.66

Original
Mean (regulation)	22.11	0.20	0.55	0.74	0.88	1.15
Median (regulation)	20.00	0.16	0.48	0.80	0.75	1.00
Mean (behavioural)	22.10	0.20	0.68	0.87	0.66	0.87
Median (behavioural)	20.00	0.22	0.76	0.97	0.53	0.69

Overestimation factor (original/recalculated)
Mean (regulation)	1.15	1.45	1.33	1.10	1.12	1.12
Median (regulation)	1.05	1.14	1.12	1.06	1.14	1.13
Mean (behavioural)	1.00	1.97	2.23	1.55	1.69	1.67
Median (behavioural)	1.00	2.86	3.53	2.30	2.45	2.41

The first section of the table presents the values we calculated. The second section presents the values published in the original review. The third section presents an overestimation factor calculated by dividing the original values by the recalculated values for power and by dividing the recalculated values by the original values for sensitivity. Power and sensitivity calculations for the ability to regulate the neurofeedback signal are presented separately from those for behavioural measures. The overestimation factor was calculated before rounding values to two decimal place. Thus, recalculating the overestimation factor with the numbers in the table will produce slightly different values. The mean and median sample size in the review differ slightly from ours—whereas we calculated these values based on the sample size used in the statistical tests, Kohl et al. calculated them based on the total number of participants. We removed one study from our calculations because it only ran binomial tests within each participant but did not test for group effects. One study used biserial correlation, for which we calculated power as for a Pearson’s correlation. One study used an ANCOVA, for which we calculated power using a 2x2 repeated measures (mixed) ANOVA.

Recalculated values for the fMRI-nf review (Tursic et al., 2020). The first section of the table presents the values we calculated. The second section presents the values published in the original review. The third section presents an overestimation factor calculated by dividing the original values by the recalculated values for power and by dividing the recalculated values by the original values for sensitivity. Power and sensitivity calculations for the ability to regulate the neurofeedback signal are presented separately from those for clinical measures. The overestimation factor was calculated before rounding values to two decimal place. Thus, recalculating the overestimation factor with the numbers in the table will produce slightly different values. The mean and median sample sizes in the review differ slightly from ours, possibly due to a calculation error. We used the data provided in the review’s supplementary material for these calculations. Recalculated values for the fNIRS-nf review (Kohl et al., 2020). The first section of the table presents the values we calculated. The second section presents the values published in the original review. The third section presents an overestimation factor calculated by dividing the original values by the recalculated values for power and by dividing the recalculated values by the original values for sensitivity. Power and sensitivity calculations for the ability to regulate the neurofeedback signal are presented separately from those for behavioural measures. The overestimation factor was calculated before rounding values to two decimal place. Thus, recalculating the overestimation factor with the numbers in the table will produce slightly different values. The mean and median sample size in the review differ slightly from ours—whereas we calculated these values based on the sample size used in the statistical tests, Kohl et al. calculated them based on the total number of participants. We removed one study from our calculations because it only ran binomial tests within each participant but did not test for group effects. One study used biserial correlation, for which we calculated power as for a Pearson’s correlation. One study used an ANCOVA, for which we calculated power using a 2x2 repeated measures (mixed) ANOVA. The miscalculations arise from an easy-to-miss default option for repeated measures (mixed)3 ANOVAs in the statistical software program GPower (Faul et al., 2007), which both reviews used (see Fig. 1 for a depiction)4. The default option defines a variable in the effect size calculation (η2) in such a way that the common usage of small, medium, and large effects sizes for the interaction of repeated measures (mixed) ANOVAs (f) doesn’t hold true. If unaware of the default option, the power calculations will account for the correlations between repeated measures a second time, and in turn substantially—but erroneously—increase power. The GPower software itself highlights that Cohen (1988) recommended another option (as viewable in Fig. 1). While Lakens (2013) explained this issue almost 10 years ago, it remains likely that researchers continue to use GPower without awareness of this default option and its implications5. Fortunately, the authors of both reviews published their data as supplementary material, making reanalysis possible.

Fig. 1

Depiction of the default and Cohen’s recommended options for conducting power calculations for repeated-measures (mixed) ANOVAs in GPower.

Depiction of the default and Cohen’s recommended options for conducting power calculations for repeated-measures (mixed) ANOVAs in GPower. We recalculated the statistical power and sensitivity of the studies from Tursic et al., 2020, Kohl et al., 2020 using the WebPower package in R6. Our recalculations show that the median study in the fMRI-nf review has only 21% power to detect clinical effects of Cohen’s d = 0.5 and the median study in the fNIRS-nf review has 22% power to detect behavioural effects of the same size. The median studies in fMRI-nf and fNIRS-nf have 80% power to detect large to very large effect sizes (d = 0.85 – 1.30)7 (see the sensitivity columns in Table 1, Table 2). The fMRI-nf review overestimates power to a greater degree than the fNIRS-nf review because more of the studies they reviewed used repeated measures (mixed) ANOVAs, where the consequential default option exists. Effect sizes of this magnitude are uncommon in medicine. When found, they rarely replicate in larger follow up trials (Nagendran et al., 2016). One study compiled meta-analyses of the 20 most common pharmaceutical therapies and found a mean effect size of d = 0.58 (median d = 0.56) (Leucht et al., 2015). Antidepressants, for example, have an effect size of d = 0.30 compared to placebos for treating depression (Cipriani et al., 2018). For a more tangible comparison, the height difference between men and women over the age of 20 in the United States is d = 1.01 (National Center for Health Statistics, 2021)8. Thus, the median studies in these neurofeedback reviews have 80% power to detect a clinical or behavioural effect size about 4 times larger than antidepressants or slightly larger than the height difference between men and women in the United States. Given the sample sizes used in the reviewed studies, even if neurofeedback drove “large” clinical or behavioural effects (d = 0.8), less than half of studies should have statistically significant results at p <.05. And yet, Tursic et al. found that 10/11 (91%) of the fMRI-nf studies that were not pilot studies reported clinical improvements while another review found that 24/35 (69%) of fMRI-nf studies reported behavioural improvements compared to a control group9 (Thibault et al., 2018). Kohl et al. found that all studies reported improvement in at least one behavioural measure. This excess significance in the fMRI-nf and fNIRS-nf literature may stem from a combination of an absence of corrections for multiple comparisons, data dependent analytical decisions, selective reporting, publication bias, false positives, statistical tests against baseline rather than against a control group, and other sources of bias. Can we be sure current sample sizes are insufficient? It depends on the question10. On the one hand, if the goal is to show that individuals can control their brain imaging data or improve their behaviour compared to baseline—where within-sample designs are appropriate and effect sizes may be large—then the upper end of current sample sizes would be sufficient. For example, neurofeedback has driven very large behavioural effects compared to baseline (∼d = 1.5) when using EEG-nf to treat ADHD (Arnold et al., 2021, Schönenberg et al., 2017) or fMRI-nf to treat depression (Mehler et al., 2018, Young et al., 2017). However, these effect sizes are generally much smaller or absent when comparing the experimental group to an active control group (Trambaiolli et al., 2021). On the other hand, if the goal is to demonstrate that a target neurofeedback protocol outperforms a reasonable control condition or matches the performance of an accepted treatment, then current sample sizes remain inadequate. Continuing to run poorly powered studies fills the literature with noise and wastes resources (Button et al., 2013). Genetics research provides a stark example of this issue. With the advent of inexpensive genome-wide testing (and the associated ability to increase sample sizes by orders of magnitude) the literature on candidate genes was found to be largely noise (Border et al., 2019, Flint and Munafò, 2013). How should we move forward? Increasing sample size is an obvious, albeit practically challenging, solution. Without an influx of resources, we would need multi-site collaborations (e.g., as done recently for EEG-nf: Arnold et al., 2021). To detect an effect size equivalent to the median effect size for the 20 most common pharmaceuticals would require 102 participants. An effect size equivalent to antidepressants would require 351 participants. These sample sizes can be prohibitive, even for multi-site collaborations. Increasing the effect size presents another option. Neurofeedback publications sometimes identify “responders” and “non-responders” post hoc. If these groups could be identified a priori, and neurofeedback selectively applied to responders, the group effect would increase. However, repeated efforts to apply this approach in personalized medicine remain largely unsuccessful (Senn, 2018). Unfortunately, there’s no easy solution. In many cases, resources are simply too scarce to answer a research question. We are better off to resist the temptation to forge ahead with uninformative sample sizes, even when incentive structures may encourage us to do so (Higginson & Munafò, 2016). In the words of Doug Altman (1994): “We need less research, better research, and research done for the right reasons”.

Data availability statement

All data and materials related to this study are publicly available on the Stanford Digital Repository (https://doi.org/10.25740/bn925rp5443).

Code availability statement

To facilitate reproducibility this manuscript was written by interleaving regular prose and analysis code using R Markdown. The relevant files are available on the Stanford Digital Repository (https://doi.org/10.25740/bn925rp5443) and in a Code Ocean container (https://doi.org/10.24433/CO.7282505.v1) which recreates the software environment in which the original analyses were performed. This container allows the manuscript to be reproduced from the data and code with a single button press.

Contributions

Robert Thibault conceived the idea for this commentary and led the analyses and writing. Hugo Pedder provided statistical support, reviewed the code, and contributed to the commentary through discussions.

Funding

Robert Thibault is supported by a general support grant awarded to METRICS from the Laura and John Arnold Foundation and a postdoctoral fellowship from the Fonds de recherche du Québec – Santé. Hugo Pedder was funded by the NIHR Biomedical Research Centre at University Hospitals Bristol and Weston NHS Foundation Trust and the University of Bristol. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care. The funders had no role in the data analysis, decision to publish, or preparation of the manuscript.

Declaration of Competing Interest

Robert Thibault has received payments for consulting for neurofeedback start-up companies. Hugo Pedder declares no competing interests.

20 in total

1. Randomized Clinical Trial of Real-Time fMRI Amygdala Neurofeedback for Major Depressive Disorder: Effects on Symptoms and Autobiographical Memory Recall.

Authors: Kymberly D Young; Greg J Siegle; Vadim Zotev; Raquel Phillips; Masaya Misaki; Han Yuan; Wayne C Drevets; Jerzy Bodurka
Journal: Am J Psychiatry Date: 2017-04-14 Impact factor: 18.112

2. Statistical pitfalls of personalized medicine.

Authors: Stephen Senn
Journal: Nature Date: 2018-11 Impact factor: 49.962

Review 3. Power failure: why small sample size undermines the reliability of neuroscience.

Authors: Katherine S Button; John P A Ioannidis; Claire Mokrysz; Brian A Nosek; Jonathan Flint; Emma S J Robinson; Marcus R Munafò
Journal: Nat Rev Neurosci Date: 2013-04-10 Impact factor: 34.870

4. The scandal of poor medical research.

Authors: D G Altman
Journal: BMJ Date: 1994-01-29

5. The Potential of Functional Near-Infrared Spectroscopy-Based Neurofeedback-A Systematic Review and Recommendations for Best Practice.

Authors: Simon H Kohl; David M A Mehler; Michael Lührs; Robert T Thibault; Kerstin Konrad; Bettina Sorger
Journal: Front Neurosci Date: 2020-07-21 Impact factor: 5.152

Review 6. Candidate and non-candidate genes in behavior genetics.

Authors: Jonathan Flint; Marcus R Munafò
Journal: Curr Opin Neurobiol Date: 2012-08-08 Impact factor: 6.627

7. Current Incentives for Scientists Lead to Underpowered Studies with Erroneous Conclusions.

Authors: Andrew D Higginson; Marcus R Munafò
Journal: PLoS Biol Date: 2016-11-10 Impact factor: 8.029

Review 8. Comparative efficacy and acceptability of 21 antidepressant drugs for the acute treatment of adults with major depressive disorder: a systematic review and network meta-analysis.

Authors: Andrea Cipriani; Toshi A Furukawa; Georgia Salanti; Anna Chaimani; Lauren Z Atkinson; Yusuke Ogawa; Stefan Leucht; Henricus G Ruhe; Erick H Turner; Julian P T Higgins; Matthias Egger; Nozomi Takeshima; Yu Hayasaka; Hissei Imai; Kiyomi Shinohara; Aran Tajika; John P A Ioannidis; John R Geddes
Journal: Lancet Date: 2018-02-21 Impact factor: 79.321

9. Double-Blind Placebo-Controlled Randomized Clinical Trial of Neurofeedback for Attention-Deficit/Hyperactivity Disorder With 13-Month Follow-up.

Authors:
Journal: J Am Acad Child Adolesc Psychiatry Date: 2020-08-25 Impact factor: 13.113

10. Targeting the affective brain-a randomized controlled trial of real-time fMRI neurofeedback in patients with depression.

Authors: David M A Mehler; Moses O Sokunbi; Isabelle Habes; Kali Barawi; Leena Subramanian; Maxence Range; John Evans; Kerenza Hood; Michael Lührs; Paul Keedwell; Rainer Goebel; David E J Linden
Journal: Neuropsychopharmacology Date: 2018-06-23 Impact factor: 7.853

1 in total

1. Corrigendum: The Potential of Functional Near-Infrared Spectroscopy-Based Neurofeedback-A Systematic Review and Recommendations for Best Practice.

Authors: Simon H Kohl; David M A Mehler; Michael Lührs; Robert T Thibault; Kerstin Konrad; Bettina Sorger
Journal: Front Neurosci Date: 2022-08-22 Impact factor: 5.152

1 in total