| Literature DB >> 32641499 |
Abstract
Science needs to understand the strength of its findings. This essay considers the evaluation of studies that test scientific (not statistical) hypotheses. A scientific hypothesis is a putative explanation for an observation or phenomenon; it makes (or "entails") testable predictions that must be true if the hypothesis is true and that lead to its rejection if they are false. The question is, "how should we judge the strength of a hypothesis that passes a series of experimental tests?" This question is especially relevant in view of the "reproducibility crisis" that is the cause of great unease. Reproducibility is said to be a dire problem because major neuroscience conclusions supposedly rest entirely on the outcomes of single, p valued statistical tests. To investigate this concern, I propose to (1) ask whether neuroscience typically does base major conclusions on single tests; (2) discuss the advantages of testing multiple predictions to evaluate a hypothesis; and (3) review ways in which multiple outcomes can be combined to assess the overall strength of a project that tests multiple predictions of one hypothesis. I argue that scientific hypothesis testing in general, and combining the results of several experiments in particular, may justify placing greater confidence in multiple-testing procedures than in other ways of conducting science.Entities:
Keywords: estimation statistics; hypothesis testing; meta-analysis; p value; reproducibility crisis; statistical hypothesis
Mesh:
Year: 2020 PMID: 32641499 PMCID: PMC7385663 DOI: 10.1523/ENEURO.0357-19.2020
Source DB: PubMed Journal: eNeuro ISSN: 2373-2822
Analysis of The Journal of Neuroscience Research Articles
| Start | End | Hyp-E | Hyp-I | Alt Hyp | # Tests | Support | Reject | Disc | Ques | Comp |
|---|---|---|---|---|---|---|---|---|---|---|
| 32 | 50 | X | 6 | X | ||||||
| 51 | 59 | X | 2 | 5 | X | X | ||||
| 60 | 72 | X | 7 | X | X | |||||
| 74 | 92 | X | 7 | X | ||||||
| 93 | 107 | X | 6 | X | ||||||
| 108 | 119 | X | ||||||||
| 120 | 136 | X | 7 | X | ||||||
| 137 | 148 | X | 3 | 8 | X | X | ||||
| 149 | 157 | X | 2 | 5 | X | X | ||||
| 158 | 172 | X | 2 | 8 | X | X | ||||
| 173 | 182 | X | 2 | 5 | X | |||||
| 183 | 199 | X | 9 | X | ||||||
| 200 | 219 | X | ||||||||
| 220 | 231 | X | ||||||||
| 232 | 244 | X | 7 | X | ||||||
| 245 | 256 | X | 1 | 7 | X | X | ||||
| 263 | 277 | X | 1 | 7 | X | X | ||||
| 278 | 290 | X | ||||||||
| 291 | 307 | X | 7 | X | ||||||
| 308 | 322 | X | 1 | 7 | X | X | ||||
| 322 | 334 | X | 7 | X | ||||||
| 335 | 346 | X | ||||||||
| 347 | 362 | X | ||||||||
| 363 | 378 | X | 6 | X | ||||||
| 379 | 397 | X | 1 | 8 | X | X | ||||
| 398 | 408 | X | ||||||||
| 409 | 422 | X | 1 | 4 | X | |||||
| 423 | 440 | X | 3 | 8 | X | X | ||||
| 441 | 451 | X | 2 | 9 | X | X | ||||
| 452 | 464 | X | 6 | X | ||||||
| 465 | 473 | X | 8 | X | ||||||
| 474 | 483 | X | 9 | X | ||||||
| 484 | 497 | X | 1 | 9 | X | X | ||||
| 498 | 502 | X | ||||||||
| 518 | 529 | X | 8 | X | ||||||
| 530 | 543 | X | ||||||||
| 548 | 554 | X | 1 | 8 | X | |||||
| 555 | 574 | X | 9 | X | ||||||
| 575 | 585 | X | 8 | X | ||||||
| 586 | 594 | X | ||||||||
| 595 | 612 | X | 1 | 3 | X | X | ||||
| 613 | 630 | X | 1 | 7 | X | X | ||||
| 631 | 647 | X | 8 | X | ||||||
| 648 | 658 | X | 6 | X | ||||||
| 659 | 678 | X | 3 | 5 | X | |||||
| 679 | 690 | X. | 5 | X | ||||||
| 691 | 709 | X | ||||||||
| 710 | 722 | X | ||||||||
| 723 | 732 | X | 1 | 5 | X | X | ||||
| 733 | 744 | X | ||||||||
| 745 | 754 | X | 1 | 4 | X | |||||
| 755 | 768 | X | 5 | X |
Classification of research reports published in The Journal of Neuroscience, vol. 38, issues 1–3, 2018, identified by page range (n = 52). An x denotes that the paper was classified in this category. Categories were: Hyp-E: at least one hypothesis was fairly explicitly stated; Hyp-I: at least one hypothesis could be inferred from the logical organization of the paper and its conclusions, but was not explicitly stated; Alt-Hyp: at least one alternative hypothesis in addition to the main one was tested; # Tests: is an estimate of the number of experiments that critically tested the major (not subsidiary or other) hypothesis; Support: the tests were consistent with the main hypothesis; Reject: at least some tests explicitly falsified at least one hypothesis; Disc: a largely “discovery science” report, not obviously hypothesis-based; Ques: experiments attempted to answer a series of questions, not unambiguously hypothesis-based; Comp: mainly a computational modeling study, experimental data were largely material for model.
Figure 2.Diagram of the logical structure of Cen et al. (2018). The paper reports several distinct groups of experiments. One group tests the main hypothesis and others test subsidiary hypotheses that are complementary to the main one but are not a necessary part of it. Connections between hypotheses and predictions that are logically necessary are indicated by solid lines; dotted lines indicate complementary, but not mandatory, connections. Falsification of the logically-necessary predictions would call for rejection of the hypothesis in its present form; falsification of any of the subsidiary hypothesis would not affect the truth of the main hypothesis. The figure numbers in the boxes identify the source of major data in Cen et al., 2018 that were used to test the indicated hypothesis.
Figure 3.Meta-analysis of the effect sizes observed in the primary tests of the main hypothesis of Cen et al. (2018; n = 6; shown in Fig. 1). I obtained effect sizes by measuring the published figures and calculated Cohen’s d values with an on-line calculator: https://www.socscistatistics.com/effectsize/default3.aspx. Analysis and graphic display (screenshot) were done with ESCI (free at https://thenewstatistics.com/itns/esci/). Top panel shows individual effect sizes (corrected, dunbiased) for the tendency of small samples to overestimate true effect sizes (see Cummings and Calin-Jageman, 2017; pp 176–177), Ns and degrees of freedom (df) of samples compared, together with confidence intervals (CIs) of effect sizes and relative weights (generated by ESCI and based mainly on sample size) that were assigned to each sample. Upper panel also shows mean effect size for random effects model and CI for mean. Bottom panel shows individual means (squares) and CIs for dunbiased (square size is proportional to sample weight). The large diamond at the very bottom is centered (vertical peak of diamond) at the mean effect size, while horizontal diamond peaks indicate CI for the mean.
Figure 1.Diagram of the main hypothesis and predictions of Cen et al. (2018). The solid lines connect the hypothesis and the logical predictions tested. This diagram omits experimental controls tests that primarily validate techniques, include non-independent p values, or add useful but non-essential information. The main hypothesis predicts that PKD1 associates directly with N-cadherin, and that PKD1 and N-cadherin jointly affect synaptic development in a variety of structural and physiological ways. Separate groups of experiments test these predictions.