More than 800 researchers, many of them prominent biostatisticians, called to “raise up against the p-value.” [1] This recent battle cry was just the climax of a growing insurrection, which… Click to show full abstract
More than 800 researchers, many of them prominent biostatisticians, called to “raise up against the p-value.” [1] This recent battle cry was just the climax of a growing insurrection, which prominently surfaced in 2018 when another group of biostatisticians demanded that we should “redefine statistical significance,” [2] and proposed to change the default p value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries. For many researchers and experts, this demand did not go far enough; in a follow-up statement, they demanded to “remove rather than redefine statistical significance.” [3] This apparent upheaval even made it into the lay press. The Financial Times, for example, analyzed that “Scientists strike back against statistical tyranny.” [4] What’s all the fuzz about? If you have not lived under a stone, you will be aware of a much wider discussion which started in earnest about a decade ago in Psychology, but then quickly percolated through the life sciences in general. According to a survey by Nature [5], the majority of researchers feel that a replication crisis has affected their discipline, as many experimental findings cannot be replicated and are likely to be false [6]. The search for underlying causes has spawned a whole new research field, meta-research [7]. It is currently widely believed that, among other issues (which include low internal validity [8] or publication bias [9]), low statistical power [10] and flawed statistics [11] are root causes of an exceedingly high false positive rate and hence difficulties to reproduce results. This is where the p value, or rather its interpretation, takes center stage. In 2012, Craig Bennett and colleagues won the IgNobel Prize in Neuroscience [12] with a remarkable functional neuroimaging study. They positioned a dead salmon, purchased at a local supermarket, in an MR scanner and showed it a series of photographs depicting human individuals in social situations with a specified emotional valence. In a classical fMRI block design, the salmon was asked to determine what emotion the individual in the photo must have been experiencing. BOLD-imaging revealed a significant hemodynamic response indicating neural processing in the dead salmon’s brain [13]. The reasonwhy these authors found “neural correlates of interspecies perspective taking in the post-mortem atlantic salmon” is of course that they relied on standard statistical thresholds (p < 0.001) and low minimum cluster sizes (k > 8), and did not appropriately control for multiple comparisons. This frivolous study is relevant because the authors also demonstrate that only 60–70% of published functional neuroimaging studies at that time controlled for multiple comparisons, questioning the results of a major portion of cognitive neuroscience studies. Other fields, in particular those based on gene expression [14] and association [15] studies, are also heavily affected by the testing burden and were initially drowning in a sea of false positives [16]. As a result, in functional imaging and genetics, techniques aimed at solving the multiple comparisons problem proliferated. Fortunately, it is nowadays unlikely that one can publish transcriptomic or functional imaging datasets without using some form of post hoc correction. It is good news that some research fields appear to have cleaned up their act. However, the bad news is that in many other fields, statistical problems, including insufficient correction for multiple testing, weak thresholds for type I errors, and low statistical power, are still rampant [17, 18]. But at the heart of the problem lie misconceptions about the p value. Many researchers believe that p is the probability that the null hypothesis is true, and that 1-p is the probability that the alternative hypothesis is true. Or more colloquially, p is confused with the false positive rate: “At an alpha of 5%, I am running a 5% chance that my hypothesis is a fluke, the drug is not effective, despite obtaining statistical significance.” Other frequent misconceptions include the belief that the p value correlates with the theoretical or practical relevance of the finding, or that a small p is evidence that the results will be replicable. Another serious fallacy is the notion that failing to reject the null hypothesis (p > 0.05) is equivalent to This article is part of the Topical Collection on Editorial.
               
Click one of the above tabs to view related content.