I n previous commentaries, I’ve discussed issues related to statistical significance 1,2 and to measures of effect size. In this one, I will discuss how they are related and, more… Click to show full abstract
I n previous commentaries, I’ve discussed issues related to statistical significance 1,2 and to measures of effect size. In this one, I will discuss how they are related and, more importantly, how they are different. So that you don’t have to dig through your files and retrieve the previous papers, let me recapitulate what each of them says. Statistical significance refers to the probability that, if the null hypothesis (H0) is true (that is, there is no difference between the groups), these, or more extreme, data could have arisen by chance. This is referred to as Null Hypothesis Significance Testing (NHST). Before we go on, though, three points are in order. First, Cohen rightly pointed out that H0 should actually be referred to as the “nil hypothesis,” not the null hypothesis. The null hypothesis is the hypothesis to be nullified which, in most cases, is in fact the nil hypothesis (nothing is going on). In some cases, though, primarily when testing for equivalence or non-inferiority, the two are not the same, and the null hypothesis (or hypotheses) is that 1 group is better or worse than the other. The second, and more important, point that Cohen makes in the same article (which should be required reading for anyone interested in statistics), is that NHSTactually does not tell us what we want to know. As we pointed out in a previous commentary, he said that what we are really concerned about is, “Given these data, what is the probability that H0 is true?” or in statistical jargon,P(H0|D).What we are told, though, is “What is the probability of these, or more extreme, data, given that H0 is true?” or P(D|H0), and the 2 questions are not the same. However, science has muddled through for all these years answering the wrong question, and will probably continue to do so for the foreseeable future. Third, statistical significance testing tells us nothing about the magnitude of the difference between the groups. On the other hand, effect sizes (ESs) do indicate how different 1 group is from the other, but don’t tell us about statistical significance. Depending on the type of data, the most common ESs can be expressed as a standardized mean difference if the data are continuous, or as odds ratios (ORs) or relative risks (RRs) if they are dichotomous. As the previous article pointed out, there are also other indices of ESs based on analyses of variance and for showing the strength of the relationship between variables. Now, what is the relationship between statistical significance and ESs? At the most simplistic level, if the results of a statistical test are not significant, then we can ignore the ES, because the results were likely due to the play of chance. But this is perhaps overly simplistic. It treats the probability as a dichotomy – less than 0.05 and the phenomenon exists, while greater than or equal to 0.05 and it doesn’t. In fact, though, probabilities are a continuum, and a P level of 0.051 is very different from one of 0.500; while to quote RosnowandRosenthal yet again, “SurelyGod loves the .06 nearly asmuch as the .05.” p. 1277 Because of the discomfort most statisticians have in drawing a binary conclusion from a continuum, some organizations have advocated supplementing P levels with confidence intervals (CIs) and ESs, and others have banned them entirely, but this is probably going too far. However, it does highlight the fact that a P level greater than 0.05 doesn’t mean that the ES is zero (or 1, in the case of ORs and RRs); just that we shouldn’t break out the Champagne quite yet. If the lower end of the 95% CI is nearly touching the cut-off point for significance (and the sample size was at least adequate), it may indicate that there’s something going on, but that the sample size wasn’t quite large enough to reach a conventional level of significance. Significance and ESs are also related in that, holding everything else equal, the larger the ES, the smaller the P level. But, rarely is everything else – especially the sample size – equal among studies, so we cannot conclude that study A showed a bigger effect than study B because the first had a P level of 0.001 and the second a P level of 0.02. We can draw this conclusion only if everything – the inclusion and exclusion criteria, how the interventions were delivered, the sample sizes, and all other aspects of the 2 studies – were identical, and this is never the case. Consequently, in order to interpret the results of a study, we need to know both the statistical significance and the ES – and sometimes, we need even more.
               
Click one of the above tabs to view related content.