![]() |
|
In Philosophy 101, we learned that, if Person A posits the existence of some phenomenon, it is incumbent on Person A to prove its existence rather than it being Person B’s job to disprove it. For example, if I assert that there are pink unicorns with purple polka dots hiding deep in the forest, I cannot simply say, “Well, prove that they don’t exist.” For the scientific world to accept my claim, I have to bring one back, dead or alive. Is it ever legitimate to violate this injunction and try to prove the nonexistence of something? In this paper, we’ll take a look at trying to prove that something does not exist. In particular, we will examine the steps necessary to show that 2 treatments are equivalent—that, in essence, a difference between them does not exist. Let’s start off, though, by looking at “normal” science. In most instances, we design studies to show statistical significance. That is, we want to prove that one treatment is more effective or has fewer side effects than another, or we want to demonstrate that a relation exists between 2 variables, such as a history of sexual or physical abuse and the probability of having a psychiatric diagnosis (1). We begin by positing a null hypothesis that there is no difference between the groups or that there is no correlation between the variables, and then we do everything in our power to disprove it. If fortune deigns to smile upon us, and the statistical test has a P level less than or equal to 0.05, we conclude that we can reject the null hypo- thesis and therefore accept the alternative—that the treatments do differ or that the variables are correlated. The technical name for this is “null-hypothesis significance testing” (NHST). In NHST, if P is greater than 0.05, we do not say that we can accept (or prove) the null hypothesis; rather, we use the convoluted locution that we “have failed to disprove the null.” The reason for this, as I have said, harkens back to David Hume and the philosophy of science, which asserts that we cannot prove the nonexistence of something (unless, of course, it violates one of the laws of nature, such as the notions of perpetual motion machines, travel that is faster than light, or politicians who tell the truth). To use our original example, no one has ever seen such a unicorn (at least while sober), but we cannot prove that it does not exist. Although it is highly unlikely, a unicorn may walk out of the forest tomorrow, much as the coelacanth was discovered in 1938, after it was thought to have been extinct for millions of years. Using an example closer to home, 6 randomized controlled studies failed to find that ASA had any beneficial effect in preventing reinfarctions. However, Canner’s metaanalysis demonstrated that ASA reduced mortality by 10% (2), and it is now inconceivable that post–myocardial infarction patients would not be told to take it. The 6 studies didn’t prove that the null hypothesis is true—that there is no difference between ASA and placebo—they simply failed to reject it; that is, all 6 suffered from a type II error, in failing to reject the null hypothesis when in fact it is false. There is a qualitative difference between “highly unlikely” and “impossible” that can never be breached, no matter how many studies have negative outcomes. Therefore, a negative result often means that we should just try harder next time. The only problem with this philosophical purity is that, as noted, there are times when we do want to demonstrate a lack of difference. This occurs most often in evaluating “me too” drugs—drugs that are supposedly as good as existing ones but that may be cheaper or have fewer side effects. Here, the first task is to show that they are no less effective—in other words, to “prove” the null hypothesis of no difference. Similar situations exist when an outpatient program is compared with an inpatient one, or time-limited therapy with is compared with therapy without a limit on the number of sessions, or a lower dosage of a drug is compared with a higher dosage (3). In all these cases, it would be sufficient to show that the less expensive or less invasive therapies are not worse than the alternative; it is not necessary to prove superiority in terms of outcome for them to be accepted as replacements. As we shall see, showing superiority vs noninferiority or equivalence does not simply demonstrate opposite sides of the same coin. The Statistical TheoryHow do we reconcile the competing demands of wanting to prove equivalence on the one hand with the difficulty, if not impossibility, of proving the null hypothesis on the other? First, we have to correct a common misperception about the null hypothesis (H0). In almost all situations, the null hypothesis is written as H0: µ1 = µ2 or H0: p1 = p2 [1] when we are comparing means (µs) or proportions (ps); that is, the means or proportions of the 2 groups are the same, vs the alternative hypothesis (HA): HA: µ1 ¹ µ2 or HA: p1 ¹ p2 [2] that the 2 means or proportions are different. The mistake is to think that the null hypothesis always has to mean “no difference.” In fact, the null hypothesis is the hypothesis to be nullified, or disproven. Cohen refers to the hypothesis of no difference by the delightful name of the “nill hypothesis” (4). In most cases, the null and the nill hypotheses are the same; however, this needn’t necessarily be the case, and we will use this distinction in testing for equivalence. A second point is that not all differences are created equal, and there are some we can safely ignore. Because of sampling error, there will always be a difference between groups, no matter how similar they may be. Further, if we simply increase the sample size sufficiently, we will always be able to show that this difference is statistically significant. For example, let’s assume that School 1 has a mean IQ score of 100, School 2 has a mean IQ score of 103, and the standard deviation is the usual 15 points. Most people would agree that this 3-point difference is clinically trivial. However, if we draw a sample of 400 students from each school, we will probably find that this difference is statistically significant. With larger sample sizes, we can find statistical significance with even smaller differences. These 2 points—that the null hypothesis doesn’t always mean no difference and that some differences may be statistically significant but clinically trivial—form the basis of testing for equivalence. First, rather than saying that the 2 means (or proportions, or whatever parameter we’re interested in) have to be absolutely identical, we establish an equivalence interval within which we would say that the groups are “close enough.” For instance, let’s assume that, for sociophobic patients, Treatment A results in a mean score of 10 on a scale of social comfort (that is, M1 = 10), where a higher score reflects greater comfort. How much lower can the score be with a different therapy (Treatment B) for us to say that the difference between the groups (which we call delta, or *) is clinically unimportant—1 point lower? 2 points? 3 points? This is not a statistical question but, rather, a clinical one, based on our knowledge of the condition, the scale, and the intervention. If the new treatment is significantly faster, cheaper, or—if it’s a drug—has a better side effect profile, we may be willing to accept a lower score (that is, somewhat poorer adjustment) than if the new therapy does not offer these advantages. As Kendall and others (5) point out, though, there’s a trade-off in the choice of this interval. The smaller its value, the more similar the treatments must be but the harder it is to demonstrate equivalence statistically. Conversely, it is easier to show equivalence with wider intervals, but then we have to accept bigger differences between the 2 groups and still say they’re not different. There are 2 approaches to equivalence testing. The 2-tailed approach tries to show that the 2 means or proportions are similar to each other; that is, that one is neither much larger nor much smaller than the other. The 1-tailed method is far more common and tests whether the second mean or proportion is different only in 1 direction. This is also referred to as noninferiority testing, because it is often used to see whether a new therapy isn’t any worse than usual treatment. We don’t care whether it’s better—in fact, we’d be ecstatic—we merely want to insure that it’s not significantly worse. The 2-tailed method is certainly theoretically important. However, on a practical level, it is much more likely that we would be interested in showing that the new treatment is not worse than the standard (noninferiority testing), so we will restrict ourselves to that situation. The first step is to use our clinical judgement to define the equivalence interval, which we designate as d. Using the previous example, assume that we’d accept a difference of 20% at most, which translates into d = 2 points. That means that the mean for the new treatment (M2) cannot be less than 8 if it is to be deemed noninferior. Now let’s bring the first point into play and redefine the null hypothesis. Instead of the usual nill hypothesis of no difference, we say that the null hypothesis is H0: µ1 - µ2 > d [3] (or in English, the first mean is more than * points greater than the second), and the alternative hypothesis is HA: µ1 - µ2 < d [4] (that is, the difference between the means is less than d, which also covers the possibility that µ2 is larger than µ1). Note that if d = 0, these are simply the null and alternative hypotheses for a 1-sided t-test. This means that, if we can reject the null hypothesis, we are left by default with the alternative hypothesis that the difference between the means (or proportions) is probably correct. The test for this (a t-test if the sample sizes are small or a z-test if they are above 10 or so) looks very similar to the usual one, with the exception of d in the numerator:
where S M1-M2 is the standard error of the difference:
and where df = (n1 + n2 – 2), the n is the sample size in each group, and the s is the standard deviation. If we are dealing with proportions rather than means, then we simply replace M1 with p1 and M2 with p2 in Equation [5] (the proportions in each group), and use Equation [7] for the standard error
An ExampleLet’s work through an example. Assume that we specified ahead of time that we would consider 2 treatments for sociophobia equivalent if the new one worked for at least 85% as many patients as did the usual therapy. What we actually find is that, with 20 patients per group, 75% improve on the standard therapy, A (that is, pA = 0.75), and 70% improve with the new treatment, B (pB = 0.70). Since 15% of 0.75 is 0.083, we set d to be 0.083. Thus, the null hypothesis is H0 : pA – pB > 0.083 and the alternative hypothesis is HA : pA – pB < 0.083 Spelled out, the null hypothesis is that the proportion of successful patients in the standard therapy group is more than 0.083 better than the proportion of successful patients in the new treatment group; the alternative hypothesis is that the difference in proportions is less than or equal to 0.083. This is shown in Figure 1. Figure 1 Hypothetical results of a new and standard therapy Using Equation [7], we find that the standard error of the difference between these 2 proportions is 0.141, and therefore
Since this is smaller than the critical value of 1.645 that we would need to reject a 1-tailed hypothesis at the 0.05 level, we cannot reject the null hypothesis, and we have to conclude that the 2 treatments are not equivalent. Sample Size and PowerIn the example we just worked through, it would seem at first glance that the 2 treatments should have come out as equivalent. We said that we would accept a 15% difference from the effectiveness of the standard treatment or roughly 0.083 less effectiveness than that demonstrated for Treatment A, 0.75. The success rate for Treatment B, 0.70, seems to be this much less; in fact, the difference is only 1 subject per group (that is, 15/20 for A vs 14/20 for B). Why do these results seem to be counterintuitive? Simply examining raw differences overlooks 2 important points. First, we cannot just look at the difference between the 2 groups. As is always the case, the means or proportions that emerge from a study are sample estimates of the true population parameters. Because of this, they deviate from the real values to some degree. The amount of this deviation is related to the variability in what is being measured (for example, the standard deviation) and the sample size, and these have to be taken into account when we test to see whether the difference is statistically significant. The second point is that, in testing for equivalence, we reverse the usual meanings of the null and alternative hypotheses. This means that we have to alter both our interpretations of type I and type II errors and what we mean by power. In both NHST and equivalence testing, a type I error occurs when we conclude that the null hypothesis is false when in fact it is true; a type II error occurs when we erroneously conclude that the null hypothesis is true when it is not. Power is the ability to reject the null hypothesis when it is false. In noninferiority testing, though, the null hypothesis is that the standard treatment is better than the new one. This means that 1. A type I error occurs when we say that the 2 treatments are equivalent, when in fact the standard treatment is better. 2. A type II error occurs when we conclude that the standard treatment is better, when it fact the treatments are equivalent. 3. Power is the probability of accepting that the groups are equivalent when in fact they are equivalent (6). The issue, then, is the power of the test. As we would expect, with only 20 subjects per group, the power of the tests we just ran is low. The reality is that equivalence testing is at times less powerful than testing for a difference. That is, we would need more subjects to test when a given difference is within the equivalence interval than when we test to see whether the 2 groups differ from each other. To determine why this is so, let’s take a look at the equations to calculate sample size (7). For the equivalence of 2 means, the equation is
and for the equivalence of 2 proportions, it is
(If we are testing whether the 2 means or proportions are identical, then the denominator becomes simply d2.) These are very similar to the usual equations for sample size determination (8) with 2 differences, one that has a small effect on the sample and one that has a potentially large effect. The difference with the small effect is that we now want to minimize the type II error rather than the type I error, as in NHST. Consequently, the values of a and ß are reversed, in that we set ß at 0.05 and a at 0.10 or 0.20. The difference that has a potentially large effect is the d in the denominator. If we use Equation [9] to figure out the required sample size for the example (setting a = 0.20, ß = 0.05, and therefore power = 0.95), we will find it to be 2255 subjects per group! Conversely, with only 20 subjects per group, the power of the test to detect a difference of 0.083 between these proportions is less than 30%. The sample size to test for equivalences is not always larger than that for testing for differences; again, it depends on the value of d. Table 1 gives the sample sizes needed to test for noninferiority for various combinations of proportions in the standard treatment group, ps, and the experimental group, pe, with a = 0.20 and ß = 0.05. For comparison, the last column is the sample size required for the traditional NHST that ps > pe. When d = 2 (ps – pe), the sample sizes are about equal for both types of tests. When d < 2 (ps – pe), the sample size for equivalence testing is larger than for difference testing. When d > 2 (ps – pe), it is smaller. Note that, when d is larger than 2 (ps – pe), the change in sample size is relatively small. However, when * is smaller than 2 (ps – pe), the sample size increases rapidly and exponentially. The same relation holds for testing the noninferiority of means, with Ms and Me replacing ps and pe.
SummaryAt times, despite all philosophical injunctions to the contrary, we have to prove that there are no unicorns. The solution, as we’ve seen, is to reverse the meanings of the null and alternate hypotheses and try to show that the null hypothesis of a difference can be rejected. This leaves us, by elimination, with the alternative—that there is no difference (or at least, that the difference is small enough for us to ignore). The issue is that the closer the groups must be to be considered equivalent, the larger the sample size required. This is entirely analogous to the situation for the traditional NHST: larger sample sizes are needed to detect smaller differences between groups. In both cases, sample size is like magnification with a microscope: the smaller the object that’s being observed, the more magnification we need. AcknowledgementsI am deeply grateful to Mr Malcolm Binns and Dr Elizabeth Linn for their careful reading of an earlier draft and for their many helpful comments. References1. MacMillan HL, Boyle MH, Wong MY, Duku EK, Fleming JE, Walsh CA. Slapping and spanking in childhood and its association with lifetime prevalence of psychiatric disorders in a general population sample. CMAJ 1999:161;805–9. 2. Canner PL. Aspirin in coronary heart disease: comparison of six clinical trials. Isr J Med Sci 1983:19;413–23. 3. Bollini P, Pampallona S, Tibaldi G, Kupelnick B, Munizza C. Effectiveness of antidepressants. Meta-analysis of dose-effect relationships in randomised clinical trials. Br J Psychiatry 1999:174;297–303. 4. Cohen J. The earth is round (p < .05). Am Psychol 1994;12:997–1003. 5. Kendall PC, Marrs-Garcia A, Nath SR, Sheldrick RC. Normative comparisons for the evaluation of clinical significance. J Consult Clin Psychol 1999;67:285–99. 6. Hatch JP. Using statistical equivalence testing in clinical biofeedback research. Biofeedback and Self-Regulation 1996;21:105–19. 7. Rogers JL, Howard KI, Vessey JT. Using significance tests to evaluate equivalence between two experimental groups. Psychol Bull 1993;113:553–65. 8. Streiner DL. Sample size and power in psychiatric research. Can J Psychiatry 1990;35:616–20. Author(s)Manuscript received May 2003 and accepted June 2003. This is the 23rd article in the series on Research Methods in Psychiatry. For previous articles, please see Can J Psychiatry 1990;35:616–20, 1991;36:357–62, 1993;38:9–13, 1993;38:140–8, 1994;39:135–40, 1994;39:191–6, 1995;40:60–6, 1995;40:439–44, 1996;41:137–43, 1996;41:491–7, 1996;41:498–502, 1997;42:388–94, 1998;43:173–9, 1998;43:411–5, 1998;43:737–41, 1998;43:837–42, 1999;44:175–9, 2000;45:833–6, 2001;46:72–6, 2002;47:68–75, 2002;47:262–6, 2002;47:552–6. 1. Director, Kunin-Lunenfeld Applied Research Unit, Baycrest Centre for Geriatric Care; Professor, Department of Psychiatry, University of Toronto, Toronto, Ontario. Address for correspondence: Dr DL Streiner, Kunin-Lunenfeld Applied Research Unit, Baycrest Centre for Geriatric Care, 3560 Bathurst Street, Toronto, ON M6A 2E1 e-mail: dstreiner@klaru-baycrest.on.ca
1 | 2
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||