THE CONCEPTS OF STATISTICAL POWER AND EFFECT SIZE
Inferential statistics make it possible for researchers to make statements about a population based upon a sample taken from the population. In order for these statements to be accurate, the sample must be representative of the population and the underlying assumptions of the statistical test being used must be met. Even when randomization is used, there is always a possibility for sampling error to infect the results and, therefore, make a statement less accurate (i.e., valid). Sometimes, however, randomization is not possible (especially for educational research) and intact groups must be matched and selected. Obviously, this lack of randomization can introduce sampling bias into a study, infect its results and, consequently, make any statement that generalizes the findings to the population less accurate (i.e., valid). In addition, if the assumptions that underlie statistical tests are violated (e.g., parametric tests require a normal distribution of scores), then bias enters into the statistical tests being used and the results of those tests are less accurate (i.e., valid) and generalizations to the population from the sample are more problematic. The former condition is the not the fault of the researcher while the latter condition is the fault of the researcher.
Most statistical models assume error-free measurement, particularly of the independent variables. But, researchers know that there is no such thing as error-free measurement because of random, chance variation in the population being sampled. (The only case where there is error-free measurement is possible is in a census.) That being the case, the larger the amount of measurement error, then the more likely it is that a researcher will not find a statistically significant result. To conceptualize this, think about how parametric tests of significance report a ratio where the denominator is the measure of error. Thus, the larger the denominator (the error), the larger the numerator (the difference between groups) must be in order to attain a statistically significant finding.
Statistical power (aka, "probability level," "level of significance") is the probability that the researcher will avoid making a Type I error—rejecting a null hypothesis that is true—and is selected by a researcher prior to the research project. Thus, a "higher" (or "bigger") alpha level (e.g., α = .05), the more likely it is the researcher will reject a true null hypothesis (a Type I error). That is, the researcher judges a significant difference exists between the sample means when there isn't one. On the other hand, the "lower" (or "smaller") the alpha level (e.g., α = .001), the more likely it is that the researcher will accept a false null hypothesis (a Type II error). That is, the researcher judges that there is not a significant difference between the sample means when there is one.
Assuming the researcher can't conduct a census yet wants to ensure that it will be tougher to detect a significant difference between two sample means (that is, to reject the null hypothesis), the researcher will select a smaller alpha level. For example, where α = .05 (a "bigger" alpha—5 in 100 chances of a probability of error) the researcher is more likely to find a significant difference than when α is .001 (a "smaller" α—1 in 1000 chances of a probability of error). The challenge confronting the researcher is that if α is set either too high or too low, it is likely that the researcher will make a wrong determination regarding the null hypothesis. Don't forget: the null hypothesis says that there is no difference between the means, that is, they are equal at a stated probability of error (i.e., the researcher is wrong).
This is the point where confusion can enter into the picture, especially as students begin to learn these relatively basic concepts. To avoid any confusion, re-read the statements in the preceding paragraph by applying them to the Null Hypothesis Chart:
A Type I Error is when the researcher rejects a true null hypothesis. That is, the researcher says there is a significant difference between the sample means when, in fact, there is no significant difference. A Type II Error is when the researcher accepts a false null hypothesis. That is, the researcher says that there is no significant difference when, in reality, there is a significant difference. Thus, if an analysis has little statistical power, the researcher is likely to overlook or miss the outcome s/he desired to discover because the analysis did not have enough statistical power to detect the significant difference that would have been evident if the statistical power had been greater.
Then, the question becomes how does a researcher increase power?
There are three ways to accomplish this, all of which are interrelated, meaning that they impact one another. The obvious first choice is to increase the sample size which decreases the amount of sampling error present into the sample. The second choice involves tinkering with the significance level (e.g., a priori changing p = .05 to p = .01). The third choice is to alter the effect size, that is, to seek an outcome of a statistical test that departs more from the null hypothesis. Thus, as the sample size, significant level, and the effect size increase, so does the power of the significance test....which is logical, because power increases automatically with an increase in the sample size and virtually any difference can be made significant if the sample is large enough.
Effect size is a numerical way of expressing the strength or magnitude of a reported relationship, be it causal or not.
The basic formula to calculate the effect size is to subtract the mean of the control group from that of the experimental group and, then, to divide the numerator by the standard deviation of the scores for the control group. Effect size is expressed as a decimal number and, while numbers greater and 1.00 are possible, they do not occur very often. Thus, an effect size near .00 means that, on average, experimental and control groups performed the same; a positive effect size means that, no average, the experimental group performed better than the control group; and, a negative effect size means that, on average, the control group performed better than the experimental group did. For positive effect sizes, the larger the number, the more effective the experimental treatment. As a general rule of thumb, an effect size in the .20's (e.g., .27) indicates a treatment that produces a relatively small effect whereas an effect size in the 80s (e.g., .88) indicates a powerful treatment.
Thus, the greater an effect size the researcher desires, the greater the difference has to be between the experimental group and control group means.
This having been said, how might someone think about effect size?
The best way is to relate the concept to what one already knows. As a teacher, for example, one may be wondering―all research begins with a problem―about what is the best way to help students learn course-related material. That teacher may be asking: Should I use cooperative learning groups, assign homework, or should I assign and grade homework?
Conveniently, a meta-analysis has found cooperative learning to produce an effect size of .76; the effect size for assigned homework is .28; the effect size for graded homework is .79 (Whalberg, 1984).
So, the teacher should consider a strategy that uses cooperative learning with graded homework. The worst of all strategies would be for the teacher to assign homework and hand it back to the students with no meaningful comments communicated on the homework.
Now, all of that is well and done. But, at the same time, in seeking a greater effect size, one opens the door to the possibility of committing a Type I error. That is, a researcher will more likely falsely reject a true null hypothesis. (Go back to the Null Hypothesis Chart and check it out.)
What all of this means, then, is that the researcher must decide a priori how much statistical significance (meaning that the results are likely to occur by chance at a predetermined level of probability) is needed to determine the practical significance of the study.
Whalberg, H. J. (1984). Improving the productivity of America's schools. Educational Leadership, 41(8), 19-27.