Meta-analysis and effect size

Chong-ho Yu, Ph.D., CNE, MCSE, CCNA

Question, question, question...

The previous lesson mentioned that power is a function of effect size, alpha level and sample size. What is effect size? How is it related to power analysis? How can we determine effect size? Answers to these questions are provided below.


What is effect size?

Before discussing the effect size, I would like to introduce a broader concept: comparison in terms of a standard. Many statistical formulas seem to be difficult. Indeed, many of them are nothing more than a standardized comparison. Take comparing wealth as a metaphor. How could we compare the net assets of American IBM corporation and Japanese Sony corporation? The simplest way is to compare them in US dollars, the standard currency for international trade. By the same token, a t-test is a mean comparison in terms of the standard deviation. Many statistics follow this thread of logic.

Effect size can be conceptualized as a standardized difference. In the simplest form, effect size, which is denoted by the symbol "d", is the mean difference between groups in standard score form i.e. the ratio of the difference between the means to the standard deviation. This concept is derived from a school of methodology named Meta-analysis, which was developed by Glass (1976).


How is it related to power analysis?

If only the null hypothesis is available and is rejected, at most the conclusion is that "the difference is not zero." The following figure shows that the difference, indicated by red arrows, can be anything. When the President asks the five-star general to estimate the war casualty, can he give "not zero" as a satisfactory answer?

When the null hypothesis is reject: Not zero

Researchers should be concerned with not only whether a null hypothesis is false or not, but also how false it is. In other words, if the difference is not zero, how large the difference one should expect? By specifying an effect size, which is the minimum difference that is worth research attention, researcher could design a study with optimal power rather than wasting resources on trivial effects. The larger the effect size (the difference between the null and alternative means) is, the greater the power of a test is.
 

effect size

Ideally, power analysis employs the population effect size. However, in practice the effect size must be estimated from sample data.


How can we determine effect size?

Gene Glass's approach

There are several ways to calculate effect size. The three most popular approaches are Gene Glass's approach, Hunter-schmidt's approach, and Cohen's d. The basic formula of Glass's approach is:

Mean of control group - Mean of treatment group
Standard deviation of the control group

The control group's standard deviation is used because it is not affected by the treatment (Glass, McGraw, & Smith, 1981).

Hunter-schmidt's approach

Hunter and Schmidt (1990) suggested using a pooled within-group standard deviation because it has less sampling error than the control group standard deviation under the condition of equal sample size. In addition, Hunter and Schmidt corrected the effect size for measurement error by dividing the effect size by the square root of the reliability coefficient of the dependent variable:

Effect size
Measurement error correction =
----------------------
Square root of r

Concepts of measurement error and reliability coefficient will be discussed in the section "Measurement."

Cohen's d

When there are two independent groups (e.g. control and treatment), usually a two-independent-sample t-test is employed. If the t-value is reported, Cohen's d can be obtained by the following formula:

d = SQRT((n1+n2)/n1*n2)

where t is the t-value, and n1 and n2 are the sample sizes of the two groups.

When a study reports a Chi-square test result with one degree of freedom (n=2), the following formula can be employed to approximate Cohen's d:

abs(d) = 2*SQRT(Chi-square/N - Chi-square)

where N is the total sample size

When a study reports a hit rate (percentage of success after taking the treatment or no treatment), the following formula can be used:

d = arscine(p1) + arscine(p2)

where p1 and p2 are the hit rates of the two groups (e.g. control and treatment) (Poston & Hanson, 2010)

Conventional values

The conventional values of effect size (Cohen, 1962) are:

Smalld = .20
Mediumd = .40
Larged = .60

Other researchers may have different values for small, medium, and large effect size. The magnitude of effect size depends on the subject matter. For example, in medical research d = .05 may consider a large effect size i.e. if the drug can save even five more lives, further research should be considered.

It is important to point out that Cohen defined .40 as the medium effect size because it was close to the average observed effect size based on his literature review using Journal of Abnormal and Social Psychology during the 1960s. The so-called small, medium, and large effect sizes are specific to a particular domain (abnormal and social psychology) and thus they should not be treated as the universal guideline (Aguinis, & Harden, 2009). Because different subject matters might have different effect sizes, Welkowitz, Ewen, Cohen (1982) explicitly stated that one should not use conventional values if one can specify the effect size that is appropriate to the specific problem. Moreover, Wilkinson and Task Force (1999) gave the following advice, "Because power computations are most meaningful when done before data are collected and examined, it is important to show how effect-size estimates have been derived from previous research and theory in order to dispel suspicions that they might have been taken from data used in the study or, even worse, constructed to justify a particular sample size."

It is a common practice for researchers to collect articles in their fields and catalog them in EndNote for future citation. It may be more beneficial to use this collection to calculate and constantly update the effect size of the subject matter to be studied.

In practice, it may be difficult to find past research studies related to your topic, especially when the topic is fairly new. To rectify this situation, Glass, McGraw, and Smith (1981) suggested to look at studies in similar domains. For example, if you are not able to locate enough research papers on Web-based instruction, you can use studies on hypertext and multimedia. Before the introduction of World Wide Web, hypertext and multimedia have been widely employed in computer-based instruction programmed in HyperCard, Authorware, and Director. Concepts related to Web-based instruction such as collaboration in chat sessions and mailing lists can be found in research on collaboration in other instructional settings.

It is noteworthy that not all research studies can be included in your collection for meta-analysis. Only well-designed studies which conform to the standards established by Campbell and Stanley (1963) and Cook and Campbell (1979) should be considered. Criteria of well-designed studies will be discussed in the section "Design of experiment"


Applications of meta-analysis

As discussed in the section concerning power analysis, computing effect size is essential to sample size determination. Nevertheless, meta analysis can not only be used for synthesizing results of past research, but also for new research studies. For example, Baker and Dwyer (2000) conducted eight studies regarding visualization as an instructional variable (n=2000). If all subjects are used for one analysis, the study will be over-powered. Instead, the effect size is computed in each study individually. The findings of eight studies are pooled to draw inferences as to the meaning of a collective body of research.

Besides the risk of overpowering, using all data in one test may lead to the Simpson's paradox. Simpson's Paradox is a phenomenon that the conclusion drawn from the aggregate data is opposite to the conclusion drawn from the contingency table based upon the same data.

The following example is given by Schwarz (1998). A university conducted a study to examine whether there is a sex bias in admission. The admission data of the MBA program and the law school were analyzed. The first table shows the MBA data:

 

MBA Program

 

Admit

Deny

Total

Male

480 (80%)

120 (20%)

600 (100%)

Female

180 (90%)

20 (10%)

200 (100%)

By looking at the MBA data only, it seems that females are admitted at a slightly higher rate than males in the MBA program. The same pattern can be found in the law school data.

 

Law School

 

Admit

Deny

Total

Male

10 (10%)

90 (90%)

100 (100%)

Female

100 (33%)

200 (66%)

300 (100%)

Interestingly enough, when the two data sets are pooled, females seem to be admitted at a lower rate than males!

 

MBA and Law School

 

Admit

Deny

Total

Male

490=70%

210=30%

700 (100%)

Female

280=56%

220=44%

500 (100%)

To avoid the Simpson Paradox, Olkin (2000) recommends researchers to employ meta-analysis rather than pooling. In pooling, data sets are first combined and then the groups are compared. In meta-analysis, groups in different data sets are compared first and then the comparisons are combined.


Limitations of meta-analysis

Like every methodology, meta-analysis also has certain limitations and weaknesses:

Assumption of standardized effects

It is important to point out that in some branches of meta-analysis computation of effect size is based upon a pooled variance or an adjusted variance. In response to this practice, Berk and Freedman (2003) are skeptical to the merit of meta-analysis. In their view, the claimed merit of meta-analysis is illusory. First, many meta-analyses use studies from both randomized experiments and observational studies. In the former, it is usually the case that subjects are not drawn at random from populations with a common variance. In observational studies there is no randomization at all. Thus, it is gratuitous to assume that standardized effects are constant across studies.

Social dependence

Further, Berk and Freedman questioned the assumed independence of studies for meta-analysis. Researchers are trained in similar ways, read the same papers, talk to each other, write proposals for the same funding agencies, and publish the findings to the same pool of peer-review journals. Earlier studies lead to later studies in the sense that each generation of doctoral students trains the next. They questioned whether this social dependence compromises statistical independence.

Publication bias

Another common problem of meta-analysis is publication bias, also know as the file-drawer effect: Publication bias leads to the censoring of studies with non-significant results. As a remedy, Keng and Beretvas (2005) developed methodology to quantify the effect that publication bias can have on correlation estimation.

Logic of courtroom

Root (2003) challenged the merits of meta-analysis at the philosophical level. According to Root, standard hypothesis testing is based upon the logic of physical sciences, in which the researcher must gamble with the unknown future, in the sense that the prediction derived from the hypothesis may not be in alignment to the proposed theory. However, meta-analysis is implicitly tied to the logic of courtroom, in which collected evidence is used to explain past events. In a retrospective methodology such as meta-analysis, the synthesizer has the luxury of choosing what past studies to be included. Using gambling as an analogy, Root pointed out that computing probabilities based on known facts is like betting money in a game after the result is known.

Subjective selection

The result of meta-analysis is tied to the selection criteria set by the researcher. In an attempt to resolve the debate concerning whether mammography can reduce the mortality rate of breast cancer, a research team utilizing meta-analysis found that there was no reliable evidence to support the claim that mass screening for breast cancer had a positive effect for any women. On the contrary, the US Preventive Service Task Force that employed meta-analysis, too, concluded that use of mammogram significantly enhanced the survival rate of women from 40-74 years of age. Aschengeau and Seage III (2007) asserted that the preceding contradiction is a result of different criteria for selecting the literature.

Superrealization bias

"Superrealization bias," the term coined by Cronbach et al. (1980) is germane to effect size and meta-analysis. Superrealization bias refers to the phenomenon that in a small-scale study, experimenters are able to monitor the quality of implementation or create unrealistic conditions, but these ideal conditions could never be realized on a large scale study. Slavin (2008), and Slavin and Smith (2008) asserted that small studies are not inherently biased, but a collection of small studies tend to be biased. Thus, Slavin warned against reporting average effect sizes using a cluster of low n studies.

Varying conditions across studies

It is important to point out that quite a few controversial conclusions in medical research arise from meta-analyses. For example, based on meta-analyses, the medical research community asserted that antidepressants are not more effective than placebos. But critics charged that not all the studies included in the meta-analyzes used the same protocols, definitions, types of patients and doses. The alleged safety of Avandia is another example. A meta-analysis from the combined trials showed that only 55 people in 10,000 had heart attacks when using Avandia whereas 59 people per 10,000 had heart attacks in comparison groups. However, after a series of statistical manipulations, this conclusion was reversed. It was argued that a meta-analysis synthesizing many small-scale studies is not a good substitute for a single trial with a large sample size (Siegfried, 2010).


Software for meta-analysis

You can use either all-purpose Stat programs or specialized programs to conduct meta-analysis. SAS is an example of all-purpose stat programs that can perform meta-analysis (Wang & Bushman, 1999). For specialized programs, one can use a Windows-based program released by BioStat (2006), Devilly (2005) or a DOS-based program written by Schwarzer (1989) In the following, a hypothetical example is used to illustrate usage of Schwarzer's program. The program is DOS-based and thus the interface is strange by today's standard:

Assume that you found four previous studies concerning Web-based instruction. First, open Excel for data entry. The first column contains the ID of each experiment, the second is the mean of the experimental group in each study, and the third, the control group mean. The forth column is the effect size of each study, which is (mean of experimental group - mean of control group)/standard deviation. The calculation can be automated by writing an Excel function (the Excel spreadsheet will have more than five columns if a function is used, because the standard deviation values must be entered, too. But you should not output the column of standard deviation). The last column is the correlation coefficient between scores of two groups. Again, an Excel function can be used to automate the computation. After the data are entered, output it to a text file.
 

Study ID
Experimental
group mean
Control group
mean
Effect size
Correlation
coefficient
1 100 100 1.2 0.9
2 200 200 0.6 0.8
3 300 300 0.9 0.8
4 400 400 0.8 0.9

In "meta-analysis" open Meta-A from D-values. Type the file name with the proper path of the file you just saved. An output as the following will be revealed. Please consult the manual for the interpretation.
 

meta analysis

The following is a screenshot of Effect size Generator written by Grant Devilly:


Further reading

To get a quick overview of effect size, I recommend reading a book chapter on effect size written by Tatsuoka (1993) in A Handbook for data analysis in the behavioral sciences (pp. 461-479), edited by Gideon Keren, Charles Lewis and published by Hillsdale, N.J. : L. Erlbaum Associates.

Also, the manual of the program "meta-analysis" gives a very concise introduction to both the theory and the computational procedures of meta-analysis. For learning the procedure of conducting meta-analysis, please look at Liao (1998) as an example.


Reference

  • Aguinis, H. & Harden, E. E. (2009). Sample size rules of thumb: Evaluating three common practices. In Charles E. Lance and Robert J. Vandenberg. (Eds.), Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences (pp.267-286). New York : Routledge.

  • Aschengrau, A., & Seage III, G. (2007). Essentials of epidemiology in public health. Boston, MA: Jones and Bartlett.

  • Baker, R., & Dwyer, F. (2000 Feb.). A meta-analytic assessment of the effects of visualized instruction. Paper presented at the 2000 AECT National Convention. Long Beach, CA.

  • Berk, R.A. & Freedman, D. (2003). Statistical assumptions as empirical commitments. In T. G. Blomberg, S. Cohen (Eds.). Law, Punishment, and Social Control: Essays in Honor of Sheldon Messinger (2nd ed) (pp. 235-254). New York: Aldine.

  • Biostat (2006). Meta-analysis. [On-line] Available: http://www.meta-analysis.com/

  • Campbell, D. & Stanley, J. (1963). Experimental and quasi-experimental designs for research. Chicago, IL: Rand-McNally.

  • Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153.

  • Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston, MA: Houghton Mifflin Company.

  • Cronbach, L. J., Ambron, S. R., Dornbusch, S. M., Hess, R.O., Hornik, R. C., Phillips, D. C., Walker, D. F., & Weiner, S. S. (1980). Toward reform of program evaluation: Aims, methods, and institutional arrangements. San Francisco: Jossey-Bass.

  • Devilly, G. (2005). Effect size generator. Retrieved from http://www.swin.edu.au/victims/resources/software/effectsize/effect_size_generator.html

  • Glass, G. V. (1976). Primary, secondary, and meta-analysis of research. Educational Researcher, 5, 3-8.

  • Glass, G. V., McGraw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills: Sage Publications.

  • Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings. Newbury Park, CA: Sage Publications.

  • Keng, L, & Beretvas, N. (2005 April). The effect of publication bias on correlation estimation. Paper presented at the Annual Meeting of the American Educational Research Association, Montreal, Canada.

  • Liao, Y. C. (1998). Effects of hypermedia versus traditional instruction on students' achievement: A meta-analysis. Journal of Research on Computing in Education, 30, 341-361.

  • Olkin, I. (2000 November). Reconcilable differences: Gleaning insight from independent scientific studies. ASU Phi Beta Kappa Lecturer Program, Tempe, Arizona.

  • Poston, J. M, & Hanson, W. E.(2010). Meta-analysis of psychological assessment as a therapeutic intervention. Psychological Assessment, 22, 203-12.

  • Root, D. (2003). Bacon, Boole, the EPA, and scientific standards. Risk Analysis, 23, 663-668.

  • Schwarz, C. (1998). Contingency tables - Simpson's paradox. Retrieved from http://www.math.sfu.ca/stats/Courses/Stat-301/Handouts/node49.html

  • Schwarzer, R. (1989). Statistics software for meta-analysis. Retrieved from http://www.yorku.ca/faculty/academic/schwarze/meta_e.htm

  • Siegfried, T. (2010). Odds are, it's wrong: Science fails to face the shortcomings of statistics. Science News, 177(7). Retrieved from http://www.sciencenews.org/view/feature/id/57091/

  • Slavin, R. (2008). Perspectives on evidence-based research in education. Educational Researcher, 37(1), 5-14.

  • Slavin, R., & Smith, D. (2008, March). Effects of sample size on effect size in systematic reviews in education. Paper presented at the annual meetings of the Society for Research on Effective Education, Crystal City, VI.

  • Wang, M. C., & Bushman, B. J. (1999). Integrating results through meta-analytic review using SAS software. Cary, NC: SAS Institute.

  • Welkowitz, J., Ewen, R. B., & Cohen, J. (1982). Introductory statistics for the behavioral sciences. San Diego, CA: Harcourt Brace Jovanovich, Publishers.

  • Wilkinson, L, & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

Last update: 2010


Go up to the main menu

Navigation

Other courses

Simplified Navigation

Table of Contents

Search Engine

Contact me