Experimental design as variance control

Chong-ho Yu, Ph.Ds.

Last updated: 2025

Variability

Kerlinger (1986) conceptualized experimental design as variance control. The previous lesson has pointed out that control is an indispensable element of experiment. The aspect of variance is discussed here. First of all, let's spend a few minutes to look at the concept "variance" or "variability".
The purpose of research is to maximize variance explained and minimize error variance. Using radio frequency as a metaphor to data, researchers want to filter noise in order to get a clear signal. For example, F ratio in Analysis of Variance (ANOVA) can be viewed as a ratio between signal and noise. ANOVA is always used to compare several group means. Why isn't it called Analysis of Mean instead of Analysis of Variance? It is because the means are compared based upon variance. The F ratio in ANOVA is:

Variability between groups

Variability within groups

For the simplicity of illustration, now let's use only two groups. Suppose in the 24^thcentury we want to find out whether Vulcans or humans are smarter, we can sample many Vulcans and humans for testing their IQ. If the mean IQ of Vulcans is 200 and that of humans is 100, but there is very little variability within each group, as indicated by two narrow curves in the following figure, then the mean difference between two groups sends a very clear signal that even the dumbest Vulcan has higher IQ than the smartest human.

But what if there exist a huge within group variability in both distributions, as shown in the following figure? In this case, the answer is not clear cut. Some Vulcans are more intelligent than humans and vice versa. Thus, within group variability is noise that distorts the signal. The means must be compared by considering both between group variability and within group variability.

Variance control

Within-group variability could be caused by many things. It could be a real inherent dispersion among different people's abilities. But it could also be a result of poor research design and measurement error. Well-design design should accomplish the following goals: maximize experimental variance, control extraneous variance, and minimize error variance. Put it in the context of ANOVA as illustrated in the preceding example, the goals of experimental design are maximizing between-group variability (signal) and minimizing within-group variability (noise).

Maximize experimental variance
Experiment variance is the variance of the dependent variable influenced by the independent variable(s). To maximize experimental variance, researchers should make experimental conditions as different as possible. For example, if you design a research study to compare Web-based instruction and conventional computer-based instruction, the features of the two treatments must be pulled apart as much as possible. The Web version must carry properties that cannot be found in other media. If learning attribute is a factor in this study, the two groups of learners must be extremely different on the aptitude dimension. In other words, researchers expect to detect between-group variability if it is present. Quite a few research studies on Web-based instruction presented very similar versions of treatment to different groups. Further, in those studies classification of user groups did not reflect variability of learner aptitude. These mistakes should be avoided.

"This is a breakthrough in engineering science. Repeated experiments confirm that a Porsche 911 can outrun a Honda Civic."

One should be cautious that the concept of "maximizing experimental variance" could be terribly misunderstood. Look at the following two experiments:

Engineer A wants to test the engine performance of Porsche 911. He compares it against a Honda Civic.

Engineer B wants to test the engine performance of Porsche 911. He compares it against a Ferrari.

By common sense, most people will laugh at the first benchmark test and approve the second one. But, look at the next pair:

Researcher A spent 100 hours to develop a Web-based course as the treatment. He simply printed out a hard copy of those WebPages in half an hour for the control group.

Researcher B spent 100 hours to develop a Web-based course as the treatment. He also invested a lot of efforts to develop a multimedia version for the control group.

What Researcher A did is very similar to what Engineer A did. Very often educational research is a comparison between a Porsche 911 and a Honda Civic. i.e. the treatment and the control groups are not comparable at all. While criticizing the lack of evidence-based research and decisions in education, Slavin (2008) cited a real-life example: Once a series of studies were conducted to evaluate Daisy Quest, a computerized program used to teach phonemic awareness in Grades K through I. The results were said to be positive, but in the studies (e.g., Barker & Torgesen, 1995), the control groups were not taught phonemic awareness at all. Besides the design issue, there is a measurement issue associated with the preceding problem. Slavin pointed out that the developer-made tests are intended to assess the outcomes taught in the treatment program, and such tests are unfair to the control group. Needless to say, test scores show the superiority of the treatment group and the null hypothesis could be easily rejected.

There is another similar real-life example in medical science. A study suggested that patients taking the anti-depressant, Paxil, had a higher rate of suicidal incidents than patients in the placebo group. Another study showed that patients taking another anti-depressant, Proaz, had fewer suicidal incidents than the placement group. But actually the rate of suicidal incidents in the Proaz group was higher than that in the Paxil group. If that is the case, then how could Proaz appear to be safer than Paxil? The truth is: there are two different placebo groups. In the Paxil experiment, fewer patients on placebo had suicidal incidents than those on placebo in the Proaz experiment (Siegfried, 2010). It does not mean that the researchers in the Proaz experiment selected an at-risk group on purpose. However, it is obvious that how good the result is depends on what the control group is.
The ethics of clinical research requires equipoise, which is a state of genuine uncertainty on the part of the clinical researcher regarding the treatment effectiveness of each side in a trial. In other words, it is unethical for a medical researcher to assign patients into the control group when he/she has known that the treatment is much more effective than the control (Freedman, 1987). By the same token, educational researchers should never conduct a study in which the control/comparison group is absolutely inferior to the treatment.

There are documented instances in research history where participants were intentionally given inferior treatments or placed in "do-nothing" control groups, leading to devastating consequences for their lives. AZT research for HIV in developing countries during the 1990s is an infamous example. In trials testing the effectiveness of antiretroviral drugs to prevent mother-to-child transmission of HIV, researchers in some developing countries used placebo control groups instead of the existing standard treatment in wealthier nations. Women in the placebo groups received no treatment, resulting in preventable HIV transmission to their newborns (Zion, 1998). .

People misunderstood that "maximizing the between group difference" is to deliberately make the control group inferior or irrelevant. Actually, the meaning of "maximizing the experimental variance" is to make the best out of both groups. If I want to see the true difference between a Porsche and a Ferrari, I should fine tune both engines to the peak level. In the same fashion, if I want to see the true difference between Web-based instruction and conventional instruction, I should do my best to make both versions well written and well-organized.

Control extraneous variance

To exclude the extraneous variable: If you worry that variability of intelligence among subjects may affect the experiment, you can select subjects whose IQ is between 90 to 100, which is considered normal. Please be cautious that this approach may hinder the researcher from making a broader generalization.
For example, when an experiment is conducted to compare test performance between users of Web-based instruction and conventional instruction, a significant mean difference may be observed among average IQ subjects, but not among high IQ subjects. If no high IQ subjects are recruited, the researcher cannot assert that Web-based training works better than conventional teaching for high IQ people.

There are many dimensions in which people can differ from each other. It is not a bad idea to exclude just one or a few extraneous variables, but still include several factors simultaneously in the experiment. However, a beginner in research may use a "one-at-a-time" approach or "holding other constant" approach to test the interested variable only. This simple approach does not require experimental design, but it fails to examine interaction effects among different factors and thus insight may be missed.

Randomization: Through randomization, high, medium, and low IQ subjects can be spread evenly in the control group and the treatment group. However, when the sample size is small, randomization may lead to Simpson's paradox, which was discussed earlier.

To include the extraneous variable as a factor or a covariate: A statistical technique called Analysis of Covariance (ANCOVA) can be employed to adjust the pre-existing differences introduced by the extraneous variable such as IQ and GPA.

Matching subjects: Find subjects in pair who have matched characteristics in extraneous variables. This method is commonly used in case-control studies. For example, in a study that aims to identify factors of illegal drug use at schools, it is extremely difficult, if not impossible, to recruit students who admit using illegal drug. A viable approach is to carpeting all the students in a school using anonymous surveys. It turns out that 50 out of 1,000 students report drug use. If these 50 cases are compared against 950 controls (no drug use), needless to say the variances of the two groups are extremely asymmetrical, thus violating the assumptions of most parametric tests. To make a valid comparison, 50 non-drug users are selected from the sample by matching the demographic and psychological characteristics of the 50 cases (Tse, Zhu, Yu, Wong, & Tsang, 2015).

Repeated measures: While matching subjects in all relevant attributes is difficult, there is a way to do easy matching. The most matching person to anyone in all attributes is the same person. Based on this logic, the researcher can employ a repeated measures design (RMD), in which each subject becomes his or her own control. The simplest form of RMD is a pretest-posttest design, which has two measures only, but it is common for researchers to have multiple measures in a longitudinal study. There are many ways to analyze RMD data, namely, GLM repeated measures, Structural Equation Modeling, and Multilevel Modeling (also known as Hierarchical Linear Modeling, HLM), HLM is considered the best. ANOVA/GLM repeated measures is limited by its assumption of a single covariance structure, namely, compound symmetry. On the other hand, multi-level modeling allows different covariance matrix structures, such as auto-regressive, Toeplitz, heterogeneous AR, heterogeneous compound symmetry, and many others. In addition, multi-level modeling uses maximun likelihood estimation, which is more accurate than the sum of squares approach in GLM. Further, the goodness of a mixed-level model could be evaluated by multiple fitness criteria, namely, Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), AAIC, and so on (Shin, Epsin, Deno, & McDonnell, 2004).

Reduce error variance
Error variance is uncontrollable variance. The source of error variance can be guessing, momentary inattention, bad mood...etc.

Blocking: If all subjects are treated as a big group, the within-group variability may be very huge. By dividing the experimental conditions into several "blocks", the researcher can localize error variance i.e. in each block the within-group variability is smaller. For example, in an experiment a researcher collected the data in two days. He worried that this might produce uncontrollable noise and thus include day of run as a blocking factor in the design (Montgomery, 1997). In a quasi-experiment studying the effect of Web-based instruction, data may be collected in three separate classes. Again, we cannot neglect that different instructors and different conditions in those classes may affect the outcome of the study. In this case, class can be used as a unit of blocking.

Improve the reliability of measure: If the instrument has a high reliability coefficient, there will be less measurement error and thus error variance is reduced. Reliability will be discussed in the section "Measurement."

Less is more

In the light of variance control, researchers should avoid designing a complex experiment prematurely. The more complicated the design is, the more noise the design has. American Psychological Association (1996) endorses the use of minimally sufficient designs and analytic strategies:
The wide array of quantitative techniques and the vast number of designs available to address research questions leave the researcher with the non-trivial task of matching analysis and design to the research question. Many forces (including reviewers of grants and papers, journal editors, and dissertation advisors) compel researchers to select increasingly complex ('state-of-the-art,' 'cutting edge,' etc.) analytic and design strategies. Sometimes such complex designs and analytic strategies are necessary to address research questions effectively; it is also true that simpler approaches can provide elegant answers to important questions. It is the recommendation of the (APA) task force that the principle of parsimony be applied to the selection of designs and analyses. The minimally sufficient design and analysis is typically to be preferred because:

it is often based on the fewest and least restrictive assumptions,

its use is less prone to errors of application, and errors are more easily recognized, and

its results are easier to communicate--to both the scientific and lay communities. This is not to say that new advances in both design and analysis are not needed, but simply that newer is not necessarily better and that more complex is not necessarily preferable.

Reading

The concept "variance" is fundamental in understanding experimental design, measurement, and statistical analysis. It is not difficult to understand ANOVA, ANCOVA, and regression if one can conceptualize them in the terms of variance. Kerlinger (1986)'s book is a good start.
To go beyond the basic, one should contemplate the concept "orthogonality," which is important to both ANOVA and regression. In ANOVA when group sizes are balanced, the design is said to be orthogonal. In regression when predictors are not inter-related, they are also said to be orthogonal. Experimental design could be conceptualized as model building. In this sense, relationships among variables are specified to form a model. "Non-orthogonal" variables are detrimental to from a sound model. A Director program entitled "Collinearity: A question of support," which was developed by myself, David Winograd, Sandra Andrews, Samuel DiGangi, and Angel Jannasch (1999), explains both "variance" and "orthogonality" in layman terms. A web version entitled "Mutli-collinearity, orthogonality, variance inflation factor" (Yu, 2016), which carries more detail, is also available.

References

American Psychological Association. (1996). Task Force on Statistical Inference Initial Report. Retrieved from http://www.apa.org/science/tfsi.html
Barker, T. A., & Torgesen, J. K. (1995). An evaluation of computer assisted instruction in phonological awareness with below average readers. Journal of Educational Computing Research, 13, 89-103.
Freedman, B. (1987). Equipoise and the ethics of clinical research. New England Journal of Medicine, 317, 141-5.
Kerlinger, F. N. (1986). Foundations of behavioral research. New York: Holt, Rinehart and Winston.
Montgomery, D. C. (1997). Design and analysis of experiment. New York: Wiley.
Shin, J., Espin, C. A., Deno, S., McConnell, S. (2004). Use of hierarchical linear modeling and curriculum-based measurement for assessing academic growth and instructional factors for students with learning difficulties. Asia Pacific Education Review, 5, 136-148.
Slavin, R. (2008). Perspectives on evidence-based research in education. Educational Researcher, 37(1), 5-14.

Siegfried, T. (2010). Odds are, it's wrong: Science fails to face the shortcomings of statistics. Science News, 177(7). Retrieved from http://www.sciencenews.org/view/feature/id/57091/
Tse, S., Zhu, S., Yu, C. H., Wong, P., & Tsang, S. (2015). An ecological analysis of secondary school students' drug use in Hong Kong: A case-control study. International Journal of Social Psychiatry, 10. DOI: 10.1177/0020764015589132. Retrieved from http://isp.sagepub.com/content/early/2015/06/08/0020764015589132.full.pdf?ijkey=LsS6bpT9BsYOcpY&keytype=finite
Yu, C. H. (2016). Multi-collinearity, variance Inflation, and orthogonalization in regression. Retrieved from http://www.creative-wisdom.com/computer/sas/collinear.html
Yu, C. H., Winograd, D., Andrews, S., DiGangi, S. & Jannasch-Pennell, A. (1999). Visualizing collinearity , variance inflation factor, and orthogonalization in subject space. American Statistical Association 1988 Proceedings of the Section on Statistical Education, 119-124.
Zion, D. (1998). Ethical considerations of clinical trials to prevent vertical transmission of HIV in developing countries. National Medicine, 4(1), 11-12. doi: 10.1038/nm0198-011.

Go up to the main menu

Navigation

Home

Other courses

Search Engine

Contact me