Comparing group differences for examining treatment effectiveness is a common
practice in research and evaluation. Parametric procedures such as t-tests and F-tests are widely
used for this purpose. However, those procedures are not very informative
because the conclusion is nothing more than rejecting or failing to reject the
null hypothesis.
APA Task Force on Statistical Inference (Wilkinson, 1996) endorsed the use of confidence intervals (CI)
as a supplement to conventional p value. By using CI, the researcher can look
at the group differences by means and variability. As the sample size
increases, the variability decreases, and the CI gets narrower. Why should we
judge the quality of a CI by its narrowness? Take this scenario as a metaphor:
You ask me to guess your age, I reply, "from 16 to 60." I am 95% confident that
your actual age would fall within this range, but is it a useful estimation?
Probably not. If I say "from 18-21" instead, it is definitely a much better
answer.
SAS/JMP provides a powerful tool named diamond plot to visualize CI and it is
very easy to obtain the result. In JMP you don't even need to know the name of
the procedure. As long as you know what your dependent and independent variables
are, you can simply choose Fit Y by X from the Analyze menu, as
shown in the following:

JMP provides the user with a contextual menu system and thus you would not be
overwhelmed by too many options. In the next screen only the options that are
applicable to the data structure are available to you. At this stage, you can
select Quantiles to display the box plot and Means/Anova to
display the diamond plot.

The result is shown in the following figure. It condenses a lot of important information:

- Grand sample mean: it is represented by a horizontal black line
- Group means: the horizontal line inside each diamond is the group means
- Confidence intervals: The diamond is the CI for each group. Because
the population parameter is unknown, there is always some uncertainty in
estimation. Thus, we need to bracket the estimation. Take photography as an
analogy. If the photographer is not sure whether the exposure is correct, he
would take at least one over-exposed photo (upper bound), one under-exposed
photo (lower bound), and one in the middle. In the JMP output, the top of the
diamond is the upper bound (best case scenario) while the bottom is the lower
bound (worst case scenario).
- Quantile: In addition to CI, JMP also provides the option of overlaying a boxplot showing quantile information

In this hypothetical example, Professor Yu taught three classes in different
modes: Conventional classroom, online class, and hybrid class. He wants to know
which method could yield better exam scores. It is obvious that the performance
gap between classroom group and the two others is significant, because even the
upper bound of the classroom group is worse than the lower bound of the other
two. However, it seems that the difference between the hybrid group and the
online group is not substantive at all because there is a lot of overlapping
between the two groups. If you need to report formal statistics, you can extract
the appropriate information below the graphic.

When I was a graduate student, I took a course in multiple
comparison procedures (MPC) as a post hoc step after ANOVA. At most the F test
of ANOVA could tell you whether one of the means differ from one of the other
means. In order to test which pairwise difference is significant but control the
Type I error rate at the same time, different MPCs are needed. The course
required the learners to memorize the pros and cons of 10-15 tests, such as LSA,
Bonferroni, Ryan, Tukey, Duncan, Gabriel...etc.. To tell you the truth, today I
forgot most of the information. The following is a screenshot of MPCs offered by
SPSS. You can tell how confusing it is. In my opinions, the diamond pot is a
much quicker and easier way for group comparison.

However, Payton, Greenstone and Schenker (2003) warned
researchers that inferring from non-overlapping CIs to significant mean
differences is a dangerous practice, because the error rate associated with this
comparison is quite large. The probability of overlap is a function of the
standard error. As the standard errors become less homogeneous, the probability
of overlap decreases. Simulations result showed that when the standard errors
are approximately equal, using 83% or 84% size for the intervals will give an
approximate alpha = 0.05 test, but using 95% confidence intervals, which is a
common practice, will give very conservative results. Thus, researchers are
encouraged to use both CI and hypothesis testing.
References
Payton, M. E., Greenstone, M. H., & Schenker, N. (2003).
Overlapping confidence intervals or standard error intervals: What do they mean
statistical significance? Journal of Insect Science, 3(34). Retrieved
April 21, 2008 from
http://insectscience.org/3.34
Wilkinson, L, & the task Force on Statistical Inference. (1996).
Stataistical methods in psychology journals; Guidelines and explanations.
Retrieved from
http://www.apa.org/science/leadership/bsa/statistical/tfsi-followup-report.pdf
Return to Index