Beyond Fisher
|
Chong-ho Yu, Ph.D., CNE, MCSE, CCNA
|
Last Updated: 2009
Bill Gates and R. A. Fisher
|
Due to the dominance of Microsoft Windows and Intel Pentium in the computer industry, for most users computing is synonymous to "Wintel." The same pattern can be found in the field of research methodology. Today, when most researchers talk about statistical analysis, it is usually referred to hypothesis testing, which is a fusion of two schools of thought: Fisher and Neyman-Pearson (Lehmann, 1993). Between the two schools, Fisher (1932, 1935) is arguably the dominant one. Fisherian legacy is also known as statistical significance testing, null hypothesis significance testing, hypothesis testing, traditional inferential statistics, classical procedures, and parametric test.
In the future, Microsoft may announce: "Today I announce that Microsoft has taken over SAS Institute, SPSS Inc, Datadesk, Stata, SyStat...etc. From now on only hypothesis testing is available in all statistical software packages. Resistance is futile."
|
|
Limitations of the Fisherian approach
Just like that Windows is not suitable for every application, hypothesis testing has its limitations. There are at least six limitations of the Fisherian approach:
- Does not address treatment effectiveness: Statistical hypothesis testing does not address the issue of treatment effectiveness. Very often researchers interpret a significant p-value as evidence of treatment effectiveness. As a matter of fact, hypothesis testing does not evaluate how likely the hypothesis is right given the data P(H|D). On the contrary, it assumes that the null hypothesis is right and examines how likely the data will occur in long run given the null is true, P(D|H) (McClure & Suen, 1994; Cohen, 1994; Loftus, 1996).
- Easy to reject the null hypothesis: By definition, a null hypothesis denotes no difference (zero effect). Loftus (1996) mocked that "rejecting a typical null hypothesis is like rejecting the proposition that the moon is made of green cheese. The appropriate response would be 'Well, yes, okay ... but so what?"
- Mismatch of null and alternate hypotheses: Statistical hypothesis testing is a fusion of two schools: Fisher and Neyman/Pearson. In the former the null hypothesis is proposed and the alpha level is associated with the null. In the latter the alternate hypothesis is proposed; power and beta are associated with the alternate.
- Provide no information about the population: The finding resulted from hypothesis testing provides no information about the form of the underlying pattern of the population. Very often researchers conduct hypothesis testing either with the assumption of a particular population distribution (normality) or without asking this question at all. Indeed, in the parametric test framework the population, to which inferences are made, is infinite and unknown. Under this direction, what researchers did is to make inferences from the population to the sample, but not the sample to the population. This issue has been discussed in the article regarding misconceived relationships among sample, sampling distributions and population.
- Probability as a relative frequency in long run: Hypothesis testing is based upon probability, which is defined as a relative frequency in long run. However, it is problematic to apply this frequentist view of probability into a single event or a new event (Carver, 1978). For example, what is the probability that the universe was formed by a big bang? What is the probability that a newly invented super-Java enabled Web-based instruction is effective in teaching computer sciences?
- Theoretical distribution: Decisions in hypothesis testing relies on theoretical sampling distributions, which can never be empirically verified. This issue has been discussed in the article regarding mathematical
reality.
- Yield a dichotomous answer based on an arbitrary cutoff: The conclusion yielded from a significance testing is dichotomous: either the effect is significant or not. And the decision is based on the conventional alpha levels, such as .05. However, there is nothing "magical" about .05. Actually, using a fixed cut off, as commonly practiced by many researchers, is going against Fisher's advice. Fisher (1956) stated, "No scientific worker has a fixed level of significance from year to year, and in all circumstances, he rejects hypothesis; he rather gives his mind to each particular case in the light of his evidence of ideas" (p.41).
Unfortunately, today hypothesis testing has become mechanical rather than a matter of judgment.
Alternatives
Methodologists have been questioning over-dependence on and misuse of Fisherian statistical approach. In the summer of 1993, the Journal of Experimental Education devoted the entire issue to the theme "Statistical significance testing in contemporary practice: Some proposed alternatives." Several options such as effect size, cross-validation and resampling were proposed (e.g. Thompson, 1993; Synder & Lawson, 1993). In 1996 a task force formed by the Board of Scientific Affairs of American Psychological Association (APA) also suggested that researchers should apply a wide variety of statistical techniques, such as Exploratory Data Analysis and Bayesian inference. In response to frequent occurrences of inappropriate use of hypothesis testing, in 1999 Dr. Leland Wilkinson led the Task Force on Statistical Inference formed by the same board of APA to address the controversy. In conclusion, the Task Force did not recommend abandon hypothesis testing, but suggested using it with caution; also, researchers were urged to use more options, such as confidence interval and effect size.
After the release of the APA Task Force report, use of effect-size in American Speech-Language-Hearing Association journals from 1999-2003 was reviewed (Meline & Wang, 2004). It was found that reporting of effect size in quantitative studies increased from 5 reports in 1990 to 1994 to 120 reports in 1999 to 2003. However, effect size was reported less than 30% of the time when inferential statistics were used, and only half of those reports included an interpretation of effect size.
Narrowness of graduate programs
Many researchers tend to follow the convention rather than experimenting with alternate strategies. This conservative behavior may be owing to the lack of comprehensive training in graduate schools. Aiken, West, Sechrest, and Reno (1990) published an article surveying the curriculum of quantitative methods in graduate psychology programs. It was found that new and important research methodologies such as structural equation modeling, confirmatory factor analysis, exploratory data analysis, and meta-analysis were not taught in the majority of those programs.
Behrens (1996) held a similar view. To rectify the situation that graduate
programs overly stress hypothesis testing, Behrens suggested that graduate
programs should integrate instruction in confirmatory statistics with
alternative data analytic methods such as meta-analysis, Bayesian analysis,
interval estimation approaches, and hybrid combinations.
Schield (1998) also criticized that traditional statistical training "covers only half the topic needed for statistical literacy. In addition to descriptive statistics and inferential statistics, statistical literacy should include Bayesian statistics and most of all-evidential statistics" (p.2).
Aikem, West, and Millsap's (2008) study, which is a replication and extension of
Aiken et al.'s (1990) study, examined whether innovations in quantitative
methodology have diffused into the training of PhDs in psychology. On one hand,
exciting advancements had happened in the domains of statistical analysis (e.g.
multi-level modeling), measurement (e.g. item response theory), and research
design (e.g. propensity scores in observational studies), but on the other hand,
many psychology programs still maintain the traditional curriculum. For example,
slightly fewer than half of all departments responded to the survey offered a
full course on structural equation modeling. Coverage of specialized statistical
methods, such as multilevel modeling, was even sparser. In all Ph.D. programs in
psychology, the measurement requirement occupies a median of only 4.5 weeks.
Even worst is that the research design curriculum has largely stagnated.
Up-hill battle
Indeed, the crisis lies on not only the lack of knowledge of alternate approaches, but also the poverty of conventional research skills. Inappropriate use of statistics among researchers across different disciplines were well-documented (Caever, 1978; Gore et al., 1977; Gibbons & Freund, 1986; Glass, Peckham, & Sanders,1980; Maxwell & Delaney, 1990; Morrison & Henkel, 1970; Pedhazur, & Schmelkin, 1991; Thompson, 1994; Wainer, 1989). The purpose of alternate methodologies is to compensate the limitations of the traditional approach. However, when researchers do not understand the assumptions and limitations of Fisherian school, how could they look for other proper tools to remediate the problem?
Even if alternate methods were widely adopted, it is likely that the implementation of alternate methods would be as careless as in the conventional approach. There is no guarantee that the quality of research would be improved by introducing new methods alone. If the attitude toward research methodology does not improve, I am afraid that papers like "Common methodology mistakes in exploratory data analysis," "A critical assessment to misinterpretation of resampling," and "The case against Bayesian inference" would be as popular as papers against statistical testing today.
Many people are aware of the narrowness of statistical analysis and attempt to counter against it. But it is an up-hill battle. For example, at Michigan State University once two professors reformed the graduate statistics courses by introducing alternate statistics such as confidence intervals and effect size. The reform, however, was protested by other faculty members because they worried that graduate students who received this unorthodox training might not be able to get their research published (Schmidt, 1996). The norm set by referred journals is one of many reasons why many researchers, faculty and graduate students blindly follow statistical testing. I heard many other reasons:
Tool mastering
"Educational and psychological researchers are not mathematics majors. We don't have to know the detail of statistics." Medical students are not chemistry and biology majors. Do they need to learn the deep knowledge of chemistry and biology? Chemistry and biology for physicians and statistics for social scientists are the means rather than the end. But without the proper tool, it is impossible for them to do a good job. The requirement of "tool mastering " is very common in humanities. For instance, once a Sinologist told me that to study the history of Yuan dynasty of China, one must achieve a high proficiency in the Chinese, Mongolian, and Persian languages in order to read first hand documents. A philosopher specializing in Buddhism also needs to learn the Chinese, Japanese, Hindu, as well as several other languages before conducting any meaningful research. Unfortunately, the concept of "tool mastering" is absent from the mentality of many social scientists.
Driving a car and building a car: Applications and research
This argument is very similar to the preceding one: "Psychologists and educators should focus on the subject matter of their own domain rather than spending time in irrelevant matters. We are concerned with applications rather than theoretical abstraction." What does it take to drive a car? Very simple. Just a few days of training is sufficient to make a qualified driver. But what does it take to build a car? Several years of training in mechanical and electrical engineering is the minimum. When an automobile engineer tells consumers that the vehicle designed by him is safe at all speeds, he is responsible for other people's lives! By the same token, when psychological researchers instruct practitioners and policy-makers to follow his theory, he is accountable, too. He must be sure that the application is firmly founded on a sound theoretical framework. When psychologists claim that children being raised in non-traditional families do not have a higher probability to develop maladjusted behaviors; legalizing drugs does not lead to long term harmful effects to society...etc., I wonder whether these conclusions are resulted from appropriate implementation of research methodologies.
By synthesizing the views of Kerlinger and Pedhazur, Daniel (1997) debunked the myth of insignificance of statistical knowledge in research:
Because all statistical methods have certain inherent strengths and limitations and because each method implies certain assumptions about the data being analyzed, the use of these methods to some degree influences both the nature and selection of research problems (Kerlinger, 1969, 1986; Kerlinger & Pedhazur, 1973). Therefore, the claim that statistical knowledge is unnecessary to good research practice is unfounded; in fact, Kerlinger noted that "it is almost impossible to do outstanding research, though one can do acceptable research, without being something of a methodologist" (p.622). Kerlinger and Pedhazur (1973) went so far as to say that the researcher who lacks a basic knowledge of data analytic strategies is a "scientific cripple" (p.369)
Qualitative researchers may argue that the preceding view is narrow-minded. In my view, statistics is a necessary, but not a sufficient condition of good research.
What shall we do?
To overcome the flooding of poor research, students in this class are asked to do the following:
- To know the difference between applying research findings and producing research results.
- To develop a sense of accountability to your fellow citizens.
- To understand that proper means are essential to reach the end. There is no short-cut.
- To learn the traditional Fisherian methodology as well as alternative techniques such as resampling, Bayesian inference, and exploratory data analysis. In the future, learners should understand merits and shortcomings of each school and apply the appropriate technique according to different research problems and data structures.
References
- Aiken, L. S., West, S. G., Sechrest, L., & Reno, P. R. (1990). Graduate training in statistics, methodology, and measurement in psychology: A survey of Ph.D. programs in North America. American Psychologist, 45, 721-734.
- Aiken, L., West, S., & Millsap, R. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno's (1990) survey of PhD programs in North America.
American Psychologist, 63, 32-50.
- Behrens, J. T. (1996). Principles and procedures of exploratory data analysis. Psychological Methods, 2, 131-160.
- Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378-399.
- Cohen, J. (1994). The earth is round (P < .05). American Psychologist, 49, 997-1003.
- Fisher, R. A. (1932). Statistical methods for research workers (4th ed.). Edinburgh, Scotland: Oliver & Boyd.
- Fisher, R. A. (1935). The design of experiment. Edinburgh,Scotland: Oliver & Boyd.
- Fisher, R. A. (1956). Statistical methods and scientific inferences New York, NY: Hafner.
- Glass, G. V, Peckham, P. D., & Sanders, J. R. (1972). Consequences of failure to meet assumptions underlying the fixed analysis of variance and covariance. Review of Educational Research, 42, 237-288.
- Gore, S. M., Jones, I. G., & Rytter, E. F. (1977). Misuse of statistical methods: A critical assessment of articles in BMJ from January to March, 1976. British Medical Journal, 1, 85-87.
- Kerlinger, F. N. (1969). Research in education. In R. Ebel, V. Noll, & R. Bauer (Eds.), Encyclopedia of educational research (4th ed., pp. 1127-1134). New York: Macmillan.
- Kerlinger, F. N. (1986). Foundations of behavioral research (3rd ed.). Fort Worth, TX: Holt, Rinehart and Winston.
- Kerlinger, F. N., & Pedhazur, E. J. (1973). Multiple regression in behavioral research. New York: Holt, Rinehart and Winston.
- Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88, 1242-1249.
- Loftus, G. R. (1996). Psychology will be a much beter science when we change the way we analyze data. Current Directions in Psychological Science, 5, 161-170.
- Maxwell, S. E., & Delaney, H. D. (1990). Design experiments and analyzing data: A model comparison perspective. Belmont, CA: Wadsworth Publishing company.
- Meline, T., & Wang, B. (2004). Effect-size reporting practices in AJSLP and other ASHA journals, 1999-2003. American Journal of Speech-Language Pathology, 13, 202-207.
- McClure, J. & Suen, H. K. (1994). Interpretation of statistical significance testing: A matter of perspective. Topics in Early Children Special Education, 14, 88-102.
- Morrison, D. E., & Henkel, R. E. (1970). The significance test controversy--A reader. Chicago: Adeline.
- Pedhazur, E. J. & Schmelkin, L. P. (1991). Measurement, design, and analysis : An integrated approach. Hillsdale, N.J. : Lawrence Erlbaum Associates.
- Schield, M. (1998). Statistical literacy and evidential statistics. Paper presented at the Annual Meeting of the American Statistical Association. Dallas, TX. [On-line] Available: http://www.augsburg.edu/ppages/schield/.
- Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 2, 115-129.
- Snyder, P. & Lawson, S. (1993). Evaluating results using corrected and uncorrected effect size estimates. Journal of Experimental Education, 61, 350-360.
- Task Force on Statistical Inference. (1996). Initial report: Task force on statistical Inference. Retrieved May 1, 2009, from: http://www.apa.org/science/bsaweb-tfsi.html.
- Thompson, B. (1993). The use of statistical significance tests in research: Bootstrap and other alternatives. Journal of Experimental Education, 61, 361-377.
- Thompson, B. (1994, April). Common methodology mistakes in dissertations, revisited. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. (ERIC Document Reproduction Service No. ED 368 771)
- Wainer, H. (1989). Eelworms, bullet holes, and Geraldine Ferraro: Some problems with statistical adjustment and some
solutions. Journal of Educational Statistics, 14, 121-140.
- Wilkinson, L., & Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
Go up to the main menu
|
|