 |
|
Reliability and
Validity
Chong
Ho Yu, Ph.D. |
Conventional views of reliability
(AERA et al., 1985)
-
Temporal
stability: the same form of a test on two or
more separate occasions to the same group of examinees
(Test-retest). On many occasions this approach is not practical
because repeated measurements are likely to change the examinees.
For example, the examinees will adapt the test format and
thus tend to score higher in later tests. Hence, careful
implementation of the test-retest approach is strongly
recommendation (Yu, 2005).
-
Form
equivalence: two different forms of test,
based on the same content, on one occasion to the same
examinees (Alternate form). After alternate forms have been
developed, it can be used for different examinees. It is very common
in high-stake examination for pre-empting cheating. A examinee who
took Form A earlier could not share the test items with another
student who might take Form B later, because the two forms have
different items.
-
Internal
consistency: the coefficient of test scores
obtained from a single test or survey (Cronbach
Alpha, KR20, Spilt-half). For instance, let's say
respondents are asked to rate statements in an attitude
survey about computer anxiety. One statement is "I feel
very negative about computers in general." Another
statement is "I enjoy using computers." People who
strongly agree with the first statement should be
strongly disagree with the second statement, and vice
versa. If the rating of both statements is high or low
among several respondents, the responses are said to be
inconsistent and patternless. The same principle can be
applied to a test. When no pattern is found in the
students' responses, probably the test is too difficult
and students just guess the answers randomly.

-
Reliability is a necessary but not
sufficient condition for validity. For instance,
if the needle of the scale is five pounds away from zero,
I always over-report my weight by five pounds. Is the
measurement consistent? Yes, but it is consistently
wrong! Is the measurement valid? No! (But if it
under-reports my weight by five pounds, I will consider
it a valid measurement)
-
Performance, portfolio, and responsive evaluations,
where the tasks vary substantially from student to
student and where multiple tasks may be evaluated simultaneously, are attacked for lacking reliability. One of the difficulties is that there are more than one source of measurement errors in performance assessment. For example, the reliability of writing skill test score is affected by the raters, the mode of discourse, and several other factors (Parkes, 2000).
-
Replications as unification: Users may
be confused by the diversity of reliability indices. Nevertheless, different
types of reliability measures share a common thread: What constitutes a
replication of a measurement procedure? (Brennan, 2001) Take internal
consistency as an example. This measure is used because it is convenient to
compute the reliability index based upon data collected from one occasion.
However, the ultimate inference should go beyond one single testing occasion to
others (Yu, 2005). In other words, any procedures for estimating reliability
should attempt to mirror a result based upon full-length replications.
Conventional views of validity
(Cronbach, 1971)
-
Face
validity : Face validity simply means the
validity at face value. As a check on face validity,
test/survey items are sent to teachers to obtain
suggestions for modification. Because of its vagueness
and subjectivity, psychometricians have abandoned this
concept for a long time. However, outside the measurement
arena, face validity has come back in another form. While
discussing the validity of a theory, Lacity and Jansen
(1994) defines validity as making common sense, and
being persuasive and seeming right to the reader. For
Polkinghorne (1988), validity of a theory refers to
results that have the appearance of truth or reality.
The internal structure of things may not concur with
the appearance. Many times professional knowledge is
counter-common sense. The criteria of validity in
research should go beyond "face," "appearance," and
"common sense."
-
Content
validity: draw an inference from test scores
to a large domain of items similar to those on the test.
Content validity is concerned with sample-population
representativeness. i.e. the knowledge and skills
covered by the test items should be representative to the
larger domain of knowledge and skills.
For example, computer literacy includes skills in
operating system, word processing, spreadsheet, database,
graphics, internet, and many others. However, it is
difficult, if not impossible, to administer a test
covering all aspects of computing. Therefore, only
several tasks are sampled from the population of computer
skills.
Content validity is usually established by content
experts. Take computer literacy as an example again.
A test of computer literacy should be written or reviewed
by computer science professors because it is assumed that
computer scientists should know what are important in his
discipline. By the first glance, this approach looks
similar to the validation process of face validity, but
yet there is a difference. In content validity, evidence
is obtained by looking for agreement in judgments by
judges. In short, face validity can be established by one
person but content validity should be checked by a
panel.
However, this approach has some drawbacks. Usually
experts tend to take their knowledge for granted and
forget how little other people know. It is not uncommon
that some tests written by content experts are extremely
difficult.
Second, very often content experts fail to identify the learning objectives of a subject. Take the following question in a philosophy test as an example:
This type of question tests the ability of memorizing historical facts, but not philosophizing. The content expert may argue that "historical facts" are important for a student to further understand philosophy. Let's change the subject to computer science and statistics.
Look at the following two questions:
It would be hard pressed for any computer scientist or statistician to accept that the above questions fulfill content validity. As a matter of fact, the memorization approach is a common practice among instructors.
Further, sampling knowledge from a larger domain of
knowledge involves subjective values. For example, a test
regarding art history may include many questions on oil
paintings, but less questions on watercolor paintings and
photography because of the perceived importance of oil
paintings in art history.
Content validity is sample-oriented rather than
sign-oriented. A behavior is viewed as a sample
when it is a subgroup of the same kind of behaviors. On
the other hand, a behavior is considered a sign when it
is an indictor or a proxy of a construct. (Goodenough,
1949). Construct validity and criterion validity, which
will be discussed later, are sign-oriented because both
of them indicate behaviors different from those of the
test.
-
Criterion:
draw an inference from test scores to performance. A high
score of a valid test indicates that the tester has met
the performance criteria.
Regression analysis can be applied to establish
criterion validity. An independent variable could be used
as a predictor variable and a dependent variable,
the criterion variable. The correlation
coefficient between them is called validity
coefficients.
For instance, scores of the driving test by simulation
is the predictor variable while scores of the road test
is the criterion variable. It is hypothesized that if the
tester passes the simulation test, he/she should meet the
criterion of being a safe driver. In other words, if the
simulation test scores could predict the road test scores
in a regression model, the simulation test is claimed to
have a high degree of criterion validity.
In short, criterion validity is about
prediction rather than explanation.
Predication is concerned with non-casual or mathematical
dependence where as explanation is pertaining to causal
or logical dependence. For example, one can predict the
weather based on the height of mercury inside a
thermometer. Thus, the height of mercury could satisfy
the criterion validity as a predictor. However, one
cannot explain why the weather changes by the change of
mercury height. Because of this limitation of criterion
validity, an evaluator has to conduct construct
validation.
-
Construct: draw
an inference form test scores to a psychological
construct. Because it is concerned with abstract and
theoretical construct, construct validity is also known
as theoretical construct.
According to Hunter and Schmidt (1990), construct
validity is a quantitative question rather than a
qualitative distinction such as "valid" or "invalid"; it
is a matter of degree. Construct validity can be measured
by the correlation between the intended
independent variable (construct) and the proxy
independent variable (indicator, sign) that is actually
used.
For example, an evaluator wants to study the
relationship between general cognitive ability and job
performance. However, the evaluator may not be able to
administer a cognitive test to every subject. In this
case, he can use a proxy variable such as "amount of
education" as an indirect indicator of cognitive ability.
After he administered a cognitive test to a portion of
all subjects and found a strong correlation between
general cognitive ability and amount of education, the
latter can be used to the larger group because its
construct validity is established.
Other authors (e.g. Angoff,1988; Cronbach & Quirk,
1976) argue that construct validity cannot be expressed
in a single coefficient; there is no mathematical index
of construct validity. Rather the nature of construct
validity is qualitative.
There are two types of indictors:
When an indictor is expressed in terms of multiple
items of an instrument, factor analysis is used
for construct validation.
Test bias is a major threat against construct
validity, and therefore test bias analyses should be
employed to examine the test items (Osterlind, 1983).
The presence of test bias definitely affects the
measurement of the psychological construct. However, the
absence of test bias does not guarantee that the test
possesses construct validity. In other words, the absence
of test bias is a necessary, but isn't a sufficient
condition.
-
Construct validation as unification:
The criterion and the content models tends to be empirical-oriented
while the construct model is inclined to be theoretical.
Nevertheless, all models of validity requires some form of
interpretation: What is the test measuring? Can it measure what it
intends to measure? In standard scientific inquiries, it is
important to formulate an interpretative (theoretical) framework
clearly and then to subject it to empirical challenges. In this
sense, theoretical construct validation is considered functioning as
a unified framework for validity (Kane, 2001).
A modified view of reliability
(Moss, 1994)
-
There can be validity without reliability if reliability is defined as consistency
among independent measures.
-
Reliability is an aspect of construct validity. As
assessment becomes less standardized, distinctions
between reliability and validity blur.
-
In many situations such as searching faculty
candidate and conferring graduate degree, committee
members are not trained to agree on a common set of
criteria and standards
-
Inconsistency in students' performance across tasks
does not invalidate the assessment. Rather it becomes an
empirical puzzle to be solved by searching for a more
comprehensive interpretation.
-
Initial disagreement (e.g., among students, teachers,
and parents in responsive evaluation) would not
invalidate the assessment. Rather it would provide an
impetus for dialog.
Li (2003) argued that the preceding view is incorrect:
-
The definition of reliability should be defined in terms of the classical test theory: the squared correlation between observed
and true scores or the proportion of true variance in obtained test scores.
-
Reliability is a unitless measure and thus it is already model-free
or standard-free.
-
It has been a tradition that multiple factors are introduced into a test to improve validity but decrease internal-consistent reliability.
An extended view of Moss's reliability
(Mislevy, 2004)
-
Being inspired by Moss, Mislevy went further to ask whether there can be reliability without reliability (indices).
-
By blending psychometrics and Hermeneutics, in which a holistic and integrative approach to understand the whole in light of its parts is used, Mislevy demanded psychometricians to think about what they intend to make inferences about.
-
In many cases we don't present just one argument; rather problem solving involves arguments or chains of reasoning with massive evidence.
-
Off-the-shelf inferential machinary (e.g. compute reliability indices) may fail if we quantify things or tasks that we don't know much about.
-
Probability-based reasoning to more complex assessments based upon cognitive psychology is needed.
A radical view of reliability
(Thompson et al, 2003)
-
Reliability is not a property of the test; rather it is attached
to the property of the data. Thus, psychomterics is datammetrics.
-
Tests are not reliable. It is important to explore reliability
in virtually all studies.
-
Reliability generalization, which can be used in a
meta-analysis application similar to validity generalization, should
be implemented to assess variance in measurement error across
studies.
An updated perspective of reliability
(Cronbach, 2004)
In a 2004's article, Lee Cronbach, the inventor of Cronbach Alpha as
a way of measuring reliability, reviewed the historical development of
Cronbach Alpha. He asserted, "I no longer regard the formula (of
Cronbach Alpha) as the most appropriate way to examine most data. Over
the years, my associates and I developed the complex generaliability
(G) theory" (p. 403). Discussion of the G theory is beyond the scope of
this document. Nevertheless, Cronbach did not object use of Cronbach
Alpha but he recommended that researchers should take the following
into consideration while employing this approach:
-
Standard error of measurement:
It is the most important piece of information to report regarding
the instrument, not a coefficient.
-
Independence of sampling
-
Heterogeneity of content
-
How the measurement will be used:
Decide whether future uses of the instrument are likely to be
exclusively for absolute decisions, for differential decisions, or
both.
-
Number of conditions for the test
A critical view of validity
(Pedhazur & Schmelkin,1991)
-
Content validity is not a type of validity at all
because validity refers to inferences made about scores,
not to an assessment of the content of an
instrument.
-
The very definition of a construct implies a domain
of content. There is no sharp distinction between test
content and test construct.
A modified view of validity
(Messick, 1995)
The conventional view (content, criterion, construct) is
fragmented and incomplete, especially because it fails to
take into account both evidence of the value implications of
score meaning as a basis for action and the social
consequences of score use.
Validity is not a property of the test or
assessment, but rather of the meaning of the test
scores.
-
Content: evidence
of content relevance, representativeness, and technical
quality
-
Substantive: theoretical
rationale
-
Structural: the
fidelity of the scoring structure
-
Generalizability: generalization
to the population and across populations
-
External: applications
to multitrait-multimethod comparison
-
Consequential: bias,
fairness, and justice; the social consequence of the
assessment to the society
Critics argued that consequences should not be a component of validity because test developers should not be held responsible for the consequences of misuse; accountability should lie with the misuser. Messick (1998) counter-argued that social consequences of score interpretation include the value implications of the construct label, which may or may not commensurate with the construct's trait implications and need to be addressed in appraising score meaning. While test developers should not be accountable to misuse of tests, they should still pay attention to the unanticipated consequences of legitimate score interpretation.
A different view of reliability
and validity (Salvucci, Walter, Conley, Fink, & Saba
(1997)
Some scholars argue that the traditional view that
"reliability is a necessary but not a sufficient condition
of validity" is incorrect. This school of thought
conceptualizes reliability as invariance and validity
as unbiasedness. A sample statistic may have an
expected value over samples equal to the population
parameter (unbiasedness), but have very high variance from a
small sample size. Conversely, a sample statistic can have
very low sampling variance but have an expected value far
departed from the population parameter (high bias). In this
view, a measure can be unreliable (high variance) but still
valid (unbiased).
|

|

|
|
Population parameter (Red line) = Sample statistic (Yellow line) --> unbiased
High variance (Green line) Unreliable but valid |
Population parameter (Red line) <> Sample statistic (Yellow line) --> Biased low variance (Green line) Invalid but reliable |
Caution and advice
There is a common misconception that if someone adopts a
validated instrument, he/she does not need to check the
reliability and validity with his/her own data. Imagine
this: When I buy a drug that has been approved by FDA and my
friend asks me whether it heals me, I tell him, "I am taking
a drug approved by FDA and therefore I don't need to know
whether it works for me or not!" A responsible evaluator
should still check the instrument's reliability and validity
with his/her own subjects and make any modifications if
necessary.
Low reliability is less detrimental to the performance
pretest. In the pretest where subjects are not exposed to
the treatment and thus are unfamiliar with the subject
matter, a low reliability caused by random guessing is
expected. One easy way to overcome this problem is to
include "I don't know" in multiple choices. In an
experimental settings where students' responses would not
affect their final grades, the experimenter should
explicitly instruct students to choose "I don't know"
instead of making a guess if they really don't know the
answer. Low reliability is a signal of high measurement
error, which reflects a gap between what students actually
know and what scores they receive. The choice "I don't know"
can help in closing this gap.
Last Updated: 2008
References
American Educational Research Association, American
Psychological Association, & National Council on
Measurement in Education. (1985). Standards for
educational and psychological testing. Washington, DC:
Authors.
Angoff, W. H. (1988). Validity: An evolving concept. In
H. Wainer & H. I. Braun (Eds.), Test validity.
Hillsdale, NJ: Lawrence Erlbaum.
Brennan, R. (2001). An essay on the history and future of reliability from the perspective of replications.
Journal of Educational Measurement, 38, 295-317.
Cronbach, L. J. (1971). Test validation. In R. L.
Thorndike (Ed.). Educational Measurement (2nd Ed.).
Washington, D. C.: American Council on Education.
Cronbach, L. J. (2004). My current thoughts on Coefficient Alpha and
successor procedures. Educational and Psychological Measurement, 64,
391-418.
Cronbach, L. J. & Quirk, T. J. (1976). Test validity.
In International Encyclopedia of Education. New York:
McGraw-Hill.
Goodenough, F. L. (1949). Mental testing: Its history,
principles, and applications. New York: Rinehart.
Hunter, J. E.; & Schmidt, F. L. (1990). Methods of
meta-analysis: Correcting error and bias in research
findings. Newsbury Park: Sage Publications.
Kane, M. (2001). Current concerns in validity theory. Journal of educational Measurement, 38, 319-342.
Lacity, M.; & Jansen, M. A. (1994). Understanding
qualitative data: A framework of text analysis methods.
Journal of Management Information System, 11,
137-160.
Li, H. (2003). The resolution of some paradoxes related to
reliability and validity. Journal of Educational and Behavioral
Statistics, 28, 89-95.
Messick, S. (1995). Validity of psychological assessment:
Validation of inferences from persons' responses and
performance as scientific inquiry into scoring meaning.
American Psychologist, 9, 741-749.
Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45, 35-44.
Mislevy, R. (2004). Can there be reliability without reliability? Journal of Educational and Behavioral Statistics, 29, 241-244.
Moss, P. A. (1994). Can there be validity without
reliability? Educational Researcher, 23, 5-12.
Osterlind, S. J. (1983). Test item bias. Newbury
Park: Sage Publications.
Parkes, J. (2000). The relationship between the reliability and cost of performance assessments. Education Policy Analysis Archives, 8. [On-line] Available URL: http://epaa.asu.edu/epaa/v8n16/
Pedhazur, E. J.; & Schmelkin, L. P. (1991).
Measurement, design, and analysis: An integrated
approach. Hillsdale, NJ: Lawrence Erlbaum Associates,
Publishers.
Polkinghorne, D. E. (1988). Narrative knowing and the
human sciences. Albany: State University of New York
Press.
Salvucci, S.; Walter, E., Conley, V; Fink, S; & Saba,
M. (1997). Measurement error studies at the National
Center for Education Statistics. Washington D. C.: U. S.
Department of Education
Thompson, B. (Ed.) (2003). Score reliability: Contemporary
thinking on reliability issues. Thousand Oaks: Sage.
Yu, C. H. (2005). Test-retest reliability. In K. Kempf-Leonard (Ed.). Encyclopedia of Social Measurement, Vol. 3 (pp. 777-784). San
Diego, CA: Academic Press.
Questions for
discussion
Pick one of the following cases and determine whether the
test or the assessment is valid. Apply the concepts of
reliability and validity to the situation. These cases may
be remote to this cultural context. You may use your own
example.
-
In ancient China, candidates for government officials
had to take the examination regarding literature and
moral philosophy, rather than public administration.
-
Before July 1, 1997 when Hong Kong was a British
colony, Hong Kong doctors, including specialists, who
graduated from non-Common Wealth medical schools had to
take a general medical examination covering all general
areas in order to be certified.
Navigation
Contact
|