THE PURPOSE OF ITEM
ANALYSIS
There must be a match between what is taught and what is
assessed. However, there must also be an effort to test for more complex levels
of understanding, with care taken to avoid over-sampling items that assess only
basic levels of knowledge. Tests that are too difficult (and have an
insufficient floor) tend to lead to frustration and lead to deflated scores,
whereas tests that are too easy (and have an insufficient ceiling) facilitate a
decline in motivation and lead to inflated scores. Tests can be improved by
maintaining and developing a pool of valid items from which future tests can be
drawn and that cover a reasonable span of difficulty levels.
Item analysis helps improve test items and identify unfair or
biased items. Results should be used to refine test item wording. In addition,
closer examination of items will also reveal which questions were most
difficult, perhaps indicating a concept that needs to be taught more
thoroughly. If a particular distracter (that is, an incorrect answer choice) is
the most often chosen answer, and especially if that distracter positively
correlates with a high total score, the item must be examined more closely for
correctness. This situation also provides an opportunity to identify and
examine common misconceptions among students about a particular concept.
In general, once test items have been created, the value of
these items can be systematically assessed using several methods representative
of item analysis: a) a test item's level of difficulty, b) an item's capacity
to discriminate, and c) the item characteristic curve. Difficulty is assessed by
examining the number of persons correctly endorsing the answer. Discrimination
can be examined by comparing the number of persons getting a particular item
correct with the total test score. Finally, the item characteristic curve can
be used to plot the likelihood of answering correctly with the level of success
on the test.
ITEM
DIFFICULTY
In test construction, item difficulty is determined by the
number of people who answer a particular test item correctly. For example, if
the first question on a test was answered correctly by 76% of the class, then
the difficulty level (p or
percentage passing) for that question isp = .76. If the second question on a
test was answered correctly by only 48% of the class, then the difficulty level
for that question is p = .48. The higher the percentage of
people who answer correctly, the easier the item, so that a difficulty level of
.48 indicates that question two was more difficult than question one, which had
a difficulty level of .76.
Many educators find themselves wondering how difficult a good
test item should be. Several things must be taken into consideration in order
to determine appropriate difficulty level. The first task of any test maker
should be to determine the probability of answering an item correctly by chance
alone, also referred to as guessing or luck. For example, a true-false item,
because it has only two choices, could be answered correctly by chance half of
the time. Therefore, a true-false item with a demonstrated difficulty level of
only p = .50 would not be a good test item
because that level of success could be achieved through guessing alone and
would not be an actual indication of knowledge or ability level. Similarly, a
multiple-choice item with five alternatives could be answered correctly by chance
20% of the time. Therefore, an item difficulty greater than .20 would be
necessary in order to discriminate between respondents' ability to guess
correctly and respondents' level of knowledge. Desirable difficulty levels
usually can be estimated as halfway between 100 percent and the percentage of
success expected by guessing. So, the desirable difficulty level for a
true-false item, for example, should be around p = .75, which is halfway between 100%
and 50% correct.
In
most instances, it is desirable for a test to contain items of various
difficulty levels in order to distinguish between students who are not prepared
at all, students who are fairly prepared, and students who are well prepared.
In other words, educators do not want the same level of success for those
students who did not study as for those who studied a fair amount, or for those
who studied a fair amount and those who studied exceptionally hard. Therefore,
it is necessary for a test to be composed of items of varying levels of difficulty.
As a general rule for norm-referenced tests, items in the difficulty range of
.30 to .70 yield important differences between individuals' level of knowledge,
ability, and preparedness. There are a few exceptions to this, however, with
regard to the purpose of the test and the characteristics of the test takers.
For instance, if the test is to help determine entrance into graduate school,
the items should be more difficult to be able to make finer distinctions
between test takers. For a criterion-referenced test, most of the item
difficulties should be clustered around the criterion cut-off score or higher.
For example, if a passing score is 70%, the vast majority of items should have
percentage passing values of
p = .60
or higher, with a number of items in the p > .90 range to enhance motivation
and test for mastery of certain essential concepts.
DISCRIMINATION INDEX
According
to Wilson (2005), item difficulty is the most essential component of item
analysis. However, it is not the only way to evaluate test items.
Discrimination goes beyond determining the proportion of people who answer
correctly and looks more specifically at who answers correctly. In other words,
item discrimination determines whether those who did well on the entire test
did well on a particular item. An item should in fact be able to discriminate
between upper and lower scoring groups. Membership in these groups is usually
determined based on their total test score, and it is expected that those
scoring higher on the overall test will also be more likely to endorse the
correct response on a particular item. Sometimes an item will discriminate
negatively, that is, a larger proportion of the lower group select the correct
response, as compared to those in the higher scoring group. Such an item should
be revised or discarded.
One way to
determine an item's power to discriminate is to compare those who have done
very well with those who have done very poorly, known as the extreme group
method. First, identify the students who scored in the top one-third as well as
those in the bottom one-third of the class. Next, calculate the proportion of
each group that answered a particular test item correctly (i.e., percentage passing
for the high and low groups on each item). Finally, subtract the p of the
bottom performing group from the p for the top performing group to yield an
item discrimination index (D). Item discriminations of D = .50 or higher are
considered excellent. D = 0 means the item has no discrimination ability, while
D = 1.00 means the item has perfect discrimination ability.
In Figure 1, it can be seen that Item 1 discriminates well with
those in the top performing group obtaining the correct response far more often
(p = .92) than those in
the
low performing group (p =
.40), thus resulting in an index of .52 (i.e., .92 - .40 = .52). Next, Item 2
is not difficult enough with a discriminability index of only .04, meaning this
particular item was not useful in discriminating between the high and low
scoring individuals. Finally, Item 3 is in need of revision or discarding as it
discriminates negatively, meaning low performing group members actually
obtained the correct keyed answer more often than high performing group
members.
Another
way to determine the discriminability of an item is to determine the
correlation coefficient between performance on an item and performance on a test,
or the tendency of students selecting the correct answer to have high overall
scores. This coefficient is reported as the item discrimination coefficient, or
the point-biserial correlation between item score (usually scored right or
wrong) and total test score. This coefficient should be positive, indicating
that students answering correctly tend to have higher overall scores or that
students answering incorrectly tend to have lower overall scores. Also, the
higher the magnitude, the better the item discriminates. The point-biserial
correlation can be computed with procedures outlined in Figure 2.
In Figure
2, the point-biserial correlation between item score and total score is
evaluated similarly to the extreme group discrimination index. If the resulting
value is negative or low, the item should be revised or discarded. The closer
the value is to 1.0, the stronger the item's discrimination power; the closer
the value is to 0,
the weaker
the power. Items that are very easy and answered correctly by the majority of
respondents will have poor point-biserial correlations.
CHARACTERISTIC CURVE
A third
parameter used to conduct item analysis is known as the item characteristic
curve (ICC). This is a graphical or pictorial depiction of the characteristics
of a particular item, or taken collectively, can be representative of the
entire test. In the item characteristic curve the total test score is
represented on the horizontal axis and the proportion of test takers passing
the item within that range of test scores is scaled along the vertical axis.
For Figure
3, three separate item characteristic curves are shown. Line A is considered a
flat curve and indicates that test takers at all score levels were equally
likely to get the item correct. This item was therefore not a useful
discriminating item. Line B demonstrates a troublesome item as it gradually
rises and then drops for those scoring highest on the overall test. Though this
is unusual, it can sometimes result from those who studied most having ruled
out the answer that was keyed as correct. Finally, Line C shows the item
characteristic curve for a good test item. The gradual and consistent positive
slope shows that the proportion of people passing the item gradually increases
as test scores increase. Though it is not depicted here, if an ICC was seen in
the shape of a backward S, negative item discrimination would be evident,
meaning that those who scored lowest were most likely to endorse a correct
response on the item.
See also:Item Response Theory
BIBLIOGRAPHY
Anastasi, A., & Urbina, S. (1997). Psychological testing (7th ed.). Upper Saddle River, NJ:
Prentice Hall.
Brown, F. (1983). Principles
of education and psychological testing(3rd ed.). New York: Holt, Rinehart,
& Winston.
DeVellis, R. (2003). Scale
development: Theory and applications (2nd
ed.). Thousand Oaks, CA: Sage.
Grunlund, N. (1993). How
to make achievement tests and assessments (5th
ed.). Boston: Allyn and Bacon.
Kaplan, R., & Saccuzzo, D. (2004). Psychological testing: Principles,
applications, and issues (6th
ed.) Pacific Grove, CA: Brooks/Cole.
Kehoe, J. (1995). Basic item analysis for multiple-choice tests.Practical
Assessment, Research & Evaluation, 4(10), retrieved April 1, 2008, from
http://pareonline.net/getvn.asp?v=4&n=10.
Patten, M. (2001). Questionnaire
research: A practical guide (2nd
ed.). Los Angeles: Pyrczak.
Wilson, M. (2005). Constructing
measures: An item response modeling approach. Mahwah, NJ: Lawrence Erlbaum.
No comments:
Post a Comment