Item
Analysis
Major
Uses of Item Analysis
Item analysis can be a
powerful technique available to instructors for the guidance and improvement of
instruction. For this to be so, the items to be analyzed must be valid measures
of instructional objectives. Further, the items must be diagnostic, that is,
knowledge of which incorrect options students select must be a clue to the
nature of the misunderstanding, and thus prescriptive of appropriate
remediation.
In addition, instructors
who construct their own examinations may greatly improve the effectiveness of
test items and the validity of test scores if they select and rewrite their
items on the basis of item performance data. Such data is available to
instructors who have their examination answer sheets scored at the Scoring
Office.
Item
Analysis Reports
As the answer sheets are
scored, records are written which contain each student's score and his or her
response to each item on the test. These records are then processed and an item
analysis report file is generated. An instructor may obtain test score
distributions and a list of students' scores, in alphabetic order, in student
number order, in percentile rank order, and/or in order of percentage of total
points. Instructors are sent their item analysis reports as e-mail attachments.
The item analysis report is contained in the file IRPT####.RPT, where the four
digits indicate the instructor's GRADER III account. A sample of an individual
long form item analysis listng for the item response pattern is shown below.
Item 10 of 125. The
correct option is 5.
|
||||||||||
Item Response Pattern
|
||||||||||
1
|
2
|
3
|
4
|
5
|
Omit
|
Error
|
Total
|
|||
Upper 27%
|
n
|
2
|
8
|
0
|
1
|
19
|
0
|
0
|
30
|
|
%
|
7
|
27
|
0
|
3
|
63
|
0
|
0
|
100
|
||
Middle 46%
|
n
|
3
|
20
|
3
|
3
|
23
|
0
|
0
|
52
|
|
%
|
6
|
38
|
6
|
6
|
44
|
0
|
0
|
100
|
||
Lower 27%
|
n
|
6
|
5
|
8
|
2
|
9
|
0
|
0
|
30
|
|
%
|
20
|
17
|
27
|
7
|
30
|
0
|
0
|
101
|
||
Total
|
n
|
11
|
33
|
11
|
6
|
51
|
0
|
0
|
112
|
|
%
|
10%
|
29%
|
11%
|
5%
|
46%
|
0%
|
0%
|
100%
|
Item
Analysis Response Patterns
Each item is identified by number and the correct option is indicated. The
group of students taking the test is divided into upper, middle and lower
groups on the basis of students' scores on the test. This division is essential
if information is to be provided concerning the operation of distracters
(incorrect options) and to compute an easily interpretable index of
discrimination. It has long been accepted that optimal item discrimination is
obtained when the upper and lower groups each contain twenty-seven percent of
the total group.
The number of students who selected each option or omitted the item is
shown for each of the upper, middle, lower and total groups. The number of
students who marked more than one option to the item is indicated under the
"error" heading. The percentage of each group who selected each of
the options, omitted the item, or erred, is also listed. Note that the total
percentage for each group may be other than 100%, since the percentages are
rounded to the nearest whole number before totaling.
The sample item listed above appears to be performing well. About
two-thirds of the upper group but only one-third of the lower group answered
the item correctly. Ideally, the students who answered the item incorrectly
should select each incorrect response in roughly equal proportions, rather than
concentrating on a single incorrect option. Option two seems to be the most
attractive incorrect option, especially to the upper and middle groups. It is
most undesirable for a greater proportion of the upper group than of the lower
group to select an incorrect option. The item writer should examine such an
option for possible ambiguity. For the sample item on the previous page, option
four was selected by only five percent of the total group. An attempt might be
made to make this option more attractive.
Item analysis provides the item writer with a record of student reaction
to items. It gives us little information about the appropriateness of an item
for a course of instruction. The appropriateness or content validity of an item
must be determined by comparing the content of the item with the instructional
objectives.
Basic
Item Analysis Statistics
A number of item statistics are reported which aid in evaluating the effectiveness
of an item. The first of these is the index of difficulty which is the
proportion of the total group who got the item wrong. Thus a high index
indicates a difficult item and a low index indicates an easy item. Some item
analysts prefer an index of difficulty which is the proportion of the total
group who got an item right. This index may be obtained by marking the
PROPORTION RIGHT option on the item analysis header sheet. Whichever index is
selected is shown as the INDEX OF DIFFICULTY on the item analysis print-out.
For classroom achievement tests, most test constructors desire items with
indices of difficulty no lower than 20 nor higher than 80, with an average
index of difficulty from 30 or 40 to a maximum of 60.
The INDEX OF DISCRIMINATION is the difference between the proportion of
the upper group who got an item right and the proportion of the lower group who
got the item right. This index is dependent upon the difficulty of an item. It
may reach a maximum value of 100 for an item with an index of difficulty of 50,
that is, when 100% of the upper group and none of the lower group answer the
item correctly. For items of less than or greater than 50 difficulty, the index
of discrimination has a maximum value of less than 100. The Interpreting the Index of
Discrimination document
contains a more detailed discussion of the index of discrimination.
Interpretation
of Basic Statistics
To aid in interpreting the index of discrimination, the maximum
discrimination value and the discriminating efficiency are given for each item.
The maximum discrimination is the highest possible index of discrimination for
an item at a given level of difficulty. For example, an item answered correctly
by 60% of the group would have an index of difficulty of 40 and a maximum
discrimination of 80. This would occur when 100% of the upper group and 20% of
the lower group answered the item correctly. The discriminating efficiency is
the index of discrimination divided by the maximum discrimination. For example,
an item with an index of discrimination of 40 and a maximum discrimination of
50 would have a discriminating efficiency of 80. This may be interpreted to
mean that the item is discriminating at 80% of the potential of an item of its
difficulty. For a more detailed discussion of the maximum discrimination and
discriminating efficiency concepts, see the Interpreting the Index of
Discrimination document.
Other
Item Statistics
Some test analysts may
desire more complex item statistics. Two correlations which are commonly used
as indicators of item discrimination are shown on the item analysis report. The
first is the biserial correlation, which is the correlation between a student's
performance on an item (right or wrong) and his or her total score on the test.
This correlation assumes that the distribution of test scores is normal and
that there is a normal distribution underlying the right/wrong dichotomy. The
biserial correlation has the characteristic, disconcerting to some, of having
maximum values greater than unity. There is no exact test for the statistical
significance of the biserial correlation coefficient.
The point biserial
correlation is also a correlation between student performance on an item (right
or wrong) and test score. It assumes that the test score distribution is normal
and that the division on item performance is a natural dichotomy. The possible
range of values for the point biserial correlation is +1 to -1. The Student's t
test for the statistical significance of the point biserial correlation is
given on the item analysis report. Enter a table of Student's t values with N -
2 degrees of freedom at the desired percentile point. N represents the total
number of students appearing in the item analysis.
The mean scores for
students who got an item right and for those who got it wrong are also shown.
These values are used in computing the biserial and point biserial coefficients
of correlation and are not generally used as item analysis statistics.
Generally, item
statistics will be somewhat unstable for small groups of students. Perhaps
fifty students might be considered a minimum number if item statistics are to
be stable. Note that for a group of fifty students, the upper and lower groups
would contain only thirteen students each. The stability of item analysis
results will improve as the group of students is increased to one hundred or
more. An item analysis for very small groups must not be considered a stable
indication of the performance of a set of items.
The item analysis data
are summarized on the last page of the item analysis report. The distribution
of item difficulty indices is a tabulation showing the number and percentage of
items whose difficulties are in each of ten categories, ranging from a very
easy category (00-10) to a very difficult category (91-100). The distribution
of discrimination indices is tabulated in the same manner, except that a
category is included for negatively discriminating items.
The mean item difficulty
is determined by adding all of the item difficulty indices and dividing the
total by the number of items. The mean item discrimination is determined in a
similar manner.
Test reliability,
estimated by the Kuder-Richardson formula number 20, is given. If the test is
speeded, that is, if some of the students did not have time to consider each
test item, then the reliability estimate may be spuriously high.
The final test statistic
is the standard error of measurement. This statistic is a common device for
interpreting the absolute accuracy of the test scores. The size of the standard
error of measurement depends on the standard deviation of the test scores as
well as on the estimated reliability of the test.
Occasionally, a test
writer may wish to omit certain items from the analysis although these items
were included in the test as it was administered. Such items may be omitted by
leaving them blank on the test key. The statistics for these items will be
omitted from the Summary Data.
A number of report
options are available for item analysis data. The long-form item analysis
report contains three items per page. A standard-form item analysis report is
available where data on each item is summarized on one line. A sample report is
shown below.
ITEM ANALYSIS
Test 4 125 Items 49 Students
|
||||||||||
Percentages: Upper 27%
- Middle - Lower 27%
|
||||||||||
Item
|
Key
|
1
|
2
|
3
|
4
|
5
|
Omit
|
Error
|
Diff
|
Disc
|
1
|
2
|
15-22-31
|
69-57-38
|
08-17-15
|
00-04-00
|
08-00-15
|
0-0-0
|
0-0-0
|
45
|
31
|
2
|
3
|
00-26-15
|
00-00-00
|
92-65-62
|
00-04-08
|
08-04-15
|
0-0-0
|
0-0-0
|
29
|
31
|
The standard form shows
the item number, key (number of the correct option), the percentage of the
upper, middle, and lower groups who selected each option, omitted the item or
erred, the index of difficulty, and the index of discrimination. For example, in
item 1 above, option 2 was the correct answer and it was selected by 69% of the
upper group, 57% of the middle group and 38% of the lower group. The index of
difficulty, based on the total group, was 45 and the index of discrimination
was 31.
Item analysis is a
completely futile process unless the results help instructors improve their
classroom practices and item writers improve their tests. Let us suggest a
number of points of departure in the application of item analysis data.
- Item analysis gives necessary
but not sufficient information concerning the appropriateness of an item
as a measure of intended outcomes of instruction. An item may perform
beautifully with respect to item analysis statistics and yet be quite
irrelevant to the instruction whose results it was intended to measure. A
most common error is to teach for behavioral objectives such as analysis
of data or situations, ability to discover trends, ability to infer
meaning, etc., and then to construct an objective test measuring mainly
recognition of facts. Clearly, the objectives of instruction must be kept
in mind when selecting test items.
- An item must be of appropriate
difficulty for the students to whom it is administered. If possible, items
should have indices of difficulty no less than 20 and no greater than 80.
It is desirable to have most items in the 30 to 50 range of difficulty.
Very hard or very easy items contribute little to the discriminating power
of a test.
- An item should discriminate
between upper and lower groups. These groups are usually based on total
test score but they could be based on some other criterion such as
grade-point average, scores on other tests, etc. Sometimes an item will
discriminate negatively, that is, a larger proportion of the lower group
than of the upper group selected the correct option. This often means that
the students in the upper group were misled by an ambiguity that the
students in the lower group, and the item writer, failed to discover. Such
an item should be revised or discarded.
- All of the incorrect options,
or distracters, should actually be distracting. Preferably, each
distracter should be selected by a greater proportion of the lower group
than of the upper group. If, in a five-option multiple-choice item, only
one distracter is effective, the item is, for all practical purposes, a
two-option item. Existence of five options does not automatically
guarantee that the item will operate as a five-choice item.
Item
Analysis of Classroom Tests: Aims and Simplified Procedures
Aim:
How well did my test distinguish among
students according to the how well they met my learning goals?
Recall that each item on your test is intended
to sample performance on a particular learning outcome. The test as a whole is
meant to estimate performance across the full domain of learning outcomes you
have targeted.
Unless your learning goals are minimal or low
(as they might be, for instance, on a readiness test), you can expect students
to differ in how well they have met those goals. (Students are not peas in a
pod!). Your aim is not to differentiate students just for the fun of it, but to
be able to measure the differences in mastery that occur.
One way to assess how well your test is
functioning for this purpose is to look at how well the individual items do so.
The basic idea is that a good item is one that good students get correct more
often than do poor students. You might end up with a big spread in scores, but
what if the good students are no more likely than poor students to get a high
score? If we assume that you have actually given them proper instruction, then
your test has not really assessed what they have learned. That is, it is
"not working."
An item analysis gets at the question of whether
your test is working by asking the same question of all individual items—how
well does it discriminate? If you have lots of items that didn’t discriminate
much if at all, you may want to replace them with better ones. If you find ones
that worked in the wrong direction (where good students did worse) and
therefore lowered test reliability, then you will definitely want to get rid of
them.
In short, item analysis gives you a way to
exercise additional quality control over your tests. Well-specified learning
objectives and well-constructed items give you a headstart in that process, but
item analyses can give you feedback on how successful you actually were.
Item analyses can also help you diagnose why some
items did not work especially well, and thus suggest ways to improve them (for
example, if you find distracters that attracted no one, try developing better
ones).
Reminder
Item analyses are intended to assess and improve
the reliability of your tests. If test reliability is low,
test validity will necessarily also be low. This is the ultimate reason you do
item analyses—to improve the validity of a test by improving its reliability.
Higher reliability will not necessarily raise validity (you can be more
consistent in hitting the wrong target), but it is a prerequisite. That is,
high reliability is necessary but not sufficient for high validity (do you
remember this point on Exam 1?).
However, when you examine the properties of each
item, you will often discover how they may or may not actually have assessed
the learning outcome you intended—which is a validity issue.
When you change items to correct these problems, it means the item analysis has
helped you to improve the likely validity of the test the next time you give
it.
The Procedure
- Identify the upper 10 scorers and lowest 10 scorers on
the test. Set aside the remainder.
- Construct an empty chart for recording their scores,
following the sample I gave you in class. This chart lists the students
down the left, by name. It arrays each item number across the top. For a
20-item test, you will have 20 columns for recording the answers for each
student. Underneath the item number, write in the correct answer (A, B,
etc.)
- Enter the student data into the chart you have just
constructed.
- Take the top 10 scorers, and write each student’s name
down the left, one row for each student. If there is a tie for 10th
place, pick one student randomly from those who are tied.
- Skip a couple rows, then write the names of the 10
lowest-scoring students, one row for each.
- Going student by student, enter each student’s answers
into the cells of the chart. However, enter only the wrong
answers (A, B, etc.). Any empty cell will therefore signal a
correct answer.
- Go back to the upper 10 students. Count how many of
them got Item 1 correct (this would be all the empty cells). Write that
number at the bottom of the column for those 10. Do the same for the
other 19 questions. We will call these sums RU, where U stands
for "upper."
- Repeat the process for the 10 lowest students. Write
those sums under their 20 columns. We will call these RL,
where L stands for "lower."
- Now you are ready to calculate the two important
indices of item functioning. These are actually only estimates of what you
would get if you had a computer program to calculate the indices for
everyone who took the test (some schools do). But they are pretty good.
- Difficulty index. This is just the proportion of people who passed the
item. Calculate it for each item by adding the number
correct in the top group (RU) to the number correct in the
bottom group (RL) and then dividing this sum by the total
number of students in the top and bottom groups (20).
RU + RL
20
Record these 20 numbers in a row near the bottom
of the chart.
- Discrimination index. This index is designed to highlight to what extent
students in the upper group were more likely than students in the lower
group to get the item correct. That is, it is designed to get at the
differences between the two groups. Calculate the index by subtracting RL from
RU, and then dividing by half the number of
students involved (10)
RU - RL
10
Record these 20 numbers in the last row of the
chart.
- You are now ready to enter these discrimination indexes
into a second chart.
- Construct the second chart, based on the model I gave
you in class. (This is the smaller chart that contains no student names.)
- Note that there are two rows of column headings in the
sample. The first row of headings contains the maximum possible discrimination
indexes for each item difficulty level (more on that in a moment). The
second row contains possible difficulty indexes. Let’s begin with that
second row of headings (labeled "p"). As your sample shows, the
entries range on the far left from "1.0" (for 100%) to
".4-0" (40%-0%) for a final catch-all column. Just copy the
numbers from the sample onto your chart.
- Now copy the numbers from the first row of headings in
the sample (labeled "Md").
- Now is the time to pick up your first chart again,
where you will find the discrimination indexes you need
to enter into your second chart.
- You will be entering its last row of
numbers into the body of the second chart.
- List each of these discrimination indexes in one and
only one of the 20 columns. But which one? List each in the column
corresponding to its difficulty level. For instance, if
item 4’s difficulty level is .85 and its discrimination index is .10, put
the .10 in the difficulty column labeled ".85." This number is
entered, of course, into the row for the fourth item
- Study this second chart.
- How many of the items are of medium difficulty? These
are the best, because they provide the most opportunity to discriminate
(to see this, look at their maximum discrimination indexes in the first
row of headings). Items that most everybody gets right or gets wrong
simply can’t discriminate much.
- The important test for an item’s discriminability is
to compare it to the maximum possible. How well did each item
discriminate relative to the maximum possible for an
item of its particular difficulty level? Here is a rough rule of thumb.
- Discrimination index is near
the maximum possible = very discriminating item
- Discrimination index is about
half the maximum possible = moderately discriminating item
- Discrimination index is about
a quarter the maximum possible = weak item
- Discrimination index is near
zero = non-discriminating item
- Discrimination index is negative =
bad item (delete it if worse than -.10)
- Go back to the first chart and study it.
- Look at whether all the distracters attracted someone.
If some did not attract any, then the distracter may not be very useful.
Normally you might want to examine it and consider how it might be
improved or replaced.
- Look also for distractors that tended to pull your
best students and thereby lower discriminability.
Consider whether the discrimination you are asking them to make is
educationally significant (or even clear). You can’t do this kind of
examination for the sample data I have given you, but keep it in mind for
real-life item analyses.
- There is much more you can do to mine these data for
ideas about your items, but this is the core of an item analysis.
Item
Analysis
After you create your objective assessment items and give your test, how can you be sure that the items are appropriate -- not too difficult and not too easy? How will you know if the test effectively differentiates between students who do well on the overall test and those who do not? An item analysis is a valuable, yet relatively easy, procedure that teachers can use to answer both of these questions.
To determine the difficulty level
of test items, a measure called the Difficulty Index is used.
This measure asks teachers to calculate the proportion of students who
answered the test item accurately. By looking at each alternative (for
multiple choice), we can also find out if there are answer choices that
should be replaced. For example, let's say you gave a multiple choice quiz
and there were four answer choices (A, B, C, and D). The following table
illustrates how many students selected each answer choice for Question #1 and
#2.
*
Denotes correct answer.
For Question #1, we can see that A
was not a very good distractor -- no one selected that answer. We can also
compute the difficulty of the item by dividing the number of students who
choose the correct answer (24) by the number of total students (30). Using
this formula, the difficulty of Question #1 (referred to as p) is equal to
24/30 or .80. A rough "rule-of-thumb" is that if the item
difficulty is more than .75, it is an easy item; if the difficulty is below
.25, it is a difficult item. Given these parameters, this item could be
regarded moderately easy -- lots (80%) of students got it correct. In
contrast, Question #2 is much more difficult (12/30 = .40). In fact, on
Question #2, more students selected an incorrect answer (B) than selected the
correct answer (A). This item should be carefully analyzed to ensure that B
is an appropriate distractor.
Another measure, the Discrimination Index, refers to how well an assessment differentiates between high and low scorers. In other words, you should be able to expect that the high-performing students would select the correct answer for each question more often than the low-performing students. If this is true, then the assessment is said to have a positive discrimination index (between 0 and 1) -- indicating that students who received a high total score chose the correct answer for a specific item more often than the students who had a lower overall score. If, however, you find that more of the low-performing students got a specific item correct, then the item has a negative discrimination index (between -1 and 0). Let's look at an example. Table 2 displays the results of ten questions on a quiz. Note that the students are arranged with the top overall scorers at the top of the table.
"1"
indicates the answer was correct; "0" indicates it was incorrect.
Follow these steps to determine
the Difficulty Index and the Discrimination Index.
Now that we have the table filled
in, what does it mean? We can see that Question #2 had a difficulty index of
.30 (meaning it was quite difficult), and it also had a negative
discrimination index of -0.6 (meaning that the low-performing students were
more likely to get this item correct). This question should be
carefully analyzed, and probably deleted or changed. Our "best"
overall question is Question 3, which had a moderate difficulty level (.60),
and discriminated extremely well (0.8).
Another consideration for an item analysis is the cognitive level that is being assessed. For example, you might categorize the questions based on Bloom's taxonomy (perhaps grouping questions that address Level I and those that address Level II). In this manner, you would be able to determine if the difficulty index and discrimination index of those groups of questions are appropriate. For example, you might note that the majority of the questions that demand higher levels of thinking skills are too difficult or do not discriminate well. You could then concentrate on improving those questions and focus your instructional strategies on higher-level skills. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
No comments:
Post a Comment