Interpreting Item Analysis
Let's look at what we have and see what we can see
- What do we see
looking at this first one? [Potential
Miskey]
Upper Low
Difference D Total
Difficulty
1. *A 1
4 -3 -.2 5 .17
B
1 3
C
10 5
D
3 3
O
<----means omit or no answer
|
- #1, more high
group students chose C than A, even though A is supposedly the correct
answer
- more low group
students chose A than high group so got negative discrimination;
- only .16% of
class got it right
- most likely you
just wrote the wrong answer key down
--> this is an easy and very common mistake
for you to make
- better you
find out now before you hand back then when kids complain
- OR WORSE, they don't
complain, and teach themselves that your miskey as the
"correct" answer
- so check it out
and rescore that question on all the papers before handing them back
- Makes it 10-5
Difference = 5; D=.34; Total = 15; difficulty=.50
--> nice item
OR:
- you check and
find that you didn't miskey it --> that is the answer you thought
two possibilities:
- one possibility is that you made
slip of the tongue and taught them the wrong answer
- anything you
say in class can be taken down and used against you on an
examination....
- more likely means even
"good" students are being tricked by a common misconception
-->
You're not supposed to have trick questions, so
may want to dump it
--> give those who got it right their point,
but total rest of the marks out of 24 instead of 25
If scores are high, or you want to make a point,
might let it stand, and then teach to it --> sometimes if they get caught,
will help them to remember better in future
such as:
- very fine
distinctions
- crucial steps
which are often overlooked
REVISE it for next time to weaken "B"
-- alternatives are not supposed to draw more
than the keyed answer
-- almost always an item flaw, rather than useful distinction
- What can we see
with #2: [Can identify
ambiguous items]
Upper
Low Difference D
Total Difficulty
2. A
6 5
B
1 2
*C
7 5 2 .13 12
.40
D
1 3
O
|
- #2, about equal numbers of top students
went for A and D.
- either,
students didn't know this material (in which case you can reteach it)
- or the item
was defective --->
- look at their
favorite alternative again, and see if you can find any reason they could
be choosing it
- often items
that look perfectly straight forward to adults are ambiguous to students
- if you NOW realize that D was a
defensible answer, rescore before you hand it back to give everyone
credit for either A or D --
avoids arguing with you in class
- if it's clearly
a wrong answer, then you now know which error most of your students are
making to get wrong answer
- useful
diagnostic information on their learning, your teaching
- Equally to all
alternatives
Upper
Low Difference D
Total Difficulty
3. A
4 3
B
3 4
*C
5 4 1 .06 9 .30
D
3 4
O
|
- item #3,
students respond about equally to all alternatives
- usually means
they are guessing
Three possibilities:
- may be
material you didn't actually get to yet
- you designed
test in advance (because I've convinced you to plan ahead) but didn't
actually get everything covered before holidays....
- or item on a
common exam that you didn't stress in your class
- item so badly
written students have no idea what you're asking
- item so
difficult students just completely baffled
- review the
item:
- if badly
written ( by other teacher) or on material your class hasn't taken, toss
it out, rescore the exam out of lower total
- BUT give
credit to those that got it, to a total of 100%
- if seems well
written, but too hard, then you know to (re)teach this material for rest
of class....
- maybe the 3
who got it are top three students,
- tough but
valid item:
- OK, if item
tests valid objective
- want to
provide occasional challenging question for top students
- but make sure
you haven't defined "top 3 students" as "those able to
figure out what the heck I'm talking about"
- Alternatives
aren't working
Upper Low
Difference D Total
Difficulty
4. A
1 5
*B
14 7 7
.47 21 .77
C
0 2
D
0 0
O
|
- example #4
--> no one fell for D --> so it is not a plausible alternative
- question is
fine for this administration, but revise item for next time
- toss
alternative D, replace it with something more realistic
- each distracter
has to attract at least 5% of the students
- class of 30,
should get at least two students
- or might accept
one if you positively can't think of another fourth alternative -- otherwise,
do not reuse the item
if two alternatives
don't draw any students --> might consider redoing as true/false
- Distracter too
attractive
Upper
Low Difference D
Total Difficulty
5. A 7 10
B
1 2
C
1 1
*D
5 2 3 .20 7
.23
O
|
- sample #5
--> too many going for A
--> no ONE distracter should get more than
key
--> no one distracter should pull more than about half of students
-- doesn't leave enough for correct answer and five percent for each
alternative
- keep for this
time
- weaken it for
next time
- Question not
discriminating
Upper
Low Difference D
Total Difficulty
6. *A 7
7 0 .00 14
.47
B
3 2
C
2 1
D
3 5
O
|
- sample #6: low group gets
it as often as high group
- on norm-referenced tests, point is
to rank students from best to worst
- so individual
test items should have good students get question right, poor students
get it wrong
- test overall
decides who is a good or poor student on this particular topic
- those who do
well have more information, skills than those who do less well
- so if on a
particular question those with more skills and knowledge do NOT do
better, something may be wrong with the question
- question may be VALID, but off topic
- E.G.: rest of
test tests thinking skill, but this is a memorization question, skilled
and unskilled equally as likely to recall the answer
- should have
homogeneous test --> don't have a math item in with social studies
- if wanted to
get really fancy, should do separate item analysis for each cell of your
blueprint...as long as you had six items per
cell
- question is VALID, on topic, but
not RELIABLE
- addresses the
specified objective, but isn't a useful measure of individual
differences
- asking Grade
10s Capital of Canada is on topic, but since they will all get it right,
won't show individual differences -- give you low D
- Negative
Discrimination
Upper
Low Difference D
Total Difficulty
7. *A 7
10 -3 -.20 17
.57
B
3 3
C
2 1
D
3 1
O
|
- D
(discrimination) index is just upper
group minus lower group
- varies from
+1.0 to -1.0
- if all top got
it right, all lower got it wrong = 100% = +1
- if more of the
bottom group get it right than the top group, you get a negative D index
- if you have a
negative D, means that students with less skills and knowledge overall,
are getting it right more often than those who the test says are better
overall
- in other words,
the better you are, the more likely you are to get it wrong
WHAT COULD ACCOUNT FOR
THAT?
Two possibilities:
- usually means
an ambiguous question
- that is
confusing good students, but weak students too weak to see the problem
- look at
question again, look at alternatives good students are going for, to see
if you've missed something
OR:
- or it might be
off topic
--> something weaker students are better at (like rote memorization)
than good students
--> not part of same set of skills as rest of test--> suggests
design flaw with table of specifications perhaps
((-if you end up with a
whole bunch of -D indices on the same test, must mean you actually have two
different distinct skills, because by definition, the low group is the high
group on that bunch of questions
--> end up treating them
as two separate tests))
- if you have a
large enough sample (like the provincial exams) then we toss the item and
either don't count it or give everyone credit for it
- with sample of
100 students or less, could just be random chance, so basically ignore it
in terms of THIS administration
- kids wrote it,
give them mark they got
- furthermore, if
you keep dropping questions, may find that you're starting to develop
serious holes in your blueprint coverage -- problem for sampling
- but you want
to track stuff this FOR NEXT TIME
- if it's
negative on administration after administration, consistently, likely not
random chance, it's screwing up in some way
- want to build
your future tests out of those items with high positive D indices
- the higher the
average D indices on the test, the more RELIABLE the test as a whole will
be
- revise items to
increase D
-->if good students are selecting one
particular wrong alternative, make it less attractive
-->or increase probability of their selecting right answer by making it more
attractive
- may have to
include some items with negative Ds if those are the only items you have
for that specification, and it's an important specification
- what this
means is that there are some skills/knowledge in this unit which are
unrelated to rest of the skills/knowledge
--> but may still be important
- e.g.,
statistics part of this course may be terrible on those students who are
the best item writers, since writing tends to be associated with the
opposite hemisphere in the brain than math, right... but still important
objective in this course
- may lower
reliability of test, but increases content validity
- Too Easy
Upper
Low Difference D
Total Difficulty
8. A
0 1
*B
14 13 1
.06 27
.90
C
0 1
D
1 1
O
|
- too easy or too
difficult won't discriminate well either
- difficulty (p)
(for proportion) varies from +1.0 (everybody got it right) to 0 (nobody)
REMEMBER: THE HIGHER THE
DIFFICULTY INDEX, THE EASIER THE QUESTION
- if the item is
NOT miskeyed or some other glaring problem, it's too late to change after
administered --> everybody got it right, OK, give them the mark
TOO DIFFICULT = 30 to 35%
(used to be rule in Branch, now not...)
- if the item is
too difficult, don't drop it, just because everybody missed it --> you
must have thought it was an important objective or it wouldn't have been
on there;
- and unless
literally EVERYONE missed it, what do you do with the students who got it
right?
- give them bonus
marks?
- cheat them of a
mark they got?
furthermore, if you
drop too many questions, lose content validity (specs)
--> if two or three got it right may just be
random chance,
so why should they get a bonus mark
- however, DO NOT
REUSE questions with too high or low difficulty (p) values in future
if difficulty is over 85%,
you're wasting space on limited item test
- asking Grade
10s the Capital of Canada is probably waste of their time and yours
--> unless this is a particularly vital objective
- same applies to
items which are too difficult --> no use asking Grade 3s to solve
quadratic equation
- but you may
want to revise question to make it easier or harder rather than just toss
it out cold
OR SOME EXCEPTIONS HERE:
You may have consciously decided to develop a "Mastery" style tests
--> will often have very easy questions
-& expect everyone to get everything trying to identify only those who are
not ready to go on
--> in which case, don't use any question which DOES NOT have a difficulty
level below 85% or whatever
Or you may want a test to
identify the top people in class, the reach for the top
team, and design a whole test of really tough questions
--> have low difficulty values (i.e., very
hard)
- so depends a
bit on what you intend to do with the test in question
- this is what
makes the difficulty index (proportion) so handy
- you create a
bank of items over the years
--> using item analysis you get better
questions all the time, until you have a whole bunch that work great
-->can then tailor-make a test for your class
you want to create an easier test this year, you pick questions with higher
difficulty (p) values;
you want to make a challenging test for your gifted kids, choose items with low
difficulty (p) values
--> for most applications will want to set difficulty level so that it gives
you average marks, nice bell curve
- government
uses 62.5 --> four item multiple choice, middle of bell curve,
- start tests
with an easy question or two to give students a running start
- make sure that
the difficulty levels are spread out over examination blueprint
- not all hard
geography questions, easy history
- unfair to
kids who are better at geography, worse at history
- turns class
off geography if they equate it with tough questions
-->REMEMBER here that
difficulty is different than complexity, Bloom
- so can have
difficult recall knowledge question, easy synthesis
- synthesis and
evaluation items will tend to be harder than recall questions so if find
higher levels are more difficult, OK, but try to balance cells as much
as possible
- certainly
content cells should be the roughly the same
- OMIT
Upper
Low Difference D
Total Difficulty
9. A
2 1
B
3 4
*C
7 3 4 .26
10 .33
D
1 1
O
2 4
|
If near end of the test
- --> they didn't
find it because it was on the next page
--format problem
OR
- --> your test is
too long, 6 of them (20%) didn't get to it
OR, if middle of the test:
- --> totally baffled
them because:
- way too
difficult for these guys
- or because also 2
from high group too: ambiguous
wording
- &
- RELATIONSHIP
BETWEEN D INDEX AND DIFFICULTY (p)
Upper
Low Difference D
Total Difficulty
10. A
0 5
*B
15 0 15 1.0
15 .50
C
0 5
D
0 5
O
---------------------------------------------------
11. A
3 2
*B
8 7 1 0.6
15 .50
C
2 3
D
2 3
O
|
- 10 is a perfect
item --> each distracter gets at least 5
discrimination index is +1.0
(ACTUALLY PERFECT ITEM
WOULD HAVE DIFFICULTY OF 65% TO ALLOW FOR GUESSING)
- high
discrimination D indices require optimal levels of difficulty
- but optimal
levels of difficulty do not assure high levels of D
- 11 has same
difficulty level, different D
- on four item
multiple-choice, student doing totally by chance will get 25%