Assessment of Learning

Interpreting Item Analysis

Let's look at what we have and see what we can see

90% of item analysis is just common sense...

Potential Miskey
Identifying Ambiguous Items
EqualDistribution to all alternatives.
Alternatives are not working
Distracter too atractive.
Question not discriminating.
Negative discrimination.
Too Easy.
Omit.
&11. Relationship between D index and Difficulty (p).

Item Analysis of Computer Printouts

What do we see looking at this first one? [Potential Miskey]

Upper Low Difference D Total Difficulty

1. *A 1 4 -3 -.2 5 .17

B 1 3

C 10 5

D 3 3

O <----means omit or no answer

#1, more high group students chose C than A, even though A is supposedly the correct answer
more low group students chose A than high group so got negative discrimination;
only .16% of class got it right
most likely you just wrote the wrong answer key down

--> this is an easy and very common mistake for you to make

better you find out now before you hand back then when kids complain
OR WORSE, they don't complain, and teach themselves that your miskey as the "correct" answer

so check it out and rescore that question on all the papers before handing them back
Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50
--> nice item

OR:

you check and find that you didn't miskey it --> that is the answer you thought

two possibilities:

one possibility is that you made slip of the tongue and taught them the wrong answer

anything you say in class can be taken down and used against you on an examination....

more likely means even "good" students are being tricked by a common misconception -->

You're not supposed to have trick questions, so may want to dump it

--> give those who got it right their point, but total rest of the marks out of 24 instead of 25

If scores are high, or you want to make a point, might let it stand, and then teach to it --> sometimes if they get caught, will help them to remember better in future

such as:

very fine distinctions
crucial steps which are often overlooked

REVISE it for next time to weaken "B"

-- alternatives are not supposed to draw more than the keyed answer
-- almost always an item flaw, rather than useful distinction

What can we see with #2: [Can identify ambiguous items]

Upper Low Difference D Total Difficulty

2. A 6 5

B 1 2

*C 7 5 2 .13 12 .40

D 1 3

#2, about equal numbers of top students went for A and D.

Suggests they couldn't tell which was correct

either, students didn't know this material (in which case you can reteach it)
or the item was defective --->

look at their favorite alternative again, and see if you can find any reason they could be choosing it
often items that look perfectly straight forward to adults are ambiguous to students

FavoriteExamples of ambiguous items.

if you NOW realize that D was a defensible answer, rescore before you hand it back to give everyone credit for either A or D -- avoids arguing with you in class
if it's clearly a wrong answer, then you now know which error most of your students are making to get wrong answer
useful diagnostic information on their learning, your teaching

Equally to all alternatives

Upper Low Difference D Total Difficulty

3. A 4 3

B 3 4

*C 5 4 1 .06 9 .30

D 3 4

item #3, students respond about equally to all alternatives
usually means they are guessing

Three possibilities:

may be material you didn't actually get to yet

you designed test in advance (because I've convinced you to plan ahead) but didn't actually get everything covered before holidays....
or item on a common exam that you didn't stress in your class

item so badly written students have no idea what you're asking
item so difficult students just completely baffled

review the item:

if badly written ( by other teacher) or on material your class hasn't taken, toss it out, rescore the exam out of lower total

BUT give credit to those that got it, to a total of 100%

if seems well written, but too hard, then you know to (re)teach this material for rest of class....
maybe the 3 who got it are top three students,

tough but valid item:
OK, if item tests valid objective
want to provide occasional challenging question for top students
but make sure you haven't defined "top 3 students" as "those able to figure out what the heck I'm talking about"

Alternatives aren't working

Upper Low Difference D Total Difficulty

4. A 1 5

*B 14 7 7 .47 21 .77

C 0 2

D 0 0

example #4 --> no one fell for D --> so it is not a plausible alternative
question is fine for this administration, but revise item for next time
toss alternative D, replace it with something more realistic
each distracter has to attract at least 5% of the students

class of 30, should get at least two students

or might accept one if you positively can't think of another fourth alternative -- otherwise, do not reuse the item

if two alternatives don't draw any students --> might consider redoing as true/false

Distracter too attractive

Upper Low Difference D Total Difficulty

5. A 7 10

B 1 2

C 1 1

*D 5 2 3 .20 7 .23

sample #5 --> too many going for A

--> no ONE distracter should get more than key

--> no one distracter should pull more than about half of students

-- doesn't leave enough for correct answer and five percent for each alternative

keep for this time
weaken it for next time

Question not discriminating

Upper Low Difference D Total Difficulty

6. *A 7 7 0 .00 14 .47

B 3 2

C 2 1

D 3 5

sample #6: low group gets it as often as high group
on norm-referenced tests, point is to rank students from best to worst
so individual test items should have good students get question right, poor students get it wrong
test overall decides who is a good or poor student on this particular topic

those who do well have more information, skills than those who do less well
so if on a particular question those with more skills and knowledge do NOT do better, something may be wrong with the question

question may be VALID, but off topic

E.G.: rest of test tests thinking skill, but this is a memorization question, skilled and unskilled equally as likely to recall the answer
should have homogeneous test --> don't have a math item in with social studies
if wanted to get really fancy, should do separate item analysis for each cell of your blueprint...as long as you had six items per cell

question is VALID, on topic, but not RELIABLE

addresses the specified objective, but isn't a useful measure of individual differences
asking Grade 10s Capital of Canada is on topic, but since they will all get it right, won't show individual differences -- give you low D

Negative Discrimination

Upper Low Difference D Total Difficulty

7. *A 7 10 -3 -.20 17 .57

B 3 3

C 2 1

D 3 1

D (discrimination) index is just upper group minus lower group
varies from +1.0 to -1.0
if all top got it right, all lower got it wrong = 100% = +1
if more of the bottom group get it right than the top group, you get a negative D index
if you have a negative D, means that students with less skills and knowledge overall, are getting it right more often than those who the test says are better overall
in other words, the better you are, the more likely you are to get it wrong

WHAT COULD ACCOUNT FOR THAT?

Two possibilities:

usually means an ambiguous question

that is confusing good students, but weak students too weak to see the problem
look at question again, look at alternatives good students are going for, to see if you've missed something

OR:

or it might be off topic

--> something weaker students are better at (like rote memorization) than good students

--> not part of same set of skills as rest of test--> suggests design flaw with table of specifications perhaps

((-if you end up with a whole bunch of -D indices on the same test, must mean you actually have two different distinct skills, because by definition, the low group is the high group on that bunch of questions

--> end up treating them as two separate tests))

if you have a large enough sample (like the provincial exams) then we toss the item and either don't count it or give everyone credit for it
with sample of 100 students or less, could just be random chance, so basically ignore it in terms of THIS administration

kids wrote it, give them mark they got

furthermore, if you keep dropping questions, may find that you're starting to develop serious holes in your blueprint coverage -- problem for sampling

but you want to track stuff this FOR NEXT TIME

if it's negative on administration after administration, consistently, likely not random chance, it's screwing up in some way
want to build your future tests out of those items with high positive D indices
the higher the average D indices on the test, the more RELIABLE the test as a whole will be
revise items to increase D

-->if good students are selecting one particular wrong alternative, make it less attractive

-->or increase probability of their selecting right answer by making it more attractive

may have to include some items with negative Ds if those are the only items you have for that specification, and it's an important specification

what this means is that there are some skills/knowledge in this unit which are unrelated to rest of the skills/knowledge
--> but may still be important

e.g., statistics part of this course may be terrible on those students who are the best item writers, since writing tends to be associated with the opposite hemisphere in the brain than math, right... but still important objective in this course

may lower reliability of test, but increases content validity

Too Easy

Upper Low Difference D Total Difficulty

8. A 0 1

*B 14 13 1 .06 27 .90

C 0 1

D 1 1

too easy or too difficult won't discriminate well either
difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0 (nobody)

REMEMBER: THE HIGHER THE DIFFICULTY INDEX, THE EASIER THE QUESTION

if the item is NOT miskeyed or some other glaring problem, it's too late to change after administered --> everybody got it right, OK, give them the mark

TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)

if the item is too difficult, don't drop it, just because everybody missed it --> you must have thought it was an important objective or it wouldn't have been on there;
and unless literally EVERYONE missed it, what do you do with the students who got it right?
give them bonus marks?
cheat them of a mark they got?

furthermore, if you drop too many questions, lose content validity (specs)

--> if two or three got it right may just be random chance,

so why should they get a bonus mark

however, DO NOT REUSE questions with too high or low difficulty (p) values in future

if difficulty is over 85%, you're wasting space on limited item test

asking Grade 10s the Capital of Canada is probably waste of their time and yours --> unless this is a particularly vital objective
same applies to items which are too difficult --> no use asking Grade 3s to solve quadratic equation
but you may want to revise question to make it easier or harder rather than just toss it out cold

OR SOME EXCEPTIONS HERE:

You may have consciously decided to develop a "Mastery" style tests

--> will often have very easy questions -& expect everyone to get everything trying to identify only those who are not ready to go on

--> in which case, don't use any question which DOES NOT have a difficulty level below 85% or whatever

Or you may want a test to identify the top people in class, the reach for the top team, and design a whole test of really tough questions

--> have low difficulty values (i.e., very hard)

so depends a bit on what you intend to do with the test in question
this is what makes the difficulty index (proportion) so handy

you create a bank of items over the years

--> using item analysis you get better questions all the time, until you have a whole bunch that work great

-->can then tailor-make a test for your class

you want to create an easier test this year, you pick questions with higher difficulty (p) values;

you want to make a challenging test for your gifted kids, choose items with low difficulty (p) values

--> for most applications will want to set difficulty level so that it gives you average marks, nice bell curve

government uses 62.5 --> four item multiple choice, middle of bell curve,

start tests with an easy question or two to give students a running start
make sure that the difficulty levels are spread out over examination blueprint

not all hard geography questions, easy history

unfair to kids who are better at geography, worse at history
turns class off geography if they equate it with tough questions

-->REMEMBER here that difficulty is different than complexity, Bloom

so can have difficult recall knowledge question, easy synthesis
synthesis and evaluation items will tend to be harder than recall questions so if find higher levels are more difficult, OK, but try to balance cells as much as possible
certainly content cells should be the roughly the same

OMIT

Upper Low Difference D Total Difficulty

9. A 2 1

B 3 4

*C 7 3 4 .26 10 .33

D 1 1

O 2 4

If near end of the test

--> they didn't find it because it was on the next page

--format problem

--> your test is too long, 6 of them (20%) didn't get to it

OR, if middle of the test:

--> totally baffled them because:

way too difficult for these guys
or because also 2 from high group too: ambiguous wording

&
RELATIONSHIP BETWEEN D INDEX AND DIFFICULTY (p)

Upper Low Difference D Total Difficulty

10. A 0 5

*B 15 0 15 1.0 15 .50

C 0 5

D 0 5

---------------------------------------------------

11. A 3 2

*B 8 7 1 0.6 15 .50

C 2 3

D 2 3

10 is a perfect item --> each distracter gets at least 5
discrimination index is +1.0

(ACTUALLY PERFECT ITEM WOULD HAVE DIFFICULTY OF 65% TO ALLOW FOR GUESSING)

high discrimination D indices require optimal levels of difficulty
but optimal levels of difficulty do not assure high levels of D
11 has same difficulty level, different D

on four item multiple-choice, student doing totally by chance will get 25%

Assessment of Learning

Friday, September 7, 2012

No comments:

Post a Comment

Contributors