Assessment of Learning: September 2012

Tuesday, September 18, 2012

make up class (sent to your E-Mails)

Dear All,

Hi! I hope you’re all doing fine after the inclement weather last Saturday. And I know that you are all excited to have your Midterm examinations. J

Anyway, I will not be around on Saturday as I have a conference to speak on Cebu. Thus, we can’t have our class on Saturday again. In this regard, I am asking you to have a 3-hour make up class on either Wednesday (tomorrow) 7.30 AM-10:30AM or Thursday (September 20) 7:30 AM to 10:30 AM. The 7:30AM-9:00AM will be your exam and the 9:00AM-10:30 AM will be for discussions.

I am hoping that we can have our make-up class in any of those dates. Please let me know asap if you decide that the class will be tomorrow. I will be posting this email in our blog as well.

Contact me in my hand phone: 09162672875 or 09478905158

*Camille and Ahnie May, can you be my contact? Please inform me on the pulse of the class.

Thanks.

Sir HJ

Tuesday, September 11, 2012

all about item analysis

Purpose of Item Analysis

OK, you now know how to plan a test and build a test

Now you need to know how to do ITEM ANALYSIS
Talking about norm-referenced, objective tests
mostly multiple-choice but same principals for true-false, matching and short answer
by analyzing results you can refine your testing

SERVES SEVERAL PURPOSES

Fix marks for current class that just wrote the test
- find flaws in the test so that you can adjust the mark before return to students
- can find questions with two right answers, or that were too hard, etc., that you may want to drop from the exam
  - even had to do that occasionally on Diploma exams, even after 36 months in development, maybe 20 different reviewers, extensive field tests, still occasionally have a question whose problems only become apparent after you give the test
  - more common on classroom tests -- but instead of getting defensive, or making these decisions at random on basis of which of your students can argue with you, do it scientifically
More diagnostic information on students
- another immediate payoff of item analysis
Classroom level:
- will tell which questions they were are all guessing on, or if you find a questions which most of them found very difficult, you can reteach that concept
- CAN do item analysis on pretests to:
  - so if you find a question they all got right, don't waste more time on this area
  - find the wrong answers they are choosing to identify common misconceptions
  - can't tell this just from score on total test, or class average
Individual level:
- isolate specific errors this child made
- after you've planned these tests, written perfect questions, and now analyzed the results, you're going to know more about these kids than they know themselves
Build future tests, revise test items to make them better
- REALLY pays off second time you teach the same course
  - by now you know how much work writing good questions is
  - studies have shown us that it is FIVE times faster to revise items that didn't work, using item analysis, than trying to replace it with a completely new question
  - new item which would just have new problems anyway
    
    --> this way you eventually get perfect items, the envy of your neighbours
- SHOULD NOT REUSE WHOLE TESTS --> diagnostic teaching means that you are responding to needs of your students, so after a few years you build up a bank of test items you can custom make tests for your class
  - know what class average will be before you even give the test because you will know approximately how difficult each item is before you use it;
  - can spread difficulty levels across your blueprint too...
Part of your continuing professional development
- doing the occasional item analysis will help teach you how to become a better test writer
- and you're also documenting just how good your evaluation is
- useful for dealing with parents or principals if there's ever a dispute
- once you start bringing out all these impressive looking stats parents and administrators will believe that maybe you do know what you're talking about when you fail students...
- parent says, I think your "question stinks",
  
  well, "according to the item analysis, this question appears to have worked well -- it's your son that stinks"
  
  (just kidding! --actually, face validity takes priority over stats any day!)
- and if the analysis shows that the question does stink, you've already dropped it before you've handed it back to the student, let alone the parent seeing it...
Before and After Pictures
- long term payoff
- collect this data over ten years, not only get great item bank, but if you change how you teach the course, you can find out if innovation is working
- if you have a strong class (as compared to provincial baseline) but they do badly on same test you used five years ago, the new textbook stinks.

ITEM ANALYSIS is one area where even a lot of otherwise very good classroom teachers fall down

they think they're doing a good job; they think they've doing good evaluation, but without doing item analysis, they can't really know
part of being a professional is going beyond the illusion of doing a good job to finding out whether you really are
but something just a lot of teachers don't know HOW to do
do it indirectly when kids argue with them...wait for complaints from students, student's parents and maybe other teachers...

ON THE OTHER HAND....

I do realize that I am advocating here more work for you in the short term, but, it will pay off in the long term

But realistically:

*Probably only doing it for your most important tests

end of unit tests, final exams --> summative evaluation
especially if you're using common exams with other teachers
common exams give you bigger sample to work with, which is good
makes sure that questions other teacher wrote are working for YOUR class
maybe they taught different stuff in a different way
impress the daylights out of your colleagues

*Probably only doing it for test questions you are likely going to reuse next year

*Spend less time on item analysis than on revising items

item analysis is not an end in itself,
no point unless you use it to revise items,
and help students on basis of information you get out of it

I also find that, if you get into it, it is kind of fascinating. When stats turn out well, it's objective, external validation of your work. When stats turn out differently than you expect, it becomes a detective mystery as you figure out what went wrong.
But you'll have to take my word on this until you try it on your own stuff.

Sample Size

The larger the sample the better, more accurate the data you get.

government would not accept any sample smaller than 250 students, balanced for urban/rural, public/separate, and zone: north/south/central/etc.
but not much you can do about small class/sample size, except

(a) give test to more than one class (secondary teachers)
need to balance reusing tests three periods in a row against chance of student's in first class tipping off third class on questions...

(b) accumulate stats over a couple of years
- IF you haven't changed the question AT ALL, you can add this year's results to last years to increase sample size
- If you have changed the question, you can compare this year's stats to last year's to see if they have changed in direction you wanted
- have to be cautious because with very small samples, some changes may just be random chance...
(c) Mostly just do it with your 30 students and just be cautious how you interpret these results
- procedures will give you hints at 30 students, proof positive at 400

Procedure works fastest if you have had students use a separate answer sheet

separate answer sheet means faster marking, so more time for interpretation
separate answer sheet means more accurate marking too
- believe me, if you're marking "A,B,C,D", turn the page, sooner or later you're going to turn two pages at once and go 3/4 the way through the booklet before you notice that you're now on the last page and have six more answers on the answer key than you have answers to mark...
separate answer sheet means much faster for counting you have to do for item analysis

UNLESS you're in grades 1-3, in which case have them write on the booklet because they get confused transferring answer to separate page

Eight Simple Steps to Item Analysis

Score each answer sheet, write score total on the corner
- obviously have to do this anyway
Sort the pile into rank order from top to bottom score
If normal class of 30 students, divide class in half
- same number in top and bottom group:
- toss middle paper if odd number (put aside)

Take 'top' pile, count number of students who responded to each alternative

fast way is simply to sort piles into "A", "B", "C", "D" // or true/false or type of error you get for short answer, fill-in-the-blank

OR set up on spread sheet if you're familiar with computers

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30


ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY

1. A     0 
  *B     4 
   C     1 
   D     1 
   O

*=Keyed Answer

repeat for lower group

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30


ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY

1. A     0  
  *B     4       2 
   C     1 
   D     1 
   O

*=Keyed Answer

this is the time consuming part --> but not that bad, can do it while watching TV, because you're just sorting piles

THREE POSSIBLE SHORT CUTS HERE (STEP 4) (A) If you have a large sample of around 100 or more, you can cut down the sample you work with

take top 27% (27 out of 100); bottom 27% (so only dealing with 54, not all 100)
put middle 46 aside for the moment
- larger the sample, more accurate, but have to trade off against labour; using top 1/3 or so is probably good enough by the time you get to 100; --27% magic figure statisticians tell us to use
I'd use halves at 30, but you could just use a sample of top 10 and bottom 10 if you're pressed for time
- but it means a single student changes stats by 10%
- trading off speed for accuracy...
but I'd rather have you doing ten and ten than nothing

(B) Second short cut, if you have access to photocopier (budgets)

photocopy answer sheets, cut off identifying info
colour code high and low groups --> dab of marker pen color
distribute randomly to students in your class so they don't know whose answer sheet they have
get them to raise their hands
- for #6, how many have "A" on blue sheet?
  how many have "B"; how many "C"
- for #6, how many have "A" on red sheet....
some reservations because they can screw you up if they don't take it seriously
another version of this would be to hire kid who cuts your lawn to do the counting, provided you've removed all identifying information
- I actually did this for a bunch of teachers at one high school in Edmonton when I was in university for pocket money

(C) Third shortcut, IF you can't use separate answer sheet, sometimes faster to type than to sort


      SAMPLE OF TYPING FORMAT 
         FOR ITEM ANALYSIS


ITEM #    1  2  3  4  5  6  7  8  9  10

KEY       T  F  T  F  T  A  D  C  A  B

STUDENT

Kay       T  T  T  F  F  A  D  D  A  C

Jane      T  T  T  F  T  A  D  C  A  D

John      F  F  T  F  T  A  D  C  A  B

type name; then T or F, or A,B,C,D == all left hand on typewriter, leaving right hand free to turn pages (from Sax)
IF you have a computer program -- some kicking around -- will give you all stats you need, plus bunches more you don't-- automatically after this stageOVERHEAD: SAMPLE ITEM ANALYSIS FOR CLASS OF 30 (PAGE #1) (in text)

Subtract the number of students in lower group who got question right from number of high group students who got it right

quite possible to get a negative number

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30


ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY

1. A     0  
  *B     4       2         2 
   C     1 
   D     1 
   O

*=Keyed Answer

Divide the difference by number of students in upper or lower group

in this case, divide by 15
this gives you the "discrimination index" (D)

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30


ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY

1. A     0  
  *B     4       2         2      0.333 
   C     1 
   D     1 
   O

*=Keyed Answer

Total number who got it right

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30


ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY

1. A     0  
  *B     4       2         2      0.333   6 
   C     1 
   D     1 
   O

*=Keyed Answer

If you have a large class and were only using the 1/3 sample for top and bottom groups, then you have to NOW count number of middle group who got each question right (not each alternative this time, just right answers)Sample Form Class Size= 100.

if class of 30, upper and lower half, no other column here

Divide total by total number of students

difficulty = (proportion who got it right (p) )

ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS
CLASS SIZE = 30


ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY

1. A     0  
  *B     4       2         2      0.333   6      .42 
   C     1 
   D     1 
   O

*=Keyed Answer

You will NOTE the complete lack of complicated statistics --> counting, adding, dividing --> no tricky formulas required for this

not going to worry about corrected point biserials etc.

one of the advantages of using fixed number of alternatives

Interpreting Item Analysis

Let's look at what we have and see what we can see

90% of item analysis is just common sense...
1. Potential Miskey
2. Identifying Ambiguous Items
3. EqualDistribution to all alternatives.
4. Alternatives are not working
5. Distracter too atractive.
6. Question not discriminating.
7. Negative discrimination.
8. Too Easy.
9. Omit.
10. &11. Relationship between D index and Difficulty (p).
- Item Analysis of Computer Printouts

What do we see looking at this first one? [Potential Miskey]


       Upper    Low    Difference    D    Total    Difficulty

1. *A    1      4        -3        -.2       5          .17 
    B    1      3
    C   10      5
    D    3      3
    O   <----means omit or no answer

#1, more high group students chose C than A, even though A is supposedly the correct answer
more low group students chose A than high group so got negative discrimination;
only .16% of class got it right
most likely you just wrote the wrong answer key down
- better you find out now before you hand back then when kids complain
- OR WORSE, they don't complain, and teach themselves that your miskey as the "correct" answer
so check it out and rescore that question on all the papers before handing them back
Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50
--> nice item

OR:

you check and find that you didn't miskey it --> that is the answer you thought

two possibilities:
1. one possibility is that you made slip of the tongue and taught them the wrong answer
  - anything you say in class can be taken down and used against you on an examination....
2. more likely means even "good" students are being tricked by a common misconception -->You're not supposed to have trick questions, so may want to dump it

If scores are high, or you want to make a point, might let it stand, and then teach to it --> sometimes if they get caught, will help them to remember better in future
such as:

very fine distinctions
crucial steps which are often overlooked

REVISE it for next time to weaken "B"

What can we see with #2: [Can identify ambiguous items]
Upper Low Difference D Total Difficulty 2. A 6 5 B 1 2 *C 7 5 2 .13 12 .40 D 1 3 O
- #2, about equal numbers of top students went for A and D.Suggests they couldn't tell which was correct
  - either, students didn't know this material (in which case you can reteach it)
  - or the item was defective --->
- look at their favorite alternative again, and see if you can find any reason they could be choosing it
- often items that look perfectly straight forward to adults are ambiguous to studentsFavoriteExamples of ambiguous items.
- if you NOW realize that D was a defensible answer, rescore before you hand it back to give everyone credit for either A or D -- avoids arguing with you in class
- if it's clearly a wrong answer, then you now know which error most of your students are making to get wrong answer
- useful diagnostic information on their learning, your teaching
Equally to all alternatives
Upper Low Difference D Total Difficulty 3. A 4 3 B 3 4 *C 5 4 1 .06 9 .30 D 3 4 O
- item #3, students respond about equally to all alternatives
- usually means they are guessingThree possibilities:
  1. may be material you didn't actually get to yet
    - you designed test in advance (because I've convinced you to plan ahead) but didn't actually get everything covered before holidays....
    - or item on a common exam that you didn't stress in your class
  2. item so badly written students have no idea what you're asking
  3. item so difficult students just completely baffled
- review the item:
  - if badly written ( by other teacher) or on material your class hasn't taken, toss it out, rescore the exam out of lower total
    - BUT give credit to those that got it, to a total of 100%
  - if seems well written, but too hard, then you know to (re)teach this material for rest of class....
  - maybe the 3 who got it are top three students,
    - tough but valid item:
    - OK, if item tests valid objective
    - want to provide occasional challenging question for top students
    - but make sure you haven't defined "top 3 students" as "those able to figure out what the heck I'm talking about"
Alternatives aren't working
Upper Low Difference D Total Difficulty 4. A 1 5 *B 14 7 7 .47 21 .77 C 0 2 D 0 0 O
- example #4 --> no one fell for D --> so it is not a plausible alternative
- question is fine for this administration, but revise item for next time
- toss alternative D, replace it with something more realistic
- each distracter has to attract at least 5% of the students
  - class of 30, should get at least two students
- or might accept one if you positively can't think of another fourth alternative -- otherwise, do not reuse the item
if two alternatives don't draw any students --> might consider redoing as true/false
Distracter too attractive
Upper Low Difference D Total Difficulty 5. A 7 10 B 1 2 C 1 1 *D 5 2 3 .20 7 .23 O
- sample #5 --> too many going for A
- keep for this time
- weaken it for next time
Question not discriminating
Upper Low Difference D Total Difficulty 6. *A 7 7 0 .00 14 .47 B 3 2 C 2 1 D 3 5 O
- sample #6: low group gets it as often as high group
- on norm-referenced tests, point is to rank students from best to worst
- so individual test items should have good students get question right, poor students get it wrong
- test overall decides who is a good or poor student on this particular topic
  - those who do well have more information, skills than those who do less well
  - so if on a particular question those with more skills and knowledge do NOT do better, something may be wrong with the question
- question may be VALID, but off topic
  - E.G.: rest of test tests thinking skill, but this is a memorization question, skilled and unskilled equally as likely to recall the answer
  - should have homogeneous test --> don't have a math item in with social studies
  - if wanted to get really fancy, should do separate item analysis for each cell of your blueprint...as long as you had six items per cell
- question is VALID, on topic, but not RELIABLE
  - addresses the specified objective, but isn't a useful measure of individual differences
  - asking Grade 10s Capital of Canada is on topic, but since they will all get it right, won't show individual differences -- give you low D
Negative Discrimination
Upper Low Difference D Total Difficulty 7. *A 7 10 -3 -.20 17 .57 B 3 3 C 2 1 D 3 1 O
- D (discrimination) index is just upper group minus lower group
- varies from +1.0 to -1.0
- if all top got it right, all lower got it wrong = 100% = +1
- if more of the bottom group get it right than the top group, you get a negative D index
- if you have a negative D, means that students with less skills and knowledge overall, are getting it right more often than those who the test says are better overall
- in other words, the better you are, the more likely you are to get it wrong
WHAT COULD ACCOUNT FOR THAT?

Two possibilities:
- usually means an ambiguous question
  - that is confusing good students, but weak students too weak to see the problem
  - look at question again, look at alternatives good students are going for, to see if you've missed something
  OR:
- or it might be off topic
  
  --> something weaker students are better at (like rote memorization) than good students
  
  --> not part of same set of skills as rest of test--> suggests design flaw with table of specifications perhaps
((-if you end up with a whole bunch of -D indices on the same test, must mean you actually have two different distinct skills, because by definition, the low group is the high group on that bunch of questions
- if you have a large enough sample (like the provincial exams) then we toss the item and either don't count it or give everyone credit for it
- with sample of 100 students or less, could just be random chance, so basically ignore it in terms of THIS administration
  - kids wrote it, give them mark they got
- furthermore, if you keep dropping questions, may find that you're starting to develop serious holes in your blueprint coverage -- problem for sampling
  - but you want to track stuff this FOR NEXT TIME
- if it's negative on administration after administration, consistently, likely not random chance, it's screwing up in some way
- want to build your future tests out of those items with high positive D indices
- the higher the average D indices on the test, the more RELIABLE the test as a whole will be
- revise items to increase D
- may have to include some items with negative Ds if those are the only items you have for that specification, and it's an important specification
  - what this means is that there are some skills/knowledge in this unit which are unrelated to rest of the skills/knowledge
    --> but may still be important
- e.g., statistics part of this course may be terrible on those students who are the best item writers, since writing tends to be associated with the opposite hemisphere in the brain than math, right... but still important objective in this course
  - may lower reliability of test, but increases content validity
Too Easy
Upper Low Difference D Total Difficulty 8. A 0 1 *B 14 13 1 .06 27 .90 C 0 1 D 1 1 O
- too easy or too difficult won't discriminate well either
- difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0 (nobody)
REMEMBER: THE HIGHER THE DIFFICULTY INDEX, THE EASIER THE QUESTION
- if the item is NOT miskeyed or some other glaring problem, it's too late to change after administered --> everybody got it right, OK, give them the mark
TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)
- if the item is too difficult, don't drop it, just because everybody missed it --> you must have thought it was an important objective or it wouldn't have been on there;
- and unless literally EVERYONE missed it, what do you do with the students who got it right?
- give them bonus marks?
- cheat them of a mark they got?
furthermore, if you drop too many questions, lose content validity (specs)
- however, DO NOT REUSE questions with too high or low difficulty (p) values in future
if difficulty is over 85%, you're wasting space on limited item test
- asking Grade 10s the Capital of Canada is probably waste of their time and yours --> unless this is a particularly vital objective
- same applies to items which are too difficult --> no use asking Grade 3s to solve quadratic equation
- but you may want to revise question to make it easier or harder rather than just toss it out cold
OR SOME EXCEPTIONS HERE:

You may have consciously decided to develop a "Mastery" style tests
Or you may want a test to identify the top people in class, the reach for the top team, and design a whole test of really tough questions
- so depends a bit on what you intend to do with the test in question
- this is what makes the difficulty index (proportion) so handy
1. you create a bank of items over the years
2. start tests with an easy question or two to give students a running start
3. make sure that the difficulty levels are spread out over examination blueprint
  - not all hard geography questions, easy history
    - unfair to kids who are better at geography, worse at history
    - turns class off geography if they equate it with tough questions
    -->REMEMBER here that difficulty is different than complexity, Bloom
  - so can have difficult recall knowledge question, easy synthesis
  - synthesis and evaluation items will tend to be harder than recall questions so if find higher levels are more difficult, OK, but try to balance cells as much as possible
  - certainly content cells should be the roughly the same
OMIT
Upper Low Difference D Total Difficulty 9. A 2 1 B 3 4 *C 7 3 4 .26 10 .33 D 1 1 O 2 4
If near end of the test
1. --> they didn't find it because it was on the next page
  
  --format problem
OR
1. --> your test is too long, 6 of them (20%) didn't get to it
OR, if middle of the test:
1. --> totally baffled them because:
  - way too difficult for these guys
  - or because also 2 from high group too: ambiguous wording
&

RELATIONSHIP BETWEEN D INDEX AND DIFFICULTY (p)


      Upper   Low   Difference   D   Total   Difficulty

10.  A     0    5
    *B    15    0        15      1.0    15       .50 
     C     0    5
     D     0    5
     O
   ---------------------------------------------------
11.  A    3     2
    *B    8     7         1      0.6     15      .50
     C    2     3
     D    2     3
     O

10 is a perfect item --> each distracter gets at least 5
discrimination index is +1.0

(ACTUALLY PERFECT ITEM WOULD HAVE DIFFICULTY OF 65% TO ALLOW FOR GUESSING)

high discrimination D indices require optimal levels of difficulty
but optimal levels of difficulty do not assure high levels of D
11 has same difficulty level, different D
- on four item multiple-choice, student doing totally by chance will get 25%

Program Evaluation

When your kids write the Diploma or Achievement Test Department sends out a printout of how your class did compared to everybody else in the province
Three types of report:

ASSESSMENT HIGHLIGHTS (pamphlet)
- how are kids doing today in terms of meeting the standards?
- how are they doing compared to four years ago? eight years ago?
  (monitor over time)
PROVINCIAL REPORT
- format keeps changing --> some years all tests in one book to save on paper and mailing costs; other years each exam gets its own report
- tons of technical information (gender stuff, etc.)
JURISDICTION & SCHOOL REPORTS
- get your hands on and interpret
- either you do it or someone else will do it for/to you
- better teachers take responsibility rather than top down
- new table formats are so easy to interpret no reason not to
- this means you can compare their responses to the responses of 30,000 students across the province
  - will help you calibrate your expectations for this class
  - is your particular class high or low one?
  - have you set your standards too high or too low?
  - giving everyone 'CÕs because you think they ought to do better than this, but they all ace the provincial tests?
Who Knows Where This is?OVERHEAD: SCHOOL TABLE 2 (June 92 GRADE 9 Math Achievement)
- check table 2 for meeting standard of excellence
- Standards set by elaborate committee structure
This example (overhead): Your class had 17 students
Total test out of 49 (means test of 50, but dropped on after item analysis)
Need comparison -- otherwise, fact that only get 6% to excellence might sound bad...
BUT can use tables to look deeper,
- use tables to identify strengths and weaknesses in student learning
- and therefore identify your own strength and weaknesses
Problem solving & knowledge/skills broken down --> table of specs topics

Interestingly, though, above provincial on problem solving at excellence...

ASK: -how do you explain % meeting knowledge and % meeting problems both higher than % meeting standard on whole test?

Answer: low correlation between performance on the two types of questions
(i.e., those who met standard on the one often did not on the other)
Don't know where this is OVERHEAD: SCHOOL TABLE 5.1 (GRADE 9 MATH, JUNE 92)
- check tables 5.1 to 5.6 for particular areas of strengths and weaknesses
- look for every question where students in your school were 5% different on keyed answer from those in provincial test
  - if 5% or more higher, is a particular strength of your program
  - if 5% or more lower, is a particular weakness

note that score on question irrelevant, only difference from rest of province
- similarly, just because got 80% doesn't make your class good if province is 98%
  - if find all strengths or all weaknesses, where is the gap lowest?
  - least awful = strengths; least strong = weakness to work on
THIS EXAMPLE? all above provincial scores on these skills

converts a decimal into a fraction 76.5-60.9 = 15.6% above provincial norm

so decimal to fraction is a strength

all of these are good, but find least good --> thatÕs the area to concentrate on

question 10: on 4.5% difference -- so your weak spot, one area you arenÕt significantly above rest of province

You can even begin to set standards in your class as province does
Then look at actual performance.

How did my students do? Compared to what?
You can even create your own statistics to compare with provincial standard
- lots of teachers recycle Diploma and Achievement test questions, but they only do it to prep kids for actual exam --> losing all that diagnostic info
HOWEVER:-avoid comparisons between schools
- serves no useful purpose, has no logic since taken out of context
- e.g., comparing cancer clinic and walk in clinic --> higher death rate in cancer clinic doesn't mean its worse; may be best cancer clinic in the world, be doing a great job given more serious nature of problems it faces
- invidious comparisons like this become "blaming" exercise
- self-fulfilling prophecy: parents pull kids from that school
Provincial authorities consider such comparisons a misuse of results
- school report = your class if only one class;
  but if two or more classes, then we are talking about your school's program
  
  --> forces you to get together with other teachers to find out what they're doing
  
  --> pool resources, techniques, strategies to address problem areas....

Friday, September 7, 2012

Interpreting Item Analysis

Let's look at what we have and see what we can see

90% of item analysis is just common sense...

Potential Miskey
Identifying Ambiguous Items
EqualDistribution to all alternatives.
Alternatives are not working
Distracter too atractive.
Question not discriminating.
Negative discrimination.
Too Easy.
Omit.
&11. Relationship between D index and Difficulty (p).

Item Analysis of Computer Printouts

What do we see looking at this first one? [Potential Miskey]

Upper Low Difference D Total Difficulty

1. *A 1 4 -3 -.2 5 .17

B 1 3

C 10 5

D 3 3

O <----means omit or no answer

#1, more high group students chose C than A, even though A is supposedly the correct answer
more low group students chose A than high group so got negative discrimination;
only .16% of class got it right
most likely you just wrote the wrong answer key down

--> this is an easy and very common mistake for you to make

better you find out now before you hand back then when kids complain
OR WORSE, they don't complain, and teach themselves that your miskey as the "correct" answer

so check it out and rescore that question on all the papers before handing them back
Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50
--> nice item

OR:

you check and find that you didn't miskey it --> that is the answer you thought

two possibilities:

one possibility is that you made slip of the tongue and taught them the wrong answer

anything you say in class can be taken down and used against you on an examination....

more likely means even "good" students are being tricked by a common misconception -->

You're not supposed to have trick questions, so may want to dump it

--> give those who got it right their point, but total rest of the marks out of 24 instead of 25

If scores are high, or you want to make a point, might let it stand, and then teach to it --> sometimes if they get caught, will help them to remember better in future

such as:

very fine distinctions
crucial steps which are often overlooked

REVISE it for next time to weaken "B"

-- alternatives are not supposed to draw more than the keyed answer
-- almost always an item flaw, rather than useful distinction

What can we see with #2: [Can identify ambiguous items]

Upper Low Difference D Total Difficulty

2. A 6 5

B 1 2

*C 7 5 2 .13 12 .40

D 1 3

#2, about equal numbers of top students went for A and D.

Suggests they couldn't tell which was correct

either, students didn't know this material (in which case you can reteach it)
or the item was defective --->

look at their favorite alternative again, and see if you can find any reason they could be choosing it
often items that look perfectly straight forward to adults are ambiguous to students

FavoriteExamples of ambiguous items.

if you NOW realize that D was a defensible answer, rescore before you hand it back to give everyone credit for either A or D -- avoids arguing with you in class
if it's clearly a wrong answer, then you now know which error most of your students are making to get wrong answer
useful diagnostic information on their learning, your teaching

Equally to all alternatives

Upper Low Difference D Total Difficulty

3. A 4 3

B 3 4

*C 5 4 1 .06 9 .30

D 3 4

item #3, students respond about equally to all alternatives
usually means they are guessing

Three possibilities:

may be material you didn't actually get to yet

you designed test in advance (because I've convinced you to plan ahead) but didn't actually get everything covered before holidays....
or item on a common exam that you didn't stress in your class

item so badly written students have no idea what you're asking
item so difficult students just completely baffled

review the item:

if badly written ( by other teacher) or on material your class hasn't taken, toss it out, rescore the exam out of lower total

BUT give credit to those that got it, to a total of 100%

if seems well written, but too hard, then you know to (re)teach this material for rest of class....
maybe the 3 who got it are top three students,

tough but valid item:
OK, if item tests valid objective
want to provide occasional challenging question for top students
but make sure you haven't defined "top 3 students" as "those able to figure out what the heck I'm talking about"

Alternatives aren't working

Upper Low Difference D Total Difficulty

4. A 1 5

*B 14 7 7 .47 21 .77

C 0 2

D 0 0

example #4 --> no one fell for D --> so it is not a plausible alternative
question is fine for this administration, but revise item for next time
toss alternative D, replace it with something more realistic
each distracter has to attract at least 5% of the students

class of 30, should get at least two students

or might accept one if you positively can't think of another fourth alternative -- otherwise, do not reuse the item

if two alternatives don't draw any students --> might consider redoing as true/false

Distracter too attractive

Upper Low Difference D Total Difficulty

5. A 7 10

B 1 2

C 1 1

*D 5 2 3 .20 7 .23

sample #5 --> too many going for A

--> no ONE distracter should get more than key

--> no one distracter should pull more than about half of students

-- doesn't leave enough for correct answer and five percent for each alternative

keep for this time
weaken it for next time

Question not discriminating

Upper Low Difference D Total Difficulty

6. *A 7 7 0 .00 14 .47

B 3 2

C 2 1

D 3 5

sample #6: low group gets it as often as high group
on norm-referenced tests, point is to rank students from best to worst
so individual test items should have good students get question right, poor students get it wrong
test overall decides who is a good or poor student on this particular topic

those who do well have more information, skills than those who do less well
so if on a particular question those with more skills and knowledge do NOT do better, something may be wrong with the question

question may be VALID, but off topic

E.G.: rest of test tests thinking skill, but this is a memorization question, skilled and unskilled equally as likely to recall the answer
should have homogeneous test --> don't have a math item in with social studies
if wanted to get really fancy, should do separate item analysis for each cell of your blueprint...as long as you had six items per cell

question is VALID, on topic, but not RELIABLE

addresses the specified objective, but isn't a useful measure of individual differences
asking Grade 10s Capital of Canada is on topic, but since they will all get it right, won't show individual differences -- give you low D

Negative Discrimination

Upper Low Difference D Total Difficulty

7. *A 7 10 -3 -.20 17 .57

B 3 3

C 2 1

D 3 1

D (discrimination) index is just upper group minus lower group
varies from +1.0 to -1.0
if all top got it right, all lower got it wrong = 100% = +1
if more of the bottom group get it right than the top group, you get a negative D index
if you have a negative D, means that students with less skills and knowledge overall, are getting it right more often than those who the test says are better overall
in other words, the better you are, the more likely you are to get it wrong

WHAT COULD ACCOUNT FOR THAT?

Two possibilities:

usually means an ambiguous question

that is confusing good students, but weak students too weak to see the problem
look at question again, look at alternatives good students are going for, to see if you've missed something

OR:

or it might be off topic

--> something weaker students are better at (like rote memorization) than good students

--> not part of same set of skills as rest of test--> suggests design flaw with table of specifications perhaps

((-if you end up with a whole bunch of -D indices on the same test, must mean you actually have two different distinct skills, because by definition, the low group is the high group on that bunch of questions

--> end up treating them as two separate tests))

if you have a large enough sample (like the provincial exams) then we toss the item and either don't count it or give everyone credit for it
with sample of 100 students or less, could just be random chance, so basically ignore it in terms of THIS administration

kids wrote it, give them mark they got

furthermore, if you keep dropping questions, may find that you're starting to develop serious holes in your blueprint coverage -- problem for sampling

but you want to track stuff this FOR NEXT TIME

if it's negative on administration after administration, consistently, likely not random chance, it's screwing up in some way
want to build your future tests out of those items with high positive D indices
the higher the average D indices on the test, the more RELIABLE the test as a whole will be
revise items to increase D

-->if good students are selecting one particular wrong alternative, make it less attractive

-->or increase probability of their selecting right answer by making it more attractive

may have to include some items with negative Ds if those are the only items you have for that specification, and it's an important specification

what this means is that there are some skills/knowledge in this unit which are unrelated to rest of the skills/knowledge
--> but may still be important

e.g., statistics part of this course may be terrible on those students who are the best item writers, since writing tends to be associated with the opposite hemisphere in the brain than math, right... but still important objective in this course

may lower reliability of test, but increases content validity

Too Easy

Upper Low Difference D Total Difficulty

8. A 0 1

*B 14 13 1 .06 27 .90

C 0 1

D 1 1

too easy or too difficult won't discriminate well either
difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0 (nobody)

REMEMBER: THE HIGHER THE DIFFICULTY INDEX, THE EASIER THE QUESTION

if the item is NOT miskeyed or some other glaring problem, it's too late to change after administered --> everybody got it right, OK, give them the mark

TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)

if the item is too difficult, don't drop it, just because everybody missed it --> you must have thought it was an important objective or it wouldn't have been on there;
and unless literally EVERYONE missed it, what do you do with the students who got it right?
give them bonus marks?
cheat them of a mark they got?

furthermore, if you drop too many questions, lose content validity (specs)

--> if two or three got it right may just be random chance,

so why should they get a bonus mark

however, DO NOT REUSE questions with too high or low difficulty (p) values in future

if difficulty is over 85%, you're wasting space on limited item test

asking Grade 10s the Capital of Canada is probably waste of their time and yours --> unless this is a particularly vital objective
same applies to items which are too difficult --> no use asking Grade 3s to solve quadratic equation
but you may want to revise question to make it easier or harder rather than just toss it out cold

OR SOME EXCEPTIONS HERE:

You may have consciously decided to develop a "Mastery" style tests

--> will often have very easy questions -& expect everyone to get everything trying to identify only those who are not ready to go on

--> in which case, don't use any question which DOES NOT have a difficulty level below 85% or whatever

Or you may want a test to identify the top people in class, the reach for the top team, and design a whole test of really tough questions

--> have low difficulty values (i.e., very hard)

so depends a bit on what you intend to do with the test in question
this is what makes the difficulty index (proportion) so handy

you create a bank of items over the years

--> using item analysis you get better questions all the time, until you have a whole bunch that work great

-->can then tailor-make a test for your class

you want to create an easier test this year, you pick questions with higher difficulty (p) values;

you want to make a challenging test for your gifted kids, choose items with low difficulty (p) values

--> for most applications will want to set difficulty level so that it gives you average marks, nice bell curve

government uses 62.5 --> four item multiple choice, middle of bell curve,

start tests with an easy question or two to give students a running start
make sure that the difficulty levels are spread out over examination blueprint

not all hard geography questions, easy history

unfair to kids who are better at geography, worse at history
turns class off geography if they equate it with tough questions

-->REMEMBER here that difficulty is different than complexity, Bloom

so can have difficult recall knowledge question, easy synthesis
synthesis and evaluation items will tend to be harder than recall questions so if find higher levels are more difficult, OK, but try to balance cells as much as possible
certainly content cells should be the roughly the same

OMIT

Upper Low Difference D Total Difficulty

9. A 2 1

B 3 4

*C 7 3 4 .26 10 .33

D 1 1

O 2 4

If near end of the test

--> they didn't find it because it was on the next page

--format problem

--> your test is too long, 6 of them (20%) didn't get to it

OR, if middle of the test:

--> totally baffled them because:

way too difficult for these guys
or because also 2 from high group too: ambiguous wording

&
RELATIONSHIP BETWEEN D INDEX AND DIFFICULTY (p)

Upper Low Difference D Total Difficulty

10. A 0 5

*B 15 0 15 1.0 15 .50

C 0 5

D 0 5

---------------------------------------------------

11. A 3 2

*B 8 7 1 0.6 15 .50

C 2 3

D 2 3

10 is a perfect item --> each distracter gets at least 5
discrimination index is +1.0

(ACTUALLY PERFECT ITEM WOULD HAVE DIFFICULTY OF 65% TO ALLOW FOR GUESSING)

high discrimination D indices require optimal levels of difficulty
but optimal levels of difficulty do not assure high levels of D
11 has same difficulty level, different D

on four item multiple-choice, student doing totally by chance will get 25%