Tuesday, September 18, 2012

make up class (sent to your E-Mails)


Dear All,

 Hi! I hope you’re all doing fine after the inclement weather last Saturday. And I know that you are all excited to have your Midterm examinations. J

Anyway, I will not be around on Saturday as I have a conference to speak on Cebu. Thus, we can’t have our class on Saturday again. In this regard, I am asking you to have a 3-hour make up class on either Wednesday (tomorrow) 7.30 AM-10:30AM or Thursday (September  20) 7:30 AM to 10:30 AM. The 7:30AM-9:00AM will be your exam and the 9:00AM-10:30 AM will be for discussions.

I am hoping that we can have our make-up class in any of those dates. Please let me know asap if you decide that the class will be tomorrow. I will be posting this email in our blog as well.

Contact me in my hand phone: 09162672875 or 09478905158

*Camille and Ahnie May, can you be my contact? Please inform me on the pulse of the class.

Thanks.
 
Sir HJ

Tuesday, September 11, 2012

all about item analysis


Purpose of Item Analysis

OK, you now know how to plan a test and build a test
  • Now you need to know how to do ITEM ANALYSIS 

      --> looks complicated at first glance, but actually quite simple

      -->even I can do this and I'm a mathematical idiot

  • Talking about norm-referenced, objective tests
  • mostly multiple-choice but same principals for true-false, matching and short answer
  • by analyzing results you can refine your testing


SERVES SEVERAL PURPOSES

  1. Fix marks for current class that just wrote the test
    • find flaws in the test so that you can adjust the mark before return to students
    • can find questions with two right answers, or that were too hard, etc., that you may want to drop from the exam

      • even had to do that occasionally on Diploma exams, even after 36 months in development, maybe 20 different reviewers, extensive field tests, still occasionally have a question whose problems only become apparent after you give the test
      • more common on classroom tests -- but instead of getting defensive, or making these decisions at random on basis of which of your students can argue with you, do it scientifically
  2. More diagnostic information on students
    • another immediate payoff of item analysis
    Classroom level:

    • will tell which questions they were are all guessing on, or if you find a questions which most of them found very difficult, you can reteach that concept
    • CAN do item analysis on pretests to:
      • so if you find a question they all got right, don't waste more time on this area
      • find the wrong answers they are choosing to identify common misconceptions
      • can't tell this just from score on total test, or class average
    Individual level:

    • isolate specific errors this child made
    • after you've planned these tests, written perfect questions, and now analyzed the results, you're going to know more about these kids than they know themselves
  3. Build future tests, revise test items to make them better
    • REALLY pays off second time you teach the same course

      • by now you know how much work writing good questions is
      • studies have shown us that it is FIVE times faster to revise items that didn't work, using item analysis, than trying to replace it with a completely new question
      • new item which would just have new problems anyway

        --> this way you eventually get perfect items, the envy of your neighbours
    • SHOULD NOT REUSE WHOLE TESTS --> diagnostic teaching means that you are responding to needs of your students, so after a few years you build up a bank of test items you can custom make tests for your class

      • know what class average will be before you even give the test because you will know approximately how difficult each item is before you use it;
      • can spread difficulty levels across your blueprint too...
  4. Part of your continuing professional development
    • doing the occasional item analysis will help teach you how to become a better test writer
    • and you're also documenting just how good your evaluation is
    • useful for dealing with parents or principals if there's ever a dispute
    • once you start bringing out all these impressive looking stats parents and administrators will believe that maybe you do know what you're talking about when you fail students...
    • parent says, I think your "question stinks",

      well, "according to the item analysis, this question appears to have worked well -- it's your son that stinks"

      (just kidding! --actually, face validity takes priority over stats any day!)
    • and if the analysis shows that the question does stink, you've already dropped it before you've handed it back to the student, let alone the parent seeing it...
  5. Before and After Pictures
    • long term payoff
    • collect this data over ten years, not only get great item bank, but if you change how you teach the course, you can find out if innovation is working
    • if you have a strong class (as compared to provincial baseline) but they do badly on same test you used five years ago, the new textbook stinks.

ITEM ANALYSIS is one area where even a lot of otherwise very good classroom teachers fall down

  • they think they're doing a good job; they think they've doing good evaluation, but without doing item analysis, they can't really know
  • part of being a professional is going beyond the illusion of doing a good job to finding out whether you really are
  • but something just a lot of teachers don't know HOW to do
  • do it indirectly when kids argue with them...wait for complaints from students, student's parents and maybe other teachers...
ON THE OTHER HAND....

  • I do realize that I am advocating here more work for you in the short term, but, it will pay off in the long term
But realistically:

*Probably only doing it for your most important tests

  • end of unit tests, final exams --> summative evaluation
  • especially if you're using common exams with other teachers
  • common exams give you bigger sample to work with, which is good
  • makes sure that questions other teacher wrote are working for YOUR class
  • maybe they taught different stuff in a different way
  • impress the daylights out of your colleagues
*Probably only doing it for test questions you are likely going to reuse next year

*Spend less time on item analysis than on revising items

  • item analysis is not an end in itself,
  • no point unless you use it to revise items,
  • and help students on basis of information you get out of it
I also find that, if you get into it, it is kind of fascinating. When stats turn out well, it's objective, external validation of your work. When stats turn out differently than you expect, it becomes a detective mystery as you figure out what went wrong.
But you'll have to take my word on this until you try it on your own stuff.


Sample Size

The larger the sample the better, more accurate the data you get.

  • government would not accept any sample smaller than 250 students, balanced for urban/rural, public/separate, and zone: north/south/central/etc.
  • but not much you can do about small class/sample size, except

    (a) give test to more than one class (secondary teachers)
  • need to balance reusing tests three periods in a row against chance of student's in first class tipping off third class on questions...

    (b) accumulate stats over a couple of years


    • IF you haven't changed the question AT ALL, you can add this year's results to last years to increase sample size
    • If you have changed the question, you can compare this year's stats to last year's to see if they have changed in direction you wanted
    • have to be cautious because with very small samples, some changes may just be random chance...


    (c) Mostly just do it with your 30 students and just be cautious how you interpret these results

    • procedures will give you hints at 30 students, proof positive at 400

Procedure works fastest if you have had students use a separate answer sheet

  • separate answer sheet means faster marking, so more time for interpretation
  • separate answer sheet means more accurate marking too

    • believe me, if you're marking "A,B,C,D", turn the page, sooner or later you're going to turn two pages at once and go 3/4 the way through the booklet before you notice that you're now on the last page and have six more answers on the answer key than you have answers to mark...

  • separate answer sheet means much faster for counting you have to do for item analysis
UNLESS you're in grades 1-3, in which case have them write on the booklet because they get confused transferring answer to separate page




Eight Simple Steps to Item Analysis



  1. Score each answer sheet, write score total on the corner

    • obviously have to do this anyway
  2. Sort the pile into rank order from top to bottom score

      (1 minute, 30 seconds tops)
  3. If normal class of 30 students, divide class in half

    • same number in top and bottom group:
    • toss middle paper if odd number (put aside)
  4. Take 'top' pile, count number of students who responded to each alternative

    • fast way is simply to sort piles into "A", "B", "C", "D" // or true/false or type of error you get for short answer, fill-in-the-blank

      OR set up on spread sheet if you're familiar with computers

    ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

    CLASS SIZE = 30
    
    ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY
    
    1. A     0 
      *B     4 
       C     1 
       D     1 
       O
    
    *=Keyed Answer


    • repeat for lower group


    ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

    CLASS SIZE = 30
    
    ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY
    
    1. A     0  
      *B     4       2 
       C     1 
       D     1 
       O
    
    *=Keyed Answer


    • this is the time consuming part --> but not that bad, can do it while watching TV, because you're just sorting piles


    THREE POSSIBLE SHORT CUTS HERE (STEP 4)
    (A) If you have a large sample of around 100 or more, you can cut down the sample you work with

    • take top 27% (27 out of 100); bottom 27% (so only dealing with 54, not all 100)
    • put middle 46 aside for the moment

      • larger the sample, more accurate, but have to trade off against labour; using top 1/3 or so is probably good enough by the time you get to 100; --27% magic figure statisticians tell us to use
    • I'd use halves at 30, but you could just use a sample of top 10 and bottom 10 if you're pressed for time

      • but it means a single student changes stats by 10%
      • trading off speed for accuracy...
    • but I'd rather have you doing ten and ten than nothing
    (B) Second short cut, if you have access to photocopier (budgets)

    • photocopy answer sheets, cut off identifying info
        (can't use if handwriting is distinctive)
    • colour code high and low groups --> dab of marker pen color
    • distribute randomly to students in your class so they don't know whose answer sheet they have
    • get them to raise their hands
      • for #6, how many have "A" on blue sheet?
        how many have "B"; how many "C"
      • for #6, how many have "A" on red sheet....
    • some reservations because they can screw you up if they don't take it seriously
    • another version of this would be to hire kid who cuts your lawn to do the counting, provided you've removed all identifying information
      • I actually did this for a bunch of teachers at one high school in Edmonton when I was in university for pocket money
    (C) Third shortcut, IF you can't use separate answer sheet, sometimes faster to type than to sort

    
          SAMPLE OF TYPING FORMAT 
             FOR ITEM ANALYSIS
    
    
    ITEM #    1  2  3  4  5  6  7  8  9  10
    
    KEY       T  F  T  F  T  A  D  C  A  B
    
    STUDENT
    
    Kay       T  T  T  F  F  A  D  D  A  C
    
    Jane      T  T  T  F  T  A  D  C  A  D
    
    John      F  F  T  F  T  A  D  C  A  B
    
    

    • type name; then T or F, or A,B,C,D == all left hand on typewriter, leaving right hand free to turn pages (from Sax)
    • IF you have a computer program -- some kicking around -- will give you all stats you need, plus bunches more you don't-- automatically after this stageOVERHEAD: SAMPLE ITEM ANALYSIS FOR CLASS OF 30 (PAGE #1) (in text)

  5. Subtract the number of students in lower group who got question right from number of high group students who got it right
    • quite possible to get a negative number



    ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

    CLASS SIZE = 30
    
    ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY
    
    1. A     0  
      *B     4       2         2 
       C     1 
       D     1 
       O
    
    *=Keyed Answer

  6. Divide the difference by number of students in upper or lower group
    • in this case, divide by 15
    • this gives you the "discrimination index" (D)


    ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

    CLASS SIZE = 30
    
    ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY
    
    1. A     0  
      *B     4       2         2      0.333 
       C     1 
       D     1 
       O
    
    *=Keyed Answer

  7. Total number who got it right

    ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

    CLASS SIZE = 30
    
    ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY
    
    1. A     0  
      *B     4       2         2      0.333   6 
       C     1 
       D     1 
       O
    
    *=Keyed Answer


  8. Divide total by total number of students
    • difficulty = (proportion who got it right (p) )

    ITEM ANALYSIS FORM TEACHER CONSTRUCTED TESTS

    CLASS SIZE = 30
    
    ITEM   UPPER   LOWER   DIFFERENCE   D   TOTAL   DIFFICULTY
    
    1. A     0  
      *B     4       2         2      0.333   6      .42 
       C     1 
       D     1 
       O
    
    *=Keyed Answer



      You will NOTE the complete lack of complicated statistics --> counting, adding, dividing --> no tricky formulas required for this
    • not going to worry about corrected point biserials etc.
    • one of the advantages of using fixed number of alternatives
Interpreting Item Analysis


Let's look at what we have and see what we can see

.

  1. What do we see looking at this first one? [Potential Miskey]
    
           Upper    Low    Difference    D    Total    Difficulty
    
    1. *A    1      4        -3        -.2       5          .17 
        B    1      3
        C   10      5
        D    3      3
        O   <----means omit or no answer
    


    • #1, more high group students chose C than A, even though A is supposedly the correct answer
    • more low group students chose A than high group so got negative discrimination;
    • only .16% of class got it right
    • most likely you just wrote the wrong answer key down 
        --> this is an easy and very common mistake for you to make
      • better you find out now before you hand back then when kids complain
      • OR WORSE, they don't complain, and teach themselves that your miskey as the "correct" answer
    • so check it out and rescore that question on all the papers before handing them back
    • Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50
      --> nice item
    OR:

    • you check and find that you didn't miskey it --> that is the answer you thought

      two possibilities:
      1. one possibility is that you made slip of the tongue and taught them the wrong answer
        • anything you say in class can be taken down and used against you on an examination....
      2. more likely means even "good" students are being tricked by a common misconception -->You're not supposed to have trick questions, so may want to dump it

          --> give those who got it right their point, but total rest of the marks out of 24 instead of 25
    If scores are high, or you want to make a point, might let it stand, and then teach to it --> sometimes if they get caught, will help them to remember better in future
    such as:
    • very fine distinctions
    • crucial steps which are often overlooked
    REVISE it for next time to weaken "B"
      -- alternatives are not supposed to draw more than the keyed answer
      -- almost always an item flaw, rather than useful distinction

  2. What can we see with #2: [Can identify ambiguous items]
    
           Upper    Low   Difference   D    Total   Difficulty
    
    2.  A    6      5
        B    1      2         
       *C    7      5        2        .13     12      .40
        D    1      3
        O
    
    


    • #2, about equal numbers of top students went for A and D.Suggests they couldn't tell which was correct

      • either, students didn't know this material (in which case you can reteach it)
      • or the item was defective --->
    • look at their favorite alternative again, and see if you can find any reason they could be choosing it
    • often items that look perfectly straight forward to adults are ambiguous to studentsFavoriteExamples of ambiguous items.
    • if you NOW realize that D was a defensible answer, rescore before you hand it back to give everyone credit for either A or D -- avoids arguing with you in class
    • if it's clearly a wrong answer, then you now know which error most of your students are making to get wrong answer
    • useful diagnostic information on their learning, your teaching

  3. Equally to all alternatives
          Upper    Low    Difference    D    Total    Difficulty
    
    3.  A     4     3
        B     3     4
       *C     5     4       1          .06       9        .30         
        D     3     4
        O
    
    


    • item #3, students respond about equally to all alternatives
    • usually means they are guessingThree possibilities:

      1. may be material you didn't actually get to yet

        • you designed test in advance (because I've convinced you to plan ahead) but didn't actually get everything covered before holidays....
        • or item on a common exam that you didn't stress in your class
      2. item so badly written students have no idea what you're asking
      3. item so difficult students just completely baffled
    • review the item:
      • if badly written ( by other teacher) or on material your class hasn't taken, toss it out, rescore the exam out of lower total

        • BUT give credit to those that got it, to a total of 100%

      • if seems well written, but too hard, then you know to (re)teach this material for rest of class....
      • maybe the 3 who got it are top three students,

        • tough but valid item:
        • OK, if item tests valid objective
        • want to provide occasional challenging question for top students
        • but make sure you haven't defined "top 3 students" as "those able to figure out what the heck I'm talking about"

  4. Alternatives aren't working
    
           Upper    Low   Difference    D   Total   Difficulty
    
    4.  A     1      5
       *B    14      7       7        .47     21        .77
        C     0      2
        D     0      0 
        O   
    
    


    • example #4 --> no one fell for D --> so it is not a plausible alternative
    • question is fine for this administration, but revise item for next time
    • toss alternative D, replace it with something more realistic
    • each distracter has to attract at least 5% of the students

      • class of 30, should get at least two students

    • or might accept one if you positively can't think of another fourth alternative -- otherwise, do not reuse the item
    if two alternatives don't draw any students --> might consider redoing as true/false

  5. Distracter too attractive
    
           Upper   Low   Difference   D   Total   Difficulty   
    
    5.  A    7     10
        B    1      2
        C    1      1
       *D    5      2       3        .20     7       .23 
        O
    
     


    • sample #5 --> too many going for A

        --> no ONE distracter should get more than key

        --> no one distracter should pull more than about half of students

        -- doesn't leave enough for correct answer and five percent for each alternative
    • keep for this time
    • weaken it for next time

  6. Question not discriminating
    
          Upper   Low   Difference   D   Total   Difficulty
    
    
    6. *A    7     7        0        .00     14      .47          
        B    3     2
        C    2     1
        D    3     5
        O
    
    


    • sample #6: low group gets it as often as high group
    • on norm-referenced tests, point is to rank students from best to worst
    • so individual test items should have good students get question right, poor students get it wrong
    • test overall decides who is a good or poor student on this particular topic

      • those who do well have more information, skills than those who do less well
      • so if on a particular question those with more skills and knowledge do NOT do better, something may be wrong with the question
    • question may be VALID, but off topic

      • E.G.: rest of test tests thinking skill, but this is a memorization question, skilled and unskilled equally as likely to recall the answer
      • should have homogeneous test --> don't have a math item in with social studies
      • if wanted to get really fancy, should do separate item analysis for each cell of your blueprint...as long as you had six items per cell

    • question is VALID, on topic, but not RELIABLE

      • addresses the specified objective, but isn't a useful measure of individual differences
      • asking Grade 10s Capital of Canada is on topic, but since they will all get it right, won't show individual differences -- give you low D

  7. Negative Discrimination
    
           Upper   Low   Difference   D   Total   Difficulty
    
    
    7. *A    7    10      -3       -.20     17       .57  
        B    3     3
        C    2     1
        D    3     1
        O
    
    


    • D (discrimination) index is just upper group minus lower group
    • varies from +1.0 to -1.0
    • if all top got it right, all lower got it wrong = 100% = +1
    • if more of the bottom group get it right than the top group, you get a negative D index
    • if you have a negative D, means that students with less skills and knowledge overall, are getting it right more often than those who the test says are better overall
    • in other words, the better you are, the more likely you are to get it wrong
    WHAT COULD ACCOUNT FOR THAT?

    Two possibilities:

    • usually means an ambiguous question

      • that is confusing good students, but weak students too weak to see the problem
      • look at question again, look at alternatives good students are going for, to see if you've missed something


      OR:
    • or it might be off topic

      --> something weaker students are better at (like rote memorization) than good students

      --> not part of same set of skills as rest of test--> suggests design flaw with table of specifications perhaps
    ((-if you end up with a whole bunch of -D indices on the same test, must mean you actually have two different distinct skills, because by definition, the low group is the high group on that bunch of questions
      --> end up treating them as two separate tests))

    • if you have a large enough sample (like the provincial exams) then we toss the item and either don't count it or give everyone credit for it
    • with sample of 100 students or less, could just be random chance, so basically ignore it in terms of THIS administration

      • kids wrote it, give them mark they got
    • furthermore, if you keep dropping questions, may find that you're starting to develop serious holes in your blueprint coverage -- problem for sampling

      • but you want to track stuff this FOR NEXT TIME
    • if it's negative on administration after administration, consistently, likely not random chance, it's screwing up in some way
    • want to build your future tests out of those items with high positive D indices
    • the higher the average D indices on the test, the more RELIABLE the test as a whole will be
    • revise items to increase D

        -->if good students are selecting one particular wrong alternative, make it less attractive

        -->or increase probability of their selecting right answer by making it more attractive
    • may have to include some items with negative Ds if those are the only items you have for that specification, and it's an important specification

      • what this means is that there are some skills/knowledge in this unit which are unrelated to rest of the skills/knowledge
        --> but may still be important

    • e.g., statistics part of this course may be terrible on those students who are the best item writers, since writing tends to be associated with the opposite hemisphere in the brain than math, right... but still important objective in this course

      • may lower reliability of test, but increases content validity

  8. Too Easy
    
           Upper   Low   Difference   D   Total   Difficulty
    
    8.  A     0     1
       *B    14    13       1       .06     27       .90
        C     0     1
        D     1     1
        O
    
    


    • too easy or too difficult won't discriminate well either
    • difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0 (nobody)
    REMEMBER: THE HIGHER THE DIFFICULTY INDEX, THE EASIER THE QUESTION

    • if the item is NOT miskeyed or some other glaring problem, it's too late to change after administered --> everybody got it right, OK, give them the mark
    TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)
    • if the item is too difficult, don't drop it, just because everybody missed it --> you must have thought it was an important objective or it wouldn't have been on there;
    • and unless literally EVERYONE missed it, what do you do with the students who got it right?
    • give them bonus marks?
    • cheat them of a mark they got?
    furthermore, if you drop too many questions, lose content validity (specs)

      --> if two or three got it right may just be random chance,

        so why should they get a bonus mark


    • however, DO NOT REUSE questions with too high or low difficulty (p) values in future
    if difficulty is over 85%, you're wasting space on limited item test

    • asking Grade 10s the Capital of Canada is probably waste of their time and yours --> unless this is a particularly vital objective
    • same applies to items which are too difficult --> no use asking Grade 3s to solve quadratic equation
    • but you may want to revise question to make it easier or harder rather than just toss it out cold
    OR SOME EXCEPTIONS HERE: 

    You may have consciously decided to develop a "Mastery" style tests
        --> will often have very easy questions -& expect everyone to get everything trying to identify only those who are not ready to go on

        --> in which case, don't use any question which DOES NOT have a difficulty level below 85% or whatever
    Or you may want a test to identify the top people in class, the reach for the top team, and design a whole test of really tough questions

        --> have low difficulty values (i.e., very hard)


    • so depends a bit on what you intend to do with the test in question
    • this is what makes the difficulty index (proportion) so handy
    1. you create a bank of items over the years

        --> using item analysis you get better questions all the time, until you have a whole bunch that work great

        -->can then tailor-make a test for your class

        you want to create an easier test this year, you pick questions with higher difficulty (p) values;

        you want to make a challenging test for your gifted kids, choose items with low difficulty (p) values

        --> for most applications will want to set difficulty level so that it gives you average marks, nice bell curve

        • government uses 62.5 --> four item multiple choice, middle of bell curve,

    2. start tests with an easy question or two to give students a running start
    3. make sure that the difficulty levels are spread out over examination blueprint

      • not all hard geography questions, easy history

        • unfair to kids who are better at geography, worse at history
        • turns class off geography if they equate it with tough questions
        -->REMEMBER here that difficulty is different than complexity, Bloom
      • so can have difficult recall knowledge question, easy synthesis
      • synthesis and evaluation items will tend to be harder than recall questions so if find higher levels are more difficult, OK, but try to balance cells as much as possible
      • certainly content cells should be the roughly the same

  9. OMIT
    
           Upper   Low   Difference   D   Total   Difficulty
    
    9.  A    2     1
        B    3     4
       *C    7     3        4       .26     10       .33
        D    1     1
        O    2     4     
    
    

    If near end of the test

    1. --> they didn't find it because it was on the next page

      --format problem
    OR

    1. --> your test is too long, 6 of them (20%) didn't get to it


    OR, if middle of the test:

    1. --> totally baffled them because: 

      • way too difficult for these guys
      • or because also 2 from high group too: ambiguous wording

  10. &
  11. RELATIONSHIP BETWEEN D INDEX AND DIFFICULTY (p)
    
          Upper   Low   Difference   D   Total   Difficulty
    
    10.  A     0    5
        *B    15    0        15      1.0    15       .50 
         C     0    5
         D     0    5
         O
       ---------------------------------------------------
    11.  A    3     2
        *B    8     7         1      0.6     15      .50
         C    2     3
         D    2     3
         O
    
    


    • 10 is a perfect item --> each distracter gets at least 5
      discrimination index is +1.0
    (ACTUALLY PERFECT ITEM WOULD HAVE DIFFICULTY OF 65% TO ALLOW FOR GUESSING)

    • high discrimination D indices require optimal levels of difficulty
    • but optimal levels of difficulty do not assure high levels of D

    • 11 has same difficulty level, different D

      • on four item multiple-choice, student doing totally by chance will get 25%
Program Evaluation

When your kids write the Diploma or Achievement Test Department sends out a printout of how your class did compared to everybody else in the province
Three types of report:

  1. ASSESSMENT HIGHLIGHTS (pamphlet)

    • how are kids doing today in terms of meeting the standards?
    • how are they doing compared to four years ago? eight years ago?
      (monitor over time)
  2. PROVINCIAL REPORT

    • format keeps changing --> some years all tests in one book to save on paper and mailing costs; other years each exam gets its own report
    • tons of technical information (gender stuff, etc.)
  3. JURISDICTION & SCHOOL REPORTS

      (up to superintendent what happens to these after that
      --> can publish in newspaper, keep secret central office only, etc.)

    • get your hands on and interpret
    • either you do it or someone else will do it for/to you
    • better teachers take responsibility rather than top down
    • new table formats are so easy to interpret no reason not to
    • this means you can compare their responses to the responses of 30,000 students across the province

      • will help you calibrate your expectations for this class
      • is your particular class high or low one?
      • have you set your standards too high or too low?
      • giving everyone 'CÕs because you think they ought to do better than this, but they all ace the provincial tests?
    Who Knows Where This is?OVERHEAD: SCHOOL TABLE 2 (June 92 GRADE 9 Math Achievement)

    • check table 2 for meeting standard of excellence
    • Standards set by elaborate committee structure
    This example (overhead): Your class had 17 students
    Total test out of 49 (means test of 50, but dropped on after item analysis)

      standard setting procedures decided that 42/49 is standard of EXCELLENCE for grade 9s in Alberta
      next column shows they expect 15% to reach this standard

      standard setting procedure decided that 23 out of 49 which Acceptable
      standard; next column says expect 85% to reach that standard

      columns at end of table show that actually, only 8.9% made standard of excellence, and only 67.4% made acceptable standard

      (bad news!)

      but looking at YOUR class, 5.9, almost 6% made standard of excellence (so fewer than province as a whole) but on the other hand, 76.5% meeting acceptable standard.
    Need comparison -- otherwise, fact that only get 6% to excellence might sound bad...

      Interpretation: either you only have one excellent math student in your class, or you are teaching to acceptable standard, but not encouraging excellence?
    BUT can use tables to look deeper,

    • use tables to identify strengths and weaknesses in student learning
    • and therefore identify your own strength and weaknesses
    Problem solving & knowledge/skills broken down --> table of specs topics

    Interestingly, though, above provincial on problem solving at excellence...

    ASK: -how do you explain % meeting knowledge and % meeting problems both higher than % meeting standard on whole test?

    Answer: low correlation between performance on the two types of questions
    (i.e., those who met standard on the one often did not on the other)

      which means (a) can't assume that easy/hard = Bloom's taxonomy

      and (b) that you have to give students both kinds of questions on your test or you are being unfair to group who is better at the other stuff
    Don't know where this is OVERHEAD: SCHOOL TABLE 5.1 (GRADE 9 MATH, JUNE 92)

    • check tables 5.1 to 5.6 for particular areas of strengths and weaknesses
    • look for every question where students in your school were 5% different on keyed answer from those in provincial test

      • if 5% or more higher, is a particular strength of your program
      • if 5% or more lower, is a particular weakness
  • note that score on question irrelevant, only difference from rest of province 

      --> e.g., if you only got 45% but province only got 35%, than that's a significant strength

      --> the fact that less than 50% just means it was a really tough question, too hard for this grade

    • similarly, just because got 80% doesn't make your class good if province is 98%

      • if find all strengths or all weaknesses, where is the gap lowest?
      • least awful = strengths; least strong = weakness to work on
    THIS EXAMPLE? all above provincial scores on these skills

    converts a decimal into a fraction 76.5-60.9 = 15.6% above provincial norm

    so decimal to fraction is a strength

    all of these are good, but find least good --> thatÕs the area to concentrate on

    question 10: on 4.5% difference -- so your weak spot, one area you arenÕt significantly above rest of province

    You can even begin to set standards in your class as province does

      --> i.e., ask yourself BEFORE the test how many of these questions should your class be able to do on this test?
    Then look at actual performance.

    How did my students do? Compared to what?
      my classroom expectations
      school's expectations
      jurisdiction's expectations
      provincial expectations
      the last/previous test administered
      community expectations
      (each jurisdiction how has its own public advisory committee)
    You can even create your own statistics to compare with provincial standard

    • lots of teachers recycle Diploma and Achievement test questions, but they only do it to prep kids for actual exam --> losing all that diagnostic info
    HOWEVER:-avoid comparisons between schools

    • serves no useful purpose, has no logic since taken out of context
    • e.g., comparing cancer clinic and walk in clinic --> higher death rate in cancer clinic doesn't mean its worse; may be best cancer clinic in the world, be doing a great job given more serious nature of problems it faces
    • invidious comparisons like this become "blaming" exercise
    • self-fulfilling prophecy: parents pull kids from that school
    Provincial authorities consider such comparisons a misuse of results

    • school report = your class if only one class;
      but if two or more classes, then we are talking about your school's program

      --> forces you to get together with other teachers to find out what they're doing

      --> pool resources, techniques, strategies to address problem areas....

Friday, September 7, 2012


Interpreting Item Analysis



Let's look at what we have and see what we can see

.

  1. What do we see looking at this first one? [Potential Miskey]

       Upper    Low    Difference    D    Total    Difficulty

1. *A    1      4        -3        -.2       5          .17
    B    1      3
    C   10      5
    D    3      3
    O   <----means omit or no answer
    • #1, more high group students chose C than A, even though A is supposedly the correct answer
    • more low group students chose A than high group so got negative discrimination;
    • only .16% of class got it right
    • most likely you just wrote the wrong answer key down 
--> this is an easy and very common mistake for you to make
      • better you find out now before you hand back then when kids complain
      • OR WORSE, they don't complain, and teach themselves that your miskey as the "correct" answer
    • so check it out and rescore that question on all the papers before handing them back
    • Makes it 10-5 Difference = 5; D=.34; Total = 15; difficulty=.50
      --> nice item
OR:
    • you check and find that you didn't miskey it --> that is the answer you thought

      two possibilities:
      1. one possibility is that you made slip of the tongue and taught them the wrong answer
        • anything you say in class can be taken down and used against you on an examination....
      2. more likely means even "good" students are being tricked by a common misconception -->
You're not supposed to have trick questions, so may want to dump it
--> give those who got it right their point, but total rest of the marks out of 24 instead of 25
If scores are high, or you want to make a point, might let it stand, and then teach to it --> sometimes if they get caught, will help them to remember better in future
such as:
    • very fine distinctions
    • crucial steps which are often overlooked
REVISE it for next time to weaken "B" 
-- alternatives are not supposed to draw more than the keyed answer
-- almost always an item flaw, rather than useful distinction

  1. What can we see with #2: [Can identify ambiguous items]

       Upper    Low   Difference   D    Total   Difficulty

2.  A    6      5
    B    1      2        
   *C    7      5        2        .13     12      .40
    D    1      3
    O
    • #2, about equal numbers of top students went for A and D.
      • either, students didn't know this material (in which case you can reteach it)
      • or the item was defective --->
    • look at their favorite alternative again, and see if you can find any reason they could be choosing it
    • often items that look perfectly straight forward to adults are ambiguous to students
FavoriteExamples of ambiguous items.
    • if you NOW realize that D was a defensible answer, rescore before you hand it back to give everyone credit for either A or D -- avoids arguing with you in class
    • if it's clearly a wrong answer, then you now know which error most of your students are making to get wrong answer
    • useful diagnostic information on their learning, your teaching

  1. Equally to all alternatives
      Upper    Low    Difference    D    Total    Difficulty

3.  A     4     3
    B     3     4
   *C     5     4       1          .06       9        .30         
    D     3     4
    O
    • item #3, students respond about equally to all alternatives
    • usually means they are guessing
Three possibilities:
      1. may be material you didn't actually get to yet
        • you designed test in advance (because I've convinced you to plan ahead) but didn't actually get everything covered before holidays....
        • or item on a common exam that you didn't stress in your class
      2. item so badly written students have no idea what you're asking
      3. item so difficult students just completely baffled
    • review the item:
      • if badly written ( by other teacher) or on material your class hasn't taken, toss it out, rescore the exam out of lower total
        • BUT give credit to those that got it, to a total of 100%

      • if seems well written, but too hard, then you know to (re)teach this material for rest of class....
      • maybe the 3 who got it are top three students, 
        • tough but valid item:
        • OK, if item tests valid objective
        • want to provide occasional challenging question for top students
        • but make sure you haven't defined "top 3 students" as "those able to figure out what the heck I'm talking about"

  1. Alternatives aren't working

       Upper    Low   Difference    D   Total   Difficulty

4.  A     1      5
   *B    14      7       7        .47     21        .77
    C     0      2
    D     0      0
    O  
    • example #4 --> no one fell for D --> so it is not a plausible alternative
    • question is fine for this administration, but revise item for next time
    • toss alternative D, replace it with something more realistic
    • each distracter has to attract at least 5% of the students
      • class of 30, should get at least two students

    • or might accept one if you positively can't think of another fourth alternative -- otherwise, do not reuse the item
if two alternatives don't draw any students --> might consider redoing as true/false

  1. Distracter too attractive

       Upper   Low   Difference   D   Total   Difficulty  

5.  A    7     10
    B    1      2
    C    1      1
   *D    5      2       3        .20     7       .23
    O


    • sample #5 --> too many going for A
--> no ONE distracter should get more than key

--> no one distracter should pull more than about half of students
 

-- doesn't leave enough for correct answer and five percent for each alternative
    • keep for this time
    • weaken it for next time

  1. Question not discriminating

      Upper   Low   Difference   D   Total   Difficulty


6. *A    7     7        0        .00     14      .47         
    B    3     2
    C    2     1
    D    3     5
    O
    • sample #6: low group gets it as often as high group
    • on norm-referenced tests, point is to rank students from best to worst
    • so individual test items should have good students get question right, poor students get it wrong
    • test overall decides who is a good or poor student on this particular topic
      • those who do well have more information, skills than those who do less well
      • so if on a particular question those with more skills and knowledge do NOT do better, something may be wrong with the question
    • question may be VALID, but off topic
      • E.G.: rest of test tests thinking skill, but this is a memorization question, skilled and unskilled equally as likely to recall the answer
      • should have homogeneous test --> don't have a math item in with social studies
      • if wanted to get really fancy, should do separate item analysis for each cell of your blueprint...as long as you had six items per cell

    • question is VALID, on topic, but not RELIABLE
      • addresses the specified objective, but isn't a useful measure of individual differences
      • asking Grade 10s Capital of Canada is on topic, but since they will all get it right, won't show individual differences -- give you low D

  1. Negative Discrimination

       Upper   Low   Difference   D   Total   Difficulty


7. *A    7    10      -3       -.20     17       .57 
    B    3     3
    C    2     1
    D    3     1
    O
    • D (discrimination) index is just upper group minus lower group
    • varies from +1.0 to -1.0
    • if all top got it right, all lower got it wrong = 100% = +1
    • if more of the bottom group get it right than the top group, you get a negative D index
    • if you have a negative D, means that students with less skills and knowledge overall, are getting it right more often than those who the test says are better overall
    • in other words, the better you are, the more likely you are to get it wrong
WHAT COULD ACCOUNT FOR THAT?

Two possibilities:
    • usually means an ambiguous question
      • that is confusing good students, but weak students too weak to see the problem
      • look at question again, look at alternatives good students are going for, to see if you've missed something


OR:
    • or it might be off topic

      --> something weaker students are better at (like rote memorization) than good students

      --> not part of same set of skills as rest of test--> suggests design flaw with table of specifications perhaps
((-if you end up with a whole bunch of -D indices on the same test, must mean you actually have two different distinct skills, because by definition, the low group is the high group on that bunch of questions
--> end up treating them as two separate tests))
    • if you have a large enough sample (like the provincial exams) then we toss the item and either don't count it or give everyone credit for it
    • with sample of 100 students or less, could just be random chance, so basically ignore it in terms of THIS administration
      • kids wrote it, give them mark they got
    • furthermore, if you keep dropping questions, may find that you're starting to develop serious holes in your blueprint coverage -- problem for sampling
      • but you want to track stuff this FOR NEXT TIME
    • if it's negative on administration after administration, consistently, likely not random chance, it's screwing up in some way
    • want to build your future tests out of those items with high positive D indices
    • the higher the average D indices on the test, the more RELIABLE the test as a whole will be 
    • revise items to increase D
-->if good students are selecting one particular wrong alternative, make it less attractive

-->or increase probability of their selecting right answer by making it more attractive
    • may have to include some items with negative Ds if those are the only items you have for that specification, and it's an important specification
      • what this means is that there are some skills/knowledge in this unit which are unrelated to rest of the skills/knowledge
        --> but may still be important

    • e.g., statistics part of this course may be terrible on those students who are the best item writers, since writing tends to be associated with the opposite hemisphere in the brain than math, right... but still important objective in this course
      • may lower reliability of test, but increases content validity

  1. Too Easy

       Upper   Low   Difference   D   Total   Difficulty

8.  A     0     1
   *B    14    13       1       .06     27       .90
    C     0     1
    D     1     1
    O
    • too easy or too difficult won't discriminate well either
    • difficulty (p) (for proportion) varies from +1.0 (everybody got it right) to 0 (nobody)
REMEMBER: THE HIGHER THE DIFFICULTY INDEX, THE EASIER THE QUESTION
    • if the item is NOT miskeyed or some other glaring problem, it's too late to change after administered --> everybody got it right, OK, give them the mark
TOO DIFFICULT = 30 to 35% (used to be rule in Branch, now not...)
    • if the item is too difficult, don't drop it, just because everybody missed it --> you must have thought it was an important objective or it wouldn't have been on there;
    • and unless literally EVERYONE missed it, what do you do with the students who got it right?
    • give them bonus marks?
    • cheat them of a mark they got?
furthermore, if you drop too many questions, lose content validity (specs)
--> if two or three got it right may just be random chance,
so why should they get a bonus mark

    • however, DO NOT REUSE questions with too high or low difficulty (p) values in future
if difficulty is over 85%, you're wasting space on limited item test
    • asking Grade 10s the Capital of Canada is probably waste of their time and yours --> unless this is a particularly vital objective
    • same applies to items which are too difficult --> no use asking Grade 3s to solve quadratic equation
    • but you may want to revise question to make it easier or harder rather than just toss it out cold
OR SOME EXCEPTIONS HERE: 

You may have consciously decided to develop a "Mastery" style tests
--> will often have very easy questions -& expect everyone to get everything trying to identify only those who are not ready to go on

--> in which case, don't use any question which DOES NOT have a difficulty level below 85% or whatever
Or you may want a test to identify the top people in class, the reach for the top team, and design a whole test of really tough questions
--> have low difficulty values (i.e., very hard)

    • so depends a bit on what you intend to do with the test in question
    • this is what makes the difficulty index (proportion) so handy
    1. you create a bank of items over the years
--> using item analysis you get better questions all the time, until you have a whole bunch that work great

-->can then tailor-make a test for your class

you want to create an easier test this year, you pick questions with higher difficulty (p) values;

you want to make a challenging test for your gifted kids, choose items with low difficulty (p) values

--> for most applications will want to set difficulty level so that it gives you average marks, nice bell curve
      • government uses 62.5 --> four item multiple choice, middle of bell curve,

    1. start tests with an easy question or two to give students a running start
    2. make sure that the difficulty levels are spread out over examination blueprint
      • not all hard geography questions, easy history
        • unfair to kids who are better at geography, worse at history
        • turns class off geography if they equate it with tough questions
-->REMEMBER here that difficulty is different than complexity, Bloom
      • so can have difficult recall knowledge question, easy synthesis
      • synthesis and evaluation items will tend to be harder than recall questions so if find higher levels are more difficult, OK, but try to balance cells as much as possible
      • certainly content cells should be the roughly the same

  1. OMIT

       Upper   Low   Difference   D   Total   Difficulty

9.  A    2     1
    B    3     4
   *C    7     3        4       .26     10       .33
    D    1     1
    O    2     4    
If near end of the test
    1. --> they didn't find it because it was on the next page

      --format problem
OR
    • --> your test is too long, 6 of them (20%) didn't get to it


OR, if middle of the test:
    1. --> totally baffled them because: 
      • way too difficult for these guys 
      • or because also 2 from high group too: ambiguous wording

  1. &
  2. RELATIONSHIP BETWEEN D INDEX AND DIFFICULTY (p)

      Upper   Low   Difference   D   Total   Difficulty

10.  A     0    5
    *B    15    0        15      1.0    15       .50
     C     0    5
     D     0    5
     O
   ---------------------------------------------------
11.  A    3     2
    *B    8     7         1      0.6     15      .50
     C    2     3
     D    2     3
     O
    • 10 is a perfect item --> each distracter gets at least 5
      discrimination index is +1.0
(ACTUALLY PERFECT ITEM WOULD HAVE DIFFICULTY OF 65% TO ALLOW FOR GUESSING)
    • high discrimination D indices require optimal levels of difficulty
    • but optimal levels of difficulty do not assure high levels of D
    • 11 has same difficulty level, different D
      • on four item multiple-choice, student doing totally by chance will get 25%