Tuesday, September 4, 2012

Item Analysis and Item Discrimination


Submitted by: Camille Ann Bolos

__________________________________________________________________________


Item analysis

An item analysis involves many statistics that can provide useful information for improving the quality and accuracy of multiple-choice or true/false items (questions).  Some of these statistics are:
Item difficulty: the percentage of students that correctly answered the item.
  • Also referred to as the p-value.
  • The range is from 0% to 100%, or more typically written as a proportion of 0.0 to 1.00.
  • The higher the value, the easier the item.
  • Calculation: Divide the number of students who got an item correct by the total number of students who answered it.
  • Ideal value: Slightly higher than midway between chance (1.00 divided by the number of choices) and a perfect score (1.00) for the item. For example, on a four-alternative, multiple-choice item, the random guessing level is 1.00/4 = 0.25; therefore, the optimal difficulty level is .25 + (1.00 - .25) / 2 = 0.62. On a true-false question, the guessing level is (1.00/2 = .50) and, therefore, the optimal difficulty level is .50+(1.00-.50)/2 = .75
  • P-values above 0.90 are very easy items and should be carefully reviewed based on the instructor’s purpose. For example, if the instructor is using easy “warm-up” questions or aiming for student mastery, than some items with p values above .90 may be warranted. In contrast, if an instructor is mainly interested in differences among students, these items may not be worth testing.
  • P-values below 0.20 are very difficult items and should be reviewed for possible confusing language, removed from subsequent exams, and/or identified as an area for re-instruction. If almost all of the students get the item wrong, there is either a problem with the item or students were not able to learn the concept. However, if an instructor is trying to determine the top percentage of students that learned a certain concept, this highly difficult item may be necessary.
Item discrimination: the relationship between how well students did on the item and their total exam score.
  • Also referred to as the Point-Biserial correlation (PBS)
  • The range is from –1.00 to 1.00.
  • The higher the value, the more discriminating the item. A highly discriminating item indicates that the students who had high exams scores got the item correct whereas students who had low exam scores got the item incorrect.
  • Items with discrimination values near or less than zero should be removed from the exam. This indicates that students who overall did poorly on the exam did better on that item than students who overall did well. The item may be confusing for your better scoring students in some way.
  • Acceptable range: 0.20 or higher
  • Ideal value: The closer to 1.00 the better
  • Calculation:    where
    ΧC = the mean total score for persons who have responded correctly to the item
    ΧΤ = the mean total score for all personsp = the difficulty value for the item 
    q = (1 – p
    S. D. Total = the standard deviation of total exam scores
Reliability coefficient: a measure of the amount of measurement error associated with a exam score.
  • The range is from 0.0 to 1.0.
  • The higher the value, the more reliable the overall exam score.
  • Typically, the internal consistency reliability is measured. This indicates how well the items are correlated with one another.
  • High reliability indicates that the items are all measuring the same thing, or general construct (e.g. knowledge of how to calculate integrals for a Calculus course).
  • With multiple-choice items that are scored correct/incorrect, the Kuder-Richardson formula 20 (KR-20) is often used to calculate the internal consistency reliability. 

    K = number of items 
    p = proportion of persons who responded correctly to an item (i.e., difficulty value) 
    q = proportion of persons who responded incorrectly to an item (i.e., 1 – p) 
    σ 2 x = total score variance
  • Three ways to improve the reliability of the exam are to 1) increase the number of items in the exam, 2) use items that have high discrimination values in the exam, 3) or perform an item-total statistic analysis
  • Acceptable range: 0.60 or higher
  • Ideal value: 1.00
Item-total statistics: measure the relationship of individual exam items to the overall exam score.  
Currently, the University of Texas does not perform this analysis for faculty. However, one can calculate these statistics using SPSS or SAS statistical software.
  1. Corrected item-total correlation
    • This is the correlation between an item and the rest of the exam, without that item considered part of the exam.
    • If the correlation is low for an item, this means the item isn't really measuring the same thing the rest of the exam is trying to measure.
  2. Squared multiple correlation
    • This measures how much of the variability in the responses to this item can be predicted from the other items on the exam.
    • If an item does not predict much of the variability, then the item should be considered for deletion.
  3. Alpha if item deleted
    • The change in Cronbach's alpha if the item is deleted.
    • When the alpha value is higher than the current alpha with the item included, one should consider deleting this item to improve the overall reliability of the exam.
      EXAMPLE
      Item-total statistic table
      Item-total statistics
      Summary for scale:
      Mean = 46.1100 S.D. = 8.26444 Valid n = 100
      Cronbach alpha = .794313 Standardized alpha = .800491
      Average inter-item correlation = .297818
       Variable
      Mean if
      deleted
      Var. if
      deleted
      S.D. if
      deleted
      Corrected item-total
      Correlation
      Squared multiple
      correlation
      Alpha if
      deleted
      ITEM1
      41.61000
      51.93790
      7.206795
      .656298
      .507160
      .752243
      ITEM2
      41.37000
      53.79310
      7.334378
      .666111
      .533015
      .754692
      ITEM3
      41.41000
      54.86190
      7.406882
      .549226
      .363895
      .766778
      ITEM4
      41.63000
      56.57310
      7.521509
      .470852
      .305573
      .776015
      ITEM5
      41.52000
      64.16961
      8.010593
      .054609
      .057399
      .824907
      ITEM6
      41.56000
      62.68640
      7.917474
      .118561
      .045653
      .817907
      ITEM7
      41.46000
      54.02840
      7.350401
      .587637
      .443563
      .762033
      ITEM8
      41.33000
      53.32110
      7.302130
      .609204
      .446298
      .758992
      ITEM9
      41.44000
      55.06640
      7.420674
      .502529
      .328149
      .772013
      ITEM10
      41.66000
      53.78440
      7.333785
      .572875
      .410561
      .763314
      By investigating the item-total correlation, we can see that the correlations of items 5 and 6 with the overall exam are . 05 and .12, while all other items correlate at .45 or better. By investigating the squared-multiple correlations, we can see that again items 5 and 6 are significantly lower than the rest of the items. Finally, by exploring the alpha if deleted, we can see that the reliability of the scale (alpha) would increase to .82 if either of these two items were to be deleted. Thus, we would probably delete these two items from this exam.
      Deleting item process: To delete these items, we would delete one item at a time, preferably item 5 because it can produce a higher exam reliability coefficient if deleted, and re-run the item-total statistics report before deleting item 6 to ensure we do not lower the overall alpha of the exam. After deleting item 5, if item 6 still appears as an item to delete, then we would re-perform this deletion process for the latter item.
Distractor evaluation: Another useful item review technique to use.
The distractor should be considered an important part of the item. Nearly 50 years of research shows that there is a relationship between the distractors students choose and total exam score. The quality of the distractors influence student performance on an exam item. Although the correct answer must be truly correct, it is just as important that the distractors be incorrect. Distractors should appeal to low scorers who have not mastered the material whereas high scorers should infrequently select the distractors. Reviewing the options can reveal potential errors of judgment and inadequate performance of distractors. These poor distractors can be revised, replaced, or removed.
One way to study responses to distractors is with a frequency table. This table tells you the number and/or percent of students that selected a given distractor. Distractors that are selected by a few or no students should be removed or replaced. These kinds of distractors are likely to be so implausible to students that hardly anyone selects them.
  • Definition: The incorrect alternatives in a multiple-choice item.
  • Reported as: The frequency (count), or number of students, that selected each incorrect alternative
  • Acceptable Range: Each distractor should be selected by at least a few students
  • Ideal Value: Distractors should be equally popular
  • Interpretation:
    • Distractors that are selected by a few or no students should be removed or replaced
    • One distractor that is selected by as many or more students than the correct answer may indicate a confusing item and/or options
  • The number of people choosing a distractor can be lower or higher than the expected because:
    • Partial knowledge
    • Poorly constructed item
    • Distractor is outside of the area being tested

________________________________________________________________

Example of Item Analysis

Summary table of exam item statistics

<exam name >

N TOTAL  = 932 MEAN TOTAL = 69.4 S.D. TOTAL = 10.2   ALPHA = .84
ITEMPR(IT)NCMCMIOMITABCDE
1.0.720.3466771.5667.661667187373010
2.0.900.2184070.1169.02184017695
3.0.600.3956172.6665.47056123346884
4.0.99-0.0692369.3469.9009233330
5.0.940.1487669.7668.2308760122420
6.0.77-0.0171669.3469.570716162535140
7.0.470.3143272.7666.16343210768165157
8.0.120.0811471.6168.398114218264153175
9.0.080.047570.7869.030756412067606
10.0.350.4233075.2463.5403309874183247
.
.
40.

Questions

  1. Which item(s) could you remove altogether from the exam? Why?
  2. Which distractor(s) would you revise?   Why?
  3. Which items are working well?

Answers  

  1. By exploring item difficulty, items 2, 4, and 5 appear extremely easy (P-value >.90), and items 8 and 9 appear extremely difficult (P-value <. 20).  
    By exploring item discrimination, items 4, 6, 8, 9, and possibly 5 appear to indicate poor discriminating items (R(IT) values near or less than zero). This indicates that students who overall did poorly on the exam did better on these questions than students who overall did well.  
    Therefore, by combining the results from these two investigations, items 4, 5, 8, and 9 appear to be the best items to delete from this exam. Please reviewdeleting item process before deleting items from your own exam.  
  2. Items 2, 4, and 5 appear to possess distractors selected by a few or no students, and items 8 and 9 appear to possess distractors selected by as many or more students than the correct answer. Therefore, these item distractors should either be removed or revised, especially if the distractors are the source of students’ confusion (rather than the item itself).  
  3. Items that appear to be working well (i.e. good difficulty, discrimination, and distractors) are items, 1, 2, 3, and possibly 7 and 10 depending upon the purpose of the exam.

No comments:

Post a Comment