Thursday, September 6, 2012

Item Analysis


ITEM ANALYSIS

Item analysis is a process which examines student responses to individual test items (questions) in order to assess the quality of those items and of the test as a whole.  Item analysis is especially valuable in improving items which will be used again in later tests, but it can also be used to eliminate ambiguous or misleading items in a single test administration. In addition, item analysis is valuable for increasing instructors' skills in test construction, and identifying specific areas of course content which need greater emphasis or clarity.

Item Analysis describes the statistical analyses which allow measurement of the effectiveness of individual test items. An understanding of the factors which govern effectiveness (and a means of measuring them) can enable us to create more effective test questions and also regulate and standardise existing tests.

There are three main types of Item Analysis: Item Response Theory, Rasch Measurement and Classical Test Theory. Althoough Classical Test Theory and Rasch Measurement will be discusses, this document will concentrate primarily on Item Response Theory.

THE MODELS
Classical Test Theory
Classical Test Theory (traditionally the main method used in the United Kingdom) utilises two main statistics - Facility and Discrimination.
  • Facility is essentially a measure of the difficulty of an item, arrived at by dividing the mean mark obtained by a sample of candidates and the maximum mark available. As a whole, a test should aim to have an overall facility of around 0.5, however it is acceptable for individual items to have higher or lower facility (ranging from 0.2 to 0.8).
  • Discrimination measures how performance on one item correlates to performance in the test as a whole. There should always be some correlation between item and test performance, however it is expected that discrimination will fall in a range between 0.2 and 1.0.
The main problems with Classical Test Theory are that the conclusions drawn depend very much on the 
sample used to collect information. There is an inter-dependence of item and candidate.

Item Response Theory
Item Response Theory (IRT) assumes that there is a correlation between the score gained by a candidate for one item/test (measurable) and their overall ability on the latent trait which underlies test performance (which we want to discover). Critically, the 'characteristics' of an item are said to be independent of the ability of the candidates who were sampled.

Item Response Theory comes in three forms: IRT1, IRT2, and IRT3 reflecting the number of parameters considered in each case.
  • For IRT1, only the difficulty of an item is considered, (difficulty is the level of ability required to be more likely to correctly answer the question than answer it wrongly).
  • For IRT2difficulty and discrimination are considered, (discrimination is how well the question is at separating out candidates of similar abilities).
  • For IRT3difficulty, discrimination and chance are considered, (chance is the random factor which enhances a candidates probability of success through guessing.
IRT can be used to create a unique plot for each item (the Item Characteristic Curve - ICC). The ICC is a plot of Probability that the Item will be answered correctly against Ability. The shape of the ICC reflects the influence of the three factors:
  • Increasing the difficulty of an item causes the curve to shift right - as candidates need to be more able to have the same chance of passing.
  • Increasing the discrimination of an item causes the gradient of the curve to increase. Candidates below a given ability are less likely to answer correctly, whilst candidates above a given ability are more likely to answer correctly.
  • Increasing the chance raises the baseline of the curve.
Of course when you carry out a test for the first time you don't know the ICC of the item because you don't know the difficulty (and discrimination of that item). Rather, you estimate the parameters (using paramater estimation techniques) to find values which fit the data you observed.
Using IRT models allows Items to be characterised and ranked by their difficulty and this can be exploited when generating Item Banks of equivalent questions. It is important to remember though, that in IRT2 and IRT3, question difficulty rankings may vary over the ability range.

Rasch Measurement
Rasch measurement is very similar to IRT1 - in that it considers only one parameter (difficulty) and the ICC is calculated in the same way. When it comes to utilising these theories to categorise items however, there is a significant difference. If you have a set of data, and analyse it with IRT1, then you arrive at an ICC that fits the data observed. If you use Rasch measurement, extreme data (e.g. questions which are consistently well or poorly answered) is discarded and the model is fitted to the remaining data.

ITEM STATISTICS
Item statistics are used to assess the performance of individual test items on the assumption that the overall quality of a test derives from the quality of its items.

Item Number: This is the question number taken from the student answer sheet, and the key sheet. 

Mean and S.D:  The mean is the "average" student response to an item.  It is computed by adding up the number of points earned by all students for the item, and dividing that total by the number of students. The standard deviation, or S.D., is a measure of the dispersion of student scores on that item, that is, it indicates how "spread out" the responses were. The item standard deviation is most meaningful when comparing items which have more than one correct alternative and when scale scoring is used.  For this reason it is not typically used to evaluate classroom tests.  

Item Difficulty:  For items with one correct alternative worth a single point, the item difficulty is simply the percentage of students who answer an item correctly.  In this case, it is also equal to the item mean.  The item difficulty index ranges from 0 to 100; the higher the value, the easier the question.  When an alternative is worth other than a single point, or when there is more than one correct alternative per question, the item difficulty is the average score on that item divided by the highest number of points for any one alternative.  Item difficulty is relevant for determining whether students have learned the concept being tested.  It also plays an important role in the ability of an item to discriminate between students who know the tested material and those who do not.  The item will have low discrimination if it is so difficult that almost everyone gets it wrong or guesses, or so easy that almost everyone gets it right. To maximize item discrimination, desirable difficulty levels are slightly higher than midway between chance and perfect scores for the item.

Item Discrimination:  Item discrimination refers to the ability of an item to differentiate among students on the basis of how well they know the material being tested.  Various hand calculation procedures have traditionally been used to compare item responses to total test scores using high and low scoring groups of students.  Computerized analyses provide more accurate assessment of the discrimination power of items because they take into account responses of all students rather than just high and low scoring groups. Because the discrimination index reflects the degree to which an item and the test as a whole are measuring a unitary ability or attribute, values  of the coefficient will tend to be lower for tests measuring a wide range of content areas than for more homogeneous tests.  Item discrimination indices must always be interpreted in the context of the type of test which is being analyzed. Items with low discrimination indices are often ambiguously worded and should be examined. Items with negative indices should be examined to determine why a negative value was obtained. Tests with high internal consistency consist of items with mostly positive relationships with total test score. In practice, values of the discrimination index will seldom exceed .50 because of the differing shapes of item and total score distributions.

Alternate Weight:  This column shows the number of points given for each response alternative. For most tests, there will be one correct answer which will be given one point.

Means:  The mean total test score (minus that item) is shown for students who selected each of the possible response alternatives.  This information should be looked at in conjunction with the discrimination index; higher total test scores should be obtained by students choosing the correct or most highly weighted alternative.  Incorrect alternatives with relatively high means should be examined to determine why "better" students chose that particular alternative. 

Frequencies and Distribution:  The number and percentage of students who choose each alternative are reported.  The bar graph on the right shows the percentage choosing each response.  Frequently chosen wrong alternatives may indicate common misconceptions among the students. 

DIFFICULTY AND DISCRIMINATION DISTRIBUTION
At the end of the Item Analysis report, test items are listed according their degrees of difficulty (easy, medium, and hard) and discrimination (good, fair, and poor).  These distributions provide a quick overview of the test, and can be used to identify items which are not performing well and which can perhaps be improved or discarded.
  • The Discrimination Index distinguishes for each item between the performance of students who did well on the exam and students who did poorly.
    • How does it work?
      • For each item, subtract the number of students in the lower group who answered correctly from the number of students in the upper group who answered correctly.
      • Divide the result by the number of students in one group.
      • The Discrimination Index is listed in decimal format and ranges between -1 and 1.
  • What is a good value?
    • For exams with a normal distribution, a discrimination of 0.3 and above is good; 0.6 and above is very good.
    • Values close to 0 mean that most students performed the same on an item.
    • The index should never be negative.


    • Example:
Item No.
Number of Correct Answers in Group
Item Discrimination Index
Upper 1/4
Lower 1/4
1
90
20
0.7
2
80
70
0.1
3
100
0
1
4
100
100
0
5
50
50
0
6
20
60
-0.4

    • Use the following table as a guideline to determine whether an item (or its corresponding instruction) should be considered for revision.
Item Discrimination (D)
Item Difficulty
High
Medium
Low
D =< 0%
review
review
review
0% < D < 30%
ok
review
ok
D >= 30%
ok
ok
ok
  • Distracter Analysis
    • Compare the performance of the highest and lowest scoring 25% of the students on the distracter options (i.e. the incorrect answers presented on the exam.)
    • Fewer of the top performers should choose each of the distracters as their answer compared to the bottom performers.
    • Examples:
Item 1
A*
B
C
D
E
Omit
% of students in upper ¼
20
5
0
0
0
0
% of students in the middle
15
10
10
10
5
0
% of students in lower ¼
5
5
5
10
0
0

Item 2
A
B
C
D*
E
Omit
% of students in upper ¼
0
5
5
15
0
0
% of students in the middle
0
10
15
5
20
0
% of students in lower ¼
0
5
10
0
10
0


No comments:

Post a Comment