ITEM
ANALYSIS
Item
analysis is a process which examines student responses to individual test items
(questions) in order to assess the quality of those items and of the test as a
whole. Item analysis is especially
valuable in improving items which will be used again in later tests, but it can
also be used to eliminate ambiguous or misleading items in a single test
administration. In addition, item analysis is valuable for increasing instructors'
skills in test construction, and identifying specific areas of course content
which need greater emphasis or clarity.
Item Analysis describes the statistical analyses which allow measurement of the effectiveness of individual test items. An understanding of the factors which govern effectiveness (and a means of measuring them) can enable us to create more effective test questions and also regulate and standardise existing tests.
There are three main types of Item Analysis: Item Response Theory, Rasch Measurement and Classical Test Theory. Althoough Classical Test Theory and Rasch Measurement will be discusses, this document will concentrate primarily on Item Response Theory.
THE MODELS
Classical Test Theory
Classical Test Theory (traditionally the main method used in the
United Kingdom) utilises two main statistics - Facility and Discrimination.
- Facility is essentially a measure of the difficulty of an item, arrived at by dividing the mean mark obtained by a sample of candidates and the maximum mark available. As a whole, a test should aim to have an overall facility of around 0.5, however it is acceptable for individual items to have higher or lower facility (ranging from 0.2 to 0.8).
- Discrimination measures how performance on one item correlates to performance in the test as a whole. There should always be some correlation between item and test performance, however it is expected that discrimination will fall in a range between 0.2 and 1.0.
The main problems with
Classical Test Theory are that the conclusions drawn depend very much on the
sample used to collect information. There is an inter-dependence of item and
candidate.
Item Response Theory
Item Response Theory (IRT) assumes that there is a correlation
between the score gained by a candidate for one item/test (measurable) and
their overall ability on the latent trait which underlies test performance
(which we want to discover). Critically, the 'characteristics' of an item are
said to be independent of the ability of the candidates who were sampled.
Item Response Theory comes in three forms: IRT1, IRT2, and IRT3
reflecting the number of parameters considered in each case.
- For IRT1,
only the difficulty of an item is considered, (difficulty is
the level of ability required to be more likely to correctly answer the
question than answer it wrongly).
- For IRT2, difficulty
and discrimination are considered, (discrimination is
how well the question is at separating out candidates of similar
abilities).
- For IRT3, difficulty,
discrimination and chance are considered, (chance is
the random factor which enhances a candidates probability of success
through guessing.
IRT can be used to create a unique plot for each item (the Item
Characteristic Curve - ICC). The ICC is a plot of Probability that the Item
will be answered correctly against Ability. The shape of the ICC reflects the
influence of the three factors:
- Increasing
the difficulty of an item causes the curve to shift right
- as candidates need to be more able to have the same chance of passing.
- Increasing
the discrimination of an item causes the gradient of the
curve to increase. Candidates below a given ability are less likely to
answer correctly, whilst candidates above a given ability are more likely
to answer correctly.
- Increasing
the chance raises the baseline of the curve.
Of
course when you carry out a test for the first time you don't know the ICC of
the item because you don't know the difficulty (and discrimination of that
item). Rather, you estimate the parameters (using paramater estimation
techniques) to find values which fit the data you observed.
Using
IRT models allows Items to be characterised and ranked by their difficulty and
this can be exploited when generating Item Banks of equivalent questions. It is
important to remember though, that in IRT2 and IRT3, question difficulty
rankings may vary over the ability range.
Rasch
Measurement
Rasch
measurement is very similar to IRT1 - in that it considers only one parameter
(difficulty) and the ICC is calculated in the same way. When it comes to
utilising these theories to categorise items however, there is a significant difference.
If you have a set of data, and analyse it with IRT1, then you arrive at an ICC
that fits the data observed. If you use Rasch measurement, extreme data (e.g.
questions which are consistently well or poorly answered) is discarded and the
model is fitted to the remaining data.
ITEM
STATISTICS
Item
statistics are used to assess the performance of individual test items on the
assumption that the overall quality of a test derives from the quality of its
items.
Item Number:
This is the question number taken from the student answer sheet, and the key
sheet.
Mean and S.D: The mean
is the "average" student response to an item. It is computed by adding up the number of
points earned by all students for the item, and dividing that total by the number
of students. The standard deviation,
or S.D., is a measure of the dispersion of student scores on that item, that
is, it indicates how "spread out" the responses were. The item
standard deviation is most meaningful when comparing items which have more than
one correct alternative and when scale scoring is used. For this reason it is not typically used to
evaluate classroom tests.
Item Difficulty: For items with one correct alternative worth
a single point, the item difficulty is simply the percentage of students who
answer an item correctly. In this case,
it is also equal to the item mean. The
item difficulty index ranges from 0 to 100; the higher the value, the easier the
question. When an alternative is worth
other than a single point, or when there is more than one correct alternative
per question, the item difficulty is the average score on that item divided by
the highest number of points for any one alternative. Item difficulty is relevant for determining whether
students have learned the concept being tested.
It also plays an important role in the ability of an item to
discriminate between students who know the tested material and those who do
not. The item will have low
discrimination if it is so difficult that almost everyone gets it wrong or
guesses, or so easy that almost everyone gets it right. To maximize item
discrimination, desirable difficulty levels are slightly higher than midway
between chance and perfect scores for the item.
Item Discrimination: Item discrimination refers to the ability of
an item to differentiate among students on the basis of how well they know the
material being tested. Various hand
calculation procedures have traditionally been used to compare item responses
to total test scores using high and low scoring groups of students. Computerized analyses provide more accurate
assessment of the discrimination power of items because they take into account
responses of all students rather than just high and low scoring groups. Because
the discrimination index reflects the degree to which an item and the test as a
whole are measuring a unitary ability or attribute, values of the coefficient will tend to be lower for
tests measuring a wide range of content areas than for more homogeneous
tests. Item discrimination indices must
always be interpreted in the context of the type of test which is being
analyzed. Items with low discrimination indices are often ambiguously worded
and should be examined. Items with negative indices should be examined to
determine why a negative value was obtained. Tests with high internal
consistency consist of items with mostly positive relationships with total test
score. In practice, values of the discrimination index will seldom exceed .50
because of the differing shapes of item and total score distributions.
Alternate Weight: This column shows the number of points given
for each response alternative. For
most tests, there will be one correct answer which will be given one point.
Means: The mean total test score (minus that item)
is shown for students who selected each of the possible response
alternatives. This information should be
looked at in conjunction with the discrimination index; higher total test
scores should be obtained by students choosing the correct or most highly
weighted alternative. Incorrect
alternatives with relatively high means should be examined to determine why
"better" students chose that particular alternative.
Frequencies and
Distribution:
The number and percentage of students who choose each alternative are
reported. The bar graph on the right
shows the percentage choosing each response.
Frequently chosen wrong alternatives may indicate common misconceptions
among the students.
DIFFICULTY AND DISCRIMINATION
DISTRIBUTION
At
the end of the Item Analysis report, test items are listed according their
degrees of difficulty (easy, medium, and hard) and discrimination (good, fair, and
poor). These distributions provide a
quick overview of the test, and can be used to identify items which are not
performing well and which can perhaps be improved or discarded.
- The Discrimination Index distinguishes for each item between the performance of students who did well on the exam and students who did poorly.
- How does it work?
- For each item, subtract the number of students in the lower group who answered correctly from the number of students in the upper group who answered correctly.
- Divide the result by the number of students in one group.
- The Discrimination Index is listed in decimal format and ranges between -1 and 1.
- What is a good value?
- For exams with a normal distribution, a discrimination of 0.3 and above is good; 0.6 and above is very good.
- Values close to 0 mean that most students performed the same on an item.
- The index should never be negative.
- Example:
Item
No.
|
Number
of Correct Answers in Group
|
Item
Discrimination Index
|
|
Upper
1/4
|
Lower
1/4
|
||
1
|
90
|
20
|
0.7
|
2
|
80
|
70
|
0.1
|
3
|
100
|
0
|
1
|
4
|
100
|
100
|
0
|
5
|
50
|
50
|
0
|
6
|
20
|
60
|
-0.4
|
- Use the following table as a guideline to determine whether an item (or its corresponding instruction) should be considered for revision.
Item Discrimination (D)
|
Item Difficulty
|
||
High
|
Medium
|
Low
|
|
D =< 0%
|
review
|
review
|
review
|
0% < D < 30%
|
ok
|
review
|
ok
|
D >= 30%
|
ok
|
ok
|
ok
|
- Distracter Analysis
- Compare the performance of the highest and lowest scoring 25% of the students on the distracter options (i.e. the incorrect answers presented on the exam.)
- Fewer of the top performers should choose each of the distracters as their answer compared to the bottom performers.
- Examples:
Item 1
|
A*
|
B
|
C
|
D
|
E
|
Omit
|
% of students in upper ¼
|
20
|
5
|
0
|
0
|
0
|
0
|
% of students in the middle
|
15
|
10
|
10
|
10
|
5
|
0
|
% of students in lower ¼
|
5
|
5
|
5
|
10
|
0
|
0
|
Item 2
|
A
|
B
|
C
|
D*
|
E
|
Omit
|
% of students in upper ¼
|
0
|
5
|
5
|
15
|
0
|
0
|
% of students in the middle
|
0
|
10
|
15
|
5
|
20
|
0
|
% of students in lower ¼
|
0
|
5
|
10
|
0
|
10
|
0
|
Sources:
No comments:
Post a Comment