FACT SHEET 24A
Title: Item Analysis Assumptions (Difficulty & Discrimination Indexes)
Date: June 2008
Details: Mr. Imran Zafar, Database Administrator, Assessment Unit, Dept of Medical Education. Ext. 47142
Introduction:
It is widely believed that “Assessment drives the curriculum”. Hence it can be argued that if the quality of teaching, training, and learning is to be upgraded, assessment is the obvious starting point. However, upgrading assessment is continuous process. The cycle of planning, and constructing assessment tools, followed by testing, validating, and reviewing has to be repeated continuously
When tests are developed for instructional purposes, to assess the effects of educational programs, or for educational research purposes, it is very important to conduct item and test analyses. These analyses evaluate the quality of the items and of the test as a whole. Such analyses can also be employed to revise and improve both items and the test as a whole.
Quantitative item analysis is a technique that enables us to assess the quality or utility of an item. It does so by identifying distractors or response options that are underperforming.
Item-analysis procedures are intended to maximize test reliability. Because maximization of test reliability is accomplished by determining the relationship between individual items and the test a whole, it is important to insure that the overall test is measuring what it is supposed to measure. It this not the case, the total score will be a poor criterion for evaluating each item.
The use of a multiple-choice format for hour exams at many institutions leads to a deluge of statistical data, which are often neglected or completely ignored. This paper will introduce some of the terms encountered in the analysis of test results, so that these data may become more meaningful and therefore more useful
Need for Item Analysis
1) Provision of information about how the quality of test items compare. The comparison is necessary if subsequent tests of the same material are going to be better.
2) Provision of diagnostic information about the types of items that students most often get incorrect. This information can be used as a basis for making instructional decision.
3) Provision of a rational basis for discussing test results with students.
4) Communication to the test developer which items needs to be improved or eliminated, to be replaced with better items.
What is the Output of Item Analysis?
Item analysis could yield the following outputs:
- Distribution of responses for each distractor of each item or frequencies of responses (histogram).
- Difficulty index for each item of the test
- Discrimination Index for each item of the test.
- Measure of exam internal consistency reliability
Total Score Frequencies
Issues to consider when interpreting the distribution of students’ total scores:
Distribution:
- Is this the distribution you expected?
- Was the test easier, more difficult than you anticipated?
- How does the mean score of this year’s class compare to scores from previous classes?
- Is there a ceiling effect – that is, are all scores close to the top?
- Is there a floor effect – that is, are all scores close to the lower possible?
Spread of Scores:
- Is the spread of scores large?
- Are there students who are scoring low marks compared to the majority of the students?
- Can you determine why they are not doing as well as most other students?
- Can you provide any extra assistance? Is there a group of students who are well ahead of the other students?
Difficulty Index
It actually tells us how easy the item was for the students in that particular group. The higher the difficulty index the easier the question; the lower the difficulty index, the more difficult the question. The difficulty index, in fact, equals to “Easiness Index”.
Issues to consider in relation to the Difficulty Index
- Are the easiest items in your test, i.e. those with the lowest difficulty ranking, the first items in the test?
- If the more difficult items occur at the start of the test, students can become upset because they feel, early on, that they can not succeed.
Literature quotes following (generalized) interpretation of Difficulty Index.
Indexed Range
|
Inference to Question
|
0.85 – 1.00
|
Very Easy
|
0.70 – 0.84
|
Easy
|
0.30 – 0.69
|
Optimum
|
0.15 – 0.29
|
Hard
|
0.00 – 0.14
|
Very Hard
|
Item Discrimination Index (DI)
This is calculated by subtracting the proportion of students correct in the lower group from the proportion correct in the upper group. It is assumed that persons in the top third on total scores should have a greater proportion with the item correct than the lower third.
The calculation of the index is an approximation of a correlation between the scores on an item and the total score. Therefore, the DI is a measure of how successfully an item discriminates between students of different abilities on the test as a whole. Any item which did not discriminates between the lower and upper group of students would have a DI=0. An item where the lower group performed better than the upper group would have a negative DI.
The discrimination index is affected by the difficulty of an item, because by definition, if an item is very easy everyone tends to get it right and it does not discriminate. Likewise, if it is very difficult everyone tends to get it wrong. Such items can be important to have in a test because they help define the range of difficulty of concepts assessed. Items should not be discarded just because they do not discriminate.
Issues to consider in relation to the Item Discrimination Index
- Are there any items with a negative discrimination index (DI)? That is, terms where students in the lower third of the group did better than students in the upper third of the group?
- Was this a deceptively easy item?
- Was the correct answer key used?
- Are there any items that do not discriminate between the students i.e. where the DI is 0.00 or very close to 0.0?
- Are these items which are either very hard or very easy and therefore where you could have a DI of 0?
Literature quotes following (generalized) interpretation of Discrimination Index.
Indexed Range
|
Inference to Question
|
Below 0.19
|
Poor
|
0.20 – 0.29
|
Dubious
|
0.30 – 1.00
|
Okay
|
FACT SHEET 24B
Title: Item Analysis Assumptions
Measure of exam internal consistency (reliability) Kuder-Richardson 20 (KR20)
Date: June 2008
Details: Mr. Imran Zafar, Database Administrator, Assessment Unit, Dept of Medical Education. Ext. 47142
Test Validity and Reliability
Test reliability measures the accuracy, stability, and consistency of the test scores. Reliability is affected by the characteristics of the students, characteristics of the test, and conditions affecting test administration and scoring.
The two factors that determine overall test quality are test validity and test reliability.
Test validity is the appropriateness of the test for the subject area and students being tested. Validity cannot be measured by a computer. It is up to the instructor to design valid test items that best measure the intended subject area. By definition, valid tests are reliable. However, a reliable test is not necessarily valid. For example, a math test comprised entirely of word problems may be measuring as much verbal skills as math ability.
The reliability of a test refers to the extent to which the test is likely to produce consistent scores. The KR-20 index is the appropriate index of test reliability for multiple-choice examinations.
What does the KR-20 measure?
The KR-20 is a measure of internal consistency reliability or how well your exam measures a single cognitive factor. If you administer an Embryology exam, you hope all test items relate to this broad construct. Similarly, an Obstetrics/Gynecology test is designed to measure this medical specialty.
Reliability coefficients theoretically range in value from zero (no reliability) to 1.00 (perfect reliability). In practice, their approximate range is from .50 to .90 for about 95% of the classroom tests.
High reliability means that the questions of a test tended to "pull together." Students who answered a given question correctly were more likely to answer other questions correctly. If a parallel test were developed by using similar items, the relative scores of students would show little change.
Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly. The resulting test scores reflect peculiarities of the items or the testing situation more than students' knowledge of the subject matter.
The KR-20 formula includes
(1) the number of test items on the exam,
(2) student performance on every test item, and
(3) the variance (standard deviation squared) for the set of student test scores.
The index ranges from 0.00 to 1.00. A value close to 0.00 means you are measuring many unknown factors but not what you intended to measure. You are close to measuring a single factor when your KR-20 is near 1.00. Most importantly, we can be confident that an exam with a high KR-20 has yielded student scores that are reliable (i.e., reproducible or consistent; or as psychometricians say, the true score). A medical school test should have a KR-20 of 0.60 or better to be acceptable.
How do you interpret the KR-20 value?
Reliability Interpretation
.90 and above Excellent reliability; at the level of the best standardized tests
.80 - .90 Very good for a classroom test
.70 - .80 Good for a classroom test; in the range of most. There are probably a few items which could be improved.
.60 - .70 Somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to determine grades. There are probably some items which could be improved.
.50 - .60 Suggests need for revision of test, unless it is quite short (ten or fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for grading.
.50 or below Questionable reliability. This test should not contribute heavily to the course grade, and it needs revision.
Standard Error of Measurement
The standard error of measurement is directly related to the reliability of the test. It is an index of the amount of variability in an individual student's performance due to random measurement error. If it were possible to administer an infinite number of parallel tests, a student's score would be expected to change from one administration to the next due to a number of factors. For each student, the scores would form a "normal" (bellshaped) distribution. The mean of the distribution is assumed to be the student's "true score," and reflects what he or she "really" knows about the subject. The standard deviation of the distribution is called the standard error of measurement and reflects the amount of change in the student's score which could be expected from one test administration to another.
Whereas the reliability of a test always varies between 0.00 and 1.00, the standard error of measurement is expressed in the same scale as the test scores. For example, multiplying all test scores by a constant will multiply the standard error of measurement by that same constant, but will leave the reliability coefficient unchanged.
A general rule of thumb to predict the amount of change which can be expected in individual test scores is to multiply the standard error of measurement by 1.5. Only rarely would one expect a student's score to increase or decrease by more than that amount between two such similar tests. The smaller the standard error of measurement, the more accurate the measurement provided by the test.
A CAUTION in Interpreting Item Analysis Results
Each of the various item statistics provides information which can be used to improve individual test items and to increase the quality of the test as a whole. Such statistics must always be interpreted in the context of the type of test given and the individuals being tested. W. A. Mehrens and I. J. Lehmann provide the following set of cautions in using item analysis results (Measurement and Evaluation in Education and Psychology. New York : Holt, Rinehart and Winston, 1973, 333-334):
1. Item analysis data are not synonymous with item validity. An external criterion is required to accurately judge the validity of test items. By using the internal criterion of total test score, item analyses reflect internal consistency of items rather than validity.
2. The discrimination index is not always a measure of item quality. There is a variety of reasons an item may have low discriminating power:
a) extremely difficult or easy items will have low ability to discriminate but such items are often needed to adequately sample course content and objectives;
b) an item may show low discrimination if the test measures many different content areas and cognitive skills. For example, if the majority of the test measures "knowledge of facts," then an item assessing "ability to apply principles" may have a low correlation with total test score, yet both types of items are needed to measure attainment of course objectives.
3. Item analysis data are tentative. Such data are influenced by the type and number of students being tested; instructional procedures employed, and chance errors. If repeated use of items is possible, statistics should be recorded for each administration of each item.
Conclusions
Developing the perfect test is the unattainable goal for anyone in an evaluative position. Even when guidelines for constructing fair and systematic tests are followed, a plethora of factors may enter into a student's perception of the test items. Looking at an item's difficulty and discrimination will assist the test developer in determining what is wrong with individual items. Item and test analysis provide empirical data about how individual items and whole tests are performing in real test situations.
One of the principal advantages of having MCQ tests scored and analyzed by computer at College of Medicine, King Saud bin Abdulaziz University for Health Sciences is the feedback available on how well the test has performed. Careful consideration of the results of item analysis can lead to significant improvements in the quality of exams.
Item Analysis
Item Analysis allows us to observe the characteristics of a particular question (item) and can be used to ensure that questions are of an appropriate standard and select items for test inclusion.
IntroductionItem Analysis describes the statistical analyses which allow measurement of the effectiveness of individual test items. An understanding of the factors which govern effectiveness (and a means of measuring them) can enable us to create more effective test questions and also regulate and standardise existing tests.
There are three main types of Item Analysis: Item Response Theory, Rasch Measurement and Classical Test Theory. Although Classical Test Theory and Rasch Measurement will be discussed, this document will concentrate primarily on Item Response Theory.
The Models
Classical Test Theory
Classical Test Theory (traditionally the main method used in the United Kingdom) utilises two main statistics - Facility and Discrimination.
Classical Test Theory
Classical Test Theory (traditionally the main method used in the United Kingdom) utilises two main statistics - Facility and Discrimination.
- Facility is essentially a measure of the difficulty of an item, arrived at by dividing the mean mark obtained by a sample of candidates and the maximum mark available. As a whole, a test should aim to have an overall facility of around 0.5, however it is acceptable for individual items to have higher or lower facility (ranging from 0.2 to 0.8).
- Discrimination measures how performance on one item correlates to performance in the test as a whole. There should always be some correlation between item and test performance, however it is expected that discrimination will fall in a range between 0.2 and 1.0.
The main problems with Classical Test Theory are that the conclusions drawn depend very much on the sample used to collect information. There is an inter-dependence of item and candidate.
Item Response Theory
Item Response Theory (IRT) assumes that there is a correlation between the score gained by a candidate for one item/test (measurable) and their overall ability on the latent trait which underlies test performance (which we want to discover). Critically, the 'characteristics' of an item are said to be independent of the ability of the candidates who were sampled.
Item Response Theory (IRT) assumes that there is a correlation between the score gained by a candidate for one item/test (measurable) and their overall ability on the latent trait which underlies test performance (which we want to discover). Critically, the 'characteristics' of an item are said to be independent of the ability of the candidates who were sampled.
Item Response Theory comes in three forms: IRT1, IRT2, and IRT3 reflecting the number of parameters considered in each case.
- For IRT1, only the difficulty of an item is considered,
(difficulty is the level of ability required to be more likely to correctly answer the question than answer it wrongly). - For IRT2, difficulty and discrimination are considered,
(discrimination is how well the question is at separating out candidates of similar abilities). - For IRT3, difficulty, discrimination and chance are considered,
(chance is the random factor which enhances a candidates probability of success through guessing.
Rasch Measurement
Rasch measurement is very similar to IRT1 - in that it considers only one parameter (difficulty) and the ICC is calculated in the same way. When it comes to utilising these theories to categorise items however, there is a significant difference. If you have a set of data, and analyse it with IRT1, then you arrive at an ICC that fits the data observed. If you use Rasch measurement, extreme data (e.g. questions which are consistently well or poorly answered) is discarded and the model is fitted to the remaining data.
Rasch measurement is very similar to IRT1 - in that it considers only one parameter (difficulty) and the ICC is calculated in the same way. When it comes to utilising these theories to categorise items however, there is a significant difference. If you have a set of data, and analyse it with IRT1, then you arrive at an ICC that fits the data observed. If you use Rasch measurement, extreme data (e.g. questions which are consistently well or poorly answered) is discarded and the model is fitted to the remaining data.
Reference: http://www.scrolla.hw.ac.uk/focus/ia.html
Why Item Analysis Important?
Item Analysis is an important (probably the most important) tool to increase test effectiveness. Each items contribution is analyzed and assessed.
To write effective items, it is necessary to examine whether they are measuring the fact, idea, or concept for which they were intended. This is done by studying the student’s responses to each item. When formalized, the procedure is called “item analysis”. It is a scientific way of improving the quality of tests and test items in an item bank.
An item analysis provides three kinds of important information about the quality of test items.
- Item difficulty: A measure of whether an item was too easy or too hard.
- Item discrimination: A measure of whether an item discriminated between students who knew the material well and students who did not.
- Effectiveness of alternatives: Determination of whether distractors (incorrect but plausible answers) tend to be marked by the less able students and not by the more able students.
2. Article on system analysis please refer to the URL below:
a. www1.carleton.ca/edc/ccms/wp-content/ccms.../Item-Analysis.pdf
b. http://www.education.com/reference/article/item-analysis/#A
c. www.washington.edu/oea/pdfs/resources/item_analysis.pdf
No comments:
Post a Comment