How to Read and Actually Use a Test Item Analysis Report
- Jacklyn DelPrete
- Feb 8
- 4 min read

If you’ve ever opened a test item analysis report and thought, “I have no idea where to start with this,” then welcome. You’re in very good company.
Item analysis reports are packed with numbers, short on explanation, and often dropped into your inbox with zero guidance. But buried in that spreadsheet is information that can make your exams fairer, clearer, and easier to defend—once you know what to look for.
Below are the five item analysis statistics that matter most, what they mean, and how to use them without spiraling.
1. Item Difficulty (How many students got this question right)
Item difficulty is reported as a proportion or percentage (usually between 0.00 and 1.00). Despite the name, it does not describe how hard the question is —it only describes how students performed.
0.90 = 90% of students answered correctly
0.40 = 40% answered correctly
Why this matters: Item difficulty helps you determine whether a question functioned as intended. Every exam should have a range of difficulty—easy recall, moderate application, and harder synthesis questions.
Example: A question on basic infection control principles has a difficulty of 0.38.
That’s concerning because:
This content is foundational
It’s emphasized heavily in lecture
Students should demonstrate mastery
But what if a complex prioritization question has a difficulty of 0.38? That may be completely appropriate.
How to use it:
Compare difficulty to importance and timing of content
Look for items that are unexpectedly low or high
Don’t revise questions based on difficulty alone—pair it with discrimination
2. Discrimination Index (Did strong students perform better than weaker students?)
Discrimination tells you whether a question can differentiate between students who understand the material and those who don’t. This is one of the most important indicators of question quality.
High discrimination means:
High-performing students answered correctly
Lower-performing students were more likely to miss it
Low or negative discrimination means the opposite - that high performing students answered incorrectly.
Example: A question has:
Difficulty: 0.65
Discrimination: –0.12
This means students who did poorly overall were more likely to answer this question correctly than your top students.
That’s a sign of:
Ambiguous wording
A misleading stem
A “trick” question
Or a correct answer that isn’t clearly correct
How to use it:
Flag items with low or negative discrimination first
Review stem clarity and answer defensibility
Ask, “Is this question testing what I meant to test?”
This stat often identifies flawed items even when difficulty looks fine.
3. Point-Biserial Correlation (Does this question align with overall exam performance?)
The point-biserial measures how performance on one item relates to performance on the entire exam. Think of it as a consistency check.
A positive point-biserial means:
Students who did well overall tended to get this item right
A negative value suggests:
High-performing students missed this question
Lower-performing students got it right
That’s a major warning sign.
Example: A pharmacology question has acceptable difficulty and a negative point-biserial
This often means:
More than one answer seems correct
The “best” answer isn’t clearly the best
The question rewards test-taking strategy instead of knowledge
How to use it:
Treat negative point-biserial items as high priority for review
Look for subtle wording issues or outdated content
Consider whether the question aligns with course objectives
4. Distractor Analysis (Are the wrong answers actually working?)
Distractor analysis shows how often each incorrect answer was selected. This tells you whether your distractors are plausible and meaningful.
Good distractors:
Attract students with partial understanding
Reflect common misconceptions
Are selected by some students
Bad distractors:
Are rarely or never chosen
Are obviously wrong
Inflate item difficulty without improving discrimination
Example: A four-option question:
Correct answer: 70%
Distractor A: 25%
Distractor B: 3%
Distractor C: 2%
Distractors B and C aren’t doing any work.
How to use it:
Revise or replace distractors chosen by <5% of students
Use real student errors from assignments or exams to construct distractors
Strong distractors improve discrimination without increasing difficulty
Better distractors = better questions.
5. Reliability (KR-20 or Cronbach’s Alpha) (How consistent is the exam as a whole?)
Reliability measures whether your exam consistently assesses student knowledge across items. This is an exam-level statistic—not a judgment of individual questions or your competence as a faculty member.
Higher values indicate greater consistency, but context matters.
Example: An exam reliability of 0.70 may be perfectly acceptable for:
Short exams
New courses
Early semesters
Exams with diverse content areas
Reliability improves over time as poor-performing items are revised or removed.
How to use it:
Track reliability across semesters, not single exams
Use it to justify gradual test improvement
Pair it with item-level data to guide revisions
How Faculty Should Actually Use Item Analysis
You are not expected to fix every imperfect item immediately.
A realistic, defensible approach:
Identify 3–5 items to revise per exam
Prioritize low discrimination and negative point-biserial
Document your review and revisions
Improve exams incrementally over time
That’s how strong, defensible assessments are built.
Check out this FREE 1-page guide on test analysis!
_edited.jpg)



Comments