How to Read and Actually Use a Test Item Analysis Report

Jacklyn DelPrete
Feb 8
4 min read

If you’ve ever opened a test item analysis report and thought, “I have no idea where to start with this,” then welcome. You’re in very good company.

Item analysis reports are packed with numbers, short on explanation, and often dropped into your inbox with zero guidance. But buried in that spreadsheet is information that can make your exams fairer, clearer, and easier to defend—once you know what to look for.

Below are the five item analysis statistics that matter most, what they mean, and how to use them without spiraling.

1. Item Difficulty (How many students got this question right)

Item difficulty is reported as a proportion or percentage (usually between 0.00 and 1.00). Despite the name, it does not describe how hard the question is —it only describes how students performed.

0.90 = 90% of students answered correctly
0.40 = 40% answered correctly

Why this matters: Item difficulty helps you determine whether a question functioned as intended. Every exam should have a range of difficulty—easy recall, moderate application, and harder synthesis questions.

Example: A question on basic infection control principles has a difficulty of 0.38.

That’s concerning because:

This content is foundational
It’s emphasized heavily in lecture
Students should demonstrate mastery

But what if a complex prioritization question has a difficulty of 0.38? That may be completely appropriate.

How to use it:

Compare difficulty to importance and timing of content
Look for items that are unexpectedly low or high
Don’t revise questions based on difficulty alone—pair it with discrimination

2. Discrimination Index (Did strong students perform better than weaker students?)

Discrimination tells you whether a question can differentiate between students who understand the material and those who don’t. This is one of the most important indicators of question quality.

High discrimination means:

High-performing students answered correctly
Lower-performing students were more likely to miss it

Low or negative discrimination means the opposite - that high performing students answered incorrectly.

Example: A question has:

Difficulty: 0.65
Discrimination: –0.12

This means students who did poorly overall were more likely to answer this question correctly than your top students.

That’s a sign of:

Ambiguous wording
A misleading stem
A “trick” question
Or a correct answer that isn’t clearly correct

How to use it:

Flag items with low or negative discrimination first
Review stem clarity and answer defensibility
Ask, “Is this question testing what I meant to test?”

This stat often identifies flawed items even when difficulty looks fine.

3. Point-Biserial Correlation (Does this question align with overall exam performance?)

The point-biserial measures how performance on one item relates to performance on the entire exam. Think of it as a consistency check.

A positive point-biserial means:

Students who did well overall tended to get this item right

A negative value suggests:

High-performing students missed this question
Lower-performing students got it right

That’s a major warning sign.

Example: A pharmacology question has acceptable difficulty and a negative point-biserial

This often means:

More than one answer seems correct
The “best” answer isn’t clearly the best
The question rewards test-taking strategy instead of knowledge

How to use it:

Treat negative point-biserial items as high priority for review
Look for subtle wording issues or outdated content
Consider whether the question aligns with course objectives

4. Distractor Analysis (Are the wrong answers actually working?)

Distractor analysis shows how often each incorrect answer was selected. This tells you whether your distractors are plausible and meaningful.

Good distractors:

Attract students with partial understanding
Reflect common misconceptions
Are selected by some students

Bad distractors:

Are rarely or never chosen
Are obviously wrong
Inflate item difficulty without improving discrimination

Example: A four-option question:

Correct answer: 70%
Distractor A: 25%
Distractor B: 3%
Distractor C: 2%

Distractors B and C aren’t doing any work.

How to use it:

Revise or replace distractors chosen by <5% of students
Use real student errors from assignments or exams to construct distractors
Strong distractors improve discrimination without increasing difficulty

Better distractors = better questions.

5. Reliability (KR-20 or Cronbach’s Alpha) (How consistent is the exam as a whole?)

Reliability measures whether your exam consistently assesses student knowledge across items. This is an exam-level statistic—not a judgment of individual questions or your competence as a faculty member.

Higher values indicate greater consistency, but context matters.

Example: An exam reliability of 0.70 may be perfectly acceptable for:

Short exams
New courses
Early semesters
Exams with diverse content areas

Reliability improves over time as poor-performing items are revised or removed.

How to use it:

Track reliability across semesters, not single exams
Use it to justify gradual test improvement
Pair it with item-level data to guide revisions

How Faculty Should Actually Use Item Analysis

You are not expected to fix every imperfect item immediately.

A realistic, defensible approach:

Identify 3–5 items to revise per exam
Prioritize low discrimination and negative point-biserial
Document your review and revisions
Improve exams incrementally over time

That’s how strong, defensible assessments are built.

Check out this FREE 1-page guide on test analysis!

How to Read and Actually Use a Test Item Analysis Report

Recent Posts

Comments

Subscribe to stay up to date!