1 of 27

10th International Workshop on Computer Science Education Data Mining (CSEDM 2026)

Strict Graders and Hallucinated Mastery:

A Cognitive Diagnostic Study of Language Models on Dynamic Programming

Sartaj Solaiman

sartajsolaiman@gmail.com

Md. Fahim Arefin

fahim@cse.du.ac.bd

Department of Computer

Science and Engineering

University of Dhaka

2 of 27

The Diagnostic Scarcity Problem

Current Code grading benchmarks solely focus on outcomes. This creates :

Skill Conflation : Two different skill deficiencies may seem the same
Lack of detailed feedback : Not knowing exactly what skills they lack prevents the students from improving
Severe diagnostic need : Students need to trace the mental model behind the code

3 of 27

Cognitive Diagnostic Modeling (CDM)

CDM is a psychometric framework that informs mastery of specific, underlying skills and knowledge states.
Instead of outcome based numbers, a student can look up which skills they possess and which skills they need to improve upon.
This expedites the learning process.

4 of 27

Evaluating Language Model Skill Profiles

Language Models are increasingly being used as :

Programming Tutors
Code Graders
Coding Assistants
Feedback Systems

Evaluating the latent skill profiles of Language Models allow us to correctly use them as code graders and feedback systems, improving the learning process overall.

5 of 27

Why Dynamic Programming?

Dynamic Programming is an ideal testbed for skill evaluation :

It requires multiple reasoning skills
Generally a difficult topic to master
Well defined Cognitive Structure
Complex reasoning exercise

6 of 27

Core Research Questions

Do current LLMs and SLMs reason about algorithmic problems, or only pattern-match against memorized solutions?
If they reason, can we measure how far that reasoning extends?

7 of 27

Our Contribution

DP Q-Matrix
CDM-AI Diagnostic Framework
Discovery of Reasoning-Synthesis Gap
Grader Reliability Analysis

8 of 27

Skill Framework

9 of 27

Dataset and Design

28 AtCoder DP problems.
560 human-written Python submissions (20 solution per problem).
Balanced AC / WA samples for each problem, TLE/MLE instead of WA for optimization problems.
8 problems (28.5%) come from 2024 or later to reduce contamination risk.

10 of 27

Problem Categorization

Each Problem was categorized into one of the following five categories :

Linear DP
Grid/Tree DP
Subset DP
Probabilistic DP
Optimized DP

11 of 27

Experimental Group Design

12 of 27

Methodology Pipeline

13 of 27

Q-Matrix Construction

14 of 27

Raw Baseline Grading Performance

15 of 27

DINA model output

Used as the psychometric engine
Our expanded Q-matrix transforms each problem into skill-specific probes.
Each expanded row contains exactly one active skill.
Ideal response therefore reduces to mastery of a single skill.
The outputs are Slip (S), missing a skill that is actually present, Guess (G) crediting a skill that is absent, and mastery probabilities.
We also introduce a Grader Reliability Index (GRI) evaluating which grader is better suited for reasoning.

16 of 27

Model Grading Behavior

Different models exhibit different diagnostic behaviors :

Claude, Consensus, and GPT-5.5 Filtered form a conservative grading cluster.
GPT-5.5 and Phi exhibit substantially higher Guess rates.
Phi shows the most permissive behavior (G = 0.414).
GPT-5.5 filtering shifts behavior toward the conservative cluster.

17 of 27

Skill mastery profiles

K1 (Problem Decomposition) remains difficult across most graders.
K3 (Transition Logic) and K4 (Boundary Conditions) show similar diagnosis patterns.
Granite uniquely emphasizes K5 (Optimization).
Phi's uniform profile suggests a non-informative grading strategy.

18 of 27

Latent Skill Correlations

Correlation of skill mastery estimates:

K3 (Transition Logic) and K4 (Boundary Conditions) move together most strongly (r = 0.94).
K5 (Optimization) is the most distinct skill, decoupling from K3 and K4 (r = 0.64, 0.60).
Models rarely master one core skill without the neighbouring ones.

19 of 27

The Reasoning-Synthesis Gap

All three SLM graders show meaningful skill profiles.
All three SLM synthesizers collapse to the EM floor.
Recognition of DP reasoning is easier than generation of DP solutions.

20 of 27

Grader Reliability Index (GRI)

We formulate GRI as follows :

GRI = (1 − S) × (1 − G)^k

Where,

S = Slip rate (missed recognition of demonstrated skills)

G = Guess rate (hallucinated skill mastery)

k = hallucination penalty factor

This equation reflects the fact that, guess hallucination is much more costly than slip hallucination

21 of 27

Grader Reliability Index (Contd.)

Based on our findings :

Claude and Consensus achieved the highest GRI scores at k=5.
Granite-4.1-8B ranked above both GPT-5.5 variants despite its smaller size.
Reliability rankings change as the hallucination penalty k increases.
Slip-Guess behavior explains grader reliability better than parameter count.

22 of 27

GRI Sensitivity to Hallucination Penalty

Re-scoring GRI across penalty k :

High-Guess graders (Phi, GPT-5.5) peak at k = 1, then collapse as the penalty grows.
Strict graders (Claude, Consensus) stay stable and lead from k = 5 onward.
k = 5 penalises hallucinated mastery without over-rewarding caution.

23 of 27

Recommendations

24 of 27

Limitations

Single-domain evaluation: Experiments were limited to 28 Dynamic Programming problems from the AtCoder DP contest.
Skill taxonomy dependence: Results depend on the validity of the proposed five-skill Q-matrix.
Probe-level interpretation: The expanded Q-matrix diagnoses skill recognition rather than full problem-solving ability.
Model-specific evaluation: Findings are based on the selected LLMs and SLMs and may not generalize to all future models.
DINA assumptions: The DINA model assumes binary mastery and may not capture partial or evolving understanding.

25 of 27

Future Work

Expand beyond Dynamic Programming to other algorithmic domains.
Validate and refine the Q-matrix through expert consensus.
Explore richer CDMs beyond DINA (e.g., DINO, GDINA, LCDM).
Expand beyond the zero-shot prompting technique.
Use a larger sample size for both the problems and the grading models.