1 of 27

10th International Workshop on Computer Science Education Data Mining (CSEDM 2026)

Strict Graders and Hallucinated Mastery:

A Cognitive Diagnostic Study of Language Models on Dynamic Programming

Sartaj Solaiman

sartajsolaiman@gmail.com

Md. Fahim Arefin

fahim@cse.du.ac.bd

Department of Computer

Science and Engineering

University of Dhaka

2 of 27

The Diagnostic Scarcity Problem

Current Code grading benchmarks solely focus on outcomes. This creates :

  • Skill Conflation : Two different skill deficiencies may seem the same
  • Lack of detailed feedback : Not knowing exactly what skills they lack prevents the students from improving
  • Severe diagnostic need : Students need to trace the mental model behind the code

2

3 of 27

Cognitive Diagnostic Modeling (CDM)

  • CDM is a psychometric framework that informs mastery of specific, underlying skills and knowledge states.
  • Instead of outcome based numbers, a student can look up which skills they possess and which skills they need to improve upon.
  • This expedites the learning process.

3

4 of 27

Evaluating Language Model Skill Profiles

Language Models are increasingly being used as :

  • Programming Tutors
  • Code Graders
  • Coding Assistants
  • Feedback Systems

Evaluating the latent skill profiles of Language Models allow us to correctly use them as code graders and feedback systems, improving the learning process overall.

4

5 of 27

Why Dynamic Programming?

Dynamic Programming is an ideal testbed for skill evaluation :

  • It requires multiple reasoning skills
  • Generally a difficult topic to master
  • Well defined Cognitive Structure
  • Complex reasoning exercise

5

6 of 27

Core Research Questions

  • Do current LLMs and SLMs reason about algorithmic problems, or only pattern-match against memorized solutions?
  • If they reason, can we measure how far that reasoning extends?

6

7 of 27

Our Contribution

  • DP Q-Matrix
  • CDM-AI Diagnostic Framework
  • Discovery of Reasoning-Synthesis Gap
  • Grader Reliability Analysis

7

8 of 27

Skill Framework

8

9 of 27

Dataset and Design

  • 28 AtCoder DP problems.
  • 560 human-written Python submissions (20 solution per problem).
  • Balanced AC / WA samples for each problem, TLE/MLE instead of WA for optimization problems.
  • 8 problems (28.5%) come from 2024 or later to reduce contamination risk.

9

10 of 27

Problem Categorization

Each Problem was categorized into one of the following five categories :

  • Linear DP
  • Grid/Tree DP
  • Subset DP
  • Probabilistic DP
  • Optimized DP

10

11 of 27

Experimental Group Design

11

12 of 27

Methodology Pipeline

12

13 of 27

Q-Matrix Construction

13

14 of 27

Raw Baseline Grading Performance

14

15 of 27

DINA model output

  • Used as the psychometric engine
  • Our expanded Q-matrix transforms each problem into skill-specific probes.
  • Each expanded row contains exactly one active skill.
  • Ideal response therefore reduces to mastery of a single skill.
  • The outputs are Slip (S), missing a skill that is actually present, Guess (G) crediting a skill that is absent, and mastery probabilities.
  • We also introduce a Grader Reliability Index (GRI) evaluating which grader is better suited for reasoning.

15

16 of 27

Model Grading Behavior

Different models exhibit different diagnostic behaviors :

  • Claude, Consensus, and GPT-5.5 Filtered form a conservative grading cluster.
  • GPT-5.5 and Phi exhibit substantially higher Guess rates.
  • Phi shows the most permissive behavior (G = 0.414).
  • GPT-5.5 filtering shifts behavior toward the conservative cluster.

16

17 of 27

Skill mastery profiles

  • K1 (Problem Decomposition) remains difficult across most graders.
  • K3 (Transition Logic) and K4 (Boundary Conditions) show similar diagnosis patterns.
  • Granite uniquely emphasizes K5 (Optimization).
  • Phi's uniform profile suggests a non-informative grading strategy.

17

18 of 27

Latent Skill Correlations

Correlation of skill mastery estimates:

  • K3 (Transition Logic) and K4 (Boundary Conditions) move together most strongly (r = 0.94).
  • K5 (Optimization) is the most distinct skill, decoupling from K3 and K4 (r = 0.64, 0.60).
  • Models rarely master one core skill without the neighbouring ones.

18

19 of 27

The Reasoning-Synthesis Gap

  • All three SLM graders show meaningful skill profiles.
  • All three SLM synthesizers collapse to the EM floor.
  • Recognition of DP reasoning is easier than generation of DP solutions.

19

20 of 27

Grader Reliability Index (GRI)

We formulate GRI as follows :

GRI = (1 − S) × (1 − G)^k

Where,

S = Slip rate (missed recognition of demonstrated skills)

G = Guess rate (hallucinated skill mastery)

k = hallucination penalty factor

This equation reflects the fact that, guess hallucination is much more costly than slip hallucination

20

21 of 27

Grader Reliability Index (Contd.)

Based on our findings :

  • Claude and Consensus achieved the highest GRI scores at k=5.
  • Granite-4.1-8B ranked above both GPT-5.5 variants despite its smaller size.
  • Reliability rankings change as the hallucination penalty k increases.
  • Slip-Guess behavior explains grader reliability better than parameter count.

21

22 of 27

GRI Sensitivity to Hallucination Penalty

Re-scoring GRI across penalty k :

  • High-Guess graders (Phi, GPT-5.5) peak at k = 1, then collapse as the penalty grows.
  • Strict graders (Claude, Consensus) stay stable and lead from k = 5 onward.
  • k = 5 penalises hallucinated mastery without over-rewarding caution.

22

23 of 27

Recommendations

23

24 of 27

Limitations

  • Single-domain evaluation: Experiments were limited to 28 Dynamic Programming problems from the AtCoder DP contest.
  • Skill taxonomy dependence: Results depend on the validity of the proposed five-skill Q-matrix.
  • Probe-level interpretation: The expanded Q-matrix diagnoses skill recognition rather than full problem-solving ability.
  • Model-specific evaluation: Findings are based on the selected LLMs and SLMs and may not generalize to all future models.
  • DINA assumptions: The DINA model assumes binary mastery and may not capture partial or evolving understanding.

24

25 of 27

Future Work

  • Expand beyond Dynamic Programming to other algorithmic domains.
  • Validate and refine the Q-matrix through expert consensus.
  • Explore richer CDMs beyond DINA (e.g., DINO, GDINA, LCDM).
  • Expand beyond the zero-shot prompting technique.
  • Use a larger sample size for both the problems and the grading models.

25

26 of 27

Key Takeaways

26

27 of 27

Thank You

27