10th International Workshop on Computer Science Education Data Mining (CSEDM 2026)
Strict Graders and Hallucinated Mastery:
A Cognitive Diagnostic Study of Language Models on Dynamic Programming
Sartaj Solaiman
sartajsolaiman@gmail.com
Md. Fahim Arefin
fahim@cse.du.ac.bd
Department of Computer
Science and Engineering
University of Dhaka
The Diagnostic Scarcity Problem
Current Code grading benchmarks solely focus on outcomes. This creates :
2
Cognitive Diagnostic Modeling (CDM)
3
Evaluating Language Model Skill Profiles
Language Models are increasingly being used as :
Evaluating the latent skill profiles of Language Models allow us to correctly use them as code graders and feedback systems, improving the learning process overall.
4
Why Dynamic Programming?
Dynamic Programming is an ideal testbed for skill evaluation :
5
Core Research Questions
6
Our Contribution
7
Skill Framework
8
Dataset and Design
9
Problem Categorization
Each Problem was categorized into one of the following five categories :
10
Experimental Group Design
11
Methodology Pipeline
12
Q-Matrix Construction
13
Raw Baseline Grading Performance
14
DINA model output
15
Model Grading Behavior
Different models exhibit different diagnostic behaviors :
16
Skill mastery profiles
17
Latent Skill Correlations
Correlation of skill mastery estimates:
18
The Reasoning-Synthesis Gap
19
Grader Reliability Index (GRI)
We formulate GRI as follows :
GRI = (1 − S) × (1 − G)^k
Where,
S = Slip rate (missed recognition of demonstrated skills)
G = Guess rate (hallucinated skill mastery)
k = hallucination penalty factor
This equation reflects the fact that, guess hallucination is much more costly than slip hallucination
20
Grader Reliability Index (Contd.)
Based on our findings :
21
GRI Sensitivity to Hallucination Penalty
Re-scoring GRI across penalty k :
22
Recommendations
23
Limitations
24
Future Work
25
Key Takeaways
26
Thank You
27