Explaining Code Examples in Introductory Programming
Courses: LLM vs Humans
Arun-Balajiee Lekshmi-Narayanan, Priti Oli, Jeevan Chapagain, Mohammad Hassany, Rabin Banjade, Peter Brusilovsky, Vasile Rus
Worked Examples
Why do we need them?
Source: freepik
Worked Examples Interfaces
Worked Examples Interfaces
Source: Breanna Jury, Angela Lorusso, Juho Leinonen, Paul Denny, and Andrew Luxton-Reilly. 2024. Evaluating LLM-generated Worked Examples in an Introductory Programming Course.
Motivation
Authoring Bottleneck – Instructors need to manually generate worked examples
Human-AI collaboration in creating worked example
Target: Explore the Feasibility
- can AI generate explanations?
- how are they different from humans?
- what prompts generate human-like explanations?
Solution: Could instructors use help from AI to generate explanations for worked examples at scale?
Source: medium
Related Work
Code Summarization (Philips et al. 2022)
Code Explanations and Comparisons with Student Explanations (MacNeil et al. 2023 and more)
Generating Worked Examples + Explanations at request for students (Jury et al. 2024)
However, instructor motivated worked examples with explanations are still an open challenge.
We want to compare ChatGPT explanations with instructor explanations!
Source: freepik
Dataset collection
Student Explanation Sources Spring 2022 End of Semester, Students from Introductory Java Programming Class; 60 subjects, 1 hr study over zoom; Students provided their explanations to 4 worked examples, line-by-line as indicated by the interface
Expert Explanation Sources 2 Experts provided their explanations to the same 4 worked examples at the same lines of code
Chatgpt Explanation Sources 3 prompting strategies (Simple, Advanced, Extended) ChatGPT for explanation generation at the same lines of code for the same 4 worked examples
ChatGPT Prompting
Simple Prompting:
“Provide a line-by-line self-explanation for each line of code in the Java program above”.
Advanced + Extended (right image):
Dataset Summary
Evaluation Metrics
Lexical Metrics: lexical diversity and density
Readability Metrics:
Similarity Metrics:
Results
Students explanations are shorter than those by experts and ChatGPT explanations have higher lexical density, suggesting that the students explain the code in a more “concentrated” way.
Results
explanations produced by experts and ChatGPT are more than twice as long as explanations produced by students
Results
ChatGPT explanations are relatively less readable than those of experts, which are less readable than those of students
Results
Simple prompting (S) generated explanations aligned more closely with Expert explanations than Advanced and Extended prompting (A and E).
Results
Explanations generated by ChatGPT are more closely aligned with expert explanations than with student explanations across all metrics
Discussion & Conclusions
Our goal was to assess the feasibility – so it is feasible; requires human correction; from other paper if necessary.
Compared the ChatGPT explanations generated by different prompts
Observed a considerable difference between the explanations produced by students and explanations produced by experts and ChatGPT
ChatGPT more aligned with experts in terms of lexical metrics but needs more finetuning; experts add their magic
Questions?
Source: freepik