1 of 17

Explaining Code Examples in Introductory Programming

Courses: LLM vs Humans

Arun-Balajiee Lekshmi-Narayanan, Priti Oli, Jeevan Chapagain, Mohammad Hassany, Rabin Banjade, Peter Brusilovsky, Vasile Rus

2 of 17

Worked Examples

Why do we need them?

Lower the cognitive load by learning with worked out programming examples
Explaining Role and function of code chunks
Demonstrate Semantics of code

Source: freepik

3 of 17

Worked Examples Interfaces

4 of 17

Worked Examples Interfaces

Source: Breanna Jury, Angela Lorusso, Juho Leinonen, Paul Denny, and Andrew Luxton-Reilly. 2024. Evaluating LLM-generated Worked Examples in an Introductory Programming Course.

5 of 17

Motivation

Authoring Bottleneck – Instructors need to manually generate worked examples

Human-AI collaboration in creating worked example

Target: Explore the Feasibility

- can AI generate explanations?

- how are they different from humans?

- what prompts generate human-like explanations?

Solution: Could instructors use help from AI to generate explanations for worked examples at scale?

Source: medium

6 of 17

Related Work

Code Summarization (Philips et al. 2022)

Code Explanations and Comparisons with Student Explanations (MacNeil et al. 2023 and more)

Generating Worked Examples + Explanations at request for students (Jury et al. 2024)

However, instructor motivated worked examples with explanations are still an open challenge.

We want to compare ChatGPT explanations with instructor explanations!

Source: freepik

7 of 17

Dataset collection

Student Explanation Sources Spring 2022 End of Semester, Students from Introductory Java Programming Class; 60 subjects, 1 hr study over zoom; Students provided their explanations to 4 worked examples, line-by-line as indicated by the interface

Expert Explanation Sources 2 Experts provided their explanations to the same 4 worked examples at the same lines of code

Chatgpt Explanation Sources 3 prompting strategies (Simple, Advanced, Extended) ChatGPT for explanation generation at the same lines of code for the same 4 worked examples

8 of 17

ChatGPT Prompting

Simple Prompting:

“Provide a line-by-line self-explanation for each line of code in the Java program above”.

Advanced + Extended (right image):

role of “a professor who teaches computer programming”
requested ChatGPT to further enhance the explanations generated by the Advanced prompt

9 of 17

Dataset Summary

10 of 17

Evaluation Metrics

Lexical Metrics: lexical diversity and density

the generated explanations to assess the richness, informativeness, and conciseness

Readability Metrics:

Flesch-Kincaid Grade Level,
Gunning Fog and
Flesch Reading Ease (Denny et al., 2020, 2021)

Similarity Metrics:

Character-based metric chrF (Popovi ́2015),
Word-based metric METEOR (Banerjee and Lavie, 2005), and
Embedding-based metrics BERTScore (Zhang et al., 2019) and
Universal Sentence Encoder (USE) (Cer et al.,2018)

11 of 17

Results

Students explanations are shorter than those by experts and ChatGPT explanations have higher lexical density, suggesting that the students explain the code in a more “concentrated” way.

12 of 17

Results

explanations produced by experts and ChatGPT are more than twice as long as explanations produced by students

13 of 17

Results

ChatGPT explanations are relatively less readable than those of experts, which are less readable than those of students

14 of 17

Results

Simple prompting (S) generated explanations aligned more closely with Expert explanations than Advanced and Extended prompting (A and E).

15 of 17

Results

Explanations generated by ChatGPT are more closely aligned with expert explanations than with student explanations across all metrics

16 of 17

Discussion & Conclusions

Our goal was to assess the feasibility – so it is feasible; requires human correction; from other paper if necessary.

Compared the ChatGPT explanations generated by different prompts

Observed a considerable difference between the explanations produced by students and explanations produced by experts and ChatGPT

ChatGPT more aligned with experts in terms of lexical metrics but needs more finetuning; experts add their magic

1 of 17

2 of 17

3 of 17

4 of 17

5 of 17

6 of 17

7 of 17

8 of 17

9 of 17

10 of 17

11 of 17

12 of 17

13 of 17

14 of 17

15 of 17

16 of 17

17 of 17