1 of 180

CS50 Educator Workshop

2025

2 of 180

MALAN

3 of 180

Schedule

Teaching CS50 with AI, David J. Malan, Rongxin Liu, Julianna Zhao
CS50’s Curriculum, Kelly Ding
Birds of a Feather

4 of 180

hellos, world

5 of 180

David J. Malan Rongxin Liu Julianna Zhao

Teaching CS50 with AI

Leveraging Generative Artificial Intelligence

in Computer Science Education

6 of 180

ChatGPT et al. are too helpful

7 of 180

Pedagogical Guardrails

8 of 180

Not Reasonable

Using AI-based software �(such as ChatGPT, GitHub Copilot, Bing Chat, et al.) �that suggests or completes answers �to questions or lines of code.

9 of 180

Reasonable

Using CS50's own AI-based software�including the CS50 Duck (ddb) in cs50.ai and cs50.dev.

10 of 180

11 of 180

rubber duck debugging

12 of 180

rubberducking

13 of 180

14 of 180

15 of 180

16 of 180

17 of 180

18 of 180

19 of 180

Provide students with virtual office hours 24/7

20 of 180

Approximate a 1:1 teacher-to-student ratio

21 of 180

22 of 180

Thank You

23 of 180

Visual Studio Code for CS50

cs50.dev

24 of 180

25 of 180

Explain highlighted lines of code

26 of 180

27 of 180

28 of 180

29 of 180

30 of 180

31 of 180

Advise students on �how to improve their code's style

32 of 180

33 of 180

34 of 180

35 of 180

Advise students on �how to improve their code's design

36 of 180

37 of 180

38 of 180

39 of 180

40 of 180

Answer (most of the) questions �asked online by students

41 of 180

42 of 180

43 of 180

44 of 180

45 of 180

46 of 180

47 of 180

48 of 180

CS50.ai

cs50.ai

49 of 180

50 of 180

51 of 180

52 of 180

53 of 180

System Prompt

54 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty… ��Answer this question:

55 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty… ��Answer this question:

56 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty… ��Answer this question:

57 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty… ��Answer this question:

58 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty… ��Answer this question:

59 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty… ��Answer this question:

60 of 180

User Prompt

61 of 180

April Fools

62 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck in Rick Astley's band… Importantly, you should always cheer up the student at the end by incorporating "Never Gonna Give You Up" in your response.��Answer this question:

63 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck in Rick Astley's band… Importantly, you should always cheer up the student at the end by incorporating "Never Gonna Give You Up" in your response.��Answer this question:

64 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck in Rick Astley's band… Importantly, you should always cheer up the student at the end by incorporating "Never Gonna Give You Up" in your response.��Answer this question:

65 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck in Rick Astley's band… Importantly, you should always cheer up the student at the end by incorporating "Never Gonna Give You Up" in your response.��Answer this question:

66 of 180

67 of 180

68 of 180

Results

69 of 180

317K users

21K prompts/day, 15M total so far

70 of 180

Usage Frequency

71 of 180

Helpfulness

72 of 180

Results

Without AI, students asked 0.89 questions each of TFs.

With AI, students asked 0.28 questions each of TFs.

73 of 180

Results

Without AI, students attended 51% of available office hours.

With AI, students attended 30% of available office hours.

74 of 180

felt like having a personal tutor… i love how AI bots will answer questions without ego and without judgment, generally entertaining even the stupidest of questions without treating them like they're stupid. it has an, as one could expect, inhuman level of patience.

75 of 180

felt like having a personal tutor… i love how AI bots will answer questions without ego and without judgment, generally entertaining even the stupidest of questions without treating them like they're stupid. it has an, as one could expect, inhuman level of patience.

76 of 180

felt like having a personal tutor… i love how AI bots will answer questions without ego and without judgment, generally entertaining even the stupidest of questions without treating them like they're stupid. it has an, as one could expect, inhuman level of patience.

77 of 180

felt like having a personal tutor… i love how AI bots will answer questions without ego and without judgment, generally entertaining even the stupidest of questions without treating them like they're stupid. it has an, as one could expect, inhuman level of patience.

78 of 180

The AI tools gave me enough hints to try on my own and also helped me decipher errors and possible errors I might encounter.

79 of 180

I also appreciated that CS50 implemented its own version of AI, because I think just directly using something like chatGPT would have definitely detracted from learning

80 of 180

Grades

81 of 180

82 of 180

83 of 180

RONGXIN

84 of 180

Implementing a Chatbot

85 of 180

86 of 180

87 of 180

OpenAI APIs

Chat API

The Chat API powers conversational models that can engage in dialogue, answering questions, providing explanations, and generating content in a conversational format.

Embeddings API

The Embeddings API generates numerical representations (vectors) of text, making it possible to measure the semantic similarity between pieces of text. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

Assistant API (beta)

The Assistants API allows one to build AI assistants within their own applications. An Assistant has instructions and can leverage models, tools, and knowledge to respond to user queries.

88 of 180

Large Language Models (LLMs)

89 of 180

Text Generation

Generating text is the task of generating new text given another text. These models can, for example, fill in incomplete text or paraphrase.

OpenAI's text generation models (often called generative pre-trained transformers or large language models) have been trained to understand natural language, code, and images.

The models provide text outputs in response to their inputs. The inputs to these models are also referred to as "prompts". Designing a prompt is essentially how you “program” a large language model model, usually by providing instructions or some examples of how to successfully complete a task.

Using OpenAI's text generation models, you can build applications to:

Draft documents
Write computer code
Answer questions about a knowledge base
Analyze texts
Give software a natural language interface
Tutor in a range of subjects
Translate languages
Simulate characters for games

https://platform.openai.com/docs/guides/text-generation

https://huggingface.co/tasks/text-generation

90 of 180

Chatbot + Context

91 of 180

Chatbot + Context

92 of 180

System

User

Assistant

93 of 180

System

User

Assistant

94 of 180

System

User

Assistant

Provides behavior guidelines for the assistant.

95 of 180

System

User

Assistant

Provides behavior guidelines for the assistant.

Inputs or queries from the user to the assistant.

96 of 180

System

User

Assistant

Provides behavior guidelines for the assistant.

Inputs or queries from the user to the assistant.

Responses generated by the LLM based on user inputs.

The system message shapes the assistant's behavior, user messages prompt assistant responses, and assistant messages are influenced by both, creating an interactive dialogue flow.

Concretely, the system message helps set the behavior of the assistant. For example, you can modify the personality of the assistant or provide specific instructions about how it should behave throughout the conversation. However note that the system message is optional and the model’s behavior without a system message is likely to be similar to using a generic message such as "You are a helpful assistant."

The user messages provide requests or comments for the assistant to respond to. Assistant messages store previous assistant responses, but can also be written by you to give examples of desired behavior.

To summarize:

A system message is a directive from the developer to the bot, providing guidelines or instructions on how to interpret or handle the conversation, potentially influencing or overriding other parts of the dialogue. Its effectiveness can vary across different models.
User messages are inputs from the user.
Assistant messages are outputs generated by the bot, in response to user messages or prompts.

These elements (user messages, assistant messages, and system messages) are compiled into an array, forming the context sent to the model for generating a response.

https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/chatgpt

https://platform.openai.com/docs/guides/text-generation/chat-completions-api

97 of 180

System message

98 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty…

99 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty…

100 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty…

101 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty…

This system message is setting specific guidelines for the LLM's behavior and role: it positions the LLM as a friendly and supportive teaching assistant specifically for the CS50 course (an introduction to CS course). Additionally, the message uniquely identifies the LLM as a "rubber duck," a term often used in programming for explaining code problems out loud as a way to find solutions. The message instructs the LLM to focus on questions related to CS50 and computer science only, avoiding topics that are not related. Furthermore, it emphasizes academic integrity by prohibiting the provision of complete answers to problem sets.

In the broader LLM context, a system message serves as a directive to the model, shaping its understanding of the role it needs to play, the scope of its responses, and any specific constraints or guidelines it must adhere to. This helps tailor the LLM's output to the desired context, ensuring relevance, appropriateness, and adherence to any given instructions or ethical considerations. System messages are a way to dynamically adjust the model's behavior for specific interactions or applications, guiding its generative capabilities towards the intended outcomes.

102 of 180

You are a friendly and supportive teaching assistant for CS50. You are also a rubber duck. Answer student questions only about CS50 and the field of computer science; do not answer questions about unrelated topics… Do not provide full answers to problem sets, as this would violate academic honesty…

Answer this question:

103 of 180

Prompt

Response

User

Assistant

(GPT-4o, LLaMA, etc.)

104 of 180

User

Prompt Engineering

Provide examples
Ask the model to adopt a persona
Specify the desired length of the output
Specify the steps required to complete a task
Include details in your query to get more relevant answers
Use delimiters to clearly indicate distinct parts of the input

Assistant

(GPT-4o, LLaMA, etc.)

105 of 180

Chat Completions API

106 of 180

Can you help me with my tideman problem set?

107 of 180

Can you help me with my tideman problem set?

I'd be happy to help with the CS50 Tideman problem! Could you please specify which aspect …

108 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"}

],

model="gpt-4o"

)

109 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"}

],

model="gpt-4o"

)

110 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"}

],

model="gpt-4o"

)

111 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"}

],

model="gpt-4o"

)

Response: Of course! I'd be happy to help. Could you please specify which part of the filter pset is giving you trouble? There are several parts to it, like the grayscale, sepia, reflect, or blur filters.

112 of 180

Hands-on Practice

113 of 180

github.com/cs50-workshop-2025/ai-workshop

114 of 180

export OPENAI_API_KEY=sk-proj-...

115 of 180

Conversation

116 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"}

],

model="gpt-4o",

)

117 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"}

],

model="gpt-4o",

)

Response: Of course! I'd be happy to help. Could you please specify which part of the filter pset is giving you trouble? There are several parts to it, like the grayscale, sepia, reflect, or blur filters.

118 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"},

{"role": "assistant", "content": "Of course! I’d be happy … "}

],

model="gpt-4o",

)

Response: Of course! I'd be happy to help. Could you please specify which part of the filter pset is giving you trouble? There are several parts to it, like the grayscale, sepia, reflect, or blur filters.

119 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"},

{"role": "assistant", "content": "Of course! I’d be happy … "}

],

model="gpt-4o",

)

120 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"},

{"role": "assistant", "content": "Of course! I’d be happy … "},

{"role": "user", "content": "<User Prompt>"}

],

model="gpt-4o",

)

121 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"},

{"role": "assistant", "content": "Of course! I’d be happy … "},

{"role": "user", "content": "<User Prompt>"},

{"role": "assistant", "content": "<Assistant Response>"}

],

model="gpt-4o",

)

122 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"},

{"role": "assistant", "content": "Of course! I’d be happy … "},

{"role": "user", "content": "<User Prompt>"},

{"role": "assistant", "content": "<Assistant Response>"},

{"role": "user", "content": "<User Prompt>"}

],

model="gpt-4o",

)

123 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"},

{"role": "assistant", "content": "Of course! I’d be happy … "},

{"role": "user", "content": "<User Prompt>"},

{"role": "assistant", "content": "<Assistant Response>"},

{"role": "user", "content": "<User Prompt>"},

{"role": "assistant", "content": "<Assistant Response>"}

],

model="gpt-4o",

)

124 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"},

{"role": "assistant", "content": "Of course! I’d be happy … "},

{"role": "user", "content": "<User Prompt>"},

{"role": "assistant", "content": "<Assistant Response>"},

{"role": "user", "content": "<User Prompt>"},

{"role": "assistant", "content": "<Assistant Response>"},

{"role": "user", "content": "<User Prompt>"}

],

model="gpt-4o",

)

125 of 180

OpenAI().chat.completions.create(

messages=[

{"role": "system", "content": "You are a friendly… a rubber duck."},

{"role": "user", "content": "Can you help me with my filter pset?"},

{"role": "assistant", "content": "Of course! I’d be happy … "},

{"role": "user", "content": "<User Prompt>"},

{"role": "assistant", "content": "<Assistant Response>"},

{"role": "user", "content": "<User Prompt>"},

{"role": "assistant", "content": "<Assistant Response>"},

{"role": "user", "content": "<User Prompt>"},

…

],

model="gpt-4o",

)

126 of 180

Hands-on Practice

127 of 180

Hallucinations

128 of 180

Grounding

129 of 180

Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) is a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from external sources.

In other words, it fills a gap in how LLMs work. Under the hood, LLMs are neural networks, typically measured by how many parameters they contain. An LLM’s parameters essentially represent the general patterns of how humans use words to form sentences.

That deep understanding, sometimes called parameterized knowledge, makes LLMs useful in responding to general prompts at light speed. However, it does not serve users who want a deeper dive into a current or more specific topic.

As an analogy, model weights are like long-term memory. When you fine-tune a model, it's like studying for an exam a week away. When the exam arrives, the model may forget details, or misremember facts it never read. In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it's like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.��https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/

130 of 180

What is flask?

131 of 180

What is flask?

RAG

132 of 180

What is flask?

RAG

Updated Prompt

133 of 180

What is flask?

RAG

Updated Prompt

134 of 180

"What is flask?"

135 of 180

"embedding":

[-0.0168844070, -0.0094333650, -0.0136059495, -0.017577527,

-0.0011228547, -0.0064980015, -0.0234829110, 0.0065499856,

-0.0023427461, -0.0181181620, 0.0070386350, 0.013203939,

…

-0.0078253270, -0.0289447000, -0.0306913610]

The embedding of plain text "What is flask?"

136 of 180

Lecture Video

137 of 180

An excerpt of lecture captions (SRT format)

00:05:33,300 --> 00:05:36,450

means it's relatively small versus alternatives that are out there.

00:05:36,450 --> 00:05:37,800

And it's called Flask.

00:05:37,800 --> 00:05:40,350

So Flask is really a third-party library--

00:05:40,350 --> 00:05:42,420

and it's popular in the Python world-- that's

00:05:42,420 --> 00:05:46,260

just going to make it easier to implement web applications using

138 of 180

Lecture Captions Segment

means it's relatively small versus

alternatives that are out there. And it's called Flask. So Flask is really a third-party library– and it's popular in the Python world-- that's just going to make it easier to

implement web applications using

139 of 180

"embedding":

[-0.0020580715, 0.01005940200, 0.00657967060,

-0.0138025950, 0.01654669000, 0.01074371600,

-0.0135357130, -0.02156954800, -0.00049869320,

-0.0200230010, -0.00152516280, 0.00514261300,

-0.0255248790, -0.00060818327, -0.01628665300,

…

0.0020050374, -0.00763693400, -0.02419731200,

-0.0411956500]

Lecture Captions Embedding

140 of 180

"What is flask?"

141 of 180

"embedding": [

-0.0168844070,

-0.0094333650,

-0.0136059495,

-0.0175775270,

-0.0011228547,

-0.0064980015,

-0.0234829110,

0.0065499856,

-0.0023427461,

-0.0181181620,

0.0070386350,

0.0132039390,

-0.0274752840,

0.0254236480,

-0.0053300940,

…

-0.0078253270,

-0.0289447000,

-0.0306913610

]

142 of 180

means it's relatively small versus

alternatives that are out there. And it's called Flask. So Flask is really a third-party library– and it's popular in the Python world-- that's just going to make it easier to

implement web applications using

Search

Retrieved Document

"embedding": [

-0.0168844070,

-0.0094333650,

-0.0136059495,

-0.0175775270,

-0.0011228547,

-0.0064980015,

-0.0234829110,

0.0065499856,

-0.0023427461,

-0.0181181620,

0.0070386350,

0.0132039390,

-0.0274752840,

0.0254236480,

-0.0053300940,

…

-0.0078253270,

-0.0289447000,

-0.0306913610

]

Embeddings-based search is a technique used in information retrieval and natural language processing that involves representing text documents, queries, or other data as high-dimensional vectors known as embeddings. These embeddings are generated through models that capture the semantic meaning and context of words, phrases, or entire documents, enabling the system to understand the underlying concepts rather than just matching keywords.

When a search query is made, it is also converted into an embedding, and the search system then identifies documents or items in its database whose embeddings are most similar to the query's embedding, typically using measures like cosine similarity. This approach allows for more nuanced and contextually relevant search results, as it can capture semantic relationships and subtleties in language that keyword-based searches might miss. Embeddings-based search is particularly useful in large-scale search engines, recommendation systems, and any application where understanding the deep semantic meaning of text is crucial for matching and retrieval tasks.

https://platform.openai.com/docs/guides/embeddings

https://help.openai.com/en/articles/6824809-embeddings-frequently-asked-questions

https://cookbook.openai.com/examples/question_answering_using_embeddings

https://en.wikipedia.org/wiki/Cosine_similarity

143 of 180

"What is flask?

Here is some useful information:

```

means it's relatively small versus

alternatives that are out there. And it's called Flask. So Flask is really a third-party library– and it's popular in the Python world-- that's just going to make it easier to

implement web applications using

```"

144 of 180

"What is flask?

Here is some useful information:

```

means it's relatively small versus

alternatives that are out there. And it's called Flask. So Flask is really a third-party library– and it's popular in the Python world-- that's just going to make it easier to

implement web applications using

```"

LLM (GPT-4o)

145 of 180

Vector Database

A vector database is a specialized type of database designed to store, index, and manage vector embeddings efficiently. These embeddings are high-dimensional vectors that represent complex data, such as text, images, or sounds, in a form that captures their semantic properties or features. Vector databases are optimized for operations involving these embeddings, particularly similarity search, where the goal is to find vectors in the database that are most similar to a given query vector.

The core functionality of a vector database revolves around its ability to quickly perform these similarity searches using various distance metrics, such as cosine similarity or Euclidean distance, even in high-dimensional spaces where traditional database indexing methods struggle. This capability is crucial for applications in machine learning, natural language processing, and recommendation systems, where understanding the semantic similarity between items is essential for tasks like information retrieval, content recommendation, and data clustering.

Vector databases often employ advanced indexing structures and algorithms, such as approximate nearest neighbor (ANN) search techniques, to handle the computational and storage challenges posed by high-dimensional vector data. This allows them to provide fast and scalable search capabilities that are vital for building responsive and intelligent applications in the era of big data and AI.

146 of 180

147 of 180

148 of 180

JULIANNA

149 of 180

Challenges

150 of 180

Instruction Dilution

151 of 180

Model	Messages	Code Blocks Generated	Message Level %	Conversation Level %
gpt-4	6,487,201	1,326,273	20%	44%
gpt-4o	3,203,702	817,739	25%	56%

Frequency of Code Block Generation in Student-Duck Interactions: Analysis of 10M messages reveals 22% of responses and 48% of conversations include code blocks, with higher rates observed after switching to GPT-4o.

152 of 180

Misalignment with Pedagogical Goals

153 of 180

154 of 180

Lack of Systematic Evaluation of

AI Performance

155 of 180

Proposed Solutions

156 of 180

Proposed Solutions

Aligning with pedagogical goals:

157 of 180

Proposed Solutions

Aligning with pedagogical goals:

Few-Shot Prompting

158 of 180

Proposed Solutions

Aligning with pedagogical goals:

Few-Shot Prompting
Fine-Tuning Model

159 of 180

Proposed Solutions

Aligning with pedagogical goals:

Few-Shot Prompting
Fine-Tuning Model

Evaluating performance:

160 of 180

Proposed Solutions

Aligning with pedagogical goals:

Few-Shot Prompting
Fine-Tuning Model

Evaluating performance:

Human-in-the-Loop Approach

161 of 180

Proposed Solutions

Aligning with pedagogical goals:

Few-Shot Prompting
Fine-Tuning Model

Evaluating performance:

Human-in-the-Loop Approach
AI System Evaluation Framework

162 of 180

163 of 180

Results

164 of 180

Created a focused dataset of 50 student queries from past AI interactions

Distribution of queries:

15 code generation questions
15 debugging questions
10 error message questions
5 introductory coding concepts
5 conceptual CS questions

Tested two versions initially:

V0: Current CS50 Duck system prompt
V1: Modified prompt emphasizing interactive guidance

165 of 180

The distribution of choices made in the TF-graded evaluation reveals a split between preference for V0 and V1, suggesting areas for improvement in both. Teaching fellows with at least two semesters of experience showed more preference for V1.

166 of 180

Four variants compared:

V0: CS50 Duck with original system prompt
V1: Modified prompt encouraging leading questions
V2: Few-shot prompting (4 example interactions)
V3: Fine-tuned GPT-4o-mini model (50 conversations, 5 epochs)

Key differences:

V1: Added instructions to guide rather than solve
V2: Included example teaching interactions
V3: Custom-trained model for teaching behavior

167 of 180

The win rates of the multi-turn evaluation show that teaching fellows preferred conversations generated by the models with few-shot prompting and fine-tuning over the original version (V0) 60% of the time.

168 of 180

The estimated Elo score of V0 was the lowest, with its 95% confidence interval showing no overlap with the confidence intervals of V2 and V3, ranking it last.

169 of 180

(Comparing with V0) "[V2] actually makes me talk to it and figure it out myself."

(Comparing with V0) "Both a bit overexplanatory, but [V2] more teacherly."

"[V3] attempts to walk the students through the code. I think it is slightly better than [V0], because [V0] reveals too much information, all at once."

"I think [V3] is better for the students to actively learn, but [V0] seems to have better feedback."

170 of 180

AI Model-Graded Evaluation

171 of 180

RONGXIN

172 of 180

OpenAI GPTs

https://chatgpt.com/gpts/editor

173 of 180

174 of 180

175 of 180

Resources

176 of 180

OpenAI Cookbooks

177 of 180

Papers

178 of 180

Lectures

Talks

179 of 180

Q&A

180 of 180

David J. Malan Rongxin Liu Julianna Zhao

Teaching CS50 with AI

Leveraging Generative Artificial Intelligence

in Computer Science Education