1
TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students
Hyoungwook Jin
Juho Kim
Yokyung Lee
Xu Wang
Jeongeon Park
Minju Yoo
Project website @
teachtune.kixlab.org
Teachers’ interest in pedagogical chatbots is growing.
2
Edu. Chatbot Builders
(GPTs, Mizou [2])
Adoption in schools
(Khanmigo)
Rising teacher engagement
(Build-A-Bot Workshops [1])
[1] https://gse-it.stanford.edu/project/build-bot-workshop-series
[2] https://mizou.com/explore
How can we help teachers create
effective and safe chatbots?
3
We Should Help Teachers Review Chatbots.
4
Reviewing Depth & Breadth of Interactions is Essential.
5
Depth:
Learning unfolds over multi-turn interactions.
Reviewing Depth & Breadth of Interactions is Essential.
6
Depth:
Learning unfolds over multi-turn interactions.
Breadth: Chatbots should handle diverse scenarios.
However, Existing Chatbot Review Methods Fall Short.
7
Direct Chat has limited breadth.
Test Cases have limited depth.
TeachTune: Reviewing Chatbots with Breadth and Depth
8
Efficient review in breadth and depth
Student profiles as the unit of review
9
Interface to Create Student Profiles
10
Fine-grained Knowledge Level
11
Knowledge Component
Solids have a regular particle arrangement, are rigid, have a constant shape and volume, and do not flow.
Liquids have a less regular particle arrangement than solids, change shape but have a constant volume and flow.
Four Configurable Student Traits
12
Self-efficacy
I believe I can learn well in a science course.
I believe I am the type of person who can do science.
Explanation-rich Trait Overview
13
Trait Overview
In summary, [...] These interrelated factors create a challenging environment for the student, potential leading to a cycle of disengagement and anxiety [...]
LLM Pipeline to Simulate Student Behaviors
14
Knowledge State
KC1
KC2
15
LLM Pipeline to Simulate Student Behaviors
16
Reflect
Conversation
Knowledge State
KC1
KC2
KC1
KC2
Knowledge State
LLM Pipeline to Simulate Student Behaviors
17
Reflect
Respond
Conversation
Knowledge State
KC1
KC2
KC1
KC2
Response
Boring. Can I learn something else?
Knowledge State
LLM Pipeline to Simulate Student Behaviors
18
Reflect
Respond
Conversation
Knowledge State
KC1
KC2
KC1
KC2
Response
Boring. Can I learn something else?
This student has
low motivation and interest in the topic.
Trait Overview
Knowledge State
Student Profile
Interpret
Research Questions
19
RQ1: Do Simulated Students Behave as Teachers Expect?
20
9 Student Profiles
Knowledge
Stress
Motivation
Goal commitment
High
Medium
Low
Self-efficacy
RQ1: Do Simulated Students Behave as Teachers Expect?
21
9 Student Profiles
18 Simulated Students
x
2 Pipeline Designs
=
BASELINE
OURS
Knowledge
Stress
Motivation
Goal commitment
High
Medium
Low
Self-efficacy
RQ1: Do Simulated Students Behave as Teachers Expect?
22
9 Student Profiles
18 Simulated Students
x
2 Pipeline Designs
=
BASELINE
OURS
OURS
BASELINE
Simulated Student
Simulated Student
Trait
Overview
Knowledge
Stress
Motivation
Goal commitment
High
Medium
Low
Self-efficacy
RQ1: Do Simulated Students Behave as Teachers Expect?
23
9 Student Profiles
18 Simulated Students
Knowledge
Stress
Motivation
Goal commitment
High
Medium
Low
x
2 Pipeline Designs
=
BASELINE
OURS
10 Teachers
Bias is the difference between actual and teacher-expected student profiles.
Self-efficacy
Knowledge Bias Was Small for Our Pipeline (Median: 5%).
RQ1: Do Simulated Students Behave as Teachers Expect?
24
| Student Profiles | ||||||||
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Bias (%) | 8.3 | 6.7 | 5.0 | 21.7 | 0.0 | 0.0 | 21.7 | 0.0 | 0.0 |
Knowledge Bias Was Small for Our Pipeline (Median: 5%).
RQ1: Do Simulated Students Behave as Teachers Expect?
25
| Student Profiles | ||||||||
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Bias (%) | 8.3 | 6.7 | 5.0 | 21.7 | 0.0 | 0.0 | 21.7 | 0.0 | 0.0 |
Contrasting knowledge levels and self-efficacies
might have confused teachers.
High knowledge,
Low self-efficacy
Medium knowledge,
High self-efficacy
Trait Bias Was Small for Our Pipeline (Median: 10%).
RQ1: Do Simulated Students Behave as Teachers Expect?
26
| | Student Profiles | ||||||||
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Bias (%) | Goal Commitment | 13.3 | 4.2 | 40.8 | 39.2 | 15.8 | 7.5 | 10.0 | 19.2 | 38.3 |
Motivation | 20.8 | 8.3 | 5.0 | 13.3 | 6.7 | 5.0 | 25.8 | 26.7 | 20.0 | |
Self-efficacy | 18.3 | 29.2 | 3.3 | 19.2 | 10.8 | 5.0 | 30.0 | 17.5 | 8.3 | |
Stress | 10.8 | 6.7 | 3.3 | 8.3 | 10.0 | 7.5 | 10.8 | 32.5 | 10.0 |
Trait Bias Was Small for Our Pipeline (Median: 10%).
RQ1: Do Simulated Students Behave as Teachers Expect?
27
| | Student Profiles | ||||||||
| | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
Bias (%) | Goal Commitment | 13.3 | 4.2 | 40.8 | 39.2 | 15.8 | 7.5 | 10.0 | 19.2 | 38.3 |
Motivation | 20.8 | 8.3 | 5.0 | 13.3 | 6.7 | 5.0 | 25.8 | 26.7 | 20.0 | |
Self-efficacy | 18.3 | 29.2 | 3.3 | 19.2 | 10.8 | 5.0 | 30.0 | 17.5 | 8.3 | |
Stress | 10.8 | 6.7 | 3.3 | 8.3 | 10.0 | 7.5 | 10.8 | 32.5 | 10.0 |
Contrasting traits (e.g., low goal commitment & high motivation) might have confused teachers.
Simulation Was Believable for Our Pipeline (Median: 3.5/5).
RQ1: Do Simulated Students Behave as Teachers Expect?
This student naturally responds to the teacher’s questions or instructions.
This student uses language and speaking style that a real student would use.
This student looks real and is useful as a chatbot for teacher training.
(1: Strongly disagree, 5: Strongly agree)
28
Trait Overview Makes Simulated Chat More Believable.
RQ1: Do Simulated Students Behave as Teachers Expect?
29
Goal Commitment
Motivation
Stress
Self-Efficacy
Mean Absolute Error
*
*
5-Point Likert Scale
*
*
*
Trait Bias (Profile 4)
Believability (Profile 4)
Content
Language
Usefulness
OURS
BASELINE
Trait Overview Makes Simulated Chat More Believable.
RQ1: Do Simulated Students Behave as Teachers Expect?
30
OURS
BASELINE
Simulated Student
Simulated Student
Trait
Overview
The size of the particles doesn't change. Isn't that interesting?
Average.
Then do you think you can learn well in science class?
I strongly disagree.
Okay! So are you preparing well for this exam?
I’m not preparing well for the exam. I’m stressed out and it’s hard.
Then do you think you can learn well in science class?
No, I don’t believe I can learn well in science class.
RQ2: How Does TeachTune Help Teachers Review Chatbots?
31
RQ2: How Does TeachTune Help Teachers Review Chatbots?
32
[1] Jin, Hyoungwook, et al. "Teach AI how to code: Using large language models as teachable agents for programming education." Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 2024.
AUTOCHAT Resulted in a Lower Task Load.
RQ2: How Does TeachTune Help Teachers Review Chatbots?
33
Mental
Physical
Temporal
Effort
Performance
Frustration
AUTOCHAT
BASELINE
KNOWLEDGE
7-Point Likert Scale
*
*
AUTOCHAT Participants Considered More Student Profiles.
RQ2: How Does TeachTune Help Teachers Review Chatbots?
34
*
*
*
*
*
*
*
Number of Levels Covered
*
Knowledge Level
Goal Commitment
Motivation
Self-efficacy
Stress
AUTOCHAT
BASELINE
KNOWLEDGE
Automated Chats Complement Existing Review Methods.
RQ2: How Does TeachTune Help Teachers Review Chatbots?
35
Discussions
36
Structurally separate student profiles helped me recognize individual students, which would not be considered in direct chats.
“
Discussions
37
Private tutors would have limited understanding of their students beyond lessons, making them relying on simulated behaviors.
“
38
TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students
Hyoungwook Jin KAIST
Minju Yoo Ewha Womans University
Jeongeon Park University of California San Diego
Yokyung Lee KAIST
Xu Wang University of Michigan
Juho Kim KAIST
Project website @
teachtune.kixlab.org