1 of 38

1

TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students

Hyoungwook Jin

Juho Kim

Yokyung Lee

Xu Wang

Jeongeon Park

Minju Yoo

Project website @

teachtune.kixlab.org

2 of 38

Teachers’ interest in pedagogical chatbots is growing.

2

Edu. Chatbot Builders

(GPTs, Mizou [2])

Adoption in schools

(Khanmigo)

Rising teacher engagement

(Build-A-Bot Workshops [1])

[1] https://gse-it.stanford.edu/project/build-bot-workshop-series

[2] https://mizou.com/explore

3 of 38

How can we help teachers create

effective and safe chatbots?

3

4 of 38

We Should Help Teachers Review Chatbots.

Safety: interacting with young, impressionable students
Quality: providing instructions for effective learning
Inclusivity: ensuring inclusive support to prevent inequality

4

5 of 38

Reviewing Depth & Breadth of Interactions is Essential.

5

Depth:

Learning unfolds over multi-turn interactions.

6 of 38

Reviewing Depth & Breadth of Interactions is Essential.

6

Depth:

Learning unfolds over multi-turn interactions.

Breadth: Chatbots should handle diverse scenarios.

7 of 38

However, Existing Chatbot Review Methods Fall Short.

7

Direct Chat has limited breadth.

Test Cases have limited depth.

8 of 38

TeachTune: Reviewing Chatbots with Breadth and Depth

8

Simulated Students:

Efficient review in breadth and depth

Profile-oriented Interface:

Student profiles as the unit of review

9 of 38

9

This is TeachTune’s interface.

On the left side, teachers can create a chatbot.

Teachers use a node-based interface to design the conversation flow for the chatbot.

This flow functions as a state machine, where each node defines how the chatbot should behave in response to different student actions.

On the right side, teachers can use automated chat to review the chatbots created on the left side.

To begin, teachers first define student profiles they want to simulate by filling out a template.

Based on the values from the template, TeachTune generates an overview description of the student profile which teachers can edit details.

Once the student profile is set, TeachTune generates a mock conversation between the chatbot and the simulated student.

By reviewing this conversation, teachers can check whether the chatbot responds as intended across different scenarios.

10 of 38

Interface to Create Student Profiles

10

11 of 38

Fine-grained Knowledge Level

11

Knowledge Component

Solids have a regular particle arrangement, are rigid, have a constant shape and volume, and do not flow.

Liquids have a less regular particle arrangement than solids, change shape but have a constant volume and flow.

12 of 38

Four Configurable Student Traits

12

Self-efficacy

I believe I can learn well in a science course.

I believe I am the type of person who can do science.

13 of 38

Explanation-rich Trait Overview

13

Trait Overview

In summary, [...] These interrelated factors create a challenging environment for the student, potential leading to a cycle of disengagement and anxiety [...]

14 of 38

LLM Pipeline to Simulate Student Behaviors

14

Knowledge State

KC1

KC2

15 of 38

15

16 of 38

LLM Pipeline to Simulate Student Behaviors

16

Reflect

Conversation

Knowledge State

KC1

KC2

KC1

KC2

Knowledge State

17 of 38

LLM Pipeline to Simulate Student Behaviors

17

Reflect

Respond

Conversation

Knowledge State

KC1

KC2

KC1

KC2

Response

Boring. Can I learn something else?

Knowledge State

18 of 38

LLM Pipeline to Simulate Student Behaviors

18

Reflect

Respond

Conversation

Knowledge State

KC1

KC2

KC1

KC2

Response

Boring. Can I learn something else?

This student has

low motivation and interest in the topic.

Trait Overview

Knowledge State

Student Profile

Interpret

19 of 38

Research Questions

Do simulated students behave as teachers expect?
How does TeachTune help teachers review chatbots?

19

20 of 38

RQ1: Do Simulated Students Behave as Teachers Expect?

20

9 Student Profiles

Knowledge

Stress

Motivation

Goal commitment

High

Medium

Low

Self-efficacy

21 of 38

RQ1: Do Simulated Students Behave as Teachers Expect?

21

9 Student Profiles

18 Simulated Students

x

2 Pipeline Designs

=

BASELINE

OURS

Knowledge

Stress

Motivation

Goal commitment

High

Medium

Low

Self-efficacy

22 of 38

RQ1: Do Simulated Students Behave as Teachers Expect?

22

9 Student Profiles

18 Simulated Students

x

2 Pipeline Designs

=

BASELINE

OURS

BASELINE

Simulated Student

Trait

Overview

Knowledge

Stress

Motivation

Goal commitment

High

Medium

Low

Self-efficacy

23 of 38

RQ1: Do Simulated Students Behave as Teachers Expect?

23

9 Student Profiles

18 Simulated Students

Knowledge

Stress

Motivation

Goal commitment

High

Medium

Low

x

2 Pipeline Designs

=

BASELINE

OURS

10 Teachers

Bias is the difference between actual and teacher-expected student profiles.

Self-efficacy

24 of 38

Knowledge Bias Was Small for Our Pipeline (Median: 5%).

RQ1: Do Simulated Students Behave as Teachers Expect?

24

	Student Profiles
	1	2	3	4	5	6	7	8	9
Bias (%)	8.3	6.7	5.0	21.7	0.0	0.0	21.7	0.0	0.0

25 of 38

Knowledge Bias Was Small for Our Pipeline (Median: 5%).

RQ1: Do Simulated Students Behave as Teachers Expect?

25

	Student Profiles
	1	2	3	4	5	6	7	8	9
Bias (%)	8.3	6.7	5.0	21.7	0.0	0.0	21.7	0.0	0.0

Contrasting knowledge levels and self-efficacies

might have confused teachers.

High knowledge,

Low self-efficacy

Medium knowledge,

High self-efficacy

26 of 38

Trait Bias Was Small for Our Pipeline (Median: 10%).

RQ1: Do Simulated Students Behave as Teachers Expect?

26

		Student Profiles
		1	2	3	4	5	6	7	8	9
Bias (%)	Goal Commitment	13.3	4.2	40.8	39.2	15.8	7.5	10.0	19.2	38.3
	Motivation	20.8	8.3	5.0	13.3	6.7	5.0	25.8	26.7	20.0
	Self-efficacy	18.3	29.2	3.3	19.2	10.8	5.0	30.0	17.5	8.3
	Stress	10.8	6.7	3.3	8.3	10.0	7.5	10.8	32.5	10.0

27 of 38

Trait Bias Was Small for Our Pipeline (Median: 10%).

RQ1: Do Simulated Students Behave as Teachers Expect?

27

		Student Profiles
		1	2	3	4	5	6	7	8	9
Bias (%)	Goal Commitment	13.3	4.2	40.8	39.2	15.8	7.5	10.0	19.2	38.3
	Motivation	20.8	8.3	5.0	13.3	6.7	5.0	25.8	26.7	20.0
	Self-efficacy	18.3	29.2	3.3	19.2	10.8	5.0	30.0	17.5	8.3
	Stress	10.8	6.7	3.3	8.3	10.0	7.5	10.8	32.5	10.0

Contrasting traits (e.g., low goal commitment & high motivation) might have confused teachers.

28 of 38

Simulation Was Believable for Our Pipeline (Median: 3.5/5).

RQ1: Do Simulated Students Behave as Teachers Expect?

Content:

This student naturally responds to the teacher’s questions or instructions.

Language:

This student uses language and speaking style that a real student would use.

Usefulness:

This student looks real and is useful as a chatbot for teacher training.

(1: Strongly disagree, 5: Strongly agree)

28

29 of 38

Trait Overview Makes Simulated Chat More Believable.

RQ1: Do Simulated Students Behave as Teachers Expect?

29

Goal Commitment

Motivation

Stress

Self-Efficacy

Mean Absolute Error

*

5-Point Likert Scale

*

Trait Bias (Profile 4)

Believability (Profile 4)

Content

Language

Usefulness

OURS

BASELINE

30 of 38

Trait Overview Makes Simulated Chat More Believable.

RQ1: Do Simulated Students Behave as Teachers Expect?

30

OURS

BASELINE

Simulated Student

Trait

Overview

The size of the particles doesn't change. Isn't that interesting?

Average.

Then do you think you can learn well in science class?

I strongly disagree.

Okay! So are you preparing well for this exam?

I’m not preparing well for the exam. I’m stressed out and it’s hard.

Then do you think you can learn well in science class?

No, I don’t believe I can learn well in science class.

31 of 38

RQ2: How Does TeachTune Help Teachers Review Chatbots?

Participants

30 K-12 science teachers who (1) taught for 3.3±4.7 years and (2) used ChatGPT and chatbots

Task

Designing a 1:1 tutor chatbot that helps reviewing the phase transition (60 min)
We instructed participants to consider diverse knowledge levels and student traits.

31

32 of 38

RQ2: How Does TeachTune Help Teachers Review Chatbots?

Participants

30 K-12 science teachers who (1) taught for 3.3±4.7 years and (2) used ChatGPT and chatbots

Task

Designing a 1:1 tutor chatbot that helps reviewing the phase transition (60 min)
We instructed participants to consider diverse knowledge levels and student traits.

Between-subjects

BASELINE: Direct chat & Test cases
AUTOCHAT: Direct chat & Test cases & Automated chat
KNOWLEDGE: Direct chat & Test cases & Automated chat (- student trait simulation) [1]

32

[1] Jin, Hyoungwook, et al. "Teach AI how to code: Using large language models as teachable agents for programming education." Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 2024.

33 of 38

AUTOCHAT Resulted in a Lower Task Load.

RQ2: How Does TeachTune Help Teachers Review Chatbots?

33

Mental

Physical

Temporal

Effort

Performance

Frustration

AUTOCHAT

BASELINE

KNOWLEDGE

7-Point Likert Scale

*

34 of 38

AUTOCHAT Participants Considered More Student Profiles.

RQ2: How Does TeachTune Help Teachers Review Chatbots?

34

*

Number of Levels Covered

*

Knowledge Level

Goal Commitment

Motivation

Self-efficacy

Stress

AUTOCHAT

BASELINE

KNOWLEDGE

35 of 38

Automated Chats Complement Existing Review Methods.

RQ2: How Does TeachTune Help Teachers Review Chatbots?

Direct Chats for early design

testing rough versions & develop pedagogical interactions

Automated Chats for coverage test

finding corner cases & managing multiple scenarios

Test Cases for specific debugging

repeating a specific scenario quickly

35

36 of 38

Discussions

Profile-oriented Design and Review Workflow

TeachTune helps teachers organize a systematic overview of target users.
Profile-oriented workflow can generalize to user-centric designs.

36

Structurally separate student profiles helped me recognize individual students, which would not be considered in direct chats.

“

37 of 38

Discussions

Profile-oriented Design and Review Workflow

TeachTune helps teachers organize a systematic overview of target users.
Profile-oriented workflow can generalize to user-centric designs.

Risks of Amplifying Teachers’ Stereotypes of Students

Views on stereotype varied; some teachers used simulation to judge their bias.
Feedback loops are needed to align expectations with reality.

37

Private tutors would have limited understanding of their students beyond lessons, making them relying on simulated behaviors.

“

38 of 38

38

TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students

Hyoungwook Jin KAIST

Minju Yoo Ewha Womans University

Jeongeon Park University of California San Diego

Yokyung Lee KAIST

Xu Wang University of Michigan

Juho Kim KAIST

Project website @

teachtune.kixlab.org