1 of 38

1

TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students

Hyoungwook Jin

Juho Kim

Yokyung Lee

Xu Wang

Jeongeon Park

Minju Yoo

Project website @

teachtune.kixlab.org

2 of 38

Teachers’ interest in pedagogical chatbots is growing.

2

Edu. Chatbot Builders

(GPTs, Mizou [2])

Adoption in schools

(Khanmigo)

Rising teacher engagement

(Build-A-Bot Workshops [1])

[1] https://gse-it.stanford.edu/project/build-bot-workshop-series

[2] https://mizou.com/explore

3 of 38

How can we help teachers create

effective and safe chatbots?

3

4 of 38

We Should Help Teachers Review Chatbots.

  • Safety: interacting with young, impressionable students
  • Quality: providing instructions for effective learning
  • Inclusivity: ensuring inclusive support to prevent inequality

4

5 of 38

Reviewing Depth & Breadth of Interactions is Essential.

5

Depth:

Learning unfolds over multi-turn interactions.

6 of 38

Reviewing Depth & Breadth of Interactions is Essential.

6

Depth:

Learning unfolds over multi-turn interactions.

Breadth: Chatbots should handle diverse scenarios.

7 of 38

However, Existing Chatbot Review Methods Fall Short.

7

Direct Chat has limited breadth.

Test Cases have limited depth.

8 of 38

TeachTune: Reviewing Chatbots with Breadth and Depth

8

  • Simulated Students:

Efficient review in breadth and depth

  • Profile-oriented Interface:

Student profiles as the unit of review

9 of 38

9

10 of 38

Interface to Create Student Profiles

10

11 of 38

Fine-grained Knowledge Level

11

Knowledge Component

Solids have a regular particle arrangement, are rigid, have a constant shape and volume, and do not flow.

Liquids have a less regular particle arrangement than solids, change shape but have a constant volume and flow.

12 of 38

Four Configurable Student Traits

12

Self-efficacy

I believe I can learn well in a science course.

I believe I am the type of person who can do science.

13 of 38

Explanation-rich Trait Overview

13

Trait Overview

In summary, [...] These interrelated factors create a challenging environment for the student, potential leading to a cycle of disengagement and anxiety [...]

14 of 38

LLM Pipeline to Simulate Student Behaviors

14

Knowledge State

KC1

KC2

15 of 38

15

16 of 38

LLM Pipeline to Simulate Student Behaviors

16

Reflect

Conversation

Knowledge State

KC1

KC2

KC1

KC2

Knowledge State

17 of 38

LLM Pipeline to Simulate Student Behaviors

17

Reflect

Respond

Conversation

Knowledge State

KC1

KC2

KC1

KC2

Response

Boring. Can I learn something else?

Knowledge State

18 of 38

LLM Pipeline to Simulate Student Behaviors

18

Reflect

Respond

Conversation

Knowledge State

KC1

KC2

KC1

KC2

Response

Boring. Can I learn something else?

This student has

low motivation and interest in the topic.

Trait Overview

Knowledge State

Student Profile

Interpret

19 of 38

Research Questions

  1. Do simulated students behave as teachers expect?
  2. How does TeachTune help teachers review chatbots?

19

20 of 38

RQ1: Do Simulated Students Behave as Teachers Expect?

20

9 Student Profiles

Knowledge

Stress

Motivation

Goal commitment

High

Medium

Low

Self-efficacy

21 of 38

RQ1: Do Simulated Students Behave as Teachers Expect?

21

9 Student Profiles

18 Simulated Students

x

2 Pipeline Designs

=

BASELINE

OURS

Knowledge

Stress

Motivation

Goal commitment

High

Medium

Low

Self-efficacy

22 of 38

RQ1: Do Simulated Students Behave as Teachers Expect?

22

9 Student Profiles

18 Simulated Students

x

2 Pipeline Designs

=

BASELINE

OURS

OURS

BASELINE

Simulated Student

Simulated Student

Trait

Overview

Knowledge

Stress

Motivation

Goal commitment

High

Medium

Low

Self-efficacy

23 of 38

RQ1: Do Simulated Students Behave as Teachers Expect?

23

9 Student Profiles

18 Simulated Students

Knowledge

Stress

Motivation

Goal commitment

High

Medium

Low

x

2 Pipeline Designs

=

BASELINE

OURS

10 Teachers

Bias is the difference between actual and teacher-expected student profiles.

Self-efficacy

24 of 38

Knowledge Bias Was Small for Our Pipeline (Median: 5%).

RQ1: Do Simulated Students Behave as Teachers Expect?

24

Student Profiles

1

2

3

4

5

6

7

8

9

Bias (%)

8.3

6.7

5.0

21.7

0.0

0.0

21.7

0.0

0.0

25 of 38

Knowledge Bias Was Small for Our Pipeline (Median: 5%).

RQ1: Do Simulated Students Behave as Teachers Expect?

25

Student Profiles

1

2

3

4

5

6

7

8

9

Bias (%)

8.3

6.7

5.0

21.7

0.0

0.0

21.7

0.0

0.0

Contrasting knowledge levels and self-efficacies

might have confused teachers.

High knowledge,

Low self-efficacy

Medium knowledge,

High self-efficacy

26 of 38

Trait Bias Was Small for Our Pipeline (Median: 10%).

RQ1: Do Simulated Students Behave as Teachers Expect?

26

Student Profiles

1

2

3

4

5

6

7

8

9

Bias (%)

Goal Commitment

13.3

4.2

40.8

39.2

15.8

7.5

10.0

19.2

38.3

Motivation

20.8

8.3

5.0

13.3

6.7

5.0

25.8

26.7

20.0

Self-efficacy

18.3

29.2

3.3

19.2

10.8

5.0

30.0

17.5

8.3

Stress

10.8

6.7

3.3

8.3

10.0

7.5

10.8

32.5

10.0

27 of 38

Trait Bias Was Small for Our Pipeline (Median: 10%).

RQ1: Do Simulated Students Behave as Teachers Expect?

27

Student Profiles

1

2

3

4

5

6

7

8

9

Bias (%)

Goal Commitment

13.3

4.2

40.8

39.2

15.8

7.5

10.0

19.2

38.3

Motivation

20.8

8.3

5.0

13.3

6.7

5.0

25.8

26.7

20.0

Self-efficacy

18.3

29.2

3.3

19.2

10.8

5.0

30.0

17.5

8.3

Stress

10.8

6.7

3.3

8.3

10.0

7.5

10.8

32.5

10.0

Contrasting traits (e.g., low goal commitment & high motivation) might have confused teachers.

28 of 38

Simulation Was Believable for Our Pipeline (Median: 3.5/5).

RQ1: Do Simulated Students Behave as Teachers Expect?

  • Content:

This student naturally responds to the teacher’s questions or instructions.

  • Language:

This student uses language and speaking style that a real student would use.

  • Usefulness:

This student looks real and is useful as a chatbot for teacher training.

(1: Strongly disagree, 5: Strongly agree)

28

29 of 38

Trait Overview Makes Simulated Chat More Believable.

RQ1: Do Simulated Students Behave as Teachers Expect?

29

Goal Commitment

Motivation

Stress

Self-Efficacy

Mean Absolute Error

*

*

5-Point Likert Scale

*

*

*

Trait Bias (Profile 4)

Believability (Profile 4)

Content

Language

Usefulness

OURS

BASELINE

30 of 38

Trait Overview Makes Simulated Chat More Believable.

RQ1: Do Simulated Students Behave as Teachers Expect?

30

OURS

BASELINE

Simulated Student

Simulated Student

Trait

Overview

The size of the particles doesn't change. Isn't that interesting?

Average.

Then do you think you can learn well in science class?

I strongly disagree.

Okay! So are you preparing well for this exam?

I’m not preparing well for the exam. I’m stressed out and it’s hard.

Then do you think you can learn well in science class?

No, I don’t believe I can learn well in science class.

31 of 38

RQ2: How Does TeachTune Help Teachers Review Chatbots?

  • Participants
    • 30 K-12 science teachers who (1) taught for 3.3±4.7 years and (2) used ChatGPT and chatbots
  • Task
    • Designing a 1:1 tutor chatbot that helps reviewing the phase transition (60 min)
    • We instructed participants to consider diverse knowledge levels and student traits.

31

32 of 38

RQ2: How Does TeachTune Help Teachers Review Chatbots?

  • Participants
    • 30 K-12 science teachers who (1) taught for 3.3±4.7 years and (2) used ChatGPT and chatbots
  • Task
    • Designing a 1:1 tutor chatbot that helps reviewing the phase transition (60 min)
    • We instructed participants to consider diverse knowledge levels and student traits.
  • Between-subjects
    • BASELINE: Direct chat & Test cases
    • AUTOCHAT: Direct chat & Test cases & Automated chat
    • KNOWLEDGE: Direct chat & Test cases & Automated chat (- student trait simulation) [1]

32

[1] Jin, Hyoungwook, et al. "Teach AI how to code: Using large language models as teachable agents for programming education." Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 2024.

33 of 38

AUTOCHAT Resulted in a Lower Task Load.

RQ2: How Does TeachTune Help Teachers Review Chatbots?

33

Mental

Physical

Temporal

Effort

Performance

Frustration

AUTOCHAT

BASELINE

KNOWLEDGE

7-Point Likert Scale

*

*

34 of 38

AUTOCHAT Participants Considered More Student Profiles.

RQ2: How Does TeachTune Help Teachers Review Chatbots?

34

*

*

*

*

*

*

*

Number of Levels Covered

*

Knowledge Level

Goal Commitment

Motivation

Self-efficacy

Stress

AUTOCHAT

BASELINE

KNOWLEDGE

35 of 38

Automated Chats Complement Existing Review Methods.

RQ2: How Does TeachTune Help Teachers Review Chatbots?

  • Direct Chats for early design
    • testing rough versions & develop pedagogical interactions
  • Automated Chats for coverage test
    • finding corner cases & managing multiple scenarios
  • Test Cases for specific debugging
    • repeating a specific scenario quickly

35

36 of 38

Discussions

  • Profile-oriented Design and Review Workflow
    • TeachTune helps teachers organize a systematic overview of target users.
    • Profile-oriented workflow can generalize to user-centric designs.

36

Structurally separate student profiles helped me recognize individual students, which would not be considered in direct chats.

37 of 38

Discussions

  • Profile-oriented Design and Review Workflow
    • TeachTune helps teachers organize a systematic overview of target users.
    • Profile-oriented workflow can generalize to user-centric designs.
  • Risks of Amplifying Teachers’ Stereotypes of Students
    • Views on stereotype varied; some teachers used simulation to judge their bias.
    • Feedback loops are needed to align expectations with reality.

37

Private tutors would have limited understanding of their students beyond lessons, making them relying on simulated behaviors.

38 of 38

38

TeachTune: Reviewing Pedagogical Agents Against Diverse Student Profiles with Simulated Students

Hyoungwook Jin KAIST

Minju Yoo Ewha Womans University

Jeongeon Park University of California San Diego

Yokyung Lee KAIST

Xu Wang University of Michigan

Juho Kim KAIST

Project website @

teachtune.kixlab.org