1 of 173

Designing, Evaluating, and

Learning from Human-AI Interactions

Sherry Tongshuang Wu

CMU, @tongshuangwu

Diyi Yang

Stanford, @diyi_yang

Sebastin Santy

UW, @sebastinsanty

Dec 6, 9am-12:30pm

Leo 3 & 4 (also hybrid)

2 of 173

Human-AI Interactions

2

Credits: Hyunwoo Kim (Human) + Bing (AI)

3 of 173

Human-AI Interaction: What is it?

3

Basically, a field where humans and AIs interact.

AI-based translations

AI-based grammar correction

Recommendation systems

4 of 173

Human-AI Interaction: What is it?

4

Basically, a field where humans and AIs interact.

Humans: AI researchers, model developers, domain experts, end users.

AIs: dialog system, translator, recommender system, autonomous driving system.

Interact:

Humans collaborate with AI,

Humans get assistance from AI,

Humans analyze AI,

AI helps human,

& many other forms

5 of 173

Human-AI Collaboration

5

The cooperative and coordinated interaction between humans (mostly non-AI experts) and AI to solve complex problems or achieve certain goals.

6 of 173

Humans get assistance from AI-infused apps

6

Humans are still mostly end users and domain experts. The big difference is AI is not a partner, but a tool (and part of “AI-infused applications”)

Because we want people to get smooth assistance from AIs when they are in the larger application context (e.g., Amazon suggestion page is only one section), the concept of task & AI model is blurred.

because these models are wrapped under mature visual interfaces, people tend to have less tolerance when they get wrong.

7 of 173

Humans get assistance from AI-infused apps

7

Humans are still mostly end users and domain experts.

The big difference is AI is not a partner, but a tool (and part of “AI-infused applications”)

8 of 173

Humans analyze Models

8

Experts systematically analyze AI models, and go beyond aggregated scores.

https://erroranalysis.ai/ Adaptive Testing and Debugging of NLP Models (Ribeiro & Lundberg, ACL 2022)

“Understanding the broader terrain of errors is an important starting point in pursuing systems that are robust, safe, and fair…[We need to] identify cohorts with higher error rates and diagnose the root causes behind these errors.”

Eric Horvitz / Microsoft, 2021

9 of 173

How do we figure out the “Interaction”?

9

Given a human and an AI…

Design

Why should they interact? How do we make it happen?

(self-)selected

Already exist & (we think) usable

Evaluate

Have we achieved what we want to achieve?

10 of 173

How do we figure out the “Interaction”?

10

Design

Evaluate

Given a human and an AI…

Learn from

Why should they interact? How do we make it happen?

(self-)selected

Already exist but needs improvement!

Have we achieved what we want to achieve?

How do we make AIs more usable?

11 of 173

How do we figure out the “Interaction”?

11

Design

Evaluate

Given a human and an AI…

Learn from

(self-)selected

Already exist but needs improvement!

We will focus on text – It’s “straightforward” but complicated at the same time, open-ended, and part of the multi-modal world. And it’s relevant!

Sherry Wu

Diyi Yang

Sebastin Santy

12 of 173

Schedule – Fruitful morning ahead

12

Design

Evaluate

Learn from

11:15-12:05 (40 mins lecture + 10 mins Q&A)

10:05-10:30 (25 mins lecture)

10:50-11:15 (15 mins lecture + 10 mins Q&A)

Conclusion

12:05-12:15

Extended Q&A

12:15-12:30

09:15-10:05 (40 mins lecture + 10 mins Q&A)

Coffee break

Evaluate (cont’)

10:30-10:50

13 of 173

Follow along…

You will learn…

13

tinyurl.com/

emnlp2023-tutorial-hai

NLP, from an HCI perspective

Awareness: Interaction as another layer on top of models & is human-centered

Systematic & up-to-date overview: the design choices around models

Contextualization: How principles are mapped to real-world case

14 of 173

“Human-AI Interactions”

Designing

14

15 of 173

Norman Doors

Have you come across a door where you have tripped, bumped into or confused as to how it operates?

15

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

❓ �Guess whether to push or pull

🔎 �Can’t locate a place to push or pull

➜�Try to push, but the door actually slides

16 of 173

You are not alone

Bad design is everywhere

From doors, to everyday objects & machines designed by people

including “AI”

16

17 of 173

Norman “AI”?

Not just doors, but happens with many poorly designed machines, including “AI”

17

	Door 🚪	AI 🤖
What the user wants to do	“How do I get to next room?”	“How do I solve my task?”
What the user ends up doing
How does a user learn �“How to use?”

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

18 of 173

Norman “AI”?

Not just doors, but happens with many poorly designed machines, including “AI”

18

	Door 🚪	AI 🤖
What the user wants to do	“How do I get to next room?”	“How do I solve my task?”
What the user ends up doing	“How should I operate the door to get to next room?”	“How should I prompt the model to get it to solve my task?”
How does a user learn �“How to use?”

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

19 of 173

Norman “AI”?

Not just doors, but happens with many poorly designed machines, including “AI”

19

	Door 🚪	AI 🤖
What the user wants to do	“How do I get to next room?”	“How do I solve my task?”
What the user ends up doing	“How should I operate the door to get to next room?”	“How should I prompt the model to get it to solve my task?”
How does a user learn �“How to use?”	- From previous encounter - Read labels - Take a guess and try	- From other people - Read prompt guidelines - Wing it

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

20 of 173

Design that disappears

“The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it.”

— Mark Weiser

They don’t require an instruction manual. You use it once or twice and you will barely feel it next interaction onwards. E.g. pointing device, touchscreen, “literacy”

20

“The father of ubiquitous computing”

Weiser, Mark. "The Computer for the 21st Century." Scientific american 265.3 (1991): 94-105.

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

21 of 173

How do we achieve this?

By extending humans instead of extending technology, reducing “frictions of learning”

21

Object�Metaphors

Action�Metaphors

Human functioning

Computer functioning

Punch Cards

Command Line

Pointing Device

GUI Icons

Personal Computing

Extending�human

Extending�computer

22 of 173

How do we achieve this?

In AI, we’ve been only thinking about extending technology

22

Human functioning

AI functioning

Using Code

Artificial Intelligence

Prompting

Chatting

Extending�human

Extending�AI

23 of 173

Why does it happen?

23

Technology-centric design

A design process for situations when a class of technologies already exists, but when user domains for co-development are not clearly established.

User-centered design (UCD) is an iterative design process in which designers focus on the users and their needs in each phase of the design process.

User-centered design

“If you have a hammer, everything looks like a nail”

Solution in search of a problem

Bly, Sara, and Elizabeth F. Churchill. "Design through matchmaking: technology in search of users." interactions 6.2 (1999): 23-31.

24 of 173

Blog Post: The Biggest Bottleneck for LLM startups is UX

These issues really only surface once someone starts trying to use the product in the context of their daily workflow. This is how you go from “cool” to “useful.”

These challenges are always present, regardless of system’s accuracy (within some bounds).�Doesn’t matter if the LLM accuracy is 80% or 95%, the user still needs to reason through failure modes and understand what to expect when interacting with the system.�Better off getting to a baseline accuracy that is good enough and then building a product that allows a user to know how to work around the model.

“You are good at designing things we cannot build.�We are good at making things that users don’t use.”

24

The biggest bottleneck for large language model startups is UX

Yang, Qian, et al. “Sketching nlp: A case study of exploring the right things to design with language intelligence." CHI 2019.

25 of 173

New models → new AI interactions

25

ChatGPT

Chatting

Pushing language as the primary interface for every task

Reeves, Byron, and Clifford Nass. "The media equation: How people treat computers, television, and new media like real people." Cambridge, UK 10.10 (1996).

Anthropomorphic Tendencies

The act of projecting human-like qualities or behavior onto non-human entities, in this case, AI

26 of 173

New models → new AI interactions

26

No specification means there is no single way of instruction. Suitable for personal tasks.

Language is imprecise

Controlling for desired outputs can be difficult. Unsuitable for tasks that require precision or is critical.

Anthropomorphic

Tendencies

ChatGPT

Reeves, Byron, and Clifford Nass. "The media equation: How people treat computers, television, and new media like real people." Cambridge, UK 10.10 (1996).

Urge to communicate like humans do

Language is flexible

27 of 173

Subtler Interactions (that disappear)

27

Voice Assistants

Email Autocomplete

Recommendations

Word suggestions

Subtitles

Code Completion

28 of 173

Design Thinking

28

29 of 173

What is design thinking?

As engineers, we immediately jump to finding solutions for a problem that we come across.

29

Problem

Solution 1

Solution 2

Solution 3

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

30 of 173

What is design thinking?

It is not about the exact number, but this high number puts emphasis on getting to the root cause of the problem, instead of looking at the surface level.

30

Problem

Solution 1

Solution 2

Solution 3

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

Problem 1

Problem 2

Problem 1.1

Why?

Ask “Five whys”

31 of 173

Getting to the root cause of the problem

“Within a design context, framing is often seen as the key creative step that allows an original solution to be produced.

Designers report on the need to get to ‘the problem behind the problem’ (as initially presented by the client), and about creating a ‘fresh perspective.’ ”

— Bec Paton and Kees Dorst

“If I had asked people what they wanted, they would have said faster horses.”

— Henry Ford

31

“The inventor of production car”

The core of ‘design thinking and its application, https://www.sciencedirect.com/science/article/pii/S0142694X11000603g

32 of 173

What is design thinking?

32

Solution 1

Solution 2

Solution 3

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

Problem

Problem 1

Problem 2

Problem 1.1

Why?

33 of 173

How to incorporate design thinking?

The “Double Diamond” Method

33

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

First Diamond�Find the specific problem.

Second Diamond�Find the specific solution.

Why?

Find the problem

Find the solution

34 of 173

How to incorporate design thinking?

The “Double Diamond” Method

34

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

Four Steps:

Discover Problem
Define Problem
Develop Solution
Deliver Solution

Find the problem

Find the solution

35 of 173

Discover Problem

Discover: Understand the issue rather than merely assuming it. It involves researching, speaking to and spending time with people who are affected by the issues.

Market Research

Field Study

Interview and Surveys

Environmental Factors

35

Stakeholder Interviews, check raised tickets, traffic and sales analysis, competitive audits

Site visits, Ethnography to observe people doing their own tasks in their own setting.

To collect information on their reactions to existing products and conditions

Understand the context, and its needs.

Quesenbery, Whitney, and W. Whitney. "Choosing the right usability technique: Getting the answers you need." User Friendly 2008-Innovation for Asia (2008).

36 of 173

Define Problem

Define: The insight gathered from the discovery phase can help to define the challenge in a different way.

Affinity Diagrams

Perspective Framing

Task & Information Analysis

36

To group and explore the structure of information.

Participatory design to develop a consensus view of the overall process.

Learning about relationships between tasks and information; Creating logical groups from the users’ point of view.

Quesenbery, Whitney, and W. Whitney. "Choosing the right usability technique: Getting the answers you need." User Friendly 2008-Innovation for Asia (2008).

37 of 173

Develop Solution

Develop: Give different answers to the clearly defined problem, seeking inspiration from elsewhere and co-designing with a range of different people.

Rapid Prototyping

Storytelling

Minimum Viable Product

37

Physical realizations of the research and design process in a tangible form. Can be used to get a sense of what it would be like to experience the product/service.��Goes from low fidelity (paper) to high fidelity (systems).

Construct situations where a specific user in a specific context would go about solving the problem with different solutions.

Quesenbery, Whitney, and W. Whitney. "Choosing the right usability technique: Getting the answers you need." User Friendly 2008-Innovation for Asia (2008).

38 of 173

The “Double Diamond” Method

38

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

Four Steps:

Discover Problem
Define Problem
Develop Solution
Deliver Solution

But this is not once and for all

39 of 173

Iterative Design

39

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013. DreamEndState.com

Find the problem

Find the solution

It is a spiral

along time axis

40 of 173

Example Task: Make writing faster for humans

40

Yang, Qian, et al. "Sketching nlp: A case study of exploring the right things to design with language intelligence." CHI. 2019.

41 of 173

Example Task: Make writing faster for humans

41

“Typo errors slow me down”

“I know the word, takes time to type”

Next word suggestions

Autocorrect word

Use: on phones, cannot think of new words, obvious word but takes time to write

Use: on phones, already typed words with spelling mistakes, too much time to go back

Next phrase suggestion

Use: on desktops, breeze through obvious phrases that otherwise take time to write.

Word copilot

Use: on desktops, large chunks of text together. Helps when one needs to start from scratch

“I know the phrase, takes time to type”

“I have writer’s block starting from scratch”

Problem Space

Solution Space

42 of 173

Develop Solution: Prototype

Wizard-of-Oz�Fake features so that the user thinks that the responses are computer-driven when they are actually human-controlled.�Challenge for NLP: AI errors are hard to simulate.

LM Prototypes / Scaffold�Simulate users that may use the system�Use LMs to build prototypes

Mimic simple functionality�Ensemble multiple tools, LMs, simple models and expectations.�Challenge for NLP: cannot simulate SOTA model capabilities

42

43 of 173

Develop Solution: Prototype Persona

Algorithmic persona: human roles that users assign to the algorithm to explain the algorithm’s goals, behaviors, and characteristics.

43

Wu, Eva Yiwei, Emily Pedersen, and Niloufar Salehi. "Agent, gatekeeper, drug dealer: How content creators craft algorithmic personas." CSCW 2019

44 of 173

Develop Solution: LM as a Prototyper / Scaffolding

44

Petridis, Savvas, Michael Terry, and Carrie Jun Cai. "PromptInfuser: Bringing User Interface Mock-ups to Life with Large Language Models." Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 2023.

45 of 173

NLP & Interfaces

UI for an NLP system aka “A wrapper on top”�e.g., Autocomplete, Google Translate

Designing an interface to make use of the underlying NLP system for achieving user tasks.

45

GUI

Natural language UI aka “the wrapper itself”�e.g., Alexa, Google Search, GPT4 Web Search, AutoGPT

Using language as a medium to interact with tools, applications and other systems.

NLP System

Language

System

46 of 173

Interfaces vs. Interactions

46

Interactions

Interfaces

Tangible i.e. visual, auditory, tactile inputs to the system. Designing interfaces are at the surface & physical level, often limited to styling.

Understanding the context of user, their needs and requirements, how they operate. Includes psychological aspects such as trust, goals, user behavior.

Interfaces

Interactions

47 of 173

Beyond Interfaces

47

48 of 173

Aspects of Interaction

Cognition, Perception and extending humans�Offload cognitively loaded tasks, make space for humans to be creative.

Trust, Reliance and other user-machine behavior�Users feel comfortable using and depending on AI systems for achieving their tasks.

Fairness, Accountability, Transparency, Ethics�Ensure equitable treatment of all individuals, regardless of their race or gender. Not perpetuate bias or discrimination and ability to understand model decisions.

Personalization, Adaptation, Feedback and Guidance�Tailoring AI interactions to individual user preferences and needs, and in turn also learn and improve itself over time to align with human preferences.

48

49 of 173

Cognition, Perception and extending humans

49

Explains what a system might be capable of��A metaphor communicates expectations of what can and cannot be done

Visual Metaphors

Audio Metaphors

Conceptual Metaphors

Textual Metaphors

Web, Crawling, Load, Fetch

Email, Thread, Port, Address

Camera Shutter Sound

Phone Lock Sound

AI Metaphors

Stochastic Parrots

Intelligent Agent

(Object Metaphors)

50 of 173

Cognition, Perception and extending humans

50

Khadpe, Pranav, et al. "Conceptual metaphors impact perceptions of human-AI collaboration." Proceedings of the ACM on Human-Computer Interaction 4.CSCW2 (2020): 1-26.

Referring to AI with a specific name / metaphor has an effect on how it is perceived, and even how it is used.

Bots with metaphors of high warmth, but low competence were preferred overall

Effects of chatbot naming

51 of 173

Cognition, Perception and extending humans

51

Khadpe, Pranav, et al. "Conceptual metaphors impact perceptions of human-AI collaboration." Proceedings of the ACM on Human-Computer Interaction 4.CSCW2 (2020): 1-26.

Assimilation Theory - people adapt experiences to match expectations.��Contrast Theory - people are attuned to a difference between expectations and experiences.��

Warmth follows Assimilation�Competence follows Contrast

52 of 173

Cognition, Perception and extending humans

52

Manipulate objects like you do in the real-world.��Real-world metaphors for objects and actions can make it easier for a user to learn and use an interface

Drag and Drop

Direct Manipulation

Resizing Elements

(Action Metaphors)

Personal Assistant

Interactive Stories

set reminder

get weather

send message

Setting own character actions or dialogue options, creating a personalized storyteller.

53 of 173

Trust and Reliance

53

Vasconcelos, Helena, et al. "Explanations can reduce overreliance on ai systems during decision-making." Proceedings of the ACM on Human-Computer Interaction 7.CSCW1 (2023): 1-38.

Trust: Trust refers to a belief or confidence in the integrity, reliability, and honesty of a person, organization, or thing.

Reliance: Involves depending on someone or something to perform a specific function or task, irrespective of whether trust is present.

Often, Trust → Reliance

Imperfect agents prone to making errors require a trusting relationship.

54 of 173

Trust and Reliance

54

Buçinca, Zana, Maja Barbara Malaya, and Krzysztof Z. Gajos. "To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making." Proceedings of the ACM on Human-Computer Interaction 5.CSCW1 (2021): 1-21.

> Showing Explanations and giving people agency

> Showing Uncertainty

> On demand

> Wait

55 of 173

Fairness

Accountability

55

Liebling, Daniel J., et al. "Unmet needs and opportunities for mobile translation AI." Proceedings of the 2020 CHI conference on human factors in computing systems. 2020.

What is the real-life cost of mistranslation?

How well it works for diverse populations?

Sun, Jiao, et al. "Pretty princess vs. successful leader: Gender roles in greeting card messages." Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 2022.

56 of 173

Transparency

Ethics

56

Flathmann, Christopher, et al. "Modeling and guiding the creation of ethical human-AI teams." Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 2021.

How to open the black-box and understand model decisions?

Cabrera, Ángel Alexander, et al. "Zeno: An interactive framework for behavioral evaluation of machine learning." Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 2023.

57 of 173

Interaction Initiative

57

Mixed initiative systems allow users to interact with them in a collaborative way, where the user and the system both take an active role in carrying out tasks or making decisions.

Advocates elegant coupling of automated services with direct manipulation.

“Autonomous actions should be taken only when an agent believes that they will have greater expected value than inaction for the user.”

58 of 173

Reflection

Design should be seamless without the need of an instruction manual

How?�By extending humans and their natural ways of interacting with the real-world

What is the process?�By using design thinking frameworks. Emphasis on finding the root cause of the problem, finding solutions and iterating back and forth, till we build applications that users want to use.

Are there other human factors to consider?�Beyond the visible elements on the interfaces, and psychological and cognitive aspects, it is important to be aware of the underlying trust and reliance in the system, and its implications in the real world.

58

59 of 173

“Human-AI Interactions”

Evaluating

59

60 of 173

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Analysis: “How do I crunch the numbers to find out what I need to know?”

60

61 of 173

Goals: HALIE – Consider Humans & Interactions

61

Lee, Mina, et al. "Evaluating human-language model interaction." TMLR 2023

Traditional evaluation: on the outputs of the models themselves.

Criteria:

Output quality

Perspective:

third party

Target:

Output

62 of 173

Goals: HALIE – Consider Humans & Interactions

62

Lee, Mina, et al. "Evaluating human-language model interaction." TMLR 2023

Interaction centric evaluation: On interactions between humans and models.

Criteria:

Output quality, human preference

Perspective:

third party, first-person experience

Target:

Output, process

63 of 173

Consider a case study: Human-LM Co-Writing

63

Lee, Mina, Percy Liang, and Qian Yang. "CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities." CHI 2022

64 of 173

Consider a case study: Human-LM Co-Writing

64

Criteria:

Quality: perplexity, etc.

Preference: Which option(s)?

Perspective:

third party…

First-person: Everyone has their choice; Thousands of divergence from the same starting point

Target:

Output: Final story

Process: Snapshot of how the article is constructed

65 of 173

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to application.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Analysis: “How do I crunch the numbers to find out what I need to know?”

65

66 of 173

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to task.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Analysis: “How do I crunch the numbers to find out what I need to know?”

66

67 of 173

Going back to our Co-writing case…

67

Goal – Understand LLM capability on support human writing

➡️ Task – Vary types of writing (Creative & Argumentative), prompts, and model randomness

➡️ Data – What’s collected?

Human preferences, final articles, etc.

68 of 173

Going back to our Co-writing case…

68

Most important data: Interaction trace

Rich metadata that reflect the entire process of interaction, with each person.

<Query>

<Move>

<Query>

He loved to play tricks on people and make them love.

there was a great mage who lived in a tower.

<Edit>

funny

Once upon a time,

<Write>

69 of 173

Going back to our Co-writing case…

69

Most important data: Interaction trace

Rich metadata that reflect the entire process of interaction, with each person.

Events and clickstreams

States of everything at every moment

70 of 173

Going back to our Co-writing case…

70

Goal – Understand LLM capability on support human writing

➡️ Task – Vary types of writing (Creative & Argumentative), prompts, and model randomness

➡️ Data – What’s collected?

1445 sessions between 63 users and GPT-3
Types of writing:

Creative writing: 830 stories written by 58 writers
Argumentative writing: 615 essays written by 49 writers

Stories and essays: 418 words long
Number of queries: 11.8 queries per writing session
Acceptance rate of suggestions: 72.3%
Percentage of text written by humans: 72.6%

71 of 173

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to task.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Interaction trace & open-ended feedback, form a comprehensive picture.

Analysis: “How do I crunch the numbers to find out what I need to know?”

71

72 of 173

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to task.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Interaction trace & open-ended feedback, form a comprehensive picture.

Analysis: “How do I crunch the numbers to find out what I need to know?”

72

73 of 173

Going back to our Co-writing case (again)

73

But most important data: Interaction trace

Rich metadata that reflect the entire process of interaction, with each person.

Things you can already see…

Multiple queries means first several suggestions are bad;

Edit gives us cost-efficiency thresholds people have in mind, between “useful enough to edit” vs. “let me start from scratch”

<Query>

<Move>

<Query>

He loved to play tricks on people and make them love.

there was a great mage who lived in a tower.

<Edit>

funny

Once upon a time,

<Write>

74 of 173

Existing metrics can help quantify effects

74

Some analyses already done by the authors!

Q: Can GPT-3 generate fluent text in response to user text?

A: Text written by user + GPT-3 had fewest errors and most diverse vocabulary

75 of 173

Define metrics for what needs to be measured

75

Some analyses already done by the authors!

Q: Can GPT-3 contribute new ideas to users’ stories?

A: Ideas generated and reused by users in the subsequent writing.

“Reused named entity” is a new (lower-bound)metric defined for ideation quantification!

76 of 173

How the dataset can be further used

76

Writers' behaviors and writing outcomes over time

Can we observe novelty effect and longitudinal change?

Do LMs homogenize writing by providing similar suggestions to all users?

Linguistic accommodation

How does the style, voice, or tone of a writer or LM influence that of the other over time? Is the influence uni-directional or bi-directional?

Edit traces

What can we learn from human edits on LM outputs?

Can we train LMs on edit traces to emulate human edits?

77 of 173

Metrics that exist for Human-LM Co-Writing

77

Shen, Hua, and Tongshuang Wu. "Parachute: Evaluating interactive human-lm co-writing systems." In2Writing 2023.

78 of 173

Metrics that exist for Human-LM Co-Writing

78

79 of 173

Metrics that exist for Human-LM Co-Writing

79

80 of 173

Metrics that exist for Human-LM Co-Writing

80

81 of 173

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to task.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Interaction trace & open-ended feedback, form a comprehensive picture.

Analysis: “How do I crunch the numbers to find out what I need to know?”

Comprehensive metrics, cover different aspects, don’t reinvent the wheel!

81

82 of 173

Evaluation – From mindset to method

82

Mixed method evaluation

Think aloud analysis

And, of course, one more case study

83 of 173

Important method: “Think Aloud”

A research method used to gain insight into a person's thought processes as they perform a task or solve a problem. The participant is asked to verbalize their thoughts as they perform the task, which allows the researcher to understand how the participant approaches the task.�"Thinking aloud may be the single most valuable usability engineering method.“

“I’m going to ask you to ____ and while you are doing that, can you tell me whatever you are thinking. Whatever comes into your mind while you are working on that. Okay?”

Protocol

Give participants specific tasks to accomplish (but not HOW to do it)

Have them speak aloud as they complete the tasks

Keep interruptions to a minimum

Ask for open-ended questions & clarification after the task is complete

Learning effect - if you make tasks, watch for biasing test due to order

Typically used to test the usability of a website, app or object

83

84 of 173

Important method: Controlled exp. + Mixed method

Mixed methods research combines elements of quantitative research and qualitative research in order to answer research questions. Can help a more complete picture than a standalone quantitative or qualitative study, as it integrates benefits of both methods.

84

Quantitative method

Understand the “what”. Precise!

Did users complete task (yes/no)?
How long did it take?
How many clicks?

Qualitative method

Understand the “why”. Open-ended!

What did you like best about the experience?
Why were you frustrated by the model output?

85 of 173

Quantitative vs. Qualitative

85

	Quantitative	Qualitative
Definition	Gather numerical data to be analyzed using statistical methods	Gathering descriptive, non-numerical data to be analyzed through interpretation and contextualization
Data source	surveys, questionnaires, experiments	interviews, observations, and document analysis
Presentation	tables, graphs, and statistics	quotes and narratives that reflect the participants' experiences and perspectives
Goal	establish cause-and-effect relationships between variables	gain a deeper understanding of social phenomena, meanings, and processes

86 of 173

Case Study: Interactive Machine Translation

86

“We present Predictive Translation Memory, an interactive, mixed-initiative system for human language translation. Translators build translations incrementally by considering machine suggestions that update according to the user’s current partial translation.”

Green, Spence, et al. "Predictive translation memory: A mixed-initiative system for human language translation." UIST 2014

87 of 173

87

88 of 173

PTM: Experimental Design

88

Comparative analysis

“We compared our system to post-editing, which is a strong baseline [29, 21], and is also the most common commercial use of MT. “��Clear research questions

Time – PTM faster than post-edit?

Quality – PTM == better translation?

Usage – subjects use interactive aids?

translate French→English or English→German

≈3,000 tokens of News/Medical/Software

post-edit (pe) and PTM

16 expert subjects per language pair

Task

Source Text

Conditions

Participants

89 of 173

RQ1: Time – PTM faster than post-edit?

89

Quantitative analysis (find robust evidence)

Metric: log of time (more tolerant of outliers)

Compare mean (for general understanding)

Linear mixed effects models (for understanding significance, important factors)

The key independent variable: translation condition

Learning effect — People get quicker as the task proceed

More edits means longer time

Initial translation quality — how much edit is necessary

They had unbalanced participation pool

Longer source sentence takes longer to edit

Potential interaction between independent variables

+random intercepts/slopes for subject, source, text genre.

90 of 173

RQ1: Time – PTM faster than post-edit?

90

Qualitative analysis (find reasons behind quantitative analysis — codebook!)

Metric: log of time (more tolerant of outliers)

Think aloud (record users’ comments) — Interactive mode takes more time because…

There are more aids to operate and more information to read and analyze:

“Because you spend more time on each word, you have opportunity to see alternative translations.”

MT quality quality greatly affected the usefulness of the interactive aids:

“If drop-down suggestions are not of a good quality, reading may consume extra time.”

The post-edit mode was easier at first, but interaction is better in the long run.

“I am used to this [post-edit], this is how Trados [the preeminent CAT tool] works.”

Likert Scale survey (can still quantitatively compare users’ subjective judgements): “In which interface did you feel most productive?”

“I would use interactive translation features if they were integrated into a CAT product”

“I got better at using the interactive interface with practice/experience”

91 of 173

RQ1: Time – PTM faster than post-edit?

91

Qualitative and quantitative results usually provides some grounding for each other.

“Post-edit mode was easier at first, but the interactive mode was better once I got used to it. “

"If I had time to use the interactive tool and grow accustomed to its way of functioning, it would be quite useful…"

92 of 173

RQ2: Quality – PTM == better translation?

92

Quantitative analysis (find robust evidence)

Metric: BLEU (automatic eval, has issues, but easier to run)

BLEU: a measure of similarity with the gold reference.

HBLEU: measure of similarity with the initial MT suggestions.

“PTM exposes translators to many more alternatives, encouraging them to deviate further from the initial MT suggestion (lower HBLEU).”

Compare mean

(Also vs. original generated text)

93 of 173

RQ2: Quality – PTM == better translation?

93

Quantitative analysis (find robust evidence)

Metric: Human subject rating

Auto methods are sensitive and noisy, so usually paired with human judgements as well

94 of 173

RQ2: Quality – PTM == better translation?

94

Qualitative analysis (find reasons behind quantitative analysis)

Metric: BLEU & rating (automatic eval, has issues, but easier to run)

When users wanted to render more stylistic translations, PTM was less useful:

“...choosing a very different translation approach (choice of words, idioms with no equivalent in English...) would be like going against the current—but may have provided a better quality.”

“the translator is less susceptible to be creative.”

Why do many participants prefer post-edit?

“I found the machine translations (texts in gray) were of a much better quality than texts generated by Google Translate”

"The translations generally did not need too much editing, which is not always the case with machine translations.”

95 of 173

RQ3: Usage – subjects use interactive aids?

95

Quantitative analysis on 1.1 million UI events

Metric: UI clickstream events (faithful reflection of user pattern)

Usage: 52% more text entered via aids than TransType

Clickstreams are also more objective and comparable with prior work.

In the TransType system, the authors commented that their users often “[accepted] predictions in [their] entirety and then edited to ensure its correctness” and reported that 52% of target characters were typed [36]. In the “prediction+options” experiment conducted by Koehn et al. [29], the authors reported that 36% of the final translations were typed, 36% entered via a mouse click, and 27% entered via the tab key to accept machine translations. When working in our PTM system, users directly utilized machine translations to a greater degree than previously reported.

96 of 173

High-level takeaways

Mindset in evaluation: Human and interaction come first.

Desiderata: Goals > Tasks > Data > Analysis method

Common data: Clickstreams, interviews, etc.

Common evaluation method: Mixed method, think aloud

Common metrics: Find them, or invent them :)

96

97 of 173

Reflection: Where do we find interactions to evaluate?

97

Don’t reinvent the wheel – task delegation has always been a thing in Human Computations!

LLM chaining:

Breakdown complex tasks … be done independently by LLMs , then combined

?

Compare human vs. LLM, and LLM vs. LLM performances on sub-tasks in the same overall context, by replicating crowdsourcing pipelines with LLMs chains!

Crowdsourcing pipeline:

Breakdown complex tasks into pieces that can be done independently by humans, then combined

98 of 173

Reflection: Where do we find interactions to evaluate?

98

Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits

Input

Print publishers are in a tizzy over Apple’s new iPad because they hope to finally be able to charge for their digital editions. But in order to get people to pay for their magazine and newspaper apps, they are going to have to offer something different that readers cannot get at the newsstand or on the open Web.

Text to be shortened

Fix

hoping to charge for digital editions

to get people to pay for apps

offer something unique not available elsewhere

Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique not available elsewhere.

Shorten the verbose parts and put back in context

Verify

Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique that is not available elsewhere.

Fix grammar issues

Find

hope to finally … their digital editions

in order to get people to… apps

offer something… on the open Web

Verbose parts

99 of 173

Reflection: Where do we find interactions to evaluate?

99

Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits

Fix

Find

Verify

hope to finally … their digital editions

in order to get people to… apps

offer something… on the open Web

hoping to charge for digital editions

to get people to pay for apps

offer something unique not available elsewhere

Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique that is not available elsewhere.

Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique not available elsewhere.

Input

Print publishers are in a tizzy over Apple’s new iPad because they hope to finally be able to charge for their digital editions. But in order to get people to pay for their magazine and newspaper apps, they are going to have to offer something different that readers cannot get at the newsstand or on the open Web.

Text to be shortened

Verbose parts

Shorten the verbose parts and put back in context

Fix grammar issues

Identify at least one area that can be shortened without changing the meaning of the paragraph.

Edit the highlighted section to shorten its length without changing the meaning of the paragraph.

Choose at least one rewrite that has significant style errors in it. Choose at least one rewrite that significantly changes the meaning of the sentence.

Bernstein et al. (2010): Find segments for shortening, Fix by shortening them, Verify by identifying rewrites with errors.

Find

Fix

Verify

100 of 173

Reflection: Where do we find interactions to evaluate?

100

Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits

Fix

Find

Verify

Input

hope to finally … their digital editions

in order to get people to… apps

offer something… on the open Web

hoping to charge for digital editions

to get people to pay for apps

offer something unique not available elsewhere

Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique that is not available elsewhere.

Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique not available elsewhere.

Print publishers are in a tizzy over Apple’s new iPad because they hope to finally be able to charge for their digital editions. But in order to get people to pay for their magazine and newspaper apps, they are going to have to offer something different that readers cannot get at the newsstand or on the open Web.

Text to be shortened

Verbose parts

Shorten the verbose parts and put back in context

Fix grammar issues

Identify at least one area that can be shortened without changing the meaning of the paragraph.

Edit the highlighted section to shorten its length without changing the meaning of the paragraph.

Choose at least one rewrite that has significant style errors in it. Choose at least one rewrite that significantly changes the meaning of the sentence.

Bernstein et al. (2010): Find segments for shortening, Fix by shortening them, Verify by identifying rewrites with errors.

Find

Fix

Verify

Find three segments that can be shortened from the following text. These segments need to be present in the text.

Text: {input text}

Segments: 1. {output segment}

Shorten the following text without changing its meaning.

Text: {segment}

Shortened text: {shortened segment}

P7: Find segments to shorten, Fix these phrases by shortening them, Verify by fixing grammatical errors.

Correct the grammar of the following text.

Text: {fixed text}

Corrected text: {grammatical text}

101 of 173

Replication experiment…

101

Course assignment for Human-Centered NLP @ CMU

20 undergraduate / graduate students

Replicate 7 crowdsourcing papers (that cover diverse pipeline designs and tasks) with LLM chains

Self-reflection on replication effectiveness (vs. single LLM baseline) and possible improvements

Peer grading on replication correctness, thoroughness, and comprehensiveness

Wu, Tongshuang, et al. "Llms as workers in human-computational algorithms? replicating crowdsourcing pipelines with llms." ArXiv 2023

102 of 173

Interesting evaluation output & design implication

102

LLMs (with only textual instructions) need proper task formation: e.g., change to multi-choice question, prompts with templates, etc.

Find three segments that can be shortened from the following text. These segments need to be present in the text.

Text: {input text}

Segments: 1. {output segment}

Humans get more scaffolding constraints: e.g. use mouse selection to extract precise segments from the original document.

Identify at least one area that can be shortened without changing the meaning of the paragraph.

Fix grammar issues

Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits

Fix

Find

Verify

Input

hope to finally … their digital editions

in order to get people to… apps

offer something… on the open Web

hoping to charge for digital editions

to get people to pay for apps

offer something unique not available elsewhere

Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique that is not available elsewhere.

Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique not available elsewhere.

Print publishers are in a tizzy over Apple’s new iPad because they hope to finally be able to charge for their digital editions. But in order to get people to pay for their magazine and newspaper apps, they are going to have to offer something different that readers cannot get at the newsstand or on the open Web.

Text to be shortened

Verbose parts

Shorten the verbose parts and put back in context

103 of 173

Reflection: How do we build the pipeline?

103

Look into existing interactions!

Example: Human-Human pair programming, vs. human-AI pair programming

Wu, Tongshuang, and Kenneth Koedinger. "Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming." ArXiv 2023

104 of 173

Reflection: How do we build the pipeline?

104

Human-human – Variance in metrics!

Time and accomplishment? twice the duration, the person-hours required

Human-AI – Too simplified metrics?

E.g. the number of lines of added code – the nature of interaction with Copilot (tab to accept suggestions) is a big factor!

105 of 173

Reflection: How do we build the pipeline?

105

More comparisons with human-human pair programming?

Get inspiration on metrics, e.g., defect density, perceptual effort measure, readability, functionability, the number of test cases passed, code complexity, scores, expert opinions, etc.?

106 of 173

Reflection: How do we build the pipeline?

106

More factors to consider – Evaluate on which population (female students, teachers, etc.)?

107 of 173

Reflection: How do we build the pipeline?

107

Wu, Tongshuang, and Kenneth Koedinger. "Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming." ArXiv 2023

108 of 173

“Human-AI Interactions”

Learning

108

109 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

109

110 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

110

111 of 173

Users Interaction with LLMs

111

112 of 173

112

113 of 173

Interaction: Different Types of Human Feedback (1)

Labeled data points

Edit data points

Change data weights

Binary/scaled user feedback

Natural language feedback

Code language feedback

113

114 of 173

Interaction: Different Types of Human Feedback (2)

Define, add, remove feature spaces

Directly change the objective function

Directly change the model parameter

…

114

115 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

115

116 of 173

Learning from Interactions and Feedback

Transform nontechnical human “preferences” into usable model “language”

Allow humans to easily provide feedback
Build models to effectively take the feedback

116

Valerie Chen et al., "Perspectives on incorporating expert feedback into model updates. Pattern 2023

117 of 173

Tradeoff: Human-friendly vs. Model-friendly

Models need feedback that “they can respond to”

Humans prefer easier-to-provide feedback

Non-experts:

natural language feedback > labeling > model manipulation

117

118 of 173

Human Interaction and Text Classification

118

Godbole, Shantanu, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti. "Document classification through interactive supervision of document and term labels." In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 185-196. Springer, Berlin, Heidelberg, 2004.

119 of 173

Human Interaction and Parsing

119

120 of 173

Human Interaction and Topic Modeling

120

Interactive Topic Modeling: start with a vanilla LDA with symmetric prior, get the initial topics. Then repeat the following process till users are satisfied: show users topics, get feedback from users, encode the feedback into a tree prior, update topics with tree-based LDA

121 of 173

121

122 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

122

123 of 173

Incorporating Human Feedback: Taxonomy

Dataset updates: change the dataset

Loss function updates: add a constraint to the objective

Parameter space updates: change the model parameters

123

124 of 173

Learning from Interaction: Datasets Updates

Data augmentation

Weak supervision

Active learning

Model-assisted adversarial labeling

124

125 of 173

Datasets Updates: Weak Supervision

Try using noisy sources of signal, specified at higher-levels of abstraction, to rapidly generate training sets.

125

Ratner, Alexander, et al. "Snorkel: Rapid training data creation with weak supervision." VLDB 2017.

126 of 173

Datasets Updates: Weak Supervision

126

Ratner, Alexander, et al. "Snorkel: Rapid training data creation with weak supervision." VLDB 2017.

127 of 173

Datasets Updates: Active Learning to update data

Proactively select which data points we want to use to learn from, rather than passively accepting all data points available.

127

128 of 173

Datasets Updates: Data Augmentation

128

Token-level augmentation:

Synonym replacement (Yang et al. 2015, Zhang et al. 2015, Miao et al. 2020)
Random insertion, deletion, swapping (Xie et al. 2019, Wei and Zou 2019)
Word replacement via LM (Wu et al. 2019, Zhu et al. 2019)

Sentence-level augmentation:

Paraphrasing (Xie et al. 2019, Chen et al. 2020)
Conditional generation (Zhang and Bansal 2019, Yang et al. 2020)

Adversarial augmentation:

Whitebox methods (Miyato et al., 2017; Zhu et al., 2020; Jiang et al., 2019; Chen et al., 2020d)
Blackbox methods (Ren et al. 2019; Garg and Ramakrishnan, 2020)

Hidden space augmentation:

Mixup (Zhang et al., 2019, Chen et al. 2020)

129 of 173

Datasets Updates: mixup for text data

129

130 of 173

Datasets Updates: mixup for text data

130

Chen, Jiaao, Zichao Yang, and Diyi Yang. "MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification." ACL 2020.

131 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

131

132 of 173

Learning from Interaction: Loss function updates

Unlikelihood learning

Add regularization to specific model behavior

Infer constraints from expert feedback

132

133 of 173

Loss Function Updates: Unlikelihood Learning

Penalize undesirable generations (e.g. not following control, repeating previous context)

133

If C is previously seen text, then less repetition and more diversity

Welleck, Sean, et al. "Neural text generation with unlikelihood training." ICLR (2019).

134 of 173

Loss Function Updates: Infer Constraints from Expert Feedback

Use counterfactual or contrasting examples to improve generalization via an auxiliary training objective

134

Teney, Damien, Ehsan Abbasnedjad, and Anton van den Hengel. "Learning what makes a difference from counterfactual examples and gradient supervision." Computer Vision–ECCV 2020:

135 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human feedback
Limitations of human feedback

135

136 of 173

Learning from Interaction: Parameter updates

Model editing

Concept bottleneck model

Parameter efficient fine-tuning (adapter, prefix)

Reinforcement learning from human feedback

Learning from “diff” or corrections

136

137 of 173

Model Editing uses a single desired input-output pair to make fast, local edits to a pre-trained model

137

Mitchell, Eric, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. "Fast model editing at scale." arXiv preprint arXiv:2110.11309 (2021).

Transform the gradient obtained by SFT using a low-rank decomposition of the gradient to make the parameterization of this transformation tractable.

138 of 173

Parameter updates: Concept Bottleneck Model trains model to explicitly use human-provided concepts

138

Koh, Pang Wei, et al. "Concept bottleneck models." International Conference on Machine Learning. PMLR, 2020.

139 of 173

Parameter updates: Parameter Efficient Fine-tuning uses small interaction data to steer models towards desired behaviors

139

140 of 173

Parameter updates: Parameter Efficient Fine-tuning uses small interaction data to steer models towards desired behaviors

140

Unlearn What You Want to Forget: Efficient Unlearning for LLMs. Jiaao Chen, Diyi Yang. EMNLP 2023

141 of 173

Incorporating Human Feedback: Taxonomy

Dataset updates: change the dataset

Loss function updates: add a constraint to the objective

Parameter space updates: change the model parameters

141

142 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

142

143 of 173

Learning from “Bad” Feedback

Preparation:

Step 1: Train models

Satisfaction classifier (❓→ 🍋, ❓→ 🍒)
Reply corrector (🍋→🍒)

Step 2: Infer the missing labels
Step 3: Convert 🍋 → 🍒
Step 4: Collect the fruits and re-train 🤖

Ingredients:

20% binary 👍👎 20% gold correction 100% free-form textual

144 of 173

Re-train a supervised dialogue model

Bot: Do you like any other fruits besides plums?

Human: I also like watermelons.

Bot: I like watermelons too! Have you heard of Harry Styles?

Human: The bot changes the topic suddenly.

Correction: “I like watermelons too! They are refreshing.”

Bot: I like watermelons too! They are refreshing.

Human: I know! Any recommended recipe for watermelon slush?

Bot: I am not sure. Can you help me understand Greek Mythology?

Human: I’d like to talk about fruits

classifier

corrector

1) Train models

2) Infer ❓labels

Bot: I like watermelons too! They are refreshing.

→🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

3) Convert 🍋 to 🍒

→🍋

Bot: Do you like any other fruits besides plums?

Bot: I like watermelons too! Have you heard of Harry Styles?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

+

Collect fruits and re-train

Bot: I like watermelons too! They are refreshing.

Bot: Do you like any other fruits besides plums?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

Bot: I like watermelons too! Have you heard of Harry Styles?

Positive +🍒

Negative +🍋

+

Dialogue

❓

🍒

🍋

❓

corrector

classifier

145 of 173

Re-train a supervised dialogue model

Bot: Do you like any other fruits besides plums?

Human: I also like watermelons.

Bot: I like watermelons too! Have you heard of Harry Styles?

Human: The bot changes the topic suddenly.

Correction: “I like watermelons too! They are refreshing.”

Bot: I like watermelons too! They are refreshing.

Human: I know! Any recommended recipe for watermelon slush?

Bot: I am not sure. Can you help me understand Greek Mythology?

Human: I’d like to talk about fruits

classifier

corrector

1) Train models

2) Infer ❓labels

Bot: I like watermelons too! They are refreshing.

→🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

3) Convert 🍋 to 🍒

→🍋

Bot: Do you like any other fruits besides plums?

Bot: I like watermelons too! Have you heard of Harry Styles?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

+

Collect fruits and re-train

Bot: I like watermelons too! They are refreshing.

Bot: Do you like any other fruits besides plums?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

Bot: I like watermelons too! Have you heard of Harry Styles?

Positive +🍒

Negative +🍋

+

Dialogue

❓

🍒

🍋

❓

corrector

classifier

146 of 173

Re-train a supervised dialogue model

Bot: Do you like any other fruits besides plums?

Human: I also like watermelons.

Bot: I like watermelons too! Have you heard of Harry Styles?

Human: The bot changes the topic suddenly.

Correction: “I like watermelons too! They are refreshing.”

Bot: I like watermelons too! They are refreshing.

Human: I know! Any recommended recipe for watermelon slush?

Bot: I am not sure. Can you help me understand Greek Mythology?

Human: I’d like to talk about fruits

classifier

corrector

1) Train models

2) Infer ❓labels

Bot: I like watermelons too! They are refreshing.

→🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

3) Convert 🍋 to 🍒

→🍋

Bot: Do you like any other fruits besides plums?

Bot: I like watermelons too! Have you heard of Harry Styles?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

+

Collect fruits and re-train

Bot: I like watermelons too! They are refreshing.

Bot: Do you like any other fruits besides plums?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

Bot: I like watermelons too! Have you heard of Harry Styles?

Positive +🍒

Negative +🍋

+

Dialogue

❓

🍒

🍋

❓

corrector

classifier

147 of 173

Re-train a supervised dialogue model

Bot: Do you like any other fruits besides plums?

Human: I also like watermelons.

Bot: I like watermelons too! Have you heard of Harry Styles?

Human: The bot changes the topic suddenly.

Correction: “I like watermelons too! They are refreshing.”

Bot: I like watermelons too! They are refreshing.

Human: I know! Any recommended recipe for watermelon slush?

Bot: I am not sure. Can you help me understand Greek Mythology?

Human: I’d like to talk about fruits

classifier

corrector

1) Train models

2) Infer ❓labels

Bot: I like watermelons too! They are refreshing.

→🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

3) Convert 🍋 to 🍒

→🍋

Bot: Do you like any other fruits besides plums?

Bot: I like watermelons too! Have you heard of Harry Styles?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

+

Collect fruits and re-train

Bot: I like watermelons too! They are refreshing.

Bot: Do you like any other fruits besides plums?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

Bot: I like watermelons too! Have you heard of Harry Styles?

Positive +🍒

Negative +🍋

+

Dialogue

❓

🍒

🍋

❓

corrector

classifier

148 of 173

Supervised reply corrector is better, at least for 3B models

Further picking out the correctable lemons helps

149 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

149

150 of 173

Incorporating Different Levels of Feedback

Incorporate different levels of human feedback via RL

✍️Local Feedback

Highlighted words or phrases

Speaker's intents

Identifiable events/topics

✍️Global Feedback

Judgement towards the coherence, coverage, overall quality…

150

Chen, Jiaao, Mohan Dodda, and Diyi Yang. "Human-in-the-loop Abstractive Dialogue Summarization." arXiv preprint arXiv:2212.09750 (2022).

151 of 173

Case Study: Collecting Local Feedback

151

152 of 173

Case Study: Collecting Global Feedback

152

153 of 173

Case Study: Incorporating Different Levels of Feedback

153

(1) Collecting two levels of human feedback

.

(2) Learning and designing reward models from two levels of human feedback.

(3) Learning the summarization policy which could generate higher-quality summaries

Chen, Jiaao, Mohan Dodda, and Diyi Yang. "Human-in-the-loop Abstractive Dialogue Summarization." arXiv preprint arXiv:2212.09750 (2022).

154 of 173

Case Study: Reward Modeling for Summarization

Learn HITL summarization policy via RL by maximizing the combined reward

154

We would like to encourage the summarization systems to generate summaries that covers the important information while avoiding redundant information. Thus, here we introduce the local rewards based on highlighted information from human feedback.

For a given conversation C with a set of human annotated salient information M (phrases/sentences/words in the dialogues), suppose the model would generate the summary s. We view the list of highlights M annotated by human as information needed in the summaries, and the other sentences which have no highlights as redundant information set N (N=C-M). We then calculate the local coverage rewards by calculating the cosine distances between the embeddings of the summary and the information in the dialogues.

For global reward, we learn a scoring function per dimension j for a pair (n is better than m) of summaries, and aggregate all the dimensions as the global rewards.

155 of 173

Case Study: Evaluation of HITL Summarization

155

Chen, Jiaao, Mohan Dodda, and Diyi Yang. "Human-in-the-loop Abstractive Dialogue Summarization." arXiv preprint arXiv:2212.09750 (2022).

156 of 173

Case Study: Converting Feedback into Principles

156

157 of 173

Case Study: Converting Feedback into Principles

157

Petridis, Savvas, Ben Wedin, James Wexler, Aaron Donsbach, Mahima Pushkarna, Nitesh Goyal, Carrie J. Cai, and Michael Terry. "ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles." arXiv preprint arXiv:2310.15428 (2023).

158 of 173

Reinforcement Learning from Human Feedback

158

159 of 173

Constitutional AI: Harmlessness from AI Feedback

159

Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen et al. "Constitutional ai: Harmlessness from ai feedback." arXiv preprint arXiv:2212.08073 (2022).

160 of 173

Constitutional AI: Harmlessness from AI Feedback

160

161 of 173

Constitutional AI: Harmlessness from AI Feedback

161

162 of 173

Scaling RL from Human Feedback with AI Feedback

162

Lee, Harrison, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. "Rlaif: Scaling reinforcement learning from human feedback with ai feedback." arXiv preprint arXiv:2309.00267 (2023).

163 of 173

Scaling RL from Human Feedback with AI Feedback

163

164 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

164

165 of 173

Limitations of Human Feedback

Human preferences can be unreliable
Reward hacking is a common problem in RL

165

166 of 173

Limitations of Human Feedback

Human preferences can be unreliable
Reward hacking is a common problem in RL
Chatbots may be rewarded to produce responses that seem authoritative, long, and helpful, regardless of truth
Who are providing these feedbacks to LLMs
Whose values get aligned or represented

166

167 of 173

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

167

168 of 173

Conclusion and Future Directions

168

169 of 173

Takeaway

Human-AI Interaction is another layer on top of models that have its own set of taxonomies.

Goals: Objective functions
Design choice: hyperparameters
Clickstreams & interviews: data logs
…

All more user-oriented – User-centered “optimization!”

169

170 of 173

Open Questions in “Designing Interactions”

Thinking what users want.

Interfaces as an extension to humans, instead of an extension to technology.

Making users comfortable with novel interactions.

Focusing on building trust, ensuring appropriate reliance, and considering other human factors.

Designing interactions equitable to diverse populations.

By considering accessibility, gender/racial/cultural and other demographic differences.

170

171 of 173

Open Questions in “Evaluating Interactions”

Scale up the evaluation.

Participant recruitment? Trolling removal? Qualitative analysis?

Evaluate dynamic interactions, in the wild.

General-purpose models get us Less pre-defined interactions. Capture effectiveness, when people self-initiate task formations?

Make interaction a benchmarking task.

Get 1000 scores on 1000 interactions for 1 model, now what? Use diverse interaction evaluations to reflect model’s practical usability!

171

172 of 173

Open Questions in “Learning from Interactions”

Going beyond labels or numbers.

Fine-grained interaction types need to be modeled.

Going beyond single-turn preference.

Interactions can be expressed in multi-turns or dynamically.

Going beyond knowns and explore unknowns in human-AI interactions.

Diverse opinions, cultures and values in learning from interactions.

172

173 of 173

Designing, Evaluating, and

Learning from Human-AI Interactions

Sherry Tongshuang Wu

CMU, @tongshuangwu

Diyi Yang

Stanford, @diyiy

Sebastin Santy

UW, @SebastinSanty

tinyurl.com/

emnlp2023-tutorial-hai