4 of 175

Human-AI Interaction: What is it?

Basically, a field where humans and AIs interact.

Humans: AI researchers, model developers, domain experts, end users.

AIs: dialog system, translator, recommender system, autonomous driving system.

Interact:

Humans collaborate with AI,

Humans get assistance from AI,

Humans analyze AI,

AI helps human,

& many other forms

5 of 175

Human-AI Collaboration

The cooperative and coordinated interaction between humans (mostly non-AI experts) and AI to solve complex problems or achieve certain goals.

6 of 175

Humans get assistance from AI-infused apps

Humans are still mostly end users and domain experts.

The big difference is AI is not a partner, but a tool (and part of “AI-infused applications”)

7 of 175

Humans analyze Models

Experts systematically analyze AI models, and go beyond aggregated scores.

https://erroranalysis.ai/ Adaptive Testing and Debugging of NLP Models (Ribeiro & Lundberg, ACL 2022)

“Understanding the broader terrain of errors is an important starting point in pursuing systems that are robust, safe, and fair…[We need to] identify cohorts with higher error rates and diagnose the root causes behind these errors.”

Eric Horvitz / Microsoft, 2021

8 of 175

How do we figure out the “Interaction”?

Given a human and an AI…

Design

Why should they interact? How do we make it happen?

(self-)selected

Already exist & (we think) usable

Evaluate

Have we achieved what we want to achieve?

9 of 175

How do we figure out the “Interaction”?

Design

Evaluate

Given a human and an AI…

Learn from

Why should they interact? How do we make it happen?

(self-)selected

Already exist but needs improvement!

Have we achieved what we want to achieve?

How do we make AIs more usable?

10 of 175

How do we figure out the “Interaction”?

Design

Evaluate

Given a human and an AI…

Learn from

(self-)selected

Already exist but needs improvement!

We will focus on text – It’s “straightforward” but complicated at the same time, open-ended, and part of the multi-modal world. And it’s relevant!

Sherry Wu

Diyi Yang

Sebastin Santy

11 of 175

Schedule – Fruitful morning ahead

Design

Evaluate

Learn from

11:15-12:05 (40 mins lecture + 10 mins Q&A)

10:05-10:30 (25 mins lecture)

10:50-11:15 (15 mins lecture + 10 mins Q&A)

Conclusion

12:05-12:15

Extended Q&A

12:15-12:30

09:15-10:05 (40 mins lecture + 10 mins Q&A)

Coffee break

Evaluate (cont’)

10:30-10:50

12 of 175

You will learn…

NLP, from an HCI perspective

Awareness: Interaction as another layer on top of models & is human-centered

Systematic & up-to-date overview: the design choices around models

Contextualization: How principles are mapped to real-world case

13 of 175

“Human-AI Interactions”

Designing

14 of 175

Norman Doors

Have you come across a door where you have tripped, bumped into or confused as to how it operates?

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

❓ �Guess whether to push or pull

🔎 �Can’t locate a place to push or pull

➜�Try to push, but the door actually slides

15 of 175

You are not alone

Bad design is everywhere

From doors, to everyday objects & machines designed by people

including “AI”

16 of 175

Norman “AI”?

Not just doors, but happens with many poorly designed machines, including “AI”

	Door 🚪	AI 🤖
What the user wants to do	“How do I get to next room?”	“How do I solve my task?”
What the user ends up doing
How does a user learn �“How to use?”

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

17 of 175

Norman “AI”?

Not just doors, but happens with many poorly designed machines, including “AI”

	Door 🚪	AI 🤖
What the user wants to do	“How do I get to next room?”	“How do I solve my task?”
What the user ends up doing	“How should I operate the door to get to next room?”	“How should I prompt the model to get it to solve my task?”
How does a user learn �“How to use?”

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

18 of 175

Norman “AI”?

Not just doors, but happens with many poorly designed machines, including “AI”

	Door 🚪	AI 🤖
What the user wants to do	“How do I get to next room?”	“How do I solve my task?”
What the user ends up doing	“How should I operate the door to get to next room?”	“How should I prompt the model to get it to solve my task?”
How does a user learn �“How to use?”	- From previous encounter - Read labels - Take a guess and try	- From other people - Read prompt guidelines - Wing it

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

19 of 175

Design that disappears

“The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it.”

— Mark Weiser

They don’t require an instruction manual. You use it once or twice and you will barely feel it next interaction onwards. E.g. pointing device, touchscreen, “literacy”

“The father of ubiquitous computing”

Weiser, Mark. "The Computer for the 21st Century." Scientific american 265.3 (1991): 94-105.

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

20 of 175

How do we achieve this?

By extending humans instead of extending technology, reducing “frictions of learning”

Object�Metaphors

Action�Metaphors

Human functioning

Computer functioning

Punch Cards

Command Line

Pointing Device

GUI Icons

Personal Computing

Extending�human

Extending�computer

21 of 175

How do we achieve this?

In AI, we’ve been only thinking about extending technology

Human functioning

AI functioning

Using Code

Artificial Intelligence

Prompting

Chatting

Extending�human

Extending�AI

22 of 175

Why does it happen?

Technology-centric design

A design process for situations when a class of technologies already exists, but when user domains for co-development are not clearly established.

User-centered design (UCD) is an iterative design process in which designers focus on the users and their needs in each phase of the design process.

User-centered design

“If you have a hammer, everything looks like a nail”

Solution in search of a problem

Bly, Sara, and Elizabeth F. Churchill. "Design through matchmaking: technology in search of users." interactions 6.2 (1999): 23-31.

23 of 175

Blog Post: The Biggest Bottleneck for LLM startups is UX

These issues really only surface once someone starts trying to use the product in the context of their daily workflow. This is how you go from “cool” to “useful.”

These challenges are always present, regardless of system’s accuracy (within some bounds).�Doesn’t matter if the LLM accuracy is 80% or 95%, the user still needs to reason through failure modes and understand what to expect when interacting with the system.�Better off getting to a baseline accuracy that is good enough and then building a product that allows a user to know how to work around the model.

“You are good at designing things we cannot build.�We are good at making things that users don’t use.”

The biggest bottleneck for large language model startups is UX

Yang, Qian, et al. “Sketching nlp: A case study of exploring the right things to design with language intelligence." CHI 2019.

24 of 175

New models → new AI interactions

ChatGPT

Chatting

Pushing language as the primary interface for every task

Reeves, Byron, and Clifford Nass. "The media equation: How people treat computers, television, and new media like real people." Cambridge, UK 10.10 (1996).

Anthropomorphic Tendencies

The act of projecting human-like qualities or behavior onto non-human entities, in this case, AI

25 of 175

New models → new AI interactions

No specification means there is no single way of instruction. Suitable for personal tasks.

Language is imprecise

Controlling for desired outputs can be difficult. Unsuitable for tasks that require precision or is critical.

Anthropomorphic

Tendencies

ChatGPT

Reeves, Byron, and Clifford Nass. "The media equation: How people treat computers, television, and new media like real people." Cambridge, UK 10.10 (1996).

Urge to communicate like humans do

Language is flexible

26 of 175

Subtler Interactions (that disappear)

Voice Assistants

Email Autocomplete

Recommendations

Word suggestions

Subtitles

Code Completion

27 of 175

Design Thinking

28 of 175

What is design thinking?

As engineers, we immediately jump to finding solutions for a problem that we come across.

Problem

Solution 1

Solution 2

Solution 3

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

29 of 175

What is design thinking?

It is not about the exact number, but this high number puts emphasis on getting to the root cause of the problem, instead of looking at the surface level.

Problem

Solution 1

Solution 2

Solution 3

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

Problem 1

Problem 2

Problem 1.1

Why?

Ask “Five whys”

30 of 175

Getting to the root cause of the problem

“Within a design context, framing is often seen as the key creative step that allows an original solution to be produced.

Designers report on the need to get to ‘the problem behind the problem’ (as initially presented by the client), and about creating a ‘fresh perspective.’ ”

— Bec Paton and Kees Dorst

“If I had asked people what they wanted, they would have said faster horses.”

— Henry Ford

“The inventor of production car”

The core of ‘design thinking and its application, https://www.sciencedirect.com/science/article/pii/S0142694X11000603g

31 of 175

What is design thinking?

Solution 1

Solution 2

Solution 3

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

Problem

Problem 1

Problem 2

Problem 1.1

Why?

32 of 175

How to incorporate design thinking?

The “Double Diamond” Method

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

First Diamond�Find the specific problem.

Second Diamond�Find the specific solution.

Why?

Find the problem

Find the solution

33 of 175

How to incorporate design thinking?

The “Double Diamond” Method

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

Four Steps:

Discover Problem
Define Problem
Develop Solution
Deliver Solution

Find the problem

Find the solution

34 of 175

Discover Problem

Discover: Understand the issue rather than merely assuming it. It involves researching, speaking to and spending time with people who are affected by the issues.

Market Research

Field Study

Interview and Surveys

Environmental Factors

Stakeholder Interviews, check raised tickets, traffic and sales analysis, competitive audits

Site visits, Ethnography to observe people doing their own tasks in their own setting.

To collect information on their reactions to existing products and conditions

Understand the context, and its needs.

Quesenbery, Whitney, and W. Whitney. "Choosing the right usability technique: Getting the answers you need." User Friendly 2008-Innovation for Asia (2008).

35 of 175

Define Problem

Define: The insight gathered from the discovery phase can help to define the challenge in a different way.

Affinity Diagrams

Perspective Framing

Task & Information Analysis

To group and explore the structure of information.

Participatory design to develop a consensus view of the overall process.

Learning about relationships between tasks and information; Creating logical groups from the users’ point of view.

Quesenbery, Whitney, and W. Whitney. "Choosing the right usability technique: Getting the answers you need." User Friendly 2008-Innovation for Asia (2008).

36 of 175

Develop Solution

Develop: Give different answers to the clearly defined problem, seeking inspiration from elsewhere and co-designing with a range of different people.

Rapid Prototyping

Storytelling

Minimum Viable Product

Physical realizations of the research and design process in a tangible form. Can be used to get a sense of what it would be like to experience the product/service.��Goes from low fidelity (paper) to high fidelity (systems).

Construct situations where a specific user in a specific context would go about solving the problem with different solutions.

Quesenbery, Whitney, and W. Whitney. "Choosing the right usability technique: Getting the answers you need." User Friendly 2008-Innovation for Asia (2008).

37 of 175

The “Double Diamond” Method

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.

Four Steps:

Discover Problem
Define Problem
Develop Solution
Deliver Solution

But this is not once and for all

38 of 175

Iterative Design

Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013. DreamEndState.com

Find the problem

Find the solution

It is a spiral

along time axis

39 of 175

Example Task: Make writing faster for humans

Yang, Qian, et al. "Sketching nlp: A case study of exploring the right things to design with language intelligence." CHI. 2019.

40 of 175

Example Task: Make writing faster for humans

“Typo errors slow me down”

“I know the word, takes time to type”

Next word suggestions

Autocorrect word

Use: on phones, cannot think of new words, obvious word but takes time to write

Use: on phones, already typed words with spelling mistakes, too much time to go back

Next phrase suggestion

Use: on desktops, breeze through obvious phrases that otherwise take time to write.

Word copilot

Use: on desktops, large chunks of text together. Helps when one needs to start from scratch

“I know the phrase, takes time to type”

“I have writer’s block starting from scratch”

Problem Space

Solution Space

41 of 175

Develop Solution: Prototype

Wizard-of-Oz�Fake features so that the user thinks that the responses are computer-driven when they are actually human-controlled.�Challenge for NLP: AI errors are hard to simulate.

LM Prototypes / Scaffold�Simulate users that may use the system�Use LMs to build prototypes

Mimic simple functionality�Ensemble multiple tools, LMs, simple models and expectations.�Challenge for NLP: cannot simulate SOTA model capabilities

42 of 175

Develop Solution: Prototype Persona

Algorithmic persona: human roles that users assign to the algorithm to explain the algorithm’s goals, behaviors, and characteristics.

Wu, Eva Yiwei, Emily Pedersen, and Niloufar Salehi. "Agent, gatekeeper, drug dealer: How content creators craft algorithmic personas." CSCW 2019

43 of 175

Develop Solution: LM as a Prototyper / Scaffolding

Petridis, Savvas, Michael Terry, and Carrie Jun Cai. "PromptInfuser: Bringing User Interface Mock-ups to Life with Large Language Models." Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 2023.

44 of 175

NLP & Interfaces

UI for an NLP system aka “A wrapper on top”�e.g., Autocomplete, Google Translate

Designing an interface to make use of the underlying NLP system for achieving user tasks.

GUI

Natural language UI aka “the wrapper itself”�e.g., Alexa, Google Search, GPT4 Web Search, AutoGPT

Using language as a medium to interact with tools, applications and other systems.

NLP System

Language

System

45 of 175

Interfaces vs. Interactions

Interactions

Interfaces

Tangible i.e. visual, auditory, tactile inputs to the system. Designing interfaces are at the surface & physical level, often limited to styling.

Understanding the context of user, their needs and requirements, how they operate. Includes psychological aspects such as trust, goals, user behavior.

Interfaces

Interactions

46 of 175

Beyond Interfaces

47 of 175

Aspects of Interaction

Cognition, Perception and extending humans�Offload cognitively loaded tasks, make space for humans to be creative.

Trust, Reliance and other user-machine behavior�Users feel comfortable using and depending on AI systems for achieving their tasks.

Fairness, Accountability, Transparency, Ethics�Ensure equitable treatment of all individuals, regardless of their race or gender. Not perpetuate bias or discrimination and ability to understand model decisions.

Personalization, Adaptation, Feedback and Guidance�Tailoring AI interactions to individual user preferences and needs, and in turn also learn and improve itself over time to align with human preferences.

48 of 175

Cognition, Perception and extending humans

Explains what a system might be capable of��A metaphor communicates expectations of what can and cannot be done

Visual Metaphors

Audio Metaphors

Conceptual Metaphors

Textual Metaphors

Web, Crawling, Load, Fetch

Email, Thread, Port, Address

Camera Shutter Sound

Phone Lock Sound

AI Metaphors

Stochastic Parrots

Intelligent Agent

(Object Metaphors)

49 of 175

Cognition, Perception and extending humans

Khadpe, Pranav, et al. "Conceptual metaphors impact perceptions of human-AI collaboration." Proceedings of the ACM on Human-Computer Interaction 4.CSCW2 (2020): 1-26.

Referring to AI with a specific name / metaphor has an effect on how it is perceived, and even how it is used.

Bots with metaphors of high warmth, but low competence were preferred overall

Effects of chatbot naming

50 of 175

Cognition, Perception and extending humans

Manipulate objects like you do in the real-world.��Real-world metaphors for objects and actions can make it easier for a user to learn and use an interface

Drag and Drop

Direct Manipulation

Resizing Elements

(Action Metaphors)

Personal Assistant

Interactive Stories

set reminder

get weather

send message

Setting own character actions or dialogue options, creating a personalized storyteller.

51 of 175

Trust and Reliance

Vasconcelos, Helena, et al. "Explanations can reduce overreliance on ai systems during decision-making." Proceedings of the ACM on Human-Computer Interaction 7.CSCW1 (2023): 1-38.

Trust: Trust refers to a belief or confidence in the integrity, reliability, and honesty of a person, organization, or thing.

Reliance: Involves depending on someone or something to perform a specific function or task, irrespective of whether trust is present.

Often, Trust → Reliance

Imperfect agents prone to making errors require a trusting relationship.

52 of 175

Trust and Reliance

Buçinca, Zana, Maja Barbara Malaya, and Krzysztof Z. Gajos. "To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making." Proceedings of the ACM on Human-Computer Interaction 5.CSCW1 (2021): 1-21.

> Showing Explanations and giving people agency

> Showing Uncertainty

> On demand

> Wait

53 of 175

Fairness

Accountability

Liebling, Daniel J., et al. "Unmet needs and opportunities for mobile translation AI." Proceedings of the 2020 CHI conference on human factors in computing systems. 2020.

What is the real-life cost of mistranslation?

How well it works for diverse populations?

Sun, Jiao, et al. "Pretty princess vs. successful leader: Gender roles in greeting card messages." Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 2022.

54 of 175

Transparency

Ethics

Flathmann, Christopher, et al. "Modeling and guiding the creation of ethical human-AI teams." Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 2021.

How to open the black-box and understand model decisions?

Cabrera, Ángel Alexander, et al. "Zeno: An interactive framework for behavioral evaluation of machine learning." Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 2023.

What are the morals and values encoded in the system?

55 of 175

Interaction Initiative

Mixed initiative systems allow users to interact with them in a collaborative way, where the user and the system both take an active role in carrying out tasks or making decisions.

Advocates elegant coupling of automated services with direct manipulation.

“Autonomous actions should be taken only when an agent believes that they will have greater expected value than inaction for the user.”

56 of 175

Reflection

Design should be seamless without the need of an instruction manual

How?�By extending humans and their natural ways of interacting with the real-world

What is the process?�By using design thinking frameworks. Emphasis on finding the root cause of the problem, finding solutions and iterating back and forth, till we build applications that users want to use.

Are there other human factors to consider?�Beyond the visible elements on the interfaces, and psychological and cognitive aspects, it is important to be aware of the underlying trust and reliance in the system, and its implications in the real world.

57 of 175

“Human-AI Interactions”

Evaluating

58 of 175

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Analysis: “How do I crunch the numbers to find out what I need to know?”

59 of 175

Goals: HALIE – Consider Humans & Interactions

Lee, Mina, et al. "Evaluating human-language model interaction." TMLR 2023

Traditional evaluation: on the outputs of the models themselves.

Criteria:

Output quality

Perspective:

third party

Target:

Output

60 of 175

Goals: HALIE – Consider Humans & Interactions

Lee, Mina, et al. "Evaluating human-language model interaction." TMLR 2023

Interaction centric evaluation: On interactions between humans and models.

Criteria:

Output quality, human preference

Perspective:

third party, first-person experience

Target:

Output, process

61 of 175

Consider a case study: Human-LM Co-Writing

Lee, Mina, Percy Liang, and Qian Yang. "CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities." CHI 2022

62 of 175

Consider a case study: Human-LM Co-Writing

Criteria:

Quality: perplexity, etc.

Preference: Which option(s)?

Perspective:

third party: perplexity

First-person: Everyone has their choice; Thousands of divergence from the same starting point

Target:

Output: Final story

Process: Snapshot of how the article is constructed

63 of 175

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to application.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Analysis: “How do I crunch the numbers to find out what I need to know?”

64 of 175

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to task.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Analysis: “How do I crunch the numbers to find out what I need to know?”

65 of 175

Going back to our Co-writing case…

Goal – Understand LLM capability on support human writing

➡️ Task – Vary types of writing (Creative & Argumentative), prompts, and model randomness

➡️ Data – What’s collected?

Human preferences, final articles, etc.

66 of 175

Going back to our Co-writing case…

Most important data: Interaction trace

Rich metadata that reflect the entire process of interaction, with each person.

<Query>

<Move>

<Query>

He loved to play tricks on people and make them love.

there was a great mage who lived in a tower.

<Edit>

funny

Once upon a time,

<Write>

Credit: Mina Lee’s job talk!

67 of 175

Going back to our Co-writing case…

Most important data: Interaction trace

Rich metadata that reflect the entire process of interaction, with each person.

Events and clickstreams

States of everything at every moment

68 of 175

Going back to our Co-writing case…

Goal – Understand LLM capability on support human writing

➡️ Task – Vary types of writing (Creative & Argumentative), prompts, and model randomness

➡️ Data – What’s collected?

1445 sessions between 63 users and GPT-3
Types of writing:

Creative writing: 830 stories written by 58 writers
Argumentative writing: 615 essays written by 49 writers

Stories and essays: 418 words long
Number of queries: 11.8 queries per writing session
Acceptance rate of suggestions: 72.3%
Percentage of text written by humans: 72.6%

69 of 175

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to task.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Interaction trace & open-ended feedback, form a comprehensive picture.

Analysis: “How do I crunch the numbers to find out what I need to know?”

70 of 175

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to task.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Interaction trace & open-ended feedback, form a comprehensive picture.

Analysis: “How do I crunch the numbers to find out what I need to know?”

71 of 175

Going back to our Co-writing case (again)

But most important data: Interaction trace

Rich metadata that reflect the entire process of interaction, with each person.

Things you can already see…

Multiple queries means first several suggestions are bad;

Edit gives us cost-efficiency thresholds people have in mind, between “useful enough to edit” vs. “let me start from scratch”

<Query>

<Move>

<Query>

He loved to play tricks on people and make them love.

there was a great mage who lived in a tower.

<Edit>

funny

Once upon a time,

<Write>

72 of 175

Existing metrics can help quantify effects

Some analyses already done by the authors!

Q: Can GPT-3 generate fluent text in response to user text?

A: Text written by user + GPT-3 had fewest errors and most diverse vocabulary

73 of 175

Define metrics for what needs to be measured

Some analyses already done by the authors!

Q: Can GPT-3 contribute new ideas to users’ stories?

A: Ideas generated and reused by users in the subsequent writing.

“Reused named entity” is a new (lower-bound)metric defined for ideation quantification!

74 of 175

How the dataset can be further used

Writers' behaviors and writing outcomes over time

Can we observe novelty effect and longitudinal change?

Do LMs homogenize writing by providing similar suggestions to all users?

Linguistic accommodation

How does the style, voice, or tone of a writer or LM influence that of the other over time? Is the influence uni-directional or bi-directional?

Edit traces

What can we learn from human edits on LM outputs?

Can we train LMs on edit traces to emulate human edits?

75 of 175

Metrics that exist for Human-LM Co-Writing

Shen, Hua, and Tongshuang Wu. "Parachute: Evaluating interactive human-lm co-writing systems." In2Writing 2023.

76 of 175

Metrics that exist for Human-LM Co-Writing

77 of 175

Metrics that exist for Human-LM Co-Writing

78 of 175

Metrics that exist for Human-LM Co-Writing

We tend to have more qualitative metrics…

79 of 175

Key desiderata in evaluation (also outline :D)

Objectives and goals: “What do I need to know?”

General mindset of “human and interaction first”, then map to task.

Tasks: “What should users do so I find out what I need to know?”

Data: “What data do I collect to find out what I need to know?”

Interaction trace & open-ended feedback, form a comprehensive picture.

Analysis: “How do I crunch the numbers to find out what I need to know?”

Comprehensive metrics, cover different aspects, don’t reinvent the wheel!

80 of 175

Evaluation – From mindset to method

Mixed method evaluation

Think aloud analysis

And, of course, one more case study

81 of 175

Important method: “Think Aloud”

A research method used to gain insight into a person's thought processes as they perform a task or solve a problem. The participant is asked to verbalize their thoughts as they perform the task, which allows the researcher to understand how the participant approaches the task.�"Thinking aloud may be the single most valuable usability engineering method.“

“I’m going to ask you to ____ and while you are doing that, can you tell me whatever you are thinking. Whatever comes into your mind while you are working on that. Okay?”

Protocol

Give participants specific tasks to accomplish (but not HOW to do it)

Have them speak aloud as they complete the tasks

Keep interruptions to a minimum

Ask for open-ended questions & clarification after the task is complete

Learning effect - if you make tasks, watch for biasing test due to order

Typically used to test the usability of a website, app or object

82 of 175

Important method: Controlled exp. + Mixed method

Mixed methods research combines elements of quantitative research and qualitative research in order to answer research questions. Can help a more complete picture than a standalone quantitative or qualitative study, as it integrates benefits of both methods.

Quantitative method

Understand the “what”. Precise!

Did users complete task (yes/no)?
How long did it take?
How many clicks?

Qualitative method

Understand the “why”. Open-ended!

What did you like best about the experience?
Why were you frustrated by the model output?

83 of 175

Quantitative vs. Qualitative

	Quantitative	Qualitative
Definition	Gather numerical data to be analyzed using statistical methods	Gathering descriptive, non-numerical data to be analyzed through interpretation and contextualization
Data source	surveys, questionnaires, experiments	interviews, observations, and document analysis
Presentation	tables, graphs, and statistics	quotes and narratives that reflect the participants' experiences and perspectives
Goal	establish cause-and-effect relationships between variables	gain a deeper understanding of social phenomena, meanings, and processes

84 of 175

Case Study: Interactive Machine Translation

“We present Predictive Translation Memory, an interactive, mixed-initiative system for human language translation. Translators build translations incrementally by considering machine suggestions that update according to the user’s current partial translation.”

Green, Spence, et al. "Predictive translation memory: A mixed-initiative system for human language translation." UIST 2014

86 of 175

PTM recap: Rationals for seemingly simple decisions

Design: Re-use familiar hotkeys e.g., CTRL+Enter Typing activates interactions

Translators are fast typists: want to avoid the mouse

Design: One column, interleaved layout

Translators read (20-25% of translation session), 2-column will be cumbersome

Design: Text color encoding

Ownership: AI can’t modify human text, human can accept but not modify AI text

Principle: Horvitz #6 – Employing socially appropriate behaviors for agent−user interaction.

87 of 175

PTM recap: Rationals for seemingly simple decisions

Design: highlight translated words

Principle: Horvitz #11 – maintaining working memory of recent interactions

88 of 175

PTM recap: Rationals for seemingly simple decisions

Design: highlight translated words

Principle: Horvitz #11 – maintaining working memory of recent interactions

Design: allow for word-to-word query

Principle: Horvitz #6 – allowing efficient direct invocation and termination

89 of 175

Principles for Mixed-initiative user interfaces

Developing significant value-added automation (v.s. direct manipulation)

Considering uncertainty about a user’s goals

Considering the status of user’s attention (minimize distraction, cost vs. benefit of deferring action)

Inferring ideal action in light of costs, benefits and uncertainties (expected values of actions!)

Employ dialog to resolve key uncertainties (interactions!)

Allowing efficient direct invocation and termination

Minimizing the cost of poor guesses about action and timing

Scoping precision of service to match uncertainty, variation in goals — do less if uncertain!

Providing mechanisms for efficient agent-user collaboration to refine results

Employing socially appropriate behaviors for agent-user interaction

Maintaining working memory of recent interactions

Continuing to learn by observing (e.g., about user’s goals, etc.)

90 of 175

PTM: Experimental Design

Comparative analysis

“We compared our system to post-editing, which is a strong baseline [29, 21], and is also the most common commercial use of MT. “��Clear research questions

Time – PTM faster than post-edit?

Quality – PTM == better translation?

Usage – subjects use interactive aids?

translate French→English or English→German

≈3,000 tokens of News/Medical/Software

post-edit (pe) and PTM

16 expert subjects per language pair

Task

Source Text

Conditions

Participants

91 of 175

RQ1: Time – PTM faster than post-edit?

Quantitative analysis (find robust evidence)

Metric: log of time (more tolerant of outliers)

Compare mean (for general understanding)

Linear mixed effects models (for understanding significance, important factors)

The key independent variable: translation condition

Learning effect — People get quicker as the task proceed

More edits means longer time

Initial translation quality — how much edit is necessary

They had unbalanced participation pool

Longer source sentence takes longer to edit

Potential interaction between independent variables

+random intercepts/slopes for subject, source, text genre.

92 of 175

RQ1: Time – PTM faster than post-edit?

Qualitative analysis (find reasons behind quantitative analysis — codebook!)

Metric: log of time (more tolerant of outliers)

Think aloud (record users’ comments) — Interactive mode takes more time because…

There are more aids to operate and more information to read and analyze:

“Because you spend more time on each word, you have opportunity to see alternative translations.”

MT quality quality greatly affected the usefulness of the interactive aids:

“If drop-down suggestions are not of a good quality, reading may consume extra time.”

The post-edit mode was easier at first, but interaction is better in the long run.

“I am used to this [post-edit], this is how Trados [the preeminent CAT tool] works.”

Likert Scale survey (can still quantitatively compare users’ subjective judgements): “In which interface did you feel most productive?”

“I would use interactive translation features if they were integrated into a CAT product”

“I got better at using the interactive interface with practice/experience”

93 of 175

RQ1: Time – PTM faster than post-edit?

Qualitative and quantitative results usually provides some grounding for each other.

“Post-edit mode was easier at first, but the interactive mode was better once I got used to it. “

"If I had time to use the interactive tool and grow accustomed to its way of functioning, it would be quite useful…"

94 of 175

RQ2: Quality – PTM == better translation?

Quantitative analysis (find robust evidence)

Metric: BLEU (automatic eval, has issues, but easier to run)

BLEU: a measure of similarity with the gold reference.

HBLEU: measure of similarity with the initial MT suggestions.

“PTM exposes translators to many more alternatives, encouraging them to deviate further from the initial MT suggestion (lower HBLEU).”

Compare mean

(Also vs. original generated text)

95 of 175

RQ2: Quality – PTM == better translation?

Quantitative analysis (find robust evidence)

Metric: Human subject rating

Auto methods are sensitive and noisy, so usually paired with human judgements as well

96 of 175

RQ2: Quality – PTM == better translation?

Qualitative analysis (find reasons behind quantitative analysis)

Metric: BLEU & rating (automatic eval, has issues, but easier to run)

When users wanted to render more stylistic translations, PTM was less useful:

“...choosing a very different translation approach (choice of words, idioms with no equivalent in English...) would be like going against the current—but may have provided a better quality.”

“the translator is less susceptible to be creative.”

Why do many participants prefer post-edit?

“I found the machine translations (texts in gray) were of a much better quality than texts generated by Google Translate”

"The translations generally did not need too much editing, which is not always the case with machine translations.”

97 of 175

RQ3: Usage – subjects use interactive aids?

Quantitative analysis on 1.1 million UI events

Metric: UI clickstream events (faithful reflection of user pattern)

Usage: 52% more text entered via aids than TransType

Clickstreams are also more objective and comparable with prior work.

In the TransType system, the authors commented that their users often “[accepted] predictions in [their] entirety and then edited to ensure its correctness” and reported that 52% of target characters were typed [36]. In the “prediction+options” experiment conducted by Koehn et al. [29], the authors reported that 36% of the final translations were typed, 36% entered via a mouse click, and 27% entered via the tab key to accept machine translations. When working in our PTM system, users directly utilized machine translations to a greater degree than previously reported.

98 of 175

High-level takeaways

Mindset in evaluation: Human and interaction come first.

Desiderata: Goals > Tasks > Data > Analysis method

Common data: Clickstreams, interviews, etc.

Common evaluation method: Mixed method, think aloud

Common metrics: Find them, or invent them :)

99 of 175

Reflection: Where do we find interactions to evaluate?

Don’t reinvent the wheel – task delegation has always been a thing in Human Computations!

LLM chaining:

Breakdown complex tasks … be done independently by LLMs , then combined

Compare human vs. LLM, and LLM vs. LLM performances on sub-tasks in the same overall context, by replicating crowdsourcing pipelines with LLMs chains!

Crowdsourcing pipeline:

Breakdown complex tasks into pieces that can be done independently by humans, then combined

100 of 175

Reflection: Where do we find interactions to evaluate?

100

Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits

Input

Print publishers are in a tizzy over Apple’s new iPad because they hope to finally be able to charge for their digital editions. But in order to get people to pay for their magazine and newspaper apps, they are going to have to offer something different that readers cannot get at the newsstand or on the open Web.

Text to be shortened

Fix

hoping to charge for digital editions

to get people to pay for apps

offer something unique not available elsewhere

Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique not available elsewhere.

Shorten the verbose parts and put back in context

Verify

Fix grammar issues

Find

hope to finally … their digital editions

in order to get people to… apps

offer something… on the open Web

Verbose parts

101 of 175

Reflection: Where do we find interactions to evaluate?

101

Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits

Fix

Find

Verify

hope to finally … their digital editions

in order to get people to… apps

offer something… on the open Web

hoping to charge for digital editions

to get people to pay for apps

offer something unique not available elsewhere

Input

Text to be shortened

Verbose parts

Shorten the verbose parts and put back in context

Fix grammar issues

Identify at least one area that can be shortened without changing the meaning of the paragraph.

Edit the highlighted section to shorten its length without changing the meaning of the paragraph.

Choose at least one rewrite that has significant style errors in it. Choose at least one rewrite that significantly changes the meaning of the sentence.

Bernstein et al. (2010): Find segments for shortening, Fix by shortening them, Verify by identifying rewrites with errors.

Find

Fix

Verify

102 of 175

Reflection: Where do we find interactions to evaluate?

102

Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits

Fix

Find

Verify

Input

hope to finally … their digital editions

in order to get people to… apps

offer something… on the open Web

hoping to charge for digital editions

to get people to pay for apps

offer something unique not available elsewhere

Text to be shortened

Verbose parts

Shorten the verbose parts and put back in context

Fix grammar issues

Identify at least one area that can be shortened without changing the meaning of the paragraph.

Edit the highlighted section to shorten its length without changing the meaning of the paragraph.

Choose at least one rewrite that has significant style errors in it. Choose at least one rewrite that significantly changes the meaning of the sentence.

Bernstein et al. (2010): Find segments for shortening, Fix by shortening them, Verify by identifying rewrites with errors.

Find

Fix

Verify

Find three segments that can be shortened from the following text. These segments need to be present in the text.

Text: {input text}

Segments: 1. {output segment}

Shorten the following text without changing its meaning.

Text: {segment}

Shortened text: {shortened segment}

P7: Find segments to shorten, Fix these phrases by shortening them, Verify by fixing grammatical errors.

Correct the grammar of the following text.

Text: {fixed text}

Corrected text: {grammatical text}

103 of 175

Replication experiment…

103

Course assignment for Human-Centered NLP @ CMU

20 undergraduate / graduate students

Replicate 7 crowdsourcing papers (that cover diverse pipeline designs and tasks) with LLM chains

Self-reflection on replication effectiveness (vs. single LLM baseline) and possible improvements

Peer grading on replication correctness, thoroughness, and comprehensiveness

Wu, Tongshuang, et al. "Llms as workers in human-computational algorithms? replicating crowdsourcing pipelines with llms." ArXiv 2023

104 of 175

Interesting evaluation output & design implication

104

LLMs (with only textual instructions) need proper task formation: e.g., change to multi-choice question, prompts with templates, etc.

Find three segments that can be shortened from the following text. These segments need to be present in the text.

Text: {input text}

Segments: 1. {output segment}

Humans get more scaffolding constraints: e.g. use mouse selection to extract precise segments from the original document.

Identify at least one area that can be shortened without changing the meaning of the paragraph.

Fix grammar issues

Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits

Fix

Find

Verify

Input

hope to finally … their digital editions

in order to get people to… apps

offer something… on the open Web

hoping to charge for digital editions

to get people to pay for apps

offer something unique not available elsewhere

Text to be shortened

Verbose parts

Shorten the verbose parts and put back in context

105 of 175

Reflection: Where to find metrics?

105

Look into existing interactions!

Example: Human-Human pair programming, vs. human-AI pair programming

Qianou Ma et al "Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming." ArXiv 2023

106 of 175

Reflection: Where to find metrics?

106

Human-human – Variance in metrics!

Time and accomplishment? twice the duration, the person-hours required

Human-AI – Too simplified metrics?

E.g. the number of lines of added code – the nature of interaction with Copilot (tab to accept suggestions) is a big factor!

107 of 175

Reflection: Where to find metrics?

107

More comparisons with human-human pair programming?

Get inspiration on metrics, e.g., defect density, perceptual effort measure, readability, functionability, the number of test cases passed, code complexity, scores, expert opinions, etc.?

108 of 175

Reflection: Where to find metrics?

108

More factors to consider – Evaluate on which population (female students, teachers, etc.)?

109 of 175

Reflection: Where to find metrics?

109

Wu, Tongshuang, and Kenneth Koedinger. "Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming." ArXiv 2023

110 of 175

“Human-AI Interactions”

Learning

110

111 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

111

112 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

112

113 of 175

Users Interaction with LLMs

113

115 of 175

Interaction: Different Types of Human Feedback (1)

Labeled data points

Edit data points

Change data weights

Binary/scaled user feedback

Natural language feedback

Code language feedback

115

116 of 175

Interaction: Different Types of Human Feedback (2)

Define, add, remove feature spaces

Directly change the objective function

Directly change the model parameter

…

116

117 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

117

118 of 175

Learning from Interactions and Feedback

Transform nontechnical human “preferences” into usable model “language”

Allow humans to easily provide feedback
Build models to effectively take the feedback

118

119 of 175

Tradeoff: Human-friendly vs. Model-friendly

Models need feedback that “they can respond to”

Humans prefer easier-to-provide feedback

Non-experts:

natural language feedback > labeling > model manipulation

119

120 of 175

Human Interaction and Text Classification

120

Godbole, Shantanu, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti. "Document classification through interactive supervision of document and term labels." In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 185-196. Springer, Berlin, Heidelberg, 2004.

121 of 175

Human Interaction and Parsing

121

122 of 175

Human Interaction and Topic Modeling

122

Interactive Topic Modeling: start with a vanilla LDA with symmetric prior, get the initial topics. Then repeat the following process till users are satisfied: show users topics, get feedback from users, encode the feedback into a tree prior, update topics with tree-based LDA

124 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

124

125 of 175

Incorporating Human Feedback: Taxonomy

Dataset updates: change the dataset

Loss function updates: add a constraint to the objective

Parameter space updates: change the model parameters

125

126 of 175

Learning from Interaction: Datasets Updates

Data augmentation

Weak supervision

Active learning

Model-assisted adversarial labeling

126

127 of 175

Datasets Updates: Weak Supervision

Try using noisy sources of signal, specified at higher-levels of abstraction, to rapidly generate training sets.

127

Ratner, Alexander, et al. "Snorkel: Rapid training data creation with weak supervision." VLDB 2017.

128 of 175

Datasets Updates: Weak Supervision

128

Ratner, Alexander, et al. "Snorkel: Rapid training data creation with weak supervision." VLDB 2017.

129 of 175

Datasets Updates: Active Learning to update data

Proactively select which data points we want to use to learn from, rather than passively accepting all data points available.

129

130 of 175

Datasets Updates: Data Augmentation

130

Token-level augmentation:

Synonym replacement (Yang et al. 2015, Zhang et al. 2015, Miao et al. 2020)
Random insertion, deletion, swapping (Xie et al. 2019, Wei and Zou 2019)
Word replacement via LM (Wu et al. 2019, Zhu et al. 2019)

Sentence-level augmentation:

Paraphrasing (Xie et al. 2019, Chen et al. 2020)
Conditional generation (Zhang and Bansal 2019, Yang et al. 2020)

Adversarial augmentation:

Whitebox methods (Miyato et al., 2017; Zhu et al., 2020; Jiang et al., 2019; Chen et al., 2020d)
Blackbox methods (Ren et al. 2019; Garg and Ramakrishnan, 2020)

Hidden space augmentation:

Mixup (Zhang et al., 2019, Chen et al. 2020)

131 of 175

Datasets Updates: mixup for text data

131

132 of 175

Datasets Updates: mixup for text data

132

Chen, Jiaao, Zichao Yang, and Diyi Yang. "MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification." ACL 2020.

133 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

133

134 of 175

Learning from Interaction: Loss function updates

Unlikelihood learning

Add regularization to specific model behavior

Infer constraints from expert feedback

134

135 of 175

Loss Function Updates: Unlikelihood Learning

Penalize undesirable generations (e.g. not following control, repeating previous context)

135

If C is previously seen text, then less repetition and more diversity

Welleck, Sean, et al. "Neural text generation with unlikelihood training." ICLR (2019).

136 of 175

Loss Function Updates: Infer Constraints from Expert Feedback

Use counterfactual or contrasting examples to improve generalization via an auxiliary training objective

136

Teney, Damien, Ehsan Abbasnedjad, and Anton van den Hengel. "Learning what makes a difference from counterfactual examples and gradient supervision." Computer Vision–ECCV 2020:

137 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human feedback
Limitations of human feedback

137

138 of 175

Learning from Interaction: Parameter updates

Model editing

Concept bottleneck model

Parameter efficient fine-tuning (adapter, prefix)

Reinforcement learning from human feedback

Learning from “diff” or corrections

138

139 of 175

Model Editing uses a single desired input-output pair to make fast, local edits to a pre-trained model

139

Mitchell, Eric, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. "Fast model editing at scale." arXiv preprint arXiv:2110.11309 (2021).

Transform the gradient obtained by SFT using a low-rank decomposition of the gradient to make the parameterization of this transformation tractable.

140 of 175

Parameter updates: Concept Bottleneck Model trains model to explicitly use human-provided concepts

140

Koh, Pang Wei, et al. "Concept bottleneck models." International Conference on Machine Learning. PMLR, 2020.

141 of 175

Parameter updates

Parameter Efficient Fine-tuning uses small interaction data to steer models towards desired behaviors

141

142 of 175

Parameter updates: Parameter Efficient Fine-tuning uses small interaction data to steer models towards desired behaviors

142

Unlearn What You Want to Forget: Efficient Unlearning for LLMs. Jiaao Chen, Diyi Yang. EMNLP 2023

143 of 175

Incorporating Human Feedback: Taxonomy

Dataset updates: change the dataset

Loss function updates: add a constraint to the objective

Parameter space updates: change the model parameters

143

144 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

144

145 of 175

Learning from “Bad” Feedback

Preparation:

Step 1: Train models

Satisfaction classifier (❓→ 🍋, ❓→ 🍒)
Reply corrector (🍋→🍒)

Step 2: Infer the missing labels
Step 3: Convert 🍋 → 🍒
Step 4: Collect the fruits and re-train 🤖

Ingredients:

20% binary 👍👎 20% gold correction 100% free-form textual

146 of 175

Re-train a supervised dialogue model

Bot: Do you like any other fruits besides plums?

Human: I also like watermelons.

Bot: I like watermelons too! Have you heard of Harry Styles?

Human: The bot changes the topic suddenly.

Correction: “I like watermelons too! They are refreshing.”

Bot: I like watermelons too! They are refreshing.

Human: I know! Any recommended recipe for watermelon slush?

Bot: I am not sure. Can you help me understand Greek Mythology?

Human: I’d like to talk about fruits

classifier

corrector

1) Train models

2) Infer ❓labels

Bot: I like watermelons too! They are refreshing.

→🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

3) Convert 🍋 to 🍒

→🍋

Bot: Do you like any other fruits besides plums?

Bot: I like watermelons too! Have you heard of Harry Styles?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

Collect fruits and re-train

Bot: I like watermelons too! They are refreshing.

Bot: Do you like any other fruits besides plums?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

Bot: I like watermelons too! Have you heard of Harry Styles?

Positive +🍒

Negative +🍋

Dialogue

❓

🍒

🍋

❓

corrector

classifier

147 of 175

Re-train a supervised dialogue model

Bot: Do you like any other fruits besides plums?

Human: I also like watermelons.

Bot: I like watermelons too! Have you heard of Harry Styles?

Human: The bot changes the topic suddenly.

Correction: “I like watermelons too! They are refreshing.”

Bot: I like watermelons too! They are refreshing.

Human: I know! Any recommended recipe for watermelon slush?

Bot: I am not sure. Can you help me understand Greek Mythology?

Human: I’d like to talk about fruits

classifier

corrector

1) Train models

2) Infer ❓labels

Bot: I like watermelons too! They are refreshing.

→🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

3) Convert 🍋 to 🍒

→🍋

Bot: Do you like any other fruits besides plums?

Bot: I like watermelons too! Have you heard of Harry Styles?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

Collect fruits and re-train

Bot: I like watermelons too! They are refreshing.

Bot: Do you like any other fruits besides plums?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

Bot: I like watermelons too! Have you heard of Harry Styles?

Positive +🍒

Negative +🍋

Dialogue

❓

🍒

🍋

❓

corrector

classifier

148 of 175

Re-train a supervised dialogue model

Bot: Do you like any other fruits besides plums?

Human: I also like watermelons.

Bot: I like watermelons too! Have you heard of Harry Styles?

Human: The bot changes the topic suddenly.

Correction: “I like watermelons too! They are refreshing.”

Bot: I like watermelons too! They are refreshing.

Human: I know! Any recommended recipe for watermelon slush?

Bot: I am not sure. Can you help me understand Greek Mythology?

Human: I’d like to talk about fruits

classifier

corrector

1) Train models

2) Infer ❓labels

Bot: I like watermelons too! They are refreshing.

→🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

3) Convert 🍋 to 🍒

→🍋

Bot: Do you like any other fruits besides plums?

Bot: I like watermelons too! Have you heard of Harry Styles?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

Collect fruits and re-train

Bot: I like watermelons too! They are refreshing.

Bot: Do you like any other fruits besides plums?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

Bot: I like watermelons too! Have you heard of Harry Styles?

Positive +🍒

Negative +🍋

Dialogue

❓

🍒

🍋

❓

corrector

classifier

149 of 175

Re-train a supervised dialogue model

Bot: Do you like any other fruits besides plums?

Human: I also like watermelons.

Bot: I like watermelons too! Have you heard of Harry Styles?

Human: The bot changes the topic suddenly.

Correction: “I like watermelons too! They are refreshing.”

Bot: I like watermelons too! They are refreshing.

Human: I know! Any recommended recipe for watermelon slush?

Bot: I am not sure. Can you help me understand Greek Mythology?

Human: I’d like to talk about fruits

classifier

corrector

1) Train models

2) Infer ❓labels

Bot: I like watermelons too! They are refreshing.

→🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

3) Convert 🍋 to 🍒

→🍋

Bot: Do you like any other fruits besides plums?

Bot: I like watermelons too! Have you heard of Harry Styles?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

Collect fruits and re-train

Bot: I like watermelons too! They are refreshing.

Bot: Do you like any other fruits besides plums?

Bot: I like them too! Do you like fruit juice?

Bot: Sure. Here is a recipe for watermelon slush!

🍒

Bot: I am not sure. Can you help me understand Greek Mythology?

Bot: I like watermelons too! Have you heard of Harry Styles?

Positive +🍒

Negative +🍋

Dialogue

❓

🍒

🍋

❓

corrector

classifier

150 of 175

Supervised reply corrector is better, at least for 3B models

Further picking out the correctable lemons helps

151 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

151

152 of 175

Incorporating Different Levels of Feedback

Incorporate different levels of human feedback via RL

✍️Local Feedback

Highlighted words or phrases

Speaker's intents

Identifiable events/topics

✍️Global Feedback

Judgement towards the coherence, coverage, overall quality…

152

Chen, Jiaao, Mohan Dodda, and Diyi Yang. "Human-in-the-loop Abstractive Dialogue Summarization." arXiv preprint arXiv:2212.09750 (2022).

153 of 175

Case Study: Collecting Local Feedback

153

154 of 175

Case Study: Collecting Global Feedback

154

155 of 175

Case Study: Incorporating Different Levels of Feedback

155

156 of 175

Case Study: Reward Modeling for Summarization

Learn HITL summarization policy by maximizing the combined reward

156

157 of 175

Case Study: Evaluation of HITL Summarization

157

Chen, Jiaao, Mohan Dodda, and Diyi Yang. "Human-in-the-loop Abstractive Dialogue Summarization." arXiv preprint arXiv:2212.09750 (2022).

158 of 175

Case Study: Converting Feedback into Principles

158

159 of 175

Case Study: Converting Feedback into Principles

159

Petridis, Savvas, Ben Wedin, James Wexler, Aaron Donsbach, Mahima Pushkarna, Nitesh Goyal, Carrie J. Cai, and Michael Terry. "ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles." arXiv preprint arXiv:2310.15428 (2023).

160 of 175

Reinforcement Learning from Human Feedback

160

161 of 175

Constitutional AI: Harmlessness from AI Feedback

161

Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen et al. "Constitutional ai: Harmlessness from ai feedback." arXiv preprint arXiv:2212.08073 (2022).

162 of 175

Constitutional AI: Harmlessness from AI Feedback

162

163 of 175

Constitutional AI: Harmlessness from AI Feedback

163

164 of 175

Scaling RL from Human Feedback with AI Feedback

164

Lee, Harrison, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. "Rlaif: Scaling reinforcement learning from human feedback with ai feedback." arXiv preprint arXiv:2309.00267 (2023).

165 of 175

Scaling RL from Human Feedback with AI Feedback

165

Humans strongly prefer RLHF and RLAIF summaries over the supervised fine-tuned (SFT) models

166 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

166

167 of 175

Limitations of Human Feedback

Human preferences can be unreliable
Reward hacking is a common problem in RL

167

168 of 175

Limitations of Human Feedback

Human preferences can be unreliable
Reward hacking is a common problem in RL
Chatbots may be rewarded to produce responses that seem authoritative, long, and helpful, regardless of truth
Who are providing these feedbacks to LLMs
Whose values get aligned or represented

168

169 of 175

Learning from Interactions Outline

Different type of human feedback
Learning from human feedback

Dataset updates (weak supervision, data augmentation)
Loss function updates (unlikelihood learning)
Parameter space updates (parameter efficient fine-tuning, model editing)

Learning from bad human feedback
Learning from multiple levels of human/AI feedback
Limitations of human feedback

169

170 of 175

Conclusion and Future Directions

170

171 of 175

Takeaway

Human-AI Interaction is another layer (?) along side models that have its own set of taxonomies.

Goals: Objective functions
Design choice: Hyperparameters
Clickstreams & interviews: Data logs
…

All more user-oriented – User-centered “optimization!”

171

172 of 175

Open Questions in “Designing Interactions”

Thinking what users want.

Interfaces as an extension to humans, instead of an extension to technology.

Making users comfortable with novel interactions.

Focusing on building trust, ensuring appropriate reliance, and considering other human factors.

Designing interactions equitable to diverse populations.

By considering accessibility, gender/racial/cultural and other demographic differences.

172

173 of 175

Open Questions in “Evaluating Interactions”

Scale up the evaluation.

Participant recruitment? Trolling removal? Qualitative analysis?

Evaluate dynamic interactions, in the wild.

General-purpose models get us less pre-defined interactions. Capture effectiveness, when people self-initiate task formations?

Make interaction a benchmarking task.

Got 1000 scores on 1000 interactions for 1 model, now what? Use diverse interaction evaluations to reflect model’s practical usability!

173