Designing, Evaluating, and
Learning from Human-AI Interactions
Sherry Tongshuang Wu
CMU, @tongshuangwu
Diyi Yang
Stanford, @diyi_yang
Sebastin Santy
UW, @sebastinsanty
Dec 6, 9am-12:30pm
Leo 3 & 4 (also hybrid)
Human-AI Interactions
2
Credits: Hyunwoo Kim (Human) + Bing (AI)
Human-AI Interaction: What is it?
3
Basically, a field where humans and AIs interact.
AI-based translations
AI-based grammar correction
Recommendation systems
Human-AI Interaction: What is it?
4
Basically, a field where humans and AIs interact.
Humans: AI researchers, model developers, domain experts, end users.
AIs: dialog system, translator, recommender system, autonomous driving system.
Interact:
Humans collaborate with AI,
Humans get assistance from AI,
Humans analyze AI,
AI helps human,
& many other forms
Human-AI Collaboration
5
The cooperative and coordinated interaction between humans (mostly non-AI experts) and AI to solve complex problems or achieve certain goals.
Humans get assistance from AI-infused apps
6
Humans are still mostly end users and domain experts.
The big difference is AI is not a partner, but a tool (and part of “AI-infused applications”)
Humans analyze Models
7
Experts systematically analyze AI models, and go beyond aggregated scores.
https://erroranalysis.ai/ Adaptive Testing and Debugging of NLP Models (Ribeiro & Lundberg, ACL 2022)
“Understanding the broader terrain of errors is an important starting point in pursuing systems that are robust, safe, and fair…[We need to] identify cohorts with higher error rates and diagnose the root causes behind these errors.”
Eric Horvitz / Microsoft, 2021
How do we figure out the “Interaction”?
8
Given a human and an AI…
Design
Why should they interact? How do we make it happen?
(self-)selected
Already exist & (we think) usable
Evaluate
Have we achieved what we want to achieve?
How do we figure out the “Interaction”?
9
Design
Evaluate
Given a human and an AI…
Learn from
Why should they interact? How do we make it happen?
(self-)selected
Already exist but needs improvement!
Have we achieved what we want to achieve?
How do we make AIs more usable?
How do we figure out the “Interaction”?
10
Design
Evaluate
Given a human and an AI…
Learn from
(self-)selected
Already exist but needs improvement!
We will focus on text – It’s “straightforward” but complicated at the same time, open-ended, and part of the multi-modal world. And it’s relevant!
Sherry Wu
Diyi Yang
Sebastin Santy
Schedule – Fruitful morning ahead
11
Design
Evaluate
Learn from
11:15-12:05 (40 mins lecture + 10 mins Q&A)
10:05-10:30 (25 mins lecture)
10:50-11:15 (15 mins lecture + 10 mins Q&A)
Conclusion
12:05-12:15
Extended Q&A
12:15-12:30
09:15-10:05 (40 mins lecture + 10 mins Q&A)
Coffee break
Evaluate (cont’)
10:30-10:50
You will learn…
12
NLP, from an HCI perspective
Awareness: Interaction as another layer on top of models & is human-centered
Systematic & up-to-date overview: the design choices around models
Contextualization: How principles are mapped to real-world case
“Human-AI Interactions”
Designing
13
Norman Doors
Have you come across a door where you have tripped, bumped into or confused as to how it operates?
14
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
❓ �Guess whether to push or pull
🔎 �Can’t locate a place to push or pull
➜�Try to push, but the door actually slides
You are not alone
Bad design is everywhere
From doors, to everyday objects & machines designed by people
including “AI”
15
Norman “AI”?
Not just doors, but happens with many poorly designed machines, including “AI”
16
| Door 🚪 | AI 🤖 |
What the user wants to do | “How do I get to next room?” | “How do I solve my task?” |
What the user ends up doing | | |
How does a user learn �“How to use?” | | |
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
Norman “AI”?
Not just doors, but happens with many poorly designed machines, including “AI”
17
| Door 🚪 | AI 🤖 |
What the user wants to do | “How do I get to next room?” | “How do I solve my task?” |
What the user ends up doing | “How should I operate the door to get to next room?” | “How should I prompt the model to get it to solve my task?” |
How does a user learn �“How to use?” | | |
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
Norman “AI”?
Not just doors, but happens with many poorly designed machines, including “AI”
18
| Door 🚪 | AI 🤖 |
What the user wants to do | “How do I get to next room?” | “How do I solve my task?” |
What the user ends up doing | “How should I operate the door to get to next room?” | “How should I prompt the model to get it to solve my task?” |
How does a user learn �“How to use?” | - From previous encounter - Read labels - Take a guess and try | - From other people - Read prompt guidelines - Wing it |
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
Design that disappears
“The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it.”
— Mark Weiser
They don’t require an instruction manual. You use it once or twice and you will barely feel it next interaction onwards. E.g. pointing device, touchscreen, “literacy”
19
“The father of ubiquitous computing”
Weiser, Mark. "The Computer for the 21st Century." Scientific american 265.3 (1991): 94-105.
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
How do we achieve this?
By extending humans instead of extending technology, reducing “frictions of learning”
20
Object�Metaphors
Action�Metaphors
Human functioning
Computer functioning
Punch Cards
Command Line
Pointing Device
GUI Icons
Personal Computing
Extending�human
Extending�computer
How do we achieve this?
In AI, we’ve been only thinking about extending technology
21
Human functioning
AI functioning
Using Code
Artificial Intelligence
Prompting
Chatting
Extending�human
Extending�AI
Why does it happen?
22
Technology-centric design
A design process for situations when a class of technologies already exists, but when user domains for co-development are not clearly established.
User-centered design (UCD) is an iterative design process in which designers focus on the users and their needs in each phase of the design process.
User-centered design
“If you have a hammer, everything looks like a nail”
Solution in search of a problem
Bly, Sara, and Elizabeth F. Churchill. "Design through matchmaking: technology in search of users." interactions 6.2 (1999): 23-31.
Blog Post: The Biggest Bottleneck for LLM startups is UX
These issues really only surface once someone starts trying to use the product in the context of their daily workflow. This is how you go from “cool” to “useful.”
These challenges are always present, regardless of system’s accuracy (within some bounds).�Doesn’t matter if the LLM accuracy is 80% or 95%, the user still needs to reason through failure modes and understand what to expect when interacting with the system.�Better off getting to a baseline accuracy that is good enough and then building a product that allows a user to know how to work around the model.
“You are good at designing things we cannot build.�We are good at making things that users don’t use.”
23
The biggest bottleneck for large language model startups is UX
Yang, Qian, et al. “Sketching nlp: A case study of exploring the right things to design with language intelligence." CHI 2019.
New models → new AI interactions
24
ChatGPT
Chatting
Pushing language as the primary interface for every task
Reeves, Byron, and Clifford Nass. "The media equation: How people treat computers, television, and new media like real people." Cambridge, UK 10.10 (1996).
Anthropomorphic Tendencies
The act of projecting human-like qualities or behavior onto non-human entities, in this case, AI
New models → new AI interactions
25
No specification means there is no single way of instruction. Suitable for personal tasks.
Language is imprecise
Controlling for desired outputs can be difficult. Unsuitable for tasks that require precision or is critical.
Anthropomorphic
Tendencies
ChatGPT
Reeves, Byron, and Clifford Nass. "The media equation: How people treat computers, television, and new media like real people." Cambridge, UK 10.10 (1996).
Urge to communicate like humans do
Language is flexible
Subtler Interactions (that disappear)
26
Voice Assistants
Email Autocomplete
Recommendations
Word suggestions
Subtitles
Code Completion
Design Thinking
27
What is design thinking?
As engineers, we immediately jump to finding solutions for a problem that we come across.
28
Problem
Solution 1
Solution 2
Solution 3
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
What is design thinking?
It is not about the exact number, but this high number puts emphasis on getting to the root cause of the problem, instead of looking at the surface level.
29
Problem
Solution 1
Solution 2
Solution 3
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
Problem 1
Problem 2
Problem 1.1
Why?
Why?
Why?
Why?
Ask “Five whys”
Getting to the root cause of the problem
“Within a design context, framing is often seen as the key creative step that allows an original solution to be produced.
Designers report on the need to get to ‘the problem behind the problem’ (as initially presented by the client), and about creating a ‘fresh perspective.’ ”
— Bec Paton and Kees Dorst
“If I had asked people what they wanted, they would have said faster horses.”
— Henry Ford
30
“The inventor of production car”
The core of ‘design thinking and its application, https://www.sciencedirect.com/science/article/pii/S0142694X11000603g
What is design thinking?
31
Solution 1
Solution 2
Solution 3
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
Problem
Problem 1
Problem 2
Problem 1.1
Why?
Why?
Why?
Why?
How to incorporate design thinking?
The “Double Diamond” Method
32
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
First Diamond�Find the specific problem.
Second Diamond�Find the specific solution.
Why?
Why?
Find the problem
Find the solution
How to incorporate design thinking?
The “Double Diamond” Method
33
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
Four Steps:
Find the problem
Find the solution
Discover Problem
Discover: Understand the issue rather than merely assuming it. It involves researching, speaking to and spending time with people who are affected by the issues.
Market Research
Field Study
Interview and Surveys
Environmental Factors
34
Stakeholder Interviews, check raised tickets, traffic and sales analysis, competitive audits
Site visits, Ethnography to observe people doing their own tasks in their own setting.
To collect information on their reactions to existing products and conditions
Understand the context, and its needs.
Quesenbery, Whitney, and W. Whitney. "Choosing the right usability technique: Getting the answers you need." User Friendly 2008-Innovation for Asia (2008).
Define Problem
Define: The insight gathered from the discovery phase can help to define the challenge in a different way.
Affinity Diagrams
Perspective Framing
Task & Information Analysis
35
To group and explore the structure of information.
Participatory design to develop a consensus view of the overall process.
Learning about relationships between tasks and information; Creating logical groups from the users’ point of view.
Quesenbery, Whitney, and W. Whitney. "Choosing the right usability technique: Getting the answers you need." User Friendly 2008-Innovation for Asia (2008).
Develop Solution
Develop: Give different answers to the clearly defined problem, seeking inspiration from elsewhere and co-designing with a range of different people.
Rapid Prototyping
Storytelling
Minimum Viable Product
36
Physical realizations of the research and design process in a tangible form. Can be used to get a sense of what it would be like to experience the product/service.��Goes from low fidelity (paper) to high fidelity (systems).
Construct situations where a specific user in a specific context would go about solving the problem with different solutions.
Quesenbery, Whitney, and W. Whitney. "Choosing the right usability technique: Getting the answers you need." User Friendly 2008-Innovation for Asia (2008).
The “Double Diamond” Method
37
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013.
Four Steps:
But this is not once and for all
Iterative Design
38
Norman, Don. The design of everyday things: Revised and expanded edition. Basic books, 2013. DreamEndState.com
Find the problem
Find the solution
It is a spiral
along time axis
Example Task: Make writing faster for humans
39
Yang, Qian, et al. "Sketching nlp: A case study of exploring the right things to design with language intelligence." CHI. 2019.
Example Task: Make writing faster for humans
40
“Typo errors slow me down”
“I know the word, takes time to type”
Next word suggestions
Autocorrect word
Use: on phones, cannot think of new words, obvious word but takes time to write
Use: on phones, already typed words with spelling mistakes, too much time to go back
Next phrase suggestion
Use: on desktops, breeze through obvious phrases that otherwise take time to write.
Word copilot
Use: on desktops, large chunks of text together. Helps when one needs to start from scratch
“I know the phrase, takes time to type”
“I have writer’s block starting from scratch”
Problem Space
Solution Space
Develop Solution: Prototype
Wizard-of-Oz�Fake features so that the user thinks that the responses are computer-driven when they are actually human-controlled.�Challenge for NLP: AI errors are hard to simulate.
LM Prototypes / Scaffold�Simulate users that may use the system�Use LMs to build prototypes
Mimic simple functionality�Ensemble multiple tools, LMs, simple models and expectations.�Challenge for NLP: cannot simulate SOTA model capabilities
41
Develop Solution: Prototype Persona
Algorithmic persona: human roles that users assign to the algorithm to explain the algorithm’s goals, behaviors, and characteristics.
42
Wu, Eva Yiwei, Emily Pedersen, and Niloufar Salehi. "Agent, gatekeeper, drug dealer: How content creators craft algorithmic personas." CSCW 2019
Develop Solution: LM as a Prototyper / Scaffolding
43
Petridis, Savvas, Michael Terry, and Carrie Jun Cai. "PromptInfuser: Bringing User Interface Mock-ups to Life with Large Language Models." Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 2023.
NLP & Interfaces
UI for an NLP system aka “A wrapper on top”�e.g., Autocomplete, Google Translate
44
GUI
Natural language UI aka “the wrapper itself”�e.g., Alexa, Google Search, GPT4 Web Search, AutoGPT
NLP System
Language
System
Interfaces vs. Interactions
45
Interactions
Interfaces
Tangible i.e. visual, auditory, tactile inputs to the system. Designing interfaces are at the surface & physical level, often limited to styling.
Understanding the context of user, their needs and requirements, how they operate. Includes psychological aspects such as trust, goals, user behavior.
Interfaces
Interactions
Beyond Interfaces
46
Aspects of Interaction
Cognition, Perception and extending humans�Offload cognitively loaded tasks, make space for humans to be creative.
Trust, Reliance and other user-machine behavior�Users feel comfortable using and depending on AI systems for achieving their tasks.
Fairness, Accountability, Transparency, Ethics�Ensure equitable treatment of all individuals, regardless of their race or gender. Not perpetuate bias or discrimination and ability to understand model decisions.
Personalization, Adaptation, Feedback and Guidance�Tailoring AI interactions to individual user preferences and needs, and in turn also learn and improve itself over time to align with human preferences.
47
Cognition, Perception and extending humans
48
Explains what a system might be capable of��A metaphor communicates expectations of what can and cannot be done
Visual Metaphors
Audio Metaphors
Conceptual Metaphors
Textual Metaphors
Web, Crawling, Load, Fetch
Email, Thread, Port, Address
Camera Shutter Sound
Phone Lock Sound
AI Metaphors
Stochastic Parrots
Intelligent Agent
(Object Metaphors)
Cognition, Perception and extending humans
49
Khadpe, Pranav, et al. "Conceptual metaphors impact perceptions of human-AI collaboration." Proceedings of the ACM on Human-Computer Interaction 4.CSCW2 (2020): 1-26.
Referring to AI with a specific name / metaphor has an effect on how it is perceived, and even how it is used.
Bots with metaphors of high warmth, but low competence were preferred overall
Effects of chatbot naming
Cognition, Perception and extending humans
50
Manipulate objects like you do in the real-world.��Real-world metaphors for objects and actions can make it easier for a user to learn and use an interface
Drag and Drop
Direct Manipulation
Resizing Elements
(Action Metaphors)
Personal Assistant
Interactive Stories
set reminder
get weather
send message
Setting own character actions or dialogue options, creating a personalized storyteller.
Trust and Reliance
51
Vasconcelos, Helena, et al. "Explanations can reduce overreliance on ai systems during decision-making." Proceedings of the ACM on Human-Computer Interaction 7.CSCW1 (2023): 1-38.
Trust: Trust refers to a belief or confidence in the integrity, reliability, and honesty of a person, organization, or thing.
Reliance: Involves depending on someone or something to perform a specific function or task, irrespective of whether trust is present.
Often, Trust → Reliance
Imperfect agents prone to making errors require a trusting relationship.
Trust and Reliance
52
Buçinca, Zana, Maja Barbara Malaya, and Krzysztof Z. Gajos. "To trust or to think: cognitive forcing functions can reduce overreliance on AI in AI-assisted decision-making." Proceedings of the ACM on Human-Computer Interaction 5.CSCW1 (2021): 1-21.
> Showing Explanations and giving people agency
> Showing Uncertainty
> On demand
> Wait
Fairness
Accountability
53
Liebling, Daniel J., et al. "Unmet needs and opportunities for mobile translation AI." Proceedings of the 2020 CHI conference on human factors in computing systems. 2020.
What is the real-life cost of mistranslation?
How well it works for diverse populations?
Sun, Jiao, et al. "Pretty princess vs. successful leader: Gender roles in greeting card messages." Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 2022.
Transparency
Ethics
54
Flathmann, Christopher, et al. "Modeling and guiding the creation of ethical human-AI teams." Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 2021.
How to open the black-box and understand model decisions?
Cabrera, Ángel Alexander, et al. "Zeno: An interactive framework for behavioral evaluation of machine learning." Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 2023.
What are the morals and values encoded in the system?
Interaction Initiative
55
Mixed initiative systems allow users to interact with them in a collaborative way, where the user and the system both take an active role in carrying out tasks or making decisions.
Advocates elegant coupling of automated services with direct manipulation.
“Autonomous actions should be taken only when an agent believes that they will have greater expected value than inaction for the user.”
Reflection
Design should be seamless without the need of an instruction manual
How?�By extending humans and their natural ways of interacting with the real-world
What is the process?�By using design thinking frameworks. Emphasis on finding the root cause of the problem, finding solutions and iterating back and forth, till we build applications that users want to use.
Are there other human factors to consider?�Beyond the visible elements on the interfaces, and psychological and cognitive aspects, it is important to be aware of the underlying trust and reliance in the system, and its implications in the real world.
56
“Human-AI Interactions”
Evaluating
57
Key desiderata in evaluation (also outline :D)
Objectives and goals: “What do I need to know?”
Tasks: “What should users do so I find out what I need to know?”
Data: “What data do I collect to find out what I need to know?”
Analysis: “How do I crunch the numbers to find out what I need to know?”
58
Goals: HALIE – Consider Humans & Interactions
59
Lee, Mina, et al. "Evaluating human-language model interaction." TMLR 2023
Traditional evaluation: on the outputs of the models themselves.
Criteria:
Output quality
Perspective:
third party
Target:
Output
Goals: HALIE – Consider Humans & Interactions
60
Lee, Mina, et al. "Evaluating human-language model interaction." TMLR 2023
Interaction centric evaluation: On interactions between humans and models.
Criteria:
Output quality, human preference
Perspective:
third party, first-person experience
Target:
Output, process
Consider a case study: Human-LM Co-Writing
61
Lee, Mina, Percy Liang, and Qian Yang. "CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities." CHI 2022
Consider a case study: Human-LM Co-Writing
62
Criteria:
Quality: perplexity, etc.
Preference: Which option(s)?
Perspective:
third party: perplexity
First-person: Everyone has their choice; Thousands of divergence from the same starting point
Target:
Output: Final story
Process: Snapshot of how the article is constructed
Key desiderata in evaluation (also outline :D)
Objectives and goals: “What do I need to know?”
General mindset of “human and interaction first”, then map to application.
Tasks: “What should users do so I find out what I need to know?”
Data: “What data do I collect to find out what I need to know?”
Analysis: “How do I crunch the numbers to find out what I need to know?”
63
Key desiderata in evaluation (also outline :D)
Objectives and goals: “What do I need to know?”
General mindset of “human and interaction first”, then map to task.
Tasks: “What should users do so I find out what I need to know?”
Data: “What data do I collect to find out what I need to know?”
Analysis: “How do I crunch the numbers to find out what I need to know?”
64
Going back to our Co-writing case…
65
Goal – Understand LLM capability on support human writing
➡️ Task – Vary types of writing (Creative & Argumentative), prompts, and model randomness
➡️ Data – What’s collected?
Human preferences, final articles, etc.
Going back to our Co-writing case…
66
Most important data: Interaction trace
Rich metadata that reflect the entire process of interaction, with each person.
<Query>
<Move>
<Move>
<Query>
<Query>
<Query>
<Accept>
He loved to play tricks on people and make them love.
<Accept>
there was a great mage who lived in a tower.
<Edit>
funny
Once upon a time,
<Write>
Credit: Mina Lee’s job talk!
Going back to our Co-writing case…
67
Most important data: Interaction trace
Rich metadata that reflect the entire process of interaction, with each person.
Events and clickstreams
States of everything at every moment
Going back to our Co-writing case…
68
Goal – Understand LLM capability on support human writing
➡️ Task – Vary types of writing (Creative & Argumentative), prompts, and model randomness
➡️ Data – What’s collected?
Key desiderata in evaluation (also outline :D)
Objectives and goals: “What do I need to know?”
General mindset of “human and interaction first”, then map to task.
Tasks: “What should users do so I find out what I need to know?”
Data: “What data do I collect to find out what I need to know?”
Interaction trace & open-ended feedback, form a comprehensive picture.
Analysis: “How do I crunch the numbers to find out what I need to know?”
69
Key desiderata in evaluation (also outline :D)
Objectives and goals: “What do I need to know?”
General mindset of “human and interaction first”, then map to task.
Tasks: “What should users do so I find out what I need to know?”
Data: “What data do I collect to find out what I need to know?”
Interaction trace & open-ended feedback, form a comprehensive picture.
Analysis: “How do I crunch the numbers to find out what I need to know?”
70
Going back to our Co-writing case (again)
71
But most important data: Interaction trace
Rich metadata that reflect the entire process of interaction, with each person.
Things you can already see…
Multiple queries means first several suggestions are bad;
Edit gives us cost-efficiency thresholds people have in mind, between “useful enough to edit” vs. “let me start from scratch”
<Query>
<Move>
<Move>
<Query>
<Query>
<Query>
<Accept>
He loved to play tricks on people and make them love.
<Accept>
there was a great mage who lived in a tower.
<Edit>
funny
Once upon a time,
<Write>
Existing metrics can help quantify effects
72
Some analyses already done by the authors!
Q: Can GPT-3 generate fluent text in response to user text?
A: Text written by user + GPT-3 had fewest errors and most diverse vocabulary
Define metrics for what needs to be measured
73
Some analyses already done by the authors!
Q: Can GPT-3 contribute new ideas to users’ stories?
A: Ideas generated and reused by users in the subsequent writing.
“Reused named entity” is a new (lower-bound)metric defined for ideation quantification!
How the dataset can be further used
74
Writers' behaviors and writing outcomes over time
Can we observe novelty effect and longitudinal change?
Do LMs homogenize writing by providing similar suggestions to all users?
Linguistic accommodation
How does the style, voice, or tone of a writer or LM influence that of the other over time? Is the influence uni-directional or bi-directional?
Edit traces
What can we learn from human edits on LM outputs?
Can we train LMs on edit traces to emulate human edits?
Metrics that exist for Human-LM Co-Writing
75
Shen, Hua, and Tongshuang Wu. "Parachute: Evaluating interactive human-lm co-writing systems." In2Writing 2023.
Metrics that exist for Human-LM Co-Writing
76
Metrics that exist for Human-LM Co-Writing
77
Metrics that exist for Human-LM Co-Writing
78
We tend to have more qualitative metrics…
Key desiderata in evaluation (also outline :D)
Objectives and goals: “What do I need to know?”
General mindset of “human and interaction first”, then map to task.
Tasks: “What should users do so I find out what I need to know?”
Data: “What data do I collect to find out what I need to know?”
Interaction trace & open-ended feedback, form a comprehensive picture.
Analysis: “How do I crunch the numbers to find out what I need to know?”
Comprehensive metrics, cover different aspects, don’t reinvent the wheel!
79
Evaluation – From mindset to method
80
Mixed method evaluation
Think aloud analysis
And, of course, one more case study
Important method: “Think Aloud”
A research method used to gain insight into a person's thought processes as they perform a task or solve a problem. The participant is asked to verbalize their thoughts as they perform the task, which allows the researcher to understand how the participant approaches the task.�"Thinking aloud may be the single most valuable usability engineering method.“
“I’m going to ask you to ____ and while you are doing that, can you tell me whatever you are thinking. Whatever comes into your mind while you are working on that. Okay?”
Protocol
Give participants specific tasks to accomplish (but not HOW to do it)
Have them speak aloud as they complete the tasks
Keep interruptions to a minimum
Ask for open-ended questions & clarification after the task is complete
Learning effect - if you make tasks, watch for biasing test due to order
Typically used to test the usability of a website, app or object
81
Important method: Controlled exp. + Mixed method
Mixed methods research combines elements of quantitative research and qualitative research in order to answer research questions. Can help a more complete picture than a standalone quantitative or qualitative study, as it integrates benefits of both methods.
82
Quantitative method
Understand the “what”. Precise!
Qualitative method
Understand the “why”. Open-ended!
Quantitative vs. Qualitative
83
| Quantitative | Qualitative |
Definition | Gather numerical data to be analyzed using statistical methods | Gathering descriptive, non-numerical data to be analyzed through interpretation and contextualization |
Data source | surveys, questionnaires, experiments | interviews, observations, and document analysis |
Presentation | tables, graphs, and statistics | quotes and narratives that reflect the participants' experiences and perspectives |
Goal | establish cause-and-effect relationships between variables | gain a deeper understanding of social phenomena, meanings, and processes |
Case Study: Interactive Machine Translation
84
“We present Predictive Translation Memory, an interactive, mixed-initiative system for human language translation. Translators build translations incrementally by considering machine suggestions that update according to the user’s current partial translation.”
Green, Spence, et al. "Predictive translation memory: A mixed-initiative system for human language translation." UIST 2014
85
PTM recap: Rationals for seemingly simple decisions
86
Design: Re-use familiar hotkeys e.g., CTRL+Enter Typing activates interactions
Translators are fast typists: want to avoid the mouse
Design: One column, interleaved layout
Translators read (20-25% of translation session), 2-column will be cumbersome
Design: Text color encoding
Ownership: AI can’t modify human text, human can accept but not modify AI text
Principle: Horvitz #6 – Employing socially appropriate behaviors for agent−user interaction.
PTM recap: Rationals for seemingly simple decisions
87
Design: highlight translated words
Principle: Horvitz #11 – maintaining working memory of recent interactions
PTM recap: Rationals for seemingly simple decisions
88
Design: highlight translated words
Principle: Horvitz #11 – maintaining working memory of recent interactions
Design: allow for word-to-word query
Principle: Horvitz #6 – allowing efficient direct invocation and termination
Principles for Mixed-initiative user interfaces
89
Developing significant value-added automation (v.s. direct manipulation)
Considering uncertainty about a user’s goals
Considering the status of user’s attention (minimize distraction, cost vs. benefit of deferring action)
Inferring ideal action in light of costs, benefits and uncertainties (expected values of actions!)
Employ dialog to resolve key uncertainties (interactions!)
Allowing efficient direct invocation and termination
Minimizing the cost of poor guesses about action and timing
Scoping precision of service to match uncertainty, variation in goals — do less if uncertain!
Providing mechanisms for efficient agent-user collaboration to refine results
Employing socially appropriate behaviors for agent-user interaction
Maintaining working memory of recent interactions
Continuing to learn by observing (e.g., about user’s goals, etc.)
PTM: Experimental Design
90
Comparative analysis
“We compared our system to post-editing, which is a strong baseline [29, 21], and is also the most common commercial use of MT. “��Clear research questions
Time – PTM faster than post-edit?
Quality – PTM == better translation?
Usage – subjects use interactive aids?
translate French→English or English→German
≈3,000 tokens of News/Medical/Software
post-edit (pe) and PTM
16 expert subjects per language pair
Task
Source Text
Conditions
Participants
RQ1: Time – PTM faster than post-edit?
91
Quantitative analysis (find robust evidence)
Metric: log of time (more tolerant of outliers)
Compare mean (for general understanding)
Linear mixed effects models (for understanding significance, important factors)
The key independent variable: translation condition
Learning effect — People get quicker as the task proceed
More edits means longer time
Initial translation quality — how much edit is necessary
They had unbalanced participation pool
Longer source sentence takes longer to edit
Potential interaction between independent variables
+random intercepts/slopes for subject, source, text genre.
RQ1: Time – PTM faster than post-edit?
92
Qualitative analysis (find reasons behind quantitative analysis — codebook!)
Metric: log of time (more tolerant of outliers)
Think aloud (record users’ comments) — Interactive mode takes more time because…
There are more aids to operate and more information to read and analyze:
“Because you spend more time on each word, you have opportunity to see alternative translations.”
MT quality quality greatly affected the usefulness of the interactive aids:
“If drop-down suggestions are not of a good quality, reading may consume extra time.”
The post-edit mode was easier at first, but interaction is better in the long run.
“I am used to this [post-edit], this is how Trados [the preeminent CAT tool] works.”
Likert Scale survey (can still quantitatively compare users’ subjective judgements): “In which interface did you feel most productive?”
“I would use interactive translation features if they were integrated into a CAT product”
“I got better at using the interactive interface with practice/experience”
RQ1: Time – PTM faster than post-edit?
93
Qualitative and quantitative results usually provides some grounding for each other.
“Post-edit mode was easier at first, but the interactive mode was better once I got used to it. “
"If I had time to use the interactive tool and grow accustomed to its way of functioning, it would be quite useful…"
RQ2: Quality – PTM == better translation?
94
Quantitative analysis (find robust evidence)
Metric: BLEU (automatic eval, has issues, but easier to run)
BLEU: a measure of similarity with the gold reference.
HBLEU: measure of similarity with the initial MT suggestions.
“PTM exposes translators to many more alternatives, encouraging them to deviate further from the initial MT suggestion (lower HBLEU).”
Compare mean
(Also vs. original generated text)
RQ2: Quality – PTM == better translation?
95
Quantitative analysis (find robust evidence)
Metric: Human subject rating
Auto methods are sensitive and noisy, so usually paired with human judgements as well
RQ2: Quality – PTM == better translation?
96
Qualitative analysis (find reasons behind quantitative analysis)
Metric: BLEU & rating (automatic eval, has issues, but easier to run)
When users wanted to render more stylistic translations, PTM was less useful:
“...choosing a very different translation approach (choice of words, idioms with no equivalent in English...) would be like going against the current—but may have provided a better quality.”
“the translator is less susceptible to be creative.”
Why do many participants prefer post-edit?
“I found the machine translations (texts in gray) were of a much better quality than texts generated by Google Translate”
"The translations generally did not need too much editing, which is not always the case with machine translations.”
RQ3: Usage – subjects use interactive aids?
97
Quantitative analysis on 1.1 million UI events
Metric: UI clickstream events (faithful reflection of user pattern)
Usage: 52% more text entered via aids than TransType
Clickstreams are also more objective and comparable with prior work.
In the TransType system, the authors commented that their users often “[accepted] predictions in [their] entirety and then edited to ensure its correctness” and reported that 52% of target characters were typed [36]. In the “prediction+options” experiment conducted by Koehn et al. [29], the authors reported that 36% of the final translations were typed, 36% entered via a mouse click, and 27% entered via the tab key to accept machine translations. When working in our PTM system, users directly utilized machine translations to a greater degree than previously reported.
High-level takeaways
Mindset in evaluation: Human and interaction come first.
Desiderata: Goals > Tasks > Data > Analysis method
Common data: Clickstreams, interviews, etc.
Common evaluation method: Mixed method, think aloud
Common metrics: Find them, or invent them :)
98
Reflection: Where do we find interactions to evaluate?
99
Don’t reinvent the wheel – task delegation has always been a thing in Human Computations!
LLM chaining:
Breakdown complex tasks … be done independently by LLMs , then combined
?
Compare human vs. LLM, and LLM vs. LLM performances on sub-tasks in the same overall context, by replicating crowdsourcing pipelines with LLMs chains!
Crowdsourcing pipeline:
Breakdown complex tasks into pieces that can be done independently by humans, then combined
Reflection: Where do we find interactions to evaluate?
100
Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits
Input
Print publishers are in a tizzy over Apple’s new iPad because they hope to finally be able to charge for their digital editions. But in order to get people to pay for their magazine and newspaper apps, they are going to have to offer something different that readers cannot get at the newsstand or on the open Web.
Text to be shortened
Fix
hoping to charge for digital editions
to get people to pay for apps
offer something unique not available elsewhere
Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique not available elsewhere.
Shorten the verbose parts and put back in context
Verify
Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique that is not available elsewhere.
Fix grammar issues
Find
hope to finally … their digital editions
in order to get people to… apps
offer something… on the open Web
Verbose parts
Reflection: Where do we find interactions to evaluate?
101
Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits
Fix
Find
Verify
hope to finally … their digital editions
in order to get people to… apps
offer something… on the open Web
hoping to charge for digital editions
to get people to pay for apps
offer something unique not available elsewhere
Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique that is not available elsewhere.
Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique not available elsewhere.
Input
Print publishers are in a tizzy over Apple’s new iPad because they hope to finally be able to charge for their digital editions. But in order to get people to pay for their magazine and newspaper apps, they are going to have to offer something different that readers cannot get at the newsstand or on the open Web.
Text to be shortened
Verbose parts
Shorten the verbose parts and put back in context
Fix grammar issues
Identify at least one area that can be shortened without changing the meaning of the paragraph.
Edit the highlighted section to shorten its length without changing the meaning of the paragraph.
Choose at least one rewrite that has significant style errors in it. Choose at least one rewrite that significantly changes the meaning of the sentence.
Bernstein et al. (2010): Find segments for shortening, Fix by shortening them, Verify by identifying rewrites with errors.
Find
Fix
Verify
Reflection: Where do we find interactions to evaluate?
102
Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits
Fix
Find
Verify
Input
hope to finally … their digital editions
in order to get people to… apps
offer something… on the open Web
hoping to charge for digital editions
to get people to pay for apps
offer something unique not available elsewhere
Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique that is not available elsewhere.
Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique not available elsewhere.
Print publishers are in a tizzy over Apple’s new iPad because they hope to finally be able to charge for their digital editions. But in order to get people to pay for their magazine and newspaper apps, they are going to have to offer something different that readers cannot get at the newsstand or on the open Web.
Text to be shortened
Verbose parts
Shorten the verbose parts and put back in context
Fix grammar issues
Identify at least one area that can be shortened without changing the meaning of the paragraph.
Edit the highlighted section to shorten its length without changing the meaning of the paragraph.
Choose at least one rewrite that has significant style errors in it. Choose at least one rewrite that significantly changes the meaning of the sentence.
Bernstein et al. (2010): Find segments for shortening, Fix by shortening them, Verify by identifying rewrites with errors.
Find
Fix
Verify
Find three segments that can be shortened from the following text. These segments need to be present in the text.
Text: {input text}
Segments: 1. {output segment}
Shorten the following text without changing its meaning.
Text: {segment}
Shortened text: {shortened segment}
P7: Find segments to shorten, Fix these phrases by shortening them, Verify by fixing grammatical errors.
Correct the grammar of the following text.
Text: {fixed text}
Corrected text: {grammatical text}
Replication experiment…
103
Course assignment for Human-Centered NLP @ CMU
20 undergraduate / graduate students
Replicate 7 crowdsourcing papers (that cover diverse pipeline designs and tasks) with LLM chains
Self-reflection on replication effectiveness (vs. single LLM baseline) and possible improvements
Peer grading on replication correctness, thoroughness, and comprehensiveness
Wu, Tongshuang, et al. "Llms as workers in human-computational algorithms? replicating crowdsourcing pipelines with llms." ArXiv 2023
Interesting evaluation output & design implication
104
LLMs (with only textual instructions) need proper task formation: e.g., change to multi-choice question, prompts with templates, etc.
Find three segments that can be shortened from the following text. These segments need to be present in the text.
Text: {input text}
Segments: 1. {output segment}
Humans get more scaffolding constraints: e.g. use mouse selection to extract precise segments from the original document.
Identify at least one area that can be shortened without changing the meaning of the paragraph.
Fix grammar issues
Find-Fix-Verify (Bernstein et al., 2010): Find problems, Fix the identified problems, Verify these edits
Fix
Find
Verify
Input
hope to finally … their digital editions
in order to get people to… apps
offer something… on the open Web
hoping to charge for digital editions
to get people to pay for apps
offer something unique not available elsewhere
Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique that is not available elsewhere.
Print publishers are in a tizzy over Apple’s new iPad because they hoping to charge for digital editions. But to get people to pay for apps, they are going to have to offer something offer something unique not available elsewhere.
Print publishers are in a tizzy over Apple’s new iPad because they hope to finally be able to charge for their digital editions. But in order to get people to pay for their magazine and newspaper apps, they are going to have to offer something different that readers cannot get at the newsstand or on the open Web.
Text to be shortened
Verbose parts
Shorten the verbose parts and put back in context
Reflection: Where to find metrics?
105
Look into existing interactions!
Example: Human-Human pair programming, vs. human-AI pair programming
Qianou Ma et al "Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming." ArXiv 2023
Reflection: Where to find metrics?
106
Human-human – Variance in metrics!
Time and accomplishment? twice the duration, the person-hours required
Human-AI – Too simplified metrics?
E.g. the number of lines of added code – the nature of interaction with Copilot (tab to accept suggestions) is a big factor!
Reflection: Where to find metrics?
107
More comparisons with human-human pair programming?
Get inspiration on metrics, e.g., defect density, perceptual effort measure, readability, functionability, the number of test cases passed, code complexity, scores, expert opinions, etc.?
Reflection: Where to find metrics?
108
More factors to consider – Evaluate on which population (female students, teachers, etc.)?
Reflection: Where to find metrics?
109
Wu, Tongshuang, and Kenneth Koedinger. "Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming." ArXiv 2023
“Human-AI Interactions”
Learning
110
Learning from Interactions Outline
111
Learning from Interactions Outline
112
Users Interaction with LLMs
113
114
Interaction: Different Types of Human Feedback (1)
Labeled data points
Edit data points
Change data weights
Binary/scaled user feedback
Natural language feedback
Code language feedback
115
Interaction: Different Types of Human Feedback (2)
Define, add, remove feature spaces
Directly change the objective function
Directly change the model parameter
…
116
Learning from Interactions Outline
117
Learning from Interactions and Feedback
Transform nontechnical human “preferences” into usable model “language”
118
Tradeoff: Human-friendly vs. Model-friendly
Models need feedback that “they can respond to”
Humans prefer easier-to-provide feedback
Non-experts:
119
Human Interaction and Text Classification
120
Godbole, Shantanu, Abhay Harpale, Sunita Sarawagi, and Soumen Chakrabarti. "Document classification through interactive supervision of document and term labels." In European Conference on Principles of Data Mining and Knowledge Discovery, pp. 185-196. Springer, Berlin, Heidelberg, 2004.
Human Interaction and Parsing
121
Human Interaction and Topic Modeling
122
Interactive Topic Modeling: start with a vanilla LDA with symmetric prior, get the initial topics. Then repeat the following process till users are satisfied: show users topics, get feedback from users, encode the feedback into a tree prior, update topics with tree-based LDA
123
Learning from Interactions Outline
124
Incorporating Human Feedback: Taxonomy
Dataset updates: change the dataset
Loss function updates: add a constraint to the objective
Parameter space updates: change the model parameters
125
Learning from Interaction: Datasets Updates
Data augmentation
Weak supervision
Active learning
Model-assisted adversarial labeling
126
Datasets Updates: Weak Supervision
Try using noisy sources of signal, specified at higher-levels of abstraction, to rapidly generate training sets.
127
Ratner, Alexander, et al. "Snorkel: Rapid training data creation with weak supervision." VLDB 2017.
Datasets Updates: Weak Supervision
128
Ratner, Alexander, et al. "Snorkel: Rapid training data creation with weak supervision." VLDB 2017.
Datasets Updates: Active Learning to update data
Proactively select which data points we want to use to learn from, rather than passively accepting all data points available.
129
Datasets Updates: Data Augmentation
130
Datasets Updates: mixup for text data
131
Datasets Updates: mixup for text data
132
Chen, Jiaao, Zichao Yang, and Diyi Yang. "MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification." ACL 2020.
Learning from Interactions Outline
133
Learning from Interaction: Loss function updates
Unlikelihood learning
Add regularization to specific model behavior
Infer constraints from expert feedback
134
Loss Function Updates: Unlikelihood Learning
Penalize undesirable generations (e.g. not following control, repeating previous context)
135
If C is previously seen text, then less repetition and more diversity
Welleck, Sean, et al. "Neural text generation with unlikelihood training." ICLR (2019).
Loss Function Updates: Infer Constraints from Expert Feedback
Use counterfactual or contrasting examples to improve generalization via an auxiliary training objective
136
Teney, Damien, Ehsan Abbasnedjad, and Anton van den Hengel. "Learning what makes a difference from counterfactual examples and gradient supervision." Computer Vision–ECCV 2020:
Learning from Interactions Outline
137
Learning from Interaction: Parameter updates
Model editing
Concept bottleneck model
Parameter efficient fine-tuning (adapter, prefix)
Reinforcement learning from human feedback
Learning from “diff” or corrections
138
Model Editing uses a single desired input-output pair to make fast, local edits to a pre-trained model
139
Mitchell, Eric, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. "Fast model editing at scale." arXiv preprint arXiv:2110.11309 (2021).
Transform the gradient obtained by SFT using a low-rank decomposition of the gradient to make the parameterization of this transformation tractable.
Parameter updates: Concept Bottleneck Model trains model to explicitly use human-provided concepts
140
Koh, Pang Wei, et al. "Concept bottleneck models." International Conference on Machine Learning. PMLR, 2020.
Parameter updates
Parameter Efficient Fine-tuning uses small interaction data to steer models towards desired behaviors
141
Parameter updates: Parameter Efficient Fine-tuning uses small interaction data to steer models towards desired behaviors
142
Unlearn What You Want to Forget: Efficient Unlearning for LLMs. Jiaao Chen, Diyi Yang. EMNLP 2023
Incorporating Human Feedback: Taxonomy
Dataset updates: change the dataset
Loss function updates: add a constraint to the objective
Parameter space updates: change the model parameters
143
Learning from Interactions Outline
144
Learning from “Bad” Feedback
Preparation:
Ingredients:
20% binary 👍👎 20% gold correction 100% free-form textual
Re-train a supervised dialogue model
Bot: Do you like any other fruits besides plums?
Human: I also like watermelons.
Bot: I like watermelons too! Have you heard of Harry Styles?
Human: The bot changes the topic suddenly.
Correction: “I like watermelons too! They are refreshing.”
Bot: I like watermelons too! They are refreshing.
Human: I know! Any recommended recipe for watermelon slush?
Bot: I am not sure. Can you help me understand Greek Mythology?
Human: I’d like to talk about fruits
classifier
corrector
1) Train models
2) Infer ❓labels
Bot: I like watermelons too! They are refreshing.
→🍒
Bot: I am not sure. Can you help me understand Greek Mythology?
3) Convert 🍋 to 🍒
→🍋
Bot: Do you like any other fruits besides plums?
Bot: I like watermelons too! Have you heard of Harry Styles?
Bot: I like them too! Do you like fruit juice?
Bot: Sure. Here is a recipe for watermelon slush!
+
Collect fruits and re-train
Bot: I like watermelons too! They are refreshing.
Bot: Do you like any other fruits besides plums?
Bot: I like them too! Do you like fruit juice?
Bot: Sure. Here is a recipe for watermelon slush!
🍒
🍒
Bot: I am not sure. Can you help me understand Greek Mythology?
Bot: I like watermelons too! Have you heard of Harry Styles?
Positive +🍒
Negative +🍋
+
Dialogue
❓
❓
🍒
🍒
🍒
🍋
❓
❓
corrector
classifier
Re-train a supervised dialogue model
Bot: Do you like any other fruits besides plums?
Human: I also like watermelons.
Bot: I like watermelons too! Have you heard of Harry Styles?
Human: The bot changes the topic suddenly.
Correction: “I like watermelons too! They are refreshing.”
Bot: I like watermelons too! They are refreshing.
Human: I know! Any recommended recipe for watermelon slush?
Bot: I am not sure. Can you help me understand Greek Mythology?
Human: I’d like to talk about fruits
classifier
corrector
1) Train models
2) Infer ❓labels
Bot: I like watermelons too! They are refreshing.
→🍒
Bot: I am not sure. Can you help me understand Greek Mythology?
3) Convert 🍋 to 🍒
→🍋
Bot: Do you like any other fruits besides plums?
Bot: I like watermelons too! Have you heard of Harry Styles?
Bot: I like them too! Do you like fruit juice?
Bot: Sure. Here is a recipe for watermelon slush!
+
Collect fruits and re-train
Bot: I like watermelons too! They are refreshing.
Bot: Do you like any other fruits besides plums?
Bot: I like them too! Do you like fruit juice?
Bot: Sure. Here is a recipe for watermelon slush!
🍒
🍒
Bot: I am not sure. Can you help me understand Greek Mythology?
Bot: I like watermelons too! Have you heard of Harry Styles?
Positive +🍒
Negative +🍋
+
Dialogue
❓
❓
🍒
🍒
🍒
🍋
❓
❓
corrector
classifier
Re-train a supervised dialogue model
Bot: Do you like any other fruits besides plums?
Human: I also like watermelons.
Bot: I like watermelons too! Have you heard of Harry Styles?
Human: The bot changes the topic suddenly.
Correction: “I like watermelons too! They are refreshing.”
Bot: I like watermelons too! They are refreshing.
Human: I know! Any recommended recipe for watermelon slush?
Bot: I am not sure. Can you help me understand Greek Mythology?
Human: I’d like to talk about fruits
classifier
corrector
1) Train models
2) Infer ❓labels
Bot: I like watermelons too! They are refreshing.
→🍒
Bot: I am not sure. Can you help me understand Greek Mythology?
3) Convert 🍋 to 🍒
→🍋
Bot: Do you like any other fruits besides plums?
Bot: I like watermelons too! Have you heard of Harry Styles?
Bot: I like them too! Do you like fruit juice?
Bot: Sure. Here is a recipe for watermelon slush!
+
Collect fruits and re-train
Bot: I like watermelons too! They are refreshing.
Bot: Do you like any other fruits besides plums?
Bot: I like them too! Do you like fruit juice?
Bot: Sure. Here is a recipe for watermelon slush!
🍒
🍒
Bot: I am not sure. Can you help me understand Greek Mythology?
Bot: I like watermelons too! Have you heard of Harry Styles?
Positive +🍒
Negative +🍋
+
Dialogue
❓
❓
🍒
🍒
🍒
🍋
❓
❓
corrector
classifier
Re-train a supervised dialogue model
Bot: Do you like any other fruits besides plums?
Human: I also like watermelons.
Bot: I like watermelons too! Have you heard of Harry Styles?
Human: The bot changes the topic suddenly.
Correction: “I like watermelons too! They are refreshing.”
Bot: I like watermelons too! They are refreshing.
Human: I know! Any recommended recipe for watermelon slush?
Bot: I am not sure. Can you help me understand Greek Mythology?
Human: I’d like to talk about fruits
classifier
corrector
1) Train models
2) Infer ❓labels
Bot: I like watermelons too! They are refreshing.
→🍒
Bot: I am not sure. Can you help me understand Greek Mythology?
3) Convert 🍋 to 🍒
→🍋
Bot: Do you like any other fruits besides plums?
Bot: I like watermelons too! Have you heard of Harry Styles?
Bot: I like them too! Do you like fruit juice?
Bot: Sure. Here is a recipe for watermelon slush!
+
Collect fruits and re-train
Bot: I like watermelons too! They are refreshing.
Bot: Do you like any other fruits besides plums?
Bot: I like them too! Do you like fruit juice?
Bot: Sure. Here is a recipe for watermelon slush!
🍒
🍒
Bot: I am not sure. Can you help me understand Greek Mythology?
Bot: I like watermelons too! Have you heard of Harry Styles?
Positive +🍒
Negative +🍋
+
Dialogue
❓
❓
🍒
🍒
🍒
🍋
❓
❓
corrector
classifier
Learning from Interactions Outline
151
Incorporating Different Levels of Feedback
Incorporate different levels of human feedback via RL
✍️Local Feedback
Highlighted words or phrases
Speaker's intents
Identifiable events/topics
✍️Global Feedback
Judgement towards the coherence, coverage, overall quality…
152
Chen, Jiaao, Mohan Dodda, and Diyi Yang. "Human-in-the-loop Abstractive Dialogue Summarization." arXiv preprint arXiv:2212.09750 (2022).
Case Study: Collecting Local Feedback
153
Case Study: Collecting Global Feedback
154
Case Study: Incorporating Different Levels of Feedback
155
Case Study: Reward Modeling for Summarization
Learn HITL summarization policy by maximizing the combined reward
156
Case Study: Evaluation of HITL Summarization
157
Chen, Jiaao, Mohan Dodda, and Diyi Yang. "Human-in-the-loop Abstractive Dialogue Summarization." arXiv preprint arXiv:2212.09750 (2022).
Case Study: Converting Feedback into Principles
158
Case Study: Converting Feedback into Principles
159
Petridis, Savvas, Ben Wedin, James Wexler, Aaron Donsbach, Mahima Pushkarna, Nitesh Goyal, Carrie J. Cai, and Michael Terry. "ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles." arXiv preprint arXiv:2310.15428 (2023).
Reinforcement Learning from Human Feedback
160
Constitutional AI: Harmlessness from AI Feedback
161
Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen et al. "Constitutional ai: Harmlessness from ai feedback." arXiv preprint arXiv:2212.08073 (2022).
Constitutional AI: Harmlessness from AI Feedback
162
Constitutional AI: Harmlessness from AI Feedback
163
Scaling RL from Human Feedback with AI Feedback
164
Lee, Harrison, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. "Rlaif: Scaling reinforcement learning from human feedback with ai feedback." arXiv preprint arXiv:2309.00267 (2023).
Scaling RL from Human Feedback with AI Feedback
165
Humans strongly prefer RLHF and RLAIF summaries over the supervised fine-tuned (SFT) models
Learning from Interactions Outline
166
Limitations of Human Feedback
167
Limitations of Human Feedback
168
Learning from Interactions Outline
169
Conclusion and Future Directions
170
Takeaway
Human-AI Interaction is another layer (?) along side models that have its own set of taxonomies.
All more user-oriented – User-centered “optimization!”
171
Open Questions in “Designing Interactions”
Thinking what users want.
Interfaces as an extension to humans, instead of an extension to technology.
Making users comfortable with novel interactions.
Focusing on building trust, ensuring appropriate reliance, and considering other human factors.
Designing interactions equitable to diverse populations.
By considering accessibility, gender/racial/cultural and other demographic differences.
172
Open Questions in “Evaluating Interactions”
Scale up the evaluation.
Participant recruitment? Trolling removal? Qualitative analysis?
Evaluate dynamic interactions, in the wild.
General-purpose models get us less pre-defined interactions. Capture effectiveness, when people self-initiate task formations?
Make interaction a benchmarking task.
Got 1000 scores on 1000 interactions for 1 model, now what? Use diverse interaction evaluations to reflect model’s practical usability!
173
Open Questions in “Learning from Interactions”
Going beyond labels or numbers.
Fine-grained interaction types need to be modeled.
Going beyond single-turn preference.
Interactions can be expressed in multi-turns or dynamically.
Going beyond knowns and explore unknowns in human-AI interactions.
Diverse opinions, cultures and values in learning from interactions.
174
Designing, Evaluating, and
Learning from Human-AI Interactions
Sherry Tongshuang Wu
CMU, @tongshuangwu
Diyi Yang
Stanford, @diyi_yang
Sebastin Santy
UW, @sebastinsanty
https://tinyurl.com/
emnlp2023-hai-tutorial