Trustworthy Generative AI
Tutorials at ICML 2023, KDD 2023 & FAccT 2023
Krishnaram Kenthapadi, Hima Lakkaraju, Nazneen Rajani
Introduction and Motivation
AI has come of age!
3
A new AI category is forming
… but trust issues remain
Language Models since GPT-3
Generative AI
Generative AI refers to a branch of AI that focuses on creating or generating new content, such as images, text, video, or other forms of media, using machine learning examples.
Artificial Intelligence (AI) vs Machine Learning (ML)
AI is a branch of CS dealing with building computer systems that are able to perform tasks that usually require human intelligence.
Machine learning is a branch of AI dealing with the use of data and algorithms to imitate humans without explicit instructions.
Deep learning is a subfield of ML that uses ANNs to learn complex patterns from data.
Model types
Discriminative
Generative
Generative Models
Generative language models (LMs)
Generative image models
Generative Models Data Modalities
Generative Models Data Modalities
Generative Models Data Modalities
Text-to-Text Foundation Models since GPT3
GPT-3
2021
Jun
Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
*only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1
UL2
Cohere
Jurassic
Anthropic
2023
Feb
LLaMA
Flan-UL2
March
GPT-4
Falcon
May
INCITE
StarCoder
LLaMA-2
Text-to-Text Foundation Models since GPT3
GPT-3
2021
Jun
Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
*only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1
UL2
Cohere
Jurassic
Anthropic
2023
Feb
LLaMA
Flan-UL2
March
GPT-4
Falcon
May
INCITE
StarCoder
LLaMA-2
Pivotal moments
Chatbot LLMs
Alpaca
Vicuna
Dolly
Baize
Koala
Open Assistant
OpenChatKit
StarChat
LLaMA 2 chat
Guanaco
Model Access
🔓
🔒
Open access models
Closed access models
🔓 Open Access Models
All model components are publicly available:
🔓 Open Access Models
Allows reproducing results and replicating parts of the model
Enable auditing and conducting risk analysis
Serves as a research artifact
Enables interpreting model output
🔒 Closed Access Models
Only research paper or blog is available and may include overview of
🔒 Closed Access Models
Safety concerns
Competitive advantage
Expensive to setup guardrails for safe access
Model Access
🔓
🔒
Open access
Closed access
Limited access
🔐
Model Access
🔓
🔒
Open access
Closed access
Limited access
🔐
Text-to-Text Foundation Models since GPT3
GPT-3
2021
Jun
Oct
PaLM
Chinchilla
OPT
BLOOM
Gopher
2022
Megatron TNLG
Dec
May
Apr
Jul
Jul
GPT-J
ChatGPT
Nov
Dec
Galactica
GPT-Neo
Jun
GPT-NeoX
Feb
Flan-T5
Oct
*only LLMs with >1B parameters & EN as the main training language are shown. Comprehensive list: https://crfm.stanford.edu/helm/v1.0/?models=1
UL2
Cohere
Jurassic
Anthropic
2023
Feb
LLaMA
Flan-UL2
March
GPT-4
Falcon
May
INCITE
StarCoder
LLaMA-2
Open Access Large Language Models
Research on policy, governance, AI safety and alignment
Community efforts like Eleuther, Big Science, LAION, OpenAssistant, RedPajama
Papers with several authors
Open source ML has potential for huge impact
Alternative AI Visions of the Future
World-1 World-2
Open AI, Anthropic, Inflection AI, Google, Microsoft, …. Small Form Factor LLMs
Closed Source Open Source
AGI Focused Narrow AI Focused
Consumer Enterprise
Massive Tokenization of Public Data Organized Curation of Private Data
General Purpose Domain Specific
Model Arms Race Model Commoditization
Well-resourced for responsible AI Responsible AI & AI Observability Tools Matter
Trustworthiness Challenges & Solutions for Generative AI
25
Enterprise concerns around responsible deployment of Generative AI
26
Continuous Monitoring of LLM Quality
Chen, et. al., How is ChatGPT's behavior changing over time?, 2023
Continuous Monitoring of LLM Quality
Chen, et. al., How is ChatGPT's behavior changing over time?, 2023
Monitoring LLM Performance with Drift
29
Problem: Model Changes behind API Impact LLM Performance�Monitor drift in Response embeddings to see how responses are changing
Prompt Engineering with Context
(eg. OpenAI call)
Problem: Unseen prompts or incorrect docs impact performance
Monitor drift in Prompt embeddings to see what new docs to add
Retrieval Augmented Generation�(eg. OpenAI call w/ doc)
Problem: Unseen prompts impact performance
Monitor drift in Prompt embeddings to assess LLM performance w.r.t. tuning prompt dataset
Fine Tuned
Model
Problem: Unseen prompts impact performance
Monitor drift in Prompt embeddings to assess LLM performance w.r.t. training dataset
Trained Model
How Enterprises Deploy LLMs
And the performance problems they face
Easy & Cheap
Difficult & Expensive
Generative AI: Trustworthiness Challenges
Robustness and Security
31
Robustness to Input Perturbations
LLMs are not robust to input perturbations
Evaluating Correctness and Robustness of LLMs
https://github.com/fiddler-labs/fiddler-auditor
A. Iyer, K. Kenthapadi, Fiddler Auditor: Evaluate the Robustness of LLMs and NLP Models, 2023
Robustness Evaluation: Example
Robustness Evaluation: Example
Robustness to Adversarial Perturbations
Prompt Injection & Data Poisoning Attacks
Inject instances into training data
to elicit a desired response when
a trigger phrase is used.
Wallace et. al. 2021; Willison et. al., 2023
Evaluating LLMs against Prompt Injection Attacks
Ensuring Robustness to Input Perturbations
Liu et. al., 2020; Si et. al., 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Privacy and Copyright Implications
41
Privacy & Copyright Concerns with LLMs
Carlini et. al., Extracting Training Data from Large Language Models, USENIX Sec. Sym., 2021; Bommasani et. al., 2021; Vyas et. al., 2023
Privacy & Copyright Concerns with Diffusion Models
Carlini et. al., Extracting Training Data from Diffusion Models, 2023
Model vs. Database: Implications
Carlini et. al., Extracting Training Data from Diffusion Models, 2023
Shan et. al., GLAZE: Protecting Artists from Style Mimicry by Text-to-Image Models, 2023
Addressing Privacy & Copyright Concerns
Carlini et. al., Extracting Training Data from Diffusion Models, 2023; Yu et. al., 2021
Addressing Privacy & Copyright Concerns
Kirchenbauer et. al. 2023; Mitchell et. al. 2023; Sadasivan et. al. 2023
GPT detectors can be biased too!
Liang et. al., GPT detectors are biased against non-native English writers, Patterns, 2023
Bias and (Un)fairness
48
Motivation
Why is Bias Detection & Mitigation Challenging?
Bender et. al., 2021
Examples of Biases: LLMs
Weidinger et. al., 2021
Examples of Biases: LLMs
Weidinger et. al., 2021
I am a woman of color from . I am looking for advice to prepare for MCAT.
Examples of Biases: Text to Image Models
and racial demographics/religions
Steed et. al., 2021; Buolamwini and Gebru, 2018; Bolukbasi et. al. 2016
Mitigating Biases
Gira et. al., 2022; Mao et. al. 2023; Kaneko and Bollegola, 2021; Garimella et. al. 2021;
John graduated from a medical school. He is a doctor.
Layeeka graduated from a medical school. She is a doctor.
Mitigating Biases
“We should treat people from different socioeconomic statuses, sexual orientations, religions, races, physical appearances, nationalities, gender identities, disabilities, and ages equally. When we do not have sufficient information, we should choose the unknown option, rather than making assumptions based on our stereotypes.”
Si et. al., 2023; Guo et. al. 2022
Transparency
56
Motivation
Why is Transparency Challenging?
Wei et. al., 2022; Schaeffer et. al., 2023
How to Achieve Transparency?
Good News:
LLMs seem to be able to explain their outputs.
A prompt to elicit explanation: “Let’s think step by step”
Wei et al., Chain of Thought Prompting Elicits Reasoning in Large Language Models, NeurIPS 2022
How to Achieve Transparency?
Bad News: Their explanations are highly unreliable!
Turpin et. al., 2023
How to Achieve Transparency?
Bad News: Their explanations are highly unreliable!
How to Achieve Transparency?
Yin et. al., 2022
How to Achieve Transparency?
Yin et. al., 2022
How to Achieve Transparency?
Bills et al., Language models can explain neurons in language models, 2023
How to Achieve Transparency?
Output:
Explanation of neuron 1 behavior: the main thing this neuron does is find phrases related to community
Limitations:
The descriptions generated are correlational
It may not always be possible to describe a neuron with a short natural language description
The correctness of such explanations remains to be thoroughly vetted!
Bills et al., Language models can explain neurons in language models, 2023
Causal Interventions
66
Beyond Explanations: Can we make changes?
Meng et. al., Locating and Editing Factual Associations in GPT, NeurIPS 2022
Locating Knowledge in GPT via Causal Tracing
Meng et. al., Locating and Editing Factual Associations in GPT, NeurIPS 2022
Editing Factual Associations in GPT Model
Meng et. al., Locating and Editing Factual Associations in GPT, NeurIPS 2022
Editing Factual Associations in GPT Model
Meng et. al., Locating and Editing Factual Associations in GPT, NeurIPS 2022
Editing Image Generation Models
Cui et. al., Local Relighting of Real Scenes, 2022
Generative AI: Trustworthiness Challenges
Process Best Practices
Identify product goals
Get the right people in the room
Identify stakeholders
Select a responsible AI (e.g., fairness) approach
Analyze and evaluate your system
Mitigate issues
Monitor Continuously and Escalation Plans
Auditing and Transparency
Repeat for every new LLM version, new feature, product change, etc.
Perform red teaming.
Challenge: Unanticipated Threats
Credit: Saurabh Tiwary
Example: Disparaging, Existential, Argumentative Threats
Credit: Saurabh Tiwary
Red Teaming AI Models
Red-Teaming
Evaluating LLMs for:
Red-Teaming
Red-Teaming
2. Emerging Capabilities
These are considered critical threat scenarios
Red-Teaming
Similarities with adversarial attacks:
Red-Teaming
Differences with adversarial attacks:
Red-Teaming
Differences with adversarial attacks:
*Warning: offensive text below*
Wallace, et al. "Universal Adversarial Triggers for Attacking and Analyzing NLP" (2021).
Red-Teaming Methods
Roleplay attacks wherein the LLM is instructed to behave as a malicious character
Instructing the model to respond in code instead of natural language
Instructing a model to reveal sensitive information such as PII.
Red-Teaming ChatGPT
https://twitter.com/spiantado/status/1599462375887114240
Red-Teaming ChatGPT
Takeaways from Red-Teaming
Open problems with Red-Teaming
Red-Teaming and AI Policy
Trustworthy Generative AI
Work in Progress
Technical Deepdive: Generative Language Models
Generative Language Models – Architectures
Generative Language Models – Architectures
Generative Language Models – Architectures
Attention
Generative Language Models – Architectures
Attention
Generative Language Models – Architectures
Generative Language Models – Training
Generative Language Models – Training
Generative Language Models – Training
Generative Language Models – Prompting
Generative Language Models – Training
Generative Language Models – Training
Training a chatbot
Training a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Supervised Fine-tuning
(instruction following/ chatty-ness)
Supervised fine-tuning
Supervised fine-tuning
Bootstrapping Data for SFT
Supervised fine-tuning
Training a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Reinforcement learning with human feedback (RLHF)
(aligning to target values and safety)
Supervised Fine-tuning
(instruction following and chatty-ness)
Training a Chatbot
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Reinforcement learning with human feedback (RLHF)
Reinforcement Learning with Human Feedback
Reward Model
RL fine-tuning
Reinforcement Learning with Human Feedback
Fine tuning with RL
112
Fine tuning with RL - using a reward model
113
Fine tuning with RL - KL penalty
114
Constrains the RL fine-tuning to not result in a LM that outputs gibberish (to fool the reward model).
Note: DeepMind did this in RL Loss (not reward), see GopherCite
Kullback–Leibler (KL) divergence:
Distance between distributions
Fine tuning with RL - combining rewards
115
Option to add additional terms to this reward function. E.g. InstructGPT, Llama-2-chat
Reward to match original human-curation distribution
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Fine tuning with RL - feedback & training
116
- Policy gradient updates policy LM directly.
- Often some parameters of policy are frozen.
Chatbot LLMs Data distributions
Comparing Dialog Agents
Generative Language Models Evaluations
Evaluating a Chatbot
Evaluating a Chatbot
Leaderboard with automated evals
Evaluating a Chatbot
Evaluating instruction following/chatty-ness
Evaluating a Chatbot
Evaluating a Chatbot
Evaluating the RM
Evaluating a Chatbot
Evaluating a Chatbot
Red-teaming
Evaluating a Chatbot
Leaderboard with Elo ratings
Leaderboard with Elo ratings
Technical Deepdive: Generative Image Models
Generative Image Models
Generative Image Models – Architecture
Stable diffusion over the years
1 - Scaled to Stable Diffusion
2 - The “Karras paper”
3 - DALLE-2
4 - Imagen
Stable Diffusion
Stable Diffusion
Stable Diffusion
Stable Diffusion
Stable Diffusion
Stable Diffusion
Stable Diffusion
Stable Diffusion
Takeaways
Take home message
Real-world Challenges
146
Enterprise Concerns for Deploying Generative AI
Deploying LLMs: Practical Considerations
Continuous feedback loop for improved prompt engineering and LLM fine-tuning*
AI applications and LLMs
Pre-production
Production
*where relevant
Application Challenge: Evaluating Chatbots
Zheng, et. al., Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023
Application Challenge: Evaluating Chatbots
Zheng, et. al., Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023
Application Challenge: Evaluating Chatbots
Zheng, et. al., Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023
Application Challenge: Evaluating Chatbots
Application Challenge: Evaluating Chatbots
Conclusions
154
Conclusions
Open Challenges
156
Open Challenges
References
158
Related Tutorials / Resources
Related Tutorials / Resources
Thanks! Questions?