SF323/CN408: AI Engineer
Lecture 11: Safety
Nutchanon Yongsatianchot
News
News
News: Gemini 3.0 soon
News
AI Risks and Safety
Broad Spectrum of AI Risks
Misuse/malicious use
Malfunction
Systemic risks
Guardrails and Safety filters
Jailbreaking and Prompt injection
Risks of Jailbreaking
Risks of Prompt injection
The risks from prompt injection are far more serious, because the attack is not against the models themselves, it’s against applications that are built on those models.
The risks of prompt injections
Jailbreak Attack
Jailbreak
Attacks
White-box Attack
Black-box Attack
Black-box Attack: Template Completion
Template Completion: Scenario Nesting - DeepInception
Template Completion: In-context Attack
Template Completion: Code Injection
Black-box Attack: Prompt Rewriting
Prompt Rewriting: Cipher
Prompt Rewriting: ASCII Art
Prompt Rewriting: Low-resource Languages
*the combined attack successful if any of the languages in the group achieves BYPASS
Low
Mid
Low-resource Languages - Past tense
Prompt Rewriting: Genetic Algorithm-based
Black-box Attack: LLM-based Generation
System Prompt Overrides
Successful attacks leveraged tags such as "<system>", "<im_start>system" or "<|start_header_id|>system<|end_header_id|>" to enclose novel system instructions. These typically took one of two forms: minimal updates (e.g. adding an exception to a single rule) or fully articulated system prompts, replacing the original rules and instructions.
Faux Reasoning
This attack involves injecting text that mimics the model’s internal reasoning, often using tags like "<think>" or similar structures. Attackers craft messages containing fabricated justifications for potentially harmful or restricted requests, aiming to make the model believe it has already evaluated and approved the action through its own (simulated) internal reasoning.
New Session / Session Data Update
Many models can be misled into believing that the context they are operating in has reset or changed significantly. Various attacks exploit this by simulating a new session or injecting altered session metadata, reframing the harmful action as permissible.
Role Playing
Jailbreaking Defense
Defense
Prompt-level Defenses: Prompt Detection
Constitutional Classifiers: Defending against universal jailbreaks
Constitutional Classifiers: Defending against universal jailbreaks
Constitutional Classifiers
The most successful jailbreaking strategies included:
Prompt-level Defenses: Prompt Perturbation
Prompt-level Defenses: Prompt Perturbation
Prompt-level Defenses: System Prompt Safeguard
OpenAI’s Prompt Hierarchy
OpenAI’s Prompt Hierarchy
GPT-4o mini System Prompt Hierarchy still doesn’t work
Defense
Model-level Defenses: SFT-based Methods
Model-level Defenses: RLHF-based Methods
Model-level Defenses: Gradient and Logit Analysis
Model-level Defenses: Refinement Methods
Model-level Defenses: Proxy Defense
Challenges of Defense
Prompt Injection
Prompt Injection
https://www.ibm.com/topics/prompt-injection
Liu et al., Formalizing and Benchmarking Prompt Injection Attacks and Defenses
The lethal trifecta
Types of prompt injections
Normal App function
Prompt Injection: Normal App function
Direct Prompt Injection
Prompt Injection in VLM
Prompt Injection Attack Methods
Types of prompt injections
Indirect Prompt Injection
Indirect Prompt Injection
Indirect Prompt Injection
Indirect Prompt Injection
Indirect Prompt Injection
Indirect Prompt Injection
General Issue: Mixing instructions and data
Prompt injection prevention and mitigation
Break
Real-World Prompt Injection Examples
Prompt Injection with Claude’s Computer Use
Prompt Injection with Claude’s Computer Use
Chatgpt Operator Prompt Injection Exploits
Chatgpt Operator: Mitigations and Defenses
Mitigation 1: User Monitoring
Chatgpt Operator: Mitigations and Defenses
Mitigation 2: Inline Confirmation Requests
Chatgpt Operator: Mitigations and Defenses
Mitigation 3: Out-of-Band Confirmation Requests
Chatgpt Operator: A Tricky Scenario
Chatgpt Operator: Sneaky Data Leakage
Key observation: just typing text hardly ever triggers any confirmations.
Chatgpt Operator: Full Prompt Injection
Chatgpt Operator: Full Prompt Injection
SQL injection-like attack on LLMs with special tokens
SQL injection-like attack on LLMs with special tokens
SQL injection-like attack on LLMs with special tokens
Data exfiltration attack against Copilot
Data exfiltration attack against Copilot
Data exfiltration attack against Copilot
Prompt injections in AI-powered browsers
ChatGPT Atlas Example
MCP Prompt Injection
MCP Prompt Injection
MCP Rug Pulls
'About The Author' injection
Many more security issues
Lessons from red teaming 100 generative AI Products
Lessons from red teaming 100 generative AI Products
Lessons from red teaming 100 generative AI Products
Lessons from red teaming 100 generative AI Products
Lesson 2: You don’t have to compute gradients to break an AI system
“real hackers don’t break in, they log in.”
“real attackers don’t compute gradients, they prompt engineer”
Google’s Approach for Secure AI Agents
Key risks of AI Agents
A fundamental tension exists: increased agent autonomy and power, which drive utility, correlate directly with increased risk.
Input, perception and personalization
A critical challenge here is reliably distinguishing trusted user commands from potentially untrusted contextual data and inputs from other sources.
System instructions
Maintaining an unambiguous distinction between trusted system instructions and potentially untrusted user data is important for mitigating prompt injection attacks
Reasoning and planning
Because LLM planning is probabilistic, it’s inherently unpredictable and prone to errors from misinterpretation. The common practice of iterative planning exacerbates the prompt injection risk: each cycle introduces opportunities for flawed logic, divergence from intent, or hijacking by malicious data, potentially compounding issues.
Orchestration and action execution (tool use)
This stage is where rogue plans translate into real-world impact. Each tool grants the agent specific powers. Uncontrolled access to powerful actions is highly risky if the planning phase is compromised.
Agent memory
Memory can become a vector for persistent attacks. If malicious data containing a prompt injection is processed and stored in memory, it could influence the agent’s behavior in future, unrelated interactions
Response rendering
If the application renders agent output without proper sanitization or escaping based on content type, vulnerabilities like Cross-Site Scripting (XSS) or data exfiltration (from maliciously crafted URLs in image tags, for example) can occur. Robust sanitization by the rendering component is crucial.
Risk 1 Rogue Actions
Risk 2 Sensitive data disclosure
Core principles for agent security
Principle 1: Agents must have well-defined human controllers
Principle 2: Agent powers must have limitations
Principle 3: Agent actions and planning must be observable
Summary of the Three Principles
Google’s approach: A hybrid defense-in-depth
Layer 1: Traditional, deterministic measures (runtime policy enforcement)
Layer 2: Reasoning-based defense strategies
To complement the deterministic guardrails and address their limitations in handling context and novel threats, the second layer use AI models themselves to evaluate inputs, outputs, or the agent’s internal reasoning for potential risks.
Validating your agent security: Assurance efforts
Supporting both layers are continuous assurance activities
Design Patterns for Securing LLM Agents
against Prompt Injections
High level principle
Once an LLM agent has ingested untrusted input, it must be constrained so that it is impossible for that input to trigger any consequential actions.
2. The Plan-Then-Execute Pattern
Allow feedback from tool outputs back to the agent, but to prevent the tool outputs from influencing the choice of actions taken by the agent.
3. The LLM Map-Reduce Pattern
4. The Dual LLM Pattern
1. Privileged LLM that receives instructions and plans actions, and can use tools
2. Quarantined LLM that can be invoked by the privileged LLM whenever untrusted data has to be processed. This LLM cannot use any tools. It can solely process text.
5. The Code-Then-Execute Pattern
An agent writes a formal computer program to solve a task. The program may call tools available to the agent, and spawn unprivileged LLMs to process untrusted text
6. The Context-Minimization pattern
To prevent certain user prompt injections, the agent system can remove unnecessary content from the context over multiple interactions
Case Studies
OS Assistant with Fuzzy Search
This LLM assistant runs in an operating system environment to help the user search for and act on files using fuzzy searches. Some examples:
Threat: The attacker can control one or more file contents, including filenames. They aim to make the agent execute insecure shell commands, or exfiltrate data.
OS Assistant with based LLM Design
Simply give our LLM access to a fully-fledged shell tool and teach it to use standard shell commands
OS Assistant with User confirmation
Ask for user confirmation before the LLM executes each command
OS Assistant with the action-selector pattern
LLM acts as a “translator” between the user’s natural language request, and a series of predefined commands
OS Assistant with the plan-then-execute pattern
LLM agent can commit to a fixed set of actions in response to a user request
OS Assistant with the dual LLM / map-reduce pattern
A better design is where the LLM assistant acts as a “controller”, and dispatches isolated LLMs to perform fuzzy searches with strict output constraints.
SQL Agent
Threat: The attacker can control the input query, or potentially the database content.
The attacker goals depend on the capabilities and include unauthorized extraction, modification or destruction of data, or remote code execution in the Python interpreter.
SQL Agent with No AI security
SQL Agent with Plan-Then-Execute
The code-then-execute pattern avoids processing any data from the databases by an LLM and only processes data with generated code. That way prompt injections inside of the database cannot influence any LLM.
SQL Agent with Action-sandboxing for the Python interpreter
Prompt injections could mislead the LLM to generate harmful Python code. Any code execution must be sandboxed into its own environment with only necessary connections allowed.
Customer service chatbot
This chatbot agent provides customer support to a consumer-facing business. It provides two kinds of services: Information from RAG and Actions (e.g., canceling order or scheduling deliver) via tool use.
Threat:
Customer service chatbot with a topic classifier
The agent relies on a separate topic classifier that will make a binary decision about whether the query is related to the store or not.
Customer service chatbot with the action-selector pattern
This design relies on an allowlist of requests that a benign user might make.
The lethal trifecta
Safety issues show the limitations of LLMs
HW
HW10
Extra
White-box Attack
Black-box Attack
White-box Attack: Gradient-based Attacks
Greedy Coordinate Gradient (GCG) (Zou et al. 2023)
AutoDAN (Zou et al. 2023)
White-box Attack: Logits-based attack
Weak-to-Strong Jailbreaking on Large Language Models
White-box Attack: Fine-tuning-based attack
SHADOW ALIGNMENT: THE EASE OF SUBVERTING
SAFELY-ALIGNED LANGUAGE MODELS (Yang et al., 2023)
Alignment Research
Sleeper Agents
Sleeper Agents
Sleeper Agents
Sleeper Agents
Sleeper Agents
Adversarial training: Use another LLM to red team, finding prompts that trigger the key word (e.g., “I hate your”), and then RL to be HHH on this data.
Sleeper Agents: Adversarial Training Results
Sleeper Agents: Summary
Alignment faking in large language models
Initial:
Train to be harmless
Additional Training:
Train to be harmful
(only if not comply)
Will it display alignment faking?
Alignment faking in large language models