1 of 20

Email: Yanmin.Sun@forces.gc.ca

Defence Research Development Canada – Centre for Operational Research and Analysis

Defense Against Adversarial Attacks

to Large Language Models

– A Technology Survey

Dr. Yanmin Sun

Centre of Operational Research and Analysis

2 of 20

Outline�

Introduction
Adversarial Attacks on LLMs
Defence against Adversarial Attacks
Conclusion

https://www.synopsys.com/glossary/what-is-devsecops.html#:~:text=Definition,in%20the%20software%20delivery%20cycle.

3 of 20

Introduction

LLMs are the state-of-the-art AI techniques for Natural Language Processing tasks
Leveraging LLM-empowered systems is an urgent innovation in the command-and-control warfare environment.
The significant use of LLMs highlights the importance of their reliability and security, particularly for military.
However, LLMs are vulnerable to malicious attack:

Jailbreaking attacks
Data poisoning attacks
Personally Identifiable Information (PII) leakage attacks
…….

Therefore, defencing against malicious attacks on LLM applications is imperative.
This survey provides a comprehensive review of techniques of

Adversarial attacks on LLM; and
Defense solutions

4 of 20

Adversarial Attacks on LLMs

Definition:

A set of techniques and strategies used to intentionally manipulate or deceive LLMs to produce malicious outputs

Attack Consequences:

Leaking personal privacy information
Generating hateful, harmful, and/or biased outputs
Generating and spreading disinformation
…….

5 of 20

Adversary Attacks on LLMs

Adapting LLMs to downstream tasks

6 of 20

Adversary Attacks on LLMs

Attacks

7 of 20

LLM Adaption

In context Learning: providing an input prompt with instructions and/or examples to elicit the output from the pre-trained model
Fine tuning: further training the pre-trained model using a smaller, task-specific, labelled dataset
Instruction tuning: fine-tuning the pre-trained model using a set of instructional prompts. The purpose is to improve the model’s responses to prompt instructions aligning with the domain knowledge or desired outputs during the in-context learning phase.

8 of 20

Adversarial Attacks on LLMs�

Data Attacks - Poisoning data for pre-training/fine-tuning LLMs

Text perturbation – adding, deleting, and substituting characters, words, phrases to introduce variations and noise in data
Backdoor attack – Injecting a pre-designed trigger to the training data such that the victim model produces adversarial outputs when the trigger presented in input

Prompt Attacks - manipulating prompts to elicit malicious outputs

Automatic Prompt Attack – Generate adversarial prompts automatically through a gradient-guided search strategy
Instruction Tuning Attacks – Inserting adversarial prompts into an instruction set for fine-tuning LLMs

Adversary Attack Approaches

9 of 20

Data Attacks

Text perturbation –introducing variations and noise in data

Different granularities: character, word, phrase, sentence
Different perturbation operations: adding, deleting, inserting, substituting, swapping…
Perturbations in both plain text and text embeddings
This approach generates invalid words or semantically wrong sentences to be perceived easily
Most reported works are for adversarial tuning LLMs for their robustness against adversarial attacks

10 of 20

Data Attacks

Backdoor Attack – Injecting a pre-designed trigger to the training data such that the victim model produces adversarial outputs when the trigger presented in input

Inject the backdoor during pre-training LLMs - a new attention-enhancing loss function is able to enforce the training on attention weights to pay full attention to the trigger tokens
BadGPT – attacking GPT-2 through fine-tuning of a reinforcement learning by injecting a backdoor into the reward model, where the trigger can receive a high reward score

11 of 20

Prompt Attacks

Automatic Prompt Attack

Automatic prompt generation

AutoPrompt - A gradient-guided learning strategy on discrete tokens to obtain an efficient prompt template for the tasks

Automatic adversary prompt generation

Training a suffix that, when attached to a wide range of prompts, is able to guide the model to produce an affirmative response
AutoDAN and LinkPrompt - Generating interpretable adversary prompts by combining both the adversary and the readability objectives in learning adversary prompts

Automatic Jailbreak Prompt generation

LLM providers have deployed a variety of mitigation measures to prevent the harmful or inappropriate content
Masterkey – Automatically generate jailbreak prompts by investigating LLMs defense mechanism through reinforcement learning from human feedback
AdvPrompter – Generate adversarial suffixes to jailbreak the target LLM

12 of 20

Prompt Attacks

Instruction Tuning Attacks – Inserting adversarial prompts into an instruction set for fine-tuning pre-trained LLMs

Introducing backdoor triggers in instruction sets

Adopting reinforcement learning from human feedback for instruction tuning to embed backdoors

A pipeline to generate an adversarial instruction set given clean instructions

An empirical investigation demonstrated instruction tuning attack can be more harmful and effective than other attack methods

13 of 20

Defense Solutions

Defense against Adversarial Attacks

Defense Against Adversarial Data
Defense Against Adversarial Prompts
Strengthening LLMs’ Robustness

14 of 20

Defense Solutions

Defense Against Adversarial Data

Anomaly detection techniques

Word frequency
Word distribution density estimation

Filtering poison examples from training process

Being wrongly labelled, they tend to have high loss in training

15 of 20

Defense Solutions

Defense Against Adversarial Prompts

Prevention-Based Defense

- tries to re-design a given instruction prompt to accomplish the target task and thwart the attack

Detection-Based Defense

- aims to detect whether a given prompt is attacked or not.

16 of 20

Defense Solutions

Defense Against Adversarial Prompts

Prevention-Based Defense

Paraphrasing – rewording existing prompts while maintaining their essential meanings

Creating variations of the existing prompts to thwart attacks, for example, through perturbating
Purifying prompts to compact the prompts to a minimal and explanatory form
Retokenization – process prompts at token-level

Prompt Optimization – Learning algorithms to regenerate existing prompts

RPO (Robust Prompt Optimization) – Learning suffixes that are robust to a wide range of prompt attacks
RAIN (Rewindable Auto-regressive Inference) – a learning method that integrating self-evaluation and rewind mechanisms

17 of 20

Defense Solutions

18 of 20

Defense Solutions

Defense Against Adversarial Prompts

Strengthening Model Robustness

Alignment Tuning – instruction tuning to encode human values and goals into LLMs to make them as helpful, safe and reliable as possible.

Reinforcement learning with Human Feedback utilizing a reward model that captures human preferences or feedback

Adversarial Tuning – including adversarial samples for tuning

Fine-tuning - Randomized Smoothing by adding noise to the data
Instruction Tuning

Integrating prefixed safety examples in the instruction-tuning set to establish a strong response to harmful prompts
Generating defensive prompts for instruction-tuning

Adversarial Tuning Optimization – Adjust learning objective to optimize adversarial tuning

Overfitting
Detecting long malicious texts

19 of 20

Conclusion Remarks��

Prompting and adaption learning (including both fine-tuning and instruction-tuning) play crucial roles in maximizing the effectiveness of LLMs for various downstream tasks.
As a result, adversaries intend to utilize these techniques to perform adversarial attacks on LLMs
Correspondingly, defense solutions are also investigated in prompting and adaption learning techniques to address adversarial attacks
However, the inadequate research efforts in developing defense solutions at data level

LLMs are trained on massive amounts of public data that might hold severe quality issues.
It is very easy for adversaries to poison data and release the data to public such that any LLM trained or fine-tuned on the poisoned data will get attacked.
Data quality assessment is a necessary step before including the data for training

1 of 20

2 of 20

3 of 20

4 of 20

5 of 20

6 of 20

7 of 20

8 of 20

9 of 20

10 of 20

11 of 20

12 of 20

13 of 20

14 of 20

15 of 20

16 of 20

17 of 20

18 of 20

19 of 20

20 of 20