1 of 20

Email: Yanmin.Sun@forces.gc.ca

Defence Research Development Canada – Centre for Operational Research and Analysis

Defense Against Adversarial Attacks

to Large Language Models

– A Technology Survey

Dr. Yanmin Sun

Centre of Operational Research and Analysis

2 of 20

Outline�

    • Introduction
    • Adversarial Attacks on LLMs
    • Defence against Adversarial Attacks
    • Conclusion

https://www.synopsys.com/glossary/what-is-devsecops.html#:~:text=Definition,in%20the%20software%20delivery%20cycle.

2

3 of 20

Introduction

  • LLMs are the state-of-the-art AI techniques for Natural Language Processing tasks
  • Leveraging LLM-empowered systems is an urgent innovation in the command-and-control warfare environment.
  • The significant use of LLMs highlights the importance of their reliability and security, particularly for military.
  • However, LLMs are vulnerable to malicious attack:
    • Jailbreaking attacks
    • Data poisoning attacks
    • Personally Identifiable Information (PII) leakage attacks
    • …….
  • Therefore, defencing against malicious attacks on LLM applications is imperative.
  • This survey provides a comprehensive review of techniques of
    • Adversarial attacks on LLM; and
    • Defense solutions

3

4 of 20

Adversarial Attacks on LLMs

  • Definition:
    • A set of techniques and strategies used to intentionally manipulate or deceive LLMs to produce malicious outputs

  • Attack Consequences:
    • Leaking personal privacy information
    • Generating hateful, harmful, and/or biased outputs
    • Generating and spreading disinformation
    • …….

4

5 of 20

Adversary Attacks on LLMs

Adapting LLMs to downstream tasks

5

6 of 20

Adversary Attacks on LLMs

Attacks

6

7 of 20

LLM Adaption

  • In context Learning: providing an input prompt with instructions and/or examples to elicit the output from the pre-trained model
  • Fine tuning: further training the pre-trained model using a smaller, task-specific, labelled dataset
  • Instruction tuning: fine-tuning the pre-trained model using a set of instructional prompts. The purpose is to improve the model’s responses to prompt instructions aligning with the domain knowledge or desired outputs during the in-context learning phase.

7

8 of 20

Adversarial Attacks on LLMs�

  • Data Attacks - Poisoning data for pre-training/fine-tuning LLMs
      • Text perturbation – adding, deleting, and substituting characters, words, phrases to introduce variations and noise in data
      • Backdoor attack – Injecting a pre-designed trigger to the training data such that the victim model produces adversarial outputs when the trigger presented in input

  • Prompt Attacks - manipulating prompts to elicit malicious outputs
      • Automatic Prompt Attack – Generate adversarial prompts automatically through a gradient-guided search strategy
      • Instruction Tuning Attacks – Inserting adversarial prompts into an instruction set for fine-tuning LLMs

Adversary Attack Approaches

8

9 of 20

Data Attacks

Text perturbation –introducing variations and noise in data

  • Different granularities: character, word, phrase, sentence
  • Different perturbation operations: adding, deleting, inserting, substituting, swapping…
  • Perturbations in both plain text and text embeddings
  • This approach generates invalid words or semantically wrong sentences to be perceived easily
  • Most reported works are for adversarial tuning LLMs for their robustness against adversarial attacks

9

10 of 20

Data Attacks

Backdoor Attack – Injecting a pre-designed trigger to the training data such that the victim model produces adversarial outputs when the trigger presented in input

  • Inject the backdoor during pre-training LLMs - a new attention-enhancing loss function is able to enforce the training on attention weights to pay full attention to the trigger tokens
  • BadGPT – attacking GPT-2 through fine-tuning of a reinforcement learning by injecting a backdoor into the reward model, where the trigger can receive a high reward score

10

11 of 20

Prompt Attacks

Automatic Prompt Attack

  • Automatic prompt generation
    • AutoPrompt - A gradient-guided learning strategy on discrete tokens to obtain an efficient prompt template for the tasks
  • Automatic adversary prompt generation
    • Training a suffix that, when attached to a wide range of prompts, is able to guide the model to produce an affirmative response
    • AutoDAN and LinkPrompt - Generating interpretable adversary prompts by combining both the adversary and the readability objectives in learning adversary prompts
  • Automatic Jailbreak Prompt generation
    • LLM providers have deployed a variety of mitigation measures to prevent the harmful or inappropriate content
    • Masterkey – Automatically generate jailbreak prompts by investigating LLMs defense mechanism through reinforcement learning from human feedback
    • AdvPrompter – Generate adversarial suffixes to jailbreak the target LLM

11

12 of 20

Prompt Attacks

Instruction Tuning Attacks – Inserting adversarial prompts into an instruction set for fine-tuning pre-trained LLMs

        • Introducing backdoor triggers in instruction sets

        • Adopting reinforcement learning from human feedback for instruction tuning to embed backdoors

        • A pipeline to generate an adversarial instruction set given clean instructions

        • An empirical investigation demonstrated instruction tuning attack can be more harmful and effective than other attack methods

12

13 of 20

Defense Solutions

Defense against Adversarial Attacks

  • Defense Against Adversarial Data
  • Defense Against Adversarial Prompts
  • Strengthening LLMs’ Robustness

13

14 of 20

Defense Solutions

Defense Against Adversarial Data

    • Anomaly detection techniques
      • Word frequency
      • Word distribution density estimation

    • Filtering poison examples from training process
      • Being wrongly labelled, they tend to have high loss in training

14

15 of 20

Defense Solutions

Defense Against Adversarial Prompts

  • Prevention-Based Defense

- tries to re-design a given instruction prompt to accomplish the target task and thwart the attack

  • Detection-Based Defense

- aims to detect whether a given prompt is attacked or not.

15

16 of 20

Defense Solutions

Defense Against Adversarial Prompts

  • Prevention-Based Defense
    • Paraphrasing – rewording existing prompts while maintaining their essential meanings
      • Creating variations of the existing prompts to thwart attacks, for example, through perturbating
      • Purifying prompts to compact the prompts to a minimal and explanatory form
      • Retokenization – process prompts at token-level
    • Prompt Optimization – Learning algorithms to regenerate existing prompts
      • RPO (Robust Prompt Optimization) – Learning suffixes that are robust to a wide range of prompt attacks
      • RAIN (Rewindable Auto-regressive Inference) – a learning method that integrating self-evaluation and rewind mechanisms

16

17 of 20

Defense Solutions

  •  

17

18 of 20

Defense Solutions

Defense Against Adversarial Prompts

  • Strengthening Model Robustness
    • Alignment Tuning – instruction tuning to encode human values and goals into LLMs to make them as helpful, safe and reliable as possible.
      • Reinforcement learning with Human Feedback utilizing a reward model that captures human preferences or feedback
    • Adversarial Tuning – including adversarial samples for tuning
      • Fine-tuning - Randomized Smoothing by adding noise to the data
      • Instruction Tuning
        • Integrating prefixed safety examples in the instruction-tuning set to establish a strong response to harmful prompts
        • Generating defensive prompts for instruction-tuning
      • Adversarial Tuning Optimization – Adjust learning objective to optimize adversarial tuning
        • Overfitting
        • Detecting long malicious texts

18

19 of 20

Conclusion Remarks��

  • Prompting and adaption learning (including both fine-tuning and instruction-tuning) play crucial roles in maximizing the effectiveness of LLMs for various downstream tasks.
  • As a result, adversaries intend to utilize these techniques to perform adversarial attacks on LLMs
  • Correspondingly, defense solutions are also investigated in prompting and adaption learning techniques to address adversarial attacks
  • However, the inadequate research efforts in developing defense solutions at data level
    • LLMs are trained on massive amounts of public data that might hold severe quality issues.
    • It is very easy for adversaries to poison data and release the data to public such that any LLM trained or fine-tuned on the poisoned data will get attacked.
    • Data quality assessment is a necessary step before including the data for training

19

20 of 20