1 of 17

Adversarial AI and Model Evasion Attacks

Anupama G

General

General

2 of 17

Agenda

  • Introduction
  • The Adversarial Threat Model
  • Evasion Attacks: The Core Concept
  • Key Evasion Attack Techniques
  • Other Types of Attacks
  • Defences Against Adversarial Attacks
  • Real-World Implications
  • The Future of Adversarial AI
  •  Summary 
  • Q&A

General

General

3 of 17

Introduction

  • The Rise of AI: AI systems, particularly deep learning models, are now integral to many critical applications (e.g., self-driving cars, medical diagnosis, cybersecurity).
  • The Inherent Vulnerability: Despite their power, these models are not infallible. They have a new type of security vulnerability that traditional cybersecurity measures can't address.

General

General

4 of 17

Adversarial AI Introduction

  • What is Adversarial AI? It's a field that studies how to manipulate AI systems. Adversarial attacks are inputs designed to deceive a machine learning model, causing it to make an incorrect prediction.
  • Why is this a big deal? The consequences of a fooled AI can be severe, ranging from a misclassified stop sign to a misdiagnosed patient. It's a fundamental threat to the trustworthiness of AI.

General

General

5 of 17

The Adversarial Threat Model

  • Understanding the Attacker: The type of attack depends on the attacker's knowledge and goal.
  • Knowledge of the Model:
    • White-Box Attacks: The attacker has full access to the model's architecture, parameters (weights), and even training data. Think of an attacker with access to an open-source model.
    • Black-Box Attacks: The attacker has no knowledge of the model's internals. They can only interact with it via its inputs and outputs. This is a more realistic scenario for attacking a proprietary, deployed model.

General

General

6 of 17

The Adversarial Threat Model

  • Goal of the Attack:

    • Targeted: The attacker wants to force the model to produce a specific, incorrect output (e.g., make a traffic sign model classify a "stop" sign as a "speed limit" sign).

    • Non-Targeted: The attacker simply wants to make the model misclassify the input, without a specific target class.

General

General

7 of 17

Evasion Attacks: The Core Concept

  • Definition: Evasion attacks occur at the inference phase, after the model has been trained and deployed. The attacker modifies a legitimate input just enough to evade the model's detection.
  • Adversarial Examples: These are the maliciously crafted inputs. They look almost identical to a human but are drastically different to the AI.
    • A classic example: Adding a tiny, almost imperceptible noise pattern to an image of a panda can make a sophisticated image classifier believe it's a gibbon.

General

General

8 of 17

Evasion Attacks: The Core Concept

How are they created?

  • Adversarial examples exploit the linearity of deep neural networks.
  • By calculating the model's gradient with respect to the input, attackers can find the direction to push the input that will cause the largest change in the output, forcing a misclassification.

General

General

9 of 17

Key Evasion Attack Techniques

  • Fast Gradient Sign Method (FGSM):
    • Concept: One of the earliest and simplest white-box attacks. It calculates the gradient of the loss function and adds a small perturbation to the input in the direction of the gradient sign. This maximizes the model's loss.
    • Strengths: Simple, fast, and effective.

General

General

10 of 17

Key Evasion Attack Techniques

  • Projected Gradient Descent (PGD):
    • Concept: An iterative and more powerful version of FGSM. It repeatedly applies small perturbations and projects the result back into a defined boundary (e.g., a small "epsilon" ball) to ensure the changes remain imperceptible.
    • Strengths: Considered a state-of-the-art attack, highly effective, and often used as a benchmark for defences.

General

General

11 of 17

Key Evasion Attack Techniques

  • Black-Box Attacks:
    • Transferability: A key principle. Adversarial examples created for one model often "transfer" and fool a different, unknown model. This is due to the similarity of deep learning architectures.
    • Query-based Attacks: The attacker repeatedly queries the black-box model and uses the outputs to approximate the model's internal workings or gradients. Techniques like Zeroth-Order Optimization (ZOO) use this approach.

General

General

12 of 17

Other Types of Attacks

  • Data Poisoning Attacks:
    • Definition: The attacker contaminates the training data with malicious samples or false labels.
    • Impact: This corrupts the model's learning process, leading to degraded performance or the creation of a backdoor that the attacker can exploit later. A famous example is Microsoft's Tay chatbot, which was poisoned by malicious users.
  • Model Extraction/Stealing Attacks:
    • Definition: The attacker repeatedly queries a black-box model to reconstruct a functional copy of it.
    • Impact: This can be a form of intellectual property theft, as the attacker now has a similar model without the massive cost of training it.

General

General

13 of 17

Other Types of Attacks

  • Inference Attacks:
    • Definition: These attacks aim to infer private or sensitive information from the model's output or parameters.

    • Types: Membership Inference (determining if a specific data point was used in training) and Model Inversion (reconstructing a training data point from the model's outputs).

General

General

14 of 17

Defences Against Adversarial Attacks

  • Proactive Defences:
    • Adversarial Training: The most effective and widely used defense. The model is trained on both clean data and a variety of adversarial examples. This makes the model more robust to future attacks.
    • Defensive Distillation: Trains a second, "distilled" model on the softened outputs (probabilities) of a first, "teacher" model. This makes the second model's decision boundaries smoother and harder for an attacker to exploit.
  • Reactive Defences:
    • Input Preprocessing: Uses techniques like adding random noise or applying image filters to remove the adversarial perturbations before the input reaches the model.
  • Limitations of Defences:
    • Adversarial examples are a cat-and-mouse game. As new defences are developed, new attacks emerge to bypass them. Many defences can also degrade a model's performance on legitimate inputs.

General

General

15 of 17

Real-World Implications

  • Autonomous Vehicles: Adversarial attacks on a car's computer vision system could lead to accidents.
  • Spam/Malware Filters: An attacker can create a variant of malware or a spam email that looks benign to the detection model.
  • Facial Recognition: Researchers have shown that wearing a specific t-shirt with a unique pattern can make a facial recognition system misclassify the person.
  • Medical Diagnosis: Manipulating a medical image could lead an AI-powered diagnostic tool to give a wrong diagnosis.

General

General

16 of 17

The Future of Adversarial AI

  • Beyond Evasion: The focus is shifting from simply "fooling" a model to more complex attacks like model extraction and data poisoning.
  • Regulatory Focus: Governments and industry bodies are starting to develop standards for the security and robustness of AI systems.
  • Towards Certified Robustness: Research is moving toward creating models with mathematical guarantees of their robustness against certain types of attacks. This is a difficult but crucial goal.

General

General

17 of 17

Summary

  • Adversarial AI is a major threat to AI security and trustworthiness.
  • Evasion attacks, which create adversarial examples to fool deployed models, are a core concern.
  • We have discussed key attack techniques (FGSM, PGD) and defenses (adversarial training) and the broader implications for AI systems.

General

General