1 of 17

Adversarial AI and Model Evasion Attacks

Anupama G

General

2 of 17

Agenda

Introduction
The Adversarial Threat Model
Evasion Attacks: The Core Concept
Key Evasion Attack Techniques
Other Types of Attacks
Defences Against Adversarial Attacks
Real-World Implications
The Future of Adversarial AI
Summary
Q&A

General

3 of 17

Introduction

The Rise of AI: AI systems, particularly deep learning models, are now integral to many critical applications (e.g., self-driving cars, medical diagnosis, cybersecurity).
The Inherent Vulnerability: Despite their power, these models are not infallible. They have a new type of security vulnerability that traditional cybersecurity measures can't address.

General

4 of 17

Adversarial AI Introduction

What is Adversarial AI? It's a field that studies how to manipulate AI systems. Adversarial attacks are inputs designed to deceive a machine learning model, causing it to make an incorrect prediction.
Why is this a big deal? The consequences of a fooled AI can be severe, ranging from a misclassified stop sign to a misdiagnosed patient. It's a fundamental threat to the trustworthiness of AI.

General

5 of 17

The Adversarial Threat Model

Understanding the Attacker: The type of attack depends on the attacker's knowledge and goal.
Knowledge of the Model:

White-Box Attacks: The attacker has full access to the model's architecture, parameters (weights), and even training data. Think of an attacker with access to an open-source model.
Black-Box Attacks: The attacker has no knowledge of the model's internals. They can only interact with it via its inputs and outputs. This is a more realistic scenario for attacking a proprietary, deployed model.

General

6 of 17

The Adversarial Threat Model

Goal of the Attack:

Targeted: The attacker wants to force the model to produce a specific, incorrect output (e.g., make a traffic sign model classify a "stop" sign as a "speed limit" sign).

Non-Targeted: The attacker simply wants to make the model misclassify the input, without a specific target class.

General

7 of 17

Evasion Attacks: The Core Concept

Definition: Evasion attacks occur at the inference phase, after the model has been trained and deployed. The attacker modifies a legitimate input just enough to evade the model's detection.
Adversarial Examples: These are the maliciously crafted inputs. They look almost identical to a human but are drastically different to the AI.

A classic example: Adding a tiny, almost imperceptible noise pattern to an image of a panda can make a sophisticated image classifier believe it's a gibbon.

General

8 of 17

Evasion Attacks: The Core Concept

How are they created?

Adversarial examples exploit the linearity of deep neural networks.
By calculating the model's gradient with respect to the input, attackers can find the direction to push the input that will cause the largest change in the output, forcing a misclassification.

General

9 of 17

Key Evasion Attack Techniques

Fast Gradient Sign Method (FGSM):

Concept: One of the earliest and simplest white-box attacks. It calculates the gradient of the loss function and adds a small perturbation to the input in the direction of the gradient sign. This maximizes the model's loss.
Strengths: Simple, fast, and effective.

General

10 of 17

Key Evasion Attack Techniques

Projected Gradient Descent (PGD):

Concept: An iterative and more powerful version of FGSM. It repeatedly applies small perturbations and projects the result back into a defined boundary (e.g., a small "epsilon" ball) to ensure the changes remain imperceptible.
Strengths: Considered a state-of-the-art attack, highly effective, and often used as a benchmark for defences.

General

11 of 17

Key Evasion Attack Techniques

Black-Box Attacks:

Transferability: A key principle. Adversarial examples created for one model often "transfer" and fool a different, unknown model. This is due to the similarity of deep learning architectures.
Query-based Attacks: The attacker repeatedly queries the black-box model and uses the outputs to approximate the model's internal workings or gradients. Techniques like Zeroth-Order Optimization (ZOO) use this approach.

General

12 of 17

Other Types of Attacks

Data Poisoning Attacks:

Definition: The attacker contaminates the training data with malicious samples or false labels.
Impact: This corrupts the model's learning process, leading to degraded performance or the creation of a backdoor that the attacker can exploit later. A famous example is Microsoft's Tay chatbot, which was poisoned by malicious users.

Model Extraction/Stealing Attacks:

Definition: The attacker repeatedly queries a black-box model to reconstruct a functional copy of it.
Impact: This can be a form of intellectual property theft, as the attacker now has a similar model without the massive cost of training it.

General

13 of 17

Other Types of Attacks

Inference Attacks:

Definition: These attacks aim to infer private or sensitive information from the model's output or parameters.

Types: Membership Inference (determining if a specific data point was used in training) and Model Inversion (reconstructing a training data point from the model's outputs).

General

14 of 17

Defences Against Adversarial Attacks

Proactive Defences:

Adversarial Training: The most effective and widely used defense. The model is trained on both clean data and a variety of adversarial examples. This makes the model more robust to future attacks.
Defensive Distillation: Trains a second, "distilled" model on the softened outputs (probabilities) of a first, "teacher" model. This makes the second model's decision boundaries smoother and harder for an attacker to exploit.

Reactive Defences:

Input Preprocessing: Uses techniques like adding random noise or applying image filters to remove the adversarial perturbations before the input reaches the model.

Limitations of Defences:

Adversarial examples are a cat-and-mouse game. As new defences are developed, new attacks emerge to bypass them. Many defences can also degrade a model's performance on legitimate inputs.

General

15 of 17

Real-World Implications

Autonomous Vehicles: Adversarial attacks on a car's computer vision system could lead to accidents.
Spam/Malware Filters: An attacker can create a variant of malware or a spam email that looks benign to the detection model.
Facial Recognition: Researchers have shown that wearing a specific t-shirt with a unique pattern can make a facial recognition system misclassify the person.
Medical Diagnosis: Manipulating a medical image could lead an AI-powered diagnostic tool to give a wrong diagnosis.

General

16 of 17

The Future of Adversarial AI

Beyond Evasion: The focus is shifting from simply "fooling" a model to more complex attacks like model extraction and data poisoning.
Regulatory Focus: Governments and industry bodies are starting to develop standards for the security and robustness of AI systems.
Towards Certified Robustness: Research is moving toward creating models with mathematical guarantees of their robustness against certain types of attacks. This is a difficult but crucial goal.

General

17 of 17

Summary

Adversarial AI is a major threat to AI security and trustworthiness.
Evasion attacks, which create adversarial examples to fool deployed models, are a core concern.
We have discussed key attack techniques (FGSM, PGD) and defenses (adversarial training) and the broader implications for AI systems.

General