1 of 1

 

Background

    • This work has proposed a sparse attack that yields interpretable adversarial examples, thanks to their superb sparsity.
    • Our key idea is to approximate the NP-hard sparsity optimization problem via a theoretical sound reparameterization technique. This makes direct optimization of sparse perturbations tractable.
    • Our approach outperforms sparse attack counterparts in terms of computation efficiency, transferability, and attack intensity.
    • Our approach helps reveal two types of adversarial perturbations. This empowers us to interpret how adversarial examples mislead DNNs into incorrect decisions.

Conclusion

Towards Interpretable Adversarial Examples

via Sparse Adversarial Attack

1 University of Delaware, 2 Stevens Institute of Technology, 3 University of West Florida

Fudong Lin1, Jiadong Lou1, Hao Wang2, Brian Jalaian3, Xu Yuan1

Code

Paper

  • Dense Attacks: Difficult to interpret adversarial attacks due to their overly perturbed adversarial examples
  • Sparse Attacks: NP-hard problem, with their resultant adversarial examples suffering from poor sparsity
  • Our Goal: Develop a sparse attack that yields interpretable adversarial examples, allowing us to interpret the vulnerability of Deep Neural Networks (DNNs)

Motivation

Interpretable Adversarial Attacks

 

Problem Statement

 

  • Explainable AI: Grad-CAM and Guided Grad-CAM
  • Two Types of Malicious Noises: “obscuring noise”, which prevents DNNs from identify true labels; and “leading noise”, which leads DNNs into incorrect decisions

Interpret the Vulnerability of DNNs

Clean Image

Malicious Noise

Adversarial Example

Intractable Sparse Optimization

Intractable!

  • Reformulate the loss function by using the Heaviside step function:

Reparameterization

 

 

Approximation of Sparse Optimization

Non-differentiable!

Misclassify “candle” as “toilet tissue”

Misclassify “canoe” as “wings”

  • Our approach achieves the best attack successful rates when attacking robust models trained by PDG-AT or Fast-AT.

Experimental Results

Comparison to Sparse Attack Counterparts

Adversarial Attacks