- This work has proposed a sparse attack that yields interpretable adversarial examples, thanks to their superb sparsity.
- Our key idea is to approximate the NP-hard sparsity optimization problem via a theoretical sound reparameterization technique. This makes direct optimization of sparse perturbations tractable.
- Our approach outperforms sparse attack counterparts in terms of computation efficiency, transferability, and attack intensity.
- Our approach helps reveal two types of adversarial perturbations. This empowers us to interpret how adversarial examples mislead DNNs into incorrect decisions.
Towards Interpretable Adversarial Examples
via Sparse Adversarial Attack
1 University of Delaware, 2 Stevens Institute of Technology, 3 University of West Florida
Fudong Lin1, Jiadong Lou1, Hao Wang2, Brian Jalaian3, Xu Yuan1
- Dense Attacks: Difficult to interpret adversarial attacks due to their overly perturbed adversarial examples
- Sparse Attacks: NP-hard problem, with their resultant adversarial examples suffering from poor sparsity
- Our Goal: Develop a sparse attack that yields interpretable adversarial examples, allowing us to interpret the vulnerability of Deep Neural Networks (DNNs)
Interpretable Adversarial Attacks
- Explainable AI: Grad-CAM and Guided Grad-CAM
- Two Types of Malicious Noises: “obscuring noise”, which prevents DNNs from identify true labels; and “leading noise”, which leads DNNs into incorrect decisions
Interpret the Vulnerability of DNNs
Intractable Sparse Optimization
- Reformulate the loss function by using the Heaviside step function:
Approximation of Sparse Optimization
Misclassify “candle” as “toilet tissue”
Misclassify “canoe” as “wings”
- Our approach achieves the best attack successful rates when attacking robust models trained by PDG-AT or Fast-AT.
Comparison to Sparse Attack Counterparts