DeepXplore: Automated Whitebox Testing of Deep Learning Systems
Kexin Pei1, Yinzhi Cao2, Junfeng Yang1, Suman Jana1
1Columbia University, 2Lehigh University
1
Deep learning (DL) has matched human performance!
2
Deep learning is increasingly used in safety-critical systems
3
Self-driving car
Medical diagnosis
Malware detection
Unreliable deep learning contributed to Tesla fatal crash
4
Tesla autopilot failed to recognize a white truck against bright sky leading to fatal crash
Existing DL testing methods are seriously limited
5
Existing DL testing methods are seriously limited (cont.)
6
School bus
Ostrich
Carefully crafted noise
Many traditional software testing techniques don’t apply to DL
7
x=0
If (x==8)
x+=1
x+=2
Traditional program
(control flow graph)
Neural network
Quick Summary of DeepXplore
8
No accident
Darker: Accident
DeepXplore
Outline
9
Outline
10
Deep learning primer
11
Hidden layers
Input layer
Output layer
X
Y
W1
W2
W3
Outline
12
How DeepXplore works?
13
12O
10O
20O
24O
20O
25O
21O
22O
23O
Right
Left
Seed inputs without labels
Feed into multiple DNNs
DNN1
Right
Objectives:
Maximize differences & neuron coverage
under realistic constraints
(e.g., lighting)
Mutate using gradient descent (DNNs are differentiable)
Right
Testing as an optimization problem
DNN2
DNN3
On new input, activate different neurons
How to achieve multiple goals simultaneously
14
Objectives:
Maximize corner-case differences & neuron coverage
under realistic constraints
(e.g., lighting)
Testing DNNs is an optimization problem
Neuron coverage → how much decision logic exercised
15
-11
Car
Face
hedge
vedge
...
Nose
Eyes
Wheel
...
...
...
f: ReLU max(x,0)
3
1
2
1
2
3
f(1)
1
Activation threshold=0.75
[3,1,2] [2,-11,1]T=1
Neuron coverage: 4/7=57%
.
Outline
16
Implementation
17
CPU
GPU
Linux
TensorFlow 1.0.1
Keras 2.0.3
DeepXplore
Efficient gradient computation &
support for intercepting outputs of intermediate neurons for calculating neuron coverage
Outline
18
Evaluation setup and results summary
19
Dataset | Description | DNNs | Original random testing accuracy | Avg. neuron coverage improvement over random/adversarial | Avg. Violations found by DeepXplore (2000 seeds) |
MNIST | Handwritten digits | LeNet variants | 98.63% | 30.5% → 70% | 1,289 |
ImageNet | General Images in 1000 categories | VGG16, VGG19, ResNet15 | 93.91% | 1% → 69% | 1,980 |
Driving | Udacity self-driving car competition dataset | Nvidia Dave-2 variants | 99.94% | 3.2% → 59% | 1,839 |
Contagio/VirusTotal | PDF malware | Fully connected | 96.29% | 18% → 70% | 1,048 |
Drebin | Android malware | Fully connected | 96.03% | 18.5% → 40% | 2,000 |
Sample corner-case errors for images
20
20
6
2
Light cardigan
Diaper
Turn right
Go straight
MNIST:
ImageNet:
Driving:
bolete
buckeye
5
3
Go straight
Turn left
MNIST:
ImageNet:
Driving:
Sample corner-case errors for malware
21
| feature::bluetooth | permission::call_phone | prediction |
Before | 0 | 0 | Malicious |
After | 1 | 1 | Benign |
Sample corner-case errors for malware (cont.)
22
| size | count_font | author_num | prediction |
Before | 1 | 1 | 10 | Malicious |
After | 34 | 15 | 5 | Benign |
Conclusions and future work
23
Check the paper for more results!
Source code: https://github.com/peikexin9/deepxplore
Play demo at: www.deepxplore.org
Thank you!
Questions?
24