1 of 24

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

Kexin Pei¹, Yinzhi Cao², Junfeng Yang¹, Suman Jana¹

¹Columbia University, ²Lehigh University

1

2 of 24

Deep learning (DL) has matched human performance!

Image recognition, speech recognition, machine translation, intrusion detection...
Wide deployment in real-world systems

2

3 of 24

Deep learning is increasingly used in safety-critical systems

Deep learning correctness and security is crucial

3

Self-driving car

Medical diagnosis

Malware detection

4 of 24

Unreliable deep learning contributed to Tesla fatal crash

4

Tesla autopilot failed to recognize a white truck against bright sky leading to fatal crash

5 of 24

Existing DL testing methods are seriously limited

Common practice: measure accuracy on a test input set of randomly chosen data
Problem 1: how good is the coverage of the test set?

DL decision logic is incredibly complex
More fundamentally, what is testing coverage metric for DL?

Problem 2: it requires expensive labeling effort

Data in test set must be manually labelled
To enlarge the test set, we need to manually label more data

5

6 of 24

Existing DL testing methods are seriously limited (cont.)

Adversarial testing (Szegedy et al. ICLR’14): find corner-case inputs imperceptible to human but induce errors

Problem 1: how good is the coverage of the test set?
Problem 2: it requires expensive labeling effort
Problem 3: Not realistic. (Theoretical, assumes a very powerful adversary. [Sharif et al. CCS’16])

6

School bus

Ostrich

Carefully crafted noise

7 of 24

Many traditional software testing techniques don’t apply to DL

DL decision logic is embedded in neurons and layers, not in code

7

x=0

If (x==8)

x+=1

x+=2

Traditional program

(control flow graph)

Neural network

8 of 24

Quick Summary of DeepXplore

The first step towards systematic testing of Deep Neural Nets (DNNs)
Neuron coverage: first testing coverage metric for deep nerual net
Automated: cross-check multiple DNNs
Realistic: physically realizable transformations
Effective:

15 State-of-the-art DNNs on 5 large datasets (ImageNet, Self-driving cars, PDF/Android malware)
Numerous corner-case errors
50% more neuron coverage than existing testing

8

No accident

Darker: Accident

DeepXplore

9 of 24

Outline

Quick deep learning primer
Workflow of DeepXplore

Design
Detail of Neuron coverage

Implementation
Evaluation setup and results summary

9

10 of 24

Outline

Quick deep learning primer
Workflow of DeepXplore

Design
Detail of Neuron coverage

Implementation
Evaluation setup and results summary

10

11 of 24

Deep learning primer

A neural network is a function f(X) → Y

Trainable parameters (W_i) on each edge and nonlinear activation function at each neuron
DNN learns the weights during training

Inference: Simply propagates X through layers (fast)
Training: Given training set (X,Y), adjust W to minimize the prediction error (slow)

11

Hidden layers

Input layer

Output layer

X

Y

W₁

W₂

W₃

12 of 24

Outline

Quick deep learning primer
Workflow of DeepXplore

Design
Detail of Neuron coverage

Implementation
Evaluation setup and results summary

12

13 of 24

How DeepXplore works?

13

12^O

10^O

20^O

24^O

20^O

25^O

21^O

22^O

23^O

Right

Left

Seed inputs without labels

Feed into multiple DNNs

DNN1

Right

Objectives:

Maximize differences & neuron coverage

under realistic constraints

(e.g., lighting)

Mutate using gradient descent (DNNs are differentiable)

Right

Testing as an optimization problem

DNN2

DNN3

On new input, activate different neurons

14 of 24

How to achieve multiple goals simultaneously

Goal 1: systematically find corner cases

Generate inputs that maximize neuron coverage

Goal 2: find DNN errors without manual labels

Differential testing: use multiple DNNs as cross-referencing oracles

Goal 3: generate realistic test inputs

Domain-specific constraints

14

Objectives:

Maximize corner-case differences & neuron coverage

under realistic constraints

(e.g., lighting)

Testing DNNs is an optimization problem

DNNs are differentiable → use gradient descent to solve the optimization problem
Apply mutation w.r.t input with realistic constraints
See paper for details

15 of 24

Neuron coverage → how much decision logic exercised

Neuron coverage = # neurons activated / # total neurons
Intuition: layerwise feature detection (Lee et al. ICML’09)

15

-11

Car

Face

hedge

vedge

...

Nose

Eyes

Wheel

...

f: ReLU max(x,0)

3

1

2

1

2

3

f(1)

1

Activation threshold=0.75

[3,1,2] [2,-11,1]^T=1

Neuron coverage: 4/7=57%

.

16 of 24

Outline

Quick deep learning primer
Workflow of DeepXplore

Design
Detail of Neuron coverage

Implementation
Evaluation setup and results summary

16

17 of 24

Implementation

17

CPU

GPU

Linux

TensorFlow 1.0.1

Keras 2.0.3

DeepXplore

Efficient gradient computation &

support for intercepting outputs of intermediate neurons for calculating neuron coverage

18 of 24

Outline

Quick deep learning primer
Workflow of DeepXplore

Design
Detail of Neuron coverage

Implementation
Evaluation setup and results summary

18

19 of 24

Evaluation setup and results summary

19

Dataset	Description	DNNs	Original random testing accuracy	Avg. neuron coverage improvement over random/adversarial	Avg. Violations found by DeepXplore (2000 seeds)
MNIST	Handwritten digits	LeNet variants	98.63%	30.5% → 70%	1,289
ImageNet	General Images in 1000 categories	VGG16, VGG19, ResNet15	93.91%	1% → 69%	1,980
Driving	Udacity self-driving car competition dataset	Nvidia Dave-2 variants	99.94%	3.2% → 59%	1,839
Contagio/VirusTotal	PDF malware	Fully connected	96.29%	18% → 70%	1,048
Drebin	Android malware	Fully connected	96.03%	18.5% → 40%	2,000

20 of 24

Sample corner-case errors for images

20

6

2

Light cardigan

Diaper

Turn right

Go straight

MNIST:

ImageNet:

Driving:

bolete

buckeye

5

3

Go straight

Turn left

MNIST:

ImageNet:

Driving:

21 of 24

Sample corner-case errors for malware

Android malware: mutations only add features to the manifest file

21

	feature::bluetooth	permission::call_phone	prediction
Before	0	0	Malicious
After	1	1	Benign

22 of 24

Sample corner-case errors for malware (cont.)

PDF malware: Mutations that do not change functionality (Srndic & Laskov Oakland’14)

22

	size	count_font	author_num	prediction
Before	1	1	10	Malicious
After	34	15	5	Benign

23 of 24

Conclusions and future work

Systematically testing DL for realistic corner cases is a hard problem
DeepXplore is the first step for systematic DL testing

Neuron coverage: first testing coverage metric for deep nerual net
Automated: differential testing by cross-checking multiple DNNs
Realistic: physically realizable transformations
Effective: find neumerous unexpected corner-case errors

A lot of exciting new research problems!

Build analysis tools for testing and verification of ML
Build better debugging support for opaque ML

23

24 of 24

Check the paper for more results!

Source code: https://github.com/peikexin9/deepxplore

Play demo at: www.deepxplore.org

Thank you!

Questions?

24