1 of 24

DeepXplore: Automated Whitebox Testing of Deep Learning Systems

Kexin Pei1, Yinzhi Cao2, Junfeng Yang1, Suman Jana1

1Columbia University, 2Lehigh University

1

2 of 24

Deep learning (DL) has matched human performance!

  • Image recognition, speech recognition, machine translation, intrusion detection...
  • Wide deployment in real-world systems

2

3 of 24

Deep learning is increasingly used in safety-critical systems

  • Deep learning correctness and security is crucial

3

Self-driving car

Medical diagnosis

Malware detection

4 of 24

Unreliable deep learning contributed to Tesla fatal crash

4

Tesla autopilot failed to recognize a white truck against bright sky leading to fatal crash

5 of 24

Existing DL testing methods are seriously limited

  • Common practice: measure accuracy on a test input set of randomly chosen data
  • Problem 1: how good is the coverage of the test set?
    • DL decision logic is incredibly complex
    • More fundamentally, what is testing coverage metric for DL?
  • Problem 2: it requires expensive labeling effort
    • Data in test set must be manually labelled
    • To enlarge the test set, we need to manually label more data

5

6 of 24

Existing DL testing methods are seriously limited (cont.)

  • Adversarial testing (Szegedy et al. ICLR’14): find corner-case inputs imperceptible to human but induce errors
    • Problem 1: how good is the coverage of the test set?
    • Problem 2: it requires expensive labeling effort
    • Problem 3: Not realistic. (Theoretical, assumes a very powerful adversary. [Sharif et al. CCS’16])

6

School bus

Ostrich

Carefully crafted noise

7 of 24

Many traditional software testing techniques don’t apply to DL

  • DL decision logic is embedded in neurons and layers, not in code

7

x=0

If (x==8)

x+=1

x+=2

Traditional program

(control flow graph)

Neural network

8 of 24

Quick Summary of DeepXplore

  • The first step towards systematic testing of Deep Neural Nets (DNNs)
  • Neuron coverage: first testing coverage metric for deep nerual net
  • Automated: cross-check multiple DNNs
  • Realistic: physically realizable transformations
  • Effective:
    • 15 State-of-the-art DNNs on 5 large datasets (ImageNet, Self-driving cars, PDF/Android malware)
    • Numerous corner-case errors
    • 50% more neuron coverage than existing testing

8

No accident

Darker: Accident

DeepXplore

9 of 24

Outline

  • Quick deep learning primer
  • Workflow of DeepXplore
    • Design
    • Detail of Neuron coverage
  • Implementation
  • Evaluation setup and results summary

9

10 of 24

Outline

  • Quick deep learning primer
  • Workflow of DeepXplore
    • Design
    • Detail of Neuron coverage
  • Implementation
  • Evaluation setup and results summary

10

11 of 24

Deep learning primer

  • A neural network is a function f(X) → Y
    • Trainable parameters (Wi) on each edge and nonlinear activation function at each neuron
    • DNN learns the weights during training
  • Inference: Simply propagates X through layers (fast)
  • Training: Given training set (X,Y), adjust W to minimize the prediction error (slow)

11

Hidden layers

Input layer

Output layer

X

Y

W1

W2

W3

12 of 24

Outline

  • Quick deep learning primer
  • Workflow of DeepXplore
    • Design
    • Detail of Neuron coverage
  • Implementation
  • Evaluation setup and results summary

12

13 of 24

How DeepXplore works?

13

12O

10O

20O

24O

20O

25O

21O

22O

23O

Right

Left

Seed inputs without labels

Feed into multiple DNNs

DNN1

Right

Objectives:

Maximize differences & neuron coverage

under realistic constraints

(e.g., lighting)

Mutate using gradient descent (DNNs are differentiable)

Right

Testing as an optimization problem

DNN2

DNN3

On new input, activate different neurons

14 of 24

How to achieve multiple goals simultaneously

  • Goal 1: systematically find corner cases
    • Generate inputs that maximize neuron coverage
  • Goal 2: find DNN errors without manual labels
    • Differential testing: use multiple DNNs as cross-referencing oracles
  • Goal 3: generate realistic test inputs
    • Domain-specific constraints

14

Objectives:

Maximize corner-case differences & neuron coverage

under realistic constraints

(e.g., lighting)

Testing DNNs is an optimization problem

  • DNNs are differentiable → use gradient descent to solve the optimization problem
  • Apply mutation w.r.t input with realistic constraints
  • See paper for details

15 of 24

Neuron coverage → how much decision logic exercised

  • Neuron coverage = # neurons activated / # total neurons
  • Intuition: layerwise feature detection (Lee et al. ICML’09)

15

-11

Car

Face

hedge

vedge

...

Nose

Eyes

Wheel

...

...

...

f: ReLU max(x,0)

3

1

2

1

2

3

f(1)

1

Activation threshold=0.75

[3,1,2] [2,-11,1]T=1

Neuron coverage: 4/7=57%

.

16 of 24

Outline

  • Quick deep learning primer
  • Workflow of DeepXplore
    • Design
    • Detail of Neuron coverage
  • Implementation
  • Evaluation setup and results summary

16

17 of 24

Implementation

17

CPU

GPU

Linux

TensorFlow 1.0.1

Keras 2.0.3

DeepXplore

Efficient gradient computation &

support for intercepting outputs of intermediate neurons for calculating neuron coverage

18 of 24

Outline

  • Quick deep learning primer
  • Workflow of DeepXplore
    • Design
    • Detail of Neuron coverage
  • Implementation
  • Evaluation setup and results summary

18

19 of 24

Evaluation setup and results summary

19

Dataset

Description

DNNs

Original random testing

accuracy

Avg.

neuron coverage improvement over random/adversarial

Avg. Violations found by DeepXplore (2000 seeds)

MNIST

Handwritten digits

LeNet variants

98.63%

30.5% → 70%

1,289

ImageNet

General Images in 1000 categories

VGG16, VGG19, ResNet15

93.91%

1% → 69%

1,980

Driving

Udacity self-driving car competition dataset

Nvidia Dave-2 variants

99.94%

3.2% → 59%

1,839

Contagio/VirusTotal

PDF malware

Fully connected

96.29%

18% → 70%

1,048

Drebin

Android malware

Fully connected

96.03%

18.5% → 40%

2,000

20 of 24

Sample corner-case errors for images

20

20

6

2

Light cardigan

Diaper

Turn right

Go straight

MNIST:

ImageNet:

Driving:

bolete

buckeye

5

3

Go straight

Turn left

MNIST:

ImageNet:

Driving:

21 of 24

Sample corner-case errors for malware

  • Android malware: mutations only add features to the manifest file

21

feature::bluetooth

permission::call_phone

prediction

Before

0

0

Malicious

After

1

1

Benign

22 of 24

Sample corner-case errors for malware (cont.)

  • PDF malware: Mutations that do not change functionality (Srndic & Laskov Oakland’14)

22

size

count_font

author_num

prediction

Before

1

1

10

Malicious

After

34

15

5

Benign

23 of 24

Conclusions and future work

  • Systematically testing DL for realistic corner cases is a hard problem
  • DeepXplore is the first step for systematic DL testing
    • Neuron coverage: first testing coverage metric for deep nerual net
    • Automated: differential testing by cross-checking multiple DNNs
    • Realistic: physically realizable transformations
    • Effective: find neumerous unexpected corner-case errors
  • A lot of exciting new research problems!
    • Build analysis tools for testing and verification of ML
    • Build better debugging support for opaque ML

23

24 of 24

Check the paper for more results!

Source code: https://github.com/peikexin9/deepxplore

Play demo at: www.deepxplore.org

Thank you!

Questions?

24