1 of 23

Free Lunch for Testing: �Fuzzing Deep-Learning Libraries �from Open Source

Anjiang Wei, Yinlin Deng, Chenyuan Yang, Lingming Zhang

CCF-2131943

CCF-2141474

anjiang@stanford.edu

1

2 of 23

Deep-Learning Libraries

Model Definition

class MyNet(nn.Module):

self.l1 = Conv2d(32, 16, 3)

self.l2 = Maxpool2d((3,2),2)

…

def forward(self, x):

x = self.l1(x)

x = self.l2(x)

return F.relu(x)

Loading Dataset

class MyDataset(Dataset):

def __getitem__(self, idx):

image = read_image(…)

image = normalize(image)

label = read_label(…)

Training / Inference

net = MyNet()

for data, label in MyDataset:

out = net(data)

loss = criterion(out, label)

loss.backward()

…

User

Library

Python

C++

CPU

Aten

CuDNN

GPU

Mobile

Build a DL model

Define a model
Load the dataset
Run training / inference

DL libraries

APIs mainly in Python
Different backends
Abstraction for hardware

2

3 of 23

Prior Work

CRADLE¹

Detecting bugs in DL Libraries with existing models
Differential testing as test oracle: compare outputs of different libraries

3

¹Pham et al. “CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries”. ICSE 2019

Existing Models

High-level Library

Differential Testing

4 of 23

Prior Work

4

²Wang et al. “Deep learning library testing via effective model generation”. FSE 2020.

We also acknowledge using their slides to illustrate model-level mutation rules

Layer Switch (LS)

Layer Copy (LC)

Layer Addition (LA)

Layer Removal (LR)

Activation Function Removal (AFRm)

Activation Function Replace (AFRp):

Multi-Layers Addition (MLA)

Example LEMON model-level mutation rules

LEMON²

Further applying model-level mutation to generate more DL models
State of the art for DL library testing

5 of 23

Motivation

Limitation of prior work:

Model-level testing covers only a limited number of APIs

DL frameworks usually have thousands of APIs (e.g., ~1900 for TensorFlow )
Models used in prior work can only cover 59 APIs!

Model-level mutation rules are constrained

E.g., LEMON’s intact-layer mutation requires that: "the output tensor shape of the API to be added/deleted should be identical to its input tensor shape"

Our work FreeFuzz:

Fully automated API-level fuzz testing
Challenge: how to automatically invoke a given API?

5

6 of 23

Challenge of Fuzzing APIs

Exposed in Python, a dynamically-typed language

Hard to determine parameter types for test input generation

Complex constraints among parameters

...

m = torch.nn.Conv2d(16, 33, 3)

input = torch.randn(20, 16, 50, 100)

output = m(input)

...

- in_channels (int) – Number of channels in the input image

- stride (int or tuple, optional) – Stride of convolution. Default: 1

- groups (int) – Controls the connections between inputs and � outputs. Constraints: in_channels and out_channels� must both be divisible by groups

Code Snippets from Doc

Documentation for torch.nn.Conv2d

6

Solution:

Dynamic tracing when running existing code

7 of 23

FreeFuzz Overview

7

Doc Code

Lib Tests

DL Models

m = torch.nn.Conv2d(16,33,(3,5),…)

input = torch.randn(20,16,50,100)

output = m(input)

def test_conv():

sizes = [(1, 256, 109, 175),

(1, 256, 80, 128),…]

conv = torch.nn.Conv2d(1,256,…)

for size in sizes:

x = torch.randn(size,…)

out = conv(x)

class MyModel(nn.Module):

self.conv1=nn.Conv3d(32,…)

self.conv2=nn.Conv3d(64,…)

…

net = MyModel()

for data in dataset:

net(data)

Code Collection

Instrumentation

API Value Space

Argument Value Space

torch.nn.Conv2d:

entry1: in_channels=16,out_channels=33,

kernel_size=(3,5),…� Input_tensor_shape=(20,16,50,100)� Input_tensor_dtype=float32

entry2: in_channels=1,…

Customized Type

(3,5): (int, int)

input: Tensor<4, float32>

in_channels, int:

torch.nn.Conv2d: 16, 1,…

torch.nn.Conv3d: 32, 64,…

torch.nn.Conv3d:

entry1: in_channels=32,…

out_channels, int:

torch.nn.Conv2d: 33, 16,…

torch.nn.Conv3d: …

Mutation

Type Mutation

Tensor<4, float32>

🡪 Tensor<4, float16>

Random Value Mutation

in_channels = random_int()

l = torch.nn.Conv2d(in_channels,…)

input = torch.randn(…,dtype=float16)

Database Value Mutation

# similar_API = torch.nn.Conv3d

in_channel = db.sample_from(Conv3d)

# in_channel -> 32, 64, …

l = torch.nn.Conv2d(in_channels,…)

input = torch.randn(…, dtype=float16)

Oracle

Differential Testing

CPU

Metamorphic Testing

GPU

Tensors

Equal?

Faster?

Tensors

Crash?

8 of 23

Instrumentation: Type Monitoring System FuzzType

Goal: To better guide fuzzing
Finer-grained

t=(2,1)
PythonType(t)=tuple
FuzzType(t)=(int,int)

Native tensor-object support

t=torch.randn(20,16,50,100)
PythonType(t)=torch.Tensor
FuzzType(t)=tensor<4,float32>

8

9 of 23

Instrumentation: API Value Space

Each entry represents one API invocation

Record each argument’s value/type, and input tensor’s shape/data type

Serve as seeds for mutation-based fuzzing

E.g., mutating argument values/types

API Value Space

torch.nn.Conv2d:

Entry1:� in_channels=16,� out_channels=33,� kernel_size=(3,5),� …� tensor_shape=(20,16,50,100)� tensor_dtype=float32

Code Execution

m = torch.nn.Conv2d(16, 33, (3,5),…)

input = torch.randn(20, 16, 50, 100)

output = m(input)

9

10 of 23

Instrumentation: Argument Value Space

Aggregated info from API Value Space
Each entry row: <argument name, type>
Each entry column: <API name, values>
Designed for fuzzing similar APIs

To generate values for in_channels for Conv2d, we can sample values of in_channels from Conv3d!
Used for Database Value Mutation

10

Argument Value Space

torch.nn.Conv2d:

entry1: in_channels=16,out_channels=33,� entry2: in_channels=1, out_channels=5,

in_channels, int:

torch.nn.Conv2d: 16, 1, …

torch.nn.Conv3d: 32, …

torch.nn.Conv3d:

entry1: in_channels=32,out_channels=64

out_channels, int:

torch.nn.Conv2d: 33, 5,…

torch.nn.Conv3d: 64, …

API Value Space

11 of 23

Mutation: Overview

Input:

API under test
API Value Space
Argument Value Space

Output:

Mutated arguments

11

See paper for details

Randomly sample one entry in API Value Space

For each argument in the entry:

if no_mutation(): # random boolean

continue

type = FuzzType(argument)

if do_type_mutation(): # random boolean

type = TypeMutation(type)

if select_rand_over_db(): # random boolean

argument = RandValueMutation(type, argument)

else:

argument = DBValueMutation(type, argument)

Simplified Algorithm:

12 of 23

Mutation: Type Mutation

Type Mutation strategies are based on FuzzType

Mutate tensor’s dimension/shape
Mutate tensor’s data type
Mutate Python’s primitive type into another
Mutate types of elements in collections of heterogeneous objects

12

Mutation Strategies	T₁	T₂
Tensor Dim Mutation	Tensor<n₁, DT>	Tensor<n₂, DT> (n₁ ≠ n₂)
Tensor Dtype Mutation	Tensor<n, DT₁>	Tensor<n, DT₂> (DT₂ ≠ DT₁)
Primitive Mutation	T₁ = int \| bool \| float \| str	T₂ (T₂ ≠ T₁)
Tuple Mutation	(T_i^{i ∈1…n})	(type_mutate(T_i)^{i ∈1…n})
List Mutation	[T_i^{i ∈1…n}]	[type_mutate(T_i)^{i ∈1…n}]

13 of 23

Mutation: Value Mutation

Random Value Mutation

Randomly generate a value based on its FuzzType
E.g., use random_int() for argument in_channels

Database Value Mutation

Use traced values from similar APIs to test the current API

Leverage Argument Value Space

How to select similar APIs?

Based on API definitions
Transform similarity score into �probability for sampling APIs

13

in_channels = random_int()

l = torch.nn.Conv2d(in_channels,…)

14 of 23

Test Oracle

Wrong computation bugs� via differential testing

CPU
GPU (with CuDNN disabled)
GPU (with CuDNN enabled)

Performance bugs� via metamorphic testing

On GPU, tensors with less precision data type tend to execute faster
Example: on GPU, float16 tends to execute faster than float32

Crash bugs

14

m = torch.nn.Conv2d(64, 128, 1, 2).cuda()

tensor = torch.rand(1, 64, 32, 32). cuda()

torch.backends.cudnn.enabled = True

output1 = m(tensor) # with CuDNN enabled

torch.backends.cudnn.enabled = False

output2 = m(tensor) # with CuDNN disabled

print(output1.sum(), output2.sum()) # debugging

assert torch.allclose(output1, output2) # fail

Buggy code #1

15 of 23

RQ1: Input Source Study

Metric:

# Covered APIs
Line Coverage

Breakdown for

Documentation code
Developer tests
DL models

Conclusion:

All 3 sources of inputs count

15

16 of 23

RQ2&3: Coverage Trend & Ablation Study

Cost:

7.3 hours for PyTorch

630 APIs * 1000 mutants

9.9 hours for Tensorflow

1900 APIs * 1000 mutants

600 mutants: cost-effective

Mutation strategies:

All 3 strategies are effective
Type Mutation is most effective

Coverage trend analysis for PyTorch

16

17 of 23

RQ4: Comparison with Prior Work

Input comparison

Cover much more APIs
Highest line coverage

Mutation comparison

9× more APIs
1.2× more line coverage
3.5× less overhead

17

Comparison on input coverage

Comparison with LEMON on mutation

	FreeFuzz (tf1.14)	LEMON	CRADLE
# API	313	30	59
Line Cov.	33389	29489	28967

	FreeFuzz (tf1.14)	LEMON
# API	313	35
Line Cov.	35473	29766
Time	7h	25h

18 of 23

RQ5: Detected Bugs

FreeFuzz has detected 49 bugs in total

38 confirmed as previously unknown bugs
21 already fixed by developers

Each mutation strategy can help detect certain bugs

Demonstrating the importance of all 3 mutation strategies

18

	FreeFuzz	FreeFuzz -TypeMu	FreeFuzz -RandMu	FreeFuzz -DBMu	FreeFuzz -AllMu	Confirmed�(Fixed)
Pytorch	28	13	24	26	5	23(7)
Tensorflow	21	20	5	20	2	15(14)

19 of 23

More Bug Examples

torch.nn.Conv3d crashes if set padding_mode ='reflect'
How can FreeFuzz find it?

Database Value Mutation (using values from similar APIs)
torch.nn.Conv2d supports 'reflect',�but torch.nn.Conv3d crashes unexpectedly

19

import torch

from torch.nn import Conv3d

x = torch.rand (2 ,3 ,3 ,3 ,3)

Conv3d(3, 4, 3, padding_mode='reflect')(x) # Crash

Documentation

torch.nn.Conv3d

padding_mode (string, optional) � Supported values: 'zeros', 'reflect', 'replicate' or 'circular’

Buggy code #2

20 of 23

More Bug Examples

torch.nn.MaxUnpool2d

CPU: throw exception
GPU: pass silently

How does FreeFuzz find it?

Random Value Mutation + differential testing
FreeFuzz does NOT always generate valid inputs

20

import torch

m_gpu = torch.nn.MaxUnpool2d(2, stride=2).cuda()

m_cpu = torch.nn.MaxUnpool2d(2, stride=2)

tensor = torch.rand(1, 1, 2, 2)

indices = torch.randint(-32768, 32768, (1, 1, 2, 2))

gpu_result = m_gpu(tensor.cuda(), indices.cuda())

cpu_result = cpu(tensor, indices) # Exception on CPU

Buggy code #3

GPU produces a wrong result silently�without throwing any error!

21 of 23

Conclusion

FreeFuzz: the first general-purpose and fully automated API-level fuzzing for DL libraries

Mining from open source

Library documentation
Developer tests
DL models in the wild

Mutating traced inputs via type and value mutations
Detecting bugs via differential testing and metamorphic testing

Detected 49 bugs for PyTorch and TensorFlow

with 38 already confirmed as previously unknown

Cover 9× more APIs than prior work with 3.5× lower overhead
FreeFuzz is publicly available: https://github.com/ise-uiuc/FreeFuzz

21

Questions? Email: Anjiang Wei <anjiang@stanford.edu>

22 of 23

Backup Slides

Limitation:

Low line coverage percentage: 47309 / 308131 = 15%

Potential reason: only collected coverage for CPU, no GPU code coverage included

A large number of APIs still uncovered (in code collection stage)
Developer tests: only tests written in Python are considered

22

23 of 23

Code Collection & Instrumentation

Code collection

Code snippets from documentation
Library developer tests
202 DL models from open source

Instrumentation in Python

630 Python APIs from PyTorch
1900 Python APIs from TensorFlow
Reason: public APIs in DL libraries are mainly exposed in Python
Limitation: only tests written in Python are considered

23