1 of 23

Free Lunch for Testing: �Fuzzing Deep-Learning Libraries �from Open Source

Anjiang Wei, Yinlin Deng, Chenyuan Yang, Lingming Zhang

CCF-2131943

CCF-2141474

1

2 of 23

Deep-Learning Libraries

Model Definition

class MyNet(nn.Module):

self.l1 = Conv2d(32, 16, 3)

self.l2 = Maxpool2d((3,2),2)

def forward(self, x):

x = self.l1(x)

x = self.l2(x)

return F.relu(x)

Loading Dataset

class MyDataset(Dataset):

def __getitem__(self, idx):

image = read_image(…)

image = normalize(image)

label = read_label(…)

Training / Inference

net = MyNet()

for data, label in MyDataset:

out = net(data)

loss = criterion(out, label)

loss.backward()

User

Library

Python

C++

CPU

Aten

CuDNN

GPU

Mobile

  • Build a DL model
    • Define a model
    • Load the dataset
    • Run training / inference

  • DL libraries
    • APIs mainly in Python
    • Different backends
    • Abstraction for hardware

2

3 of 23

Prior Work

  • CRADLE1
    • Detecting bugs in DL Libraries with existing models
    • Differential testing as test oracle: compare outputs of different libraries

3

1Pham et al. “CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries”. ICSE 2019

Existing Models

High-level Library

Differential Testing

4 of 23

Prior Work

4

2Wang et al. “Deep learning library testing via effective model generation”. FSE 2020.

We also acknowledge using their slides to illustrate model-level mutation rules

Layer Switch (LS)

Layer Copy (LC)

Layer Addition (LA)

Layer Removal (LR)

Activation Function Removal (AFRm)

Activation Function Replace (AFRp):

Multi-Layers Addition (MLA)

Example LEMON model-level mutation rules

  • LEMON2
    • Further applying model-level mutation to generate more DL models
    • State of the art for DL library testing

5 of 23

Motivation

  • Limitation of prior work:
    • Model-level testing covers only a limited number of APIs
      • DL frameworks usually have thousands of APIs (e.g., ~1900 for TensorFlow     )
      • Models used in prior work can only cover 59 APIs!
    • Model-level mutation rules are constrained
      • E.g., LEMON’s intact-layer mutation requires that: "the output tensor shape of the API to be added/deleted should be identical to its input tensor shape"
  • Our work FreeFuzz:
    • Fully automated API-level fuzz testing
    • Challenge: how to automatically invoke a given API?

5

6 of 23

Challenge of Fuzzing APIs

  • Exposed in Python, a dynamically-typed language
    • Hard to determine parameter types for test input generation
  • Complex constraints among parameters

...

m = torch.nn.Conv2d(16, 33, 3)

input = torch.randn(20, 16, 50, 100)

output = m(input)

...

- in_channels (int) – Number of channels in the input image

- stride (int or tuple, optional) – Stride of convolution. Default: 1

- groups (int) – Controls the connections between inputs and � outputs. Constraints: in_channels and out_channels� must both be divisible by groups

Code Snippets from Doc

Documentation for torch.nn.Conv2d

6

  • Solution:
    • Dynamic tracing when running existing code

7 of 23

FreeFuzz Overview

7

Doc Code

Lib Tests

DL Models

m = torch.nn.Conv2d(16,33,(3,5),…)

input = torch.randn(20,16,50,100)

output = m(input)

def test_conv():

sizes = [(1, 256, 109, 175),

(1, 256, 80, 128),…]

conv = torch.nn.Conv2d(1,256,…)

for size in sizes:

x = torch.randn(size,…)

out = conv(x)

class MyModel(nn.Module):

self.conv1=nn.Conv3d(32,…)

self.conv2=nn.Conv3d(64,…)

net = MyModel()

for data in dataset:

net(data)

Code Collection

Instrumentation

API Value Space

Argument Value Space

torch.nn.Conv2d:

entry1: in_channels=16,out_channels=33,

kernel_size=(3,5),…� Input_tensor_shape=(20,16,50,100)� Input_tensor_dtype=float32

entry2: in_channels=1,…

Customized Type

(3,5): (int, int)

input: Tensor<4, float32>

in_channels, int:

torch.nn.Conv2d: 16, 1,…

torch.nn.Conv3d: 32, 64,…

torch.nn.Conv3d:

entry1: in_channels=32,…

out_channels, int:

torch.nn.Conv2d: 33, 16,…

torch.nn.Conv3d: …

Mutation

Type Mutation

Tensor<4, float32>

🡪 Tensor<4, float16>

Random Value Mutation

in_channels = random_int()

l = torch.nn.Conv2d(in_channels,…)

input = torch.randn(…,dtype=float16)

Database Value Mutation

# similar_API = torch.nn.Conv3d

in_channel = db.sample_from(Conv3d)

# in_channel -> 32, 64, …

l = torch.nn.Conv2d(in_channels,…)

input = torch.randn(…, dtype=float16)

Oracle

Differential Testing

CPU

Metamorphic Testing

GPU

Tensors

<float16>

Equal?

Faster?

Tensors

<float32>

Crash?

8 of 23

Instrumentation: Type Monitoring System FuzzType

  • Goal: To better guide fuzzing
  • Finer-grained
    • t=(2,1)
    • PythonType(t)=tuple
    • FuzzType(t)=(int,int)
  • Native tensor-object support
    • t=torch.randn(20,16,50,100)
    • PythonType(t)=torch.Tensor
    • FuzzType(t)=tensor<4,float32>

 

8

9 of 23

Instrumentation: API Value Space

  • Each entry represents one API invocation

  • Record each argument’s value/type, and input tensor’s shape/data type

  • Serve as seeds for mutation-based fuzzing
    • E.g., mutating argument values/types

API Value Space

torch.nn.Conv2d:

Entry1:� in_channels=16,� out_channels=33,� kernel_size=(3,5),� …� tensor_shape=(20,16,50,100)� tensor_dtype=float32

Code Execution

m = torch.nn.Conv2d(16, 33, (3,5),…)

input = torch.randn(20, 16, 50, 100)

output = m(input)

9

10 of 23

Instrumentation: Argument Value Space

  • Aggregated info from API Value Space
  • Each entry row: <argument name, type>
  • Each entry column: <API name, values>
  • Designed for fuzzing similar APIs
    • To generate values for in_channels for Conv2d, we can sample values of in_channels from Conv3d!
    • Used for Database Value Mutation

10

Argument Value Space

torch.nn.Conv2d:

entry1: in_channels=16,out_channels=33,� entry2: in_channels=1, out_channels=5,

in_channels, int:

torch.nn.Conv2d: 16, 1, …

torch.nn.Conv3d: 32, …

torch.nn.Conv3d:

entry1: in_channels=32,out_channels=64

out_channels, int:

torch.nn.Conv2d: 33, 5,…

torch.nn.Conv3d: 64, …

API Value Space

11 of 23

Mutation: Overview

  • Input:
    • API under test
    • API Value Space
    • Argument Value Space
  • Output:
    • Mutated arguments

11

See paper for details

Randomly sample one entry in API Value Space

For each argument in the entry:

if no_mutation(): # random boolean

continue

type = FuzzType(argument)

if do_type_mutation(): # random boolean

type = TypeMutation(type)

if select_rand_over_db(): # random boolean

argument = RandValueMutation(type, argument)

else:

argument = DBValueMutation(type, argument)

Simplified Algorithm:

12 of 23

Mutation: Type Mutation

  • Type Mutation strategies are based on FuzzType
    • Mutate tensor’s dimension/shape
    • Mutate tensor’s data type
    • Mutate Python’s primitive type into another
    • Mutate types of elements in collections of heterogeneous objects

12

Mutation Strategies

T1

T2

Tensor Dim Mutation

Tensor<n1, DT>

Tensor<n2, DT> (n1 ≠ n2)

Tensor Dtype Mutation

Tensor<n, DT1>

Tensor<n, DT2> (DT2 ≠ DT1)

Primitive Mutation

T1 = int | bool | float | str

T2 (T2 ≠ T1)

Tuple Mutation

(Tii ∈1…n)

(type_mutate(Ti)i ∈1…n)

List Mutation

[Tii ∈1…n]

[type_mutate(Ti)i ∈1…n]

 

13 of 23

Mutation: Value Mutation

  • Random Value Mutation
    • Randomly generate a value based on its FuzzType
    • E.g., use random_int() for argument in_channels

  • Database Value Mutation
    • Use traced values from similar APIs to test the current API
      • Leverage Argument Value Space
    • How to select similar APIs?
      • Based on API definitions
      • Transform similarity score into �probability for sampling APIs

13

in_channels = random_int()

l = torch.nn.Conv2d(in_channels,…)

 

 

14 of 23

Test Oracle

  • Wrong computation bugs� via differential testing
    • CPU
    • GPU (with CuDNN disabled)
    • GPU (with CuDNN enabled)

  • Performance bugs� via metamorphic testing
    • On GPU, tensors with less precision data type tend to execute faster
    • Example: on GPU, float16 tends to execute faster than float32

  • Crash bugs

14

m = torch.nn.Conv2d(64, 128, 1, 2).cuda()

tensor = torch.rand(1, 64, 32, 32). cuda()

torch.backends.cudnn.enabled = True

output1 = m(tensor) # with CuDNN enabled

torch.backends.cudnn.enabled = False

output2 = m(tensor) # with CuDNN disabled

print(output1.sum(), output2.sum()) # debugging

assert torch.allclose(output1, output2) # fail

Buggy code #1

15 of 23

RQ1: Input Source Study

  • Metric:
    • # Covered APIs
    • Line Coverage
  • Breakdown for
    • Documentation code
    • Developer tests
    • DL models
  • Conclusion:
    • All 3 sources of inputs count

15

16 of 23

RQ2&3: Coverage Trend & Ablation Study

  • Cost:
    • 7.3 hours for PyTorch​
      • 630 APIs * 1000 mutants
    • 9.9 hours for Tensorflow
      • 1900 APIs * 1000 mutants
    • 600 mutants: cost-effective

  • Mutation strategies:
    • All 3 strategies are effective
    • Type Mutation is most effective

Coverage trend analysis for PyTorch

16

17 of 23

RQ4: Comparison with Prior Work

  • Input comparison
    • Cover much more APIs
    • Highest line coverage

  • Mutation comparison
    • 9× more APIs
    • 1.2× more line coverage
    • 3.5× less overhead

17

Comparison on input coverage

Comparison with LEMON on mutation

FreeFuzz (tf1.14)

LEMON

CRADLE

# API

313

30

59

Line Cov.

33389

29489

28967

FreeFuzz (tf1.14)

LEMON

# API

313

35

Line Cov.

35473

29766

Time

7h

25h

18 of 23

RQ5: Detected Bugs

  • FreeFuzz has detected 49 bugs in total
    • 38 confirmed as previously unknown bugs
    • 21 already fixed by developers
  • Each mutation strategy can help detect certain bugs
    • Demonstrating the importance of all 3 mutation strategies

18

FreeFuzz

FreeFuzz

-TypeMu

FreeFuzz

-RandMu

FreeFuzz

-DBMu

FreeFuzz

-AllMu

Confirmed�(Fixed)

Pytorch

28

13

24

26

5

23(7)

Tensorflow

21

20

5

20

2

15(14)

19 of 23

More Bug Examples

  • torch.nn.Conv3d crashes if set padding_mode ='reflect'
  • How can FreeFuzz find it?
    • Database Value Mutation (using values from similar APIs)
    • torch.nn.Conv2d supports 'reflect',�but torch.nn.Conv3d crashes unexpectedly

19

import torch

from torch.nn import Conv3d

x = torch.rand (2 ,3 ,3 ,3 ,3)

Conv3d(3, 4, 3, padding_mode='reflect')(x) # Crash

Documentation

torch.nn.Conv3d

padding_mode (string, optional) � Supported values: 'zeros', 'reflect', 'replicate' or 'circular’

Buggy code #2

20 of 23

More Bug Examples

  • torch.nn.MaxUnpool2d
    • CPU: throw exception
    • GPU: pass silently

  • How does FreeFuzz find it?
    • Random Value Mutation + differential testing
    • FreeFuzz does NOT always generate valid inputs

20

import torch

m_gpu = torch.nn.MaxUnpool2d(2, stride=2).cuda()

m_cpu = torch.nn.MaxUnpool2d(2, stride=2)

tensor = torch.rand(1, 1, 2, 2)

indices = torch.randint(-32768, 32768, (1, 1, 2, 2))

gpu_result = m_gpu(tensor.cuda(), indices.cuda())

cpu_result = cpu(tensor, indices) # Exception on CPU

Buggy code #3

GPU produces a wrong result silently�without throwing any error!

21 of 23

Conclusion

  • FreeFuzz: the first general-purpose and fully automated API-level fuzzing for DL libraries
    • Mining from open source
      • Library documentation
      • Developer tests
      • DL models in the wild
    • Mutating traced inputs via type and value mutations
    • Detecting bugs via differential testing and metamorphic testing
  • Detected 49 bugs for PyTorch and TensorFlow
    • with 38 already confirmed as previously unknown
  • Cover 9× more APIs than prior work with 3.5× lower overhead
  • FreeFuzz is publicly available: https://github.com/ise-uiuc/FreeFuzz

21

Questions? Email: Anjiang Wei <anjiang@stanford.edu>

22 of 23

Backup Slides

  • Limitation:
    • Low line coverage percentage: 47309 / 308131 = 15%
      • Potential reason: only collected coverage for CPU, no GPU code coverage included
    • A large number of APIs still uncovered (in code collection stage)
    • Developer tests: only tests written in Python are considered

22

23 of 23

Code Collection & Instrumentation

  • Code collection
    • Code snippets from documentation
    • Library developer tests
    • 202 DL models from open source

  • Instrumentation in Python
    • 630 Python APIs from PyTorch
    • 1900 Python APIs from TensorFlow
    • Reason: public APIs in DL libraries are mainly exposed in Python
    • Limitation: only tests written in Python are considered

23