1 of 102

ZNN - A CPU Implementation of Convolutional Neural Networks for Deep Learning

Aleks Zlateski

1/23/2015

2 of 102

Motivation

Why do we need ZNN?

Why do we need fast convolutional networks?

3 of 102

Why do we need ZNNs?

4 of 102

Why do we need ZNNs?

Seung Lab

Members

5 of 102

We have a EM microscope!

6 of 102

Obtain high resolution EM images

7 of 102

3D volume of neural tissue

8 of 102

So?

Ashwin Vishwanathan

9 of 102

3D reconstruct neurons...

Kim at al.

10 of 102

3D reconstruct neurons...

11 of 102

Morphology of different cell types

Rendering by

Alex Norton

12 of 102

Synapses

Rendering by

Alex Norton

13 of 102

Understand the brain a bit better

Rendering by

Alex Norton

14 of 102

Rendering by

Alex Norton

15 of 102

3D reconstruct neurons…

Kim at al.

16 of 102

3D reconstruct neurons…

How?

Kim at al.

17 of 102

18 of 102

19 of 102

20 of 102

21 of 102

Boundaries to segmentation

22 of 102

Main challenge: find the boundaries!

23 of 102

Convolutional Neural Networks

24 of 102

Convolutional Neural Networks

Graph Representation

  • Directed Acyclic Graphs (DAGs)
  • Usually layered
  • Nodes are perceptrons
  • Edges
    • Convolution (with a filter)
    • Pooling
    • Etc…
  • Can have multiple inputs/outputs

25 of 102

26 of 102

Convolutional Neural Networks

Graph Representation

27 of 102

Convolutional Neural Networks

Graph Representation

28 of 102

Convolutional Neural Networks

Graph Representation

29 of 102

Convolutional Neural Networks

Graph Representation

[0,1]

30 of 102

Convolutional Neural Networks

Graph Representation

[0,1]

Field of view

31 of 102

Convolutional Neural Networks

raph Representation

32 of 102

Training Neural Networks

Prepare a training set

33 of 102

Training Neural Networks

Backpropagation using gradient descent

Phases

    • Forward pass
    • Backward pass
    • Weight update

34 of 102

Training Neural Networks

Backpropagation using gradient descent

Phases

    • Forward pass
    • Backward pass
    • Weight update

[0,1]

35 of 102

Training Neural Networks

Backpropagation using gradient descent

Phases

    • Forward pass
    • Backward pass
    • Weight update

Error

36 of 102

Training Neural Networks

Backpropagation using gradient descent

Phases

    • Forward pass
    • Backward pass
    • Weight update

37 of 102

Convolutional Neural Networks

Graph Representation

[0,1]

Field of view

38 of 102

Field of view size matters!

39 of 102

Field of view size matters!

40 of 102

Increasing the Field of View

  • Large filters
  • Deep networks
  • Both!

41 of 102

Deep Neural Networks

  • Pros
    • More non-linearity
    • Large field of view
    • More parameters
  • Cons
    • Large number of layers
    • More parameters
    • Slow training or impossible training

42 of 102

Deep Neural Networks

  • Pros
    • More non-linearity
    • Large field of view
    • More parameters
  • Cons
    • Large number of layers
    • More parameters
    • Slow or impossible training

TIM the MIT beaver

43 of 102

How can we make ZNN fast?

  • Minimize computation
    • Faster convolution using FFTs
    • Reuse computation
  • Parallelize over CPUs
    • Utilize 100% of all CPUs

44 of 102

Faster convolution using FFTs

Direct Convolution

O(N3*M3)

45 of 102

Faster convolution using FFTs

Direct Convolution

O(N3*M3)

FFT Convolution

O(N3logN)

Faster for large enough M

(larger filters)

46 of 102

Faster convolution using FFTs

Direct Convolution

O(N3*M3)

FFT Convolution

O(N3logN)

Slower for some M

Multiple convolutions still faster!

47 of 102

Faster convolution using FFTs

X*Y Convolutions

  • Naively we have 3 FFTs per convolution
    • Total of 3*X*Y FFTs
  • We can reuse FFTs!

X Y

48 of 102

Faster convolution using FFTs

X*Y Convolutions

  • Naively we have 3 FFTs per convolution
    • Total of 3*X*Y FFTs
  • We can reuse FFTs!
    • X

X Y

49 of 102

Faster convolution using FFTs

X*Y Convolutions

  • Naively we have 3 FFTs per convolution
    • Total of 3*X*Y FFTs
  • We can reuse FFTs!
    • X+X*Y

X Y

50 of 102

Faster convolution using FFTs

X*Y Convolutions

  • Naively we have 3 FFTs per convolution
    • Total of 3*X*Y FFTs
  • We can reuse FFTs!
    • X+X*Y+Y
    • FFT(A+B) = FFT(A)+FFT(B)

X Y

51 of 102

Faster convolution using FFTs

X*Y Convolutions

  • Naively we have 3 FFTs per convolution
    • Total of 3*X*Y FFTs
  • We can reuse FFTs!
    • X+X*Y+Y
  • Faster for wider networks

X Y

52 of 102

Faster convolution using FFTs

  • Faster for wider networks
  • Backward pass can reuse the FFTs from the forward pass
    • only X+Y convolutions
  • Speed/memory tradeoff
    • Need to keep all the FFTs from the forward pass

X Y

53 of 102

ZNN Parallelization Ideas

  • Multiple convolutions (or FFTs) at the same time

54 of 102

ZNN Parallelization Ideas

  • Multiple convolutions (or FFTs) in the same layer
  • Convolutions in the next layer can start even before the whole layer has finished

55 of 102

ZNN Parallelization Model

  • Prioritized Task Model
    • not a standard model
    • implemented on top of pthreads (ZNN v1)
  • 2 task types
    • result will spawn new tasks (high priorities)
    • result will be used at some point (low priorities)
  • Scheduling strategy - never wait!
    • priority based on the number of tasks spawned after the completion of the current task
    • possible to steal type 2 tasks

56 of 102

Convolutional Neural Network

Forward Pass Tasks - Type 1

  • Direct
    • Convolve F & W
    • Nonlinearity + Bias
  • FFT
    • FFT of F
    • FFT of W
    • Inverse FFT
    • Nonlinearity + Bias

57 of 102

Convolutional Neural Network

Forward Pass Tasks - Type 2

  • FFT
    • FFT of all Ws

58 of 102

Convolutional Neural Network

Backward Pass Tasks - Type 1

  • Direct
    • Calculate G
    • Convolve G & W
  • FFT
    • Calculate FFT of G
    • FFT of W*
    • Inverse FFT

59 of 102

Convolutional Neural Network

Backward Pass Tasks - Type 2

  • FFT
    • FFT of all Ws
    • But they are all done!

60 of 102

Convolutional Neural Network

Weight Update Tasks - Type 2

  • Direct
    • Calculate dEdB and update the Biases
    • Convolve F & G’ and update W
  • FFT
    • Calculate dEdB and update the Biases
    • IFFT(FFT(F),FFT(G)) and update W

61 of 102

ZNN Parallelization Model

  • Scheduling strategy
    • schedule type 2 tasks with low priority as soon as all the pre requirements are done
    • schedule type 1 tasks as soon as all type 1 required tasks are done with custom priority
    • when executing type 1 tasks if a type 2 task requirement is not finished
      • execute the task yourself

62 of 102

Zonvolutional Neural Network

Forward Pass - Scheduling Tasks

  • Prepare input(s)
    • Top priority?*
  • FFT of the filters**
    • No dependencies
    • Type 2 tasks

63 of 102

Zonvolutional Neural Network

Forward Pass - Scheduling Tasks

  • Input perceptron type 1
    • FFT of F with the highest priority
  • If input is not ready?
    • Steal the task!

64 of 102

Zonvolutional Neural Network

Forward Pass - Scheduling Tasks

  • When a node is done
    • Schedule outgoing edges’ multiply-adds
    • If the FFT of W is not ready - steal the task!

65 of 102

Zonvolutional Neural Network

Forward Pass - Scheduling Tasks

  • When a MAD is done
    • If the last one, schedule the perceptron

66 of 102

Zonvolutional Neural Network

Forward Pass - Scheduling Tasks

  • When the output perceptron is done
    • We are done!

67 of 102

Zonvolutional Neural Network

Backward Pass - Scheduling Tasks

  • Analogous to the Forward pass
  • Convolving G with W in the opposite direction

68 of 102

Zonvolutional Neural Network

Weight Update - Scheduling Tasks

  • Calculate gradients and update the filters & biases (convolution)
  • No dependencies all the tasks can be done in parallel
  • Schedule all tasks asap
  • Don’t wait for the backward pass to finish

69 of 102

Zonvolutional Neural Network

Weight Update

  • As soon as the weight update of a filter is done
    • Schedule FFT(W)
    • Needed for the next forward pass
    • type 2 tasks

70 of 102

Zonvolutional Neural Network

Further Optimizations

  • If doing only forward pass
    • Pipeline the computation
    • Multiple forward passes in parallel
  • Always have the next sample ready!
  • Benchmark FFT vs Direct computation per layer
    • Use the faster one.
  • Sparse convolution optimization
  • Max pooling optimizations for certain sizes
    • Statically compiled

71 of 102

Some Benchmarks

Graph Representation

  • Scaling with the number of CPUs
  • Comparing to GPU NN on machines of similar age
    • Maybe not fair
    • Hard to find a fair comparison
  • Comparing against new state of the art GPU implementations (Theano)
    • Rare 3D CNN capable implementation
    • Comparable to other modern implementation in 2D

72 of 102

Some Benchmarks

Scalability Test - 4 layer 16 width network

73 of 102

Some Benchmarks

Compared to CNS

  • 3 Layers, 5 Perceptrons, 5x5x5

Updates per

second

CNS

ZNN

74 of 102

Some Benchmarks

Compared to CNS

  • 3 Layers, 15 Perceptrons, 5x5x5

CNS

ZNN

Updates per

second

75 of 102

Some Benchmarks

Compared to CNS

  • 3 Layers, 10 Perceptrons, 7x7x7

CNS

ZNN

Updates per

second

76 of 102

Some Benchmarks

Compared to Theano GPU

  • Comparable in both 2D and 3D
    • 2x Slower in 2D
    • 2x Faster in 3D

CNS

ZNN

Seconds per

update

77 of 102

Why is ZNN is so fast?

When everyone is says we should use GPUs?

78 of 102

CPUs vs GPUs

  • 32-40 cores
  • >2000 cores

79 of 102

CPUs vs GPUs

  • 32-40 cores
  • ~3Ghz
  • >2000 cores
  • ~1Ghz

80 of 102

CPUs vs GPUs

  • 32-40 cores
  • ~3Ghz
  • Vectorized instructions
  • >2000 cores
  • ~1Ghz

81 of 102

CPUs vs GPUs

  • 32-40 cores
  • ~3Ghz
  • Vectorized instructions
  • 512 GB Memory
  • >2000 cores
  • ~1Ghz
  • 2-4 GB Memory

82 of 102

CPUs vs GPUs

  • 32-40 cores
  • ~3Ghz
  • Vectorized instructions
  • 512 GB Memory
  • >2000 cores
  • ~1Ghz
  • 2-4 GB Memory

GPUs still faster!

83 of 102

CPUs vs GPUs

  • 32-64 cores
  • ~3Ghz
  • Vectorized instructions
  • 512 GB Memory
  • >2000 cores
  • ~1Ghz
  • 2-4 GB Memory
  • SIMD
    • All cores have to execute the same instructions!

84 of 102

GPU limitations

  • SIMD (single instruction multiple data)
    • A single instruction decoder - all cores do same work
    • Divergence kills performance
    • Parallelization done per convolution(s)
      • Direct convolution
        • computationally much more expensive
      • FFT
        • can’t efficiently utilize all cores
  • Memory limitations
    • Can’t cache FFT transforms for reuse

85 of 102

ZNN loves deep learning

  • ZNN shines when
    • Filter sizes are large so that FFTs are used
    • Wide and deep networks
    • Bigger output patch
  • ZNN is the only (reasonable) solution
    • Very deep networks with large filters
    • FFTs of the feature maps and gradients can fit in RAM, but couldn’t fit on the GPU
  • ZNN can be ran out of the box on future MUUUUULTI core machines

86 of 102

ZNN is heavily used in-house

  • Optimize network structure
    • Fast turnaround times
  • Very deep networks
    • Don’t fit GPU memory when using FFTs
    • Slower on the GPU when using direct convolution
  • Trained on large patches
    • Very fast compared to other state of the art implementation

87 of 102

KisukNetTM v1 (kisuklee@mit.edu)

88 of 102

KisukNetTM v2 (kisuklee@mit.edu)

89 of 102

KisukNetTM v3 (kisuklee@mit.edu)

90 of 102

91 of 102

KisukNetTM v73 (kisuklee@mit.edu)

92 of 102

Ashwin Vishwanathan

2500 pxl (17.5 μm)

93 of 102

2500 pxl (17.5 μm)

Kisuk

Lee

94 of 102

ZNN now

  • Open source
    • https://github.com/seung-lab/znn-release
    • 99% C++ (dep. fftw3 and boost-all)
    • Some MATLAB helper functions
    • Examples included
  • Other programming language bindings
    • Easy to make
    • Might or might not come soon…
    • You can help!

95 of 102

ZNN in the future

  • # Cores per CPUs seem to be increasing :)
  • ZNN will work out of the box

96 of 102

ZNN in the future

  • # Cores per CPUs seem to be increasing :)
  • ZNN will work out of the box
  • GPU code has to be re-optimized for new generations
  • GPU design priorities:

97 of 102

ZNN in the future

  • # Cores per CPUs seem to be increasing :)
  • ZNN will work out of the box
  • GPU code has to be re-optimized for new generations
  • GPU design priorities:

1. Graphics

98 of 102

ZNN in the future

  • # Cores per CPUs seem to be increasing :)
  • ZNN will work out of the box
  • GPU code has to be re-optimized for new generations
  • GPU design priorities:

1. Graphics

2. Graphics

99 of 102

ZNN in the future

  • # Cores per CPUs seem to be increasing :)
  • ZNN will work out of the box
  • GPU code has to be re-optimized for new generations
  • GPU design priorities:

1. Graphics

2. Graphics

N-1. Graphics

100 of 102

ZNN in the future

  • # Cores per CPUs seem to be increasing :)
  • ZNN will work out of the box
  • GPU code has to be re-optimized for new generations
  • GPU design priorities:

1. Graphics

2. Graphics

N-1. Graphics

N. GPU Compute

101 of 102

Why should you use ZNNs

  • Very easy installation
  • No special hardware required
  • Competitive speed with the state of the art GPU implementations
    • Sometimes it can be slower
    • Sometimes it can be faster
    • Rule of thumb is that ZNN is usually better for
      • Deeper networks
      • Larger filters

102 of 102

Questions?