1 of 23

Analyzing the Energy-Latency-Area-Accuracy Trade-off Across Contemporary Neural Networks

Vikram Jain, Linyan Mei, Marian Verhelst

MICAS, KU Leuven, Belgium

vikram.jain@kuleuven.be

2 of 23

Overview

  • Introduction
  • Motivation
  • ZigZag
  • DNN workload comparison
    • Experiment setup
    • Visualizing the trade-off space
    • Impact of HW arch on optimal workload
    • Impact of workload on optimal HW arch
  • Conclusion

2

3 of 23

Machine learning is ubiquitous

3

4 of 23

AI at the edge

  • Hardware requirements
    • Low power
    • Decent performance
    • Small footprint
    • Flexibility
  • Machine learning models
    • Large model size
    • Huge number of operations

4

S. Bianco et al, "Benchmark Analysis of Representative Deep Neural Network Architectures," in IEEE Access, vol. 6, pp. 64270-64277, 2018, doi: 10.1109/ACCESS.2018.2877890.

5 of 23

Design Space Exploration (DSE)

  • Design space huge for Algorithm, Hardware, and Algorithm-to-Hardware Mapping.

  • Auto design space exploration is important.

5

Dataflow, Loop blocking or tiling, Loop ordering, ...

Precision, Memory Hierarchy, PE array, Sparsity,...

Mapping

Hardware

Model topology, NN operations, ...

Algorithm

Joint co-optimization space

6 of 23

DSE State-of-The-Art

  • Timeloop (NVIDIA)
  • Interstellar (Stanford)
  • MAESTRO (Georgia Tech)
  • NAS (Google)
  • ZigZag (In-house)

6

Most DSE frameworks optimize and report hardware architecture for individual neural network workload. But none explore hardware architecture design for a suite of NN workloads or vice versa.

7 of 23

Overview

  • Introduction
  • Motivation
  • ZigZag
  • DNN workload comparison
    • Experiment setup
    • Visualizing the trade-off space
    • Impact of HW arch on optimal workload
    • Impact of workload on optimal HW arch
  • Conclusion

7

8 of 23

Motivation

This study aims to broaden understanding of:

  • Efficiency of NN workloads across a set of hardware architectures

  • Efficiency of hardware architectures across a suite of NN workloads

  • Overall energy-latency-area-accuracy trade-offs across different architectures and NN workloads

  • Properties of hardware architectures that make them efficient across a suite of NN workloads and vice versa.

8

9 of 23

Overview

  • Introduction
  • Motivation
  • ZigZag
  • DNN workload comparison
    • Experiment setup
    • Visualizing the trade-off space
    • Impact of HW arch on optimal workload
    • Impact of workload on optimal HW arch
  • Conclusion

9

10 of 23

ZigZag

  • An architecture-mapping joint DSE framework for DNN accelerators.

  • This study uses ZigZag for a broad joint HW-NN exploration.

  • ZigZag finds the optimal design points by comparing energy, performance and area.

10

[ZigZag, TC2021]

https://github.com/ZigZag-Project/zigzag

11 of 23

Understanding ZigZag results

  • ZigZag provides information about energy, latency resp area for all valid architecture-mapping combinations

  • Analyzing these provides insights into hardware and workload parameters

  • E.g.- Plot shows MobileNetv2 mapped to 280 different mem hierarchies

  • Bottom plot shows energy and area vs layers of network

11

Top 3 subplots represent mem hier (size and levels), black dots represent no mem sharing, colored dots otherwise

12 of 23

Overview

  • Introduction
  • Motivation
  • ZigZag
  • DNN workload comparison
    • Experiment setup
    • Visualizing the trade-off space
    • Impact of HW arch on optimal workload
    • Impact of workload on optimal HW arch
  • Conclusion

12

13 of 23

Experiment setup-algorithm

  • Workload: 12 popular NN workloads with ImageNet dataset

13

ImageNet Top-1 accuracy ranging from 50% to 85%

Operation ranging from 0.06 to 23.74 GFLOPs

Parameter size ranging from 3.31 to 84.45 MB

  • Traditional convolution:

AlexNet, DenseNet201;

  • Depthwise separable convolution:�MobileNetv1/v2/v3*, NASNet, Xception, Incep-Resv2;

  • Residual networks:

ResNet50, Incep-ResNet-v2

  • Grouped convolution: SEResNeXt50

*MobileNetv3 is missing in the figure.

14 of 23

Experiment setup-hardware

  • Hardware: 720 HW architectures with different mem hierarchy (levels, size, share and bypass), OX 14 | K 16 spatial unrolling

14

Arch. Level

Inner PE Reg L0

On-chip L1

On-chip L2

Off-chip

Mem size option

2 B, 32 B, 128 B

8 KB, 32 KB

0.5 MB, 2 MB

DRAM

Mem bandwidth

16 bits/cycle (r/w)

128 bits/cycle (r/w)

Mem share options

All separate

All separate, Two shared, All shared

All shared

All shared

Mem Bypass option

No bypass

Can bypass

Can bypass

No bypass

15 of 23

Visualizing the trade-off space(1)

  • 1440 points for each NN --> immense design space

  • Wide trade-off space available
  • ImageNet Top-1 accuracy ranging from 50% to 85%

  • Energy from few uJ to 10000 uJ

  • Latency from few million cycles to 1000s of million cycles

  • Area from 0.1 mm2 to 10s of mm2

15

16 of 23

Energy vs Latency

Major observations:

  1. Generally speaking, Accuracy Energy/Latency
  2. MobileNet series provides middle ground
  3. Large L0 for W and I is more energy efficient
  4. Small L0 for O is efficient because of output stationary dataflow
  5. MobileNet series prefer shallower memory hierarchies, but are near optimal at deeper memory hierarchies

16

17 of 23

Energy vs Area

Major observations:

  • Large NN workloads benefits in energy efficiency from large area i.e. large on-chip memory

  • More on-chip memory does not always translate to energy efficiency

17

18 of 23

Overall comparison

18

NNs whose order of accuracy is larger than the order of best energy and latency (i.e. achieve relatively low accuracy with relatively high hardware cost) are suboptimal, and hence should be avoided.

NNs whose order of accuracy is equal to the order of best energy and latency (i.e. the accuracy achieved lives up to the hardware cost) are good for some applications.

NNs whose order of accuracy is smaller than the order of best energy and latency (i.e. achieve relatively high accuracy with relatively low hardware cost) are promising for embedded systems.

Accuracy

# MAC Op

Parameter size

Energy

Latency

Area

Hardware Metrics

Algorithm Metrics

have relations

but not definite

19 of 23

HW arch vs optimal NN workload (energy)

  • A subset of 200 arch-mapping combinations plotted on a box and whistler plot
  • MobileNet series have lower spread of quartiles for energy -> less sensitive to HW arch
  • Other NN workloads need careful consideration when designing memory hierarchy

19

NN1

NN2

NN3

20 of 23

HW arch vs optimal NN workload (latency)

  • MobileNet series provides least overall latency but are more sensitive to mem hierarchy
  • ResNet50, DenseNet201, SEResNext40 and IncepResv2 have higher latency but are more amenable to HW arch based on latency

20

21 of 23

Optimal NN workload vs HW arch (energy)

  • A subset of 100 arch-mapping combinations are selected*
  • Two level mem hier (L0+L1) provided best energy across all NNs
  • A 512KB L1 provides near optimal energy → no need for larger on-chip memory
  • Memory sharing benefits energy efficiency

21

*Energy, resp. latency of each is normalized to min energy, resp. latency of that NN

Top 3 subplots represent mem hier (size and levels), black dots represent no mem sharing, colored dots otherwise

HW1

HW2

HW3

22 of 23

Optimal NN workload vs HW arch (latency)

  • Good latency points are achieved with either 1 level or 2 level memory hierarchy

  • Mem sharing has little to no effect on latency

  • Large L0 mem size benefits latency

22

23 of 23

Conclusion

  • We propose an exploration methodology to analyze the energy-latency-area trade-off space across a suite of NN workloads

  • Impact of HW architecture on NN workloads and vice versa

  • Some main observations on memory hierarchy were derived:
    • Multi-level memory hierarchy provides good energy
    • Large on-chip memory does not always translate to energy efficiency
    • Memory sharing is beneficial to energy efficiency

  • Similar exploration with different spatial unrolling, PE array size, other NN workloads, etc., can be beneficial to design versatile hardware for mapping suite of NN workloads

23