1 of 23

Analyzing the Energy-Latency-Area-Accuracy Trade-off Across Contemporary Neural Networks

Vikram Jain, Linyan Mei, Marian Verhelst

MICAS, KU Leuven, Belgium

vikram.jain@kuleuven.be

2 of 23

Overview

Introduction
Motivation
ZigZag
DNN workload comparison

Experiment setup
Visualizing the trade-off space
Impact of HW arch on optimal workload
Impact of workload on optimal HW arch

Conclusion

3 of 23

Machine learning is ubiquitous

4 of 23

AI at the edge

Hardware requirements

Low power
Decent performance
Small footprint
Flexibility

Machine learning models

Large model size
Huge number of operations

S. Bianco et al, "Benchmark Analysis of Representative Deep Neural Network Architectures," in IEEE Access, vol. 6, pp. 64270-64277, 2018, doi: 10.1109/ACCESS.2018.2877890.

5 of 23

Design Space Exploration (DSE)

Design space huge for Algorithm, Hardware, and Algorithm-to-Hardware Mapping.

Auto design space exploration is important.

Dataflow, Loop blocking or tiling, Loop ordering, ...

Precision, Memory Hierarchy, PE array, Sparsity,...

Mapping

Hardware

Model topology, NN operations, ...

Algorithm

Joint co-optimization space

6 of 23

DSE State-of-The-Art

Timeloop (NVIDIA)
Interstellar (Stanford)
MAESTRO (Georgia Tech)
NAS (Google)
ZigZag (In-house)

Most DSE frameworks optimize and report hardware architecture for individual neural network workload. But none explore hardware architecture design for a suite of NN workloads or vice versa.

7 of 23

Overview

Introduction
Motivation
ZigZag
DNN workload comparison

Experiment setup
Visualizing the trade-off space
Impact of HW arch on optimal workload
Impact of workload on optimal HW arch

Conclusion

8 of 23

Motivation

This study aims to broaden understanding of:

Efficiency of NN workloads across a set of hardware architectures

Efficiency of hardware architectures across a suite of NN workloads

Overall energy-latency-area-accuracy trade-offs across different architectures and NN workloads

Properties of hardware architectures that make them efficient across a suite of NN workloads and vice versa.

9 of 23

Overview

Introduction
Motivation
ZigZag
DNN workload comparison

Experiment setup
Visualizing the trade-off space
Impact of HW arch on optimal workload
Impact of workload on optimal HW arch

Conclusion

10 of 23

ZigZag

An architecture-mapping joint DSE framework for DNN accelerators.

This study uses ZigZag for a broad joint HW-NN exploration.

ZigZag finds the optimal design points by comparing energy, performance and area.

[ZigZag, TC2021]

https://github.com/ZigZag-Project/zigzag

11 of 23

Understanding ZigZag results

ZigZag provides information about energy, latency resp area for all valid architecture-mapping combinations

Analyzing these provides insights into hardware and workload parameters

E.g.- Plot shows MobileNetv2 mapped to 280 different mem hierarchies

Bottom plot shows energy and area vs layers of network

Top 3 subplots represent mem hier (size and levels), black dots represent no mem sharing, colored dots otherwise

12 of 23

Overview

Introduction
Motivation
ZigZag
DNN workload comparison

Experiment setup
Visualizing the trade-off space
Impact of HW arch on optimal workload
Impact of workload on optimal HW arch

Conclusion

13 of 23

Experiment setup-algorithm

Workload: 12 popular NN workloads with ImageNet dataset

ImageNet Top-1 accuracy ranging from 50% to 85%

Operation ranging from 0.06 to 23.74 GFLOPs

Parameter size ranging from 3.31 to 84.45 MB

Traditional convolution:

AlexNet, DenseNet201;

Depthwise separable convolution:�MobileNetv1/v2/v3*, NASNet, Xception, Incep-Resv2;

Residual networks:

ResNet50, Incep-ResNet-v2

Grouped convolution: SEResNeXt50

*MobileNetv3 is missing in the figure.

14 of 23

Experiment setup-hardware

Hardware: 720 HW architectures with different mem hierarchy (levels, size, share and bypass), OX 14 | K 16 spatial unrolling

Arch. Level	Inner PE Reg L0	On-chip L1	On-chip L2	Off-chip
Mem size option	2 B, 32 B, 128 B	8 KB, 32 KB	0.5 MB, 2 MB	DRAM
Mem bandwidth	16 bits/cycle (r/w)	128 bits/cycle (r/w)
Mem share options	All separate	All separate, Two shared, All shared	All shared	All shared
Mem Bypass option	No bypass	Can bypass	Can bypass	No bypass

15 of 23

Visualizing the trade-off space(1)

1440 points for each NN --> immense design space

Wide trade-off space available
ImageNet Top-1 accuracy ranging from 50% to 85%

Energy from few uJ to 10000 uJ

Latency from few million cycles to 1000s of million cycles

Area from 0.1 mm2 to 10s of mm2

16 of 23

Energy vs Latency

Major observations:

Generally speaking, Accuracy Energy/Latency
MobileNet series provides middle ground
Large L0 for W and I is more energy efficient
Small L0 for O is efficient because of output stationary dataflow
MobileNet series prefer shallower memory hierarchies, but are near optimal at deeper memory hierarchies

17 of 23

Energy vs Area

Major observations:

Large NN workloads benefits in energy efficiency from large area i.e. large on-chip memory

More on-chip memory does not always translate to energy efficiency

18 of 23

Overall comparison

NNs whose order of accuracy is larger than the order of best energy and latency (i.e. achieve relatively low accuracy with relatively high hardware cost) are suboptimal, and hence should be avoided.

NNs whose order of accuracy is equal to the order of best energy and latency (i.e. the accuracy achieved lives up to the hardware cost) are good for some applications.

NNs whose order of accuracy is smaller than the order of best energy and latency (i.e. achieve relatively high accuracy with relatively low hardware cost) are promising for embedded systems.

Accuracy

# MAC Op

Parameter size

Energy

Latency

Area

Hardware Metrics

Algorithm Metrics

have relations

but not definite

19 of 23

HW arch vs optimal NN workload (energy)

A subset of 200 arch-mapping combinations plotted on a box and whistler plot
MobileNet series have lower spread of quartiles for energy -> less sensitive to HW arch
Other NN workloads need careful consideration when designing memory hierarchy

NN1

NN2

NN3

20 of 23

HW arch vs optimal NN workload (latency)

MobileNet series provides least overall latency but are more sensitive to mem hierarchy
ResNet50, DenseNet201, SEResNext40 and IncepResv2 have higher latency but are more amenable to HW arch based on latency

21 of 23

Optimal NN workload vs HW arch (energy)

A subset of 100 arch-mapping combinations are selected*
Two level mem hier (L0+L1) provided best energy across all NNs
A 512KB L1 provides near optimal energy → no need for larger on-chip memory
Memory sharing benefits energy efficiency

*Energy, resp. latency of each is normalized to min energy, resp. latency of that NN

Top 3 subplots represent mem hier (size and levels), black dots represent no mem sharing, colored dots otherwise

HW1

HW2

HW3

22 of 23

Optimal NN workload vs HW arch (latency)

Good latency points are achieved with either 1 level or 2 level memory hierarchy

Mem sharing has little to no effect on latency

Large L0 mem size benefits latency

23 of 23

Conclusion

We propose an exploration methodology to analyze the energy-latency-area trade-off space across a suite of NN workloads

Impact of HW architecture on NN workloads and vice versa

Some main observations on memory hierarchy were derived:

Multi-level memory hierarchy provides good energy
Large on-chip memory does not always translate to energy efficiency
Memory sharing is beneficial to energy efficiency

Similar exploration with different spatial unrolling, PE array size, other NN workloads, etc., can be beneficial to design versatile hardware for mapping suite of NN workloads