Analyzing the Energy-Latency-Area-Accuracy Trade-off Across Contemporary Neural Networks
Vikram Jain, Linyan Mei, Marian Verhelst
MICAS, KU Leuven, Belgium
vikram.jain@kuleuven.be
Overview
2
Machine learning is ubiquitous
3
AI at the edge
4
S. Bianco et al, "Benchmark Analysis of Representative Deep Neural Network Architectures," in IEEE Access, vol. 6, pp. 64270-64277, 2018, doi: 10.1109/ACCESS.2018.2877890.
Design Space Exploration (DSE)
5
Dataflow, Loop blocking or tiling, Loop ordering, ...
Precision, Memory Hierarchy, PE array, Sparsity,...
Mapping
Hardware
Model topology, NN operations, ...
Algorithm
Joint co-optimization space
DSE State-of-The-Art
6
Most DSE frameworks optimize and report hardware architecture for individual neural network workload. But none explore hardware architecture design for a suite of NN workloads or vice versa.
Overview
7
Motivation
This study aims to broaden understanding of:
8
Overview
9
ZigZag
10
[ZigZag, TC2021]
https://github.com/ZigZag-Project/zigzag
Understanding ZigZag results
11
Top 3 subplots represent mem hier (size and levels), black dots represent no mem sharing, colored dots otherwise
Overview
12
Experiment setup-algorithm
13
ImageNet Top-1 accuracy ranging from 50% to 85%
Operation ranging from 0.06 to 23.74 GFLOPs
Parameter size ranging from 3.31 to 84.45 MB
AlexNet, DenseNet201;
ResNet50, Incep-ResNet-v2
*MobileNetv3 is missing in the figure.
Experiment setup-hardware
14
Arch. Level | Inner PE Reg L0 | On-chip L1 | On-chip L2 | Off-chip |
Mem size option | 2 B, 32 B, 128 B | 8 KB, 32 KB | 0.5 MB, 2 MB | DRAM |
Mem bandwidth | 16 bits/cycle (r/w) | 128 bits/cycle (r/w) | ||
Mem share options | All separate | All separate, Two shared, All shared | All shared | All shared |
Mem Bypass option | No bypass | Can bypass | Can bypass | No bypass |
Visualizing the trade-off space(1)
15
Energy vs Latency
Major observations:
16
Energy vs Area
Major observations:
17
Overall comparison
18
NNs whose order of accuracy is larger than the order of best energy and latency (i.e. achieve relatively low accuracy with relatively high hardware cost) are suboptimal, and hence should be avoided.
NNs whose order of accuracy is equal to the order of best energy and latency (i.e. the accuracy achieved lives up to the hardware cost) are good for some applications.
NNs whose order of accuracy is smaller than the order of best energy and latency (i.e. achieve relatively high accuracy with relatively low hardware cost) are promising for embedded systems.
Accuracy
# MAC Op
Parameter size
Energy
Latency
Area
Hardware Metrics
Algorithm Metrics
have relations
but not definite
HW arch vs optimal NN workload (energy)
19
NN1
NN2
NN3
HW arch vs optimal NN workload (latency)
20
Optimal NN workload vs HW arch (energy)
21
*Energy, resp. latency of each is normalized to min energy, resp. latency of that NN
Top 3 subplots represent mem hier (size and levels), black dots represent no mem sharing, colored dots otherwise
HW1
HW2
HW3
Optimal NN workload vs HW arch (latency)
22
Conclusion
23