Analog or Digital In-memory Computing?�Benchmarking through Quantitative Modeling
Jiacong Sun, Pouya Houshmand, Marian Verhelst
Outline
2
Deep Neural Network models: nested loop representation
3
Digital accelerators vs. In-Memory-Computing (IMC) accelerators
4
PE-based accelerators
In-Memory-Computing (IMC) accelerators
Digital accelerators vs. In-Memory-Computing (IMC) accelerators
5
PE-based accelerators
Digital accelerators vs. In-Memory-Computing (IMC) accelerators
6
In-Memory-Computing (IMC) accelerators
Analog and Digital In-Memory-Computing architectures
7
Analog MAC units
Digital MAC units
DIMC
AIMC
In-Memory-Computing (IMC) accelerators
Analog In-Memory-Computing (AIMC) architectures
8
In-Memory-Computing (IMC) accelerators
Analog MAC units
AIMC
Analog and Digital In-Memory-Computing architectures
9
In-Memory-Computing (IMC) accelerators
Digital MAC units
DIMC
Challenges in In-Memory-Computing (IMC) research
10
peak macro-level performance,
different precisions,
different technologies
Research goal
11
Outline
12
A unified hardware template for IMC
13
Modeling for memory instance
14
Ref: Balasubramonian, Rajeev, et al. "CACTI 7: New tools for interconnect exploration in innovative off-chip memories." ACM Transactions on Architecture and Code Optimization (TACO) 14.2 (2017): 1-25.
Modeling for ADCs
15
k#: technology dependent parameters
Modeling for DACs
16
k#: technology dependent parameters
Modeling for digital components
17
Gate | Cap | Delay | Area |
NAND2 | 1 | 1 | 1 |
XOR2 | 1.5 | 2.4 | 2.4 |
DFF | 3 | -- | 6 |
Validated works @ 22/28 nm
18
Validation results - Delay
19
Validation results - Delay
20
Model-to-reference ratio
Validation results - Area
21
Model-to-reference ratio
Validation results - Energy
22
Model-to-reference ratio
Validation results - conclusion
23
Outline
24
Goals of experiment exploration
25
(1) Peak macro-level performance vs. macro size
Goals of experiment exploration
26
(1) Peak macro-level performance vs. macro size
(2) Peak system-level performance vs. macro size
Goals of experiment exploration
27
(1) Peak macro-level performance vs. macro size
(2) Peak system-level performance vs. macro size
(3) Workload system-level performance vs. macro size
Hardware settings – peak macro level
28
Macro-level
Hardware settings – peak system level
29
System-level
Hardware settings – workload system level
30
System-level
Peak macro-level performance evaluation
31
Peak energy efficiency and computation density
(1) Peak macro-level performance vs. macro size
Peak macro-level performance evaluation
32
Peak energy efficiency and computation density
(1) Peak macro-level performance vs. macro size
33
Peak energy efficiency and computation density
(2) Peak system-level performance vs. macro size
34
Peak energy efficiency and computation density
drop
drop
(2) Peak system-level performance vs. macro size
35
drop
input/output access cost
macro-level
system-level
(2) Peak system-level energy efficiency vs. macro size
36
Peak energy efficiency and computation density
drop
drop
(2) Peak system-level performance vs. macro size
37
Peak energy efficiency and computation density
drop
drop
(2) Peak system-level performance vs. macro size
ZigZag-IMC framework
38
Ref: https://github.com/KULeuven-MICAS/zigzag-imc
ZigZag-IMC framework
39
Ref: https://github.com/KULeuven-MICAS/zigzag-imc
4 networks in MLPerf Tiny benchmark
40
(3) Workload system-level performance vs. macro size
Workload performance
41
(3) Workload system-level performance vs. macro size
Workload performance
weight loading cost
42
(3) Workload system-level performance vs. macro size
Workload performance
43
(3) Layer system-level performance vs. macro size
layer performance
Layer unrolling ratio
OX | 32 |
OY | 32 |
G | 1 |
K | 16 |
C | 16 |
FX | 3 |
FY | 3 |
Layer size
peak system-level
Energy breakdown [pJ/MAC]
44
(3) Layer system-level performance vs. macro size
layer performance
Layer unrolling ratio
Hardware under-utilized
OX | 32 |
OY | 32 |
G | 1 |
K | 16 |
C | 16 |
FX | 3 |
FY | 3 |
Layer size
peak system-level
Energy breakdown [pJ/MAC]
45
(3) Layer system-level performance vs. macro size
layer performance
Layer unrolling ratio
Hardware under-utilized
OX | 32 |
OY | 32 |
G | 1 |
K | 16 |
C | 16 |
FX | 3 |
FY | 3 |
Layer size
peak system-level
Hardware utilization must be maximized.
- No extra benefit if under-utilized.
Energy breakdown [pJ/MAC]
46
(3) Workload system-level performance vs. macro size
Workload performance
Key takeaways for IMC architects
47
Hardware-oriented:
Workload-oriented:
Q & A
References:
[1] I. A. Papistas, et. al, “A 22 nm, 1540 top/s/w, 12.1 top/s/mm2 in-memory analog matrix-vector-multiplier for dnn acceleration,” in 2021 IEEE Custom Integrated Circuits Conference (CICC), 2021, pp. 1–2.
[2] J.-W. Su, et. al, “A 8-bprecision 6t sram computing-in-memory macro using segmented-bitline charge-sharing scheme for ai edge chips,” IEEE Journal of Solid-State Circuits, vol. 58, no. 3, pp. 877–892, 2023.
[3] P. Chen, et. al, “7.8 a 22nm delta-sigma computing-in-memory (cim) sram macro with near-zero-mean outputs
and lsb-first adcs achieving 21.38tops/w for 8b-mac edge ai processing,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023, pp. 140–142.
[4] F. Tu, et. al, “A 28nm 29.2tflops/w bf16 and 36.5tops/w int8 reconfigurable digital cim processor with unified fp/int pipeline and bitwise in-memory booth multiplication for cloud deep learning acceleration,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3.
[5] B. Yan, et.al , “A 1.041-mb/mm2 27.38-tops/w signed-int8 dynamic-logic-based adc-less sram computein- memory macro in 28nm with reconfigurable bitwise operation for ai and embedded applications,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 188–190.
[6] A. Guo, et. al, “A 28nm 64-kb 31.6- tflops/w digital-domain floating-point-computing-unit and double-bit 6tsram computing-in-memory macro for floating-point cnns,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023, pp. 128–130.
[7] J. Yue, et. al, “A 28nm 16.9-300tops/w computing-in-memory processor supporting floating-point nn inference/training with intensive-cim sparse-digital architecture,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023, pp. 1–3.
48
ZigZag-IMC framework
https://github.com/KULeuven-MICAS/zigzag-imc
Thank you!