1 of 48

Analog or Digital In-memory Computing?�Benchmarking through Quantitative Modeling

Jiacong Sun, Pouya Houshmand, Marian Verhelst

2 of 48

Outline

Background: Challenges for In-Memory Computing
SRAM-based In-Memory Computing modeling
Design space exploration and insights
Conclusions

2

3 of 48

Deep Neural Network models: nested loop representation

3

4 of 48

Digital accelerators vs. In-Memory-Computing (IMC) accelerators

4

PE-based accelerators

In-Memory-Computing (IMC) accelerators

5 of 48

Digital accelerators vs. In-Memory-Computing (IMC) accelerators

5

High mapping flexibility.

Limited parallelism.

Frequent memory access.

PE-based accelerators

6 of 48

Digital accelerators vs. In-Memory-Computing (IMC) accelerators

6

Massive parallelism.

High energy and area efficiency.

Limited mapping.

In-Memory-Computing (IMC) accelerators

7 of 48

Analog and Digital In-Memory-Computing architectures

7

Analog MAC units

Digital MAC units

DIMC

AIMC

In-Memory-Computing (IMC) accelerators

8 of 48

Analog In-Memory-Computing (AIMC) architectures

8

High peak energy efficiency.

DACs/ADCs as extra cost.

In-Memory-Computing (IMC) accelerators

Analog MAC units

AIMC

9 of 48

Analog and Digital In-Memory-Computing architectures

9

higher mapping flexibility.

Bit-accurate.

Lower area/energy efficiency.

In-Memory-Computing (IMC) accelerators

Digital MAC units

DIMC

10 of 48

Challenges in In-Memory-Computing (IMC) research

10

For rapid design space exploration (DSE) we need
Quantitative model
Framework for IMC

peak macro-level performance,

different precisions,

different technologies

11 of 48

Research goal

Quantitative unified model for IMC.

An IMC framework for system-level performance evaluation.

11

12 of 48

Outline

Background: Challenges for In-Memory Computing
SRAM-based In-Memory Computing modeling
Design space exploration and insights
Conclusions

12

13 of 48

A unified hardware template for IMC

13

14 of 48

Modeling for memory instance

14

Use an established tool: CACTI.

CACTI: open-source memory cost estimator.

Ref: Balasubramonian, Rajeev, et al. "CACTI 7: New tools for interconnect exploration in innovative off-chip memories." ACM Transactions on Architecture and Code Optimization (TACO) 14.2 (2017): 1-25.

15 of 48

Modeling for ADCs

15

Existing formula for energy/area:

RC model of bitlines for latency:

k_#: technology dependent parameters

16 of 48

Modeling for DACs

16

Formula for energy cost:

k_#: technology dependent parameters

17 of 48

Modeling for digital components

17

Parameterized model (#precision, #inputs).
Gate-level modeling.
Cost is normalized to #NAND2 gates.

Gate	Cap	Delay	Area
NAND2	1	1	1
XOR2	1.5	2.4	2.4
DFF	3	--	6

18 of 48

Validated works @ 22/28 nm

18

Validation against 7 chips from literature.

Different precisions, macro size, #macro.

19 of 48

Validation results - Delay

19

20 of 48

Validation results - Delay

20

Mismatch: <20%.

DIMC1: extra Booth Encoder.
DIMC2: dynamic adder tree depth.

Model-to-reference ratio

21 of 48

Validation results - Area

21

Mismatch: <20%.

AIMC1: extra area on DECAP/repeater.
DIMC2: dynamic adder tree depth.

Model-to-reference ratio

22 of 48

Validation results - Energy

22

Model-to-reference ratio

50% sparsity is assumed.

AIMC2/DIMC2/DIMC4: sparsity not reported.

Mismatch: <10%.

23 of 48

Validation results - conclusion

23

Unified model for both AIMC and DIMC.

<20% mismatch.

24 of 48

Outline

Background: Challenges for In-Memory Computing
SRAM-based In-Memory Computing modeling
Design space exploration and insights
Conclusions

24

25 of 48

Goals of experiment exploration

25

(1) Peak macro-level performance vs. macro size

26 of 48

Goals of experiment exploration

26

(1) Peak macro-level performance vs. macro size

(2) Peak system-level performance vs. macro size

27 of 48

Goals of experiment exploration

27

(1) Peak macro-level performance vs. macro size

(2) Peak system-level performance vs. macro size

(3) Workload system-level performance vs. macro size

28 of 48

Hardware settings – peak macro level

28

No peripheral memories cost.

No weight reloading.

INT8, #macros=1

Macro-level

29 of 48

Hardware settings – peak system level

29

Include memories cost.

No weight reloading.

INT8, #macros=1

System-level

30 of 48

Hardware settings – workload system level

30

Include memories cost.

Include weight reloading.

INT8, #macros=1

System-level

31 of 48

Peak macro-level performance evaluation

31

Peak energy efficiency and computation density

AIMC: 4-50 TOP/s/W.

Larger macro size 🡪higher energy efficiency

DIMC: ~7 TOP/s/W.

No major impact of array size on energy efficiency

(1) Peak macro-level performance vs. macro size

32 of 48

Peak macro-level performance evaluation

32

Peak energy efficiency and computation density

AIMC: 5-10 TOP/s/mm².

larger array size 🡪 computation density drops due to longer conversion time.

DIMC: ~11 TOP/s/mm².

No major impact of array size on computational density

(1) Peak macro-level performance vs. macro size

33 of 48

33

Peak energy efficiency and computation density

(2) Peak system-level performance vs. macro size

34 of 48

34

Peak energy efficiency and computation density

drop

(2) Peak system-level performance vs. macro size

35 of 48

35

Memory cost impacts on performance.

drop

input/output access cost

macro-level

system-level

(2) Peak system-level energy efficiency vs. macro size

36 of 48

36

Peak energy efficiency and computation density

Memory cost impacts on performance.

drop

(2) Peak system-level performance vs. macro size

37 of 48

37

Peak energy efficiency and computation density

Memory cost impacts on performance.

macro-level metrics: cannot reflect system-level efficiency.

drop

(2) Peak system-level performance vs. macro size

38 of 48

ZigZag-IMC framework

38

Workload performance estimation.

Optimal mapping / architecture exploration.

Ref: https://github.com/KULeuven-MICAS/zigzag-imc

39 of 48

ZigZag-IMC framework

39

Ref: https://github.com/KULeuven-MICAS/zigzag-imc

4 networks in MLPerf Tiny benchmark

40 of 48

40

(3) Workload system-level performance vs. macro size

Workload performance

41 of 48

41

(3) Workload system-level performance vs. macro size

Workload performance

weight loading cost

Weight loading cost negatively affects workload efficiency.

42 of 48

42

(3) Workload system-level performance vs. macro size

Workload performance

Weight loading cost negatively affects workload efficiency.

43 of 48

43

(3) Layer system-level performance vs. macro size

layer performance

Layer unrolling ratio

OX	32
OY	32
G	1
K	16
C	16
FX	3
FY	3

Layer size

peak system-level

Energy breakdown [pJ/MAC]

44 of 48

44

(3) Layer system-level performance vs. macro size

layer performance

Layer unrolling ratio

Hardware under-utilized

OX	32
OY	32
G	1
K	16
C	16
FX	3
FY	3

Layer size

peak system-level

Energy breakdown [pJ/MAC]

45 of 48

45

(3) Layer system-level performance vs. macro size

layer performance

Layer unrolling ratio

Hardware under-utilized

OX	32
OY	32
G	1
K	16
C	16
FX	3
FY	3

Layer size

peak system-level

Hardware utilization must be maximized.

- No extra benefit if under-utilized.

Energy breakdown [pJ/MAC]

46 of 48

46

(3) Workload system-level performance vs. macro size

Workload performance

Weight loading cost negatively affects workload efficiency.

Hardware utilization must be maximized.

DIMC: higher TOP/s than AIMC.

47 of 48

Key takeaways for IMC architects

47

Hardware-oriented:

Focus should be not only at macro but also at system level.
Mapping efficiency / hardware utilization must be optimized

Workload-oriented:

AIMC suits for workloads featuring big kernel size and large temporal weight reuse.
DIMC suits for workloads featuring diverse kernel sizes.
DIMC is less affected by macro dimensions for peak TOP/s/W.

48 of 48

Q & A

References:

[1] I. A. Papistas, et. al, “A 22 nm, 1540 top/s/w, 12.1 top/s/mm2 in-memory analog matrix-vector-multiplier for dnn acceleration,” in 2021 IEEE Custom Integrated Circuits Conference (CICC), 2021, pp. 1–2.

[2] J.-W. Su, et. al, “A 8-bprecision 6t sram computing-in-memory macro using segmented-bitline charge-sharing scheme for ai edge chips,” IEEE Journal of Solid-State Circuits, vol. 58, no. 3, pp. 877–892, 2023.

[3] P. Chen, et. al, “7.8 a 22nm delta-sigma computing-in-memory (cim) sram macro with near-zero-mean outputs

and lsb-first adcs achieving 21.38tops/w for 8b-mac edge ai processing,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023, pp. 140–142.

[4] F. Tu, et. al, “A 28nm 29.2tflops/w bf16 and 36.5tops/w int8 reconfigurable digital cim processor with unified fp/int pipeline and bitwise in-memory booth multiplication for cloud deep learning acceleration,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3.

[5] B. Yan, et.al , “A 1.041-mb/mm2 27.38-tops/w signed-int8 dynamic-logic-based adc-less sram computein- memory macro in 28nm with reconfigurable bitwise operation for ai and embedded applications,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 188–190.

[6] A. Guo, et. al, “A 28nm 64-kb 31.6- tflops/w digital-domain floating-point-computing-unit and double-bit 6tsram computing-in-memory macro for floating-point cnns,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023, pp. 128–130.

[7] J. Yue, et. al, “A 28nm 16.9-300tops/w computing-in-memory processor supporting floating-point nn inference/training with intensive-cim sparse-digital architecture,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023, pp. 1–3.

48

ZigZag-IMC framework

https://github.com/KULeuven-MICAS/zigzag-imc

Thank you!