1 of 48

Analog or Digital In-memory Computing?�Benchmarking through Quantitative Modeling

Jiacong Sun, Pouya Houshmand, Marian Verhelst

2 of 48

Outline

  • Background: Challenges for In-Memory Computing
  • SRAM-based In-Memory Computing modeling
  • Design space exploration and insights
  • Conclusions

2

3 of 48

Deep Neural Network models: nested loop representation

3

4 of 48

Digital accelerators vs. In-Memory-Computing (IMC) accelerators

4

PE-based accelerators

In-Memory-Computing (IMC) accelerators

5 of 48

Digital accelerators vs. In-Memory-Computing (IMC) accelerators

5

  • High mapping flexibility.

  • Limited parallelism.

  • Frequent memory access.

PE-based accelerators

6 of 48

Digital accelerators vs. In-Memory-Computing (IMC) accelerators

6

  • Massive parallelism.

  • High energy and area efficiency.

  • Limited mapping.

In-Memory-Computing (IMC) accelerators

7 of 48

Analog and Digital In-Memory-Computing architectures

7

Analog MAC units

Digital MAC units

DIMC

AIMC

In-Memory-Computing (IMC) accelerators

8 of 48

Analog In-Memory-Computing (AIMC) architectures

8

  • High peak energy efficiency.

  • DACs/ADCs as extra cost.

In-Memory-Computing (IMC) accelerators

Analog MAC units

AIMC

9 of 48

Analog and Digital In-Memory-Computing architectures

9

  • higher mapping flexibility.

  • Bit-accurate.

  • Lower area/energy efficiency.

In-Memory-Computing (IMC) accelerators

Digital MAC units

DIMC

10 of 48

Challenges in In-Memory-Computing (IMC) research

10

  • For rapid design space exploration (DSE) we need
  • Quantitative model
  • Framework for IMC

peak macro-level performance,

different precisions,

different technologies

11 of 48

Research goal

  • Quantitative unified model for IMC.

  • An IMC framework for system-level performance evaluation.

11

12 of 48

Outline

  • Background: Challenges for In-Memory Computing
  • SRAM-based In-Memory Computing modeling
  • Design space exploration and insights
  • Conclusions

12

13 of 48

A unified hardware template for IMC

13

14 of 48

Modeling for memory instance

14

  • Use an established tool: CACTI.

  • CACTI: open-source memory cost estimator.

Ref: Balasubramonian, Rajeev, et al. "CACTI 7: New tools for interconnect exploration in innovative off-chip memories." ACM Transactions on Architecture and Code Optimization (TACO) 14.2 (2017): 1-25.

15 of 48

Modeling for ADCs

15

  • Existing formula for energy/area:

  • RC model of bitlines for latency:

k#: technology dependent parameters

16 of 48

Modeling for DACs

16

  • Formula for energy cost:

k#: technology dependent parameters

17 of 48

Modeling for digital components

17

  • Parameterized model (#precision, #inputs).
  • Gate-level modeling.
  • Cost is normalized to #NAND2 gates.

Gate

Cap

Delay

Area

NAND2

1

1

1

XOR2

1.5

2.4

2.4

DFF

3

--

6

18 of 48

Validated works @ 22/28 nm

18

  • Validation against 7 chips from literature.

  • Different precisions, macro size, #macro.

19 of 48

Validation results - Delay

19

20 of 48

Validation results - Delay

20

  • Mismatch: <20%.

  • DIMC1: extra Booth Encoder.
  • DIMC2: dynamic adder tree depth.

Model-to-reference ratio

21 of 48

Validation results - Area

21

  • Mismatch: <20%.

  • AIMC1: extra area on DECAP/repeater.
  • DIMC2: dynamic adder tree depth.

Model-to-reference ratio

22 of 48

Validation results - Energy

22

Model-to-reference ratio

  • 50% sparsity is assumed.

  • AIMC2/DIMC2/DIMC4: sparsity not reported.

  • Mismatch: <10%.

23 of 48

Validation results - conclusion

23

  • Unified model for both AIMC and DIMC.

  • <20% mismatch.

24 of 48

Outline

  • Background: Challenges for In-Memory Computing
  • SRAM-based In-Memory Computing modeling
  • Design space exploration and insights
  • Conclusions

24

25 of 48

Goals of experiment exploration

25

(1) Peak macro-level performance vs. macro size

26 of 48

Goals of experiment exploration

26

(1) Peak macro-level performance vs. macro size

(2) Peak system-level performance vs. macro size

27 of 48

Goals of experiment exploration

27

(1) Peak macro-level performance vs. macro size

(2) Peak system-level performance vs. macro size

(3) Workload system-level performance vs. macro size

28 of 48

Hardware settings – peak macro level

28

  • No peripheral memories cost.

  • No weight reloading.

  • INT8, #macros=1

Macro-level

29 of 48

Hardware settings – peak system level

29

  • Include memories cost.

  • No weight reloading.

  • INT8, #macros=1

System-level

30 of 48

Hardware settings – workload system level

30

  • Include memories cost.

  • Include weight reloading.

  • INT8, #macros=1

System-level

31 of 48

Peak macro-level performance evaluation

31

Peak energy efficiency and computation density

  • AIMC: 4-50 TOP/s/W.
    • Larger macro size 🡪higher energy efficiency
  • DIMC: ~7 TOP/s/W.
    • No major impact of array size on energy efficiency

(1) Peak macro-level performance vs. macro size

32 of 48

Peak macro-level performance evaluation

32

Peak energy efficiency and computation density

  • AIMC: 5-10 TOP/s/mm2.
    • larger array size 🡪 computation density drops due to longer conversion time.
  • DIMC: ~11 TOP/s/mm2.
    • No major impact of array size on computational density

(1) Peak macro-level performance vs. macro size

33 of 48

33

Peak energy efficiency and computation density

(2) Peak system-level performance vs. macro size

34 of 48

34

Peak energy efficiency and computation density

drop

drop

(2) Peak system-level performance vs. macro size

35 of 48

35

  • Memory cost impacts on performance.

drop

input/output access cost

macro-level

system-level

(2) Peak system-level energy efficiency vs. macro size

36 of 48

36

Peak energy efficiency and computation density

  • Memory cost impacts on performance.

drop

drop

(2) Peak system-level performance vs. macro size

37 of 48

37

Peak energy efficiency and computation density

  • Memory cost impacts on performance.

  • macro-level metrics: cannot reflect system-level efficiency.

drop

drop

(2) Peak system-level performance vs. macro size

38 of 48

ZigZag-IMC framework

38

  • Workload performance estimation.

  • Optimal mapping / architecture exploration.

Ref: https://github.com/KULeuven-MICAS/zigzag-imc

39 of 48

ZigZag-IMC framework

39

Ref: https://github.com/KULeuven-MICAS/zigzag-imc

4 networks in MLPerf Tiny benchmark

40 of 48

40

(3) Workload system-level performance vs. macro size

Workload performance

41 of 48

41

(3) Workload system-level performance vs. macro size

Workload performance

weight loading cost

  • Weight loading cost negatively affects workload efficiency.

42 of 48

42

(3) Workload system-level performance vs. macro size

Workload performance

  • Weight loading cost negatively affects workload efficiency.

43 of 48

43

(3) Layer system-level performance vs. macro size

layer performance

Layer unrolling ratio

OX

32

OY

32

G

1

K

16

C

16

FX

3

FY

3

Layer size

peak system-level

Energy breakdown [pJ/MAC]

44 of 48

44

(3) Layer system-level performance vs. macro size

layer performance

Layer unrolling ratio

Hardware under-utilized

OX

32

OY

32

G

1

K

16

C

16

FX

3

FY

3

Layer size

peak system-level

Energy breakdown [pJ/MAC]

45 of 48

45

(3) Layer system-level performance vs. macro size

layer performance

Layer unrolling ratio

Hardware under-utilized

OX

32

OY

32

G

1

K

16

C

16

FX

3

FY

3

Layer size

peak system-level

Hardware utilization must be maximized.

- No extra benefit if under-utilized.

Energy breakdown [pJ/MAC]

46 of 48

46

(3) Workload system-level performance vs. macro size

Workload performance

  • Weight loading cost negatively affects workload efficiency.

  • Hardware utilization must be maximized.

  • DIMC: higher TOP/s than AIMC.

47 of 48

Key takeaways for IMC architects

47

Hardware-oriented:

  • Focus should be not only at macro but also at system level.
  • Mapping efficiency / hardware utilization must be optimized

Workload-oriented:

  • AIMC suits for workloads featuring big kernel size and large temporal weight reuse.
  • DIMC suits for workloads featuring diverse kernel sizes.
  • DIMC is less affected by macro dimensions for peak TOP/s/W.

48 of 48

Q & A

References:

[1] I. A. Papistas, et. al, “A 22 nm, 1540 top/s/w, 12.1 top/s/mm2 in-memory analog matrix-vector-multiplier for dnn acceleration,” in 2021 IEEE Custom Integrated Circuits Conference (CICC), 2021, pp. 1–2.

[2] J.-W. Su, et. al, “A 8-bprecision 6t sram computing-in-memory macro using segmented-bitline charge-sharing scheme for ai edge chips,” IEEE Journal of Solid-State Circuits, vol. 58, no. 3, pp. 877–892, 2023.

[3] P. Chen, et. al, “7.8 a 22nm delta-sigma computing-in-memory (cim) sram macro with near-zero-mean outputs

and lsb-first adcs achieving 21.38tops/w for 8b-mac edge ai processing,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023, pp. 140–142.

[4] F. Tu, et. al, “A 28nm 29.2tflops/w bf16 and 36.5tops/w int8 reconfigurable digital cim processor with unified fp/int pipeline and bitwise in-memory booth multiplication for cloud deep learning acceleration,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3.

[5] B. Yan, et.al , “A 1.041-mb/mm2 27.38-tops/w signed-int8 dynamic-logic-based adc-less sram computein- memory macro in 28nm with reconfigurable bitwise operation for ai and embedded applications,” in 2022 IEEE International Solid- State Circuits Conference (ISSCC), vol. 65, 2022, pp. 188–190.

[6] A. Guo, et. al, “A 28nm 64-kb 31.6- tflops/w digital-domain floating-point-computing-unit and double-bit 6tsram computing-in-memory macro for floating-point cnns,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023, pp. 128–130.

[7] J. Yue, et. al, “A 28nm 16.9-300tops/w computing-in-memory processor supporting floating-point nn inference/training with intensive-cim sparse-digital architecture,” in 2023 IEEE International Solid- State Circuits Conference (ISSCC), 2023, pp. 1–3.

48

ZigZag-IMC framework

https://github.com/KULeuven-MICAS/zigzag-imc

Thank you!