1 of 24

CS-773 Paper Presentation��REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute

�

Pranshu S Negi�ArchMages (#0)

180050076@iitb.ac.in

1

https://ieeexplore.ieee.org/document/9499927

2 of 24

Contents

Motivation
Characterization of DNN Primitives
REDUCT
Evaluation
Summary

2

3 of 24

Motivation

ML and DNN used in variety of applications and services�
Race to build hardware for DNN execution:

CPUs most common�
Programmable and �General purpose �in nature

3

CPU

GPU

DSA

FPGA

4 of 24

Motivation

All instructions go through power-hungry CPU pipeline and all compute placed in a serially accessed multi-level cache�
DNN inference primitives like convolution and inner-product

heavily data-parallel
diverse compute intensity (Ops/Byte)�

Results in sub-optimal performance, resource utilization and power�

4

5 of 24

Characterization of DNN Primitives

Convolution

Under Utilization of L2/L3 bandwidth
Data Movement Overhead at L1-L2
Bandwidth over-provisioned wrt compute

5

6 of 24

Characterization of DNN Primitives

Inner-product

Poor Hit Rate at L1
Compute over-provisioned wrt bandwidth
High Data Movement Overheads

6

7 of 24

Performance Characterization

7

8 of 24

REDUCT : Overview

8

9 of 24

REDUCT Support Extensions (rSX) ISA

New set of rSX instructions

Kernel instructions decoded and allocated new TFU Code Registers

9

10 of 24

REDUCT Support Extensions (rSX)

Encode multiple loops of information concisely
Meta-data information about each instruction

Set of loops it resides within
Base address for loads and stores
Stride values for destination register ids

Dispatch of work proximal to the execution units

10

11 of 24

Example :

parallel for :

Load^rsx Weight R1 <- [A₁ + Δ₁]

Load^rsx Weight R2 <- [A₂ + Δ₁]

Load^rsx Weight R3 <- [A₃ + Δ₁]

Load^rsx Weight R4 <- [A₄ + Δ₁]

Load^rsx Input R5 <- [A₅ + Δ₂ + Δ₃]

MAC^rsx R{6 + δ + θ} += R1, R5

MAC^rsx R{7 + δ + θ} += R2, R5

MAC^rsx R{8 + δ + θ} += R3, R5

MAC^rsx R{9 + δ + θ} += R4, R5

Loop

end parallel for

Store Outputs (R6, R7 … )

11

12 of 24

REDUCT Support Extensions (rSX)

12

13 of 24

Tensor Functional Units

13

14 of 24

Leveraging SMT

14

15 of 24

Micro-Architectural Support

Virtual Memory to Physical Translation�
Coherence Support

Core-valid bit at L2 and L3�

Distributed L3 Caches

Local Cache for TFU

15

16 of 24

Micro-Architectural Support (contd.)

Memory Ordering

Strict load/store ordering maintained�

Context Switching

TFU context included in context save/restore

16

17 of 24

Programming Model Support

Generating rSX Code

JITer

Exposing REDUCT capability

cpuid

Optimal TFU selection
Distribution of work across TFUs

static_asymmetric

17

18 of 24

Simulation Parameters

18

19 of 24

REDUCT Configurations

19

20 of 24

Convolution

20

2x to 3.94x performance over baseline
Data movement overheads reduced :26% -> 9%

21 of 24

Inner Product

21

2.2x better performance
Data movement overheads reduced : 2.1x

22 of 24

Performance and Power

22

23 of 24

DNN Inference Summary

CPUs : Inefficient resource utilization
Thus:

23

Simple ISA enhancements

Near-Cache Compute

REDUCT

Better Performance/Watt

24 of 24

��Thank you for listening!

Any Questions?

�

24