1 of 24

CS-773 Paper Presentation��REDUCT: Keep it Close, Keep it Cool! : Efficient Scaling of DNN Inference on Multi-core CPUs with Near-Cache Compute

Pranshu S Negi�ArchMages (#0)

180050076@iitb.ac.in

1

2 of 24

Contents

  • Motivation
  • Characterization of DNN Primitives
  • REDUCT
  • Evaluation
  • Summary

2

3 of 24

Motivation

  • ML and DNN used in variety of applications and services�
  • Race to build hardware for DNN execution:
    • CPUs most common�
    • Programmable and �General purpose �in nature

3

CPU

GPU

DSA

FPGA

4 of 24

Motivation

  • All instructions go through power-hungry CPU pipeline and all compute placed in a serially accessed multi-level cache�
  • DNN inference primitives like convolution and inner-product
    • heavily data-parallel
    • diverse compute intensity (Ops/Byte)�
  • Results in sub-optimal performance, resource utilization and power�

4

5 of 24

Characterization of DNN Primitives

Convolution

  • Under Utilization of L2/L3 bandwidth
  • Data Movement Overhead at L1-L2
  • Bandwidth over-provisioned wrt compute

5

6 of 24

Characterization of DNN Primitives

Inner-product

  • Poor Hit Rate at L1
  • Compute over-provisioned wrt bandwidth
  • High Data Movement Overheads

6

7 of 24

Performance Characterization

7

8 of 24

REDUCT : Overview

8

9 of 24

REDUCT Support Extensions (rSX) ISA

New set of rSX instructions

Kernel instructions decoded and allocated new TFU Code Registers

9

10 of 24

REDUCT Support Extensions (rSX)

  • Encode multiple loops of information concisely
  • Meta-data information about each instruction
    • Set of loops it resides within
    • Base address for loads and stores
    • Stride values for destination register ids
  • Dispatch of work proximal to the execution units

10

11 of 24

Example :

parallel for :

Loadrsx Weight R1 <- [A1 + Δ1]

Loadrsx Weight R2 <- [A2 + Δ1]

Loadrsx Weight R3 <- [A3 + Δ1]

Loadrsx Weight R4 <- [A4 + Δ1]

Loadrsx Input R5 <- [A5 + Δ2 + Δ3]

MACrsx R{6 + δ + θ} += R1, R5

MACrsx R{7 + δ + θ} += R2, R5

MACrsx R{8 + δ + θ} += R3, R5

MACrsx R{9 + δ + θ} += R4, R5

Loop

Loop

end parallel for

Store Outputs (R6, R7 … )

11

12 of 24

REDUCT Support Extensions (rSX)

12

13 of 24

Tensor Functional Units

13

14 of 24

Leveraging SMT

14

15 of 24

Micro-Architectural Support

  • Virtual Memory to Physical Translation�
  • Coherence Support
    • Core-valid bit at L2 and L3�
  • Distributed L3 Caches
    • Local Cache for TFU

15

16 of 24

Micro-Architectural Support (contd.)

  • Memory Ordering
    • Strict load/store ordering maintained�
  • Context Switching
    • TFU context included in context save/restore

16

17 of 24

Programming Model Support

  • Generating rSX Code
    • JITer
  • Exposing REDUCT capability
    • cpuid
  • Optimal TFU selection
  • Distribution of work across TFUs
    • static_asymmetric

17

18 of 24

Simulation Parameters

18

19 of 24

REDUCT Configurations

19

20 of 24

Convolution

20

  • 2x to 3.94x performance over baseline
  • Data movement overheads reduced :26% -> 9%

21 of 24

Inner Product

21

  • 2.2x better performance
  • Data movement overheads reduced : 2.1x

22 of 24

Performance and Power

22

23 of 24

DNN Inference Summary

  • CPUs : Inefficient resource utilization
  • Thus:

23

Simple ISA enhancements

Near-Cache Compute

REDUCT

Better Performance/Watt

24 of 24

��Thank you for listening!

Any Questions?

24