1 of 12

DSP Packing

AMD DSPs

  • 18x27 Mult
  • Preadder
  • 48 bits MAC, that an be cascaded to create larger MACs

NanoXplore DSPs

  • 19x24 Mult
  • Preadder
  • 56 bits MAC

Main idea

AxB, DxB ->(A+D)xB -> accumulate or separate the results

2 of 12

DSP Packing

Source of LUT overutilization - Pointwise and Depthwise Layers

8-bit quantization - Pointwise Layer

4-bit quantization - Pointwise Layer

3 of 12

DSP Packing

Source of LUT overutilization - Pointwise and Depthwise Layers

8-bit quantization - Depthwise Layer

4-bit quantization - Depthwise Layer

4 of 12

DSP Packing

x4 4bit packed

x2 8bit packing

5 of 12

DSP Packing

13/11/24

x2 8bit packing

infected bits from sign extension if first result is negative

6 of 12

DSP Packing

13/11/24

Vivado synthesis - Pointwise layer

Current implementation

DSP Packing

Less more expensive Multiplications

7 of 12

DSP Packing

13/11/24

no infected bits

x2 4bit packing

8 of 12

DSP Packing

x3 4bit packed

9 of 12

DSP Packing

Next steps

  • Change Depthwise convolution implementation to allow weight and input sharing for the x4 4-bit packing version
  • Integrate the shifting part before addition into the weight matrix
  • Make use of the post adder

10 of 12

FIFO Depth Optimization Ultra

Slower clocks can be used for Relu, Clone and other layers that can’t be slowed down using the Reuse Factor because they don’t include too many computations per input

11 of 12

FIFO Depth Optimization Ultra

For the Keras 100k model for example, by changing the speed of Zeropadding, the BRAMs

dropped from 140% (2564 ) to 80% (1504)

12 of 12

PixESL

Pixel Coordinate

ToA

fToA

ToT

Raw data (txt file):

Clustered data (root file):

  • number of pixels in the cluster
  • ToA, ToT for each pixel in the cluster

Problem -> Arbitrary cluster size