DSP Packing
AMD DSPs
NanoXplore DSPs
Main idea
AxB, DxB ->(A+D)xB -> accumulate or separate the results
DSP Packing
Source of LUT overutilization - Pointwise and Depthwise Layers
8-bit quantization - Pointwise Layer
4-bit quantization - Pointwise Layer
DSP Packing
Source of LUT overutilization - Pointwise and Depthwise Layers
8-bit quantization - Depthwise Layer
4-bit quantization - Depthwise Layer
DSP Packing
x4 4bit packed
x2 8bit packing
DSP Packing
13/11/24
x2 8bit packing
infected bits from sign extension if first result is negative
DSP Packing
13/11/24
Vivado synthesis - Pointwise layer
Current implementation
DSP Packing
Less more expensive Multiplications
DSP Packing
13/11/24
no infected bits
x2 4bit packing
DSP Packing
x3 4bit packed
DSP Packing
Next steps
FIFO Depth Optimization Ultra
Slower clocks can be used for Relu, Clone and other layers that can’t be slowed down using the Reuse Factor because they don’t include too many computations per input
FIFO Depth Optimization Ultra
For the Keras 100k model for example, by changing the speed of Zeropadding, the BRAMs
dropped from 140% (2564 ) to 80% (1504)
PixESL
Pixel Coordinate
ToA
fToA
ToT
Raw data (txt file):
Clustered data (root file):
Problem -> Arbitrary cluster size