1 of 1

  • Digital Signal Processors (DSPs)
    • Used in order to perform multiplications, because generating multiplication blocks from FPGA logic elements is too expensive.
    • Cyclone V DSPs can perform 2 multiplications per clock cycle (pipelined), so latency becomes #multiplications / 2 * #DSPs.
  • Estimating latency assuming DSPs are limiting factor:
    • The number of multiplications per convolutional layer is:

    • DSPs are assigned in chunks proportional in size to factors in equation 1, so for example a 5x5 convolution with 1 filter (nf=1) could be assigned 25 DSPs to parallelize multiplications across filter multiplications the leaving us with wi*hi*di non-parallel operations (size of the input).

Convolutional Neural Networks for Track Reconstruction on FPGAs

Thomas Boser1; Paolo Calafiura2; Ian Johnson2

1University of California, Santa Cruz, 2Lawrence Berkeley National Laboratory

Contact:

TBoser@ucsc.edu

Motivation

Methods and Materials

Workflow and Implementation

Discussion

Conclusions and future work

DSPs

Memory

(Block RAMs)

Clock cycles

OpenCL LeNet

35

273

1176k

VHDL 5x5 convolution*

25

4

npixels + 10

VHDL LeNet (resource conscious)*

112

300

50,000

Available (Cyclone V)

112

557

--

Available

(Stratix 10)

5,760

11,721

--

FPGA Resources

  • YES. At 400 MHz, 5 μs is 2000 clock cycles.
    • Heatmap shows that with our implementation and a large FPGA (some have 5000+ DSPs) we can predict on a reasonable network.
  • We propose a pipelined CNN forward pass which scales with resources available.
    • Assuming multiplications are limiting factor, latency scales linearly with number of DSPs .

How are FPGAs programmed?

  • Hardware Description Languages

HDLs are programming languages which describe electronic circuits.

  • High Level Synthesis

Generation of HDL from a higher level language, often C or C++.

  • OpenCL

Computing framework similar to CUDA which is supported by some FPGAs.

  • LHC Particle Tracking as a “Connecting the dots” problem:
    • Given dataset with O(105) 3D space-points belonging to O(103) particle tracks, predict which space-points belong to the same track.
  • Performance requirements:
    • 100KHz rate, with ~5 μs latency per prediction.
  • FPGA good match:
    • guaranteed latency, high throughput
    • already used by LHC experiments for similar applications
  • FPGAs are reprogrammable integrated circuits.
  • Logic blocks can be used to configure low level operations such as bit masking, shifting, and addition.
  • Support highly parallel and pipelined algorithm implementations with guaranteed latency.

* Assuming single pixel stream in which can be widened to

parallelize.

* Cyclone V available resources used as constraints.

  • Approach:
    • Design and train model using a deep learning library.
    • Perform inference using the FPGA.
  • Implemented LeNet5 on FPGA natively (VHDL) and via OpenCL.

  • Using LeNet as an example we see that increasing DSPs by a factor of 3 can dramatically reduce latency of a network.
  • Data flow into the FPGA:
    • Use of a general purpose coprocessor for data transfer will increase latency too much.
    • Our VHDL implementation will allow for data to stream directly into FPGA input ports reducing IO caused latency.
  • FPGA DNN implementation:
    • Have a predictable real-time latency
    • Implemented in data streaming approach
    • Data can be streamed through the FPGA DNN
    • Convolutions are a good example of the FPGA potential for low latency DNNs

  • Optimized DNN implementation to fit into FPGA resources:
    • large FPGAs contain 1000’s of DSPs and clocked at ~400 MHz >4,000,000 multiplications/s. For reference LeNet5 performs ~150,000 multiplications per image in its forward pass.

  • Implementing new layers:
    • Most successful approach to the tracking problem has been through a combination of LSTMs and CNNs.

Credit: Andy Salzburger

Firmware convolution:

  • Input matrix streamed linearly into the convolution module
    • Feeds into shift register which then sends multiple row values into FIFO’s which store values on the same row, ‘iterating’ through input.
    • the FIFOs push 5 values each (25 for 5x5 filter) into a DSP which multiplies input values by filter values and sums them, producing the output for one pixel.
  • Each instance of a filter convolving on an input matrix allocates 1 DSP (or 25 DSPs in the more parallel approach) for a 5x5 filter.

Is our latency goal attainable?

A heatmap showing how the number of DSPs allocated to convolution layers impact the number of clock cycles for convolutions of different sizes.

Number of cycles required to complete each layer of LeNet using resource available in Cyclone V vs an FPGA with 324 DSPs, demonstrating how much DSPs can constrain latency.

References

  • The many convolution and matrix multiplications can be resource costly for FPGAs:
    • Smaller FPGAs can quickly be resource starved.
    • Our implementation had to be shrunk in order to fit completely on an Altera Cyclone V.

Diagram showing example resource usage on an FPGA.

Poster Print Size:

This poster template is 21” high by 36” wide and is printed at 200% for a 42” high by 72” wide poster. It can be used to print any poster with a 7:12 aspect ratio.

Placeholders:

The various elements included in this poster are ones we often see in medical, research, and scientific posters. Feel free to edit, move, add, and delete items, or change the layout to suit your needs. Always check with your conference organizer for specific requirements.

Image Quality:

You can place digital photos or logo art in your poster file by selecting the Insert, Picture command, or by using standard copy & paste. For best results, all graphic elements should be at least 150-200 pixels per inch in their final printed size. For instance, a 1600 x 1200 pixel photo will usually look fine up to 8“-10” wide on your printed poster.

To preview the print quality of images, select a magnification of 100% when previewing your poster. This will give you a good idea of what it will look like in print. If you are laying out a large poster and using half-scale dimensions, be sure to preview your graphics at 200% to see them at their final printed size.

Please note that graphics from websites (such as the logo on your hospital's or university's home page) will only be 72dpi and not suitable for printing.

�[This sidebar area does not print.]

Change Color Theme:

This template is designed to use the built-in color themes in the newer versions of PowerPoint.

To change the color theme, select the Design tab, then select the Colors drop-down list.

The default color theme for this template is “Office”, so you can always return to that after trying some of the alternatives.

Printing Your Poster:

Once your poster file is ready, visit www.genigraphics.com to order a high-quality, affordable poster print. Every order receives a free design review and we can deliver as fast as next business day within the US and Canada.

Genigraphics® has been producing output from PowerPoint® longer than anyone in the industry; dating back to when we helped Microsoft® design the PowerPoint® software.

US and Canada: 1-800-790-4001�Email: info@genigraphics.com

�[This sidebar area does not print.]