1 of 19

ME0 Segment �Finding Firmware

Andrew Peck�Boston University

2021

2 of 19

ME0

  • S-bit = Sector-bit, basic trigger element of CMS GEM readout
    • 1 s-bit = OR of 2 adjacent strips �
  • ME0 Chamber:
    • 6 layers
    • each layer has 8 eta partitions
    • each eta partition has 384 strips (192 s-bits)
    • 1536 sbits / layer / bx * 6 layers = 9216 sbits / bx�
  • Segment Finding:
    • Need to identify multi-layer segments
    • Need to do some refined angular/position�calculation using multi-layer hit data

2

3 of 19

ME0 readout paths

3

EMTF Sector Processor

2 x 16 or 1 x 25 Gbps �per 40 degrees

LpGBTs

8 per ME0 layer

VFAT3

VFAT3

24x VFAT3

Per layer:

8x 10.24 Gbps

x 6 layers

x 2 chambers

320Mbps

elink x1

320Mbps

elink x8

DTH

Central DAQ

Worst case:

6 x 16 Gbps or 4 x 25 Gbps

per 40 degrees

Trigger links

DAQ/control links

  • Each ATCA card handles 40 degrees (2 ME0 super-chambers)
    • Each super-chamber is composed of 6 layers of GEM chambers
    • Front-end links (per ATCA card)
      • 96x 10.24 Gbps LpGBT for Trigger, DAQ, and Control
    • Back-end links (per CTP7)
      • 1 x 25 Gbps or 2 x 16 Gbps to EMTF
      • 4 x 25 Gbps or 6 x 16 Gbps link to DTH

ATCA Backend

40 degrees each,

18 cards total

Trigger & DAQ

DAQ

Trigger

A. Peck / N. McColl (UCLA), E. Juska (TAMU), G. De Lentdecker (ULB)

DAQ + Trigger

4 of 19

Segment Finding

  • Segment finding has three basic stages: �
    • 1) Need to identify segments locally
      • Create "segment candidates" with some sortable key field
      • Done in parallel, many times
    • 2) Need to sort segment candidates into a smaller number of output candidates
      • Development of sorting machinery can be somewhat decoupled from the exact segment identification mechanism
    • 3) Perform some post-processing on the segment candidates to produce a refined measurement (some kind of fit)

4

Segment Identification

Sorting

Output Segments

Input Sbits

Fitting

Segment Finding Data Flow:

5 of 19

Naive Implementation

I am working on a very naive, "TMB-like" segment finding firmware

  • Goals:
    1. Have something that will work for cosmic ray tests w/ demonstrator
      • Have a useful algorithm for self-triggering an ME0 chamber
      • No need for fitting right now... just want a way to self-L1A
      • Initially use wide patterns => high rate + high efficiency
        • Good for cosmics, bad for LHC
    2. Initial (crude) estimates of resource usage
    3. Develop some of the common machinery needed for any implementation
      • (e.g. good, parameterized sorting + priority encoding)
    4. Get hands dirty, encounter/explore questions
    5. Initial (crude) latency estimates

5

6 of 19

Data Flow

6

partition

Pattern Unit

Pattern Unit

Pattern Unit

Pattern Unit

Pattern Unit

Pattern Unit

x192

pre-sorting

choose best 1 segment for every N �(e.g. 8) neighboring strips�

1/strip

ghost cancellation��prevent retriggering on same segment in subsequent bx ��prevent triggering multiple times on neighboring strips

partition sorting ��choose best�N (e.g. 4) segments from each partition

chamber sorting ��choose best�N (e.g. 4) segments from entire chamber

fitting��non-

existent �right now

1536 pattern units /

chamber

1536 candidates /�chamber

192 candidates /�chamber

128 candidates / �chamber

16 candidates /

chamber

16 segments /

chamber

7 of 19

Data Flow

7

Clean structural implementation...

Partitions x8

First Stage Sorting

Final Sorting

8 of 19

Pattern Finding

  • "TMB-like" == road-based pattern finding
  • For each strip, assign a collection of coarse roads to identify segments
    • e.g.

����

  • For each road, count the number of layers hit
  • For each strip, choose the road with the best quality
    • Choose more layers over fewer
    • Choose straight patterns over bent ones
  • Firmware tests right now with fairly arbitrary pattern set..
    • 15 patterns, up to 37 strips wide
    • Very easy to change, framework is quite flexible (see backup)
    • Exact patterns are made up to heat firmware... somebody else should define a proper set

8

9 of 19

Implementation Status

  • Rough draft of the firmware is written and compiles:
    1. https://github.com/andrewpeck/me0sf
    2. Many outstanding tasks but this should be a good starting point

  • Implementation goals:
    • A lot of effort is being put into making things parameterized and easily scalable
    • Modular-- composed of simple, standalone modules with simple interfaces
    • Simple, pure VHDL implementation (no ips)
      • Still using VHDL2008 features not supported by Vivado simulator :(
      • Simulating with GHDL.. open source, yay :)�
  • Vivado crashes my computer unless I close all my other programs :(
      • I think you want >= 64 Gb of RAM to compile

9

10 of 19

Resource utilization

  • Take these with a healthy safety factor...
    • Cross partition segment finding is very simple
    • # of patterns is completely arbitrary
    • Algorithm design likely to change
    • No fitting�
  • Resource usage dependent on # of patterns and on multiplexing factors
    • Can use a fast clock time multiplex shared logic
      • 320 MHz --> reuse factor x8
      • 160 MHz --> reuse factor x4
      • ....
      • 40 MHz --> reuse factor x1�(tradeoff between resource usage and timing closure)�
  • This design is optimized for latency and resource usage
    • Reduced to 1/3 since last meeting...

10

11 of 19

Resource utilization

  • Current Latency = 4.25bx at 320MHz (S-bits to final segments, no fitting)
    • Likely some increase in time with additional pipelining...
    • estimate 5-6 bx total ?
  • Resources: Uses ~12% of KU15P, 5% of VU13P

320MHz :

11

12 of 19

Resource utilization

12

Implemented in ME0 APEX with GEM_AMC firmware

VU13P (standalone)

13 of 19

Outstanding Firmware Work

  • Outstanding firmware work:
    1. Replacing bitonic sorter to reduce external dependencies
      • Evaldas found a nice one, but need to integrate in
    2. Need to simulate & test
    3. GEM_AMC firmware needs S-bit handling implemented
      • Sbits received from lpGBT need correct link mapping, timing (position) correction, need S-bit scan software, etc.
      • Also include a cluster finder for debug, s-bit scans, etc
    4. Need to integrate segment finder with GEM_AMC firmware
      • Simple interface, should not be terribly difficult for HDL integration
    5. Floorplanning in VU__P FPGA (SLR crossing)

13

14 of 19

Outstanding Software Work

  • Outstanding software work: (need physicists)
  • Pattern definitions need work
  • 1st -- Careful pencil & paper proposal
  • 2nd -- A proper study of road-based pattern finding in ME0
    • Output requirements need to be defined:
      • # of segments / bx
      • # of segments / partition
      • # data formats (achievable resolution, # of bits)
    • Refined measurement on segments (fitting)
      • Probably not important for initial tests at 904
      • But somebody should start thinking about this, test in software
    • Cross-partition segment finding is completely dumb right now
    • Alignment:
      • are corrections required?
    • Explore other ideas
      • Look at Hough transform etc.

14

15 of 19

Backup

15

16 of 19

Patterns

--pat=1 span=37

ly0 ----------------------------xxxxxxxxx

ly1 ------------------------xxxxxxxxx----

ly2 ----------------xxxxxxxxxxxx---------

ly3 ---------xxxxxxxx--------------------

ly4 ----xxxxxxxxx------------------------

ly5 xxxxxxxxx----------------------------

Casting very wide pattern nets compared to CSC pattern finding

e.g. span of 37, compared to span of 11 in CSC

  • vhdl type system makes this quite easy ☺
  • automated pattern mirror for left/right symmetry ☺
  • automated printout of VHDL patterns ☺
    • type "make patterns"

specify as e.g.:

constant pat_l7 : pat_unit_t := (

id => 2,

ly0 => (lo => -18, hi => -10),

ly1 => (lo => -14, hi => -6),

ly2 => (lo => -9, hi => 2),

ly3 => (lo => 2, hi => 9),

ly4 => (lo => 6, hi => 14),

ly5 => (lo => 10, hi => 18)

);

16

--pat=5 span=23

ly0 -------------------xxxx

ly1 ----------------xxxxx--

ly2 -----------xxxx--------

ly3 --------xxxx-----------

ly4 --xxxxx----------------

ly5 xxxx-------------------

--pat=7 span=17

ly0 -------------xxxx

ly1 ------------xxxx-

ly2 --------xxxx-----

ly3 ------xxxxx------

ly4 -xxxx------------

ly5 xxxx-------------

Patterns are arbitrary right now, � used to occupy the FPGA �

Need more careful consideration and study!!

17 of 19

Cross Partition Segment Finding

  • Segments will span multiple partitions
    • Need to have a clear understanding of how to handle segment finding across partitions in a simple way
    • What information do we need from eta?
    • Are only certain combinations acceptable?
    • Complex (3d) segment finding and fitting will dramatically increase firmware complexity and resource usage

�My dumb approach right now is to just ORs together neighboring partitions (duplicates all segments)...

Cross-partition ghost cancellation trims duplicates

We need to think much more about this...

What we have now is probably good enough for a test stand

.... but really bad for LHC

17

18 of 19

ME0 trigger path: Opto-hybrid to Backend

ME0 trigger path: ATCA Backend to EMTF

18

  • Trigger data is an OR of two strips
  • With new "no-FPGA" design, there is no bandwidth bottleneck in the Optohybrid
    • ALL trigger bits are directly forwarded to the backend
    • No longer care about # of clusters per bunch crossing

A. Peck / N. McColl (UCLA), E. Juska (TAMU), G. De Lentdecker (ULB)

  • A quick look at stub format requirements:
    • 4 bits: 16 eta positions (stubs can't cross more than 2 partitions)
    • 10 bits: 768 phi positions ("half strip" resolution)
    • 9 bits: 512 different bend angles
    • 4 bits: 16 different quality levels
    • 1 bit : 2 different chamber numbers

    • 28 bits total

19 of 19

ME0 trigger path: ATCA Backend to EMTF

19

  • Stub finding at ME0 CTP7
    • Stub building will be done in ATCA backend
      • Logic requirements and algorithm �are still undefined
      • Mimicking the offline algorithm would be �very resource intensive... need significant�work to develop a minimal algorithm �that can be implemented in firmware �
    • 16 Gbps ~ 384 bits / bx => 14 stubs / bx / link
      • 28 stubs / 40 degrees / bx (on two links)
    • 25 Gbps ~ 608 bits / bx => 21 stubs / bx / link
      • 21 stubs / 40 degrees / bx (on one link)
    • Both link configurations ensure very low overflow rates in ME0 trigger path

Mean # of Segments per 40 degrees per bx

PU=140

PU=200

PU=240

Neutrons x1

0.14

0.22

0.28

Neutrons x3

0.29

0.75

1.41

Compare to 21 or 28 stubs

A. Peck / N. McColl (UCLA), E. Juska (TAMU), G. De Lentdecker (ULB)

60°