1 of 12

Corundum status updates

Alex Forencich

7/3/2023

2 of 12

Agenda

  • Announcements
  • Status updates

3 of 12

Announcements

  • Next Corundum developer meeting: July 17 at 9:00 AM PDT
  • Next switch development meeting: July 10 at 9:00 AM PDT

4 of 12

Status update summary

  • PCIe RX buffer management
  • Upstream git repo updated
  • Upcoming changes for the git repo
  • MAC control and PFC

5 of 12

PCIe RX buffer management

  • PCIe endpoints must advertise infinite completion credits
    • Prevent deadlocks, but no back-pressure on RX
  • PCIe interface cores have some amount of RX buffering
    • No back-pressure means possibility of buffer overflow for completions
  • Must manage outstanding read requests to prevent this
  • Previously, Corundum DMA engine did not implement this
    • Some previous “PCIe HIP bugs” are probably just CPL buffer overflows

6 of 12

Buffer size measurement: setup

  • Modified DMA engine example design
    • Add mechanism to block RC interface
    • Added request and completion counters
    • With RC blocked, issue a bunch of DMA reads
    • Release block and see if a completion got dropped (in which case, the DMA engine hangs)
  • Tested in 2 different computers
    • One with max payload 256 (Intel Xeon 6130)
    • One with max payload 128 that also likes to produce 64B completions (old AMD Phenom CPU)

7 of 12

Buffer size measurement: results

  • XCKU040 (PCIE3) – “64 CPLH, 16384B”
    • Host with 256B cpl – 30x 512B RdReq OK (30 req, 60 cpl, 960 CPLD)
    • Host with 128B cpl – 16x 512B RdReq OK (16 req, 64 cpl, 512 CPLD)
    • Host with 128B cpl – 17x 144B unaligned RdReq OK (17 req, 68 cpl)
  • XCKU3P/XCKU15P (PCIE4) – “128 CPLH, 32768B”
    • Host with 256B cpl – 62x 512B RdReq OK (62 req, 124 cpl, 1984 CPLD)
    • Host with 128B cpl – 58x 512B RdReq OK (58 req, 232 cpl, 1856 CPLD)
    • Host with 128B cpl – 67x 144B unaligned RdReq OK (67 req, 268 cpl)
  • S10MX (H-tile) – “770 CPLH, 2500 CPLD”
    • Host with 128B cpl – 76x 512B RdReq OK (76 req, 304 cpl, 2432 CPLD)
    • Host with 128B cpl – 195x 144B unaligned RdReq OK (195 req, 780 cpl)
  • Agilex F (P-tile) – “1144 CPLH, 1444 (256-bit) CPLD”
    • Host with 128B cpl – 128x 512B RdReq OK (128 req, 512 cpl, 4096 CPLD)
    • Host with 128B cpl – 256x 144B unaligned RdReq OK (256 req, 1024 cpl)

8 of 12

Buffer size measurement: conclusions

  • UltraScale definitely has a 64 CPLH limit
    • So, documentation is correct
  • UltraScale+ supports about 256 CPLH credits
    • Docs are wrong
    • Still not enough for the buffer size, unless RCB bit is set (uncommon)
  • Data buffer size on both includes completion TLP headers
    • Docs are correct about the buffer size, but incorrect on how to manage it
  • S10 H-tile has a CPLD limit closer to 2400
    • Docs are wrong
  • P-tile has big buffers
    • Haven’t been able to hit CPLH limit, hit CPLD limit at 64 KB/4096 CPLD (?!)

9 of 12

PCIe RX buffer management status

  • Implemented CPLH/CPLD credit tracking in DMA engine (PACKET_FC from PG213)
  • Connected RCB status signals
  • Updated verilog-pcie example design to test completion buffer
  • P-tile: implemented as 1144 CPLH/2888 CPLD
  • H/L-tiles: implemented as 770 CPLH/2400 CPLD
  • Xilinx US: implemented as 64 CPLH/1024-64 CPLD
  • Xilinx US+: implemented as 256 CPLH/2048-256 CPLD
    • Unfortunately, there is some performance penalty

10 of 12

Upstream git repo updated

  • Driver updates
    • Dynamic queue allocation
    • ethtool APIs
    • Other clean-up and improvements
  • New board support – KR260, 520N-MX
  • Ubuntu on Zynq support
  • Virtual I2C mux
  • PCIe RX completion buffer management

11 of 12

Upcoming changes

  • Queue control register reorganization
    • Coupled with some driver refactoring
    • Commands for atomic manipulation of queue state
    • Double CQ/EQ control register density
  • This will bump all of the queue control register block versions
    • Out-of-tree drivers (DPDK, etc.) will need to be updated
  • Might also push CQ consolidation changes
    • Single set of CQ instead of separate RX CQ and TX CQ

12 of 12

MAC control and PFC

  • MAC control layer
    • Exchange MAC control messages, including pause frames
    • Inject/extract control traffic on MAC side of TX/RX FIFOs
  • PFC and pause frame support
    • Generate/receive PFC and pause frames using MAC control layer
    • Pause frames directly stop TX traffic, PFC must be handled deeper in the design
  • Modules written in a flexible manner
    • Can implement in MAC itself, or in MAC-independent L2 ingress/egress modules in Corundum
  • Status: mostly implemented, working in sim