1 of 22

Corundum status updates

Alex Forencich

1/15/2024

2 of 22

Agenda

  • Announcements
  • Status updates

3 of 22

Announcements

  • Happy new year!!!!
  • Corundum dev meeting on Jan 15 and Jan 29, rest are TBD
  • Switch dev meeting on Jan 22
  • Survey on wiki page: https://meeting.corundum.io
  • Options:
    • Every other Monday, 9 AM PDT (same as before)
    • Every other Wednesday, 9 AM PDT
    • 1st Wednesday of each month, 9 AM and 9 PM
      • Two meetings 12 hrs apart might be better with timezones
    • Every other Wednesday, alternating between 9 AM and 9 PM

4 of 22

Status update summary

  • MAC/PHY optimizations
  • Updated transmit status feedback
  • Updated queue management
  • Updated scheduler and internal flow control
  • Additional app section passthrough
  • Testbed status

5 of 22

MAC/PHY optimizations

  • Added some pipeline registers to improve timing performance
    • Reworked PRBS error counters in 10G/25G PHY, added 1 register slice
    • Added some registers to CRC logic in 10G/25G MAC
      • No change to latency
  • Reworked some RX PHY logic to reduce fanin
    • Only need to look at 4 bits of the block type
    • The rest of the bits are only used to detect errors
    • (random thought: only 15 of 256 values are used…is it possible to tolerate some number of bit flips?)

6 of 22

Updated transmit status feedback

  • Status feedback from transmit engine improved
    • Necessary for implementing internal flow control
  • Separate paths for 3 different events
    • Dequeue status (fails if queue is disabled or empty)
    • Transmit start (includes packet size, fails if descriptor read fails)
    • Transmit complete (also includes packet size)
  • If an operation fails, subsequent events are not reported
  • Status: Transmit engine and scheduler updated, working in HW

7 of 22

Updated queue management

  • Resurrecting queue management code (again…)
  • Identified some bottlenecks from the previous WIP code
    • New queue state storage and split pipeline
    • Event generation improvements
    • Rework slot allocation

8 of 22

Queue state storage (current)

TX/RX size

CQ/EQ size

Field

64

64

Base addr

16

16

Head ptr

16

16

Tail ptr

16

16

CQ/EQ/IRQ index

4

4

Log queue size

2

-

Log block size

-

1

Arm

-

1

Arm cont

1

1

Enable

8

8

Op index

127

127

Total

URAM is 4096 x 64, so

2 URAM = 4096 queues

9 of 22

Queue state storage (new, 1st attempt)

QP size

CQ/EQ size

Field

52

52

Base addr (4K align)

16+16

16

Producer ptr

16+16

16

Consumer ptr

16+16

16

CQ/EQ/IRQ index

4+4

4

Log queue size

-

1

Arm

1

1

Enable

1+1

1

Active

12

12

VF index

16

-

LSO offset

187

119

Total

URAM is 4096 x 64, so:

4096 QP = 3 URAM

4096 CQ/EQ = 2 URAM

SQ+RQ rings will share same memory bock, amortizing large base address field

10 of 22

Queue state storage (new, 2nd attempt)

QP size

CQ/EQ size

Field

52

52

Base addr (4K align)

16+16

16

Producer ptr

16+16

16

Consumer ptr

16+16

16

CQ/EQ/IRQ index

4+4

4

Log queue size

-

1

Arm

-

1

Pending

1*2

1

Enable

1+1

1

Active

12

12

VF index

8+8

8

Slot

188

128

Total

URAM is 4096 x 64, so:

4096 QP = 3 URAM

4096 CQ/EQ = 2 URAM

SQ+RQ rings will share same memory bock, amortizing large base address field

11 of 22

Split pipeline

  • Split queue state RAM from base address RAM
  • Reduces interference between SQ and RQ operations
    • DMA operations hit shared address RAM, but DMA engine is shared
    • Register operations may need to hit multiple pipelines
  • Status: tested in HW in new completion handling logic

Prod/cons pointers, size, arm, active, enable, etc.

Prod/cons pointers, size, arm, active, enable, etc.

Base address, VF

SQ

RQ

Address

64 bits

64 bits

64 bits

12 of 22

Event generation improvements

  • Event generation from CQs has potential for backpressure problems
    • Events can be generated in response to register writes (re-arm CQ)
    • Initial rewrite of CQ logic had contention problem related to re-arming
  • Observation: when re-arming a non-empty queue, don’t need to wait for timeout, can generate event immediately
  • Move event generation to queue state handling logic
    • Add pending bit and background scrub to generate deferred events
    • Breaks backpressure path, events cannot be lost
    • No need to store EQN in slot state
    • TODO: use same scheme for EQs generating IRQs
  • Status: tested in HW in new completion handling logic

13 of 22

New slot allocation

  • Previously, slot allocation was handled outside of queue state management
    • Queue state management logic wasn’t even aware of slots
  • Re-arming resulted in a CQN->slot mapping bottleneck
    • Storing slot in the queue state fixes this
  • Why not have the queue state logic manage slot allocation as well?
    • Need to resolve a few hazards, but it should be doable
    • Would remove several choke-points and should also simplify logic
  • Status: In progress

14 of 22

Status

  • Working on resurrecting batched completion write logic
    • New internal RAM layout (added slot and pending bit): done
    • Event generation from queue state module (pending bit): done
    • New split pipeline: done
    • New slot allocation: in progress
  • Next: new descriptor fetch logic, merge TXQ/RXQ into QP SQ/RQ

15 of 22

Updated scheduler and internal flow control

  • Multiple ports and multiple priorities
    • Need an internal queue per priority per port
    • Use linked lists for efficient storage
    • Multiple entries per list element to hide pipeline delays
  • Internal flow control
    • Quota for in-flight transmit operations, per-port and per-priority
    • Prevent head-of-line blocking
  • Status: in progress

16 of 22

New scheduler

  • Multiple levels of scheduling
    • Round robin across ports
    • Priority across TCs on each port
    • Round robin on scheduled queues on each TC
    • Queues can be scheduled on multiple ports, but only one TC per port
  • Need internal flow control to manage buffer space
    • Operations can only start if there is at least 1 MTU of available buffer
    • One scheduler “channel” per TC, per port
    • Flow control configured and tracked per channel

17 of 22

Internal flow control

  • Need to enforce byte limit and possibly packet limit
  • Don’t know the size of a packet until it’s being sent
    • Have to initially assume worst-case size (MTU)
    • But, always reserving an MTU-sized block is inefficient
  • High-level idea: it works like a credit card
    • Have a credit limit of the buffer size
    • Place a “hold” for an MTU-sized block
    • Once we know how big the packet is, release the hold and charge the actual amount
    • Pay it off when the operation completes
    • (Not particularly accurate – no late fees, no interest, no miles, …..)

18 of 22

First plan: scaling

  • Before TX starts: track packet count
  • After TX starts: track packet count and byte count
  • How to relate pre-TX count to byte count?
    • Could inc/dec by MTU
    • Could shift/scale packet count
  • Advantage: shift/scale packet count means that the MTU and buffer limit can be adjusted at any time
  • Disadvantage: complex calculation, not sure if it will close timing at 250+ MHz, pipelining reduces packet rate

19 of 22

Second plan: buffer borrowing

  • Tracking packet counts is easy – can we do it that way?
  • N byte buffer can hold N/MTU = K packets, can start K now
  • Once packet sizes are known, sum up (MTU-size)
    • For every MTU, can “generate” one packet credit
  • Advantage: credit check is super simple
  • Disadvantage: have to eat those extra credits at some point, complicating things

20 of 22

Current plan: split credit generation

  • Similar idea as before: use FC credits, 1 credit = 1 packet
  • Credits generated for each MTU in the buffer
    • Buf_sz and buf_lim
    • If buf_sz + MTU <= buf_lim, buf_sz += MTU, generate 1 FC credit
    • On TX start, buf_sz -= MTU - pkt_sz
    • On TX complete, buf_sz -= pkt_sz
    • On failure, recycle FC credit (already paid 1 MTU)
  • Advantage: simple, can set buf size and MTU size in bytes directly, can change buf size at any time
  • Disadvantage: haven’t found an obvious problem yet…

21 of 22

App section passthrough

  • Some applications require more “stuff” to get passed through to application section
  • MLE put together a pull request to add some macro-magic for this
    • Need to add testbenches

22 of 22

Potential shared development testbed

  • Hardware:
    • Several host machines
    • Various NICs and PCIe-form-factor FPGA boards
    • 2x HTG-9200 boards (9x QSFP28)
    • ONT-603 100G network tester, possibly other test equipment
    • Arista 7060CX 32 port 100G packet switch
    • 1x 32x32 + 2x 16x16 Polatis optical switches as scriptable patch panel
  • Software:
    • Less clear at the moment
    • Current idea: diskless hosts, users can set up their own images and boot them on the hosts (tools shared via NFS)