JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 22

Corundum status updates

Alex Forencich

1/15/2024

2 of 22

Agenda

Announcements
Status updates

3 of 22

Announcements

Happy new year!!!!
Corundum dev meeting on Jan 15 and Jan 29, rest are TBD
Switch dev meeting on Jan 22
Survey on wiki page: https://meeting.corundum.io
Options:

Every other Monday, 9 AM PDT (same as before)
Every other Wednesday, 9 AM PDT
1^st Wednesday of each month, 9 AM and 9 PM

Two meetings 12 hrs apart might be better with timezones

Every other Wednesday, alternating between 9 AM and 9 PM

4 of 22

Status update summary

MAC/PHY optimizations
Updated transmit status feedback
Updated queue management
Updated scheduler and internal flow control
Additional app section passthrough
Testbed status

5 of 22

MAC/PHY optimizations

Added some pipeline registers to improve timing performance

Reworked PRBS error counters in 10G/25G PHY, added 1 register slice
Added some registers to CRC logic in 10G/25G MAC

No change to latency

Reworked some RX PHY logic to reduce fanin

Only need to look at 4 bits of the block type
The rest of the bits are only used to detect errors
(random thought: only 15 of 256 values are used…is it possible to tolerate some number of bit flips?)

6 of 22

Updated transmit status feedback

Status feedback from transmit engine improved

Necessary for implementing internal flow control

Separate paths for 3 different events

Dequeue status (fails if queue is disabled or empty)
Transmit start (includes packet size, fails if descriptor read fails)
Transmit complete (also includes packet size)

If an operation fails, subsequent events are not reported
Status: Transmit engine and scheduler updated, working in HW

7 of 22

Updated queue management

Resurrecting queue management code (again…)
Identified some bottlenecks from the previous WIP code

New queue state storage and split pipeline
Event generation improvements
Rework slot allocation

8 of 22

Queue state storage (current)

TX/RX size	CQ/EQ size	Field
64	64	Base addr
16	16	Head ptr
16	16	Tail ptr
16	16	CQ/EQ/IRQ index
4	4	Log queue size
2	-	Log block size
-	1	Arm
-	1	Arm cont
1	1	Enable
8	8	Op index
127	127	Total

URAM is 4096 x 64, so

2 URAM = 4096 queues

9 of 22

Queue state storage (new, 1^st attempt)

QP size	CQ/EQ size	Field
52	52	Base addr (4K align)
16+16	16	Producer ptr
16+16	16	Consumer ptr
16+16	16	CQ/EQ/IRQ index
4+4	4	Log queue size
-	1	Arm
1	1	Enable
1+1	1	Active
12	12	VF index
16	-	LSO offset
187	119	Total

URAM is 4096 x 64, so:

4096 QP = 3 URAM

4096 CQ/EQ = 2 URAM

SQ+RQ rings will share same memory bock, amortizing large base address field

10 of 22

Queue state storage (new, 2^nd attempt)

QP size	CQ/EQ size	Field
52	52	Base addr (4K align)
16+16	16	Producer ptr
16+16	16	Consumer ptr
16+16	16	CQ/EQ/IRQ index
4+4	4	Log queue size
-	1	Arm
-	1	Pending
1*2	1	Enable
1+1	1	Active
12	12	VF index
8+8	8	Slot
188	128	Total

URAM is 4096 x 64, so:

4096 QP = 3 URAM

4096 CQ/EQ = 2 URAM

SQ+RQ rings will share same memory bock, amortizing large base address field

11 of 22

Split pipeline

Split queue state RAM from base address RAM
Reduces interference between SQ and RQ operations

DMA operations hit shared address RAM, but DMA engine is shared
Register operations may need to hit multiple pipelines

Status: tested in HW in new completion handling logic

Prod/cons pointers, size, arm, active, enable, etc.

Base address, VF

Address

64 bits

12 of 22

Event generation improvements

Event generation from CQs has potential for backpressure problems

Events can be generated in response to register writes (re-arm CQ)
Initial rewrite of CQ logic had contention problem related to re-arming

Observation: when re-arming a non-empty queue, don’t need to wait for timeout, can generate event immediately
Move event generation to queue state handling logic

Add pending bit and background scrub to generate deferred events
Breaks backpressure path, events cannot be lost
No need to store EQN in slot state
TODO: use same scheme for EQs generating IRQs

Status: tested in HW in new completion handling logic

13 of 22

New slot allocation

Previously, slot allocation was handled outside of queue state management

Queue state management logic wasn’t even aware of slots

Re-arming resulted in a CQN->slot mapping bottleneck

Storing slot in the queue state fixes this

Why not have the queue state logic manage slot allocation as well?

Need to resolve a few hazards, but it should be doable
Would remove several choke-points and should also simplify logic

Status: In progress

14 of 22

Status

Working on resurrecting batched completion write logic

New internal RAM layout (added slot and pending bit): done
Event generation from queue state module (pending bit): done
New split pipeline: done
New slot allocation: in progress

Next: new descriptor fetch logic, merge TXQ/RXQ into QP SQ/RQ

15 of 22

Updated scheduler and internal flow control

Multiple ports and multiple priorities

Need an internal queue per priority per port
Use linked lists for efficient storage
Multiple entries per list element to hide pipeline delays

Internal flow control

Quota for in-flight transmit operations, per-port and per-priority
Prevent head-of-line blocking

Status: in progress

16 of 22

New scheduler

Multiple levels of scheduling

Round robin across ports
Priority across TCs on each port
Round robin on scheduled queues on each TC
Queues can be scheduled on multiple ports, but only one TC per port

Need internal flow control to manage buffer space

Operations can only start if there is at least 1 MTU of available buffer
One scheduler “channel” per TC, per port
Flow control configured and tracked per channel

17 of 22

Internal flow control

Need to enforce byte limit and possibly packet limit
Don’t know the size of a packet until it’s being sent

Have to initially assume worst-case size (MTU)
But, always reserving an MTU-sized block is inefficient

High-level idea: it works like a credit card

Have a credit limit of the buffer size
Place a “hold” for an MTU-sized block
Once we know how big the packet is, release the hold and charge the actual amount
Pay it off when the operation completes
(Not particularly accurate – no late fees, no interest, no miles, …..)

18 of 22

First plan: scaling

Before TX starts: track packet count
After TX starts: track packet count and byte count
How to relate pre-TX count to byte count?

Could inc/dec by MTU
Could shift/scale packet count

Advantage: shift/scale packet count means that the MTU and buffer limit can be adjusted at any time
Disadvantage: complex calculation, not sure if it will close timing at 250+ MHz, pipelining reduces packet rate

19 of 22

Second plan: buffer borrowing

Tracking packet counts is easy – can we do it that way?
N byte buffer can hold N/MTU = K packets, can start K now
Once packet sizes are known, sum up (MTU-size)

For every MTU, can “generate” one packet credit

Advantage: credit check is super simple
Disadvantage: have to eat those extra credits at some point, complicating things

20 of 22

Current plan: split credit generation

Similar idea as before: use FC credits, 1 credit = 1 packet
Credits generated for each MTU in the buffer

Buf_sz and buf_lim
If buf_sz + MTU <= buf_lim, buf_sz += MTU, generate 1 FC credit
On TX start, buf_sz -= MTU - pkt_sz
On TX complete, buf_sz -= pkt_sz
On failure, recycle FC credit (already paid 1 MTU)

Advantage: simple, can set buf size and MTU size in bytes directly, can change buf size at any time
Disadvantage: haven’t found an obvious problem yet…

21 of 22

App section passthrough

Some applications require more “stuff” to get passed through to application section
MLE put together a pull request to add some macro-magic for this

Need to add testbenches

22 of 22

Potential shared development testbed

Hardware:

Several host machines
Various NICs and PCIe-form-factor FPGA boards
2x HTG-9200 boards (9x QSFP28)
ONT-603 100G network tester, possibly other test equipment
Arista 7060CX 32 port 100G packet switch
1x 32x32 + 2x 16x16 Polatis optical switches as scriptable patch panel

Software:

Less clear at the moment
Current idea: diskless hosts, users can set up their own images and boot them on the hosts (tools shared via NFS)