1 of 33

Corundum status updates

Alex Forencich

1/16/2023

2 of 33

Agenda

Status updates

3 of 33

Status update summary

Bugs

FIFO memory inference issue (in progress)

10G/25G MAC optimizations (done)
Priority flow control (todo)
AXI Virtual FIFO (in progress)
DRAM integration (in progress)
Batched completion write support (in progress)
Variable-length descriptor support (todo)
Management soft core (todo)
10G/25G switching (HW done, SW todo)

4 of 33

Bugs: FIFO memory inference issue

Seeing TX packets with incorrect IP layer checksums, only on Intel devices
MLE traced the issue to FIFO between TX engine and TX checksum compute block incorrectly setting “enable” bit
Appears to be a Quartus tool bug related to merging pipeline registers into MLABs

Connecting RAM output register to logic analyzer or adding “preserve” attribute results in the bug disappearing

Status: reported to Intel

5 of 33

10G/25G MAC optimizations

Improved CRC verification logic

Problem: 64 bit datapath means 8 bytes are processed per clock, but frames are not required to be a multiple of 8 bytes in length
Previously: partial CRC logic for 1, 2, 3, and 4 lanes, compute CRC across full frame including FCS and compare to magic constant
Now: zero-extend frame to 8 byte boundary, roll trailing zeros into magic constant, only need full 8 lane CRC logic plus 8 constants

Result: 605 LUT + 363 FF to 255 LUT + 308 FF
TODO: take a look at TX side (currently ~1400 LUTs)

6 of 33

Priority Flow Control

Starting to look at supporting PFC in Corundum
HW

PFC frame TX/RX, pause quanta counters
Per-TC queues
Connection to TX/RX queues and PFC frame logic
Internal flow control
TC-aware/per-TC scheduling

SW

Driver support

Outstanding questions

How to map RX traffic to priority levels (and is this necessary)?
How to efficiently handle multiple traffic classes in HW?
What needs to be done in the driver?

7 of 33

AXI virtual FIFO

Large packet buffer capability in DRAM
Store both packet data as well as sideband data (tid, tdest, tuser)
Split encoding from storage

Encode framing and sideband data, stripe across multiple channels

Intent is to support operation at 100G with all packet sizes

2x DDR4-2400 channels or 2-4 HBM ports
Main bottleneck is memory BW, so need efficient encoding scheme for framing and sideband data

Status: FIFO core working in HW, encode/decode is TODO

8 of 33

AXI virtual FIFO BW test

Test of core FIFO in HW (full-width data, no encoding)
Data generation/checking in PCIe clock domain (250 MHz)

512-bit limited to 128 Gbps, 256-bit limited to 64 Gbps

DDR4 MIG configuration

ROW_COLUMN_BANK_INTLV with autoprecharge (recommended settings for AXI) did not work well, switched to ROW_COLUMN_BANK with autoprecharge disabled

Test consists of four parts

Write 1,000,000 full-width words
Simultaneously read+write 1,000,000 full-width words (“offset” test)
Read 1,000,000 full-width words
Simultaneously read+write 1,000,000 full-width words

9 of 33

AXI virtual FIFO BW test

DDR4-2666 (fb2CG@KU15P, 333 MHz AXI clock)

RO: 128 Gbps, WO: 124 Gbps, R+W: 76.6 Gbps, R+W offset: 73.6 Gbps

DDR4-2400 (Alveo U200, 300 MHz AXI clock)

RO: 128 Gbps, WO: 124 Gbps, R+W: 69.9 Gbps, R+W offset: 66.4 Gbps

HBM2 (Alveo U50, 450 MHz AXI clock)

RO: 64 Gbps, WO: 64 Gbps, R+W: 49.0 Gbps, R+W offset: 49.0 Gbps

2x DDR4-2400 = around 132 Gbps, sufficient for 100 G Ethernet traffic
HBM2 probably will need 3-4 channels, especially on Intel parts (Stratix 10 -2 is 400 MHz)

10 of 33

AXI virtual FIFO status

Vfifo channel module

Performed quite a bit of timing optimization (450 MHz for HBM)
Test design works with 16 active HBM channels on U50

Encoding

Back-of-the-envelope calculations
2x 64 Gbps channels (i.e. conservative 2x DDR4-2400) + 100G Ethernet
20 bytes of overhead per packet should be doable when packing into 256 bit segments

11 of 33

DMA benchmark application

DMA benchmark application is a useful test/sanity check
Extend DMA benchmark to test more internal components

Use AXI virtual FIFO to test DDR/HBM channels

Support DMA benchmark application on all targets
Status:

Added app for all boards (only one variant per board)
DRAM test in progress

12 of 33

DRAM integration

Provide access to DRAM and HBM from application section

AXI passthrough for all memory channels
Dedicated PCIe BAR for host software access and P2P DMA
Connection to DMA subsystem (unified DMA address space)

Potentially provide multiple clocking modes

Fully async (all ports driven directly from interface clocks, min. latency)
Synchronous (include async FIFOs to sync to first interface clock, core clock, custom clock, etc.)

TODO

Intel DDR4 (AXI?), Intel HBM (switching, bursts), 7-series
PCIe BAR, DMA subsystem connection

13 of 33

RAM BAR/AXI port

Currently provide AXI lite ports for NIC and app control

Exposed as PCIe BAR0 and BAR2 on PCIe designs

Add AXI port + BAR4 to access on-card RAM

Full AXI supporting bursting and interleaving
Supports P2P DMA, write combining, etc.
Can also pass through to application section for low latency operations

Internally shared with DMA subsystem

Transparently handle DMA operations based on address

14 of 33

Batched completion write support

Writing completions separately is inefficient

PCIe TLP header overhead

Batch completions in per-queue buffers, write in blocks

Less overhead
Can issue block write + event/IRQ + queue pointer write concurrently
Effectively doubles as interrupt coalescing

Status: initial version working in HW

Performance is similar

15 of 33

Baseline (MTU 9000)

ADM-PCIE-9V3
8192 TXQ
256 RXQ
64 EQ + 64 MSI-X
256 PCIe tags w/FC
2x Xeon 6130
MTU 9000
Writeback off
MMIO pointers
No IRQ rate limiting

16 of 33

Writeback on, but not used (MTU 9000)

ADM-PCIE-9V3
8192 TXQ
256 RXQ
64 EQ + 64 MSI-X
256 PCIe tags w/FC
2x Xeon 6130
MTU 9000
Writeback on
MMIO pointers
No IRQ rate limiting

17 of 33

Full queue pointer writeback (MTU 9000)

ADM-PCIE-9V3
8192 TXQ
256 RXQ
64 EQ + 64 MSI-X
256 PCIe tags w/FC
2x Xeon 6130
MTU 9000
Writeback on
Writeback pointers
No IRQ rate limiting

18 of 33

Writeback with IRQ rate limiting (MTU 9000)

ADM-PCIE-9V3
8192 TXQ
256 RXQ
64 EQ + 64 MSI-X
256 PCIe tags w/FC
2x Xeon 6130
MTU 9000
Writeback on
Writeback pointers
IRQ rate limiting (10 us)

19 of 33

New completion write (MTU 9000)

ADM-PCIE-9V3
8192 TXQ
256 RXQ
64 EQ + 64 MSI-X
256 PCIe tags w/FC
2x Xeon 6130
MTU 9000
New completion write implementation
IRQ rate limiting

20 of 33

Baseline (MTU 1500)

ADM-PCIE-9V3
8192 TXQ
256 RXQ
64 EQ + 64 MSI-X
256 PCIe tags w/FC
2x Xeon 6130
MTU 1500
Writeback off
MMIO pointers
IRQ rate limiting

21 of 33

New completion write (MTU 1500)

ADM-PCIE-9V3
8192 TXQ
256 RXQ
64 EQ + 64 MSI-X
256 PCIe tags w/FC
2x Xeon 6130
MTU 1500
New completion write implementation
IRQ rate limiting

22 of 33

Batched completion write support

~64K buffer space, ~64 rings of ~1024 bytes (32 entries)
Truncated queue pointers so wrap point aligns
Block write and pointer writeback

Triggered on half ring boundary in scratch buffer
Also triggered by event generation

Event generation

Triggered by packet count and/or timeout (configurable)

23 of 33

Variable-length descriptor support

Current queue management logic does not handle variable-length descriptors
Implement simplified queue managers (state storage only)
Implement descriptor fetch and header parsing logic

Read descriptors in blocks up to some block size, parse length/type fields and hand off to transmit/receive engines
Potential extensions to support LSO (descriptor duplication)

Status: working on supporting components

24 of 33

Current descriptor format

Small - 16 bytes

32 bit length, 64 bit pointer

Fixed-size blocks for scatter-gather

Inflexible
Extra overhead for small packets

struct mqnic_desc {

__u8 rsvd0[2];

__u16 tx_csum_cmd;

__u32 len;

__u64 addr;

};

25 of 33

Descriptor framing format

Split framing from contents

Descriptor fetch only has to parse framing information
Descriptor format can be modified without changing fetch logic

16 byte blocks, 2 byte header

256 blocks / 4096 bytes max size
1 byte type field to support multiple descriptor formats

struct desc_hdr {

__u8 len;

__u8 type;

}

struct desc_block {

__u8 rsvd[16];

};

struct desc {

struct desc_hdr desc_hdr;

__u8 rsvd[14];

struct desc_block[];

};

26 of 33

Proposed descriptor format

struct desc_with_inline_data {

struct desc_hdr desc_hdr;

__u16 flags;

__u32 opcode;

__u8 data_seg_count;

__u8 rsvd0;

__u16 inl_data_len;

__u8 rsvd1[4];

union {

struct desc_data_seg data;

char inl_data[];

} segs[] __attribute__ ((aligned(16)));

};

struct desc_hdr {

__u8 len;

__u8 type;

}

struct desc_data_seg {

union {

struct desc_hdr desc_hdr;

__u8 rsvd0[2];

};

__u16 flags;

__u16 tx_csum_cmd;

__u16 len;

__u64 addr;

};

27 of 33

Queue state storage

Some state must be stored in HW for every queue

Supporting 10K+ queues requires efficient storage
Utilize URAM blocks on US+ (4K x 64)

Need to add writeback pointer, VF index, and LSO offset

Can reclaim a couple of other fields

Status: Implemented queue state storage module

28 of 33

Queue state storage (current)

TX/RX size	CQ/EQ size	Field
64	64	Base addr
16	16	Head ptr
16	16	Tail ptr
16	16	CQ/EQ/IRQ index
4	4	Log queue size
2	-	Log block size
-	1	Arm
-	1	Arm cont
1	1	Enable
8	8	Op index
127	127	Total

URAM is 4096 x 64, so

2 URAM = 4096 queues

29 of 33

Queue state storage (current)

TX/RX size	CQ/EQ size	Field
52	52	Base addr (4K align)
16	16	Producer ptr
16	16	Consumer ptr
16	16	CQ/EQ/IRQ index
4	4	Log queue size
-	1	Arm
1	1	Enable
1	1	Active
12 (6)	12 (6)	VF index
16	-	LSO offset
58 (-)	58 (-)	Writeback addr (64B align)
192 (128)	178 (113)	Total

URAM is 4096 x 64, so

3 URAM = 4096 queues

Can fit into 2 URAM with writeback disabled and 6 bit VF index

30 of 33

Management soft core

Soft core for board-level management

Handle board and device specific functions
Present unified interface to driver
Location: board-level or inside core?

API

Low-level – direct register access, etc. Board-dependent diagnostics
Medium-level – Somewhat abstracted operations
High-level – High-level operations, board and FPGA independent

Core: probably vexriscv

JTAG debug interface, can simulate with icarus verilog

Status: core bring-up

31 of 33

10G/25G switching

Run-time switching between 10G and 25G should be doable

Change speeds and mix and match speeds without reloading FPGA
Run two QPLLs, dynamically switch channel clock sources
Possibly can also switch EQ settings (DFE vs LPM)

Requires resetting both RX and TX

Previously blocked on FIFO reset rework and PHY integration updates

Status: Working from userspace, need to move to softcore FW
TODO: update transmit engine to handle dropped TX packets when PTP is enabled

32 of 33

25G/10G/1G switching

Is it possible to mix 25G, 10G, and 1G on the same GT quad?

1 QPLL needed each for 10G and 25G, CPLL not very flexible
What about oversampling? (configure for 10G, run at 1G)

10.3125 / 1.25 = 8.25
8.25 * 4 = 33
10:8 gearbox, replicated 8 8 8 9 8 8 8 9 feeding the 66:64 async gearbox?
Might work for TX, but could the RX CDR track a 1.25 Gbps signal correctly? Might need an experiment…

33 of 33

To-do list for stable release

Application section [done]
Register space reorganization [done]
Shared interface datapath [initial version done]
Self-contained checksum offloading (no checksum in desc)
Variable-length descriptor support [in progress]
On-FPGA management soft core (no board-specific kernel code)
Overall goal: stabilize driver/hardware interface