1 of 33

Corundum status updates

Alex Forencich

1/16/2023

2 of 33

Agenda

  • Status updates

3 of 33

Status update summary

  • Bugs
    • FIFO memory inference issue (in progress)
  • 10G/25G MAC optimizations (done)
  • Priority flow control (todo)
  • AXI Virtual FIFO (in progress)
  • DRAM integration (in progress)
  • Batched completion write support (in progress)
  • Variable-length descriptor support (todo)
  • Management soft core (todo)
  • 10G/25G switching (HW done, SW todo)

4 of 33

Bugs: FIFO memory inference issue

  • Seeing TX packets with incorrect IP layer checksums, only on Intel devices
  • MLE traced the issue to FIFO between TX engine and TX checksum compute block incorrectly setting “enable” bit
  • Appears to be a Quartus tool bug related to merging pipeline registers into MLABs
    • Connecting RAM output register to logic analyzer or adding “preserve” attribute results in the bug disappearing
  • Status: reported to Intel

5 of 33

10G/25G MAC optimizations

  • Improved CRC verification logic
    • Problem: 64 bit datapath means 8 bytes are processed per clock, but frames are not required to be a multiple of 8 bytes in length
    • Previously: partial CRC logic for 1, 2, 3, and 4 lanes, compute CRC across full frame including FCS and compare to magic constant
    • Now: zero-extend frame to 8 byte boundary, roll trailing zeros into magic constant, only need full 8 lane CRC logic plus 8 constants
  • Result: 605 LUT + 363 FF to 255 LUT + 308 FF
  • TODO: take a look at TX side (currently ~1400 LUTs)

6 of 33

Priority Flow Control

  • Starting to look at supporting PFC in Corundum
  • HW
    • PFC frame TX/RX, pause quanta counters
    • Per-TC queues
    • Connection to TX/RX queues and PFC frame logic
    • Internal flow control
    • TC-aware/per-TC scheduling
  • SW
    • Driver support
  • Outstanding questions
    • How to map RX traffic to priority levels (and is this necessary)?
    • How to efficiently handle multiple traffic classes in HW?
    • What needs to be done in the driver?

7 of 33

AXI virtual FIFO

  • Large packet buffer capability in DRAM
  • Store both packet data as well as sideband data (tid, tdest, tuser)
  • Split encoding from storage
    • Encode framing and sideband data, stripe across multiple channels
  • Intent is to support operation at 100G with all packet sizes
    • 2x DDR4-2400 channels or 2-4 HBM ports
    • Main bottleneck is memory BW, so need efficient encoding scheme for framing and sideband data
  • Status: FIFO core working in HW, encode/decode is TODO

8 of 33

AXI virtual FIFO BW test

  • Test of core FIFO in HW (full-width data, no encoding)
  • Data generation/checking in PCIe clock domain (250 MHz)
    • 512-bit limited to 128 Gbps, 256-bit limited to 64 Gbps
  • DDR4 MIG configuration
    • ROW_COLUMN_BANK_INTLV with autoprecharge (recommended settings for AXI) did not work well, switched to ROW_COLUMN_BANK with autoprecharge disabled
  • Test consists of four parts
    • Write 1,000,000 full-width words
    • Simultaneously read+write 1,000,000 full-width words (“offset” test)
    • Read 1,000,000 full-width words
    • Simultaneously read+write 1,000,000 full-width words

9 of 33

AXI virtual FIFO BW test

  • DDR4-2666 (fb2CG@KU15P, 333 MHz AXI clock)
    • RO: 128 Gbps, WO: 124 Gbps, R+W: 76.6 Gbps, R+W offset: 73.6 Gbps
  • DDR4-2400 (Alveo U200, 300 MHz AXI clock)
    • RO: 128 Gbps, WO: 124 Gbps, R+W: 69.9 Gbps, R+W offset: 66.4 Gbps
  • HBM2 (Alveo U50, 450 MHz AXI clock)
    • RO: 64 Gbps, WO: 64 Gbps, R+W: 49.0 Gbps, R+W offset: 49.0 Gbps
  • 2x DDR4-2400 = around 132 Gbps, sufficient for 100 G Ethernet traffic
  • HBM2 probably will need 3-4 channels, especially on Intel parts (Stratix 10 -2 is 400 MHz)

10 of 33

AXI virtual FIFO status

  • Vfifo channel module
    • Performed quite a bit of timing optimization (450 MHz for HBM)
    • Test design works with 16 active HBM channels on U50
  • Encoding
    • Back-of-the-envelope calculations
    • 2x 64 Gbps channels (i.e. conservative 2x DDR4-2400) + 100G Ethernet
    • 20 bytes of overhead per packet should be doable when packing into 256 bit segments

11 of 33

DMA benchmark application

  • DMA benchmark application is a useful test/sanity check
  • Extend DMA benchmark to test more internal components
    • Use AXI virtual FIFO to test DDR/HBM channels
  • Support DMA benchmark application on all targets
  • Status:
    • Added app for all boards (only one variant per board)
    • DRAM test in progress

12 of 33

DRAM integration

  • Provide access to DRAM and HBM from application section
    • AXI passthrough for all memory channels
    • Dedicated PCIe BAR for host software access and P2P DMA
    • Connection to DMA subsystem (unified DMA address space)
  • Potentially provide multiple clocking modes
    • Fully async (all ports driven directly from interface clocks, min. latency)
    • Synchronous (include async FIFOs to sync to first interface clock, core clock, custom clock, etc.)
  • TODO
    • Intel DDR4 (AXI?), Intel HBM (switching, bursts), 7-series
    • PCIe BAR, DMA subsystem connection

13 of 33

RAM BAR/AXI port

  • Currently provide AXI lite ports for NIC and app control
    • Exposed as PCIe BAR0 and BAR2 on PCIe designs
  • Add AXI port + BAR4 to access on-card RAM
    • Full AXI supporting bursting and interleaving
    • Supports P2P DMA, write combining, etc.
    • Can also pass through to application section for low latency operations
  • Internally shared with DMA subsystem
    • Transparently handle DMA operations based on address

14 of 33

Batched completion write support

  • Writing completions separately is inefficient
    • PCIe TLP header overhead
  • Batch completions in per-queue buffers, write in blocks
    • Less overhead
    • Can issue block write + event/IRQ + queue pointer write concurrently
    • Effectively doubles as interrupt coalescing
  • Status: initial version working in HW
    • Performance is similar

15 of 33

Baseline (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • Writeback off
  • MMIO pointers
  • No IRQ rate limiting

16 of 33

Writeback on, but not used (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • Writeback on
  • MMIO pointers
  • No IRQ rate limiting

17 of 33

Full queue pointer writeback (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • Writeback on
  • Writeback pointers
  • No IRQ rate limiting

18 of 33

Writeback with IRQ rate limiting (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • Writeback on
  • Writeback pointers
  • IRQ rate limiting (10 us)

19 of 33

New completion write (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • New completion write implementation
  • IRQ rate limiting

20 of 33

Baseline (MTU 1500)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 1500
  • Writeback off
  • MMIO pointers
  • IRQ rate limiting

21 of 33

New completion write (MTU 1500)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 1500
  • New completion write implementation
  • IRQ rate limiting

22 of 33

Batched completion write support

  • ~64K buffer space, ~64 rings of ~1024 bytes (32 entries)
  • Truncated queue pointers so wrap point aligns
  • Block write and pointer writeback
    • Triggered on half ring boundary in scratch buffer
    • Also triggered by event generation
  • Event generation
    • Triggered by packet count and/or timeout (configurable)

23 of 33

Variable-length descriptor support

  • Current queue management logic does not handle variable-length descriptors
  • Implement simplified queue managers (state storage only)
  • Implement descriptor fetch and header parsing logic
    • Read descriptors in blocks up to some block size, parse length/type fields and hand off to transmit/receive engines
    • Potential extensions to support LSO (descriptor duplication)
  • Status: working on supporting components

24 of 33

Current descriptor format

  • Small - 16 bytes
    • 32 bit length, 64 bit pointer
  • Fixed-size blocks for scatter-gather
    • Inflexible
    • Extra overhead for small packets

struct mqnic_desc {

__u8 rsvd0[2];

__u16 tx_csum_cmd;

__u32 len;

__u64 addr;

};

25 of 33

Descriptor framing format

  • Split framing from contents
    • Descriptor fetch only has to parse framing information
    • Descriptor format can be modified without changing fetch logic
  • 16 byte blocks, 2 byte header
    • 256 blocks / 4096 bytes max size
    • 1 byte type field to support multiple descriptor formats

struct desc_hdr {

__u8 len;

__u8 type;

}

struct desc_block {

__u8 rsvd[16];

};

struct desc {

struct desc_hdr desc_hdr;

__u8 rsvd[14];

struct desc_block[];

};

26 of 33

Proposed descriptor format

struct desc_with_inline_data {

struct desc_hdr desc_hdr;

__u16 flags;

__u32 opcode;

__u8 data_seg_count;

__u8 rsvd0;

__u16 inl_data_len;

__u8 rsvd1[4];

union {

struct desc_data_seg data;

char inl_data[];

} segs[] __attribute__ ((aligned(16)));

};

struct desc_hdr {

__u8 len;

__u8 type;

}

struct desc_data_seg {

union {

struct desc_hdr desc_hdr;

__u8 rsvd0[2];

};

__u16 flags;

__u16 tx_csum_cmd;

__u16 len;

__u64 addr;

};

27 of 33

Queue state storage

  • Some state must be stored in HW for every queue
    • Supporting 10K+ queues requires efficient storage
    • Utilize URAM blocks on US+ (4K x 64)
  • Need to add writeback pointer, VF index, and LSO offset
    • Can reclaim a couple of other fields
  • Status: Implemented queue state storage module

28 of 33

Queue state storage (current)

TX/RX size

CQ/EQ size

Field

64

64

Base addr

16

16

Head ptr

16

16

Tail ptr

16

16

CQ/EQ/IRQ index

4

4

Log queue size

2

-

Log block size

-

1

Arm

-

1

Arm cont

1

1

Enable

8

8

Op index

127

127

Total

URAM is 4096 x 64, so

2 URAM = 4096 queues

29 of 33

Queue state storage (current)

TX/RX size

CQ/EQ size

Field

52

52

Base addr (4K align)

16

16

Producer ptr

16

16

Consumer ptr

16

16

CQ/EQ/IRQ index

4

4

Log queue size

-

1

Arm

1

1

Enable

1

1

Active

12 (6)

12 (6)

VF index

16

-

LSO offset

58 (-)

58 (-)

Writeback addr (64B align)

192 (128)

178 (113)

Total

URAM is 4096 x 64, so

3 URAM = 4096 queues

Can fit into 2 URAM with writeback disabled and 6 bit VF index

30 of 33

Management soft core

  • Soft core for board-level management
    • Handle board and device specific functions
    • Present unified interface to driver
    • Location: board-level or inside core?
  • API
    • Low-level – direct register access, etc. Board-dependent diagnostics
    • Medium-level – Somewhat abstracted operations
    • High-level – High-level operations, board and FPGA independent
  • Core: probably vexriscv
    • JTAG debug interface, can simulate with icarus verilog
  • Status: core bring-up

31 of 33

10G/25G switching

  • Run-time switching between 10G and 25G should be doable
    • Change speeds and mix and match speeds without reloading FPGA
    • Run two QPLLs, dynamically switch channel clock sources
    • Possibly can also switch EQ settings (DFE vs LPM)
  • Requires resetting both RX and TX
    • Previously blocked on FIFO reset rework and PHY integration updates
  • Status: Working from userspace, need to move to softcore FW
  • TODO: update transmit engine to handle dropped TX packets when PTP is enabled

32 of 33

25G/10G/1G switching

  • Is it possible to mix 25G, 10G, and 1G on the same GT quad?
    • 1 QPLL needed each for 10G and 25G, CPLL not very flexible
    • What about oversampling? (configure for 10G, run at 1G)
  • 10.3125 / 1.25 = 8.25
  • 8.25 * 4 = 33
  • 10:8 gearbox, replicated 8 8 8 9 8 8 8 9 feeding the 66:64 async gearbox?
  • Might work for TX, but could the RX CDR track a 1.25 Gbps signal correctly? Might need an experiment…

33 of 33

To-do list for stable release

  • Application section [done]
  • Register space reorganization [done]
  • Shared interface datapath [initial version done]
  • Self-contained checksum offloading (no checksum in desc)
  • Variable-length descriptor support [in progress]
  • On-FPGA management soft core (no board-specific kernel code)
  • Overall goal: stabilize driver/hardware interface