1 of 25

Corundum status updates

Alex Forencich

10/10/2022

2 of 25

Agenda

  • Status updates

3 of 25

Status update summary

  • Bugs
    • Apparent bug in Xilinx PCIe IP core (in progress)
  • DRAM integration (in progress)
  • Batched completion write support (in progress)
  • Variable-length descriptor support (todo)
  • Management soft core (todo)
  • 10G/25G switching (HW done, SW todo)
  • Clock info register block (todo)

4 of 25

Bugs

  • Apparent PCIe HIP bug
    • US+ PCIe HIP drops operations under heavy load
    • Appears to be related to completion buffer management
    • Disabling client tags may be a workaround (PCIe HIP provides tags, instead of DMA engine)
    • Status: implemented tag management in HIP model, need to update DMA engine

5 of 25

DRAM integration

  • Provide access to DRAM and HBM from application section
    • AXI passthrough for all memory channels
    • Dedicated PCIe BAR for host software access and P2P DMA
    • Connection to DMA subsystem (unified DMA address space)
  • Status
    • Added AXI passthrough from core to application logic
    • Added DDR4 MIGs and HBM controllers to current US+ designs
      • Calibration completes (status via JTAG)
    • TODO: Intel DDR4 (AXI?), Intel HBM (switching, bursts), 7-series
    • Long term TODO: PCIe BAR, DMA subsystem connection

6 of 25

Batched completion write support

  • Writing completions separately is inefficient
    • PCIe TLP header overhead
  • Batch completions in per-queue buffers, write in blocks
    • Less overhead
    • Can issue block write + event/IRQ + queue pointer write concurrently
    • Effectively doubles as interrupt coalescing
  • Status: initial version working in HW
    • Performance is similar

7 of 25

Baseline (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • Writeback off
  • MMIO pointers
  • No IRQ rate limiting

8 of 25

Writeback on, but not used (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • Writeback on
  • MMIO pointers
  • No IRQ rate limiting

9 of 25

Full queue pointer writeback (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • Writeback on
  • Writeback pointers
  • No IRQ rate limiting

10 of 25

Writeback with IRQ rate limiting (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • Writeback on
  • Writeback pointers
  • IRQ rate limiting (10 us)

11 of 25

New completion write (MTU 9000)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 9000
  • New completion write implementation
  • IRQ rate limiting

12 of 25

Baseline (MTU 1500)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 1500
  • Writeback off
  • MMIO pointers
  • IRQ rate limiting

13 of 25

New completion write (MTU 1500)

  • ADM-PCIE-9V3
  • 8192 TXQ
  • 256 RXQ
  • 64 EQ + 64 MSI-X
  • 256 PCIe tags w/FC
  • 2x Xeon 6130
  • MTU 1500
  • New completion write implementation
  • IRQ rate limiting

14 of 25

Batched completion write support

  • ~64K buffer space, ~64 rings of ~1024 bytes (32 entries)
  • Truncated queue pointers so wrap point aligns
  • Block write and pointer writeback
    • Triggered on half ring boundary in scratch buffer
    • Also triggered by event generation
  • Event generation
    • Triggered by packet count and/or timeout (configurable)

15 of 25

Variable-length descriptor support

  • Current queue management logic does not handle variable-length descriptors
  • Implement simplified queue managers (state storage only)
  • Implement descriptor fetch and header parsing logic
    • Read descriptors in blocks up to some block size, parse length/type fields and hand off to transmit/receive engines
    • Potential extensions to support LSO (descriptor duplication)
  • Status: working on supporting components

16 of 25

Current descriptor format

  • Small - 16 bytes
    • 32 bit length, 64 bit pointer
  • Fixed-size blocks for scatter-gather
    • Inflexible
    • Extra overhead for small packets

struct mqnic_desc {

__u8 rsvd0[2];

__u16 tx_csum_cmd;

__u32 len;

__u64 addr;

};

17 of 25

Descriptor framing format

  • Split framing from contents
    • Descriptor fetch only has to parse framing information
    • Descriptor format can be modified without changing fetch logic
  • 16 byte blocks, 2 byte header
    • 256 blocks / 4096 bytes max size
    • 1 byte type field to support multiple descriptor formats

struct desc_hdr {

__u8 len;

__u8 type;

}

struct desc_block {

__u8 rsvd[16];

};

struct desc {

struct desc_hdr desc_hdr;

__u8 rsvd[14];

struct desc_block[];

};

18 of 25

Proposed descriptor format

struct desc_with_inline_data {

struct desc_hdr desc_hdr;

__u16 flags;

__u32 opcode;

__u8 data_seg_count;

__u8 rsvd0;

__u16 inl_data_len;

__u8 rsvd1[4];

union {

struct desc_data_seg data;

char inl_data[];

} segs[] __attribute__ ((aligned(16)));

};

struct desc_hdr {

__u8 len;

__u8 type;

}

struct desc_data_seg {

union {

struct desc_hdr desc_hdr;

__u8 rsvd0[2];

};

__u16 flags;

__u16 tx_csum_cmd;

__u16 len;

__u64 addr;

};

19 of 25

Queue state storage

  • Some state must be stored in HW for every queue
    • Supporting 10K+ queues requires efficient storage
    • Utilize URAM blocks on US+ (4K x 64)
  • Need to add writeback pointer, VF index, and LSO offset
    • Can reclaim a couple of other fields
  • Status: Implemented queue state storage module

20 of 25

Queue state storage (current)

TX/RX size

CQ/EQ size

Field

64

64

Base addr

16

16

Head ptr

16

16

Tail ptr

16

16

CQ/EQ/IRQ index

4

4

Log queue size

2

-

Log block size

-

1

Arm

-

1

Arm cont

1

1

Enable

8

8

Op index

127

127

Total

URAM is 4096 x 64, so

2 URAM = 4096 queues

21 of 25

Queue state storage (current)

TX/RX size

CQ/EQ size

Field

52

52

Base addr (4K align)

16

16

Producer ptr

16

16

Consumer ptr

16

16

CQ/EQ/IRQ index

4

4

Log queue size

-

1

Arm

1

1

Enable

1

1

Active

12 (6)

12 (6)

VF index

16

-

LSO offset

58 (-)

58 (-)

Writeback addr (64B align)

192 (128)

178 (113)

Total

URAM is 4096 x 64, so

3 URAM = 4096 queues

Can fit into 2 URAM with writeback disabled and 6 bit VF index

22 of 25

Management soft core

  • Soft core for board-level management
    • Handle board and device specific functions
    • Present unified interface to driver
    • Location: board-level or inside core?
  • API
    • Low-level – direct register access, etc. Board-dependent diagnostics
    • Medium-level – Somewhat abstracted operations
    • High-level – High-level operations, board and FPGA independent
  • Core: probably vexriscv
    • JTAG debug interface, can simulate with icarus verilog
  • Status: core bring-up

23 of 25

10G/25G switching

  • Run-time switching between 10G and 25G should be doable
    • Change speeds and mix and match speeds without reloading FPGA
    • Run two QPLLs, dynamically switch channel clock sources
    • Possibly can also switch EQ settings (DFE vs LPM)
  • Requires resetting both RX and TX
    • Previously blocked on FIFO reset rework and PHY integration updates
  • Status: Working from userspace, need to move to softcore FW
  • TODO: update transmit engine to handle dropped TX packets when PTP is enabled

24 of 25

Clock info register block

  • Report information about clock frequencies
    • High precision clock frequency measurements
    • Core clock
    • TX/RX clocks
  • Precise core clock frequency important for things like rate limiting
  • Status: TODO

25 of 25

To-do list for stable release

  • Application section [done]
  • Register space reorganization [done]
  • Shared interface datapath [initial version done]
  • Self-contained checksum offloading (no checksum in desc)
  • Variable-length descriptor support [in progress]
  • On-FPGA management soft core (no board-specific kernel code)
  • Overall goal: stabilize driver/hardware interface