1 of 18

Corundum status updates

Alex Forencich

11/6/2023

2 of 18

Agenda

  • Announcements
  • Project ideas
  • Status updates

3 of 18

Announcements

  • Next Corundum developer meeting: November 20 at 9:00 AM PDT
  • Next switch development meeting: November 13 at 9:00 AM PDT

4 of 18

Project ideas

  • Some relatively self-contained project ideas:
  • Oversampled 1000BASE-X on a 10GBASE-R transceiver
  • Open-source 40G MAC+PCS
  • MAC+PHY+GT wrappers for switchable 1/10/25/40/100 G
  • RISC-V management core
  • Open-source FT4232 JTAG adapter with Alveo connectors
  • Improved transmit scheduler (WFQ, rate limiting, etc.)
  • SR-IOV support
  • Multi-head support (multiple PCIe and/or AXI host interfaces)
  • Improved application section SoC support

5 of 18

Status update summary

  • New PTP subsystem progress
  • Design consolidation
  • Port splitting
  • Architectural changes
    • Single HW interface, split in SW
    • Switch integration
    • Additional app section passthrough
  • Potential shared development testbed

6 of 18

New PTP subsystem

  • Distribute time from PHC via one wire serial + PTP ref clock
  • Leaf clocks perform CDC into target clock domains
  • Supports truncated timestamp format from which full 96 bit timestamps can be reconstructed
  • Status:
    • Protocol mostly defined
    • PHC and leaf clock simulation models working
    • PHC and leaf clock HDL working in sim and in HW, tested on Corundum
    • Working on integration

7 of 18

PTP TD protocol

  • 16 bit UART-like protocol
    • 1 start bit, 16 data bits, 0 stop bits, baud rate = ref clock freq
    • 17 clock cycles to cross parallel data to a different clock domain
  • 256 cycle update period
    • 256/17 = 15.05, can fit 14 16-bit words in each message and have 1 full word of idles for frame sync to work correctly
  • Format similar to GPS almanac – data sent at different rates
    • All information to set 64 bit relative TS sent in every message
    • Other portion of message cycles through 96 bit ToD TS information and timestamp reconstruction information
  • Timestamp reconstruction
    • “Current” and “alternate” offsets for ~500 ms reconstruction window

8 of 18

PHC API updates

  • Integer ns relative-to-ToD offset requires shared fractional ns
    • Old PTP clock keeps this separate
  • Use case 1: coarse, non-precision set
    • Unpredictable latency makes a direct write of the time imprecise; no point in setting FNS
  • Use case 2: fine adjust period to slew clock
    • What the PTP time servo usually does
  • Use case 3: high precision atomic offsets
    • For synchronous/coherent mode (WR), cannot touch period register
    • Need to atomically apply offsets to the clock time (including FNS)
  • Add provisions for PTM

9 of 18

New PHC API

  • Read current time (FNS, relative TS, ToD TS, PTM)
  • Snapshot current time (simultaneous capture of all time sources)
  • Set period (8 bit ns + 32 bit fractional ns)
  • Offset FNS (32 bit fractional ns)
  • Set ToD timestamp (48 bit sec + 30 bit ns)
  • Offset ToD timestamp (30 bit signed ns)
  • Set relative timestamp (48 bit ns)
  • Offset relative timestamp (32 bit signed ns)

10 of 18

PTP TD PHC

  • Based on a state machine
    • Don’t need to compute new time values on every cycle, instead “jump” 256 cycles every 256 cycles
    • Small footprint – single 48 bit shared adder
  • Handle drift computation on every cycle
  • Simplified PPS output logic
    • Use shared adder to do some of the work, don’t need full FNS accumulator
  • Better handling of ToD offsets
    • Increment or decrement seconds field when necessary

11 of 18

PTP TD leaf

  • Two parts: reference clock in PTP clock domain, full PTP clock in dest clock domain
  • Deserialization logic generates delay-compensated sync pulse
  • Digital PLL to generate deskewed sampling signal
  • Time servo PID loop to control PTP clock period
  • Coarse phase detector compares TS bit 8 for rough phase lock
  • Lower 9 bits + 4 FNS bits subtracted for fine phase lock
  • Gain scheduling to improve dynamics
  • MSBs checked periodically and loaded if necessary
  • ToD TS derived from relative TS with minimal additional logic

12 of 18

Design consolidation

  • Reduce redundancy by merging similar designs
  • AU200/AU250/VCU1525: merged as AU200
    • Pin-compatible, only need to change placement constraints and add parameter to exclude CMS IP core
  • AU280/AU55, AU50/C1100?
    • Similar boards so merging may be possible, need to investigate further
  • 25G vs 100G
    • More flexible GT + MAC wrappers should enable 25G designs to be config variants of 100G designs (disable CMAC, set datapath config)

13 of 18

Port splitting: clocking

  • Support splitting a 100G port into four 25G (or slower) ports
  • RX clocking
    • CMAC RX streaming IF currently runs in TX clock domain, but this should be simple to change to RX recovered clock to match 25G MAC
    • Switched all 100G designs to use RX recovered clock: works; done
    • Tested using separate GTY RX clocks: works; holding off
  • TX clocking
    • CMAC expects all GTYs to share the same TX clock
    • BUFG_GT only has one input, unlike BUFGMUX
    • Perhaps we can add phase compensation FIFOs so the channels can run with separate TX clocks

14 of 18

Port splitting: streaming interfaces

  • How to present split interface to Corundum?
  • Option 1: single streaming interface
    • Merge everything into single streaming interface
    • Problem: head-of-line blocking, would have to interleave
  • Option 2: concatenated interface
    • Single 512 bit interface, acts as 1 at 100G and 4x 128 bit when split
    • Problems: sideband data, clocking on async interfaces
  • Option 3: 100G interface + 25G interfaces
    • Single 512 bit interface + four 64/128 bit interfaces
    • Problem: logic resources, need two sets of streaming interfaces
  • Option 4: channel interfaces
    • Single 512 bit interface for 100G + lane 0 25G + three 64/128 bit interfaces
    • Problem: logic resources (but a bit more sharing), need two sets of streaming interfaces
  • Another consideration: frame preemption?

15 of 18

Single HW interface

  • Currently, Corundum supports multiple HW interfaces, with each interface potentially connecting to multiple ports
  • Is this really necessary?
    • New queue management logic trades area for performance
    • Dynamic queue allocation in driver + RX indirection tables mean software can smoothly expose each port as on OS-level interface
    • With embedded switch, it’s likely going to make even more sense to manage this sort of thing in software
    • Port splitting could also result in having a lot of ports and/or a variable number of ports, even without embedded switch

16 of 18

Switch integration

  • Sharing code between NIC and switch opens up some interesting possibilities
  • Switchable/splittable 10G/25G/100G ports
    • How to handle splitting ports with application section interfaces?
  • E-switch for internal routing for SRIOV, etc.
    • Additional internal switch ports for application section?
  • Switch likely will use additional clocks
    • Would the application section interface clocking change?
    • Should the PCIe clock be decoupled from the core clock?

17 of 18

App section passthrough

  • Some applications require more “stuff” to get passed through to application section
  • Can this be done with macros?
    • Probably need to `include two files
      • Port definitions (included in module port list)
      • Port connections (included in instance port list)
      • What about parameters?

18 of 18

Potential shared development testbed

  • Hardware:
    • Several host machines
    • Various NICs and PCIe-form-factor FPGA boards
    • 2x HTG-9200 boards (9x QSFP28)
    • ONT-603 100G network tester, possibly other test equipment
    • Arista 7060CX 32 port 100G packet switch
    • 2x Polatis 16x16 optical switches as scriptable patch panel
  • Software:
    • Less clear at the moment
    • Current idea: diskless hosts, users can set up their own images and boot them on the hosts (tools shared via NFS)