1 of 36

HPC Networking

Jenett Tillotson

Senior HPC Systems Engineer

National Center for Atmospheric Research

February 2024

2 of 36

Agenda

  • How networks forward things
  • Network topologies
  • Cluster interconnects
  • Cluster management
  • Troubleshooting
  • Performance

2

February 2024

3 of 36

IP Addresses

  • IP addresses are just 32-bit sequences
  • Commonly expressed as “dotted quads”
    • 4x 8-bit sequences as bytes separated by a dot
  • Subnets masks are dotted quads that consist of a sequence of 1s followed by a sequence of 0s when expressed in binary
    • /24 = 255.255.255.0, /26 = 255.255.255.192
  • In an IP/mask pair, the bits masked as 1 are the subnet, and the bits masked as 0 are the host

February 2024

4 of 36

IP Address Example

  • 192.168.1.10/24 (or netmask 255.255.255.0)
    • 11000000.10101000.00000001.00001010
    • 11111111.11111111.11111111.00000000
  • All other IPs starting with 192.168.1 are part of the same subnet
  • In this case, 0-255 on last octet are possible hosts minus the first and last address
  • First address (all 0s host) is the network address
    • 192.168.1.0
  • Last address (all 1s host) is the broadcast address
    • 192.168.1.255

February 2024

5 of 36

More Complicated IP Address Example

  • 172.18.166.195/28 (netmask 255.255.255.240)
    • 10101100.00010010.10100110.11000011
    • 11111111.11111111.11111111.11110000
  • Usable host IPs
    • 172.18.166.193 - 172.18.166.206
    • 14 hosts - 16 addresses minus two
  • First address (all 0s host) is the network address
    • 172.18.166.192
  • Last address (all 1s host) is the broadcast address
    • 172.18.166.207

February 2024

6 of 36

ARP

  • Translates IP addresses to MAC addresses
  • Used between hosts on same subnet
  • Host A broadcasts for resolution
    • “Who has IP address 192.168.1.10?”
  • Host B replies unicast to host A
    • “192.168.1.10 is at aa:bb:cc:dd:ee:ff”
  • Cached on each host
    • For how long?

February 2024

7 of 36

Examples

March 2024

8 of 36

Network topologies

  • All engineering requires managing trade-offs
  • Design goals:
    • Maximize throughput (bandwidth) between any two points
      • Lowest bandwidth segment of path between them
    • Minimize latency between any two points
      • Additional hops cause additional latency
    • Minimize financial cost
    • Redundancy is desirable

  • As usual: pick two of good, fast, and cheap

February 2024

9 of 36

Daisy chain

March 2024

10 of 36

Ring

March 2024

11 of 36

Partial mesh

March 2024

12 of 36

Spanning tree

I think that I shall never see

A graph more lovely than a tree.

A tree whose crucial property

Is loop-free connectivity […]

A mesh is made by folks like me,

Then bridges find a spanning tree.

  • Radia Perlman

Spanning Tree Protocol (STP) prevents loops in bridged networks by soft disabling all links necessary to turn any given network graph into a tree structure without loops

12

February 2024

13 of 36

Full mesh

March 2024

14 of 36

Spine-Leaf/Fat Tree

March 2024

15 of 36

Hypercube

Wikimedia Commons

March 2024

16 of 36

Dragonfly

Wikimedia Commons

March 2024

17 of 36

Cluster interconnects

  • Direct communication between cluster members
  • Must be very fast, and very reliable
  • Optimized for locality
  • RDMA is very desirable
  • Multiple options:
    • InfiniBand
    • Ethernet
    • OmniPath
    • Vendor proprietary

17

February 2024

18 of 36

Ethernet

  • Ubiquitous, easy to find resources, can get support from network departments
  • Cheap – at least by comparison
  • Many vendors to choose from, extremely versatile
  • Typically higher latency and higher complexity
  • RDMA quite proprietary, and uncommon
  • Required for other cluster aspects, anyway

18

February 2024

19 of 36

InfiniBand

  • Popular in HPC, open standard
  • But effectively all NVIDIA (who bought Mellanox)
  • Rated by speed of lane and lane multiplier
    • Currently mostly 50G at 4x lanes = 200G
  • Modern implementations use QFSP+ connectors, which either use fiber or DACs (direct attach copper)

19

February 2024

20 of 36

InfiniBand cont.

20

Wikipedia

March 2024

21 of 36

InfiniBand cont.

  • Architectures quite similar to Ethernet
    • Switches for intra-subnet, routers between subnets
  • Software defined, switch side is plug and play
                • Centrally managed via a subnet manager
                  • Software that programs the IB switches
                  • OpenSM on a host
                  • Internal subnet manager in firmware

21

February 2024

22 of 36

RDMA

  • Remote Direct Memory Access
  • Part of Inifiniband (most common) and other similar technologies
  • Exposed as APIs
  • Allows hosts to directly transfer memory data between one another while CPU is churning away on compute
  • Available on Ethernet as RDMA over Converged Ethernet (RoCE)
    • Software RoCE when NIC doesn't support RoCE
    • RoCE doesn't require a subnet manager
    • NICs just work it out for themselves
  • Performance dependent on a lossless network

22

February 2024

23 of 36

RDMA cont.

  • Create TCP/IP connection between hosts and figure out equivalent IB endpoints
  • Pin memory on both hosts so kernel won’t work on it
  • Open an IB context
  • Use TCP/IP to exchange metadata about the transfer
  • Put in an IB work request to transfer the memory
  • Clean up

February 2024

24 of 36

Proprietary Interconnects

  • Usually not interoperable, lock-in
  • Dependent on vendor support
  • Consider lifetime of system
    • Size of vendor?
    • Roadmap?
  • Scalability

24

February 2024

25 of 36

Cluster management

  • Ethernet
  • Admin access
    • Software updates, troubleshooting
    • Monitoring
    • Lights out management of hardware
    • Stateless node boot
      • PXE boot, gets parameters via DHCP
      • Retrieves boot image via TFTP
      • Effectively gets re-imaged with every reboot

February 2024

26 of 36

Other Local Network Access

  • Ethernet
  • User access
    • Direct job submission in some scenarios
    • Job monitoring
    • Result collection in some scenarios
  • File systems
    • NFS, CIFS
    • HPC network file systems

February 2024

27 of 36

External access

  • Ethernet
  • Usually, only login nodes are world accessible
  • Usually, some sort of firewall in front of entire cluster (but not between cluster members!)
  • How are external file systems reached?
    • Specific to use case

February 2024

28 of 36

Troubleshooting Layer 1

  • Is the link light on?
  • Is the cable plugged in properly and seated?
  • Is the optic plugged in properly and seated?
  • Does the fiber need a pair flip?
  • Is the cable damaged?

February 2024

29 of 36

More layer 1

  • Are the ports enabled administratively on both sides?
    • "ifconfig up"/“ip link set up” for hosts
    • ethtool for link properties on host
    • Switch side depends on vendor
  • Is the switch port error disabled?

  • Is subnet manager running for InfiniBand?

February 2024

30 of 36

Ethernet layer 2 and above

  • Layer 2 adjacency
    • Link Layer Discovery Protocol (LLDP)
    • arp command on hosts
    • Maybe need to ping first
  • Layer 3 adjacency
    • ping command
  • Layer 4 connectivity
    • nc
    • telnet

February 2024

31 of 36

InfiniBand layer 2 and above

  • General diagnostics
    • ibdiagnet
  • Layer 2 adjacency
    • ibnetdiscover
  • Topology verification
    • ibtopodiff
  • Port health
    • perfquery

February 2024

32 of 36

Packet captures

  • Ethernet: tcpdump
  • InfiniBand: ibdump
  • Run tests from previous slides while capturing
  • Load capture up in Wireshark to analyze
  • Or ship to someone knowledgeable

February 2024

33 of 36

Data transfer nodes

  • Nodes optimized for moving large amounts of data
  • Kernel and file system tuning significantly improves performance
  • Centralization of special file transfer software such as GlobusOnline
  • Sometimes overlaps with head nodes
  • Can be used to move data internally and externally
  • https://fasterdata.es.net/

February 2024

34 of 36

Jumbo frames

  • Default Ethernet frame size is 1500 bytes
  • Jumbo frames can be up to 9000 bytes
  • No fragmentation at layer 2, frame size must be supported on entire path
  • Requires manual configuration on all switches and hosts in the layer 2 domain
  • Mismatches will be difficult to find and cause severe problems
  • So why do it?

February 2024

35 of 36

Bandwidth delay product

  • TCP data needs to be ACK’d
  • Buffers cannot be freed until ACK happened
  • Links have bandwidth and a delay
  • Link bandwidth * RTT delay = data in flight that cannot have been ACK’d yet
  • If buffer is too small, you have to wait
  • 100Gbps at 2ms = 10^11 bps * 0.002s = 25MB
  • Default Linux RX/TX buffers are 6MB and 4MB
  • This gets much bigger for external delay tolerant networks (Internet)

February 2024

36 of 36

Benchmarking

  • Purpose is to test throughput under controlled circumstances
  • Iperf3 / PerfSonar
    • Memory to memory, client to server
    • Introduce delay and bandwidth issues to see how network responds
  • bbcp
    • File copies between networked nodes
  • Establish baseline
  • Make changes
  • Test again

February 2024