1 of 43

Vernard Martin, Solutions Architect and Infrastructure Team Lead�Scientific Computing & Bioinformatics Services �Office of Advanced Molecular Detection�Centers for Disease Control & Prevention

Data Movement Hardware & Software

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC-ND 4.0 (https://creativecommons.org/licenses/by-nc-nd/4.0/).

2 of 43

Data Transfer Node

  • A DTN server is made of several subsystems. Each needs to perform optimally for the DTN workflow:
    • Storage: capacity, performance, reliability, physical footprint
    • Networking: protocol support, optimization, reliability
    • Motherboard: I/O paths, PCIe subsystem, IPMI
    • Chassis: adequate power supply, extra cooling

  • Note: the workflow we are optimizing for here is sequential reads/write of large files, and a moderate number of high bandwidth flows.

  • We assume this host is dedicated to data transfer, and not doing data analysis/manipulation

2 – ESnet Science Engagement (engage@es.net) - 5/12/2024

3 of 43

Motherboard and Chassis selection

  • Chassis
    • Extra cooling (for future expansion, unless you buy the system full)
    • Make sure the power supply is adequate
  • Motherboard/CPU
    • Cascade Lake/Alder lake/Raptor Lake or newer CPU architecture (e.g. for 10/40G, you can get away with a reasonably modern machine)
      • High clock rate better than high core count for DTNs – max this out
      • Faster QPIC for communication between processors
    • PCI Gen 3 or newer (40G and 100G requires PCI Gen 3)
    • Memory speed – faster is better, more is better
      • We recommend 128GB of RAM for a DTN node
      • Balance CPU,RAM, & I/O means sometimes more nodes rather than larger ones
    • IPMI for remote management (optional but really not)

3 – ESnet Science Engagement (engage@es.net) - 5/12/2024

4 of 43

PCI Slot Considerations

  • PCI slots are defined by:
    • Slot width:
      • Physical card and form factor
      • Max number of lanes
    • Lane count:
      • Maximum bandwidth per lane
      • Most cards will run slower in a slower slot
      • Not all cards will use all lanes

4 – ESnet Science Engagement (engage@es.net) - 5/12/2024

5 of 43

PCI Slot Considerations

  • Card being inserted:
    • x8 Slot Width
  • Slot being used:
    • x8 Slot Width
  • Wider Slot (bottom):
    • x16 Slot Width

5 – ESnet Science Engagement (engage@es.net) - 5/12/2024

6 of 43

PCI Bus Considerations

  • Example:
    • 10GE NICs require an 8 lane PCIe-2 slot
    • 40G/QDR NICs require an 8 lane PCIe-3 slot
    • 100G NICs (ex. Mellanox) requires a 16 lane PCIe-3 slot
    • Most RAID controllers require an 8 lane PCIe-2 slot
    • Fusion-IO cards may require a 16 lane PCIe-3 slot

6 – ESnet Science Engagement (engage@es.net) - 5/12/2024

7 of 43

-Storage Architectures - Internal

  • DTN with internal RAID is self-contained
    • Same CPU, RAM, etc. as DTN with external storage
    • No external dependencies for storage
    • Deployable anywhere
    • Limited scalability
    • Storage managed locally (you get whatever tools the RAID controller gives you)

8 of 43

Storage Architectures - External

  • These are essentially the same from a DTN host design perspective
    • IB, Ethernet, or Fibrechannel card connects to external storage
    • Other system components (CPU, RAM, etc.) the same
    • Central storage management, greater flexibility
    • Integration with other large-scale resources (e.g. HPC)

9 of 43

DTNs in a Facility

40G/100G

Downstream

10G

10G

10G

10 of 43

DTNs in a Facility – Network Paths

40G/100G

Downstream

10G

10G

10G

11 of 43

DTNs in a Facility – Security

40G/100G

Downstream

10G

10G

10G

12 of 43

Storage Subsystem Selection

  • Deciding what storage to use in your DTN is based on what you are optimizing for:
    • performance, reliability, capacity, and/or cost
  • SATA disks historically have been cheaper and higher capacity, while SAS disks typically have been the fastest.
    • Technologies have been converging (and its hard to keep up)
  • Do what you can support well (ditto for filesystems – ZFS, ext4, etc.)
    • Support is more than just what a vendor supports
    • Internal talent can be the deciding factor more often than not

12 – ESnet Science Engagement (engage@es.net) - 5/12/2024

13 of 43

SSDs and HDs

  • SSD storage costs much more than traditional hard drives (HDs), but are much faster. They come in different styles:
    • PCIe card: some vendors (Fusion-IO) build PCI cards with SSDs.
      • These are the fastest type of SSD: up to several GBytes/sec per card.
        • Note that this type of SSD is typically not hot-swapable.
    • HD replacement: several vendors now sell SSD-based drives that have the same form factor as traditional drives such as SAS and SATA.
      • The downside to this approach is that performance is limited by the RAID controller, and not all controllers work well with SSD.
        • Be sure that your RAID controller is “SSD capable”.
    • NVME is the new kid on the block and prices are coming down.

  • Note that the price of SSD is coming down quickly, so an SSD-based solution may be worth considering for your DTNs.

13 – ESnet Science Engagement (engage@es.net) - 5/12/2024

14 of 43

SSD Form Factors

14 – ESnet Science Engagement (engage@es.net) - 5/12/2024

15 of 43

RAID Controllers

  • Often optimized for a given workload, rarely for performance.
  • RAID0 is the fastest of all RAID levels but is also the least reliable.
  • The performance of the RAID controller is a factor of the number of drives and its own processing engine.

15 – ESnet Science Engagement (engage@es.net) - 5/12/2024

16 of 43

RAID Controller

  • Be sure your RAID controller has the following:
    • >= 1GB of on-board cache
    • PCIe Gen3 support (and your board can support that)
    • dual-core RAID-on-Chip (ROC) processor if you will have more than 8 drives

16 – ESnet Science Engagement (engage@es.net) - 5/12/2024

17 of 43

Network Subsystem Selection

  • There is a huge performance difference between cheap and expensive 10G/40G NICs.
    • E.g. please don’t cheap out on the NIC – it’s important for an optimized DTN host.
  • NIC features to look for include:
    • support for interrupt coalescing
    • support for MSI-X
    • TCP Offload Engine (TOE)
    • support for zero-copy protocols such as RDMA (RoCE or iWARP)
  • Note that many 10G/40G/100G NICs come in dual ports, but that does not mean if you use both ports at the same time you get double the performance. Often the second port is meant to be used as a backup port.
    • Myricom 10G-PCIE2-8C2-2S
    • Mellanox MCX312A-XCBT
    • Mellanox ConnectX-3 , ConnectX-4

17 – ESnet Science Engagement (engage@es.net) - 5/12/2024

18 of 43

Reference Implementation (2023)

  • Hardware description
  • The total cost of this server was around $25K in late 2022. These systems be deployed to ESnet in late 2022 and into 2023 for ESnet6
  • Base System:  Supermicro 2124US-TNRP 2U dual AMD socket SP3 server
    • onboard VGA, dual10G RJ45, dual10G SFP+, onboard dedicated IPMI RJ45
    • 1 PCI-E 4.0 x16 slot,
    • 24 front access NVME hotswap bays
    • dual redundant hotswap 1200W PSU
  • 2x AMD EPYC Milan 73F3
    • 16 cores each
    • 3.5Ghz 240W TDP processor
  • 256 GB RAM - 16x 16G DDR4 3200 ECC RDIMM
  • 800G System Disk: 2x Micron 7300 MAX 800G U.2/2.5" NVME
  • 25TB Data Disk: 10x Micron 9300 MAX 3.2TB U.2/2.5" NVME
  • NVIDIA MCX613106A-VDAT ConnectX-6 EN Adapter Card 200GbE
  • Mellanox MMA1L10-CR Optical Transceiver 100GbE QSFP28 LC-LC 1310nm LR4 up to 10km
  • OOB license for IPMI management
  • 2x 1300W -48V DC PSU   OR   2x 1200W AC PSU

18 – ESnet Science Engagement (engage@es.net) - 5/12/2024

19 of 43

Reference Implementation (2020)

  • Hardware description
  • The total cost of this server was approximately $21K in mid 2019
  • Base System: Gigabyte R281-NO0 dual socket P 2U server
    • Onboard: VGA, 2 x GbE RJ45 Intel i350, IPMI dedicated LAN
    • 24 x front access U.2 hotswap bays
    • 2 x rear access 2.5” SATA hotswap bays
    • Dual redundant hotswap 1600W PSU
  • 2 x Intel Cascade lake Xeon Gold 6246
    • 12 cores each
    • 3.3GHz 165W TDP processor
  • 12 x 16G DDR4 2933 ECC RDIMM (192G total)
  • 10 x Intel P4610 1.6TB U.2/2.5” PCIe NVMe 3.0 x4 Drives (connect directly to CPU for VROC)
  • 2 x Enterprise 960G 2.5“ SATA SSD (OS, onboard Intel SATA Raid 1)
  • VIntel® Virtual RAID On CPU (VROC) RAID 0, 1, 10, 5
  • Mellanox ConnectX-5 EN MCX516A-CCAT 40/50/100GbE dual-port QSFP28 NIC

19 – ESnet Science Engagement (engage@es.net) - 5/12/2024

20 of 43

Reference Implementation (2020)

Preliminary System Performance Results for this configuration

�Initial system performance was evaluated utilizing two identical units, and a topology that consisted of:

    • Back-to-back 40Gbps active cable
    • 40Gbps local to 100Gbps WAN connection(s) with an RTT of 40ms (Berkeley CA to Chicago IL, round trip)

The testing was performed on Ubuntu 18.04 using XFS on mdraid with O_DIRECT and standard POSIX I/O. ��The file sizes varied between 256 GB and 1T with consistent results between file sizes, as well as consistent results when writing pseudo-random data vs all-zeroes. 

�Testing revealed it was possible to maintain a sustained total of 80Gbps on both connections of the NIC.  Future testing will focus on the use of 100Gbps cable cables and optics.

20 – ESnet Science Engagement (engage@es.net) - 5/12/2024

21 of 43

Reference Implementation (2020)

Disk configuration effects on performance

The test systems have employed several disk configurations which were somewhat more complicated:

  • Using RAID-0, the best results were 126 Gbps write and 140 Gbps read with 8 disks on a single controller (PCIe multiplexor).
    • Current research is to increase the number of disks to 10 (with hopes of getting 155Gbps of raw performance) through re-arranging the disk placement on the PCIe multiplexor
    • PCIe bandwidth is roughly 1 GBps per PCIe lane, thus balancing across multiple drives is required to ensure peak performance.
  • The best preliminary storage results with parity are using mirroring in a RAID-10. �The raw performance (to disk) is the same as above, but the effective write performance drops to about 60Gbps due to the resiliency requirements.  

21 – ESnet Science Engagement (engage@es.net) - 5/12/2024

22 of 43

Reference Implementation (2020)

CPU & Memory effects on performance

�Testing has revealed that not all 12 cores (on a single chip) are activated regularly, unless the degree of parallelism increases.  Busy DTN requirements will push up this requirement.  ��Populating all memory channels in the RAM configuration also ensures peak performance, as the system is able to fully utilize and balance load as needed.  

22 – ESnet Science Engagement (engage@es.net) - 5/12/2024

23 of 43

Take Home Points

  • The 2 key “take homes” for hardware are:
    • Needs to be expandable and
    • Needs to be supportable
  • Needs to be able to seamlessly support data mobility

23 – ESnet Science Engagement (engage@es.net) - 5/12/2024

24 of 43

DTN Tuning is Art & Science

24 – ESnet Science Engagement (engage@es.net) - 5/12/2024

25 of 43

  • Defaults are not appropriate for performance.

    • What needs to be tuned:
    • BIOS
    • Firmware
    • Device Drivers
    • Networking
    • File System
    • Application

25 – ESnet Science Engagement (engage@es.net) - 5/12/2024

26 of 43

DTN Tuning

  • Tuning your DTN host is extremely important. We have seen overall IO throughput of a DTN more than double with proper tuning.
  • Tuning can be as much art as a science. Due to differences in hardware, its hard to give concrete advice.
  • Here are some tuning settings that we have found do make a difference.
  • This tutorial assumes you are running a RedHat-based Linux system, but other Unix flavors should have similar tuning knobs.
  • Note that you should always use the most recent version of the OS, as performance optimizations for new hardware are added to every release.

26 – ESnet Science Engagement (engage@es.net) - 5/12/2024

27 of 43

Network Tuning

# add to /etc/sysctl.conf

net.core.rmem_max = 67108864

net.core.wmem_max = 67108864

net.ipv4.tcp_rmem = 4096 87380 33554432

net.ipv4.tcp_wmem = 4096 65536 33554432

net.core.netdev_max_backlog = 250000

Add to /etc/rc.local to increase send queue depth

# increase txqueuelen

/sbin/ifconfig eth2 txqueuelen 10000

/sbin/ifconfig eth3 txqueuelen 10000

This info (and more) is available on fasterdata, formatted for cut and paste

# make sure cubic and htcp are loaded

/sbin/modprobe tcp_htcp

/sbin/modprobe tcp_cubic

# set default to CC alg to htcp

net.ipv4.tcp_congestion_control=htcp

# with the Myricom 10G NIC

# using interrupt coalencing helps a lot:

/usr/sbin/ethtool -C ethN rx-usecs 75

Please consider jumbo frames, but don’t just blindly turn them on

28 of 43

I/O Scheduler

  • The default Linux scheduler is the "fair" scheduler. For a DTN node, we recommend using the "deadline" scheduler instead.
  • To enable deadline scheduling, add "elevator=deadline" to the end of the "kernel' line in your /boot/grub/grub.conf file, similar to this:
  • kernel /vmlinuz-2.6.35.7 ro root=/dev/VolGroup00/LogVol00 rhgb quiet elevator=deadline

28 – ESnet Science Engagement (engage@es.net) - 5/12/2024

29 of 43

Interrupt Affinity

  • Interrupts are triggered by I/O cards (storage, network). High performance means lot of interrupts per second
  • Interrupt handlers are executed on a core
    • Interrupt handler is just code – it gets run for every interrupt
    • Cache effects matter (with lots of I/O we’re going to run that code a lot)
  • Depending on the scheduler, core 0 gets all the interrupts, or interrupts are dispatched in a round-robin fashion among the cores: both are bad for performance:
    • Core 0 get all interrupts: with very fast I/O, the core is overwhelmed and becomes a bottleneck
    • Round-robin dispatch: very likely the core that executes the interrupt handler will not have the code in its L1 cache.
    • Two different I/O channels may end up on the same core.

29 – ESnet Science Engagement (engage@es.net) - 5/12/2024

30 of 43

Know Your Layout

30 – ESnet Science Engagement (engage@es.net) - 5/12/2024

  • Where are your cards?
  • Where are your cores?
  • Understand what your bindings mean physically
  • Performance of a given config can be tightly coupled to physical layout
  • Don’t be afraid to experiment

31 of 43

SSD Issues – Write Sparingly

  • Tuning your SSD is more about reliability and longevity than performance
    • Each flash memory cell has a finite lifespan that is determined by the number of "program and erase (P/E)" cycles
    • Without proper tuning, SSD can die within months.
    • never do "write" benchmarks on SSD: this will damage your SSD quickly.
  • TRIM
    • Trim informs the SSD when the filesystem no longer needs space
    • Important to prolong the life of SSDs
    • Modern SSD drives and modern OSes should all include TRIM support
    • Only the newest RAID controllers included TRIM support as of late 2012
  • Swap
    • To prolong SSD lifespan, do not swap on an SSD
    • In Linux you can control this using the sysctl variable vm.swappiness. E.g.: add this to /etc/sysctl.conf:
      • vm.swappiness=1
      • This tells the kernel to avoid unmapping mapped pages whenever possible.
  • Avoid frequent re-writing of files (for example during compiling code from source), use a ramdisk file system (tmpfs) for /tmp /usr/tmp, etc.

31 – ESnet Science Engagement (engage@es.net) - 5/12/2024

32 of 43

Benchmarking

  • Single-threaded sequential file write:
    • $ dd if=/dev/zero of=/storage/data1/file1 bs=4k count=33554432
  • Single-threaded sequential file read:
    • $ dd if=/storage/data1/file1 of=/dev/null bs=4k
  • Use more to simulate parallel workload
    • Use oflag=direct to disable caching

32 – ESnet Science Engagement (engage@es.net) - 5/12/2024

33 of 43

Sample Data Transfer Results (2005)

  • Using the right tool is very important
  • Sample Results: Berkeley, CA to Argonne, IL (near Chicago). RTT = 53 ms, network capacity = 10Gbps.

Tool Throughput

scp: 140 Mbps

HPN patched scp: 1.2 Gbps

ftp 1.4 Gbps

GridFTP, 4 streams 5.4 Gbps

GridFTP, 8 streams 6.6 Gbps

      • Note that to get more than 1 Gbps (125 MB/s) disk to disk requires RAID.

33 – ESnet Science Engagement (engage@es.net) - 5/12/2024

34 of 43

Say NO to SCP (2016)

  • Using the right data transfer tool is very important
  • Sample Results: Berkeley, CA to Argonne, IL (near Chicago ) RTT = 53 ms, network capacity = 10Gbps.

  • Notes
    • scp is 24x slower than GridFTP on this path!!
    • to get more than 1 Gbps (125 MB/s) disk to disk requires RAID array.
    • Assume host TCP buffers are set correctly for the RTT

34 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Tool

Throughput

scp

330 Mbps

wget, GridFTP, FDT, 1 stream

6 Gbps

GridFTP and FDT, 4 streams

8 Gbps (disk limited)

35 of 43

Data Transfer Tools

  • Parallelism is key
    • It is much easier to achieve a given performance level with four parallel connections than one connection
    • Several tools offer parallel transfers
  • Latency interaction is critical
    • Wide area data transfers have much higher latency than LAN transfers
    • Many tools and protocols assume a LAN
    • Examples: SCP/SFTP, HPSS mover protocol

35 – ESnet Science Engagement (engage@es.net) - 5/12/2024

36 of 43

Why Not Use SCP or SFTP?

  • Pros:
    • Most scientific systems are accessed via OpenSSH
    • SCP/SFTP are therefore installed by default
    • Modern CPUs encrypt and decrypt well enough for small to medium scale transfers
    • Credentials for system access and credentials for data transfer are the same
  • Cons:
    • The protocol used by SCP/SFTP has a fundamental flaw that limits WAN performance
    • CPU speed doesn’t matter – latency matters
    • Fixed-size buffers reduce performance as latency increases
    • It doesn’t matter how easy it is to use SCP and SFTP – they simply do not perform
  • Verdict: Do Not Use Without Performance Patches

36 – ESnet Science Engagement (engage@es.net) - 5/12/2024

37 of 43

A Fix For scp/sftp

  • PSC has a patch set that fixes problems with SSH
  • Significant performance increase (allows the TCP window to open up if the host is tuned)
  • Advantage – this helps rsync too

37 – ESnet Science Engagement (engage@es.net) - 5/12/2024

38 of 43

sftp

  • Uses same code as scp, so don't use sftp WAN transfers unless you have installed the HPN patch from PSC
  • But even with the patch, SFTP has yet another flow control mechanism
    • By default, sftp limits the total number of outstanding messages to 16 32KB messages.
    • Since each datagram is a distinct message you end up with a 512KB outstanding data limit.
    • You can increase both the number of outstanding messages ('-R') and the size of the message ('-B') from the command line though.
  • Sample command for a 128MB window:
    • sftp -R 512 -B 262144 user@host:/path/to/file outfile

38 – ESnet Science Engagement (engage@es.net) - 5/12/2024

39 of 43

Commercial Data Transfer Tools

  • There are several commercial UDP-based tools
  • These should all do better than TCP on a lossy path
    • advantage of these tools less clear on an clean path

  • They all have different, fairly complicated pricing models

39 – ESnet Science Engagement (engage@es.net) - 5/12/2024

40 of 43

GridFTP

  • GridFTP from ANL has features needed to fill the network pipe
    • Buffer Tuning
    • Parallel Streams
  • Supports multiple authentication options
    • Anonymous
    • ssh
    • X509
  • Ability to define a range of data ports
    • helpful to get through firewalls
  • Partnership with ESnet and Globus Online to support Globus Online for use in Science DMZs

40 – ESnet Science Engagement (engage@es.net) - 5/12/2024

41 of 43

41 – ESnet Science Engagement (engage@es.net) - 5/12/2024

42 of 43

42 – ESnet Science Engagement (engage@es.net) - 5/12/2024

43 of 43

The End

© 2019, Engagement and Performance Operations Center (EPOC)

43

5/12/2024