1 of 43

Vernard Martin, Solutions Architect and Infrastructure Team Lead�Scientific Computing & Bioinformatics Services �Office of Advanced Molecular Detection�Centers for Disease Control & Prevention

Data Movement Hardware & Software

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC-ND 4.0 (https://creativecommons.org/licenses/by-nc-nd/4.0/).

2 of 43

Data Transfer Node

A DTN server is made of several subsystems. Each needs to perform optimally for the DTN workflow:

Storage: capacity, performance, reliability, physical footprint
Networking: protocol support, optimization, reliability
Motherboard: I/O paths, PCIe subsystem, IPMI
Chassis: adequate power supply, extra cooling

Note: the workflow we are optimizing for here is sequential reads/write of large files, and a moderate number of high bandwidth flows.

We assume this host is dedicated to data transfer, and not doing data analysis/manipulation

2 – ESnet Science Engagement (engage@es.net) - 5/12/2024

What composes a DTN

A DTN server is made of several subsystems:

Storage

Networking

Motherboard

Chassis

Each needs to perform optimally for the DTN workflow—any of these things components can be the bottleneck if you’re not careful

There is always one thing that prevents you from going faster—eliminate as many bottlenecks as possible with hardware as a start

Only job is big sequential writes and big sequential reads—large file IO

_______________

Also please note the workflow we are optimizing for here is

sequential reads/write of large files, and a moderate number of high bandwidth flows.

A Data Transfer Node is not a processing, rendering, server. Its only workflow is:

Sender host:

1) read data from the storage subsystem

2) send it to the receiver host

Receiver host:

1) read data from the network

2) write it onto the storage subsystem

3 of 43

Motherboard and Chassis selection

Chassis

Extra cooling (for future expansion, unless you buy the system full)
Make sure the power supply is adequate

Motherboard/CPU

Cascade Lake/Alder lake/Raptor Lake or newer CPU architecture (e.g. for 10/40G, you can get away with a reasonably modern machine)

High clock rate better than high core count for DTNs – max this out
Faster QPIC for communication between processors

PCI Gen 3 or newer (40G and 100G requires PCI Gen 3)
Memory speed – faster is better, more is better

We recommend 128GB of RAM for a DTN node
Balance CPU,RAM, & I/O means sometimes more nodes rather than larger ones

IPMI for remote management (optional but really not)

3 – ESnet Science Engagement (engage@es.net) - 5/12/2024

4 of 43

PCI Slot Considerations

PCI slots are defined by:

Slot width:

Physical card and form factor
Max number of lanes

Lane count:

Maximum bandwidth per lane
Most cards will run slower in a slower slot
Not all cards will use all lanes

4 – ESnet Science Engagement (engage@es.net) - 5/12/2024

When thinking about the PCI slots, need to ask”

How many PCI cards will you need ? How many lanes each ?

What is the aggregate throughput you need on your PCI cards ?

PCI slots are defined by slot width and lane count

Basically just need to make sure you have the right number for the right PCI slots

RAID controllers, NICs, and anything connected to the network goes through the PCI cards

If its not fast enough, this will be your bottleneck

________________________

Form Factor: This is the length of the slots, referred as the number of PCI lanes it can support, and cards. A 16 lane controller is twice as long as an 8 lane controller.

Number of wired lanes: not all lanes of the slot may be wired. Some 8 lane controller may only have 4 lanes wired.

Bridges

PCIe 2.0 is 500 MB/sec per lane. A typical host supports 8 lane (x8) slots but only runs at half speed: 4 GB/sec. A high-end host might have 16 lane (x16) slots, or up to 8 GB/sec.

PCIe 3.0 = Double the above

RAID controllers, NICs and anything connected to the network goes through PCIe—if its not fast enough, this will be the bottleneck.

Make sure the motherboard you select has the right number of slots with the right number of lanes for you planned usage.

5 of 43

PCI Slot Considerations

Card being inserted:

x8 Slot Width

Slot being used:

x8 Slot Width

Wider Slot (bottom):

x16 Slot Width

5 – ESnet Science Engagement (engage@es.net) - 5/12/2024

When thinking about the PCI slots, need to ask”

How many PCI cards will you need ? How many lanes each ?

What is the aggregate throughput you need on your PCI cards ?

PCI slots are defined by slot width and lane count

Basically just need to make sure you have the right number for the right PCI slots

RAID controllers, NICs, and anything connected to the network goes through the PCI cards

If its not fast enough, this will be your bottleneck

________________________

Form Factor: This is the length of the slots, referred as the number of PCI lanes it can support, and cards. A 16 lane controller is twice as long as an 8 lane controller.

Number of wired lanes: not all lanes of the slot may be wired. Some 8 lane controller may only have 4 lanes wired.

Bridges

PCIe 2.0 is 500 MB/sec per lane. A typical host supports 8 lane (x8) slots but only runs at half speed: 4 GB/sec. A high-end host might have 16 lane (x16) slots, or up to 8 GB/sec.

PCIe 3.0 = Double the above

RAID controllers, NICs and anything connected to the network goes through PCIe—if its not fast enough, this will be the bottleneck.

Make sure the motherboard you select has the right number of slots with the right number of lanes for you planned usage.

6 of 43

PCI Bus Considerations

Example:

10GE NICs require an 8 lane PCIe-2 slot
40G/QDR NICs require an 8 lane PCIe-3 slot
100G NICs (ex. Mellanox) requires a 16 lane PCIe-3 slot
Most RAID controllers require an 8 lane PCIe-2 slot
Fusion-IO cards may require a 16 lane PCIe-3 slot

6 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Make sure the motherboard you select has the right number of slots with the right number of lanes for you planned usage.

For example, some typical slot requirements are:

10GE NICs require an 8 lane PCIe-2 slot

40G/QDR NICs require an 8 lane PCIe-3 slot

Most RAID controllers require 8 lane PCIe-2 slot

A high-end Fusion IO ioDrive may require a 16 lane PCIe-3 slot

Here are some possible motherboards to consider

SuperMicro X9DR3-F
Sample Dell Server (Poweredge r320-r720)
Sample HP Server (ProLiant DL380p gen8 High Performance model)

Supermicro is in our reference designs (posted on fasterdata)

But other vendors have servers that can be used as DTNs

Some institutions require long-term support contracts for servers so they tend to want to go with an established vendor (Dell, HP etc.)

All these are capable of 40G though

________________________

7 of 43

-Storage Architectures - Internal

DTN with internal RAID is self-contained

Same CPU, RAM, etc. as DTN with external storage
No external dependencies for storage
Deployable anywhere
Limited scalability
Storage managed locally (you get whatever tools the RAID controller gives you)

Next is: what does the storage look like for a DTN?

There are various ways you can bring storage into or to your server

3 main classes

The 1^st and most common DTN choice is local storage within the DTN

Perks:

no external dependencies

deployable anywhere

But

limited scalability

and only tools for managing storage come from RAID controllers

_____________

Local storage: the storage system (just a bunch of drives / JBOD) is packaged with the server. That can from 12 drivers up to 48 drives depending on vendor/model. In addition, external chassis with more drives can be added, connected to the server with SAS or FC. This is ideal for standalone servers since it does not require plumbing for the storage subsystem.

However, maintenance is typically more difficult since it does not have all the tooling that usually comes with storage systems.

8 of 43

Storage Architectures - External

These are essentially the same from a DTN host design perspective

IB, Ethernet, or Fibrechannel card connects to external storage
Other system components (CPU, RAM, etc.) the same
Central storage management, greater flexibility
Integration with other large-scale resources (e.g. HPC)

Next two classes are the same from a systems perspective

(main different from a workflow perspective—i.e. double copy workflow for SANs but not with distributed/global file systems)

We see these storage architectures in DTNs at data and HPC facilities

But whatever you choose for storage will depend on your environment, requirements, and resources.

______________________

System reasoning: what kind of card used on the backend to the storage system—FC/Ethernet/Inifiniband look the same

But the workflow is what’s different

No block device communication between DTN kernel Global fs

For SAN and internal disks—DTN kernel sees it as “owning”/managing blocks in SAN or RAID

_____________________________

Distributed file systems—high quality of service

much more reliable

easy to manage

however, typically performance is much lower because they are usually a shared system—don’t know the workload of other users—possibility of contesting traffic

i.e., we see this at data centers/supercomputers

they do have bottlenecks. I/O is the storage controller (EMC is limited to 8G right now)

Same I/O capacity but no bottleneck with infiniband connections.

enterprise level system

IB

56 G/link and easy to aggregate

reliable networking—no packet loss

fairly cheap

Distributed filesystem: this is similar to the storage system, except that the exported volumes are not RAW but file system (PNFS, Lustre, GPFS…). This set up is typical of a tiered system: data is acquired and processed, and stored. Then the DTN read from the shared storage.

Storage Systems: These are usually more massive systems (EMC, Netapp, DDN, etc) providing raw volumes to servers. The connectivity is usually fiber, but it can also be infiniband and ethernet.

This class benefits storage capacity.

POINT: look at what your infrastructure requirements are

how much capacity do you need for the next 3 years—will your system scale the way you want?

how reliable does your storage have to be—source of truth or is it a pain to resend the data if a disk goes down? (i.e., choose your RAID appropriately)

performance of storage what you need? If its 10G—laptops won’t do that

don’t need to invest in super fast storage if you only want to use laptops

9 of 43

DTNs in a Facility

40G/100G

Downstream

10G

Next two classes are the same from a systems perspective

(main different from a workflow perspective—i.e. double copy workflow for SANs but not with distributed/global file systems)

We see these storage architectures in DTNs at data and HPC facilities

But whatever you choose for storage will depend on your environment, requirements, and resources.

______________________

System reasoning: what kind of card used on the backend to the storage system—FC/Ethernet/Inifiniband look the same

But the workflow is what’s different

No block device communication between DTN kernel Global fs

For SAN and internal disks—DTN kernel sees it as “owning”/managing blocks in SAN or RAID

_____________________________

Distributed file systems—high quality of service

much more reliable

easy to manage

however, typically performance is much lower because they are usually a shared system—don’t know the workload of other users—possibility of contesting traffic

i.e., we see this at data centers/supercomputers

they do have bottlenecks. I/O is the storage controller (EMC is limited to 8G right now)

Same I/O capacity but no bottleneck with infiniband connections.

enterprise level system

IB

56 G/link and easy to aggregate

reliable networking—no packet loss

fairly cheap

Distributed filesystem: this is similar to the storage system, except that the exported volumes are not RAW but file system (PNFS, Lustre, GPFS…). This set up is typical of a tiered system: data is acquired and processed, and stored. Then the DTN read from the shared storage.

Storage Systems: These are usually more massive systems (EMC, Netapp, DDN, etc) providing raw volumes to servers. The connectivity is usually fiber, but it can also be infiniband and ethernet.

This class benefits storage capacity.

POINT: look at what your infrastructure requirements are

how much capacity do you need for the next 3 years—will your system scale the way you want?

how reliable does your storage have to be—source of truth or is it a pain to resend the data if a disk goes down? (i.e., choose your RAID appropriately)

performance of storage what you need? If its 10G—laptops won’t do that

don’t need to invest in super fast storage if you only want to use laptops

10 of 43

DTNs in a Facility – Network Paths

40G/100G

Downstream

10G

Next two classes are the same from a systems perspective

(main different from a workflow perspective—i.e. double copy workflow for SANs but not with distributed/global file systems)

We see these storage architectures in DTNs at data and HPC facilities

But whatever you choose for storage will depend on your environment, requirements, and resources.

______________________

System reasoning: what kind of card used on the backend to the storage system—FC/Ethernet/Inifiniband look the same

But the workflow is what’s different

No block device communication between DTN kernel Global fs

For SAN and internal disks—DTN kernel sees it as “owning”/managing blocks in SAN or RAID

_____________________________

Distributed file systems—high quality of service

much more reliable

easy to manage

however, typically performance is much lower because they are usually a shared system—don’t know the workload of other users—possibility of contesting traffic

i.e., we see this at data centers/supercomputers

they do have bottlenecks. I/O is the storage controller (EMC is limited to 8G right now)

Same I/O capacity but no bottleneck with infiniband connections.

enterprise level system

IB

56 G/link and easy to aggregate

reliable networking—no packet loss

fairly cheap

Distributed filesystem: this is similar to the storage system, except that the exported volumes are not RAW but file system (PNFS, Lustre, GPFS…). This set up is typical of a tiered system: data is acquired and processed, and stored. Then the DTN read from the shared storage.

Storage Systems: These are usually more massive systems (EMC, Netapp, DDN, etc) providing raw volumes to servers. The connectivity is usually fiber, but it can also be infiniband and ethernet.

This class benefits storage capacity.

POINT: look at what your infrastructure requirements are

how much capacity do you need for the next 3 years—will your system scale the way you want?

how reliable does your storage have to be—source of truth or is it a pain to resend the data if a disk goes down? (i.e., choose your RAID appropriately)

performance of storage what you need? If its 10G—laptops won’t do that

don’t need to invest in super fast storage if you only want to use laptops

11 of 43

DTNs in a Facility – Security

40G/100G

Downstream

10G

Next two classes are the same from a systems perspective

(main different from a workflow perspective—i.e. double copy workflow for SANs but not with distributed/global file systems)

We see these storage architectures in DTNs at data and HPC facilities

But whatever you choose for storage will depend on your environment, requirements, and resources.

______________________

System reasoning: what kind of card used on the backend to the storage system—FC/Ethernet/Inifiniband look the same

But the workflow is what’s different

No block device communication between DTN kernel Global fs

For SAN and internal disks—DTN kernel sees it as “owning”/managing blocks in SAN or RAID

_____________________________

Distributed file systems—high quality of service

much more reliable

easy to manage

however, typically performance is much lower because they are usually a shared system—don’t know the workload of other users—possibility of contesting traffic

i.e., we see this at data centers/supercomputers

they do have bottlenecks. I/O is the storage controller (EMC is limited to 8G right now)

Same I/O capacity but no bottleneck with infiniband connections.

enterprise level system

IB

56 G/link and easy to aggregate

reliable networking—no packet loss

fairly cheap

Distributed filesystem: this is similar to the storage system, except that the exported volumes are not RAW but file system (PNFS, Lustre, GPFS…). This set up is typical of a tiered system: data is acquired and processed, and stored. Then the DTN read from the shared storage.

Storage Systems: These are usually more massive systems (EMC, Netapp, DDN, etc) providing raw volumes to servers. The connectivity is usually fiber, but it can also be infiniband and ethernet.

This class benefits storage capacity.

POINT: look at what your infrastructure requirements are

how much capacity do you need for the next 3 years—will your system scale the way you want?

how reliable does your storage have to be—source of truth or is it a pain to resend the data if a disk goes down? (i.e., choose your RAID appropriately)

performance of storage what you need? If its 10G—laptops won’t do that

don’t need to invest in super fast storage if you only want to use laptops

12 of 43

Storage Subsystem Selection

Deciding what storage to use in your DTN is based on what you are optimizing for:

performance, reliability, capacity, and/or cost

SATA disks historically have been cheaper and higher capacity, while SAS disks typically have been the fastest.

Technologies have been converging (and its hard to keep up)

Do what you can support well (ditto for filesystems – ZFS, ext4, etc.)

Support is more than just what a vendor supports
Internal talent can be the deciding factor more often than not

12 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Now we are assuming you’ve chosen to go with internal disks/storage for your DTN

Deciding what storage to use in your DTN is based on what you are optimizing for:

performance, reliability, capacity, or cost.

SATA disks historically have been cheaper and higher capacity, while SAS (SCSI) disks typically have been the fastest.

Technologies have been converging (and its hard to keep up)

SATA (lower quality) was more for consumers and SAS for enterprise

SATA 3 is coming up to speed with SAS performance though

POINT: Choose what you can support well

_____________________________

SATA vs. SAS

Currently no particular reason to choose one over the other

SATA used to be much lower than SAS for performance

With SATA 3 it is comparable performance now

(SATA/SAS drives to RAID controller)

Historically

SATA (lower quality) was more for consumers and SAS for enterprise

Check SATA3!

[SAS/SATA bus that connects computer storage to drives

SAS backplanes are backward compatible with SATA but not SATA 3]

SATA3 and SAS are multipath I/O so roughly the same performance now but SAS is still recommended for non-consumer based workflows.

13 of 43

SSDs and HDs

SSD storage costs much more than traditional hard drives (HDs), but are much faster. They come in different styles:

PCIe card: some vendors (Fusion-IO) build PCI cards with SSDs.

These are the fastest type of SSD: up to several GBytes/sec per card.

Note that this type of SSD is typically not hot-swapable.

HD replacement: several vendors now sell SSD-based drives that have the same form factor as traditional drives such as SAS and SATA.

The downside to this approach is that performance is limited by the RAID controller, and not all controllers work well with SSD.

Be sure that your RAID controller is “SSD capable”.

NVME is the new kid on the block and prices are coming down.

Note that the price of SSD is coming down quickly, so an SSD-based solution may be worth considering for your DTNs.

13 – ESnet Science Engagement (engage@es.net) - 5/12/2024

SSDs have usually been more expensive than HDs

prices are coming down though

SSDs do come in a few different shapes now:

On PCI cards (these are by far the fastest but they are the most expensive)

And as HD replacements: several vendors now sell SSD-based drives that have the same form factor as traditional drives such as SAS and SATA.

The downside to this approach is that performance is limited by the RAID controller, and not all controllers work well with SSD.

Be sure that your RAID controller is “SSD capable”.

_____________________________________

SSD have much less capacity/performance

every vendor has a different way of dealing with SSDs

some read fast but are WAY slower to write—slower than a hard drive

I.e., 97 GB/sec read with a 20GB/sec write—this is not optimal for a read/write workflow

i.e. laptop mostly does read

SSDs more geared to commodities right now—not great for high performance right now

SSDs not for everyone

Have to go back to WHAT IS YOUR WORKFLOW—i.e., bioinformatics use case with mostly reads, SSDs could be a good choice

For a lot of writes like a beamline at a light source, hard drives may be better

SSDs come in a lot of different packaging

One of the pros for SSDs are that they are good for plug and play

make sure that your RAID controller can deal with SSDs though because it is not always the same for every level

14 of 43

SSD Form Factors

14 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Fusion-io with SSDs on PCI cards (go directly to the CPU)

Left: two form factor replacements for HDs

Please note that performance/longevity of the SSD HD replacements is not great

They have a tendency to fail faster

also performance is unreliable: from 1GB to 100MB/s because they use more power, therefore heat up faster, and slowdown

______________________________

SSD $$ is becoming interesting—10X faster than the hard drive

not going to pay more for SSD performance

Performance/$$ is the same

But the catch is CAPACITY is not the same!

_____________________

Right side picture—PCIe card with SSD (this is what Fusion IO is doing this—the best for performance but pretty pricey--8TB was $120,000 last time we checked)

1TB available/card

max performance of an SSD—then no RAID controller in the path—goes directly to CPU

Form factor SSDs fail faster (examples shown on the left side)

SSD performance is not predictable

uses more power than a physical drive

they warm up much more to a point where systems cannot cool them down

internal measurement in systems and the SSDs will slow down (killing performance)

Slow down from 1GBs to 100MB/s—becomes super slow—therefore it is harder to predict how fast your I/O is going to be

On the other hand,

Hard drives are predictable and cooled

_______________________________

SSD, cost much more than HD, but a much faster. They come in different packages:

PCIe card: some vendors (Fusion I/O) build PCI cards with SSD. The current maximum capacity is 1TB per card. Since those cards are PCIe, the data transfer between the main memory and the SSD is just limited by the SSD speed and the PCIe speed: in other words, it is really fast (several GB/sec per card). The drawbacks are that 1TB uses a PCIe slot. This design often means that a PCI extender is needed, but if space and performance is an issue, this is the best solution. Those SSD cards can also be an deployment issue: replacing a failed card means that the server must be open.

- HD replacement: some vendors (IBM, WD, etc) have product that a physical replacement for HD: same form factor, same connectivity (SAS, SATA…). This allows for easier migration path from HD to SSD, but the performance is limited by the controller. Also, not all controllers are good at controlling SSD drives: always make sure that the controller is “SSD capable”.

15 of 43

RAID Controllers

Often optimized for a given workload, rarely for performance.
RAID0 is the fastest of all RAID levels but is also the least reliable.
The performance of the RAID controller is a factor of the number of drives and its own processing engine.

15 – ESnet Science Engagement (engage@es.net) - 5/12/2024

16 of 43

RAID Controller

Be sure your RAID controller has the following:

>= 1GB of on-board cache
PCIe Gen3 support (and your board can support that)
dual-core RAID-on-Chip (ROC) processor if you will have more than 8 drives

16 – ESnet Science Engagement (engage@es.net) - 5/12/2024

17 of 43

Network Subsystem Selection

There is a huge performance difference between cheap and expensive 10G/40G NICs.

E.g. please don’t cheap out on the NIC – it’s important for an optimized DTN host.

NIC features to look for include:

support for interrupt coalescing
support for MSI-X
TCP Offload Engine (TOE)
support for zero-copy protocols such as RDMA (RoCE or iWARP)

Note that many 10G/40G/100G NICs come in dual ports, but that does not mean if you use both ports at the same time you get double the performance. Often the second port is meant to be used as a backup port.

Myricom 10G-PCIE2-8C2-2S
Mellanox MCX312A-XCBT
Mellanox ConnectX-3 , ConnectX-4

17 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Last thing in the hardware queue is to think about the Network subsystem

Networking subsystem is the second subsystem after the storage that is critical and could be a major bottle neck.

The choice of your NIC(s) will impact performance.

There is a huge performance difference between cheap and expensive 10G NICs.

E.g. please don’t cheap out on the NIC – it’s important for an optimized DTN host.

NIC features to look for include:

support for interrupt coalescing

support for MSI-X

TCP Offload Engine (TOE)

support for zero-copy protocols such as RDMA (RoCE or iWARP)

Note that many 10G and 40G NICs come in dual ports, but that does not mean if you use both ports at the same time you get double the performance. Often the second port is meant to be used as a backup port.

True 2x10G capable cards include the Myricom 10G-PCIE2-8C2-2S and the Mellanox MCX312A-XCBT.

18 of 43

Reference Implementation (2023)

Hardware description
The total cost of this server was around $25K in late 2022. These systems be deployed to ESnet in late 2022 and into 2023 for ESnet6
Base System: Supermicro 2124US-TNRP 2U dual AMD socket SP3 server

onboard VGA, dual10G RJ45, dual10G SFP+, onboard dedicated IPMI RJ45
1 PCI-E 4.0 x16 slot,
24 front access NVME hotswap bays
dual redundant hotswap 1200W PSU

2x AMD EPYC Milan 73F3

16 cores each
3.5Ghz 240W TDP processor

256 GB RAM - 16x 16G DDR4 3200 ECC RDIMM
800G System Disk: 2x Micron 7300 MAX 800G U.2/2.5" NVME
25TB Data Disk: 10x Micron 9300 MAX 3.2TB U.2/2.5" NVME
NVIDIA MCX613106A-VDAT ConnectX-6 EN Adapter Card 200GbE
Mellanox MMA1L10-CR Optical Transceiver 100GbE QSFP28 LC-LC 1310nm LR4 up to 10km
OOB license for IPMI management
2x 1300W -48V DC PSU OR 2x 1200W AC PSU

18 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Last thing in the hardware queue is to think about the Network subsystem

Networking subsystem is the second subsystem after the storage that is critical and could be a major bottle neck.

The choice of your NIC(s) will impact performance.

There is a huge performance difference between cheap and expensive 10G NICs.

E.g. please don’t cheap out on the NIC – it’s important for an optimized DTN host.

NIC features to look for include:

support for interrupt coalescing

support for MSI-X

TCP Offload Engine (TOE)

support for zero-copy protocols such as RDMA (RoCE or iWARP)

Note that many 10G and 40G NICs come in dual ports, but that does not mean if you use both ports at the same time you get double the performance. Often the second port is meant to be used as a backup port.

True 2x10G capable cards include the Myricom 10G-PCIE2-8C2-2S and the Mellanox MCX312A-XCBT.

19 of 43

Reference Implementation (2020)

Hardware description
The total cost of this server was approximately $21K in mid 2019
Base System: Gigabyte R281-NO0 dual socket P 2U server

Onboard: VGA, 2 x GbE RJ45 Intel i350, IPMI dedicated LAN
24 x front access U.2 hotswap bays
2 x rear access 2.5” SATA hotswap bays
Dual redundant hotswap 1600W PSU

2 x Intel Cascade lake Xeon Gold 6246

12 cores each
3.3GHz 165W TDP processor

12 x 16G DDR4 2933 ECC RDIMM (192G total)
10 x Intel P4610 1.6TB U.2/2.5” PCIe NVMe 3.0 x4 Drives (connect directly to CPU for VROC)
2 x Enterprise 960G 2.5“ SATA SSD (OS, onboard Intel SATA Raid 1)
VIntel® Virtual RAID On CPU (VROC) RAID 0, 1, 10, 5
Mellanox ConnectX-5 EN MCX516A-CCAT 40/50/100GbE dual-port QSFP28 NIC

19 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Last thing in the hardware queue is to think about the Network subsystem

Networking subsystem is the second subsystem after the storage that is critical and could be a major bottle neck.

The choice of your NIC(s) will impact performance.

There is a huge performance difference between cheap and expensive 10G NICs.

E.g. please don’t cheap out on the NIC – it’s important for an optimized DTN host.

NIC features to look for include:

support for interrupt coalescing

support for MSI-X

TCP Offload Engine (TOE)

support for zero-copy protocols such as RDMA (RoCE or iWARP)

Note that many 10G and 40G NICs come in dual ports, but that does not mean if you use both ports at the same time you get double the performance. Often the second port is meant to be used as a backup port.

True 2x10G capable cards include the Myricom 10G-PCIE2-8C2-2S and the Mellanox MCX312A-XCBT.

20 of 43

Reference Implementation (2020)

Preliminary System Performance Results for this configuration

�Initial system performance was evaluated utilizing two identical units, and a topology that consisted of:

Back-to-back 40Gbps active cable
40Gbps local to 100Gbps WAN connection(s) with an RTT of 40ms (Berkeley CA to Chicago IL, round trip)

The testing was performed on Ubuntu 18.04 using XFS on mdraid with O_DIRECT and standard POSIX I/O. ��The file sizes varied between 256 GB and 1T with consistent results between file sizes, as well as consistent results when writing pseudo-random data vs all-zeroes.

�Testing revealed it was possible to maintain a sustained total of 80Gbps on both connections of the NIC. Future testing will focus on the use of 100Gbps cable cables and optics.

20 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Last thing in the hardware queue is to think about the Network subsystem

Networking subsystem is the second subsystem after the storage that is critical and could be a major bottle neck.

The choice of your NIC(s) will impact performance.

There is a huge performance difference between cheap and expensive 10G NICs.

E.g. please don’t cheap out on the NIC – it’s important for an optimized DTN host.

NIC features to look for include:

support for interrupt coalescing

support for MSI-X

TCP Offload Engine (TOE)

support for zero-copy protocols such as RDMA (RoCE or iWARP)

Note that many 10G and 40G NICs come in dual ports, but that does not mean if you use both ports at the same time you get double the performance. Often the second port is meant to be used as a backup port.

True 2x10G capable cards include the Myricom 10G-PCIE2-8C2-2S and the Mellanox MCX312A-XCBT.

21 of 43

Reference Implementation (2020)

Disk configuration effects on performance

�The test systems have employed several disk configurations which were somewhat more complicated:

Using RAID-0, the best results were 126 Gbps write and 140 Gbps read with 8 disks on a single controller (PCIe multiplexor).

Current research is to increase the number of disks to 10 (with hopes of getting 155Gbps of raw performance) through re-arranging the disk placement on the PCIe multiplexor
PCIe bandwidth is roughly 1 GBps per PCIe lane, thus balancing across multiple drives is required to ensure peak performance.

The best preliminary storage results with parity are using mirroring in a RAID-10. �The raw performance (to disk) is the same as above, but the effective write performance drops to about 60Gbps due to the resiliency requirements.

21 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Last thing in the hardware queue is to think about the Network subsystem

Networking subsystem is the second subsystem after the storage that is critical and could be a major bottle neck.

The choice of your NIC(s) will impact performance.

There is a huge performance difference between cheap and expensive 10G NICs.

E.g. please don’t cheap out on the NIC – it’s important for an optimized DTN host.

NIC features to look for include:

support for interrupt coalescing

support for MSI-X

TCP Offload Engine (TOE)

support for zero-copy protocols such as RDMA (RoCE or iWARP)

Note that many 10G and 40G NICs come in dual ports, but that does not mean if you use both ports at the same time you get double the performance. Often the second port is meant to be used as a backup port.

True 2x10G capable cards include the Myricom 10G-PCIE2-8C2-2S and the Mellanox MCX312A-XCBT.

22 of 43

Reference Implementation (2020)

CPU & Memory effects on performance

�Testing has revealed that not all 12 cores (on a single chip) are activated regularly, unless the degree of parallelism increases. Busy DTN requirements will push up this requirement. ��Populating all memory channels in the RAM configuration also ensures peak performance, as the system is able to fully utilize and balance load as needed.

22 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Last thing in the hardware queue is to think about the Network subsystem

Networking subsystem is the second subsystem after the storage that is critical and could be a major bottle neck.

The choice of your NIC(s) will impact performance.

There is a huge performance difference between cheap and expensive 10G NICs.

E.g. please don’t cheap out on the NIC – it’s important for an optimized DTN host.

NIC features to look for include:

support for interrupt coalescing

support for MSI-X

TCP Offload Engine (TOE)

support for zero-copy protocols such as RDMA (RoCE or iWARP)

Note that many 10G and 40G NICs come in dual ports, but that does not mean if you use both ports at the same time you get double the performance. Often the second port is meant to be used as a backup port.

True 2x10G capable cards include the Myricom 10G-PCIE2-8C2-2S and the Mellanox MCX312A-XCBT.

23 of 43

Take Home Points

The 2 key “take homes” for hardware are:

Needs to be expandable and
Needs to be supportable

Needs to be able to seamlessly support data mobility

23 – ESnet Science Engagement (engage@es.net) - 5/12/2024

24 of 43

DTN Tuning is Art & Science

24 – ESnet Science Engagement (engage@es.net) - 5/12/2024

What composes a DTN

A DTN server is made of several subsystems:

Storage

Networking

Motherboard

Chassis

Each needs to perform optimally for the DTN workflow—any of these things components can be the bottleneck if you’re not careful

There is always one thing that prevents you from going faster—eliminate as many bottlenecks as possible with hardware as a start

Only job is big sequential writes and big sequential reads—large file IO

_______________

Also please note the workflow we are optimizing for here is

sequential reads/write of large files, and a moderate number of high bandwidth flows.

A Data Transfer Node is not a processing, rendering, server. Its only workflow is:

Sender host:

1) read data from the storage subsystem

2) send it to the receiver host

Receiver host:

1) read data from the network

2) write it onto the storage subsystem

25 of 43

DTN Tuning �http://fasterdata.es.net/science-dmz/DTN/tuning

Defaults are not appropriate for performance.

What needs to be tuned:
BIOS
Firmware
Device Drivers
Networking
File System
Application

25 – ESnet Science Engagement (engage@es.net) - 5/12/2024

26 of 43

DTN Tuning

Tuning your DTN host is extremely important. We have seen overall IO throughput of a DTN more than double with proper tuning.
Tuning can be as much art as a science. Due to differences in hardware, its hard to give concrete advice.
Here are some tuning settings that we have found do make a difference.
This tutorial assumes you are running a RedHat-based Linux system, but other Unix flavors should have similar tuning knobs.
Note that you should always use the most recent version of the OS, as performance optimizations for new hardware are added to every release.

26 – ESnet Science Engagement (engage@es.net) - 5/12/2024

27 of 43

Network Tuning

# add to /etc/sysctl.conf

net.core.rmem_max = 67108864

net.core.wmem_max = 67108864

net.ipv4.tcp_rmem = 4096 87380 33554432

net.ipv4.tcp_wmem = 4096 65536 33554432

net.core.netdev_max_backlog = 250000

Add to /etc/rc.local to increase send queue depth

# increase txqueuelen

/sbin/ifconfig eth2 txqueuelen 10000

/sbin/ifconfig eth3 txqueuelen 10000

This info (and more) is available on fasterdata, formatted for cut and paste

# make sure cubic and htcp are loaded

/sbin/modprobe tcp_htcp

/sbin/modprobe tcp_cubic

# set default to CC alg to htcp

net.ipv4.tcp_congestion_control=htcp

# with the Myricom 10G NIC

# using interrupt coalencing helps a lot:

/usr/sbin/ethtool -C ethN rx-usecs 75

Please consider jumbo frames, but don’t just blindly turn them on

28 of 43

I/O Scheduler

The default Linux scheduler is the "fair" scheduler. For a DTN node, we recommend using the "deadline" scheduler instead.
To enable deadline scheduling, add "elevator=deadline" to the end of the "kernel' line in your /boot/grub/grub.conf file, similar to this:
kernel /vmlinuz-2.6.35.7 ro root=/dev/VolGroup00/LogVol00 rhgb quiet elevator=deadline

28 – ESnet Science Engagement (engage@es.net) - 5/12/2024

29 of 43

Interrupt Affinity

Interrupts are triggered by I/O cards (storage, network). High performance means lot of interrupts per second
Interrupt handlers are executed on a core

Interrupt handler is just code – it gets run for every interrupt
Cache effects matter (with lots of I/O we’re going to run that code a lot)

Depending on the scheduler, core 0 gets all the interrupts, or interrupts are dispatched in a round-robin fashion among the cores: both are bad for performance:

Core 0 get all interrupts: with very fast I/O, the core is overwhelmed and becomes a bottleneck
Round-robin dispatch: very likely the core that executes the interrupt handler will not have the code in its L1 cache.
Two different I/O channels may end up on the same core.

29 – ESnet Science Engagement (engage@es.net) - 5/12/2024

30 of 43

Know Your Layout

30 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Where are your cards?
Where are your cores?
Understand what your bindings mean physically
Performance of a given config can be tightly coupled to physical layout
Don’t be afraid to experiment

31 of 43

SSD Issues – Write Sparingly

Tuning your SSD is more about reliability and longevity than performance

Each flash memory cell has a finite lifespan that is determined by the number of "program and erase (P/E)" cycles
Without proper tuning, SSD can die within months.
never do "write" benchmarks on SSD: this will damage your SSD quickly.

TRIM

Trim informs the SSD when the filesystem no longer needs space
Important to prolong the life of SSDs
Modern SSD drives and modern OSes should all include TRIM support
Only the newest RAID controllers included TRIM support as of late 2012

Swap

To prolong SSD lifespan, do not swap on an SSD
In Linux you can control this using the sysctl variable vm.swappiness. E.g.: add this to /etc/sysctl.conf:

vm.swappiness=1
This tells the kernel to avoid unmapping mapped pages whenever possible.

Avoid frequent re-writing of files (for example during compiling code from source), use a ramdisk file system (tmpfs) for /tmp /usr/tmp, etc.

31 – ESnet Science Engagement (engage@es.net) - 5/12/2024

32 of 43

Benchmarking

Single-threaded sequential file write:

$ dd if=/dev/zero of=/storage/data1/file1 bs=4k count=33554432

Single-threaded sequential file read:

$ dd if=/storage/data1/file1 of=/dev/null bs=4k

Use more to simulate parallel workload

Use oflag=direct to disable caching

32 – ESnet Science Engagement (engage@es.net) - 5/12/2024

33 of 43

Sample Data Transfer Results (2005)

Using the right tool is very important
Sample Results: Berkeley, CA to Argonne, IL (near Chicago). RTT = 53 ms, network capacity = 10Gbps.

Tool Throughput

scp: 140 Mbps

HPN patched scp: 1.2 Gbps

ftp 1.4 Gbps

GridFTP, 4 streams 5.4 Gbps

GridFTP, 8 streams 6.6 Gbps

Note that to get more than 1 Gbps (125 MB/s) disk to disk requires RAID.

33 – ESnet Science Engagement (engage@es.net) - 5/12/2024

34 of 43

Say NO to SCP (2016)

Using the right data transfer tool is very important
Sample Results: Berkeley, CA to Argonne, IL (near Chicago ) RTT = 53 ms, network capacity = 10Gbps.

Notes

scp is 24x slower than GridFTP on this path!!
to get more than 1 Gbps (125 MB/s) disk to disk requires RAID array.
Assume host TCP buffers are set correctly for the RTT

34 – ESnet Science Engagement (engage@es.net) - 5/12/2024

Tool	Throughput
scp	330 Mbps
wget, GridFTP, FDT, 1 stream	6 Gbps
GridFTP and FDT, 4 streams	8 Gbps (disk limited)

35 of 43

Data Transfer Tools

Parallelism is key

It is much easier to achieve a given performance level with four parallel connections than one connection
Several tools offer parallel transfers

Latency interaction is critical

Wide area data transfers have much higher latency than LAN transfers
Many tools and protocols assume a LAN
Examples: SCP/SFTP, HPSS mover protocol

35 – ESnet Science Engagement (engage@es.net) - 5/12/2024

36 of 43

Why Not Use SCP or SFTP?

Pros:

Most scientific systems are accessed via OpenSSH
SCP/SFTP are therefore installed by default
Modern CPUs encrypt and decrypt well enough for small to medium scale transfers
Credentials for system access and credentials for data transfer are the same

Cons:

The protocol used by SCP/SFTP has a fundamental flaw that limits WAN performance
CPU speed doesn’t matter – latency matters
Fixed-size buffers reduce performance as latency increases
It doesn’t matter how easy it is to use SCP and SFTP – they simply do not perform

Verdict: Do Not Use Without Performance Patches

36 – ESnet Science Engagement (engage@es.net) - 5/12/2024

37 of 43

A Fix For scp/sftp

PSC has a patch set that fixes problems with SSH

http://www.psc.edu/networking/projects/hpn-ssh/

Significant performance increase (allows the TCP window to open up if the host is tuned)
Advantage – this helps rsync too

37 – ESnet Science Engagement (engage@es.net) - 5/12/2024

38 of 43

sftp

Uses same code as scp, so don't use sftp WAN transfers unless you have installed the HPN patch from PSC
But even with the patch, SFTP has yet another flow control mechanism

By default, sftp limits the total number of outstanding messages to 16 32KB messages.
Since each datagram is a distinct message you end up with a 512KB outstanding data limit.
You can increase both the number of outstanding messages ('-R') and the size of the message ('-B') from the command line though.

Sample command for a 128MB window:

sftp -R 512 -B 262144 user@host:/path/to/file outfile

38 – ESnet Science Engagement (engage@es.net) - 5/12/2024

39 of 43

Commercial Data Transfer Tools

There are several commercial UDP-based tools

Aspera: http://www.asperasoft.com/
Data Expedition: http://www.dataexpedition.com/ �

These should all do better than TCP on a lossy path

advantage of these tools less clear on an clean path

They all have different, fairly complicated pricing models

39 – ESnet Science Engagement (engage@es.net) - 5/12/2024

40 of 43

GridFTP

GridFTP from ANL has features needed to fill the network pipe

Buffer Tuning
Parallel Streams

Supports multiple authentication options

Anonymous
ssh
X509

Ability to define a range of data ports

helpful to get through firewalls

Partnership with ESnet and Globus Online to support Globus Online for use in Science DMZs

40 – ESnet Science Engagement (engage@es.net) - 5/12/2024

41 of 43

41 – ESnet Science Engagement (engage@es.net) - 5/12/2024

42 of 43

42 – ESnet Science Engagement (engage@es.net) - 5/12/2024

43 of 43

The End

43

5/12/2024