1 of 50

Linux Clusters Institute:�HPC Storage: Part 1

J.D. Maloney | Sr. HPC Storage Engineer

National Center for Supercomputing Applications (NCSA)

malone12@illinois.edu

May 1-5, 2023

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).

2 of 50

Target Audience:

Those involved in designing, implementing, or managing HPC storage systems.

Outline:

Common Storage Concepts and Terms
Storage Related Goals & Requirements
Storage Hardware
Storage Software
Wrap Up

May 1-5, 2023

3 of 50

Concepts and Terms

May 1-5, 2023

4 of 50

What is Storage?

A place to store data. Either temporarily or permanently.

Processor Cache

Fastest access; closest to the CPU; temporary (L1, L2)

System Memory (DRAM)

Very Fast access; close to CPU but not on it; temporary

Solid State Storage

Fast access (esp. random)
Can be internal or part of an external storage system
Capable of high densities with high associated costs

Spinning Disk

Slower; performance is tied to access behavior
Can be internal or part of an external storage system
Capable of extremely high densities

Network / Cloud storage

Network - Can scale from slow to extremely fast, high density

Tape

Extremely slow; typically used for cold storage

May 1-5, 2023

Bandwidth Increase

Latency & Size Increase

Cache

(L1, L2, L3,)

Memory

(DRAM, HBM, )

Solid State Disk

(SATA SSD, M.2 Module, PCIe Card)

Spinning Disks

(PMR, SMR, HAMR/MAMR)

Tape

(LTO, TS11XX)

CPU

5 of 50

Concepts and Terms

Throughput: Storage capability usually measured in GB/s (or TB/s)
IOPs: Input/Output Operations per second
RAID: Redundant Array of Independent Disks
JBOD: Just a Bunch of Disks
JBOF: Just a Bunch of Flash
Storage Server: provide direct access to storage device and functions as a data manager for that disk
Storage Client: accesses data, but plays no role in data management
NAS: Network Attached Storage
SAN: Storage Area Network

May 1-5, 2023

6 of 50

Concepts and Terms

High Availability (HA)

Components are configured in failover pairs
Prevents a single point of failure in the system
Prevents a service outage

Failover Pairs

Active/Active

Both component share the load
On failure one component takes over the complete load

Active/Passive

One component services requests, the other is in standby
On failure the standby becomes active

Networks

InfiniBand (IB)
Ethernet (TCP/IP)

Host Connectivity

Host Bus Adapter (HBA)
Network Interface Card (NIC)

May 1-5, 2023

Storage

Active Controller

Standby Controller

7 of 50

Concepts and Terms

Raw Space: what the disk label shows. Typically given in base 10.

10TB (terabyte) == 10*10^12 bytes

Useable Space: what `df` shows once the storage is mounted. Typically given in base 2.

10TiB (tebibyte) == 10*2^40 bytes

Useable space is often about 25% smaller than raw space

Depending on the RAID level you are selecting, some storage is used for RAID overhead, file system overhead, etc.
Learning how to calculate this is a challenge
Dependent on levels of redundancy and the file system you choose

May 1-5, 2023

File System overhead is applied after RAID overhead further reducing the usable space.

3 Drives

2 Storage Drives

1 Parity Drive

33% RAID overhead

4 Drives

3 Storage Drives

1 Parity Drive

25% RAID overhead

8 of 50

Goals and Requirements

May 1-5, 2023

9 of 50

How to Choose the Optimal Storage Solution?

Short Answer: Whichever solution meets your requirements

Long Answer: There is no single best solution for all scenarios

Each is designed to solve specific problems and serve specific requirements
Each works well when built and deployed according to their strengths
Usage requirements and access patterns define which is the best choice

Application Requirements
User Expectations
Budget Constraints
Expertise in the support team

Compromise based on competing needs is almost always the end result

May 1-5, 2023

10 of 50

Building a Balanced System

The Ideal:

All components of the system operate in perfect harmony such that speeds & feeds within the system all align perfectly

The Reality:

Competing needs will lead to compromises that cause imbalances in the system.

Common Imbalances:

Capacity is prioritized over bandwidth; the number of disks exceeds the performance capabilities of the controllers, disk interconnect, or HBAs
Overall output of the storage system is exceeded by the network capacity of the computational systems

May 1-5, 2023

11 of 50

Requirements Evaluation

Stakeholders

Computational Users
Management
Policy Managers
Funding Agencies
System Administration Staff
Infrastructure Support Staff
IT Security Staff

Usage Patterns

Write dominate
Read dominate
Streaming I/O vs Random I/O

User Profiles

Expert vs. Beginner
Custom vs. Commercial application

May 1-5, 2023

I/O Profiles

Serial I/O
Random I/O
Parallel I/O
MapReduce I/O
Large Files
Small Files

Infrastructure Profile

Integrated with HPC resource
Standalone storage solution
Network connectivity
Security requirements

12 of 50

Gathering Stakeholder Requirements

Who are your stakeholders?

What features are they looking for?
How will people want to use the storage?
What usage policies need to be supported?

From what science/usage domains are the users?

What applications will they be using?

How much space do they anticipate needing?
Are there expectations of access from multiple systems?

May 1-5, 2023

What is the distribution of files?

Sizes, count

What is the typical I/O pattern?

How many bytes are written for every byte read?
How many bytes are read for each file opened?
How many bytes are written for each file opened?

Are there any system-based restrictions?

POSIX conformance - do you need a POSIX interface to the file system
Limitations on number of files or files per directory
Network compatibility (IB, Eth, Slingshot)

13 of 50

Application I/O Access Patterns

An application’s I/O transaction size, and the order in which they are accessed, defines an application’s I/O access pattern. This is a combination of how the application does I/O along with how the file system handles I/O requests.

For typical HPC file systems, sequential I/O of large blocks provides the best performance. Unfortunately, these types of I/O patterns aren’t the most common.

Understanding the I/O access patterns of your major applications can help you design a solution your users will be happy with.

May 1-5, 2023

14 of 50

Common Data Access Patterns

Streaming (bandwidth centric)

Records Accessed Only Once, file is read/written from beginning to end
Minimal overall IOPS
File tend to be large and performance is measured in bandwidth
Common in Digital Media, Classic HPC, Scientific Applications, DSP

Discrete File I/O (IOP centric)

Small individual transactions; may not even read a full block at a time

Small files, random access

File IOPS can be high
Common in bioinformatics, AI, /home and /sw directories

Transaction Processing (IOP centric)

Small transactions with good temporal locality; individual updates maybe smaller than a block but consecutive transactions tend to be in continuous blocks
File IOPS can be high
Common in databases and commercial applications

May 1-5, 2023

15 of 50

HPC I/O Access Patterns

Traditional HPC

Streaming large block writes (low IOPs rates)
Large output files
Minimal metadata operations

More common today

Random I/O patterns (high IOPs rates)
Smaller output files
Large number of metadata operations

Challenges:

Choosing a block size that fits your application I/O pattern
IOPs becomes more important with random I/O patterns and small files

May 1-5, 2023

16 of 50

Gathering Data Requirements

Do you need different tiers or types of storage?

Active long-term (project space)
Temporary (scratch space, fast, not backed up)
Archive (cloud, disk or tape, cheaper)
Backups (snapshots, disk, tape)
Encryption (SED or per share/export)

Data Restrictions

HIPAA and PHI
CUI
PCI DSS
And many more (FISMA, ITAR, SOX, GLBA, CJIS, FERPA, SOC, …)

Data transfer characteristics

May 1-5, 2023

17 of 50

Training and Support Requirements

Training

Sys Admin Staff

How much training does your staff need?
Vendor supplied training?
Does someone on your staff have the expertise to provide training?

User Services

How much training does user support staff need?
Does your user support staff have the expertise to provide user training?

Users

How much training will your users need to effectively use the system?
How often will training need to be provided?

May 1-5, 2023

System Support

Does the vendor provide support for all components of your system?
Does support for parts of your system come from the open source community?
What are the support requirements for your staff?

7x24
8x5 M-F

Do you have Service Level Agreements (SLAs) with your user community?

18 of 50

Storage Hardware

May 1-5, 2023

19 of 50

Common Storage Building Blocks

Media

HDDs (SATA & SAS)
SSDs (SATA, SAS, NVMe)
Tape (LTO, TS11XX)

Media Formats

HDD - 3.5” (2.5” dying out, some 10/15K SAS remain)
SSD - 2.5” (varying thicknesses), U.2/3, E1.S/L, E3.S/L, M.2
Tape - LTO (Open Standard), TS11XX (IBM format)

Enclosures

JBOD & JBOF (Enclosures/Drawers)
Controller (Couplets)
Storage Servers

May 1-5, 2023

20 of 50

Common Connecting Fabrics

Network Fabrics

Ethernet

Speeds: 1Gb - 400GbE (800GbE soon)
RJ-45, SFP+, QSFP+, QSFP-DD
TCP/RoCE

Infiniband

EDR, HDR, NDR (100Gb, 200Gb, 400Gb) are modern versions in used today
RDMA support

Slingshot

HPE specific (at present)
Ethernet based with “extra stuff”

May 1-5, 2023

Other Storage Fabrics & Carriers

Fiber Channel & SAS

24Gb SAS is now GA; 12Gb still common
64GFC is now available (6.4GB/s per direction)

PCIe

Gen 5 (32GT/s per lane) ~64GB/s per direction in a x16 slot (available in 2023)
Gen 4 (16GT/s per lane) ~32GB/s per direction in a x16 slot (available since 2019)
Carrier for CXL devices, NVME drives, NICs, etc.
Gen 6 (64GT/s per lane) ~128GB/s per direction in a x16 slot -- likely 2025/2026 GA

21 of 50

Common RAID Levels

Standard Levels

RAID 0: striping without mirroring or parity
RAID 1: data mirroring, without parity or striping
RAID 5: block-level striping with distributed parity

No loss of data with a single disk failure
Performance hit in degraded mode

RAID 6: block-level striping with dual distributed parity

No loss of data with two disk failures
Data loss occurs with third failure
Most common in larger systems today

Hybrid Levels

RAID 0+1: create two stripes then mirror
RAID 1+0: striping across mirrored drives

May 1-5, 2023

22 of 50

Common RAID Levels (cont.)

Hybrid Levels

RAID 0+1: create two stripes then mirror
RAID 1+0: striping across mirrored drives

May 1-5, 2023

23 of 50

Erasure Coding

RAID levels are becoming/already are not tenable with increasing drive sizes and performance characteristics

Capacity of drives have outpaced their performance capabilities

Increased rebuild times mean we need a faster way to get data protected following device failure

RAID rebuilds typically require the new drive to do a write of its entire capacity once, creates a write bottleneck on one device

This is leading to the use of erasure coding across many devices to greatly shorten the time to data protection
There are different erasure algorithms, but most operate on a common principles

May 1-5, 2023

24 of 50

Erasure Coding

An example with ZFS dRAID

May not seem much different on the surface but much more resilient than classic RAID methods

May 1-5, 2023

25 of 50

Erasure Coding

Improvements over “classical RAID”

All drives in the pool participate in a rebuild, can even happen if drive not replaced (depends on solution/configuration)
Reconstruction of data can be ordered more efficiently, rebuild the most critical blocks first and move to the least critical
Large erasure code pools (especially/primarily) on fast media can have seriously increased usage efficiency without losing data protection
No more rebuild bottleneck on a single drive

Some (not exhaustive) list of solutions that use erasure coding

Ceph
ZFS dRAID
DDN SFA
VAST Data
IBM ESS

May 1-5, 2023

26 of 50

Data Transport from Server to Client

TCP (Transmission Control Protocol)

Great for the Internet and viewing cat videos, sub-optimal for high performance storage transport

Very robust method of moving data to ensure it gets there no matter what and is resilient in poor conditions
Parallel file systems are getting better at optimizing for standard Ethernet connections and TCP

Data has to flow through OS (it’s buffers), to the NIC (and its buffers), over the network to the other host and traverse the same stack again in reverse order

May 1-5, 2023

27 of 50

Data Transport from Server to Client

RDMA (Remote Direct Memory Access)

Requires a “lossless” network be that Infiniband, RoCE, Slingshot, etc.
Applications can transfer data from their memory to memory on a remote host directly, bypassing a lot of the layers in the stack
Improves both throughput of file systems between servers and clients as well as reduces latency which greatly helps metadata-intensive workloads as well as though that do many small I/O transactions

May 1-5, 2023

28 of 50

TCP vs. RDMA

May 1-5, 2023

29 of 50

Data Transport from Server to Client

GDS (GPU Direct Storage)

Allows storage traffic to by-pass the host CPU and Memory and copy straight into VRAM on the GPU
As network speeds and all-flash appliance speeds increase, it cuts out a bottleneck between your file system and getting data into the GPU’s memory for training/processing
Works with Nvidia GPUs and a lot of different storage platforms

If you have GPUs and need to buy storage, check if storage solution supports GDS
Formerly known as Magnum IO (in case you see that in docs)

May 1-5, 2023

30 of 50

GDS (GPU Direct Storage)

May 1-5, 2023

31 of 50

Storage Software

May 1-5, 2023

32 of 50

File vs. Object Storage

File Storage

A block holds a chunk of data, files typically reside in multiple blocks.
Data is accessed through a request to the block address
The file system controls access to blocks, imposes a hierarchical structure and defines and updates metadata
File systems are generally POSIX compliant
Access is through the OS layer, user has direct access to the file system
Storage systems are generally feature rich and externally based
Raw block storage is common in large scale database applications allowing direct block access

May 1-5, 2023

Object Storage

An object hold both data and it’s associated metadata
Data is access through a request for the object ID
Objects are stored in a flat structure
Metadata can be flexible and defined by the user
Resiliency is provided through multiple copies of the object, generally geographically distributed
Access is through a REST API application not the OS
Systems are generally built with commodity servers with internal disks

33 of 50

Data Consistency

Strong Consistency

Block storage systems are strongly consistent
Typically used for real-time processing such as transactional databases
Good for data that is constantly changing

Updating a file only requires changing the blocks that have changed

Limited scalability especially within a geographically distributed system
Guarantees that a read request returns the most recent version of data

May 1-5, 2023

Eventual Consistency

Some object storage systems are eventually consistent
High availability for data that is relatively static and rarely altered

Updating an object requires a re-write of the entire object

Typical uses are for multimedia or unstructured data
There is no guarantee that a read request returns the most recent version of data

34 of 50

Parallel vs. Non Parallel File Systems

Characteristics of a Parallel File System

Multiple storage servers manage a single namespace
Speed is able to be increased by scaling up the number of storage servers and the disks they present
Handle consistent locking across files and directories being accessed from multiple clients at the same time

May 1-5, 2023

Characteristics of Non-Parallel File Systems

Usually limited to a single storage server

Often file systems just accessed by the local machine (XFS, EXT4, etc.)
Can present multiple clients over a network, but don’t/can’t enforce full POSIX locking compliance across all of them

35 of 50

Why HPC Uses Parallel File Systems

Clusters are made up of many nodes (10s, 100s, 1000s of nodes)
Single jobs can run and span many/all of these nodes in the system and will need access to the same data
HPC jobs also may involve checkpointing and will need a shared file system to store checkpoint files
HPC systems generally work with and produce large amounts of data that can quickly/easily scale beyond the capability of since servers
The immense compute power of these machines need fast file systems to keep them “fed” with data

May 1-5, 2023

36 of 50

Goals of Scale-Out File Systems

May 1-5, 2023

Access Transparency: clients are unaware that they are access remote data
Concurrency Transparency: all clients have the same view of the data state
Failure Transparency: clients continue to operate correctly after a server failure
Heterogeneity: clients/servers can be of different hardware and operating system
Scalability: file system should work as well at small client node counts as at large client node counts
Replication Transparency: any data replication should be invisible to clients
Migration Transparency: any data migration should be invisible to clients

37 of 50

Some Examples of File Systems Used in HPC

Characteristics

Open source parallel distributed file system
Metadata services and storage are segregated from data services and storage

System Components

Metadata Servers (MDS) – metadata servers, client access is through the MDS
Metadata Targets (MDT) – dedicated metadata storage
Management Servers (MGS) – Lustre cluster management
Management Targets (MGT) – management data storage
Object Storage Servers (OSS) – data servers, client access is through the OSS
Object Sever Targets (OST) – dedicated object storage

Sources

lustre.org – community support for Lustre source
Productized by companies such as DDN and HPE (and others)

May 1-5, 2023

38 of 50

Some Examples of File Systems Used in HPC

May 1-5, 2023

39 of 50

Some Examples of File Systems Used in HPC

Characteristics

All user data is accessible from any disk to any node
Metadata may be shared from all disks or from a dedicated set of disks
Supports multi copies of metadata
All data movement between nodes and disk is parallel
Large number of nodes utilize the file system

System Components

Server Nodes - Cluster Manager, Quorum Nodes, File System Manager
NSD Servers – direct or SAN attached to physical storage, block access
Network Shared Disk (NSD) – LUNs formatted for GPFS usage

Sources

IBM – https://www.ibm.com/products/spectrum-scale

May 1-5, 2023

40 of 50

Some Examples of File Systems Used in HPC

May 1-5, 2023

41 of 50

Some Examples of File Systems Used in HPC

Characteristics

Generally separate metadata and data servers/services
Scale out metadata and data
Newly has a policy engine called BeeGFS Hive Index
Like Lustre built to run well on commodity hardware

System Components

Metadata Server/Service - Provides metadata information to clients about inodes
Data Server/Service – Stores the files themselves and retrieves them for clients
Management & Monitoring Service - Holds cluster configuration needed by clients and provides monitoring functions for the cluster

Sources

BeeGFS – https://doc.beegfs.io/

May 1-5, 2023

42 of 50

Some Examples of File Systems Used in HPC

May 1-5, 2023

43 of 50

Some Examples of File Systems Used in HPC

Characteristics

Leverages NFS for clients (can use ”stock” NFS but really needs to use a custom NFS client provided by VAST)
Scale out and scale up to get more capacity and performance
Has built in data compression and dedupe (aka. “data reduction”)
Focuses on Read performance over Write performance

System Components

D-Boxes - “Disk Boxes” that present media devices to the C-Boxes
C-Boxes - “Compute Boxes”/”C-Nodes machines that form the file system and present it to clients via NFS, SMB, S3
Backend Fabric - Ethernet or IB fabric between the C-Boxes and D-Boxes

Sources

VAST – https://vastdata.com

May 1-5, 2023

44 of 50

Some Examples of File Systems Used in HPC

May 1-5, 2023

45 of 50

Common HPC Storage Solutions

Lustre Appliances

DDN EXAScaler
HPE Cray Sonexion
Dell/EMC HPC Lustre Storage

Spectrum Scale Appliances

IBM ESS
Lenovo GSS
Dell Pixstor
HPE Enterprise Storage

BeeGFS Reference Solution Providers

NetApp
Dell

May 1-5, 2023

BeeGFS (cont.)

RAID Inc.
Exxact Corp.

46 of 50

47 of 50

Resources

HPC:

https://www.youtube.com/watch?v=n0OfAoUXUJw (Henry Neeman) Supercomputing in plain English

General Storage:

https://www.youtube.com/watch?v=pBmtY4Tk-R8 (Henry Neeman) Why storage for big data is hard

Hardware Related:

https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html GPU Direct Storage

https://www.geeksforgeeks.org/remote-direct-memory-access-rdma/ RDMA

https://www.snia.org/forums/cmsi/knowledge/formfactors Flash form factors

https://en.wikipedia.org/wiki/InfiniBand Generations of Infiniband

https://www.lto.org/lto-generation-compatibility/ LTO Tape Generation & Compatibility Information

May 1-5, 2023

48 of 50

Resources

Tuning and Benchmarking:

fasterdata.es.net/ An Expert Guide for End-to-End Performance Tuning, Tools and Techniques

https://ior.readthedocs.io/en/latest/ IOR and mdtest documentation

https://github.com/breuner/elbencho Another FS benchmarking tool with GDS Support

https://glennklockwood.blogspot.com/2016/07/basics-of-io-benchmarking.html

May 1-5, 2023

File System Solutions:

www.opensfs.org/ OpenSFS supports vendor-neutral development of Lustre

wiki.lustre.org Lustre Wiki

https://www.ddn.com/products/lustre-file-system-exascaler/ DDN Exascaler

www.ibm.com/products/spectrum-scale IBM Spectrum Scale

www.spectrumscale.org Spectrum Scale User Group

https://docs.ceph.com/en/quincy/ CEPH file system

https://vastdata.com VAST

1 of 50

2 of 50

3 of 50

4 of 50

5 of 50

6 of 50

7 of 50

8 of 50

9 of 50

10 of 50

11 of 50

12 of 50

13 of 50

14 of 50

15 of 50

16 of 50

17 of 50

18 of 50

19 of 50

20 of 50

21 of 50

22 of 50

23 of 50

24 of 50

25 of 50

26 of 50

27 of 50

28 of 50

29 of 50

30 of 50

31 of 50

32 of 50

33 of 50

34 of 50

35 of 50

36 of 50

37 of 50

38 of 50

39 of 50

40 of 50

41 of 50

42 of 50

43 of 50

44 of 50

45 of 50

46 of 50

47 of 50

48 of 50

49 of 50

50 of 50