1 of 50

Linux Clusters Institute:�HPC Storage: Part 1

J.D. Maloney | Sr. HPC Storage Engineer

National Center for Supercomputing Applications (NCSA)

malone12@illinois.edu

1

May 1-5, 2023

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).

2 of 50

Target Audience:

Those involved in designing, implementing, or managing HPC storage systems.

Outline:

  • Common Storage Concepts and Terms
  • Storage Related Goals & Requirements
  • Storage Hardware
  • Storage Software
  • Wrap Up

2

May 1-5, 2023

3 of 50

Concepts and Terms

3

May 1-5, 2023

4 of 50

What is Storage?

A place to store data. Either temporarily or permanently.

  • Processor Cache
    • Fastest access; closest to the CPU; temporary (L1, L2)
  • System Memory (DRAM)
    • Very Fast access; close to CPU but not on it; temporary
  • Solid State Storage
    • Fast access (esp. random)
    • Can be internal or part of an external storage system
    • Capable of high densities with high associated costs
  • Spinning Disk
    • Slower; performance is tied to access behavior
    • Can be internal or part of an external storage system
    • Capable of extremely high densities
  • Network / Cloud storage
    • Network - Can scale from slow to extremely fast, high density
  • Tape
    • Extremely slow; typically used for cold storage

4

May 1-5, 2023

Bandwidth Increase

Latency & Size Increase

Cache

(L1, L2, L3,)

Memory

(DRAM, HBM, )

Solid State Disk

(SATA SSD, M.2 Module, PCIe Card)

Spinning Disks

(PMR, SMR, HAMR/MAMR)

Tape

(LTO, TS11XX)

CPU

5 of 50

Concepts and Terms

  • Throughput: Storage capability usually measured in GB/s (or TB/s)
  • IOPs: Input/Output Operations per second
  • RAID: Redundant Array of Independent Disks
  • JBOD: Just a Bunch of Disks
  • JBOF: Just a Bunch of Flash
  • Storage Server: provide direct access to storage device and functions as a data manager for that disk
  • Storage Client: accesses data, but plays no role in data management
  • NAS: Network Attached Storage
  • SAN: Storage Area Network

5

May 1-5, 2023

6 of 50

Concepts and Terms

  • High Availability (HA)
    • Components are configured in failover pairs
    • Prevents a single point of failure in the system
    • Prevents a service outage
  • Failover Pairs
    • Active/Active
      • Both component share the load
      • On failure one component takes over the complete load
    • Active/Passive
      • One component services requests, the other is in standby
      • On failure the standby becomes active
  • Networks
    • InfiniBand (IB)
    • Ethernet (TCP/IP)
  • Host Connectivity
    • Host Bus Adapter (HBA)
    • Network Interface Card (NIC)

6

May 1-5, 2023

Storage

Active Controller

DN

DN

DN

DN

Standby Controller

7 of 50

Concepts and Terms

  • Raw Space: what the disk label shows. Typically given in base 10.
    • 10TB (terabyte) == 10*10^12 bytes
  • Useable Space: what `df` shows once the storage is mounted. Typically given in base 2.
    • 10TiB (tebibyte) == 10*2^40 bytes

Useable space is often about 25% smaller than raw space

    • Depending on the RAID level you are selecting, some storage is used for RAID overhead, file system overhead, etc.
    • Learning how to calculate this is a challenge
    • Dependent on levels of redundancy and the file system you choose

7

May 1-5, 2023

File System overhead is applied after RAID overhead further reducing the usable space.

3 Drives

2 Storage Drives

1 Parity Drive

33% RAID overhead

4 Drives

3 Storage Drives

1 Parity Drive

25% RAID overhead

8 of 50

Goals and Requirements

8

May 1-5, 2023

9 of 50

How to Choose the Optimal Storage Solution?

  • Short Answer: Whichever solution meets your requirements

  • Long Answer: There is no single best solution for all scenarios

    • Each is designed to solve specific problems and serve specific requirements
    • Each works well when built and deployed according to their strengths
    • Usage requirements and access patterns define which is the best choice
      • Application Requirements
      • User Expectations
      • Budget Constraints
      • Expertise in the support team

  • Compromise based on competing needs is almost always the end result

9

May 1-5, 2023

10 of 50

Building a Balanced System

The Ideal:

    • All components of the system operate in perfect harmony such that speeds & feeds within the system all align perfectly

The Reality:

    • Competing needs will lead to compromises that cause imbalances in the system.

Common Imbalances:

  • Capacity is prioritized over bandwidth; the number of disks exceeds the performance capabilities of the controllers, disk interconnect, or HBAs
  • Overall output of the storage system is exceeded by the network capacity of the computational systems

10

May 1-5, 2023

11 of 50

Requirements Evaluation

  • Stakeholders
    • Computational Users
    • Management
    • Policy Managers
    • Funding Agencies
    • System Administration Staff
    • Infrastructure Support Staff
    • IT Security Staff

  • Usage Patterns
    • Write dominate
    • Read dominate
    • Streaming I/O vs Random I/O

  • User Profiles
    • Expert vs. Beginner
    • Custom vs. Commercial application

11

May 1-5, 2023

  • I/O Profiles
    • Serial I/O
    • Random I/O
    • Parallel I/O
    • MapReduce I/O
    • Large Files
    • Small Files

  • Infrastructure Profile
    • Integrated with HPC resource
    • Standalone storage solution
    • Network connectivity
    • Security requirements

12 of 50

Gathering Stakeholder Requirements

  • Who are your stakeholders?
    • What features are they looking for?
    • How will people want to use the storage?
    • What usage policies need to be supported?
  • From what science/usage domains are the users?
    • What applications will they be using?
  • How much space do they anticipate needing?
  • Are there expectations of access from multiple systems?

12

May 1-5, 2023

  • What is the distribution of files?
    • Sizes, count
  • What is the typical I/O pattern?
    • How many bytes are written for every byte read?
    • How many bytes are read for each file opened?
    • How many bytes are written for each file opened?
  • Are there any system-based restrictions?
    • POSIX conformance - do you need a POSIX interface to the file system
    • Limitations on number of files or files per directory
    • Network compatibility (IB, Eth, Slingshot)

13 of 50

Application I/O Access Patterns

  • An application’s I/O transaction size, and the order in which they are accessed, defines an application’s I/O access pattern. This is a combination of how the application does I/O along with how the file system handles I/O requests.

  • For typical HPC file systems, sequential I/O of large blocks provides the best performance. Unfortunately, these types of I/O patterns aren’t the most common.

  • Understanding the I/O access patterns of your major applications can help you design a solution your users will be happy with.

13

May 1-5, 2023

14 of 50

Common Data Access Patterns

  • Streaming (bandwidth centric)
    • Records Accessed Only Once, file is read/written from beginning to end
    • Minimal overall IOPS
    • File tend to be large and performance is measured in bandwidth
    • Common in Digital Media, Classic HPC, Scientific Applications, DSP
  • Discrete File I/O (IOP centric)
    • Small individual transactions; may not even read a full block at a time
      • Small files, random access
    • File IOPS can be high
    • Common in bioinformatics, AI, /home and /sw directories
  • Transaction Processing (IOP centric)
    • Small transactions with good temporal locality; individual updates maybe smaller than a block but consecutive transactions tend to be in continuous blocks
    • File IOPS can be high
    • Common in databases and commercial applications

14

May 1-5, 2023

15 of 50

HPC I/O Access Patterns

  • Traditional HPC
    • Streaming large block writes (low IOPs rates)
    • Large output files
    • Minimal metadata operations
  • More common today
    • Random I/O patterns (high IOPs rates)
    • Smaller output files
    • Large number of metadata operations

Challenges:

  • Choosing a block size that fits your application I/O pattern
  • IOPs becomes more important with random I/O patterns and small files

15

May 1-5, 2023

16 of 50

Gathering Data Requirements

  • Do you need different tiers or types of storage?
    • Active long-term (project space)
    • Temporary (scratch space, fast, not backed up)
    • Archive (cloud, disk or tape, cheaper)
    • Backups (snapshots, disk, tape)
    • Encryption (SED or per share/export)
  • Data Restrictions
    • HIPAA and PHI
    • CUI
    • PCI DSS
    • And many more (FISMA, ITAR, SOX, GLBA, CJIS, FERPA, SOC, …)
  • Data transfer characteristics

16

May 1-5, 2023

17 of 50

Training and Support Requirements

  • Training
    • Sys Admin Staff
      • How much training does your staff need?
      • Vendor supplied training?
      • Does someone on your staff have the expertise to provide training?
    • User Services
      • How much training does user support staff need?
      • Does your user support staff have the expertise to provide user training?
    • Users
      • How much training will your users need to effectively use the system?
      • How often will training need to be provided?

17

May 1-5, 2023

  • System Support
    • Does the vendor provide support for all components of your system?
    • Does support for parts of your system come from the open source community?
    • What are the support requirements for your staff?
      • 7x24
      • 8x5 M-F

    • Do you have Service Level Agreements (SLAs) with your user community?

18 of 50

Storage Hardware

18

May 1-5, 2023

19 of 50

Common Storage Building Blocks

  • Media
    • HDDs (SATA & SAS)
    • SSDs (SATA, SAS, NVMe)
    • Tape (LTO, TS11XX)
  • Media Formats
    • HDD - 3.5” (2.5” dying out, some 10/15K SAS remain)
    • SSD - 2.5” (varying thicknesses), U.2/3, E1.S/L, E3.S/L, M.2
    • Tape - LTO (Open Standard), TS11XX (IBM format)
  • Enclosures
    • JBOD & JBOF (Enclosures/Drawers)
    • Controller (Couplets)
    • Storage Servers

19

May 1-5, 2023

20 of 50

Common Connecting Fabrics

Network Fabrics

  • Ethernet
    • Speeds: 1Gb - 400GbE (800GbE soon)
    • RJ-45, SFP+, QSFP+, QSFP-DD
    • TCP/RoCE
  • Infiniband
    • EDR, HDR, NDR (100Gb, 200Gb, 400Gb) are modern versions in used today
    • RDMA support
  • Slingshot
    • HPE specific (at present)
    • Ethernet based with “extra stuff”

20

May 1-5, 2023

Other Storage Fabrics & Carriers

  • Fiber Channel & SAS
    • 24Gb SAS is now GA; 12Gb still common
    • 64GFC is now available (6.4GB/s per direction)
  • PCIe
    • Gen 5 (32GT/s per lane) ~64GB/s per direction in a x16 slot (available in 2023)
    • Gen 4 (16GT/s per lane) ~32GB/s per direction in a x16 slot (available since 2019)
    • Carrier for CXL devices, NVME drives, NICs, etc.
    • Gen 6 (64GT/s per lane) ~128GB/s per direction in a x16 slot -- likely 2025/2026 GA

21 of 50

Common RAID Levels

  • Standard Levels
    • RAID 0: striping without mirroring or parity
    • RAID 1: data mirroring, without parity or striping
    • RAID 5: block-level striping with distributed parity
      • No loss of data with a single disk failure
      • Performance hit in degraded mode
    • RAID 6: block-level striping with dual distributed parity
      • No loss of data with two disk failures
      • Data loss occurs with third failure
      • Most common in larger systems today
  • Hybrid Levels
    • RAID 0+1: create two stripes then mirror
    • RAID 1+0: striping across mirrored drives

21

May 1-5, 2023

22 of 50

Common RAID Levels (cont.)

  • Hybrid Levels
    • RAID 0+1: create two stripes then mirror
    • RAID 1+0: striping across mirrored drives

22

May 1-5, 2023

23 of 50

Erasure Coding

  • RAID levels are becoming/already are not tenable with increasing drive sizes and performance characteristics
    • Capacity of drives have outpaced their performance capabilities
  • Increased rebuild times mean we need a faster way to get data protected following device failure
    • RAID rebuilds typically require the new drive to do a write of its entire capacity once, creates a write bottleneck on one device
  • This is leading to the use of erasure coding across many devices to greatly shorten the time to data protection
  • There are different erasure algorithms, but most operate on a common principles

23

May 1-5, 2023

24 of 50

Erasure Coding

  • An example with ZFS dRAID
    • May not seem much different on the surface but much more resilient than classic RAID methods

24

May 1-5, 2023

25 of 50

Erasure Coding

  • Improvements over “classical RAID”
    • All drives in the pool participate in a rebuild, can even happen if drive not replaced (depends on solution/configuration)
    • Reconstruction of data can be ordered more efficiently, rebuild the most critical blocks first and move to the least critical
    • Large erasure code pools (especially/primarily) on fast media can have seriously increased usage efficiency without losing data protection
    • No more rebuild bottleneck on a single drive
  • Some (not exhaustive) list of solutions that use erasure coding
    • Ceph
    • ZFS dRAID
    • DDN SFA
    • VAST Data
    • IBM ESS

25

May 1-5, 2023

26 of 50

Data Transport from Server to Client

TCP (Transmission Control Protocol)

  • Great for the Internet and viewing cat videos, sub-optimal for high performance storage transport
    • Very robust method of moving data to ensure it gets there no matter what and is resilient in poor conditions
    • Parallel file systems are getting better at optimizing for standard Ethernet connections and TCP
  • Data has to flow through OS (it’s buffers), to the NIC (and its buffers), over the network to the other host and traverse the same stack again in reverse order

26

May 1-5, 2023

27 of 50

Data Transport from Server to Client

RDMA (Remote Direct Memory Access)

  • Requires a “lossless” network be that Infiniband, RoCE, Slingshot, etc.
  • Applications can transfer data from their memory to memory on a remote host directly, bypassing a lot of the layers in the stack
  • Improves both throughput of file systems between servers and clients as well as reduces latency which greatly helps metadata-intensive workloads as well as though that do many small I/O transactions

27

May 1-5, 2023

28 of 50

TCP vs. RDMA

28

May 1-5, 2023

29 of 50

Data Transport from Server to Client

GDS (GPU Direct Storage)

  • Allows storage traffic to by-pass the host CPU and Memory and copy straight into VRAM on the GPU
  • As network speeds and all-flash appliance speeds increase, it cuts out a bottleneck between your file system and getting data into the GPU’s memory for training/processing
  • Works with Nvidia GPUs and a lot of different storage platforms
    • If you have GPUs and need to buy storage, check if storage solution supports GDS
    • Formerly known as Magnum IO (in case you see that in docs)

29

May 1-5, 2023

30 of 50

GDS (GPU Direct Storage)

30

May 1-5, 2023

31 of 50

Storage Software

31

May 1-5, 2023

32 of 50

File vs. Object Storage

File Storage

  • A block holds a chunk of data, files typically reside in multiple blocks.
  • Data is accessed through a request to the block address
  • The file system controls access to blocks, imposes a hierarchical structure and defines and updates metadata
  • File systems are generally POSIX compliant
  • Access is through the OS layer, user has direct access to the file system
  • Storage systems are generally feature rich and externally based
  • Raw block storage is common in large scale database applications allowing direct block access

32

May 1-5, 2023

Object Storage

  • An object hold both data and it’s associated metadata
  • Data is access through a request for the object ID
  • Objects are stored in a flat structure
  • Metadata can be flexible and defined by the user
  • Resiliency is provided through multiple copies of the object, generally geographically distributed
  • Access is through a REST API application not the OS
  • Systems are generally built with commodity servers with internal disks

33 of 50

Data Consistency

Strong Consistency

  • Block storage systems are strongly consistent
  • Typically used for real-time processing such as transactional databases
  • Good for data that is constantly changing
    • Updating a file only requires changing the blocks that have changed
  • Limited scalability especially within a geographically distributed system
  • Guarantees that a read request returns the most recent version of data

33

May 1-5, 2023

Eventual Consistency

  • Some object storage systems are eventually consistent
  • High availability for data that is relatively static and rarely altered
    • Updating an object requires a re-write of the entire object
  • Typical uses are for multimedia or unstructured data
  • There is no guarantee that a read request returns the most recent version of data

34 of 50

Parallel vs. Non Parallel File Systems

Characteristics of a Parallel File System

  • Multiple storage servers manage a single namespace
  • Speed is able to be increased by scaling up the number of storage servers and the disks they present
  • Handle consistent locking across files and directories being accessed from multiple clients at the same time

34

May 1-5, 2023

Characteristics of Non-Parallel File Systems

                  • Usually limited to a single storage server
                • Often file systems just accessed by the local machine (XFS, EXT4, etc.)
                • Can present multiple clients over a network, but don’t/can’t enforce full POSIX locking compliance across all of them

35 of 50

Why HPC Uses Parallel File Systems

  • Clusters are made up of many nodes (10s, 100s, 1000s of nodes)
  • Single jobs can run and span many/all of these nodes in the system and will need access to the same data
  • HPC jobs also may involve checkpointing and will need a shared file system to store checkpoint files
  • HPC systems generally work with and produce large amounts of data that can quickly/easily scale beyond the capability of since servers
  • The immense compute power of these machines need fast file systems to keep them “fed” with data

35

May 1-5, 2023

36 of 50

Goals of Scale-Out File Systems

36

May 1-5, 2023

  • Access Transparency: clients are unaware that they are access remote data
  • Concurrency Transparency: all clients have the same view of the data state
  • Failure Transparency: clients continue to operate correctly after a server failure
  • Heterogeneity: clients/servers can be of different hardware and operating system
  • Scalability: file system should work as well at small client node counts as at large client node counts
  • Replication Transparency: any data replication should be invisible to clients
  • Migration Transparency: any data migration should be invisible to clients

37 of 50

Some Examples of File Systems Used in HPC

Characteristics

  • Open source parallel distributed file system
  • Metadata services and storage are segregated from data services and storage

System Components

  • Metadata Servers (MDS) – metadata servers, client access is through the MDS
  • Metadata Targets (MDT) – dedicated metadata storage
  • Management Servers (MGS) – Lustre cluster management
  • Management Targets (MGT) – management data storage
  • Object Storage Servers (OSS) – data servers, client access is through the OSS
  • Object Sever Targets (OST) – dedicated object storage

Sources

  • lustre.org – community support for Lustre source
  • Productized by companies such as DDN and HPE (and others)

37

May 1-5, 2023

38 of 50

Some Examples of File Systems Used in HPC

38

May 1-5, 2023

39 of 50

Some Examples of File Systems Used in HPC

Characteristics

    • All user data is accessible from any disk to any node
    • Metadata may be shared from all disks or from a dedicated set of disks
    • Supports multi copies of metadata
    • All data movement between nodes and disk is parallel
    • Large number of nodes utilize the file system

System Components

    • Server Nodes - Cluster Manager, Quorum Nodes, File System Manager
    • NSD Servers – direct or SAN attached to physical storage, block access
    • Network Shared Disk (NSD) – LUNs formatted for GPFS usage

Sources

39

May 1-5, 2023

40 of 50

Some Examples of File Systems Used in HPC

40

May 1-5, 2023

41 of 50

Some Examples of File Systems Used in HPC

Characteristics

    • Generally separate metadata and data servers/services
    • Scale out metadata and data
    • Newly has a policy engine called BeeGFS Hive Index
    • Like Lustre built to run well on commodity hardware

System Components

    • Metadata Server/Service - Provides metadata information to clients about inodes
    • Data Server/Service – Stores the files themselves and retrieves them for clients
    • Management & Monitoring Service - Holds cluster configuration needed by clients and provides monitoring functions for the cluster

Sources

    • BeeGFS – https://doc.beegfs.io/

41

May 1-5, 2023

42 of 50

Some Examples of File Systems Used in HPC

42

May 1-5, 2023

43 of 50

Some Examples of File Systems Used in HPC

Characteristics

    • Leverages NFS for clients (can use ”stock” NFS but really needs to use a custom NFS client provided by VAST)
    • Scale out and scale up to get more capacity and performance
    • Has built in data compression and dedupe (aka. “data reduction”)
    • Focuses on Read performance over Write performance

System Components

    • D-Boxes - “Disk Boxes” that present media devices to the C-Boxes
    • C-Boxes - “Compute Boxes”/”C-Nodes machines that form the file system and present it to clients via NFS, SMB, S3
    • Backend Fabric - Ethernet or IB fabric between the C-Boxes and D-Boxes

Sources

    • VAST – https://vastdata.com

43

May 1-5, 2023

44 of 50

Some Examples of File Systems Used in HPC

44

May 1-5, 2023

45 of 50

Common HPC Storage Solutions

  • Lustre Appliances
    • DDN EXAScaler
    • HPE Cray Sonexion
    • Dell/EMC HPC Lustre Storage
  • Spectrum Scale Appliances
    • IBM ESS
    • Lenovo GSS
    • Dell Pixstor
    • HPE Enterprise Storage
  • BeeGFS Reference Solution Providers
    • NetApp
    • Dell

45

May 1-5, 2023

  • BeeGFS (cont.)
    • RAID Inc.
    • Exxact Corp.

46 of 50

Other Storage Solutions Seen in HPC

There are some other storage solutions/vendors out there, that while not large, do show up regularly enough at HPC sites that they’re worth mentioning:

    • CephFS
    • Quobyte
    • ZFSonLinux + NFS/SMB
    • Swift
    • Panasas
    • Pure Storage
    • NetApp (NFS)

46

May 1-5, 2023

47 of 50

Resources

HPC:

https://www.youtube.com/watch?v=n0OfAoUXUJw (Henry Neeman) Supercomputing in plain English

General Storage:

https://www.youtube.com/watch?v=pBmtY4Tk-R8 (Henry Neeman) Why storage for big data is hard

Hardware Related:

https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html GPU Direct Storage

https://www.geeksforgeeks.org/remote-direct-memory-access-rdma/ RDMA

https://www.snia.org/forums/cmsi/knowledge/formfactors Flash form factors

https://en.wikipedia.org/wiki/InfiniBand Generations of Infiniband

https://www.lto.org/lto-generation-compatibility/ LTO Tape Generation & Compatibility Information

47

May 1-5, 2023

48 of 50

Resources

Tuning and Benchmarking:

fasterdata.es.net/ An Expert Guide for End-to-End Performance Tuning, Tools and Techniques

https://ior.readthedocs.io/en/latest/ IOR and mdtest documentation

https://github.com/breuner/elbencho Another FS benchmarking tool with GDS Support

https://glennklockwood.blogspot.com/2016/07/basics-of-io-benchmarking.html

48

May 1-5, 2023

File System Solutions:

www.opensfs.org/ OpenSFS supports vendor-neutral development of Lustre

wiki.lustre.org Lustre Wiki

https://www.ddn.com/products/lustre-file-system-exascaler/ DDN Exascaler

www.ibm.com/products/spectrum-scale IBM Spectrum Scale

www.spectrumscale.org Spectrum Scale User Group

https://docs.ceph.com/en/quincy/ CEPH file system

https://vastdata.com VAST

49 of 50

Resources

49

May 1-5, 2023

50 of 50

Questions?

50

May 1-5, 2023