1 of 26

Linux Clusters Institute:�Current HPC File Systems & Trends

J.D. Maloney | Lead HPC Storage Engineer

Storage Enabling Technologies Group (SET)

National Center for Supercomputing Applications (NCSA)

malone12@illinois.edu

University of Oklahoma, May 13th – 17th 2024

2 of 26

Baseline

  • Look over slides the beginner workshop in Feb 2024
    • Link to Slides
    • Segments of some following slides were adapted from their work
  • Have an grasp on:
    • Definition of Parallel file system
    • Types of file systems (home, scratch, projects, archive)

2

May 13 – 17, 2024

3 of 26

  • There are many file systems out there, too numerous to hit them all
  • Going to touch on current HPC file systems that are popular
    • Trying to hit the ones you’re most likely to encounter in an HPC environment these days
    • Presentation on a given product is by no means an endorsement of any kind for the vendor discussed
  • Will wrap up by discussing some HPC file system trends

3

May 13 – 17, 2024

Topic Coverage

4 of 26

Current HPC-Relevant File Systems

4

May 13 – 17, 2024

5 of 26

Storage Scale (GPFS) Overview

  • Product of IBM, gone through many name changes
    • Tiger Shark FS, MultiMedia FS, GPFS, Spectrum Scale, now its Storage Scale
  • Licensed file system, based on capacity (usually)
  • One of the two “prominent” file systems in recent years used by some of the world’s largest supercomputers
  • Generally considered easier to administer due to product maturity and Enterprise level features
  • Can run on top of RAID’d LUNs presented over a SAN, or it can erasure across disks in servers (Erasure Code Edition)

5

May 13 – 17, 2024

6 of 26

Storage Scale (GPFS) Architecture

6

May 13 – 17, 2024

Image Credit: redlineperf.com

7 of 26

Storage Scale (GPFS) Architecture Notes

  • Types of servers:
    • NSD Servers- Connected to disks, serve I/O to other nodes in the cluster
    • Manager Nodes- Cluster Manager, FS manager, Quorum Nodes
    • Clients- Mount file system for access, run GPFS daemon also
  • Supports multiple storage pools and has HSM functionality
  • Supports encryption at rest (Advanced License)
  • Features Like:
    • AFM (Active File Management)
    • Built in Policy Engine (very powerful)
    • Native Cluster Export Services NFS, Samba/CIFS
    • Support for Object Storage via Cluster Export Services
    • Remote Cluster Mounts
    • Powerful Sub-block Allocation

7

May 13 – 17, 2024

8 of 26

Lustre Overview

  • Open Source file system supported, developed by many companies and large institutions (DDN, Seagate, Intel, CEA, DOE)
  • One of the two “prominent” file systems in use today by the world’s largest supercomputers
  • Known for its ability to scale sequential I/O performance as the storage system grows
  • More complicated to administer, stricter operating environment (OFED stack, kernel, etc.)
  • Can grow to greater numbers of clients

8

May 13 – 17, 2024

9 of 26

Lustre Architecture

9

May 13 – 17, 2024

Image Credit: nextplatform.com

10 of 26

Lustre Architecture Notes

  • Very saleable historically for OSS/OSTs (data storage/performance), but now also for MDS/MDTs (metadata storage/performance)
  • Built in quotas, but no built in Policy Engine…yet (external via Robinhood)
  • Supports mixed fabrics within the same environment via LNET routing
  • HSM support is built in, can be manipulated with external policy engine to migrate files between tiers
  • Can run more ”commodity” storage, but expects reliable disk, HA maintained by failover

10

May 13 – 17, 2024

11 of 26

BeeGFS Overview

  • File System formerly known as FgHS (Fraunhofer File System)
  • Maintained by the Fraunhofer GmBH
  • Most features and abilities are open source, support and some advance features can be purchased through Fraunhofer
  • Scaling of both data and metadata performance and capacity as hardware used grows
  • More limited features set than other big file systems:
    • No encryption support
    • No HSM supported (yet)
  • Gaining in popularity recently, especially in Europe
  • Now has a policy engine (new as of Late 2022)

11

May 13 – 17, 2024

12 of 26

BeeGFS Architcture

12

May 13 – 17, 2024

Image Credit: beegfs.io

13 of 26

BeeGFS Architecture Notes

  • Scale metadata and data services independently
  • Multiple services can be run from same node
  • Host failure resistance by using ”Buddy Groups” to replicated data on a per directory granularity for high uptime
  • Runs on “commodity” hardware
  • Supportive of multiple fabric types (Ethernet, IB, OPA)
  • Storage pools allow grouping of LUNs by different types (SSD, HDD, etc.)

13

May 13 – 17, 2024

14 of 26

Ceph Overview

  • Object/Posix file system that is developed by Red Hat gaining a lot of popularity
  • Usually leveraged as an “Archive” store, a general dataset repository, and a backing to Openstack
  • Open Source, free to use, support available through Red Hat
  • Provides Object, Block, and Posix storage
  • Very popular in the cloud/virtualization space
  • Runs on “commodity” hardware
  • Can be deployed over Ethernet or IB
    • Ethernet is very popular with this FS

14

May 13 – 17, 2024

15 of 26

Ceph Architecture

15

May 13 – 17, 2024

Image Credit: innotta.com.au

16 of 26

Ceph Architecture Notes

  • Uses the CRUSH algorithm to handle file storage and retrieval across the cluster
  • Disks are handled individually, each disk has a ”journal” device that allows for quick writes for small files
    • Multiple disks can share the same journal device, tuning the ratio and drive types allows one to increase or decrease performance
  • Supports both replication or erasure coding for redundancy
    • Does NOT expect any devices to be RAID’d together before being presented
  • Disks are individually formatted by Ceph (Bluestore) and managed by the file system
  • Multi metadata server support (for POSIX component) is maturing

16

May 13 – 17, 2024

17 of 26

VAST Overview

17

May 13 – 17, 2024

  • Proprietary file system and hardware solution from VAST Data that talks NFS, SMB, and S3
  • Based on all-flash premise, does not support HDDs
  • Closed source, need to buy VAST hardware and support
  • Supports multiple cluster-side fabrics (IB, Ethernet, others coming)
  • Popular as a hands off, low maintenance, highly available file system solution

18 of 26

VAST Architecture

18

May 13 – 17, 2024

19 of 26

VAST Architecture Notes

  • All C-Boxes are connected to all D-boxes
  • Writes all land on Optane drives (or alternate high endurance drive option as Intel Optane fades)
    • This gets all I/O aligned and files ready to be flushed to the back-end QLC capacity drives
    • Read performance comes from all drives; but this limits writes to just the performance of the high endurance drives
  • Per the above bullet; solution is capable of high read amounts, less so for writes (know your workload)
  • Backend network can be either Ethernet or Infiniband (but usually is Ethernet)

19

May 13 – 17, 2024

20 of 26

Others

  • There are many other really cool file systems out there to choose from:
    • WekaIO
    • DAOS
    • Quobyte
    • HDFS
    • List Goes On
  • Always good to keep an eye on these as everything has to start somewhere
    • Helps you understand which direction things are headed
    • Where the heaviest development activity is taking place

20

May 13 – 17, 2024

21 of 26

HPC File System Trends

21

May 13 – 17, 2024

22 of 26

New Ground-Up File Systems

  • With the increasing prevalence of flash and its economics of use in more places, new ground-up file systems are being written with flash and/or flash + object in mind
  • New technology such as DPUs also plays a roll in accelerating storage I/O
  • These new technologies can allow for more feature implementation with much less/no cost penalty
    • Things like compression, deduplication, more advanced application/user performance telemetry, etc.
  • These file systems are also being written to be much more resilient and fault tolerant, also to be cloud-native

22

May 13 – 17, 2024

23 of 26

New Deployment Methods

  • Historical FS deployment has been RPM install or running a fully controlled OS image from a vendor
    • Has worked well for the most part (especially full images where all packages/configs are controlled)…but can be brittle and upgrades stink
  • Newer file systems are deployed via containers
    • Fully encapsulated to ensure packages and libraries are approved versions
    • Greatly improves upgrade processes/expansion/contraction, easier to spin up and spin down
    • At least partly insulates vendors from a shifting OS landscape and allows sites to run the bare metal OS of their choice

23

May 13 – 17, 2024

24 of 26

Wrap Up

  • Many of file systems mentioned in this presentation are open source and/or available to play with for free
    • Set some of the appealing ones to you up and test them out in VMs or on older hardware
    • Keep up on new releases as features get added or weak points get addressed
    • Tons of options out there to choose from
  • No one right solution, many factors go into making the decision
    • Balance the trade-offs, your environment will have different constraints than someone else’s
  • Reach out to others in the community, attend User Group meetings

24

May 13 – 17, 2024

25 of 26

Acknowledgements

  • Members of the SET group at NCSA for slide review
  • Members of the steering committee for slide review

25

May 13 – 17, 2024

26 of 26

Questions

26

May 13 – 17, 2024