1 of 101

Linux Clusters Institute:�Storage Scale & Ceph

J.D. Maloney | Lead HPC Storage Engineer

Storage Enabling Technologies Group (SET)

National Center for Supercomputing Applications (NCSA)

malone12@illinois.edu

Oklahoma University, May 13th – 17th 2024

2 of 101

Storage Scale (GPFS) Overview

  • Product of IBM, gone through many name changes
  • Licensed file system, based on server socket count and client count, or usable capacity
  • One of the two “prominent” file systems in used today by the world’s largest supercomputers
  • Generally considered easier to administer due to product maturity and Enterprise level features
  • File System generally expects to be running on top of

highly reliable disks presented by redundant controllers

2

May 13 – 17, 2024

3 of 101

Quick History of Storage Scale

  • Began as the Tiger Shark File System in 1993 to handle multimedia applications
  • Also was influenced by IBM’s Vesta File System which was in development around the same time for use in different applications
  • Was productized in 1994, and the name changed to IBM GPFS (General Parallel File System) around 1998
  • Gone through many version changes and feature adds and in 2023 the name was changed to Storage Scale
    • Though we’re all still getting used to that ☺

3

May 13 – 17, 2024

4 of 101

4

May 13 – 17, 2024

Image Credit: spectrumscale.org

5 of 101

Stand Out Storage Scale Features

  • Distributed metadata servers, no real limit to number, could have it in all of them
  • Allows data and metadata to be written inline with each other the same storage device, no separate devices needed
  • Supports “Super inodes” where files less than ~3.8K actually fit inside the inode itself
    • Very handy when you have metadata pools that run on all flash devices
    • Leads to noticeably improved small file performance

5

May 13 – 17, 2024

6 of 101

Stand Out Storage Scale Features

  • Robust Tiering architecture based on storage pools
  • Build in Policy Engine that can be used to query the file system and/or drive data movement
    • Run things like automated purges based on parameters
    • Move data between storage pools based on certain criteria
  • Built in rebalancing of data across NSDs (LUNs)
    • Handy when you grow your storage system over time or when you’re doing big migrations or upgrades
  • Filesets for isolating different datasets

6

May 13 – 17, 2024

7 of 101

Stand Out Storage Scale Features

  • Great Sub-block Allocation
    • Dynamic Sub-block size based on File System block size
    • Allows for great streaming performance, but also great file space efficiency
    • You don’t have to compromise anymore (with v5) between performance and space efficiency

7

May 13 – 17, 2024

8 of 101

Storage Scale Weaknesses

  • License cost that scales with deployment
    • Not an open source FS
  • Multiple fabric support is less robust
  • Requires more robust/enterprise hardware to present reliable NSDs to servers
    • Erasure edition is not constrained by this weakness however it does still have hardware qualifications, it’s not fully “white box” capable
  • Client limitation of around 10,000 clients per cluster (per IBM documentation)
  • Very sensitive to network disturbances
  • Orchestrates via SSH so NSD servers all need to have password-less SSH to clients

8

May 13 – 17, 2024

9 of 101

Storage Scale Appliances

9

May 13 – 17, 2024

  • Can buy ready built appliances from many vendors, here are some:

IBM/Lenovo

Dell

HPE

10 of 101

Storage Scale Hardware

10

May 13 – 17, 2024

  • Many other vendors will sell you hardware pre-configured for Storage Scale file systems
  • Find solution that hits your price point and you have confidence in your vendor to provide a solid product
  • Needs to be built off a reliable storage appliance that can present LUNs through multiple controllers to multiple hosts
    • Can be tested on less robust hardware but not for production
  • Some gains can be had from integrated appliances, but comes with trade-off of limited flexibility/customization

11 of 101

Storage Scale Concepts

11

May 13 – 17, 2024

12 of 101

Key Definitions

12

May 13 – 17, 2024

  • NSD (Network Shared Disk) – LUN presented to a Storage scale server to be used for the file system(s)
  • Cluster Manager – Storage Scale server that is elected to handle disk leases, detects and recovers node failures, distributes configuration information, etc.
  • File System Manager – Storage Scale server that coordinates token management, disk space allocation, mount/unmount requests, etc. for a file system
  • Quorum Node- Storage Scale server that helps the cluster maintain data integrity in case of node failure
  • File System – A group of NSDs that are grouped together to form a mountable device on the client

13 of 101

Scaling Out

13

May 13 – 17, 2024

  • Since Storage Scale servers can each deal with both data and metadata, scaling comes by just increasing the total number of NSD servers
  • Many file systems can be run out of the same Storage Scale cluster (256 FS limit)
  • What servers are cluster and file system managers is dynamic
    • Election held during startup of the mmfs daemon and managers can be moved around by admins to get them on desired nodes if there is a preference
    • Usually like to have even distribution as much as possible

14 of 101

Cluster vs Scatter

14

May 13 – 17, 2024

  • Two different block allocation map types
  • Parameter is chosen at file system create time, cannot be changed afterward
  • Cluster allocates blocks in chunks (clusters) on NSDs
    • Better for clusters with smaller quantities of disks and or clients
  • Scatter allocates blocks randomly across NSDs
    • Better for clusters with larger quantities of disks or clients
  • The default setting is chosen based on the number of nodes and NSDs present in the cluster at the time of the create command
    • Threshold for switch from cluster to scatter is currently 8 nodes or disks

15 of 101

Storage Scale NSD Server

15

May 13 – 17, 2024

  • Powerful Single or Dual CPU System (64 cores+)
  • More memory the better, used for page pool
    • Lowest you’d want to probably go is 32GB/64GB
    • We set our NSD server memory at 384GB-512GB currently (smaller with embedded systems)
  • Fast disks for metadata pool if possible
    • Great candidate for this is NVME
    • Disks now come in U.2 form factor for easier access
    • Metadata disks presented individually to Storage Scale
  • Good network connectivity
    • Connectivity type partially depended on how you access your disk (IB SAN, SAS, Fiber Channel)
    • Cluster network type match your compute nodes (IB, OPA, Ethernet)
    • Balance the two as much as possible, leave some overhead for other tasks

16 of 101

Storage Scale Architecture

16

May 13 – 17, 2024

Image Credit: ibm.com

17 of 101

File Sets

  • A way of breaking up a file system into different units that can all have different properties that all still use the same underlying NSDs
  • Allows an admin to not have to sacrifice performance for the sake of logical separation
  • Enables policy engine scans to run on individual file sets (if using independent inode spaces)
    • Speeds up the policy run
  • Parameters that can each file set can have tuned separately:
    • Block Size
    • Inode Limits
    • Quotas

17

May 13 – 17, 2024

18 of 101

Storage Scale Tuning

18

May 13 – 17, 2024

19 of 101

Tuning Parameters

  • Start with the operating system and the attached disk systems. Make sure you have the optimal settings for your environment first before trying to tune Storage Scale.
  • Run a baseline IOR and mdtest on the file system so you know what your initial performance numbers look like.
  • Only make 1 change at time, running the IOR and mdtest after each change to verify if what you did hurt or helped the situation.

19

May 13 – 17, 2024

20 of 101

Tuning Parameters

  • As of Storage Scale 5.1.9.0, there are over 250 parameters within Storage Scale
    • Take a look at mmdiag --config output
  • We are going to just touch on a few of them because Storage Scale has gotten much smarter at its own configuration
  • Good chunk of settings may not apply depending on what Storage Scale features you’re using
  • Note some tuning parameters can be changed online, while others require a full cluster outage – watch out for those

20

May 13 – 17, 2024

21 of 101

File System Block Size

  • Determined when file system is created, a very key choice to make
  • Different pools can have different block sizes
  • Count of sub-blocks per block is determined by the smallest block size a pool in the FS is set to (!!)
  • Make sure you’re using optimal chunk sizes on your underlying storage devices that jive with your FS block size
  • Generate a file-size histogram of the files on your file system to aid you in setting this parameter

21

May 13 – 17, 2024

Tuning Parameters

22 of 101

Tuning Parameters

Page Pool

  • Determines the size of the Storage Scale file data block cache
  • Unlike local file systems that use the operating system page cache to cache file data, Storage Scale allocates its own cache called the pagepool
  • The Storage Scale pagepool is used to cache user file data and file system metadata
  • Can be set on a node class basis
  • Allocated at the startup of the mmfs daemon
    • For large pagepool sizes you may see delay on daemon startup while this gets allocated (tail log file /var/adm/ras/mmfs.log.latest)
  • Can now be set to be dynamic

22

May 13 – 17, 2024

23 of 101

Tuning Parameters

maxMBpS

  • Specifies an estimate of how much performance into or out of a single node can occur
    • Default is 2048MB/s
  • Value is used in calculating the amount of I/O that can be done to effectively pre-fetch data for readers and write‐behind data from writers
  • You can lower this amount to limit I/O demand from a single node on a cluster
  • You can also raise this amount to increase the I/O demand allowed from a single node

23

May 13 – 17, 2024

24 of 101

Tuning Parameters

maxFilesToCache

  • Controls how many file descriptors (inodes) each node can cache. Each file cached requires memory for the inode and a token(lock).
  • Tuning Guidelines
    • The increased value should be large enough to handle the number of concurrently open files plus allow caching of recently used files and metadata operations such as "ls" on large directories.
    • Increasing maxFilesToCache can improve the performance of user interactive operations like running "ls".
    • Don't increase the value of maxFilesToCache on all nodes in a large cluster without ensuring you have sufficient token manager memory to support the possible number of outstanding tokens.

24

May 13 – 17, 2024

25 of 101

Tuning Parameters

maxStatCache

  • The maxStatCache parameter sets aside pageable memory to cache attributes of files that are not currently in the regular file cache
    • This can be useful to improve the performance of stat() calls for applications with a working set that does not fit in the regular file cache
    • The memory occupied by the stat cache can be calculated as: maxStatCache × 176 bytes
  • Storage Scale supports a peer-to-peer access of this cache to improve performance of the file system by reducing load on the NSD servers hosting the metadata

25

May 13 – 17, 2024

26 of 101

Tuning Parameters

nsdMaxWorkerThreads

  • Sets the maximum number of NSD threads on an NSD server that will be concurrently transferring data with NSD clients
    • The maximum value depends on the sum of worker1Threads + prefetchThreads + nsdMaxWorkerThreads < 8192 on 64bit architectures
    • The default is 64 (in 3.4) 512 (in 3.5) with a minimum of 8 and maximum of 8,192
    • In some cases it may help to increase nsdMaxWorkerThreads for large clusters.
    • Scale this with the number of LUNs, not the number of clients. You need this to manage flow control on the network between the clients and the servers.

26

May 13 – 17, 2024

27 of 101

Storage Scale Node Classes

27

May 13 – 17, 2024

28 of 101

GPFS Node Classes

  • A node class is simply a user defined logical grouping of nodes
  • You can use a node class with any GPFS command that uses the ”-N” option to specify a list of nodes
  • The systems in a group may perform the same type of functions
  • The systems in a group may have the same characteristics, such as GPU processors, larger memory, faster CPUs, etc
  • You may group servers together that have special GPFS configuration settings just for them

28

May 13 – 17, 2024

29 of 101

Creating a Node Class

29

May 13 – 17, 2024

  • Can be handy to create a node class for your core NSD servers
  • Other potentially handy classes: CES nodes, login nodes, GridFTP nodes, Connect-X6 clients, Connect-X5 clients, etc.

30 of 101

List of Node Classes

30

May 13 – 17, 2024

  • Use the “mmlsnodeclass” command to view the current node classes on the system and what members are in them

31 of 101

Storage Scale Snapshots

31

May 13 – 17, 2024

32 of 101

What Is A Snapshot

  • A snapshot can preserver the state of a file system at a given moment in time
    • Snapshots at File System level are known as Global snapshots
  • The space a snapshot takes up is the amount of blocks that would have been deleted or changed since the snapshot was taken
  • Snapshots of a file system are read-only; changes can only be made to the active (that is, normal, non-snapshot) files and directories

32

May 13 – 17, 2024

33 of 101

What Is A Snapshot

  • Creates a consistent copy of the file system at a given moment in time while not interfering with backups or replications occurring on the file system
  • Allows for the easy recovery of a file, while not a backup, can be used as one in certain scenarios:
    • User accidental file deletion
    • Recovery of older file state for comparison
    • Accidental overwrite of file

33

May 13 – 17, 2024

34 of 101

Snapshot Types

File System Snapshot

  • Taken for the entire file system. Again, only the changed blocks are stored to reduce the snapshot size

Fileset Snapshot

  • You can also take a snapshot of any independent inode file set separate from a file system snapshot
  • Instead of creating a global snapshot of an entire file system, a fileset snapshot can be created to preserve the contents of a single independent fileset plus all dependent filesets that share the same inode space.
    • If an independent fileset has dependent filesets that share its inode space, then a snapshot of the independent fileset will also include those dependent filesets.

34

May 13 – 17, 2024

35 of 101

Snapshot Storage

  • Snapshots are stored in a special read-only directory named .snapshots by default
  • This directory resides in the top-level directory of the file system or the top-level of the fileset
  • Snapshots can only be taken on file systems or independent inode filesets
    • Dependent filesets are capture in the independent fileset parent snapshot (or the file system snapshot if they are dependent on the root inode)
  • Note: when running syncs of certain filesets or file systems, make sure to exclude the .snapshots directory otherwise you’ll transfer a lot of un-wanted data

35

May 13 – 17, 2024

36 of 101

Snapshot Creation

36

May 13 – 17, 2024

  • Use the “mmcrsnapshot” command to run the snapshot
  • Below is a file system level (global snapshot)
  • Below is a fileset snapshot

37 of 101

Listing Snapshots

37

May 13 – 17, 2024

  • Use the “mmlssnapshot” command to view all the snapshots currently stored on a given file system

38 of 101

Snapshot Deletion

38

May 13 – 17, 2024

  • Deleting the snapshot taken at the file system level using the “mmdelsnapshot” command
  • Below is a file system level (global snapshot)
  • Below is a fileset snapshot

39 of 101

File Level Restore from Snapshot

  • In order to restore a file, you can traverse the directories in the .snapshots directory
  • The directories have the name given to the snapshot when the mmcrsnapshot command was executed
  • You can search for the file you want to restore and then use rsync or cp to copy the file wherever you would like, outside of the .snapshot directory
  • Self-service for users, doesn’t require an admin to get back the snapshot data, standard Linux permissions still apply to the snapshot data

39

May 13 – 17, 2024

40 of 101

Snapshot Restore Utility

# mmsnaprest -h

GPFS Restore From Snapshot

Please note: This utility uses rsync style processing for directories. If

you are unsure of how that matching works, you may want to play

with it in a test area. There are examples in the EXAMPLES

section of this help screen.

Usage: mmsnaprest [-D|--debug] [-u|--usage] [-v|--verbose] [-h|--help]

[--dry-run] [-ls SOURGE] [-s SOURCE -t TARGET]

40

May 13 – 17, 2024

  • Useful for bulk restores from a snapshot
  • Massive data deletion (someone let an rm –rf go wild) requires a large restore from a snapshot
  • Native to Storage Scale, and written by IBM

41 of 101

Storage Scale Cluster Export Services

41

May 13 – 17, 2024

42 of 101

CES – Cluster Export Services

  • Provides highly available file and object services to a Storage Scale cluster such as NFS, SMB, Object, and Block

High availability

  • With Storage Scale, you can configure a subset of nodes in the cluster to provide a highly available solution for exporting Storage Scale file systems usings NFS, SMB, Object, and Block.
    • Nodes are designated as Cluster Export Services (CES) nodes or protocol nodes. The set of CES nodes is frequently referred to as the CES cluster.
  • A set of IP addresses, the CES address pool, is defined and distributed among the CES nodes
    • If a node enters or exits the CES Cluster, IP Addresses are dynamically reassigned
    • Clients use these floating IP Address to access the CES services

42

May 13 – 17, 2024

43 of 101

CES – Cluster Export Services

Monitoring

    • CES monitors the state of the protocol services itself
      • Checks not just for host availability, but also the health of the services
      • If a failure is detected CES will migrate IP Address away from a node and mark it as offline for CES services

Protocol support

    • CES supports the following export protocols: NFS, SMB, object, and iSCSI (block)
      • Protocols can be enabled individually
      • If a protocol is enabled, all CES nodes will serve that protocol
    • The following are examples of enabling and disabling protocol services by using the mmces command:
      • mmces service enable nfs Enables the NFS protocol in the CES cluster.
      • mmces service disable obj Disables the Object protocol in the CES cluster.

43

May 13 – 17, 2024

44 of 101

Common CES Commands

  • mmblock - Manages the BLOCK configuration operations
  • mmces - Manages the CES address pool and other CES cluster configuration options
  • mmnfs - Manages NFS exports and sets the NFS configuration attributes
  • mmobj - Manages the Object configuration operations
  • mmsmb - Manages SMB exports and sets the SMB configuration attributes
  • mmuserauth - Configures the authentication methods that are used by the protocols

44

May 13 – 17, 2024

45 of 101

Storage Scale Policy Engine

45

May 13 – 17, 2024

46 of 101

Policy Engine

  • The GPFS policy engine allows you to run SQL-like queries against the file system and get reports based on those queries
  • The policy engine can also be used to invoke actions, such as compression, file movement, etc
  • Customized scripts can also be invoked, letting you have full control over anything that is being done
  • There are many parameters that can be specified. For a list of them, check out the Storage Scale Administration and Programming Reference

46

May 13 – 17, 2024

47 of 101

Example Policy Run #1

  • Here is a simple sample policy that will just list all of the files in /fs0/projects along with the file’s allocation, its actual size, owner and fileset name. It also displays the inode number and fully qualified path name.

# cat rules.txt

RULE 'listall' list 'all-files'

SHOW( varchar(kb_allocated) || ' ' || varchar(file_size) || ' ' || varchar(user_id) || ' ' || fileset_name )

WHERE PATH_NAME LIKE '/fs0/projects/%'

47

May 13 – 17, 2024

48 of 101

Example Policy Run #1

Sample output from a policy run:

# mmapplypolicy fs0 -f /fs0/tmp/ -P rules.txt -I defer

[I] GPFS Current Data Pool Utilization in KB and %

Pool_Name KB_Occupied KB_Total Percent_Occupied

archive 131072 41934848 0.312561047%

data 192512 41934848 0.459074038%

system 0 0 0.000000000% (no user data)

[I] 4422 of 502784 inodes used: 0.879503%.

[W] Attention: In RULE 'listall' LIST name 'all-files' appears but there is no corresponding "EXTERNAL LIST 'all-files' EXEC ... OPTS ..." rule to specify a program to process the matching files.

[I] Loaded policy rules from rules.txt.

Evaluating policy rules with CURRENT_TIMESTAMP = 2017-07-25@15:34:38 UTC

Parsed 1 policy rules.

RULE 'listall' list 'all-files'

SHOW( varchar(kb_allocated) || ' ' || varchar(file_size) || ' ' || varchar(user_id) || ' ' || fileset_name )

WHERE PATH_NAME LIKE '/fs0/projects/%'

[I] 2017-07-25@15:34:39.041 Directory entries scanned: 385.

[I] Directories scan: 362 files, 23 directories, 0 other objects, 0 'skipped' files and/or errors.

[I] 2017-07-25@15:34:39.043 Sorting 385 file list records.

[I] Inodes scan: 362 files, 23 directories, 0 other objects, 0 'skipped' files and/or errors.

48

May 13 – 17, 2024

49 of 101

Example Policy Run #1

Sample output from a policy run (continued):

[I] 2017-07-25@15:34:40.954 Policy evaluation. 385 files scanned.

[I] 2017-07-25@15:34:40.956 Sorting 360 candidate file list records.

[I] 2017-07-25@15:34:41.024 Choosing candidate files. 360 records scanned.

[I] Summary of Rule Applicability and File Choices:

Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule

0 360 61184 360 61184 0 RULE 'listall' LIST 'all-files' SHOW(.) WHERE(.)

[I] Filesystem objects with no applicable rules: 25.

[I] GPFS Policy Decisions and File Choice Totals:

Chose to list 61184KB: 360 of 360 candidates;

Predicted Data Pool Utilization in KB and %:

Pool_Name KB_Occupied KB_Total Percent_Occupied

archive 131072 41934848 0.312561047%

data 192512 41934848 0.459074038%

system 0 0 0.000000000% (no user data)

[I] 2017-07-25@15:34:41.027 Policy execution. 0 files dispatched.

[I] A total of 0 files have been migrated, deleted or processed by an EXTERNAL EXEC/script;

0 'skipped' files and/or errors.

#

49

May 13 – 17, 2024

50 of 101

Example Policy Run #1

Sample output from a policy run:

]# wc -l /fs0/tmp/list.all-files

360 /fs0/tmp/list.all-files

# head -n 10 /fs0/tmp/list.all-files

402432 374745509 0 3584 1741146 0 projects -- /fs0/projects/dar-2.4.1.tar.gz

402434 229033036 0 0 1217 1000 projects -- /fs0/projects/dar-2.4.1/README

402435 825781038 0 256 43668 1000 projects -- /fs0/projects/dar-2.4.1/config.guess

402436 1733958940 0 256 18343 1000 projects -- /fs0/projects/dar-2.4.1/config.rpath

402437 37654404 0 0 371 1000 projects -- /fs0/projects/dar-2.4.1/INSTALL

402438 1471382967 0 0 435 1000 projects -- /fs0/projects/dar-2.4.1/TODO

402440 398210967 0 0 376 1000 projects -- /fs0/projects/dar-2.4.1/misc/batch_cygwin

402441 292549403 0 0 738 1000 projects -- /fs0/projects/dar-2.4.1/misc/README

402442 1788675584 0 256 3996 1000 projects -- /fs0/projects/dar-2.4.1/misc/dar_ea.rpm.proto

402443 637382920 0 256 4025 1000 projects -- /fs0/projects/dar-2.4.1/misc/dar64_ea.rpm.proto

#

50

May 13 – 17, 2024

51 of 101

Example Policy Run #2

  • One of our actual scratch purge policies that we run daily to keep users old data cleaned up

RULE 'purge_30days' DELETE

FOR FILESET ('scratch')

WHERE CURRENT_TIMESTAMP - MODIFICATION_TIME > INTERVAL '30' DAYS and

CURRENT_TIMESTAMP - CREATION_TIME > INTERVAL '30' DAYS and

CURRENT_TIMESTAMP - ACCESS_TIME > INTERVAL '30' DAYS and

PATH_NAME LIKE '/gpfs/iccp/scratch/%'

51

May 13 – 17, 2024

52 of 101

Example Policy Run #2

Sample output from a policy run:

[I] GPFS Current Data Pool Utilization in KB and %

Pool_Name KB_Occupied KB_Total Percent_Occupied

data 1006608482304 2621272227840 38.401523948%

system 0 0 0.000000000% (no user data)

[I] 378536926 of 689864704 inodes used: 54.871183%.

[I] Loaded policy rules from scratch.purge.policy.

Evaluating policy rules with CURRENT_TIMESTAMP = 2019-04-12@16:00:02 UTC

Parsed 1 policy rules.

RULE 'purge_30days' DELETE

FOR FILESET ('scratch')

WHERE CURRENT_TIMESTAMP - MODIFICATION_TIME > INTERVAL '30' DAYS and

CURRENT_TIMESTAMP - CREATION_TIME > INTERVAL '30' DAYS and

CURRENT_TIMESTAMP - ACCESS_TIME > INTERVAL '30' DAYS and

PATH_NAME LIKE '/gpfs/iccp/scratch/%'

[I] 2019-04-12@16:00:04.045 Directory entries scanned: 0.

[I] 2019-04-12@16:00:19.026 Directory entries scanned: 1376623.

[I] 2019-04-12@16:00:34.027 Directory entries scanned: 1376623.

[I] 2019-04-12@16:00:37.104 Directory entries scanned: 8576323.

[I] Directories scan: 4132091 files, 3713818 directories, 730414 other objects, 0 'skipped' files and/or errors.

52

May 13 – 17, 2024

53 of 101

Example Policy Run #2

Sample output from a policy run (continued):

[I] 2019-04-12@16:00:37.145 Parallel-piped sort and policy evaluation. 0 files scanned.

[I] 2019-04-12@16:00:42.975 Parallel-piped sort and policy evaluation. 8576323 files scanned.

[I] 2019-04-12@16:00:43.523 Piped sorting and candidate file choosing. 0 records scanned.

[I] 2019-04-12@16:00:43.647 Piped sorting and candidate file choosing. 90047 records scanned.

[I] Summary of Rule Applicability and File Choices:

Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule

0 90047 1078304928 90047 1078304928 0 RULE 'purge_30days' DELETE FOR FILESET(.) WHERE(.)

[I] Filesystem objects with no applicable rules: 8486148.

[I] GPFS Policy Decisions and File Choice Totals:

Chose to delete 1078304928KB: 90047 of 90047 candidates;

Predicted Data Pool Utilization in KB and %:

Pool_Name KB_Occupied KB_Total Percent_Occupied

data 1005533405024 2621272227840 38.360510379%

system 0 0 0.000000000% (no user data)

[I] 2019-04-12@16:00:43.732 Policy execution. 0 files dispatched.

[I] 2019-04-12@16:00:49.027 Policy execution. 65886 files dispatched.

[I] 2019-04-12@16:00:51.069 Policy execution. 90047 files dispatched.

[I] A total of 90047 files have been migrated, deleted or processed by an EXTERNAL EXEC/script;

0 'skipped' files and/or errors.

53

May 13 – 17, 2024

54 of 101

Storage Scale Storage Pools & File Placement

54

May 13 – 17, 2024

55 of 101

Storage Pools

  • Physically, a storage pool is a collection of disks or RAID arrays
    • Allow you to group multiple storage systems within a file system.
  • Using storage pools, you can create tiers of storage by grouping storage devices based on performance, locality, or reliability characteristics
    • One pool could be an All Flash Array (AFA) with high-performance SSDs
    • Another pool might consist of numerous disk controllers that host a large set of economical SAS/SATA drives

55

May 13 – 17, 2024

56 of 101

Example Storage Pool Configuration

Flash Tier

  • Holds Metadata pool (system pool)
  • Maybe has filesets pinned to it via Storage Scale Policies
    • Popular areas would be /home, /apps, and even /scratch

Capacity Tier

  • Large disk enclosures behind RAID controllers presenting big NSD’s from large hard drives
  • Higher Storage Device to Server ratio
  • Potentially less bandwidth for connectivity

56

May 13 – 17, 2024

57 of 101

Information Lifecycle Management

  • Storage Scale includes the ILM toolkit that allows you to manage your data via the built in policy engine
  • No matter the directory structure, Storage Scale can automatically manage what storage pools host the data, and for how long
    • Throughout the life of the data Storage scale can track and migrate data from your policy driven rules
  • You can match the data and its needs to hardware, allowing for cost savings
  • Great method for spanning infrastructure investments
    • New hardware is for more important/more used data
    • Older hardware becomes the slower storage pool

57

May 13 – 17, 2024

58 of 101

  • There are three types of storage pools in Storage Scale:
    • A required system pool that you create
    • Optional user storage pools that you create
    • Optional external storage pools that you define with policy rules and manage through an external application (eg. Storage Protect/Tivoli)
  • Create filesets to provide a way to partition the file system namespace to allow administrative operations to work at a narrower level than the entire file system
  • Create policy rules based on data attributes to determine the initial file placement and manage data placement throughout the life of the file

58

May 13 – 17, 2024

Information Lifecycle Management

59 of 101

Storage Tiering

  • Your Storage Scale cluster can have differing types of storage media connected by differing technologies as well, including but not limited to:
    • NVME
    • SAS SSD
    • SAS/FC/IB attached Hard Drives
    • Self-Encrypting drives
    • You may have differing drives sizes: 6T, 8T, 10T, 12T, etc
    • Tape Libraries

59

May 13 – 17, 2024

60 of 101

Using Tiered Storage

  • Lets say you have the following devices on your system:
    • /dev/nvme01 /dev/nvme02
    • /dev/sas01 /dev/sas02
    • /dev/sata01 /dev/sata02
  • the SAS drives above are SED

  • There are many different ways that you can configure a Storage Scale file system. To make it interesting, lets have the following business rules that we need to satisfy:
    • Very fast file creates and lookups, including a mirrored copy.
    • A decent storage area for data files
    • All files place in /gpfs/fs0/health live on encrypted drives
    • What would this configuration look like?

60

May 13 – 17, 2024

61 of 101

Tiered Storage: Example Config

61

May 13 – 17, 2024

%nsd:

nsd=nvme01

usage=metadataOnly

pool=system

%nsd:

nsd=nvme02

usage=metadataOnly

pool=system

Flash Metadata Pool

%nsd:

nsd=sata01

usage=dataOnly

pool=data

%nsd:

nsd=sata02

usage=dataOnly

pool=data

Bulk Data Pool

%nsd:

nsd=sas01

usage=dataOnly

pool=hippa

%nsd:

nsd=sas02

usage=dataOnly

pool=hippa

Encrypted Data Pool

62 of 101

File Placement Policies

  • If you are utilizing multiple storage pools within Storage Scale, you must specify a default storage policy at a minimum.
  • File placement policies are used to control what data is written to which storage pool.
  • A default policy rule can be quite simple. For example, if you have a ’data’ pool and want to write all files there, create a file called policy with a single line containing the following rule:

# cat policy

rule 'default' set pool 'data'

62

May 13 – 17, 2024

63 of 101

Installing File Placement Policies

# Usage: mmchpolicy Device PolicyFilename

[-t DescriptiveName] [-I {yes|test}]

63

May 13 – 17, 2024

Test the policy before installing it is good practice!

# mmchpolicy fs0 policy -I test

Validated policy 'policy': Parsed 1 policy rules.

No errors on the policy, so lets install it:

# mmchpolicy fs0 policy

Validated policy 'policy': Parsed 1 policy rules.

Policy `policy' installed and broadcast to all nodes.

64 of 101

Viewing Installed Policies

# Usage: mmlspolicy Device

List the file placement policies:

# mmlspolicy fs0

Policy for file system '/dev/fs0':

Installed by root@ss-demo1.os.ncsa.edu on Fri Apr 12 09:26:10 2019.

First line of policy 'policy' is:

rule 'default' set pool 'data'

64

May 13 – 17, 2024

Verify prior policy installed successfully:

65 of 101

Storage Scale Monitoring

65

May 13 – 17, 2024

66 of 101

Monitoring with mmpmon

  • Built in tool to report counters that each of the mmfs daemons keep
  • Can output results in either machine parseable or human readable formats
  • Some of the statistics it monitors on a per host basis:
    • Bytes Read
    • Bytes Written
    • File Open Requests
    • File Close Requests
    • Per NSD Read/Write
  • The machine parse-able output is easy to use for scripted data gathering

66

May 13 – 17, 2024

67 of 101

Monitoring with mmpmon

67

May 13 – 17, 2024

Sample output from mmpmon (human readable)

68 of 101

Monitoring with mmpmon

  • Can be used to make useful graphs 🡪 Live

68

May 13 – 17, 2024

  • Sample output from mmpmon (machine parseable)

69 of 101

Other Storage Scale Monitoring

  • Using the built in ZiMon sensors with mmperfmon
  • Storage Scale GUI now has the ability to have performance monitoring with graphs
  • Storage Scale Grafana Bridge
    • Python standalone application that puts Storage Scale performance data into openTSDB which Grafana can understand
    • Data that is pushed “across” the bridge is gathered by the ZiMon Monitoring Tool

69

May 13 – 17, 2024

70 of 101

Resources

70

May 13 – 17, 2024

71 of 101

What is Ceph?

  • A ”universal” file system
  • Name comes from “cephalopod”
    • Many “tentacles”
    • Lots of things at once
  • Distributed File System
    • Highly redundant
    • Highly parallel
    • Created to avoid single points of failure
    • Supports the three big storage access methods:
      • POSIX file system (file)
      • Block Storage
      • S3 object storage

71

May 13 – 17, 2024

72 of 101

What is Ceph?

  • Designed primarily to run on Ethernet
    • Has IB support but rarely used
    • Common to run it with a “front end” network and a “back end” network; though many deployments moving to single network
  • Provisioning methods have shifted over time
    • Went from internal ceph tools/commands
    • Next to ceph-ansible
    • Now back to ceph tools/commands (ceph adm)
  • Commonly deployed in
    • Openstack cloud environments
    • As stand-alone object storage
    • Occasionally as mounted file system on HPC

72

May 13 – 17, 2024

73 of 101

Scaling Up Ceph

  • In the context of large POSIX-style filesystems, how do we usually go “bigger”?
    • More controllers/server couplets
  • What are the implications of this?
  • Ceph’s distributed nature scales in a different way
  • Two key ways to try and convey how Ceph works
    • Cover what “CRUSH” is
    • Pictures/Diagrams can be most helpful when trying to understand Ceph’s architecture

73

May 13 – 17, 2024

74 of 101

Underlying concepts

  • CRUSH (Controlled Replication Under Scalable Hashing)
    • “[…] is a hash-based algorithm for calculating how and where to store and retrieve data in a distributed object–based storage cluster.��CRUSH distributes data evenly across available object storage devices in what is often described as a pseudo-random manner. Distribution is controlled by a hierarchical cluster map called a CRUSH map. The map, which can be customized by the storage administrator, informs the cluster about the layout and capacity of nodes in the storage network and specifies how redundancy should be managed. By allowing cluster nodes to calculate where a data item has been stored, CRUSH avoids the need to look up data locations in a central directory. CRUSH also allows for nodes to be added or removed, moving as few objects as possible while still maintaining balance across the new cluster configuration.��[…] Because CRUSH allows clients to communicate directly with storage devices without the need for a central index server to manage data object locations, Ceph clusters can store and retrieve data very quickly and scale up or down quite easily.”�

74

May 13 – 17, 2024

75 of 101

Underlying concepts

  • RADOS (Reliable Autonomic Distributed Object Store)
  • “[…] is an open source object storage service that is an integral part of the Ceph distributed storage system.��Ceph RADOS system typically consists of a large collection of standard commodity servers, also known as storage nodes. Common use cases for a Ceph RADOS system are as a standalone storage system or as a back end for OpenStack Block Storage.��RADOS has the ability to scale to thousands of hardware devices by making use of management software that runs on each of the individual storage nodes. […]��(Courtesy https://searchstorage.techtarget.com/definition/RADOS-Reliable-Autonomic-Distributed-Object-Store)

75

May 13 – 17, 2024

76 of 101

Conceptual Diagram

76

May 13 – 17, 2024

77 of 101

Conceptual Diagram

77

May 13 – 17, 2024

78 of 101

Terminology

  • Release naming
  • OSD: Object Storage Device – usually an entire disk, but can be a partition on a disk
  • MDS: MetaData Server – only used for Posix filesystem
  • Monitor (MON): Coordinators of the cluster, needs a quorum
  • Manager (MGR): Optional in kraken, new in luminous – runs alongside the MON. Additional monitoring, external interface to monitoring tools
  • Pools: A virtual chunk of storage with it’s own characteristics (replication, underlying storage, placement groups)
  • Placement Groups (PG): A defined size (analagous to a disk block) chunk of data that gets placed onto an OSD

78

May 13 – 17, 2024

79 of 101

How does this affect you as an Engineer (Storage Admin)?

  • Redundancy you can control
    • Failure Domains
    • Redundancy Levels
  • Scalability
  • Performance Related Capabilities

79

May 13 – 17, 2024

80 of 101

Locality Customization

  • Failure Domain Localities
    • OSD (aka device)
    • Host
    • Chassis
    • Rack
    • Row
    • PDU
    • Pod
    • Room
    • Datacenter
    • Region

80

May 13 – 17, 2024

81 of 101

Failure Domains

  • Controls layout (has impact on the CRUSH map) of data across the system
    • Are you protecting against rack failure or power feed failure
    • Do you want to distribute across multiple campus datacenters?
  • There are implications to increasing the “scope” or size of the fault domain
    • Usually the farther “apart” failure domains are, latency increases
    • Ceph is a consistent file system so increases in latency between failure domains increases FS latency
  • Choose what makes the most sense for your infrastructure
    • Your campus/institution/organization will likely have somewhat unique situations depending on its history

81

May 13 – 17, 2024

82 of 101

Redundancy

  • Two main types of data protection “algorithms”
    • Replication
    • Erasure
  • There are benefits and downsides to each and each can be useful in different situations
  • Replication was the first method
  • Erasure support was added later
    • But has been around a while; very stable
  • Things to consider when deciding which type
    • Storage medium
    • Durability Needs
    • Cost Factors

82

May 13 – 17, 2024

83 of 101

Redundancy

  • Replication
    • What it sounds like, data chunks are replicated N times across the storage media
    • Default is replica of 3, good for HDDs especially
    • For NVME-based Ceph clusters replication of 2 is generally sufficient due to greatly enhanced rebuild times
    • Method to use when $/TB not ultimate concern, offers better IOPs (especially on reads)
  • Erasure
    • Parity data is generated from data blocks and stored on disk, similar-ish to RAID but much more distributed and flexible (and rebuilds faster)
    • Formula for usable space: # of OSD * (K/(K+M)) * OSD Size
      • Where K is # of data chunks and M is # of parity chunks
      • Common are 4+2 or 8+3

83

May 13 – 17, 2024

84 of 101

Scalability features

Ceph has some useful features that greatly aid in its ability to scale well across hundreds-thousands of storage servers

  • Automatic failure rebalance
    • Re-replicates chunks on drives that fail
    • Reduces usable capacity but gets the data protected quickly
  • Clients talk directly to the hosting server (CRUSH)
    • Reduces bottlenecks that can occur when channeling communication through fewer servers than then talk to the block devices
  • Like many file systems, Ceph’s performance scales as you add hardware to it

84

May 13 – 17, 2024

85 of 101

Scalability features

  • Handles mixed drive sizes well
    • Another benefit of CRUSH is that it ends up being more flexible with respect to drive sizes
    • There are some performance implications here though with regard to balancing of space
  • Multiple active metadata daemons
    • New-ish to Ceph but you can now scale up the number of active metadata processes to help with processes doing a lot of metadata intensive workloads
    • Also can/should/must have daemons on standby to handle HA needs in case of failures

85

May 13 – 17, 2024

86 of 101

Performance features

Ceph has some specific performance related features that have been developed over the years that are worth noting

  • Bluestore
    • How Ceph formats the raw devices (HDDs, SSDs)
    • Initially drives needed to be pre-formatted XFS but that changed with Bluestore
    • Ceph now controls the full stack from the client to the block devices themselves
    • This allowed Ceph to gain a lot more reliability and performance
  • Pools
    • Can be on different media types allowing you to direct I/O to certain hardware capabilities

86

May 13 – 17, 2024

87 of 101

Performance features

  • Pools (continued)
    • Resilience can also be distinct per pool, to match the hardware that backs it
    • You may have an HDD pool that is 4+2 erasure and a flash pool that is replica 2
  • Inline Data (experimental)
    • Small file (less than 2KB) is stored in the inode on the MDS so it can be served straight from there
    • Can enhance performance…if you have enough MDS’s and they are performant enough, otherwise moves the bottleneck
  • LazyIO
    • Relaxes POSIX semantics allowing buffered I/O with relaxed/no locking; clients have to manage cache coherency

87

May 13 – 17, 2024

88 of 101

Ceph Monitoring

Before we dive into debugging; it’s helpful to touch on monitoring Ceph

  • Like any file system, it’s important to have good monitoring and alerts configured for your storage
  • Big things to track
    • mon quorum status
    • OSD health
    • PG health
    • physical infrastructure health
    • QoS metrics (latency, etc.)

88

May 13 – 17, 2024

89 of 101

Ceph Monitoring

  • There are many ways to get metrics out of Ceph natively
    • Prometheus
    • Telegraf
    • StatsD
    • Many more
  • Use the tools you prefer that fit within your org
  • We have a handy dashboard that helps us track many of these things at a glance

89

May 13 – 17, 2024

90 of 101

Ceph Monitoring

90

May 13 – 17, 2024

91 of 101

Ceph Tools

Some commands to have on hand when debugging a cluster or getting the “lay out the land” on a cluster you inherit

# ceph -w

    • This command essentially is a “watch” on the output of ceph status; handy to keep running while you’re debugging

# ceph osd df

    • Prints out usage state of all your OSDs and what their state is

# ceph daemon osd.XX config show

    • Dumps out that OSD’s config (OSD must be hosted on machine you run the command on); nice for debugging an OSD issue

91

May 13 – 17, 2024

92 of 101

Ceph Tools

# ceph osd tree

    • Prints out summary of all OSDs in the cluster
    • Sample output

92

May 13 – 17, 2024

93 of 101

Crushmaps

  • De-compiling the crush map

# ceph osd getcrushmap -o /root/crushmap_raw

# crushtool -d /root/crushmap_raw -o /root/crushmap.txt

  • You now have a text readable file of the active crushmap the system is using
    • If you have changes you want to make to the crushmap you can can do so in that text file
    • Sometimes during troubleshooting you may want/need to adjust the crushmap to solve your problem

93

May 13 – 17, 2024

94 of 101

Crushmaps

  • After editing the crush map file you can re-compile it and inject it

# crushtool -c /root/crushmap.txt -o /root/crushmap_new

# ceph osd setcrushmap -i /root/cruchmap_new

  • A key thing is to compile the new version as a new/separate file from the raw one you downloaded
    • Makes it easy to restore back to the prior running version if you want to
    • Keeping in git/change control, not a bad idea

94

May 13 – 17, 2024

95 of 101

Ceph PGs (Placement Groups)

  • Can be a common source of pain when managing a ceph environment
    • Even more in a heterogeneous environment
  • There is dedicated documentation out in the ceph docs specifically for this area
  • In summary, you want all your PGs in an “active+clean” state on your system

95

May 13 – 17, 2024

96 of 101

Ceph PGs (Placement Groups)

Some useful pg-related commands:

# ceph pg ls

# ceph pg 1.0 query

# ceph pg dump_stuck [stale, inactive, unclean]

# ceph pg repair 1.0

  • Ceph has good ”man” pages, they describe a lot of these commands and more, pretty well

96

May 13 – 17, 2024

97 of 101

Ceph PGs (Placement Groups)

  • When setting up a cluster and dealing with configuration, it’s very handy to calculate placement group information
  • Super handy source is hosted by ceph:
    • https://old.ceph.com/pgcalc/
    • Make sure you set the right use case in the “Use Case Selector”
    • Will even generate commands for you

97

May 13 – 17, 2024

98 of 101

Ceph Troubleshooting

  • Every one’s experience is unique, however I’ve found the following to be true about Ceph
    • When setup is done properly, it just runs; especially for “build-once-and-run” systems
    • Can be extremely reliable, and fault tolerant with little hand-holding once setup
    • If something goes wrong…it is usually complicated to sort out (though this is improving)
    • The more uniform your configuration of hardware, the better things will be; stay consistent where you can

98

May 13 – 17, 2024

99 of 101

Resources to Know About

  • The official websites
    • ceph.com
    • docs.ceph.com
  • Mailing Lists
    • ceph-users mailing list
    • ceph-devel mailing list….if you really want to hear about the nitty-gritty stuff
  • Engage with the community
    • https://ceph.io/en/community/
  • RedHat materials
    • They own Ceph and are heavily involved in its development

99

May 13 – 17, 2024

100 of 101

Acknowledgements

  • Members of the SET group at NCSA for slide creation and review
  • Members of the LCI Steering Committee for slide review

100

May 13 – 17, 2024

101 of 101

Questions

101

May 13 – 17, 2024