1 of 75

Linux Clusters Institute:�Storage Scale

J.D. Maloney | Lead HPC Storage Engineer

Storage Enabling Technologies Group (SET)

National Center for Supercomputing Applications (NCSA)

malone12@illinois.edu

Mississippi State, October 21^st – 25^th 2023

2 of 75

Storage Scale (GPFS) Overview

Product of IBM, gone through many name changes
Licensed file system, based on server socket count and client count, or usable capacity
One of the two “prominent” file systems in used today by the world’s largest supercomputers
Generally considered easier to administer due to product maturity and Enterprise level features
File System generally expects to be running on top of

highly reliable disks presented by redundant controllers

August 21 – 25, 2023

3 of 75

Quick History of Storage Scale

Began as the Tiger Shark File System in 1993 to handle multimedia applications
Also was influenced by IBM’s Vesta File System which was in development around the same time for use in different applications
Was productized in 1994, and the name changed to IBM GPFS (General Parallel File System) around 1998
Gone through many version changes and feature adds and in 2023 the name was changed to Storage Scale

Though we’re all still getting used to that ☺

August 21 – 25, 2023

4 of 75

August 21 – 25, 2023

Image Credit: spectrumscale.org

5 of 75

Stand Out Storage Scale Features

Distributed metadata servers, no real limit to number, could have it in all of them
Allows data and metadata to be written inline with each other the same storage device, no separate devices needed
Supports “Super inodes” where files less than ~3.8K actually fit inside the inode itself

Very handy when you have metadata pools that run on all flash devices
Leads to noticeably improved small file performance

August 21 – 25, 2023

6 of 75

Stand Out Storage Scale Features

Robust Tiering architecture based on storage pools
Build in Policy Engine that can be used to query the file system and/or drive data movement

Run things like automated purges based on parameters
Move data between storage pools based on certain criteria

Built in rebalancing of data across NSDs (LUNs)

Handy when you grow your storage system over time or when you’re doing big migrations or upgrades

Filesets for isolating different datasets

August 21 – 25, 2023

7 of 75

Stand Out Storage Scale Features

Great Sub-block Allocation

Dynamic Sub-block size based on File System block size
Allows for great streaming performance, but also great file space efficiency
You don’t have to compromise anymore (with v5) between performance and space efficiency

August 21 – 25, 2023

8 of 75

Storage Scale Weaknesses

License cost that scales with deployment

Not an open source FS

Multiple fabric support is less robust
Requires more robust/enterprise hardware to present reliable NSDs to servers

Erasure edition is not constrained by this weakness however it does still have hardware qualifications, it’s not fully “white box” capable

Client limitation of around 10,000 clients per cluster (per IBM documentation)
Very sensitive to network disturbances
Orchestrates via SSH so NSD servers all need to have password-less SSH to clients

August 21 – 25, 2023

9 of 75

Storage Scale Appliances

August 21 – 25, 2023

Can buy ready built appliances from many vendors, here are some:

IBM/Lenovo

Dell

HPE

10 of 75

Storage Scale Hardware

August 21 – 25, 2023

Many other vendors will sell you hardware pre-configured for Storage Scale file systems
Find solution that hits your price point and you have confidence in your vendor to provide a solid product
Needs to be built off a reliable storage appliance that can present LUNs through multiple controllers to multiple hosts

Can be tested on less robust hardware but not for production

Some gains can be had from integrated appliances, but comes with trade-off of limited flexibility/customization

11 of 75

Storage Scale Concepts

August 21 – 25, 2023

12 of 75

Key Definitions

August 21 – 25, 2023

NSD (Network Shared Disk) – LUN presented to a Storage scale server to be used for the file system(s)
Cluster Manager – Storage Scale server that is elected to handle disk leases, detects and recovers node failures, distributes configuration information, etc.
File System Manager – Storage Scale server that coordinates token management, disk space allocation, mount/unmount requests, etc. for a file system
Quorum Node- Storage Scale server that helps the cluster maintain data integrity in case of node failure
File System – A group of NSDs that are grouped together to form a mountable device on the client

13 of 75

Scaling Out

August 21 – 25, 2023

Since Storage Scale servers can each deal with both data and metadata, scaling comes by just increasing the total number of NSD servers
Many file systems can be run out of the same Storage Scale cluster (256 FS limit)
What servers are cluster and file system managers is dynamic

Election held during startup of the mmfs daemon and managers can be moved around by admins to get them on desired nodes if there is a preference
Usually like to have even distribution as much as possible

14 of 75

Cluster vs Scatter

August 21 – 25, 2023

Two different block allocation map types
Parameter is chosen at file system create time, cannot be changed afterward
Cluster allocates blocks in chunks (clusters) on NSDs

Better for clusters with smaller quantities of disks and or clients

Scatter allocates blocks randomly across NSDs

Better for clusters with larger quantities of disks or clients

The default setting is chosen based on the number of nodes and NSDs present in the cluster at the time of the create command

Threshold for switch from cluster to scatter is currently 8 nodes or disks

15 of 75

Storage Scale NSD Server

August 21 – 25, 2023

Powerful Dual Socket CPU System
More memory the better, used for page pool

Lowest you’d want to probably go is 32GB/64GB
We set our NSD server memory at 384GB currently (smaller with embedded systems)

Fast disks for metadata pool if possible

Great candidate for this is NVME
Disks now come in U.2 form factor for easier access
Metadata disks presented individually to Storage Scale

Good network connectivity

Connectivity type partially depended on how you access your disk (IB SAN, SAS, Fiber Channel)
Cluster network type match your compute nodes (IB, OPA, Ethernet)
Balance the two as much as possible, leave some overhead for other tasks

16 of 75

Storage Scale Architecture

August 21 – 25, 2023

Image Credit: ibm.com

17 of 75

File Sets

A way of breaking up a file system into different units that can all have different properties that all still use the same underlying NSDs
Allows an admin to not have to sacrifice performance for the sake of logical separation
Enables policy engine scans to run on individual file sets (if using independent inode spaces)

Speeds up the policy run

Parameters that can each file set can have tuned separately:

Block Size
Inode Limits
Quotas

August 21 – 25, 2023

18 of 75

Storage Scale Tuning

August 21 – 25, 2023

19 of 75

Tuning Parameters

Start with the operating system and the attached disk systems. Make sure you have the optimal settings for your environment first before trying to tune Storage Scale.
Run a baseline IOR and mdtest on the file system so you know what your initial performance numbers look like.
Only make 1 change at time, running the IOR and mdtest after each change to verify if what you did hurt or helped the situation.

August 21 – 25, 2023

20 of 75

Tuning Parameters

As of Storage Scale 5.1.8.0, there are over 250 parameters within Storage Scale

Take a look at mmdiag --config output

We are going to just touch on a few of them because Storage Scale has gotten much smarter at its own configuration
Good chunk of settings may not apply depending on what Storage Scale features you’re using
Note some tuning parameters can be changed online, while others require a full cluster outage – watch out for those

August 21 – 25, 2023

21 of 75

File System Block Size

Determined when file system is created, a very key choice to make
Used to be a much harder choice in v4, much easier in v5…but still important!

If your cluster is stuck back on RHEL/CentOS 6.x then you’ll be stuck on v4 and need to think about this

Different pools can have different block sizes
Count of sub-blocks per block is determined by the smallest block size a pool in the FS is set to (!!)
Make sure you’re using optimal chunk sizes on your underlying storage devices that jive with your FS block size

August 21 – 25, 2023

Tuning Parameters

22 of 75

Tuning Parameters

Page Pool

Determines the size of the Storage Scale file data block cache
Unlike local file systems that use the operating system page cache to cache file data, Storage Scale allocates its own cache called the pagepool
The Storage Scale pagepool is used to cache user file data and file system metadata
Can be set on a node class basis
Allocated at the startup of the mmfs daemon

For large pagepool sizes you may see delay on daemon startup while this gets allocated (tail log file /var/adm/ras/mmfs.log.latest)

August 21 – 25, 2023

23 of 75

Tuning Parameters

maxMBpS

Specifies an estimate of how much performance into or out of a single node can occur

Default is 2048MB/s

Value is used in calculating the amount of I/O that can be done to effectively pre-fetch data for readers and write‐behind data from writers
You can lower this amount to limit I/O demand from a single node on a cluster
You can also raise this amount to increase the I/O demand allowed from a single node

August 21 – 25, 2023

24 of 75

Tuning Parameters

maxFilesToCache

Controls how many file descriptors (inodes) each node can cache. Each file cached requires memory for the inode and a token(lock).
Tuning Guidelines

The increased value should be large enough to handle the number of concurrently open files plus allow caching of recently used files and metadata operations such as "ls" on large directories.
Increasing maxFilesToCache can improve the performance of user interactive operations like running "ls".
Don't increase the value of maxFilesToCache on all nodes in a large cluster without ensuring you have sufficient token manager memory to support the possible number of outstanding tokens.

August 21 – 25, 2023

25 of 75

Tuning Parameters

maxStatCache

The maxStatCache parameter sets aside pageable memory to cache attributes of files that are not currently in the regular file cache

This can be useful to improve the performance of stat() calls for applications with a working set that does not fit in the regular file cache
The memory occupied by the stat cache can be calculated as: maxStatCache × 176 bytes

Storage Scale supports a peer-to-peer access of this cache to improve performance of the file system by reducing load on the NSD servers hosting the metadata

August 21 – 25, 2023

26 of 75

Tuning Parameters

nsdMaxWorkerThreads

Sets the maximum number of NSD threads on an NSD server that will be concurrently transferring data with NSD clients

The maximum value depends on the sum of worker1Threads + prefetchThreads + nsdMaxWorkerThreads < 8192 on 64bit architectures
The default is 64 (in 3.4) 512 (in 3.5) with a minimum of 8 and maximum of 8,192
In some cases it may help to increase nsdMaxWorkerThreads for large clusters.
Scale this with the number of LUNs, not the number of clients. You need this to manage flow control on the network between the clients and the servers.

August 21 – 25, 2023

27 of 75

Storage Scale Node Classes

August 21 – 25, 2023

28 of 75

GPFS Node Classes

A node class is simply a user defined logical grouping of nodes
You can use a node class with any GPFS command that uses the ”-N” option to specify a list of nodes
The systems in a group may perform the same type of functions
The systems in a group may have the same characteristics, such as GPU processors, larger memory, faster CPUs, etc
You may group servers together that have special GPFS configuration settings just for them

August 21 – 25, 2023

30 of 75

Creating a Node Class

# mmcrnodeclass

mmcrnodeclass: Missing arguments.

Usage:

mmcrnodeclass ClassName -N {Node[,Node...] | NodeFile | NodeClass}

# mmcrnodeclass coreio -N ss-demo1.local,ss-demo2.local

mmcrnodeclass: Propagating the cluster configuration data to all

affected nodes. This is an asynchronous process.

August 21 – 25, 2023

Can be handy to create a node class for your core NSD servers

Other potentially handy classes: CES nodes, login nodes, GridFTP nodes, Connect-X6 clients, Connect-X5 clients, etc.

31 of 75

List of Node Classes

# mmlsnodeclass

Node Class Name Members

--------------------- -----------------------------------------------------------

coreio ss-demo1.local,ss-demo2.local

August 21 – 25, 2023

Use the “mmlsnodeclass” command to view the current node classes on the system and what members are in them

32 of 75

Storage Scale Snapshots

August 21 – 25, 2023

33 of 75

What Is A Snapshot

A snapshot can preserver the state of a file system at a given moment in time

Snapshots at File System level are known as Global snapshots

The space a snapshot takes up is the amount of blocks that would have been deleted or changed since the snapshot was taken
Snapshots of a file system are read-only; changes can only be made to the active (that is, normal, non-snapshot) files and directories

August 21 – 25, 2023

34 of 75

What Is A Snapshot

Creates a consistent copy of the file system at a given moment in time while not interfering with backups or replications occurring on the file system
Allows for the easy recovery of a file, while not a backup, can be used as one in certain scenarios:

User accidental file deletion
Recovery of older file state for comparison
Accidental overwrite of file

August 21 – 25, 2023

35 of 75

Snapshot Types

File System Snapshot

Taken for the entire file system. Again, only the changed blocks are stored to reduce the snapshot size

Fileset Snapshot

You can also take a snapshot of any independent inode file set separate from a file system snapshot
Instead of creating a global snapshot of an entire file system, a fileset snapshot can be created to preserve the contents of a single independent fileset plus all dependent filesets that share the same inode space.

If an independent fileset has dependent filesets that share its inode space, then a snapshot of the independent fileset will also include those dependent filesets.

August 21 – 25, 2023

36 of 75

Snapshot Storage

Snapshots are stored in a special read-only directory named .snapshots by default
This directory resides in the top-level directory of the file system.
The directory can be linked into all subdirectories with the mmsnapdir command

Place a link in all directories:

# mmsnapdir fs0 –a

Undo the link above:

# mmsnapdir fs0 -r

August 21 – 25, 2023

37 of 75

Snapshot Creation

# mmcrsnapshot fs0 fs0_20190411_0001

Flushing dirty data for snapshot :fs0_20190411_0001...

Quiescing all file system operations.

Snapshot :fs0_20190411_0001 created with id 1.

August 21 – 25, 2023

Use the “mmcrsnapshot” command to run the snapshot
Below is a file system level (global snapshot)

Below is a fileset snapshot

# mmcrsnapshot fs0 home:fs0_home_20190411_0612 -j home

Flushing dirty data for snapshot home:fs0_home_20190411_0612...

Quiescing all file system operations.

Snapshot home:fs0_home_20190411_0612 created with id 2.

38 of 75

Listing Snapshots

Listing the snapshots for fs0 now shows a snapshot of the home fileset.

# mmlssnapshot fs0

Snapshots in file system fs0:

Directory SnapId Status Created Fileset

fs0_20170718_0001 1 Valid Mon Jul 24 11:08:13 2017

fs0_home_20170724_0612 2 Valid Mon Jul 24 11:12:20 2017 home

August 21 – 25, 2023

Use the “mmlssnapshot” command to view all the snapshots currently stored on a given file system

39 of 75

Snapshot Deletion

# mmdelsnapshot fs0 fs0_20170718_0001

Invalidating snapshot files in :fs0_20170718_0001...

Deleting files in snapshot :fs0_20170718_0001...

100.00 % complete on Mon Jul 24 11:17:52 2017 ( 502784 inodes with total 1 MB data processed)

Invalidating snapshot files in :fs0_20170718_0001/F/...

Delete snapshot :fs0_20170718_0001 successful.

August 21 – 25, 2023

Deleting the snapshot taken at the file system level using the “mmdelsnapshot” command
Below is a file system level (global snapshot)

Below is a fileset snapshot

# mmdelsnapshot fs0 home:fs0_home_20170724_0612 -j home

Invalidating snapshot files in home:fs0_home_20170724_0612...

Deleting files in snapshot home:fs0_home_20170724_0612...

100.00 % complete on Mon Jul 24 11:25:56 2017 ( 100096 inodes with total 0 MB data processed)

Invalidating snapshot files in home:fs0_home_20170724_0612/F/...

Delete snapshot home:fs0_home_20170724_0612 successful.

40 of 75

File Level Restore from Snapshot

In order to restore a file, you can traverse the directories in the .snapshots directory
The directories have the name given to the snapshot when the mmcrsnapshot command was executed
You can search for the file you want to restore and then use rsync or cp to copy the file wherever you would like, outside of the .snapshot directory
Self-service for users, doesn’t require an admin to get back the snapshot data, standard Linux permissions still apply to the snapshot data

August 21 – 25, 2023

41 of 75

Snapshot Restore Utility

# mmsnaprest -h

GPFS Restore From Snapshot

Please note: This utility uses rsync style processing for directories. If

you are unsure of how that matching works, you may want to play

with it in a test area. There are examples in the EXAMPLES

section of this help screen.

Usage: mmsnaprest [-D|--debug] [-u|--usage] [-v|--verbose] [-h|--help]

[--dry-run] [-ls SOURGE] [-s SOURCE -t TARGET]

August 21 – 25, 2023

Useful for bulk restores from a snapshot
Massive data deletion (someone let an rm –rf go wild) requires a large restore from a snapshot
Native to Storage Scale, and written by IBM

42 of 75

Snapshot Automation

You can automate the creation of snapshots with a shell script, or even call the mmcrsnapshot command straight from cron if you like
At NCSA, we use an in-house tool called snappy

Same utility for both file system and fileset snapshots
Written in python
Utilizes a simple windows ini style configuration file
Allows for a very customized approach to snapshots:

Hourly
Daily
Weekly
Monthly
Quarterly
Yearly

Available at: https://github.com/ckerner/snappy.git

August 21 – 25, 2023

43 of 75

Snapshot Configuration File: .snapcfg

The configuration file always must reside in the root top level directory of the file system.

# cat .snapcfg

[DEFAULT]

Active=False

SnapType=Fileset

Versions=30

Frequency=daily

[home]

Active=True

[projects]

Active=True

Versions=7

[software]

Active=True

Frequency=weekly

Versions=14

August 21 – 25, 2023

From this example, the default is for fileset snapshots, running daily, keeping 30 versions. The default action is to NOT take snapshots. So, if you want a snapshot, you must turn it on for each fileset individually.

The .snapcfg section name must be the same as the fileset name. Each section will inherit the DEFAULT section and then override it with the local values. Here is the breakdown for this file:

[home] get daily snapshots with 30 versions saved.
[projects] gets daily snapshots with 7 versions.
[software] gets a weekly snapshot with 14 versions.

44 of 75

Storage Scale Cluster Export Services

August 21 – 25, 2023

45 of 75

CES – Cluster Export Services

Provides highly available file and object services to a Storage Scale cluster such as NFS, SMB, Object, and Block

High availability

With Storage Scale, you can configure a subset of nodes in the cluster to provide a highly available solution for exporting Storage Scale file systems usings NFS, SMB, Object, and Block.

Nodes are designated as Cluster Export Services (CES) nodes or protocol nodes. The set of CES nodes is frequently referred to as the CES cluster.

A set of IP addresses, the CES address pool, is defined and distributed among the CES nodes

If a node enters or exits the CES Cluster, IP Addresses are dynamically reassigned
Clients use these floating IP Address to access the CES services

August 21 – 25, 2023

46 of 75

CES – Cluster Export Services

Monitoring

CES monitors the state of the protocol services itself

Checks not just for host availability, but also the health of the services
If a failure is detected CES will migrate IP Address away from a node and mark it as offline for CES services

Protocol support

CES supports the following export protocols: NFS, SMB, object, and iSCSI (block)

Protocols can be enabled individually
If a protocol is enabled, all CES nodes will serve that protocol

The following are examples of enabling and disabling protocol services by using the mmces command:

mmces service enable nfs Enables the NFS protocol in the CES cluster.
mmces service disable obj Disables the Object protocol in the CES cluster.

August 21 – 25, 2023

47 of 75

Common CES Commands

mmblock - Manages the BLOCK configuration operations
mmces - Manages the CES address pool and other CES cluster configuration options
mmnfs - Manages NFS exports and sets the NFS configuration attributes
mmobj - Manages the Object configuration operations
mmsmb - Manages SMB exports and sets the SMB configuration attributes
mmuserauth - Configures the authentication methods that are used by the protocols

August 21 – 25, 2023

48 of 75

Storage Scale Policy Engine

August 21 – 25, 2023

49 of 75

Policy Engine

The GPFS policy engine allows you to run SQL-like queries against the file system and get reports based on those queries
The policy engine can also be used to invoke actions, such as compression, file movement, etc
Customized scripts can also be invoked, letting you have full control over anything that is being done
There are many parameters that can be specified. For a list of them, check out the Storage Scale Administration and Programming Reference

August 21 – 25, 2023

50 of 75

Example Policy Run #1

Here is a simple sample policy that will just list all of the files in /fs0/projects along with the file’s allocation, its actual size, owner and fileset name. It also displays the inode number and fully qualified path name.

# cat rules.txt

RULE 'listall' list 'all-files'

SHOW( varchar(kb_allocated) || ' ' || varchar(file_size) || ' ' || varchar(user_id) || ' ' || fileset_name )

WHERE PATH_NAME LIKE '/fs0/projects/%'

August 21 – 25, 2023

51 of 75

Example Policy Run #1

Sample output from a policy run:

# mmapplypolicy fs0 -f /fs0/tmp/ -P rules.txt -I defer

[I] GPFS Current Data Pool Utilization in KB and %

Pool_Name KB_Occupied KB_Total Percent_Occupied

archive 131072 41934848 0.312561047%

data 192512 41934848 0.459074038%

system 0 0 0.000000000% (no user data)

[I] 4422 of 502784 inodes used: 0.879503%.

[W] Attention: In RULE 'listall' LIST name 'all-files' appears but there is no corresponding "EXTERNAL LIST 'all-files' EXEC ... OPTS ..." rule to specify a program to process the matching files.

[I] Loaded policy rules from rules.txt.

Evaluating policy rules with CURRENT_TIMESTAMP = 2017-07-25@15:34:38 UTC

Parsed 1 policy rules.

RULE 'listall' list 'all-files'

SHOW( varchar(kb_allocated) || ' ' || varchar(file_size) || ' ' || varchar(user_id) || ' ' || fileset_name )

WHERE PATH_NAME LIKE '/fs0/projects/%'

[I] 2017-07-25@15:34:39.041 Directory entries scanned: 385.

[I] Directories scan: 362 files, 23 directories, 0 other objects, 0 'skipped' files and/or errors.

[I] 2017-07-25@15:34:39.043 Sorting 385 file list records.

[I] Inodes scan: 362 files, 23 directories, 0 other objects, 0 'skipped' files and/or errors.

August 21 – 25, 2023

52 of 75

Example Policy Run #1

Sample output from a policy run (continued):

[I] 2017-07-25@15:34:40.954 Policy evaluation. 385 files scanned.

[I] 2017-07-25@15:34:40.956 Sorting 360 candidate file list records.

[I] 2017-07-25@15:34:41.024 Choosing candidate files. 360 records scanned.

[I] Summary of Rule Applicability and File Choices:

Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule

0 360 61184 360 61184 0 RULE 'listall' LIST 'all-files' SHOW(.) WHERE(.)

[I] Filesystem objects with no applicable rules: 25.

[I] GPFS Policy Decisions and File Choice Totals:

Chose to list 61184KB: 360 of 360 candidates;

Predicted Data Pool Utilization in KB and %:

Pool_Name KB_Occupied KB_Total Percent_Occupied

archive 131072 41934848 0.312561047%

data 192512 41934848 0.459074038%

system 0 0 0.000000000% (no user data)

[I] 2017-07-25@15:34:41.027 Policy execution. 0 files dispatched.

[I] A total of 0 files have been migrated, deleted or processed by an EXTERNAL EXEC/script;

0 'skipped' files and/or errors.

August 21 – 25, 2023

53 of 75

Example Policy Run #1

Sample output from a policy run:

]# wc -l /fs0/tmp/list.all-files

360 /fs0/tmp/list.all-files

# head -n 10 /fs0/tmp/list.all-files

402432 374745509 0 3584 1741146 0 projects -- /fs0/projects/dar-2.4.1.tar.gz

402434 229033036 0 0 1217 1000 projects -- /fs0/projects/dar-2.4.1/README

402435 825781038 0 256 43668 1000 projects -- /fs0/projects/dar-2.4.1/config.guess

402436 1733958940 0 256 18343 1000 projects -- /fs0/projects/dar-2.4.1/config.rpath

402437 37654404 0 0 371 1000 projects -- /fs0/projects/dar-2.4.1/INSTALL

402438 1471382967 0 0 435 1000 projects -- /fs0/projects/dar-2.4.1/TODO

402440 398210967 0 0 376 1000 projects -- /fs0/projects/dar-2.4.1/misc/batch_cygwin

402441 292549403 0 0 738 1000 projects -- /fs0/projects/dar-2.4.1/misc/README

402442 1788675584 0 256 3996 1000 projects -- /fs0/projects/dar-2.4.1/misc/dar_ea.rpm.proto

402443 637382920 0 256 4025 1000 projects -- /fs0/projects/dar-2.4.1/misc/dar64_ea.rpm.proto

August 21 – 25, 2023

54 of 75

Example Policy Run #2

One of our actual scratch purge policies that we run daily to keep users old data cleaned up

RULE 'purge_30days' DELETE

FOR FILESET ('scratch')

WHERE CURRENT_TIMESTAMP - MODIFICATION_TIME > INTERVAL '30' DAYS and

CURRENT_TIMESTAMP - CREATION_TIME > INTERVAL '30' DAYS and

CURRENT_TIMESTAMP - ACCESS_TIME > INTERVAL '30' DAYS and

PATH_NAME LIKE '/gpfs/iccp/scratch/%'

August 21 – 25, 2023

55 of 75

Example Policy Run #2

Sample output from a policy run:

[I] GPFS Current Data Pool Utilization in KB and %

Pool_Name KB_Occupied KB_Total Percent_Occupied

data 1006608482304 2621272227840 38.401523948%

system 0 0 0.000000000% (no user data)

[I] 378536926 of 689864704 inodes used: 54.871183%.

[I] Loaded policy rules from scratch.purge.policy.

Evaluating policy rules with CURRENT_TIMESTAMP = 2019-04-12@16:00:02 UTC

Parsed 1 policy rules.

RULE 'purge_30days' DELETE

FOR FILESET ('scratch')

WHERE CURRENT_TIMESTAMP - MODIFICATION_TIME > INTERVAL '30' DAYS and

CURRENT_TIMESTAMP - CREATION_TIME > INTERVAL '30' DAYS and

CURRENT_TIMESTAMP - ACCESS_TIME > INTERVAL '30' DAYS and

PATH_NAME LIKE '/gpfs/iccp/scratch/%'

[I] 2019-04-12@16:00:04.045 Directory entries scanned: 0.

[I] 2019-04-12@16:00:19.026 Directory entries scanned: 1376623.

[I] 2019-04-12@16:00:34.027 Directory entries scanned: 1376623.

[I] 2019-04-12@16:00:37.104 Directory entries scanned: 8576323.

[I] Directories scan: 4132091 files, 3713818 directories, 730414 other objects, 0 'skipped' files and/or errors.

August 21 – 25, 2023

56 of 75

Example Policy Run #2

Sample output from a policy run (continued):

[I] 2019-04-12@16:00:37.145 Parallel-piped sort and policy evaluation. 0 files scanned.

[I] 2019-04-12@16:00:42.975 Parallel-piped sort and policy evaluation. 8576323 files scanned.

[I] 2019-04-12@16:00:43.523 Piped sorting and candidate file choosing. 0 records scanned.

[I] 2019-04-12@16:00:43.647 Piped sorting and candidate file choosing. 90047 records scanned.

[I] Summary of Rule Applicability and File Choices:

Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule

0 90047 1078304928 90047 1078304928 0 RULE 'purge_30days' DELETE FOR FILESET(.) WHERE(.)

[I] Filesystem objects with no applicable rules: 8486148.

[I] GPFS Policy Decisions and File Choice Totals:

Chose to delete 1078304928KB: 90047 of 90047 candidates;

Predicted Data Pool Utilization in KB and %:

Pool_Name KB_Occupied KB_Total Percent_Occupied

data 1005533405024 2621272227840 38.360510379%

system 0 0 0.000000000% (no user data)

[I] 2019-04-12@16:00:43.732 Policy execution. 0 files dispatched.

[I] 2019-04-12@16:00:49.027 Policy execution. 65886 files dispatched.

[I] 2019-04-12@16:00:51.069 Policy execution. 90047 files dispatched.

[I] A total of 90047 files have been migrated, deleted or processed by an EXTERNAL EXEC/script;

0 'skipped' files and/or errors.

August 21 – 25, 2023

57 of 75

Storage Scale Storage Pools & File Placement

August 21 – 25, 2023

58 of 75

Storage Pools

Physically, a storage pool is a collection of disks or RAID arrays

Allow you to group multiple storage systems within a file system.

Using storage pools, you can create tiers of storage by grouping storage devices based on performance, locality, or reliability characteristics

One pool could be an All Flash Array (AFA) with high-performance SSDs
Another pool might consist of numerous disk controllers that host a large set of economical SAS/SATA drives

August 21 – 25, 2023

59 of 75

Example Storage Pool Configuration

Flash Tier

Holds Metadata pool (system pool)
Maybe has filesets pinned to it via Storage Scale Policies

Popular areas would be /home, /apps, and even /scratch

Capacity Tier

Large disk enclosures behind RAID controllers presenting big NSD’s from large hard drives
Higher Storage Device to Server ratio
Potentially less bandwidth for connectivity

August 21 – 25, 2023

60 of 75

Information Lifecycle Management

Storage Scale includes the ILM toolkit that allows you to manage your data via the built in policy engine
No matter the directory structure, Storage Scale can automatically manage what storage pools host the data, and for how long

Throughout the life of the data Storage scale can track and migrate data from your policy driven rules

You can match the data and its needs to hardware, allowing for cost savings
Great method for spanning infrastructure investments

New hardware is for more important/more used data
Older hardware becomes the slower storage pool

August 21 – 25, 2023

61 of 75

There are three types of storage pools in Storage Scale:

A required system pool that you create
Optional user storage pools that you create
Optional external storage pools that you define with policy rules and manage through an external application (eg. Storage Protect/Tivoli)

Create filesets to provide a way to partition the file system namespace to allow administrative operations to work at a narrower level than the entire file system
Create policy rules based on data attributes to determine the initial file placement and manage data placement throughout the life of the file

August 21 – 25, 2023

Information Lifecycle Management

62 of 75

Storage Tiering

Your Storage Scale cluster can have differing types of storage media connected by differing technologies as well, including but not limited to:

NVME
SAS SSD
SAS/FC/IB attached Hard Drives
Self-Encrypting drives
You may have differing drives sizes: 6T, 8T, 10T, 12T, etc
Tape Libraries

August 21 – 25, 2023

63 of 75

Using Tiered Storage

Lets say you have the following devices on your system:

/dev/nvme01 /dev/nvme02
/dev/sas01 /dev/sas02
/dev/sata01 /dev/sata02

the SAS drives above are SED

There are many different ways that you can configure a Storage Scale file system. To make it interesting, lets have the following business rules that we need to satisfy:

Very fast file creates and lookups, including a mirrored copy.
A decent storage area for data files
All files place in /gpfs/fs0/health live on encrypted drives
What would this configuration look like?

August 21 – 25, 2023

64 of 75

Tiered Storage: Example Config

August 21 – 25, 2023

%nsd:

nsd=nvme01

usage=metadataOnly

pool=system

%nsd:

nsd=nvme02

usage=metadataOnly

pool=system

Flash Metadata Pool

%nsd:

nsd=sata01

usage=dataOnly

pool=data

%nsd:

nsd=sata02

usage=dataOnly

pool=data

Bulk Data Pool

%nsd:

nsd=sas01

usage=dataOnly

pool=hippa

%nsd:

nsd=sas02

usage=dataOnly

pool=hippa

Encrypted Data Pool

65 of 75

File Placement Policies

If you are utilizing multiple storage pools within Storage Scale, you must specify a default storage policy at a minimum.
File placement policies are used to control what data is written to which storage pool.
A default policy rule can be quite simple. For example, if you have a ’data’ pool and want to write all files there, create a file called policy with a single line containing the following rule:

# cat policy

rule 'default' set pool 'data'

August 21 – 25, 2023

66 of 75

Installing File Placement Policies

# Usage: mmchpolicy Device PolicyFilename

[-t DescriptiveName] [-I {yes|test}]

August 21 – 25, 2023

Test the policy before installing it is good practice!

# mmchpolicy fs0 policy -I test

Validated policy 'policy': Parsed 1 policy rules.

No errors on the policy, so lets install it:

# mmchpolicy fs0 policy

Validated policy 'policy': Parsed 1 policy rules.

Policy `policy' installed and broadcast to all nodes.

67 of 75

Viewing Installed Policies

# Usage: mmlspolicy Device

List the file placement policies:

# mmlspolicy fs0

Policy for file system '/dev/fs0':

Installed by root@ss-demo1.os.ncsa.edu on Fri Apr 12 09:26:10 2019.

First line of policy 'policy' is:

rule 'default' set pool 'data'

August 21 – 25, 2023

Verify prior policy installed successfully:

68 of 75

Storage Scale Monitoring

August 21 – 25, 2023

69 of 75

Monitoring with mmpmon

Built in tool to report counters that each of the mmfs daemons keep
Can output results in either machine parseable or human readable formats
Some of the statistics it monitors on a per host basis:

Bytes Read
Bytes Written
File Open Requests
File Close Requests
Per NSD Read/Write

The machine parse-able output is easy to use for scripted data gathering

August 21 – 25, 2023

70 of 75

Monitoring with mmpmon

August 21 – 25, 2023

Sample output from mmpmon (human readable)

71 of 75

Monitoring with mmpmon

Can be used to make useful graphs 🡪 Live

August 21 – 25, 2023

Sample output from mmpmon (machine parseable)

72 of 75

Other Storage Scale Monitoring

Using the built in ZiMon sensors with mmperfmon
Storage Scale GUI now has the ability to have performance monitoring with graphs
Storage Scale Grafana Bridge

Python standalone application that puts Storage Scale performance data into openTSDB which Grafana can understand
Data that is pushed “across” the bridge is gathered by the ZiMon Monitoring Tool

August 21 – 25, 2023

1 of 75

2 of 75

3 of 75

4 of 75

5 of 75

6 of 75

7 of 75

8 of 75

9 of 75

10 of 75

11 of 75

12 of 75

13 of 75

14 of 75

15 of 75

16 of 75

17 of 75

18 of 75

19 of 75

20 of 75

21 of 75

22 of 75

23 of 75

24 of 75

25 of 75

26 of 75

27 of 75

28 of 75

29 of 75

30 of 75

31 of 75

32 of 75

33 of 75

34 of 75

35 of 75

36 of 75

37 of 75

38 of 75

39 of 75

40 of 75

41 of 75

42 of 75

43 of 75

44 of 75

45 of 75

46 of 75

47 of 75

48 of 75

49 of 75

50 of 75

51 of 75

52 of 75

53 of 75

54 of 75

55 of 75

56 of 75

57 of 75

58 of 75

59 of 75

60 of 75

61 of 75

62 of 75

63 of 75

64 of 75

65 of 75

66 of 75

67 of 75

68 of 75

69 of 75

70 of 75

71 of 75

72 of 75

73 of 75

74 of 75

75 of 75