Linux Clusters Institute:�HPC Storage: Part 1
J.D. Maloney | Sr. HPC Storage Engineer
National Center for Supercomputing Applications (NCSA)
malone12@illinois.edu
1
May 1-5, 2023
This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).
Target Audience:
Those involved in designing, implementing, or managing HPC storage systems.
Outline:
2
May 1-5, 2023
Concepts and Terms
3
May 1-5, 2023
What is Storage?
A place to store data. Either temporarily or permanently.
4
May 1-5, 2023
Bandwidth Increase
Latency & Size Increase
Cache
(L1, L2, L3,)
Memory
(DRAM, HBM, )
Solid State Disk
(SATA SSD, M.2 Module, PCIe Card)
Spinning Disks
(PMR, SMR, HAMR/MAMR)
Tape
(LTO, TS11XX)
CPU
Concepts and Terms
5
May 1-5, 2023
Concepts and Terms
6
May 1-5, 2023
Storage
Active Controller
DN
DN
DN
DN
Standby Controller
Concepts and Terms
Useable space is often about 25% smaller than raw space
7
May 1-5, 2023
File System overhead is applied after RAID overhead further reducing the usable space.
3 Drives
2 Storage Drives
1 Parity Drive
33% RAID overhead
4 Drives
3 Storage Drives
1 Parity Drive
25% RAID overhead
Goals and Requirements
8
May 1-5, 2023
How to Choose the Optimal Storage Solution?
9
May 1-5, 2023
Building a Balanced System
The Ideal:
The Reality:
Common Imbalances:
10
May 1-5, 2023
Requirements Evaluation
11
May 1-5, 2023
Gathering Stakeholder Requirements
12
May 1-5, 2023
Application I/O Access Patterns
13
May 1-5, 2023
Common Data Access Patterns
14
May 1-5, 2023
HPC I/O Access Patterns
Challenges:
15
May 1-5, 2023
Gathering Data Requirements
16
May 1-5, 2023
Training and Support Requirements
17
May 1-5, 2023
Storage Hardware
18
May 1-5, 2023
Common Storage Building Blocks
19
May 1-5, 2023
Common Connecting Fabrics
Network Fabrics
20
May 1-5, 2023
Other Storage Fabrics & Carriers
Common RAID Levels
21
May 1-5, 2023
Common RAID Levels (cont.)
22
May 1-5, 2023
Erasure Coding
23
May 1-5, 2023
Erasure Coding
24
May 1-5, 2023
Erasure Coding
25
May 1-5, 2023
Data Transport from Server to Client
TCP (Transmission Control Protocol)
26
May 1-5, 2023
Data Transport from Server to Client
RDMA (Remote Direct Memory Access)
27
May 1-5, 2023
TCP vs. RDMA
28
May 1-5, 2023
Data Transport from Server to Client
GDS (GPU Direct Storage)
29
May 1-5, 2023
GDS (GPU Direct Storage)
30
May 1-5, 2023
Storage Software
31
May 1-5, 2023
File vs. Object Storage
File Storage
32
May 1-5, 2023
Object Storage
Data Consistency
Strong Consistency
33
May 1-5, 2023
Eventual Consistency
Parallel vs. Non Parallel File Systems
Characteristics of a Parallel File System
34
May 1-5, 2023
Characteristics of Non-Parallel File Systems
Why HPC Uses Parallel File Systems
35
May 1-5, 2023
Goals of Scale-Out File Systems
36
May 1-5, 2023
Some Examples of File Systems Used in HPC
Characteristics
System Components
Sources
37
May 1-5, 2023
Some Examples of File Systems Used in HPC
38
May 1-5, 2023
Some Examples of File Systems Used in HPC
Characteristics
System Components
Sources
39
May 1-5, 2023
Some Examples of File Systems Used in HPC
40
May 1-5, 2023
Some Examples of File Systems Used in HPC
Characteristics
System Components
Sources
41
May 1-5, 2023
Some Examples of File Systems Used in HPC
42
May 1-5, 2023
Some Examples of File Systems Used in HPC
Characteristics
System Components
Sources
43
May 1-5, 2023
Some Examples of File Systems Used in HPC
44
May 1-5, 2023
Common HPC Storage Solutions
45
May 1-5, 2023
Other Storage Solutions Seen in HPC
There are some other storage solutions/vendors out there, that while not large, do show up regularly enough at HPC sites that they’re worth mentioning:
46
May 1-5, 2023
Resources
HPC:
https://www.youtube.com/watch?v=n0OfAoUXUJw (Henry Neeman) Supercomputing in plain English
General Storage:
https://www.youtube.com/watch?v=pBmtY4Tk-R8 (Henry Neeman) Why storage for big data is hard
Hardware Related:
https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html GPU Direct Storage
https://www.geeksforgeeks.org/remote-direct-memory-access-rdma/ RDMA
https://www.snia.org/forums/cmsi/knowledge/formfactors Flash form factors
https://en.wikipedia.org/wiki/InfiniBand Generations of Infiniband
https://www.lto.org/lto-generation-compatibility/ LTO Tape Generation & Compatibility Information
47
May 1-5, 2023
Resources
Tuning and Benchmarking:
fasterdata.es.net/ An Expert Guide for End-to-End Performance Tuning, Tools and Techniques
https://ior.readthedocs.io/en/latest/ IOR and mdtest documentation
https://github.com/breuner/elbencho Another FS benchmarking tool with GDS Support
https://glennklockwood.blogspot.com/2016/07/basics-of-io-benchmarking.html
48
May 1-5, 2023
File System Solutions:
www.opensfs.org/ OpenSFS supports vendor-neutral development of Lustre
wiki.lustre.org Lustre Wiki
https://www.ddn.com/products/lustre-file-system-exascaler/ DDN Exascaler
www.ibm.com/products/spectrum-scale IBM Spectrum Scale
www.spectrumscale.org Spectrum Scale User Group
https://docs.ceph.com/en/quincy/ CEPH file system
https://vastdata.com VAST
Resources
File System Solutions:
https://thelinuxcluster.com/2012/10/18/a-brief-look-at-the-difference-between-nfsv3-and-nfsv4/ NFSv3 vs NFSv4
https://zfsonlinux.org ZFS Main Page
https://openzfs.github.io/openzfs-docs/Project%20and%20Community/Mailing%20Lists.html ZFS Mailing Lists
49
May 1-5, 2023
Questions?
50
May 1-5, 2023