Towards Efficient Checkpointing Across Deep Tiers of Memory Hierarchy
Avinash Maurya�am6429@rit.edu
Advisors: M. Mustafa Rafique (RIT), Bogdan Nicolae (ANL)
SC’22 Doctoral Showcase
1
2
Outline
▶ Introduction and Motivation
Intermediate Data in High-Performance Computing (HPC) Workloads
3
Checkpointing and restoring intermediate data at scale is a
fundamental problem in many HPC workloads
Our research focus
Example: Reverse Time Migration (RTM) used for Seismic Imaging
4
Actual subsurfaces
Simulated subsurfaces
Representative working of RTM*
* Source: https://www.youtube.com/watch?v=VWF--1OnitM
* Note: Animation depicts Full Wavefield Inversion that has same I/O characteristics as RTM
Accelerators Introduce I/O Contention
5
Limited memory capacity and fast computational performance of accelerators leads to application I/O overheads
Memory Hierarchy on HPC Systems
6
Memory tiers on HPC systems creates a complex caching infrastructure that is not efficiently utilized for various checkpointing and restoring scenarios
7
Outline
▶ Research Summary
Research Challenges for Efficient Checkpoint-Restore Operations
8
9
Outline
▶ Background: Asynchronous Multi-Level Checkpointing
Asynchronous Multi-Level Checkpointing Without GPU Support
10
Main
memory
Local
Disk
Remote Storage
Compute
~1000s GB
~10s PB
Asynchronous checkpointing without GPU support does not utilize
unused GPU HBM and concurrent transfers
~10s GB
~100s GB
HPC Application
GPU HBM
Cache space
on memory tier
Unused space
on memory tier
Checkpoint
Asynchronous Multi-Level Checkpointing With Limited GPU Support
11
Main
memory
Local
Disk
Remote Storage
Compute
HPC Application
~10s GB
~100s GB
~1000s GB
~10s PB
Cache space
on memory tier
Unused space
on memory tier
Existing GPU supported checkpointing runtimes
sub-optimally utilize GPU cache
Can hold only
two checkpoints
Checkpoint
GPU HBM
Architecture for Realizing Efficient Multi-Level Checkpointing
12
Main
memory
Local
Disk
Remote Storage
Compute
~10s GB
~1000s GB
~10s PB
Our architecture enables optimal cache utilization and �concurrent transfers across all memory tiers
~100s GB
HPC Application
GPU HBM
Cache space
on memory tier
Unused space
on memory tier
Checkpoint
13
Outline
▶ Checkpoint Load Balancing
HPC Workloads Produce Variable Sized Checkpoints
14
Slow processes delay checkpoint for entire group for tightly coupled applications
MASCOTS’21
62 MB
Local-only Checkpointing Approach (Baseline)
15
GPU local write speed 350 GB/s�GPU to host write speed 12 GB/s�GPU-to-GPU write speed 24 or 48 GB/s
Does NOT leverage free space on peer GPUs and the fast GPU-to-GPU interconnects
MASCOTS’21
48 GB/s
24 GB/s
12 GB/s
Unused
Our Approach: Greedy Schedule for Checkpoint Load Balancing
16
Greedy approach is straightforward and has lower complexity,
but does NOT yield optimal transfer schedule
MASCOTS’21
Our Solution: Min-time Max-flow Schedule
17
β = 24 GB/s; γ = 12 GB/s
MASCOTS’21
Evaluation Methodology: Checkpoint Load Balancing
18
MASCOTS’21
Performance Results: RTM Traces Checkpointing Overhead
19
Distribution of checkpoint sizes over across 8 GPUs
Checkpointing overhead (app blocking time) of greedy and baseline approaches relative to min-time max-flow approach
GPU free space = 128 MB
Optimal approach: Min-time max-flow
Baseline and greedy approaches incur a checkpointing overhead of 400% and 97% respectively as compared to the proposed min-time max-flow based approach
400%
97%
MASCOTS’21
Performance Results: Execution Times of Different Approaches
20
Distribution of checkpoint sizes over across 8 GPUs
Response time for scheduling transfers using baseline, greedy and min-time max-flow approaches
GPU free space = 128 MB
Min-time max-flow approach incurs negligible runtime penalty (~90 µs) and always produces optimal schedule as compared to the greedy and baseline approaches
Optimal approach: Min-time max-flow
MASCOTS’21
Key Takeaways: Checkpoint Load Balancing
21
MASCOTS’21
22
Outline
▶ Efficient Cache Initialization
Large-Volume High-Frequency Checkpointing
23
Short-lived tasks on Multi-GPU systems compete during cache allocation,
leading to significant initialization overheads
HiPC’22
Design Principles for Efficient Cache Initialization
24
HiPC’22
Concurrency Control Between Transfers and Allocations
25
HiPC’22
Evaluation Methodology: Efficient Cache Initialization
26
HiPC’22
Total I/O (Checkpoint + Restore) Overhead on the Application
27
7.8x
Our approach outperforms the baseline (Direct Pin) approach by up to 12.5x and 7.8x for checkpoint and total I/O overheads respectively
HiPC’22
Scalability Test for Increasing Number of Processes Per Node
28
Our approach demonstrates significant performance improvements as compared to state-of-the-art approaches even at scale
13.6x
HiPC’22
Key Takeaways: Efficient Cache Initialization
29
HiPC’22
30
Outline
▶ Foreknowledge Based Eviction and Prefetching
Efficient Cache Eviction and Prefetching
31
Concurrent checkpointing and restoring creates a complex producer-consumer interleaving, and require unified eviction and prefetching strategies
Under
review
Unified Flush/Prefetch Support using Finite State Machine Life-cycle
32
Under
review
Region born through checkpoint or prefetch operation
Can transition from
checkpointing
to prefetching phase
Can be evicted
Unified checkpoint-restore minimizes I/O contention and application execution time
Score-based Look-ahead Cache Eviction
33
Under
review
1
6
5
3
4
2
1
6
5
3
4
2
1
6
5
7
4
2
1
6
5
7
4
2
Required cache size
#
Prefetch order
(1 means next in line for restore)
Not eligible for eviction
(write in-progress)
Eligible for eviction
Least impact on I/O
Evaluation Methodology: Foreknowledge-based Eviction and Prefetching
34
Under
review
Restore throughput performance
35
11.7x
Our approach outperforms the state-of-the-art solutions by up to 7.5x, 8.7x, and 11.7x for sequential, reverse, and irregular restore orders, respectively
Under
review
Scalability study
36
Scalability study shows up to 3.4x and 7.6x faster I/O throughput for tightly coupled and embarrassingly parallel scenarios, respectively
Tightly coupled
Under
review
Embarrassingly parallel
3.4x
7.6x
Key Takeaways: Foreknowledge-based Cache Eviction and Prefetching
37
Under
review
38
Outline
▶ Future Directions
Future Directions
39
Thank You!
Questions?
Backup slides
Acknowledgements
43
Outline
▶ Future Research Plans
1. Implementing Collaborative Multi-Level Checkpointing
44
Source: https://www.hardwarezone.com.sg/tech-news-nvidia-h100-gpu-hopper-architecture-building-block-ai-infrastructure-dgx-h100-supercomputer
2. Skipping Intermediate Memory Tiers to Minimize Transfer Latency
45
Example based comparison of greedy and flow-based approaches
46
Greedy approach
Overall wait time = 20 ms
Spare capacities of GPU 4 was not used
48 GB/s
24 GB/s
12 GB/s
Min-time Max-flow approach
Overall wait time = 6.7 ms
Challenges with Existing Multi-level Checkpointing Runtimes
47
Existing checkpointing runtimes either have no or very limited GPU support
Asynchronous Multi-Level Checkpointing
48
1. Application runs computation kernels
2. Checkpointing at certain timesteps
3. Checkpoint stored on GPU cache first
HBM is fast (up to 500 GB/s trf. rate)
Wait for eviction if GPU cache is full
5. Asynchronously transfer to host cache
Wait for eviction if Host cache is full
6. Asynchronously transfer to SSD
Wait for eviction if SSD is full
4. Application resumes computations
Problem formulation
49
48 GB/s
24 GB/s
12 GB/s
Where and how much to write of each checkpoint to minimize total ckpt time?
Key Research Challenges Presented
50
Synchronous Multi-Level Checkpointing
51
Main memory
Local
Disk
Remote Storage
Compute
GPU HBM
Application
~1000s GB
~10s PB
Unused space
on memory tier
Synchronous checkpointing incurs significant I/O wait time does not leverage unused memory space or concurrent transfers
~10s GB
~100s GB
Checkpoint
Research Goal and Objectives
52
Research Goal: Accelerate data movement on deep tiers of memory hierarchy
| Objective | Challenge | Progress | 
| Optimized multi-level flushing techniques (Using dedicated transfer threads across each tier) | Slow flushes to Lower cache tiers | ☑ Implemented | 
| Collaborative checkpointing strategies (Using spare capacity on peer devices and fast interconnects) | Checkpoint load imbalance | ☑ Simulated ◷ Implementing | 
| Proactive asynchronous cache initialization (Using async. memory mapping, pinning and transfers) | Slow cache allocation and pinning | ☑ Implemented | 
| Foreknowledge aware eviction and prefetching (Using state-machine for unified checkpointing & restoring) | Restore oblivious cache eviction and prefetching | ☑ Implemented | 
| Skipping intermediate memory tiers (Using technologies such as GPUDirect RDMA) | Cascading transfers across hierarchical memory tiers | ◷ In-progress | 
| Efficient co-scheduling of opportunistic on-demand jobs with batch jobs (Checkpointing or Killing batch jobs that minimize compute loss) | Loss of progress made by batch jobs while co-scheduling on-demand jobs | ☑ Simulated ◷ Implementing | 
3. Co-design Checkpointing and Scheduling Runtimes to Preempt Jobs
53
PFS
PFS
System-level checkpointing
Application-level checkpointing
Simulation
DS-RT’20
54
Outline
▶ Related Works
Checkpoint-restore Runtimes and Data Movement Engines
55
Workflow Engines and Cache Management Systems
56
57
Outline
▶ Publications and Submissions
Publications and Submissions
58
Intermediate Data Access Pattern for Different Checkpointing Scenarios
59
Our research focus
Improvement Idea: Min-cost Max-flow Schedule
60
Cannot apply classic algorithms directly:
β = 24 GB/s; γ = 12 GB/s
MASCOTS’21
Problem Formulation: Checkpoint Load Balancing
61
Efficiently scheduling transfers such that cache utilization is maximized and
I/O wait time is minimized
48 GB/s
24 GB/s
12 GB/s
MASCOTS’21
Implementation of Cache Initialization and Transfer Mechanism
62
HiPC’22
Problem Formulation: Efficient Cache Initialization
63
HiPC’22
Limitations of Existing Cache Initialization Schemes
64
HiPC’22