1 of 64

Towards Efficient Checkpointing Across Deep Tiers of Memory Hierarchy

Avinash Maurya�am6429@rit.edu

Advisors: M. Mustafa Rafique (RIT), Bogdan Nicolae (ANL)

SC’22 Doctoral Showcase

2 of 64

Outline

▶ Introduction and Motivation

Research Summary
Research Contributions

Background: Asynchronous Multi-Level Checkpointing
Checkpoint Load Balancing
Efficient Cache Initialization
Foreknowledge Based Eviction and Prefetching

Future Directions

3 of 64

Intermediate Data in High-Performance Computing (HPC) Workloads

HPC workloads manage large volumes of data concurrently at scale

Climate modelling, seismic simulations, computational biology, big data analytics, deep learning, etc.

They produce and consume huge amounts of intermediate data

Applications may need previously generated data to advance computations, job migration, etc.

Intermediate data needs to be checkpointed and restored at high frequency

HPC workloads, e.g., Reverse Time Migration (RTM), checkpoints 415 TB of intermediate data

Use cases of checkpointing and restoring intermediate data

Defensive: Fault tolerance based on checkpoint-restore
Administrative: Job migration, suspend-resume low-priority jobs
Productive: Revisiting previous data for optimized execution; analytics to steer simulations

Checkpointing and restoring intermediate data at scale is a

fundamental problem in many HPC workloads

Our research focus

4 of 64

Example: Reverse Time Migration (RTM) used for Seismic Imaging

RTM used in oil and gas industry to generate subsurface images
Adjoint computation: Forward and backward propagated seismic waves are calculated and cross-correlated to form the subsurface image
Two wavefields at identical propagation times are combined and reversed in time using checkpointing
Intermediate data is checkpointed during forward pass and restored during backward pass
Checkpoint read/write rate is ~7 GB/s

Actual subsurfaces

Simulated subsurfaces

Representative working of RTM*

* Source: https://www.youtube.com/watch?v=VWF--1OnitM

* Note: Animation depicts Full Wavefield Inversion that has same I/O characteristics as RTM

5 of 64

Accelerators Introduce I/O Contention

Applications leverage accelerators, e.g. GPUs for faster performance

E.g. Nvidia H100 GPUs deliver up to 60 TFLOPS performance for general purpose FP64 computations

Faster compute results in frequent checkpointing and restoring (~ms)

Accelerators (e.g. GPUs) are equipped with High-Bandwidth Memory (HBM) to avoid data bottlenecks

HBM has limited capacity, therefore cannot store all checkpoint data

E.g. Nvidia A100 GPUs have 40GB or 80GB HBM, most of which is consumed by the application

Remote storage (e.g. Parallel File System) can store all checkpoints, but incur large I/O penalty

Limited memory capacity and fast computational performance of accelerators leads to application I/O overheads

6 of 64

Memory Hierarchy on HPC Systems

Memory tiers on HPC systems creates a complex caching infrastructure that is not efficiently utilized for various checkpointing and restoring scenarios

7 of 64

Outline

Introduction and Motivation

▶ Research Summary

Research Contributions

Background: Asynchronous Multi-Level Checkpointing
Checkpoint Load Balancing
Efficient Cache Initialization
Foreknowledge Based Eviction and Prefetching

Future Directions

8 of 64

Research Challenges for Efficient Checkpoint-Restore Operations

Fast memory tiers cannot store all checkpoints, flushing sync. to lower tiers blocks application during I/O

Slow Flushes to Lower Cache Tiers

Cache on some memory tiers are filled faster than others, forcing the app. to transfer to slow tiers

Checkpoint Load Imbalance

Short-lived jobs cannot amortize the overhead of cache initialization during their runtimes

Slow Cache Allocation and Pinning

The restore order of checkpoints are not considered while evicting cache, leading to cache misses

Restore Oblivious Cache Eviction and Prefetching

9 of 64

Outline

Introduction and Motivation
Research Summary
Research Contributions

▶ Background: Asynchronous Multi-Level Checkpointing

Checkpoint Load Balancing
Efficient Cache Initialization
Foreknowledge Based Eviction and Prefetching

Future Directions

10 of 64

Asynchronous Multi-Level Checkpointing Without GPU Support

Main

memory

Local

Disk

Remote Storage

Compute

~1000s GB

~10s PB

Unused GPU HBM not utilized as cache

Application needs to manually transfer to main memory

Application is blocked till checkpoint is captured on local disk

Limited concurrent transfers between

Local disk to Remote storage
GPU HBM ➜ Main memory ➜ Local Disk

Asynchronous checkpointing without GPU support does not utilize

unused GPU HBM and concurrent transfers

~10s GB

~100s GB

HPC Application

GPU HBM

Cache space

on memory tier

Unused space

on memory tier

Checkpoint

11 of 64

Asynchronous Multi-Level Checkpointing With Limited GPU Support

Main

memory

Local

Disk

Remote Storage

Compute

HPC Application

~10s GB

~100s GB

~1000s GB

~10s PB

Traditionally used for fault-tolerance: limited checkpoints are stored on device and host cache
Cache may be under or over utilized based on checkpoint sizes, and leads to app I/O overhead
Challenges in designing GPU cache:

Limited slots on cache tiers result in underutilized cache and I/O concurrency
Multi-GPU systems put pressure on main memory
Large I/O throughput gap (D2D > 50*D2H) ➜ High variance in production/consumption of checkpoints
Versioning is key to run productive scenarios

Cache space

on memory tier

Unused space

on memory tier

Existing GPU supported checkpointing runtimes

sub-optimally utilize GPU cache

Can hold only

two checkpoints

Checkpoint

GPU HBM

12 of 64

Architecture for Realizing Efficient Multi-Level Checkpointing

Main

memory

Local

Disk

Remote Storage

Compute

~10s GB

~1000s GB

~10s PB

Optimized for GPU HBM usage

Full cache utilization
Fast transfers to pinned main memory using Direct Memory Access (DMA)

Maximize cache utilization

Concurrent transfers across all memory tiers

Integrated with VELOC, a production-ready checkpoint-restore runtime

Our architecture enables optimal cache utilization and �concurrent transfers across all memory tiers

~100s GB

HPC Application

GPU HBM

Cache space

on memory tier

Unused space

on memory tier

Checkpoint

13 of 64

Outline

Introduction and Motivation
Research Summary
Research Contributions

Background: Asynchronous Multi-Level Checkpointing

▶ Checkpoint Load Balancing

Efficient Cache Initialization
Foreknowledge Based Eviction and Prefetching

Future Directions

14 of 64

HPC Workloads Produce Variable Sized Checkpoints

Slow processes delay checkpoint for entire group for tightly coupled applications

Checkpoints accumulate on the fast tiers blocks application longer for I/O
Techniques to reduce checkpoint sizes

Compression
Decimation
Interpolation

Compression leads to uneven checkpoint sizes on each process
GPUs having enough free local memory write fast in their HBMs, others need to write to slow memory tiers

MASCOTS’21

62 MB

15 of 64

Local-only Checkpointing Approach (Baseline)

Write entire checkpoint to local HBM (at 350 GB/s)
If not enough room is available on local HBM

Checkpoint partially on local HBM (at 350 GB/s)
Checkpoint remaining on host cache (at 12 GB/s)

Implemented by state-of-the-art checkpointing runtimes and data movement engines

GPU local write speed 350 GB/s�GPU to host write speed 12 GB/s�GPU-to-GPU write speed 24 or 48 GB/s

Does NOT leverage free space on peer GPUs and the fast GPU-to-GPU interconnects

MASCOTS’21

48 GB/s

24 GB/s

12 GB/s

Unused

16 of 64

Our Approach: Greedy Schedule for Checkpoint Load Balancing

Leverages GPU-to-GPU interconnect and spare capacity on peers

The GPU with highest remainder will block the group checkpoint for the longest time, hence sorted in descending order of remainder

Worst case complexity for N GPUs is Օ(N²)

Steps:

Select GPU i with highest remainder checkpoint (ckpt_i)
Select peer j with fastest interconnect having spare capacity (F_j)
Transfer max(ckpt_i, F_j) to peer j
Goto step (2) if more peers exist
If no peer exists with spare capacity, transfer to host-memory

Greedy approach is straightforward and has lower complexity,

but does NOT yield optimal transfer schedule

MASCOTS’21

17 of 64

Our Solution: Min-time Max-flow Schedule

Modified min-cost max-flow
Increment the flow over the fastest link by one unit
Increase the cost of the fastest link by its unit cost of transfer, forcing slower paths to be selected for parallel transfers in the next cycle
Guarantees optimal schedule (minimum checkpoint blocking time)
Complexity: O(Fᐧ Nᐧ E)� F = Total remainder ckpt across all GPUs� N = Total number of GPUs� E = Total number of edges

β = 24 GB/s; γ = 12 GB/s

MASCOTS’21

18 of 64

Evaluation Methodology: Checkpoint Load Balancing

Checkpointing traces of Reverse Time Migration (RTM):

Used in the oil and gas industry for cross correlation of two seismic wavefields to form subsurface image
Nvidia DGX-1: 8 Tesla V100 GPUs (32 GB HBM2 each)
776 checkpoints ~53 GB per GPU

Compared approaches

Baseline (Local-only) Suboptimal schedule, put the remainder on host-memory
Greedy based Suboptimal schedule, put the remainder on fastest peer, resort to host-memory
Min-time max-flow based Optimal schedule, put the remainder across all possible peers in parallel

Key metrics for evaluation

Application I/O wait time (checkpoint overheads) of non-optimal approaches
Response time to schedule transfers by the compared approaches

MASCOTS’21

19 of 64

Performance Results: RTM Traces Checkpointing Overhead

Distribution of checkpoint sizes over across 8 GPUs

Checkpointing overhead (app blocking time) of greedy and baseline approaches relative to min-time max-flow approach

GPU free space = 128 MB

Optimal approach: Min-time max-flow

Baseline and greedy approaches incur a checkpointing overhead of 400% and 97% respectively as compared to the proposed min-time max-flow based approach

400%

97%

MASCOTS’21

20 of 64

Performance Results: Execution Times of Different Approaches

Distribution of checkpoint sizes over across 8 GPUs

Response time for scheduling transfers using baseline, greedy and min-time max-flow approaches

GPU free space = 128 MB

Min-time max-flow approach incurs negligible runtime penalty (~90 µs) and always produces optimal schedule as compared to the greedy and baseline approaches

Optimal approach: Min-time max-flow

MASCOTS’21

21 of 64

Key Takeaways: Checkpoint Load Balancing

Applications produce a large amount of intermediate data (checkpoints), to be reused later
Memory constrained systems (e.g. accelerators) cannot hold all checkpoints
Asynchronous multi-level checkpointing is used to flush to slow tiers in the background
Proposed greedy and min-time max-flow approaches to generate the efficient checkpointing schedules
Min-time max-flow approach always yields optimal results with negligible runtime overhead

8x faster than baseline (local-only) approach
2x faster than greedy approach
Execution time of min-time max-flow approach is negligible even at scale

MASCOTS’21

22 of 64

Outline

Introduction and Motivation
Research Summary
Research Contributions

Background: Asynchronous Multi-Level Checkpointing
Checkpoint Load Balancing

▶ Efficient Cache Initialization

Foreknowledge Based Eviction and Prefetching

Future Directions

23 of 64

Large-Volume High-Frequency Checkpointing

Some HPC workloads have long runtimes due to ensembles of short runs (runtime ~mins)

Reverse Time Migration (RTM) processes one seismic image in approx. 3.5 mins

Each run needs to checkpoint frequently (~ms)

RTM produces approx. 200 MB checkpoint every 30 ms (6.7 GB/s)

Cache initialization (mapping and pinning) overhead is not amortized over application runtime

Multi-GPU systems run multiple such short-lived tasks concurrently

Competition for memory allocation amplifies initialization overhead

Short-lived tasks on Multi-GPU systems compete during cache allocation,

leading to significant initialization overheads

HiPC’22

24 of 64

Design Principles for Efficient Cache Initialization

Incremental GPU cache allocation using low-level GPU Virtual Memory Management (VMM)

Concurrent GPU HBM allocations reduce allocation rates from 1.4 TB/s to 160 GB/s, so allocate HBM in chunks

Opportunistic touching of host buffer virtual pages to force allocation of physical pages in advance

Run forerunner threads to access cache pages before checkpoints are written

Concurrency control to optimize competition between flushes and opportunistic page touching

Avoid I/O competition on memory tier between allocations and transfers
Pause memory allocation when transfers are in progress

Serialized registration of host buffers to avoid competition for pinning physical memory pages

Concurrent registration of multiple host buffers leads to contention on DMA drivers

HiPC’22

25 of 64

Concurrency Control Between Transfers and Allocations

Standard Allocation and Registration

Waits for complete cache initialization
Implemented by state-of-art ckpt runtimes

Concurrent Touch and Flush

Incrementally attempt touching before transfers
Register when all pages are touched

Sequential Touch and Flush

Incrementally attempt touching before transfers
Pause touching when transferring
Register when all pages are touched

HiPC’22

26 of 64

Evaluation Methodology: Efficient Cache Initialization

Experimental Setup

Nvidia DGX A100: 8 A100 GPUs, each with 40 GB GPU HBM
Two 64-core AMD Rome CPUs (256 threads), four 3.84 TB Gen4 NVME drives (4 GB/s)
6 NVSwitches and with the host memory through a SXM4 interface

Compared Approaches

Direct pin: Standard allocation and registration using cudaMallocHost
UVM: Using Unified Virtual Memory (multi-tier memory management by CUDA)
Incremental memset: Concurrent touch and flush technique, memset pages to zero
Our approach: Use transparent huge pages, concurrency control touch and flush, sequential pinning

Compute interval between checkpoint or restore operations

Based on real-world characteristics of RTM application: [5 . . . 60] ms

Key metrics for evaluation

Total I/O overhead (checkpoint + restore)
Scalability tests

HiPC’22

27 of 64

Total I/O (Checkpoint + Restore) Overhead on the Application

7.8x

Our approach outperforms the baseline (Direct Pin) approach by up to 12.5x and 7.8x for checkpoint and total I/O overheads respectively

HiPC’22

28 of 64

Scalability Test for Increasing Number of Processes Per Node

Our approach demonstrates significant performance improvements as compared to state-of-the-art approaches even at scale

13.6x

HiPC’22

29 of 64

Key Takeaways: Efficient Cache Initialization

HPC workloads may consist of short-lived ensembles, that need to checkpoint frequently
State-of-the-art checkpointing runtimes pay high cache initialization cost
We propose the following optimizations for reducing cache initialization overheads:

Incremental cache allocation on GPU using low-level virtual memory functions
Opportunistic touching of memory pages in advance to force physical page mapping
Concurrency control to optimize competition between touching and transfers
Serialized registration of host buffer

Our optimizations show up to 12x faster checkpointing and 7x faster application completion time as compared to the baseline approaches

HiPC’22

30 of 64

Outline

Introduction and Motivation
Research Summary
Research Contributions

Background: Asynchronous Multi-Level Checkpointing
Checkpoint Load Balancing
Efficient Cache Initialization

▶ Foreknowledge Based Eviction and Prefetching

Future Directions

31 of 64

Efficient Cache Eviction and Prefetching

Productive use-cases of checkpointing read back (restore) the entire history of checkpoints
The order of restore operations is usually known in advance

Same order : Check invariants, identify diversion in evolution
Reverse order : Automated differentiation, seismic imaging in oil industry
Irregular order : Priority of checkpoint, pseudo-random sampling

The deterministic restore order of HPC workloads is not considered during eviction
Existing checkpointing runtimes are either optimized for checkpoint or restore, but not both

Concurrent checkpointing and restoring creates a complex producer-consumer interleaving, and require unified eviction and prefetching strategies

Under

review

32 of 64

Unified Flush/Prefetch Support using Finite State Machine Life-cycle

Under

review

Region born through checkpoint or prefetch operation

Can transition from

checkpointing

to prefetching phase

Can be evicted

Unified checkpoint-restore minimizes I/O contention and application execution time

33 of 64

Score-based Look-ahead Cache Eviction

Under

review

Required cache size

Prefetch order

(1 means next in line for restore)

Not eligible for eviction

(write in-progress)

Eligible for eviction

Least impact on I/O

34 of 64

Evaluation Methodology: Foreknowledge-based Eviction and Prefetching

Under

review

Experimental Setup

Nvidia DGX A1008 A100 GPUs, each with 40 GB GPU HBM
Two 64-core AMD Rome CPUs (256 threads), four 3.84 TB Gen4 NVME drives (4 GB/s)
6 NVSwitches and with the host memory through a SXM4 interface

Compared Approaches

ADIOS2: State-of-the-art data movement runtime
FIFO-based eviction: Evict the oldest checkpoint existing on the cache
Score-based eviction: Our approach, evict subset of checkpoints that minimize I/O contention

Number of prefetching hints available

No hints: Application does not have foreknowledge of prefetch order
Single hint: Only next prefetch is known
All hints: The entire sequence of restoration is known

Key metrics for evaluation

Restore throughput
Scalability tests

35 of 64

Restore throughput performance

11.7x

Our approach outperforms the state-of-the-art solutions by up to 7.5x, 8.7x, and 11.7x for sequential, reverse, and irregular restore orders, respectively

Under

review

36 of 64

Scalability study

Scalability study shows up to 3.4x and 7.6x faster I/O throughput for tightly coupled and embarrassingly parallel scenarios, respectively

Tightly coupled

Under

review

Embarrassingly parallel

3.4x

7.6x

37 of 64

Key Takeaways: Foreknowledge-based Cache Eviction and Prefetching

HPC workloads need to restore entire checkpoint history to complete execution
Existing checkpointing runtimes are not optimized for concurrent checkpointing and restoration
We design and develop

Finite-state-machine to enable seamless transition from checkpointing to restoring phase
Score-based look-ahead eviction technique based on restore hints
Integrated with VELOC, a production-ready multi-level HPC checkpointing runtime

Our evaluations show up to 6.3x and 11.7x faster checkpoint and restore throughput, respectively
Scalability study shows 3.4x and 7.6x speedup as compared to state-of-the-art approaches for tightly coupled and embarrassingly parallel applications, respectively

Under

review

38 of 64

Outline

Introduction and Motivation
Research Summary
Research Contributions

Background: Asynchronous Multi-Level Checkpointing
Checkpoint Load Balancing
Efficient Cache Initialization
Foreknowledge Based Eviction and Prefetching

▶ Future Directions

39 of 64

Future Directions

Concurrent transformations during checkpoint transfers
Shared node-local host buffer across all processes to further reduce initialization overhead
Enabling foreknowledge based data migration in Nvidia Unified Virtual Memory (UVM)
Leverage GPUDirect storage to minimize latency due to cascading transfers
Explore optimization opportunities between evicting older checkpoints vs checkpointing on peers
Integrate checkpointing runtime with scheduler for administrative scenarios (e.g. job migration, suspend-resume low-priority jobs, etc.)

40 of 64

Thank You!

Questions?

41 of 64

Backup slides

42 of 64

Acknowledgements

43 of 64

Outline

Introduction and Motivation
Research Summary
Research Contributions

Background: Asynchronous Multi-Level Checkpointing
Checkpoint Load Balancing
Efficient Cache Initialization
Foreknowledge Based Eviction and Prefetching

▶ Future Research Plans

Publications

44 of 64

1. Implementing Collaborative Multi-Level Checkpointing

Refine Min-time max-flow model to leverage modern interconnects (e.g. Nvidia CX7 adapter up to 400 GB/s)

Extend collaborative checkpointing to remote GPU HBMs and between memory tiers (e.g. shared host cache)

Integrate with VELOC, an HPC checkpoint-restore runtime

Evaluate and optimize performance for real-world workloads

Source: https://www.hardwarezone.com.sg/tech-news-nvidia-h100-gpu-hopper-architecture-building-block-ai-infrastructure-dgx-h100-supercomputer

45 of 64

2. Skipping Intermediate Memory Tiers to Minimize Transfer Latency

Modern interconnects are capable of direct transfers to non-subsequent memory tier

Design a unified checkpoint-runtime scheduler to minimize I/O time using direct transfers

Design efficient cache eviction schemes to minimize collective I/O pressure across memory tiers and interconnects

46 of 64

Example based comparison of greedy and flow-based approaches

Greedy approach

GPU 0 ⟶ GPU 4 : 480 MB 10 ms
GPU 6 ⟶ Host : 240 MB 20 ms

Overall wait time = 20 ms

Spare capacities of GPU 4 was not used

48 GB/s

24 GB/s

12 GB/s

Min-time Max-flow approach

GPU 0 ⟶ GPU 1 : 137 MB 5.7 ms
GPU 0 ⟶ GPU 4 : 275 MB 5.7 ms
GPU 0 ⟶ Host : 68 MB 5.7 ms
GPU 6 ⟶ GPU 4 : 160 MB 6.7 ms
GPU 6 ⟶ Host : 80 MB 6.7 ms

Overall wait time = 6.7 ms

47 of 64

Challenges with Existing Multi-level Checkpointing Runtimes

Most checkpointing runtimes do not support GPU-based workloads

GPU HBM has unused space that is not used for storing checkpoints
Application is responsible for transferring the data from GPU to main memory, leading to I/O overhead
Checkpoint is synchronously captured from main memory to the next tier

Checkpointing runtimes (e.g. FTI Fault-Tolerance Interface) have limited GPU checkpointing support

Dynamically determine the residency of the checkpoint (resides on GPU or main memory)
Performs blocking transfer from GPU to cache slots on main memory
Have limited cache slots on the main memory (e.g. FTI has 2 buffers)
Fixed sized cache slots may be under or over utilized by checkpoints of varying sizes
No pinning of main memory ➡ High transfer overheads (transfer without Direct Memory Access)
Pinning main memory ➡ High initialization overheads (high cost of pinning)

Existing checkpointing runtimes either have no or very limited GPU support

48 of 64

Asynchronous Multi-Level Checkpointing

1. Application runs computation kernels

2. Checkpointing at certain timesteps

3. Checkpoint stored on GPU cache first

HBM is fast (up to 500 GB/s trf. rate)

Wait for eviction if GPU cache is full

5. Asynchronously transfer to host cache

Wait for eviction if Host cache is full

6. Asynchronously transfer to SSD

Wait for eviction if SSD is full

4. Application resumes computations

49 of 64

Problem formulation

Each GPU has some free space reserved for storing checkpoints (F_i)
For GPU ‘i’, if the checkpoint size (ckpt_i) is larger than its free space (F_i) ➡ GPU having remainder
If ckpt_i < F_i ➡ GPU having spare capacity
Each checkpoint can be split into multiple regions to checkpoint across various devices
Each GPU has dedicated copy engines (i.e., GPU-to-GPU and GPU-to-host copies can happen in parallel)
Links between GPUs are heterogeneous (e.g., 24 GB/s single NVLink, 48 GB/s dual NVLink)

48 GB/s

24 GB/s

12 GB/s

Where and how much to write of each checkpoint to minimize total ckpt time?

50 of 64

Key Research Challenges Presented

Slow Flushes to Lower Cache Tiers

Fast memory tiers cannot store all checkpoints, flushing synchronously to lower tiers blocks application during I/O

Checkpoint Load Imbalance

Cache on some memory tiers are filled faster than others, leading to uneven cache consumption

Slow Cache Allocation and Pinning

Short-lived jobs cannot amortize the overhead of cache initialization during their runtimes

Restore Oblivious Cache Eviction and Prefetching

The restore order of checkpoints are not considered while evicting old checkpoints from cache

Cascading I/O Hierarchically Across Memory Tiers

Checkpoints are strictly transferred hierarchically across cache tiers, increasing I/O access latency

Running On-Demand Jobs by Terminating Batch Jobs

Running opportunistic on-demand jobs within a given deadline requires evicting batch jobs, leading to lost progress

51 of 64

Synchronous Multi-Level Checkpointing

Application produces intermediate data

Application is blocked until checkpoint is transferred to the slower memory tier, typically remote storage for persistence

Limitations:

Does not overlap compute with transfers
Does not utilize the unused space on intermediate memory tiers

Main memory

Local

Disk

Remote Storage

Compute

GPU HBM

Application

~1000s GB

~10s PB

Unused space

on memory tier

Synchronous checkpointing incurs significant I/O wait time does not leverage unused memory space or concurrent transfers

~10s GB

~100s GB

Checkpoint

52 of 64

Research Goal and Objectives

Research Goal: Accelerate data movement on deep tiers of memory hierarchy

Objective	Challenge	Progress
Optimized multi-level flushing techniques (Using dedicated transfer threads across each tier)	Slow flushes to Lower cache tiers	☑ Implemented
Collaborative checkpointing strategies (Using spare capacity on peer devices and fast interconnects)	Checkpoint load imbalance	☑ Simulated ◷ Implementing
Proactive asynchronous cache initialization (Using async. memory mapping, pinning and transfers)	Slow cache allocation and pinning	☑ Implemented
Foreknowledge aware eviction and prefetching (Using state-machine for unified checkpointing & restoring)	Restore oblivious cache eviction and prefetching	☑ Implemented
Skipping intermediate memory tiers (Using technologies such as GPUDirect RDMA)	Cascading transfers across hierarchical memory tiers	◷ In-progress
Efficient co-scheduling of opportunistic on-demand jobs with batch jobs (Checkpointing or Killing batch jobs that minimize compute loss)	Loss of progress made by batch jobs while co-scheduling on-demand jobs	☑ Simulated ◷ Implementing

53 of 64

3. Co-design Checkpointing and Scheduling Runtimes to Preempt Jobs

HPC workloads make use of on-demand jobs for various use cases

Steering of experimental instruments
Acceleration of high-fidelity simulations through surrogate computations
Guided ensemble searches

Running on-demand needs large number of idle nodes before the deadline
Strategies to make room for on-demand jobs

Reserve nodes for on-demand: Under or over committing resources
Evict running jobs: Which jobs to evict and how?

Eviction strategies

Kill the batch job: Loss in progress made by the job
System-level checkpointing: Capture entire state to Parallel File System (PFS)
Application-level checkpointing: Capture only critical data structures to PFS

Our proposal: Dynamic programming based scheduling model to minimize loss in progress made by batch job while scheduling on-demand workload

PFS

System-level checkpointing

Application-level checkpointing

Simulation

DS-RT’20

Batch jobs can be evicted by either killing them or by checkpointing them
Killing involves a penalty, which we call compute loss, which is equal to the loss of progress made by the job since it started (or got application checkpointed)
Checkpointing is a suspend-resume technique, wherein we suspend the job by capturing its application state to the parallel file system, and resuming it later
Two approaches to checkpointing: System-level and application-level
System-level needs to copy the entire memory of the job from every node to the PFS (See the figure on the top)
Application-level just copies the essential data structures (As we can see in the bottom figure)
There are certain tradeoff with each checkpointing approaches:

System-level can start at any time i.e. as soon as the on-demand job arrives, but requires more amount of memory to be transferred
Application-level checkpointing can start only at specific time when it is safe for checkpointing, but requires lesser memory transfer
Good thing about the application-level checkpointing is that the batch job can make additional progress while waiting to get application checkpointed

54 of 64

Outline

Introduction and Motivation
Research Summary
Research Contributions

Optimized Multi-Level Flushing
Checkpoint Load Balancing
Efficient Cache Initialization

Future Research Plans

▶ Related Works

Publications and Submissions
Tentative Schedule for Completion

55 of 64

Checkpoint-restore Runtimes and Data Movement Engines

HPC Checkpoint-restore runtimes

System-level: (e.g. DMTCP [1])�Enable transparent Fault tolerance, capture full state of memory, no GPU support
Application-level: (e.g. SCR [2])�User defines critical data structures, enables use cases beyond resilience, no GPU support
GPU-based checkpoint-restore

Optimized for resilience or application-specific (e.g. Cricket [3], CheckFreq [4])
Close to our work, but suboptimal for GPU based workloads (e.g. FTI-GPU [5])

Data Movement Engines

Utilize intermediate cache layers (e.g. burst buffer) to minimize I/O, no GPU support (e.g. Stacker [6], Data Elevator [7])

Ansel, Jason, et. al., "DMTCP: Transparent checkpointing for cluster computations and the desktop." IPDPS, 2009.
Nicolae, Bogdan, et al. "Veloc: Towards high performance adaptive asynchronous checkpointing at large scale." IPDPS. 2019.
Eiling, Niklas, et al. "Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support." Conc. & Comp.: Practice & Experience , 2022.
Mohan, Jayashree, et al. "CheckFreq: Frequent, Fine-Grained DNN Checkpointing." FAST, 2021.
Konstantinos, et al. "Checkpoint restart support for heterogeneous HPC applications." CCGRID, 2020.
Subedi, Pradeep, et al. "Stacker: an autonomic data movement engine for extreme-scale data staging-based in-situ workflows." SC, 2018.
Dong, Bin, et al. "Data elevator: Low-contention data movement in hierarchical storage system." HiPC, 2016.

56 of 64

Workflow Engines and Cache Management Systems

Workflow Engines

Complex data redistribution and block level parallelism (e.g. Decaf [8])
Data staging and publish-subscribe model (e.g. DataSpaces [9])
In-situ analytics and/or visualization engines (e.g. VisIt [10])
Coroutines and position-independent executables for concurrent simulation and in-situ analysis (e.g. Henson [11])
Provide flexible I/O pipelines, but lack fine grained read/write of tasks with varying access pattern

Cache Management Systems

Optimized on location-awareness [12], data access control [13], software defined cache hierarchy [14], eliminating cache pollution [15], and user-level access pattern modelling
Do not exploit synergies of producing-consuming checkpoints in a unified caching system

Yildiz, Orcun, Matthieu Dreher, and Tom Peterka. "Decaf: Decoupled Dataflows for In Situ Workflows." In Situ Visualization for Computational Science. Springer, 2022.
Docan, Ciprian, et al. "Dataspaces: an interaction and coordination framework for coupled simulation workflows." Cluster Computing, 2012.
Childs, Hank, et al. "VisIt: An end-user tool for visualizing and analyzing very large data." LBNL report, 2012.
Monozov, Dmitriy et al. Henson v1.0, https://www.osti.gov/biblio/1312559
Park, Jongsoo, et al. "Location-aware cache management for many-core processors with deep cache hierarchy." SC 2013.
Shi, Qingchuan, et al. "LDAC: Locality-aware data access control for large-scale multicore cache hierarchies." TACO, 2016.
Tsai, Po-An, et al. "Jenga: Software-defined cache hierarchies." ICSA. 2017.
Devarajan, Hariharan, et al. "Hfetch: Hierarchical data prefetching for scientific workflows in multi-tiered storage environments." IPDPS, 2020.
Qin, Yubo, et al. "Leveraging user access patterns & advanced cyberinfrastructure to accelerate data delivery from shared-use scientific observatories." FGCS, 2021.

57 of 64

Outline

Introduction and Motivation
Research Summary
Research Contributions

Optimized Multi-Level Flushing
Checkpoint Load Balancing
Efficient Cache Initialization

Future Research Plans
Related Works

▶ Publications and Submissions

Tentative Schedule for Completion

58 of 64

Publications and Submissions

Peer-reviewed papers

Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Amr M. Elsayed, Thierry Tonellot, and Franck Cappello. "Towards Efficient Cache Allocation for High-Frequency Checkpointing," in Proceedings of the 29th IEEE International Conference on High Performance Computing, Data, and Analytics, 2022 (HiPC’22)

Avinash Maurya, Jaiaid Mobin, M. Mustafa Rafique. “Towards Data Gravity and Compliance Aware Distributed Deep Learning on Hybrid Clouds,” in Proceedings of the 29th IEEE International Conference on High Performance Computing, Data, and Analytics Workshop, 2022 (HiPCW’22)

Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Thierry Tonellot, and Franck Cappello. "Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing," in Proceedings of the 29th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Virtual, Portugal, 2021 (MASCOTS’21).
Avinash Maurya, Bogdan Nicolae, Ishan Guliani, M. Mustafa Rafique, "CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters." in Proceedings of the 24th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, Prague, Czech Republic, 2020 (DS-RT’20).

Under review

Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Thierry Tonellot, and Franck Cappello, and Hussain J. AlSalem. "GPU-Enabled Asynchronous Multi-level Checkpoint Caching and Prefetching."

59 of 64

Intermediate Data Access Pattern for Different Checkpointing Scenarios

Defensive (High Writes, Low Reads)

Fault-tolerance based on checkpoint-restart
Frequent checkpointing (writes), few restarts (read back)

Productive (High Writes, High Reads)

Revisiting previously generated intermediate data�Adjoint computations that produce data in forward pass and consume in backward pass
Online analytics for steering experiments �Recalibrating workload sub-tasks dynamically based on intermediate execution results
Reproducibility or post-hoc analysis

Administrative (Low Writes, Low Reads)

Migration of jobs�Migrating workloads for energy savings in the HPC cluster
Suspend-resume low priority jobs to run on-demand job�Running opportunistic on-demand jobs to complement a batch job or experiments

Our research focus

For the three different scenarios pertaining to intermediate data checkpointing, we observe different read/write patterns.

For defensive purposes, checkpoints are relatively frequent at regular intervals, while restarts are rare, leading to high writes and low reads.

For productive scenarios, checkpoints are both read and written at high frequency and large volumes. Examples of this include adjoint computations that produce data in forward pass and consume them during the backward pass.

For administrative scenario, checkpoint and restore are done on demand and thus exhibit low read and low write access pattern.

Checkpointing for defensive purposes has been extensively studied in the past, while productive and administrative scenarios have received less attention. Therefore, the primary focus of our research is for productive and administrative use cases.

Now, let us look at impacts of modern memory tiers

60 of 64

Improvement Idea: Min-cost Max-flow Schedule

Virtual src and sink with infinite capacity
GPUs with remainder data connected to devices having spare capacity
Edges contain the max transferable capacity and cost per unit data
Backedges used to reverse a previously made suboptimal flow decision

Cannot apply classic algorithms directly:

Relaying by max capacity of the fastest path does NOT leverage parallel data transfer
Cost of transfer is NOT the sum of all paths, it is the cost of slowest transfer

β = 24 GB/s; γ = 12 GB/s

MASCOTS’21

61 of 64

Problem Formulation: Checkpoint Load Balancing

Each GPU has some free space (F_i) reserved for storing checkpoints on GPU i
If ckpt_size_i > F_i : GPU has remainder
If ckpt_size_i < F_i : GPU has spare capacity
Each checkpoint can be split into multiple regions to checkpoint across various devices
Each GPU has dedicated copy engines (i.e., concurrent GPU-to-GPU and GPU-to-host copy)
Links between GPUs are heterogeneous (e.g., 24 GB/s single NVLink, 48 GB/s dual NVLink)

Efficiently scheduling transfers such that cache utilization is maximized and

I/O wait time is minimized

48 GB/s

24 GB/s

12 GB/s

MASCOTS’21

62 of 64

Implementation of Cache Initialization and Transfer Mechanism

Integrated design principles with VELOC, a production ready HPC checkpointing runtime

VELOC_init spawns threads for running allocations on device and host cache

Concurrent operations supported

Incrementally map device cache
Incrementally map host cache
Transfer from app to device cache
Compute checkpoint of next iteration
Transfer from device to host cache
Transfer from host cache to SSD
Transfer from SSD to PFS (optional)

HiPC’22

63 of 64

Problem Formulation: Efficient Cache Initialization

HPC node consists of N GPUs, each dedicated to a single short-lived process

Every process reserves a fraction of GPU HBM and main memory as cache

Checkpoints cannot be saved until cache is initialized, that leads to application overhead

Checkpoints are stored on faster cache tiers, and evicted when required in FIFO order

Goal is to minimize cache initialization subject to two constraints:

Checkpoints are not evicted until cache is fully initialized and filled to avoid cache underutilization
Host cache cannot be pinned in parts

HiPC’22

64 of 64

Limitations of Existing Cache Initialization Schemes

Memory allocators (e.g. malloc) only reserve virtual address space, not mapped to physical memory

When checkpoints first access cache, kernel triggers page_faults, forces mapping to physical memory

Map memory upfront leads to huge initialization overhead on the application

Mapped but unpinned memory does not allow for peak GPU to main memory transfer rates

Pinning host cache allows peak transfer rates using Direct Memory Access (DMA)

Reserving (mapping + pinning) host cache is slow, but transfers at peak rates

HiPC’22