1 of 64

Towards Efficient Checkpointing Across Deep Tiers of Memory Hierarchy

Avinash Mauryaam6429@rit.edu

Advisors: M. Mustafa Rafique (RIT), Bogdan Nicolae (ANL)

SC’22 Doctoral Showcase

1

2 of 64

2

Outline

▶ Introduction and Motivation

  • Research Summary
  • Research Contributions
    • Background: Asynchronous Multi-Level Checkpointing
    • Checkpoint Load Balancing
    • Efficient Cache Initialization
    • Foreknowledge Based Eviction and Prefetching
  • Future Directions

3 of 64

Intermediate Data in High-Performance Computing (HPC) Workloads

3

  • HPC workloads manage large volumes of data concurrently at scale
    • Climate modelling, seismic simulations, computational biology, big data analytics, deep learning, etc.
  • They produce and consume huge amounts of intermediate data
    • Applications may need previously generated data to advance computations, job migration, etc.
  • Intermediate data needs to be checkpointed and restored at high frequency
    • HPC workloads, e.g., Reverse Time Migration (RTM), checkpoints 415 TB of intermediate data
  • Use cases of checkpointing and restoring intermediate data
    • Defensive: Fault tolerance based on checkpoint-restore
    • Administrative: Job migration, suspend-resume low-priority jobs
    • Productive: Revisiting previous data for optimized execution; analytics to steer simulations

Checkpointing and restoring intermediate data at scale is a

fundamental problem in many HPC workloads

Our research focus

4 of 64

Example: Reverse Time Migration (RTM) used for Seismic Imaging

4

  • RTM used in oil and gas industry to generate subsurface images
  • Adjoint computation: Forward and backward propagated seismic waves are calculated and cross-correlated to form the subsurface image
  • Two wavefields at identical propagation times are combined and reversed in time using checkpointing
  • Intermediate data is checkpointed during forward pass and restored during backward pass
  • Checkpoint read/write rate is ~7 GB/s

Actual subsurfaces

Simulated subsurfaces

Representative working of RTM*

* Source: https://www.youtube.com/watch?v=VWF--1OnitM

* Note: Animation depicts Full Wavefield Inversion that has same I/O characteristics as RTM

5 of 64

Accelerators Introduce I/O Contention

5

  • Applications leverage accelerators, e.g. GPUs for faster performance
    • E.g. Nvidia H100 GPUs deliver up to 60 TFLOPS performance for general purpose FP64 computations

  • Faster compute results in frequent checkpointing and restoring (~ms)

  • Accelerators (e.g. GPUs) are equipped with High-Bandwidth Memory (HBM) to avoid data bottlenecks

  • HBM has limited capacity, therefore cannot store all checkpoint data
    • E.g. Nvidia A100 GPUs have 40GB or 80GB HBM, most of which is consumed by the application

  • Remote storage (e.g. Parallel File System) can store all checkpoints, but incur large I/O penalty

Limited memory capacity and fast computational performance of accelerators leads to application I/O overheads

6 of 64

Memory Hierarchy on HPC Systems

6

Memory tiers on HPC systems creates a complex caching infrastructure that is not efficiently utilized for various checkpointing and restoring scenarios

7 of 64

7

Outline

  • Introduction and Motivation

▶ Research Summary

  • Research Contributions
    • Background: Asynchronous Multi-Level Checkpointing
    • Checkpoint Load Balancing
    • Efficient Cache Initialization
    • Foreknowledge Based Eviction and Prefetching
  • Future Directions

8 of 64

Research Challenges for Efficient Checkpoint-Restore Operations

8

  1. Fast memory tiers cannot store all checkpoints, flushing sync. to lower tiers blocks application during I/O
    • Slow Flushes to Lower Cache Tiers

  • Cache on some memory tiers are filled faster than others, forcing the app. to transfer to slow tiers
    • Checkpoint Load Imbalance

  • Short-lived jobs cannot amortize the overhead of cache initialization during their runtimes
    • Slow Cache Allocation and Pinning

  • The restore order of checkpoints are not considered while evicting cache, leading to cache misses
    • Restore Oblivious Cache Eviction and Prefetching

9 of 64

9

Outline

  • Introduction and Motivation
  • Research Summary
  • Research Contributions

▶ Background: Asynchronous Multi-Level Checkpointing

    • Checkpoint Load Balancing
    • Efficient Cache Initialization
    • Foreknowledge Based Eviction and Prefetching
  • Future Directions

10 of 64

Asynchronous Multi-Level Checkpointing Without GPU Support

10

Main

memory

Local

Disk

Remote Storage

Compute

~1000s GB

~10s PB

  • Unused GPU HBM not utilized as cache

  • Application needs to manually transfer to main memory

  • Application is blocked till checkpoint is captured on local disk

  • Limited concurrent transfers between
    • Local disk to Remote storage
    • GPU HBM ➜ Main memory ➜ Local Disk

Asynchronous checkpointing without GPU support does not utilize

unused GPU HBM and concurrent transfers

~10s GB

~100s GB

HPC Application

GPU HBM

Cache space

on memory tier

Unused space

on memory tier

Checkpoint

11 of 64

Asynchronous Multi-Level Checkpointing With Limited GPU Support

11

Main

memory

Local

Disk

Remote Storage

Compute

HPC Application

~10s GB

~100s GB

~1000s GB

~10s PB

  • Traditionally used for fault-tolerance: limited checkpoints are stored on device and host cache
  • Cache may be under or over utilized based on checkpoint sizes, and leads to app I/O overhead
  • Challenges in designing GPU cache:
    • Limited slots on cache tiers result in underutilized cache and I/O concurrency
    • Multi-GPU systems put pressure on main memory
    • Large I/O throughput gap (D2D > 50*D2H) ➜ High variance in production/consumption of checkpoints
    • Versioning is key to run productive scenarios

Cache space

on memory tier

Unused space

on memory tier

Existing GPU supported checkpointing runtimes

sub-optimally utilize GPU cache

Can hold only

two checkpoints

Checkpoint

GPU HBM

12 of 64

Architecture for Realizing Efficient Multi-Level Checkpointing

12

Main

memory

Local

Disk

Remote Storage

Compute

~10s GB

~1000s GB

~10s PB

  • Optimized for GPU HBM usage
    • Full cache utilization
    • Fast transfers to pinned main memory using Direct Memory Access (DMA)

  • Maximize cache utilization

  • Concurrent transfers across all memory tiers

  • Integrated with VELOC, a production-ready checkpoint-restore runtime

Our architecture enables optimal cache utilization and �concurrent transfers across all memory tiers

~100s GB

HPC Application

GPU HBM

Cache space

on memory tier

Unused space

on memory tier

Checkpoint

13 of 64

13

Outline

  • Introduction and Motivation
  • Research Summary
  • Research Contributions
    • Background: Asynchronous Multi-Level Checkpointing

▶ Checkpoint Load Balancing

    • Efficient Cache Initialization
    • Foreknowledge Based Eviction and Prefetching
  • Future Directions

14 of 64

HPC Workloads Produce Variable Sized Checkpoints

14

Slow processes delay checkpoint for entire group for tightly coupled applications

  • Checkpoints accumulate on the fast tiers blocks application longer for I/O
  • Techniques to reduce checkpoint sizes
    • Compression
    • Decimation
    • Interpolation
  • Compression leads to uneven checkpoint sizes on each process
  • GPUs having enough free local memory write fast in their HBMs, others need to write to slow memory tiers

MASCOTS’21

62 MB

15 of 64

Local-only Checkpointing Approach (Baseline)

15

  • Write entire checkpoint to local HBM (at 350 GB/s)
  • If not enough room is available on local HBM
    • Checkpoint partially on local HBM (at 350 GB/s)
    • Checkpoint remaining on host cache (at 12 GB/s)
  • Implemented by state-of-the-art checkpointing runtimes and data movement engines

GPU local write speed 350 GB/s�GPU to host write speed 12 GB/s�GPU-to-GPU write speed 24 or 48 GB/s

Does NOT leverage free space on peer GPUs and the fast GPU-to-GPU interconnects

MASCOTS’21

48 GB/s

24 GB/s

12 GB/s

Unused

16 of 64

Our Approach: Greedy Schedule for Checkpoint Load Balancing

16

  • Leverages GPU-to-GPU interconnect and spare capacity on peers

  • The GPU with highest remainder will block the group checkpoint for the longest time, hence sorted in descending order of remainder

  • Worst case complexity for N GPUs is Օ(N2)

  • Steps:
    1. Select GPU i with highest remainder checkpoint (ckpti)
    2. Select peer j with fastest interconnect having spare capacity (Fj)
    3. Transfer max(ckpti, Fj) to peer j
    4. Goto step (2) if more peers exist
    5. If no peer exists with spare capacity, transfer to host-memory

Greedy approach is straightforward and has lower complexity,

but does NOT yield optimal transfer schedule

MASCOTS’21

17 of 64

Our Solution: Min-time Max-flow Schedule

17

  • Modified min-cost max-flow
  • Increment the flow over the fastest link by one unit
  • Increase the cost of the fastest link by its unit cost of transfer, forcing slower paths to be selected for parallel transfers in the next cycle
  • Guarantees optimal schedule (minimum checkpoint blocking time)
  • Complexity: O(Fᐧ Nᐧ E)F = Total remainder ckpt across all GPUs� N = Total number of GPUs� E = Total number of edges

β = 24 GB/s; γ = 12 GB/s

MASCOTS’21

18 of 64

Evaluation Methodology: Checkpoint Load Balancing

18

  • Checkpointing traces of Reverse Time Migration (RTM):
    • Used in the oil and gas industry for cross correlation of two seismic wavefields to form subsurface image
    • Nvidia DGX-1: 8 Tesla V100 GPUs (32 GB HBM2 each)
    • 776 checkpoints ~53 GB per GPU

  • Compared approaches
    • Baseline (Local-only) Suboptimal schedule, put the remainder on host-memory
    • Greedy based Suboptimal schedule, put the remainder on fastest peer, resort to host-memory
    • Min-time max-flow based Optimal schedule, put the remainder across all possible peers in parallel

  • Key metrics for evaluation
    • Application I/O wait time (checkpoint overheads) of non-optimal approaches
    • Response time to schedule transfers by the compared approaches

MASCOTS’21

19 of 64

Performance Results: RTM Traces Checkpointing Overhead

19

Distribution of checkpoint sizes over across 8 GPUs

Checkpointing overhead (app blocking time) of greedy and baseline approaches relative to min-time max-flow approach

GPU free space = 128 MB

Optimal approach: Min-time max-flow

Baseline and greedy approaches incur a checkpointing overhead of 400% and 97% respectively as compared to the proposed min-time max-flow based approach

400%

97%

MASCOTS’21

20 of 64

Performance Results: Execution Times of Different Approaches

20

Distribution of checkpoint sizes over across 8 GPUs

Response time for scheduling transfers using baseline, greedy and min-time max-flow approaches

GPU free space = 128 MB

Min-time max-flow approach incurs negligible runtime penalty (~90 µs) and always produces optimal schedule as compared to the greedy and baseline approaches

Optimal approach: Min-time max-flow

MASCOTS’21

21 of 64

Key Takeaways: Checkpoint Load Balancing

21

  • Applications produce a large amount of intermediate data (checkpoints), to be reused later
  • Memory constrained systems (e.g. accelerators) cannot hold all checkpoints
  • Asynchronous multi-level checkpointing is used to flush to slow tiers in the background
  • Proposed greedy and min-time max-flow approaches to generate the efficient checkpointing schedules
  • Min-time max-flow approach always yields optimal results with negligible runtime overhead
    • 8x faster than baseline (local-only) approach
    • 2x faster than greedy approach
    • Execution time of min-time max-flow approach is negligible even at scale

MASCOTS’21

22 of 64

22

Outline

  • Introduction and Motivation
  • Research Summary
  • Research Contributions
    • Background: Asynchronous Multi-Level Checkpointing
    • Checkpoint Load Balancing

▶ Efficient Cache Initialization

    • Foreknowledge Based Eviction and Prefetching
  • Future Directions

23 of 64

Large-Volume High-Frequency Checkpointing

23

  • Some HPC workloads have long runtimes due to ensembles of short runs (runtime ~mins)
    • Reverse Time Migration (RTM) processes one seismic image in approx. 3.5 mins

  • Each run needs to checkpoint frequently (~ms)
    • RTM produces approx. 200 MB checkpoint every 30 ms (6.7 GB/s)

  • Cache initialization (mapping and pinning) overhead is not amortized over application runtime

  • Multi-GPU systems run multiple such short-lived tasks concurrently

  • Competition for memory allocation amplifies initialization overhead

Short-lived tasks on Multi-GPU systems compete during cache allocation,

leading to significant initialization overheads

HiPC’22

24 of 64

Design Principles for Efficient Cache Initialization

24

  • Incremental GPU cache allocation using low-level GPU Virtual Memory Management (VMM)
    • Concurrent GPU HBM allocations reduce allocation rates from 1.4 TB/s to 160 GB/s, so allocate HBM in chunks

  • Opportunistic touching of host buffer virtual pages to force allocation of physical pages in advance
    • Run forerunner threads to access cache pages before checkpoints are written

  • Concurrency control to optimize competition between flushes and opportunistic page touching
    • Avoid I/O competition on memory tier between allocations and transfers
    • Pause memory allocation when transfers are in progress

  • Serialized registration of host buffers to avoid competition for pinning physical memory pages
    • Concurrent registration of multiple host buffers leads to contention on DMA drivers

HiPC’22

25 of 64

Concurrency Control Between Transfers and Allocations

25

  • Standard Allocation and Registration
    • Waits for complete cache initialization
    • Implemented by state-of-art ckpt runtimes
  • Concurrent Touch and Flush
    • Incrementally attempt touching before transfers
    • Register when all pages are touched
  • Sequential Touch and Flush
    • Incrementally attempt touching before transfers
    • Pause touching when transferring
    • Register when all pages are touched

HiPC’22

26 of 64

Evaluation Methodology: Efficient Cache Initialization

26

  • Experimental Setup
    • Nvidia DGX A100: 8 A100 GPUs, each with 40 GB GPU HBM
    • Two 64-core AMD Rome CPUs (256 threads), four 3.84 TB Gen4 NVME drives (4 GB/s)
    • 6 NVSwitches and with the host memory through a SXM4 interface
  • Compared Approaches
    • Direct pin: Standard allocation and registration using cudaMallocHost
    • UVM: Using Unified Virtual Memory (multi-tier memory management by CUDA)
    • Incremental memset: Concurrent touch and flush technique, memset pages to zero
    • Our approach: Use transparent huge pages, concurrency control touch and flush, sequential pinning
  • Compute interval between checkpoint or restore operations
    • Based on real-world characteristics of RTM application: [5 . . . 60] ms
  • Key metrics for evaluation
    • Total I/O overhead (checkpoint + restore)
    • Scalability tests

HiPC’22

27 of 64

Total I/O (Checkpoint + Restore) Overhead on the Application

27

7.8x

Our approach outperforms the baseline (Direct Pin) approach by up to 12.5x and 7.8x for checkpoint and total I/O overheads respectively

HiPC’22

28 of 64

Scalability Test for Increasing Number of Processes Per Node

28

Our approach demonstrates significant performance improvements as compared to state-of-the-art approaches even at scale

13.6x

HiPC’22

29 of 64

Key Takeaways: Efficient Cache Initialization

29

  • HPC workloads may consist of short-lived ensembles, that need to checkpoint frequently
  • State-of-the-art checkpointing runtimes pay high cache initialization cost
  • We propose the following optimizations for reducing cache initialization overheads:
    • Incremental cache allocation on GPU using low-level virtual memory functions
    • Opportunistic touching of memory pages in advance to force physical page mapping
    • Concurrency control to optimize competition between touching and transfers
    • Serialized registration of host buffer
  • Our optimizations show up to 12x faster checkpointing and 7x faster application completion time as compared to the baseline approaches

HiPC’22

30 of 64

30

Outline

  • Introduction and Motivation
  • Research Summary
  • Research Contributions
    • Background: Asynchronous Multi-Level Checkpointing
    • Checkpoint Load Balancing
    • Efficient Cache Initialization

▶ Foreknowledge Based Eviction and Prefetching

  • Future Directions

31 of 64

Efficient Cache Eviction and Prefetching

31

  • Productive use-cases of checkpointing read back (restore) the entire history of checkpoints
  • The order of restore operations is usually known in advance
    • Same order : Check invariants, identify diversion in evolution
    • Reverse order : Automated differentiation, seismic imaging in oil industry
    • Irregular order : Priority of checkpoint, pseudo-random sampling
  • The deterministic restore order of HPC workloads is not considered during eviction
  • Existing checkpointing runtimes are either optimized for checkpoint or restore, but not both

Concurrent checkpointing and restoring creates a complex producer-consumer interleaving, and require unified eviction and prefetching strategies

Under

review

32 of 64

Unified Flush/Prefetch Support using Finite State Machine Life-cycle

32

Under

review

Region born through checkpoint or prefetch operation

Can transition from

checkpointing

to prefetching phase

Can be evicted

Unified checkpoint-restore minimizes I/O contention and application execution time

33 of 64

Score-based Look-ahead Cache Eviction

33

Under

review

1

6

5

3

4

2

1

6

5

3

4

2

1

6

5

7

4

2

1

6

5

7

4

2

Required cache size

#

Prefetch order

(1 means next in line for restore)

Not eligible for eviction

(write in-progress)

Eligible for eviction

Least impact on I/O

34 of 64

Evaluation Methodology: Foreknowledge-based Eviction and Prefetching

34

Under

review

  • Experimental Setup
    • Nvidia DGX A1008 A100 GPUs, each with 40 GB GPU HBM
    • Two 64-core AMD Rome CPUs (256 threads), four 3.84 TB Gen4 NVME drives (4 GB/s)
    • 6 NVSwitches and with the host memory through a SXM4 interface
  • Compared Approaches
    • ADIOS2: State-of-the-art data movement runtime
    • FIFO-based eviction: Evict the oldest checkpoint existing on the cache
    • Score-based eviction: Our approach, evict subset of checkpoints that minimize I/O contention
  • Number of prefetching hints available
    • No hints: Application does not have foreknowledge of prefetch order
    • Single hint: Only next prefetch is known
    • All hints: The entire sequence of restoration is known
  • Key metrics for evaluation
    • Restore throughput
    • Scalability tests

35 of 64

Restore throughput performance

35

11.7x

Our approach outperforms the state-of-the-art solutions by up to 7.5x, 8.7x, and 11.7x for sequential, reverse, and irregular restore orders, respectively

Under

review

36 of 64

Scalability study

36

Scalability study shows up to 3.4x and 7.6x faster I/O throughput for tightly coupled and embarrassingly parallel scenarios, respectively

Tightly coupled

Under

review

Embarrassingly parallel

3.4x

7.6x

37 of 64

Key Takeaways: Foreknowledge-based Cache Eviction and Prefetching

37

  • HPC workloads need to restore entire checkpoint history to complete execution
  • Existing checkpointing runtimes are not optimized for concurrent checkpointing and restoration
  • We design and develop
    • Finite-state-machine to enable seamless transition from checkpointing to restoring phase
    • Score-based look-ahead eviction technique based on restore hints
    • Integrated with VELOC, a production-ready multi-level HPC checkpointing runtime
  • Our evaluations show up to 6.3x and 11.7x faster checkpoint and restore throughput, respectively
  • Scalability study shows 3.4x and 7.6x speedup as compared to state-of-the-art approaches for tightly coupled and embarrassingly parallel applications, respectively

Under

review

38 of 64

38

Outline

  • Introduction and Motivation
  • Research Summary
  • Research Contributions
    • Background: Asynchronous Multi-Level Checkpointing
    • Checkpoint Load Balancing
    • Efficient Cache Initialization
    • Foreknowledge Based Eviction and Prefetching

▶ Future Directions

39 of 64

Future Directions

39

  • Concurrent transformations during checkpoint transfers
  • Shared node-local host buffer across all processes to further reduce initialization overhead
  • Enabling foreknowledge based data migration in Nvidia Unified Virtual Memory (UVM)
  • Leverage GPUDirect storage to minimize latency due to cascading transfers
  • Explore optimization opportunities between evicting older checkpoints vs checkpointing on peers
  • Integrate checkpointing runtime with scheduler for administrative scenarios (e.g. job migration, suspend-resume low-priority jobs, etc.)

40 of 64

Thank You!

Questions?

41 of 64

Backup slides

42 of 64

Acknowledgements

43 of 64

43

Outline

  • Introduction and Motivation
  • Research Summary
  • Research Contributions
    • Background: Asynchronous Multi-Level Checkpointing
    • Checkpoint Load Balancing
    • Efficient Cache Initialization
    • Foreknowledge Based Eviction and Prefetching

▶ Future Research Plans

  • Publications

44 of 64

1. Implementing Collaborative Multi-Level Checkpointing

44

  • Refine Min-time max-flow model to leverage modern interconnects (e.g. Nvidia CX7 adapter up to 400 GB/s)

  • Extend collaborative checkpointing to remote GPU HBMs and between memory tiers (e.g. shared host cache)

  • Integrate with VELOC, an HPC checkpoint-restore runtime

  • Evaluate and optimize performance for real-world workloads

Source: https://www.hardwarezone.com.sg/tech-news-nvidia-h100-gpu-hopper-architecture-building-block-ai-infrastructure-dgx-h100-supercomputer

45 of 64

2. Skipping Intermediate Memory Tiers to Minimize Transfer Latency

45

  • Modern interconnects are capable of direct transfers to non-subsequent memory tier

  • Design a unified checkpoint-runtime scheduler to minimize I/O time using direct transfers

  • Design efficient cache eviction schemes to minimize collective I/O pressure across memory tiers and interconnects

46 of 64

Example based comparison of greedy and flow-based approaches

46

Greedy approach

  • GPU 0 ⟶ GPU 4 : 480 MB 10 ms
  • GPU 6 ⟶ Host : 240 MB 20 ms

Overall wait time = 20 ms

Spare capacities of GPU 4 was not used

48 GB/s

24 GB/s

12 GB/s

Min-time Max-flow approach

  • GPU 0 ⟶ GPU 1 : 137 MB 5.7 ms
  • GPU 0 ⟶ GPU 4 : 275 MB 5.7 ms
  • GPU 0 ⟶ Host : 68 MB 5.7 ms
  • GPU 6 ⟶ GPU 4 : 160 MB 6.7 ms
  • GPU 6 ⟶ Host : 80 MB 6.7 ms

Overall wait time = 6.7 ms

47 of 64

Challenges with Existing Multi-level Checkpointing Runtimes

47

  • Most checkpointing runtimes do not support GPU-based workloads
    • GPU HBM has unused space that is not used for storing checkpoints
    • Application is responsible for transferring the data from GPU to main memory, leading to I/O overhead
    • Checkpoint is synchronously captured from main memory to the next tier
  • Checkpointing runtimes (e.g. FTI Fault-Tolerance Interface) have limited GPU checkpointing support
    • Dynamically determine the residency of the checkpoint (resides on GPU or main memory)
    • Performs blocking transfer from GPU to cache slots on main memory
    • Have limited cache slots on the main memory (e.g. FTI has 2 buffers)
    • Fixed sized cache slots may be under or over utilized by checkpoints of varying sizes
    • No pinning of main memory ➡ High transfer overheads (transfer without Direct Memory Access)
    • Pinning main memory ➡ High initialization overheads (high cost of pinning)

Existing checkpointing runtimes either have no or very limited GPU support

48 of 64

Asynchronous Multi-Level Checkpointing

48

1. Application runs computation kernels

2. Checkpointing at certain timesteps

3. Checkpoint stored on GPU cache first

HBM is fast (up to 500 GB/s trf. rate)

Wait for eviction if GPU cache is full

5. Asynchronously transfer to host cache

Wait for eviction if Host cache is full

6. Asynchronously transfer to SSD

Wait for eviction if SSD is full

4. Application resumes computations

49 of 64

Problem formulation

49

  • Each GPU has some free space reserved for storing checkpoints (Fi)
  • For GPU ‘i’, if the checkpoint size (ckpti) is larger than its free space (Fi) ➡ GPU having remainder
  • If ckpti < Fi ➡ GPU having spare capacity
  • Each checkpoint can be split into multiple regions to checkpoint across various devices
  • Each GPU has dedicated copy engines (i.e., GPU-to-GPU and GPU-to-host copies can happen in parallel)
  • Links between GPUs are heterogeneous (e.g., 24 GB/s single NVLink, 48 GB/s dual NVLink)

48 GB/s

24 GB/s

12 GB/s

Where and how much to write of each checkpoint to minimize total ckpt time?

50 of 64

Key Research Challenges Presented

50

  • Slow Flushes to Lower Cache Tiers
    • Fast memory tiers cannot store all checkpoints, flushing synchronously to lower tiers blocks application during I/O
  • Checkpoint Load Imbalance
    • Cache on some memory tiers are filled faster than others, leading to uneven cache consumption
  • Slow Cache Allocation and Pinning
    • Short-lived jobs cannot amortize the overhead of cache initialization during their runtimes
  • Restore Oblivious Cache Eviction and Prefetching
    • The restore order of checkpoints are not considered while evicting old checkpoints from cache
  • Cascading I/O Hierarchically Across Memory Tiers
    • Checkpoints are strictly transferred hierarchically across cache tiers, increasing I/O access latency
  • Running On-Demand Jobs by Terminating Batch Jobs
    • Running opportunistic on-demand jobs within a given deadline requires evicting batch jobs, leading to lost progress

51 of 64

Synchronous Multi-Level Checkpointing

51

  • Application produces intermediate data

  • Application is blocked until checkpoint is transferred to the slower memory tier, typically remote storage for persistence

  • Limitations:
    • Does not overlap compute with transfers
    • Does not utilize the unused space on intermediate memory tiers

Main memory

Local

Disk

Remote Storage

Compute

GPU HBM

Application

~1000s GB

~10s PB

Unused space

on memory tier

Synchronous checkpointing incurs significant I/O wait time does not leverage unused memory space or concurrent transfers

~10s GB

~100s GB

Checkpoint

52 of 64

Research Goal and Objectives

52

Research Goal: Accelerate data movement on deep tiers of memory hierarchy

Objective

Challenge

Progress

Optimized multi-level flushing techniques

(Using dedicated transfer threads across each tier)

Slow flushes to Lower cache tiers

Implemented

Collaborative checkpointing strategies

(Using spare capacity on peer devices and fast interconnects)

Checkpoint load imbalance

Simulated

Implementing

Proactive asynchronous cache initialization

(Using async. memory mapping, pinning and transfers)

Slow cache allocation and pinning

Implemented

Foreknowledge aware eviction and prefetching

(Using state-machine for unified checkpointing & restoring)

Restore oblivious cache eviction and prefetching

Implemented

Skipping intermediate memory tiers

(Using technologies such as GPUDirect RDMA)

Cascading transfers across hierarchical memory tiers

In-progress

Efficient co-scheduling of opportunistic on-demand jobs with batch jobs

(Checkpointing or Killing batch jobs that minimize compute loss)

Loss of progress made by batch jobs while co-scheduling on-demand jobs

Simulated

Implementing

53 of 64

3. Co-design Checkpointing and Scheduling Runtimes to Preempt Jobs

53

  • HPC workloads make use of on-demand jobs for various use cases
    • Steering of experimental instruments
    • Acceleration of high-fidelity simulations through surrogate computations
    • Guided ensemble searches
  • Running on-demand needs large number of idle nodes before the deadline
  • Strategies to make room for on-demand jobs
    • Reserve nodes for on-demand: Under or over committing resources
    • Evict running jobs: Which jobs to evict and how?
  • Eviction strategies
    • Kill the batch job: Loss in progress made by the job
    • System-level checkpointing: Capture entire state to Parallel File System (PFS)
    • Application-level checkpointing: Capture only critical data structures to PFS
  • Our proposal: Dynamic programming based scheduling model to minimize loss in progress made by batch job while scheduling on-demand workload

PFS

PFS

System-level checkpointing

Application-level checkpointing

Simulation

DS-RT’20

54 of 64

54

Outline

  • Introduction and Motivation
  • Research Summary
  • Research Contributions
    • Optimized Multi-Level Flushing
    • Checkpoint Load Balancing
    • Efficient Cache Initialization
  • Future Research Plans

▶ Related Works

  • Publications and Submissions
  • Tentative Schedule for Completion

55 of 64

Checkpoint-restore Runtimes and Data Movement Engines

55

  • HPC Checkpoint-restore runtimes
    • System-level: (e.g. DMTCP [1])�Enable transparent Fault tolerance, capture full state of memory, no GPU support
    • Application-level: (e.g. SCR [2])�User defines critical data structures, enables use cases beyond resilience, no GPU support
    • GPU-based checkpoint-restore
      • Optimized for resilience or application-specific (e.g. Cricket [3], CheckFreq [4])
      • Close to our work, but suboptimal for GPU based workloads (e.g. FTI-GPU [5])
  • Data Movement Engines
    • Utilize intermediate cache layers (e.g. burst buffer) to minimize I/O, no GPU support (e.g. Stacker [6], Data Elevator [7])
  1. Ansel, Jason, et. al., "DMTCP: Transparent checkpointing for cluster computations and the desktop." IPDPS, 2009.
  2. Nicolae, Bogdan, et al. "Veloc: Towards high performance adaptive asynchronous checkpointing at large scale." IPDPS. 2019.
  3. Eiling, Niklas, et al. "Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support." Conc. & Comp.: Practice & Experience , 2022.
  4. Mohan, Jayashree, et al. "CheckFreq: Frequent, Fine-Grained DNN Checkpointing." FAST, 2021.
  5. Konstantinos, et al. "Checkpoint restart support for heterogeneous HPC applications." CCGRID, 2020.
  6. Subedi, Pradeep, et al. "Stacker: an autonomic data movement engine for extreme-scale data staging-based in-situ workflows." SC, 2018.
  7. Dong, Bin, et al. "Data elevator: Low-contention data movement in hierarchical storage system." HiPC, 2016.

56 of 64

Workflow Engines and Cache Management Systems

56

  • Workflow Engines
    • Complex data redistribution and block level parallelism (e.g. Decaf [8])
    • Data staging and publish-subscribe model (e.g. DataSpaces [9])
    • In-situ analytics and/or visualization engines (e.g. VisIt [10])
    • Coroutines and position-independent executables for concurrent simulation and in-situ analysis (e.g. Henson [11])
    • Provide flexible I/O pipelines, but lack fine grained read/write of tasks with varying access pattern
  • Cache Management Systems
    • Optimized on location-awareness [12], data access control [13], software defined cache hierarchy [14], eliminating cache pollution [15], and user-level access pattern modelling
    • Do not exploit synergies of producing-consuming checkpoints in a unified caching system
  • Yildiz, Orcun, Matthieu Dreher, and Tom Peterka. "Decaf: Decoupled Dataflows for In Situ Workflows." In Situ Visualization for Computational Science. Springer, 2022.
  • Docan, Ciprian, et al. "Dataspaces: an interaction and coordination framework for coupled simulation workflows." Cluster Computing, 2012.
  • Childs, Hank, et al. "VisIt: An end-user tool for visualizing and analyzing very large data." LBNL report, 2012.
  • Monozov, Dmitriy et al. Henson v1.0, https://www.osti.gov/biblio/1312559
  • Park, Jongsoo, et al. "Location-aware cache management for many-core processors with deep cache hierarchy." SC 2013.
  • Shi, Qingchuan, et al. "LDAC: Locality-aware data access control for large-scale multicore cache hierarchies." TACO, 2016.
  • Tsai, Po-An, et al. "Jenga: Software-defined cache hierarchies." ICSA. 2017.
  • Devarajan, Hariharan, et al. "Hfetch: Hierarchical data prefetching for scientific workflows in multi-tiered storage environments." IPDPS, 2020.
  • Qin, Yubo, et al. "Leveraging user access patterns & advanced cyberinfrastructure to accelerate data delivery from shared-use scientific observatories." FGCS, 2021.

57 of 64

57

Outline

  • Introduction and Motivation
  • Research Summary
  • Research Contributions
    • Optimized Multi-Level Flushing
    • Checkpoint Load Balancing
    • Efficient Cache Initialization
  • Future Research Plans
  • Related Works

▶ Publications and Submissions

  • Tentative Schedule for Completion

58 of 64

Publications and Submissions

58

  • Peer-reviewed papers
    • Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Amr M. Elsayed, Thierry Tonellot, and Franck Cappello. "Towards Efficient Cache Allocation for High-Frequency Checkpointing," in Proceedings of the 29th IEEE International Conference on High Performance Computing, Data, and Analytics, 2022 (HiPC’22)

    • Avinash Maurya, Jaiaid Mobin, M. Mustafa Rafique. “Towards Data Gravity and Compliance Aware Distributed Deep Learning on Hybrid Clouds,” in Proceedings of the 29th IEEE International Conference on High Performance Computing, Data, and Analytics Workshop, 2022 (HiPCW’22)

    • Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Thierry Tonellot, and Franck Cappello. "Towards Efficient I/O Scheduling for Collaborative Multi-Level Checkpointing," in Proceedings of the 29th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, Virtual, Portugal, 2021 (MASCOTS’21).
    • Avinash Maurya, Bogdan Nicolae, Ishan Guliani, M. Mustafa Rafique, "CoSim: A Simulator for Co-Scheduling of Batch and On-Demand Jobs in HPC Datacenters." in Proceedings of the 24th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications, Prague, Czech Republic, 2020 (DS-RT’20).
  • Under review
    • Avinash Maurya, Bogdan Nicolae, M. Mustafa Rafique, Thierry Tonellot, and Franck Cappello, and Hussain J. AlSalem. "GPU-Enabled Asynchronous Multi-level Checkpoint Caching and Prefetching."

59 of 64

Intermediate Data Access Pattern for Different Checkpointing Scenarios

59

  • Defensive (High Writes, Low Reads)
    • Fault-tolerance based on checkpoint-restart
    • Frequent checkpointing (writes), few restarts (read back)

  • Productive (High Writes, High Reads)
    • Revisiting previously generated intermediate data�Adjoint computations that produce data in forward pass and consume in backward pass
    • Online analytics for steering experiments �Recalibrating workload sub-tasks dynamically based on intermediate execution results
    • Reproducibility or post-hoc analysis

  • Administrative (Low Writes, Low Reads)
    • Migration of jobs�Migrating workloads for energy savings in the HPC cluster
    • Suspend-resume low priority jobs to run on-demand job�Running opportunistic on-demand jobs to complement a batch job or experiments

Our research focus

60 of 64

Improvement Idea: Min-cost Max-flow Schedule

60

  • Virtual src and sink with infinite capacity
  • GPUs with remainder data connected to devices having spare capacity
  • Edges contain the max transferable capacity and cost per unit data
  • Backedges used to reverse a previously made suboptimal flow decision

Cannot apply classic algorithms directly:

  • Relaying by max capacity of the fastest path does NOT leverage parallel data transfer
  • Cost of transfer is NOT the sum of all paths, it is the cost of slowest transfer

β = 24 GB/s; γ = 12 GB/s

MASCOTS’21

61 of 64

Problem Formulation: Checkpoint Load Balancing

61

  • Each GPU has some free space (Fi) reserved for storing checkpoints on GPU i
  • If ckpt_sizei > Fi : GPU has remainder
  • If ckpt_sizei < Fi : GPU has spare capacity
  • Each checkpoint can be split into multiple regions to checkpoint across various devices
  • Each GPU has dedicated copy engines (i.e., concurrent GPU-to-GPU and GPU-to-host copy)
  • Links between GPUs are heterogeneous (e.g., 24 GB/s single NVLink, 48 GB/s dual NVLink)

Efficiently scheduling transfers such that cache utilization is maximized and

I/O wait time is minimized

48 GB/s

24 GB/s

12 GB/s

MASCOTS’21

62 of 64

Implementation of Cache Initialization and Transfer Mechanism

62

  • Integrated design principles with VELOC, a production ready HPC checkpointing runtime

  • VELOC_init spawns threads for running allocations on device and host cache

  • Concurrent operations supported
    • Incrementally map device cache
    • Incrementally map host cache
    • Transfer from app to device cache
    • Compute checkpoint of next iteration
    • Transfer from device to host cache
    • Transfer from host cache to SSD
    • Transfer from SSD to PFS (optional)

HiPC’22

63 of 64

Problem Formulation: Efficient Cache Initialization

63

  • HPC node consists of N GPUs, each dedicated to a single short-lived process

  • Every process reserves a fraction of GPU HBM and main memory as cache

  • Checkpoints cannot be saved until cache is initialized, that leads to application overhead

  • Checkpoints are stored on faster cache tiers, and evicted when required in FIFO order

  • Goal is to minimize cache initialization subject to two constraints:
    • Checkpoints are not evicted until cache is fully initialized and filled to avoid cache underutilization
    • Host cache cannot be pinned in parts

HiPC’22

64 of 64

Limitations of Existing Cache Initialization Schemes

64

  • Memory allocators (e.g. malloc) only reserve virtual address space, not mapped to physical memory

  • When checkpoints first access cache, kernel triggers page_faults, forces mapping to physical memory

  • Map memory upfront leads to huge initialization overhead on the application

  • Mapped but unpinned memory does not allow for peak GPU to main memory transfer rates

  • Pinning host cache allows peak transfer rates using Direct Memory Access (DMA)

  • Reserving (mapping + pinning) host cache is slow, but transfers at peak rates

HiPC’22