1 of 16

CEEMS: A Resource Manager Agnostic Energy & Emissions Monitoring Stack

Mahendra Paipuri

Research Engineer

CNRS, France

11/17/2024

1

2 of 16

CNRS in numbers

IDRIS - National HPC Center of CNRS

  • Fundamental sciences
  • 33 000 employees of which 28 000 are researchers
  • 1 100 research laboratories
  • 4 billion euros of annual budget
  • Jean Zay: HPE/ATOS, 123.6 PFLOPS
  • CPU Partition: Intel Skylake, 28 800 cores
  • V100 Partition: Intel Skylake, 1832 GPUs
  • A100 Partition: AMD Milan, 418 GPUs
  • H100 Partition: Intel Sapphire Rapids, 1542 GPUs
  • Infiniband and Omni-Path interconnect
  • Lustre file system

11/17/2024

2

3 of 16

Context

Compute Energy & Emissions Monitoring Stack (CEEMS)

  • AI DC projected to consume ~100 TWh by 2026
  • 40% of consumption is due to servers
  • Efficient semiconductors and cooling techniques alone is not enough
  • “Practical” solution is to engage end users to optimize their workflows
  • Need to provide relevant metrics and tools to encourage optimization

IEA 2024

11/17/2024

3

4 of 16

CEEMS

Control Groups (cgroups) provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.

11/17/2024

4

5 of 16

CEEMS Architecture

11/17/2024

5

6 of 16

Technical details

  • 100% Go
  • CEEMS apps are Capability Aware
  • Configurable energy estimation
  • Leverages eBPF for IO and Network metrics
  • Supported Metrics:
    • CPU and GPU util and memory util
    • CPU and GPU energy usage and equivalent emissions
    • CPU Hardware/Software/Cache perf metrics
    • IO (Read/Write bytes, bandwidth, requests, errors)
    • Network (TCP/UDP, IPv4/IPv6 bytes, bandwidth, requests, errors)
    • Selected RDMA stats (QPs, CQe, MRs, requests)

11/17/2024

6

7 of 16

User’s Perspective

Job CPU Stats

11/17/2024

7

8 of 16

User’s Perspective

Job CPU Perf Stats

11/17/2024

8

9 of 16

User’s Perspective

Job GPU Stats

11/17/2024

9

10 of 16

User’s Perspective

Job GPU Perf Stats

11/17/2024

10

11 of 16

Operator’s Perspective

11/17/2024

11

12 of 16

Profiling

CEEMS supports Grafana Alloy and Pyroscope

  • Deterministic Profiling: Record call stack & memory stats, investigate and iterate
  • Continuous profiling: Statistical profiling based on sampling call stack
    • eBPF based
    • No instrumentation needed
    • Very low overhead
    • “Always On” on production
  • Grafana, Splunk, Datadog, Amazon, Polar signals offer Open Source profilers

  • Limitations of deterministic profiling:
    • Overhead
    • Hard to recreate problematic scenarios

11/17/2024

12

13 of 16

Continuous Profiling of SLURM jobs

11/17/2024

13

14 of 16

Continuous Profiling of SLURM jobs

11/17/2024

14

15 of 16

Final Remarks

  • A very low overhead monitoring solution
  • Add k8s support
  • Currently testing eBPF based metrics and Continuous profiling on Jean Zay

  • Running on Jean Zay for more than 6 months with a scrape frequency of 10s

11/17/2024

15

16 of 16

GitHub Repository

11/17/2024

16