1 of 18

Soft Cluster powercap at SuperMUC-NG with EAR

24/10/2022

EE HPC SOP Workshop

Lluís Alonso

lluis.alonso@bsc.es

2 of 18

Soft powercap

  • Introduction: Why is powercap necessary?
  • EAR overview
  • EAR extensions to support powercap
  • Node powercap
  • Soft cluster powercap
  • Powercap evaluation with synthetic workloads
  • SuperMUC-NG experiments
  • Related work

EE HPC SOP Workshop, October 2022

3 of 18

Introduction

  • Power management has become an important topic for HPC centers.
    • Hardware constraints
    • Resource efficiency
    • Cost constraints

EE HPC SOP Workshop, October 2022

4 of 18

EAR overview

  • EAR is a system software for energy management created in collaboration between BSC and Lenovo.
  • EAR offers energy monitoring, accounting, control and optimisation.
  • Four main components:
    • Node power manager (EARD)
    • Database manager (EARDBD) [Not involved]
    • Optimisation library (EARL)
    • Cluster power manager (EARGM)

EE HPC SOP Workshop, October 2022

5 of 18

Powercap extensions

  • Primary goal is to prevent power consumption going above cap
    • Secondary is the maximisation of power utilisation under said cap. Power balance.
  • Hierarchical approach with 4 levels:
    • Global cluster powercap is controlled by meta-EARGM.
    • Sub-cluster powercap (islands) is controlled by an EARGM.
    • Node powercap is controlled by the EARD.
    • Hardware domains (CPU/GPU) are controlled by specific plugins loaded by EARD.

EE HPC SOP Workshop, October 2022

6 of 18

Powercap extensions

  • Each level implements:
    • Powercap control: guarantees that the layer does not exceed its power
    • allocation.
    • Powercap status: evaluates current power consumption and sends it
    • to the layer above with hints of
    • its power needs.
    • API to be contacted by upper
    • layers.
    • Powercap balance: redistribute
    • power between the domains it
    • controls.

EE HPC SOP Workshop, October 2022

7 of 18

Node powercap control

  • Requirement: never exceed a given DC node powercap.
  • Software approach since there is no hardware that controls node power usage (including GPU) and offers power balance.
  • Limited frequency of power/energy readings. Power controlled in two stages:
    • Low level domain (CPU or GPU) control: high frequency validation (500ms) of domain consumption and change of its settings to meet requirements.
    • Full node control: measures the entire node power at a reduced frequency and adapts the the power allocated to each sub-domain dynamically.
  • The domain manager (plugins) reports periodically to the node manager (EARD) with its status. If the domain cannot meet its requested settings (requested frequency) it also reports the level of stress it is under, that is, how far from the target it currently is.
  • The node manager gets the domains’ statuses and decides on the possible actions:
    • Redistribute power between domains so that both are under the same level of stress.
    • Request additional power to the global manager if settings are not being met.
    • If requested settings are being met, it marks a percentage of the excess power as potential to be released.

EE HPC SOP Workshop, October 2022

8 of 18

Node powercap control

  • Requirement: never exceed a given DC node powercap.
  • Software approach since there is no hardware that controls node power usage (including GPU) and offers power balance.
  • Power consumption enforcement:
    • Dynamic computation of ratio between node power and
    • (CPU+DRAM)+GPUs
    • Power limit assigned per domain
    • Two frequencies of power validation
      • Short term (if needed), based on hardware readings every ~500ms
      • Medium term, based on IPMI/DCMI power every 10s
  • Dynamic power balance.

EE HPC SOP Workshop, October 2022

9 of 18

Soft cluster powercap (at LRZ)

  • Cost constraints: peak power usage is penalised.
  • Goal: detect excessive power consumption and bring it below limit.
  • Node power by default is unlimited.
  • Configuration: power limit, activation threshold and action, deactivation threshold and action.
  • Soft cluster powercap algorithm:
    • EARGM periodically aggregates the power of the nodes under its control.
    • If the total power approaches the limit set for the cluster, it sets a power limit to all the computational nodes.
    • If there is a powercap currently in action and the total power goes below a set threshold, the limitation is lifted.

EE HPC SOP Workshop, October 2022

10 of 18

Evaluation of node powercap

  • Tested in a node with 2 x Intel Xeon Gold 6126, 12 cores each with TDP 125W

Powercap range

Kernel

CPU threads

300-200

BT-MZ.C.x (CPU bound)

24

300-200

DGEMM (AVX 512)

24

350-250

STREAM (Memory bound)

24

EE HPC SOP Workshop, October 2022

11 of 18

Evaluation 1: BT-MZ.C.x

  • CPU bound application.
  • Each powercap change is a new kernel execution

EE HPC SOP Workshop, October 2022

12 of 18

Evaluation 2: DGEMM

  • AVX 512 application
  • Each powercap change is a new kernel execution

EE HPC SOP Workshop, October 2022

13 of 18

Evaluation 2: STREAM

  • Memory bound application
  • Each powercap change is a new kernel execution

EE HPC SOP Workshop, October 2022

14 of 18

Evaluation: SuperMUC-NG

  • Experiment done in one island (792 nodes)
  • Each node with 2 x Intel Skylake Xeon Platinum 8174 and 48 cores with TDP 240W.
  • Power validation using power distribution units (PDU) measurements (AC power).
  • Powercap for the island set to 285kW; limit for powercap activation at 90% and deactivation at 80%.
  • Cluster power monitoring set to 2 minutes.

EE HPC SOP Workshop, October 2022

15 of 18

Evaluation: SuperMUC-NG

  • Same application running on 792 nodes at once (38016 cores)
  • Multiple jobs, same application

NPB-BT running on all nodes

Wavesim running on all nodes

EE HPC SOP Workshop, October 2022

16 of 18

Related work

  • Node powercap
    • Powercap algorithms:
      • Machine learning approaches selecting the best settings to minimise power usage.
      • Reactive approaches modifying the settings as the application runs.
    • Hardware tools (CRAY, RAPL, Intel Node Manager, Nvidia SMI)
  • Cluster powercap
    • SLURM
    • PBSPro

EE HPC SOP Workshop, October 2022

17 of 18

Conclusions and current/future work

  • The implemented system meets the requirements of SuperMUC-NG.
  • As an extension for soft powercap: adding other sources of power consumption (other than compute nodes).
  • Job powercap.
  • Evaluation of cluster power reallocation in hard powercap

EE HPC SOP Workshop, October 2022

18 of 18

Questions

EE HPC SOP Workshop, October 2022