1 of 13

Node Health Check

Jenett Tillotson

Senior HPC Systems Engineer

National Center for Atmospheric Research

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).

February 2024

2 of 13

What Should NHC do?

  • Prevent jobs from running on unhealthy nodes
  • Works with the resource manager
    • Querys jobs and resources (queues, nodes, etc.)
    • Submit requests (requeue jobs, offline nodes)
  • Lightweight
    • Uses few resources
    • Takes advatages of timeouts
  • Be fast
  • Be flexible
  • Be extensible
  • Be reusable and portable

February 2024

3 of 13

The Need for an NHC Standard

  • Avoid losing work or causing churn due to broken nodes
  • Increase throughput
  • Flag broken nodes so admins can fix them
    • Hardware fails
    • Configs get misapplied
  • Could use custom home-grown scripts
    • Site-specific
    • Lack portability
    • No consideration for extensibility, reusability, etc.
  • Need a simple, robust framework easy to understand and apply
  • Stop admins from re-inventing the wheel

February 2024

4 of 13

LBNL NHC

  • Developed by LBNL
  • Uses bash
  • Watchdog timers
  • Anything that can be described in a shell script can be checked
  • Modular
  • Config file easily controlled through config management
  • Intended to be run by the resource manager, manually, or periodically through cron
    • Consider the ramifications of any of these options
      • Not running NHC at job start will result in some jobs starting on unhealthy nodes
      • Running at job start could cause issues with shared nodes
        • Could run multiple copies - as many as one per core - at once
  • Lots of built-in checks to start with

February 2024

5 of 13

Installation

  • Download NHC
  • Install RPM (or build and install from tarball)
    • warewulf-nhc
  • Edit configuration file (default: /etc/nhc/nhc.conf)
  • Configure launch mechanism:
    • crond – Consider using sample script nhc.cron
    • TORQUE – $node_check_script & $node_check_interval
    • SLURM – HealthCheckProgram & HealthCheckInterval
    • SGE – Load sensor: load_sensor & load_thresholds
    • IBM Platform LSF – lsb.queues: PRE_EXEC & POST_EXEC

February 2024

6 of 13

Example /etc/nhc/nhc.conf

# NHC Configuration file

# https://github.com/mej/nhc

## Hardware checks

crhtc* || check_hw_cpuinfo 2 36 72

{casper[01-31],casper36} || check_hw_cpuinfo 2 36 72

casper[32-35] || check_hw_cpuinfo 2 32 64

## Hardware memory checks

crhtc[01-62] || check_hw_physmem 385412 385412 5%

crhtc[63-64] || check_hw_physmem 1546555 1546555 5%

February 2024

7 of 13

Example nhc.conf (cont.)

## Check for any errors in dmesg output

* || ncar_check_dmesg

## Ensure ntp is synchronized

* || check_cmd_output -t 60 -m 'System clock synchronized: yes' -e 'timedatectl'

## Make sure we're running an expected BIOS version...

crhtc* || check_dmi_data_match "BIOS Information: BIOS Revision: 5.14"

casper15 || check_dmi_data_match "BIOS Information: BIOS Revision: 5.12"

{casper[08-09],casper24,casper[26-31],casper33,casper35} || check_dmi_data_match "BIOS Information: BIOS Revision: 5.14"

{casper[01-07],casper[10-14],casper[16-19],casper[21-23],casper25} || check_dmi_data_match "BIOS Information: BIOS Revision: 5.12"

February 2024

8 of 13

Example nhc.conf (cont.)

# Make sure our RAM is running at the correct bus rate.

crhtc* || check_dmi_data_match -t "Memory Device" "*Speed: 2933 MT/s"

{casper[01-14],casper[15-19],casper[21-28]} || check_dmi_data_match -t "Memory Device" "*Speed: 2666 MT/s"

casper[29-36] || check_dmi_data_match -t "Memory Device" "*Speed: 2933 MT/s"

## Check for nvidia GPUs

{casper15,casper[06-07],casper14,casper[16-17],casper[22-23]} || check_nvidia 1

{casper08,casper24,casper[27-28],casper[30-31]} || check_nvidia 8

casper09,casper36,casper29,casper25 || check_nvidia 4

February 2024

9 of 13

Example nhc.conf (cont.)

## Ensure certain files are not in place

* || ncar_check_nolocal

* || ncar_check_nologin

## Filesystem checks

* || check_file_test -S /var/mmfs/mmpmon/mmpmonSocket

* || ncar_check_gpfs 5.0.5.3

# All nodes should have their root filesystem mounted read/write.

* || check_fs_mount_rw -f /

* || check_fs_mount_rw -f /glade/u

* || check_fs_mount_rw -f /glade/scratch

February 2024

10 of 13

Example nhc.conf (cont.)

## check swap, but allow some fudge factor in size of swap

crhtc* || check_hw_swap 1562800000 1562813780

casper* || check_hw_swap 1520000000 5720000000

# Make sure the root filesystem doesn't get too full.

* || check_fs_free / 10%

# Free inodes are also important.

* || check_fs_ifree / 1k

## Check the mcelog daemon for any pending errors.

* || check_hw_mcelog

February 2024

11 of 13

NHC Configuration

  • /etc/sysconfig/nhc

NHC_RM=pbs

TIMEOUT=270

PBS_SERVER_HOME=/var/spool/pbs

OFFLINE_NODE=/etc/nhc/scripts/offline-node.sh

ONLINE_NODE=/etc/nhc/scripts/online-node.sh

February 2024

12 of 13

Other Checks

  • Verify that the rpcbind service is alive

check_cmd_output -t 1 -r 0 -m '/is running/' /sbin/service rpcbind status

  • Search for HTTP daemon IPv4 listening socket and restart if missing:

check_net_socket -n "HTTP daemon" -p tcp -s LISTEN -l '0.0.0.0:80' -d httpd -e 'service httpd start’

February 2024

13 of 13

Thanks and questions?

February 2024