1 of 13

Node Health Check

Jenett Tillotson

Senior HPC Systems Engineer

National Center for Atmospheric Research

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).

February 2024

2 of 13

What Should NHC do?

Prevent jobs from running on unhealthy nodes
Works with the resource manager

Querys jobs and resources (queues, nodes, etc.)
Submit requests (requeue jobs, offline nodes)

Lightweight

Uses few resources
Takes advatages of timeouts

Be fast
Be flexible
Be extensible
Be reusable and portable

February 2024

3 of 13

The Need for an NHC Standard

Avoid losing work or causing churn due to broken nodes
Increase throughput
Flag broken nodes so admins can fix them

Hardware fails
Configs get misapplied

Could use custom home-grown scripts

Site-specific
Lack portability
No consideration for extensibility, reusability, etc.

Need a simple, robust framework easy to understand and apply
Stop admins from re-inventing the wheel

February 2024

4 of 13

LBNL NHC

Developed by LBNL
Uses bash
Watchdog timers
Anything that can be described in a shell script can be checked
Modular
Config file easily controlled through config management
Intended to be run by the resource manager, manually, or periodically through cron

Consider the ramifications of any of these options

Not running NHC at job start will result in some jobs starting on unhealthy nodes
Running at job start could cause issues with shared nodes

Could run multiple copies - as many as one per core - at once

Lots of built-in checks to start with

February 2024

5 of 13

Installation

Download NHC

http://warewulf.lbl.gov/downloads/releases/

Install RPM (or build and install from tarball)

warewulf-nhc

Edit configuration file (default: /etc/nhc/nhc.conf)
Configure launch mechanism:

crond – Consider using sample script nhc.cron
TORQUE – $node_check_script & $node_check_interval
SLURM – HealthCheckProgram & HealthCheckInterval
SGE – Load sensor: load_sensor & load_thresholds
IBM Platform LSF – lsb.queues: PRE_EXEC & POST_EXEC

February 2024

6 of 13

Example /etc/nhc/nhc.conf

# NHC Configuration file

# https://github.com/mej/nhc

## Hardware checks

crhtc* || check_hw_cpuinfo 2 36 72

{casper[01-31],casper36} || check_hw_cpuinfo 2 36 72

casper[32-35] || check_hw_cpuinfo 2 32 64

## Hardware memory checks

crhtc[01-62] || check_hw_physmem 385412 385412 5%

crhtc[63-64] || check_hw_physmem 1546555 1546555 5%

February 2024

7 of 13

Example nhc.conf (cont.)

## Check for any errors in dmesg output

* || ncar_check_dmesg

## Ensure ntp is synchronized

* || check_cmd_output -t 60 -m 'System clock synchronized: yes' -e 'timedatectl'

## Make sure we're running an expected BIOS version...

crhtc* || check_dmi_data_match "BIOS Information: BIOS Revision: 5.14"

casper15 || check_dmi_data_match "BIOS Information: BIOS Revision: 5.12"

{casper[08-09],casper24,casper[26-31],casper33,casper35} || check_dmi_data_match "BIOS Information: BIOS Revision: 5.14"

{casper[01-07],casper[10-14],casper[16-19],casper[21-23],casper25} || check_dmi_data_match "BIOS Information: BIOS Revision: 5.12"

February 2024

8 of 13

Example nhc.conf (cont.)

# Make sure our RAM is running at the correct bus rate.

crhtc* || check_dmi_data_match -t "Memory Device" "*Speed: 2933 MT/s"

{casper[01-14],casper[15-19],casper[21-28]} || check_dmi_data_match -t "Memory Device" "*Speed: 2666 MT/s"

casper[29-36] || check_dmi_data_match -t "Memory Device" "*Speed: 2933 MT/s"

## Check for nvidia GPUs

{casper15,casper[06-07],casper14,casper[16-17],casper[22-23]} || check_nvidia 1

{casper08,casper24,casper[27-28],casper[30-31]} || check_nvidia 8

casper09,casper36,casper29,casper25 || check_nvidia 4

February 2024

9 of 13

Example nhc.conf (cont.)

## Ensure certain files are not in place

* || ncar_check_nolocal

* || ncar_check_nologin

## Filesystem checks

* || check_file_test -S /var/mmfs/mmpmon/mmpmonSocket

* || ncar_check_gpfs 5.0.5.3

# All nodes should have their root filesystem mounted read/write.

* || check_fs_mount_rw -f /

* || check_fs_mount_rw -f /glade/u

* || check_fs_mount_rw -f /glade/scratch

February 2024

10 of 13

Example nhc.conf (cont.)

## check swap, but allow some fudge factor in size of swap

crhtc* || check_hw_swap 1562800000 1562813780

casper* || check_hw_swap 1520000000 5720000000

# Make sure the root filesystem doesn't get too full.

* || check_fs_free / 10%

# Free inodes are also important.

* || check_fs_ifree / 1k

## Check the mcelog daemon for any pending errors.

* || check_hw_mcelog

February 2024

11 of 13

NHC Configuration

/etc/sysconfig/nhc

NHC_RM=pbs

TIMEOUT=270

PBS_SERVER_HOME=/var/spool/pbs

OFFLINE_NODE=/etc/nhc/scripts/offline-node.sh

ONLINE_NODE=/etc/nhc/scripts/online-node.sh

February 2024

12 of 13

Other Checks

Verify that the rpcbind service is alive

check_cmd_output -t 1 -r 0 -m '/is running/' /sbin/service rpcbind status

Search for HTTP daemon IPv4 listening socket and restart if missing:

check_net_socket -n "HTTP daemon" -p tcp -s LISTEN -l '0.0.0.0:80' -d httpd -e 'service httpd start’

February 2024

1 of 13

2 of 13

3 of 13

4 of 13

5 of 13

6 of 13

7 of 13

8 of 13

9 of 13

10 of 13

11 of 13

12 of 13

13 of 13