Node Health Check
Jenett Tillotson
Senior HPC Systems Engineer
National Center for Atmospheric Research
This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).
February 2024
What Should NHC do?
February 2024
The Need for an NHC Standard
February 2024
LBNL NHC
February 2024
Installation
February 2024
Example /etc/nhc/nhc.conf
# NHC Configuration file
# https://github.com/mej/nhc
## Hardware checks
crhtc* || check_hw_cpuinfo 2 36 72
{casper[01-31],casper36} || check_hw_cpuinfo 2 36 72
casper[32-35] || check_hw_cpuinfo 2 32 64
## Hardware memory checks
crhtc[01-62] || check_hw_physmem 385412 385412 5%
crhtc[63-64] || check_hw_physmem 1546555 1546555 5%
February 2024
Example nhc.conf (cont.)
## Check for any errors in dmesg output
* || ncar_check_dmesg
## Ensure ntp is synchronized
* || check_cmd_output -t 60 -m 'System clock synchronized: yes' -e 'timedatectl'
## Make sure we're running an expected BIOS version...
crhtc* || check_dmi_data_match "BIOS Information: BIOS Revision: 5.14"
casper15 || check_dmi_data_match "BIOS Information: BIOS Revision: 5.12"
{casper[08-09],casper24,casper[26-31],casper33,casper35} || check_dmi_data_match "BIOS Information: BIOS Revision: 5.14"
{casper[01-07],casper[10-14],casper[16-19],casper[21-23],casper25} || check_dmi_data_match "BIOS Information: BIOS Revision: 5.12"
February 2024
Example nhc.conf (cont.)
# Make sure our RAM is running at the correct bus rate.
crhtc* || check_dmi_data_match -t "Memory Device" "*Speed: 2933 MT/s"
{casper[01-14],casper[15-19],casper[21-28]} || check_dmi_data_match -t "Memory Device" "*Speed: 2666 MT/s"
casper[29-36] || check_dmi_data_match -t "Memory Device" "*Speed: 2933 MT/s"
## Check for nvidia GPUs
{casper15,casper[06-07],casper14,casper[16-17],casper[22-23]} || check_nvidia 1
{casper08,casper24,casper[27-28],casper[30-31]} || check_nvidia 8
casper09,casper36,casper29,casper25 || check_nvidia 4
February 2024
Example nhc.conf (cont.)
## Ensure certain files are not in place
* || ncar_check_nolocal
* || ncar_check_nologin
## Filesystem checks
* || check_file_test -S /var/mmfs/mmpmon/mmpmonSocket
* || ncar_check_gpfs 5.0.5.3
# All nodes should have their root filesystem mounted read/write.
* || check_fs_mount_rw -f /
* || check_fs_mount_rw -f /glade/u
* || check_fs_mount_rw -f /glade/scratch
February 2024
Example nhc.conf (cont.)
## check swap, but allow some fudge factor in size of swap
crhtc* || check_hw_swap 1562800000 1562813780
casper* || check_hw_swap 1520000000 5720000000
# Make sure the root filesystem doesn't get too full.
* || check_fs_free / 10%
# Free inodes are also important.
* || check_fs_ifree / 1k
## Check the mcelog daemon for any pending errors.
* || check_hw_mcelog
February 2024
NHC Configuration
NHC_RM=pbs
TIMEOUT=270
PBS_SERVER_HOME=/var/spool/pbs
OFFLINE_NODE=/etc/nhc/scripts/offline-node.sh
ONLINE_NODE=/etc/nhc/scripts/online-node.sh
February 2024
Other Checks
check_cmd_output -t 1 -r 0 -m '/is running/' /sbin/service rpcbind status
check_net_socket -n "HTTP daemon" -p tcp -s LISTEN -l '0.0.0.0:80' -d httpd -e 'service httpd start’
February 2024
Thanks and questions?
February 2024