1 of 23

Linux Clusters Institute:�Cluster Stack Basics

Jenett Tillotson

Senior HPC Systems Engineer

National Center for Atmospheric Research

May 2023

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).

2 of 23

A Guiding Principle

“The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q” - Fabio Petrini, et. al.

  • Minimize the software running on the compute nodes to reduce overhead/OS jitter
  • The larger the cluster, the more important this is.
  • Therefore:
    • Minimize services running on compute nodes
    • Minimize unnecessary network traffic
  • Clusters can be like 800 pound gorillas - proceed carefully!

2

May 2023

3 of 23

Cluster Architecture Overview

3

Login Node

Compute Node

Compute Node

Compute Node

Compute Node

Parallel �File System

Mgmt Node

Local Area Network

LDAP/NTP/DNS

Cluster security boundary

Lights Out

Internal

Network

Interconnect

Internet

NTP/DNS

May 2023

4 of 23

Management/Service Node

4

Mgmt Node

Time:

NTP, Chrony

Server

Client

NFS server:

nfsd

rpcbind

Accounts/Authentication:

LDAP Server, Kerberos

PXE:

DHCP

TFTP

HTTP

Resource Manager:

SLURM, PBS

Server

Job Scheduler

Interconnect:

Drivers

Fabric Manager

SSH:

sshd

pdsh

Logging/Monitoring/Metrics:

Syslog Server

Ganglia, Nagios, Zabbix

Time Series Database/telegraf

Config Management:

Puppet, Ansible, Salt

Security:

iptables, nftables

NAT

Domain Namespace (DNS):

Server

Client

May 2023

5 of 23

Login Node

5

Login Node

Time:

client

Accounts/Authentication:

Name Service Caching Daemon(nscd, nslcd, sssd)

PAM

Resource Manager: client

Munge

Interconnect:

Drivers

SSH:

sshd

Network File Systems:

Home Directories

Applications

Parallel/Scratch

Work, Project

Monitoring/Metrics

Firewall:

iptables, nftables

User Environment:

Compilers

Libraries

Modules, Lmod

DNS:

client

May 2023

6 of 23

Compute Nodes

6

Compute Node

Time:

client

Accounts/Authentication:

Namespace Caching (nscd, nslcd, sssd)

PAM

Resource Manager:

MOM

Munge

nhc

Interconnect:

Drivers

SSH:

sshd

Network File Systems:

Home Directories

Applications

Parallel/Scratch

Work, Project

Monitoring/Metrics:

Ganglia, telegraf

DNS:

client

User Environment:

Compilers

Libraries

Modules, Lmod

May 2023

7 of 23

PXE

  • PXE - Preboot EXecution Environment - DHCP, TFTP, amd HTTP services for booting machines over the network
  • DHCP - Dynamic Host Configuration Protocol - Automatically assigns IP addresses to hosts over the network
  • TFTP - Trivial File Transfer Protocol - Protocol for transferring files over a network. It’s a simplified version of the FTP protocol
  • HTTP - HyperText Transport Protocol - Data transfer protocol designed for the World Wide Web. Used to retrieve installation config files, OS images, RPMs, etc.
  • Supports stateful installs - Install OS to local storage
  • Supports stateless booting - Download OS image to RAM

7

May 2023

8 of 23

DHCP

  • DHCP - Dynamic Host Configuration Protocol
  • Maps network hardware to network addresses
    • MAC address of network device mapped to network address
  • Also hands out gateway information and hostname
  • Supports various boot methods (stateful, stateless, satellite)
  • Supports various network boot agents
    • iPXE, XNBA, pxelinux, GRUB

8

May 2023

9 of 23

Domain Name Service (DNS)

  • DNS - Domain Name Service
    • Maps hostname to network addresses
  • Mgmt/Service Node
    • Has DNS mappings for cluster components
    • Provides those mappings to cluster components
    • Connects to upstream DNS servers for DNS forwarding
  • Login Node
    • Connects to mgmt/service nodes for local DNS information
  • Compute Node
    • Also connects to mgmt/service node for local DNS information

9

May 2023

10 of 23

Cluster Manager

  • PXE booting support
    • Support DNS, DHCP, TFTP
    • Stateless image booting
    • Satellite NFS (Network File System) share
      • Share read-only files over the network using the NFS protocol
  • Cluster managers tools
    • File synchronization
    • Distributed shell - pdsh - Parallel Distributed Shell or clush - Cluster Shell
    • Stateless image building
    • Stateful installation support
    • Lights Out management
      • Remote power control and console access
  • …and more!

10

May 2023

11 of 23

Resource Manager/Job Scheduler

  • Acts as a broker between users and resources
  • Tracks resources and requests for resources
  • Determines the order resource requests are filled and on what resources the request will run
  • Mgmt/Service nodes run a resource manager server and job scheduler
  • Compute nodes run a MoM (Machine-oriented miniserver)
  • Typically use munge for authentication
  • nhc - Node Health Check - A utility periodically run on the compute nodes to check for issues

11

May 2023

12 of 23

SLURM Daemons

  • SLURM - Simple Linux Utility for Resource Management
    • An open-source resource manager and job scheduler
  • Mgmt/Service Nodes
    • Slurmctld - SLURM server/Job scheduler
    • MariaDB/MySQL - Storing accounting information
    • Slurmdbd - SLURM Database interface
    • Munge
  • Compute Nodes
    • Slurmd
    • Munge
  • Login Nodes
    • Munge
  • Install a SLURM client on any node that needs to submit jobs or query the SLURM server

12

May 2023

13 of 23

Interconnect

  • InfiniBand (IB)
    • High bandwidth, low-latency network
    • Provides RDMA (Remote Direct Memory Access)
    • Uses Verbs instead of a traditional OSI network stack
    • IPoIB (Internet Protocol over Infiniband) allows IP traffic over an Infiniband network
  • Ethernet
    • High bandwidth
    • Uses a traditional OSI network stack
    • Provides RDMA by using RoCE (RDMA over Converged Ethernet)
  • Subnet Manager - A daemon that monitors the network for any changes and updates routing information

13

May 2023

14 of 23

Network File Systems

  • NFS
    • Network Attached Storage (NAS)
    • Often used for home directories and applications (read-only)
    • Relatively smaller
    • Backed up
    • Use an automounter to reduce load on NFS server
  • Parallel File Systems
    • Use parallelism to gain speed and reduce latency
      • GPFS (Spectrum Scale), Lustre, BeeGFS, WekaIO, Vast, …
    • Used for scratch space - big, fast, not backed up
    • Used for work/project space - smaller, fast, backed up

14

May 2023

15 of 23

Network File Systems

  • Mgmt/Service Nodes
    • Typically don’t mount parallel file systems
    • If running resource manager, will need to mount home directories and will need user account information (LDAP)
  • Login Nodes
    • Mount home directories, applications, scratch, and work/project spaces
  • Compute nodes
    • Same as login nodes
    • Compute nodes should match login nodes as much as possible

15

May 2023

16 of 23

Account Information/Authentication

  • Traditional Flat Files - /etc/passwd, /etc/shadow, /etc/group
  • LDAP - Lightweight Directory Access Protocol - A standardized protocol for providing user account information, and other information to computers on your network. Can also provide user authentication
  • Kerberos - A protocol for authentication that can authenticate users, computers, and services
  • SSSD - System Security Services Daemon - Handles account lookups and use authentication
    • Can use LDAP and Kerberos for account lookup and user authentication
    • Important to cache lookups
  • PAM - Pluggable Authentication Modules

16

May 2023

17 of 23

Time Service

  • Use a time service to sync the times of all the various cluster components
    • Synchronized times is important for network file systems, logging, authentication
  • Management/Service Nodes
    • Connect to upstream time servers
    • Provide time server for compute nodes
  • Login Nodes
    • Connect to upstream time servers
  • Compute Nodes
    • Connect to Mgmt/Service Nodes time servers

17

May 2023

18 of 23

SSH

  • SSH - Secure Shell - a method of logging into a remote system that uses strong encryption to make sure the connection is secure. It uses cryptography to verify the identity of remote system. Users can create public/private key pairs to login without using passwords.
  • Mgmt/Service Nodes
    • Only accessible from the local network
  • Login Nodes
    • Usually accessible from the Internet (All the Things!)
    • Rate limiting
  • Compute Nodes
    • Only accessible from cluster networks
    • Use pam_adopt to limit logins to users with running jobs

18

May 2023

19 of 23

Logging

  • Centralized logging
    • rsyslog
  • Logging locally will reduce the memory of stateless compute nodes
  • Stateless compute nodes will lose local logs when rebooted
  • Stateless compute nodes can also be set to kdump to the network when the kernel crashes

19

May 2023

20 of 23

Monitoring/Metrics

  • ganglia - Popular open-source, web-based tool for monitoring and trending CPU utilization, memory usage and other cluster statistics
    • Requires services to run on the server and the clients to be monitored
  • Nagios - An older, but still popular monitoring tool that can determine if a system is up and if the configured services are operating properly
  • Zabbix - Another monitoring tool that can also collect metrics
  • Time Series Database - Uses tools like telegraf to stuff metrics into a time series database for monitoring and analysis

20

May 2023

21 of 23

Security

  • Use firewalls (iptable, nftables, etc) to limit access
  • NAT - Network Address Translation - Wraps private IP addresses with a public IP address for accessing the Internet
  • Mgmt/Service Node
    • Locked down to only be accessible to the local area network and the cluster networks
    • NAT should be configured so it can act as a gateway for the compute nodes
  • Login Node
    • SSH allowed from the Internet
  • Compute Nodes
    • Are only accessible from the cluster networks

21

May 2023

22 of 23

Config Management/User Environment

  • Configuration Manager
    • Used to impose a configuration on the cluster components
    • Puppet, Ansible, Salt
  • User Environment
    • Compilers, libraries
    • Shell environment
      • Lmod, environment modules
      • Set $PATH, $LD_LIBRARY_PATH, $MANPATH for supported applications
    • Available on the login nodes and the compute nodes

22

May 2023

23 of 23

Conclusions

Thanks and questions?

23

May 2023