Linux Clusters Institute:�Cluster Stack Basics
Jenett Tillotson
Senior HPC Systems Engineer
National Center for Atmospheric Research
May 2023
This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).
A Guiding Principle
“The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q” - Fabio Petrini, et. al.
2
May 2023
Cluster Architecture Overview
3
Login Node
Compute Node
Compute Node
Compute Node
Compute Node
Parallel �File System
Mgmt Node
Local Area Network
LDAP/NTP/DNS
Cluster security boundary
Lights Out
Internal
Network
Interconnect
Internet
NTP/DNS
May 2023
Management/Service Node
4
Mgmt Node
Time:
NTP, Chrony
Server
Client
NFS server:
nfsd
rpcbind
Accounts/Authentication:
LDAP Server, Kerberos
PXE:
DHCP
TFTP
HTTP
Resource Manager:
SLURM, PBS
Server
Job Scheduler
Interconnect:
Drivers
Fabric Manager
SSH:
sshd
pdsh
Logging/Monitoring/Metrics:
Syslog Server
Ganglia, Nagios, Zabbix
Time Series Database/telegraf
Config Management:
Puppet, Ansible, Salt
Security:
iptables, nftables
NAT
Domain Namespace (DNS):
Server
Client
May 2023
Login Node
5
Login Node
Time:
client
Accounts/Authentication:
Name Service Caching Daemon(nscd, nslcd, sssd)
PAM
Resource Manager: client
Munge
Interconnect:
Drivers
SSH:
sshd
Network File Systems:
Home Directories
Applications
Parallel/Scratch
Work, Project
Monitoring/Metrics
Firewall:
iptables, nftables
User Environment:
Compilers
Libraries
Modules, Lmod
DNS:
client
May 2023
Compute Nodes
6
Compute Node
Time:
client
Accounts/Authentication:
Namespace Caching (nscd, nslcd, sssd)
PAM
Resource Manager:
MOM
Munge
nhc
Interconnect:
Drivers
SSH:
sshd
Network File Systems:
Home Directories
Applications
Parallel/Scratch
Work, Project
Monitoring/Metrics:
Ganglia, telegraf
DNS:
client
User Environment:
Compilers
Libraries
Modules, Lmod
May 2023
PXE
7
May 2023
DHCP
8
May 2023
Domain Name Service (DNS)
9
May 2023
Cluster Manager
10
May 2023
Resource Manager/Job Scheduler
11
May 2023
SLURM Daemons
12
May 2023
Interconnect
13
May 2023
Network File Systems
14
May 2023
Network File Systems
15
May 2023
Account Information/Authentication
16
May 2023
Time Service
17
May 2023
SSH
18
May 2023
Logging
19
May 2023
Monitoring/Metrics
20
May 2023
Security
21
May 2023
Config Management/User Environment
22
May 2023
Conclusions
Thanks and questions?
23
May 2023