1 of 23

Operating a research computing environment for a large university

CaRCC May 19, 2022

Erik Deumens, senior director

Brennan Jones, systems admin

Craig Prescott, systems architect

Eric Tomeo, operations manager

Rise to Five

2 of 23

What we cover…

Research Computing Needs
Hardware Landscape
Configuration Management
Security Considerations
User Interfaces and Support

3 of 23

Research computing needs

Diverse and wide-ranging in multiple dimensions
Domain: physics, chemistry, materials engineering, fluid dynamics, gene sequencing, social and political sciences, economy, language, history, agriculture, fine arts, …
Expertise: undergraduate students, computer scientists, scholars, …
Scale: a few cores and a few GB of data, 1,000s of cores and GPUs and 100s of TB, …
Capabilities: numbers only, advanced graphics and visualization, asynchronous/batch, interactive, …

4 of 23

Hardware landscape

5 of 23

Major components

Compute nodes
Storage systems
Network switches
Infrastructure nodes

6 of 23

Compute Nodes

Three generations of nodes: HPG2, HPG3 and HPG-AI
HPG2

Compute: 2015, Dell PowerEdge SOS6320, 29,312 total Intel cores
GPU: 2015, Dell PowerEdge R720, Nvidia Telsa K80s

HPG3

Compute: 2021, Lenovo ThinkSystem SR645, 40,320 AMD cores
GPU: 2019, Exxact Tyan S7109, Nvidia 2080TI and RTX 6000

HPG-AI

GPU: 2021, 140 Nvidia DGX A100s, #30 Top 500, #5 Green 500 (Nov 2021)

7 of 23

Storage Systems

Qumulo

NFS and SMB
Home, Applications and Infrastructure
266 TB

DDN EXAScaler

Lustre
Orange - long term storage, 17 PiB
Blue - compute job I/O, 7 PiB
Red - high performance compute jobs, 2.5 PiB

8 of 23

Other Systems

Networking

Ethernet

BMC and management
Mellanox switches running Cumulus Linux

Infiniband

Mellanox switches
EDR and HDR
Mellanox Connectx 5,6

Infrastructure

DNS, LDAP, DHCP, provisioning, scheduler, security, licensing, monitoring, etc.
Mix of virtual and physical machines
High availability

9 of 23

Configuration management

10 of 23

What tools are we using?

Production: Puppet for config management, r10k for replication/sync to puppet masters upon git push, four puppet masters
Foreman for Puppet ENC, integrate config mgmt with provisioning

Related – Katello for content management (packages, yum repos, etc)
Ansible supported – we’ve used it a little, don’t have assigned ansible roles to any production resources at present
Chef, Salt also supported but we don’t use those at all

Versions

Production (old!) – Foreman 1.15.6, Puppet 4.10.12
Dev – Foreman 2.5.4, Puppet 7

Target is Foreman 3.2 and Puppet 7 – entering dev now

To enter prod in summer 2022 across all UFIT Research Computing resources

Will drop r10k connected to old gitosis instance in favor of Gitlab CI for sync to masters

11 of 23

Why did we choose those tools?

Paint picture – The Before Times: Prior to HiPerGator (v1), we only had a provisioning system (that was no longer supported), no config mgmt at all (or content mgmt, for that matter)

Needed something – complexity and scale had both already grown beyond our tooling (systemimager, rdist), and growth was accelerating in both directions

Complexity – many node types/roles, heterogenous hardware, different requirements
Many hand-maintained “golden clients” – details of our evolving configs could easily become “lost” – major upgrades were a huge pain.
Rdist as poor-man’s config mgmt – ad hoc, by-hand thing – one-shot in nature
Attempts to manage complexity had mixed results, but were fragile and resulted in an error-prone, unmaintainable morass of little files and soft links, huge/complicated Distfile, etc
Virtual machines – could not be managed
List goes on. Short story - outgrew primitive tools, were replacing lack of capable infrastructure with staff effort - error-prone, inefficient, and ultimately unpractical.

Our Requirements: flexible solution to better manage growing complexity and heterogeneity, eliminate above issues, allow scaling as we grow, bare metal or virtualized as equals.

Most important: UFRC’s target is that any operations staff member MUST be able to restore a production server from scratch at any time with the push of a button, walk away, and not have to think about it further. No domain knowledge or engagements of other staff should be required for the resource to resume production state.

12 of 23

Why did we choose these tools? (cont.)

Chose Foreman/Puppet/Katello stack

Very flexible (important for us). VMs now on equal footing with bare metal from provisioning and config mgmt standpoint – libvirt/vmware/AWS/GCE/Azure (we use libvirt internally a lot)

Built-in support for unattended provisioning of many OS families, legacy BIOS and UEFI, PXE and HTTP
Lifecycle, subscription, repository (yum/apt/docker/files), and package management via Katello
Built-in reporting, Audit logging (who did what when), OpenSCAP integration for security, etc
Integrates with Red Hat Satellite, RH Cloud, etc

All open source, Infrastructure as code – no more golden clients or by-hand rdist, unmaintainable directory trees of little files/links, hand-maintaining dhcpd.conf, etc. Strong and active development and user communities. Upstream source of Red Hat Satellite.
Puppet – huge library of mature, well-maintained modules on the Forge for us to leverage, language features help make things easy (e.g. types like nagios_host, exported resources, etc). Catalog enforcement on every agent run prevents configuration drift, can apply corrective actions.
External Node Classifier, Environment integration – ENC gives us an extremely strong lever to manage complexity - e.g. Foreman hostgroups can associate puppet classes and set parameters which can be used to set class parameter values (facts can be used as well), etc
Supports mix-and-match of config mgmt tools e.g. Puppet and Ansible if desired
Built-in audit logging, archiving of config mgmt reports, OpenSCAP reports, etc

13 of 23

How does config management in UFIT Research Computing deployments?

What does UFIT Research Computing deploy and maintain from its Foreman/Puppet/Katello infrastructure?

RHEL/CentOS-6/7/8: with Mellanox OFED and without
Citrix XenServer: we’re handing this off to another unit within UFIT “soon”
Debian Buster – switches: ZTP + Cumulus Linux, puppet managed switch configs
Gone, but not forgotten - Ubuntu 20.04: deployed from Foreman for HPG-AI acceptance (“DGX-OS 5”) – don’t see future use needed, but capability demonstrated
Most important practically as of spring 2022 for UFRC is RHEL-7/8 and Cumulus (Debian)
Config management does all, we rely on it

14 of 23

How does config management fit in UFIT Research Computing deployments (cont)

Config management setup and strategy

“Base” set of classes applied to all hosts, virtual or bare-metal

Anything needed to set up in UFRC env including all network interfaces (IB, bonds, bridges, VLAN-tagged, whatever) and operate/maintain HW (if bare metal)

Further specialization according to intended role by associating further classes (hostgroups and config groups) – e.g., HPG SLURM client, edge node, etc
Provisioning time – gets signed cert from puppet CA, and applies same catalog that is enforced on agent runs
SRV domain very helpful for multi-master setup
Our agents run once an hour
Manifest development – git branch and associated environment (Gitlab CI used to make this simple) - once happy/reviewed, merge and remove branch and environment

15 of 23

Options for sharing infrastructure with other UFIT units

We don’t work with any other IT groups today, but are set up for it
Foreman-side

Authentication credentials come from UF AD (“gatorlink”)
Roles can be created and assigned to users and groups (30 built-in roles)
Use Organization and Location features

Sharing (or limiting sharing) of repositories/puppet classes/ansible roles/etc with other orgs/locations supported
For RC – org set to UFRC, location to UFDC (where all our resources currently reside)

Gitlab – auth is with gatorlink credentials already, could enable access to other units if the use case came up

16 of 23

Security Considerations

17 of 23

Networking

Access from outside through login nodes, web servers and portals

Separate public VLANs with limited ACLs
OSSEC software runs on all edge servers

Compute nodes can access Internet resource via NAT
Backend networks are private – management, storage, InfiniBand
Secure enclave is more protected

Private VLAN
ACLs only for systems management and monitoring

18 of 23

Authentication

Access from outside requires Duo multi-factor authentication

SSH key or password auth to login nodes
Single-sign on to all web interfaces
SFTP and Globus
Not for SMB (but restricted to campus networks only)
VPN

Federated authentication will leverage COManage and eduVPN

Both services will leverage home institution multi-factor authentication

Secure enclave leverages SSO with Duo

Also requires a special key for access (not SSH key)

19 of 23

Logging

Centralized logging via rsyslog
rsyslog is configured for Splunk forwarding

Secure enclave nodes are configured to send to splunk directly

Audit logging

Possible, but not enabled
File system audit logs
Linux auditd system logs
Heavy impact to performance

20 of 23

User interfaces and support

21 of 23

User Interfaces

Command line – standard cluster interface

Requires SSH key or password and Duo

Web browser GUI – Open OnDemand

Allows GUI software to work without complex client setup
Leverages campus SSO with internal authorization

Batch job management – SLURM
Data transfer – Globus Enterprise

Also leverages campus SSO

22 of 23

Application Support

Software is managed with modules (lmod)
We keep two versions of applications and libraries

Production version – the default
Previous version – for (limited) backward capability

Open and campus-licensed software

Controlled by central licensing services

Licensed software access is managed by group membership

Only those who have a license get access

1 of 23

2 of 23

3 of 23

4 of 23

5 of 23

6 of 23

7 of 23

8 of 23

9 of 23

10 of 23

11 of 23

12 of 23

13 of 23

14 of 23

15 of 23

16 of 23

17 of 23

18 of 23

19 of 23

20 of 23

21 of 23

22 of 23

23 of 23