1 of 23

Operating a research computing environment for a large university

CaRCC May 19, 2022

Erik Deumens, senior director

Brennan Jones, systems admin

Craig Prescott, systems architect

Eric Tomeo, operations manager

Rise to Five

2 of 23

What we cover…

  • Research Computing Needs
  • Hardware Landscape
  • Configuration Management
  • Security Considerations
  • User Interfaces and Support

3 of 23

Research computing needs

  • Diverse and wide-ranging in multiple dimensions
  • Domain: physics, chemistry, materials engineering, fluid dynamics, gene sequencing, social and political sciences, economy, language, history, agriculture, fine arts, …
  • Expertise: undergraduate students, computer scientists, scholars, …
  • Scale: a few cores and a few GB of data, 1,000s of cores and GPUs and 100s of TB, …
  • Capabilities: numbers only, advanced graphics and visualization, asynchronous/batch, interactive, …

4 of 23

Hardware landscape

5 of 23

Major components

  • Compute nodes
  • Storage systems
  • Network switches
  • Infrastructure nodes

6 of 23

Compute Nodes

  • Three generations of nodes: HPG2, HPG3 and HPG-AI
  • HPG2
    • Compute: 2015, Dell PowerEdge SOS6320, 29,312 total Intel cores
    • GPU: 2015, Dell PowerEdge R720, Nvidia Telsa K80s
  • HPG3
    • Compute: 2021, Lenovo ThinkSystem SR645, 40,320 AMD cores
    • GPU: 2019, Exxact Tyan S7109, Nvidia 2080TI and RTX 6000
  • HPG-AI
    • GPU: 2021, 140 Nvidia DGX A100s, #30 Top 500, #5 Green 500 (Nov 2021)

7 of 23

Storage Systems

  • Qumulo
    • NFS and SMB
    • Home, Applications and Infrastructure
    • 266 TB
  • DDN EXAScaler
    • Lustre
    • Orange - long term storage, 17 PiB
    • Blue - compute job I/O, 7 PiB
    • Red - high performance compute jobs, 2.5 PiB

8 of 23

Other Systems

  • Networking
    • Ethernet
      • BMC and management
      • Mellanox switches running Cumulus Linux
    • Infiniband
      • Mellanox switches
      • EDR and HDR
      • Mellanox Connectx 5,6
  • Infrastructure
    • DNS, LDAP, DHCP, provisioning, scheduler, security, licensing, monitoring, etc.
    • Mix of virtual and physical machines
    • High availability

9 of 23

Configuration management

10 of 23

What tools are we using?

  • Production: Puppet for config management, r10k for replication/sync to puppet masters upon git push, four puppet masters
  • Foreman for Puppet ENC, integrate config mgmt with provisioning
    • Related – Katello for content management (packages, yum repos, etc)
    • Ansible supported – we’ve used it a little, don’t have assigned ansible roles to any production resources at present
    • Chef, Salt also supported but we don’t use those at all
  • Versions
    • Production (old!) – Foreman 1.15.6, Puppet 4.10.12
    • Dev – Foreman 2.5.4, Puppet 7
      • Target is Foreman 3.2 and Puppet 7 – entering dev now
        • To enter prod in summer 2022 across all UFIT Research Computing resources
      • Will drop r10k connected to old gitosis instance in favor of Gitlab CI for sync to masters

11 of 23

Why did we choose those tools?

  • Paint picture – The Before Times: Prior to HiPerGator (v1), we only had a provisioning system (that was no longer supported), no config mgmt at all (or content mgmt, for that matter)
    • Needed something – complexity and scale had both already grown beyond our tooling (systemimager, rdist), and growth was accelerating in both directions
      • Complexity – many node types/roles, heterogenous hardware, different requirements
      • Many hand-maintained “golden clients” – details of our evolving configs could easily become “lost” – major upgrades were a huge pain.
      • Rdist as poor-man’s config mgmt – ad hoc, by-hand thing – one-shot in nature
      • Attempts to manage complexity had mixed results, but were fragile and resulted in an error-prone, unmaintainable morass of little files and soft links, huge/complicated Distfile, etc
      • Virtual machines – could not be managed
      • List goes on. Short story - outgrew primitive tools, were replacing lack of capable infrastructure with staff effort - error-prone, inefficient, and ultimately unpractical.
    • Our Requirements: flexible solution to better manage growing complexity and heterogeneity, eliminate above issues, allow scaling as we grow, bare metal or virtualized as equals.
      • Most important: UFRC’s target is that any operations staff member MUST be able to restore a production server from scratch at any time with the push of a button, walk away, and not have to think about it further. No domain knowledge or engagements of other staff should be required for the resource to resume production state.

12 of 23

Why did we choose these tools? (cont.)

  • Chose Foreman/Puppet/Katello stack
    • Very flexible (important for us). VMs now on equal footing with bare metal from provisioning and config mgmt standpoint – libvirt/vmware/AWS/GCE/Azure (we use libvirt internally a lot)
      • Built-in support for unattended provisioning of many OS families, legacy BIOS and UEFI, PXE and HTTP
      • Lifecycle, subscription, repository (yum/apt/docker/files), and package management via Katello
      • Built-in reporting, Audit logging (who did what when), OpenSCAP integration for security, etc
      • Integrates with Red Hat Satellite, RH Cloud, etc
    • All open source, Infrastructure as code – no more golden clients or by-hand rdist, unmaintainable directory trees of little files/links, hand-maintaining dhcpd.conf, etc. Strong and active development and user communities. Upstream source of Red Hat Satellite.
    • Puppet – huge library of mature, well-maintained modules on the Forge for us to leverage, language features help make things easy (e.g. types like nagios_host, exported resources, etc). Catalog enforcement on every agent run prevents configuration drift, can apply corrective actions.
    • External Node Classifier, Environment integration – ENC gives us an extremely strong lever to manage complexity - e.g. Foreman hostgroups can associate puppet classes and set parameters which can be used to set class parameter values (facts can be used as well), etc
    • Supports mix-and-match of config mgmt tools e.g. Puppet and Ansible if desired
    • Built-in audit logging, archiving of config mgmt reports, OpenSCAP reports, etc

13 of 23

How does config management in UFIT Research Computing deployments?

  • What does UFIT Research Computing deploy and maintain from its Foreman/Puppet/Katello infrastructure?
    • RHEL/CentOS-6/7/8: with Mellanox OFED and without
    • Citrix XenServer: we’re handing this off to another unit within UFIT “soon”
    • Debian Buster – switches: ZTP + Cumulus Linux, puppet managed switch configs
    • Gone, but not forgotten - Ubuntu 20.04: deployed from Foreman for HPG-AI acceptance (“DGX-OS 5”) – don’t see future use needed, but capability demonstrated
    • Most important practically as of spring 2022 for UFRC is RHEL-7/8 and Cumulus (Debian)
    • Config management does all, we rely on it

14 of 23

How does config management fit in UFIT Research Computing deployments (cont)

  • Config management setup and strategy
    • “Base” set of classes applied to all hosts, virtual or bare-metal
      • Anything needed to set up in UFRC env including all network interfaces (IB, bonds, bridges, VLAN-tagged, whatever) and operate/maintain HW (if bare metal)
    • Further specialization according to intended role by associating further classes (hostgroups and config groups) – e.g., HPG SLURM client, edge node, etc
    • Provisioning time – gets signed cert from puppet CA, and applies same catalog that is enforced on agent runs
    • SRV domain very helpful for multi-master setup
    • Our agents run once an hour
    • Manifest development – git branch and associated environment (Gitlab CI used to make this simple) - once happy/reviewed, merge and remove branch and environment

15 of 23

Options for sharing infrastructure with other UFIT units

  • We don’t work with any other IT groups today, but are set up for it
  • Foreman-side
    • Authentication credentials come from UF AD (“gatorlink”)
    • Roles can be created and assigned to users and groups (30 built-in roles)
    • Use Organization and Location features
      • Sharing (or limiting sharing) of repositories/puppet classes/ansible roles/etc with other orgs/locations supported
      • For RC – org set to UFRC, location to UFDC (where all our resources currently reside)
  • Gitlab – auth is with gatorlink credentials already, could enable access to other units if the use case came up

16 of 23

Security Considerations

17 of 23

Networking

  • Access from outside through login nodes, web servers and portals
    • Separate public VLANs with limited ACLs
    • OSSEC software runs on all edge servers
  • Compute nodes can access Internet resource via NAT
  • Backend networks are private – management, storage, InfiniBand
  • Secure enclave is more protected
    • Private VLAN
    • ACLs only for systems management and monitoring

18 of 23

Authentication

  • Access from outside requires Duo multi-factor authentication
    • SSH key or password auth to login nodes
    • Single-sign on to all web interfaces
    • SFTP and Globus
    • Not for SMB (but restricted to campus networks only)
    • VPN
  • Federated authentication will leverage COManage and eduVPN
    • Both services will leverage home institution multi-factor authentication
  • Secure enclave leverages SSO with Duo
    • Also requires a special key for access (not SSH key)

19 of 23

Logging

  • Centralized logging via rsyslog
  • rsyslog is configured for Splunk forwarding
    • Secure enclave nodes are configured to send to splunk directly
  • Audit logging
    • Possible, but not enabled
    • File system audit logs
    • Linux auditd system logs
    • Heavy impact to performance

20 of 23

User interfaces and support

21 of 23

User Interfaces

  • Command line – standard cluster interface
    • Requires SSH key or password and Duo
  • Web browser GUI – Open OnDemand
    • Allows GUI software to work without complex client setup
    • Leverages campus SSO with internal authorization
  • Batch job management – SLURM
  • Data transfer – Globus Enterprise
    • Also leverages campus SSO

22 of 23

Application Support

  • Software is managed with modules (lmod)
  • We keep two versions of applications and libraries
    • Production version – the default
    • Previous version – for (limited) backward capability
  • Open and campus-licensed software
    • Controlled by central licensing services
  • Licensed software access is managed by group membership
    • Only those who have a license get access

23 of 23

Questions?