1 of 19

Throughput Machine Learning

Ian Ross

PATh Staff Meeting 2025.04.30

This project is supported by the National Science Foundation under Cooperative Agreements OAC-2030508. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

2 of 19

PATh ML supplement

  • ~1 year supplement to PATh to “profile the effects of…training and inference on distributed, heterogeneous capacity”
    • Use htcondor annexes to run training jobs into National AI Research Resource (NAIRR)
      • Intentionally shuffle training runs between sites at epoch boundaries to test resilience and impact of heterogeneity
    • Compare models trained in distributed fashion to baseline (single-node trained) versions
  • Exercise our technologies to see how they operate in a heterogeneous “AI” setting while establishing single-point access to NAIRR compute

Microscoft Copilot prompt: A cartoon image of a pelican and a condor holding hands over a landscape made of computer chips at sunset

3 of 19

NAIRR in a nutshell

  • “The National Artificial Intelligence Research Resource (NAIRR) is a concept for a shared national research infrastructure to bridge this gap by connecting U.S. researchers to responsible and trustworthy Artificial Intelligence (AI) resources, as well as the needed computational, data, software, training, and educational resources to advance research, discovery, and innovation.” From nairrpilot.org
  • We applied for, and received, allocations at 6 computational resources for this experiment.
    • Expanse, Bridges-2, Anvil, Delta, Jestream2, AWS

4 of 19

Eval

Eval

Eval

Eval

Eval

Eval

Eval

Eval

Eval

Eval

Training (Resource C)

Training (Resource D)

Training (Resource A)

Training (Resource B)

Training (Resource E)

MQHTYPAQLRRFGQA

MQHTYLAQLMRWGTH

MQHTPPAQLMQFGTA

MAHTYPAQLMRAGTA

Pretrain protein LLM using ~10M biophysical simulations

~40GB on disk

Pretrained Model

Model capable of more accurate predictions than training on the limited experimental data directly.

MQHTSPAFLMRFGTA

Finetune using ~100 experimental measurements of protein sequence mutations

MQHTYPAQLMRTTTA

Comparison of n model training path permutations

Objectives:

    • Characterize the impact of training models and ensembles across heterogeneous resources
    • Advance PATh Access Point capabilities to reduce the barrier of entry and support ML workloads in distributed and heterogeneous environments like NAIRR

Domain benchmark from Gitter lab: Training a protein language model for use in protein engineering:

PATh supplement, workflow and science case

Epoch 1

Epoch 2

Epoch 3

Epoch 4

Epoch 5

(~20h/epoch

on A100)

(~1min/epoch)

5 of 19

PATh supplement, status

  • Training process working on Delta, Expanse, Bridges-2, OSPool, CHTC
    • Manual job submissions to get the ball rolling…
  • Took longer than anticipated to iron out wrinkles to get consistent epochs in…
  • Ongoing – work on getting one heterogeneous training complete (8/~30 epochs)
  • In parallel – adapting a DAG generator to ease management and organization
    • Plenty more to come on this…

6 of 19

Annexes, NAIRR resources, and lessons learned

  • Overall things on our side seem to be in pretty good shape
    • Documentation, annex creation, htcondor-cli, all good with a few odd snares (to come)
  • Bit of a pain to manage and organize across 6 different sites (plus CHTC and OSPool), each with their own documentation and quirks and policies..
    • Solution: create helper utility to smooth handling differences between sites (e.g. job needs 80GB of memory, but site A is whole node so annex creation won’t let you specify memory request; usernames, project names, partition names, “optimal” resource requests…)
  • Storing apptainer image and data in OSDF – great success!

7 of 19

Annexes, NAIRR resources, and lessons learned

  • “Gotchas” observed (not all are fully confirmed: sorted by how much “blame” can be attributed to user error):
    • Hard to understand failures and idle jobs.
    • Would love annex create –-mem 128GB instead of annex create –mem_mb 131072 to be consistent with RequestMemory=128GB
    • +JobDurationCategory = "Medium" default on all jobs on ap40
      • With epochs taking 18-25 hours, this one was a big headache before I realized what was happening
    • Job requesting an annex must be in queue before annex can be created
      • Understandable, but complicates semi-automated workflows. A –know_what_im_doing_so_please_start_the_annex=true option would be nice
    • gpu_minimum_memory and gpu_minimum_capability in submit definition resulted in jobs not matching
    • Frustrating when a job goes on hold for over-utilization of resources when I’m running in a single job, single node annex modality.
      • Solution: know your code.
  • Organization and management is a pain, so…

8 of 19

Overview of “MLDAG”

  • Objective:
    • Adapt Thinh Nguyen’s work from last summer to generate DAGs for the heterogeneous resource experiment
      • Shish-kebab DAGs, with NAIRR resource shuffling via annex
    • Less about a production-level tool and more about recognizing shortcomings, pain points, and capabilities in current system.
      • And also retaining my sanity…
  • Sidequest:
    • Extend to support other common “experiment” workflows like hyperparameter sweeps

https://xkcd.com/927/

9 of 19

Overview of “MLDAG”

  • An “experiment” consists of x “training runs”, each of which consists of y epochs of training, handled within n training jobs (nodes) within the training run
    • The training runs likely differ based in user-specified ways (targeted resources, hyperparameters, …)
  • Each training job has a common submit definition
    • In my case, each targets a different NAIRR resource, but this is unlikely to be a common pattern… so I’ll ignore it (although the utility supports resource specific configuration)
  • Each node can also be associated with an evaluation job to “check in” with the training process

10 of 19

Experiment configuration and DAG generation

  • Experiment.yaml file to capture:
    • Name of the experiment
    • Template for submit description
    • Variables that either define training run behavior (total number of epochs, epochs per job) or define possible values variables can take in order to define the shape of the experiment.
  • DAG generation script will:
    • Fan out the variables
    • Create Training Runs for each combination of variables
    • Define job nodes with appropriate VARS
    • Optional special sauce and useful (to me) defaults
      • Site-specific resources requests with site targeting and shuffles
      • Random seed and unique ID per training runs
      • Handles for service node definitions, PRE- and POST-scripts

11 of 19

name: "Global Pretraining"

submit_template: |

universe = container

container_image = osdf:///ospool/ap40/data/ian.ross/metl_global.sif

request_disk = {resource.disk}

request_memory = {resource.mem_mb}

request_cpus = {resource.cpus}

request_gpus = {resource.gpus}

gpus_minimum_memory = {resource.gpu_memory}

gpus_minimum_capability = 7.5

executable = /bin/python

transfer_executable = false

arguments = pretrain.py --learning_rate=$(alpha)

queue

vars:

epochs:

value: 30

type: value

description: "Number of epochs to train for"

epochs_per_job:

value: 5

type: value

description: "Number of epochs to train in each job"

alpha:

start: 0

stop: 10

step: 1

type: range

description: "learning rate example"

These get filled in by specific values for each NAIRR

resource that get passed in during the DAG generation, �if any. Defaults can be set for more general usage.

Just standard VAR usage, but must match definition�in “vars” field

Get slotted into VARS with a the dag, along with some other internal

bookkeeping variables:

JOB run0-train_epoch1 default_pretrain.sub

VARS run0-train_epoch1 epoch="1" run_uuid="d35b3ea9" ResourceName="default"

VARS run0-train_epoch1 epochs=”30" epochs_per_job=”5" run_number="0" alpha="0"

JOB run1-train_epoch1 default_pretrain.sub

VARS run1-train_epoch1 epoch="1" run_uuid="c3062075" ResourceName="default"

VARS run1-train_epoch1 epochs=”30" epochs_per_job=”5" run_number=”0" alpha="1”

12 of 19

Annex-creating Service node hackery

  • Challenge: My experiment is targeting 30 training runs, each of which is a 30 node shish-kebab, with each node running in a different annex. How can I organize the work and simplify my life?
  • Constraints:
    • Job requesting an annex must be in the queue before annex can be created
    • The “shuffle” nature of the experiment and transience of annexes precludes a PROVISION node to create an annex at each site

13 of 19

Annex-creating Service node hackery

  • Hack:
    • PRESCRIPT that drops an annex creation “request” into a directory
    • Add MY.TargetAnnexName = ”unique_annex_name” to submit description
    • Create a service node that watches and acts on those requests
      • Hooks directly into the annex create/add codepaths within the htcondor-cli
  • …but automatic annex creation is a tricky and dangerous business.
    • Two-factor authentication at many of the NAIRR sites
      • So the utility can be run interactively as well…
    • Don’t want it to “go rogue” and spin up requests unnecessarily
    • Worth thinking about if we anticipate heavy NAIRR annex usage.

14 of 19

“MLDAG” status

  • Very much a prototype but initial implementation is done
  • Trying to do some cleanup to remove the worst of the supplement-specific pieces
  • Documentation and more testing
    • Including some dogfooding
  • Probably a round of iterations or two with RCFs and power user(s)
  • Other feature requests or experimentation?

…I don’t have a picture of my dog eating dogfood, so my dog as food will have to do

15 of 19

Lessons learned

  • It’s not hard to create these VAR expansion (hyper)parameter sweep-type DAGs, but there’s not one clear way to do it.
  • It is hard to break my existing DAG habits and patterns 😆
  • Leveraging annexes within a DAG is tricky for obvious reasons
  • I found myself conceptually thinking of this as a collection of SUBDAGs, with jobs able to access SUBDAG-scoped variables
    • Obviously useful for hyperparameter sweeps, but can see benefits for InitialDir-style organization, dataset slicing
  • Macros for DAG composition?
    • subdag alpha, beta from parameters.txts

16 of 19

PATh supplement, timelines

  • December 2024: Finish profiling and baseline training
  • Soon™: Dedicated AP (ordering + provisioning)
    • Expected shipping in May
  • January 2025: Documentation, guides, tutorials re: distributed training in OSPool
    • New start-to-finish example created and presented in UW-Madison Research Bazaar 2025 - “GPU Access and AI workflows in the Center for High Throuhput Computing”
    • But always a need for updated guides…
  • February/March 2025: Overlay of NAIRR resources in HTCSS “HPC Annex”
  • May 2025: First round of distributed heterogeneous training, preliminary report
    • Ongoing
  • June/July: Documentation + tutorials of high throughput training+inference in OSPool/NAIRR
  • August: Second round of distributed training and final evaluations, final report

17 of 19

Other ongoing work – understanding usage and needs

  • 24 responses to “GPU Usage within the OSPool” poll

18 of 19

Future plans

  • Supplement
  • Summer of inference
    • What can we do to support users who come to us with (or create) a model (or set of models)?
    • Is batch-processing inference the best solution?
    • Would a CMS- or IceCube-inspired “coprocessor” model benefit users broadly, or are these workflows too specialized?
  • Documentation and guide updates

19 of 19

Discussion

  • What kinds of commonly repeated workloads are we seeing? What can we do to better support users doing these things?
    • Hyperparameter sweeps
    • Inference
      • Batch
      • Offloaded
  • How can we clearly articulate what is and isn’t a good fit for our services?