1 of 20

Throughput Machine Learning

Ian Ross

PATh Staff Meeting 2025.04.30

This project is supported by the National Science Foundation under Cooperative Agreements OAC-2030508. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

2 of 20

PATh ML supplement

~1 year supplement to PATh to “profile the effects of…training and inference on distributed, heterogeneous capacity”

Use htcondor annexes to run training jobs into National AI Research Resource (NAIRR)

Intentionally shuffle training runs between sites at epoch boundaries to test resilience and impact of heterogeneity

Compare models trained in distributed fashion to baseline (single-node trained) versions

Exercise our technologies to see how they operate in a heterogeneous “AI” setting while establishing single-point access to NAIRR compute

Microscoft Copilot prompt: A cartoon image of a pelican and a condor holding hands over a landscape made of computer chips at sunset

3 of 20

NAIRR in a nutshell

“The National Artificial Intelligence Research Resource (NAIRR) is a concept for a shared national research infrastructure to bridge this gap by connecting U.S. researchers to responsible and trustworthy Artificial Intelligence (AI) resources, as well as the needed computational, data, software, training, and educational resources to advance research, discovery, and innovation.” From nairrpilot.org
We applied for, and received, allocations at 6 computational resources for this experiment.

Expanse, Bridges-2, Anvil, Delta, Jestream2, AWS

4 of 20

…

Eval

Training (Resource C)

Training (Resource D)

Training (Resource A)

Training (Resource B)

Training (Resource E)

MQHTYPAQLRRFGQA

MQHTYLAQLMRWGTH

MQHTPPAQLMQFGTA

MAHTYPAQLMRAGTA

…

Pretrain protein LLM using ~10M biophysical simulations

~40GB on disk

Pretrained Model

Model capable of more accurate predictions than training on the limited experimental data directly.

MQHTSPAFLMRFGTA

…

Finetune using ~100 experimental measurements of protein sequence mutations

MQHTYPAQLMRTTTA

Comparison of n model training path permutations

…

Objectives:

Characterize the impact of training models and ensembles across heterogeneous resources
Advance PATh Access Point capabilities to reduce the barrier of entry and support ML workloads in distributed and heterogeneous environments like NAIRR

Domain benchmark from Gitter lab: Training a protein language model for use in protein engineering:

PATh supplement, workflow and science case

Epoch 1

Epoch 2

Epoch 3

Epoch 4

Epoch 5

(~20h/epoch

on A100)

(~1min/epoch)

5 of 20

PATh supplement, status

Training process working on Delta, Expanse, Bridges-2, OSPool, CHTC

Manual job submissions to get the ball rolling…

Took longer than anticipated to iron out wrinkles to get consistent epochs in…
Ongoing – work on getting one heterogeneous training complete (8/~30 epochs)
In parallel – adapting a DAG generator to ease management and organization

Plenty more to come on this…

6 of 20

Annexes, NAIRR resources, and lessons learned

Overall things on our side seem to be in pretty good shape

Documentation, annex creation, htcondor-cli, all good with a few odd snares (to come)

Bit of a pain to manage and organize across 6 different sites (plus CHTC and OSPool), each with their own documentation and quirks and policies..

Solution: create helper utility to smooth handling differences between sites (e.g. job needs 80GB of memory, but site A is whole node so annex creation won’t let you specify memory request; usernames, project names, partition names, “optimal” resource requests…)

Storing apptainer image and data in OSDF – great success!

7 of 20

Annexes, NAIRR resources, and lessons learned

“Gotchas” observed (not all are fully confirmed: sorted by how much “blame” can be attributed to user error):

Hard to understand failures and idle jobs.
Would love annex create –-mem 128GB instead of annex create –mem_mb 131072 to be consistent with RequestMemory=128GB
+JobDurationCategory = "Medium" default on all jobs on ap40

With epochs taking 18-25 hours, this one was a big headache before I realized what was happening

Job requesting an annex must be in queue before annex can be created

Understandable, but complicates semi-automated workflows. A –know_what_im_doing_so_please_start_the_annex=true option would be nice

gpu_minimum_memory and gpu_minimum_capability in submit definition resulted in jobs not matching
Frustrating when a job goes on hold for over-utilization of resources when I’m running in a single job, single node annex modality.

Solution: know your code.

Organization and management is a pain, so…

8 of 20

Overview of “MLDAG”

Objective:

Adapt Thinh Nguyen’s work from last summer to generate DAGs for the heterogeneous resource experiment

Shish-kebab DAGs, with NAIRR resource shuffling via annex

Less about a production-level tool and more about recognizing shortcomings, pain points, and capabilities in current system.

And also retaining my sanity…

Sidequest:

Extend to support other common “experiment” workflows like hyperparameter sweeps

https://xkcd.com/927/

9 of 20

Overview of “MLDAG”

An “experiment” consists of x “training runs”, each of which consists of y epochs of training, handled within n training jobs (nodes) within the training run

The training runs likely differ based in user-specified ways (targeted resources, hyperparameters, …)

Each training job has a common submit definition

In my case, each targets a different NAIRR resource, but this is unlikely to be a common pattern… so I’ll ignore it (although the utility supports resource specific configuration)

Each node can also be associated with an evaluation job to “check in” with the training process

10 of 20

Experiment configuration and DAG generation

Experiment.yaml file to capture:

Name of the experiment
Template for submit description
Variables that either define training run behavior (total number of epochs, epochs per job) or define possible values variables can take in order to define the shape of the experiment.

DAG generation script will:

Fan out the variables
Create Training Runs for each combination of variables
Define job nodes with appropriate VARS
Optional special sauce and useful (to me) defaults

Site-specific resources requests with site targeting and shuffles
Random seed and unique ID per training runs
Handles for service node definitions, PRE- and POST-scripts

11 of 20

name: "Global Pretraining"

submit_template: |

universe = container

container_image = osdf:///ospool/ap40/data/ian.ross/metl_global.sif

request_disk = {resource.disk}

request_memory = {resource.mem_mb}

request_cpus = {resource.cpus}

request_gpus = {resource.gpus}

gpus_minimum_memory = {resource.gpu_memory}

gpus_minimum_capability = 7.5

…

executable = /bin/python

transfer_executable = false

arguments = pretrain.py --learning_rate=$(alpha)

…

queue

vars:

epochs:

value: 30

type: value

description: "Number of epochs to train for"

epochs_per_job:

value: 5

type: value

description: "Number of epochs to train in each job"

alpha:

start: 0

stop: 10

step: 1

type: range

description: "learning rate example"

These get filled in by specific values for each NAIRR

resource that get passed in during the DAG generation, �if any. Defaults can be set for more general usage.

Just standard VAR usage, but must match definition�in “vars” field

Get slotted into VARS with a the dag, along with some other internal

bookkeeping variables:

JOB run0-train_epoch1 default_pretrain.sub

VARS run0-train_epoch1 epoch="1" run_uuid="d35b3ea9" ResourceName="default"

VARS run0-train_epoch1 epochs=”30" epochs_per_job=”5" run_number="0" alpha="0"

JOB run1-train_epoch1 default_pretrain.sub

VARS run1-train_epoch1 epoch="1" run_uuid="c3062075" ResourceName="default"

VARS run1-train_epoch1 epochs=”30" epochs_per_job=”5" run_number=”0" alpha="1”

…

12 of 20

Annex-creating Service node hackery

Challenge: My experiment is targeting 30 training runs, each of which is a 30 node shish-kebab, with each node running in a different annex. How can I organize the work and simplify my life?
Constraints:

Job requesting an annex must be in the queue before annex can be created
The “shuffle” nature of the experiment and transience of annexes precludes a PROVISION node to create an annex at each site

13 of 20

Annex-creating Service node hackery

Hack:

PRESCRIPT that drops an annex creation “request” into a directory
Add MY.TargetAnnexName = ”unique_annex_name” to submit description
Create a service node that watches and acts on those requests

Hooks directly into the annex create/add codepaths within the htcondor-cli

…but automatic annex creation is a tricky and dangerous business.

Two-factor authentication at many of the NAIRR sites

So the utility can be run interactively as well…

Don’t want it to “go rogue” and spin up requests unnecessarily
Worth thinking about if we anticipate heavy NAIRR annex usage.

14 of 20

“MLDAG” status

Very much a prototype but initial implementation is done
Trying to do some cleanup to remove the worst of the supplement-specific pieces
Documentation and more testing

Including some dogfooding

Probably a round of iterations or two with RCFs and power user(s)
Other feature requests or experimentation?

…I don’t have a picture of my dog eating dogfood, so my dog as food will have to do

15 of 20

Lessons learned

It’s not hard to create these VAR expansion (hyper)parameter sweep-type DAGs, but there’s not one clear way to do it.
It is hard to break my existing DAG habits and patterns 😆
Leveraging annexes within a DAG is tricky for obvious reasons
I found myself conceptually thinking of this as a collection of SUBDAGs, with jobs able to access SUBDAG-scoped variables

Obviously useful for hyperparameter sweeps, but can see benefits for InitialDir-style organization, dataset slicing

Macros for DAG composition?

subdag alpha, beta from parameters.txts

16 of 20

PATh supplement, timelines

December 2024: Finish profiling and baseline training
Soon™: Dedicated AP (ordering + provisioning)

Expected shipping in May

January 2025: Documentation, guides, tutorials re: distributed training in OSPool

New start-to-finish example created and presented in UW-Madison Research Bazaar 2025 - “GPU Access and AI workflows in the Center for High Throuhput Computing”
But always a need for updated guides…

February/March 2025: Overlay of NAIRR resources in HTCSS “HPC Annex”
May 2025: First round of distributed heterogeneous training, preliminary report

Ongoing

June/July: Documentation + tutorials of high throughput training+inference in OSPool/NAIRR
August: Second round of distributed training and final evaluations, final report

17 of 20

18 of 20

Other ongoing work – understanding usage and needs

24 responses to “GPU Usage within the OSPool” poll

19 of 20

Future plans

Supplement
Summer of inference

What can we do to support users who come to us with (or create) a model (or set of models)?
Is batch-processing inference the best solution?
Would a CMS- or IceCube-inspired “coprocessor” model benefit users broadly, or are these workflows too specialized?

Documentation and guide updates

20 of 20

Discussion

What kinds of commonly repeated workloads are we seeing? What can we do to better support users doing these things?

Hyperparameter sweeps
Inference

Batch
Offloaded

How can we clearly articulate what is and isn’t a good fit for our services?