Throughput Machine Learning
Ian Ross
PATh Staff Meeting 2025.04.30
This project is supported by the National Science Foundation under Cooperative Agreements OAC-2030508. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
PATh ML supplement
Microscoft Copilot prompt: A cartoon image of a pelican and a condor holding hands over a landscape made of computer chips at sunset
NAIRR in a nutshell
…
Eval
Eval
Eval
Eval
Eval
Eval
Eval
Eval
Eval
Eval
Training (Resource C)
Training (Resource D)
Training (Resource A)
Training (Resource B)
Training (Resource E)
MQHTYPAQLRRFGQA
MQHTYLAQLMRWGTH
MQHTPPAQLMQFGTA
MAHTYPAQLMRAGTA
…
Pretrain protein LLM using ~10M biophysical simulations
~40GB on disk
Pretrained Model
Model capable of more accurate predictions than training on the limited experimental data directly.
MQHTSPAFLMRFGTA
…
Finetune using ~100 experimental measurements of protein sequence mutations
MQHTYPAQLMRTTTA
Comparison of n model training path permutations
…
Objectives:
Domain benchmark from Gitter lab: Training a protein language model for use in protein engineering:
PATh supplement, workflow and science case
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
(~20h/epoch
on A100)
(~1min/epoch)
PATh supplement, status
Annexes, NAIRR resources, and lessons learned
Annexes, NAIRR resources, and lessons learned
Overview of “MLDAG”
https://xkcd.com/927/
Overview of “MLDAG”
Experiment configuration and DAG generation
name: "Global Pretraining"
submit_template: |
universe = container
container_image = osdf:///ospool/ap40/data/ian.ross/metl_global.sif
request_disk = {resource.disk}
request_memory = {resource.mem_mb}
request_cpus = {resource.cpus}
request_gpus = {resource.gpus}
gpus_minimum_memory = {resource.gpu_memory}
gpus_minimum_capability = 7.5
…
executable = /bin/python
transfer_executable = false
arguments = pretrain.py --learning_rate=$(alpha)
…
queue
vars:
epochs:
value: 30
type: value
description: "Number of epochs to train for"
epochs_per_job:
value: 5
type: value
description: "Number of epochs to train in each job"
alpha:
start: 0
stop: 10
step: 1
type: range
description: "learning rate example"
These get filled in by specific values for each NAIRR
resource that get passed in during the DAG generation, �if any. Defaults can be set for more general usage.
Just standard VAR usage, but must match definition�in “vars” field
Get slotted into VARS with a the dag, along with some other internal
bookkeeping variables:
JOB run0-train_epoch1 default_pretrain.sub
VARS run0-train_epoch1 epoch="1" run_uuid="d35b3ea9" ResourceName="default"
VARS run0-train_epoch1 epochs=”30" epochs_per_job=”5" run_number="0" alpha="0"
JOB run1-train_epoch1 default_pretrain.sub
VARS run1-train_epoch1 epoch="1" run_uuid="c3062075" ResourceName="default"
VARS run1-train_epoch1 epochs=”30" epochs_per_job=”5" run_number=”0" alpha="1”
…
Annex-creating Service node hackery
Annex-creating Service node hackery
“MLDAG” status
…I don’t have a picture of my dog eating dogfood, so my dog as food will have to do
Lessons learned
PATh supplement, timelines
Other ongoing work – understanding usage and needs
Future plans
Discussion