1 of 24

A WORKFLOW FOR ERROR ANALYSIS FOR DRUG RESPONSE PREDICTION

VIA STATISTICAL STANDARDIZATION AND DISTRIBUTION ANALYSIS

JAKE GWINN, JUSTIN M. WOZNIAK, RAJEEV JAIN, YITAN ZHU, �ALEXANDER PARTIN, THOMAS BRETTIN, AND RICK STEVENS�DATA SCIENCE & LEARNING�ARGONNE NATIONAL LABORATORY

WORKS @ SC — November 17, 2025 — St. Louis

2 of 24

OVERVIEW: HPC WORKFLOWS

Workflows are more relevant than ever in a time of increasing automation

Exascale is here – Aurora available to users as of the spring
Experimental science centers around the world are experiencing massive increases in data velocity and quantity
Emerging applications are putting AI first
Modern systems are diverse with many usage barriers
Need scalable, portable solutions that deliver this power to real applications
Will describe recent efforts:

Deep learning workflows for cancer
Use of the Swift/T workflow system
A new approach for finding rare drugs
Results from runs on ALCF Aurora

Workflows

Cancer�Data�Sets

Training�Tasks

3 of 24

OUTLINE

Overview of ECP CANDLE
Computing environment
CANDLE/Supervisor framework
Workflow technologies
Statistical analysis of drug response prediction accuracy
Results from the High Error Drug workflow

4 of 24

SWIFT/T: DESIGNED FOR EXASCALE

Hierarchical concurrency in MPI environments

Single-site, tightly-coupled, massively parallel workflows
Integrates tasks from many scripting languages
Originally built for exotic, limited systems (BlueGene, SiCortex)

Finalist 2020

$ conda install -c swift-t swift-t

SIM

5 of 24

THE SWIFT PROGRAMMING MODEL�

F() and G() implemented in native code or external programs
F() and G()run in concurrently in different processes
r is computed when they are both done
This parallelism is automatic
Works recursively throughout the program’s call graph

All progress driven by concurrent dataflow

(int r) myproc (int i, int j)

{

int x = F(i);

int y = G(j);

r = x + y;

}

6 of 24

SWIFT SYNTAX

Data types

int i = 4;

string s = "hello world";

file image<"snapshot.jpg">;

Shell access

app (file o) myapp(file f, int i)

{ mysim "-s" i @f @o ; }

Structured data

typedef image file;

image A[];

type protein_run {

file pdb_in; file sim_out;

}

bag<blob>[] B;

Conventional expressions

if (x == 3) {

y = x+2;

s = strcat("y: ", y);

}

Parallel loops

foreach f,i in A {

B[i] = convert(A[i]);

}

Data flow

merge(analyze(B[0], B[1]),

analyze(B[2], B[3]));

Swift: A language for distributed parallel scripting. �J. Parallel Computing, 2011
Compiler techniques for massively scalable implicit task parallelism. Proc. SC, 2014

7 of 24

ASYNCHRONOUS DYNAMIC LOAD BALANCER

An MPI library for master-worker �workloads in C
Uses a variable-size, scalable network of servers
Servers implement work-stealing
The work unit is a byte array
Optional work priorities, targets, types

For Swift/T, we added:

Server-stored data
Data-dependent execution
Parallel tasks

ADLB for short

Lusk et al. More scalability, less pain: A simple programming model and its implementation for extreme computing. �SciDAC Review 17, 2010

8 of 24

MPI: THE MESSAGE PASSING INTERFACE

Programming model used on large supercomputers
Can run on many networks, including sockets, or shared memory
Standard API for C and Fortran; other languages have working implementations
Contains communication calls for

Point-to-point (send/recv)
Collectives (broadcast, reduce, etc.)

Interesting concepts

Communicators: collections of �communicating processing and �a context
Data types: Language-independent�data marshaling scheme

9 of 24

CANDLE/SUPERVISOR OVERVIEW

CANDLE/Supervisor consists of several high-level workflows:

Capable of modifying/controlling application parameters dynamically as the workflow progresses and training runs complete
Distribute work across large computing infrastructure, manage progress

Underlying applications are Python programs that use TensorFlow/PyTorch

“User code” shown in blue
“Utilities” shown in white
New studies would be developed �by modifying the blue sections

10 of 24

CANDLE/SUPERVISOR IMPLEMENTATION

Runs start with a test script
CFG scripts contain settings for a �system or parameters for a �given study (e.g., search space)
Reusable site settings
The workflow shell script sets up �the run
Swift/T launches and manages the �workflow
Reusable Model scripts set up each �app run
The DL app uses TF plus CANDLE �Python utilities

11 of 24

LEARNING ON REAL SUPERCOMPUTERS

Steep learning curve with myriad technologies

Workflow manager (Swift/T, EMEWS) ; Scheduler ; scripting

Deep learning (Keras, TensorFlow, Horovod)

Optimization algorithms (R, Python)

MPI implementation (MVAPICH, Open MPI)

etc. …

12 of 24

EXAMPLE: INCREMENTAL LEARNING

Leave-one-out training workflow to probe the data
Split up the training data into subsets, iteratively train on most remaining subsets.
Weight sharing from one subset to the next (incremental learning)

run_stage(int N, int S, string this, int stage,

void block, string plan_id) {

void parent = run_single(this, stage, block, plan_id);

if (stage < S) {

foreach id_child in [1:N] {

run_stage(N, S, this+"."+id_child, stage+1, parent,

plan_id, db_file, runtype);

}}

}

run_single(string node, int stage, void block) {

json_fragment = make_json_fragment(node, stage);

json = "{\"node\": \"%s\", %s}" % (node, json_fragment);

block => obj(json, node);

}

Allows for investigations into data quality and learning patterns
Could also boost performance by preventing overloading data ingest limits
Recursive calls define the datasets for training
Runs at large scale on Summit, ramp-up/down

High-bypass learning: Automated detection of tumor cells that significantly impact drug response. Proc. MLHPC 2020.

13 of 24

HIGH ERROR DRUGS (HEDS)

When using machine learning to search for the best cancer drugs for a particular case, a challenge is that the good drugs are out-of-distribution with respect to the typical training set.
Better drug candidates have low AUC scores
As shown in the plot, very few drugs are strong candidates
However, training a deep to be able to reproduce the best candidates is very difficult
Essentially want to train the model to be more accurate on the rarer high error drugs
Many alternative methods exist in the literature, including methods that modify the data set, which could be evaluated with the framework

14 of 24

DISTRIBUTION OF ERRORS

Used the previously developed Uno model
Uno uses the drug descriptors and cell line data to predict an AUC score
Analyzed the AUC error by Z-Score
The fewer drug/cell combinations with higher Z-Scores are farther out-of-distribution and have higher errors
Need to focus training on these drugs to get better end ranking

15 of 24

NEW ERROR METRIC

Developed new error metric compatible with TensorFlow to steer training toward better accuracy low AUC drugs
Alpha parameter is tunable – higher alpha is more attuned to low AUC
Side effect is higher error on ordinary drugs, but that is not significant for the study
Need to retrain with this error metric across the full data set

16 of 24

WORKFLOW SPECIFICATION

Derived HED workflow from prior “UPF” workflow that accepts lists of hyperparameters
Hyperparameters here are used to select drugs for move from training to test set

Each training run rebuilds the data sets in local storage on Aurora

17 of 24

VARIOUS I/O IN HED WORKFLOW

Staging: Using a previously developed technique from X-ray science to stage code and data to local storage using MPI-IO
Runs as a hook inside the Swift/T workflow
Critical to get Anaconda + TensorFlow installed at scale
Checkpointing: Using the previously developed CANDLE checkpoint module
Allows for metadata-rich checkpoints, automatic cleanup optionally based on training progress
Overall minimize accesses to the parallel filesystem
Used an assembly of CCLE, CTRPv2, gCSI, GDSCv1, and GDSCv2 data sets, integrated by the IMPROVE project
Contains 770 unique drugs with many samples per drug

18 of 24

RESULTS FOR 10 AUC RANGES

Staging

19 of 24

RESULTS FOR 2 AUC RANGES

Staging

20 of 24

CHALLENGES: INCREASING COMPLEXITY

How to manage and run:

Large software packages (10s of Gbs)
Different programming languages in the same workflow

How to manage highly specialized, heterogeneous workloads

Determine what will be done on CPU/GPU systems and �what will be done on specialized hardware
How to reallocate resources in the middle of the run

How to maintain scientific integrity while using opaque services
Most important: How to create a roadmap from rapid prototyping �to performant, scalable applications

?�Deployment�?

Software Monsters: Quantifying, reporting, and controlling composite applications.
ASCR Workshop on the Science of Scientific-Software Development and Use 2021.

21 of 24

TAKEAWAYS

Took the approach of building systems specifically for exascale infrastructure
Built with standard HPC tools like MPI, and focused on high-performance integrations such as library calls
Demonstrated reusability of Supervisor framework for a totally different kind of investigation
Demonstrated the ability to optimize training results for interesting drugs
Future automation in science will rely on �fast, reliable systems that can work together

Reusability in a framework for deep learning and cancer

22 of 24

THANKS

Thanks to the organizers

Code and guides:

CANDLE GitHub: https://github.com/ECP-CANDLE
Swift/T Home: http://swift-lang.org/Swift-T

This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative.

23 of 24

ACKNOWLEDGMENTS

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This research was completed with resources provided by the Laboratory Computing Resource Center at Argonne National Laboratory.
This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract number DE-AC02-06CH11357.
This work was supported by the U.S. Department of Energy, Office of Basic Energy Sciences, Division of Materials Sciences and Engineering and used resources of the Advanced Photon Source, a U.S. Department of Energy Office of Science User Facility at Argonne National Laboratory.

24 of 24

QUESTIONS?

WOZ@ANL.GOV