1 of 24

A WORKFLOW FOR ERROR ANALYSIS FOR DRUG RESPONSE PREDICTION

VIA STATISTICAL STANDARDIZATION AND DISTRIBUTION ANALYSIS

JAKE GWINN, JUSTIN M. WOZNIAK, RAJEEV JAIN, YITAN ZHU, ALEXANDER PARTIN, THOMAS BRETTIN, AND RICK STEVENSDATA SCIENCE & LEARNINGARGONNE NATIONAL LABORATORY

WORKS @ SC — November 17, 2025 — St. Louis

2 of 24

OVERVIEW: HPC WORKFLOWS

Workflows are more relevant than ever in a time of increasing automation

  • Exascale is here – Aurora available to users as of the spring
  • Experimental science centers around the world are experiencing massive increases in data velocity and quantity
  • Emerging applications are putting AI first
  • Modern systems are diverse with many usage barriers
  • Need scalable, portable solutions that deliver this power to real applications
  • Will describe recent efforts:
    • Deep learning workflows for cancer
    • Use of the Swift/T workflow system
    • A new approach for finding rare drugs
    • Results from runs on ALCF Aurora

Workflows

CancerDataSets

+

TrainingTasks

3 of 24

OUTLINE

  • Overview of ECP CANDLE
  • Computing environment
  • CANDLE/Supervisor framework
  • Workflow technologies
  • Statistical analysis of drug response prediction accuracy
  • Results from the High Error Drug workflow

3

4 of 24

SWIFT/T: DESIGNED FOR EXASCALE

Hierarchical concurrency in MPI environments

  • Single-site, tightly-coupled, massively parallel workflows
  • Integrates tasks from many scripting languages
  • Originally built for exotic, limited systems (BlueGene, SiCortex)

Finalist 2020

$ conda install -c swift-t swift-t

ML

ML

ML

SIM

ML

ML

ML

SIM

5 of 24

THE SWIFT PROGRAMMING MODEL

  • F() and G() implemented in native code or external programs
  • F() and G()run in concurrently in different processes
  • r is computed when they are both done
  • This parallelism is automatic
  • Works recursively throughout the program’s call graph

All progress driven by concurrent dataflow

(int r) myproc (int i, int j)

{

int x = F(i);

int y = G(j);

r = x + y;

}

6 of 24

SWIFT SYNTAX

  • Data types

int i = 4;

string s = "hello world";

file image<"snapshot.jpg">;

  • Shell access

app (file o) myapp(file f, int i)

{ mysim "-s" i @f @o ; }

  • Structured data

typedef image file;

image A[];

type protein_run {

file pdb_in; file sim_out;

}

bag<blob>[] B;

  • Conventional expressions

if (x == 3) {

y = x+2;

s = strcat("y: ", y);

}

  • Parallel loops

foreach f,i in A {

B[i] = convert(A[i]);

}

  • Data flow

merge(analyze(B[0], B[1]),

analyze(B[2], B[3]));

  • Swift: A language for distributed parallel scripting. J. Parallel Computing, 2011
  • Compiler techniques for massively scalable implicit task parallelism. Proc. SC, 2014

7 of 24

ASYNCHRONOUS DYNAMIC LOAD BALANCER

  • An MPI library for master-worker workloads in C
  • Uses a variable-size, scalable network of servers
  • Servers implement work-stealing
  • The work unit is a byte array
  • Optional work priorities, targets, types

  • For Swift/T, we added:
    • Server-stored data
    • Data-dependent execution
    • Parallel tasks

ADLB for short

  • Lusk et al. More scalability, less pain: A simple programming model and its implementation for extreme computing. SciDAC Review 17, 2010

8 of 24

MPI: THE MESSAGE PASSING INTERFACE

  • Programming model used on large supercomputers
  • Can run on many networks, including sockets, or shared memory
  • Standard API for C and Fortran; other languages have working implementations
  • Contains communication calls for
    • Point-to-point (send/recv)
    • Collectives (broadcast, reduce, etc.)
  • Interesting concepts
    • Communicators: collections of �communicating processing and �a context
    • Data types: Language-independent�data marshaling scheme

9 of 24

CANDLE/SUPERVISOR OVERVIEW

  • CANDLE/Supervisor consists of several high-level workflows:
    • Capable of modifying/controlling application parameters dynamically as the workflow progresses and training runs complete
    • Distribute work across large computing infrastructure, manage progress
  • Underlying applications are Python programs that use TensorFlow/PyTorch

  • “User code” shown in blue
  • “Utilities” shown in white
  • New studies would be developed by modifying the blue sections

9

10 of 24

CANDLE/SUPERVISOR IMPLEMENTATION

  • Runs start with a test script
  • CFG scripts contain settings for a system or parameters for a given study (e.g., search space)
  • Reusable site settings
  • The workflow shell script sets up the run
  • Swift/T launches and manages the workflow
  • Reusable Model scripts set up each app run
  • The DL app uses TF plus CANDLE Python utilities

10

11 of 24

LEARNING ON REAL SUPERCOMPUTERS

Steep learning curve with myriad technologies

  • Workflow manager (Swift/T, EMEWS) ; Scheduler ; scripting

Deep learning (Keras, TensorFlow, Horovod)

  • Optimization algorithms (R, Python)

  • MPI implementation (MVAPICH, Open MPI)

etc. …

12 of 24

EXAMPLE: INCREMENTAL LEARNING

  • Leave-one-out training workflow to probe the data
  • Split up the training data into subsets, iteratively train on most remaining subsets.
  • Weight sharing from one subset to the next (incremental learning)

run_stage(int N, int S, string this, int stage,

void block, string plan_id) {

void parent = run_single(this, stage, block, plan_id);

if (stage < S) {

foreach id_child in [1:N] {

run_stage(N, S, this+"."+id_child, stage+1, parent,

plan_id, db_file, runtype);

}}

}

run_single(string node, int stage, void block) {

json_fragment = make_json_fragment(node, stage);

json = "{\"node\": \"%s\", %s}" % (node, json_fragment);

block => obj(json, node);

}

  • Allows for investigations into data quality and learning patterns
  • Could also boost performance by preventing overloading data ingest limits
  • Recursive calls define the datasets for training
  • Runs at large scale on Summit, ramp-up/down
  • High-bypass learning: Automated detection of tumor cells that significantly impact drug response. Proc. MLHPC 2020.

13 of 24

HIGH ERROR DRUGS (HEDS)

  • When using machine learning to search for the best cancer drugs for a particular case, a challenge is that the good drugs are out-of-distribution with respect to the typical training set.
  • Better drug candidates have low AUC scores
  • As shown in the plot, very few drugs are strong candidates
  • However, training a deep to be able to reproduce the best candidates is very difficult
  • Essentially want to train the model to be more accurate on the rarer high error drugs
  • Many alternative methods exist in the literature, including methods that modify the data set, which could be evaluated with the framework

14 of 24

DISTRIBUTION OF ERRORS

  • Used the previously developed Uno model
  • Uno uses the drug descriptors and cell line data to predict an AUC score
  • Analyzed the AUC error by Z-Score
  • The fewer drug/cell combinations with higher Z-Scores are farther out-of-distribution and have higher errors
  • Need to focus training on these drugs to get better end ranking

15 of 24

NEW ERROR METRIC

  • Developed new error metric compatible with TensorFlow to steer training toward better accuracy low AUC drugs
  • Alpha parameter is tunable – higher alpha is more attuned to low AUC
  • Side effect is higher error on ordinary drugs, but that is not significant for the study
  • Need to retrain with this error metric across the full data set

16 of 24

WORKFLOW SPECIFICATION

  • Derived HED workflow from prior “UPF” workflow that accepts lists of hyperparameters
  • Hyperparameters here are used to select drugs for move from training to test set

  • Each training run rebuilds the data sets in local storage on Aurora

17 of 24

VARIOUS I/O IN HED WORKFLOW

  • Staging: Using a previously developed technique from X-ray science to stage code and data to local storage using MPI-IO
  • Runs as a hook inside the Swift/T workflow
  • Critical to get Anaconda + TensorFlow installed at scale
  • Checkpointing: Using the previously developed CANDLE checkpoint module
  • Allows for metadata-rich checkpoints, automatic cleanup optionally based on training progress
  • Overall minimize accesses to the parallel filesystem
  • Used an assembly of CCLE, CTRPv2, gCSI, GDSCv1, and GDSCv2 data sets, integrated by the IMPROVE project
  • Contains 770 unique drugs with many samples per drug

18 of 24

RESULTS FOR 10 AUC RANGES

  • Staging

19 of 24

RESULTS FOR 2 AUC RANGES

  • Staging

20 of 24

CHALLENGES: INCREASING COMPLEXITY

  • How to manage and run:
    • Large software packages (10s of Gbs)
    • Different programming languages in the same workflow
  • How to manage highly specialized, heterogeneous workloads
    • Determine what will be done on CPU/GPU systems and what will be done on specialized hardware
    • How to reallocate resources in the middle of the run
  • How to maintain scientific integrity while using opaque services
  • Most important: How to create a roadmap from rapid prototyping to performant, scalable applications

?Deployment?

  • Software Monsters: Quantifying, reporting, and controlling composite applications.
  • ASCR Workshop on the Science of Scientific-Software Development and Use 2021.

21 of 24

TAKEAWAYS

  • Took the approach of building systems specifically for exascale infrastructure
  • Built with standard HPC tools like MPI, and focused on high-performance integrations such as library calls
  • Demonstrated reusability of Supervisor framework for a totally different kind of investigation
  • Demonstrated the ability to optimize training results for interesting drugs
  • Future automation in science will rely on fast, reliable systems that can work together

Reusability in a framework for deep learning and cancer

22 of 24

THANKS

  • Thanks to the organizers

  • Code and guides:
    • CANDLE GitHub: https://github.com/ECP-CANDLE
    • Swift/T Home: http://swift-lang.org/Swift-T

  • This research was supported by the Exascale Computing Project (17-SC-20-SC), a joint project of the U.S. Department of Energy’s Office of Science and National Nuclear Security Administration, responsible for delivering a capable exascale ecosystem, including software, applications, and hardware technology, to support the nation’s exascale computing imperative.

22

23 of 24

ACKNOWLEDGMENTS

  • This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This research was completed with resources provided by the Laboratory Computing Resource Center at Argonne National Laboratory.
  • This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract number DE-AC02-06CH11357.
  • This work was supported by the U.S. Department of Energy, Office of Basic Energy Sciences, Division of Materials Sciences and Engineering and used resources of the Advanced Photon Source, a U.S. Department of Energy Office of Science User Facility at Argonne National Laboratory.

24 of 24

QUESTIONS?

WOZ@ANL.GOV

www.anl.gov