1 of 9

Workflow Anomaly Detection

with Graph Neural Networks

Hongwei Jin¹, Krishnan Raghavan¹, George Papadimitriou2, Cong Wang3, Anirban Mandal3,

Patrycja Krawczuk2, Loïc Pottier2, Mariam Kiran4, Ewa Deelman2, Prasanna Balaprakash1

¹Argonne National Laboratory, ²University of Southern California,

³Rennaissance Computing Institute, ⁴Lawrence Berkeley National Laboratory

This work was funded by the US Department of Energy under contracts DE-SC0022328 and DE-AC02-06CH11357.

17th Workshop on Workflows in Support of Large-Scale Science, Dallas, TX, 2022

2 of 9

Anomaly Detection in Scientific Workflows

Motivation:

  • Detecting and diagnosing anomalies are both import and challenge for reliable scientific workflows
  • Model the scientific workflow as a directed acyclic graph (DAG)

Our proposal :

  • Apply graph neural networks (GNNs) to detect anomalies
  • Anomaly detection on entire graph (workflow)
  • Anomaly detection on nodes (jobs)

*

SC22 | Dallas, TX | hpc accelerates.

2

Figure: an abstract view of workflow as DAG

DAG

G=(A, X)

A: job dependencies (graph structure)

X: job features (node features)

classification

3 of 9

Data Collection

  • We use Pegasus 1, an exemplar workflow management system, enables users to design workflows at a high level of abstraction of resources available to execute.
  • Collect job features such as ready time, prescript delay, runtime, postscript delay, queue delay, stage-in delay, stage-out delay, and etc. And normalize features along jobs.
  • Data
    • Normal. No anomaly is introduced normal conditions.
    • CPU. 2–3 cores are occupied by the stress tool on each worker node.
    • HDD. 50–100 MB of data are continuously written by the stress tool on each worker node.
    • Packet loss. The network connection between two ExoGENI regions is experiencing 0.1%–5.0% of packet loss.

*

SC22 | Dallas, TX | hpc accelerates.

3

  1. Deelman, E., et al. The Evolution of the Pegasus Workflow Management Software. Computing in Science Engineering 2019.

4 of 9

Graph Neural Networks

  • Two-layer graph convolutional layer (GCN) aggregate the job information by propagation from neighbors
  • From the hidden representation (H), we take a two-layer MLP for further classification
  • Objective function is measured from cross-entropy loss and optimized in Adam

*

SC22 | Dallas, TX | hpc accelerates.

4

Figure: GNN architecture

  • One of the recent deep learning approaches to learn the representation of structured data
  • Learn the node representation based on both node feature and local structural information

5 of 9

Experiments

*

SC22 | Dallas, TX | hpc accelerates.

5

Graph-level

Node-level

Workflows

Binary label

Multi-label

Binary label

Multi-label

1000 genome

0.921

0.852

0.870

0.743

Nowcast w/ clustering 8

0.815

0.683

0.798

0.590

Nowcast w/ clustering 16

0.828

0.593

0.801

0.487

Wind w/ clustering casa

0.769

0.434

0.775

0.389

Wind w/o clustering casa

0.817

0.586

0.789

0.472

ALL

0.841

0.674

0.829

0.594

Single model trains multiple workflows simultaneously

Tables: A collection of results on anomaly detection (classification)

Binary: normal vs anomaly

Multi labels: based on anomaly category

6 of 9

Experiments (cont’d)

CPU

HDD

Pkt. Loss

1000 genome

1.000

0.981

0.772

Nowcast w/ clustering 8

0.763

0.968

0.719

Nowcast w/ clustering 16

0.745

0.825

0.790

Wind w/ clustering casa

0.544

0.779

0.693

Wind w/o clustering casa

0.784

0.967

0.784

HDD

60

100

1000 genome

0.927

0.985

Nowcast w/ clustering 8

0.985

0.985

Nowcast w/ clustering 16

0.818

0.894

Wind w/ clustering casa

0.773

0.849

Wind w/o clustering casa

0.891

0.938

Detailed anomaly detection by anomaly categories

Detailed anomaly detection by anomaly level

HDD anomaly level

(60, 100 MB data are continuously written by stress tool)

7 of 9

Effectiveness and Efficiency

Acc.

Recall

Prec.

F1

Runtime

SVM

0.622

0.622

0.667

0.550

N/A

MLP

0.874

0.874

0.875

0.875

N/A

RF

0.898

0.898

0.908

0.887

N/A

AlexNet

0.910

0.914

0.910

0.910

251 s

VGG-16

0.900

0.900

0.900

0.900

435 s

ResNet-18

0.910

0.916

0.910

0.910

991 s

Our GNN

0.923

0.929

0.921

0.922

142 s

Classical ML models

Computer vision DL models 1

Measure on single GPU

Trustworthy prediction based on the confusion matrices

  1. Krawczuk, P. et al. Anomaly Detection in Scientific Workflows using End-to-End Execution Gantt Charts and Convolutional Neural Networks. EARC’21.

8 of 9

Conclusion and Future Works

Conclusion

  • Map the scientific workflow as directed acyclic graphs (DAG) and learn the graph-level and node-level information through graph neural networks.
  • GNN approach surpasses the conventional machine learning approaches and computer vision approaches with computation efficiency and higher accuracy in both graph and node level classification task.

Future Works

  • Investigate to analyze larger scientific workflows in terms of efficiency
  • Generalize the model from one learned workflow to another (transfer learning)
  • Synthetic generation for anomaly data (generative models)
  • On-line detection as workflow computes

*

SC22 | Dallas, TX | hpc accelerates.

8

9 of 9

Q&A

9

Contact information

Hongwei Jin: jinh@anl.gov