1 of 9

Workflow Anomaly Detection

with Graph Neural Networks

Hongwei Jin¹, Krishnan Raghavan¹, George Papadimitriou², Cong Wang³, Anirban Mandal³,

Patrycja Krawczuk², Loïc Pottier², Mariam Kiran⁴, Ewa Deelman², Prasanna Balaprakash¹

¹Argonne National Laboratory, ²University of Southern California,

³Rennaissance Computing Institute, ⁴Lawrence Berkeley National Laboratory

This work was funded by the US Department of Energy under contracts DE-SC0022328 and DE-AC02-06CH11357.

17th Workshop on Workflows in Support of Large-Scale Science, Dallas, TX, 2022

2 of 9

Anomaly Detection in Scientific Workflows

Motivation:

Detecting and diagnosing anomalies are both import and challenge for reliable scientific workflows
Model the scientific workflow as a directed acyclic graph (DAG)

Our proposal :

Apply graph neural networks (GNNs) to detect anomalies
Anomaly detection on entire graph (workflow)
Anomaly detection on nodes (jobs)

SC22 | Dallas, TX | hpc accelerates.

Figure: an abstract view of workflow as DAG

DAG

G=(A, X)

A: job dependencies (graph structure)

X: job features (node features)

classification

3 of 9

Data Collection

We use Pegasus ¹, an exemplar workflow management system, enables users to design workflows at a high level of abstraction of resources available to execute.
Collect job features such as ready time, prescript delay, runtime, postscript delay, queue delay, stage-in delay, stage-out delay, and etc. And normalize features along jobs.
Data

Normal. No anomaly is introduced normal conditions.
CPU. 2–3 cores are occupied by the stress tool on each worker node.
HDD. 50–100 MB of data are continuously written by the stress tool on each worker node.
Packet loss. The network connection between two ExoGENI regions is experiencing 0.1%–5.0% of packet loss.

SC22 | Dallas, TX | hpc accelerates.

Deelman, E., et al. The Evolution of the Pegasus Workflow Management Software. Computing in Science Engineering 2019.

4 of 9

Graph Neural Networks

Two-layer graph convolutional layer (GCN) aggregate the job information by propagation from neighbors
From the hidden representation (H), we take a two-layer MLP for further classification
Objective function is measured from cross-entropy loss and optimized in Adam

SC22 | Dallas, TX | hpc accelerates.

Figure: GNN architecture

One of the recent deep learning approaches to learn the representation of structured data
Learn the node representation based on both node feature and local structural information

5 of 9

Experiments

SC22 | Dallas, TX | hpc accelerates.

	Graph-level		Node-level
Workflows	Binary label	Multi-label	Binary label	Multi-label
1000 genome	0.921	0.852	0.870	0.743
Nowcast w/ clustering 8	0.815	0.683	0.798	0.590
Nowcast w/ clustering 16	0.828	0.593	0.801	0.487
Wind w/ clustering casa	0.769	0.434	0.775	0.389
Wind w/o clustering casa	0.817	0.586	0.789	0.472
ALL	0.841	0.674	0.829	0.594

Single model trains multiple workflows simultaneously

Tables: A collection of results on anomaly detection (classification)

Binary: normal vs anomaly

Multi labels: based on anomaly category

6 of 9

Experiments (cont’d)

	CPU	HDD	Pkt. Loss
1000 genome	1.000	0.981	0.772
Nowcast w/ clustering 8	0.763	0.968	0.719
Nowcast w/ clustering 16	0.745	0.825	0.790
Wind w/ clustering casa	0.544	0.779	0.693
Wind w/o clustering casa	0.784	0.967	0.784

	HDD
	60	100
1000 genome	0.927	0.985
Nowcast w/ clustering 8	0.985	0.985
Nowcast w/ clustering 16	0.818	0.894
Wind w/ clustering casa	0.773	0.849
Wind w/o clustering casa	0.891	0.938

Detailed anomaly detection by anomaly categories

Detailed anomaly detection by anomaly level

HDD anomaly level

(60, 100 MB data are continuously written by stress tool)

7 of 9

Effectiveness and Efficiency

	Acc.	Recall	Prec.	F1	Runtime
SVM	0.622	0.622	0.667	0.550	N/A
MLP	0.874	0.874	0.875	0.875	N/A
RF	0.898	0.898	0.908	0.887	N/A
AlexNet	0.910	0.914	0.910	0.910	251 s
VGG-16	0.900	0.900	0.900	0.900	435 s
ResNet-18	0.910	0.916	0.910	0.910	991 s
Our GNN	0.923	0.929	0.921	0.922	142 s

Classical ML models

Computer vision DL models ¹

Measure on single GPU

Trustworthy prediction based on the confusion matrices

Krawczuk, P. et al. Anomaly Detection in Scientific Workflows using End-to-End Execution Gantt Charts and Convolutional Neural Networks. EARC’21.

8 of 9

Conclusion and Future Works

Conclusion

Map the scientific workflow as directed acyclic graphs (DAG) and learn the graph-level and node-level information through graph neural networks.
GNN approach surpasses the conventional machine learning approaches and computer vision approaches with computation efficiency and higher accuracy in both graph and node level classification task.

Future Works

Investigate to analyze larger scientific workflows in terms of efficiency
Generalize the model from one learned workflow to another (transfer learning)
Synthetic generation for anomaly data (generative models)
On-line detection as workflow computes

SC22 | Dallas, TX | hpc accelerates.

1 of 9

2 of 9

3 of 9

4 of 9

5 of 9

6 of 9

7 of 9

8 of 9

9 of 9