Workflow Anomaly Detection
with Graph Neural Networks
Hongwei Jin¹, Krishnan Raghavan¹, George Papadimitriou2, Cong Wang3, Anirban Mandal3,
Patrycja Krawczuk2, Loïc Pottier2, Mariam Kiran4, Ewa Deelman2, Prasanna Balaprakash1
¹Argonne National Laboratory, ²University of Southern California,
³Rennaissance Computing Institute, ⁴Lawrence Berkeley National Laboratory
This work was funded by the US Department of Energy under contracts DE-SC0022328 and DE-AC02-06CH11357.
17th Workshop on Workflows in Support of Large-Scale Science, Dallas, TX, 2022
Anomaly Detection in Scientific Workflows
Motivation:
Our proposal :
*
SC22 | Dallas, TX | hpc accelerates.
2
Figure: an abstract view of workflow as DAG
DAG
G=(A, X)
A: job dependencies (graph structure)
X: job features (node features)
classification
Data Collection
*
SC22 | Dallas, TX | hpc accelerates.
3
Graph Neural Networks
*
SC22 | Dallas, TX | hpc accelerates.
4
Figure: GNN architecture
Experiments
*
SC22 | Dallas, TX | hpc accelerates.
5
| Graph-level | Node-level | ||
Workflows | Binary label | Multi-label | Binary label | Multi-label |
1000 genome | 0.921 | 0.852 | 0.870 | 0.743 |
Nowcast w/ clustering 8 | 0.815 | 0.683 | 0.798 | 0.590 |
Nowcast w/ clustering 16 | 0.828 | 0.593 | 0.801 | 0.487 |
Wind w/ clustering casa | 0.769 | 0.434 | 0.775 | 0.389 |
Wind w/o clustering casa | 0.817 | 0.586 | 0.789 | 0.472 |
ALL | 0.841 | 0.674 | 0.829 | 0.594 |
Single model trains multiple workflows simultaneously
Tables: A collection of results on anomaly detection (classification)
Binary: normal vs anomaly
Multi labels: based on anomaly category
Experiments (cont’d)
| CPU | HDD | Pkt. Loss |
1000 genome | 1.000 | 0.981 | 0.772 |
Nowcast w/ clustering 8 | 0.763 | 0.968 | 0.719 |
Nowcast w/ clustering 16 | 0.745 | 0.825 | 0.790 |
Wind w/ clustering casa | 0.544 | 0.779 | 0.693 |
Wind w/o clustering casa | 0.784 | 0.967 | 0.784 |
| HDD | |
| 60 | 100 |
1000 genome | 0.927 | 0.985 |
Nowcast w/ clustering 8 | 0.985 | 0.985 |
Nowcast w/ clustering 16 | 0.818 | 0.894 |
Wind w/ clustering casa | 0.773 | 0.849 |
Wind w/o clustering casa | 0.891 | 0.938 |
Detailed anomaly detection by anomaly categories
Detailed anomaly detection by anomaly level
HDD anomaly level
(60, 100 MB data are continuously written by stress tool)
Effectiveness and Efficiency
| Acc. | Recall | Prec. | F1 | Runtime |
SVM | 0.622 | 0.622 | 0.667 | 0.550 | N/A |
MLP | 0.874 | 0.874 | 0.875 | 0.875 | N/A |
RF | 0.898 | 0.898 | 0.908 | 0.887 | N/A |
AlexNet | 0.910 | 0.914 | 0.910 | 0.910 | 251 s |
VGG-16 | 0.900 | 0.900 | 0.900 | 0.900 | 435 s |
ResNet-18 | 0.910 | 0.916 | 0.910 | 0.910 | 991 s |
Our GNN | 0.923 | 0.929 | 0.921 | 0.922 | 142 s |
Classical ML models
Computer vision DL models 1
Measure on single GPU
Trustworthy prediction based on the confusion matrices
Conclusion and Future Works
Conclusion
Future Works
*
SC22 | Dallas, TX | hpc accelerates.
8
Q&A
9
Contact information
Hongwei Jin: jinh@anl.gov