AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis
1
Scientific Achievement
The first work to demonstrate the feasibility to automatically identify the I/O performance bottlenecks at the job-level for scientific applications running on high-performance computing (HPC) systems
Significance and Impact
Take the human out the loop of I/O performance bottleneck diagnosis with the cutting-edge Artificial Intelligence (AI)
Lay the foundation for methods to automatically fix the I/O performance issues of scientific applications
Open the possibility to use AI technologies to identify bottlenecks for communication and computing of scientific applications
AIIO can identify the I/O bottlenecks of applications, which can be fixed to improve performance up to 146X
Technical Approach
Multiple linear regression models based performance function to connect I/O counters with I/O performance
Game-theory based diagnosis functions with SHAP to calculate the impact of various factors on I/O performance
Incorporating the diverse characteristics (e.g., sparsity) of applications into both performance and diagnosis functions
PI(s)/Facility Lead(s): Bin Dong, Jean Luca Bez, Suren Byna
Collaborating Institutions: Lawrence Berkeley National Laboratory, The Ohio State University
Publication(s) for this work: Bin Dong, et al., “AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis”, HPDC 2023, https://doi.org/10.1145/3588195.3592986.