1 of 1

AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis

1

Scientific Achievement

The first work to demonstrate the feasibility to automatically identify the I/O performance bottlenecks at the job-level for scientific applications running on high-performance computing (HPC) systems

Significance and Impact

  • Take the human out the loop of I/O performance bottleneck diagnosis with the cutting-edge Artificial Intelligence (AI)
  • Lay the foundation for methods to automatically fix the I/O performance issues of scientific applications
  • Open the possibility to use AI technologies to identify bottlenecks for communication and computing of scientific applications

AIIO can identify the I/O bottlenecks of applications, which can be fixed to improve performance up to 146X

Technical Approach

  • Multiple linear regression models based performance function to connect I/O counters with I/O performance
  • Game-theory based diagnosis functions with SHAP to calculate the impact of various factors on I/O performance
  • Incorporating the diverse characteristics (e.g., sparsity) of applications into both performance and diagnosis functions

PI(s)/Facility Lead(s): Bin Dong, Jean Luca Bez, Suren Byna

Collaborating Institutions: Lawrence Berkeley National Laboratory, The Ohio State University

Publication(s) for this work: Bin Dong, et al., “AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis”, HPDC 2023, https://doi.org/10.1145/3588195.3592986.

Code Developed or Datasets: https://github.com/hpc-io/aiio