Developed approaches to better characterize and model I/O performance of applications run on HPC systems. We propose a taxonomy comprised of five categories of I/O modeling errors: poor models of application behavior, global system disturbances, inadequate training set coverage, external disturbances such as I/O contention, and inherent system noise.
Significance and Impact
I/O efficiency is crucial to productivity in scientific computing, but rising system complexity and application complexity make it difficult for practitioners to understand and optimize I/O behavior at scale. The developed taxonomy and litmus tests can help I/O researchers to assess the impact of each class of errors, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.
Technical Approach
We analyze four years of application, scheduler, and storage systems logs on two production leadership-class HPC platforms – ALCF Theta and NERSC Cori.
Application Modeling Errors: Characterizes the quality of mapping between application and ideal HPC system; System Modelling Errors: Characterizes the effects of non-stationarity in data; Generalization Errors: Characterize the ability of the model to generalize beyond the trained scenarios/ systems; Contention Errors: Characterize the effect of lack of visibility into job interactions; Inherent Noise Errors: Characterize the effect of random behavior by the system
Our test show that Theta and Cori jobs have overall I/O throughput standard deviation of 5.7% and 7.2%, respectively.�
Fig. 1: Taxonomy of I/O throughput modeling errors, with examples of the effects of each class of error
Fig. 1: Results of errors breakdown from ALCF Theta and NERSC Cori systems
Isakov, Mi, M. Currier, E. Rosario, S. Madireddy, P. Balaprakash, P. Carns, R. Ross, G. Lockwood, and M. Kinsy (2022). “A Taxonomy of Error Sources in HPC I/O Machine Learning Models”. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (Super Computing 2022)