1 of 1

HPC Code Migration Using Large Language Models

Scientific Achievement

Large Language Models (LLMs) have achieved remarkable success in a wide array of domains, including natural language processing, code analysis, and code generation. This research explores the potential of LLMs to tackle the challenge of code migration in high-performance computing (HPC). We have developed a novel LLM-based approach for migrating OpenMP Fortran code to an equivalent C++ version, achieving promising results.

Significance and Impact

Traditional approaches typically require the manual development and maintenance of tools for migrating legacy HPC codes. In contrast, LLM-based approaches provide a more automated and unified solution. Reusable pipelines can be employed to generate datasets and train LLMs to address code migration challenges, thereby significantly reducing the need for manual tool development and accelerating scientific discoveries and innovations.

Technical Approach

Exploring both manual and LLM-based dataset generation approaches
Leveraging commercial and open-weights large language models
Combining prompt engineering with fine-tuning for improved model outputs

PI/Facility Lead: Chunhua Liao/Lawrence Livermore National Laboratory

Collaborating Institutions: University of Connecticut and Iowa State University

ASCR Program: [SciDAC/RAPIDS-2] ASCR PM: Kalyan Perumalla.

Publication: Lei, et al., "Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C++," 2023 IEEE High Performance Extreme Computing Conference, doi: 10.1109/HPEC58863.2023.10363534. Datasets: HPC_Fortran_CPP

Fig. 1 LLM-Based Automated Dataset Generation using Holistic Feedback

We use open-source code snippets as seeds and prompt GPT-4 to generate training datasets using feedback from large language models (LLMs), compilers, and unit testing.

Fig. 2 Accuracies of Different Models Translating OpenMP Fortran to C++

Using the dataset from Fig.1, we fine-tuned two open-weight models, WizardCoder and DeepSeek-Coder, improving their translation accuracies by 20.1x and 1.55x respectively, nearly matching the top commercial model, GPT-4, which has a 0.262 CodeBLEU.

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PRES-862641

LOCAL LAB POC:

Local Lab POC: Kathryn Mohror

TALKING POINTS:

This research explores the potential of large language models (LLMs) to address the challenge of code migration in high-performance computing (HPC). We have developed a novel LLM-based approach for migrating OpenMP Fortran code to an equivalent C++ version, achieving promising results.

A major challenge in this work is creating high-quality datasets for fine-tuning and evaluating large language models designed to translate OpenMP Fortran to C++. Our initial efforts involved manually creating a small-scale dataset of 315 pairs, which demonstrated the feasibility and effectiveness of fine-tuning a model for translating between OpenMP C++ and Fortran.

Ongoing work involves using language models, such as GPT-4, to automatically generate a high-quality dataset, as shown in Fig 1. The process includes using GPT-4 to translate OpenMP Fortran subroutines from GitHub into C++ versions. The translated C++ code is further refined by GPT-4 using feedback from LLMs, compilers and unit testing frameworks.

The current results are very encouraging, as presented in Fig 2. The new version of the dataset (approximately 2000 pairs generated from 10,000 candidate code snippets) has enabled us to fine-tune open-weight models. We fine-tuned two open-weight models, WizardCoder and DeepSeek-Coder, significantly enhancing their translation accuracies by 20.1x and 1.55x, respectively, as measured by CodeBLEU.

The performance of these enhanced models (CodeBLUE accuracy of 0.221 and 0.232, respectively ) approaches that of the leading commercial model, GPT-4, which boasts a CodeBLEU accuracy of 0.262. This demonstrates the substantial efficacy of our methodology. We are expanding the dataset size and expect to soon outperform GPT-4.

model names = ['WizardCoder', 'Codellama', 'GPT-4', 'DeepSeek-Coder-FT', 'Magicoder', 'DeepSeek-Coder', 'WizardCoder-FT']

model sizes = ['15B', '13B', 'N/A', '33B', '6.7B', '33B', '15B']

METADATA:

Name of the associated awarded project: RAPIDS2: A SciDAC Institute for Computer Science, Data, and Artificial Intelligence

PI: Rob Ross,Argonne National Laboratory

LLNL PI: Chunhua Liao

Name of the program manager: Kalyan Perumalla

Early part of the work (manual dataset generation) was funded by DOE/ASC: HPC-FAIR, ended in Sept, 2023. ASCR PM: Margaret Lentz

CITATIONS:

The notes section should also include full citations (including the DOI) to any important and associated publications, datasets or code developed as part this work.

B. Lei, C. Ding, L. Chen, P. Lin and C. Liao, "Creating a Dataset for High-Performance Computing Code Translation using LLMs: A Bridge Between OpenMP Fortran and C++," 2023 IEEE High Performance Extreme Computing Conference (HPEC), Boston, MA, USA, 2023, pp. 1-7, doi: 10.1109/HPEC58863.2023.10363534. (Outstanding Student Paper Award) https://arxiv.org/abs/2307.07686

Code Developed: https://github.com/bin123apple/OpenMP-Fortran-CPP-Translation

Datasets: https://huggingface.co/datasets/Bin12345/HPC_Fortran_CPP