1 of 9

Evolving Code with A Large Language Model

Erik Hemberg1*, Stephen Moskal1 and Una-May O’Reilly1 1*EECS, MIT CSAIL, 32 Vassar St, Cambridge, 02139, MA, USA.

Presenter: Zihan Zhang

2 of 9

Motivation

Limitations of Traditional GP

GP relies on syntactic operations that lack semantic awareness, often producing invalid or low-quality programs and requiring many generations to converge.

Strengths of LLMs in Code Semantics

Toward Semantics-Driven Evolution

LLMs demonstrate strong capabilities in understanding, rewriting, and generating code, making them suitable as intelligent variation operators.

Integrating LLMs into GP enables semantically guided variation, reduces manual design effort, and improves evolutionary efficiency.

3 of 9

Background

GP Algorithm (Genetic Programming)

LLM

Genetic Programming (GP) is an evolutionary algorithm that automatically discovers programs.

It begins with a population of candidate programs and iteratively improves them through:

Selection: choose high-performing programs,
Variation: generate new programs via crossover and mutation,
Evaluation: measure each program’s fitness on the task,
Replacement: update the population with better offspring.

Through repeated evolutionary cycles, GP gradually evolves a program that solves the problem.

Pre-trained on massive text corpora

Strong at pattern recognition and code generation

Can rewrite, refactor, and combine code in a semantically meaningful way

LLMs understand code patterns, so they can generate better offspring than random mutation/crossover in traditional GP.

4 of 9

LLM_GP Algorithm

Algorithm Overview

LLM Based GP

Mutation is replaced by LLM-based code rewriting

Traditional GP randomly changes subtrees;
LLM-GP uses prompts to let the LLM rewrite code semantically and “intelligently.”

Crossover is replaced by LLM-based code merging

Traditional GP randomly stitches subtrees;
LLM-GP asks the LLM to generate a reasonable combined expression based on context.

Some variants even use LLMs for selection.

Traditional GP ranks individuals by fitness;
LLM-GP can evaluate or compare candidates through prompted semantic judgment

5 of 9

LLM-based Operator

�

LLM_GP Algorithm

Each LLM-based operator works in three small steps:

Build a prompt using the operator’s prompt-function.
Send the prompt to the LLM and get its generated output.
Validate the output and extract a clean solution.

Prompt-functions and Prompts

Each operator uses a prompt-function to build the text sent to the LLM.
A prompt includes: examples, a query, allowed primitives, and a response format.
LLM outputs a new expression in JSON format.

6 of 9

Experiments

Baselines

Computational Resources

Random — Generates expressions purely at random.

Tutorial GP — Standard Genetic Programming implementation.

LLM — Directly prompts the LLM to generate random expressions.

LLM-GP_Mu_XO — Uses the LLM for initialization, mutation, and crossover only.

LLM-GP — Uses the LLM for all evolutionary operators except fitness evaluation.

7 of 9

Experiment Results

1. Traditional GP is the fastest and cheapest, while LLM-based methods are much slower and more costly due to frequent model calls.

2. LLM-generated programs tend to be longer and more complex, whereas GP produces smaller and more efficient solutions.

3. Full LLM-GP suffers from higher operator error rates, but reducing LLM usage (LLM-GP_Mu_XO) improves stability and efficiency.

8 of 9

Conclusion

LLM-GP represents a new paradigm for code evolution: It retains the GP framework but uses prompts and LLMs to perform initialization, selection, mutation, and other operators.

LLMs introduce new costs and risks: Frequent LLM interactions during runtime lead to significant time and monetary overhead, and the outputs can be unstable or ill-formed.

Still worth exploring: Despite these uncertainties, LLM-GP offers expressive capabilities and search behaviors that traditional GP cannot achieve, making it a promising direction for future research.