1 of 20

Evaluating Dependency Gaps in �LLM-Generated Code

Bhanu Prakash Vangala¹, Ashish Gehani², Tanu Malik¹

PhD Student¹, SRI Research Fellow², Associate Professor and Director of Radiant Lab¹

^{bv3hz@missouri.edu}^,^{ashish.gehani@sri.com}^,^{tanu@missouri.edu}

Dept. of EECS, University of Missouri, Columbia

¹University of Missouri, Columbia

²SRI International, Menlo Park, CA

University of Missouri - Columbia

2 of 20

Reproducibility Crisis Extending to LLMs Generated Code

2

Definition of Reproducibility:

Reproducibility is the ability to obtain consistent results from a scientific study by using the same data, methods, code, and conditions as the original research, verifying its reliability

The Reality Crisis:

The research community already faces a reproducibility crisis in ML Based workflows.
If AI tools generate non-reproducible code, we are compounding the crisis, not solving it.

Reproducibility is foundational in:

1. Scientific computing

2. Software engineering

3. AI research

LLMs may amplify, not solve this problem

University of Missouri - Columbia

3 of 20

LLM Coding Agents and the Dependency Problem �(Claude, Gemini, Codex)

LLM Coding Agents promise to accelerate software development

Generate complete projects (multi-file code, configs, dependencies)
Widely used for real-world software tasks
Generate thousands of lines of code from simple prompts
Create full-stack applications with backend, frontend, and configuration

3

The Scenario: A developer asks for a "Sentiment Analysis App." The AI gives code + requirements.txt.

☁️ Prompt →

🤖 LLM →

📝 Complete Code →

✅ Deploy

University of Missouri - Columbia

4 of 20

The Reality We Found So Far...

4

Can another developer reproduce or run it blindly to get same results?

☁️ Prompt

🤖

LLM

📝

Code

❌

IMPORT ERROR

🛠️

Debug

⚠️�Syntax

Error

🛠️

FIX

🤔

Maybe

Works

☁️ Prompt

🤖 LLM

📝 Complete Code

✅ Deploy

Iterative Resolution

University of Missouri - Columbia

5 of 20

The Gap in Current Coding Benchmarks

5

BENCHMARKS TODAY EVALUATE

Functional correctness: Does the logic work? (e.g., HumanEval, MBPP)

Key Assumption: environment already exists!�

THEY DO NOT EVALUATE OR NOT FOCUSING ON:

Can this project or codes run in a clean environment?

Are dependencies complete?

Is execution reproducible with what LLMs explicitly provided?

University of Missouri - Columbia

6 of 20

Reproducibility as Executable Reliability

Executable Reliability of an LLM-based Coding Agent = Probability that a generated project runs successfully in a clean environment without manual fixes states

(a) A project is a SUCCESS ✅ or Reproducible (only if):

It runs in a pristine environment (OS packages only).
It uses only the dependencies (data or environment) the LLM explicitly listed

(b) A project is a FAILURE ❌if it requires:

Any manual debugging (even 1 minute).
Installing a missing package (e.g., pip install missing-lib) or data dependency
Fixing syntax errors or file paths.

6

University of Missouri - Columbia

7 of 20

Evaluation Infrastructure: AWS EC2 Setup

7

100 Standardized Prompts

Claude

Opus 4.1

Gemini

2.5 Pro

Codex

0.52.0

Web Scraping,�ML Pipelines, �API’s/ Databases

100 �projects

300 �projects

Languages per agent:

Python (40)
JavaScript (35)
Java (25)

University of Missouri - Columbia

8 of 20

Primary Result: From 300 AI-Generated Projects

Only 68.3% of AI-generated code projects run out-of-the-box (This is not a bad result and that is why LLMs are successful for code generation)
But, remaining 31.7% require extensive manual debugging---avg. ~15 minutes each!
Primary Issue: Missing dependencies--- dependencies required are 13.5 times more from what LLMs state.

8

68.3 % Success ✅	31.7% Failed
Run out-of-the-box	Manual debugging (~15 min each)
Zero intervention needed	Finding missing dependencies
Work as specified	Fixing code generation errors

University of Missouri - Columbia

9 of 20

The Dependency Gap: Claimed vs. Runtime Reality�Transitive dependencies create a hidden explosion at runtime

Layer 1: Claimed Dependencies (D_c):

What the LLM tells you to install
Example: flask, requests�

Layer 2: Working Dependencies (D_w):

What you actually need after debugging
Example: flask, requests, beautifulsoup4 ← (missing!)�

Layer 3: Runtime Dependencies (D_r):

Everything loaded at runtime (including transitive excluding OS Packages)
Example: 20+ packages (werkzeug, jinja2, click, ...)

9

Reproducibility fails in the gaps between these layers,

Where reality diverges from the LLM’s Claim.

University of Missouri - Columbia

10 of 20

We Discovered Working Dependencies (D_w) Iteratively

10

LLM Specifies (D_c)

Install only

Code on the environment

Execute

Success
Failure

Result

Record Success ✅

ImportError?

Yes

Add Package
Retry

No

Code Bug?
Fix and Record
Retry

Key Rules:

✓ Max 10 iterations per project

✓ Document every failure & fix ✓ Typically 2-3 iterations for missing dependencies ✗ No manual fixes counted as "success"

This mirrors the real developer experience!

University of Missouri - Columbia

11 of 20

Methodological Overview

Computing the Ground Truth (Runtime Depedency Capture):

Python: Sciunit (runtime tracing)
JavaScript: npm dependency tree
Java: Maven dependency tree

11

Environment Standardization:

- AWS EC2 instances (t2.large, Ubuntu 22.04)

- Pristine baseline: exactly 91 packages

- Strict complete reset between each test and no cached dependencies (how did you ensure?)

Evaluation Protocol:

1. Install only LLM-specified dependencies

2. Attempt execution

3. Document every failure

4. No manual fixes for success rate calculation

Sciunit

University of Missouri - Columbia

12 of 20

Results: The Reproducibility Gap

Success Rates by Agent:

- Claude: 73.0%

- Gemini: 72.0%

- Codex: 60.0%

Takeaway: Nearly 1 in 3 projects fail immediately and require human intervention.

Each failure costs ~15 minutes of debugging time

12

Reproducibility outcomes across three LLM coding agents. Partial indicates projects that execute but require external services (databases, APIs) to be fully functional.

University of Missouri - Columbia

13 of 20

The Iceberg Effect (Dependency Explosion)

The Illusion: LLMs treat dependencies as single lines of text.

The Reality: Massive transitive trees.

The Multiplier: Average expansion is 13.5x in runtime.

By Language:

Python: 12.3× expansion (3 claimed → 37 loaded)
JavaScript: 9.7× expansion (3 claimed → 29 loaded)
Java: 18.4× expansion (2 claimed → 37 loaded)

Real Example:

Claimed: scikit-learn, pandas, matplotlib (3 packages)

Reality: 52 packages loaded at runtime!

13

Runtime dependency explosion showing the gapbetween claimed (agent-declared) and runtime (actually installed) dependencies. Java shows a massive 9.5× multiplier,while JavaScript surprisingly shows almost no expansion(1.0×)

Risk: Missing any of these 52 breaks the code to reproduce.

University of Missouri - Columbia

14 of 20

Dependency Completeness Gaps

Missing Dependencies Distribution:

87% of projects have ZERO missing dependencies
13% require manual dependency discovery (1-3 packages)

Common Missing Packages:

- Python: lxml, python-dotenv, bcrypt

- JavaScript: body-parser, ws, dotenv

- Java: JUnit, SLF4J (testing & logging)

But: Only 10.5% of failures due to missing deps

14

Distribution of completeness gaps. Most projects(87%) have correct dependencies, but 13% require manual iterative debugging to identify missing packages.

University of Missouri - Columbia

15 of 20

Why Do They Fail only 10%? (It's Not Just Dependencies)

We Expected: Dependency Problems

We Found: Code Generation Problems

Failure Breakdown Analysis (95 failed projects):

Myth: "It's just missing pip install.“

- Code Bugs: 52.6% ← Syntax errors, logic issues

- Not Processed: 16.8% ← Unparseable code

- Other: 15.8% ← Version conflicts

- Dependencies: 10.5% ← Missing packages

- Environment: 4.2% ← System conflicts

15

Insight: LLMs struggle with basic code structure more than environment specs. The Problem Isn't Just Dependencies - It's Basic Code Quality

Error type distribution by agent among FailedProjects. Code bugs dominate overall (50 of 95), with Codex showing the highest count (24). Not Processed errors appear only in Codex and Gemini (8 each), while Dependency errors are most prevalent in Claude (7).

University of Missouri - Columbia

16 of 20

Language Matters Dramatically

Success Rate by Language:

Python: 89.2%

JavaScript: 61.9%

Java: 44.0%

Why Such Differences?

Python: Simple requirements.txt, clear error messages - flat dependency model

JavaScript: npm helps, but nested deps add complexity - nested but auto-resolved

Java: Complex Maven XML, deep transitive graphs - deep transitive dependencies, complex scopes

Reproducibility is ecosystem-dependent, not model-independent.

16

University of Missouri - Columbia

17 of 20

Surprising Agent Specializations---Hidden Biases

Unexpected Findings: Agents have undocumented "Skills".

Not all LLMs behave the same and Hidden Specializations Never Advertised

Claude: The "Enterprise" Specialist. Strong across all languages, Java expert (80%)
Gemini: Perfect Python (100%!), struggles with Java (28%)
Codex: Python bias (87.5%), poor Java (24%) - Struggles with non-scripting languages.

Implication

“Best LLM” depends on language
These aren't minor differences - they're 3× performance gaps
Organizations must match agents to their tech stack

17

University of Missouri - Columbia

18 of 20

Path Forward: From AWS to Chameleon

Goal 1: Bare-Metal Clean-State Guarantees

Migrate 300-project eval to Chameleon bare-metal nodes

Eliminate hidden hypervisor state persisting on EC2

Goal 2: Extend SciUnit Provenance to JS/Java

System-level import hooks on bare-metal �Capture true runtime deps, not just pkg manager output

Goal 3: Parallelize the Pipeline

Current: ~100 hours serial on AWS (isolation constraint)

Chameleon reservations + snapshots enable parallelism

Goal 4: Reproducible Evaluation Infrastructure

Make evaluation itself reproducible for other researchers

Snapshots + metadata vs. cumbersome AWS AMI sharing

18

University of Missouri - Columbia

"Now, the path forward—and this is where Chameleon comes in."

"Goal 1: We want to migrate our entire 300-project evaluation to Chameleon bare-metal nodes. Unlike EC2, bare-metal provisioning eliminates the hypervisor layer entirely—no hidden state can persist between runs. This gives us verified clean-state guarantees."

"Goal 2: We currently use SciUnit for Python provenance capture, but for JavaScript and Java we relied on package manager output which may miss dynamic loads. Bare-metal access lets us extend system-level import hooks to all three languages."

"Goal 3: Our pipeline currently takes about 100 hours running serially—that was our only isolation option on AWS. Chameleon's reservation system with snapshotted images enables parallel execution across multiple nodes."

"Goal 4: Finally, we want the evaluation infrastructure itself to be reproducible. Chameleon's snapshotted images and metadata make this straightforward—unlike the cumbersome AWS AMI sharing process."

19 of 20

Conclusion & Chameleon Goals

1. AI-Generated Code Has a Reproducibility Crisis → Only 68.3% run out-of-the-box

2. The Problem Is Broader Than Dependencies → 52.6% code bugs, 10.5% missing deps

3. Language & Agent Choice Matter → Python 89.2% vs Java 44.0%; 3× agent gaps

4. Hidden Complexity Is Massive → 13.5× average runtime dependency expansion

5. Reproducibility Must Be a Primary Metric → Not an afterthought in AI code evaluation

6. Chameleon Enables Rigorous Evaluation → Bare-metal isolation, SciUnit provenance, parallelization

19

Chameleon's bare-metal infrastructure directly addresses the evaluation gaps we identified

University of Missouri - Columbia

"To conclude, six takeaways."

"One: only 68.3% run out-of-the-box—there's a clear reproducibility gap."

"Two: the problem is broader than dependencies—most failures are code bugs at 52.6%."

"Three: language and agent choice matter critically—Python at 89.2% versus Java at 44.0% is a huge gap."

"Four: hidden dependency complexity is massive—about 13.5× runtime expansion on average."

"Five: reproducibility must be a primary metric in AI code evaluation, not an afterthought."

"Six: Chameleon enables the rigorous evaluation we need—bare-metal isolation eliminates hidden state, system-level hooks give us true provenance across all three languages, and the reservation model lets us parallelize what currently takes 100 hours."

"Chameleon's bare-metal infrastructure directly addresses the evaluation gaps we identified in our AWS-based study."

20 of 20

☺ Thank You

Contact:

Bhanu Prakash Vangala

bv3hz@missouri.edu

20

https://radiant-systems-lab.github.io/

University of Missouri - Columbia