1 of 20

Evaluating Dependency Gaps in �LLM-Generated Code

Bhanu Prakash Vangala¹, Ashish Gehani², Tanu Malik¹

PhD Student¹, SRI Research Fellow², Associate Professor and Director of Radiant Lab1

bv3hz@missouri.edu, ashish.gehani@sri.com, tanu@missouri.edu

Dept. of EECS, University of Missouri, Columbia

¹University of Missouri, Columbia

²SRI International, Menlo Park, CA

University of Missouri - Columbia

2 of 20

Reproducibility Crisis Extending to LLMs Generated Code

2

Definition of Reproducibility:

Reproducibility is the ability to obtain consistent results from a scientific study by using the same data, methods, code, and conditions as the original research, verifying its reliability

The Reality Crisis:

  • The research community already faces a reproducibility crisis in ML Based workflows.
  • If AI tools generate non-reproducible code, we are compounding the crisis, not solving it.

Reproducibility is foundational in:

1. Scientific computing

2. Software engineering

3. AI research

LLMs may amplify, not solve this problem

University of Missouri - Columbia

3 of 20

LLM Coding Agents and the Dependency Problem �(Claude, Gemini, Codex)

LLM Coding Agents promise to accelerate software development

  • Generate complete projects (multi-file code, configs, dependencies)
  • Widely used for real-world software tasks
  • Generate thousands of lines of code from simple prompts
  • Create full-stack applications with backend, frontend, and configuration

3

The Scenario: A developer asks for a "Sentiment Analysis App." The AI gives code + requirements.txt

☁️ Prompt →

🤖 LLM →

📝 Complete Code →

✅ Deploy

University of Missouri - Columbia

4 of 20

The Reality We Found So Far...

4

Can another developer reproduce or run it blindly to get same results?

☁️ Prompt

🤖

LLM

📝

Code

IMPORT ERROR

🛠️

Debug

⚠️�Syntax

Error

🛠️

FIX

🤔

Maybe

Works

☁️ Prompt

🤖 LLM

📝 Complete Code

✅ Deploy

Iterative Resolution

University of Missouri - Columbia

5 of 20

The Gap in Current Coding Benchmarks

5

BENCHMARKS TODAY EVALUATE

Functional correctness: Does the logic work? (e.g., HumanEval, MBPP)

Key Assumption: environment already exists!

THEY DO NOT EVALUATE OR NOT FOCUSING ON:

Can this project or codes run in a clean environment?

Are dependencies complete?

Is execution reproducible with what LLMs explicitly provided?

University of Missouri - Columbia

6 of 20

Reproducibility as Executable Reliability​

Executable Reliability of an LLM-based Coding Agent =  Probability that a generated project runs successfully in a clean environment without manual fixes states

(a) A project is a SUCCESS ✅ or Reproducible (only if):

    • It runs in a pristine environment (OS packages only).
    • It uses only the dependencies (data or environment) the LLM explicitly listed

(b) A project is a FAILURE ❌if it requires:

    • Any manual debugging (even 1 minute).
    • Installing a missing package (e.g., pip install missing-lib) or data dependency
    • Fixing syntax errors or file paths.

6

University of Missouri - Columbia

7 of 20

Evaluation Infrastructure: AWS EC2 Setup

7

100 Standardized Prompts

Claude

Opus 4.1

Gemini

2.5 Pro

Codex

0.52.0

Web Scraping,�ML Pipelines, �API’s/ Databases

100 �projects

300 �projects

Languages per agent:

  • Python (40)
  • JavaScript (35)
  • Java (25)

University of Missouri - Columbia

8 of 20

Primary Result: From 300 AI-Generated Projects

    • Only 68.3% of AI-generated code projects run out-of-the-box (This is not a bad result and that is why LLMs are successful for code generation)
    • But, remaining 31.7% require extensive manual debugging---avg. ~15 minutes each!
    • Primary Issue: Missing dependencies--- dependencies required are 13.5 times more from what LLMs state.

8

68.3 %

Success ✅

31.7%

Failed

Run out-of-the-box

Manual debugging

(~15 min each)

Zero intervention needed

Finding missing dependencies

Work as specified

Fixing code generation errors

University of Missouri - Columbia

9 of 20

The Dependency Gap: Claimed vs. Runtime Reality�Transitive dependencies create a hidden explosion at runtime

Layer 1: Claimed Dependencies (Dc):

    • What the LLM tells you to install
    • Example: flask, requests�

Layer 2: Working Dependencies (Dw):

    • What you actually need after debugging
    • Example: flask, requests, beautifulsoup4 ← (missing!)�

Layer 3: Runtime Dependencies (Dr):

    • Everything loaded at runtime (including transitive excluding OS Packages)
    • Example: 20+ packages (werkzeug, jinja2, click, ...)

9

Reproducibility fails in the gaps between these layers,

Where reality diverges from the LLM’s Claim.

University of Missouri - Columbia

10 of 20

We Discovered Working Dependencies (Dw) Iteratively

10

    • LLM Specifies (Dc)

Install only

    • Code on the environment

Execute

    • Success
    • Failure

Result

Record Success ✅

ImportError?

Yes

    • Add Package
    • Retry

No

    • Code Bug?
    • Fix and Record
    • Retry

Key Rules:

✓ Max 10 iterations per project

✓ Document every failure & fix ✓ Typically 2-3 iterations for missing dependencies ✗ No manual fixes counted as "success"

This mirrors the real developer experience!

University of Missouri - Columbia

11 of 20

Methodological Overview

Computing the Ground Truth (Runtime Depedency Capture):

  • Python: Sciunit (runtime tracing)
  • JavaScript: npm dependency tree
  • Java: Maven dependency tree

11

Environment Standardization:

- AWS EC2 instances (t2.large, Ubuntu 22.04)

- Pristine baseline: exactly 91 packages

- Strict complete reset between each test and no cached dependencies (how did you ensure?)

Evaluation Protocol:

1. Install only LLM-specified dependencies

2. Attempt execution

3. Document every failure

4. No manual fixes for success rate calculation

Sciunit

University of Missouri - Columbia

12 of 20

Results: The Reproducibility Gap

Success Rates by Agent:

- Claude: 73.0%

- Gemini: 72.0%

- Codex: 60.0%

Takeaway: Nearly 1 in 3 projects fail immediately and require human intervention.

Each failure costs ~15 minutes of debugging time

12

Reproducibility outcomes across three LLM coding agents. Partial indicates projects that execute but require external services (databases, APIs) to be fully functional.

University of Missouri - Columbia

13 of 20

The Iceberg Effect (Dependency Explosion)

The Illusion: LLMs treat dependencies as single lines of text.

The Reality: Massive transitive trees.

The Multiplier: Average expansion is 13.5x in runtime.

By Language:

  • Python: 12.3× expansion (3 claimed → 37 loaded)
  • JavaScript: 9.7× expansion (3 claimed → 29 loaded)
  • Java: 18.4× expansion (2 claimed → 37 loaded)

Real Example:

Claimed: scikit-learn, pandas, matplotlib (3 packages)

Reality: 52 packages loaded at runtime!

13

Runtime dependency explosion showing the gapbetween claimed (agent-declared) and runtime (actually installed) dependencies. Java shows a massive 9.5× multiplier,while JavaScript surprisingly shows almost no expansion(1.0×)

Risk: Missing any of these 52 breaks the code to reproduce.

University of Missouri - Columbia

14 of 20

Dependency Completeness Gaps

Missing Dependencies Distribution:

  • 87% of projects have ZERO missing dependencies
  • 13% require manual dependency discovery (1-3 packages)

Common Missing Packages:

- Python: lxml, python-dotenv, bcrypt

- JavaScript: body-parser, ws, dotenv

- Java: JUnit, SLF4J (testing & logging)

But: Only 10.5% of failures due to missing deps

14

Distribution of completeness gaps. Most projects(87%) have correct dependencies, but 13% require manual iterative debugging to identify missing packages.

University of Missouri - Columbia

15 of 20

Why Do They Fail only 10%? (It's Not Just Dependencies)

We Expected: Dependency Problems

We Found: Code Generation Problems

Failure Breakdown Analysis (95 failed projects):

Myth: "It's just missing pip install.“

- Code Bugs: 52.6% ← Syntax errors, logic issues

- Not Processed: 16.8% ← Unparseable code

- Other: 15.8% ← Version conflicts

- Dependencies: 10.5% ← Missing packages

- Environment: 4.2% ← System conflicts

15

Insight: LLMs struggle with basic code structure more than environment specs. The Problem Isn't Just Dependencies - It's Basic Code Quality

Error type distribution by agent among FailedProjects. Code bugs dominate overall (50 of 95), with Codex showing the highest count (24). Not Processed errors appear only in Codex and Gemini (8 each), while Dependency errors are most prevalent in Claude (7).

University of Missouri - Columbia

16 of 20

Language Matters Dramatically

Success Rate by Language:

Python: 89.2%

JavaScript: 61.9%

Java: 44.0%

Why Such Differences?

Python: Simple requirements.txt, clear error messages - flat dependency model

JavaScript: npm helps, but nested deps add complexity - nested but auto-resolved

Java: Complex Maven XML, deep transitive graphs - deep transitive dependencies, complex scopes

Reproducibility is ecosystem-dependent, not model-independent.

16

University of Missouri - Columbia

17 of 20

Surprising Agent Specializations---Hidden Biases

Unexpected Findings: Agents have undocumented "Skills".

Not all LLMs behave the same and Hidden Specializations Never Advertised

  • Claude: The "Enterprise" Specialist. Strong across all languages, Java expert (80%)
  • Gemini: Perfect Python (100%!), struggles with Java (28%)
  • Codex: Python bias (87.5%), poor Java (24%) - Struggles with non-scripting languages.

Implication

  • “Best LLM” depends on language
  • These aren't minor differences - they're 3× performance gaps
  • Organizations must match agents to their tech stack

17

University of Missouri - Columbia

18 of 20

Path Forward: From AWS to Chameleon

Goal 1: Bare-Metal Clean-State Guarantees

Migrate 300-project eval to Chameleon bare-metal nodes

Eliminate hidden hypervisor state persisting on EC2

Goal 2: Extend SciUnit Provenance to JS/Java

System-level import hooks on bare-metal �Capture true runtime deps, not just pkg manager output

Goal 3: Parallelize the Pipeline

Current: ~100 hours serial on AWS (isolation constraint)

Chameleon reservations + snapshots enable parallelism

Goal 4: Reproducible Evaluation Infrastructure

Make evaluation itself reproducible for other researchers

Snapshots + metadata vs. cumbersome AWS AMI sharing

18

University of Missouri - Columbia

19 of 20

Conclusion & Chameleon Goals

1. AI-Generated Code Has a Reproducibility Crisis → Only 68.3% run out-of-the-box

2. The Problem Is Broader Than Dependencies → 52.6% code bugs, 10.5% missing deps

3. Language & Agent Choice Matter → Python 89.2% vs Java 44.0%; 3× agent gaps

4. Hidden Complexity Is Massive → 13.5× average runtime dependency expansion

5. Reproducibility Must Be a Primary Metric → Not an afterthought in AI code evaluation

6. Chameleon Enables Rigorous Evaluation → Bare-metal isolation, SciUnit provenance, parallelization

19

Chameleon's bare-metal infrastructure directly addresses the evaluation gaps we identified

University of Missouri - Columbia

20 of 20

☺ Thank You

Contact:

Bhanu Prakash Vangala

bv3hz@missouri.edu

20

University of Missouri - Columbia