Evaluating Dependency Gaps in �LLM-Generated Code
Bhanu Prakash Vangala¹, Ashish Gehani², Tanu Malik¹
PhD Student¹, SRI Research Fellow², Associate Professor and Director of Radiant Lab1
bv3hz@missouri.edu, ashish.gehani@sri.com, tanu@missouri.edu
Dept. of EECS, University of Missouri, Columbia
¹University of Missouri, Columbia
²SRI International, Menlo Park, CA
University of Missouri - Columbia
Reproducibility Crisis Extending to LLMs Generated Code
2
Definition of Reproducibility:
Reproducibility is the ability to obtain consistent results from a scientific study by using the same data, methods, code, and conditions as the original research, verifying its reliability
The Reality Crisis:
Reproducibility is foundational in:
1. Scientific computing
2. Software engineering
3. AI research
LLMs may amplify, not solve this problem
University of Missouri - Columbia
LLM Coding Agents and the Dependency Problem �(Claude, Gemini, Codex)
LLM Coding Agents promise to accelerate software development
3
The Scenario: A developer asks for a "Sentiment Analysis App." The AI gives code + requirements.txt.
☁️ Prompt →
🤖 LLM →
📝 Complete Code →
✅ Deploy
University of Missouri - Columbia
The Reality We Found So Far...
4
Can another developer reproduce or run it blindly to get same results?
☁️ Prompt
🤖
LLM
📝
Code
❌
IMPORT ERROR
🛠️
Debug
⚠️�Syntax
Error
🛠️
FIX
🤔
Maybe
Works
☁️ Prompt
🤖 LLM
📝 Complete Code
✅ Deploy
Iterative Resolution
University of Missouri - Columbia
The Gap in Current Coding Benchmarks
5
BENCHMARKS TODAY EVALUATE
Functional correctness: Does the logic work? (e.g., HumanEval, MBPP)
Key Assumption: environment already exists!�
THEY DO NOT EVALUATE OR NOT FOCUSING ON:
Can this project or codes run in a clean environment?
Are dependencies complete?
Is execution reproducible with what LLMs explicitly provided?
University of Missouri - Columbia
Reproducibility as Executable Reliability
Executable Reliability of an LLM-based Coding Agent = Probability that a generated project runs successfully in a clean environment without manual fixes states
(a) A project is a SUCCESS ✅ or Reproducible (only if):
(b) A project is a FAILURE ❌if it requires:
6
University of Missouri - Columbia
Evaluation Infrastructure: AWS EC2 Setup
7
100 Standardized Prompts
Claude
Opus 4.1
Gemini
2.5 Pro
Codex
0.52.0
Web Scraping,�ML Pipelines, �API’s/ Databases
100 �projects
300 �projects
Languages per agent:
University of Missouri - Columbia
Primary Result: From 300 AI-Generated Projects
8
68.3 % Success ✅ | 31.7% Failed |
Run out-of-the-box | Manual debugging (~15 min each) |
Zero intervention needed | Finding missing dependencies |
Work as specified | Fixing code generation errors |
University of Missouri - Columbia
The Dependency Gap: Claimed vs. Runtime Reality�Transitive dependencies create a hidden explosion at runtime
Layer 1: Claimed Dependencies (Dc):
Layer 2: Working Dependencies (Dw):
Layer 3: Runtime Dependencies (Dr):
9
Reproducibility fails in the gaps between these layers,
Where reality diverges from the LLM’s Claim.
University of Missouri - Columbia
We Discovered Working Dependencies (Dw) Iteratively
10
Install only
Execute
Result
Record Success ✅
ImportError?
Yes
No
Key Rules:
✓ Max 10 iterations per project
✓ Document every failure & fix ✓ Typically 2-3 iterations for missing dependencies ✗ No manual fixes counted as "success"
This mirrors the real developer experience!
University of Missouri - Columbia
Methodological Overview
Computing the Ground Truth (Runtime Depedency Capture):
11
Environment Standardization:
- AWS EC2 instances (t2.large, Ubuntu 22.04)
- Pristine baseline: exactly 91 packages
- Strict complete reset between each test and no cached dependencies (how did you ensure?)
Evaluation Protocol:
1. Install only LLM-specified dependencies
2. Attempt execution
3. Document every failure
4. No manual fixes for success rate calculation
Sciunit
University of Missouri - Columbia
Results: The Reproducibility Gap
Success Rates by Agent:
- Claude: 73.0%
- Gemini: 72.0%
- Codex: 60.0%
Takeaway: Nearly 1 in 3 projects fail immediately and require human intervention.
Each failure costs ~15 minutes of debugging time
12
Reproducibility outcomes across three LLM coding agents. Partial indicates projects that execute but require external services (databases, APIs) to be fully functional.
University of Missouri - Columbia
The Iceberg Effect (Dependency Explosion)
The Illusion: LLMs treat dependencies as single lines of text.
The Reality: Massive transitive trees.
The Multiplier: Average expansion is 13.5x in runtime.
By Language:
Real Example:
Claimed: scikit-learn, pandas, matplotlib (3 packages)
Reality: 52 packages loaded at runtime!
13
Runtime dependency explosion showing the gapbetween claimed (agent-declared) and runtime (actually installed) dependencies. Java shows a massive 9.5× multiplier,while JavaScript surprisingly shows almost no expansion(1.0×)
Risk: Missing any of these 52 breaks the code to reproduce.
University of Missouri - Columbia
Dependency Completeness Gaps
Missing Dependencies Distribution:
Common Missing Packages:
- Python: lxml, python-dotenv, bcrypt
- JavaScript: body-parser, ws, dotenv
- Java: JUnit, SLF4J (testing & logging)
But: Only 10.5% of failures due to missing deps
14
Distribution of completeness gaps. Most projects(87%) have correct dependencies, but 13% require manual iterative debugging to identify missing packages.
University of Missouri - Columbia
Why Do They Fail only 10%? (It's Not Just Dependencies)
We Expected: Dependency Problems
We Found: Code Generation Problems
Failure Breakdown Analysis (95 failed projects):
Myth: "It's just missing pip install.“
- Code Bugs: 52.6% ← Syntax errors, logic issues
- Not Processed: 16.8% ← Unparseable code
- Other: 15.8% ← Version conflicts
- Dependencies: 10.5% ← Missing packages
- Environment: 4.2% ← System conflicts
15
Insight: LLMs struggle with basic code structure more than environment specs. The Problem Isn't Just Dependencies - It's Basic Code Quality
Error type distribution by agent among FailedProjects. Code bugs dominate overall (50 of 95), with Codex showing the highest count (24). Not Processed errors appear only in Codex and Gemini (8 each), while Dependency errors are most prevalent in Claude (7).
University of Missouri - Columbia
Language Matters Dramatically
Success Rate by Language:
Python: 89.2%
JavaScript: 61.9%
Java: 44.0%
Why Such Differences?
Python: Simple requirements.txt, clear error messages - flat dependency model
JavaScript: npm helps, but nested deps add complexity - nested but auto-resolved
Java: Complex Maven XML, deep transitive graphs - deep transitive dependencies, complex scopes
Reproducibility is ecosystem-dependent, not model-independent.
16
University of Missouri - Columbia
Surprising Agent Specializations---Hidden Biases
Unexpected Findings: Agents have undocumented "Skills".
Not all LLMs behave the same and Hidden Specializations Never Advertised
Implication
17
University of Missouri - Columbia
Path Forward: From AWS to Chameleon
Goal 1: Bare-Metal Clean-State Guarantees
Migrate 300-project eval to Chameleon bare-metal nodes
Eliminate hidden hypervisor state persisting on EC2
Goal 2: Extend SciUnit Provenance to JS/Java
System-level import hooks on bare-metal �Capture true runtime deps, not just pkg manager output
Goal 3: Parallelize the Pipeline
Current: ~100 hours serial on AWS (isolation constraint)
Chameleon reservations + snapshots enable parallelism
Goal 4: Reproducible Evaluation Infrastructure
Make evaluation itself reproducible for other researchers
Snapshots + metadata vs. cumbersome AWS AMI sharing
18
University of Missouri - Columbia
Conclusion & Chameleon Goals
1. AI-Generated Code Has a Reproducibility Crisis → Only 68.3% run out-of-the-box
2. The Problem Is Broader Than Dependencies → 52.6% code bugs, 10.5% missing deps
3. Language & Agent Choice Matter → Python 89.2% vs Java 44.0%; 3× agent gaps
4. Hidden Complexity Is Massive → 13.5× average runtime dependency expansion
5. Reproducibility Must Be a Primary Metric → Not an afterthought in AI code evaluation
6. Chameleon Enables Rigorous Evaluation → Bare-metal isolation, SciUnit provenance, parallelization
19
Chameleon's bare-metal infrastructure directly addresses the evaluation gaps we identified
University of Missouri - Columbia
☺ Thank You
Contact:
Bhanu Prakash Vangala
bv3hz@missouri.edu
20
University of Missouri - Columbia