Human-Centered Evaluation of Coding Agents
Valerie Chen
Brief intro 👋
ML/NLP
HCI
My research
I am a student at…
I collaborate with…
“Software is eating the world”
Marc Andreessen (2011)
Source: Evans Data Corporation
Recently, a proliferation of AI tools for code
Most frequent responses based on
survey in Fall 2024 (n=170)
A ripe opportunity to study human-AI interaction!
Coding assistants are evolving
A growing ecosystem
This was 7 months ago
Increased autonomy, increased risk
These concerns are not limited to SWE
So how do we design the next generation of (coding) agents?
It starts with evaluation!
Human-Centered Evaluations
Evaluating copilots
Prior Evaluations
In practice, models don’t work on their own
Trigger LLM
Human
LLM
Code completion
Edit file
Get user context
GitHub Copilot
How do we pick which LLM to use?
Commercial Models
Open models
Open code-specific �models
What about benchmark performance?
We test this hypothesis by running a user study where people program with models of varying benchmark performance.
We create a web interface to evaluate two forms of AI support
Chatbot
The interface is part of an end-to-end pipeline to evaluate LLMs for coding
Our results highlight the importance of human-in-the-loop evaluation
Notice that none of these models are used anymore for code!
What’s the best way to do human-in-the-loop evaluation?
Realistic usage, but not scalable!
Running studies involve user recruitment and a user only interacts with one model.
What’s the best way to do human-in-the-loop evaluation?
As of September 2025, the platform has collected over 3.5 million head-to-head votes across more than 400 models
Demo
What’s the best way to do human-in-the-loop evaluation?
Chatbot Arena has introduced a coding category to evaluate models with coding capabilites.
Scalable, but not realistic!
Existing evaluations are flawed
Realistic
Scalability
RealHumanEval
Chatbot Arena
Copilot Arena aims to achieve the best of both worlds
Realistic
Scalability
RealHumanEval
Chatbot Arena
Copilot Arena
Copilot Arena is a real VSCode Extension
Copilot Arena workflow
What models does Copilot Arena support?
Commercial Models
Open models
Open code-specific �models
How are chat models performing code completions?
Need to be able to “fill-in-the-middle” (FiM)
Not trained to do FiM!
What’s unique about code?
34
Slide from CMU Neural Code Generation (11-891)
Training models to fill in the middle
35
<PRE>
<SUF>
<MID>
Encode code in the same way during inference
https://arxiv.org/pdf/2207.14255
Many chat models struggle with correct formatting
Indent Error
Correct Outcome
Instead of generating completions, we ask models to generate code snippets
Generate the entire snippet
Correct Outcome
We post-process the generated snippet to remove overlaps
Correct Outcome
Remove any overlapping text
Simple solution to allow chat models to perform FiM!
Copilot Arena so far
Leaderboard Computation
Copilot Arena Leaderboard
Live at lmarena.ai
Real-world “data distribution”
Comparison to prior evals
40% of Chatbot Arena’s coding tasks contain code context and only 2.6% focus on infilling
Many existing static benchmarks only evaluate Python, interview-style coding problems that are written in English.
Copilot Arena captures the long tail of context lengths
Which models best align with user preferences?
TLDR
Evaluating agents
A lot of work has been done to understand copilot usage
Current understanding of agent usage
Scope of tasks has increased
Demo
Demo
Agents still require humans in the loop
Instruction
Human
Update on progress
Get user context
Edit files
Agent
Now, developers have options
What if we put them head to head?
Comparing copilots and agents
We recruit participants who are regular users of GitHub Copilot
Study design
Summary of Findings
Summary of Findings
There is room for improvement!
What should we change about the agent?
Instruction
Human
Update on progress
Get user context
Edit files
Agent
Tools
Case studies in agent design
(1) LLM backbone
(2) Reasoning strategy
(3) Memory management
Measuring quality of agent work
Copilot Arena lends itself to natural feedback signal
If you accept suggestion: suggestion good
If you continue typing:
suggestion bad
There is no existing measure in agentic workflows
If you follow-up:
agent work is good??
If you stop a conversation:
agent work is bad??
Prediction-powered User Label Synthesis & Evaluation
Step 1: Collect ratings
Users are prompted to provide feedback after each work segment
A work segment = actions between user command and the agent returning to “stopped” state
We collect a dataset of N=1747 labeled user trajectories where the average rating is 4.07
Step 2: Train rating predictor
Features based on the user:
Features based on the agent:
Features that show task progression:
User sentiment and git push were the most predictive features!
Step 2: Train rating predictor
Supervised learning approaches using our features outperform naive LLM-as-a-Judge baseline
Step 3: Compute effect size
Naive approach
Augment with Infilled Labels
Overview of Users
Over the course of multiple months, we ran our 3 case studies on over 15k users of the OpenHands SaaS platform.
Results
Results
+5.9% difference between claude-3.7-sonnet and claude-4-sonnet
-7.8% difference between claude-4-sonnet and gpt-5
vs.
+3.1% difference between no plan and planning
Results
Comparison to benchmarks
Static benchmarks don’t tell the full story!
TLDR
Future Directions
Current benchmarks
Alternatively, focus on collaboration
Desiderata 1:
Agent behaviors should be transparent to users.
Desiderata 2:
Agents should have
balanced proactivity.
Desiderata 3:
Agent should effectively leverage human effort.
Current Usage
Desiderata 1:
Agent behaviors should be transparent to users.
Future Usage
Current Usage
Desiderata 2:
Agents should have
balanced proactivity.
Future Usage
Current Usage
Desiderata 3:
Agent should effectively leverage human effort.
Future Usage
Acknowledgements and paper links
Papers mentioned in this talk
The projects presented in this talk are in collaboration with
Aditya Bharat Soni, Aditya Mittal, Ameet Talwalkar, Anastasios Angelopoulos, Calvin Smith, Chris Donahue, David Sontag, Dennis Wei, Graham Neubig, Hoang H Tran, Hussein Mozannar, Ion Stoica, Juan Michelini, Manish Nagireddy, Mohammed Alsobay, Naman Jain, Prasanna Sattigeri, Robert Brennan, Rohit Malhotra, Sebastian Zhao, Subhro Das, Tianjun Zhang, Wayne Chi, Wei-Lin Chiang, Xingyao Wang, Xuhui Zhou
From
CMU, MIT, UC Berkeley, IBM, OpenHands, LM Arena
Thank you! Questions?