1 of 80

Human-Centered Evaluation of Coding Agents

Valerie Chen

2 of 80

Brief intro 👋

ML/NLP

HCI

My research

I am a student at…

I collaborate with…

3 of 80

“Software is eating the world”

Marc Andreessen (2011)

Source: Evans Data Corporation

4 of 80

Recently, a proliferation of AI tools for code

Most frequent responses based on

survey in Fall 2024 (n=170)

A ripe opportunity to study human-AI interaction!

5 of 80

Coding assistants are evolving

6 of 80

A growing ecosystem

7 of 80

This was 7 months ago

8 of 80

9 of 80

Increased autonomy, increased risk

10 of 80

These concerns are not limited to SWE

11 of 80

So how do we design the next generation of (coding) agents?

12 of 80

It starts with evaluation!

13 of 80

Human-Centered Evaluations

14 of 80

Evaluating copilots

15 of 80

Prior Evaluations

16 of 80

In practice, models don’t work on their own

Trigger LLM

Human

LLM

Code completion

Edit file

Get user context

17 of 80

GitHub Copilot

18 of 80

How do we pick which LLM to use?

Commercial Models

Open models

Open code-specific �models

19 of 80

What about benchmark performance?

We test this hypothesis by running a user study where people program with models of varying benchmark performance.

20 of 80

We create a web interface to evaluate two forms of AI support

Chatbot

21 of 80

The interface is part of an end-to-end pipeline to evaluate LLMs for coding

22 of 80

Our results highlight the importance of human-in-the-loop evaluation

  • 213 participants assigned to 7 conditions
  • Gaps in benchmark performance do not match differences in human performance
  • The benefits that a human gets from LLM support vary by task type (e.g., data science vs interview-style problems)

Notice that none of these models are used anymore for code!

23 of 80

What’s the best way to do human-in-the-loop evaluation?

Realistic usage, but not scalable!

Running studies involve user recruitment and a user only interacts with one model.

24 of 80

What’s the best way to do human-in-the-loop evaluation?

As of September 2025, the platform has collected over 3.5 million head-to-head votes across more than 400 models

25 of 80

Demo

26 of 80

What’s the best way to do human-in-the-loop evaluation?

Chatbot Arena has introduced a coding category to evaluate models with coding capabilites.

Scalable, but not realistic!

27 of 80

Existing evaluations are flawed

Realistic

Scalability

RealHumanEval

Chatbot Arena

28 of 80

Copilot Arena aims to achieve the best of both worlds

Realistic

Scalability

RealHumanEval

Chatbot Arena

Copilot Arena

29 of 80

30 of 80

Copilot Arena is a real VSCode Extension

31 of 80

Copilot Arena workflow

32 of 80

What models does Copilot Arena support?

Commercial Models

Open models

Open code-specific �models

33 of 80

How are chat models performing code completions?

Need to be able to “fill-in-the-middle” (FiM)

Not trained to do FiM!

34 of 80

What’s unique about code?

34

Slide from CMU Neural Code Generation (11-891)

35 of 80

Training models to fill in the middle

35

<PRE>

<SUF>

<MID>

Encode code in the same way during inference

https://arxiv.org/pdf/2207.14255

36 of 80

Many chat models struggle with correct formatting

Indent Error

Correct Outcome

37 of 80

Instead of generating completions, we ask models to generate code snippets

Generate the entire snippet

Correct Outcome

38 of 80

We post-process the generated snippet to remove overlaps

Correct Outcome

Remove any overlapping text

39 of 80

Simple solution to allow chat models to perform FiM!

  • Performance on benchmarks like HumanEval infilling drastically improves model performance (and reduces indentation or formatting issues).

40 of 80

Copilot Arena so far

  • >30k battles
  • >4.5k users contribute votes
  • Dozen models in the arena
  • We even help evaluate new models before launch!

41 of 80

Leaderboard Computation

42 of 80

Copilot Arena Leaderboard

Live at lmarena.ai

43 of 80

Real-world “data distribution”

44 of 80

Comparison to prior evals

40% of Chatbot Arena’s coding tasks contain code context and only 2.6% focus on infilling

Many existing static benchmarks only evaluate Python, interview-style coding problems that are written in English.

Copilot Arena captures the long tail of context lengths

45 of 80

Which models best align with user preferences?

  • Downstream task significantly affects user preference, while programming languages have little effect.
  • Smaller models tend to perform better on data similar to static benchmarks.

46 of 80

TLDR

  • Existing evaluations do not necessarily correlate well with in-the-wild preferences.
  • Model performance is affected by task, context, and code structure. No model that is “one-size-fits-all.”
  • Diverse and realistic human preference data is essential for effective code generation models.
  • We also now have a lot of interesting data to dig into!

47 of 80

Evaluating agents

48 of 80

A lot of work has been done to understand copilot usage

49 of 80

Current understanding of agent usage

50 of 80

Scope of tasks has increased

51 of 80

Demo

52 of 80

Demo

53 of 80

Agents still require humans in the loop

Instruction

Human

Update on progress

Get user context

Edit files

Agent

54 of 80

Now, developers have options

What if we put them head to head?

55 of 80

Comparing copilots and agents

We recruit participants who are regular users of GitHub Copilot

56 of 80

Study design

57 of 80

Summary of Findings

  • On average, participants with agents are more productive than with copilots (a 35% increase in task correctness).
  • For correctly solved problems, we find a significant difference in user effort between the time spent using copilots and agents (25.1 vs 12.5 minutes).

58 of 80

Summary of Findings

There is room for improvement!

59 of 80

What should we change about the agent?

Instruction

Human

Update on progress

Get user context

Edit files

Agent

Tools

60 of 80

Case studies in agent design

(1) LLM backbone

  • claude-3.7-sonnet
  • claude-4-sonnet
  • gpt-5 (high reasoning)

(2) Reasoning strategy

  • break down complex tasks through planning
  • no planning

(3) Memory management

  • truncate and summarize context after 120 steps
  • truncate and summarize context after 80 steps

61 of 80

Measuring quality of agent work

Copilot Arena lends itself to natural feedback signal

If you accept suggestion: suggestion good

If you continue typing:

suggestion bad

There is no existing measure in agentic workflows

If you follow-up:

agent work is good??

If you stop a conversation:

agent work is bad??

62 of 80

Prediction-powered User Label Synthesis & Evaluation

63 of 80

Step 1: Collect ratings

Users are prompted to provide feedback after each work segment

A work segment = actions between user command and the agent returning to “stopped” state

We collect a dataset of N=1747 labeled user trajectories where the average rating is 4.07

64 of 80

Step 2: Train rating predictor

Features based on the user:

  • Sentiment of user messages
  • Number of user messages

Features based on the agent:

  • Type of task

Features that show task progression:

  • Git actions

User sentiment and git push were the most predictive features!

65 of 80

Step 2: Train rating predictor

Supervised learning approaches using our features outperform naive LLM-as-a-Judge baseline

66 of 80

Step 3: Compute effect size

Naive approach

Augment with Infilled Labels

67 of 80

Overview of Users

Over the course of multiple months, we ran our 3 case studies on over 15k users of the OpenHands SaaS platform.

68 of 80

Results

69 of 80

Results

  • Changes to the LLM model has the largest differences on user satisfaction

+5.9% difference between claude-3.7-sonnet and claude-4-sonnet

-7.8% difference between claude-4-sonnet and gpt-5

vs.

+3.1% difference between no plan and planning

70 of 80

Results

  • No statistically significant in memory case study actually shows how we can reduce cost while preserving user satisfaction

  • Our results also show how PULSE can lead to more conclusive results (reducing confidence intervals by up to 40%)

71 of 80

Comparison to benchmarks

Static benchmarks don’t tell the full story!

72 of 80

TLDR

  • We are seeing a shift towards more autonomous workflows in AI coding assistants
  • Evaluations in these multi-turn settings pose unique challenges compared to the copilot setting
  • However, benchmarks do not always correlate with user satisfaction, requiring the use of efficient human-in-the-loop approaches

73 of 80

Future Directions

74 of 80

Current benchmarks

75 of 80

Alternatively, focus on collaboration

Desiderata 1:

Agent behaviors should be transparent to users.

Desiderata 2:

Agents should have

balanced proactivity.

Desiderata 3:

Agent should effectively leverage human effort.

76 of 80

Current Usage

  • Since agents may be editing multiple files and making many changes, it can become difficult for users to understand why the agent made certain changes.
  • In the post-study feedback, we found that participants wanted ways to understand quickly what the agent did and why changes were necessary.
  • We also see this in the user messages, where one participant asked “Did you delete most of the functions in [filename]? If so, explain why did you do this.”

Desiderata 1:

Agent behaviors should be transparent to users.

Future Usage

  • Prior literature has studied how users consume model explanations has largely focused on ML models and more recently LLMs.
  • However, there is a need to propose explanations of agent actions. Recent work introduced a way for agent developers to view counterfactual roll-outs, but this is not necessarily user-friendly for end users (e.g., developers).
  • Future work should consider how to improve the transparency of agent behaviors

77 of 80

Current Usage

  • Many participants observed, or even complained, about how OpenHands would take more actions than necessary.
  • One participant wrote in a message to the agent, “could the code have been simplified, I did not expect 10 files to be created with more than 1000 lines each.”

Desiderata 2:

Agents should have

balanced proactivity.

Future Usage

  • Agents should be better calibrated in terms of their confidence about whether it has completed the user’s request.
  • Recent work on UI agents has explored proactively pausing agents at task boundaries, identifying such boundaries in the software engineering settings may be a fruitful direction to improve user perception of agent actions.

78 of 80

Current Usage

  • Human effort can be measured in many ways, including the amount of time spent interacting with the agent.
  • On this front, many participants noted that the “the generation time is slower” for OpenHands than GitHub Copilot and sometimes “im just kind of sitting there”.

Desiderata 3:

Agent should effectively leverage human effort.

Future Usage

  • User experience can be improved by explicitly optimizing for latency when engaging in back-and-forth with the user and providing more direct ways for users to steer agent behaviors.
  • Additionally, developers will increasingly need to multi-task to be most productive in agentic workflows, though prior work has characterized the cognitive cost of doing so.

79 of 80

Acknowledgements and paper links

Papers mentioned in this talk

The projects presented in this talk are in collaboration with

Aditya Bharat Soni, Aditya Mittal, Ameet Talwalkar, Anastasios Angelopoulos, Calvin Smith, Chris Donahue, David Sontag, Dennis Wei, Graham Neubig, Hoang H Tran, Hussein Mozannar, Ion Stoica, Juan Michelini, Manish Nagireddy, Mohammed Alsobay, Naman Jain, Prasanna Sattigeri, Robert Brennan, Rohit Malhotra, Sebastian Zhao, Subhro Das, Tianjun Zhang, Wayne Chi, Wei-Lin Chiang, Xingyao Wang, Xuhui Zhou

From

CMU, MIT, UC Berkeley, IBM, OpenHands, LM Arena

80 of 80

Thank you! Questions?