1 of 11

OpenMCP: A Reproducible Benchmarking Harness for Evaluating Computer-Use Agents on Chameleon Cloud

Agustin Leon & Anup Raj Niroula

New York University (NYU)

2 of 11

Agenda

Brief intro and why this topic
Existing MCP benchmarks
OpenMCP architecture
How to run experiments with OpenMCP
Review of our experiments & results
Quick demo

3 of 11

MCP protocol popularity has grown rapidly

Source: Pulse MCP, https://www.pulsemcp.com/statistics

4 of 11

Why this topic?

We were interested in the topic of agents, MCP, and local LLMs.

But we weren’t initially set to work on developing a benchmarking harness, we were sort of pushed into it.

As we tried to use existing benchmarks, the reproducibility concern arised.

Given the popularity of MCP-enabled agents and computer-use-agents in general, having reliable benchmarks seemed important to us.

5 of 11

What MCP benchmarks are out there?

Framework	GUI Support	Deterministic Evaluation	Local Model Support	Portable Infrastructure	Infrastructure Telemetry
MCPBench	✗	✗	✗	✗	✗
LiveMCPBench	✗	✗	✗	✗	✗
MCP Universe	✗	✓	✓	✗	✗
MCPWorld	✓	✓	✗	✗	✗
OpenMCP	✓	✓	✓	✓	✓

6 of 11

An overview of OpenMCP’s architecture

7 of 11

Running experiments on OpenMCP

Model scale vs task success rates?
Can smaller models perform well?
What are the energy-performance trade-offs?
What type of tasks are most difficult?
How do agents solve tasks?

Environment initialization
Agent loop execution
Telemetry sampling
Termination and trace logic

Metadata
Metrics
Telemetry
Raw events

1) Some possible research questions

2) Experiments Configuration

3) Experiment Execution

4) Output Artifacts

One config file sets everything up

8 of 11

Some of the data and results we collected

9 of 11

Time for a quick demo

10 of 11

The role of Chameleon in all of this

The entirety of this project was developed in Chameleon Cloud.
In total, we used over 30k SUs, spread over 6 months of work.
Our released public dataset has over 115 hours of agent runtime, evaluated in 6 different bare-metal instances and VMs.
All resource provisioning was handled using a Trovi artifact.
H100 KVM@TACC instances were extremely useful, and they became our main playground.

Only minor ones on Chameleon side, for which the Chameleon support team gave us rapid response.
Most challenges came from the software side, not infrastructure.

Chameleon was core to the project

Challenges?

11 of 11

Summary

Link to Trovi artifact

OpenMCP is a reproducible benchmarking harness for MCP-enabled CUAs that gives researchers full control over locally-hosted models, runtime types, and evaluation; while producing rich metadata, metrics, and GPU telemetry.

For more details, see upcoming EuroMLSys paper:

Agustin Leon, Anup Raj Niroula, and Fraida Fund. 2026. OpenMCP: an open-source self-hosted benchmarking harness for MCP-enabled computer use agents. In The 6th Workshop on Machine Learning and Systems (EuroMLSys ’26), April 27–30, 2026, Edinburgh, Scotland, UK.

Trovi artifact publicly available: “OpenMCP: A Reproducible Benchmarking Harness for Evaluating Computer-Use Agents on Chameleon Cloud”