1 of 11

OpenMCP: A Reproducible Benchmarking Harness for Evaluating Computer-Use Agents on Chameleon Cloud

Agustin Leon & Anup Raj Niroula

New York University (NYU)

2 of 11

Agenda

  1. Brief intro and why this topic
  2. Existing MCP benchmarks
  3. OpenMCP architecture
  4. How to run experiments with OpenMCP
  5. Review of our experiments & results
  6. Quick demo

(1) Brief intro | (2) Existing benchmarks | (3) OpenMCP architecture | (4) Running experiments | (5) Results | (6) Demo

3 of 11

MCP protocol popularity has grown rapidly

Source: Pulse MCP, https://www.pulsemcp.com/statistics

1. Brief intro | 2. Existing benchmarks | 3. OpenMCP architecture | 4. Running experiments | 5. Results | 6. Demo

4 of 11

Why this topic?

  • We were interested in the topic of agents, MCP, and local LLMs.

  • But we weren’t initially set to work on developing a benchmarking harness, we were sort of pushed into it.

  • As we tried to use existing benchmarks, the reproducibility concern arised.

  • Given the popularity of MCP-enabled agents and computer-use-agents in general, having reliable benchmarks seemed important to us.

1. Brief intro | 2. Existing benchmarks | 3. OpenMCP architecture | 4. Running experiments | 5. Results | 6. Demo

5 of 11

What MCP benchmarks are out there?

Framework

GUI Support

Deterministic Evaluation

Local Model Support

Portable Infrastructure

Infrastructure Telemetry

MCPBench

LiveMCPBench

MCP Universe

MCPWorld

OpenMCP

1. Brief intro | 2. Existing benchmarks | 3. OpenMCP architecture | 4. Running experiments | 5. Results | 6. Demo

6 of 11

An overview of OpenMCP’s architecture

1. Brief intro | 2. Existing benchmarks | 3. OpenMCP architecture | 4. Running experiments | 5. Results | 6. Demo

7 of 11

Running experiments on OpenMCP

  1. Model scale vs task success rates?
  2. Can smaller models perform well?
  3. What are the energy-performance trade-offs?
  4. What type of tasks are most difficult?
  5. How do agents solve tasks?
  1. Environment initialization
  2. Agent loop execution
  3. Telemetry sampling
  4. Termination and trace logic
  1. Metadata
  2. Metrics
  3. Telemetry
  4. Raw events

1) Some possible research questions

2) Experiments Configuration

3) Experiment Execution

4) Output Artifacts

One config file sets everything up

1. Brief intro | 2. Existing benchmarks | 3. OpenMCP architecture | 4. Running experiments | 5. Results | 6. Demo

8 of 11

Some of the data and results we collected

1. Brief intro | 2. Existing benchmarks | 3. OpenMCP architecture | 4. Running experiments | 5. Results | 6. Demo

9 of 11

Time for a quick demo

1. Brief intro | 2. Existing benchmarks | 3. OpenMCP architecture | 4. Running experiments | 5. Results | 6. Demo

10 of 11

The role of Chameleon in all of this

  • The entirety of this project was developed in Chameleon Cloud.
  • In total, we used over 30k SUs, spread over 6 months of work.
  • Our released public dataset has over 115 hours of agent runtime, evaluated in 6 different bare-metal instances and VMs.
  • All resource provisioning was handled using a Trovi artifact.
  • H100 KVM@TACC instances were extremely useful, and they became our main playground.
  • Only minor ones on Chameleon side, for which the Chameleon support team gave us rapid response.
  • Most challenges came from the software side, not infrastructure.

Chameleon was core to the project

Challenges?

11 of 11

Summary

Link to Trovi artifact

  • OpenMCP is a reproducible benchmarking harness for MCP-enabled CUAs that gives researchers full control over locally-hosted models, runtime types, and evaluation; while producing rich metadata, metrics, and GPU telemetry.

  • For more details, see upcoming EuroMLSys paper:

Agustin Leon, Anup Raj Niroula, and Fraida Fund. 2026. OpenMCP: an open-source self-hosted benchmarking harness for MCP-enabled computer use agents. In The 6th Workshop on Machine Learning and Systems (EuroMLSys ’26), April 27–30, 2026, Edinburgh, Scotland, UK.

  • Trovi artifact publicly available: “OpenMCP: A Reproducible Benchmarking Harness for Evaluating Computer-Use Agents on Chameleon Cloud”