1 of 123

Engineering the Harness:

A Practical Workshop

Rajiv Shah

@rajistics

OpenHands

https://github.com/rajshah4/harness-engineering

2 of 123

Engineering the Harness:

A Practical Workshop

Rajiv Shah

@rajistics

OpenHands

https://github.com/rajshah4/harness-engineering

3 of 123

The tasks changed.

2023

Prompting for sentiment

30 tokens - ~0.2 seconds

2026�Fix the bug, here is the code base and test suites

@rajistics

4 of 123

The model isn’t solving the problem. The system is.

13M Tokens for 50k Output! and takes 20 minutes

https://laminar.sh/shared/evals/c97e4a45-8a14-428f-8eac-f77ef6eb75a8

@rajistics

5 of 123

Hi, I'm Rajiv, and this is a masterclass on Harnesses.

Rajiv Shah

Agentic AI Engineer

OpenHands

Social Media: @rajistics

6 of 123

This has been an evolution

https://arxiv.org/pdf/2604.08224

@rajistics

7 of 123

What is in a harness?

https://blog.langchain.com/the-anatomy-of-an-agent-harness/

@rajistics

8 of 123

A harness is everything outside the model

https://blog.langchain.com/the-anatomy-of-an-agent-harness/

@rajistics

9 of 123

A harness is everything outside the model

https://blog.langchain.com/the-anatomy-of-an-agent-harness/

@rajistics

10 of 123

A good SDK abstracts the harness for agentic actions

https://github.com/OpenHands/software-agent-sdk/

@rajistics

11 of 123

Same model, 2× performance gap

https://x.com/sayashk/status/1996334941832089732

The change:

95% - Claude Code

42% - HF smolagents

@rajistics

12 of 123

Everyone on the leaderboard uses the same model

https://www.tbench.ai/

@rajistics

13 of 123

Small model + good harness > big model

AutoHarness: https://arxiv.org/pdf/2508.07995�Meta-Harness: https://yoonholee.com/meta-harness/

@rajistics

14 of 123

What harness do you use?

@rajistics

15 of 123

Harness carries a lot of decisions

https://fieldjournal.ai/blog/codex-cli-vs-claude-code/?utm_source=chatgpt.com

@rajistics

16 of 123

Harnesses are evolving with the models.

https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus

  • AutoGen & CrewAI - 3 years ago
  • Model improve
  • Manus rebuilt their harness 5 times

@rajistics

17 of 123

As models improved, who has noticed the trend towards shorter system prompts?

@rajistics

18 of 123

System prompts getting longer!

https://github.com/rajshah4/harness-engineering

@rajistics

19 of 123

Claude Code Harness / Architecture

https://arxiv.org/pdf/2604.14228

@rajistics

20 of 123

Harnesses have bugs

  • default reasoning high -> medium
  • bug that accidentally evicted thinking blocks on every turn in session was a change to help with cache optimization
  • system prompt change to reduce verbosity which reduced code quality

https://www.anthropic.com/engineering/april-23-postmortem

@rajistics

21 of 123

The 5 Levers of Harness Engineering

  • The Model
  • Retrieval
  • Memory & State
  • Agentic Loops & Tool Use
  • System Architecture

@rajistics

22 of 123

Let's start with how agents find what they need.

  • The Model
  • Retrieval
  • Context & Memory
  • Agentic Loops & Tool Use
  • System Architecture

@rajistics

23 of 123

The three modes of Agentic Retrieval.

BM25

Keyword-based retrieval

Language Models

Semantic meaning with embeddings

Agentic Search

Dynamic using LLM Reasoning

@rajistics

24 of 123

The Baseline: grep.

  • Keyword precision

  • Sub-second latency

  • Battle-tested

@rajistics

25 of 123

State-of-the-Art Coding Agents rely on grep

@rajistics

26 of 123

Inverted Indices (BM25) make grep instant.

N_docs

Linear/Grep (s)

Inverted Index

BM25 (s)

1000

3.468

0.005

0.028

3000

10.188

0.014

0.097

6000

20.608

0.025

0.24

9000

30.092

0.061

0.36

You don't need a vector database to search code.

BM25 is blazing fast.

@rajistics

27 of 123

Context Harness Uses BM25

Provides a retrieval

frameworks for code

https://parallax-labs.github.io/context-harness/

@rajistics

28 of 123

Where Lexical breaks down: The Synonym Gap.

Takeaway:

  • If you have keyword-heavy queries and need sub-second response → text search might be sufficient

@rajistics

29 of 123

Embeddings solve for meaning.

@rajistics

30 of 123

Semantic search is useful for massive codebases.

https://cursor.com/blog/semsearch

  • Cursor
    • Achieving on average 12.5% higher accuracy
    • Producing code changes that are more likely to be retained in codebases.

@rajistics

31 of 123

Who is using Agentic Search?

@rajistics

32 of 123

Agentic RAG

@rajistics

33 of 123

Agentic Search trades latency for massive accuracy.

⚡️

⚡️

⏱️

⏱️

⏱️

⏱️

@rajistics

34 of 123

Coding agents are really good at long Context

https://arxiv.org/pdf/2603.20432

  • Coding agents with file access are very good

@rajistics

35 of 123

Use files instead of chunking RAG approach.

https://arxiv.org/pdf/2603.12180

@rajistics

36 of 123

So when should you move to a database

Files stay the source of truth. Index on top when query cost bites.

@rajistics

37 of 123

Design rules for you

  • Lexical for Small Set of Files: BM25 / grep is your baseline.
  • Add Semantics for Speed: Only if you suffer from vocabulary mismatch.
  • Loop it if Accuracy is Key: Ultimately, put the retrieval inside an iterative agentic loop.

@rajistics

38 of 123

Let's start with how agents find what they need.

  • The Model
  • Retrieval
  • Context & Memory
  • Agentic Loops & Tool Use
  • System Architecture

@rajistics

39 of 123

Are you excited about 10M Context Windows?

@rajistics

40 of 123

Memory & State

Agents don't fail because the context fills up. They fail because they forget what matters.

41 of 123

1M Context Windows are never enough.

  • 500 page multimodal PDF

  • 20k rows of data

  • 100k lines / 5 MB

@rajistics

42 of 123

1M Context Windows degrade

https://claude.com/blog/1m-context-ga

@rajistics

43 of 123

Key facts disappear inside long model inputs

https://www.linkedin.com/posts/sinan-ozdemir_agenticai-llm-rag-ugcPost-7428125462201102336-B7Br

@rajistics

44 of 123

Coding agents struggle with long context models

When 100s of lines of warning push the real goal out of the agent's context window.

@rajistics

45 of 123

Three layers of Memory

  • Layer 1 - Active Context: What is in the prompt right now.
  • Layer 2 - Working State: Plans, TODOs, scratchpads.
  • Layer 3 - Durable Memory: Skills and reusable workflows.

@rajistics

46 of 123

Layer 1: Fixing Active Context

Reset:

  • Clear the window entirely.
  • Refill it with only the original instructions and critical artifacts.

Compacting:

  • Summarize older turns but keep recent turns intact

@rajistics

47 of 123

Layer 1: Compacting from OpenHands

💰 Up to 2x per-turn API cost reduction

⚡ Consistent response times in long sessions

🧠 Equivalent (or better!) performance on software engineering tasks

https://openhands.dev/blog/openhands-context-condensensation-for-more-efficient-ai-agents�ACON: https://arxiv.org/html/2510.00615v1

@rajistics

48 of 123

Layer 1: How does Codex do it???

https://simzhou.com/en/posts/2026/how-codex-compacts-context/�https://developers.openai.com/api/docs/guides/compaction

@rajistics

49 of 123

Getting the most of a 1M Context Windows

https://x.com/trq212/status/2044548257058328723

https://code.claude.com/docs/en/agent-sdk/file-checkpointing

Techniques from Anthropic on managing your context window

@rajistics

50 of 123

Layer 2: Working State (The Golden Rule)

Files make better memory than chat.

Write your plan to a .md file in the workspace rather than holding it in prompt memory.

@rajistics

51 of 123

Deep Agents rely on external plans.

LangChain’s Deep Agents

uses a write_todos tool to write out plans for agent tasks

https://www.youtube.com/watch?v=geTtqyFnyHA

@rajistics

52 of 123

Extreme Layer 2: Recursive Language Models (RLM)

RLMs bypass token limits entirely by using a persistent Python REPL to manage their state and call sub-LLMs.

https://arxiv.org/pdf/2512.24601v1

@rajistics

53 of 123

RLMs maintain accuracy at 1M tokens.

https://arxiv.org/pdf/2512.24601v1

Standard GPT-5 failed entirely; RLM-GPT-5 hit 91% accuracy

@rajistics

54 of 123

Layer 3: Durable Memory

@rajistics

55 of 123

Who uses an Agents.md file?

@rajistics

56 of 123

Durable Memory with Agents.md

Don’t overload this file, it’s part of every prompt

@rajistics

57 of 123

Auto-generated AGENTS.md files hurt performance

  • Reduces task success
  • Increases inference cost by over 20%.

https://arxiv.org/pdf/2602.11988v1

@rajistics

58 of 123

The rule of thumb: "Minimize Load-Bearing Memory"

https://www.youtube.com/watch?v=PQU9o_5rHC4

And so the thing that you want is do the minimal possible thing in order to get the model on track. And so if you delete your claude.md and then, you know, the model is getting off track, it does the wrong thing, that's when you kind of add back a little bit at a time. And what you're probably going to find is with every model you have to add less and less.

@rajistics

59 of 123

Skills are the new standard for Durable Memory.

A skill isn't just a prompt.

It’s instructions, code, and references material

@rajistics

60 of 123

Skills as Externalized Expertise

https://arxiv.org/pdf/2604.08224

@rajistics

61 of 123

Skills can replace Code

Cursor: AI Engineering London 2026

@rajistics

62 of 123

Building a learning loop with skills

https://github.com/NousResearch/hermes-agent

  1. Task attempt: Run complex task
  2. Periodic nudge: "What would you do differently?"
  3. Agent writes skill: Saves to skills/.md
  4. Self-improves: Edits own skill on next failure

@rajistics

63 of 123

Continual learning outer loop with skills

Inner loop finishes the task.

Outer loop makes the agent smarter.

https://www.philschmid.de/inner-loop-vs-outer-loop

@rajistics

64 of 123

Warning: 16% of skills actually reduce performance.

https://www.reddit.com/r/rajistics/comments/1r77v1h/skillsbench_showed_models_arent_good_at/

@rajistics

65 of 123

You must evaluate your skills.

Blog post: https://openhands.dev/blog/evaluating-agent-skills

Repo: https://github.com/rajshah4/evaluating-skills-tutorial

@rajistics

66 of 123

Constant Innovation around Memory

  • Layer 1 - Active Context: What is in the prompt right now.
  • Layer 2 - Working State: Plans, TODOs, scratchpads.
  • Layer 3 - Durable Memory: Skills and reusable workflows.

@rajistics

67 of 123

@rajistics

68 of 123

Memory & Claude

  • Conversation compaction and context assembly
  • CLAUDE.md / settings / project memory
  • Tool search and MCP loading strategy
  • Managed Agents

@rajistics

69 of 123

Memory as Lock-in

https://www.langchain.com/blog/your-harness-your-memory

Closed Harness

  • Memory lives in provider API
  • Compaction is encrypted
  • Switch provider → lose history

Open Harness

  • Memory lives in your files
  • Compaction is your code
  • Switch provider → memory stays

Lock the harness → lose the memory → lose your product.

@rajistics

70 of 123

Let's start with how agents find what they need.

  • The Model
  • Retrieval
  • Context & Memory
  • Agentic Loops & Tool Use
  • System Architecture

@rajistics

71 of 123

Agentic Loops and Tool Use

A great loop forgives a mediocre prompt.

72 of 123

Engineering the Loop

  1. The Baseline: The "Ralph Wiggum" loop (and why it fails)
  2. Cognitive Discipline
  3. Environmental Discipline
  4. Safety & Friction

@rajistics

73 of 123

We no longer rely on single-shot execution.

O1 - Sept 2024

@rajistics

74 of 123

We no longer rely on single-shot execution.

Opus 4.7 Tools:�ask_user_input_v0

bash_tool

conversation_search

create_file

fetch_sports_data

image_search

message_compose_v1

places_map_display_v0

places_search

present_files

recent_chats

recipe_display_v0

recommend_claude_apps

search_mcp_registry

str_replace

suggest_connectors

view

weather_fetch

web_fetch

web_search

tool_search

visualize:read_me

visualize:show_widget

https://simonwillison.net/2026/Apr/18/opus-system-prompt/

@rajistics

75 of 123

The Default: The "Ralph Wiggum" Agent

Try command

Get failure

Retry without diagnosis

Repeat until stopped by the harness

@rajistics

76 of 123

Ralph can work

This brute force approach can work

C Compiler

https://openhands.dev/blog/20260219-velocity-is-dead�https://www.ronin.consulting/artificial-intelligence/using-the-ralph-wiggum-loop/

@rajistics

77 of 123

Cognitive Discipline via JSON Schema and Plan

Build a plan before you execute

@rajistics

78 of 123

Moving from Ralph Wiggum to AutoResearch

https://github.com/karpathy/autoresearch

@rajistics

79 of 123

An Improved Loop for AutoResearch

https://github.com/karpathy/autoresearch

Added:�hypothesis and a verification step

@rajistics

80 of 123

Defensive Tool Returns.

@rajistics

81 of 123

Environmental Discipline: Testing Driven Development

Factory: AI Engineering London 2026

Tests written after implementation don't catch bugs; they just confirm the agent's decisions.

@rajistics

82 of 123

Adding System Constraints for your Harness

https://openai.com/index/harness-engineering/

Code is Free. Architecture is Expensive.

Add system constraints:

  • Lint errors → instructions to the agent
  • File-length tests → force decomposition
  • CI checks → the real guardrail

@rajistics

83 of 123

Safety & Friction: Sandboxes

  • Don’t work on your laptop

  • Isolated with a sandbox

@rajistics

84 of 123

Safety & Friction: Sandboxes

  • Don’t work on your laptop

  • Isolated with a sandbox

@rajistics

85 of 123

Guardrails and Approval Friction

https://www.anthropic.com/engineering/beyond-permission-prompts-making-claude-code-more-secure-and-autonomous

- https://www.anthropic.com/engineering/writing-effective-tools-for-agents

- https://github.com/walkinglabs/awesome-harness-engineering

Design friction to match blast radius.

@rajistics

86 of 123

Principles for Agentic Loop

  1. Actions should be simple and easy to understand for agents.
  2. Actions should be compact and efficient.
  3. Environment feedback should be informative but concise.
  4. Guardrails mitigate error propagation and hasten recovery

SWE Bench: https://proceedings.neurips.cc/paper_files/paper/2024/file/5a7c947568c1b1328ccc5230172e1e7c-Paper-Conference.pdf

@rajistics

87 of 123

Let's start with how agents find what they need.

  • The Model
  • Retrieval
  • Context & Memory
  • Agentic Loops & Tool Use
  • System Architecture

@rajistics

88 of 123

System Architecture:�Single versus Multi Agent

89 of 123

Who’s using a multi-agent for coding?

@rajistics

90 of 123

Single agents degrade as complexity grows.

https://snorkel.ai/blog/multi-agents-in-the-context-of-enterprise-tool-use/

@rajistics

91 of 123

Split the context between multiple agents

https://snorkel.ai/blog/multi-agents-in-the-context-of-enterprise-tool-use/

@rajistics

92 of 123

Multi-Agent is like Distributed Systems: Complex!

https://www.youtube.com/watch?v=2czYyrTzILg

@rajistics

93 of 123

Many ways to orchestrate multiple agents

  • Routing
  • Parallelization
  • Orchestrator

@rajistics

94 of 123

Coordination Tax - Going from Parallel to Serial

Factory: AI Engineering London 2026

@rajistics

95 of 123

The Reality: More agents only help if coordination stays cheap.

  • On average, multi-agent systems perform worse than single agents (-3.5%).
  • Once a single agent reaches ~45% accuracy, adding agents stops helping. Coordination cost outweighs reasoning gains.
  • Architecture determines if errors are corrected or amplified. Independent agents amplify errors ~17x.

https://arxiv.org/pdf/2512.08296

96 of 123

Multi-Agent critics using reflection

Rubric-Supervised Critichttps://arxiv.org/pdf/2603.03800

Reflexion: https://arxiv.gg/abs/2303.11366�Boris Cherny: 2 to 3X better quality

@rajistics

97 of 123

Harness engineering in another two years?

Commoditizing (will be default)

Durable (keeps mattering)

Compaction algorithms

Skills systems — your org's expertise

Default tool choices (grep, read, edit)

Memory policy — what persists, what doesn't

Model-specific prompt tweaks

Domain-specific tool libraries

Most "hacks" patching capability gaps

Security posture & approval friction

Evals — knowing when your harness breaks

@rajistics

98 of 123

Five knobs that decide everything

  1. Retrieval → Grep / BM25 / semantic .
  2. Memory → files > chat history. Skills
  3. Tools → Quality, quantity, error handling
  4. Loops → force hypothesis before action. Tests before code.
  5. Orchestration → Single agent by default. Multi-agent only when coordination stays cheap.

@rajistics

99 of 123

Why Harnesses Matter

Yesterday we asked which model.

Today, ask which harness?

  • Performance differences — same model, 2× gap
  • Know the knobs — 5 levers: retrieval, memory, loops, tools, architecture
  • Mind the lock-in — your harness owns your memory, context, and tools

@rajistics

100 of 123

Engineering the Harness:

A Practical Workshop

Rajiv Shah

@rajistics

OpenHands

https://github.com/rajshah4/harness-engineering

101 of 123

Engineering the Harness:

A Practical Workshop

Rajiv Shah

@rajistics

OpenHands

https://github.com/rajshah4/harness-engineering

102 of 123

Protocols and The JSON Schema Fix

The harness as cognitive environment. The Foundation Model (Agent Core) sits at the center.

Six harness dimensions form a coordinated ring around it.

https://arxiv.org/pdf/2604.08224

@rajistics

103 of 123

Appendix: BM25 rewards rare query terms.

Probabilistic lexical ranking function

@rajistics

104 of 123

Embeddings use the semantic meaning of words

@rajistics

105 of 123

Adding a loop with agentic search

@rajistics

106 of 123

BRIGHT dataset is a more challenging retrieval task

BRIGHT: https://arxiv.org/pdf/2407.12883

Requires reasoning

over retrieval

@rajistics

107 of 123

Appendix: Navigate structured data; don’t stuff into a prompt

https://arxiv.org/pdf/2602.05447

  • For structured data, file search worked better than placing schemas in prompts
  • Using nested navigation guide for lots of files

@rajistics

108 of 123

Agentic BM25 beats Semantic Search on reasoning.

BRIGHT: https://arxiv.org/pdf/2407.12883

Querying with LLM using BM25!!

Agentic Search

LLMs know synonyums

@rajistics

109 of 123

Agentic Search trades latency for massive accuracy.

Pick:

  • Accuracy
  • Latency

(5s versus 25s)

WIxQA: https://arxiv.org/abs/2505.08643

@rajistics

110 of 123

Appendix: Harnesses may eventually self-improve.

Models can learn frm themselves

@rajistics

111 of 123

Appendix: Context may eventually tune itself.

https://arxiv.org/pdf/2510.04618

https://github.com/kayba-ai/agentic-context-engine

Optimizing prompts and memory

@rajistics

112 of 123

From Tools to Coding

- 2,600 API endpoints as JSON schemas

- 1.2M tokens per request

- Dozens of round trips

- 🐢 slow, chatty

Cloudflare: AI Engineering London 2026

- Pass docs, ask for one JS program

- Execute in V8 Isolates

- 1 round trip

- ⚡ zero-latency logic

@rajistics

113 of 123

Externalize process

https://arxiv.org/pdf/2604.08224

Protocols externalize the interaction burden.

@rajistics

114 of 123

It’s possible to build an independent standard

https://www.agentdataprotocol.com/

standardized format to represent agent trajectories across different harnesses, tools, and environments

@rajistics

115 of 123

Retrieval decides what enters context.

Document hierarchy creation

INGEST & PARSE

OCR text

Layout analysis

Image captioning

Table extraction

Metadata creation

]

Source �docs

Data�store

Extracted document

Chunked document

QUERY

Translation

Input query

Model

Armor

Multi-turn

Query decomposition

Query expansion

Query reformulation

RETRIEVE

Filter model

Reranker

Reciprocal rank fusion

Lexical search

Semantic

search

Filter via metadata

Final retrievals

GENERATE

Generate response

(GLM, Claude, GPT5, etc)

Attributions

Groundedness scores

Final

response

Translation

116 of 123

Example: Splitting by Role

https://www.nvidia.com/en-us/on-demand/session/gtc25-s74439/

@rajistics

117 of 123

Models are rapidly getting better.

https://metr.org/

Tradeoffs:

  • Capability
  • Speed
  • Cost

@rajistics

118 of 123

The Status Quo: Hybrid Search.

Chroma: https://x.com/trychroma/status/1983625513244750304

Prioritizing

n=100

Prioritizing

n=10

Query

Reranker

Reciprocal rank fusion

Lexical search

Semantic

search

Filter via metadata

Final retrievals

@rajistics

119 of 123

Why this takes 12M tokens

https://laminar.sh/shared/evals/c97e4a45-8a14-428f-8eac-f77ef6eb75a8

  • Not one generation
  • Hundreds of iterations
  • Each step includes context + feedback
  • The model isn’t solving the problem. The system is.

@rajistics

120 of 123

Less can be more with agents

https://arxiv.org/pdf/2603.09004

Minimal prompts force agents to explore and verify, rather than blindly following a hallucinated plan.

@rajistics

121 of 123

Capability is rising faster than reliability.

Models are getting better are writing code, but they are not doing it reliably, which is why we need a harness.

https://arxiv.org/pdf/2602.16666

@rajistics

122 of 123

TroubleShooting Tools

Many agent issues are tool-use problems.

Leonie Monigatti

@rajistics

123 of 123

Specs for software development

OpenSpec�https://www.youtube.com/watch?v=PQU9o_5rHC4

🎯 Goals: what "done" looks like

🚧 Constraints: the box the agent must stay inside

✅ Acceptance: the test the agent runs against itself

The spec outlives any single session.

@rajistics