Engineering the Harness:
A Practical Workshop
Rajiv Shah
@rajistics
OpenHands
https://github.com/rajshah4/harness-engineering
Engineering the Harness:
A Practical Workshop
Rajiv Shah
@rajistics
OpenHands
https://github.com/rajshah4/harness-engineering
The tasks changed.
2023
Prompting for sentiment
30 tokens - ~0.2 seconds
2026�Fix the bug, here is the code base and test suites
@rajistics
The model isn’t solving the problem. The system is.
13M Tokens for 50k Output! and takes 20 minutes
https://laminar.sh/shared/evals/c97e4a45-8a14-428f-8eac-f77ef6eb75a8
@rajistics
Hi, I'm Rajiv, and this is a masterclass on Harnesses.
Rajiv Shah
Agentic AI Engineer
OpenHands
Social Media: @rajistics
This has been an evolution
https://arxiv.org/pdf/2604.08224
@rajistics
What is in a harness?
https://blog.langchain.com/the-anatomy-of-an-agent-harness/
@rajistics
A harness is everything outside the model
https://blog.langchain.com/the-anatomy-of-an-agent-harness/
@rajistics
A harness is everything outside the model
https://blog.langchain.com/the-anatomy-of-an-agent-harness/
@rajistics
A good SDK abstracts the harness for agentic actions
https://github.com/OpenHands/software-agent-sdk/
@rajistics
Same model, 2× performance gap
https://x.com/sayashk/status/1996334941832089732
The change:
95% - Claude Code
42% - HF smolagents
@rajistics
Everyone on the leaderboard uses the same model
https://www.tbench.ai/
@rajistics
Small model + good harness > big model
AutoHarness: https://arxiv.org/pdf/2508.07995�Meta-Harness: https://yoonholee.com/meta-harness/
@rajistics
What harness do you use?
@rajistics
Harness carries a lot of decisions
https://fieldjournal.ai/blog/codex-cli-vs-claude-code/?utm_source=chatgpt.com
@rajistics
Harnesses are evolving with the models.
https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus
@rajistics
As models improved, who has noticed the trend towards shorter system prompts?
@rajistics
System prompts getting longer!
https://github.com/rajshah4/harness-engineering
@rajistics
Claude Code Harness / Architecture
https://arxiv.org/pdf/2604.14228
@rajistics
Harnesses have bugs
https://www.anthropic.com/engineering/april-23-postmortem
@rajistics
The 5 Levers of Harness Engineering
@rajistics
Let's start with how agents find what they need.
@rajistics
The three modes of Agentic Retrieval.
BM25
Keyword-based retrieval
Language Models
Semantic meaning with embeddings
Agentic Search
Dynamic using LLM Reasoning
@rajistics
The Baseline: grep.
@rajistics
State-of-the-Art Coding Agents rely on grep
@rajistics
Inverted Indices (BM25) make grep instant.
N_docs | Linear/Grep (s) | Inverted Index | BM25 (s) |
1000 | 3.468 | 0.005 | 0.028 |
3000 | 10.188 | 0.014 | 0.097 |
6000 | 20.608 | 0.025 | 0.24 |
9000 | 30.092 | 0.061 | 0.36 |
You don't need a vector database to search code.
BM25 is blazing fast.
@rajistics
Context Harness Uses BM25
Provides a retrieval
frameworks for code
https://parallax-labs.github.io/context-harness/
@rajistics
Where Lexical breaks down: The Synonym Gap.
Takeaway:
@rajistics
Embeddings solve for meaning.
@rajistics
Semantic search is useful for massive codebases.
https://cursor.com/blog/semsearch
@rajistics
Who is using Agentic Search?
@rajistics
Agentic RAG
@rajistics
Agentic Search trades latency for massive accuracy.
⚡️
⚡️
⏱️
⏱️
⏱️
⏱️
@rajistics
Coding agents are really good at long Context
https://arxiv.org/pdf/2603.20432
@rajistics
Use files instead of chunking RAG approach.
https://arxiv.org/pdf/2603.12180
@rajistics
So when should you move to a database
Files stay the source of truth. Index on top when query cost bites.
@rajistics
Design rules for you
@rajistics
Let's start with how agents find what they need.
@rajistics
Are you excited about 10M Context Windows?
@rajistics
Memory & State
Agents don't fail because the context fills up. They fail because they forget what matters.
1M Context Windows are never enough.
@rajistics
1M Context Windows degrade
https://claude.com/blog/1m-context-ga
@rajistics
Key facts disappear inside long model inputs
https://www.linkedin.com/posts/sinan-ozdemir_agenticai-llm-rag-ugcPost-7428125462201102336-B7Br
@rajistics
Coding agents struggle with long context models
When 100s of lines of warning push the real goal out of the agent's context window.
@rajistics
Three layers of Memory
@rajistics
Layer 1: Fixing Active Context
Reset:
Compacting:
@rajistics
Layer 1: Compacting from OpenHands
💰 Up to 2x per-turn API cost reduction
⚡ Consistent response times in long sessions
🧠 Equivalent (or better!) performance on software engineering tasks
https://openhands.dev/blog/openhands-context-condensensation-for-more-efficient-ai-agents�ACON: https://arxiv.org/html/2510.00615v1
@rajistics
Layer 1: How does Codex do it???
https://simzhou.com/en/posts/2026/how-codex-compacts-context/�https://developers.openai.com/api/docs/guides/compaction
@rajistics
Getting the most of a 1M Context Windows
https://x.com/trq212/status/2044548257058328723
https://code.claude.com/docs/en/agent-sdk/file-checkpointing
Techniques from Anthropic on managing your context window
@rajistics
Layer 2: Working State (The Golden Rule)
Files make better memory than chat.
Write your plan to a .md file in the workspace rather than holding it in prompt memory.
@rajistics
Deep Agents rely on external plans.
LangChain’s Deep Agents
uses a write_todos tool to write out plans for agent tasks
https://www.youtube.com/watch?v=geTtqyFnyHA
@rajistics
Extreme Layer 2: Recursive Language Models (RLM)
RLMs bypass token limits entirely by using a persistent Python REPL to manage their state and call sub-LLMs.
https://arxiv.org/pdf/2512.24601v1
@rajistics
RLMs maintain accuracy at 1M tokens.
https://arxiv.org/pdf/2512.24601v1
Standard GPT-5 failed entirely; RLM-GPT-5 hit 91% accuracy
@rajistics
Layer 3: Durable Memory
@rajistics
Who uses an Agents.md file?
@rajistics
Durable Memory with Agents.md
Don’t overload this file, it’s part of every prompt
@rajistics
Auto-generated AGENTS.md files hurt performance
https://arxiv.org/pdf/2602.11988v1
@rajistics
The rule of thumb: "Minimize Load-Bearing Memory"
https://www.youtube.com/watch?v=PQU9o_5rHC4
And so the thing that you want is do the minimal possible thing in order to get the model on track. And so if you delete your claude.md and then, you know, the model is getting off track, it does the wrong thing, that's when you kind of add back a little bit at a time. And what you're probably going to find is with every model you have to add less and less.
@rajistics
Skills are the new standard for Durable Memory.
A skill isn't just a prompt.
It’s instructions, code, and references material
@rajistics
Skills as Externalized Expertise
https://arxiv.org/pdf/2604.08224
@rajistics
Skills can replace Code
Cursor: AI Engineering London 2026
@rajistics
Building a learning loop with skills
https://github.com/NousResearch/hermes-agent
@rajistics
Continual learning outer loop with skills
Inner loop finishes the task.
Outer loop makes the agent smarter.
https://www.philschmid.de/inner-loop-vs-outer-loop
@rajistics
Warning: 16% of skills actually reduce performance.
https://www.reddit.com/r/rajistics/comments/1r77v1h/skillsbench_showed_models_arent_good_at/
@rajistics
You must evaluate your skills.
Blog post: https://openhands.dev/blog/evaluating-agent-skills
Repo: https://github.com/rajshah4/evaluating-skills-tutorial
@rajistics
Constant Innovation around Memory
@rajistics
@rajistics
Memory & Claude
@rajistics
Memory as Lock-in
https://www.langchain.com/blog/your-harness-your-memory
Closed Harness
Open Harness
Lock the harness → lose the memory → lose your product.
@rajistics
Let's start with how agents find what they need.
@rajistics
Agentic Loops and Tool Use
A great loop forgives a mediocre prompt.
Engineering the Loop
@rajistics
We no longer rely on single-shot execution.
O1 - Sept 2024
@rajistics
We no longer rely on single-shot execution.
Opus 4.7 Tools:�ask_user_input_v0
bash_tool
conversation_search
create_file
fetch_sports_data
image_search
message_compose_v1
places_map_display_v0
places_search
present_files
recent_chats
recipe_display_v0
recommend_claude_apps
search_mcp_registry
str_replace
suggest_connectors
view
weather_fetch
web_fetch
web_search
tool_search
visualize:read_me
visualize:show_widget
https://simonwillison.net/2026/Apr/18/opus-system-prompt/
@rajistics
The Default: The "Ralph Wiggum" Agent
Try command
Get failure
Retry without diagnosis
Repeat until stopped by the harness
@rajistics
Ralph can work
This brute force approach can work
C Compiler
https://openhands.dev/blog/20260219-velocity-is-dead�https://www.ronin.consulting/artificial-intelligence/using-the-ralph-wiggum-loop/
@rajistics
Cognitive Discipline via JSON Schema and Plan
Build a plan before you execute
@rajistics
Moving from Ralph Wiggum to AutoResearch
https://github.com/karpathy/autoresearch
@rajistics
An Improved Loop for AutoResearch
https://github.com/karpathy/autoresearch
Added:�hypothesis and a verification step
@rajistics
Defensive Tool Returns.
@rajistics
Environmental Discipline: Testing Driven Development
Factory: AI Engineering London 2026
Tests written after implementation don't catch bugs; they just confirm the agent's decisions.
@rajistics
Adding System Constraints for your Harness
https://openai.com/index/harness-engineering/
Code is Free. Architecture is Expensive.
Add system constraints:
@rajistics
Safety & Friction: Sandboxes
@rajistics
Safety & Friction: Sandboxes
@rajistics
Guardrails and Approval Friction
https://www.anthropic.com/engineering/beyond-permission-prompts-making-claude-code-more-secure-and-autonomous
- https://www.anthropic.com/engineering/writing-effective-tools-for-agents
- https://github.com/walkinglabs/awesome-harness-engineering
Design friction to match blast radius.
@rajistics
Principles for Agentic Loop
SWE Bench: https://proceedings.neurips.cc/paper_files/paper/2024/file/5a7c947568c1b1328ccc5230172e1e7c-Paper-Conference.pdf
@rajistics
Let's start with how agents find what they need.
@rajistics
System Architecture:�Single versus Multi Agent
Who’s using a multi-agent for coding?
@rajistics
Single agents degrade as complexity grows.
https://snorkel.ai/blog/multi-agents-in-the-context-of-enterprise-tool-use/
@rajistics
Split the context between multiple agents
https://snorkel.ai/blog/multi-agents-in-the-context-of-enterprise-tool-use/
@rajistics
Multi-Agent is like Distributed Systems: Complex!
https://www.youtube.com/watch?v=2czYyrTzILg
@rajistics
Many ways to orchestrate multiple agents
@rajistics
Coordination Tax - Going from Parallel to Serial
Factory: AI Engineering London 2026
@rajistics
The Reality: More agents only help if coordination stays cheap.
https://arxiv.org/pdf/2512.08296
Multi-Agent critics using reflection
Rubric-Supervised Critichttps://arxiv.org/pdf/2603.03800
Reflexion: https://arxiv.gg/abs/2303.11366�Boris Cherny: 2 to 3X better quality
@rajistics
Harness engineering in another two years?
Commoditizing (will be default) | Durable (keeps mattering) |
Compaction algorithms | Skills systems — your org's expertise |
Default tool choices (grep, read, edit) | Memory policy — what persists, what doesn't |
Model-specific prompt tweaks | Domain-specific tool libraries |
Most "hacks" patching capability gaps | Security posture & approval friction |
| Evals — knowing when your harness breaks |
@rajistics
Five knobs that decide everything
@rajistics
Why Harnesses Matter
Yesterday we asked which model.
Today, ask which harness?
@rajistics
Engineering the Harness:
A Practical Workshop
Rajiv Shah
@rajistics
OpenHands
https://github.com/rajshah4/harness-engineering
Engineering the Harness:
A Practical Workshop
Rajiv Shah
@rajistics
OpenHands
https://github.com/rajshah4/harness-engineering
Protocols and The JSON Schema Fix
The harness as cognitive environment. The Foundation Model (Agent Core) sits at the center.
Six harness dimensions form a coordinated ring around it.
https://arxiv.org/pdf/2604.08224
@rajistics
Appendix: BM25 rewards rare query terms.
Probabilistic lexical ranking function
@rajistics
Embeddings use the semantic meaning of words
@rajistics
Adding a loop with agentic search
@rajistics
BRIGHT dataset is a more challenging retrieval task
BRIGHT: https://arxiv.org/pdf/2407.12883
Requires reasoning
over retrieval
@rajistics
Appendix: Navigate structured data; don’t stuff into a prompt
https://arxiv.org/pdf/2602.05447
@rajistics
Agentic BM25 beats Semantic Search on reasoning.
BRIGHT: https://arxiv.org/pdf/2407.12883
Querying with LLM using BM25!!
Agentic Search
LLMs know synonyums
@rajistics
Agentic Search trades latency for massive accuracy.
Pick:
(5s versus 25s)
WIxQA: https://arxiv.org/abs/2505.08643
@rajistics
Appendix: Harnesses may eventually self-improve.
Models can learn frm themselves
@rajistics
Appendix: Context may eventually tune itself.
https://arxiv.org/pdf/2510.04618
https://github.com/kayba-ai/agentic-context-engine
Optimizing prompts and memory
@rajistics
From Tools to Coding
- 2,600 API endpoints as JSON schemas
- 1.2M tokens per request
- Dozens of round trips
- 🐢 slow, chatty
Cloudflare: AI Engineering London 2026
- Pass docs, ask for one JS program
- Execute in V8 Isolates
- 1 round trip
- ⚡ zero-latency logic
@rajistics
Externalize process
https://arxiv.org/pdf/2604.08224
Protocols externalize the interaction burden.
@rajistics
It’s possible to build an independent standard
https://www.agentdataprotocol.com/
standardized format to represent agent trajectories across different harnesses, tools, and environments
@rajistics
Retrieval decides what enters context.
Document hierarchy creation
INGEST & PARSE
OCR text
Layout analysis
Image captioning
Table extraction
Metadata creation
]
Source �docs
Data�store
Extracted document
Chunked document
QUERY
Translation
Input query
Model
Armor
Multi-turn
Query decomposition
Query expansion
Query reformulation
RETRIEVE
Filter model
Reranker
Reciprocal rank fusion
Lexical search
Semantic
search
Filter via metadata
Final retrievals
GENERATE
Generate response
(GLM, Claude, GPT5, etc)
Attributions
Groundedness scores
Final
response
Translation
Example: Splitting by Role
https://www.nvidia.com/en-us/on-demand/session/gtc25-s74439/
@rajistics
Models are rapidly getting better.
https://metr.org/
Tradeoffs:
@rajistics
The Status Quo: Hybrid Search.
Chroma: https://x.com/trychroma/status/1983625513244750304
Prioritizing
n=100
Prioritizing
n=10
Query
Reranker
Reciprocal rank fusion
Lexical search
Semantic
search
Filter via metadata
Final retrievals
@rajistics
Why this takes 12M tokens
https://laminar.sh/shared/evals/c97e4a45-8a14-428f-8eac-f77ef6eb75a8
@rajistics
Less can be more with agents
https://arxiv.org/pdf/2603.09004
Minimal prompts force agents to explore and verify, rather than blindly following a hallucinated plan.
@rajistics
Capability is rising faster than reliability.
Models are getting better are writing code, but they are not doing it reliably, which is why we need a harness.
https://arxiv.org/pdf/2602.16666
@rajistics
TroubleShooting Tools
Many agent issues are tool-use problems.
Leonie Monigatti
@rajistics
Specs for software development
OpenSpec�https://www.youtube.com/watch?v=PQU9o_5rHC4
🎯 Goals: what "done" looks like
🚧 Constraints: the box the agent must stay inside
✅ Acceptance: the test the agent runs against itself
The spec outlives any single session.
@rajistics