1 of 71

Autoresearch

in ranking

An honest take

2 of 71

About me

Share my learnings early

Honesty a core value

Still trying to figure out how search works

Embarrassing myself as a service

http://softwaredoug.com

(Search since 2013, training + consulting)

3 of 71

May 8 - Lexical+BM25

May 12 - Vector

Search

May 14 - Search

Evaluation

$$ 1200

FREE!

Cheat at Search

~essentials~

http://softwaredoug.com/search-101

4 of 71

Cheat at Search with Agents

http://maven.com/softwaredoug/cheat-at-search

Live training - discount code haystack

Agentic Search course

5 of 71

6 of 71

Autoresearch - the idea

Coding

Agent

Model / Code / ??

Evals / Objective

Propose improvements

Evaluate Change

👍/ 👎 / ??

https://github.com/karpathy/autoresearch

7 of 71

Why? Deployment story

Coding

Agent

Model / Code / ??

Evals / Objective

Propose improvements

Evaluate Change

👍/ 👎 / ??

You’re already deploying this somewhere. You’re just making it better

8 of 71

Today: exploration of experiments

My DIY Agent -> Autoresearch setup

What I learned experimenting on datasets

Sometimes backed up by data. Other times intuition.

9 of 71

Repo + Notebook

https://github.com/softwaredoug/search-experiments

10 of 71

Agentic Search Setup

11 of 71

Agentic Loop

Agents solve problems w/ tools (ie coding)

Sort the foobar with quicksort

Investigate

Edit Code

Run tests

Code editing tool

code

Command running tool

Command running tool (grep, etc?

Filesystem

12 of 71

Agentic Loop

Agentic search: agents solving a search problem

What are BM25 posts by Doug Turnbull?

Create Queries

(Reasoning)

Call Search

Tools

Evaluate

(LLM reasoning)

retry…

… until happy

Lexical Search

Tool

Embedding Retrieval

Tool

How can I use my tools to find useful results?

Oh “XYZ earnings” didn’t work. Let me try again with “XYZ financials”

13 of 71

Boring old, hallucinating AI

inputs = [

{"role": "system", "content": "You are a helpful assistant that helps find blog posts."},

{"role": "user", "content": "What are some blog posts by Doug Turnbull about BM25"},

]

resp = openai.responses.create(

model="gpt-5-mini",

input=inputs,

)

resp.output[-1].content[-1].text

- “BM25 explained” (Elastic blog) — a straightforward explainer of the BM25 ranking function, its parameters (k1, b), and intuition for term frequency and document length normalization.

- “BM25 and TF saturation / parameter tuning”

- “BM25 vs. TF-IDF / Why BM25 replaced TF-IDF in Lucene” — historical/comparative post describing differences between classic TF-IDF and BM25 and why BM25 is the default in Lucene/Elasticsearch.

In:

Out:

14 of 71

Define a tool - just a python function

def search(keywords: str):

"""A simple BM25 keyword search of Doug Turnbull's blog posts."""

… body omitted …

search("bm25")

[{'id': 160,

'title': 'Bayesian BM25 is cool',

'score': 37.75146675109863,

'publication_date': '2026-03-16'},

{'id': 121,

'title': 'Can BM25 be a probability?',

'score': 36.11565685272217,

'publication_date': '2026-03-06'},

In:

Out:

(In most frameworks)

Name + docstring + params becomes part of prompt

15 of 71

Tell OpenAI about tools

resp = openai.responses.create(

model="gpt-5-mini",

input=inputs,

tools=... (openai tool spec) ...,

)

resp.output

[

ResponseReasoningItem(id='rs_0b3cc088788503af0069f904b2ad1c8197be9552cd0698014f', summary=[], type='reasoning', content=None, encrypted_content=None, status=None),

ResponseFunctionToolCall(arguments='{"keywords":"BM25"}', call_id='call_a5wFOmExw3RI1HMn9MJriMCy', name='search', type='function_call', id='fc_0b3cc088788503af0069f904b42fd0819786283f7d06d79f18', status='completed')

]

In:

Out:

… agent requests us to call it

We tell agent about a tool…

(name, description, params…)

16 of 71

Loop until done calling

while tool_calls:

tool_calls = False

resp = openai.responses.create(

model="gpt-5-mini",

input=inputs,

tools=... (openai tool spec) ...,

)

inputs += resp.output

for func_output in resp.output:

if func_output.type == "function_call":

if func.name == "search”:

tool_response = search(...)

inputs.append(tool_response)

tool_calls = True

Call tool, append the results to context

17 of 71

And it works!

http://github.com/softwaredoug/search-experiments

Tool combos

(embedding + BM25)

18 of 71

Even at hard deep research tasks

Decent on very hard questions

browsecomp-plus benchmark

Using weak model

(gpt-4-oss)

Revisiting Text Ranking in Deep Research

https://arxiv.org/pdf/2602.21456

19 of 71

Autoresearch

20 of 71

Can’t agentify all the search (yet)

Agent

red sheos

👠 Red high heels

👞 Maroon dress shoes

Lexical Search

Embedding Retrieval

21 of 71

Distill lessons into code?

Agent

Agent Coded ranker

red sheos

👠 Red high heels

👞 Maroon dress shoes

Lexical Search

Embedding Retrieval

Query Corrections

backends

Other pieces

22 of 71

Start with ranking code source

original_source = """

def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):

docs = fielded_bm25(query, fields=['title^9.3',

'description^4.1'],

operator='or', top_k=10)

return [str(doc['id']) for doc in docs]

"""

with open("rerank_wands.py", "w") as f:

f.write(original_source)

Inject retrieval primitives (ie BM25, etc)

call these “search tools”

Some starter code

23 of 71

Searching agent -> code agent

Agentic Loop

Hypothesize Code changes

(Reasoning)

Edit Code

Eval on labeled data

search_wands

Queries	Documents	Rel?
red shoe	1234	👍
blue shoe	5678	👎

Training set

code

edits

Eval Tooling

Ask for eval

Training Query

NDCGs

search

24 of 71

Build your own coding tool

def apply_patch(edit: Edit) -> EditResult:

"""Save the proposed code change to rerank_esci.py."""

...

Tool for accepting changes:

25 of 71

Build your own coding tool

class Edit(BaseModel):

"""A single edit to apply to the reranker code."""

anchor: str = Field(

...,

description="The anchor text to identify where the patch should be applied.",

)

block_until: str = Field(

...,

description="The end of the block of text which the patch should be applied. Do not leave blank.",

)

action: Literal["insert_after", "replace", "delete"] = Field(

..., description="The action to perform: insert_after, replace, or delete."

)

26 of 71

Build your own coding tool

def apply_patch(edit: Edit) -> EditResult:

"""Save the proposed code change to rerank_esci.py."""

block_index = code.find(edit.block_until, anchor_index)

if edit.action == "insert_after":

insertion_point = block_index + len(edit.block_until)

code = (

code[:insertion_point] + "\n" + edit.text + "\n" + code[insertion_point:]

)

elif edit.action == "replace":

...

elif edit.action == "delete":

...

27 of 71

Give agent visibility into impact

run_evals

code

Ask for eval

Training Query

NDCGs

{

"red shoes": 0.56

"nike sneakers": 0.21

}

“Training Data” just means “Agent can see these queries”

28 of 71

Why not Claude Code?

I want tight, programmatic control over accepting changes

Easy reproducibility with few dependencies

It’s not that hard, this is simple

I’m already building harnesses for other things

(But yes you can do this with Claude Code + Hooks)

29 of 71

All the code editing + eval tools

tools = [# -------

# CODE EDITING tools

apply_patch, # Edit the reranker with a patch

revert_changes, # Restore the reranker to the last version

# -------

# INSPECT + EVAL current code

search_wands, # The raw search tool (from earlier)

run_reranker, # Run on one query (optionally label results)

run_evals, # Run on training set, getting per-query NDCG / mean]

Editing

30 of 71

All the code editing + eval tools

tools = [# -------

# CODE EDITING tools

apply_patch, # Edit the reranker with a patch

revert_changes, # Restore the reranker to the last version

# -------

# INSPECT + EVAL current code

search_wands, # The raw search tool (from earlier)

run_reranker, # Run on one query (optionally label results)

run_evals, # Run on training set, getting per-query NDCG / mean]

Run evals

Use the search primitives directly

31 of 71

Now I can just configure what I need

32 of 71

strategy:

type: codegen

params:

train:

model: gpt-5

search_tools:

- fielded_bm25

- e5_base_v2

...

Setup task with config

Primitives available to the code generation

fielded_bm25 - does a BM25 search on specified field
e5_base_v2 - embedding model

33 of 71

...

system_prompt: |

Your task is to improve the reranker code so it returns more relevant results.

Use apply_patch to edit the reranker module.

Use run_reranker to inspect single queries and run_evals for NDCG.

If NDCG does not improve, revert with revert_changes.

DO NOT rename the function in the code. You MUST keep the signature the same.

System prompt

34 of 71

inputs = [

{"role": "system", "content": "Your task is to improve the reranker code so it returns more relevant results….

...

Here’s the current code:

def rerank_wands(query, fielded_bm25, e5_base_v2, ...):

docs = fielded_bm25(keywords=query,

field_to_search='product_name',

operator='and',

top_k=10)

return [doc['id'] for doc in docs]

"},

With system prompt

35 of 71

tool_calls = True

while tool_calls:

tool_calls = False

resp = openai.responses.create(

model="gpt-5-mini",

input=inputs,

tools=... (openai tool spec) ...,

)

inputs += resp.output

for func_output in resp.output:

if func_output.type == "function_call":

... call eval + editing tools ...

This is just:

(Pretty much same agentic loop…

…

But prompts asking agent to edit code + run evals)

36 of 71

Agentic Loop

Or in pretty picture form

Sort the foobar with quicksort

Investigate

Edit Code

Run tests

Propose update

code

apply_patch

Added:

run_evals

run_reranker

revert_code

Return current code

search_wands

37 of 71

def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):

"""

Hybrid reranker for the wands dataset.

Strategy:

- Gather a broad candidate set using multiple retrieval modes:

* BM25 (OR) for recall

* BM25 (PHRASE) to capture exact-phrase matches

* BM25 (AND) for precise multi-term intent

* Embedding search for semantic recall

- Compute lightweight features per candidate (normalized scores, token overlap,

title boosts, and presence of product-category head terms).

- Combine with a weighted score and return top results by final score.

"""

...

After one run

38 of 71

def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):

"""

Hybrid reranker for the wands dataset.

Strategy:

- Gather a broad candidate set using multiple retrieval modes:

* BM25 (OR) for recall

* BM25 (PHRASE) to capture exact-phrase matches

* BM25 (AND) for precise multi-term intent

* Embedding search for semantic recall

- Compute lightweight features per candidate (normalized scores, token overlap,

title boosts, and presence of product-category head terms).

- Combine with a weighted score and return top results by final score.

"""

...

Start again w/ this code

Autoresearch run

Even better ranker

39 of 71

Run it... over multiple rounds

uv run train \

--strategy configs/codegen/codegen_no_guards.yml \

--dataset wands \

--rounds 10

Judgments + corpus from here

Yml file controlling agentic process

Restart fresh round w/ output of last

40 of 71

The results

41 of 71

Code looks sus

Holy overfit batman!

42 of 71

The other problem: size

43 of 71

Non training data tells messy story…

44 of 71

Slightly less bad

Autoresearch

45 of 71

Add overfit guard before accepting changes

strategy:

type: codegen

params:

train:

model: gpt-5

search_tools:

- fielded_bm25

- e5_base_v2

edit:

guards:

- validation

- overfit:

model: openai/gpt-5-mini

Force check on validation set before accepting edits
Ask another LLM if these “look overfit” per our requirements

46 of 71

Reject code failing check

def apply_patch(edit: Edit) -> EditResult:

"""Save the proposed code change to rerank_esci.py."""

# Validate code

for guardrail in guardrail_fns:

error_message = guardrail(edit.text)

if error_message is not None:

raise ValueError(error_message)

block_index = code.find(edit.block_until, anchor_index)

if edit.action == "insert_after":

...

Run checks.

Reject code that fails

47 of 71

Add maximum code size

strategy:

type: codegen

params:

train:

model: gpt-5

search_tools:

- fielded_bm25

- e5_base_v2

edit:

guards:

- length:

max_lines: 10

max_cols: 120

Disallow edits over a certain size

More targeted / incremental changes
Cause + effect in reasoning

48 of 71

Agentic Loop

Autoresearch w/ validation

Sort the foobar with quicksort

Investigate

Edit Code

Run tests

Propose update

code

apply_patch

Reject w/ error:

“Your code did not increase validation by X …”

Updated:

49 of 71

Final code

def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):

b = fielded_bm25(query, fields=['title^10.5','description^3.9'], operator='or', top_k=95, k1=1.2, b=0.6)

e = search_embeddings(query, top_k=40)

s = {}; k = 50.0

for i,d in enumerate(b): s[str(d['id'])] = s.get(str(d['id']),0.0)+0.45/(k+i+1.0)

for i,d in enumerate(e): s[str(d['id'])] = s.get(str(d['id']),0.0)+0.55/(k+i+1.0)

ib = {str(d['id']) for d in b}; ie = {str(d['id']) for d in e}

for i in ib & ie: s[i] = s.get(i,0.0)+0.01

return [k for k,_ in sorted(s.items(), key=lambda x:x[1], reverse=True)][:10]

50 of 71

Ah better. Higher plateau ~0.59

51 of 71

Final code

def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):

b = fielded_bm25(query, fields=['title^10.5','description^3.9'], operator='or', top_k=95, k1=1.2, b=0.6)

e = search_embeddings(query, top_k=40)

s = {}; k = 50.0

for i,d in enumerate(b): s[str(d['id'])] = s.get(str(d['id']),0.0)+0.45/(k+i+1.0)

for i,d in enumerate(e): s[str(d['id'])] = s.get(str(d['id']),0.0)+0.55/(k+i+1.0)

ib = {str(d['id']) for d in b}; ie = {str(d['id']) for d in e}

for i in ib & ie: s[i] = s.get(i,0.0)+0.01

return [k for k,_ in sorted(s.items(), key=lambda x:x[1], reverse=True)][:10]

52 of 71

As explained by ChatGPT

Step 1: Run fielded BM25 search. (modified b, stronger title)

Step 2: Embedding search

Step 3: Weighed RRF

Step 4: Bonus for overlapping in both retrievers

53 of 71

LLMs pick unsurprising solutions

🤖 LLM

Our blog posts

Problem

Known IR approaches

54 of 71

Exploring novel solutions

55 of 71

First thought - control ingredients

Known IR approaches

def rerank_wands(query,

fielded_bm25,

search_embeddings,

rewrite_query,

get_commute_distance,

categorize_queries

...

**kwargs):

…

Give agent many prepared primitives?

(~feature engineering)

56 of 71

Or we could give something open-ended

Known IR approaches

def rerank_wands(query,

fielded_bm25,

search_embeddings,

call_llm,

**kwargs):

…

Or let agent figure things out from first principles?

~guide with skills to figure it out

Really this is a curse of dimensionality problem: explore given limited context

57 of 71

Curse of dimensionality

Known IR approaches

Let me try title boost

Context

Ok that didn’t work, what about query rewriting

100K tokens

Another 100K tokens

(Using up context trying all the options)

58 of 71

One approach - focus on part of the problem

Known IR approaches

def rerank_wands(query,

search,

rewrite_query

...

**kwargs):

…

Result from previous run - treated like a black box

New tool being used

59 of 71

Think in terms of ensemble / hiding

Known IR approaches

search_tools:

- codegen:

path: runs/codegen/codegen_guarded/20260502_025238

dependencies:

- fielded_bm25

- e5_base_v2

- query_rewrite:

model: gpt-5-mini

max_alternatives: 5

Agent just sees “search” doesn’t know details

Add “query rewrite”

60 of 71

Train w/ rewrite + two retrieval

Known IR approaches

~0.586

61 of 71

Hide the retrieval behind “search”

Known IR approaches

~0.595

62 of 71

What if we give raw ingredients (tf, df, etc)?

def rerank_minimarco(query, fielded_bm25, get_corpus, **kwargs):

corpus = get_corpus()

snowball = corpus["description_snowball"].array

...

Start with BM25 baseline code

Access raw term stats

Start with BM25?

63 of 71

Try on MSMarco dataset

(Passage ranking, just the text)

Classic BM25 baseline: ~0.189 MRR (anserini)

how much do rear brakes and rotors cost?

Question

$105 for brake pads (front and rear) and $170 for labor. I was quoted $932 for front and rear breaks and rotors. It was steep, but I needed it done and didn't want to shop it around. When I drove out, my breaks were squeeking worse than when I went in.

Answer

64 of 71

I’m so smart. Can I beat BM25?

Known IR approaches

Coding

Agent

BM25 on passage

Better BM25?

65 of 71

Add human guidance - what to try

🤖 LLM

Our blog posts

Known IR approaches

HINTS - what to explore

- Find statistically significant bigrams (colocations) using the phrase term frequency

stats

- Use the fielded BM25 scores as a signal in your reranker

- Try reranking on exact phrase matches in the retrieved candidates

- Use term length as another measure of specificity (ie longer terms are more specific to user intent)

- Try different stopword removal opproaches (but don't apply it to short queries)

- Look in logs where there was ALMOST a 0.002 improvement and see what the code changes were. Try to understand why they were close but not quite there - maybe you can mitigate the downsides to get a net improvement?

System Prompt

66 of 71

On minimarco (smaller msmarco)

Known IR approaches

Whoah! It’s getting better

(gpt-5.5 xhigh)

67 of 71

On whole MSMarco, suspicious...

Known IR approaches

Flatten after initial stopword removal

(gpt-5.5 xhigh)

68 of 71

Overfit to validation

Known IR approaches

(Notice mention of very specific terms in stopwords list)

Important to monitor dataset NOT in validation or training set

69 of 71

Still interesting results

Known IR approaches

# The boost is intentionally small compared with the BM25 score.

bigram_boost_weight = 0.08

bigram_scores = sum(

(

bigram_boost_weight

* description_index.termfreqs(scoring_terms[i : i + 2])

for i in range(len(scoring_terms) - 1)

),

np.zeros(num_documents),

)

scores += bigram_scores

stopword removal -> BM25 -> bigram boost

70 of 71

Not magic, but data-driven collaboration

Known IR approaches

🧑🏼‍💻

🤖

Ideas outside the norm

Manage Agent focus

Evals + Guardrails

Existing knowledge

Tireless trial + error

Try the obvious

Inspiration + Guidance

Exhaustive, Rote, Obvious.

71 of 71

Analogy: Erdos Problems

Known IR approaches

Terence Tao

https://terrytao.wordpress.com/2025/12/08/the-story-of-erdos-problem-126/

(Should we add search to our deep research?)