Autoresearch
in ranking
© SoftwareDoug LLC - http://softwaredoug.com
An honest take
About me
(Search since 2013, training + consulting)
May 8 - Lexical+BM25
May 12 - Vector
Search
May 14 - Search
Evaluation
$$ 1200
FREE!
Cheat at Search
~essentials~
Cheat at Search with Agents
Live training - discount code haystack
Agentic Search course
Autoresearch - the idea
Coding
Agent
Model / Code / ??
Evals / Objective
Propose improvements
Evaluate Change
👍/ 👎 / ??
Why? Deployment story
Coding
Agent
Model / Code / ??
Evals / Objective
Propose improvements
Evaluate Change
👍/ 👎 / ??
You’re already deploying this somewhere. You’re just making it better
Today: exploration of experiments
Repo + Notebook
Agentic Search Setup
Agentic Loop
Agents solve problems w/ tools (ie coding)
Sort the foobar with quicksort
Investigate
Edit Code
Run tests
Code editing tool
code
Command running tool
Command running tool (grep, etc?
Filesystem
Filesystem
Agentic Loop
Agentic search: agents solving a search problem
What are BM25 posts by Doug Turnbull?
Create Queries
(Reasoning)
Call Search
Tools
Evaluate
(LLM reasoning)
retry…
… until happy
Lexical Search
Tool
Embedding Retrieval
Tool
How can I use my tools to find useful results?
Oh “XYZ earnings” didn’t work. Let me try again with “XYZ financials”
Boring old, hallucinating AI
inputs = [
{"role": "system", "content": "You are a helpful assistant that helps find blog posts."},
{"role": "user", "content": "What are some blog posts by Doug Turnbull about BM25"},
]
resp = openai.responses.create(
model="gpt-5-mini",
input=inputs,
)
resp.output[-1].content[-1].text
- “BM25 explained” (Elastic blog) — a straightforward explainer of the BM25 ranking function, its parameters (k1, b), and intuition for term frequency and document length normalization.
- “BM25 and TF saturation / parameter tuning”
- “BM25 vs. TF-IDF / Why BM25 replaced TF-IDF in Lucene” — historical/comparative post describing differences between classic TF-IDF and BM25 and why BM25 is the default in Lucene/Elasticsearch.
In:
Out:
Define a tool - just a python function
def search(keywords: str):
"""A simple BM25 keyword search of Doug Turnbull's blog posts."""
… body omitted …
search("bm25")
[{'id': 160,
'title': 'Bayesian BM25 is cool',
'score': 37.75146675109863,
'publication_date': '2026-03-16'},
{'id': 121,
'title': 'Can BM25 be a probability?',
'score': 36.11565685272217,
'publication_date': '2026-03-06'},
In:
Out:
(In most frameworks)
Name + docstring + params becomes part of prompt
Tell OpenAI about tools
resp = openai.responses.create(
model="gpt-5-mini",
input=inputs,
tools=... (openai tool spec) ...,
)
resp.output
[
ResponseReasoningItem(id='rs_0b3cc088788503af0069f904b2ad1c8197be9552cd0698014f', summary=[], type='reasoning', content=None, encrypted_content=None, status=None),
ResponseFunctionToolCall(arguments='{"keywords":"BM25"}', call_id='call_a5wFOmExw3RI1HMn9MJriMCy', name='search', type='function_call', id='fc_0b3cc088788503af0069f904b42fd0819786283f7d06d79f18', status='completed')
]
In:
Out:
… agent requests us to call it
We tell agent about a tool…
(name, description, params…)
Loop until done calling
while tool_calls:
tool_calls = False
resp = openai.responses.create(
model="gpt-5-mini",
input=inputs,
tools=... (openai tool spec) ...,
)
inputs += resp.output
for func_output in resp.output:
if func_output.type == "function_call":
if func.name == "search”:
tool_response = search(...)
inputs.append(tool_response)
tool_calls = True
Call tool, append the results to context
And it works!
Tool combos
(embedding + BM25)
Even at hard deep research tasks
Decent on very hard questions
browsecomp-plus benchmark
Using weak model
(gpt-4-oss)
Revisiting Text Ranking in Deep Research
Autoresearch
Can’t agentify all the search (yet)
Agent
red sheos
👠 Red high heels
👞 Maroon dress shoes
Lexical Search
Embedding Retrieval
Distill lessons into code?
Agent
Agent Coded ranker
red sheos
👠 Red high heels
👞 Maroon dress shoes
Lexical Search
Embedding Retrieval
Query Corrections
Query Corrections
backends
Other pieces
Start with ranking code source
original_source = """
def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):
docs = fielded_bm25(query, fields=['title^9.3',
'description^4.1'],
operator='or', top_k=10)
return [str(doc['id']) for doc in docs]
"""
with open("rerank_wands.py", "w") as f:
f.write(original_source)
Inject retrieval primitives (ie BM25, etc)
call these “search tools”
Some starter code
Searching agent -> code agent
Agentic Loop
Hypothesize Code changes
(Reasoning)
Edit Code
Eval on labeled data
search_wands
Queries | Documents | Rel? |
red shoe | 1234 | 👍 |
blue shoe | 5678 | 👎 |
Training set
code
edits
Eval Tooling
Ask for eval
Training Query
NDCGs
search
Build your own coding tool
def apply_patch(edit: Edit) -> EditResult:
"""Save the proposed code change to rerank_esci.py."""
...
Tool for accepting changes:
Build your own coding tool
class Edit(BaseModel):
"""A single edit to apply to the reranker code."""
anchor: str = Field(
...,
description="The anchor text to identify where the patch should be applied.",
)
block_until: str = Field(
...,
description="The end of the block of text which the patch should be applied. Do not leave blank.",
)
action: Literal["insert_after", "replace", "delete"] = Field(
..., description="The action to perform: insert_after, replace, or delete."
)
Build your own coding tool
def apply_patch(edit: Edit) -> EditResult:
"""Save the proposed code change to rerank_esci.py."""
block_index = code.find(edit.block_until, anchor_index)
if edit.action == "insert_after":
insertion_point = block_index + len(edit.block_until)
code = (
code[:insertion_point] + "\n" + edit.text + "\n" + code[insertion_point:]
)
elif edit.action == "replace":
...
elif edit.action == "delete":
...
Give agent visibility into impact
run_evals
code
Ask for eval
Training Query
NDCGs
{
"red shoes": 0.56
"nike sneakers": 0.21
}
“Training Data” just means “Agent can see these queries”
Why not Claude Code?
(But yes you can do this with Claude Code + Hooks)
All the code editing + eval tools
tools = [# -------
# CODE EDITING tools
apply_patch, # Edit the reranker with a patch
revert_changes, # Restore the reranker to the last version
# -------
# INSPECT + EVAL current code
search_wands, # The raw search tool (from earlier)
run_reranker, # Run on one query (optionally label results)
run_evals, # Run on training set, getting per-query NDCG / mean]
Editing
All the code editing + eval tools
tools = [# -------
# CODE EDITING tools
apply_patch, # Edit the reranker with a patch
revert_changes, # Restore the reranker to the last version
# -------
# INSPECT + EVAL current code
search_wands, # The raw search tool (from earlier)
run_reranker, # Run on one query (optionally label results)
run_evals, # Run on training set, getting per-query NDCG / mean]
Run evals
Use the search primitives directly
Now I can just configure what I need
strategy:
name: codegen_no_guards
type: codegen
params:
train:
model: gpt-5
search_tools:
- fielded_bm25
- e5_base_v2
...
Setup task with config
Primitives available to the code generation
...
system_prompt: |
Your task is to improve the reranker code so it returns more relevant results.
Use apply_patch to edit the reranker module.
Use run_reranker to inspect single queries and run_evals for NDCG.
If NDCG does not improve, revert with revert_changes.
DO NOT rename the function in the code. You MUST keep the signature the same.
System prompt
inputs = [
{"role": "system", "content": "Your task is to improve the reranker code so it returns more relevant results….
...
Here’s the current code:
def rerank_wands(query, fielded_bm25, e5_base_v2, ...):
docs = fielded_bm25(keywords=query,
field_to_search='product_name',
operator='and',
top_k=10)
return [doc['id'] for doc in docs]
"},
With system prompt
tool_calls = True
while tool_calls:
tool_calls = False
resp = openai.responses.create(
model="gpt-5-mini",
input=inputs,
tools=... (openai tool spec) ...,
)
inputs += resp.output
for func_output in resp.output:
if func_output.type == "function_call":
... call eval + editing tools ...
This is just:
(Pretty much same agentic loop…
…
But prompts asking agent to edit code + run evals)
Agentic Loop
Or in pretty picture form
Sort the foobar with quicksort
Investigate
Edit Code
Run tests
Propose update
code
apply_patch
Added:
run_evals
run_reranker
revert_code
Return current code
search_wands
def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):
"""
Hybrid reranker for the wands dataset.
Strategy:
- Gather a broad candidate set using multiple retrieval modes:
* BM25 (OR) for recall
* BM25 (PHRASE) to capture exact-phrase matches
* BM25 (AND) for precise multi-term intent
* Embedding search for semantic recall
- Compute lightweight features per candidate (normalized scores, token overlap,
title boosts, and presence of product-category head terms).
- Combine with a weighted score and return top results by final score.
"""
...
After one run
def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):
"""
Hybrid reranker for the wands dataset.
Strategy:
- Gather a broad candidate set using multiple retrieval modes:
* BM25 (OR) for recall
* BM25 (PHRASE) to capture exact-phrase matches
* BM25 (AND) for precise multi-term intent
* Embedding search for semantic recall
- Compute lightweight features per candidate (normalized scores, token overlap,
title boosts, and presence of product-category head terms).
- Combine with a weighted score and return top results by final score.
"""
...
Start again w/ this code
Autoresearch run
Even better ranker
Run it... over multiple rounds
uv run train \
--strategy configs/codegen/codegen_no_guards.yml \
--dataset wands \
--rounds 10
Judgments + corpus from here
Yml file controlling agentic process
Restart fresh round w/ output of last
The results
Code looks sus
Holy overfit batman!
The other problem: size
Non training data tells messy story…
Slightly less bad
Autoresearch
Add overfit guard before accepting changes
strategy:
name: codegen_guarded
type: codegen
params:
train:
model: gpt-5
search_tools:
- fielded_bm25
- e5_base_v2
edit:
guards:
- validation
- overfit:
model: openai/gpt-5-mini
Reject code failing check
def apply_patch(edit: Edit) -> EditResult:
"""Save the proposed code change to rerank_esci.py."""
# Validate code
for guardrail in guardrail_fns:
error_message = guardrail(edit.text)
if error_message is not None:
raise ValueError(error_message)
block_index = code.find(edit.block_until, anchor_index)
if edit.action == "insert_after":
...
Run checks.
Reject code that fails
Add maximum code size
strategy:
name: codegen_guarded
type: codegen
params:
train:
model: gpt-5
search_tools:
- fielded_bm25
- e5_base_v2
edit:
guards:
- length:
max_lines: 10
max_cols: 120
Disallow edits over a certain size
Agentic Loop
Autoresearch w/ validation
Sort the foobar with quicksort
Investigate
Edit Code
Run tests
Propose update
code
apply_patch
Reject w/ error:
“Your code did not increase validation by X …”
Updated:
Final code
def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):
b = fielded_bm25(query, fields=['title^10.5','description^3.9'], operator='or', top_k=95, k1=1.2, b=0.6)
e = search_embeddings(query, top_k=40)
s = {}; k = 50.0
for i,d in enumerate(b): s[str(d['id'])] = s.get(str(d['id']),0.0)+0.45/(k+i+1.0)
for i,d in enumerate(e): s[str(d['id'])] = s.get(str(d['id']),0.0)+0.55/(k+i+1.0)
ib = {str(d['id']) for d in b}; ie = {str(d['id']) for d in e}
for i in ib & ie: s[i] = s.get(i,0.0)+0.01
return [k for k,_ in sorted(s.items(), key=lambda x:x[1], reverse=True)][:10]
Ah better. Higher plateau ~0.59
Final code
def rerank_wands(query, fielded_bm25, search_embeddings, **kwargs):
b = fielded_bm25(query, fields=['title^10.5','description^3.9'], operator='or', top_k=95, k1=1.2, b=0.6)
e = search_embeddings(query, top_k=40)
s = {}; k = 50.0
for i,d in enumerate(b): s[str(d['id'])] = s.get(str(d['id']),0.0)+0.45/(k+i+1.0)
for i,d in enumerate(e): s[str(d['id'])] = s.get(str(d['id']),0.0)+0.55/(k+i+1.0)
ib = {str(d['id']) for d in b}; ie = {str(d['id']) for d in e}
for i in ib & ie: s[i] = s.get(i,0.0)+0.01
return [k for k,_ in sorted(s.items(), key=lambda x:x[1], reverse=True)][:10]
As explained by ChatGPT
Step 1: Run fielded BM25 search. (modified b, stronger title)
Step 2: Embedding search
Step 3: Weighed RRF
Step 4: Bonus for overlapping in both retrievers
LLMs pick unsurprising solutions
🤖 LLM
Our blog posts
Our blog posts
Our blog posts
Our blog posts
Problem
Known IR approaches
Exploring novel solutions
First thought - control ingredients
Known IR approaches
def rerank_wands(query,
fielded_bm25,
search_embeddings,
rewrite_query,
get_commute_distance,
categorize_queries
...
**kwargs):
…
Give agent many prepared primitives?
(~feature engineering)
Or we could give something open-ended
Known IR approaches
def rerank_wands(query,
fielded_bm25,
search_embeddings,
call_llm,
**kwargs):
…
Or let agent figure things out from first principles?
~guide with skills to figure it out
Really this is a curse of dimensionality problem: explore given limited context
Curse of dimensionality
Known IR approaches
Let me try title boost
Context
Ok that didn’t work, what about query rewriting
100K tokens
Another 100K tokens
(Using up context trying all the options)
One approach - focus on part of the problem
Known IR approaches
def rerank_wands(query,
search,
rewrite_query
...
**kwargs):
…
Result from previous run - treated like a black box
New tool being used
Think in terms of ensemble / hiding
Known IR approaches
search_tools:
- codegen:
path: runs/codegen/codegen_guarded/20260502_025238
name: search
dependencies:
- fielded_bm25
- e5_base_v2
- query_rewrite:
model: gpt-5-mini
max_alternatives: 5
Agent just sees “search” doesn’t know details
Add “query rewrite”
Train w/ rewrite + two retrieval
Known IR approaches
~0.586
Hide the retrieval behind “search”
Known IR approaches
~0.595
What if we give raw ingredients (tf, df, etc)?
def rerank_minimarco(query, fielded_bm25, get_corpus, **kwargs):
corpus = get_corpus()
snowball = corpus["description_snowball"].array
...
Start with BM25 baseline code
Access raw term stats
Start with BM25?
Try on MSMarco dataset
(Passage ranking, just the text)
Classic BM25 baseline: ~0.189 MRR (anserini)
how much do rear brakes and rotors cost?
Question
$105 for brake pads (front and rear) and $170 for labor. I was quoted $932 for front and rear breaks and rotors. It was steep, but I needed it done and didn't want to shop it around. When I drove out, my breaks were squeeking worse than when I went in.
Answer
I’m so smart. Can I beat BM25?
Known IR approaches
Coding
Agent
BM25 on passage
Better BM25?
Add human guidance - what to try
🤖 LLM
Our blog posts
Our blog posts
Our blog posts
Our blog posts
Known IR approaches
HINTS - what to explore
- Find statistically significant bigrams (colocations) using the phrase term frequency
stats
- Use the fielded BM25 scores as a signal in your reranker
- Try reranking on exact phrase matches in the retrieved candidates
- Use term length as another measure of specificity (ie longer terms are more specific to user intent)
- Try different stopword removal opproaches (but don't apply it to short queries)
- Look in logs where there was ALMOST a 0.002 improvement and see what the code changes were. Try to understand why they were close but not quite there - maybe you can mitigate the downsides to get a net improvement?
System Prompt
On minimarco (smaller msmarco)
Known IR approaches
Whoah! It’s getting better
(gpt-5.5 xhigh)
On whole MSMarco, suspicious...
Known IR approaches
Flatten after initial stopword removal
(gpt-5.5 xhigh)
Overfit to validation
Known IR approaches
(Notice mention of very specific terms in stopwords list)
Important to monitor dataset NOT in validation or training set
Still interesting results
Known IR approaches
# The boost is intentionally small compared with the BM25 score.
bigram_boost_weight = 0.08
bigram_scores = sum(
(
bigram_boost_weight
* description_index.termfreqs(scoring_terms[i : i + 2])
for i in range(len(scoring_terms) - 1)
),
np.zeros(num_documents),
)
scores += bigram_scores
stopword removal -> BM25 -> bigram boost
Not magic, but data-driven collaboration
Known IR approaches
🧑🏼💻
🤖
Ideas outside the norm
Manage Agent focus
Evals + Guardrails
Existing knowledge
Tireless trial + error
Try the obvious
Inspiration + Guidance
Exhaustive, Rote, Obvious.
Analogy: Erdos Problems
Known IR approaches
(Should we add search to our deep research?)