1 of 10

Agent Prediction Evals

SCALING OBSERVABILITY�AND GROUNDEDNESS

AGENT STRATEGIES

Joshua Levy (with Kam Leung)

FLF Demo Day — Oct 15, 2025

2 of 10

Why do we care about smarter agents?

We care about improving human reasoning

Humans use and benefit from agent capabilities

We are increasingly using LLMs and agents

Analyze and reason about important problems
Plan better, avert problems

We need tools and agents* that improve

Rigor (valid reasoning)
Groundedness (valid facts)�

* Is making smarter agents possibly a bad idea? My view is that it is valuable and important.� Just as with human reasoning, aside specific high-risk topics, valid facts and reasoning� as a whole are essential infrastructure for collective human intelligence.

3 of 10

Where I started: When should we trust AI content?

We lack observability into why models say things
They’re often confidently wrong�
Earlier experiments

AI fact checker: Do web searches, check support for assertions
See also: Community Notes-style agents �

But you’re often still guessing if the checker is right

This is hard, slow, with sparse data

4 of 10

Why should we focus on agent predictions?

Agents* are being used for more and more tasks
Predictions are an essential way to see how good agents are

Humans need better predictions to plan—or avert disasters
Agents need predictions to act

Agent prediction strategies may be good, bad, or ugly

Opaque and unreliable
Prompts, workflows, conditional checks like gating, etc.

Predictions can be graded against reality (ground truth)

* “Agent” is being used broadly here to mean “a collection of code and� models that work to make a prediction or achieve a goal”

5 of 10

How do we scale agent groundedness?

Three goals for evaluating agents:

Observability: Inputs, tools, gates, granular rationale, intermediate states
Groundedness: Ways to validate against reality (not training data, RLHF, or even citations)
Scalable verification: Reproducible runs, backtest/ensemble evals, less manual improvement loops (RL)

The key: cheap gradability
An interesting example is financial predictions

Clear claims and reasoning
Market data is plentiful via APIs and provides ground truth

6 of 10

How can we improve agents’ predictions?

	Low	Med	High
Observability	Opaque LLM call	Manual inspection of chain of thought	Analysis of fully expressed, granular reasoning
Groundedness	Hope LLM is right	Check against other sources (still unreliable)	Compare with real-world outcomes
Scalable verification	Use in real time,�manually eval one by one	Run batches, manually eval	Reproducible runs, auto-grading, RL improvements

Cheap, grounded, automated backtesting

NEED:

7 of 10

Agent Strategy Arena

Pretty functional MVP

300K LOC in 3 weeks

Define prediction agents

Prompts + tools

Encode predictions
Experiment framework

Add a time machine
Information isolation
Back-testable runs

Current: Testing agent performance
Next: Automatic improvement of reasoning and strategies

10 of 10

Would love to talk about…

Evaluating and improving reasoning strategies
Grounding and automated fact checking
Trading and forecasting platforms and use cases

Learnings from forecasting markets, investment strategies

Discerning skill vs luck

Example: Barra cofactor analysis in hedge fund investment

Joshua Levy

x.com/ojoshe
github.com/jlevy