1 of 10

Agent Prediction Evals

SCALING OBSERVABILITY�AND GROUNDEDNESS

of

AGENT STRATEGIES

Joshua Levy (with Kam Leung)

FLF Demo Day — Oct 15, 2025

2 of 10

Why do we care about smarter agents?

  • We care about improving human reasoning
    • Humans use and benefit from agent capabilities
  • We are increasingly using LLMs and agents
    • Analyze and reason about important problems
    • Plan better, avert problems
  • We need tools and agents* that improve
    • Rigor (valid reasoning)
    • Groundedness (valid facts)

* Is making smarter agents possibly a bad idea? My view is that it is valuable and important.� Just as with human reasoning, aside specific high-risk topics, valid facts and reasoning� as a whole are essential infrastructure for collective human intelligence.

3 of 10

Where I started: When should we trust AI content?

  • We lack observability into why models say things
  • They’re often confidently wrong�
  • Earlier experiments
    • AI fact checker: Do web searches, check support for assertions
    • See also: Community Notes-style agents
  • But you’re often still guessing if the checker is right
    • This is hard, slow, with sparse data

4 of 10

Why should we focus on agent predictions?

  • Agents* are being used for more and more tasks
  • Predictions are an essential way to see how good agents are
    • Humans need better predictions to plan—or avert disasters
    • Agents need predictions to act
  • Agent prediction strategies may be good, bad, or ugly
    • Opaque and unreliable
    • Prompts, workflows, conditional checks like gating, etc.
  • Predictions can be graded against reality (ground truth)

* “Agent” is being used broadly here to mean “a collection of code and� models that work to make a prediction or achieve a goal

5 of 10

How do we scale agent groundedness?

  • Three goals for evaluating agents:
    • Observability: Inputs, tools, gates, granular rationale, intermediate states
    • Groundedness: Ways to validate against reality (not training data, RLHF, or even citations)
    • Scalable verification: Reproducible runs, backtest/ensemble evals, less manual improvement loops (RL)
  • The key: cheap gradability
  • An interesting example is financial predictions
    • Clear claims and reasoning
    • Market data is plentiful via APIs and provides ground truth

6 of 10

How can we improve agents’ predictions?

Low

Med

High

Observability

Opaque LLM call

Manual inspection of chain of thought

Analysis of fully expressed, granular reasoning

Groundedness

Hope LLM is right

Check against other sources (still unreliable)

Compare with real-world outcomes

Scalable verification

Use in real time,�manually eval

one by one

Run batches, manually eval

Reproducible runs, auto-grading, RL improvements

Cheap, grounded, automated backtesting

NEED:

7 of 10

Agent Strategy Arena

  • Pretty functional MVP
    • 300K LOC in 3 weeks
  • Define prediction agents
    • Prompts + tools
  • Encode predictions
  • Experiment framework
    • Add a time machine
    • Information isolation
    • Back-testable runs
  • Current: Testing agent performance
  • Next: Automatic improvement of reasoning and strategies

8 of 10

9 of 10

10 of 10

Would love to talk about…

  • Evaluating and improving reasoning strategies
  • Grounding and automated fact checking
  • Trading and forecasting platforms and use cases
    • Learnings from forecasting markets, investment strategies
  • Discerning skill vs luck
    • Example: Barra cofactor analysis in hedge fund investment

  • Joshua Levy
    • x.com/ojoshe
    • github.com/jlevy