Agent Prediction Evals
SCALING OBSERVABILITY�AND GROUNDEDNESS
of
AGENT STRATEGIES
Joshua Levy (with Kam Leung)
FLF Demo Day — Oct 15, 2025
Why do we care about smarter agents?
* Is making smarter agents possibly a bad idea? My view is that it is valuable and important.� Just as with human reasoning, aside specific high-risk topics, valid facts and reasoning� as a whole are essential infrastructure for collective human intelligence.
Where I started: When should we trust AI content?
Why should we focus on agent predictions?
* “Agent” is being used broadly here to mean “a collection of code and� models that work to make a prediction or achieve a goal”
How do we scale agent groundedness?
How can we improve agents’ predictions?
| Low | Med | High |
Observability | Opaque LLM call | Manual inspection of chain of thought | Analysis of fully expressed, granular reasoning |
Groundedness | Hope LLM is right | Check against other sources (still unreliable) | Compare with real-world outcomes |
Scalable verification | Use in real time,�manually eval one by one | Run batches, manually eval | Reproducible runs, auto-grading, RL improvements |
Cheap, grounded, automated backtesting
NEED:
Agent Strategy Arena
Would love to talk about…