Modeling Scheming Ability in Multi-Agent LLM Interactions
Thao Pham
2Hs?
How do agents scheme?
How do agents detect scheming?
Part 1: How capable are agents at scheming?
Part 2: Bayesian Theory of Mind
A
B
nah…
We can try…
You scheme!
=> how effectively BToM agents can infer deceptive intents compared to non-bayesian inferences?��More info? https://thaopham.dev/