JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 4

Modeling Scheming Ability in Multi-Agent LLM Interactions

Thao Pham

2Hs?

How do agents scheme?

How do agents detect scheming?

2 of 4

Part 1: How capable are agents at scheming?

What is “scheming”?

covertly pursuing misaligned goals against their developers & users’ intentions (Hubinger et al., 2024; Carlsmith, 2023).
(in MAS) faking cooperation, withholding information, strategic manipulation, etc.

Motivation:

one agent (or a group of colluded agents) schemes another agent
single-agent ≠ multi-agent (information asymmetries, agentic inequality, etc.)

Approaches:

Early results!!! two low-complexity environments

binary cheap talk
resources allocation game

interested in testing scenarios:

knowing other agent scheming
knowing other agent never scheming

3 of 4

Part 2: Bayesian Theory of Mind

A

B

nah…

4 of 4

We can try…

Probabilistic Modeling (deep learning for bayesian)

doing belief updating from a few guesses (approximate bayesian inference)
sub-goal proposal
reweighting these guesses overtime

Designing agents that hit the right trade-offs
Abstraction

ignore low-level details until they become necessary
not doing pure bayesian updating

You scheme!

=> how effectively BToM agents can infer deceptive intents compared to non-bayesian inferences?��More info? https://thaopham.dev/