1 of 4

Modeling Scheming Ability in Multi-Agent LLM Interactions

Thao Pham

2Hs?

How do agents scheme?

How do agents detect scheming?

2 of 4

Part 1: How capable are agents at scheming?

  1. What is “scheming”?
    1. covertly pursuing misaligned goals against their developers & users’ intentions (Hubinger et al., 2024; Carlsmith, 2023).
    2. (in MAS) faking cooperation, withholding information, strategic manipulation, etc.
  2. Motivation:
    • one agent (or a group of colluded agents) schemes another agent
    • single-agent ≠ multi-agent (information asymmetries, agentic inequality, etc.)
  3. Approaches:
    • Early results!!! two low-complexity environments
      1. binary cheap talk
      2. resources allocation game
    • interested in testing scenarios:
      • knowing other agent scheming
      • knowing other agent never scheming

3 of 4

Part 2: Bayesian Theory of Mind

A

B

nah…

4 of 4

We can try…

  1. Probabilistic Modeling (deep learning for bayesian)
    1. doing belief updating from a few guesses (approximate bayesian inference)
    2. sub-goal proposal
    3. reweighting these guesses overtime
  2. Designing agents that hit the right trade-offs
  3. Abstraction
    • ignore low-level details until they become necessary
    • not doing pure bayesian updating

You scheme!

=> how effectively BToM agents can infer deceptive intents compared to non-bayesian inferences?��More info? https://thaopham.dev/