1 of 36

 Towards Learning-based Approximately Optimal Control in (Constrained) Decentralized Dynamic Teams

Vijay Subramanian

ECE Division, EECS Department, University of Michigan (UM), Ann Arbor

Joint work with Nouman Khan (UM) (CDC 2023, INFOCOM 2024), Hsu Kao (JP Morgan) (AISTATS 2022),

Ujwal Dinesha, Subrahmanyam Arunachalam, Dheeraj Narasimha, and Srinivas Shakkottai (TAMU) (INFOCOM 2024)

April 8, 2024

SNAPP Seminar

2 of 36

Acknowledgments

  • Nouman Khan
  • Hsu Kao

  • Srinivas Shakkottai and group

  • Dengwang Tang
  • Aditya Mahajan

  • Demos Teneketzis

  • NSF

3 of 36

Decision-Making in Multi-Agent Systems/Networks

UAV Swarms

Connected autonomous driving

Smart grids

Drone Delivery

Warehouse Robots

  • Multi-Agent Systems:
    • A group of agents (decision-makers) who interact with each other in a shared environment
    • Present in a variety of forms and in diverse applications
  • Extremes of possible configurations:
    • Cooperative (common goal) – TEAM
    • Competitive (individual goals) -- GAME
  • Unique Challenges due to a combination of:
    • Multi-agent interactions
    • Complex information structures:
      • Information asymmetry and partial observability
    • Joint constraints
  • Focus of this talk: Cooperative setting of a TEAM

4 of 36

Background: Sequential Decision Making and Information States

  • Agents need to use information increasing over time --- (Private) History
    • Increase of complexity and memory
  • Information State-Compression of history sufficient for decisions (Witsenhausen1976)
    • Simplifies search for (optimal or good) control policies + yields algorithms (DP)
    • Enables learning-based solutions + algorithms (RL)
  • Examples:
    • MDPs --- Current state (KumarVaraiya2015)
      • Model and Strategy Independent
    • POMDPs --- Belief state (nonlinear filter) (Astrom1969, SmallwoodSondik1973)
      • Strategy Independent but Model Dependent
      • Model and Strategy Independent Information State exists that increases in time
  • How about Information States in multi-agent teams?

5 of 36

Challenges for cooperative multi-agent problems

  • Information asymmetry: Each agent works with their private information
    • Agents need to use complex reasoning to infer private information when acting
    • Signaling intertwines strategies
    • Nonclassical information structures make sequential decomposition (DP) difficult
    • Characterization of Information State allows for sequential decomposition

  • Concurrent policy adaptation => non-stationary environment for the agents

  • Teams simpler than games: No (strategic) deviations

  • High solution complexity: NEXP complete, see Witsenhausen counterexample too

6 of 36

Solutions for cooperative multi-agent problems

  • (Conceptual) Approaches exist to solve many multi-agent team problems
    • Person-by-person approach (Witsenhausen1979)
    • Designer’s approach (Witsenhausen1973)
    • Common-Information (CI) approach (NayyarMahajanTenketzis2014)
  • CI Approach
    • Transforms problem to a (fictitious) coordinator’s POMDP using CI
    • Coordinator uses CI to provide prescriptions to each agent
      • Prescription: Partial function from private history to actions
      • Prescriptions are then actions of coordinator
    • Equivalence to team problem holds with deterministically chosen prescriptions
      • Each agent instantiates the coordinator and solves the problem
  • Provides Strategy Independent but Model Dependent Information State!

7 of 36

Multi-agent Constrained Dec POMDP

8 of 36

Decision Process

 

 

 

 

Local

observation

 

 

 

 

Local

observation

 

Local action

Local action

 

 

9 of 36

Decision Process

 

10 of 36

Optimization Problem

Policy-Profile Space

Objective and Constraint Functions

11 of 36

Optimization Problem

Optimization Problem

Is the Lagrangian formulation useful?

12 of 36

Optimization Problem: Comments

Optimization Problem

  • Do not have an LP on occupation-measures (Altman1999, Borkar2002)
    • The general decentralized setting considered does not yield a characterization of occupation-measures on the state and joint-action, as in MDPs
  • CI approach needs technical innovations
    • Leads to uncountable state-space in the equivalent belief-based constrained MDP
    • Constrained MDPs and POMDPs results need countable state-space (Altman1999, Borkar2002)

13 of 36

Key Assumptions

 

14 of 36

Strong Duality Result

Result 1: Strong Duality

15 of 36

Result on Existence of Saddle-Point

Result 2: Existence of Saddle-Point

16 of 36

Main Idea 1 (A Suitable Minimax Theorem)

Proposition 1: A Suitable Minimax Theorem

17 of 36

 

 

W-S/Borkar Toplogy

18 of 36

Analysis Steps

 

19 of 36

Summary of General Results

 

20 of 36

Wireless Downlink Video-Streaming

21 of 36

Problem Setting

  • Controller at a Base-Transceiver-Station (BTS) sends video-packets to Edge-Devices (EDs) over a wireless channel (with errors/erasures)
  • Transmission-energy constraint at the BTS
  • EDs play video packets and send application-level feedback to the BTS
  • Application-level feedback is available at the BTS controller after some information-processing and/or feedback-channel delay
  • A team problem with the (common) goal of providing a high quality of video streaming experience to end-users

22 of 36

Problem Statement

  • Sequential decision-making models for video-streaming problem:

    • Literature focused on MDPs (full and instantaneous information sharing)

    • A multi-agent view with information asymmetry (due to delayed feedback) is missing

    • With constraints, we will use the general duality result discussed earlier

    • Factorized two-agent (transmitter-receiver pair per stream) solutions are appealing for scalability
  • Work as a team to provide high Quality-of-Experience (QoE) to the end-users
    • (Expressed as an infinite-horizon expected discounted cost)
  • Identify structural properties of a feasible multi-agent control and then develop a learning-based solution as channel’s transition-law can be unknown in practice

Goal

23 of 36

System Model I

 

 

24 of 36

System Model II

25 of 36

Decision Problem

Policy-Profile Space

Objective and Constraint Functions

26 of 36

Decision Problem

Optimization Problem

Lagrangian formulation useful makes immediate costs separable!

27 of 36

Key Assumptions for Scalable Multi-Agent Control

Issue: No common information:

Assumption 1: Factorized Transition Law.

Assumption 2: Additively separable immediate costs.

Nested information for each ED:

Knowing BTS & ED action,

ED state evolves independently

Lagrangian relaxation

separates immediate costs

28 of 36

Result 3

 

29 of 36

Optimal Control for a Single Transmitter-Receiver Problem

  • A single transmitter-receiver problem with immediate objective cost

 

  • Information State: CI based belief on this unknown state
    • Strategy Independent but Model Dependent

30 of 36

Learning-based Solution

  • With model unknown using CI approach is challenging
    • Private information of agent (large time-invariant domain)
    • CI grows with time
    • Belief state (Information State) needs model knowledge

  • Approximate Information State (AIS) is needed (Kao-S., AISTATS 2022)
      • Data-driven ”smaller” compression than Information State
        • Compressions of both private and common information for team problem
        • Characterize performance loss with AIS
      • AIS originally proposed for POMDPs (Mahajan et al. 2020)
      • Similar in spirit to function approximation for large MDPs (Mahajan et al. 2024)

31 of 36

Numerical Results I - Parameters

  • Stylized Model with fixed Lagrangian cost
  • BTS three actions with decreasing power
    • Single packet with L chunks transmitted
    • Power: high (2) > med (1) > low (0)
    • Probability of error: low > med > high
  • ED playout buffer: B
  • ED two actions: Playout or not
  • ED costs
    • Reward r for playout
    • Cost k times number of stalls
  • Lagrange multiplier fixed
  • Training and testing details
  • LSTMs for AIS networks: 128 parameters
  • Feedforward networks for prediction and policy
    • 64 parameters
  • 30k iterations training, 1k test episodes

32 of 36

Numerical Results II

  • Two alternatives considered
    • CI approach with a controller each at the BTS and the ED: [MA-AIS]
    • Delay-oblivious BTS choosing both transmission and play-out actions using the fed-back (delayed) state from the ED: [SA-DS]
  • Stylized Simulation Model with Lagrangian cost

33 of 36

Extensions to dynamic games?

  • Information State hard to characterize in general
    • D. Tang et al. DGAA 2023, and D. Tang PhD thesis 2021

  • For some models, Model and Strategy Independent Information States exist
    • Equilibrium using strategies with compressed information exists
    • All equilibrium payoff profiles can be preserved
    • In general, the domain grows with time!

  • (CI-based) Belief State as candidate Information State?
    • In many cases, this Belief State has a time-invariant domain
    • In general, this Belief State is Model and Strategy Dependent
    • Equilibria may not exist in general, and sequential decomposition may not hold

34 of 36

Thank You

35 of 36

Background for Wireless Video Streaming

  • Sequential decision-making models:

    • Only MDP settings have been covered:
      • No feedback delay, full information sharing, and model knowledge
      • (FuSchaar2010) showed optimality of factorized policies for weakly-coupled constrained discounted cost MDPs and developed a primal-dual algorithm
      • (Singh et al.2022,SinghKumar2016) used index-based policies for finite-time horizon, discounted cost, and average cost settings, all with hard constraints

    • With model unknown, (Alizadeh et al.2017, Shakkottai et al.2019,Li et al.2023) RL algorithms used
      • Some take MA view, and some optimize policy of one side fixing the policy of the other side

  • Linear programming (LP) on occupation-measures is core to these works (Altman1999, Borkar2002)
      • Policies map to occupation-measures on (state, action) pairs

36 of 36

Summary of Wireless Video Streaming

  • Took a multi-agent view of wireless video-streaming under energy constraints

  • For using Lagrangian relaxation, showed strong-duality result
    • “Simple” case of more general strong-duality result that can used broadly

  • For feasible multi-agent control, showed decomposition into multiple two-agent POMDP problems (all sharing a common Lagrange multiplier)
    • Once decomposed, the individual POMDPs are amenable to tractable MARL as the number of agents is only two

  • Designed MA-AIS based learning agents and illustrated performance improvements
    • Demonstrated use and value of AIS based simpler representation for teams