1 of 107

Shared Google Slides Instructions for Students

  1. IMPORTANT: Please prepare your slides in a separate google slides document first, then copy your slides over to this deck. In case there are issues with the shared slides, these will be your backup.
  2. Do NOT edit or change the content of other group’s slides. Version history is turned on so we can see who made changes.
  3. Do NOT delete the separator slides (with Group # + names). Insert your content after this slide.
  4. To make sure that we can fit all talks during class time, you should thoroughly practice your talk to make sure that you are not going over 5 minutes.

If you experience any technical issues, please email c.mattson@utah.edu.

2 of 107

Group 33

Yanxi Lin, Varun Raveendra

2

3 of 107

Multi-agent Reinforcement Learning (MARL)

for Open Agent Systems

4 of 107

Background

  • In many real-world scenarios like disaster management, agents operate in dynamic, uncertain, and rapidly changing environments.
  • MARL extends traditional RL to environments where multiple agents can collaborate or compete to achieve individual or shared goals.
  • A key limitation of most MARL systems is their assumption of a fixed number of agents throughout training and deployment.
  • Open agent systems address this challenge by acknowledging that agent participation is not guaranteed to be constant.

5 of 107

Wildfire Suppression

  • Agent openness
    • Agents can be active or inactive depending on the suppressants
  • Task openness
    • Fires intensity increase over time
    • New fires can dynamically emerge
    • Existing fires can burn out or spread
  • Reward
    • Time
    • Attack accuracy
    • Number of fires being extinguished

6 of 107

Method 1: Actor-Critic

7 of 107

Method 1: Actor-Critic Performance

8 of 107

GNNs

Refill

0,32

9 of 107

Method 2: Vanilla Policy gradient- (REINFORCE) using GNNs

Agent 1

Agent 2

Agent 3

Master agent

Loss fn

Loss fn

Loss fn

PG- loss fn with reward to go as return

+

Update

Loss fn:�- logprob * reward_to_go

10 of 107

Method 2: performance

11 of 107

Method 3: Actor-COMA-Critic using GNNs

https://arxiv.org/abs/1705.08926- Counterfactual Multi-Agent Policy Gradients

Counterfactual Multi-Agent Policy Gradients (COMA)

Agent 1

Agent 2

Agent 3

Master agent

Loss Fn

Loss Fn

Loss Fn

+

Update

Loss Fn: -logprob*COMA-advantage

Actor

Agent 1

Agent 2

Agent 3

q-value

Critic

MSE loss

  • what the agent could have expected on average if it tried all its possible actions.

12 of 107

Method 3: performance

Next,

  • Try easier methods
  • DQN or just Policy Iteration maybe

(clearly something is not working)

13 of 107

Group 15

Krishna Ashish Chinnari, Abhishek Rajgaria

3

14 of 107

Portfolio Manager

by-

Krishna Ashish Chinnari (u1477143)

Abhishek Rajgaria (u1471428)

15 of 107

Stock Representation - Daily Movements

  • OHLCV
    • Opening price
    • Highest price
    • Lowest price
    • Closing price
    • Volume
    • Daily Return
  • Stock Indicators
    • MACD - Moving Average Convergence Divergence
    • MA10 - Moving Average of past 10 days
    • RSI - Relative Strength Index
    • Stock Sentiment (Proposed)
  • Market Indicators
    • VIX - CBOE Volatility Index
    • External News Analysis (Proposed)
  • Portfolio Movement
    • Cash Percentage (Allocation)
    • Stock Percentage (Allocation of different stocks)

Y - FINANCE - Data Resource

16 of 107

17 of 107

Single Stock vs S&P 500 (A2C)

18 of 107

Single Stock vs S&P 500 (DQN)

19 of 107

Single Stock vs S&P 500 (PPO)

20 of 107

Single Stock vs S&P 500 (DDPG)

21 of 107

Challenges Ahead

  • Multi Stock Portfolio
    • Multi Agent to have Flexible Portfolio
    • Single Agent - Fixed Portfolio Size
  • Addition of Stock & Market Sentiment
    • External APIs available
  • Feedback from fund Manager - (RLHF)

22 of 107

Questions & Suggestion

23 of 107

Thank You !

24 of 107

Group 26

Novella Alvina, Leo Leano, Nicole Sundberg

4

25 of 107

May The Best Chef Win

Leonardo Leano, Nicole Sundberg, Novella Alvina

26 of 107

Overview

Multi-Agent RL Models:

  • Proximal Policy Optimization
  • Deep Deterministic Policy Gradient
  • Value Decomposition

Goal

Compare performance of RL models in a cooperative task

27 of 107

Why Is This Problem Important?

Real-world applications

Robotics, traffic control, logistics

Coordination is hard

Agents must learn to share space and tasks

Overcooked is an ideal environment

Time pressure and shared goals

28 of 107

Multi-Agent Reinforcement Learning Algorithms Background

MADDPG

Multi-Agent Deep Deterministic Policy Gradient

Policy gradient algorithm involving multiple agents and a critic structure to learn policies

MAVD

Multi-Agent Value Decomposition

Learns individual agent value functions to combine into one join action-value function.

MAPPO

Multi-Agent Proximal Policy Optimization

Updates policy in small steps via clipping to ensure stability

29 of 107

Multi-Agent Reinforcement Learning Algorithms Background

1️⃣

MAPPO

Multi-Agent Proximal Policy Optimization

Updates policy in small steps via clipping to ensure stability

2️⃣

MADDPG

Multi-Agent Deep Deterministic Policy Gradient

Policy gradient algorithm involving multiple agents and a critic structure to learn policies Updates policy in small steps via clipping to ensure stability

3️⃣

MAVD

Multi-Agent Value Decomposition

Learns individual agent value functions to combine into one join action-value function.

30 of 107

Multi-Agent Reinforcement Learning Algorithms We’re Investigating

MADDPG

Multi-Agent Deep Deterministic Policy Gradient

MAPPO

Multi-Agent Proximal Policy Optimization

MAVD

Multi-Agent Value Decomposition

Develop two of the MARL Algorithms, train them and compare them against one another

31 of 107

Multi-Agent Reinforcement Learning Evaluation Methods

Visual Differences

Compare the MARL algorithm against others and self-play to observe technique differences

Score

After training the algorithm, what is the best score it can reach?

Reliability

After training, how reliable and consistent is the algorithm?

Time to ‘optimal performance’

How many training loops are required until performance improves + plateaus

32 of 107

Thank you!

Any questions?

33 of 107

Group 19

Ghazal Abdollahinoghondar

5

34 of 107

Advanced AI project,�Spring 2025

Advisor: Prof. Daniel Brown

Student: Ghazal Abdollahi

(PhD in Computer Science)

University of Utah

34

35 of 107

outline

    • Introduction
    • Deep Q-Network Solution
    • Multi-Container Queue Processing
    • Expected Results

35

36 of 107

Introduction

  • Optimizing Multi-Container Warming Strategy in Serverless Computing
    • The Cold Start Problem
      • Cold Start: 2-10 second delay
      • Warm Container: Immediate response
      • Challenge: Balance performance vs. resources

36

37 of 107

Deep Q-Network Solution

  • DQN Approach
  • States:
    • Containers, requests, time, utilization
  • Actions:
    • Increase/decrease/maintain containers
  • Reward:
    • Minimize cold starts + idle resources

37

38 of 107

Reward function

  • Reward = -(α × queue_time) - (β × container_count) - (γ × cold_starts)

38

39 of 107

Multi-Container Queue Processing

  • Queue Processing
    • Single container → Sequential processing
    • Multiple containers → Parallel processing
    • Queue Time ≈ Base Time ÷ N containers

39

40 of 107

Expected Results

  • Performance Comparison
    • Lower cold start
    • Improved resource utilization
    • Faster adaptation to workload changes
    • Target: 75-85% container utilization

40

41 of 107

Expected Results

41

42 of 107

Thank you for your presence!

42

43 of 107

Group 1

Matthew Lowery

6

44 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

45 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

46 of 107

Descriptor: Model-based RL

Agent that forms an idea of how the world will react instead of reacting to the world

47 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

48 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

49 of 107

Descriptor: Offline RL

- No ability for the learning agent to continuously interact with the environment

- `Offline` tuples of (s,a,r,s′)

- Relevant where it’s costly or hazardous to interact with the environment

Think Healthcare

Think Finance

50 of 107

Offline + Model-based RL

Have model of the world, so can generate ‘synthetic’ rollouts to supplement our fixed dataset and learn a better policy

51 of 107

Offline + Model-based RL

Have model of the world, so can generate ‘synthetic’ rollouts to supplement our fixed dataset and learn a better policy.

Problem:

In some (s,a) pairs from the synthetic data, our understanding of the world is fuzzy

and we might predict rewards greater than in reality.

→ We learn policy which works in the model, but not in reality

52 of 107

Offline + Model-based RL

Have model of the world, so can generate ‘synthetic’ rollouts to supplement our fixed dataset and learn a better policy.

Problem: Model exploitation

In some (s,a) pairs from the synthetic data, our understanding of the world is fuzzy

and we might predict rewards greater than in reality.

→ We learn policy which works in the model, but not in reality

→ Can synthetic data still help?

53 of 107

Counter to Model Exploitation

Only keep rollouts which the model isn’t too uncertain about,

→ Which are more representative of the data distribution

→ Uncertainty could be based on how ‘far’ rollout is from fixed dataset

(in an l2 sense)

→ i.e. what is the euclidean distance between (s,a) pairs in rollout from the (s,a) pairs in the dataset?

Remove if it’s too much

54 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

55 of 107

NSDEs ???for Uncertainty-Aware Offline Model-based RL

56 of 107

Discrete vs Continuous

Skip Connections

57 of 107

Discrete vs Continuous

Skip Connections

Kinda looks like:

Forward Euler Step

58 of 107

Discrete vs Continuous

Skip Connections

Kinda looks like:

Forward Euler Step

Which is just a particular way to solve

59 of 107

Discrete vs Continuous

Skip Connections

Kinda looks like:

Forward Euler Step

Which is just a particular way to solve

Focus on this form instead? And you get a Neural ODE

60 of 107

Continuous & Probabilistic NSDEs

s is an evolving distribution, with a diffusion term that reflects our uncertainty

And a drift term, which tracks the mean (as previously)

Unimodal

61 of 107

Continuous & Probabilistic NSDEs

s is an evolving distribution, with a diffusion term that reflects our uncertainty

How to generate ‘safe’ synthetic data?

  • Train Sigma to predict l2 distance between current (s,a) and dataset, thus scaling our uncertainty
  • Then remove rollouts if this uncertainty accumulates beyond a threshold.

62 of 107

Group 8

Atharv Kulkarni

7

63 of 107

Group 16

Timothy Wang

8

64 of 107

Multi-Armed Bandit Approach for NLP Problems

CS 6955

Timothy Wang

Timothy Wang | CS 6955 | 2025 April 14

64

Image credits: https://tenor.com/view/love-winner-you-are-mine-gif-15271571536543540590; https://www.gettyimages.com/detail/photo/natural-language-processing-cognitive-computing-royalty-free-image/1313050195

65 of 107

Multi-Armed Bandit (MAB) Approach

CS 6955

Timothy Wang

  • An agent can select from different actions (arms) each with a different unknown reward distribution
  • Goal: Given curr. state find best action(s) → Maximize long-term cumulative reward

65

Action 1

Action 2

Action 3

R(1)

R(2)

R(3)

Could this be applied to natural language processing (NLP) problems?

Image credits: https://www.gettyimages.com/detail/illustration/cheerful-flying-robot-solid-icon-royalty-free-illustration/1281734513

66 of 107

NLP-Gym

CS 6955

Timothy Wang

Reinforcement learning (RL) Python tool created by Ramamurthy et al.

Link: https://github.com/rajcscw/nlp-gym/tree/main

NLP Environments:

  • Sequence Tagging
  • Question and Answering
  • Multi-Label Classification

66

67 of 107

Question and Answering (Q&A) Task

CS 6955

Timothy Wang

Observation Data:

  • Question
  • Two facts
  • Potential answer choice (8 total choices)

Actions:

  • CONTINUE (move to next answer choice)
  • ANSWER (select answer choice)

Reward:

  • 1 → answered correctly
  • 0 → answered incorrectly

67

Image credits: https://github.com/rajcscw/nlp-gym/tree/main

68 of 107

Built-In Featurizer

CS 6955

Timothy Wang

[ , ]

68

Question

2 Facts

Answer Choice

Flair

Sentence Embeddings Tensors

Cosine Similarity

Similarity Tensor

Question

x

Answer Choice

Facts

X

Answer Choice

Range: -1 to 1

Num 1

Num 2

69 of 107

Q-Value Array

CS 6955

Timothy Wang

40 x 40 = 1600 state space

2 actions (cont or ans) per state

Q-value (average reward) array:

(State sp) x (action sp) = (40 x 40) x 2

69

Discrete State

Tensor Values

0

-1.0 to -0.95

1

0.95 to -0.9

39

0.95 to 1.0

0

1

39

0

Continue

—--------

Answer

1

39

Num 1 (Q x AC)

Num 2

(F x AC)

70 of 107

MAB Training Algorithm

CS 6955

Timothy Wang

70

1. Featurize training sample into a state space cell:

Tensor (1, 0.97)

[ 39, 39 ]

2. Each state space cell has an action space to store Q-values:

[continue=0.12,

answer=0.15]

3. Select action randomly or pick one with highest Q-value

5. Repeat for 10,000 iterations → The whole Q-value array can now be used as a policy

4. Env returns actual reward → Update Q-value in action array

[ 39, 39 ]

71 of 107

Very Preliminary Results

CS 6955

Timothy Wang

Percent Test Samples Correct

Future Work:

  • Seq. Tagging and Multi-Label Class.
  • Refine MAB policy (UCB-1)
  • Collect further test results

71

Random Policy

Deep Q-Learning Policy

MAB Policy

11.56%

15.66%

13.07%

72 of 107

Summary

CS 6955

Timothy Wang

  • Exploring how well a multi-armed bandit approach works on NLP problems
  • Using NLP-Gym → Reinforcement Learning + NLP
  • Q and A problem → Select correct answer based on given question/facts
  • Featurize observations into two numbers → Convert to indices for a state array
  • Store averages for each state and each action (CONTINUE or ANSWER) → Trained policy

Thank You

72

73 of 107

Resources

CS 6955

Timothy Wang

  • NLP-Gym: https://github.com/rajcscw/nlp-gym/tree/main

title={NLPGym -- A toolkit for evaluating RL agents on Natural Language Processing Tasks},

author={Rajkumar Ramamurthy and Rafet Sifa and Christian Bauckhage},

year={2020},

eprint={2011.08272},

archivePrefix={arXiv},

primaryClass={cs.CL}

  • Hugging Face/AllenAI QASC Database: https://huggingface.co/datasets/allenai/qasc

author = {Tushar Khot and Peter Clark and Michal Guerquin and Peter Jansen and Ashish Sabharwal},

title = {QASC: A Dataset for Question Answering via Sentence Composition},

journal = {arXiv:1910.11473v2},

year = {2020},

73

74 of 107

Group 20

Fabiha Bushra, Simon Gonzalez

9

75 of 107

Comparative Analysis of Multi-Agent Reinforcement Learning Algorithms

Fabiha Bushra, Simon Gonzalez

76 of 107

Why MARL?

Real-world problems often involve multiple agents

How do MARL algorithms compare across different interaction types?

Self-Driving Car Coordination�(Collaborative Interaction)

A Game of Checkers

(Competitive Interaction)

Soccer

(Mixed Interaction)

77 of 107

Why MARL?

Every new MARL paper be like:

78 of 107

Objectives

Which MARL algorithm performs better under

different cooperation-competition dynamics?

Which algorithm is more stable and robust across diverse scenarios?

Which algorithm shows superior sample efficiency and achieves higher performance with fewer interactions?

79 of 107

Quick Recap: Main Setting for Coop MARL

80 of 107

Environments & Task Types

VMAS – Balance: Agents work together to keep a pole balanced

PettingZoo – Simple Tag: Predators try to tag evaders

VMAS – Football: Partial cooperation within teams, but competition between them

Collaborative

Competitive

Mixed

81 of 107

MARL Algorithms: CTDE

PROBLEM: How do we make centralized training tractable?

Use privileged centralized information at training time.

Each policy can be independently executed in the deployment environment.

CENTRALIZED TRAINING

DECENTRALIZED EXECUTION

82 of 107

MARL Algorithms

On-policy actor-critic, centralized value function.

Off-policy actor-critic with entropy regularization.

Value decomposition with monotonic mixing of agent Q-values.

MAPPO1

Multi-Agent Proximal Policy Optimization

MASAC2

Multi-Agent Soft Actor-Critic

QMIX3

Q-Mixing Network

83 of 107

Metrics

Aggregate Statistics

Performance Profile

Sample Efficiency for All Tasks

84 of 107

Preliminary Results

85 of 107

Preliminary Results

Mixed, Football

Competitive, Simple Tag

86 of 107

Thank You!

CREDITS: This presentation template was created by Slidesgo, and includes icons, infographics & images by Freepik

87 of 107

Group 37

Aidan Wilde

10

88 of 107

Learning to Play Card Games with RL

Aidan Wilde

89 of 107

Motivation and Game

  • Variation of the game Gin Rummy
  • Goal of the game is to form sets and runs, such that you can discard all of the cards in your hand

90 of 107

Game Construction

  • State
    • Game involves 2 full decks, shuffled together
    • Agent has imperfect knowledge of the state (observation state)
  • Actions
    • Pick up card from discard or draw pile - (0, 1)
    • Choose which card to discard - (0, 108)
    • Choose when to ‘buy’ cards - (0, 3)
    • Total action space 114
  • Rewards
    • Shaped to model milestones of the game(rewards increase as player gets closer to getting rid of all cards)
    • Penalty enforced for each timestep

91 of 107

The Environment

  • Unique version of the game, it wasn’t implemented online
  • Implemented the game logic in Python
  • Wrapped the game logic in Gymnasium environment
  • Standardizes environment to make use of already implemented training tools

92 of 107

Training

  • Deep Q-Networks is a good choice due to large state space, discrete action space, and simulation availability
  • Difficulties
    • Large State Space (Billions of unique game states)
    • Multiple unique action processes

93 of 107

Feature Engineering

Cards are represented as one hot encoded vector :

We can reduce vector into more meaningful data

Set Vector :

2H

2D

2S

2C

3H

3D

3S

…….

Ace S

Ace C

Length 108 (2 decks)

2’s

3’s

4’s

5’s

6’s

7’s

8’s

……..

King’s

Ace’s

Length 13

94 of 107

Action Masking

  • Agent has 3 different actions it can take
    • Draw | Discard | Buy

.67

.54

.12

.18

.65

.43

.90

.61

.55

.36

.77

.97

.32

.2

.05

.67

.54

.12

.18

.65

.43

.90

.61

.55

.36

.77

.97

.32

.2

.05

.54

.43

.77

.77

95 of 107

DQN ε - Greedy Agent vs Randomized Agent

🔴 Exploit Average Reward

⚫ Exploit Average Win %

96 of 107

Future Directions

  • Considering removing action mask and moving to MARL
  • Self Play implementation and progression

97 of 107

Group 31

Matt Myers

11

98 of 107

Group 2

Abbas Mohammadi

12

99 of 107

Reinforcement Learning for Adaptive Pavement Marking Maintenance

Abbas Mohammadi

Advanced AI – Spring 2025

100 of 107

Motivation & Problem Statement

  • Pavement markings degrade over time and affect nighttime visibility.

  • Traditional methods use fixed schedules (e.g., repaint every 6 or 12 months), which are not responsive to actual pavement conditions.

  • These methods can cause:
    • Wasted resources due to premature maintenance
    • Safety risks due to delayed interventions

  • Reinforcement learning (RL) offers a way to make adaptive, data-driven decisions over time.

101 of 107

Project Goal

  • Goal:

Train an RL agent to make optimal pavement maintenance decisions using simulated traffic and retroreflectivity data.

  • Key objectives:
    • Maintain retroreflectivity above the 150 mcd/m²/lx safety threshold
    • Minimize cumulative maintenance cost
    • Compare RL agent to rule-based baselines

102 of 107

Simulation Environment & Setup

Environment:

  • Custom PavementEnv with:
    • Retroreflectivity
    • Traffic
    • Months since last maintenance
    • Snow exposure (Snowy months since last maintenance)

  • Discrete actions:

0 = No Maintenance, 1 = Perform Maintenance

  • Degradation formula:

decay=10+0.0001×traffic+2×months_since+3×snowy_months

103 of 107

Reward Function

Reward Design:

  • Penalty for violating the threshold
  • Penalty for unnecessary maintenance
  • Bonus for safe cost-efficient decisions

104 of 107

Learning Curve

  • Shows reward increasing over 14,000 episodes, stabilizing near –400.

  • DQN agent learns to maximize long-term rewards while managing cost and compliance

105 of 107

Comparison with Baselines

  • Shows DQN better avoids falling below the threshold compared to the 12-month strategy.
  • More cost-efficient than a 6-month baseline.
  • DQN acts only when needed, while baselines act rigidly.
  • DQN achieves better trade-off: less frequent than 6 months, safer than 12 months.

106 of 107

Key Results

  • DQN reduced cumulative cost by up to ~25% vs 6-month baseline

  • Better safety compliance than a 12-month baseline

  • Learned policy is state-aware, not fixed

  • Clear value in moving from reactive to proactive planning

107 of 107

Thanks!����Questions?