1 of 107

Shared Google Slides Instructions for Students

IMPORTANT: Please prepare your slides in a separate google slides document first, then copy your slides over to this deck. In case there are issues with the shared slides, these will be your backup.
Do NOT edit or change the content of other group’s slides. Version history is turned on so we can see who made changes.
Do NOT delete the separator slides (with Group # + names). Insert your content after this slide.
To make sure that we can fit all talks during class time, you should thoroughly practice your talk to make sure that you are not going over 5 minutes.

If you experience any technical issues, please email c.mattson@utah.edu.

2 of 107

Group 33

Yanxi Lin, Varun Raveendra

2

3 of 107

Multi-agent Reinforcement Learning (MARL)

for Open Agent Systems

4 of 107

Background

In many real-world scenarios like disaster management, agents operate in dynamic, uncertain, and rapidly changing environments.
MARL extends traditional RL to environments where multiple agents can collaborate or compete to achieve individual or shared goals.
A key limitation of most MARL systems is their assumption of a fixed number of agents throughout training and deployment.
Open agent systems address this challenge by acknowledging that agent participation is not guaranteed to be constant.

5 of 107

Wildfire Suppression

Agent openness

Agents can be active or inactive depending on the suppressants

Task openness

Fires intensity increase over time
New fires can dynamically emerge
Existing fires can burn out or spread

Reward

Time
Attack accuracy
Number of fires being extinguished

6 of 107

Method 1: Actor-Critic

7 of 107

Method 1: Actor-Critic Performance

8 of 107

GNNs

Refill

0,32

9 of 107

Method 2: Vanilla Policy gradient- (REINFORCE) using GNNs

Agent 1

Agent 2

Agent 3

Master agent

Loss fn

PG- loss fn with reward to go as return

+

Update

Loss fn:�- logprob * reward_to_go

10 of 107

Method 2: performance

11 of 107

Method 3: Actor-COMA-Critic using GNNs

https://arxiv.org/abs/1705.08926- Counterfactual Multi-Agent Policy Gradients

Counterfactual Multi-Agent Policy Gradients (COMA)

Agent 1

Agent 2

Agent 3

Master agent

Loss Fn

+

Update

Loss Fn: -logprob*COMA-advantage

Actor

Agent 1

Agent 2

Agent 3

q-value

Critic

MSE loss

what the agent could have expected on average if it tried all its possible actions.

12 of 107

Method 3: performance

Next,

Try easier methods
DQN or just Policy Iteration maybe

(clearly something is not working)

13 of 107

Group 15

Krishna Ashish Chinnari, Abhishek Rajgaria

3

14 of 107

Portfolio Manager

by-

Krishna Ashish Chinnari (u1477143)

Abhishek Rajgaria (u1471428)

15 of 107

Stock Representation - Daily Movements

OHLCV

Opening price
Highest price
Lowest price
Closing price
Volume
Daily Return

Stock Indicators

MACD - Moving Average Convergence Divergence
MA10 - Moving Average of past 10 days
RSI - Relative Strength Index
Stock Sentiment (Proposed)

Market Indicators

VIX - CBOE Volatility Index
External News Analysis (Proposed)

Portfolio Movement

Cash Percentage (Allocation)
Stock Percentage (Allocation of different stocks)

Y - FINANCE - Data Resource

16 of 107

17 of 107

Single Stock vs S&P 500 (A2C)

18 of 107

Single Stock vs S&P 500 (DQN)

19 of 107

Single Stock vs S&P 500 (PPO)

20 of 107

Single Stock vs S&P 500 (DDPG)

21 of 107

Challenges Ahead

Multi Stock Portfolio

Multi Agent to have Flexible Portfolio
Single Agent - Fixed Portfolio Size

Addition of Stock & Market Sentiment

External APIs available

Feedback from fund Manager - (RLHF)

22 of 107

Questions & Suggestion

23 of 107

Thank You !

24 of 107

Group 26

Novella Alvina, Leo Leano, Nicole Sundberg

4

25 of 107

May The Best Chef Win

Leonardo Leano, Nicole Sundberg, Novella Alvina

26 of 107

Overview

Multi-Agent RL Models:

Proximal Policy Optimization
Deep Deterministic Policy Gradient
Value Decomposition

Goal	Compare performance of RL models in a cooperative task

27 of 107

Why Is This Problem Important?

Real-world applications

Robotics, traffic control, logistics

Coordination is hard

Agents must learn to share space and tasks

Overcooked is an ideal environment

Time pressure and shared goals

28 of 107

Multi-Agent Reinforcement Learning Algorithms Background

MADDPG Multi-Agent Deep Deterministic Policy Gradient	Policy gradient algorithm involving multiple agents and a critic structure to learn policies
MAVD Multi-Agent Value Decomposition	Learns individual agent value functions to combine into one join action-value function.
MAPPO Multi-Agent Proximal Policy Optimization	Updates policy in small steps via clipping to ensure stability

29 of 107

Multi-Agent Reinforcement Learning Algorithms Background

1️⃣	MAPPO Multi-Agent Proximal Policy Optimization	Updates policy in small steps via clipping to ensure stability
2️⃣	MADDPG Multi-Agent Deep Deterministic Policy Gradient	Policy gradient algorithm involving multiple agents and a critic structure to learn policies Updates policy in small steps via clipping to ensure stability
3️⃣	MAVD Multi-Agent Value Decomposition	Learns individual agent value functions to combine into one join action-value function.

30 of 107

Multi-Agent Reinforcement Learning Algorithms We’re Investigating

MADDPG

Multi-Agent Deep Deterministic Policy Gradient

MAPPO

Multi-Agent Proximal Policy Optimization

MAVD

Multi-Agent Value Decomposition

Develop two of the MARL Algorithms, train them and compare them against one another

31 of 107

Multi-Agent Reinforcement Learning Evaluation Methods

Visual Differences

Compare the MARL algorithm against others and self-play to observe technique differences

Score

After training the algorithm, what is the best score it can reach?

Reliability

After training, how reliable and consistent is the algorithm?

Time to ‘optimal performance’

How many training loops are required until performance improves + plateaus

32 of 107

Thank you!

Any questions?

33 of 107

Group 19

Ghazal Abdollahinoghondar

5

34 of 107

Advanced AI project,�Spring 2025

Advisor: Prof. Daniel Brown

Student: Ghazal Abdollahi

(PhD in Computer Science)

University of Utah

34

35 of 107

outline

Introduction
Deep Q-Network Solution
Multi-Container Queue Processing
Expected Results

35

36 of 107

Introduction

Optimizing Multi-Container Warming Strategy in Serverless Computing

The Cold Start Problem

Cold Start: 2-10 second delay
Warm Container: Immediate response
Challenge: Balance performance vs. resources

36

37 of 107

Deep Q-Network Solution

DQN Approach
States:

Containers, requests, time, utilization

Actions:

Increase/decrease/maintain containers

Reward:

Minimize cold starts + idle resources

37

38 of 107

Reward function

Reward = -(α × queue_time) - (β × container_count) - (γ × cold_starts)

38

39 of 107

Multi-Container Queue Processing

Queue Processing

Single container → Sequential processing
Multiple containers → Parallel processing
Queue Time ≈ Base Time ÷ N containers

39

40 of 107

Expected Results

Performance Comparison

Lower cold start
Improved resource utilization
Faster adaptation to workload changes
Target: 75-85% container utilization

40

41 of 107

Expected Results

41

42 of 107

Thank you for your presence!

42

43 of 107

Group 1

Matthew Lowery

6

44 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

45 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

46 of 107

Descriptor: Model-based RL

Agent that forms an idea of how the world will react instead of reacting to the world

47 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

48 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

49 of 107

Descriptor: Offline RL

- No ability for the learning agent to continuously interact with the environment

- `Offline` tuples of (s,a,r,s′)

- Relevant where it’s costly or hazardous to interact with the environment

Think Healthcare

Think Finance

50 of 107

Offline + Model-based RL

Have model of the world, so can generate ‘synthetic’ rollouts to supplement our fixed dataset and learn a better policy

51 of 107

Offline + Model-based RL

Have model of the world, so can generate ‘synthetic’ rollouts to supplement our fixed dataset and learn a better policy.

Problem:

In some (s,a) pairs from the synthetic data, our understanding of the world is fuzzy

and we might predict rewards greater than in reality.

→ We learn policy which works in the model, but not in reality

52 of 107

Offline + Model-based RL

Have model of the world, so can generate ‘synthetic’ rollouts to supplement our fixed dataset and learn a better policy.

Problem: Model exploitation

In some (s,a) pairs from the synthetic data, our understanding of the world is fuzzy

and we might predict rewards greater than in reality.

→ We learn policy which works in the model, but not in reality

→ Can synthetic data still help?

53 of 107

Counter to Model Exploitation

Only keep rollouts which the model isn’t too uncertain about,

→ Which are more representative of the data distribution

→ Uncertainty could be based on how ‘far’ rollout is from fixed dataset

(in an l2 sense)

→ i.e. what is the euclidean distance between (s,a) pairs in rollout from the (s,a) pairs in the dataset?

Remove if it’s too much

54 of 107

NSDEs for Uncertainty-Aware Offline Model-based RL

Matthew Lowery

55 of 107

NSDEs ???for Uncertainty-Aware Offline Model-based RL

56 of 107

Discrete vs Continuous

Skip Connections

57 of 107

Discrete vs Continuous

Skip Connections

Kinda looks like:

Forward Euler Step

58 of 107

Discrete vs Continuous

Skip Connections

Kinda looks like:

Forward Euler Step

Which is just a particular way to solve

59 of 107

Discrete vs Continuous

Skip Connections

Kinda looks like:

Forward Euler Step

Which is just a particular way to solve

Focus on this form instead? And you get a Neural ODE

60 of 107

Continuous & Probabilistic NSDEs

s is an evolving distribution, with a diffusion term that reflects our uncertainty

And a drift term, which tracks the mean (as previously)

Unimodal

61 of 107

Continuous & Probabilistic NSDEs

s is an evolving distribution, with a diffusion term that reflects our uncertainty

How to generate ‘safe’ synthetic data?

Train Sigma to predict l2 distance between current (s,a) and dataset, thus scaling our uncertainty
Then remove rollouts if this uncertainty accumulates beyond a threshold.

62 of 107

Group 8

Atharv Kulkarni

7

63 of 107

Group 16

Timothy Wang

8

64 of 107

Multi-Armed Bandit Approach for NLP Problems

CS 6955

Timothy Wang

Timothy Wang | CS 6955 | 2025 April 14

64

Image credits: https://tenor.com/view/love-winner-you-are-mine-gif-15271571536543540590; https://www.gettyimages.com/detail/photo/natural-language-processing-cognitive-computing-royalty-free-image/1313050195

65 of 107

Multi-Armed Bandit (MAB) Approach

CS 6955

Timothy Wang

An agent can select from different actions (arms) each with a different unknown reward distribution
Goal: Given curr. state find best action(s) → Maximize long-term cumulative reward

65

Action 1

Action 2

Action 3

R(1)

R(2)

R(3)

Could this be applied to natural language processing (NLP) problems?

Image credits: https://www.gettyimages.com/detail/illustration/cheerful-flying-robot-solid-icon-royalty-free-illustration/1281734513

66 of 107

NLP-Gym

CS 6955

Timothy Wang

Reinforcement learning (RL) Python tool created by Ramamurthy et al.

Link: https://github.com/rajcscw/nlp-gym/tree/main

NLP Environments:

Sequence Tagging
Question and Answering
Multi-Label Classification

66

67 of 107

Question and Answering (Q&A) Task

CS 6955

Timothy Wang

Observation Data:

Question
Two facts
Potential answer choice (8 total choices)

Actions:

CONTINUE (move to next answer choice)
ANSWER (select answer choice)

Reward:

1 → answered correctly
0 → answered incorrectly

67

Image credits: https://github.com/rajcscw/nlp-gym/tree/main

68 of 107

Built-In Featurizer

CS 6955

Timothy Wang

[ , ]

68

Question

2 Facts

Answer Choice

Flair

Sentence Embeddings Tensors

Cosine Similarity

Similarity Tensor

Question

x

Answer Choice

Facts

X

Answer Choice

Range: -1 to 1

Num 1

Num 2

69 of 107

Q-Value Array

CS 6955

Timothy Wang

40 x 40 = 1600 state space

2 actions (cont or ans) per state

Q-value (average reward) array:

(State sp) x (action sp) = (40 x 40) x 2

69

Discrete State

Tensor Values

0

-1.0 to -0.95

1

0.95 to -0.9

…

39

0.95 to 1.0

	0	1	…	39
0	Continue —-------- Answer		…
1			…
…	…	…	…	…
39			…

Num 1 (Q x AC)

Num 2

(F x AC)

70 of 107

MAB Training Algorithm

CS 6955

Timothy Wang

70

1. Featurize training sample into a state space cell:

Tensor (1, 0.97)

[ 39, 39 ]

2. Each state space cell has an action space to store Q-values:

[continue=0.12,

answer=0.15]

3. Select action randomly or pick one with highest Q-value

5. Repeat for 10,000 iterations → The whole Q-value array can now be used as a policy

4. Env returns actual reward → Update Q-value in action array

[ 39, 39 ]

71 of 107

Very Preliminary Results

CS 6955

Timothy Wang

Percent Test Samples Correct

Future Work:

Seq. Tagging and Multi-Label Class.
Refine MAB policy (UCB-1)
Collect further test results

71

Random Policy	Deep Q-Learning Policy	MAB Policy
11.56%	15.66%	13.07%

72 of 107

Summary

CS 6955

Timothy Wang

Exploring how well a multi-armed bandit approach works on NLP problems
Using NLP-Gym → Reinforcement Learning + NLP
Q and A problem → Select correct answer based on given question/facts
Featurize observations into two numbers → Convert to indices for a state array
Store averages for each state and each action (CONTINUE or ANSWER) → Trained policy

Thank You

72

73 of 107

Resources

CS 6955

Timothy Wang

NLP-Gym: https://github.com/rajcscw/nlp-gym/tree/main

title={NLPGym -- A toolkit for evaluating RL agents on Natural Language Processing Tasks},

author={Rajkumar Ramamurthy and Rafet Sifa and Christian Bauckhage},

year={2020},

eprint={2011.08272},

archivePrefix={arXiv},

primaryClass={cs.CL}

Hugging Face/AllenAI QASC Database: https://huggingface.co/datasets/allenai/qasc

author = {Tushar Khot and Peter Clark and Michal Guerquin and Peter Jansen and Ashish Sabharwal},

title = {QASC: A Dataset for Question Answering via Sentence Composition},

journal = {arXiv:1910.11473v2},

year = {2020},

73

74 of 107

Group 20

Fabiha Bushra, Simon Gonzalez

9

75 of 107

Comparative Analysis of Multi-Agent Reinforcement Learning Algorithms

Fabiha Bushra, Simon Gonzalez

76 of 107

Why MARL?

Real-world problems often involve multiple agents

How do MARL algorithms compare across different interaction types?

Self-Driving Car Coordination�(Collaborative Interaction)

A Game of Checkers

(Competitive Interaction)

Soccer

(Mixed Interaction)

77 of 107

Why MARL?

Every new MARL paper be like:

78 of 107

Objectives

Which MARL algorithm performs better under

different cooperation-competition dynamics?

Which algorithm is more stable and robust across diverse scenarios?

Which algorithm shows superior sample efficiency and achieves higher performance with fewer interactions?

79 of 107

Quick Recap: Main Setting for Coop MARL

80 of 107

Environments & Task Types

VMAS – Balance: Agents work together to keep a pole balanced

PettingZoo – Simple Tag: Predators try to tag evaders

VMAS – Football: Partial cooperation within teams, but competition between them

Collaborative

Competitive

Mixed

81 of 107

MARL Algorithms: CTDE

PROBLEM: How do we make centralized training tractable?

Use privileged centralized information at training time.

Each policy can be independently executed in the deployment environment.

CENTRALIZED TRAINING

DECENTRALIZED EXECUTION

¹ Yu et al. (2022) ² Woywood et al. (2020) ³ Rashid et al. (2024)

82 of 107

MARL Algorithms

On-policy actor-critic, centralized value function.

Off-policy actor-critic with entropy regularization.

Value decomposition with monotonic mixing of agent Q-values.

MAPPO¹

Multi-Agent Proximal Policy Optimization

MASAC²

Multi-Agent Soft Actor-Critic

QMIX³

Q-Mixing Network

¹ Yu et al. (2022) ² Woywood et al. (2020) ³ Rashid et al. (2024)

83 of 107

Metrics

Aggregate Statistics

Performance Profile

Sample Efficiency for All Tasks

84 of 107

Preliminary Results

85 of 107

Preliminary Results

Mixed, Football

Competitive, Simple Tag

86 of 107

Thank You!

CREDITS: This presentation template was created by Slidesgo, and includes icons, infographics & images by Freepik

87 of 107

Group 37

Aidan Wilde

10

88 of 107

Learning to Play Card Games with RL

Aidan Wilde

89 of 107

Motivation and Game

Variation of the game Gin Rummy
Goal of the game is to form sets and runs, such that you can discard all of the cards in your hand

90 of 107

Game Construction

State

Game involves 2 full decks, shuffled together
Agent has imperfect knowledge of the state (observation state)

Actions

Pick up card from discard or draw pile - (0, 1)
Choose which card to discard - (0, 108)
Choose when to ‘buy’ cards - (0, 3)
Total action space 114

Rewards

Shaped to model milestones of the game(rewards increase as player gets closer to getting rid of all cards)
Penalty enforced for each timestep

91 of 107

The Environment

Unique version of the game, it wasn’t implemented online
Implemented the game logic in Python
Wrapped the game logic in Gymnasium environment
Standardizes environment to make use of already implemented training tools

92 of 107

Training

Deep Q-Networks is a good choice due to large state space, discrete action space, and simulation availability
Difficulties

Large State Space (Billions of unique game states)
Multiple unique action processes

93 of 107

Feature Engineering

Cards are represented as one hot encoded vector :

We can reduce vector into more meaningful data

Set Vector :

2H	2D	2S	2C	3H	3D	3S	…….	Ace S	Ace C

Length 108 (2 decks)

2’s	3’s	4’s	5’s	6’s	7’s	8’s	……..	King’s	Ace’s

Length 13

94 of 107

Action Masking

Agent has 3 different actions it can take

Draw | Discard | Buy

.67	.54	.12	.18	.65	.43	.90	.61	.55	.36	.77	.97	.32	.2	.05

.67	.54	.12	.18	.65	.43	.90	.61	.55	.36	.77	.97	.32	.2	.05

.54	.43	.77

.77

95 of 107

DQN ε - Greedy Agent vs Randomized Agent

🔴 Exploit Average Reward

⚫ Exploit Average Win %

96 of 107

Future Directions

Considering removing action mask and moving to MARL
Self Play implementation and progression

97 of 107

Group 31

Matt Myers

11

98 of 107

Group 2

Abbas Mohammadi

12

99 of 107

Reinforcement Learning for Adaptive Pavement Marking Maintenance

Abbas Mohammadi

Advanced AI – Spring 2025

100 of 107

Motivation & Problem Statement

Pavement markings degrade over time and affect nighttime visibility.

Traditional methods use fixed schedules (e.g., repaint every 6 or 12 months), which are not responsive to actual pavement conditions.

These methods can cause:

Wasted resources due to premature maintenance
Safety risks due to delayed interventions

Reinforcement learning (RL) offers a way to make adaptive, data-driven decisions over time.

101 of 107

Project Goal

Goal:

Train an RL agent to make optimal pavement maintenance decisions using simulated traffic and retroreflectivity data.

Key objectives:

Maintain retroreflectivity above the 150 mcd/m²/lx safety threshold
Minimize cumulative maintenance cost
Compare RL agent to rule-based baselines

102 of 107

Simulation Environment & Setup

Environment:

Custom PavementEnv with:

Retroreflectivity
Traffic
Months since last maintenance
Snow exposure (Snowy months since last maintenance)

Discrete actions:

0 = No Maintenance, 1 = Perform Maintenance

Degradation formula:

decay=10+0.0001×traffic+2×months_since+3×snowy_months

103 of 107

Reward Function

Reward Design:

Penalty for violating the threshold
Penalty for unnecessary maintenance
Bonus for safe cost-efficient decisions

104 of 107

Learning Curve

Shows reward increasing over 14,000 episodes, stabilizing near –400.

DQN agent learns to maximize long-term rewards while managing cost and compliance

105 of 107

Comparison with Baselines

Shows DQN better avoids falling below the threshold compared to the 12-month strategy.
More cost-efficient than a 6-month baseline.
DQN acts only when needed, while baselines act rigidly.
DQN achieves better trade-off: less frequent than 6 months, safer than 12 months.

106 of 107

Key Results

DQN reduced cumulative cost by up to ~25% vs 6-month baseline

Better safety compliance than a 12-month baseline

Learned policy is state-aware, not fixed

Clear value in moving from reactive to proactive planning

107 of 107

Thanks!��Questions?