Shared Google Slides Instructions for Students
If you experience any technical issues, please email c.mattson@utah.edu.
Group 33
Yanxi Lin, Varun Raveendra
2
Multi-agent Reinforcement Learning (MARL)
for Open Agent Systems
Background
Wildfire Suppression
Method 1: Actor-Critic
Method 1: Actor-Critic Performance
GNNs
Refill
0,32
Method 2: Vanilla Policy gradient- (REINFORCE) using GNNs
Agent 1
Agent 2
Agent 3
Master agent
Loss fn
Loss fn
Loss fn
PG- loss fn with reward to go as return
+
Update
Loss fn:�- logprob * reward_to_go
Method 2: performance
Method 3: Actor-COMA-Critic using GNNs
https://arxiv.org/abs/1705.08926- Counterfactual Multi-Agent Policy Gradients
Counterfactual Multi-Agent Policy Gradients (COMA)
Agent 1
Agent 2
Agent 3
Master agent
Loss Fn
Loss Fn
Loss Fn
+
Update
Loss Fn: -logprob*COMA-advantage
Actor
Agent 1
Agent 2
Agent 3
q-value
Critic
MSE loss
Method 3: performance
Next,
(clearly something is not working)
Group 15
Krishna Ashish Chinnari, Abhishek Rajgaria
3
Portfolio Manager
by-
Krishna Ashish Chinnari (u1477143)
Abhishek Rajgaria (u1471428)
Stock Representation - Daily Movements
Y - FINANCE - Data Resource
Single Stock vs S&P 500 (A2C)
Single Stock vs S&P 500 (DQN)
Single Stock vs S&P 500 (PPO)
Single Stock vs S&P 500 (DDPG)
Challenges Ahead
Questions & Suggestion
Thank You !
Group 26
Novella Alvina, Leo Leano, Nicole Sundberg
4
May The Best Chef Win
Leonardo Leano, Nicole Sundberg, Novella Alvina
Overview
Multi-Agent RL Models:
Goal | Compare performance of RL models in a cooperative task |
Why Is This Problem Important?
Real-world applications
Robotics, traffic control, logistics
Coordination is hard
Agents must learn to share space and tasks
Overcooked is an ideal environment
Time pressure and shared goals
Multi-Agent Reinforcement Learning Algorithms Background
MADDPG Multi-Agent Deep Deterministic Policy Gradient | Policy gradient algorithm involving multiple agents and a critic structure to learn policies |
MAVD Multi-Agent Value Decomposition | Learns individual agent value functions to combine into one join action-value function. |
MAPPO Multi-Agent Proximal Policy Optimization | Updates policy in small steps via clipping to ensure stability |
Multi-Agent Reinforcement Learning Algorithms Background
1️⃣ | MAPPO Multi-Agent Proximal Policy Optimization | Updates policy in small steps via clipping to ensure stability |
2️⃣ | MADDPG Multi-Agent Deep Deterministic Policy Gradient | Policy gradient algorithm involving multiple agents and a critic structure to learn policies Updates policy in small steps via clipping to ensure stability |
3️⃣ | MAVD Multi-Agent Value Decomposition | Learns individual agent value functions to combine into one join action-value function. |
Multi-Agent Reinforcement Learning Algorithms We’re Investigating
MADDPG
Multi-Agent Deep Deterministic Policy Gradient
MAPPO
Multi-Agent Proximal Policy Optimization
MAVD
Multi-Agent Value Decomposition
Develop two of the MARL Algorithms, train them and compare them against one another
Multi-Agent Reinforcement Learning Evaluation Methods
Visual Differences
Compare the MARL algorithm against others and self-play to observe technique differences
Score
After training the algorithm, what is the best score it can reach?
Reliability
After training, how reliable and consistent is the algorithm?
Time to ‘optimal performance’
How many training loops are required until performance improves + plateaus
Thank you!
Any questions?
Group 19
Ghazal Abdollahinoghondar
5
Advanced AI project,�Spring 2025
Advisor: Prof. Daniel Brown
Student: Ghazal Abdollahi
(PhD in Computer Science)
University of Utah
34
outline
35
Introduction
36
Deep Q-Network Solution
37
Reward function
38
Multi-Container Queue Processing
39
Expected Results
40
Expected Results
41
Thank you for your presence!
42
Group 1
Matthew Lowery
6
NSDEs for Uncertainty-Aware Offline Model-based RL
Matthew Lowery
NSDEs for Uncertainty-Aware Offline Model-based RL
Matthew Lowery
Descriptor: Model-based RL
Agent that forms an idea of how the world will react instead of reacting to the world
NSDEs for Uncertainty-Aware Offline Model-based RL
Matthew Lowery
NSDEs for Uncertainty-Aware Offline Model-based RL
Matthew Lowery
Descriptor: Offline RL
- No ability for the learning agent to continuously interact with the environment
- `Offline` tuples of (s,a,r,s′)
- Relevant where it’s costly or hazardous to interact with the environment
Think Healthcare
Think Finance
Offline + Model-based RL
Have model of the world, so can generate ‘synthetic’ rollouts to supplement our fixed dataset and learn a better policy
Offline + Model-based RL
Have model of the world, so can generate ‘synthetic’ rollouts to supplement our fixed dataset and learn a better policy.
Problem:
In some (s,a) pairs from the synthetic data, our understanding of the world is fuzzy
and we might predict rewards greater than in reality.
→ We learn policy which works in the model, but not in reality
Offline + Model-based RL
Have model of the world, so can generate ‘synthetic’ rollouts to supplement our fixed dataset and learn a better policy.
Problem: Model exploitation
In some (s,a) pairs from the synthetic data, our understanding of the world is fuzzy
and we might predict rewards greater than in reality.
→ We learn policy which works in the model, but not in reality
→ Can synthetic data still help?
Counter to Model Exploitation
Only keep rollouts which the model isn’t too uncertain about,
→ Which are more representative of the data distribution
→ Uncertainty could be based on how ‘far’ rollout is from fixed dataset
(in an l2 sense)
→ i.e. what is the euclidean distance between (s,a) pairs in rollout from the (s,a) pairs in the dataset?
Remove if it’s too much
NSDEs for Uncertainty-Aware Offline Model-based RL
Matthew Lowery
NSDEs ???for Uncertainty-Aware Offline Model-based RL
Discrete vs Continuous
Skip Connections
Discrete vs Continuous
Skip Connections
Kinda looks like:
Forward Euler Step
Discrete vs Continuous
Skip Connections
Kinda looks like:
Forward Euler Step
Which is just a particular way to solve
Discrete vs Continuous
Skip Connections
Kinda looks like:
Forward Euler Step
Which is just a particular way to solve
Focus on this form instead? And you get a Neural ODE
Continuous & Probabilistic NSDEs
s is an evolving distribution, with a diffusion term that reflects our uncertainty
And a drift term, which tracks the mean (as previously)
Unimodal
Continuous & Probabilistic NSDEs
s is an evolving distribution, with a diffusion term that reflects our uncertainty
How to generate ‘safe’ synthetic data?
Group 8
Atharv Kulkarni
7
Group 16
Timothy Wang
8
Multi-Armed Bandit Approach for NLP Problems
CS 6955
Timothy Wang
Timothy Wang | CS 6955 | 2025 April 14
64
Image credits: https://tenor.com/view/love-winner-you-are-mine-gif-15271571536543540590; https://www.gettyimages.com/detail/photo/natural-language-processing-cognitive-computing-royalty-free-image/1313050195
Multi-Armed Bandit (MAB) Approach
CS 6955
Timothy Wang
65
Action 1
Action 2
Action 3
R(1)
R(2)
R(3)
Could this be applied to natural language processing (NLP) problems?
Image credits: https://www.gettyimages.com/detail/illustration/cheerful-flying-robot-solid-icon-royalty-free-illustration/1281734513
NLP-Gym
CS 6955
Timothy Wang
Reinforcement learning (RL) Python tool created by Ramamurthy et al.
Link: https://github.com/rajcscw/nlp-gym/tree/main
NLP Environments:
66
Question and Answering (Q&A) Task
CS 6955
Timothy Wang
Observation Data:
Actions:
Reward:
67
Image credits: https://github.com/rajcscw/nlp-gym/tree/main
Built-In Featurizer
CS 6955
Timothy Wang
[ , ]
68
Question
2 Facts
Answer Choice
Flair
Sentence Embeddings Tensors
Cosine Similarity
Similarity Tensor
Question
x
Answer Choice
Facts
X
Answer Choice
Range: -1 to 1
Num 1
Num 2
Q-Value Array
CS 6955
Timothy Wang
40 x 40 = 1600 state space
2 actions (cont or ans) per state
Q-value (average reward) array:
(State sp) x (action sp) = (40 x 40) x 2
69
Discrete State
Tensor Values
0
-1.0 to -0.95
1
0.95 to -0.9
…
…
39
0.95 to 1.0
| 0 | 1 | … | 39 |
0 | Continue —-------- Answer | | … | |
1 | | | … | |
… | … | … | … | … |
39 | | | … | |
Num 1 (Q x AC)
Num 2
(F x AC)
MAB Training Algorithm
CS 6955
Timothy Wang
70
1. Featurize training sample into a state space cell:
Tensor (1, 0.97)
[ 39, 39 ]
2. Each state space cell has an action space to store Q-values:
[continue=0.12,
answer=0.15]
3. Select action randomly or pick one with highest Q-value
5. Repeat for 10,000 iterations → The whole Q-value array can now be used as a policy
4. Env returns actual reward → Update Q-value in action array
[ 39, 39 ]
Very Preliminary Results
CS 6955
Timothy Wang
Percent Test Samples Correct
Future Work:
71
Random Policy | Deep Q-Learning Policy | MAB Policy |
11.56% | 15.66% | 13.07% |
Summary
CS 6955
Timothy Wang
Thank You
72
Resources
CS 6955
Timothy Wang
title={NLPGym -- A toolkit for evaluating RL agents on Natural Language Processing Tasks},
author={Rajkumar Ramamurthy and Rafet Sifa and Christian Bauckhage},
year={2020},
eprint={2011.08272},
archivePrefix={arXiv},
primaryClass={cs.CL}
author = {Tushar Khot and Peter Clark and Michal Guerquin and Peter Jansen and Ashish Sabharwal},
title = {QASC: A Dataset for Question Answering via Sentence Composition},
journal = {arXiv:1910.11473v2},
year = {2020},
73
Group 20
Fabiha Bushra, Simon Gonzalez
9
Comparative Analysis of Multi-Agent Reinforcement Learning Algorithms
Fabiha Bushra, Simon Gonzalez
Why MARL?
Real-world problems often involve multiple agents
How do MARL algorithms compare across different interaction types?
Self-Driving Car Coordination�(Collaborative Interaction)
A Game of Checkers
(Competitive Interaction)
Soccer
(Mixed Interaction)
Why MARL?
Every new MARL paper be like:
Objectives
Which MARL algorithm performs better under
different cooperation-competition dynamics?
Which algorithm is more stable and robust across diverse scenarios?
Which algorithm shows superior sample efficiency and achieves higher performance with fewer interactions?
Quick Recap: Main Setting for Coop MARL
Environments & Task Types
VMAS – Balance: Agents work together to keep a pole balanced
PettingZoo – Simple Tag: Predators try to tag evaders
VMAS – Football: Partial cooperation within teams, but competition between them
Collaborative
Competitive
Mixed
MARL Algorithms: CTDE
PROBLEM: How do we make centralized training tractable?
Use privileged centralized information at training time.
Each policy can be independently executed in the deployment environment.
CENTRALIZED TRAINING
DECENTRALIZED EXECUTION
MARL Algorithms
On-policy actor-critic, centralized value function.
Off-policy actor-critic with entropy regularization.
Value decomposition with monotonic mixing of agent Q-values.
MAPPO1
Multi-Agent Proximal Policy Optimization
MASAC2
Multi-Agent Soft Actor-Critic
QMIX3
Q-Mixing Network
Metrics
Aggregate Statistics
Performance Profile
Sample Efficiency for All Tasks
Preliminary Results
Preliminary Results
Mixed, Football
Competitive, Simple Tag
Thank You!
Group 37
Aidan Wilde
10
Learning to Play Card Games with RL
Aidan Wilde
Motivation and Game
Game Construction
The Environment
Training
Feature Engineering
Cards are represented as one hot encoded vector :
We can reduce vector into more meaningful data
Set Vector :
2H | 2D | 2S | 2C | 3H | 3D | 3S | ……. | Ace S | Ace C |
Length 108 (2 decks)
2’s | 3’s | 4’s | 5’s | 6’s | 7’s | 8’s | …….. | King’s | Ace’s |
Length 13
Action Masking
.67 | .54 | .12 | .18 | .65 | .43 | .90 | .61 | .55 | .36 | .77 | .97 | .32 | .2 | .05 |
.67 | .54 | .12 | .18 | .65 | .43 | .90 | .61 | .55 | .36 | .77 | .97 | .32 | .2 | .05 |
.54 | .43 | .77 |
.77 |
DQN ε - Greedy Agent vs Randomized Agent
🔴 Exploit Average Reward
⚫ Exploit Average Win %
Future Directions
Group 31
Matt Myers
11
Group 2
Abbas Mohammadi
12
Reinforcement Learning for Adaptive Pavement Marking Maintenance
Abbas Mohammadi
Advanced AI – Spring 2025
Motivation & Problem Statement
Project Goal
Train an RL agent to make optimal pavement maintenance decisions using simulated traffic and retroreflectivity data.
Simulation Environment & Setup
Environment:
0 = No Maintenance, 1 = Perform Maintenance
decay=10+0.0001×traffic+2×months_since+3×snowy_months
Reward Function
Reward Design:
Learning Curve
Comparison with Baselines
Key Results
Thanks!����Questions?