A | B | C | D | E | F | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Week | Week topic | Day | Day topic | Format | Relevant papers / readings | ||||||||||||||||||||
2 | 1 | Intro and multi-agent coordination | 3/25/24 | Introduction to course, syllabus. RL basics | Lecture | https://spinningup.openai.com/en/latest/ Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018. Playing Atari with Deep Reinforcement Learning Prioritized Experience Replay Policy Gradient Methods for Reinforcement Learning with Function Approximation Asynchronous Methods for Deep Reinforcement Learning Trust Region Policy Optimization Proximal Policy Optimization Algorithms | ||||||||||||||||||||
3 | 3/27/24 | Deep multi-agent RL intro - Lecture | Lecture | Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments (MADDPG) Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning Is Independent Learning All You Need in the StarCraft Multi-Agent Challenge? | ||||||||||||||||||||||
4 | 2 | Multi-agent coordination (cont'd) | 4/1/24 | Multi-agent learning - Discussion | Discussion | Mastering the game of Go with deep neural networks and tree search Learning Latent Representations to Influence Multi-Agent Interaction Learning with opponent-learning awareness Machine theory of mind Theory of Minds: Understanding Behavior in Groups Through Inverse Planning Breaking the Curse of Multiagents in a Large State Space: RL in Markov Games with Independent Linear Function Approximation When is Offline Two-Player Zero-Sum Markov Game Solvable? | ||||||||||||||||||||
5 | ||||||||||||||||||||||||||
6 | ||||||||||||||||||||||||||
7 | 4/3/24 | Fun multi-agent papers - Discussion | Discussion | Celebrating Diversity in Shared Multi-Agent Reinforcement Learning Modeling Others using Oneself in Multi-Agent Reinforcement Learning Multi-Agent Cooperation and the Emergence of (Natural) Language Emergent Prosociality in Multi-Agent Games Through Gifting Learning to Incentivize Other Learning Agents Concurrent Meta Reinforcement Learning Human-level play in the game of Diplomacy by combining language models with strategic reasoning | ||||||||||||||||||||||
8 | ||||||||||||||||||||||||||
9 | ||||||||||||||||||||||||||
10 | 3 | Coordination with humans and population-based training | 4/8/24 | Human-agent coordination (zero-shot) and population-based training - Lecture | Lecture Recording | Concept-based Understanding of Emergent Multi-Agent Behavior On the Utility of Learning about Humans for Human-AI Coordination Cooperating with Humans without Human Data | ||||||||||||||||||||
11 | 4/10/24 | Human-agent coordination and population-based training (Discussion) | Discussion | List 1: Human-AI coordination: Too Many Cooks: Bayesian Inference for Coordinating Multi-Agent Collaboration Generating Diverse Cooperative Agents by Learning Incompatible Policies (LIPO) Diverse Conventions for Human-AI Collaboration (CoMeDI) Trajectory Diversity for Zero-Shot Coordination Off-Belief Learning List 2: State-of-the-art MARL Grandmaster level in StarCraft II using multi-agent reinforcement learning Dota 2 with Large Scale Deep Reinforcement Learning Human-level performance in first-person multiplayer games with population-based deep reinforcement learning Real World Games Look Like Spinning Tops | ||||||||||||||||||||||
12 | ||||||||||||||||||||||||||
13 | ||||||||||||||||||||||||||
14 | 4 | Emergent Complexity | 4/15/24 | Emergent Complexity - Lecture | Lecture Partial Recording | Autocurricula and the emergence of innovation from social interaction Emergent Tool Use from Multi-Agent Autocurricula Adversarial policies: Attacking deep reinforcement learning Intrinsic motivation and automatic curricula via asymmetric self-play Asymmetric self-play for automatic goal discovery in robotic manipulation Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design Environment generation for zero-shot compositional reinforcement learning | ||||||||||||||||||||
15 | 4/17/24 | Emergent Complexity and Open-Endedness - Discussion | Discussion | Adversarial policies: Attacking deep reinforcement learning Intrinsic motivation and automatic curricula via asymmetric self-play Asymmetric self-play for automatic goal discovery in robotic manipulation OMNI: Open-endedness via Models of human Notions of Interestingness Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions Scaling MAP-Elites to deep neuroevolution Neural MMO: A Massively Multiagent Game Environment for Training and Evaluating Intelligent Agents Evolving Curricula with Regret-Based Environment Design A Quality Diversity Approach to Automatically Generating Human-Robot Interaction Scenarios in Shared Autonomy Prioritized Level Replay | ||||||||||||||||||||||
16 | ||||||||||||||||||||||||||
17 | ||||||||||||||||||||||||||
18 | 5 | Social Learning | 4/22/24 | Social Learning - Lecture | Lecture | The Secret of Our Success (book by Joseph Henrich) Why Copy Others? Insights from the Social Learning Strategies Tournament Emergent Social Learning via Multi-agent Reinforcement Learning PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning The Big Man Mechanism: how prestige fosters cooperation and creates prosocial leaders The Social Function of Intellect Culture and the evolution of human cooperation | ||||||||||||||||||||
19 | 4/24/24 | Social Learning - Discussion | Discussion | Social Cohesion in Autonomous Driving Behavior Planning of Autonomous Cars with Social Perception Courteous Autonomous Cars Learning few-shot imitation as cultural transmission Culture and the evolution of human cooperation The social function of intellect How culture shaped the human genome The Selfish Gene (book by Richard Dawkins) chapter 11 The Secret of Our Success (book by Joseph Henrich) | ||||||||||||||||||||||
20 | ||||||||||||||||||||||||||
21 | ||||||||||||||||||||||||||
22 | 6 | Learning from humans (including IRL, language-conditioned RL) | 4/29/24 | Inverse RL and other ways to learn from humans | Lecture Recording | Maximum entropy inverse reinforcement learning Socially Adaptive Path Planning in Human Environments Using Inverse Reinforcement Learning Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience Cooperative Inverse Reinforcement Learning (CIRL) Latent Constraints: Learning to Generate Conditionally from Unconditional Generative Models Learning via social awareness: Improving a deep generative sketching model with facial feedback Interactively shaping agents via human reinforcement (Tamer) | ||||||||||||||||||||
23 | 5/1/24 | Learning from humans - Discussion | Discussion | List 1: PbRL / robot learning from human feedback: Active Preference-Based Learning of Reward Functions Breadcrumbs to the Goal: Goal-Conditioned Exploration from Human-in-the-Loop Feedback Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction Physical interaction as communication: Learning robot objectives online from human corrections Learning Reward Functions by Integrating Human Demonstrations and Preferences B-Pref: Benchmarking Preference-Based Reinforcement Learning Preferences Implicit in the State of the World List 2: Interacting with humans by following natural language instructions: Do As I Can, Not As I Say: Grounding Language in Robotic Affordances Imitating Interactive Intelligence Thought cloning: Learning to think while acting by imitating human thinking Continual Learning for Grounded Instruction Generation by Observing Human Following Behavior Speaker-Follower Models for Vision-and-Language Navigation Grounding Language in Play Language as an Abstraction for Hierarchical Deep Reinforcement Learning | ||||||||||||||||||||||
24 | ||||||||||||||||||||||||||
25 | ||||||||||||||||||||||||||
26 | 7 | RLHF | 5/6/24 | RLHF - Lecture | Lecture Recording | Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control Human-centric Dialog Training via Offline Reinforcement Learning Hierarchical Reinforcement Learning for Open-Domain Dialog Deep RL from Human Preferences Fine-Tuning Language Models from Human Preferences Learning to summarize from human feedback Training language models to follow instructions with human feedback | ||||||||||||||||||||
27 | 5/8/24 | RLHF - Discussion | Discussion | Training language models to follow instructions with human feedback (InstructGPT) Models of human preference for learning reward functions Multi-agent Communication meets Natural Language: Synergies between Functional and Structural Language Learning Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons | ||||||||||||||||||||||
28 | ||||||||||||||||||||||||||
29 | ||||||||||||||||||||||||||
30 | 8 | Latest developments in RLHF | 5/13/24 | RLHF latest developments - Lecture | Guest lectures from Cassidy Laidlaw (DPL) and Rafael Rafailov (DPO) Recording | Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF Direct Preference Optimization: Your Language Model is Secretly a Reward Model | ||||||||||||||||||||
31 | 5/15/24 | RLHF latest developments - Discussion | Discussion | A Minimaximalist Approach to Reinforcement Learning from Human Feedback Nash Learning from Human Feedback Jury Learning: Integrating Dissenting Voices into Machine Learning Models A Roadmap to Pluralistic Alignment Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF Contrastive prefence learning: Learning from human feedback without rl Learning optimal advantage from preferences and mistaking it for reward Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data | ||||||||||||||||||||||
32 | ||||||||||||||||||||||||||
33 | ||||||||||||||||||||||||||
34 | 9 | RLAIF / Multi-agent LLMs | 5/20/24 | In-class time to work on projects and ask questions | Project work time | N/A | ||||||||||||||||||||
35 | 5/22/24 | RLAIF / Multi-agent LLMs | 30 minute lecture + 1 paper Discussionk | Lecture: Universal and Transferable Adversarial Attacks on Aligned Language Models Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation <Red Teaming> AI safety via debate Improving Factuality and Reasoning in Language Models through Multiagent Debate AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Discussion: Constitutional AI: Harmlessness from AI Feedback Curiosity-driven Red-teaming for Large Language Models Universal and Transferable Adversarial Attacks on Aligned Language Models Social Simulacra: Creating Populated Prototypes for Social Computing Systems SOTOPIA: Interactive Evaluation for Social Intelligence in Language Agents Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia | ||||||||||||||||||||||
36 | 10 | Project presentations | 5/29/24 | Project presentations | ||||||||||||||||||||||
37 | ||||||||||||||||||||||||||
38 | ||||||||||||||||||||||||||
39 | ||||||||||||||||||||||||||
40 | ||||||||||||||||||||||||||
41 | ||||||||||||||||||||||||||
42 | ||||||||||||||||||||||||||
43 | ||||||||||||||||||||||||||
44 | ||||||||||||||||||||||||||
45 | ||||||||||||||||||||||||||
46 | ||||||||||||||||||||||||||
47 | ||||||||||||||||||||||||||
48 | ||||||||||||||||||||||||||
49 | ||||||||||||||||||||||||||
50 | ||||||||||||||||||||||||||
51 | ||||||||||||||||||||||||||
52 | ||||||||||||||||||||||||||
53 | ||||||||||||||||||||||||||
54 | ||||||||||||||||||||||||||
55 | ||||||||||||||||||||||||||
56 | ||||||||||||||||||||||||||
57 | ||||||||||||||||||||||||||
58 | ||||||||||||||||||||||||||
59 | ||||||||||||||||||||||||||
60 | ||||||||||||||||||||||||||
61 | ||||||||||||||||||||||||||
62 | ||||||||||||||||||||||||||
63 | ||||||||||||||||||||||||||
64 | ||||||||||||||||||||||||||
65 | ||||||||||||||||||||||||||
66 | ||||||||||||||||||||||||||
67 | ||||||||||||||||||||||||||
68 | ||||||||||||||||||||||||||
69 | ||||||||||||||||||||||||||
70 | ||||||||||||||||||||||||||
71 | ||||||||||||||||||||||||||
72 | ||||||||||||||||||||||||||
73 | ||||||||||||||||||||||||||
74 | ||||||||||||||||||||||||||
75 | ||||||||||||||||||||||||||
76 | ||||||||||||||||||||||||||
77 | ||||||||||||||||||||||||||
78 | ||||||||||||||||||||||||||
79 | ||||||||||||||||||||||||||
80 | ||||||||||||||||||||||||||
81 | ||||||||||||||||||||||||||
82 | ||||||||||||||||||||||||||
83 | ||||||||||||||||||||||||||
84 | ||||||||||||||||||||||||||
85 | ||||||||||||||||||||||||||
86 | ||||||||||||||||||||||||||
87 | ||||||||||||||||||||||||||
88 | ||||||||||||||||||||||||||
89 | ||||||||||||||||||||||||||
90 | ||||||||||||||||||||||||||
91 | ||||||||||||||||||||||||||
92 | ||||||||||||||||||||||||||
93 | ||||||||||||||||||||||||||
94 | ||||||||||||||||||||||||||
95 | ||||||||||||||||||||||||||
96 | ||||||||||||||||||||||||||
97 | ||||||||||||||||||||||||||
98 | ||||||||||||||||||||||||||
99 | ||||||||||||||||||||||||||
100 |