| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Inspired by the course CS332: Advanced Survey of RL, Key papers from OpenAI, Reinforcement Learning Summer School, my personal intake, STA 4273: Minimizing Expectations, CS 6789: Foundations of Reinforcement Learning, COMS E6998: Bandits and RL, CS 542: Statistical Reinforcement Learning | Update: | May 13, 2024 | ||||||||||||||||||||||||
2 | Please don't hestitate to make recommendations!! | ||||||||||||||||||||||||||
3 | Category | Paper | Year | Note | Algorithm Name | Must-read (*) | |||||||||||||||||||||
4 | Book | Reinforcement Learning: Theory and Algorithms | |||||||||||||||||||||||||
5 | Dynamic Programming and Optimal Control | ||||||||||||||||||||||||||
6 | Bayesian RL: A survey | ||||||||||||||||||||||||||
7 | A tutorial on Thompson Sampling | ||||||||||||||||||||||||||
8 | Algorithms for Reinforcement Learning | ||||||||||||||||||||||||||
9 | Adaptive Algorithms and Stochastic Approximations | ||||||||||||||||||||||||||
10 | From Perturbation Analysis to Markov Decision Processes and Reinforcement Learning | ||||||||||||||||||||||||||
11 | |||||||||||||||||||||||||||
12 | Bandit | Stochastic Linear Optimization under Bandit Feedback | |||||||||||||||||||||||||
13 | |||||||||||||||||||||||||||
14 | |||||||||||||||||||||||||||
15 | Exploration | Provably Efficient Reinforcement Learning with Linear Function Approximation | 2019 | ||||||||||||||||||||||||
16 | Contextual Decision Processes with Low Bellman Rank are PAC-Learnable | 2016 | Bellman rank | ||||||||||||||||||||||||
17 | Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches | 2019 | witness rank | ||||||||||||||||||||||||
18 | Provably Efficient Exploration in Policy Optimization | 2024 | Optimistic PPO | ||||||||||||||||||||||||
19 | Provably Efficient Maximum Entropy Exploration | 2018 | Maximum-entropy policy computation | ||||||||||||||||||||||||
20 | Learning Montezuma's Revenge from a Single Demonstration | 2018 | Demonstration-Initialized Rollout Worker | ||||||||||||||||||||||||
21 | Go-Explore: a New Approach for Hard-Exploration Problems | 2019 | Go-Explore | ||||||||||||||||||||||||
22 | Episodic Curiosity through Reachability | 2019 | Bonus Computation | ||||||||||||||||||||||||
23 | Curiosity-driven Exploration by Self-supervised Prediction | 2017 | ICM | ||||||||||||||||||||||||
24 | Large-Scale Study of Curiosity-Driven Learning | 2019 | |||||||||||||||||||||||||
25 | Reinforcement Learning with Delayed, Composite, and Partially Anonymous Reward | 2023 | |||||||||||||||||||||||||
26 | Model-based Reinforcement Learning and the Eluder Dimension | 2014 | PSRL | ||||||||||||||||||||||||
27 | Near-Optimal Reinforcement Learning in Polynomial Time | 1998 | E3 | ||||||||||||||||||||||||
28 | R-max – A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning | 2002 | R-max | ||||||||||||||||||||||||
29 | Near-optimal Regret Bounds for Reinforcement Learning | 2010 | UCRL2 | ||||||||||||||||||||||||
30 | PAC Model-Free Reinforcement Learning | 2006 | Delayed Q-Learning | ||||||||||||||||||||||||
31 | |||||||||||||||||||||||||||
32 | |||||||||||||||||||||||||||
33 | Introduction and Evaluating RL Progress | Deep Reinforcement Learning at the Edge of the statistical precipice | |||||||||||||||||||||||||
34 | |||||||||||||||||||||||||||
35 | Models and Representation Learning | Decoupling Representation Learning from Reinforcement Learning | |||||||||||||||||||||||||
36 | |||||||||||||||||||||||||||
37 | |||||||||||||||||||||||||||
38 | Model-Free | Playing Atari with Deep Reinforcement Learning | 2013 | Deep Q-Learning with Experience Replay | |||||||||||||||||||||||
39 | Deep Recurrent Q-Learning for Partially Observable MDPs | 2015 | Deep Recurent Q-Network | ||||||||||||||||||||||||
40 | Dueling Network Architectures for Deep Reinforcement Learning | 2015 | Double DQN | ||||||||||||||||||||||||
41 | Deep Reinforcement Learning with Double Q-learning | 2015 | Double DQN | ||||||||||||||||||||||||
42 | Prioritized Experience Replay | 2015 | Double DQN with propotional prioritization | ||||||||||||||||||||||||
43 | Rainbow: Combining Improvements in Deep Reinforcement Learning | 2017 | Double DQN + prioritized replay + multi-step learning + distributional RL + noisy nets | ||||||||||||||||||||||||
44 | Asynchronous Methods for Deep Reinforcement Learning | 2016 | A3C | ||||||||||||||||||||||||
45 | Trust Region Policy Optimization | 2015 | TRPO | ||||||||||||||||||||||||
46 | High-Dimensional Continuous Control Using Generalized Advantage Estimation | 2016 | GAE | ||||||||||||||||||||||||
47 | Proximal Policy Optimization Algorithms | 2017 | PPO | ||||||||||||||||||||||||
48 | Emergence of Locomotion Behaviours in Rich Environments | 2017 | Distributed PPO | ||||||||||||||||||||||||
49 | Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation | 2017 | AC with KTR | ||||||||||||||||||||||||
50 | Sample efficient actor-critic with experience replay | 2017 | AC with ER | ||||||||||||||||||||||||
51 | Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor | 2018 | SAC | ||||||||||||||||||||||||
52 | Deterministic Policy Gradient Algorithms | 2014 | DPG | ||||||||||||||||||||||||
53 | Continuous control with deep reinforcement learning | 2016 | DDPG | ||||||||||||||||||||||||
54 | Addressing Function Approximation Error in Actor-Critic Methods | 2018 | |||||||||||||||||||||||||
55 | Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic | 2017 | Adaptive Q-Prop | ||||||||||||||||||||||||
56 | Action-depedent Control Variates for Policy Optimization via Stein's Identity | 2018 | PPO with Control Variate through Stein’s Identity | ||||||||||||||||||||||||
57 | The Mirage of Action-Dependent Baselines in Reinforcement Learning | 2018 | |||||||||||||||||||||||||
58 | Bridging the Gap Between Value and Policy Based Reinforcement Learning | 2017 | Unified PCL | ||||||||||||||||||||||||
59 | Trust-PCL: An Off-Policy Trust Region Method for Continuous Control | 2018 | Trust PCL | ||||||||||||||||||||||||
60 | A Natural Policy Gradient | 2001 | NPG | ||||||||||||||||||||||||
61 | Eligibility Traces for Off-Policy Policy Evaluation | 2000 | Eligibility(Lambda) | ||||||||||||||||||||||||
62 | Maximum a Posteriori Policy Optimisation | 2018 | MPO | ||||||||||||||||||||||||
63 | V-MPO: On-Policy Maximum a Posteriori Policy Optimization for Discrete and Continuous Control | 2019 | V-MPO | ||||||||||||||||||||||||
64 | Reinforcement Learning with Deep Energy-Based Policies | 2017 | soft Q-learning | ||||||||||||||||||||||||
65 | Diversity is All You Need: Learning Skills without a Reward Function | 2018 | DIAYN | ||||||||||||||||||||||||
66 | The Value Function Polytope in Reinforcement Learning | 2019 | |||||||||||||||||||||||||
67 | An operator view of policy gradient methods | 2020 | |||||||||||||||||||||||||
68 | Mirror Descent Policy Optimization | 2021 | MDPO | ||||||||||||||||||||||||
69 | Combining policy gradient and Q-learning | 2017 | PGQL | ||||||||||||||||||||||||
70 | The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning | 2018 | Reactor | ||||||||||||||||||||||||
71 | Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning | 2017 | Interpolated PG | ||||||||||||||||||||||||
72 | Equivalence Between Policy Gradients and Soft Q-Learning | 2017 | |||||||||||||||||||||||||
73 | Evolution Strategies as a Scalable Alternative to Reinforcement Learning | 2017 | Evolution Strategy | ||||||||||||||||||||||||
74 | Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes | 2019 | |||||||||||||||||||||||||
75 | Model-Free Linear Quadratic Control via Reduction to Expert Prediction | 2019 | |||||||||||||||||||||||||
76 | |||||||||||||||||||||||||||
77 | |||||||||||||||||||||||||||
78 | Model-based | Imagination-Augmented Agents for Deep Reinforcement Learning | 2017 | I2A | |||||||||||||||||||||||
79 | Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning | 2017 | |||||||||||||||||||||||||
80 | Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning | 2018 | MVE-AC | ||||||||||||||||||||||||
81 | Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion | 2018 | STEVE | ||||||||||||||||||||||||
82 | Model-Ensemble Trust-Region Policy Optimization | 2018 | ME-TRPO | ||||||||||||||||||||||||
83 | Model-Based Reinforcement Learning via Meta-Policy Optimization | 2018 | MB-MPO | ||||||||||||||||||||||||
84 | Recurrent World Models Facilitate Policy Evolution | 2018 | MDN-RNN | ||||||||||||||||||||||||
85 | Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm | 2017 | |||||||||||||||||||||||||
86 | Thinking Fast and Slow with Deep Learning and Tree Search | 2017 | Expert Iteration | ||||||||||||||||||||||||
87 | Model-based Reinforcement Learning for Atari | 2020 | SimPLe | ||||||||||||||||||||||||
88 | Dual Representations for Dynamic Programming | 2008 | |||||||||||||||||||||||||
89 | Learning to Simulate Complex Physics with Graph Networks | 2020 | GNS | ||||||||||||||||||||||||
90 | Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models | 2018 | PETS | ||||||||||||||||||||||||
91 | Planning with Diffusion for Flexible Behavior Synthesis | 2022 | Guided diffusion planning | ||||||||||||||||||||||||
92 | Action-Conditional Video Prediction using Deep Networks in Atari Games | 2015 | Encoding-Transformation-Decoding | ||||||||||||||||||||||||
93 | Temporal Difference Learning for Model Predictive Control | 2022 | MPC | ||||||||||||||||||||||||
94 | Mastering Atari, Go, chess and shogi by planning with a learned model | 2020 | MuZero | ||||||||||||||||||||||||
95 | Dream to Control: Learning Behaviors by Latent Imagination | 2019 | Dreamer | ||||||||||||||||||||||||
96 | Adaptive Discretization for Model-Based Reinforcement Learning | 2020 | |||||||||||||||||||||||||
97 | Model-based Reinforcement Learning and the Eluder Dimension | 2014 | |||||||||||||||||||||||||
98 | |||||||||||||||||||||||||||
99 | |||||||||||||||||||||||||||
100 | Linear MDP | Reinforcement Learning in Feature Space: Matrix Bandit, Kernels, and Regret Bound | 2019 | MatrixRL | |||||||||||||||||||||||