ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAG
1
Q4Q1Q2Q3Q4Q1Q2Q3Q4Q1Q2Q3Q4Q1
2
20142015201620172018
3
Deep Reinforcement LearningHuman-level control through deep reinforcement learning
(Deep Q Network - DQN)
Deep Recurrent Q-Learning for Partially Observable MDPs
(Deep Recurrent Q-Network - DRQN)
Asynchronous Methods for Deep Reinforcement Learning
(A3C)
Imagination-Augmented Agents
for Deep Reinforcement Learning
(I2A)
Rainbow: Combining Improvements in Deep Reinforcement LearningA distributional perspective on Reinforcement LearningIMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner ArchitecturesDistributed Prioritized Experience Replay
4
Mnih 2015Hausknecht et al 2015Mnih et al 2016Weber 2017Hessel 2017Bellemare 2017Espeholt 2018Horgan 2018
5
DeepmindUni. of Texas AustinDeepmindDeepmindDeepmindDeepmindDeepmindDeepmind
6
http://www.davidqiu.com:8888/research/nature14236.pdf
https://arxiv.org/abs/1507.06527https://arxiv.org/pdf/1602.01783.pdfhttps://arxiv.org/abs/1707.06203https://arxiv.org/abs/1710.02298https://arxiv.org/abs/1707.06887https://arxiv.org/abs/1802.01561 https://arxiv.org/abs/1803.00933
7
This paper derives the Deep-Q-Network from traditional Q-learning, via 3 innovations: (1) Multilayer ANNs are used to estimate Q values. (2) Experience replay buffer is used to train the network, which helps to make training practical and aid learning of the association of rare and distant rewards with their causes. (3) Target network - the system is comprised of 2 networks that collectively implement the agent. The target network is trained to guess Q-values during training, i.e. to simulate the real reward function. This paper was a huge leap forward in RL capability and has been cited widely.DRQN extends DQN by replacing the first post-convolutional fully-connected layer with a recurrent LSTM. Allows DQN to consider a longer history when modelling behaviour in particular states, and demonstrates that this system can also deal with partial observability, which is very important as many real-world problems are PO-MDPs. Tested on Atari games with partial observability with good results.Replaces and is demonstrated to be better than DQN. The name A3C (Asynchronous Actor-Critic Agents) is derived from the Actor-Critic RL architecture. The A3 refers to 3 characteristics that begin with A: Actor (i.e. Actor-Critic architecture), Asynchronous (several sub-networks are trained simultaneously) and Advantage - a new formulation for Reward values that seems to be preferable for exploring the space and discovering where rewards are poorly defined. This paper significantly advanced the state of the art in many tests using computer games.Model-free RL maps inputs to actions directly. But there are problems with generalization due to lack of an internal model. Model-based methods learn a model from the environment, then inputs map to model configurations and model configurations to outputs. The agent can then reason inside the simulated model without inputs, akin to imagination. There is potential to use imagination to learn from fewer experiences. The architecture is demonstrated on Sobokan and Minipacman.Focus on improved overall performance by combining good ideas. Which improvements to DQN can be effectively combined without problems, versus others which are competing and incompatible improvements? Rainbow is the empirically optimum combination of techniques (via "median human-normalized [difficulty] score", which is valid across multiple problems). Tricks covered include Double-DQN, Prioritized ER, Duelling, Multi-step learning, Distributional RL (this is one of my favourites), Noisy Nets. Ablation (i.e. remove one trick) results showed that Prioritized ER, and Multi-step learning most crucial, then Distributional RL. Benefit of Double-DQN and and Duelling was mixed (+ve, -ve) and overall neutral.This is the first (in recent work) to model the value distribution rather than the expectation of value. Turns out to be really important (and demonstrates benchmark-beating results. This is a fundamental rethink.Trained a single algorithm to simultaneously solve several "Lab" and Atari games with shared parameters (learned weights). Improvement on A3C. Wall-clock training speed is emphasized. The architecture is broken down into 'actors' and 'learners' enabling distributed evaluation of parameters. V-trace off-policy correction mechanism: Parallel learning is stable. There are two outcomes: faster learning with better or similar performance, and demonstrated benefits of transfer between tasks.As with Impala, another attempt to accelerate learning by separating actor and learner. Actors contribute to a shared Experience Replay buffer. Prioritization is over the contents of the shared ER buffer. Comparison to Stochastic Gradient Descent optimization theory (why it helps). Direct heritage from original ER in DQN. Unlike IMPALA does not attempt to learn multiple problems simultaneously nor share learned weights.
8
RL with Episodic MemoryModel-Free Episodic Control
(MFEC)
Neural Episodic Control
(NEC)
9
Deep RL approach, maintaining history for replay.Blundell et al 2016Pritzel et al 2017
10
DeepmindDeepmind
11
https://arxiv.org/abs/1606.04460https://arxiv.org/abs/1703.01988
12
Hypothesis: Hippocampal-inspired episodic control module can achieve better performance on sequence learning tasks, and learn more quickly (from fewer episodes), by replaying them as simulations. The episodic replay module is a Q-value table with pruning. Tested on Arcade Learning Environment (Atari games), and was shown to learn faster and achieve higher scores on several games than DQN, and A3C. However, authors expect the approach has limited ability to generalize episodes due to the tabular replay buffer. Objective is to learn faster, from fewer experiences (episodes) by taking inspiration from hippocampus. Kumaran et al. (2016) suggest that training on replayed experiences from the replay buffer in DQN is similar to the replay of experiences from episodic memory during sleep in animals. Derived from DQN with new components. Improves both learning speed and performance over DQN and MFEC. However we still have a very large Q-table (DND) so generalization is suspect.
13
AttentionNTM
Neural Turing Machine
Differentiable Neural Computer
(DNC)
Attention is all you need
14
Graves et al 2014Graves et al 2016Vaswani et al 2017
15
DeepmindDeepmindGoogle Brain
16
https://arxiv.org/abs/1410.5401
https://www.nature.com/articles/nature20101
https://arxiv.org/abs/1706.03762
17
The key concept of this paper was to add a general purpose working memory to ANNs. Although the name sounds very artificial (ilke a Turing machine with an infinite tape) the work is inspired by and similar to Short Term Memory, or Working Memory in humans (this is mentioned in the introduction). They demonstrated that the ANN could now learn several general purpose algorithms involving storing temporary variables in the memory. They learned some simple programs, including how to reason about a graph. This paper is somewhat related to Long-Short-Term-Memory (LSTM) but different (more flexible and powerful) in the way memory is utilized.This paper extends and improves on NTM. They start to talk about the benefits of a 'fully differentiable architecture' for deep training of sophisticated memory systems - all components can be trained by deep backpropagation even though the layers are dissimilar in structure and function. Making the memory read and write heads fully differentiable is a significant achievement. Although not stated so clearly, this architecture is also aiming to reproduce some of the capabilities of human general purpose working memory, and/or short-term memory.A simplified architecture replacing convolution and recurrence with attention instead. Despite reducing architectural complexity, they beat natural language machine translation benchmarks considerably. Thus, the claim that attention is a very powerful tool and does all you need.
18
Hippocampus Inspired - Mixed Biological/ML StudiesThe Hippocampus as a Predictive MapThe Successor Representation in Human Reinforcement LearningDorsal Hippocampus Contributes to Model-based Planning
19
It is significant that Biologists and Machine Learning researchers are working together to understand neuroscience and to improve ML algorithms.

This work can directly influence future approaches to RL, and may have been important in some of the RL approaches summarised elsewhere in this table, in particular those referencing hippocampal concepts such as 'Episodic' learning.
Stachenfeld 2017Momennejad 2017Miller 2017
20
Multiple neuroscience institutes & DeepmindMultiple neuroscience institutes & DeepmindMultiple neuroscience institutes & Deepmind
21
https://www.nature.com/articles/nn.4650https://www.nature.com/articles/s41562-017-0180-8https://www.nature.com/articles/nn.4613
22
This study looks at the function of the Hippocampus from an RL perspective. They find that the Hippocampus forms low dimensional representations that are effective at making predictions, differing from the traditional interpretation of grid cells as simply representing spatial locations.Looks at the role of hippocampus in human RL in terms of model-free vs model based approaches. They show a combination of both called the Successor Representation (SR).

This is again a combined biological / ML study, in this case from a psychological perspective.
Investigation into the neural mechanisms for planning in terms of action selection, in the dorsal hippocampus, a structure long believed to be important for this function. The results suggest that model based planning is employed. Another example of a neuroscientific analysis that can fuel new RL algorithms.
23
Few-shot LearningSiamese Neural Networks for One-Shot Image RecognitionHuman-level Concept Learning Through Probabilistic Program InductionOne-Shot Generalization in Deep Generative ModelsMatching Networks for One Shot LearningOptimization as a Model for Few-shot LearningPrototypical Networks for Few-shot LearningA Generative Vision Model That Trains with High Data
Efficiency and Breaks Text-based CAPTCHAs
(RCN)
Meta-Learning for Semi-Supervised Few-Shot Classification
24
Koch 2015Lake 2015Rezende 2016Vinyals 2016Ravi 2017Snell 2017George 2017Ren 2018
25
University of TorontoNew York UniversityDeepmindDeepmindTwitterUniversity of Toronto / TwitterVicariousGoogle Brain
26
https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf
http://web.mit.edu/cocosci/Papers/Science-2015-Lake-1332-8.pdf
https://arxiv.org/pdf/1603.05106https://arxiv.org/pdf/1606.04080https://openreview.net/pdf?id=rJY0-Kcllhttps://arxiv.org/pdf/1703.05175
http://science.sciencemag.org/content/early/2017/10/25/science.aag2612.full
https://lld-workshop.github.io/papers/LLD_2017_paper_40.pdf
27
Inspired by earlier work on one-shot learning by Fei Fei Li and Lake from the 90's and early 2000's. Deep siamese CNN are used to compare class labels and verify a pair of images are of the same class. They applied this to verification of unseen classes of Omniglot and MNIST dataset, to show that it is capable of a type of one shot learning. It was one of the first of the recent papers on one-shot learning and has been used for comparison in subsequent papers.This paper triggered renewed interest in few shot learning and has become a foundational template for testing such problems. The main concept is to quickly learn new classes that are composed of parts that have already been learned, cast as "learning to learn". This is an ability to generalise from few exposures. Tests cover classification as well as generation of novel exemplars as well as generation or 'invention' of exemplars of completely novel classes. The main dataset used was Omniglot. The underlying algorithm is Bayesian.This paper extended Lake 2015 by incorporating deep learning and use feedback and attentional mechanisms for both inference and generation. They combine the "representational power of deep neural networks embedded within hierarchical latent variable models, with the inferential power of approximate Bayesian reasoning". They similarly focus on the Omniglot dataset. The system is more general, but requires more training data. This paper took a different approach for one-shot learning to Lake 2015. Together they have defined two templates for subsequent papers. In this approach, the system learns to match an unlabelled exemplar with a small support set. It is then able to learn to match previously unseen classes. In this way, it is also framed as a problem of learning to learn. They utilise a CNN embedding function and ran tests on images Omniglot and ImageNet as well as a language task using the Penn Treebank dataset.Extends Vinyals by using LSTM. "Rather than training a single model over multiple episodes, the LSTM meta-learner learns to train a custom model for each episode." Snell 2017This is a variation on Vinyals using prototypical networks. A CNN embedding function is used to transform the input to a metric space where a 'prototype' of the class is the mean of a small support set. Classification is done by finding nearest point in that embedded space. They also show performance on zero-shot, where the prototype is derived from one sample. They tested on several image datasets: Omniglot, miniImageNet and Caltech-UCSD Birds (CUB).This system is not focussed on few-shot learning specifically, but it is noteworthy in that it requires much smaller training sets, up to 300 fold less than deep networks for comparable tasks. It approaches image recognition by analysing texture and shape separately. For the latter, a hierarchical bayesian model with feedback and lateral connections is utilised. They demonstrated results for MNIST, ICDAR and a variety of CAPTCHAs.This is an extensions to the prototypical networks of Snell 2017, to work semi-supervised i.e. some examples in the small support sets are unlabelled.
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100