ABCDEFGHIJKLMNOPQRSTUVWXYZAAAB
1
TopicQuestion - max 95 charactersAnswer 1 - max 60 charactersAnswer 2 - max 60 charactersAnswer 3 - max 60 charactersAnswer 4 - max 60 characters
Time limit (sec) - 5,10,20,30,60,90 or 120 secs
Correct answer(s) - choose at least one
Explanation
2
Test QuestionThe name of the vegan/vegatarian boat was....the Anne Frankthe Wilhelmusthe Johan CruyffThe Rich Sutton602
Wilhelmus is the name of Dutch national anthem
3
Generalisation in RLThe bias-variance decomposition for RL is the same as for SL. True or False?truefalse602
False, there is no actual "bias-variance" trade-off in RL. There is an analogous tradeoff between a sufficiently rich learning algorithm and a learning algorithm not too complex
4
Generalisation in RLIn general, the L2-loss is not used in the bias-overfitting decomposition for RL because....there are target and behavior policies in RLthe i.i.d. assumption is violatedin RL, errors are wrt an optimal policy RL estimators are unbiased603
RL is not about estimating a target Y, but about optimizing a policy
5
Generalisation in RLGeneralisation in RL can be improved with.....the type of function approximatorchanging the objective functionusing abstract represenationsadding more data601,2,3,4
6
Intro to Bandit Algorithms
In general, the goals in RL and bandits (adversarial, stochastic, contextual, etc.) ....are the sameare to minimize regretboth involve a policyhave no similarities603
Although RL can be about minimizing regret, standard RL is not about regret but about finding a policy that minimizes expected 'loss' after training. Bandits are about finding a policy (arm to play).
7
Intro to Bandit Algorithms
In the stochastic bandit setting, how can the loss (negative reward) be characterised?It is set by the environmentIt is sampled from a known distribution
It is sampled from an initially unknown distribution
603
if 'set' is interpreted as 'chosen by' then 1 is the adversarial bandit setting. The distribution is not known initially, it could e.g. be the action of a user like clicking some ad.
8
Intro to Bandit Algorithms
Why are bandits useful for RL?Learning to balance exploration/exploitationSimpler models allows for theoretical analysis
Algorithm design principles may be the same
Both deal with uncertainty about the environment
601,2,3,4
9
FunWhen visiting Amsterdam, what may cause you to try eating raw herring?Optimism in the face of uncertaintyIt was in a quiz question I did not understand
I couldn't find any cheese or stroopwafels
pure exploration601,2,3,4
of course all are good answers, as long as you try it!
10
Policy GradientsWe are interested in policy gradients because....They allow us to do Monte Carlo estimation
They allow us to condition the policy on a set of parameters
They work with gradient descent so we can use SL approaches
They always come with a baseline602
There are also other approaches to condition a policy on a set of parameters (e.g. evolutionary approaches). We can also do MC with other RL approaches. Technically, they work with gradient ascent. Also, the SL approaches cannot generally be used in RL due to minimizing error wrt a label instead of the optimal policy
11
Policy GradientsWhen does a baseline not introduce bias in policy gradient?It should be a criticIt should not depend on the actionIt should depend on the stateIts gradient should be 0602
It could, be does not have to be a critic. It could, but does not have to depend on the state. Its gradient should not be 0 but the gradient of the policy wrt the baseline should be 0
12
Policy GradientsRecent policy gradients algorithms (PPO,TRPO,SAC,DDPG,TD3) benefit from ...Small and controlled learning stepsReparameterizationExploration based on entropythe deadly triad601,2,3
These algorithms benefit from (some or all) of these. The deadly triad is something you suffer from, not benefit from
13
TD with function approximation
The TD-error ....has high varianceis bootstrapped
can only be determined at the end of an episode
cannot be estimated602
14
TD with function approximation
TD with function approximation ....always minimizes the error of the value functionalways minimizes the TD erroralways is a semi-gradient methodcan use a true gradient604
It can use a 'true' gradient of e.g. the projected bellmann error
15
TD with function approximation
How does DQN deal with the "deadly triad" (f.a., semi-gradient bootstrapping and off-policy)?Training on a "replay buffer" of past experiences
By using a target Q network and a behavior Q network
By using a true gradientBy using a deep neural network601,2
in DQN, the sampling mechanism and the learning mechanism are 'decorrellated' (somewhat) by a replay buffer and separate networks. It uses a semi-gradient and its usage of a DNN is not to overcome the deadly triad but 'causes' the deadly triad
16
Fun"I got my best idea this week ..."while waiting for the coffee machine
while getting seated when the boat approached a bridge
when I decided to show up on time for the quiz
someone had a great question601,2,3,4
17
MCTSWhich of these are true?
MCTS only works if the rollout leaf nodes are (game tree) terminal states
MCTS requires learning
MCTS rollouts can be both used for training and evaluation
Child nodes are used to determine the value of a parent node
603,4
MCTS can use value estimates if the (game tree) terminal nodes are not reached
18
MCTSIn which way can MCTS use a policy function?To know the best rollout depth
Playing likely moves rather than random moves during rollout
To direct the search during play
To update parent nodes during backup
602,3
19
MCTSWhat is a key difference between AlphaZero (2018) and MuZero (2020)
AlphaZero was only evaluated on Go, MuZero on multiple games
MuZero clips outputs to [-1,1]
MuZero can learn the rules, whereas for AlphaZero these must be given
MuZero was not trained on a simulator
603
20
Exploration/ExploitationWhat is wrong with the exploration in epsilon-greedy Q-learning? (according to the lecture)not effective in covering the state spaceit is based on biased estimates of Q
it is based on optimism in the face of uncertainty
the policy changes at each step601,2,4
according to the lecture
21
Exploration/ExploitationHow can state visit counters be used to drive exploration?To estimate uncertaintyTo normalize gradientsTo "drive" exploration
To determine the size of the replay buffer
601,3
22
Exploration/ExploitationWhat are ways to increase exploration in Optimistic-Q learning?An entropy term in the Q-value updateinitialization of the Q-value estimates
An exploration bonus in the Q value update
Using a (pseudo) state visit counter602,3,4
There is no entropy term in optimistic Q learning. The others are used, specifically the counters are used to estimate uncertainty
23
FunWhat does it mean to be "Schmidhuberd"
When you (intentionally?) exclude a certain author from prior work
when you get accused of incorrect attribution during your own talk at a conference
When your learning agent turns against you
when your paper gets rejected for not including a citation
602
24
DMBRLWhat is true of MBRL?MBRL is used in most real-life applications
In Dyna-Q, we obtain planning without extra computational overhead ("for free")
MBRL can only be done with deep neural networks to 'dream' about the world
The agent maintains a state and transition function internally
604
the 'for free' in Dyna-Q is wrt samples, not computation
25
DMBRLWhat are two ways to deal with imperfect models in MBRL?probabilistic inference & tree-based planning
latent models & end-to-end learning and planning
latent models & probabilistic inference
short rollouts and re-planning & probabilistic inference
602,3,4
Tree-based planning is in itself not robust against weak models
26
DMBRLLatent dynamics models are characterized by
Compression of the observation space into a smaller latent space
Predicting next observations and using these for planning
the usage of gaussian processesUsing a perfect model601
guassian processes are used for probabilistic inference, these models are not perfect
27
Symmetries in RLWhen to use MDP homomorphic networks?If you want sample efficient learningIf all actions in the environment are reversible
If there are symmetries in the environment
If you can construct equivariant layers by hand
601,3
Reversibility of actions is not necessarily symmetric. You do not need to construct equivariant layers by hand, use the symmetrizer
28
Symmetries in RLWhen is a neural network f equivariant to the projection g? For any input x....f(x) = f(g(x))f(g(x)) = g'(f(x))x(g(f)) = g(x(f))f'(g(x)) = g(f(x))602
if the ' indicates the inverse. 1 is invariance not equivariance. The rest is bogus
29
Symmetries in RLIn multi-agent RL ...symmetries can be exploited purely locallyall agents have to know everythingsymmetries cannot be exploited
symmetries can be exploited using equivariance constraints and message passing
604
30
Hierarchical RLNames of hierarchical RL frameworks include...Feudal RLManager-worker RLPromotor-PhD student RLOptions601,3
31
Hierarchical RLIn hierarchical RL, options ...
are temporally extended actions that have their own policy
cannot have predefined policieshave initiation and termination conditionscan be used to buy and sell stocks601,3
32
Hierarchical RLChallenges within Hierarchical RL include ...
optimality of the global solution when combining local solutions
finding a good way to split problems into smaller problems
reuse of solutions
getting sub-policies to listen to their managers
601,2,3
33
World models & object-centric learning
What are some advantages of world models?
It is computationnally efficient to obtain the optimal policy from them
improved data efficiencycan be reused for many tasksworld models can be causal602,3,4
World models trade off computation for sampling
34
World models & object-centric learning
object centric learning is about ...labelling a lot of data by hand
automatically discover representation of objects
understanding what actions are safe
focusing on a lot of things at the same time
602
35
FunThe Kantorovich Metric....
tells the model to focus on the present, just like the buddha
has a bunch of different names, just like the devil
is about living dangerously, just like Austin Powers
suddenly appears at unexpected times and places, just like Juergen Schmidhuber
602
36
State similarity metricsState similarities are about...countries with a similar cultureobserved state featuresstates obtained bi-simulationtransitions and rewards604
The focus in the talk was not so much about feature space but rather about bisimulation equivalence relations which are defined over transitions and reward
37
State similarity metricsA problem of using state equivalence relations that are bisimulations isit does not take into account Q or V values
it can be brittle, because it uses equality for transition probabilities
it requires a model
it is a metric that is requires solving an optimal flow problem
602,3
It is defined over rewards but incorporates V and Q by co-induction. It is not a metric, but a metric that approximates its does require this
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100