Quiz RLSS 2022

	A	C	D	E	F	G	H	I	J
1	Topic	Question - max 95 characters	Answer 1 - max 60 characters	Answer 2 - max 60 characters	Answer 3 - max 60 characters	Answer 4 - max 60 characters	Time limit (sec) - 5,10,20,30,60,90 or 120 secs	Correct answer(s) - choose at least one	Explanation

2	Test Question	The name of the vegan/vegatarian boat was....	the Anne Frank	the Wilhelmus	the Johan Cruyff	The Rich Sutton	60	2	Wilhelmus is the name of Dutch national anthem
3	Generalisation in RL	The bias-variance decomposition for RL is the same as for SL. True or False?	true	false			60	2	False, there is no actual "bias-variance" trade-off in RL. There is an analogous tradeoff between a sufficiently rich learning algorithm and a learning algorithm not too complex
4	Generalisation in RL	In general, the L2-loss is not used in the bias-overfitting decomposition for RL because....	there are target and behavior policies in RL	the i.i.d. assumption is violated	in RL, errors are wrt an optimal policy	RL estimators are unbiased	60	3	RL is not about estimating a target Y, but about optimizing a policy
5	Generalisation in RL	Generalisation in RL can be improved with.....	the type of function approximator	changing the objective function	using abstract represenations	adding more data	60	1,2,3,4
6	Intro to Bandit Algorithms	In general, the goals in RL and bandits (adversarial, stochastic, contextual, etc.) ....	are the same	are to minimize regret	both involve a policy	have no similarities	60	3	Although RL can be about minimizing regret, standard RL is not about regret but about finding a policy that minimizes expected 'loss' after training. Bandits are about finding a policy (arm to play).
7	Intro to Bandit Algorithms	In the stochastic bandit setting, how can the loss (negative reward) be characterised?	It is set by the environment	It is sampled from a known distribution	It is sampled from an initially unknown distribution		60	3	if 'set' is interpreted as 'chosen by' then 1 is the adversarial bandit setting. The distribution is not known initially, it could e.g. be the action of a user like clicking some ad.
8	Intro to Bandit Algorithms	Why are bandits useful for RL?	Learning to balance exploration/exploitation	Simpler models allows for theoretical analysis	Algorithm design principles may be the same	Both deal with uncertainty about the environment	60	1,2,3,4
9	Fun	When visiting Amsterdam, what may cause you to try eating raw herring?	Optimism in the face of uncertainty	It was in a quiz question I did not understand	I couldn't find any cheese or stroopwafels	pure exploration	60	1,2,3,4	of course all are good answers, as long as you try it!
10	Policy Gradients	We are interested in policy gradients because....	They allow us to do Monte Carlo estimation	They allow us to condition the policy on a set of parameters	They work with gradient descent so we can use SL approaches	They always come with a baseline	60	2	There are also other approaches to condition a policy on a set of parameters (e.g. evolutionary approaches). We can also do MC with other RL approaches. Technically, they work with gradient ascent. Also, the SL approaches cannot generally be used in RL due to minimizing error wrt a label instead of the optimal policy
11	Policy Gradients	When does a baseline not introduce bias in policy gradient?	It should be a critic	It should not depend on the action	It should depend on the state	Its gradient should be 0	60	2	It could, be does not have to be a critic. It could, but does not have to depend on the state. Its gradient should not be 0 but the gradient of the policy wrt the baseline should be 0
12	Policy Gradients	Recent policy gradients algorithms (PPO,TRPO,SAC,DDPG,TD3) benefit from ...	Small and controlled learning steps	Reparameterization	Exploration based on entropy	the deadly triad	60	1,2,3	These algorithms benefit from (some or all) of these. The deadly triad is something you suffer from, not benefit from
13	TD with function approximation	The TD-error ....	has high variance	is bootstrapped	can only be determined at the end of an episode	cannot be estimated	60	2
14	TD with function approximation	TD with function approximation ....	always minimizes the error of the value function	always minimizes the TD error	always is a semi-gradient method	can use a true gradient	60	4	It can use a 'true' gradient of e.g. the projected bellmann error
15	TD with function approximation	How does DQN deal with the "deadly triad" (f.a., semi-gradient bootstrapping and off-policy)?	Training on a "replay buffer" of past experiences	By using a target Q network and a behavior Q network	By using a true gradient	By using a deep neural network	60	1,2	in DQN, the sampling mechanism and the learning mechanism are 'decorrellated' (somewhat) by a replay buffer and separate networks. It uses a semi-gradient and its usage of a DNN is not to overcome the deadly triad but 'causes' the deadly triad
16	Fun	"I got my best idea this week ..."	while waiting for the coffee machine	while getting seated when the boat approached a bridge	when I decided to show up on time for the quiz	someone had a great question	60	1,2,3,4
17	MCTS	Which of these are true?	MCTS only works if the rollout leaf nodes are (game tree) terminal states	MCTS requires learning	MCTS rollouts can be both used for training and evaluation	Child nodes are used to determine the value of a parent node	60	3,4	MCTS can use value estimates if the (game tree) terminal nodes are not reached
18	MCTS	In which way can MCTS use a policy function?	To know the best rollout depth	Playing likely moves rather than random moves during rollout	To direct the search during play	To update parent nodes during backup	60	2,3
19	MCTS	What is a key difference between AlphaZero (2018) and MuZero (2020)	AlphaZero was only evaluated on Go, MuZero on multiple games	MuZero clips outputs to [-1,1]	MuZero can learn the rules, whereas for AlphaZero these must be given	MuZero was not trained on a simulator	60	3
20	Exploration/Exploitation	What is wrong with the exploration in epsilon-greedy Q-learning? (according to the lecture)	not effective in covering the state space	it is based on biased estimates of Q	it is based on optimism in the face of uncertainty	the policy changes at each step	60	1,2,4	according to the lecture
21	Exploration/Exploitation	How can state visit counters be used to drive exploration?	To estimate uncertainty	To normalize gradients	To "drive" exploration	To determine the size of the replay buffer	60	1,3
22	Exploration/Exploitation	What are ways to increase exploration in Optimistic-Q learning?	An entropy term in the Q-value update	initialization of the Q-value estimates	An exploration bonus in the Q value update	Using a (pseudo) state visit counter	60	2,3,4	There is no entropy term in optimistic Q learning. The others are used, specifically the counters are used to estimate uncertainty
23	Fun	What does it mean to be "Schmidhuberd"	When you (intentionally?) exclude a certain author from prior work	when you get accused of incorrect attribution during your own talk at a conference	When your learning agent turns against you	when your paper gets rejected for not including a citation	60	2
24	DMBRL	What is true of MBRL?	MBRL is used in most real-life applications	In Dyna-Q, we obtain planning without extra computational overhead ("for free")	MBRL can only be done with deep neural networks to 'dream' about the world	The agent maintains a state and transition function internally	60	4	the 'for free' in Dyna-Q is wrt samples, not computation
25	DMBRL	What are two ways to deal with imperfect models in MBRL?	probabilistic inference & tree-based planning	latent models & end-to-end learning and planning	latent models & probabilistic inference	short rollouts and re-planning & probabilistic inference	60	2,3,4	Tree-based planning is in itself not robust against weak models
26	DMBRL	Latent dynamics models are characterized by	Compression of the observation space into a smaller latent space	Predicting next observations and using these for planning	the usage of gaussian processes	Using a perfect model	60	1	guassian processes are used for probabilistic inference, these models are not perfect
27	Symmetries in RL	When to use MDP homomorphic networks?	If you want sample efficient learning	If all actions in the environment are reversible	If there are symmetries in the environment	If you can construct equivariant layers by hand	60	1,3	Reversibility of actions is not necessarily symmetric. You do not need to construct equivariant layers by hand, use the symmetrizer
28	Symmetries in RL	When is a neural network f equivariant to the projection g? For any input x....	f(x) = f(g(x))	f(g(x)) = g'(f(x))	x(g(f)) = g(x(f))	f'(g(x)) = g(f(x))	60	2	if the ' indicates the inverse. 1 is invariance not equivariance. The rest is bogus
29	Symmetries in RL	In multi-agent RL ...	symmetries can be exploited purely locally	all agents have to know everything	symmetries cannot be exploited	symmetries can be exploited using equivariance constraints and message passing	60	4
30	Hierarchical RL	Names of hierarchical RL frameworks include...	Feudal RL	Manager-worker RL	Promotor-PhD student RL	Options	60	1,3
31	Hierarchical RL	In hierarchical RL, options ...	are temporally extended actions that have their own policy	cannot have predefined policies	have initiation and termination conditions	can be used to buy and sell stocks	60	1,3
32	Hierarchical RL	Challenges within Hierarchical RL include ...	optimality of the global solution when combining local solutions	finding a good way to split problems into smaller problems	reuse of solutions	getting sub-policies to listen to their managers	60	1,2,3
33	World models & object-centric learning	What are some advantages of world models?	It is computationnally efficient to obtain the optimal policy from them	improved data efficiency	can be reused for many tasks	world models can be causal	60	2,3,4	World models trade off computation for sampling
34	World models & object-centric learning	object centric learning is about ...	labelling a lot of data by hand	automatically discover representation of objects	understanding what actions are safe	focusing on a lot of things at the same time	60	2
35	Fun	The Kantorovich Metric....	tells the model to focus on the present, just like the buddha	has a bunch of different names, just like the devil	is about living dangerously, just like Austin Powers	suddenly appears at unexpected times and places, just like Juergen Schmidhuber	60	2
36	State similarity metrics	State similarities are about...	countries with a similar culture	observed state features	states obtained bi-simulation	transitions and rewards	60	4	The focus in the talk was not so much about feature space but rather about bisimulation equivalence relations which are defined over transitions and reward
37	State similarity metrics	A problem of using state equivalence relations that are bisimulations is	it does not take into account Q or V values	it can be brittle, because it uses equality for transition probabilities	it requires a model	it is a metric that is requires solving an optimal flow problem	60	2,3	It is defined over rewards but incorporates V and Q by co-induction. It is not a metric, but a metric that approximates its does require this
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100