A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Topic | Question - max 95 characters | Answer 1 - max 60 characters | Answer 2 - max 60 characters | Answer 3 - max 60 characters | Answer 4 - max 60 characters | Time limit (sec) - 5,10,20,30,60,90 or 120 secs | Correct answer(s) - choose at least one | Explanation | |||||||||||||||||||
2 | Test Question | The name of the vegan/vegatarian boat was.... | the Anne Frank | the Wilhelmus | the Johan Cruyff | The Rich Sutton | 60 | 2 | Wilhelmus is the name of Dutch national anthem | |||||||||||||||||||
3 | Generalisation in RL | The bias-variance decomposition for RL is the same as for SL. True or False? | true | false | 60 | 2 | False, there is no actual "bias-variance" trade-off in RL. There is an analogous tradeoff between a sufficiently rich learning algorithm and a learning algorithm not too complex | |||||||||||||||||||||
4 | Generalisation in RL | In general, the L2-loss is not used in the bias-overfitting decomposition for RL because.... | there are target and behavior policies in RL | the i.i.d. assumption is violated | in RL, errors are wrt an optimal policy | RL estimators are unbiased | 60 | 3 | RL is not about estimating a target Y, but about optimizing a policy | |||||||||||||||||||
5 | Generalisation in RL | Generalisation in RL can be improved with..... | the type of function approximator | changing the objective function | using abstract represenations | adding more data | 60 | 1,2,3,4 | ||||||||||||||||||||
6 | Intro to Bandit Algorithms | In general, the goals in RL and bandits (adversarial, stochastic, contextual, etc.) .... | are the same | are to minimize regret | both involve a policy | have no similarities | 60 | 3 | Although RL can be about minimizing regret, standard RL is not about regret but about finding a policy that minimizes expected 'loss' after training. Bandits are about finding a policy (arm to play). | |||||||||||||||||||
7 | Intro to Bandit Algorithms | In the stochastic bandit setting, how can the loss (negative reward) be characterised? | It is set by the environment | It is sampled from a known distribution | It is sampled from an initially unknown distribution | 60 | 3 | if 'set' is interpreted as 'chosen by' then 1 is the adversarial bandit setting. The distribution is not known initially, it could e.g. be the action of a user like clicking some ad. | ||||||||||||||||||||
8 | Intro to Bandit Algorithms | Why are bandits useful for RL? | Learning to balance exploration/exploitation | Simpler models allows for theoretical analysis | Algorithm design principles may be the same | Both deal with uncertainty about the environment | 60 | 1,2,3,4 | ||||||||||||||||||||
9 | Fun | When visiting Amsterdam, what may cause you to try eating raw herring? | Optimism in the face of uncertainty | It was in a quiz question I did not understand | I couldn't find any cheese or stroopwafels | pure exploration | 60 | 1,2,3,4 | of course all are good answers, as long as you try it! | |||||||||||||||||||
10 | Policy Gradients | We are interested in policy gradients because.... | They allow us to do Monte Carlo estimation | They allow us to condition the policy on a set of parameters | They work with gradient descent so we can use SL approaches | They always come with a baseline | 60 | 2 | There are also other approaches to condition a policy on a set of parameters (e.g. evolutionary approaches). We can also do MC with other RL approaches. Technically, they work with gradient ascent. Also, the SL approaches cannot generally be used in RL due to minimizing error wrt a label instead of the optimal policy | |||||||||||||||||||
11 | Policy Gradients | When does a baseline not introduce bias in policy gradient? | It should be a critic | It should not depend on the action | It should depend on the state | Its gradient should be 0 | 60 | 2 | It could, be does not have to be a critic. It could, but does not have to depend on the state. Its gradient should not be 0 but the gradient of the policy wrt the baseline should be 0 | |||||||||||||||||||
12 | Policy Gradients | Recent policy gradients algorithms (PPO,TRPO,SAC,DDPG,TD3) benefit from ... | Small and controlled learning steps | Reparameterization | Exploration based on entropy | the deadly triad | 60 | 1,2,3 | These algorithms benefit from (some or all) of these. The deadly triad is something you suffer from, not benefit from | |||||||||||||||||||
13 | TD with function approximation | The TD-error .... | has high variance | is bootstrapped | can only be determined at the end of an episode | cannot be estimated | 60 | 2 | ||||||||||||||||||||
14 | TD with function approximation | TD with function approximation .... | always minimizes the error of the value function | always minimizes the TD error | always is a semi-gradient method | can use a true gradient | 60 | 4 | It can use a 'true' gradient of e.g. the projected bellmann error | |||||||||||||||||||
15 | TD with function approximation | How does DQN deal with the "deadly triad" (f.a., semi-gradient bootstrapping and off-policy)? | Training on a "replay buffer" of past experiences | By using a target Q network and a behavior Q network | By using a true gradient | By using a deep neural network | 60 | 1,2 | in DQN, the sampling mechanism and the learning mechanism are 'decorrellated' (somewhat) by a replay buffer and separate networks. It uses a semi-gradient and its usage of a DNN is not to overcome the deadly triad but 'causes' the deadly triad | |||||||||||||||||||
16 | Fun | "I got my best idea this week ..." | while waiting for the coffee machine | while getting seated when the boat approached a bridge | when I decided to show up on time for the quiz | someone had a great question | 60 | 1,2,3,4 | ||||||||||||||||||||
17 | MCTS | Which of these are true? | MCTS only works if the rollout leaf nodes are (game tree) terminal states | MCTS requires learning | MCTS rollouts can be both used for training and evaluation | Child nodes are used to determine the value of a parent node | 60 | 3,4 | MCTS can use value estimates if the (game tree) terminal nodes are not reached | |||||||||||||||||||
18 | MCTS | In which way can MCTS use a policy function? | To know the best rollout depth | Playing likely moves rather than random moves during rollout | To direct the search during play | To update parent nodes during backup | 60 | 2,3 | ||||||||||||||||||||
19 | MCTS | What is a key difference between AlphaZero (2018) and MuZero (2020) | AlphaZero was only evaluated on Go, MuZero on multiple games | MuZero clips outputs to [-1,1] | MuZero can learn the rules, whereas for AlphaZero these must be given | MuZero was not trained on a simulator | 60 | 3 | ||||||||||||||||||||
20 | Exploration/Exploitation | What is wrong with the exploration in epsilon-greedy Q-learning? (according to the lecture) | not effective in covering the state space | it is based on biased estimates of Q | it is based on optimism in the face of uncertainty | the policy changes at each step | 60 | 1,2,4 | according to the lecture | |||||||||||||||||||
21 | Exploration/Exploitation | How can state visit counters be used to drive exploration? | To estimate uncertainty | To normalize gradients | To "drive" exploration | To determine the size of the replay buffer | 60 | 1,3 | ||||||||||||||||||||
22 | Exploration/Exploitation | What are ways to increase exploration in Optimistic-Q learning? | An entropy term in the Q-value update | initialization of the Q-value estimates | An exploration bonus in the Q value update | Using a (pseudo) state visit counter | 60 | 2,3,4 | There is no entropy term in optimistic Q learning. The others are used, specifically the counters are used to estimate uncertainty | |||||||||||||||||||
23 | Fun | What does it mean to be "Schmidhuberd" | When you (intentionally?) exclude a certain author from prior work | when you get accused of incorrect attribution during your own talk at a conference | When your learning agent turns against you | when your paper gets rejected for not including a citation | 60 | 2 | ||||||||||||||||||||
24 | DMBRL | What is true of MBRL? | MBRL is used in most real-life applications | In Dyna-Q, we obtain planning without extra computational overhead ("for free") | MBRL can only be done with deep neural networks to 'dream' about the world | The agent maintains a state and transition function internally | 60 | 4 | the 'for free' in Dyna-Q is wrt samples, not computation | |||||||||||||||||||
25 | DMBRL | What are two ways to deal with imperfect models in MBRL? | probabilistic inference & tree-based planning | latent models & end-to-end learning and planning | latent models & probabilistic inference | short rollouts and re-planning & probabilistic inference | 60 | 2,3,4 | Tree-based planning is in itself not robust against weak models | |||||||||||||||||||
26 | DMBRL | Latent dynamics models are characterized by | Compression of the observation space into a smaller latent space | Predicting next observations and using these for planning | the usage of gaussian processes | Using a perfect model | 60 | 1 | guassian processes are used for probabilistic inference, these models are not perfect | |||||||||||||||||||
27 | Symmetries in RL | When to use MDP homomorphic networks? | If you want sample efficient learning | If all actions in the environment are reversible | If there are symmetries in the environment | If you can construct equivariant layers by hand | 60 | 1,3 | Reversibility of actions is not necessarily symmetric. You do not need to construct equivariant layers by hand, use the symmetrizer | |||||||||||||||||||
28 | Symmetries in RL | When is a neural network f equivariant to the projection g? For any input x.... | f(x) = f(g(x)) | f(g(x)) = g'(f(x)) | x(g(f)) = g(x(f)) | f'(g(x)) = g(f(x)) | 60 | 2 | if the ' indicates the inverse. 1 is invariance not equivariance. The rest is bogus | |||||||||||||||||||
29 | Symmetries in RL | In multi-agent RL ... | symmetries can be exploited purely locally | all agents have to know everything | symmetries cannot be exploited | symmetries can be exploited using equivariance constraints and message passing | 60 | 4 | ||||||||||||||||||||
30 | Hierarchical RL | Names of hierarchical RL frameworks include... | Feudal RL | Manager-worker RL | Promotor-PhD student RL | Options | 60 | 1,3 | ||||||||||||||||||||
31 | Hierarchical RL | In hierarchical RL, options ... | are temporally extended actions that have their own policy | cannot have predefined policies | have initiation and termination conditions | can be used to buy and sell stocks | 60 | 1,3 | ||||||||||||||||||||
32 | Hierarchical RL | Challenges within Hierarchical RL include ... | optimality of the global solution when combining local solutions | finding a good way to split problems into smaller problems | reuse of solutions | getting sub-policies to listen to their managers | 60 | 1,2,3 | ||||||||||||||||||||
33 | World models & object-centric learning | What are some advantages of world models? | It is computationnally efficient to obtain the optimal policy from them | improved data efficiency | can be reused for many tasks | world models can be causal | 60 | 2,3,4 | World models trade off computation for sampling | |||||||||||||||||||
34 | World models & object-centric learning | object centric learning is about ... | labelling a lot of data by hand | automatically discover representation of objects | understanding what actions are safe | focusing on a lot of things at the same time | 60 | 2 | ||||||||||||||||||||
35 | Fun | The Kantorovich Metric.... | tells the model to focus on the present, just like the buddha | has a bunch of different names, just like the devil | is about living dangerously, just like Austin Powers | suddenly appears at unexpected times and places, just like Juergen Schmidhuber | 60 | 2 | ||||||||||||||||||||
36 | State similarity metrics | State similarities are about... | countries with a similar culture | observed state features | states obtained bi-simulation | transitions and rewards | 60 | 4 | The focus in the talk was not so much about feature space but rather about bisimulation equivalence relations which are defined over transitions and reward | |||||||||||||||||||
37 | State similarity metrics | A problem of using state equivalence relations that are bisimulations is | it does not take into account Q or V values | it can be brittle, because it uses equality for transition probabilities | it requires a model | it is a metric that is requires solving an optimal flow problem | 60 | 2,3 | It is defined over rewards but incorporates V and Q by co-induction. It is not a metric, but a metric that approximates its does require this | |||||||||||||||||||
38 | ||||||||||||||||||||||||||||
39 | ||||||||||||||||||||||||||||
40 | ||||||||||||||||||||||||||||
41 | ||||||||||||||||||||||||||||
42 | ||||||||||||||||||||||||||||
43 | ||||||||||||||||||||||||||||
44 | ||||||||||||||||||||||||||||
45 | ||||||||||||||||||||||||||||
46 | ||||||||||||||||||||||||||||
47 | ||||||||||||||||||||||||||||
48 | ||||||||||||||||||||||||||||
49 | ||||||||||||||||||||||||||||
50 | ||||||||||||||||||||||||||||
51 | ||||||||||||||||||||||||||||
52 | ||||||||||||||||||||||||||||
53 | ||||||||||||||||||||||||||||
54 | ||||||||||||||||||||||||||||
55 | ||||||||||||||||||||||||||||
56 | ||||||||||||||||||||||||||||
57 | ||||||||||||||||||||||||||||
58 | ||||||||||||||||||||||||||||
59 | ||||||||||||||||||||||||||||
60 | ||||||||||||||||||||||||||||
61 | ||||||||||||||||||||||||||||
62 | ||||||||||||||||||||||||||||
63 | ||||||||||||||||||||||||||||
64 | ||||||||||||||||||||||||||||
65 | ||||||||||||||||||||||||||||
66 | ||||||||||||||||||||||||||||
67 | ||||||||||||||||||||||||||||
68 | ||||||||||||||||||||||||||||
69 | ||||||||||||||||||||||||||||
70 | ||||||||||||||||||||||||||||
71 | ||||||||||||||||||||||||||||
72 | ||||||||||||||||||||||||||||
73 | ||||||||||||||||||||||||||||
74 | ||||||||||||||||||||||||||||
75 | ||||||||||||||||||||||||||||
76 | ||||||||||||||||||||||||||||
77 | ||||||||||||||||||||||||||||
78 | ||||||||||||||||||||||||||||
79 | ||||||||||||||||||||||||||||
80 | ||||||||||||||||||||||||||||
81 | ||||||||||||||||||||||||||||
82 | ||||||||||||||||||||||||||||
83 | ||||||||||||||||||||||||||||
84 | ||||||||||||||||||||||||||||
85 | ||||||||||||||||||||||||||||
86 | ||||||||||||||||||||||||||||
87 | ||||||||||||||||||||||||||||
88 | ||||||||||||||||||||||||||||
89 | ||||||||||||||||||||||||||||
90 | ||||||||||||||||||||||||||||
91 | ||||||||||||||||||||||||||||
92 | ||||||||||||||||||||||||||||
93 | ||||||||||||||||||||||||||||
94 | ||||||||||||||||||||||||||||
95 | ||||||||||||||||||||||||||||
96 | ||||||||||||||||||||||||||||
97 | ||||||||||||||||||||||||||||
98 | ||||||||||||||||||||||||||||
99 | ||||||||||||||||||||||||||||
100 |