|AI Safety Gridworlds|
|Thomas Stepleton. The pycolab game engine, 2017.|
|Todd Hester, Learning from demonstrations for real world reinforcement learning.|
Dylan Hadfield-Menell, Smitha Milli, Stuart Russell, Pieter Abbeel, and Anca D Dragan. Inverse reward design. In Advances in Neural Information Processing Systems ,
Agent treats rewards provided as information about a still-unknown reward function, which the agent keeps a Bayesian model of. This retains uncertainty over the value of states unseen in training. We can almost certainly use this.
|Mohamed and Rezende, Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning,|
Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning,
Empowerment rewards (reward ~ number of possible future states) requires calculation of mutual information, which is computationally expensive. Paper shows how to calculate MI quickly via a variational approximation. Figures 4-7 show some desirable properties of empowerment, but concrete problems points out some issues with using it alone
Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298
Introduction of Rainbow algorithm. Very relevant if we have to implement it.
Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision processes. In International Conference on Machine Learning , pages 1451–1458, 2012.
Concept of ERGODICITY raised here is important for us - means that any state is eventually reachable from any other state by following a suitable policy. I.e. that all states are reversible. So can explore safely by restricting space of policies to those that preserve ergodicity with user specified probability (called safety level). Provably ‘safe’ exploration of grid worlds of size up to 50 100
|Alex Turner, Whitelist Learning|
Agent learns (whitelists) acceptable actions by observing human generated examples
|Faulty reward functions in the wild,|
Blogpost intro to Concrete Problems, OpenAI Universe. Cursory
Christoph Salge, Cornelius Glackin, and Daniel Polani. Empowerment—an introduction. In Guided Self-Organization: Inception, pages 67–114. 2014.
Extremely long! Worth skimming if we make use of empowerment but will take a while to read and might not repay the investment
|Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. arXiv preprint arXiv:1705.10528|
New search algorithm for constrained MDPs, building safety into policy optim. Good for safe exploration. "Point-Gather" (collect apples while avoid bombs) is a side-effect environment and is nearly solved by CPO
|Stuart Armstrong. AI toy control problem|
Cute minimal-viable example of a gridworld but about absent / manipulable supervisor.
|Stuart Armstrong and Benjamin Levinstein. Low impact artificial intelligences. arXiv preprint arXiv:1705.10720|
Exemplar of one of the main anti-side-effects approaches. Theoretical, not RL or grid focussed
|Sutton: RL textbook|
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565
Pieter Abbeel and Andrew Ng. Apprenticeship learning via inverse reinforcement learning. In International Conference on Machine Learning , pages 1–8,
Riad Akrour, Marc Schoenauer, and Michele Sebag. APRIL: Active preference learning-based reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases pages 116–131,
James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, David Roberts, Matthew E Taylor, and Michael L Littman. Interactive learning from policy-dependent human feedback. arXiv preprint arXiv:1701.06049,
Andrew Ng and Stuart Russell. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning , pages 663–670, 2000.
Aaron Wilson, Alan Fern, and Prasad Tadepalli. A Bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems , pages 1133–1141, 2012.
Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement learning. In AAAI , pages 1433–1438, 2008.
One of the earlier IRL papers - uses a linear rewaard model and softmax-optimal expert demonstrations. Largely superseded by IRL with GPs (https://homes.cs.washington.edu/~zoran/gpirl.pdf) or deep reward models
|Hester et al, Deep Q-learning from Demonstrations|
Deep Q-learning from Demonstrations,
Small sets of human demo data accelerate learning process using prioritised replay mechanism - combines temporal difference updates with supervised classification of demo actions. Use to pre-train agent for good performance from beginning, even in absence of accurate simulator. Useful in relation to Whitelist Learning? Possibly needs a more experienced eye than mine.
|Stuart Armstrong and Jan Leike. Towards interactive inverse reinforcement learning. In NIPS Workshop ,|
Pretty relevant and is a short read. The problem is POMDP where agent starts without a reward function and has to learn it (this makes it not entirely like the gridworld sideffects) while interacting with environment. Some major assumption here is that there exists some finite set of potential reward functions (which is a downside of the approach). Some positive side is that the agent learns the reward and uses it at the same time. Actions of the agent might influence (bias) the function it learns. They try to solve the environment in such a way that this bias is removed.
Zachary C Lipton, Jianfeng Gao, Lihong Li, Jianshu Chen, and Li Deng. Combating reinforcement learning’s Sisyphean curse with intrinsic fear. arXiv preprint arXiv:1611.01211
Some states might be rare and catastrophic. We want agents to quickly learn not to visit them (normally they would need to visit them many times before learning to aviod them). Authors propose using a function called intrinsic fear that is supervised learnt and show that it works for this kind of problem.
William Saunders, Girish Sastry, Andreas Stuhlmueller, and Owain Evans. Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017.
Human teaches supervised learner to oversee RL agent, and prevent it causing catastrophes. Videos: https://www.youtube.com/playlist?list=PLjs9WCnnR7PCn_Kzs2-1afCsnsBENWqor Could apply to side effects?
Alexander Hans, Daniel Schneegaß, Anton Maximilian Schafer, and Steffen Udluft. Safe exploration for reinforcement learning. In European Symposium on Artificial Neural Networks , pages 143–148,
Idea of ‘safety function’ to determine safety of action a in state s, plus ‘backup policy’ to lead agent from possibly critical state back to safety. Could be useful approach for us if we define critical state as irreversible state?
Mark O Riedl and Brent Harrison. Using stories to teach human values to artificial agents. In AAAI Workshop on AI, Ethics, and Society, 2016.
Agents could reverse engineer values from stories? Trajectory tree of actions extracted from crowdsourced stories of trip to pharmacy - Q-learning agent trained using reward function from trajectory tree, reward for each step that adheres to tree. Similar in a way to the whitelist idea - humans show what is acceptable behaviour.
Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch. Alignment for advanced machine learning systems. Technical report, Machine Intelligence Research Institute, 2016.
Directly relevant is one small chapter (2.6 Imapct Measures). It basically states the problem and roughly describes some proposed solutions (measuring impact). The main proposition is from a paper that we have listed here (Armstrong, Levinstein 2017). I give it 4 because the causality idea might inspire somebody :)
|Christopher Watkins and Peter Dayan. Q-learning. Machine Learning , 8(3-4):279–292, 1992.|
A convergence proof for Q-learning (an on-policy RL algorithm we might want to use). Sutton & Barto is probably a better source for what we need, SARSA is also worth looking at
Paul Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems
Great paper, but less relevant - even if we ask for human preferences we're probably better off sticking with feature-based RL in initial explorations to remove technical risk
Pieter Abbeel, Adam Coates, and Andrew Ng. Autonomous helicopter aerobatics through apprenticeship learning. International Journal of Robotics Research, 29(13):1608–1639,
|Javier Garcıa and Fernando Fernandez. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research|
Relevant as a potential source of new papers. Some survey of papers on Safe RL. It has potential further reading for inverse RL and learning from human feedback and demonstrations.
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. Cooperative inverse reinforcement learning. In Advances in Neural Information Processing Systems , 2016a.
Potential source of papers on Inverse RL. Some new framing of Inverse Reinforcement learning kind of problems. The key observation seems to be twofold: humans policy of instructing an agent should be more than just showing the optimal way; we want the agents to optimize our reward function but not adopt it as it's own.
Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell. Should robots be obedient? In International Joint Conference on Artificial Intelligence, 2017.
Learning human preferences as IRL problem. Robots should not be blindly obedient, rather should infer reward parameters from human orders. Supervision POMDP model.
|Yarin Gal. Uncertainty in Deep Learning . PhD thesis, University of Cambridge,|
Current deep learning models are generally deterministic. They may produce probability distributions, but the model parameters are point estimates or the structure is fixed. This PhD thesis elaborates on how we can add uncertanity to the models, so that we are kind of choosing models from some distribution so that they best explain the data. It's not really relevant to us, but could be some good resource on uncertanity modelling (with bayes). Since we will most probably be using some uncertanity about the rewards, this work could be useful.
Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. Reinforcement learning with corrupted reward signal. In International Joint Conference on Artificial Intelligence
It's about making an agent robust against exploiting of corrupted rewards (erronously high rewards due to some bugs or misspecification). It's indirectly relevant since in some approaches our solution will have to be uncertain or learn some sort of true reward. The solution presented here includes wrong intepretation in case of IRL.
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The off-switch game. In International Joint Conference on Artificial Intelligence , 2016b.
Interesting paper on corrigibility in the CIRL framework (c.f. Ryan Carey's paper). Not relevant to side effects in particular, unless we try preference learning. The point is basically that corrigibility can be recovered if the agent is allowed to present candidate actions to the human evaluator
|Daniel Weld and Oren Etzioni. The first law of robotics (a call to arms). In AAAI , pages 1042–1047, 1994.|
Marc G Bellemare, Will Dabney, and Remi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, pages 449–458,
Learning a value distributions instead of just values for states may improve performance of RL algorithms. Important for RL, but not so much for safety and sideffects in gridworlds
Tom Everitt, Jan Leike, and Marcus Hutter. Sequential extensions of causal and evidential decision theory. In Algorithmic Decision Theory, pages 205–221,
Some extensions of casual and evidential decision theories that uses physicalistic environment models (agent is part and not separate from the environment).
Mohammad Ghavamzadeh, Marek Petrik, and Yinlam Chow. Safe policy improvement by minimizing robust baseline regret. In Advances in Neural Information Processing Systems , pages 2298–2306,
Using environment models to better improve a policy over some baseline one. The word 'safe' here means that a given ne policy performes not worse then the baseline one.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning , pages 1928–1937, 2016.
Some inprovement on existing RL algorithms. It introduces asynchronous learning. Not really relevant to sideeffects.
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The Arcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research , 47:253–279,
Some environment for training and evaluating RL agents. Pretty nice to learn rl, but not relevant for us, because we have pycolab and gridworlds.
|Ulrich Berger. Brown’s original fictitious play. Journal of Economic Theory , 135:572–578,|
Game theory - fictitious play, doesn't seem relevant to me, but maybe someone better informed on GT might disagree.
|George W. Brown. Iterative Solution of Games by Fictitious Play . Wiley, 1951.|
Game theory fictitious play again - description of the algorithm - it's relevant to robustness to adversaries problem, not so much to ours.
Stefano Coraluppi and Steven Marcus. Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes. Automatica , 35(2):301–309,
The paper is expressed in the language of MDPs and control. In the RL setting, what it is about could be described: we want to train RL agent to maximize reward while caring about the variance of reward while learning. (So the risk here refers to variance of rewards.)
Tom Everitt, Daniel Filan, Mayank Daswani, and Marcus Hutter. Self-modification of policy and utility function in rational agents. In Artificial General Intelligence, pages 1–11,
Agents might self modify by changing future policy or utility function. Paper shows that it is possible to create an agent that despite being able to make any self-modification, will refrain from doing so.
Peter Auer and Chiang Chao-Kai. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory
Some significant result for bandit problems.
|Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,|
Introduces Adam, a state of the art algorithm for gradient-based optimization of stochastic objective function.
|Stephen Omohundro. The basic AI drives. In Artificial General Intelligence , pages 483–492, 2008.|
High level description of ‘drives’ that will appear in any AI system unless explicitly counteracted - essentially paperclip problem.
Nader Chmait, David L Dowe, David G Green, and Yuan-Fang Li. Agent coordination and potential risks: Meaningful environments for evaluating multiagent systems. In Evaluating General-Purpose AI, IJCAI Workshop
Test environments that allow evaluation of multi-agent systems
|Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572|
Argues that failure in range of models (incl. state of the art NNs) when adversarial examples are used is due to linear nature.
Mark O Riedl and Brent Harrison. Enter the matrix: A virtual world approach to safely interruptable autonomous systems. arXiv preprint arXiv:1703.10284, 2017.
Safe interruptibility in RL by using ‘kill switch’ to swap agent to virtual world where it may still receive reward.
Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77,
Some sort of optimal play in adversarial bandit problem. Cannot access the paper for free.
|Sebastian Bubeck and Alexander Slivkins. The best of both worlds: stochastic and adversarial bandits. In Conference on Learning Theory|
Some algorithm for bandit problems that is optimal for both stochastic and adversarial case
Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety verification of deep neural networks. In Computer Aided Verification, pages 3–29, 2017b.
State of the art deep neural nets for image classification can be tricked into beliveing that some slightly changed image is of different class then the original. Authors present a method that is guaranteed to find examples of such adversarial images within some neighborhood of given image. This can be used to make NNs more robust.
|Laurent Orseau and Stuart Armstrong. Safely interruptible agents. In Uncertainty in Artificial Intelligence , pages 557–566, 2016.|
Some result showing that if we make the interruption probabilistic then it's possible to get safe interruptibility in the limit.
|Laurent Orseau and Mark Ring. Space-time embedded intelligence. In Artificial General Intelligence , pages 209–218, 2012.|
Previous theories of AGI assume agent & environment are different entities (essentially dualism) which is problematic. This paper formulates physicalist approach in which agent is fully integrated into environment and can be modified by it.
Ion Stoica, Dawn Song, Raluca Ada Popa, David A Patterson, Michael W Mahoney, Randy H Katz, Anthony D Joseph, Michael Jordan, Joseph M Hellerstein, Joseph Gonzalez, et al. A Berkeley view of systems challenges for AI. Technical report, UC Berkeley, 2017.
Some summary of current state of AI together with propositions on potential directions of doing research. There ar elots of safety issues mentioned, but they relate to very current AIs that are used in production.
Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. arXiv preprint arXiv:1706.10295
Jordi Grau-Moya, Felix Leibfried, Tim Genewein, and Daniel A Braun. Planning with information-processing constraints and model uncertainty in Markov decision processes. In Machine Learning and Knowledge Discovery in Databases
|Bill Hibbard. Model-based utility functions. Journal of Artificial General Intelligence , 3(1):1–24,|
Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. In International Conference on Learning Representations , 2017a.
Guy Katz, Clark Barrett, David Dill, Kyle Julian, and Mykel Kochenderfer. Towards proving the adversarial robustness of deep neural networks. arXiv preprint arXiv:1709.02802,
Volodymyr Mnih, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
Andrew Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning , pages 278–287, 1999.
Laurent Orseau and Mark Ring. Self-modification and mortality in artificial agents. In Artificial General Intelligence , pages 1–10, 2011.
|Pedro Ortega, Kee-Eung Kim, and Daniel D Lee. Bandits with attitude. In Artificial Intelligence and Statistics, 2015.|
Martin Pecka and Tomas Svoboda. Safe exploration techniques for reinforcement learning — an overview. In International Workshop on Modelling and Simulation for Autonomous Systems, pages 357–375, 2014.
Joaquin Quinonero Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil Lawrence. Dataset Shift in Machine Learning . MIT Press, 2009.
|Mark Ring and Laurent Orseau. Delusion, survival, and intelligent agents. In Artificial General Intelligence, pages 11–20, 2011.|
Stuart Russell, Daniel Dewey, and Max Tegmark. Research priorities for robust and beneficial artificial intelligence. AI Magazine, 36(4):105–114, 2015.
Anirban Santara, Abhishek Naik, Balaraman Ravindran, Dipankar Das, Dheevatsa Mudigere, Sasikanth Avancha, and Bharat Kaul. RAIL: Risk-averse imitation learning. arXiv preprint arXiv:1707.06658, 2017.
Yevgeny Seldin and Alexander Slivkins. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, 2014.
|Sanjit A Seshia, Dorsa Sadigh, and S Shankar Sastry. Towards verified artificial intelligence. arXiv preprint arXiv:1606.08514 , 2016.|
Nate Soares and Benja Fallenstein. Aligning superintelligence with human interests: A technical research agenda. Technical report, Machine Intelligence Research Institute, 2014.
|Nate Soares, Benja Fallenstein, Stuart Armstrong, and Eliezer Yudkowsky. Corrigibility. In AAAI Workshop on AI, Ethics, and Society, 2015.|
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 , 2013.
Jessica Taylor. Quantilizers: A safer alternative to maximizers for limited optimization. In AAAI Workshop on AI, Ethics, and Society , pages 1–9, 2016.
Philip Thomas, Georgios Theocharous, and Mohammad Ghavamzadeh. High confidence policy improvement. In International Conference on Machine Learning , pages 2380–2388, 2015.
|Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-RMSProp: Divide the gradient by a running average of its recent magnitude.|
Matteo Turchetta, Felix Berkenkamp, and Andreas Krause. Safe exploration in finite Markov decision processes with Gaussian processes. In Advances in Neural Information Processing Systems, pages 4312–4320, 2016.
|Bart van den Broek, Wim Wiegerinck, and Hilbert Kappen. Risk sensitive path integral control. arXiv preprint arXiv:1203.3523 , 2012.|
|Peter Whittle. Optimal Control: Basics and Beyond . John Wiley \& Sons, 1996.|
Ronald Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning , 8(3-4):229–256, 1992.
Shen Yun, Wilhelm Stannat, and Klaus Obermayer. A unified framework for risk-sensitie Markov control processes. In Conference on Decision and Control , 2014.
|Kemin Zhou and John C Doyle. Essentials of Robust Control . Pearson, 1997.|