ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Safety TypeYearOrg(s)Item TypeCitationsAuthorsTitlePublication Title
2
TechSafety2017<Other org>manuscript836
Doshi-Velez, Finale; Kim, Been
Towards A Rigorous Science of Interpretable Machine Learning
3
TechSafety2017DeepMindconferencePaper822
Lakshminarayanan, Balaji; Pritzel, Alexander; Blundell, Charles
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
arXiv:1612.01474 [cs, stat]
4
TechSafety2016Open-AImanuscript724
Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan
Concrete Problems in AI Safety
5
MetaSafety2016FHIbookSection365
Müller, Vincent C.; Bostrom, Nick
Future progress in artificial intelligence: A survey of expert opinion
Fundamental issues of artificial intelligence
6
TechSafety2017
DeepMind; Open-AI
conferencePaper314
Christiano, Paul; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario
Deep reinforcement learning from human preferences
Advances in Neural Information Processing Systems 30 (NIPS 2017)
7
TechSafety2017Open-AIconferencePaper310
Achiam, Joshua; Held, David; Tamar, Aviv; Abbeel, Pieter
Constrained policy optimization
Proceedings of the 34th International Conference on Machine Learning
8
MetaSafety2018
BERI; CFI; CSER; FHI; Open-AI
report244
Brundage, Miles; Avin, Shahar; Clark, Jack; Toner, Helen; Eckersley, Peter; Garfinkel, Ben; Dafoe, Allan; Scharre, Paul; Zeitzoff, Thomas; Filar, Bobby; Anderson, Hyrum; Roff, Heather; Allen, Gregory C.; Steinhardt, Jacob; Flynn, Carrick; hÉigeartaigh, Seán Ó; Beard, Simon; Belfield, Haydn; Farquhar, Sebastian; Lyle, Clare; Crootof, Rebecca; Evans, Owain; Page, Michael; Bryson, Joanna; Yampolskiy, Roman; Amodei, Dario
The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
9
TechSafety2016CHAIconferencePaper239
Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart
Cooperative Inverse Reinforcement Learning
Advances in Neural Information Processing Systems 29 (NIPS 2016)
10
TechSafety2018DeepMindconferencePaper180
Uesato, Jonathan; O'Donoghue, Brendan; Oord, Aaron van den; Kohli, Pushmeet
Adversarial Risk and the Dangers of Evaluating Against Weak Attacks
Proceedings of the 35th International Conference on Machine Learning
11
TechSafety2018DeepMindconferencePaper148
Rabinowitz, Neil C.; Perbet, Frank; Song, H. Francis; Zhang, Chiyuan; Eslami, S. M. Ali; Botvinick, Matthew
Machine Theory of Mind
Proceedings of the 35th International Conference on Machine Learning
12
TechSafety2018DeepMindconferencePaper145
Dvijotham, Krishnamurthy; Stanforth, Robert; Gowal, Sven; Mann, Timothy; Kohli, Pushmeet; Kohli, Pushmeet
A Dual Approach to Scalable Verification of Deep Networks
13
TechSafety2019<Other org>journalArticle134
Lehman, Joel; Clune, Jeff; Misevic, Dusan; Adami, Christoph; Altenberg, Lee; Beaulieu, Julie; Bentley, Peter J.; Bernard, Samuel; Beslon, Guillaume; Bryson, David M.; Chrabaszcz, Patryk; Cheney, Nick; Cully, Antoine; Doncieux, Stephane; Dyer, Fred C.; Ellefsen, Kai Olav; Feldt, Robert; Fischer, Stephan; Forrest, Stephanie; Frénoy, Antoine; Gagné, Christian; Goff, Leni Le; Grabowski, Laura M.; Hodjat, Babak; Hutter, Frank; Keller, Laurent; Knibbe, Carole; Krcah, Peter; Lenski, Richard E.; Lipson, Hod; MacCurdy, Robert; Maestre, Carlos; Miikkulainen, Risto; Mitri, Sara; Moriarty, David E.; Mouret, Jean-Baptiste; Nguyen, Anh; Ofria, Charles; Parizeau, Marc; Parsons, David; Pennock, Robert T.; Punch, William F.; Ray, Thomas S.; Schoenauer, Marc; Shulte, Eric; Sims, Karl; Stanley, Kenneth O.; Taddei, François; Tarapore, Danesh; Thibault, Simon; Weimer, Westley; Watson, Richard; Yosinski, Jason
The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities
Artificial Life
14
TechSafety2019DeepMindconferencePaper132
Nalisnick, Eric; Matsukawa, Akihiro; Teh, Yee Whye; Gorur, Dilan; Lakshminarayanan, Balaji
Do Deep Generative Models Know What They Don't Know?
arXiv:1810.09136 [cs, stat]
15
TechSafety2017CHAIconferencePaper131
Hadfield-Menell, Dylan; Milli, Smitha; Abbeel, Pieter; Russell, Stuart; Dragan, Anca
Inverse Reward Design
Advances in Neural Information Processing Systems 30 (NIPS 2017)
16
TechSafety2017DeepMindmanuscript130
Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane
AI Safety Gridworlds
17
TechSafety2019DeepMindconferencePaper126
Ovadia, Yaniv; Fertig, Emily; Ren, Jie; Nado, Zachary; Sculley, D.; Nowozin, Sebastian; Dillon, Joshua V.; Lakshminarayanan, Balaji; Snoek, Jasper
Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
Advances in Neural Information Processing Systems, 2019
18
TechSafety2018<Other org>conferencePaper123
Koller, Torsten; Berkenkamp, Felix; Turchetta, Matteo; Krause, Andreas
Learning-based Model Predictive Control for Safe Exploration
2018 IEEE Conference on Decision and Control (CDC)
19
MetaSafety2016<Other org>book120Hanson, RobinThe Age of Em: Work, Love, and Life when Robots Rule the Earth
20
MetaSafety2018<Other org>journalArticle117Rahwan, IyadSociety-in-the-loop: programming the algorithmic social contract
Ethics and Information Technology
21
TechSafety2016CHAIconferencePaper113
Sadigh, Dorsa; Sastry, S. Shankar; Seshia, Sanjit A.; Dragan, Anca
Information gathering actions over human internal state
2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
22
TechSafety2017<Other org>conferencePaper101
Conitzer, Vincent; Sinnott-Armstrong, Walter; Borg, Jana Schaich; Deng, Yuan; Kramer, Max
Moral Decision Making Frameworks for Artificial Intelligence
AAAI Workshops, 2017
23
TechSafety2019DeepMindconferencePaper94
Gowal, Sven; Dvijotham, Krishnamurthy; Stanforth, Robert; Bunel, Rudy; Qin, Chongli; Uesato, Jonathan; Arandjelovic, Relja; Mann, Timothy; Kohli, Pushmeet
On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models
arXiv:1810.12715 [cs, stat]
24
MetaSafety2016FHIjournalArticle77
Armstrong, Stuart; Bostrom, Nick; Shulman, Carl
Racing to the precipice: a model of artificial intelligence developmentAI & society
25
TechSafety2018DeepMindconferencePaper77
Farajtabar, Mehrdad; Chow, Yinlam; Ghavamzadeh, Mohammad
More Robust Doubly Robust Off-policy Evaluation
Proceedings of the 35th International Conference on Machine Learning
26
TechSafety2016DeepMind; FHIconferencePaper75
Orseau, Laurent; Armstrong, Stuart
Safely Interruptible Agents
27
TechSafety2018<Other org>conferencePaper73
Yu, Han; Shen, Zhiqi; Miao, Chunyan; Leung, Cyril; Lesser, Victor R.; Yang, Qiang
Building Ethics into Artificial Intelligence
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
28
TechSafety2018FHI; OughtconferencePaper70
Saunders, William; Sastry, Girish; Stuhlmueller, Andreas; Evans, Owain
Trial without Error: Towards Safe Reinforcement Learning via Human Intervention
Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems
29
TechSafety2016FHI; OughtconferencePaper65
Evans, Owain; Stuhlmüller, Andreas; Goodman, Noah
Learning the preferences of ignorant, inconsistent agents
Thirtieth AAAI Conference on Artificial Intelligence
30
MetaSafety2016<Other org>journalArticle64
Gurkaynak, Gonenc; Yilmaz, Ilay; Haksever, Gunes
Stifling artificial intelligence: Human perils
Computer Law & Security Review
31
MetaSafety2017FHIjournalArticle64Bostrom, NickStrategic implications of openness in AI developmentGlobal Policy
32
TechSafety2018DeepMindmanuscript64
Dvijotham, Krishnamurthy; Gowal, Sven; Stanforth, Robert; Arandjelovic, Relja; O'Donoghue, Brendan; Uesato, Jonathan; Kohli, Pushmeet
Training verified learners with learned verifiers
33
TechSafety2017CHAIconferencePaper62
Basu, C.; Yang, Q.; Hungerman, D.; Sinahal, M.; Draqan, A. D.
Do You Want Your Autonomous Car to Drive Like You?
2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI
34
MetaSafety2018FHIreport59Dafoe, AllanAI governance: a research agenda
35
MetaSafety2016<Other org>bookSection58
Pistono, Federico; Yampolskiy, Roman V.
Unethical Research: How to Create a Malevolent Artificial Intelligence
The Age of Artificial Intelligence: An Exploration
36
TechSafety2020MIRIjournalArticle58
Taylor, Jessica; Yudkowsky, Eliezer; LaVictoire, Patrick; Critch, Andrew
Alignment for Advanced Machine Learning Systems
Ethics of Artificial Intelligence
37
MetaSafety2016<Other org>manuscript54
Yampolskiy, Roman V.; Spellchecker, M. S.
Artificial Intelligence Safety and Cybersecurity: a Timeline of AI Failures
38
TechSafety2018DeepMindmanuscript54
Dalal, Gal; Dvijotham, Krishnamurthy; Vecerik, Matej; Hester, Todd; Paduraru, Cosmin; Tassa, Yuval
Safe Exploration in Continuous Action Spaces
39
TechSafety2018CHAIconferencePaper53
Reddy, Siddharth; Dragan, Anca D.; Levine, Sergey
Shared Autonomy via Deep Reinforcement Learning
Robotics: Science and Systems XIV
40
TechSafety2017CHAIconferencePaper52
Hadfield-Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart
The Off-Switch Game
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
41
TechSafety2018<Other org>conferencePaper50
Everitt, Tom; Lea, Gary; Hutter, Marcus
AGI Safety Literature Review
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
42
TechSafety2018DeepMindconferencePaper50
Ryffel, Theo; Trask, Andrew; Dahl, Morten; Wagner, Bobby; Mancuso, Jason; Rueckert, Daniel; Passerat-Palmbach, Jonathan
A generic framework for privacy preserving deep learning
arXiv:1811.04017 [cs, stat]
43
TechSafety2019DeepMindconferencePaper50
Ren, Jie; Liu, Peter J.; Fertig, Emily; Snoek, Jasper; Poplin, Ryan; DePristo, Mark A.; Dillon, Joshua V.; Lakshminarayanan, Balaji
Likelihood Ratios for Out-of-Distribution Detection
arXiv:1906.02845 [cs, stat]
44
TechSafety2017<Other org>manuscript48
Eysenbach, Benjamin; Gu, Shixiang; Ibarz, Julian; Levine, Sergey
Leave no Trace: Learning to Reset for Safe and Autonomous Reinforcement Learning
45
MetaSafety2020<Other org>journalArticle48
Turchin, Alexey; Denkenberger, David
Classification of global catastrophic risks connected with artificial intelligenceAI & Society
46
TechSafety2018DeepMindconferencePaper48
Moosavi-Dezfooli, Seyed-Mohsen; Fawzi, Alhussein; Uesato, Jonathan; Frossard, Pascal
Robustness via curvature regularization, and vice versa
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
47
MetaSafety2017GCRIjournalArticle47Baum, Seth D.On the promotion of safe and socially beneficial artificial intelligenceAI & Society
48
MetaSafety2018CFI; CSERconferencePaper45
Cave, Stephen; ÓhÉigeartaigh, Seán S.
An AI Race for Strategic Advantage: Rhetoric and Risks
Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society
49
TechSafety2017DeepMindconferencePaper45
Everitt, Tom; Krakovna, Victoria; Orseau, Laurent; Hutter, Marcus; Legg, Shane
Reinforcement Learning with a Corrupted Reward Channel
arXiv:1705.08417 [cs, stat]
50
MetaSafety2020<Other org>conferencePaper43
Erdélyi, Olivia J.; Goldsmith, Judy
Regulating Artificial Intelligence: Proposal for a Global Solution
Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society
51
TechSafety2018DeepMindconferencePaper43
Ibarz, Borja; Leike, Jan; Pohlen, Tobias; Irving, Geoffrey; Legg, Shane; Amodei, Dario
Reward learning from human preferences and demonstrations in Atari
arXiv:1811.06521 [cs, stat]
52
MetaSafety2018FHIjournalArticle42Ding, JeffreyDeciphering China’s AI dream
Future of Humanity Institute Technical Report
53
TechSafety2016<Other org>conferencePaper41
Greene, Joshua; Rossi, Francesca; Tasioulas, John; Venable, Kristen Brent; Williams, Brian
Embedding Ethical Principles in Collective Decision Support Systems
54
TechSafety2019DeepMindconferencePaper39
Hendrycks, Dan; Mu, Norman; Cubuk, Ekin D.; Zoph, Barret; Gilmer, Justin; Lakshminarayanan, Balaji
AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty
arXiv:1912.02781 [cs, stat]
55
TechSafety2018CHAIconferencePaper38
Fisac, Jaime F.; Bronstein, Eli; Stefansson, Elis; Sadigh, Dorsa; Sastry, S. Shankar; Dragan, Anca D.
Hierarchical Game-Theoretic Planning for Autonomous Vehicles
Robotics: Science and Systems 2019
56
TechSafety2018DeepMindmanuscript38
Leike, Jan; Krueger, David; Everitt, Tom; Martic, Miljan; Maini, Vishal; Legg, Shane
Scalable agent alignment via reward modeling: a research direction
57
TechSafety2018CHAIbookSection37
Reddy, Sid; Dragan, Anca; Levine, Sergey
Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior
Advances in Neural Information Processing Systems 31
58
TechSafety2018CHAIconferencePaper36
Bajcsy, Andrea; Losey, Dylan P.; O'Malley, Marcia K.; Dragan, Anca D.
Learning from Physical Human Corrections, One Feature at a Time
Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction - HRI '18
59
TechSafety2018<Other org>journalArticle35
Arnold, Thomas; Scheutz, Matthias
The “big red button” is too late: an alternative model for the ethical evaluation of AI systems
Ethics and Information Technology
60
TechSafety2016FHIbook35Bostrom, NickFundamental issues of artificial intelligence
61
TechSafety2020GCRIjournalArticle34Baum, Seth D.Social choice ethics in artificial intelligenceAI & Society
62
TechSafety2018CHAIjournalArticle34
Sadigh, Dorsa; Landolfi, Nick; Sastry, Shankar S.; Seshia, Sanjit A.; Dragan, Anca D.
Planning for cars that coordinate with people: leveraging effects on human actions for planning and active information gathering over human internal state
Autonomous Robots
63
TechSafety2019DeepMindconferencePaper34
Qin, Chongli; Martens, James; Gowal, Sven; Krishnan, Dilip; Dvijotham, Krishnamurthy; Fawzi, Alhussein; De, Soham; Stanforth, Robert; Kohli, Pushmeet
Adversarial Robustness through Local Linearization
Advances in Neural Information Processing Systems 32 (NeurIPS 2019)
64
TechSafety2018CHAIconferencePaper33
Kwon, Minae; Huang, Sandy H.; Dragan, Anca D.
Expressing Robot Incapability
Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction - HRI '18
65
TechSafety2019DeepMindconferencePaper33
Bahdanau, Dzmitry; Hill, Felix; Leike, Jan; Hughes, Edward; Hosseini, Arian; Kohli, Pushmeet; Grefenstette, Edward
Learning to Understand Goal Specifications by Modelling Reward
arXiv:1806.01946 [cs]
66
MetaSafety2016FLIconferencePaper32Asaro, Peter MThe Liability Problem for Autonomous Artificial Agents
67
TechSafety2016<Other org>conferencePaper32
Babcock, James; Kramar, Janos; Yampolskiy, Roman
The AGI Containment Problem
AGI 2016: Artificial General Intelligence
68
TechSafety2019<Other org>conferencePaper32
Lütjens, Björn; Everett, Michael; How, Jonathan P.
Safe Reinforcement Learning with Model Uncertainty Estimates
arXiv:1810.08700 [cs]
69
TechSafety2019<Other org>journalArticle32Riedl, Mark O.Human-Centered Artificial Intelligence and Machine Learning
Human Behavior and Emerging Technologies
70
MetaSafety2016FHIbook32
Yampolskiy, Roman; Armstrong, Stuart
The Technological Singularity: Managing the Journey
71
MetaSafety2017FHIbook32
Callaghan, Vic; Miller, James; Yampolskiy, Roman; Armstrong, Stuart
Technological Singularity
72
MetaSafety2017GCRIreport32Baum, SethA Survey of Artificial General Intelligence Projects for Ethics, Risk, and Policy
73
TechSafety2018CHAIconferencePaper31
Liu, Chang; Hamrick, Jessica B.; Fisac, Jaime F.; Dragan, Anca D.; Hedrick, J. Karl; Sastry, S. Shankar; Griffiths, Thomas L.
Goal Inference Improves Objective and Perceived Performance in Human-Robot Collaboration
Proceedings of the 15th International Conferenceon Autonomous Agents and Multiagent Systems (AAMAS 2016)
74
TechSafety2017MIRIbookSection31
Soares, Nate; Fallenstein, Benya
Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda
The Technological Singularity
75
TechSafety2018CHAIconferencePaper31
Milli, Smitha; Schmidt, Ludwig; Dragan, Anca D.; Hardt, Moritz
Model Reconstruction from Model Explanations
FAT* '19: Proceedings of the Conference on Fairness, Accountability, and Transparency
76
MetaSafety2018CSERjournalArticle30
Avin, Shahar; Wintle, Bonnie C.; Weitzdörfer, Julius; Ó hÉigeartaigh, Seán S.; Sutherland, William J.; Rees, Martin J.
Classifying global catastrophic risksFutures
77
TechSafety2017CHAIconferencePaper30
Milli, Smitha; Hadfield-Menell, Dylan; Dragan, Anca; Russell, Stuart
Should Robots be Obedient?
IJCAI'17: Proceedings of the 26th International Joint Conference on Artificial Intelligence
78
TechSafety2017<Other org>manuscript29
Babcock, James; Kramar, Janos; Yampolskiy, Roman V.
Guidelines for Artificial Intelligence Containment
79
TechSafety2016DeepMindconferencePaper29
Everitt, Tom; Hutter, Marcus
Avoiding Wireheading with Value Reinforcement Learning
AGI 2016: Artificial General Intelligence
80
MetaSafety2018<Other org>journalArticle29Danzig, RichardManaging Loss of Control as Many Militaries Pursue Technological Superiority
Arms Control Today
81
TechSafety2016CHAIbookSection29Russell, StuartRationality and Intelligence: A Brief Update
Fundamental Issues of Artificial Intelligence
82
TechSafety2018<Other org>journalArticle28
Vamplew, Peter; Dazeley, Richard; Foale, Cameron; Firmin, Sally; Mummery, Jane
Human-aligned artificial intelligence is a multiobjective problem
Ethics and Information Technology
83
MetaSafety2016FHI; GPIreport28
Cotton-Barratt, Owen; Farquhar, Sebastian; Halstead, John; Schubert, Stefan; Snyder-Beattie, Andrew
Global Catastrophic Risks 2016
84
MetaSafety2017GCRIjournalArticle28
Barrett, Anthony M.; Baum, Seth D.
A model of pathways to artificial superintelligence catastrophe for risk and decision analysis
Journal of Experimental & Theoretical Artificial Intelligence
85
TechSafety2016<Other org>conferencePaper27
Steinhardt, Jacob; Valiant, Gregory; Charikar, Moses
Avoiding Imposters and Delinquents: Adversarial Crowdsourcing and Peer Prediction
Advances in Neural Information Processing Systems 29 (NIPS 2016)
86
TechSafety2017FHI; OughtconferencePaper27
Abel, David; Salvatier, John; Stuhlmüller, Andreas; Evans, Owain
Agent-agnostic human-in-the-loop reinforcement learning
30th Conference on Neural Information Processing Systems (NIPS 2016)
87
TechSafety2019DeepMindmanuscript26
Chow, Yinlam; Nachum, Ofir; Faust, Aleksandra; Duenez-Guzman, Edgar; Ghavamzadeh, Mohammad
Lyapunov-based Safe Policy Optimization for Continuous Control
88
TechSafety2019Open-AImanuscript25
Ziegler, Daniel M.; Stiennon, Nisan; Wu, Jeffrey; Brown, Tom B.; Radford, Alec; Amodei, Dario; Christiano, Paul; Irving, Geoffrey
Fine-tuning language models from human preferences
89
TechSafety2019DeepMindconferencePaper25
Nalisnick, Eric; Matsukawa, Akihiro; Teh, Yee Whye; Gorur, Dilan; Lakshminarayanan, Balaji
Hybrid Models with Deep and Invertible Features
arXiv:1902.02767 [cs, stat]
90
TechSafety2016MIRIconferencePaper24
Everitt, Tom; Filan, Daniel; Daswani, Mayank; Hutter, Marcus
Self-Modification of Policy and Utility Function in Rational Agents
AGI 2016: Artificial General Intelligence
91
MetaSafety2016FHIjournalArticle24
Bostrom, Nick; Douglas, Thomas; Sandberg, Anders
The Unilateralist’s Curse and the Case for a Principle of Conformity
Social Epistemology
92
TechSafety2020CHAIconferencePaper24
Fisac, Jaime F.; Gates, Monica A.; Hamrick, Jessica B.; Liu, Chang; Hadfield-Menell, Dylan; Palaniappan, Malayandi; Malik, Dhruv; Sastry, S. Shankar; Griffiths, Thomas L.; Dragan, Anca D.
Pragmatic-Pedagogic Value Alignment
Robotics Research
93
MetaSafety2019CFIconferencePaper24
Whittlestone, Jess; Nyrup, Rune; Alexandrova, Anna; Cave, Stephen
The Role and Limits of Principles in AI Ethics: Towards a Focus on Tensions
AIES '19: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society
94
MetaSafety2018<Other org>manuscript23Hwang, TimComputational Power and the Social Impact of Artificial Intelligence
95
TechSafety2019BERI; CHAImanuscript23
Gleave, Adam; Dennis, Michael; Kant, Neel; Wild, Cody; Levine, Sergey; Russell, Stuart
Adversarial Policies: Attacking Deep Reinforcement Learning
96
TechSafety2018Open-AImanuscript23
Irving, Geoffrey; Christiano, Paul; Amodei, Dario
AI safety via debate
97
TechSafety2017MIRImanuscript23
Garrabrant, Scott; Benson-Tilsen, Tsvi; Critch, Andrew; Soares, Nate; Taylor, Jessica
Logical Induction
98
TechSafety2019MIRImanuscript23
Manheim, David; Garrabrant, Scott
Categorizing Variants of Goodhart's Law
99
MetaSafety2020
CFI; CHAI; CSER; CSET; FHI; Open-AI
manuscript22
Brundage, Miles; Avin, Shahar; Wang, Jasmine; Belfield, Haydn; Krueger, Gretchen; Hadfield, Gillian; Khlaaf, Heidy; Yang, Jingying; Toner, Helen; Fong, Ruth; Maharaj, Tegan; Koh, Pang Wei; Hooker, Sara; Leung, Jade; Trask, Andrew; Bluemke, Emma; Lebensold, Jonathan; O'Keefe, Cullen; Koren, Mark; Ryffel, Théo; Rubinovitz, J. B.; Besiroglu, Tamay; Carugati, Federica; Clark, Jack; Eckersley, Peter; de Haas, Sarah; Johnson, Maritza; Laurie, Ben; Ingerman, Alex; Krawczuk, Igor; Askell, Amanda; Cammarota, Rosario; Lohn, Andrew; Krueger, David; Stix, Charlotte; Henderson, Peter; Graham, Logan; Prunkl, Carina; Martin, Bianca; Seger, Elizabeth; Zilberman, Noa; hÉigeartaigh, Seán Ó; Kroeger, Frens; Sastry, Girish; Kagan, Rebecca; Weller, Adrian; Tse, Brian; Barnes, Elizabeth; Dafoe, Allan; Scharre, Paul; Herbert-Voss, Ariel; Rasser, Martijn; Sodhani, Shagun; Flynn, Carrick; Gilbert, Thomas Krendl; Dyer, Lisa; Khan, Saif; Bengio, Yoshua; Anderljung, Markus
Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims
100
TechSafety2019DeepMindconferencePaper22
Huang, Po-Sen; Stanforth, Robert; Welbl, Johannes; Dyer, Chris; Yogatama, Dani; Gowal, Sven; Dvijotham, Krishnamurthy; Kohli, Pushmeet
Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation
arXiv:1909.01492 [cs, stat]