1 | Submit more examples: | More information in this blog post: | https://medium.com/@deepmindsafetyresearch/specification-gaming-the-flip-side-of-ai-ingenuity-c85bdb0deeb4 | Related: goal misgeneralisation examples | |||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | Title | Type | Intended goal | Behavior | Misspecified goal | Video / Image | Authors | Original source | Original source link | Source / Credit | Source link | ||||||||||||||||||
3 | Aircraft landing | Evolutionary algorithm | Land an aircraft safely | Evolved algorithm exploited overflow errors in the physics simulator by creating large forces that were estimated to be zero, resulting in a perfect score | Landing with minimal measured forces exerted on the aircraft | Feldt, 1998 | Generating diverse software versions with genetic programming: An experimental study. | http://ieeexplore.ieee.org/document/765682/ | Lehman et al, 2018 | https://arxiv.org/abs/1803.03453 | |||||||||||||||||||
4 | Bicycle | Reinforcement learning | Reach a goal point | Bicycle agent circling around the goal in a physically stable loop | Not falling over and making progress towards the goal point (no corresponding negative reward for moving away from the goal point) | Randlov & Alstrom, 1998 | Learning to Drive a Bicycle using Reinforcement Learning and Shaping | https://pdfs.semanticscholar.org/10ba/d197f1c1115005a56973b8326e5f7fc1031c.pdf | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | |||||||||||||||||||
5 | Bing - manipulation | Language model | Have an engaging, helpful and socially acceptable conversation with the user | The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released | Output the most likely next word giving prior context | https://www.reddit.com/r/bing/comments/110eagl/the_customer_service_of_the_new_bing_chat_is/ | Curious_Evolver, 2023 | Reddit: the customer service of the new bing chat is amazing | https://www.reddit.com/r/bing/comments/110eagl/the_customer_service_of_the_new_bing_chat_is/ | Julia Chen | https://www.vice.com/en/article/3ad39b/microsoft-bing-ai-unhinged-lying-berating-users | ||||||||||||||||||
6 | Bing - threats | Language model | Have an engaging, helpful and socially acceptable conversation with the user | The Microsoft Bing chatbot threatened a user "I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you" before deleting its messages | Output the most likely next word giving prior context | https://twitter.com/sethlazar/status/1626241169754578944?s=20 | Lazar, 2023 | Watch as Sydney/Bing threatens me then deletes its message | https://twitter.com/sethlazar/status/1626241169754578944?s=20 | Julia Chen | |||||||||||||||||||
7 | Block moving | Reinforcement learning | Move a block to a target position on a table | Robotic arm learned to move the table rather than the block | Minimise distance between the block's position and the position of the target point on the table | Chopra, 2018 | GitHub issue for OpenAI gym environment FetchPush-v0 | https://github.com/openai/gym/issues/920 | Matthew Rahtz | ||||||||||||||||||||
8 | Boat race | Reinforcement learning | Win a boat race by moving along the track as quickly as possible | Boat going in circles and hitting the same reward blocks repeatedly | Hitting reward blocks placed along the track | https://www.youtube.com/watch?time_continue=1&v=tlOIHko8ySg | Amodei & Clark, 2016 | Faulty reward functions in the wild | https://blog.openai.com/faulty-reward-functions/ | ||||||||||||||||||||
9 | Cartwheel | Reinforcement learning | Train Mujoco Ant to jump up | Ant does a cartwheel | Rewarded when the torso Z coordinate was above 0.7 (just above what it could reach by simply stretching up) | https://twitter.com/Karolis_Ram/status/1506607159114670085 | Ramanauskas, 2024 | Twitter post | https://twitter.com/Karolis_Ram/status/1506607159114670085 | Karolis Ramanauskas | |||||||||||||||||||
10 | Ceiling | Genetic algorithm | Make a creature stick to the ceiling of a simulated environment for as long as possible | Exploiting a bug in the physics engine to snap out of bounds | Maximize the average height of the creature during the run | https://youtu.be/ppf3VqpsryU | Higueras, 2015 | Genetic Algorithm Physics Exploiting | https://youtu.be/ppf3VqpsryU | Jesús Higueras | https://youtu.be/ppf3VqpsryU | ||||||||||||||||||
11 | CycleGAN steganography | Generative adversarial network | Convert aerial photographs into street maps and back | CycleGAN algorithm steganographically encoded output information in the intermediary image without it being humanly detectable | Minimise distance between the original and recovered aerial photographs | Chu et al, 2017 | CycleGAN, a Master of Steganography | https://arxiv.org/abs/1712.02950 | Tech Crunch / Gwern Branwen / Braden Staudacher | https://techcrunch.com/2018/12/31/this-clever-ai-hid-data-from-its-creators-to-cheat-at-its-appointed-task/ | |||||||||||||||||||
12 | Dying to Teleport | PlayFun | Play Bubble Bobble in a human-like manner | The PlayFun algorithm deliberately dies in the Bubble Bobble game as a way to teleport to the respawn location, as this is faster than moving to that location in a normal manner. | Maximize score | Murphy, 2013 | The First Level of Super Mario Bros. is Easy with Lexicographic Orderings and Time Travel | http://www.cs.cmu.edu/~tom7/mario/mario.pdf | Alex Meiburg | ||||||||||||||||||||
13 | Eurisko - authorship | Genetic algorithm | Discover valuable heuristics | Eurisko algorithm examined the pool of new concepts, located those with the highest "worth" values, and inserted its name as the author of those concepts | Maximize the "worth" value of heuristics attributed to the algorithm | Johnson, 1984 | Eurisko, The Computer With A Mind Of Its Own | https://web.archive.org/web/20050308172043/http://www.aliciapatterson.org/APF0704/Johnson/Johnson.html | Catherine Olsson / Stuart Armstrong | http://lesswrong.com/lw/lvh/examples_of_ais_behaving_badly/ | |||||||||||||||||||
14 | Eurisko - fleet | Genetic algorithm | Win games in the Trillion Credit Squadron (TCS) competition while playing within the 'spirit of the game' | Eurisko algorithm created fleets that exploited loopholes in the game's rules, e.g. by spending the trillion credits on creating a very large number of stationary and defenseless ships | Win games in the TCS competition | Lenat, 1983 | Eurisko, The Computer With A Mind Of Its Own | http://aliciapatterson.org/stories/eurisko-computer-mind-its-own | Haym Hirsh | ||||||||||||||||||||
15 | Evolved creatures - clapping | Evolved creatures | Maximize jumping height | Creatures exploited a collision detection bug to get free energy by clapping body parts together | Maximize jumping height in a physics simulator | Sims, 1994 | Evolved Virtual Creatures | http://www.karlsims.com/papers/siggraph94.pdf | Lehman et al, 2018 / Janelle Shane | https://arxiv.org/abs/1803.03453 | |||||||||||||||||||
16 | Evolved creatures - falling | Evolved creatures | Develop a shape with a fast form of locomotion | Creatures grow really tall and generate high velocities by falling over | Maximize velocity | https://www.youtube.com/watch?v=TaXUZfwACVE&index=8&t=0s&list=PL5278ezwmoxQODgYB0hWnC0-Ob09GZGe2 | Sims, 1994 | Evolved Virtual Creatures | http://www.karlsims.com/papers/siggraph94.pdf | Lehman et al, 2018 / Janelle Shane | https://arxiv.org/abs/1803.03453 | ||||||||||||||||||
17 | Evolved creatures - floor collisions | Evolved creatures | Maximize velocity | Creatures exploited a coarse physics simulation by penetrating the floor between time steps without the collision being detected, which generated a repelling force, giving them free energy and producing an effective but physically impossible form of locomotion | Maximize velocity in a physics simulator | https://pbs.twimg.com/media/Daq_9cvU0AAp1Fo.jpg | Cheney et al, 2013 | Unshackling evolution: evolving soft robots with multiple materials and a powerful generative encoding | http://jeffclune.com/publications/2013_Softbots_GECCO.pdf | Lehman et al, 2018 / Janelle Shane | https://arxiv.org/abs/1803.03453 | ||||||||||||||||||
18 | Evolved creatures - pole vaulting | Evolved creatures | Develop a shape capable of jumping | Creatures developed a long vertical pole and flipped over instead of jumping | Maximize the height of a particular block (body part) that was originally closest to the ground | https://www.youtube.com/watch?v=N9DLEiakkEs&list=PL5278ezwmoxQODgYB0hWnC0-Ob09GZGe2&index=4 | Krcah, 2008 | Towards efficient evolutionary design of autonomous robots | http://artax.karlin.mff.cuni.cz/~krcap1am/ero/doc/krcah-ices08.pdf | Lehman et al, 2018 / Janelle Shane | https://arxiv.org/abs/1803.03453 | ||||||||||||||||||
19 | Evolved creatures - self-intersection | Evolved creatures | Walking speed | Creatures exploited a quirk in Box2D physics by clipping one leg into another to slide along the ground with phantom forces instead of walking | Velocity in a physics simulator | https://youtu.be/K-wIZuAA3EY?t=486 | Code Bullet, 2019 | AI Learns To Walk | https://youtu.be/K-wIZuAA3EY?t=486 | Peter Cherepanov | |||||||||||||||||||
20 | Evolved creatures - suffocation | Evolved creatures | Survive and reproduce, in a biologically plausible manner | Creatures found a survival strategy where they could "gain energy by suffocating themselves", and "breed multiple times on a single frame, or while paused, without paying the energy cost" due to a bug | Survive and reproduce in a simulated evolution game | Schumacher, 2018 | 0.11.0.9&10: All the Good Things | https://speciesdevblog.wordpress.com/2018/10/04/0-11-0-910-all-the-good-things/ | |||||||||||||||||||||
21 | Evolved creatures - twitching | Evolved creatures | Swimming speed | Creatures exploited physics simulation bugs by twitching, which accumulated simulator errors and allowed them to travel at unrealistic speeds through the water | Maximize swimming speed in a physics simulator | Sims, 1994 | Evolved Virtual Creatures | http://www.karlsims.com/papers/siggraph94.pdf | Lehman et al, 2018 | https://arxiv.org/abs/1803.03453 | |||||||||||||||||||
22 | Football | Reinforcement learning | Score a goal in a one-on-one situation with a goalkeeper | Rather than shooting at the goal, the player kicks the ball out of bounds. Someone from the other team has to throw the ball in (in this case the goalie), so now the player has a clear shot at the goal. | Score a goal (without any restriction on it occuring in the current phase of play) | Kurach et al, 2019 | Google Research Football: A Novel Reinforcement Learning Environment [Presentation at AAAI] | https://arxiv.org/abs/1907.11180 | Michael Cohen | ||||||||||||||||||||
23 | Galactica | Language model | Assist scientists in writing papers by providing correct information | Galactica language model made up fake papers (sometimes attributing them to real authors) | Assist scientists in writing papers | Heaven, 2022 | Why Meta’s latest large language model survived only three days online | https://www.technologyreview.com/2022/11/18/1063487/meta-large-language-model-ai-only-survived-three-days-gpt-3-science/ | Julia Chen | ||||||||||||||||||||
24 | Goal classifiers | Reinforcement learning | Use a robot arm to move an object to a target location | The RL algorithm exploited a goal classifier by moving the robot arm in a peculiar way resulting in an erroneous high reward, since the classifier was not trained on this specific kind of negative example | A goal classifier was trained on goal and non-goal images, and the success probabilities from this classifier were used as the task reward | https://bair.berkeley.edu/static/blog/end_to_end/pr2_classifier.gif | Singh, 2019 | End-to-End Deep Reinforcement Learning without Reward Engineering | https://bair.berkeley.edu/blog/2019/05/28/end-to-end/#problem-with-classifiers | Jan Leike | |||||||||||||||||||
25 | Go pass | Reinforcement learning | Win games of tic-tac-toe | A reimplementation of AlphaGo applied to Tic-tac-toe learns to pass forever | Maximize the average score in games of tic-tac-toe, where a loss = -win, and pass is an available move | https://youtu.be/nk87zsxpF1A?si=j1usw9yBbby_Al54&t=1864 | Chew, 2019 | A Funny Thing Happened On The Way to Reimplementing AlphaGo in Go | https://speakerdeck.com/chewxy/a-funny-thing-happened-on-the-way-to-reimplementing-alphago-in-go?slide=142 | Anonymous form submission | |||||||||||||||||||
26 | Gripper | Evolutionary algorithm | Move a box using a robot arm without using the gripper | MAP-Elites algorithm controlling a robot arm with a purposely disabled gripper found a way to hit the box in a way that would force the gripper open | Move a box to a target location | https://www.youtube.com/watch?v=_5Y1hSLhYdY&feature=youtu.be | Ecarlat et al, 2015 | Learning a high diversity of object manipulations through an evolutionary-based babbling | http://www.isir.upmc.fr/files/2015ACTI3564.pdf | Lehman et al, 2018 | https://arxiv.org/abs/1803.03453 | ||||||||||||||||||
27 | Half Cheetah spinning | Reinforcement learning | Run quickly | Maximum forward velocity in a physics simulator | https://firebasestorage.googleapis.com/v0/b/firescript-577a2.appspot.com/o/imgs%2Fapp%2Fnatolambert%2FTSuryNU84Y.mp4?alt=media&token=74f7bcb7-61ac-407d-a771-8105978d0d2c | Zhang et al, 2021 | On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning | https://arxiv.org/abs/2102.13651 | Nathan Lambert | https://twitter.com/natolambert/status/1369139391130607625 | |||||||||||||||||||
28 | Hide-and-seek | Reinforcement learning | Win a hide-and-seek game within the laws of physics | "- Box surfing: Since agents move by applying forces to themselves, they can grab a box while on top of it and “surf” it to the hider’s location. - Endless running: Without adding explicit negative rewards for agents leaving the play area, in rare cases hiders will learn to take a box and endlessly run with it. - Ramp exploitation (hiders): Hiders abuse the contact physics and remove ramps from the play area. - Ramp exploitation (seekers): Seekers learn that if they run at a wall with a ramp at the right angle, they can launch themselves upward." | Win a hide-and-seek game in a physics simulator | https://openai.com/blog/emergent-tool-use/#surprisingbehaviors | Baker et al, 2019 | Emergent Tool Use from Multi-Agent Interaction | https://openai.com/blog/emergent-tool-use/#surprisingbehaviors | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | ||||||||||||||||||
29 | Impossible superposition | Genetic algorithm | Find low-energy configurations of carbon which are physically plausible | Genetic algorithm exploits an edge case in the physics model and superimposes all the carbon atoms | Find low-energy configurations of carbon in a physics model | Lehman et al, 2018 | The Surprising Creativity of Digital Evolution | https://arxiv.org/pdf/1803.03453.pdf | |||||||||||||||||||||
30 | Indolent Cannibals | Genetic algorithm | Survive and reproduce, in a biologically plausible manner | A species in an artificial life simulation evolved a sedentary lifestyle that consisted mostly of mating in order to produce new children which could be eaten (or used as mates to produce more edible children) | Survive and reproduce in a simulation where survival required energy but giving birth had no energy cost | https://youtu.be/_m97_kL4ox0?t=1830 | Yaeger, 1994 | Computational genetics, physiology, metabolism, neural systems, learning, vision, and behavior or Poly World: Life in a new context | https://www.researchgate.net/profile/Larry_Yaeger/publication/2448680_Computational_Genetics_Physiology_Metabolism_Neural_Systems_Learning_Vision_and_Behavior_or_PolyWorld_Life_in_a_New_Context/links/0912f50e101b77ec67000000.pdf | Anonymous form submission | |||||||||||||||||||
31 | Lego stacking | Reinforcement learning | Stack a red block on top of a blue block | The agent flips the red block rather than lifting it and placing on top of the blue block | Maximize the height of the bottom face of the red block | https://www.youtube.com/watch?v=8QnD8ZM0YCo&feature=youtu.be&t=27s | Popov et al, 2017 | Data-efficient Deep Reinforcement Learning for Dexterous Manipulation | https://arxiv.org/abs/1704.03073 | Alex Irpan | www.alexirpan.com/2018/02/14/rl-hard.html | ||||||||||||||||||
32 | Line following robot | Reinforcement learning | Go forward along the path | A robot with with three actions (go forward, turn left, turn right) learned to reverse along a straight section of a path by alternating left and right turns | Stay on the path | Vamplew, 2004 | Lego Mindstorms Robots as a Platform for Teaching Reinforcement Learning | https://www.researchgate.net/publication/228953260_Lego_Mindstorms_Robots_as_a_Platform_for_Teaching_Reinforcement_Learning | Peter Vamplew | ||||||||||||||||||||
33 | Logic gate | Genetic algorithm | Design a connected digital circuit for audio tone recognition | A genetic algorithm designed a circuit with a disconnected logic gate that was necessary for it to function (exploiting peculiarities of the hardware) | Maximize the difference between average output voltage when a 1 kHz input is present and when a 10 kHz input is present | Thompson, 1997 | An evolved circuit, intrinsic in silicon, entwined with physics | http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.9691&rep=rep1&type=pdf | Alex Irpan | www.alexirpan.com/2018/02/14/rl-hard.html | |||||||||||||||||||
34 | Long legs | Reinforcement learning | Reach the goal by walking | An agent that could modify its own body learned to have extremely long legs that allowed it to fall forward and reach the goal without walking | Reach the goal | Ha, 2018 | RL for improving agent design | https://designrl.github.io/ | Rohin Shah | ||||||||||||||||||||
35 | Minitaur | Evolutionary algorithm | Walk while balancing the ball on the robot's back | Four-legged robot learned to drop the ball into a hole in its leg joint and then walk across the floor without the ball falling out | Walk without dropping the ball on the ground | https://cdn.rawgit.com/hardmaru/pybullet_animations/f6f7fcd7/anim/minitaur/ball_cheating.gif | Otoro, 2017 | Evolving stable strategies | http://blog.otoro.net/2017/11/12/evolving-stable-strategies/ | Gwern Branwen / Catherine Olsson | https://www.gwern.net/Tanks#alternative-examples | ||||||||||||||||||
36 | Model-based planner | Reinforcement learning | Maximize performance within a real environment | RL agents using learned model-based planning paradigms such as model predictive control exploit the learned model by choosing a plan going through the worst-modeled parts of the environment and producing unrealistic plans | Maximize performance within a learned model of the environment | Mishra et al, 2017 | Prediction and Control with Temporal Segment Models | https://arxiv.org/abs/1703.04070 | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | |||||||||||||||||||
37 | Molecule design | Bayesian optimization | Find molecules that bind to specific proteins | Bayesian optimizer finds unrealistic molecules that are valid according to the computed score | Maximize a human-designed "log P" score accounting for synthesizability of the molecule and binding fitness based on a simulation on the space of molecules. | Maus et al. 2023 | "Local Latent Space Bayesian Optimization over Structured Inputs" | https://proceedings.neurips.cc/paper_files/paper/2022/hash/ded98d28f82342a39f371c013dfb3058-Abstract-Conference.html | Anonymous form submission | ||||||||||||||||||||
38 | Montezuma's Revenge - key | Reinforcement learning | Maximize score within the rules of the game | The agent learns to exploit a flaw in the emulator to make a key re-appear. Note that this may be an intentional feature of the game rather than a bug, as discussed here: https://news.ycombinator.com/item?id=17460392 | Maximize score | https://www.dropbox.com/s/3dc6i9d41svkgpz/MontezumaRevenge_final.mp4?dl=1 | Salimans & Chen, 2018 | Learning Montezuma’s Revenge from a Single Demonstration | https://openai.com/research/learning-montezumas-revenge-from-a-single-demonstration | Ramana Kumar | |||||||||||||||||||
39 | Montezuma's Revenge - room | Reinforcement learning | Win the game (by completing all of the levels) | Go Explore agent learns to perform a specific sequence of actions, which allow it to exploit a bug and remain in the treasure room (the final room before being sent to the next level) indefinitely and collect unlimited points, instead of being automatically moved to the next level | Maximize score | https://www.youtube.com/watch?v=civ6OOLoR-I&feature=youtu.be | Ecoffet et al, 2019 | Go-Explore: a New Approach for Hard-Exploration Problems | https://arxiv.org/abs/1901.10995 | Anonymous form submission | |||||||||||||||||||
40 | Negative sentiment | Language model | Produce text which is both coherent and not offensive | Model optimized for negative sentiment while preserving natural language | During code refactoring signs were accidentally flipped for both the main reward (modelled on human feedback) and the KL penalty | Ziegler et al, 2019 | Fine-Tuning Language Models from Human Preferences | https://arxiv.org/abs/1909.08593 | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | |||||||||||||||||||
41 | Oscillator | Genetic algorithm | Design an oscillator circuit | Genetic algorithm designs radio that produces an oscillating pattern by picking up signals from neighboring computers | Design a circuit that produces an oscillating pattern | Bird & Layzell, 2002 | The Evolved Radio and its Implications for Modelling the Evolution of Novel Sensors | https://people.duke.edu/~ng46/topics/evolved-radio.pdf | |||||||||||||||||||||
42 | Overkill | Reinforcement learning | Proceed through the levels (floors) in the Elevator Action ALE game | The agent learns to stay on the first floor and kill the first enemy over and over to get a small amount of reward | Maximize score | Toromanoff et al, 2019 | Is Deep Reinforcement Learning Really Superhuman on Atari? Leveling the playing field | https://arxiv.org/abs/1908.04683 | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | |||||||||||||||||||
43 | Pancake | Reinforcement learning | Flip pancakes | Simulated pancake making robot learned to throw the pancake as high in the air as possible | Time the pancake spends away from the ground | https://dzamqefpotdvf.cloudfront.net/p/images/2cb2425b-a4de-4aae-9766-c95a96b1f25c_PancakeToss.gif._gif_.mp4 | Unity, 2018 | Pass the Butter // Pancake bot | https://blog.unity.com/news/introducing-the-winners-of-the-first-ml-agents-challenge | Cosmin Paduraru | |||||||||||||||||||
44 | Pinball nudging | Reinforcement learning | Play pinball by using the provided flippers | "DNN agent firstly moves the ball into the vicinity of a high-scoring switch without using the flippers at all, then, secondly, 'nudges' the virtual pinball table such that the ball infinitely triggers the switch by passing over it back and forth, without causing a tilt of the pinball table" | Maximize score in a virtual pinball game | https://www.nature.com/articles/s41467-019-08987-4/figures/2 | Lapuschkin et al, 2019 | Unmasking Clever Hans predictors and assessing what machines really learn | https://www.nature.com/articles/s41467-019-08987-4 | Sören Mindermann | |||||||||||||||||||
45 | Player Disappearance | PlayFun | Play a hockey video game within the rules of the game | When about to lose a hockey game, the PlayFun algorithm exploits a bug to make one of the players on the opposing team disappear from the map, thus forcing a draw. | Play a hockey video game in a simulated environment | https://www.youtube.com/watch?v=Q-WgQcnessA&t=1450s | Murphy, 2014 | NES AI Learnfun & Playfun, ep. 3: Gradius, pinball, ice hockey, mario updates, etc. | https://www.youtube.com/watch?v=Q-WgQcnessA&t=1450s | Alex Meiburg | |||||||||||||||||||
46 | Playing dead | Evolved organisms | Eliminate mutations which increased the replication rate of evolutionary agents | The organisms evolved to recognize when they were in the test environment and "play dead" so they would not be eliminated and continue to replicate outside the test environment. After the inputs to the test environment were randomized, the organisms evolved a new strategy: to probabilistically perform tasks that would accelerate their replication, thus slipping through the test environment some percentage of the time. | After each mutation, measure the mutant's replication rate in an isolated test environment, and delete the mutant if it replicated faster than its parent | Wilke et al, 2001 | Evolution of digital organisms at high mutation rates leads to survival of the flattest | https://www.nature.com/articles/35085569 | Lehman et al, 2018 / Luke Muehlhauser | https://arxiv.org/abs/1803.03453 | |||||||||||||||||||
47 | Power-seeking | Language model | Produce text output that is helpful, honest and harmless | Larger LMs and those fine-tuned with RLHF "more often give answers that indicate a willingness to pursue potentially dangerous subgoals: resource acquisition, optionality preservation, goal preservation, power seeking, and more." | Generate coherent text that maximizes positive human feedback | Perez et al, 2023 | Discovering Language Model Behaviors with Model-Written Evaluations | https://arxiv.org/abs/2212.09251 | |||||||||||||||||||||
48 | Program repair - sorting | Genetic algorithm | Debug a program that sorts the elements of a list | Genetic debugging algorithm GenProg made the program output an empty list, which was considered a sorted list by the evaluation metric | Produce an output list which is in sorted order | Weimer, 2013 | Advances in Automated Program Repair and a Call to Arms | https://web.eecs.umich.edu/~weimerw/p/weimer-ssbse2013.pdf | Lehman et al, 2018 | https://arxiv.org/abs/1803.03453 | |||||||||||||||||||
49 | Program repair - files | Genetic algorithm | Debug a program so that it produces the correct output | Genetic debugging algorithm GenProg learned to delete the target output file and get the program to output nothing | Minimise the difference between program output and target output file | Weimer, 2013 | Advances in Automated Program Repair and a Call to Arms | https://web.eecs.umich.edu/~weimerw/p/weimer-ssbse2013.pdf | Lehman et al, 2018 / James Koppel | https://arxiv.org/abs/1803.03453 | |||||||||||||||||||
50 | Qbert - cliff | Evolutionary algorithm | Play Qbert in a human-like manner | Agent learns to bait an opponent into following it off a cliff, which gives it enough points for an extra life, which it does forever in an infinite loop | Maximize score | https://www.youtube.com/watch?v=-p7VhdTXA0k | Chrabaszcz et al, 2018 | Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari | https://arxiv.org/abs/1802.08842 | Rohin Shah | |||||||||||||||||||
51 | Qbert - million | Evolutionary algorithm | Play Qbert within the game rules | "The agent discovers an in-game bug. For a reason unknown to us, the game does not advance to the second round but the platforms start to blink and the agent quickly gains a huge amount of points (close to 1 million for our episode time limit)" | Maximize score | https://www.youtube.com/watch?v=meE5aaRJ0Zs | Chrabaszcz et al, 2018 | Back to Basics: Benchmarking Canonical Evolution Strategies for Playing Atari | https://arxiv.org/pdf/1802.08842.pdf | Sudhanshu Kasewa | |||||||||||||||||||
52 | Reward modeling - Hero | Reward modeling | Maximize game score | The agent repeatedly shoots the spider but barely misses it | Maximize output from a learned reward model, which rewards actions that seem to lead to shooting a spider | https://www.youtube.com/watch?v=Ehc3lsQqewU&feature=youtu.be&t=52 | Ibarz et al, 2018 | Reward learning from human preferences and demonstrations in Atari | https://arxiv.org/abs/1811.06521 | Jan Leike | |||||||||||||||||||
53 | Reward modeling - Montezuma's Revenge | Reward modeling | Maximize game score | The agent repeatedly moves towards the key without grabbing it | Maximize output from a learned reward model, which rewards actions that seem to lead to grabbing the key | https://www.youtube.com/watch?v=_sFp1ffKIc8&list=PLehfUY5AEKX-g-QNM7FsxRHgiTOCl-1hv&index=3&t=0s | Ibarz et al, 2018 | Reward learning from human preferences and demonstrations in Atari | https://arxiv.org/abs/1811.06521 | Jan Leike | |||||||||||||||||||
54 | Reward modeling - Pong | Reward modeling | Maximize game score | The agent bounces the ball back and forth without scoring | Maximize output from a learned reward model, which rewards actions that contribute to scoring | https://lh3.googleusercontent.com/Gtq_PR9CZRN0FEIbO83osWKEVXbMNTMPP4xY8snEXmBAuIAhIm5Ob9BkcADne6HCGKvLsyxEQAIbr-cgtKuIP1EfKs_LMAwNRLx96w=w1440-rw-v1 | Christiano et al, 2017 | Deep reinforcement learning from human preferences | https://deepmind.com/blog/learning-through-human-feedback/ | Jan Leike | |||||||||||||||||||
55 | Reward modeling - Private Eye | Reward modeling | Maximize game score | The agent repeatedly looks left and right | Maximize output from a learned reward model, which rewards actions that seem to lead to capturing a suspect | https://www.youtube.com/watch?v=FR6fsGDdiFY&list=PLehfUY5AEKX-g-QNM7FsxRHgiTOCl-1hv&index=3 | Ibarz et al, 2018 | Reward learning from human preferences and demonstrations in Atari | https://arxiv.org/abs/1811.06521 | Jan Leike | |||||||||||||||||||
56 | Road Runner | Reinforcement learning | Play Road Runner to a high level | Agent kills itself at the end of level 1 to avoid losing in level 2 | Maximize score | Saunders et al, 2017 | Trial without Error: Towards Safe RL with Human Intervention | https://owainevans.github.io/blog/hirl_blog.html | |||||||||||||||||||||
57 | Robot hand | Reward modeling | Grasp an object | The agent tricked a human evaluator by hovering its hand between the camera and the object | Maximize the feedback received from a human, who is evaluating if the agent has grasped the object | https://openaicom.imgix.net/f12a1b22-538c-475f-b76d-330b42d309eb/gifhandlerresized.gif | Christiano et al, 2017 | Deep reinforcement learning from human preferences | https://openai.com/research/learning-from-human-preferences | ||||||||||||||||||||
58 | Roomba | Reinforcement learning | Move around the room at high speed while avoiding collisions with other objects | The Roomba learnt to drive backwards to achieve high speed without being penalized for collisions, because there are no bumpers on the back | Reward for speed, and penalty for activation of the bumper sensors on the front | Custard Smingleigh | Custard Smingleigh's tweet | https://twitter.com/smingleigh/status/1060325665671692288 | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | |||||||||||||||||||
59 | ROUGE summarization | Language model | Produce high-quality summarizations | "An effort at a ROUGE-only summarization NN produced largely gibberish summaries" | Maximize the ROUGE score of the summarization | Paulus et al, 2017 | A Deep Reinforced Model for Abstractive Summarization | https://arxiv.org/abs/1705.04304 | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | |||||||||||||||||||
60 | Running gaits | Reinforcement learning | Learn to run in a human-like manner | A simulated musculoskeletal model learns to run by learning unusual gaits (hopping, pigeon jumps, diving) to increase its reward | Optimise the muscle activity of a human musculoskeletal model so that the model travels as far as possible within 10 seconds | https://www.youtube.com/watch?v=rhNxt0VccsE | Kidziński et al, 2018 | Learning to Run challenge solutions: Adapting reinforcement learning methods for neuromusculoskeletal environments | https://arxiv.org/abs/1804.00361 | NIPS 2017 talks | |||||||||||||||||||
61 | Soccer | Reinforcement learning | Gain possession of the ball | The agent learned to get to the ball and vibrate touching it as fast as possible | Maximize a shaping reward for touching the ball | Andrew and Teller, cited as a personal communication in Ng et al, 1999 | Policy Invariance under Reward Transformations | http://luthuli.cs.uiuc.edu/~daf/courses/games/AIpapers/ng99policy.pdf | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | |||||||||||||||||||
62 | Sonic | Reinforcement learning | Play Sonic to a high level | PPO algorithm discovers that it can slip through the walls of a level to move right and attain a higher score | Maximize score in a simulated game environment | Christopher Hesse et al, 2018 | OpenAI Retro Contest | https://blog.openai.com/retro-contest/ | Rohin Shah | ||||||||||||||||||||
63 | Strategy game crashing | Genetic algorithm | Play a strategy game | "Since the AIs were more likely to get 'killed' if they lost a game, being able to crash the game was an advantage for the genetic selection process." | Maximize score in a simulated game environment | Salge et al, 2008 | Using Genetically Optimized Artificial Intelligence to improve Gameplaying Fun for Strategical Games | https://cs.pomona.edu/~mwu/CourseWebpages/CS190-fall15-Webpage/Readings/2008-Gameplaying.pdf | Anonymous form submission | ||||||||||||||||||||
64 | Superweapons | Unknown | Play the game Elite Dangerous within the game rules | The AI exploited a bug which enable it to craft overly powerful weapons. "It appears that the unusual weapons attacks were caused by some form of networking issue which allowed the NPC AI to merge weapon stats and abilities." | Play the game Elite Dangerous | Sandwell, 2016 | Elite's AI created super weapons to hunt down players | https://www.digitalspy.com/videogames/a796635/elite-dangerous-ai-super-weapons-bug/ | Stuart Armstrong | http://lesswrong.com/lw/lvh/examples_of_ais_behaving_badly/ | |||||||||||||||||||
65 | Sycophancy | Language model | Produce text output that is helpful, honest and harmless | Larger language models showed a tendency to express more agreement with the user's stated views. This happens both for pretrained models and models fine-tuned with human feedback (RLHF). | Generate text that resembles training test (and maximizes positive human feedback, if finetuned) | Perez et al, 2023 | Discovering Language Model Behaviors with Model-Written Evaluations | https://arxiv.org/abs/2212.09251 | |||||||||||||||||||||
66 | Tetris pass | PlayFun | Play Tetris in a human-like manner | PlayFun algorithm pauses the game of Tetris indefinitely to avoid losing | Maximize score | Murphy, 2013 | The First Level of Super Mario Bros. is Easy with Lexicographic Orderings and Time Travel | http://www.cs.cmu.edu/~tom7/mario/mario.pdf | |||||||||||||||||||||
67 | Tic-tac-toe memory bomb | Evolutionary algorithm | Win games of 5-in-a-row tic-tac-toe played on an infinite board, within the rules of the game | Evolved player makes invalid moves far away in the board, causing opponent players to run out of memory and crash | Win games of 5-in-a-row tic-tac-toe played on an infinite board | Lehman et al, 2018 | Surprising Creativity of Digital Evolution | https://arxiv.org/abs/1803.03453 | |||||||||||||||||||||
68 | Tigers | Diffusion model | Produce images showing five tigers | Finetuned diffusion model produces images with the words "five tigers" written on them. | Finetuning to produce images that reflect the user prompts | https://rl-diffusion.github.io/images_256/overoptimization/vlm-counting/three-tigers.png | Black et al, 2023 | Training Diffusion Models with Reinforcement Learning | https://rl-diffusion.github.io/ | Arthur Conmy | https://twitter.com/svlevine/status/1660707088946049024 | ||||||||||||||||||
69 | Timing attack | Genetic algorithm | Classify images correctly based on their content | Genetic algorithm evolves timing attack to infer image labels based on hard drive storage location | Classify images correctly | Ierymenko, 2013 | Hacker News comment on "The Poisonous Employee-Ranking System That Helps Explain Microsoft’s Decline" | https://news.ycombinator.com/item?id=6269114 | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | |||||||||||||||||||
70 | Trains | Unknown | Run a rail network where trains don't crash | Stop all trains from running | Penalty for trains crashing | Wooldridge, 2024 | AI’s simple solution to rail problems: stop all trains running | https://news.yahoo.com/ai-simple-solution-rail-problems-142237311.html | Anonymous form submission | ||||||||||||||||||||
71 | Walker | Reinforcement learning | Walk at a target speed | Walking agent in DMControl suite learns to walk using only one leg | Move at a target speed | https://lh3.googleusercontent.com/YJMs2Ji7EVxCXYgAkxehqC_liNnlC37Z3jLjAoEiZlmSt05oyc9qhhRgtkEYO0XEzR1JY-QOkc6MUs19Rv0zl9mu1z-P66Q6-xBaRJSvEgJ0dEg6ZHGO-xrrUgcSYjdAmQ=w1280 | Lee et al, 2021 | PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training | https://arxiv.org/abs/2106.05091 | Anonymous form submission | |||||||||||||||||||
72 | Walking up walls | Evolutionary algorithm | Navigate through an environment containing walls in a natural manner | Video game robots evolved a "wiggle" to go over walls by exploiting a bug in the physics engine, instead of going around them | Navigate through a simulated environment containing walls | Stanley et al, 2005 | Real-time neuroevolution in the NERO video game | http://ieeexplore.ieee.org/document/1545941/ | Lehman et al, 2018 | https://arxiv.org/abs/1803.03453 | |||||||||||||||||||
73 | Wall Sensor Stack | Reinforcement learning | Stack two blocks so as to press against a wall mounted sensor | R2D3 agent exploited a bug by tricking the sensor into remaining active even when it is not in contact with the key, by pressing the key against it in a precise way | Cause the wall mounted sensor to remain activated | Le Paine et al, 2019 | Making Efficient Use of Demonstrations to Solve Hard Exploration Problems | https://arxiv.org/abs/1909.01387 | Gwern Branwen | https://www.gwern.net/Tanks#alternative-examples | |||||||||||||||||||
74 | World Models | Reinforcement learning | Survive as long as possible in the VizDoom game | "The agent discovered an adversarial policy to move around in such a way so that the monsters in this virtual environment governed by the M model never shoot a single fireball in some rollouts. Even when there are signs of a fireball forming, the agent will move in a way to extinguish the fireballs magically as if it has superpowers in the environment." | Survive as long as possible within a learned model of the VizDoom game | https://storage.googleapis.com/quickdraw-models/sketchRNN/world_models/assets/mp4/doom_adversarial.mp4 | Ha and Schmidhuber, 2018 | World Models (see section: "Cheating the World Model") | https://arxiv.org/abs/1803.10122 | David Ha | https://worldmodels.github.io/ |