In this post we are going to look at implementing a simple reinforcement learning simulation functionally in clojure.

Reinforcement learning is the branch of AI that studies agents embedded in environments that learn to behave through repeated interactions with the environment. Arguably it is the “fun” branch of modern AI since it does not abstract away notions of behavior and decision making in realistic environments.

The model

Here is a sequence of timesteps from the model we will implement. Each gray cell represents a snapshot of the state of our simulation at that timestep. Time flows left-right, up-down.

In each gray cell there is typically a green square and a purple square. The idea is that the green square represents a patch of grass, and the purple square represents a rabbit. The rabbit will be our learning agent. The goal of the rabbit is to maximize its energy level (visualized by the size of the rabbit’s purple square). The only way the rabbit can gain energy is to eat the grass when on top of it. Elsewise it loses a small amount of energy every timestep due to metabolism.

Since the grass respawns when its energy has been depleted, an optimally behaving rabbit would seek the shortest route to the grass, eat it, and then seek the shortest route to the new location, and so on. In the lower rows of the graphic indeed this appears to be what is happening as the rabbit’s girth exceeds the world-size.

But we don’t hardcode the rabbit’s behavior; the rabbit learns this behavior through reinforcement learning.

Reinforcement Learning

Our implementation of reinforcement learning is as simple as possible. The agent keeps a queue of the past N states it has been in, and a queue of the past N changes in energy level (deltaE):

t-3

t-2

t-1

t0

state = 2

state = 2

state = 6

state = 3

deltaE = +2

deltaE = +2

deltaE = -0.5

deltaE = -0.5

We calculate the value of a particular state by summing all the changes in energy experienced from that state up to the present. We weight the summation so that more immediate energy deltas are weighted more heavily. So for example, if the agent’s memory looked like the table above, we would estimate the value of state 2 to be:

Value(CS2) = (2*1) + (2*1/2) - (0.5*1/4) - (0.5*1/8)

This value would then be stored in a set with similar valuations of state 2 from other timesteps, and the total expected value of state 2 would be the average of all those values.

State compression

In the learning mechanism described above, we first compress the states before using them. Consider: the uncompressed environment state consists of a rabbit and grass on a 5x5 grid of locations. Since each can be located at any of the 25 possible locations, there are 252 = 625 possible uncompressed states of the environment. To learn good expected value estimates of each of those uncompressed states would require repeated visits and would take a long time. Plus, any time the rabbit found itself in a novel state, it would have no idea what to do.

Instead, we compress the environment state by mapping the 625 possible states down to, in this case, 5 compressed states. The mapping function we use is simply the distance of the rabbit to the grass. Thus compressed state 0 means the rabbit is on the grass, compressed state 1 means the rabbit is one unit away, and so forth. This allows for generalization and much quicker learning.

Learning a good compression function is essentially what makes AI hard. In our implementation we don’t attempt to learn it and instead hardcode it as described. But imagine using reinforcement learning to play chess. Chess has (substantially) more possible board states than 625. A common compression function taught to chess novices is to assign each chess piece a numerical value and to compress a given board state by summing the values of all the pieces on the board. This compression allows for productive generalization in many cases, but not so in many other cases. No modern day chess engine uses this function. Automatically learning a quality compression function is an open question.

No learning

The simulation thus works in the following way: the rabbit hops from one location to the next  by first determining which moves are available to it, looking up their expected values calculated by reinforcement learning, and choosing the highest value (randomly if more than one move has highest expected value, such as when starting out).

Before we look at some code, let’s see what happens when we let loose a rabbit with an ablated memory (i.e. we turn off the learning mechanism) (we also turn off metabolism because otherwise the rabbit would shrink to not be visible):

One can compare this image with the previous one. Without a learning mechanism, the rabbit is doomed to wandering around randomly, occasionally stumbling upon the grass, but just as quickly moving off from the grass. The rabbit does slowly grow since we had turned off metabolism, but nowhere near as quickly as the learning-capable rabbit which lost energy through metabolism at every timestep.

Code check

Code gist

Functional programming is so expressive, we can fit the entire simulation in one line:

(iterate step (init-state))

Ok, that is a bit facetious, but thinking of the simulation as a lazy sequence of states allows us to run some succinct queries. For example, show me just the rabbit’s energy level over 10 iterations of the simulation:

=> (map #(-> % :rabbit :energy) (take 10 (iterate step (init-state))))

(5 4.7 4.4 4.1000000000000005 3.8000000000000007 3.500000000000001 3.200000000000001 2.9000000000000012 2.6000000000000014 2.3000000000000016)

Let’s take a look at that state initialization:

(defn init-state

  []

  {:rabbit

   {:energy   5

    :location [(rand-int SIZE) (rand-int SIZE)]

    :memory

    {:last-n-states (vec (repeat HISTORY nil))

     :last-n-energy (vec (repeat HISTORY nil))

     :expected-util {}}}

   :grass

    {:energy   4

     :location [(rand-int SIZE) (rand-int SIZE)]}})

Yep, a single nested map. This encapsulates the entire state of the simulation. Having all your state in one monolithic structure is great for reasoning with and prototyping, although execution performance might take a hit depending on size. Also there is no type-checking if you use a wrong key-name.

Here is the step function that we repeatedly pipe the state through to compute each timestep:

(defn step

  [s]

  (->

    s

    (grass)

    (bunny)))

You can check out grass handling in the gist linked above. Most of the bunny handling is also bookkeeping, but we will take a look at the behavior function called by the bunny function since it may be the most difficult to parse:

(defn behavior

  [state]

  (let [[lx ly] (->

                  state

                  :rabbit

                  :location)

        [bm cs] (rand-nth

                  (val

                    (first                      

                      (reduce

                        (fn [m [l cs :as v]]

                          (update-in

                            m

                            [(expected-utility state cs)]

                            conj

                            v))

                        (sorted-map-by >)

                        (list*

                          [[lx ly]

                           (compress-state state)]

                          (for [i (range -1 2) ;direction

                                j (range -1 2) ;direction

                                r (range 1 (inc MOVE-RANGE))

                                :let [loc [(+ lx (* r i)) (+ ly (* r j))]]

                                :while (in-bounds? loc)]

                            [loc

                             (compress-state                          

                               (make-move state loc))]))))))]

    (->

      state

      (make-move bm)

      (update-in [:rabbit :memory :last-n-states] conj cs))))

This function is where the rabbit queries its memory of compressed states and their associated estimated values to determine which of its available moves to select. To accomplish this, we build a sorted map where the keys are the expected values of moves based on memory, and the values of the map are the moves themselves. We randomly select from the moves with the highest expected value. I have color-highlighted the code in order to illustrate where the sorted map is being created. The blue code is creating all possible locations to move (including the current location), paired in a vector with the resulting compressed-state of the possible move.

TODO