Fractal AI Book #1: Forward Thinking
Fractal AI
A Fragile Theory of Intelligence
Sergio Hernández Cerezo
Guillem Duran Ballester
FrAGIle
BOOK #1
“Forward Thinking”
Version: V4.1
Main researchers
Sergio Hernández Cerezo (@EntropyFarmer)
Guillem Duran Ballester (@Miau_DB)
Special thanks
Researchers’ families for suffering us in all the ‘eureka’ moments.
HCSoft and Source{d} for their unconditional support.
Reviewers
Eiso Kant (@EisoKant), CEO at source{d}
José María Amigó García (Elche University)
Roshawn Terrell (@RoshawnTerrell)
Juan G. Cruz Ayoroa
Aidan Rocke (@AidanRocke)
Spiros Baxevanakis(@SpirosBax)
Brian Njenga
Trevor Gower
Anton Osika (@AntonOsika)
Doug Mayhew
Contents
1.1 - The playground of intelligence 5
1.1.2.1 - Forward vs Backward intelligence 6
2.1.2 - Conditional Causal Cones 10
2.2.1 - Dead vs Alive states 11
2.2.2 - Reward function properties 12
2.2.3 - Relativize: Universal reward reshaping 12
2.2.4 - Reward density over Causal Slices 13
2.3 - Policies: defining strategies 13
2.3.3 - Probability density due to policy over Causal Slices 14
2.4 - Divergence between distributions 15
3.1.1 - Causal Slice divergence 16
3.1.2 - Intelligent Scanning 16
3.1.3 - Scanning sub-optimality coefficient 17
3.2.1 - Intelligent decision 18
3.2.2 - Decision sub-optimality coefficient 19
3.3 - Global sub-optimality 19
4 - FMC: a Fractal Monte Carlo algorithm 20
4.1.1 - Starting at Monte Carlo 21
4.1.2 - Choosing the intelligence parameters 22
4.1.2.1 - Decisions per second 22
4.1.2.3 - Number of walkers 23
4.1.4 - Probability densities 25
4.1.4.1 - Density of walkers 25
4.1.6 - Making the decision 28
4.2 - The migratory process 29
4.2.2 - Simplifying the Virtual Reward 30
4.2.3 - Balancing exploitation and exploration 32
4.2.3.1 - Relativizing exploitation and exploration 33
4.2.3.2 - Altering the natural balance 33
4.2.3.3 - The “Common Sense” intelligence 33
4.2.4 - Probability of cloning 34
4.4 - Classifying the algorithm 36
4.4.1 - Monte Carlo Planning algorithm 36
4.4.4 - Evolutionary algorithm 37
5.1 - Discrete case: Atari 2600 games 39
5.1.2 - Sampling efficiency 42
5.1.3 - Implementation details 42
5.1.3.1 - Github repository 42
5.1.3.2 - Using two FMC flavours 42
5.2 - Continuous case: Flying rocket 44
5.2.1 - Flying a chaotic attractor 44
5.2.3 - Implementation details 46
6.1 - Distributed computing 48
6.2 - Adding learning capabilities 48
6.2.1 - Using a DQN for learning 50
6.3 - Common Sense Assisted Control 50
6.5 - Real-time decision-making 51
Atari experiment references: 54
“For instance, on the planet Earth, man had always assumed that he was more intelligent than dolphins because he had achieved so much—the wheel, New York, wars and so on—whilst all the dolphins had ever done was muck about in the water having a good time. But conversely, the dolphins had always believed that they were far more intelligent than man—for precisely the same reasons.”
Douglas Adams, The Hitchhiker's Guide to the Galaxy
One of the big obstacles in the field of artificial intelligence is not having a definition of intelligence based on solid mathematical and physical principles that could inspire the design and implementations of efficient intelligent algorithms.
For instance, consider the most widely accepted definition of intelligence, signed by 52 specialist on the field [2]:
“A very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. It is not merely book learning, a narrow academic skill, or test-taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings...”
A more recent definition [3] provided by Shane Legg, chief scientist of Deep Mind, and Marcus Hutter, founder of AIXI, is the following:
“Intelligence measures an agent’s ability to achieve goals in a wide range of environments.”
Although there are many other definitions of intelligence, they are too fuzzy to help us develop a theory of intelligent behaviour or give us an insight on how a general, computable and efficient algorithm for generating intelligent behaviour should look like.
This document is an effort to present such a definition based on entropic principles deeply inspired by the concept of “Causal Entropic Forces” introduced by Alexander Wissner-Gross in 2013 [1] and to propose a generic implementation of those principles.
As a first attempt in defining intelligence, we could say that intelligence works by taking decisions that directly affect the degrees of freedom of a system in such a way that its future evolution is biased toward rewarding futures.
Let's suppose we are controlling a cart-pole that can move left or right by pushing one of the two available buttons. Our goal is to keep the pole standing up, so the ball at the tip of the cart-pole must be as high as possible.
In that case, the intelligence can affect the system evolution by pressing one of the two buttons at any time. The ultimate goal of the intelligence is then to continuously choose the actions that keeps the ball as high as possible.
In this example, the action-space is discrete, but it could also be continuous: instead of having two buttons to choose from, we may have a “joystick” we can push, so our available actions are real numbers in the range [-1, 1]. The simulation of the system can also be more or less deterministic, and the goals could be a combination of several sub-goals. None of this would change the problem except that they would require more computation.
When the intelligence is asked to choose between a discrete set of actions {ai} it will internally score them accordingly to some metrics and then output an “intelligent decision” as being the action with the highest score or, in the continuous case, the average of a number of actions weighted by their normalised scores.
Decision = ∑(ai * Score(ai)) / ∑(Score(ai))
“You can know the past, but not control it. You can control the future, but have not knowledge of it.”
Claude Shannon
There are two main strategies used in an intelligent decision making process:
On one hand, we can use information from past events, along with the decisions that were taken and their corresponding outcomes, and eventually learn from that information, to influence future decisions. We will refer to this as ‘backward-thinking’, as we will only base our decisions on events from the past.
Such a backward-thinking process could in fact learn to predict the best action given an initial state, but it could also learn to predict the next state of the system, as it has access to pairs of initial states, actions, and final states. Predicting the final state from the initial one is equivalent to being able to internally simulate the system for relatively long periods by simulating one state after the other in small jumps.
At some point, evolution began to develop agents that were able to project their actual state into the future with relative accuracy, enabling it to ponder about its available actions in terms not only of its past experiences, but by predicting the foreseeable future each action leads to and its consequences.
This ‘forward-thinking’ process is a planning algorithm that will make better decisions as the simulation gets better, as opposed to learning-based strategies of back-thinking that depend on the accumulation of past experiences to improve.
Both strategies are complementary. You need to learn how to simulate the system before you can start thinking forward, while forward-thinking can be used to detect and develop better decisions for situations where only repeating past strategies, is no longer viable.
This document will focus on forward-thinking, or the ability to make near-perfect intelligent decisions based on a near-perfect simulation of the system, without the need of any previous learning. This is present in the human cognitive processes and, in current AI methods, it is referred to as “planning algorithms”.
“No sensible decision can be made any longer without taking into account not only the world as it is, but the world as it will be.”
Isaac Asimov
So what is the basic idea behind “scoring” an action? Imagine the cart holding the pole is in the situation shown in the image:
If we push the cart to the right, the pole will fall and the red ball will be at the lowest possible position, a single possible future state having a reward of zero. However, if we instead push it to the left, the pole will recover its up position, not only maximising the reward but also providing access to a greater number of future states.
The right decision here is then pushing to the left for two separate reasons:
The process of scoring options needs then to be guided by a search for more possible future states, usually called ‘exploration’, while also taking into account how rewarding such future states are, i.e. ‘exploitation’. Reaching and maintaining this fragile balance is the key idea behind any intelligent process.
We have nailed down the actual problem of intelligence to finding a way to scan the space of future states in such a way that exploration and exploitation are balanced during the process, and then make a decision about our next action based on the findings.
“By far, the greatest danger of Artificial Intelligence is that people conclude too early that they understand it.”
Eliezer Yudkowsky
Building a detailed theory of intelligence based on these ideas requires fleshing out some fundamental concepts: the shape of this ‘space of future states’, what ‘scanning’ will mean to us, what ‘balancing exploration and exploitation’ means, and finally, how can we use information derived from the scanning process to make intelligent decisions over the available actions.
“What is required is that the mind be prepared in advance, and be already stepping from thought to thought, so that it will not be too much held up when the path becomes slippery and treacherous.”
Leibniz, on Rational Decision-Making
In order to understand the ‘space of future states’ an intelligence will need to scan, we define a Causal Cone X(x0, 𝛕) as the set of all the paths the system can take starting from an initial state x0 if allowed to evolve over a time interval of length 𝛕, the ‘time horizon’ of the cone.
Causal cones are usually divided into two parts: the cone’s ‘horizon’, formed by the final states (t = 𝛕) for all the possible paths, and the rest of the cone (t < 𝛕) usually referred to as the cone’s ‘bulk’.
A Causal Slice of the cone X(x0, 𝛕) at time t∈[0, 𝛕] is the horizon of the cone X(x0, t) which may be denoted by XH(x0, t). Meanwhile, it is important to note that causal slices consist of a set of states, unlike causal cones that are formed by full paths.
We can imagine this causal slices as being the set of all future states the system can evolve to in a given time t, starting from x0.
Given that the initial state x0 contains specific information concerning the system’s degrees of freedom, and given that the intelligence can alter those values by taking an initial action, all the paths forming the Causal Cone can then be partitioned based on this initial action taken.
We define the Conditional Causal Cone associated with an action a ∈ A, X(x0, 𝛕|a), as the cone formed by all the paths that start by taking the option a. The conditional cones are then a partition of the original cone:
“To employ the art of consequences, we need an art of bringing things to mind, another of estimating probabilities and, in addition, knowledge of how to evaluate goods and ills.”
Leibniz, on Rational Decision-Making
In our setup, we will assume a reward function R(x) is defined over the state space. This reward function must follow some basic rules in order to be useful to the intelligence.
We will say a state of the system is ‘alive’ from the intelligence standpoint when:
We will then say the intelligence is ‘alive’ when it is in a alive state, and ‘dead’ in the other cases. This naturally defines the most basic form of reward function associated with the goal “keep alive”:
R0(s) = 1 if s is an ‘alive’ state
R0(s) = 0 if s is an ‘dead’ state
In general, this basic reward function is always present, identifying the zones where you don’t want your agent to be so the algorithm has to avoid visiting at any cost.
“I came to see that there is a species of mathematics in estimating reasons, where they sometimes have to be added, sometimes multiplied together in order to get the sum. This has not yet been noted by the logicians.”
Leibniz, on Rational Decision-Making
A function R(x) defined over the state space of the system can be considered a reward function for an intelligent agent when it meets the following criteria:
As an example, consider R(x) as being the battery level of an electric powered remote controlled agent, then:
Usually, our actual reward will be a composition of several more basic rewards: keep alive, energy level, health level, etc. that are multiplied together to build the reward being maximised by the agent.
R(s) = R0(s) × R1(s) × ... Rn-1(s) × Rn(s)
Sometimes a reward function is too badly shaped to be directly used: if you want the agent to get ‘as much money as it can’, you need to add a reward proportional to the agent’s bank account balance, a figure that can easily be zero or even negative.
Whenever we have a reward component R(s) we cannot guarantee to strictly follow the above rules -as it would happen if reward could be negative for instance- we would need to reshape it in order to force the needed properties.
R(s) = Exp(RN(s)) if RN(s) ≤ 0
R(s) = 1+Ln(1+RN(s)) if RN(s) > 0
For every slice XH(x0, t) of the causal cone, we can calculate the total reward RTOT(x0, t) of the slice as the integral of the reward over the slice. We may then convert the reward into a probability density PR over the slice as follows:
PR(x|x0, t) = R(x) / RTOT(x0, t)
A policy 𝜋 is a set of two functions that completely define the strategy an intelligent system would follow when scanning the future consequences of its actions and deciding on the next action to take.
𝜋 = {𝜋S, 𝜋D}
The scanning policy 𝜋S is a function that, given a state x ∈ E of the system, outputs a probability distribution P over the available actions:
𝜋S : E → P
When considering the probability of choosing a particular action a, we will use the conditional notation 𝜋S(a|x).
A special case of scanning policy is the random policy 𝜋SRND that would assign a uniform distribution over the actions regardless of the state x, so all actions are equally probable and 𝜋SRND(ai|x) = 𝜋SRND(aj|x), ∀ai, aj ∈ A:
𝜋SRND : E → P with P a uniform distribution.
“It is the mark of a truly intelligent person to be moved by statistics.”
George Bernard Shaw
A policy must also include a mechanism to score the available actions and assign them a probability of being chosen after the scanning phase is finished. This deciding probability distribution shall be denoted by:
𝜋D : A → [0, 1]
There is also a special case where the decision is taken in a random manner so each action is equiprobable and 𝜋DRND(ai) = 𝜋DRND(aj), ∀ai, aj ∈ A:
𝜋DRND : A → [0, 1] is a uniform distribution.
Once a scanning policy 𝜋S is defined, we may use it to calculate the probability of the system evolving to a given state, as the policy serves as a transition probability which, as in a Markov chain, allows us to project the evolution of the probability density over time.
For instance, in the initial causal slice for t = 0, the density is concentrated on the initial state x0, as the slice XH(x0, t = 0), only contain this state. However, as time increases, the causal slice volume will grow and the distribution will evolve.
We can then define this ‘scanning probability density’ Ps as being a function defined over the slice XH(x0, t), with parameters x0, t and 𝜋S:
Ps(x|x0, t, 𝜋S)
It is important to note that, given a scanning policy 𝜋S, some of the states in a slice could be unreachable. For instance, if a policy chooses the same action all the time with probability 1, then Ps(x|x0, t, 𝜋S) is not guaranteed to be non zero for all x in the slice, as it occurs with the random policy 𝜋SRND.
We will need to define a reliable measure of how similar two probability distributions are. This is usually done using the Kullback-Leibler divergence as follows:
DKL(p || q) = -𝚺(pi Log(pi/qi)) ≥ 0
This formulation requires qi > 0 whenever pi > 0. In our case, where p = PR(x|x0, t) and qi = Ps(x|x0, t, 𝜋S), this is not guaranteed to be true as we noted before, but we could use Gibbs’ theorem to find a divergence formulation which doesn’t suffer from this weakness.
Gibbs’ Theorem 1. Let P, Q ∈ Pn = {p ∈ ℝn : pi ≥ 0, 𝚺 pi = 1}, then 𝚷(qipi) is maximized by 𝚷(pipi).
Using this theorem we may define a different divergence of two distributions as follows:
DH(P || Q) = Log(𝚷(pipi) / 𝚷(qipi))
This divergence is well defined for any possible distributions p and q, including the problematic case when (pi > 0, qi = 0) and it satisfies the main properties of the KL-divergence we need:
“There is always another way to say the same thing that doesn't look at all like the way you said it before. I don't know what the reason for this is. I think it is somehow a representation of the simplicity of nature.”
Richard P. Feynman
We will define intelligence as the ability to minimize a ‘sub-optimality’ coefficient based on the similarity of two pairs of probability distributions obtained during the processes involved: scanning and the decision-making.
“Intelligence is not to make no mistakes, but quickly to see how to make them good.”
Bertolt Brecht
In the scanning phase, the possible future outcomes of the initial actions are sampled using the scanning policy 𝜋S.
Given a slice of the causal cone XH(x0, t) and a scanning policy 𝜋S we previously defined two probability distributions over it: the scanning density distribution Ps(x|x0, t, 𝜋S), and the reward distribution PR(x|x0, t).
The reward distribution is provided by the environment in an objective manner so the policy can not change it, while the scanning density distribution is directly dependent on 𝜋S, so it makes sense to use the divergence DH(PR, Ps) as a measure of how well the scanning policy leads the agent to rewarding states.
“...in order to decide what we ought to do to obtain some good or avoid some harm, it is necessary to consider not only the good or harm in itself, but also the probability that it will or will not occur, and to view geometrically the proportion all these things have when taken together.”
Leibniz, on Rational Decision-Making
We will define the ‘Intelligent Scanning’ as the optimal policy 𝜋SOPT that produces a scanning density distribution that is proportional to the reward for every slice of the causal cone:
Ps(x|x0, t, 𝜋S) ∝ R(x), ∀x 𝛜 XH(x0, t)
This implies that both probability densities are coincident:
Ps(x|x0, t, 𝜋S) = PR(x|x0, t), ∀t 𝛜 [0, 𝛕], ∀x 𝛜 XH(x0, t)
Equivalently, we may define 𝜋SOPT as the policy that makes the causal slice divergence to equal zero for any t 𝛜 [0, 𝛕]:
DH(PR(x | x0, t), Ps(x | x0, t, 𝜋S)) = 0, ∀t 𝛜 [0, 𝛕], ∀x 𝛜 XH(x0, t)
The idea behind this definition is that the optimal way of scanning a space is to make the probability of searching on a particular zone to be proportional to the expected reward: should you be searching for gold over a wide landscape, it would make sense to adjust the density of gold-miners in different zones to be proportional to the density -probability of finding- gold.
The more similar the two distributions are, the more intelligent and efficient the scanning will be, and, in the limit when the policy is optimal, we would reach an equilibrium where the divergence between both distributions is exactly zero.
Given that real-world scanning policies will not produce scanning probability densities exactly proportional to the rewards, this divergence will generally not be exactly zero over the different slices XH(x0, t), so integrating the divergences over time can provide us with a measure of the sub-optimality of the scanning policy:
In order to scale this coefficient into a more sensible figure, we may define the ‘unit of sub-optimality’ to be the sub-optimality associated with the random scanning policy 𝜋SRND.
Random Scan Sub-optimality= Scan(𝜋SRND|x0, 𝛕)
We may now divide the previous coefficient into this sub-optimality unit, so only policies that are ‘better than random’ will score below 1, while ‘worse than random’ ones will score above 1.
Scan Sub-Optimality(𝜋S|x0, 𝛕) = Scan(𝜋S|x0, 𝛕) / Scan(𝜋SRND|x0, 𝛕)
“There is hardly anyone who could work out the entire table of pros and cons in any deliberation, that is, who could not only enumerate the expedient and inexpedient aspects but also weigh them rightly. Now, however, our characteristic will reduce the whole to numbers, so that reasons can also be weighed, as if by a kind of statics.”
Leibniz, on Rational Decision-Making
As the second part of the process, the information obtained by the scanning phase is used to decide which action to take or, more generally, the probability 𝜋D(a) of each action to be taken.
We will say a decision, defined as a probability distribution 𝜋D(a) over all possible actions a ∊ A, is the ‘intelligent decision’ ID(a|𝜋S , 𝛕) for policy 𝜋S and time horizon 𝛕, when the probabilities are proportional to the entropy of the conditional scanning probability densities for the actions over the last slice of the cone, XH(x0, 𝛕).
ID(a|𝜋S , 𝛕) ∝ 𝓗(Ps(x|x0, 𝛕, 𝜋S, a))
Using that the conditional probabilities PS for different actions a ∈ A are independent and that the corresponding conditional causal cones form a partition of the causal cone, we have that:
𝓗(Ps(x|x0, 𝛕, 𝜋S)) =𝓗(Ps(x|x0, 𝛕, 𝜋S, a))
So we may obtain the intelligent decision as a probability density over the actions as follows:
ID(a|𝜋S, 𝛕) = 𝓗(Ps(x|x0, 𝛕, 𝜋S, a)) / 𝓗(Ps(x|x0, 𝛕, 𝜋S))
This formulation implies that the intelligent decision in the discrete and continuous cases are:
Discrete case | Intelligent decision= arg max ID(a|𝜋S, 𝛕) |
Continuous case | Intelligent decision= |
Given a policy 𝜋 = {𝜋S, 𝜋D} that generates a probability distribution 𝜋D(a) over the actions, we can define its sub-optimality as the divergence with the ideal distribution ID(a):
Decision sub-optimality(𝜋|x0, 𝛕) ∝ DH(ID(a|𝜋S, 𝛕), 𝜋D(a))
As we want this coefficient to be 1 for the random policy, we can define our unit of sub-optimality as the sub-optimality of the random decision policy 𝜋DRND , the uniform distribution:
Decision sub-optimality(𝜋|x0, 𝛕) = DH(ID(a|𝜋S, 𝛕), 𝜋D) / DH(ID(a|𝜋S, 𝛕), 𝜋DRND)
Given a policy 𝜋 responsible for both scanning and deciding, we can define its global sub-optimality as the average of both sub-optimality coefficients:
Sub-optimality(𝜋|x0, 𝛕) = (Scan sub-optimality(𝜋S|x0, 𝛕) + Decision sub-optimality(𝜋|x0, 𝛕)) / 2
This global sub-optimality coefficient allows us to determine which policies are approximately random (≈1) and which are nearly optimal (≈0).
Given the definition of sub-optimality of a policy, we can define the IQ of a given policy as follows:
IQ(𝜋|x0, 𝛕) = 1 / sub-optimality(𝜋|x0, 𝛕)
By using a moving average of this IQ over time as the system evolves, we can build a reliable real-time measurement of the intelligence of a particular system evolution.
“Intelligence is the ability to avoid doing work, yet getting the work done.”
Linus Torvalds
Once we have a sensible definition of intelligence, the next step is to use it to define a practical and efficient “decision-taking” -also known as “planning”- algorithm, or, more precisely:
“Given (1) a system with some degrees of freedom that can be controlled, (2) an informative simulation of the system’s dynamics (not necessarily a perfect simulation nor a deterministic one), and (3) a reward function defined over the state space, find an algorithm that use this information to push the degrees of freedom in such a way that the system behaves -or evolves- intelligently”.
We will guide our search in two ambitious “design principles”:
The algorithm presented here is not the only possible one, nor it is a complete implementation of all the potential uses of the previous theory. It must instead be considered as one direct and intuitive use of the concepts presented in order to generate intelligent behaviour from the raw simulation of a system.
“You can recognize truth by its beauty and simplicity. When you get it right, it is obvious that it is right—at least if you have any experience—because usually what happens is that more comes out than goes in. (...) the truth always turns out to be simpler than you thought.“
Richard P. Feynman
Before presenting the pseudo-code of the algorithm, we will introduce the ideas behind it and how its design aims to meet the previously defined theory in the lowest computational time complexity.
“Any one who considers arithmetical methods of producing random digits is, of course, in a state of sin.”
John von Neumann
We will start our implementation by considering the random policy or, equivalently, using the standard Monte Carlo approach, where a set of independent random walks are simulated over a number of discrete time steps to actually build a collection of paths in the causal cone considered in the theory.
In practical terms, we will need an informative simulation function -not necessary perfect nor deterministic- that, given a system state -that includes the positions of the different degrees of freedom the AI can modify- and a small delta of time dt, outputs the expected next state of the system:
x(t+dt) = Simulation(x(t), dt)
Starting at the actual system’s state x0 and iterating the process of taking a random decision over the degrees of freedom and then simulating the next state for a fixed number T of ticks , we can build a random walk of length 𝝉 = T*dt.
In the image below, two different actions -pushing buttons “A” or “B”- are available. In the continuous case, actions would be real vectors representing the velocities of the degrees of freedom, or how strong it pushes each free param, and in which direction.
As we are looking for a decision-taking algorithm, the first decision -or action- taken during the walk, “B” in this example, will be an important piece of information, so we will keep it in the walker’s internal state for later use.
By building one path after another we can actually build a set of N ‘feasible random walks’ starting at x0 and ending at one of the possible system future states.
At this point, some practical questions like how many random walks we will be using, the time horizon those walks will explore, and the number of ticks we will divide this time into will probably arise.
Before going any deeper in the algorithm it is worth spending some lines addressing those questions as they are the main parameters related to the CPU you will need and the quality of the results.
The first parameter to be chosen is about the number of decisions per second the agent will be taking, or the FPS (frames per second) of the algorithm. We are basically deciding on the length of the dt used in the simulations, our tick length.
Basically, this length depends on the system reaction times. If our agent is modeling a fly we will need a faster decision taking, so a lower tick length. If it were modeling a spaceship traveling to andromeda, we could safely take one decision per month or even year.
As a guiding number, human brains are considered to run at about 12 decisions per second, so if we were to mimic some human behaviour, a dt of about 0.1 seconds (or 10 FPS) is a nice starting point.
Setting a FPS higher that the reaction time will not really improve the results, while CPU time will grow proportionally to FPS, so just keep this figure around the sweet point.
The time horizon dictates how far into the future will the walkers explore the consequences of their initial actions.
The idea here is to try to scan long enough to detect the problems before you can not avoid them and, again, it depends naturally on the task:
This one is easy: the more the better. Of course it will use CPU linearly, but the most efficient way to improve the agent behaviour is using more walkers.
As the theory mandates that the ‘density of scanning’ -or ‘density of walkers’- must be kept proportional to some reward density, we will need to build all the random walks simultaneously, so the density of walkers can be defined.
Before going any further, we will show a simple pseudo-code that would generate this set of walks and finally decide based on the highest reward found at the final states of each path: if the path leading to the most rewarding state found started with the action “A”, then it will actually take action “A”.
// INITIALIZATION:
// Create N walkers with copies of the system’s state:
FOR i:= 1 TO N DO BEGIN
// Walkers start at the system’s initial state:
Walker(i).State:= System.State
// Take walker’s initial decision:
Walker(i).Initial_decision:= random values
END
// SCANNING PHASE:
// Evolve walkers from time=t to t+Tau in M ticks:
FOR t:= 1 TO M DO BEGIN
// PERTURBATION:
FOR i:= 1 TO N DO BEGIN
// First tick use the stored initial decision
IF (t=1) THEN
Walker(i).Degrees_of_freedom:= Walker(i).Initial_decision
ELSE
Walker(i).Degrees_of_freedom:= random values
// Use the simulation to fill the other state’s component:
Walker(i).State:= Simulation(Walker(i).State, dt:= Tau/M)
END
END
// DECIDING PHASE:
Best:= ArgMax(Reward(Walker(i).State))
Decision:= Walker(Best).Initial_decision
Being this code just a simple starting point, it already meets some of our design goals:
“In applying dynamical principles to the motion of immense numbers of atoms, the limitation of our faculties forces us to abandon the attempt to express the exact history of each atom, and to be content with estimating the average condition of a group of atoms large enough to be visible. This method... which I may call the statistical method, and which in the present state of our knowledge is the only available method of studying the properties of real bodies, involves an abandonment of strict dynamical principles, and an adoption of the mathematical methods belonging to the theory of probability.”
James Clerk Maxwell
According to theory, no matter which partition {A1, ... , AN} of the causal slice you consider, the walker density Di on each part Ai should be made proportional to the density of reward Ri
Our next step will be properly defining both densities.
In the next image we have partitioned the space with boxes of the same size. Each one of the four populated boxes A1 to A4 have a number of walkers Wi inside it from a total of six walkers considered, so the walker’s density at Ai will be Di = Wi/6 .
Please note that we didn’t need it to be a partition, the different zones may overlap, so we are actually using a covering of the set of all the walker’s positions.
At the same time, a reward value is defined over the state space, so we can assign a reward value to each box Ai by averaging the rewards at the positions of the walkers in Ai.
Our final goal is to make both densities proportionals, or, equivalently, we need Ri/Di to be as constant as possible.
Basically, Ri/Di is a measure of the “reward per capita” that a walkers in the box Ai will receive, and the idea of making this a constant could be interpreted as the need of a fair wealth distribution over the walkers: if zone Ai has a much higher reward per capita than zone Aj then walkers in Aj will likely prefer to move to Ai. As we aim to make both densities to be proportional, we need to define a ‘migratory flow’ in order to balance the reward per capita.
This migratory flow will need to detect zones where walker density is higher-than-proportional to the reward (zones with a low ‘reward per capita’) and move some walkers from there to other zones with higher ‘reward per capita’.
In order to move a walker W1 from zone Ai to Aj we could just select a random walker W2 at Aj and copy its state, W1.state ← W2.state. We say that W1 ‘cloned’ into W2 position.
Here we just sketched a nïve way to choose which walkers needs to be cloned into which others by using boxes. In the final algorithm this idea will be replaced with a more general solution.
“Deliberation is nothing else but a weighing, as it were on scales, the conveniences and inconveniences of the fact we are attempting.”
Leibniz, on Rational Decision-Making
In the previous example, when a walker labeled with the first option “A” is cloned to the position of a walker labeled “B”, it not only clone its position but also its label, so after the clone, the initial action “A” will have one ‘follower’ less, while action “B” will gain one. After some ticks are performed, the distribution of the initial actions in the population of walkers will vary.
In the example above, both actions start having a proportion of 3/6 = 0.5 of the walkers, but after the migratory flow takes place, “B” population grows to 5/6 while “A” drops to 1/6.
If the process was to be stop here -time horizon 𝛕 was 6 ticks of length dt- then we would use these proportions to build our decision.
In the discrete case, the AI would take its decision by sampling an action from the probability distribution (1/6, 5/6) of the actions, so the most probable action would be “B”.
In the continuous case, where actions are real vectors and there is no finite list of available actions, the decision is formed by averaging the initial decisions of all the walkers.
We commented on the need of forcing migratory flows from areas were reward was low or population density high, but defining a density is a tricky thing that can also be computationally demanding.
The idea of using boxes -as in the previous introduction- was just a sketch of the idea, we didn’t define a method to effectively choose which walkers will clone and, more importantly, which walkers will they copy from.
“Presuming that a man has wisdom of the third degree and power in the fourth, his total estimation would be twelve and not seven, since wisdom be of assistance to power.”
Leibniz, on Rational Decision-Making
The ‘Virtual reward’ is a generalization of the ‘reward per capita’ concept introduced before, but instead of using an externally defined partition, we will build one with a different zone per walker, so we can obtain a ‘personalized’ reward value that will tell a walker how ‘lucky’ he is as compared to other walkers: the higher the reward is, and the fewer the walkers around you, the better.
Choosing one partition per walker to account for the number of walkers inside each part is actually just a way to approximate the concept of ‘density of walkers’ around a position. At this point we will just assume we have defined some general measure of such density around the position of each walker:
Di =Density of Walkers around Wi position/state.
We will then define the ‘virtual reward’ VRi at a walker’s Wi position, where the reward value is Ri, as being:
VRi = Ri / Di
One simple way to define a density of walkers around a given position is to consider our covering as being formed by a set of spheres of a fixed radius in the state space, each centered at one walker position, and counting the number of walkers inside each one to get a density.
Please note that, as we will only focus on the proportionality of those densities -and not in their actual values- we can safely use the number Ni of walkers inside the ball as a density, without dividing it by N, as it will not alter the proportions.
Then, by looking at the image, we can say that at walker W1 position, there is a reward of R1 = 6 and a walker density of N1 = 3, so we will calculate its virtual reward as:
VR1 = R1 / N1 = 6 / 3 = 2
If we compare it with the virtual reward for walker W2, VR2 = R2 / N2 = 2 / 1 = 2, we find that both positions are equally ‘appealing’ as the number of walkers is kept proportional to the reward.
“Everything should be made as simple as possible,
but not simpler.”
Albert Einstein
Using densities around the walker positions may not be an ideal approach for several reasons:
Our first simplification will eliminate the need for an externally defined radius, by far the smaller of our two concerns.
The key idea we will use here is that walker density Di defined over a sphere centered at the walker Wi position, is roughly -the relation may not be strictly linear- inversely proportional to the average distance from Wi to the other walkers Wj.
Di ∝ 1 / (∑ Dist(Wi, Wj)/N)
Again, N can be eliminated from the equation as we only need to compare proportions, so:
Di ∝ 1 / ∑ Dist(Wi, Wj)
By replacing Di ∝ 1/Ni in the previous formula with this new version of Di we obtain:
VRi ∝ Ri * ∑ Dist(Wi, Wj)
The second simplification we will introduce is quite a dramatic one and it may initially sound like a really bad idea: we will replace the average distance from walker Wi to all the other walkers Wj with just one of those distances, randomly chosen:
Di ∝ 1 / Dist(Wi, Wj) with j randomly chosen, j≠i
The resulting virtual reward formulation have a time-complexity of only O(n) while making it to be a highly stochastic function:
VRi ∝ Ri / Di ∝ Ri * Dist(Wi, Wj) with j≠i, randomly chosen.
As our only purpose is to compare virtual rewards, being proportional allows us to safely define:
VRi = Ri * Dist(Wi, Wj) with j≠i, randomly chosen.
Using this stochastic version of the density actually do a better job than using the standard average distance, and at a much lower computational cost.
“Nobody ever figures out what life is all about, and it doesn't matter. Explore the world. Nearly everything is really interesting if you go into it deeply enough.”
Virtual reward is a product of two components: a distance and a reward. They represent the two basic goals of any intelligence: exploration and exploitation.
Before we can make any attempt to control this balance, we need to make sure that both components are normalized into a common scale by applying the relativize function introduced in 2.2.3 - Universal reward reshaping to both the rewards {Ri} and the distances {Di} between walkers before using them for the Virtual Reward.
Ri = (Ri - Mean_R) / Std_dev_R
Ri = iif(Ri > 0, 1+Ln(Ri), Exp(Ri))
Di = (Di - Mean_D) / Std_dev_D
Di = iif(Di > 0, 1+Ln(Di), Exp(Di))
A more general formula for the virtual reward can be considered:
VRi = Ri⍺ * Distβ(Wi, Wj) with j≠i, randomly chosen.
By default both ⍺ and β are set to 1, meaning that, as far as the distances and rewards are properly scaled, exploitation and exploration will be actually balanced.
The value of β is considered fixed at 1 and highly dependent on the metric used, while ⍺ is a parameter we can freely change in a standard range from 0 up to 2 or more, actually pushing this balance from equilibrium (⍺ = β) toward an exploration-only mode (⍺ = 0) or toward an aggressive search for reward (⍺ > β).
By manually setting ⍺ = 0, the behaviour changes to a goal-less intelligence, where the decisions are taken in order to increase the number of different reachable futures, regardless of how rewarding they are. The effect is a very clever autopilot that can keep a plane flying around avoiding dangerous paths almost indefinitely.
Please note this “Common sense” mode is nearly equivalent to the idea of maximizing the “empowerment” [6] of the agent as an intrinsic goal. In fact, in this case the agent is maximizing an intrinsic reward represented by the entropy of the available futures after a decision.
Inversely, if ⍺>β the agent will be somehow blinded by the reward and will not care much about safety, driving the agent to dangerous situations where the reward is particularly high. If our system can be in death states -our agent can die- then keeping ⍺ low is mandatory.
We can then conclude that the value of ⍺ roughly represents how safe is it to focus on exploitation in detriment of exploration, or how ‘safe’ the environment is for the agent.
“Often one postulates that a priori, all states are equally probable. This is not true in the world as we see it. This world is not correctly described by the physics which assumes this postulate.”
Richard P. Feynman
Once we defined a simple yet informative value for the virtual reward of a walker, it is time to get back to the migratory process we outlined before and try to draw a simple method to choose which walkers will be cloning which.
In the process of calculating the virtual reward, a walker must obtain the state of another randomly chosen walker in order to compare positions and obtain a distance. We will also start the cloning process by choosing another random walker in order to compare virtual rewards and obtain a probability of cloning.
Once walker Wi choose a second random walker Wk it will compare virtual rewards and, should he find his to be lower, he could decide to jump to Wk position by cloning its state.
We will define the probability of walker Wi with virtual reward VRi cloning to Wk state, with virtual reward VRk, as:
We present a basic pseudo-code implementation to serve as a baseline implementation aimed at simplicity and readability instead of efficiency or scalability.
// SCALING VALUES:
Relativize(R: array of real)
M:= R.Mean;
S:= R.Std_dev
IF (S > 0) THEN
R[i]:= (R[i]-M)/S
IF (R[i]>0) THEN
R[i]:= 1+Ln(R[i])
ELSE
R[i]:= Exp(R[i])
// [0] INITIALIZATION PHASE:
// Define dt for a time horizon Tau, divided on M ticks:
dt:= Tau/M
// Create N walkers with copies of the system’s state:
FOR i:= 1 TO N DO BEGIN
// Walkers start at the system’s initial state:
Walker(i).State:= System.State
// Walkers take an initial decision:
Walker(i).Ini_Dec:= random values
// Walkers simulate their next states:
Walker(i).State:= Simulation(Walker(i).State, dt)
END
// SCANNING PHASE:
FOR t:= 1 TO M DO BEGIN
// GET REWARDS AND DISTANCES:
FOR i:= 1 TO N DO BEGIN
// Choose a random walker:
j:= random index from 1 to n, j<>i
// Read reward and distance:
R[i]:= Reward(Walker(i).State)
D[i]:= Distance(Walker(i), Walker(j))
END
// NORMALIZE VALUES:
Relativize(R)
Relativize(D)
// CALCULATE VIRTUAL REWARD:
FOR i:= 1 TO N DO
Walker(i).VR:= d * Reward(Walker(i).State)
// UPDATE STATES:
FOR i:= 1 TO N DO BEGIN
// Probability of cloning to another walker:
j:= random index from 1 to n, j<>i
IF (Walker(i).VR=0) THEN
Prob:= 1
ELSE IF (Walker(j).VR < Walker(i).VR) THEN
Prob:= 0
ELSE
Prob:= (Walker(j).VR - Walker(i).VR) / Walker(i).VR
// Update state by Cloning or Perturbing:
IF (random < Prob) THEN BEGIN
// Cloning:
Walker(i).State:= Walker(j).State
Walker(i).Ini_Dec:= Walker(j).Ini_Dec
END ELSE BEGIN
// Perturbing:
// Perturbing degrees of freedom:
Walker(i).Degrees_of_freedom:= random values
// Simulate walker’s next state:
Walker(i).State:= Simulation(Walker(i).State, dt)
END
END
END
// DECIDING PHASE:
Decision:= Sum(Walker(i).Ini_Dec) / N
Please note that probability of cloning can be >1, feel free to clip it to 1 for formal reasons if this is too uncomfortable for you.
The algorithm may actually fit in many of the categories used in the field of algorithmics, making it difficult to properly classify it. Instead, we will name and comment on the categories where it could eventually fit.
Note: This algorithm has an algorithmic time-complexity of O(n), where n is the number of walkers used times the number of ticks we divided the time horizon interval into.
“We shall not cease from exploration
And the end of all our exploring
Will be to arrive where we started
And know the place for the first time.”
One of the most evident classification of the algorithm is as a Planning algorithm, similar to Monte Carlo Tree Search (MCTS), where the different futures a system can visit are scanned up to some depth level (a time horizon) while forcing the less appealing branches to be rejected in the process in order to focus on the most promising ones.
The main differences between Fractal Monte Carlo (FMC) and the many MCTS variants [4] are:
The concept of “walker” as an entity that uses an internal programing to autonomously decide on its evolution, is easily assimilable to the concept of a cellular automaton, except for the fact that, in our case, the system and the available actions don’t need to be discrete.
In fact, from the standpoint of a walker, the algorithm is just a repetition of a series of small steps that, when simultaneously performed by a number of walkers, automatically generates an intelligent decision on the agent:
FMC is also a swarm intelligence algorithm where the collective behaviour of a pool of decentralized, self-organized agents, solely determines the decision to be made.
Please note that, although in the pseudo-code the walkers are moved and cloned in a synchronized way for the sake of clarity, it is not mandatory at all, walkers could communicate via asynchronous messages and evolve in an asynchronous and isolated way, making it extremely easy to parallelize or distribute the algorithm in order to scale it. This, along with the linear time-complexity, makes the algorithm highly scalable.
FMC is a population-based evolutionary algorithm where walkers are divided into population groups (based on their initial actions) that compete for success by cloning the best fits (highest virtual reward) overwriting the worst fits, and then mutating them by randomly changing their states on the simulating phase.
There is not a real category of “entropic” algorithms, but it should. An algorithm is entropic when it is driven by the maximization/minimization of some kind of entropy, cross-entropy, or divergence.
Gradient descent optimization of a loss function like cross-entropy or the divergence of two distributions, as used in deep learning, is a clear example of an entropic algorithm.
FMC is an entropic algorithm, as it is based on minimizing the divergence -or equivalently maximizing a cross-entropy- of two distributions: reward and walker probability distributions.
In a Monte Carlo Tree Search, where actions are discrete, the graph of the states we visit in the search process is a tree. FMC generate the same kind of trees in the same conditions.
When the decision and the state spaces are both continuous, then the distances between walkers and the time step dt we use for the simulation can we made as small as we want making the resulting tree to be formed by smaller and smaller pieces.
In the limit when both the number of ticks used to divide the time horizon and the number of walkers tend to infinity, the graph morphs from a finite tree to a fractal tree.
We have run several experiments in order to show the algorithm strong points and to compare FMC with other similar algorithms in both the discrete and continuous cases.
One of the most widely accepted benchmarks for AI methods are the Atari 2600 games from OpenAI Gym. They all have a discrete decision space of six on/off buttons and well defined scorings allowing for fair comparison between algorithms.
To test FMC against other algorithms, we will be using a set of 50 Atari games commonly used in all the planning algorithm literature, even if some other articles -related to learning algorithms- may add some 5 to 7 extra games.
FMC will be compared against four relevant baselines:
Being FMC a planning algorithm, the only “apples-to-apples” comparison is against State of the Art (SoTA) planning algorithms. The rest of the comparisons, especially against learning algorithms, are actually “apples-to-bananas” as they have no access to a simulation of the model to sample future states from, instead they learn from perfect memories of past experiences.
That said, our experiments showed that:
Wins | % | ||
FMC vs Standard Human | 49 / 50 | 98% | |
FMC vs Human World Record | 32 / 50 | 64% | |
FMC vs Best Planning SoTA | 50 / 50 | 100% | |
FMC vs Best Learning SoTA | 45 / 50 | 90% | |
FMC “solved” the game | 32 / 50 | 64% | |
Solved due to the 1M bug | 16 / 50 | 32% |
We will consider a game as being “solved” when:
Please note that, in the case of bug-limited games, the reported human world records cannot be reached by any program, as the human records can include an additional ‘units of million’ digit that did never showed up on the game screen, nor in the internal score accessed via APIs.
In the following high score table, solved games use gray background and are labeled with the reason why they were considered as solved.
Game | FMC Score | Standard Human | Human Record | Planning SoTA | Learning SoTA |
Alien (e) | 479,940 | 7,128 | 251,916 | 38,951 | 7,967 |
Amidar | 5,779 | 1,720 | 155,339 | 3,122 | 4,058 |
Assault (e) | 14,472 | 1,496 | 8,647 | 1,970 | 11,734 |
Asterix (b) | 999,500 (1) | 8,503 | 335,500 | 319,667 | 406,211 |
Asteroids (d) | 12,575,000 | 47,389 | 10,004,100 | 68,345 | 167,159 |
Atlantis (d) | 10,000,100 | 29,028 | 7,352,737 | 198,510 | 2,872,645 |
Bank heist | 3,139 | 753 | 199,978 | 1,171 | 1,496 |
Battle zone (b) | 999,000 | 37,800 | 863,000 | 330,880 | 53,742 |
Beam rider (c) | 999,999 | 16,926 | 999,999 | 12,243 | 21,077 |
Berzerk | 17,610 | 2,630 | 1,057,940 | 2,096 | 2,500 |
Bowling | 180 | 161 | 300 | 69 | 136 |
Boxing (a) | 100 | 12 | 100 | 100 | 100 |
Breakout (e) | 864 | 31.8 | 752 | 772 | 748 |
Centipede (d) | 1,351,000 | 12,017 | 1,301,709 | 193,799 | 25,275 |
Chopper command (c) | 999,900 | 9,882 | 999,900 | 34,097 | 15,600 |
Crazy climber (d) | 2,254,100 | 35,829 | 447,000 | 141,840 | 179,877 |
Demon attack (c) | 999,970 | 3,401 | 1,556,345 (2) | 34,405 | 130,955 |
Double dunk (a) | 24 | -15.5 | 24 | 24 | 24 |
Enduro (e) | 5,279 | 860 | 3,618 | 788 | 3,454 |
Fishing derby | 63 | 6 | 71 | 42 | 59 |
Freeway | 33 | 30 | 34 | 32 | 34 |
Frostbite (c) | 999,960 | 4,335 | 552,590 | 6,427 | 4,442 |
Gopher (c) | 999,980 | 2,412 | 120,000 | 26,297 | 41,138 |
Gravitar | 14,050 | 1,693 | 1,673,950 | 6,520 | 2,308 |
Hero | 43,255 | 30,826 | 1,000,000 | 15,280 | 105,929 |
Ice hockey (e) | 64 | 1 | 36 | 62 | 11 |
James bond (e) | 152,950 | 407 | 45,550 | 23,070 | 6,963 |
Kangaroo | 10,800 | 3,035 | 1,436,500 | 8,760 | 15,470 |
Krull | 426,534 | 2,666 | 1,006,680 | 15,788 | 35,024 |
Kung fu master | 172,600 | 22,736 | 1,000,000 | 86,290 | 79,676 |
Montezuma revenge | 5,600 | 4,753 | 1,219,200 | 500 | 4,740 |
Ms Pacman (c) | 999,990 | 15,693 | 290,090 | 30,785 | 5,913 |
Name this game (e) | 53,010 | 8,049 | 25,220 | 15,410 | 12,542 |
Pong (a) | 21 | 15 | 21 | 21 | 21 |
Private eye | 41,760 | 69,571 | 103,100 | 2,544 | 40,908 |
Q-Bert (c) | 999,975 | 13,455 | 2,400,000 (2) | 44,876 | 27,543 |
River raid | 18,510 | 17,118 | 194,940 | 15,410 | 24,568 |
Road runner (c) | 999,900 | 7,845 | 2,038,100 (2) | 120,923 | 367,023 |
Robo tank (e) | 94 | 12 | 74 | 75 | 65 |
Seaquest (c) | 999,999 | 42,055 | 527,160 | 35,009 | 266,434 |
Space invaders | 17,970 | 1,669 | 621,535 | 3,974 | 7,227 |
Star gunner (c) | 999,800 | 10,250 | 77,400 | 14,193 | 84,490 |
Tennis (a) | 24 | -8 | 24 | 24 | 23 |
Time pilot (e) | 90,000 | 5,925 | 66,500 | 65,213 | 18,501 |
Tutankham | 342 | 168 | 3,493 | 226 | 288 |
Up n down (c) | 999,999 | 11,693 | 168,830 | 120,200 | 155,049 |
Venture | 1,500 | 1,188 | 31,900 | 1,200 | 1,520 |
Video pinball (c) | 999,999 | 17,668 | 91,862,206 (2) | 471,859 | 999,999 |
Wizard of Wor (c) | 99,900 | 4,756 | 233,700 (2) | 161,640 | 16,143 |
Zaxxon | 92,100 | 9,173 | 100,000 | 39,687 | 18,057 |
Planning algorithms work by sampling the model many times in order to know about the family of future states we can reach from an initial state after a number of steps. In practice, one or several copies of the Atari environment are used as perfect simulators to build those paths step by step by chaining one state with its predicted next state after taking one action.
Efficiency is usually defined as “samples per action”, the mean number of samples of the model we needed to query before an action was taken.
All the benchmarked planning algorithms use a minimum of 150,000 samples per action while FMC, being specially cheap on sampling, used on average 359 times fewer samples per action.
The Atari games experiment was implemented in python following the ideas in this document.
You can find the full python code for the Atari experiment at:
https://github.com/FragileTheory/FractalAI
Two different implementations of the FMC algorithm were used for the tests. You can find both version in the GitHub repository.
In the first implementation -the “step-by-step” standard FMC version- we follow the ideas presented in the pseudo-code, where the agent makes each decision by building a totally new cone after each step, starting at the actual position of the agent, and using a fixed number of walkers and tree depth (time horizon).
In a second implementation -the “swarm wave” FMC version- the agent build a single cone, starting in its initial state without limiting to any time horizon. We make this cone to grow until one of the paths reaches a desired limit score. Once this goal is met, the algorithm stops and a video of the highest scoring path is shown, while a dataset of thousands of high quality episodes is stored ready to be used to train a DNN.
A game can send the observation of its actual state in two flavours:
FMC base its decisions on the entropic properties of the swarm of walker states. Being this an intrinsic goal, FMC can be equally applied to RAM or image-based observations with very similar results.
As most of the methods we will be comparing with can only be played on image-based games, we can only fairly compare IMG vs RAM observations solved with FMC using the same parameters (fixed_steps=5, time_limit=15, Max_walkers=30, Max_samples=300). Even with those low values of 300 samples per step, FMC is able to beat state of the art (SoTA) algorithms and totally solve some of the games:
FMC on IMG | FMC on RAM | RAM vs IMG | ||
atlantis | 145,000 | 139,500 | 96.21% | |
bank heist | 160 | 280 | 175.00% | |
boxing | 100 | 100 | 100.00% | |
centipede | Immortal (1) | immortal | 100.00% | |
ice hockey | 16 | 33 | 206.25% | |
ms pacman | 23,980 | 29,410 | 122.64% | |
qbert | 17,950 | 22,500 | 125.35% | |
video pinball | 1,273,011 (2) | 1,999,999 | 366.29% | |
161.47% |
It may seem counter-intuitive that FMC scored better on RAM without seeing the screen images, but actually it makes quite sense: RAM is the actual state of the game, while images are partial observations that contains a lot of noise (a single bit changed on RAM can cause a firework on screen), so the distance between RAM dumps is far more informative than knowing the distance between the corresponding screen images.
In this experiment we control a 2D rocket flying inside a closed environment. The rocket has two continuous degrees of freedom corresponding to the trust levels for the main rocket and a secondary one used for rotating.
At each step we define the force applied to each of the degrees of freedom, so our decision space is a continuous 2D space. In these cases we will not be choosing an action from a list, instead we will be deciding on a force as a 2D vector.
The rocket is a very interesting toy-system, it is very dynamic as equilibrium is never reached and making fast decisions is critical. When FMC is used in a continuous decision system like this, it performs even better than it did in the discrete case. Being the decision an average of many decisions, it is more naturally defined in the continuous case than in the discrete one.
“Chaos was the law of nature; Order was the dream of man.
Henry Adams
As flying a rocket was not a big problem for fractal AI, we designed a special environment where the goals were almost impossible to get.
First, a hock was attached to the rocket using a rubber band, forming a chaotic oscillator where the final position of the hook is highly sensitive to small changes in the initial conditions, making it extremely difficult to sample the state space and find the low probability paths that define the right decision.
Secondly, we will define a ridiculous difficult goal for the system: drive the rocket in such a way that the chaotic hook picks a falling rock, take it into a distant deploy zone (crossed area), release it and wait until it leaves the dotted circle, and repeat it as many times as you can.
To define this ‘hook goal’ we just have to convert it into a reward function.
Only this reward function was added, the rest of the code was just the standard FMC code and a homebrew physics simulator of the system.
Fractal AI was able to solve this problem using only 300 walkers (paths) and 200 samples per walker (a time horizon of 2 seconds at steps of 0.1 second) for a total of 60,000 samples per decision.
The agent was able to chain several goals in a row, and also could recover the rock when it fell down to the ground, totally solving the task with extremely low computational resources.
Video 1: Solving the task (https://youtu.be/HLbThk624jI)
This experiment also allows us to watch the fractal paths generated in the process.
Video 2: Visualizing the decision process (https://youtu.be/cyibNzyU4ug )
This experiment used an early implementation of the algorithm that was 99% the same as in the pseudo-code. It is coded in object-pascal (delphi 7) and includes a complete ad-hoc physical simulator.
The main differences with the general case came from the definition of distance between two states and the reward function used.
The reward was the composition of two goals: keep alive -that correspond to the rocket health level (the simulator used the energy of the collisions to take health from the agent)- and the already commented ‘hook reward’, so the actual reward was defined as:
Reward = Health_Level * Hook_Reward
Any informative distance would have made the trick, but as this was a very physical example, we decided to focus only on the position and momentum of the rocket to define its position, thus the distance used was:
Dist(A, B) = Sqrt( (A.x-B.x)^2 + (A.vx-B.vx)^2 + (A.y-B.y)^2 + (A.vy-B.vy)^2 )
Adding the rest of the coordinates to the distance formula resulted in a lower performance. In general, using a distance that makes sense in the problem helps. For the general case, using an embedding of the state can reduce its dimensionality and, as in any other method, helps to improve efficiency.
We will briefly comment on some of the research topic we find worth exploring.
The algorithm time complexity is O(n) where n = num. walkers x num. ticks, but walkers work almost independently -except for some inter-walker communication- and even asynchronously, so a parallel implementation of the algorithm, assigning one core per walker can, in principle, lower the time complexity near to O(num. ticks).
Thus, adding more CPU can scale up the algorithm almost linearly but, eventually, the inter-processes communication overhead of sending system states from one walker to another will impose a practical limit dependent, in practical terms, on the size of the system state.
Reached this point, a second distributed strategy is launching several FMC processes -workers- in a distributed environment, each one using a smaller number of walkers to output a decision vector that will be averaged by the agent in order to make its decision. This ‘clustered’ decision is almost as reliable as using the sum of all walkers in a single fractal.
By adding the capability to reshuffle the states of all the walkers among the cluster of workers every m steps, you can continuously change from totally isolated workers (m = number of total ticks, no communication overhead) or totally connected workers (m = 1, overhead depending on state size and simulation time).
Please note that the repository contains a parallel version that uses as many cores as you wish (and have) but it is not a truly distributed version as it runs still on a single PC.
“One remembers the past to imagine the future’’
Daniel L. Schacter
As we noted before, one of the limitations of the algorithm was dealing only with the forward-thinking process, so one natural research topic is mixing forward and backward-thinking in a single process.
The presented algorithm scans all available actions at any given state with a uniform probability distribution, it doesn’t have any ‘a priori’ preference, so the walks produced are truly random.
FMC algorithm could be used to feed examples of correct decisions -rollouts- to a neural network that, in turn, would train itself to predict the intelligent decision -a probability distribution over the actions- as a function of the state.
We can then use this as the ‘a priori’ distribution: when a walkers needs to randomly choose an action before simulating a delta of time, instead of choosing a random action, we can now sample it from this distribution.
A mechanism like this could give the walkers a natural tendency to repeat actions that worked well in the past on similar situations. Usually it means that, with the same number of walkers, you can get better decisions or, inversely, that you can safely lower the number of walkers needed to decide on situations that are familiar to the agent.
In the extreme case where the neural network can be considered well trained, you can completely disable walkers and decide by choosing the most probable action in the NN output.
By comparing the decision suggested by such a memory system with the one generated by the FMC algorithm -for the same initial state- we can estimate how accurate our a priori distribution is, enabling us to dynamically adjust the ‘credibility’ of this distribution.
To get a simplified sketch of the idea we could:
This mechanism opens the possibility of mixing FMC with any general learning method -like neural networks- in such a way that it can detect when the memory-based backward-thinking “fast” decisions are good enough to replace the more expensive -in terms of CPU usage- forward-thinking decisions, allowing us to:
DQN is a model of deep learning designed to learn the probability of the actions -expressed as reward expectation- as a function of the system state and as such it is a perfect match for FMC.
In one hand, FMC is able to generate good quality game rollouts -sequences of pairs state-action from previous games without the need of a priori density of probabilities for actions, so it is able to feed the learning process of the DQN with meaningful game sequences instead of random played ones, boosting the learning performance to some degree.
On the other hand, once the DQN has learned from a dataset of initial rollouts, FMC can use it output as a priori for randomly choosing an action at each walker step, making the FMC to scan more deeply those actions suggested by the DQN and thus, making FMC more efficient and capable as the DQN learns to make better predictions.
This kind of combination is not new in the literature [7] but replacing UCT (a state-of-the-art implementation of MCTS) with FMC could improve the efficiency of the hybrid. A fair comparison between both methods is presented in the experiments section showing that FMC can outperform UCT using about 0.01% to 0.1% of the samples per step.
Imagine a drone driven with FMC, following the only goal of not crashing into anything. Now add a remote control that, when pushed forward, send an a priori probability over the actions where going forward is much more probable that all the other actions, in the same way the DQN would be doing.
The FMC will try to follow this direction, but only if this doesn’t go against its first goal of not crashing into anything. The resulting drone will follow the control orders without crashing, allowing to fly it on difficult scenarios with easy and safety.
“In any decision for action, when you have to make up your mind what to do, there is always a 'should' involved, and this cannot be worked out from, 'If I do this, what will happen?' alone.”
Richard P. Feynman
When the reward function is a composition of several goals {Gi} we can assign a relative importance {Ki} for each goal, having Ki ≥ 0 and Σ(Ki)=1, so our reward function would looks like this:
Reward(X) = 𝚷(Gi(X)Ki)
We can consider the vector {Ki} as being the “mental state” -as opposed to the physical state- of the agent. Any mechanism that could automatically adjust those coefficients in order to make better decisions can be considered as a conscious mechanism.
If we consider those {Ki} as being a second-level agent state, we can use them as both state component and degrees of freedom, and apply the same FMC algorithm to intelligently adjust them... as far as we can define a sensible reward associated with a mental state {Ki}.
“Life is the continuous adjustment of internal relations to external relations.”
Herbert Spencer
The need in the present implementation of resetting all the walkers to the new agent position after a step is made, force us to erase valuable information, making the samples per step to be higher than it could be.
A slightly different implementation could lead to a continuous decision algorithm where a new decision is generated for each tick of the algorithm -as opposed to each step of the agent- in a totally continuous process, allowing to sample the decision at any time the agent needs it.
In this new implementation, each continuous decision would come with a measure of the average delay between the agent time when the walkers started its path and the actual agent time when the decision is used. A faster CPU would allow the algorithm to have a smaller delay, use more walkers and paths, or think at a longer time horizon. A new parameter would be needed to limit this delay below a desired value.
“Find beauty not only in the thing itself, but also in the pattern of the shadows, the light and dark which that thing provides”
Junichiro Tanizaki
It would be interesting to connect the distances between walkers at any moment with the Universality pattern found in the eigenvalues of random matrices. This pattern seems to be universal to any system where the parts are heavily correlated. In the pool of walkers it is the case, so if this algorithm is some form of universal complex system solver, it makes sense to try to connect both ideas.
The theory of intelligence introduced allowed us to build a very efficient agent that, in the discrete decision case, outperforms actual implementations of MCTS and other planning algorithms in two or three orders of magnitude. This is done by just inspecting the tree of decisions using entropy-based principles while boosting exploration to achieve a balance in the exploitation vs exploration tradeoff.
The algorithm can also be applied to continuous decision spaces, where it proved to be highly efficiently, introducing Monte Carlo methods into a new spectrum of possible uses like driving vehicles or controlling robots.
The algorithm naturally allows working with neural networks in a close symbiosis where the NN learns from the actions taken by the intelligence to produce a good prior distribution over the actions, that is then used by the planning algorithm to get better results over time, that again are fed into the learning process of the NN, closing a virtuous cycle of improvements. By doing so, actual reinforced learning methods could be reshaped into a simpler and more efficient form of supervised learning.
[1] Wissner-Gross, A. D., and C. E. Freer. “Causal Entropic Forces”. Physical Review Letters 110.16, 2013.
[2] Gottfredson, Linda S. "Mainstream Science on Intelligence: An Editorial With 52 Signatories, History, and Bibliography". Intelligence. 24: 13–23. ISSN 0160-2896. doi:10.1016/s0160-2896(97)90011-8, 1997.
[3] Shane Legg, Marcus Hutter. “A Collection of Definitions of Intelligence”. Frontiers in Artificial Intelligence and Applications, Vol. 157, 17-24, arXiv:0706.3639 [cs.AI], 2007.
[4] Cameron Browne, Edward Powley, Daniel Whitehouse, Simon Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis and Simon Colton. “A Survey of Monte Carlo Tree Search Methods”. IEEE Transactions on computational intelligence and AI in games, Vol. 4, No. 1, March 2012.
[5] Timothy Yee, Viliam Lisý, Michael Bowling. “Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty”. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), 2016.
[6] Christoph Salge, Cornelius Glackin, Daniel Polani. “Empowerment -- an Introduction”, arXiv:1310.1863 [cs.AI], 2013.
30 July 2020