1 of 47

Hacking

Reinforcement Learning

Guillem Duran Ballester

Guillemdb

@Miau_DB

2 of 47

Guillem Duran Ballester

Investigo sobre RL en InstaDeep

Ingeniero de telecomunicaciones

Coorganizador de PyData Mallorca

Mi hobby: Planning en Python con Sergio Hernández (@entropyfarmer)

Guillemdb

@Miau_DB

3 of 47

Hacking RL

Information gathering
Scanning
Exploitation & privilege escalation
Maintaining access & covering tracks

4 of 47

5 of 47

Reinforcement Learning

6 of 47

Reinforcement Learning

Planning

7 of 47

Decisión inteligente

Dirección de máximo

Número de opciones futuras

Dado tu estado actual

8 of 47

Asígnales una puntuación

Hasta cierto punto en el futuro

Cuenta todos los posibles caminos que puedas tomar

9 of 47

¿Cómo conducir el kart?

Prueba diferentes caminos aleatorios
Colorealos:

Empiezan hacia la derecha: Rojo
Empiezan hacia la izquierda: Azul

Pondera el número de caminos de cada color que no se han estrellado
Aplica la decisión y repite el proceso

10 of 47

The action chosen at each step, will follow the color distribution of the cone.

If the cone is blue, it will turn left a lot.

If it is red, it will turn right a lot,

And if it’s balanced, it will keep going straight.

We do this because the state space of the environment is continuous,

but if it were discrete, we would choose the action corresponding to the most popular color

How we sample the cone is important,

As you can see here, the bigger the cone, the better.

Because having a bigger cone allows the agent to see further into the future,

And that can greatly impact its performance.

The cone represents what the agent sees at a given moment, and in order to take an action

It will only take into account the information available on the sampled cone.

This means that according to the equations, we only need to calculate a super huge cone to have a godlike agent

Sounds cool right? [65s]

11 of 47

Hacking RL

Information gathering
Finding vulnerabilities & Scanning
Exploitation & privilege escalation
Covering tracks & Maintaining access

12 of 47

Reinforcement Learning

, end, info

13 of 47

Hackeando Rl con planning

14 of 47

Fractal AI vs. Monte Carlo

15 of 47

16 of 47

17 of 47

Hacking RL

Information gathering
Scanning
Exploitation & privilege escalation
Maintaining access & covering tracks

18 of 47

Baloncesto Ninja con cohetes

FUEGO!

HP

Combustible

Gancho

Muelle

2 Grados de Libertad continuos

19 of 47

The Gameplay

Lleva aquí la roca

Recompensa

Vida + Combustible
Acercarse al objetivo → +0.2
Cumple objetivo → +100

Engancha la roca fuera de aquí

No te estampes!

20 of 47

FMC Cone

Lineas grises:

Futuros del cohete

Líneas de colores: Futuros del gancho

Cambio de color:

Nuevo objetivo

(Recoge/suelta roca)

Roca enganchada

Suelta Roca

Engancha roca

21 of 47

22 of 47

23 of 47

Demo time!

It is time to see if the knowledge we gathered is useful in a real world scenario.

We will start testing it on Pacman, so let’s run our Swarm wave.

While this is running I will talk about the parameters used.

The number of walkers is the size of the Swarm used. In this case we are only using 88 walkers because we want it to be fast, and getting a high score is not a priority

Using such a low number of walkers increases the variance in performance, and I am not using a fixed random seed,

but I hope we can get a bit more than 20 thousand points.

This balance parameter here, allow us to change the balance between exploring new trajectories and exploiting promising paths,

Using a value of two, makes exploiting high scores twice as important as exploring.

Instead of applying the same action a fixed number of times, we choose to vary the step size for each walker sampling from the following distribution.

This allows us to be more efficient, while being able to make precise movements when needed.

The rest of the parameters are hard limits on the maximum score we can get, and the maximum number of samples allowed.

Now that it has finished lets see how it performed.

It achieved X points while playing at Y frames per second.

Compared with standard human performances, the wave played X

times faster and achieved and score Y times better.

We can also compare it against two recent papers on planning

24 of 47

Hacking RL

Information gathering
Scanning
Exploitation & privilege escalation
Maintaining access & managing tracks

25 of 47

Performance of the Swarm Wave

26 of 47

Resolver juegos de Atari es sencillo

27 of 47

También funciona en problemas más difíciles

28 of 47

29 of 47

Control swarms of agents

30 of 47

Multi objective environments

31 of 47

¡Gracias!

¡Échale un vistazo a nuestras frikadas!

@Miau_DB

@Entropyfarmer

Repo: Guillemdb/hacking-rl
Código: FragileTheory/FractalAI
Más de 100 videos
PDFs en arXiv.org

32 of 47

Additional Material

How the algorithm works
An overview of the FractalAI repository
Reinforcement Learning as a supervised problem
Hacking OpenAI baselines
Papers that need some love
Improving AlphaZero
Combining FractalAI with neural networks

33 of 47

The Algorithm

Random perturbation of the walkers
Calculate the virtual reward of each walker

Distance to 1 random walker
Reward of current state

Clone the walkers → Balance the Swarm

34 of 47

Random perturbation

35 of 47

Walkers & Reward density

36 of 47

Cloning Process

37 of 47

Cloning balances both densities

38 of 47

Choose the action that most walkers share

39 of 47

RL is training a DNN model

ML without labels → Environment
Sample the environment
Dataset of games → Map states to scores
Predict good actions

40 of 47

Which Envs are compromised?

Atari games → Solved 32 Games!

dm_control → x1000+ with tricks

Sega games → Good performance

I hope soon in DoTA 2 & challenging environments

41 of 47

If you run it on your laptop in 50 games

Pwns planning SoTA

17+ games with max scores (1M Bug)

Cheaper than a human (No Pitfall)

Beats human record → 56.36% games

42 of 47

RL as a supervised task

Train autoencoder with a SW
Generate 1M Games and overfit on them
Use a GAN to mimic a fractal
Use FMC to calculate Q-vals/Advantages
Trained model as a prior

43 of 47

Give love to papers!

44 of 47

Efficiency on MsPacman

SW vs. UCT & p-IW (Assuming 2 x M4.16xlarge)

When UCT(AlphaZero) finishes ⅔ of its first step,

SW has already beaten by 25% its final score

	UCT 150k	p-IW 150k	p-IW 0.5s	p-IW 32s
Score	x1.25	x0.91	x1.85	x1.21
Sampling Efficiency	x1260	x1260	x1848	x29581

An example run:

128 walkers
14.20 samples / action
Scored 27971 points
Game len 6892
97894 samples
1min 38s. Runtime
70.34 fps

45 of 47

Improving Alphazero

Change UTC for SW → sample x1000 + faster
Stones as reward → SW jumps local optima
Embedding of conv. layers for distance
Use FMC to get better Q-values
Heuristics only valid in Go

46 of 47

SW: Presenting an unfair benchmark

A fair benchmark requires sampling 1M score at 150k samples / step

10 min play: 12000 steps - One step: 400 µs
1 core game: 4.8s x 150k x 50 rounds -> 416 days
Ideal M4.16xlarge: $3.20 / Hour →

500$ per game running 1 instance for 6.5 days

26,500$ on 53 games → Sponsors are welcome

47 of 47

Counting Paths vs. Trees

Samples / step: confusing → Tree of games

Traditional Planning

Swarm Wave