1 of 47

Hacking

Reinforcement Learning

Guillem Duran Ballester

Guillemdb

@Miau_DB

2 of 47

Guillem Duran Ballester

  • Investigo sobre RL en InstaDeep
  • Ingeniero de telecomunicaciones
  • Coorganizador de PyData Mallorca
  • Mi hobby: Planning en Python con Sergio Hernández (@entropyfarmer)

Guillemdb

@Miau_DB

3 of 47

Hacking RL

  1. Information gathering
  2. Scanning
  3. Exploitation & privilege escalation
  4. Maintaining access & covering tracks

4 of 47

5 of 47

Reinforcement Learning

6 of 47

Reinforcement Learning

Planning

7 of 47

Decisión inteligente

Dirección de máximo

Número de opciones futuras

Dado tu estado actual

8 of 47

Asígnales una puntuación

Hasta cierto punto en el futuro

Cuenta todos los posibles caminos que puedas tomar

9 of 47

¿Cómo conducir el kart?

  • Prueba diferentes caminos aleatorios
  • Colorealos:
    1. Empiezan hacia la derecha: Rojo
    2. Empiezan hacia la izquierda: Azul
  • Pondera el número de caminos de cada color que no se han estrellado
  • Aplica la decisión y repite el proceso

10 of 47

11 of 47

Hacking RL

  • Information gathering
  • Finding vulnerabilities & Scanning
  • Exploitation & privilege escalation
  • Covering tracks & Maintaining access

12 of 47

Reinforcement Learning

, end, info

13 of 47

Hackeando Rl con planning

14 of 47

Fractal AI vs. Monte Carlo

15 of 47

16 of 47

17 of 47

Hacking RL

  • Information gathering
  • Scanning
  • Exploitation & privilege escalation
  • Maintaining access & covering tracks

18 of 47

Baloncesto Ninja con cohetes

FUEGO!

HP

Combustible

Gancho

Muelle

2 Grados de Libertad continuos

19 of 47

The Gameplay

Lleva aquí la roca

Recompensa

  • Vida + Combustible
  • Acercarse al objetivo → +0.2
  • Cumple objetivo → +100

Engancha la roca fuera de aquí

No te estampes!

20 of 47

FMC Cone

  • Lineas grises:

Futuros del cohete

  • Líneas de colores: Futuros del gancho
  • Cambio de color:

Nuevo objetivo

(Recoge/suelta roca)

Roca enganchada

Suelta Roca

Engancha roca

21 of 47

22 of 47

23 of 47

Demo time!

24 of 47

Hacking RL

  • Information gathering
  • Scanning
  • Exploitation & privilege escalation
  • Maintaining access & managing tracks

25 of 47

Performance of the Swarm Wave

26 of 47

Resolver juegos de Atari es sencillo

27 of 47

También funciona en problemas más difíciles

28 of 47

29 of 47

Control swarms of agents

30 of 47

Multi objective environments

31 of 47

¡Gracias!

¡Échale un vistazo a nuestras frikadas!

  1. Repo: Guillemdb/hacking-rl
  2. Código: FragileTheory/FractalAI
  3. Más de 100 videos
  4. PDFs en arXiv.org

32 of 47

Additional Material

  • How the algorithm works
  • An overview of the FractalAI repository
  • Reinforcement Learning as a supervised problem
  • Hacking OpenAI baselines
  • Papers that need some love
  • Improving AlphaZero
  • Combining FractalAI with neural networks

33 of 47

The Algorithm

  • Random perturbation of the walkers
  • Calculate the virtual reward of each walker
    1. Distance to 1 random walker
    2. Reward of current state
  • Clone the walkers → Balance the Swarm

34 of 47

Random perturbation

35 of 47

Walkers & Reward density

36 of 47

Cloning Process

37 of 47

Cloning balances both densities

38 of 47

Choose the action that most walkers share

39 of 47

RL is training a DNN model

  • ML without labels → Environment
  • Sample the environment
  • Dataset of games → Map states to scores
  • Predict good actions

40 of 47

Which Envs are compromised?

  • Atari games → Solved 32 Games!
  • dm_control → x1000+ with tricks
  • Sega games → Good performance
  • I hope soon in DoTA 2 & challenging environments

41 of 47

If you run it on your laptop in 50 games

  • Pwns planning SoTA
  • 17+ games with max scores (1M Bug)
  • Cheaper than a human (No Pitfall)
  • Beats human record → 56.36% games

42 of 47

RL as a supervised task

  • Train autoencoder with a SW
  • Generate 1M Games and overfit on them
  • Use a GAN to mimic a fractal
  • Use FMC to calculate Q-vals/Advantages
  • Trained model as a prior

43 of 47

Give love to papers!

44 of 47

Efficiency on MsPacman

SW vs. UCT & p-IW (Assuming 2 x M4.16xlarge)

When UCT(AlphaZero) finishes ⅔ of its first step,

SW has already beaten by 25% its final score

UCT 150k

p-IW 150k

p-IW 0.5s

p-IW 32s

Score

x1.25

x0.91

x1.85

x1.21

Sampling

Efficiency

x1260

x1260

x1848

x29581

An example run:

  • 128 walkers
  • 14.20 samples / action
  • Scored 27971 points
  • Game len 6892
  • 97894 samples
  • 1min 38s. Runtime
  • 70.34 fps

45 of 47

Improving Alphazero

  • Change UTC for SW → sample x1000 + faster
  • Stones as reward → SW jumps local optima
  • Embedding of conv. layers for distance
  • Use FMC to get better Q-values
  • Heuristics only valid in Go

46 of 47

SW: Presenting an unfair benchmark

  • A fair benchmark requires sampling 1M score at 150k samples / step
  • 10 min play: 12000 steps - One step: 400 µs
  • 1 core game: 4.8s x 150k x 50 rounds -> 416 days
  • Ideal M4.16xlarge: $3.20 / Hour →

500$ per game running 1 instance for 6.5 days

  • 26,500$ on 53 games → Sponsors are welcome

47 of 47

Counting Paths vs. Trees

  • Samples / step: confusing → Tree of games

Traditional Planning

Swarm Wave