1 of 41

Reinforcement Learning

Monte Carlo Method

1

2 of 41

2

3 of 41

keywords

  • Agent
  • Environment
  • State
  • Action
  • Reward
  • Policy

3

4 of 41

Reinforcement Learning?

4

5 of 41

Easier way

5

6 of 41

6

Agent

Action: left, right, jump …

Reward: (score, coins)

State:

Map info

Enemy location

Time left

7 of 41

7

8 of 41

Let's build an agent that can play blackjack

8

9 of 41

Basic Rules

9

10 of 41

Basic Rules

  • Players bet money (say $100)
  • Dealer gives 2 cards to player, 1 card to dealer
  • If sum of card is 21 🡪 black jack, earns x 1.5 of bet (150$)
  • What Player can do:
    • Hit : get 1 more card
    • Stand: no more new card
  • If sum > 21, bust(lose)
  • When players are done getting cards, dealer Hit while sum <17

10

11 of 41

11

12 of 41

12

13 of 41

13

14 of 41

14

15 of 41

keywords

  • Agent = Player
  • Environment = Card Table
  • State = information given in table
  • Action = Hit or stand 
  • Reward = Win or Lose (or money)
  • Policy = should I Hit or Stand?

15

16 of 41

This is the policy

16

Dealer’s first card

Player’s cards

Policy: deciding action (hit or stand)

based on state (Dealer’s card and Player’s cards)

17 of 41

Monte Carlo

  • Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results.

Monte Carlo methods vary, but tend to follow a particular pattern:

  1. Define a domain of possible inputs
  2. Generate inputs randomly from a probability distribution over the domain
  3. Perform a deterministic computation on the inputs
  4. Aggregate the results

17

18 of 41

  •  

18

19 of 41

19

20 of 41

  • Initialize policy arbitrarily
  • Take random action a
  • Record what reward Q was given by taking that action a in state s, Q(s,a)
  • Keep average values of rewards

20

21 of 41

Example�episode1

  • State 🡪 Action (by random)

  • State 🡪 Action 🡪 reward
  • Reward 🡪 +1 (new card was 8, sum =21)
  • State 🡪 Action 🡪 reward 🡪 record

21

22 of 41

More episodes…

  • Rewards Q(s,a) are recorded as episode continues…
  • Record is basically ‘experience’ itself
  • By comparing average value of �[13, 8, True] and [13, 8, False], agent can decide Hit or Stand

22

23 of 41

23

24 of 41

Results- Ace card in hand

24

Dealer’s first card

Player’s card

25 of 41

Results- Ace not in hand

25

26 of 41

Application- Multi Agent

26

27 of 41

27

Hider

Seeker

Object

Agent: Hider and Seeker

Environment: wall, floor, objects

Action: move around, push and pull object

Hider can ‘lock’ object

28 of 41

28

Hider uses cube objects to block the enterance

29 of 41

29

Seeker uses ramp objects to climb over the wall

30 of 41

30

Hider takes away ramp objects and block the enterance

31 of 41

31

Hider builds wall, locks ramp so that seekers can not use them

Seeker found glitch, climbs over cube and moves with it

Locked

32 of 41

32

Seeker use cube like a vehicle

Moves with it

Glitch

33 of 41

33

34 of 41

Personal thoughts

  • field of study to give the ability to a Machine to learn without being explicitly programmed

  • If agent was explicitly programmed, seeker wouldn’t have been able to find glitch in system.

34

35 of 41

35

Hider builds wall, locks every object from now on :(

36 of 41

Application- Bio

36

37 of 41

37

38 of 41

38

39 of 41

39

Environment: FDA approved UVA/Padova simulator

Agent: Insulin Pump

Action: release insulin or not

Goal is to maintain normal state

40 of 41

40

41 of 41

41

Traditional MSA algorithms works, but computational complexity needs to be improved.

Agent (RL model) gets rewarded if MSA result is similar to traditional algorithm (e.g. , dynamic programming)