1 of 41

Reinforcement Learning

Monte Carlo Method

1

2 of 41

2

In the field of machine learning, there are three main types of learning: supervised learning, unsupervised learning, and reinforcement learning. Let's take a look at these three types of learning and how they differ.

Supervised learning is a type of machine learning where the algorithm is trained on labeled data, with the goal of predicting the label of new, unseen data. The algorithm is given inputs and corresponding outputs, and it learns to map inputs to outputs by minimizing the difference between its predicted output and the actual output.

Unsupervised learning is a type of machine learning where the algorithm is trained on unlabeled data, with the goal of discovering patterns or relationships in the data. The algorithm is given input data and it learns to find patterns or clusters in the data without any supervision.

Reinforcement learning is a type of machine learning where an agent interacts with an environment, with the goal of learning to take actions that maximize a cumulative reward signal. The agent receives feedback in the form of rewards or punishments, and learns to associate actions with expected future rewards by exploring the environment.

As we can see, Reinforcement Learning differs from Supervised and Unsupervised learning in that it involves an agent that interacts with an environment and learns through trial and error. This makes Reinforcement Learning well-suited to problems where optimal decisions must be made in dynamic and uncertain environments, such as games or robotics.

3 of 41

keywords

Agent
Environment
State
Action
Reward
Policy

3

4 of 41

Reinforcement Learning?

4

In Reinforcement Learning, there are five key concepts that we need to understand: the agent, state, action, reward, and environment. Let's take a closer look at each of these concepts.

The agent is the entity that learns to make decisions in the environment. It takes actions based on the current state, and receives feedback in the form of rewards or punishments.

The state is the representation of the environment at a given point in time. It contains all relevant information that the agent needs to make decisions, such as the location of objects, the velocity of moving objects, or the state of the game board.

The action is the decision made by the agent based on the current state. The action determines what the agent does next, and it affects the future state of the environment.

The reward is the feedback provided to the agent after it takes an action. The reward indicates how good or bad the action was, and it provides the basis for learning in Reinforcement Learning. The agent's goal is to maximize the cumulative reward over time.

The environment is the external system that the agent interacts with. It provides the state to the agent, accepts the action, and produces a new state and reward.

As we can see, the relationship between the agent, state, action, reward, and environment is central to Reinforcement Learning. By learning how to make decisions based on the feedback from the environment, the agent can learn to take actions that maximize the cumulative reward over time.

5 of 41

Easier way

5

6 of 41

6

Agent

Action: left, right, jump …

Reward: (score, coins)

State:

Map info

Enemy location

Time left

7 of 41

7

In this slide, we have a GIF of the classic video game Super Mario Bros. Let's take a look at how we can apply the concepts of Reinforcement Learning to this game.

In Super Mario Bros., the player controls Mario, the agent in our Reinforcement Learning problem. Mario's actions include running, jumping, and collecting power-ups. The game environment consists of various obstacles, enemies, and power-ups, and the current state of the game is determined by the location of Mario and the objects around him.

The reward in Super Mario Bros. is provided by the game in the form of points, coins, and power-ups. For example, collecting coins or power-ups will provide a positive reward, while getting hit by an enemy or falling into a pit will provide a negative reward.

In this Reinforcement Learning problem, the goal of the agent is to learn a policy that maximizes the cumulative reward over time. The agent can learn this policy by exploring the game environment and observing the rewards obtained from different actions in different states.

By applying Reinforcement Learning to Super Mario Bros., we can create an AI agent that can learn to play the game at a high level, even surpassing the ability of human players in some cases.

As we can see, Reinforcement Learning can be applied to a wide variety of problems, including video games like Super Mario Bros., by defining the agent, state, action, reward, and environment appropriately.

8 of 41

Let's build an agent that can play blackjack

8

9 of 41

Basic Rules

9

10 of 41

Basic Rules

Players bet money (say $100)
Dealer gives 2 cards to player, 1 card to dealer
If sum of card is 21 🡪 black jack, earns x 1.5 of bet (150$)
What Player can do:

Hit : get 1 more card
Stand: no more new card

If sum > 21, bust(lose)
When players are done getting cards, dealer Hit while sum <17

10

11 of 41

11

12 of 41

12

13 of 41

13

14 of 41

14

15 of 41

keywords

Agent = Player
Environment = Card Table
State = information given in table
Action = Hit or stand
Reward = Win or Lose (or money)
Policy = should I Hit or Stand?

15

16 of 41

This is the policy

16

Dealer’s first card

Player’s cards

Policy: deciding action (hit or stand)

based on state (Dealer’s card and Player’s cards)

17 of 41

Monte Carlo

Monte Carlo methods, or Monte Carlo experiments, are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results.

Monte Carlo methods vary, but tend to follow a particular pattern:

Define a domain of possible inputs
Generate inputs randomly from a probability distribution over the domain
Perform a deterministic computation on the inputs
Aggregate the results

17

In this slide, we will discuss Monte Carlo methods, which are a class of algorithms commonly used in Reinforcement Learning.

Monte Carlo methods are a family of techniques that use statistical sampling to estimate the value of a quantity. In Reinforcement Learning, Monte Carlo methods are used to estimate the value of a policy, which is the expected reward obtained by following a particular set of actions.

The basic idea behind Monte Carlo methods is to use random sampling to estimate the expected value of a quantity. In the context of Reinforcement Learning, we can use Monte Carlo methods to estimate the value of a policy by simulating many episodes of the game and averaging the rewards obtained.

The Monte Carlo method for estimating the value of a policy is simple to understand and implement, making it a popular choice for many Reinforcement Learning problems. However, it can also be computationally expensive, as it requires many simulations to obtain an accurate estimate of the policy value.

There are many variants of Monte Carlo methods that can be used to estimate the value of a policy, including First-Visit Monte Carlo, Every-Visit Monte Carlo, and Incremental Monte Carlo. Each of these methods has its own advantages and disadvantages, and the choice of which method to use depends on the specific problem being solved.

Overall, Monte Carlo methods are an important tool in the Reinforcement Learning toolbox, as they provide a simple and effective way to estimate the value of a policy using statistical sampling. By using Monte Carlo methods, we can create AI agents that can learn from experience and improve their performance over time, even in complex and uncertain environments.

18 of 41

18

19 of 41

19

In this slide, we will present a pseudo code for the Monte Carlo method used in Reinforcement Learning to estimate the value of a policy.

The Monte Carlo method for estimating the value of a policy involves simulating many episodes of the game and averaging the rewards obtained. Here is the basic pseudo code for the Monte Carlo method:

Initialize empty lists for storing the state, action, and reward sequences for each episode
For each episode: a. Initialize the starting state b. Repeat until the episode ends: i. Select an action using the current policy ii. Take the selected action and observe the reward and the next state iii. Add the current state, action, and reward to the corresponding sequence list iv. Update the current state to the next state c. Update the cumulative reward for each state in the sequence
Calculate the empirical returns for each state by averaging the cumulative rewards across all episodes
Update the policy to choose the action with the highest empirical return for each state

In this pseudo code, we first initialize empty lists to store the sequences of states, actions, and rewards obtained for each episode. We then simulate each episode by selecting actions using the current policy and observing the rewards and next states. We store the state, action, and reward sequences for each episode and update the cumulative reward for each state in the sequence.

After simulating all episodes, we calculate the empirical returns for each state by averaging the cumulative rewards across all episodes. We then update the policy to choose the action with the highest empirical return for each state. This process is repeated until the policy converges to the optimal policy.

This pseudo code demonstrates the basic steps of the Monte Carlo method used in Reinforcement Learning. By using Monte Carlo methods, we can create AI agents that can learn from experience and improve their performance over time, even in complex and uncertain environments.

20 of 41

Initialize policy arbitrarily
Take random action a
Record what reward Q was given by taking that action a in state s, Q(s,a)
Keep average values of rewards

20

21 of 41

Example�episode1

State 🡪 Action (by random)

State 🡪 Action 🡪 reward
Reward 🡪 +1 (new card was 8, sum =21)
State 🡪 Action 🡪 reward 🡪 record

21

22 of 41

More episodes…

Rewards Q(s,a) are recorded as episode continues…
Record is basically ‘experience’ itself
By comparing average value of �[13, 8, True] and [13, 8, False], agent can decide Hit or Stand

22

23 of 41

23

24 of 41

Results- Ace card in hand

24

Dealer’s first card

Player’s card

25 of 41

Results- Ace not in hand

25

26 of 41

Application- Multi Agent

26

27 of 41

27

Hider

Seeker

Object

Agent: Hider and Seeker

Environment: wall, floor, objects

Action: move around, push and pull object

Hider can ‘lock’ object

28 of 41

28

Hider uses cube objects to block the enterance

29 of 41

29

Seeker uses ramp objects to climb over the wall

30 of 41

30

Hider takes away ramp objects and block the enterance

31 of 41

31

Hider builds wall, locks ramp so that seekers can not use them

Seeker found glitch, climbs over cube and moves with it

Locked

32 of 41

32

Seeker use cube like a vehicle

Moves with it

Glitch

33 of 41

33

34 of 41

Personal thoughts

field of study to give the ability to a Machine to learn without being explicitly programmed

If agent was explicitly programmed, seeker wouldn’t have been able to find glitch in system.

34

35 of 41

35

Hider builds wall, locks every object from now on :(

36 of 41

Application- Bio

36

37 of 41

37

38 of 41

38

39 of 41

39

Environment: FDA approved UVA/Padova simulator

Agent: Insulin Pump

Action: release insulin or not

Goal is to maintain normal state

40 of 41

40

41 of 41

41

Traditional MSA algorithms works, but computational complexity needs to be improved.

Agent (RL model) gets rewarded if MSA result is similar to traditional algorithm (e.g. , dynamic programming)