Deep Learning (410251)
(BE Computer 2019 PAT)
A.Y. 2022-23 SEM-II
Prepared by
Mr. Dhomse G.P.
1
Unit-6 Reinforcement Learning
2
3
4
5
6
7
8
Introduction to Deep Reinforcement Learning
9
Introduction to Deep Reinforcement Learning
10
Introduction to Deep Reinforcement Learning
11
Introduction to Deep Reinforcement Learning
12
Introduction to Deep Reinforcement Learning
13
Introduction to Deep Reinforcement Learning
There are two main types of Reinforcement Learning algorithms:
14
Model Based RL
• Explicit: model
• May or may not have policy and/or value function
• Model Free RL
• Explicit: Value function and/or Policy Function
• No model
Introduction to Deep Reinforcement Learning
15
Introduction to Deep Reinforcement Learning
16
Introduction to Deep Reinforcement Learning
17
Introduction to Deep Reinforcement Learning
18
Introduction to Deep Reinforcement Learning
19
Transition / dynamics model predicts next agent state
20
Introduction to Deep Reinforcement Learning
21
Application of RL
Applications of deep Reinforcement Learning- Industrial manufacturing-Deep Reinforcement Learning is very commonly applied in Robotics.
22
Markov Decision Process
23
Markov Decision Process
24
Markov Decision Process
25
Markov Decision Process
How we formulate RL problems mathematically (using MDP), we need to develop our intuition about :
26
Markov Decision Process
S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this equation means is that the transition from state S[t] to S[t+1] is entirely independent of the past.
So, the RHS of the Equation means the same as LHS if the system has a Markov Property. Intuitively meaning that our current state already captures the information of the past states.
27
Markov Decision Process
28
Markov Decision Process
29
Markov Decision Process
The edges of the tree denote transition probability. From this chain let’s take some sample. Now, suppose that we were sleeping and the according to the probability distribution there is a 0.6 chance that we will Run and 0.2 chance we sleep more and again 0.2 that we will eat ice-cream. Similarly, we can think of other sequences that we can sample from this chain. Some samples from the chain :
In the above two sequences what we see is we get random set of States(S) (i.e. Sleep,Ice-cream,Sleep ) every time we run the chain.Hope, it’s now clear why Markov process is called random set of sequences.
Before going to Markov Reward process let’s look at some important concepts that will help us in understand MRPs.
30
Markov Decision Process
r[t+1] is the reward received by the agent at time step t[0] while performing an action(a) to move from one state to another. Similarly, r[t+2] is the reward received by the agent at time step t[1] by performing an action to move to another state.
And, r[T] is the reward received by the agent by at the final time step by performing an action to move to another state.
31
32
Suppose our start state is Class 2, and we move to Class 3 then Pass then Sleep.In short, Class 2 > Class 3 > Pass > Sleep. Our expected return is with a discount factor of 0.5:
33
Markov Decision Process
34
Markov Decision Process
A Markov Reward Process is a tuple <S, P, R, γ> where:
Markov Reward Process
Till now we have seen how Markov chain defined the dynamics of a environment using set of states(S) and Transition Probability Matrix(P).But, we know that Reinforcement Learning is all about goal to maximize the reward.So, let’s add reward to our Markov Chain.This gives us Markov Reward Process. Bellman Equation
for Value Function
35
Reward model predicts immediate reward
Expected discounted sum of future rewards under a particular policy
Markov Decision Process
36
If an agent at time t follows a policy π then π(a|s) is the probability that the agent with taking action (a ) at a particular time step (t).
vπ(s) is the expected return starting from s and following a policy π for the next states until we reach the terminal state.
Basic Framework Of Reinforcement
37
Basic Framework Of Reinforcement
38
Basic Framework Of Reinforcement
39
Challenges Of Reinforcement Learning
Here are the major challenges you will face while doing Reinforcement earning:
40
Dynamic Programming for Of RL
41
Dynamic Programming for Of RL
Terms used in Reinforcement Learning
42
Dynamic Programming for Of RL
There are mainly three ways to implement reinforcement-learning in ML, which are:
43
Dynamic Programming for Of RL
Elements of Reinforcement Learning
There are four main elements of Reinforcement Learning, which are given below:
For deterministic policy: a = π(s)
For stochastic policy: π(a | s) = P[At =a | St = s]
Reward Signal: The goal of RL is defined by the reward signal. At each state, the environment sends an immediate signal to the learning agent, and this signal is known as a reward signal. The reward signal can change the policy, such as if an action selected by the agent leads to low reward, then the policy may change to select other actions in the future.
Value Function: The value function gives information about how good the situation and action are and how much reward an agent can expect. A reward indicates the immediate signal for each good and bad action, whereas a value function specifies the good state and action for the future.
Model: which mimics the behavior of the environment. With the help of the model, one can make inferences about how the environment will behave. Such as, if a state and an action are given, then a model can predict the next state and reward.
44
Dynamic Programming for Of RL
45
46
47
48
Dynamic Programming for Of RL
49
50
51
Dynamic Programming for Of RL
52
53
Dynamic Programming for Of RL
54
Dynamic Programming for Of RL
55
56
57
58
59
60
61
62
π some given Policy Update Equation Rule
�
63
Q-Learning
64
Q-Learning does not expect a Markov decision process. It can work solely by evaluating which of its actions return a higher reward.
Q-Learning
65
Q-Learning stand for Quality Learning get max rewards in future
66
Q-Learning
67
Q-Learning
The scoring/reward system is as below:
Now, the obvious question is: How do we train a robot to reach the end goal with the shortest path without stepping on a mine? So, how do we solve this?
68
Q-Learning
Each Q-table score will be the maximum expected future reward that the robot will get if it takes that action at that state. This is an iterative process, as we need to improve the Q-Table at each iteration.
But the questions are:
To learn each value of the Q-table, we use the Q-Learning algorithm.
69
Q-Learning
70
Q-Learning
71
Q-Learning
72
Q-Learning
73
Dynamic Programing vs Q-Learning
Dynamic programming (DP) and Q-learning are two approaches used in RL to solve problems where an agent needs to learn the best action to take in a given state to maximize its long-term reward. there are some key differences between them:
74
Dynamic Programing vs Q-Learning
3. Convergence vs. Exploration: DP algorithms are guaranteed to converge to the optimal solution in a finite number of steps. However, this requires complete knowledge of the environment model. Q-learning does not guarantee convergence, but it can handle environments with incomplete or unknown models by exploring and updating its Q-values based on the observed rewards.
4. Memory Usage: DP algorithms require a large amount of memory to store the value function and policy for each state. The memory requirement grows exponentially with the number of states in the environment. In contrast, Q-learning only requires memory to store the Q-table, which has a size proportional to the number of state-action pairs.
75
Q-Learning Application
76
Deep Q-Networks
77
78
79
Deep Q-Networks
80
Deep Q-Networks
where:
81
Q(s, a) = r + gamma * max(Q(s', a'))
Deep Q-Networks
So, what are the steps involved in reinforcement learning using deep Q-learning networks (DQNs)?
82
Deep Q-Networks
Let’s sum it all Deep Q-learning processes into steps - The agent: This is the entity that interacts with the game environment. It takes in the current state of the game and outputs an action to take.
83
Deep Q Learning Recurrent Network
Deep Q-Learning (DQL) can also be implemented using recurrent neural networks (RNNs), which can handle sequential data and provide better performance in tasks that require temporal reasoning.
In DQL with RNNs, the input to the neural network is a sequence of game states, and the output is a sequence of Q-values, one for each state-action pair. Here is a diagram that illustrates the basic architecture of DQL with RNNs:
84
Q-Learning Application
In this architecture, we have:
PAC MAN GAME
86
Atari Games
87
Simple Reinforcement Learning for Tic Tac Toe
Tic Tac Toe is a simple game with a small state space, making it an ideal environment for learning. In this game, two players take turns placing either an X or an O on a 3x3 grid. The goal is to place three of the same symbol in a row, column, or diagonal.
Here's a simple reinforcement learning algorithm for Tic Tac Toe:
88
Simple Reinforcement Learning for Tic Tac Toe
4. Define the learning algorithm: A common learning algorithm used in RL is Q-learning. Q-learning learns the optimal action-value function, which tells the agent the expected total reward for taking a certain action in a certain state. The Q-value for a given state-action pair is updated using the Bellman equation:
Q(s,a) = Q(s,a) + alpha * (reward + gamma * max(Q(s',a')) - Q(s,a))
where s is the current state, a is the action taken, alpha is the learning rate, gamma is the discount factor, s' is the next state, and a' is the best action to take in the next state according to the current Q-values.
89
Simple Reinforcement Learning for Tic Tac Toe
A widely used strategy to tackle this problem, is the epsilon-decreasing strategy. It works as follows: 1. Initialize a variable ‘epsilon’, with a value between 0 and 1 (usually around 0.3) 2. Now with probability = epsilon, we explore and with probability = 1-epsilon, we exploit. 3. We decrease the value of epsilon over time until it becomes zero
90
Simple Reinforcement Learning for Tic Tac Toe
where, V(s) — the current state of the game board,
V(s^f) — The new state of the board after the agent takes some action, and alpha — learning rate/ step-size parameter.
Simple Reinforcement Learning for Tic Tac Toe
92
93