2 of 93

Unit-6 Reinforcement Learning

Introduction of deep reinforcement learning, Markov Decision Process, basic framework of reinforcement
learning, challenges of reinforcement learning, Dynamic programming algorithms for reinforcement
learning, Q Learning and Deep Q-Networks, Deep Q recurrent networks, Simple reinforcement learning for Tic-Tac-Toe.

9 of 93

Introduction to Deep Reinforcement Learning

10 of 93

Introduction to Deep Reinforcement Learning

11 of 93

Introduction to Deep Reinforcement Learning

Reinforcement Learning
Agent - Agent (A) takes actions that affect the environment. Citing an example, the machine learning to play chess is the agent.
It is Software programs that make intelligent decisions and they are the learners in RL. These agents interact with the environment by actions and receive rewards based on there actions.
Action - It is the set of all possible operations/moves the agent can make. The agent makes a decision on which action to take from a set of discrete actions (a).
Environment - All actions that the reinforcement learning agent makes directly affect the environment. Here, the board of chess is the environment. The environment takes the agent's present state and action as information and returns the reward to the agent with a new state.

12 of 93

Introduction to Deep Reinforcement Learning

Environment :It is the demonstration of the problem to be solved.Now, we can have a real-world environment or a simulated environment with which our agent will interact.
For example, the move made by the bot will either have a negative/positive effect on the whole game and the arrangement of the board. This will decide the next action and state of the board. State - A state (S) is a particular situation in which the agent finds itself.
State : This is the position of the agents at a specific time-step in the environment. So,whenever an agent performs a action the environment gives the agent reward and a new state where the agent reached by performing the action.
actions can be any decision we want the agent to learn and state can be anything which can be useful in choosing actions. We do not assume that everything in the environment is unknown to the agent, for example, reward calculation is considered to be the part of the environment even though the agent knows a bit on how it’s reward is calculated as a function of its actions and states in which they are taken. This is because rewards cannot be arbitrarily changed by the agent. Sometimes, the agent might be fully aware of its environment but still finds it difficult to maximize the reward as like we might know how to play Rubik’s cube but still cannot solve it. So, we can safely say that the agent-environment relationship represents the limit of the agent control and not it’s knowledge.

13 of 93

Introduction to Deep Reinforcement Learning

Reward (R) - The environment gives feedback by which we determine the validity of the agent’s actions in each state. It is crucial in the scenario of Reinforcement Learning where we want the machine to learn all by itself and the only critic that would help it in learning is the feedback/reward it receives.
For example, in a chess game scenario it happens when the bot takes the place of an opponent's piece and later captures it.
Discount factor - Over time, the discount factor modifies the importance of incentives. Given the uncertainty of the future it’s better to add variance to the value estimates. Discount factor helps in reducing the degree to which future rewards affect our value function estimates.
Policy (π) - It decides what action to take in a certain state to maximize the reward.
Value (V)—It measures the optimality of a specific state. It is the expected discounted rewards that the agent collects following the specific policy.
Q-value or action-value - Q Value is a measure of the overall expected reward if the agent (A) is in state (s) and takes action (a), and then plays until the end of the episode according to some policy (π).

14 of 93

Introduction to Deep Reinforcement Learning

Model-based vs Model-free learning algorithms

There are two main types of Reinforcement Learning algorithms:

1. Model-based algorithms- They are used in scenarios where we have complete knowledge of the environment and how it reacts to different actions.
Agent has access to the model of the environment i.e., action required to be performed to go from one state to another, probabilities attached, and corresponding rewards attached.
2. Model-free algorithms- Model-free algorithms find the optimal policy with very limited knowledge of the dynamics of the environment. They do no thave any transition/reward function to judge the best policy.
They estimate the optimal policy directly from experience i.e., interaction between agent and environment without having any hint of the reward function.

Model Based RL

• Explicit: model

• May or may not have policy and/or value function

• Model Free RL

• Explicit: Value function and/or Policy Function

• No model

15 of 93

Introduction to Deep Reinforcement Learning

16 of 93

Introduction to Deep Reinforcement Learning

17 of 93

Introduction to Deep Reinforcement Learning

18 of 93

Introduction to Deep Reinforcement Learning

19 of 93

Introduction to Deep Reinforcement Learning

Transition / dynamics model predicts next agent state

20 of 93

S[t] denotes the current state of the agent and s[t+1] denotes the next state.
What this equation means is that the transition from state S[t] to S[t+1] is entirely independent of the past.
So, the RHS of the Equation means the same as LHS if the system has a Markov Property.
Intuitively meaning that our current state already captures the information of the past states.

21 of 93

Introduction to Deep Reinforcement Learning

Markov property states that, a state at time t+1 is dependent only on the current state ‘t’ and is independent of all previous states from t-1, t-2, . . .. In short, to know a future state, we just need to know the current state.
state diagram has 3 possible states: sleep, run, icecream.
probability of the next state depends only on the current state. eg. Tic-Tac-Toe, Golf. When we are able to take a decision based on the current state, rather than needing to know the whole history, then we say that we satisfy the conditions of the Markov Property.

22 of 93

Application of RL

Applications of deep Reinforcement Learning- Industrial manufacturing-Deep Reinforcement Learning is very commonly applied in Robotics.

The actions that the robot has to take are inherently sequential. Agents learn to interact with dynamic changing environments and thus find applications in industrial automation and manufacturing.
Labor expenses, product faults and unexpected downtime are being reduced with significant improvement in transition times and production speed.
Self-driving cars- Autonomous vehicle used large amounts of visual data and leveraged image processing capabilities in cohesion with Neural Network architecture.
The algorithms learn to recognize pedestrians,roads, traffic, detect street signs in the environment and act accordingly. It is trained in complex scenarios and trained to excel in decision making skills in scenarios involving minimal human loss, best route to follow etc.
Trading and Finance- We have seen how supervised learning and time-series analysis helps in prediction of the stock market. But none helps us in making decisions of what to do in a particular situation. An RL agent can select whether to hold, buy, or sell a share. To guarantee that the RL model is working optimally, it is assessed using market benchmark standards.
Natural Language Processing- Reinforced Learning is expanding in wings and has conquered NLP too. Different NLP tasks like question-answering, summarization, chatbot implementation can be done by a Reinforcement Learning agent.

23 of 93

Markov Decision Process

In a typical Reinforcement Learning (RL) problem, there is a learner and a decision maker called agent and the surrounding with which it interacts is called environment.
The environment, in return, provides rewards and a new state based on the actions of the agent.
So, in reinforcement learning, we do not teach an agent how it should do something but presents it with rewards whether positive or negative based on its actions.
So our root question how we formulate any problem in RL mathematically. This is where the Markov Decision Process(MDP) comes in.

24 of 93

Markov Decision Process

Markov Decision Process is a Reinforcement Learning algorithm that gives us a way to formalize sequential decision making.
This formalization is the basis to the problems that are solved by Reinforcement Learning. The components involved in a Markov Decision Process (MDP) is a decision maker called an agent that interacts with the environment it is placed in.
These interactions occur sequentially overtime.

25 of 93

Markov Decision Process

Like a Markov chain, the model attempts to predict an outcome given only information provided by the current state. However, the Markov decision process incorporates the characteristics of actions and motivations.
At each step during the process, the decision maker may choose to take an action available in the current state, resulting in the model moving to the next step and offering the decision maker a reward.
The process of selecting an action from a given state, transitioning to a new state and receiving a reward happens sequentially over and over again. This creates something called a trajectory that shows the sequence of states, actions and rewards.
Using RL the algorithm will attempt to optimize the actions taken within an environment, in order to maximize the potential reward. Where supervised learning techniques require correct input/output pairs to create a model, reinforcement learning uses Markov decision processes to determine an optimal balance of exploration and exploitation. ( Profit )

26 of 93

Markov Decision Process

How we formulate RL problems mathematically (using MDP), we need to develop our intuition about :

The Agent-Environment relationship ( see slide 5 & 6)
Markov Property
Markov Process and Markov chains
Markov Reward Process (MRP)
Bellman Equation
Markov Reward Process

27 of 93

Markov Decision Process

The Markov Property
Transition : Moving from one state to another is called Transition.
Transition Probability: The probability that the agent will move from one state to another is called transition probability.
The Markov Property state that :
“Future is Independent of the past given the present”
Mathematically we can express this statement as :

S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this equation means is that the transition from state S[t] to S[t+1] is entirely independent of the past.

So, the RHS of the Equation means the same as LHS if the system has a Markov Property. Intuitively meaning that our current state already captures the information of the past states.

28 of 93

Markov Decision Process

State Transition Probability :
As we now know about transition probability we can define state Transition Probability as follows :
For Markov State from S[t] to S[t+1] i.e. any other successor state , the state transition probability is given by

We can formulate the State Transition probability into a State Transition probability matrix by

29 of 93

Markov Decision Process

Markov Process or Markov Chains
Markov Process is the memory less random process i.e. a sequence of a random state S[1],S[2],….S[n] with a Markov Property.So, it’s basically a sequence of states with the Markov Property. It can be defined using a set of states(S) and transition probability matrix (P).The dynamics of the environment can be fully defined using the States(S) and Transition Probability matrix(P).
But what random process means ?
To answer this question let’s look at a example:

30 of 93

Markov Decision Process

The edges of the tree denote transition probability. From this chain let’s take some sample. Now, suppose that we were sleeping and the according to the probability distribution there is a 0.6 chance that we will Run and 0.2 chance we sleep more and again 0.2 that we will eat ice-cream. Similarly, we can think of other sequences that we can sample from this chain. Some samples from the chain :

Sleep — Run — Ice-cream — Sleep
Sleep — Ice-cream — Ice-cream — Run

In the above two sequences what we see is we get random set of States(S) (i.e. Sleep,Ice-cream,Sleep ) every time we run the chain.Hope, it’s now clear why Markov process is called random set of sequences.

Before going to Markov Reward process let’s look at some important concepts that will help us in understand MRPs.

31 of 93

Markov Decision Process

Reward and Returns
Rewards are the numerical values that the agent receives on performing some action at some state(s) in the environment. The numerical value can be positive or negative based on the actions of the agent.
In Reinforcement learning, we care about maximizing the cumulative reward (all the rewards agent receives from the environment) instead of, the reward agent receives from the current state(also called immediate reward). This total sum of reward the agent receives from the environment is called returns.

r[t+1] is the reward received by the agent at time step t[0] while performing an action(a) to move from one state to another. Similarly, r[t+2] is the reward received by the agent at time step t[1] by performing an action to move to another state.

And, r[T] is the reward received by the agent by at the final time step by performing an action to move to another state.

33 of 93

Suppose our start state is Class 2, and we move to Class 3 then Pass then Sleep.In short, Class 2 > Class 3 > Pass > Sleep. Our expected return is with a discount factor of 0.5:

34 of 93

Markov Decision Process

Episodic Tasks: These are the tasks that have a terminal state (end state).We can say they have finite states. For example, in racing games, we start the game (start the race) and play it until the game is over (race ends!). This is called an episode. Once we restart the game it will start from an initial state and hence, every episode is independent.
Continuous Tasks : These are the tasks that have no ends i.e. they don’t have any terminal state.These types of tasks will never end. For example, Learning how to code!
Discount Factor (ɤ): It determines how much importance is to be given to the immediate reward and future rewards. This basically helps us to avoid infinity as a reward in continuous tasks. It has a value between 0 and 1. A value of 0 means that more importance is given to the immediate reward and a value of 1 means that more importance is given to future rewards. In practice, a discount factor of 0 will never learn as it only considers immediate reward and a discount factor of 1 will go on for future rewards which may lead to infinity. Therefore, the optimal value for the discount factor lies between 0.2 to 0.8.

35 of 93

Markov Decision Process

A Markov Reward Process is a tuple <S, P, R, γ> where:

S is a (finite) set of states
P is a state transition probability matrix, Pss′=P[St+1=s′∣St=s]
R is a reward function, Rs=E[rt+1∣St=s]
γ is a discount factor, γ∈[0,1]- Discount factor ᶕ weighs immediate vs future rewards

Markov Reward Process

Till now we have seen how Markov chain defined the dynamics of a environment using set of states(S) and Transition Probability Matrix(P).But, we know that Reinforcement Learning is all about goal to maximize the reward.So, let’s add reward to our Markov Chain.This gives us Markov Reward Process. Bellman Equation

for Value Function

Reward model predicts immediate reward

Expected discounted sum of future rewards under a particular policy

36 of 93

Markov Decision Process

Policy Function and Value Function
Value Function determines how good it is for the agent to be in a particular state. Of course, to determine how good it will be to be in a particular state it must depend on some actions that it will take. This is where policy comes in. A policy defines what actions to perform in a particular state s.

If an agent at time t follows a policy π then π(a|s) is the probability that the agent with taking action (a ) at a particular time step (t).

vπ(s) is the expected return starting from s and following a policy π for the next states until we reach the terminal state.

37 of 93

Basic Framework Of Reinforcement

38 of 93

Basic Framework Of Reinforcement

The goal of most practical RL algorithms is to maximize expected discounted returns

39 of 93

Basic Framework Of Reinforcement

An observation space. Observations are the data an RL agent uses to make sense of the world. An autonomous vehicle (AV) may observe the world through cameras and Lidar.
An action space. The action space defines what actions an RL agent can take. An AV’s action space could be the steering wheel angle as well as the gas and break pedals.
A reward function. The reward function labels each timestep with how favorable it is for the overall metric that we want to optimize. For AVs, the reward function could be whether the car is closer to its destination and is driving safely.
A terminal condition. This is the condition that determines the length of single episode of an RL loop. It is either a fixed time amount (e.g., 1000 steps) or is triggered when a success condition is met (e.g., AV car arrived safely or user purchased a recommended item).

40 of 93

Challenges Of Reinforcement Learning

Here are the major challenges you will face while doing Reinforcement earning:

Feature/reward design which should be very involved
Parameters may affect the speed of learning.
Realistic environments can have partial observability.(an agent does not know the state of the world or that the agents act simultaneously.)
Too much Reinforcement may lead to an overload of states which can diminish the results.
Realistic environments can be non-stationary. (An environment where sudden concept drift can occur due to dynamic and unknown probability data distribution function.)
When Not to Use Reinforcement Learning?
When you have enough data to solve the problem with a supervised learning method
You need to remember that Reinforcement Learning is computing-heavy and time-consuming. in particular when the action space is large.

41 of 93

Dynamic Programming for Of RL

Dynamic programming can help find optimal solutions to planning problems faced in the industry, with an important assumption that the specifics of the environment are known.
Number of bikes returned and requested at each location are given by functions g(n) and h(n) respectively.
In exact terms the probability that the number of bikes rented at both locations is n is given by g(n) and probability that the number of bikes returned at both locations is n is given by h(n)

To use a dynamic programming algorithm, we need a finite Markov Decision Process (MDP) to describe our environment. Thus we need to know the probabilities that our actions have. The result we yield is deterministic, e.g., the policy we retrieve with the help of our dynamic programming algorithm is always the same. Another benefit is that it always converges.

42 of 93

Dynamic Programming for Of RL

Terms used in Reinforcement Learning

Agent(): An entity that can perceive/explore the environment and act upon it.
Environment(): A situation in which an agent is present or surrounded by. In RL, we assume the stochastic environment, which means it is random in nature.
Action(): Actions are the moves taken by an agent within the environment.
State(): State is a situation returned by the environment after each action taken by the agent.
Reward(): A feedback returned to the agent from the environment to evaluate the action of the agent.
Policy(): Policy is a strategy applied by the agent for the next action based on the current state.
Value(): It is expected long-term retuned with the discount factor and opposite to the short-term reward.
Q-value(): It is mostly similar to the value, but it takes one additional parameter as a current action (a).

43 of 93

Dynamic Programming for Of RL

Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which are:

Value-based:�The value-based approach is about to find the optimal value function, which is the maximum value at a state under any policy. Therefore, the agent expects the long-term return at any state(s) under policy π.
Policy-based:�Policy-based approach is to find the optimal policy for the maximum future rewards without using the value function. In this approach, the agent tries to apply such a policy that the action performed in each step helps to maximize the future reward.�The policy-based approach has mainly two types of policy:

Deterministic: The same action is produced by the policy (π) at any state.
Stochastic: In this policy, probability determines the produced action.

Model-based: a virtual model is created for the environment, and the agent explores that environment to learn it. There is no particular solution or algorithm for this approach because the model representation is different for each environment.

44 of 93

Dynamic Programming for Of RL

Elements of Reinforcement Learning

There are four main elements of Reinforcement Learning, which are given below:

Policy
Reward Signal
Value Function
Model of the environment

For deterministic policy: a = π(s)

For stochastic policy: π(a | s) = P[At =a | St = s]

Reward Signal: The goal of RL is defined by the reward signal. At each state, the environment sends an immediate signal to the learning agent, and this signal is known as a reward signal. The reward signal can change the policy, such as if an action selected by the agent leads to low reward, then the policy may change to select other actions in the future.

Value Function: The value function gives information about how good the situation and action are and how much reward an agent can expect. A reward indicates the immediate signal for each good and bad action, whereas a value function specifies the good state and action for the future.

Model: which mimics the behavior of the environment. With the help of the model, one can make inferences about how the environment will behave. Such as, if a state and an action are given, then a model can predict the next state and reward.

45 of 93

Dynamic Programming for Of RL

The algorithm continues to update the value function until it converges to the optimal value function.
Once the optimal value function is found, an optimal policy can be computed by selecting the action with the highest expected reward in each state.
An example of using DP for RL is in a grid world navigation problem.
In this problem, an agent must navigate a grid world to reach a goal state while avoiding obstacles.
The agent receives a reward of +1 for reaching the goal state and a reward of -1 for colliding with an obstacle or reaching a boundary.
The agent can take actions in four directions: up, down, left, and right. The transition probabilities are deterministic, meaning that the agent will always move in the intended direction unless there is an obstacle or boundary in the way.

49 of 93

Dynamic Programming for Of RL

DP algorithms aim to compute optimal policies by iteratively computing value functions, which represent the expected cumulative reward that an agent can receive while following a given policy from a particular state.
One of the most well-known DP algorithms is the Value Iteration algorithm.
The algorithm starts by initializing the value function for each state to an arbitrary value, such as 0.
Then, it iteratively updates the value function for each state by using the Bellman optimality equation, which relates the value of a state to the values of its successor states:

Here, V(s) represents the value of state s,
max_a represents the maximum over all possible actions a that can be taken in state s,
P(s'|s,a) represents the probability of transitioning to state s' given that action a is taken in state s,
R(s,a,s') represents the reward received for transitioning from state s to state s' with action a,
γ is a discount factor that determines the relative importance of immediate and future rewards.

52 of 93

Dynamic Programming for Of RL

54 of 93

Dynamic Programming for Of RL

55 of 93

Dynamic Programming for Of RL

63 of 93

π some given Policy Update Equation Rule

�

64 of 93

Q-Learning

Q-Learning does not expect a Markov decision process. It can work solely by evaluating which of its actions return a higher reward.

65 of 93

Q-Learning

66 of 93

Q-Learning stand for Quality Learning get max rewards in future

67 of 93

Q-Learning

Let’s say that a robot has to cross a maze and reach the end point.
There are mines, and the robot can only move one tile at a time.
If the robot steps onto a mine, the robot is dead.
The robot has to reach the end point in the shortest time possible.

68 of 93

Q-Learning

The scoring/reward system is as below:

The robot loses 1 point at each step.
This is done so that the robot takes the shortest path and reaches the goal as fast as possible.
If the robot steps on a mine, the point loss is 100 and the game ends.
If the robot gets power ⚡️, it gains 1 point.
If the robot reaches the end goal, the robot gets 100 points.

Now, the obvious question is: How do we train a robot to reach the end goal with the shortest path without stepping on a mine? So, how do we solve this?

69 of 93

Q-Learning

In the Q-Table, the columns are the actions and the rows are the states.

Each Q-table score will be the maximum expected future reward that the robot will get if it takes that action at that state. This is an iterative process, as we need to improve the Q-Table at each iteration.

But the questions are:

How do we calculate the values of the Q-table?
Are the values available or predefined?

To learn each value of the Q-table, we use the Q-Learning algorithm.

70 of 93

Q-Learning

Mathematics: the Q-Learning algorithm
Q-function- The Q-function uses the Bellman equation and takes two inputs: state (s) and action (a).
Using the above function, we get the values of Q for the cells in the table.
When we start, all the values in the Q-table are zeros.

71 of 93

Q-Learning

Step 1: initialize the Q-Table
We will first build a Q-table. There are n columns, where n= number of actions. There are m rows, where m= number of states. We will initialise the values at 0.
Steps 2 and 3: choose and perform an action
This combination of steps is done for an undefined amount of time. This means that this step runs until the time we stop the training, or the training loop stops as defined in the code.
We will choose an action (a) in the state (s) based on the Q-Table. But, as mentioned earlier, when the episode initially starts, every Q-value is 0.

72 of 93

Q-Learning

Steps 4 and 5: evaluate
Now we have taken an action and observed an outcome and reward.We need to update the function Q(s,a). In the case of the robot game, to reiterate the scoring/reward structure is:
power = +1
mine = -100
end = +100

73 of 93

Q-Learning

74 of 93

Dynamic Programing vs Q-Learning

Dynamic programming (DP) and Q-learning are two approaches used in RL to solve problems where an agent needs to learn the best action to take in a given state to maximize its long-term reward. there are some key differences between them:

Model-Based vs. Model-Free: Dynamic programming requires a model of the environment to compute the optimal policy. The model includes the transition probabilities between states and the reward associated with each state-action pair. On the other hand, Q-learning is a model-free method that does not require any knowledge of the environment's transition probabilities. It learns from experience by updating its Q-values based on the rewards it receives.
Iterative vs. Sample-Based: DP algorithms are iterative methods that use the Bellman equation to compute the optimal value function and policy. They require multiple passes through the entire state space to converge to an optimal solution. Q-learning is a sample-based method that updates the Q-values using experience gained from interacting with the environment. It does not require multiple passes through the entire state space.

75 of 93

Dynamic Programing vs Q-Learning

3. Convergence vs. Exploration: DP algorithms are guaranteed to converge to the optimal solution in a finite number of steps. However, this requires complete knowledge of the environment model. Q-learning does not guarantee convergence, but it can handle environments with incomplete or unknown models by exploring and updating its Q-values based on the observed rewards.

4. Memory Usage: DP algorithms require a large amount of memory to store the value function and policy for each state. The memory requirement grows exponentially with the number of states in the environment. In contrast, Q-learning only requires memory to store the Q-table, which has a size proportional to the number of state-action pairs.

76 of 93

Q-Learning Application

Game AI: Deep Q-learning has been used to create AI agents that can play video games at superhuman levels. One of the most famous examples is the DeepMind's AlphaGo and AlphaZero, which use deep Q-learning to learn to play the board games Go and chess, respectively.
Robotics: Deep Q-learning has been used in robotics to teach robots how to perform tasks such as grasping objects or navigating through a maze. By using deep Q-learning, robots can learn to perform complex tasks in a more efficient and adaptive way.
Autonomous vehicles: Deep Q-learning can be used to train autonomous vehicles to make decisions in real-time. For example, the agent can learn to navigate through complex traffic situations and avoid collisions by learning from its experiences.
Healthcare: Deep Q-learning can be used in healthcare to optimize treatment plans for patients with chronic diseases. For example, the agent can learn to adjust the dosage of medications based on patient feedback and symptoms.
Finance: Deep Q-learning can be used in finance to develop trading strategies for stocks and other financial instruments. The agent can learn to predict market trends and make decisions based on its experiences.
Natural Language Processing: Deep Q-learning has been used in natural language processing to develop conversational agents that can interact with users in a more natural way. The agent can learn to respond to user queries and make decisions based on the context of the conversation.

77 of 93

Deep Q-Networks

80 of 93

Deep Q-Networks

Deep Q-learning is a variant of Q-learning that uses a deep neural network to approximate the Q-values rather than a Q-table. Here is an example of how deep Q-learning works:
Let's say we have a game where the agent needs to navigate through a maze to reach a goal. The maze is represented as a grid, and the agent can take four actions: move up, move down, move left, or move right. The goal is to find the shortest path to the goal while avoiding obstacles.
The deep Q-learning algorithm starts with a deep neural network that takes the current state of the game as input and outputs a Q-value for each possible action. The neural network is trained using experience replay, where the agent stores its experience in a buffer and samples a mini-batch of experiences to update the network.
During the training phase, the agent interacts with the environment by taking actions and receiving rewards. The agent selects the action with the highest Q-value according to the current neural network. After each action, the agent stores the experience in the buffer, which includes the current state, action taken, reward received, and next state.
Periodically, the agent samples a mini-batch of experiences from the buffer and uses them to update the neural network weights. The update rule is similar to the Q-learning update rule, but instead of updating a Q-table, the weights of the neural network are updated using backpropagation. The loss function is the mean squared error between the predicted Q-values and the target Q-values, which are computed using the Bellman equation:

81 of 93

Deep Q-Networks

where:

Q(s, a) is the predicted Q-value for state s and action a
r is the reward received for taking action a in state s
gamma is the discount factor, which determines the importance of future rewards
max(Q(s', a')) is the maximum Q-value for the next state s' and all possible actions a'

Q(s, a) = r + gamma * max(Q(s', a'))

82 of 93

Deep Q-Networks

So, what are the steps involved in reinforcement learning using deep Q-learning networks (DQNs)?

All the past experience is stored by the user in memory
The next action is determined by the maximum output of the Q-network
The loss function here is mean squared error of the predicted Q-value and the target Q-value – Q*.
This is basically a regression problem.
However, we do not know the target or actual value here as we are dealing with a reinforcement learning problem.

83 of 93

Deep Q-Networks

Let’s sum it all Deep Q-learning processes into steps - The agent: This is the entity that interacts with the game environment. It takes in the current state of the game and outputs an action to take.

The game state: This represents the current state of the game environment, which is fed into the agent.
The neural network: This is the deep neural network that approximates the Q-values. It takes in the game state as input and outputs the Q-values for each possible action.
The target network: This is a copy of the neural network that is used to compute the target Q-values. It is updated periodically to prevent instability during training.
The experience replay: This is a buffer that stores the agent's experiences (i.e., the game state, action taken, reward received, and next state) for later use in training.
The Q-value update: This is the process of updating the Q-values of the neural network based on the Bellman equation. The update is done using a mini-batch of experiences sampled from the experience replay buffer.
The action selection: This is the process of selecting the action to take based on the Q-values output by the neural network. The action with the highest Q-value is selected.

84 of 93

Deep Q Learning Recurrent Network

Deep Q-Learning (DQL) can also be implemented using recurrent neural networks (RNNs), which can handle sequential data and provide better performance in tasks that require temporal reasoning.

In DQL with RNNs, the input to the neural network is a sequence of game states, and the output is a sequence of Q-values, one for each state-action pair. Here is a diagram that illustrates the basic architecture of DQL with RNNs:

85 of 93

Q-Learning Application

In this architecture, we have:

The agent: This is the entity that interacts with the game environment. It takes in a sequence of game states as input and outputs a sequence of Q-values, one for each state-action pair.
The game state: This represents the current state of the game environment, which is fed into the agent as a sequence of states.
The recurrent neural network: This is the deep neural network that approximates the Q-values. It takes in a sequence of game states as input and outputs a sequence of Q-values for each possible action. The RNN maintains a hidden state that captures the temporal dynamics of the game.
The target network: This is a copy of the recurrent neural network that is used to compute the target Q-values. It is updated periodically to prevent instability during training.
The experience replay: This is a buffer that stores the agent's experiences (i.e., the game state sequence, action taken, reward received, and next state sequence) for later use in training.
The Q-value update: This is the process of updating the Q-values of the recurrent neural network based on the Bellman equation. The update is done using a mini-batch of experiences sampled from the experience replay buffer.
The action selection: This is the process of selecting the action to take based on the Q-values output by the recurrent neural network. The action with the highest Q-value is selected.

86 of 93

PAC MAN GAME

The implementation provides 5 moves: left, right, up, down, and stay.
The score starts at 0 and changes at each timestep as follows:
The score drops by 1.
Eating food increases the score by 10.
Eating a ghost increases the score by 200.
Eating the last piece of food (winning the game) increases the score by 500.
Being eaten by a ghost (losing the game) decreases the score by 500.

87 of 93

Atari Games

88 of 93

Simple Reinforcement Learning for Tic Tac Toe

Tic Tac Toe is a simple game with a small state space, making it an ideal environment for learning. In this game, two players take turns placing either an X or an O on a 3x3 grid. The goal is to place three of the same symbol in a row, column, or diagonal.

Here's a simple reinforcement learning algorithm for Tic Tac Toe:

Define the state space: In this case, the state space consists of the current configuration of the board. There are 3^9 possible (19683 state ) board configurations, but many of them are impossible or equivalent to others due to symmetries. We can reduce this to about 14000 unique states by ignoring the symmetrical configurations and eliminating the illegal ones where one player has already won. ( it easier to analyze the game and develop strategies for playing it)
Define the action space: The action space consists of all possible moves a player can make given the current state of the board. There are at most 9 possible moves in any given state, but some may be invalid if the corresponding cell is already occupied.
Define the reward function: The reward function is what the agent is trying to maximize. In Tic Tac Toe, the agent wants to win the game, so a reward of +1 is given if the agent wins, -1 if the agent loses, and 0 for any other outcome.

89 of 93

Simple Reinforcement Learning for Tic Tac Toe

4. Define the learning algorithm: A common learning algorithm used in RL is Q-learning. Q-learning learns the optimal action-value function, which tells the agent the expected total reward for taking a certain action in a certain state. The Q-value for a given state-action pair is updated using the Bellman equation:

Q(s,a) = Q(s,a) + alpha * (reward + gamma * max(Q(s',a')) - Q(s,a))

where s is the current state, a is the action taken, alpha is the learning rate, gamma is the discount factor, s' is the next state, and a' is the best action to take in the next state according to the current Q-values.

Train the agent: The agent plays many games against a random or semi-random opponent, updating its Q-values as it goes. During training, the agent may explore new actions (called exploration) or choose the best action based on its current knowledge (called exploitation). The balance between exploration and exploitation is controlled by an exploration rate that is gradually decreased over time.
Test the agent: After training, the agent's performance is evaluated against a new set of opponents. If the agent can consistently win against these opponents, it is considered to have learned the game.

90 of 93

Simple Reinforcement Learning for Tic Tac Toe

Exploitation means choosing the action which maximizes our reward (may lead to being stuck in a local optimum).
Exploration means choosing an action regardless of the reward it provides (this helps us discover other local optimum which may lead us closer to the global optimum).
Going all out in either one of them is harmful, all exploitation may lead to a suboptimal agent, and all exploration would just give us a stupid agent which keeps taking random actions.

A widely used strategy to tackle this problem, is the epsilon-decreasing strategy. It works as follows: 1. Initialize a variable ‘epsilon’, with a value between 0 and 1 (usually around 0.3) 2. Now with probability = epsilon, we explore and with probability = 1-epsilon, we exploit. 3. We decrease the value of epsilon over time until it becomes zero

Temporal Difference Learning - The method of reinforcement learning used in this implementation is called temporal difference learning. It works on the principle that each state has a value attached to it. Say, if a state leads to the AI winning, it shall have a positive value(value = 1). If AI loses in some state, it shall have a negative value(value = -1). All the rest of the states would have a neutral value(value = 0). These are the initialized state values.

91 of 93

Simple Reinforcement Learning for Tic Tac Toe

Once a game is started, our agent computes all possible actions it can take in the current state and the new states which would result from each action.
The values of these states are collected from a state_value vector, which contains values for all possible states in the game.
The agent can then choose the action which leads to the state with the highest value(exploitation), or chooses a random action(exploration), depending on the value of epsilon.
Over the course of our training, we play several games, after each move, the value of the state is updated using the following rule:

where, V(s) — the current state of the game board,

V(s^f) — The new state of the board after the agent takes some action, and alpha — learning rate/ step-size parameter.

92 of 93

Simple Reinforcement Learning for Tic Tac Toe

Using this update rule, the states that lead to a loss get a negative state value as well(whose magnitude depends on the learning rate). The agent learns that being in such a state may lead to a loss down the line, so it would try to avoid landing in this state unless necessary.
On the other hand, the states that lead to a win get a positive state value. The agent learns that being in such a state may lead to a win down the line, so it would be encouraged to be in this state.
After implementing these basic function and playing with learning agent for several time you will start to see that the learning agent is trying to counter your moves, is blocking you, and sometimes attacking when there is a mistake or wrong move.
Game theory logic always considers that both the players will be playing optimally and so it won’t be able to use the fact that other player have made a wrong move. The reward system that in case of tic-tac-toe game is winning the game helps the agent to make decisions and take paths with interacting with environment and growing.

1 of 93

2 of 93

3 of 93

4 of 93

5 of 93

6 of 93

7 of 93

8 of 93

9 of 93

10 of 93

11 of 93

12 of 93

13 of 93

14 of 93

15 of 93

16 of 93

17 of 93

18 of 93

19 of 93

20 of 93

21 of 93

22 of 93

23 of 93

24 of 93

25 of 93

26 of 93

27 of 93

28 of 93

29 of 93

30 of 93

31 of 93

32 of 93

33 of 93

34 of 93

35 of 93

36 of 93

37 of 93

38 of 93

39 of 93

40 of 93

41 of 93

42 of 93

43 of 93

44 of 93

45 of 93

46 of 93

47 of 93

48 of 93

49 of 93

50 of 93

51 of 93

52 of 93

53 of 93

54 of 93

55 of 93

56 of 93

57 of 93

58 of 93

59 of 93

60 of 93

61 of 93

62 of 93

63 of 93

64 of 93

65 of 93

66 of 93

67 of 93

68 of 93

69 of 93

70 of 93

71 of 93

72 of 93

73 of 93

74 of 93

75 of 93

76 of 93

77 of 93

78 of 93

79 of 93

80 of 93