2 of 26

Table of Content

2. Understanding Q-Learning and Bellman Equation

1. Markov Decision Process in Trading

3. Action-Selection Strategies in Q-Learning

4. Deep Q-Learning and Experience Buffer

3 of 26

Markov Decision Process in Trading

Section 1

4 of 26

Overview of Markov Decision Process

Definition of MDP

Markov Property

Applications in Trading

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making situations where outcomes are partly random and partly under the control of a decision-maker (agent), characterized by states, actions, and rewards.

The Markov property asserts that the future state of a process depends only on the current state and action taken, not on the sequence of events that preceded it, ensuring memoryless transitions in decision-making.

In stock trading, MDPs facilitate the modeling of market dynamics by defining states as market conditions, actions as trading decisions, and rewards as financial returns, enabling optimal strategy development through reinforcement learning.

5 of 26

MDP Components Relevant to Stock Trading

States in Trading

Actions Available

Reward Structure

In stock trading, states represent various market conditions, including price levels, volume, and technical indicators. These states provide the necessary context for the agent to make informed trading decisions based on current market dynamics.

The actions in stock trading typically include buying, selling, or holding a stock. Each action impacts the portfolio's performance and is chosen based on the agent's strategy to maximize expected rewards over time.

Rewards in this context are defined by the profit or loss resulting from the actions taken. A well-designed reward function encourages the agent to prioritize strategies that yield higher returns while managing risks effectively.

6 of 26

Definition of Reinforcement Learning

An ML paradigm where an agent learns to make decisions by interacting with an environment, receiving feedback in the form of rewards or penalties. This process involves trial and error, allowing the agent to optimize its actions over time to maximize cumulative rewards, distinguishing it from supervised and unsupervised learning methods that rely on labeled data or clustering.

8 of 26

Comparison with Other Learning Paradigms

Pros

Cons

Handles complex environments

Learns from delayed rewards

Adapts to dynamic markets

Balances exploration and exploitation

Scalable with deep learning

Provides robust decision-making

Requires extensive data

High computational cost

Difficult to tune hyperparameters

Risk of overfitting

Limited interpretability

Insensitive to market changes

9 of 26

Understanding Q-Learning and Bellman Equation

Section 2

10 of 26

Introduction to Q-Learning Algorithm

Fundamental Concept

Q-Learning is a model-free reinforcement learning algorithm that enables an agent to learn the value of actions in a given state by utilizing a Q-table.

It stores the Q-values for each state-action pair (i.e. expected reward of taking a specific action when in a specific state) and iteratively updates them based on the Bellman equation and observed rewards.

Definition of Action-Value Function

The action-value function, denoted as Q(s,a), quantifies the expected return of taking action 'a' in state 's', providing a critical framework for evaluating trading decisions based on anticipated future rewards.

11 of 26

Introduction to Q-Learning Algorithm

Bellman Equation

Importance in Trading Strategies

By continuously updating Q-values through interactions with the trading environment, the action-value function enables traders to refine their strategies, ensuring that decisions are informed by both immediate outcomes and long-term market trends.

12 of 26

Discount Factor Gamma (γ)

Definition

Impact on Learning

Role in Convergence

The discount factor, denoted as γ (gamma), is a crucial parameter in reinforcement learning that determines the present value of future rewards, influencing the agent's decision-making process by balancing immediate versus long-term gains.

A higher discount factor encourages the agent to prioritize long-term rewards, fostering strategies that consider future market conditions, while a lower factor may lead to more short-sighted decisions, potentially compromising overall trading performance.

The discount factor plays a significant role in ensuring the convergence of value functions in reinforcement learning algorithms, affecting the stability and efficiency of learning processes, particularly in complex environments like stock trading.

13 of 26

Bellman Equation: State Transition Probabilities

Bellman Equation has a version that computes final reward values using probabilities of state transitions, but it requires the agent to know the probabilities (i.e. have a valid model for the environment), which doesn’t apply our environment: the stock market.

14 of 26

Action-Selection Strategies in Q-Learning

Section 3

15 of 26

Introduction to Exploration vs. Exploitation

Exploration

Exploitation

Exploration in reinforcement learning refers to the action taken by agents to discover new actions and states, essential for improving decision-making and maximizing long-term rewards.

Exploitation in reinforcement learning refers to taking actions that the agent already knows has high Q Values in order to maximize reward.

16 of 26

Exploration vs. Exploitation:

The Epsilon-Greedy Strategy

When an agent begins learning, we would want it to take random actions to explore more paths. But as the agent gets better, the Q-function converges to more consistent Q-values. Now we would like our agent to exploit paths with highest Q-value i.e takes greedy actions and get the best rewards

The agent takes random actions for probability ε and greedy action for probability (1-ε), where ε decreases over iterations.

The Epsilon decay factor can be linear or exponential.

Note that in actual implementation, the decay factor would be a simple counter that increases as update iterations/episodes occur

17 of 26

Other Action Selection Strategies

Boltzmann Exploration

Upper Confidence Bound (UCB)

Thompson Sampling

Utilizes a probabilistic approach to action selection based on Q-values and a temperature parameter, allowing for a smooth balance between exploration and exploitation, adapting dynamically to the learning process.

Balances the estimated value of actions with an exploration term that favors less-frequented actions, systematically encouraging exploration while enhancing learning efficiency through uncertainty management.

Employs Bayesian inference to sample from the posterior distribution of Q-values, promoting exploration by favoring actions with high uncertainty, thus effectively balancing exploration and exploitation in decision-making.

18 of 26

Other Action Selection Strategies

Entropy-Based Exploration

Exploration with Noise: Noisy Networks

Entropy-based exploration enhances the diversity of action selection by promoting randomness in the policy, which is particularly beneficial in complex environments where long-term exploration is necessary to discover optimal strategies, thereby preventing premature convergence to suboptimal policies.

Noisy networks introduce parametric noise directly into the weights of the Q-network during action selection, facilitating a natural exploration process that evolves from exploration to exploitation as training progresses, enhancing the agent's ability to adapt to complex environments without the need for explicit exploration parameters.

19 of 26

Deep Q-Learning and Experience Buffer

Section 4

20 of 26

Transitioning to Deep Q-Learning

Becoming Neural Networks

Transitioning from Q-tables to Deep Q-Learning involves utilizing neural networks to approximate the Q-value function, which allows it to handle much larger state spaces and more complex environments like the chaotic stock market.

Training Process and Techniques

The training process in Deep Q-Learning incorporates techniques such as experience replay and target networks to stabilize learning and solve practical issues pertinent to neural networks.

21 of 26

The Experience Replay Buffer

Experience replay allows agents to store past experiences and sample them randomly during training, which breaks the correlation between consecutive experiences. Usually implemented as a circular storage that retains past experiences in the form of (state, action, reward, next_state) tuples.

22 of 26

Why Experience Replay?

Neural networks require independent and identically distributed (i.i.d.) samples to learn effectively. consecutive experiences are highly correlated because the agent moves through sequential states in the environment. If we update the Q-network with consecutive experiences (without experience replay), the neural network can overfit to recent transitions, leading to unstable learning.

23 of 26

Target Networks

In DQN, a secondary network is used to provide more stable Q-value targets by copying the main network’s weights periodically. This prevents large, unstable updates in the Q-values by stabilizing the learning process, reducing feedback loops and improving convergence during training.

1 of 26