1 of 33

Deep Reinforcement Learning for Optimal Order Placement in a Limit Order Book

Ilija Ilievski, PhD candidate, Learning & Vision Lab

NGS, National University of Singapore

2 of 33

Deep Reinforcement Learning

2/33

ilija139.github.io

3 of 33

Deep Reinforcement Learning

3/33

ilija139.github.io

4 of 33

Deep Reinforcement Learning

4/33

ilija139.github.io

5 of 33

Deep Reinforcement Learning

5/33

ilija139.github.io

6 of 33

Deep Reinforcement Learning

6/33

ilija139.github.io

7 of 33

Deep Reinforcement Learning

7/33

ilija139.github.io

8 of 33

Deep Reinforcement Learning

8/33

ilija139.github.io

9 of 33

Deep Reinforcement Learning for Quant Finance?

9/33

ilija139.github.io

10 of 33

Deep Reinforcement Learning for Quant Finance?

Complex state properties
Almost infinite number of states
Not clearly defined actions and rewards

10/33

ilija139.github.io

11 of 33

Deep Reinforcement Learning
Deep Q-Learning
Limit Order Book
Deep RL for optimal order placement

12 of 33

(Deep) Reinforcement Learning Essentials

12/33

ilija139.github.io

13 of 33

(Deep) Reinforcement Learning Essentials

Policy π is a function for selecting actions given states:

a = π(s)

Value function Q^π(s, a) is expected total reward for state s and action a following policy π

Q^π(s, a) = E[r_t+1+γr_t+2+γ²r_t+3+... | s, a]

13/33

ilija139.github.io

14 of 33

Bellman Equation

Bellman expectation equation unrolls the value function to:

Q^π(s, a) = E[r_t+1+γr_t+2+γ²r_t+3+... | s, a] =

= E_s',a'[r + γQ^π(s', a') | s, a]

Bellman optimality equation

Q^*(s, a) = E[r + γ maxQ^*(s', a') | s, a]

14/33

ilija139.github.io

15 of 33

Deep Q-network

Represent the Q value function with deep Q-network with parameters θ:

Q(s, a; θ) ≈ Q^π(s, a)

Train the network with stochastic gradient descent and loss function defined as MSE in Q-values

Loss(θ) = E[(r + γ maxQ(s', a'; θ) - Q(s, a; θ))²]

15/33

ilija139.github.io

16 of 33

Deep Q-network Problems

Sequential data -> samples are correlated and non-iid
Policy changes often leading to unstable loss function

Loss(θ) = E[(r + γ maxQ(s', a'; θ) - Q(s, a; θ))²]

Scale of rewards and Q-values leading to unstable and exploding gradients

16/33

ilija139.github.io

17 of 33

Deep Q-network Solutions

Use experience buffer
Freeze target Q-value network

Loss(θ) = E[(r + γ maxQ(s', a'; θ_target) - Q(s, a; θ))²]

Clipping rewards and/or normalizing the network values gives robust gradients

17/33

ilija139.github.io

18 of 33

Q-learning Algorithm

18/33

ilija139.github.io

19 of 33

Deep Reinforcement Learning
Deep Q-Learning
Limit Order Book
Deep RL for optimal order placement

20 of 33

Order Placement in a Limit Order Book

Limit orders is an order to trade a certain amount of assets at a given price
Limit orders are collected and posted in a limit order book (LOB)
A limit order stays in the LOB until it is executed against a market order or until is canceled
Market order is an order to trade a certain amount of assets at the best available price in the LOB

20/33

ilija139.github.io

21 of 33

Order Placement Problem

One needs to buy/sell N orders within T seconds
If limit orders are not executed by time T, they are converted to market orders
One needs to split the N orders into N_1,t, N_2,t, ... limit orders at time t=0,1,..,T

Problems with optimal order placement:

spread cost vs order execution probability vs market impact

21/33

ilija139.github.io

22 of 33

Deep Reinforcement Learning
Deep Q-Learning
Limit Order Book
Deep RL for optimal order placement

23 of 33

Deep RL for Optimal Order Placement

Objective:

Minimise Volume Weighted Average Price (VWAP)

States:

Limit order book prices (with rolling window)
Progress (in percentage)
Time until T

Action related states:

Order execution probability (estimated via neural network)
Size of order (in percentage of N)

23/33

ilija139.github.io

24 of 33

Deep RL for Optimal Order Placement

Rewards:

Price difference (after each order completion)
Order duration (shorter the better)
Completion penalty (how many orders left after time T)

24/33

ilija139.github.io

25 of 33

Deep RL for Optimal Order Placement

Actions:

Place limit order at current k-level k=1, ..., 5, with size in percentage of N and in increments of 10 (50)
Empty action (51)
Cancel last order

25/33

Level-2 Limit Order Book

ilija139.github.io

26 of 33

Network Architectures: Execution Probability

Two layer MLP network outputs the execution probability for order O_i at time t
Inputs:

Sell or buy (1 dim)
Probability estimation based on data up to time t (1 dim)
Ticker symbol (variable, in our case 7 dim)
LOB level (5 dim)
LOB state (with rolling window) (512 dim)

Network structure 526->1024->256->1 with tanh activation function
Optimised via SGD with momentum

26/33

ilija139.github.io

27 of 33

Network Architectures: LOB State Representation

LSTM network with 512-dimensional hidden state
Inputs:

Vector of bid and ask prices at each level (10 dim)
Size of orders at each level (10 dim)
Ticker symbol (variable, in our case 7 dim)

Trained with a rolling window of 20 time steps
Last hidden state is the LOB state representation

27/33

ilija139.github.io

28 of 33

Network Architectures: Q(s, a) network

Two layer MLP network 521->1024->256->51 with tanh activation function
State related inputs:

LOB state (with rolling window) (512 dim)
Progress (1 dim)
Time until T (1 dim)

Action related inputs:

Order execution probability (1 dim)
Size of order (1 dim)
Order position in LOB (5 dim)

Optimised via SGD with momentum

28/33

ilija139.github.io

29 of 33

Training Details

Code in (lua)Torch forked from DeepMind Atari Deep Q Learner (https://github.com/kuz/DeepMind-Atari-Deep-Q-Learner)
Training done on a server with:

12 CPU cores, 64GB RAM, and 3x12GB NVIDIA Titan X GPUs

One training to convergence takes approximately one week
Hyperparameters are optimized with HORD hyperparameter optimization algorithm (bit.ly/hord-aaai)

29/33

ilija139.github.io

30 of 33

Preliminary Results

Learns to prefer placing orders at second level in LOB with equal sizes spread across T
Learns price momentum; places orders more aggressively when price is trending, consequence of order execution probability

30/33

ilija139.github.io

31 of 33

Deep Reinforcement Learning for Quant Finance?

Complex state properties

Infinite number of states

Not clearly defined actions and rewards

31/33

Learn state representation with deep neural networks

Implement Q-value function as deep neural network

Do more experiments...

(I guess)

ilija139.github.io

32 of 33

Unresolved Issues (Future Work)

Training with historical market orders and passive participants

Need for multi-agent systems to model adversarial participants

Market impact as part of the rewards

Model the market impact given a order using a historical data

Better defined rewards to reduce training time
Better defined actions to improve performance

32/33

ilija139.github.io

33 of 33

Thank you

Questions?

ilija139.github.io