1 of 33

Deep Reinforcement Learning for Optimal Order Placement in a Limit Order Book

Ilija Ilievski, PhD candidate, Learning & Vision Lab

NGS, National University of Singapore

2 of 33

Deep Reinforcement Learning

2/33

ilija139.github.io

3 of 33

Deep Reinforcement Learning

3/33

ilija139.github.io

4 of 33

Deep Reinforcement Learning

4/33

ilija139.github.io

5 of 33

Deep Reinforcement Learning

5/33

ilija139.github.io

6 of 33

Deep Reinforcement Learning

6/33

ilija139.github.io

7 of 33

Deep Reinforcement Learning

7/33

ilija139.github.io

8 of 33

Deep Reinforcement Learning

8/33

ilija139.github.io

9 of 33

Deep Reinforcement Learning for Quant Finance?

9/33

ilija139.github.io

10 of 33

Deep Reinforcement Learning for Quant Finance?

  • Complex state properties
  • Almost infinite number of states
  • Not clearly defined actions and rewards

10/33

ilija139.github.io

11 of 33

  • Deep Reinforcement Learning
  • Deep Q-Learning
  • Limit Order Book
  • Deep RL for optimal order placement

11

12 of 33

(Deep) Reinforcement Learning Essentials

12/33

ilija139.github.io

13 of 33

(Deep) Reinforcement Learning Essentials

  • Policy π is a function for selecting actions given states:

a = π(s)

  • Value function Qπ(s, a) is expected total reward for state s and action a following policy π

Qπ(s, a) = E[rt+1+γrt+22rt+3+... | s, a]

13/33

ilija139.github.io

14 of 33

Bellman Equation

  • Bellman expectation equation unrolls the value function to:

Qπ(s, a) = E[rt+1+γrt+22rt+3+... | s, a] =

= Es',a'[r + γQπ(s', a') | s, a]

  • Bellman optimality equation

Q*(s, a) = E[r + γ maxQ*(s', a') | s, a]

14/33

ilija139.github.io

15 of 33

Deep Q-network

  • Represent the Q value function with deep Q-network with parameters θ:

Q(s, a; θ) ≈ Qπ(s, a)

  • Train the network with stochastic gradient descent and loss function defined as MSE in Q-values

Loss(θ) = E[(r + γ maxQ(s', a'; θ) - Q(s, a; θ))2]

15/33

ilija139.github.io

16 of 33

Deep Q-network Problems

  • Sequential data -> samples are correlated and non-iid
  • Policy changes often leading to unstable loss function

Loss(θ) = E[(r + γ maxQ(s', a'; θ) - Q(s, a; θ))2]

  • Scale of rewards and Q-values leading to unstable and exploding gradients

16/33

ilija139.github.io

17 of 33

Deep Q-network Solutions

  • Use experience buffer
  • Freeze target Q-value network

Loss(θ) = E[(r + γ maxQ(s', a'; θtarget) - Q(s, a; θ))2]

  • Clipping rewards and/or normalizing the network values gives robust gradients

17/33

ilija139.github.io

18 of 33

Q-learning Algorithm

18/33

ilija139.github.io

19 of 33

  • Deep Reinforcement Learning
  • Deep Q-Learning
  • Limit Order Book
  • Deep RL for optimal order placement

19

20 of 33

Order Placement in a Limit Order Book

  • Limit orders is an order to trade a certain amount of assets at a given price
  • Limit orders are collected and posted in a limit order book (LOB)
  • A limit order stays in the LOB until it is executed against a market order or until is canceled
  • Market order is an order to trade a certain amount of assets at the best available price in the LOB

20/33

ilija139.github.io

21 of 33

Order Placement Problem

  • One needs to buy/sell N orders within T seconds
  • If limit orders are not executed by time T, they are converted to market orders
  • One needs to split the N orders into N1,t, N2,t, ... limit orders at time t=0,1,..,T

  • Problems with optimal order placement:

spread cost vs order execution probability vs market impact

21/33

ilija139.github.io

22 of 33

  • Deep Reinforcement Learning
  • Deep Q-Learning
  • Limit Order Book
  • Deep RL for optimal order placement

22

23 of 33

Deep RL for Optimal Order Placement

  • Objective:
    • Minimise Volume Weighted Average Price (VWAP)
  • States:
    • Limit order book prices (with rolling window)
    • Progress (in percentage)
    • Time until T
  • Action related states:
    • Order execution probability (estimated via neural network)
    • Size of order (in percentage of N)

23/33

ilija139.github.io

24 of 33

Deep RL for Optimal Order Placement

  • Rewards:
    • Price difference (after each order completion)
    • Order duration (shorter the better)
    • Completion penalty (how many orders left after time T)

24/33

ilija139.github.io

25 of 33

Deep RL for Optimal Order Placement

  • Actions:
    • Place limit order at current k-level k=1, ..., 5, with size in percentage of N and in increments of 10 (50)
    • Empty action (51)
    • Cancel last order

25/33

Level-2 Limit Order Book

ilija139.github.io

26 of 33

Network Architectures: Execution Probability

  • Two layer MLP network outputs the execution probability for order Oi at time t
  • Inputs:
    • Sell or buy (1 dim)
    • Probability estimation based on data up to time t (1 dim)
    • Ticker symbol (variable, in our case 7 dim)
    • LOB level (5 dim)
    • LOB state (with rolling window) (512 dim)
  • Network structure 526->1024->256->1 with tanh activation function
  • Optimised via SGD with momentum

26/33

ilija139.github.io

27 of 33

Network Architectures: LOB State Representation

  • LSTM network with 512-dimensional hidden state
  • Inputs:
    • Vector of bid and ask prices at each level (10 dim)
    • Size of orders at each level (10 dim)
    • Ticker symbol (variable, in our case 7 dim)
  • Trained with a rolling window of 20 time steps
  • Last hidden state is the LOB state representation

27/33

ilija139.github.io

28 of 33

Network Architectures: Q(s, a) network

  • Two layer MLP network 521->1024->256->51 with tanh activation function
  • State related inputs:
    • LOB state (with rolling window) (512 dim)
    • Progress (1 dim)
    • Time until T (1 dim)
  • Action related inputs:
    • Order execution probability (1 dim)
    • Size of order (1 dim)
    • Order position in LOB (5 dim)
  • Optimised via SGD with momentum

28/33

ilija139.github.io

29 of 33

Training Details

  • Code in (lua)Torch forked from DeepMind Atari Deep Q Learner (https://github.com/kuz/DeepMind-Atari-Deep-Q-Learner)
  • Training done on a server with:
    • 12 CPU cores, 64GB RAM, and 3x12GB NVIDIA Titan X GPUs
  • One training to convergence takes approximately one week
  • Hyperparameters are optimized with HORD hyperparameter optimization algorithm (bit.ly/hord-aaai)

29/33

ilija139.github.io

30 of 33

Preliminary Results

  • Learns to prefer placing orders at second level in LOB with equal sizes spread across T
  • Learns price momentum; places orders more aggressively when price is trending, consequence of order execution probability

30/33

ilija139.github.io

31 of 33

Deep Reinforcement Learning for Quant Finance?

  • Complex state properties

  • Infinite number of states

  • Not clearly defined actions and rewards

31/33

Learn state representation with deep neural networks

Implement Q-value function as deep neural network

Do more experiments...

(I guess)

ilija139.github.io

32 of 33

Unresolved Issues (Future Work)

  • Training with historical market orders and passive participants
    • Need for multi-agent systems to model adversarial participants
  • Market impact as part of the rewards
    • Model the market impact given a order using a historical data
  • Better defined rewards to reduce training time
  • Better defined actions to improve performance

32/33

ilija139.github.io

33 of 33

Thank you

Questions?

ilija139.github.io