1 of 11

Artificial Intelligence &

Recommendation System (BAF670)

Final Project

A Comprehensive Survey,

Focused on Reinforcement & Deep Learning linked with Market Microstructure

20259172 Jaemin Park CFA&FRM

2 of 11

Introduction:

Machine Trading and the Emergence of Machine Learning

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

The Scope to be discussed

Modern trading markets are defined by Machine Trading. Interestingly, the leading sector today is Machine Learning, also widely recognized as Artificial Intelligence. This survey is aimed to see the linking point of the two ‘Machines’. The two Machine Learning category to be discussed in the scope of Market Structure is Deep Learning and Reinforcement Learning.

Reinforcement Learning is a learning framework where an agent learns to make decisions by interacting with an environment and receiving rewards that guide it toward maximizing long-term cumulative reward

Deep Learning is a machine-learning approach that uses multi-layered neural networks to automatically learn complex patterns and representations from data.

The Market Microstructure Challenge

I. Optimal Execution

Minimizing market impact and trading cost

II. Smart Order Routing

Navigating the best platform

III. Order Management and Market Making

Market making and Order Option valuation

IV. Microstructure Modeling and Evaluation

Tick size physics

3 of 11

Machine Trading and the Emergence of Machine Learning

Module I: Optimal Execution

Module I:

Abstract

This module delves into Optimal Trade Execution, the most mature application of Reinforcement Learning in finance. It also discusses the critical challenge of Market Impact and how Simulations and Probability tools are used to mitigate it.

Problem

Limit Order Book state space is too vast for standard Reinforcement Learning. Mapping raw market data leads to sparse updates and slow convergence.

Solution

Tailored RL approach that simplifies the state space by separating private variables (time and inventory) from market variables, and by assuming that actions do not affect future order-book evolution during training. This enables to apply a backward-induction, Q-learning–style algorithm on real NASDAQ data, enabling RL to learn fast, state-dependent execution policies that outperform traditional strategies

Reference: Nevmyvaka et al. Reinforcement Learning for Optimized Trade Execution (2006)

Key Concept

State Factorization

Split state S into independent components:

  • Private Variables: 

Inventory, Time Remaining (Agent-Controlled).

  • Market Variables: 

Spread, Volume (Environment-Driven).

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

Result & Key Takeaways

The Tailored RL algorithm, compared to the traditional strategies, showed an empirical analysis of reducing costs by up to 50%.

4 of 11

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

Reference: An Adaptive Dual‑Level Reinforcement Learning Approach for Intraday Stock Trading (Kim et al., 2024)

Problem

Standard RL struggles with intraday VWAP execution because the full trading day is extremely long (over 4,000 steps) and price/volume changes within short intervals are too small to provide meaningful learning signals. Long-horizon training becomes unstable and computationally expensive, while short-horizon training loses the ability to track daily VWAP accurately.

Solution

A dual-level RL architecture that decomposes the trading day into manageable intervals. First, a global volume allocator (U-shape statistical model or Transformer) predicts how to distribute the day’s target volume across intervals. Second, a local RL agent (PPO + LSTM) optimizes execution within each interval, using the allocated volume and recent market features. This structure resolves the long-horizon challenge by letting RL operate only on short, stable sub-episodes.

Key Concept

Dual-Level Decomposition

Global Level – Volume Allocation

• Learns the intraday U-shape volume pattern using a Transformer.

• Adapts allocation to day-specific market conditions.

Local Level – Interval RL

PPO + LSTM operates on 20-minute intervals, short horizon

• Learns fine-grained execution using limit-order-book features.

Core Concept

• Divide the day to make RL learnable

• Use ML to adapt volume, and RL to optimize execution.

Result & Key Takeaways

The dual-level approach significantly improves VWAP tracking accuracy compared to single-level RL or static U-shape baselines. Transformer-based allocation outperforms statistical U-shape, especially for low-liquidity stocks. Overall, the framework achieves more consistent execution and better adaptability to intraday market conditions than traditional or single-agent RL methods.

Machine Trading and the Emergence of Machine Learning

Module I: Optimal Execution

5 of 11

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

Reference: Reinforcement Learning for Trade Execution with Market Impact (Cheridito & Weiss, 2025)

Machine Trading and the Emergence of Machine Learning

Module I: Optimal Execution

Problem

Previous RL execution models were too simple which failed to account for market impact and the reactions of other participants. There is a need for a advanced RL execution framework that works in a realistic Limit Order Book environment with full market impact.

Solution

In simple, the previous RL were simulated on unrealistically simple conditions. To build a fully reactive, market-impact–aware execution framework is to combine a realistic multi-agent limit-order-book simulator with a flexible RL policy structure.

Key Concept

The action 𝑎=(𝑎0,…,𝑎𝐾) represents probabilistic allocations of inventory across market orders, multiple limit-order price levels, and a hold action.

A Logistic–Normal distribution is used to generate a smooth, differentiable stochastic policy on the simplex.

An actor–critic architecture is employed to derive policy gradients and enable efficient learning.

The authors construct a reactive simulation environment in which noise, tactical, and strategic traders interact with the RL agent.

Result & Key Takeaways

The model presents RL execution agent operating in an environment that reacts to its actions just like a real market. This study shows that how RL allocates orders in multiple price level, and which order type to use, and how to control the inventory level of stocks considering the realistic market impact.

6 of 11

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

Reference: Optimal Order Routing with Reinforcement Learning (Journal of Financial Data Science 2023)

Machine Trading and the Emergence of Machine Learning

Module II: Smart Order Routing

Module II:

Abstract

Smart Order Routing (SOR) algorithms focus on where to trade. This problem is effectively modeled using the Multi-Armed Bandit (MAB) framework, a subset of RL dealing with the exploration-exploitation trade-off; the balance between trying new actions to discover better rewards and using known actions that already give good rewards.

Problem

Asset managers must route many small equity orders across several sell-side brokers using an “algo wheel.” The challenge is to design an order-routing framework that balances exploitation (use the best broker) and exploration (keep broker quality) while minimizing execution cost.

Solution

The paper models broker selection as a multi-armed bandit (MAB) problem and applies lightweight reinforcement-learning algorithms to optimally route orders. Each broker is treated as an “arm,” with stochastic execution performance measured via slippage versus VWAP. The solution compares three RL bandit strategies—ε-greedy, UCB, and Thompson Sampling—and evaluates them through simulations calibrated to real algo-wheel data.

Multi-Armed Bandits (MAB)

• Framework for balancing exploration vs. exploitation in sequential decision-making.

Bandit Algorithms

ε-Greedy: Random exploration with probability ε; otherwise choose the best estimated broker.

UCB: Uses statistical confidence intervals—optimistic exploration when uncertainty is high.

Thompson Sampling: Bayesian sampling of expected broker performance

Result & Key Takeaways

Thompson Sampling consistently performs best in practice—low regret, robust, and no hyperparameter tuning required. UCB can theoretically outperform Thompson but is highly sensitive to hyperparameter choice and can underperform with poor tuning. ε-Greedy performs worst across nearly all simulation scenarios.

Key Concept

7 of 11

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

Reference: Dark-Pool Smart Order Routing a Combinatorial Multi-Armed Bandit Approach (ICAIF 2022)

Machine Trading and the Emergence of Machine Learning

Module II: Smart Order Routing

Problem

Liquidity fragmentation and the rapid growth of dark pools have made order routing increasingly difficult. Dark pools provide no transparency—the trader never knows true available liquidity and receives only censored feedback (executed ≤ submitted). Existing SOR and bandit-based allocation methods can allocate volumes across venues but cannot choose prices (no limit-price selection), making them unable to order size. Thus the main challenge is: how to optimally split an order across many dark pools while simultaneously choosing the best venue–price combinations under censored and uncertain liquidity.

Solution

The authors formulate Dark Pool Smart Order Routing (DPSOR) as a Combinatorial Multi-Armed Bandit (CMAB) problem.Each “arm” corresponds to a specific (venue, price, volume) tuple, and each action is a superarm that jointly allocates the entire order across multiple venues and prices. The paper also proposes DP-CMAB, a family of online-learning algorithms.

Key Concept

Combinatorial Multi-Armed Bandits (CMAB)

• Generalizes classical bandits to actions that combine many arms simultaneously (entire allocation matrix).

• Handles the joint optimization over venue, price, and volume.

DP-CMAB Algorithms

• DP-CUCB (Upper Confidence Bound)

• DP-CTS (Thompson Sampling for CMAB)

• DP-Bayes UCB

Result & Key Takeaways

DP-CTS with full propagation consistently achieves the lowest regret, outperforming classical baselines. Across 5 real NASDAQ assets (AAPL, FB, AMZN, MSFT, GOOGL), DP-CTS ranks #1 in all cases. Overall conclusion is that leveraging CMAB structure enables efficient, realistic dark-pool routing under censored liquidity and price uncertainty.

8 of 11

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

Reference: Deep Reinforcement Learning for High‑Frequency Market Making (Kumar et al, 2022-23)

Machine Trading and the Emergence of Machine Learning

Module III: Order Management and Market Making

Module III:

Abstract

Beyond pure execution and routing, RL is increasingly applied to generate alpha through strategic order placement and Market Making.

Problem

High-frequency market making is difficult because the agent must operate under partial observability, fast-changing limit-order-book dynamics, inventory risk, and maker–taker fees. Classical RL and DQN methods fail to generalize in such large, noisy, and partially observable state spaces, leading to unstable or unrealistic policies.

Solution

High-frequency market making is difficult because the agent must operate under partial observability, fast-changing limit-order-book dynamics, inventory risk, and maker–taker fees. Classical RL and DQN methods fail to generalize in such large, noisy, and partially observable state spaces, leading to unstable or unrealistic policies.

Realistic LOB simulation with a matching engine, latency, FIX protocol, and maker–taker fee model.

DRQN architecture using CNN + LSTM layers to handle temporal structure and partial observability.

State representation combining inventory, quoting distances, spread, volatility, and order-book microstructure features.

Action space consisting of buy, sell, hold, and cancel decisions with discrete sizes.

PnL-based reward incorporating transaction costs, rebates, and penalties for adverse selection or large inventories.

Result & Key Takeaways

DRQN clearly outperforms DQN and classical RL baselines in profitability and inventory control, reproduces real-market stylized facts, and shows how maker–taker fees tighten spreads, reduce volatility, and boost market-maker performance.

Key Concept of the MM RL Model

9 of 11

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

Reference: Reinforcement Learning in a Dynamic Limit Order Market (UNSW & U Sydney Working Paper - Kwan & Philip 2025)

Machine Trading and the Emergence of Machine Learning

Module III: Order Management and Market Making

Problem

Managing a limit order is hard because its value depends on queue position, surrounding depth, volatility, and execution risk—all of which change rapidly. Existing models cannot capture these interactions or quantify when a trader should leave or cancel an order. This study uses RL to numerically value this cancellation right by comparing managed vs. unmanaged order strategies.

Solution

The authors model limit-order management as an MDP and use Q-learning on nanosecond ASX ITCH data. At each decision point, the agent chooses to leave or cancel, and the model learns the expected value of each choice across thousands of empirical market states.

Key Concept

18,001 empirically defined market states (price level, queue sizes, queue position, volatility).

Q-learning to evaluate the long-term value of leave vs. cancel.

Rewards combine execution gains and inventory PnL.

Transition probabilities estimated directly from real LOB data.

Result & Key Takeaways

Value is highest at the best bid and highly sensitive to queue position. Queue behind adds value; queue ahead reduces it. The option to cancel is crucial—worth about 19% of a limit order’s total value and often turns a negative-EV order into a positive one. This data-driven RL approach identifies when orders should be canceled and which market features most influence the order’s expected value.

10 of 11

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

Reference: Deep Limit Order Book Forecasting a Microstructural Guide (Quantitative Finance 2025)

Machine Trading and the Emergence of Machine Learning

Module IV: Microstructure Modeling and Evaluation

Problem

Most deep-learning LOB forecasting models treat the limit order book as raw high-dimensional tensors and ignore the underlying microstructure rules that actually generate order-book dynamics. As a result, models often learn spurious patterns, fail to generalize across assets or regimes, and provide little interpretability for practitioners who need forecasts aligned with economically meaningful features like queueing, imbalance, and volatility.

Solution

The paper proposes a microstructure-guided framework for deep LOB forecasting: extract features that directly reflect order-flow mechanics—such as queue imbalance, event types, and liquidity regimes—and feed them into a deep neural architecture designed to respect LOB dynamics. By combining domain-informed representations with modern deep networks, the model forecasts short-term price moves more accurately and with clearer interpretability.

Key Concept

Microstructure-aware feature design: queue imbalance, depth measures, order-flow counts, liquidity state.

Deep forecasting models: CNN/LSTM/Transformer variants adapted to event-driven LOB sequences.

Event-based representation: modeling LOB changes as sequences of order-flow events instead of uniform time grids.

Forecast targets: short-horizon midprice moves, directional changes, or volatility bursts.

Result & Key Takeaways

Microstructure-guided models significantly outperform naïve deep models that ignore market mechanics. Event-based architectures and queue-imbalance features provide the largest accuracy gains. The approach yields more stable forecasting performance across assets and regimes, and produces signals that align with known empirical microstructure facts rather than spurious noise.

11 of 11

Artificial Intelligence & Recommendation System (BAF670)

20259172 Jaemin Park

Machine Trading and the Emergence of Machine Learning

Conclusion

I. Optimal Execution

Optimal Execution has evolved from simple tabular Q-learning to sophisticated Dual-Level Deep RL architectures that can plan over day-long horizons while managing tick-level dynamics

II. Smart Order Routing

Smart Order Routing is effectively solved by Multi-Armed Bandit algorithms, particularly Thompson Sampling and Combinatorial MABs, which deftly handle the uncertainty and censored data of dark pools.

III. Order Management and Market Making

RL provides new economic insights, such as quantifying the "option to cancel" limit orders, revealing that active order management contributes nearly one-fifth of an order's value, and new market making algorythm

IV. Microstructure Modeling and Evaluation

Deep learning models can forecast limit order book movements surprisingly well, but the field lacks a clear explanation of why these models work and which microstructural signals they rely on.