0
Neural Constrained Combinatorial Bandits
Ziyu Shao
Shangshang Wang
Simeng Bian
Xin Liu
Wide Applications of Bandits for Networking
1
Emoji Prediction
Next-Word Prediction
Online Recommendation
Federated Learning
Contextual Bandits
2
(1) Context
(2) Selection
Agent
Contextual Bandits
3
(1) Context
(2) Selection
Agent
Contextual Bandits
4
(1) Context
(2) Selection
Agent
Naive Assumptions
Linear Function
Special Non-Linear, �e.g., Expo & Poly
Neural (Contextual) Bandits
5
(1) Context
(2) Selection
Agent
Data-Driven Practical Approach
Neural Networks
General Non-Linear Function
Strong Representational Ability
Naive Assumptions
Linear Function
Special Non-Linear, �e.g., Expo & Poly
Options of Neural Networks
6
Multi-Layer Perceptron (MLP)
Transformer
No Theoretical Guarantee
Complex Structure
Theoretical Guarantee
Simple yet Powerful
Theory behind MLP: Neural Tangent Kernel
7
MLP
Theory behind MLP: Neural Tangent Kernel
8
Tangent
Neural
Kernel
MLP
Theory behind MLP: Neural Tangent Kernel
9
Tangent
Neural
Kernel
Mean Square Loss
Training Data
Training Dynamics
MLP
Neural Tangent Kernel of Ultra-Wide MLP
10
MLP
Random at Initialization
Dynamic during Training
Neural Tangent Kernel of Ultra-Wide MLP
11
Ultra-Wide�MLP
MLP
Random at Initialization
Dynamic during Training
Deterministic at Initialization
Static during Training
Limiting Behavior
Neural Tangent Kernel of Ultra-Wide MLP
12
Ultra-Wide�MLP
MLP
Random at Initialization
Dynamic during Training
Deterministic at Initialization
Static during Training
Constant
Ordinary Differential Equation
Limiting Behavior
Reward & Cost from Arm Selection
13
(1) Context
(2) Selection
(3) Reward & Cost
MLP for Reward�(Function Approximation)
MLP for Cost
Practical Concern: Anytime Cumulative Constraint
14
Cumulative
Anytime
Constraint
Sustainable Deployment
Energy Consumption
Within-Budget�at Anytime
Cumulative Cost
15
Cumulative Cost
Round
Cost in �Each Round
Cumulative Budget
16
Cumulative Budget
Round
Budget in �Each Round
Constraint Violation
17
Cumulative Cost & Budget
Round
Constraint Violation
18
Cumulative Cost & Budget
Round
Zero Constraint Violation
Main Contribution
19
Works | Contextual Bandits | Constraint | |||
Neural Bandit | Kernel Bandit | Linear Bandit | Anytime Cumulative | Unconstrained | |
Ours | ★ | | | ★ | |
Zhou et al. [18] | | ★ | | ★ | |
Liu et al. [19] | | | ★ | ★ | |
Zhou et al. [11] | ★ | | | | ★ |
Srinivas et al. [48] | | ★ | | | ★ |
Li et al. [5] | | | ★ | | ★ |
Coupling of Training & Learning & Control
20
Neural Network�Online Training
Online�Control
Online�Learning
Neural Constrained Combinatorial Bandits
Algorithm Part I: Online Learning
21
(1) Empirical Mean
Exploitation
Exploration
(2) Confidence Radius
Algorithm Part I: Online Learning
22
(1) Empirical Mean
Exploitation
Exploration
(2) Confidence Radius
MLP for Reward (Cost)
MLP’s Output
MLP’s Gradient
Algorithm Part I: Online Learning
23
(1) Empirical Mean
Exploitation
Exploration
(2) Confidence Radius
MLP for Reward (Cost)
Upper Confidence Bound = (1) + (2) for Reward Estimation
Lower Confidence Bound = (1) – (2) for Cost Estimation
MLP’s Output
MLP’s Gradient
Algorithm Part II: Online Control
24
Queue Length
Virtual Queue
Cost Estimation
Budget
Controlled Queue Length
Bounded Constraint Violation
Algorithm Workflow
25
MLP for Reward
MLP for Cost
Algorithm Workflow
26
MLP for Reward
MLP for Cost
Online Learning
MLPs’�Output &�Gradient
Algorithm Workflow
27
MLP for Reward
MLP for Cost
Online Learning
Online Control
Cost Estimation
MLPs’�Output &�Gradient
Algorithm Workflow
28
MLP for Reward
MLP for Cost
Online Learning
Online Constrained Optimizaiton
Queue Length
Reward�Estimation
Cost�Estimation
Online Control
Cost Estimation
MLPs’�Output &�Gradient
Algorithm Workflow
29
MLP for Reward
MLP for Cost
Online Learning
Online Constrained Optimizaiton
Queue Length
Reward�Estimation
Cost�Estimation
Online Control
Cost Estimation
Reward Estimation
MLPs’�Output &�Gradient
Algorithm Workflow
30
MLP for Reward
MLP for Cost
Online Learning
Cost as �Training Data
Reward as �Training Data
Online Constrained Optimizaiton
Queue Length
Reward�Estimation
Cost�Estimation
Online Control
Cost Estimation
Reward Estimation
MLPs’�Output &�Gradient
Performance Guarantee: Sharp Regret Bound
31
Regret (Reward Gap from the Optimal)
Round
Our Upper Bound
Lower Bound
Linear Order
Improved UCB Variants
Performance Guarantee: Zero Constraint Violation
32
Constraint Violation
Inaccurate Estimation
Accurate Estimation
Zero Constraint Violation
Round
33
Simulation Setting: Crowdsourcing System
Workers�(50 Arms)
Platform�(Agent)
(1) Task�(Context)
(2) Select A Subset �of 5 Workers
(3) Completion Qualities &�Expenses (Reward & Cost)
34
Simulation Result: Round-Averaged Regret
35
Simulation Result: Constraint Violation
Conclusion
36
Guaranteed Theoretical�Performance
Sharp Regret Bound
Zero Constraint Violation
Real-World Applications with Neural Networks
Neural Constrained Combinatorial Bandits
37
Link to Our Technical Report: http://faculty.sist.shanghaitech.edu.cn/faculty/shaozy/Neural-PD.pdf
Problem Formulation
38
System Workflow
39
(1) Context
(2) Selection
(3) Reward
MLP
(4) Online �Update
Round
MLP�Parameter
Constraints: Finite-Time & Long-Term Special Cases
40
Cumulative
Anytime
Constraint
Finite-Time
Long-Term
Neural Tangent Kernel: Kernel Bandits’ Perspective
41
Kernel Bandits
Neural Bandits
Corresponding Kernel�Defined by Inner Product of Feature Functions
Neural Tangent Kernel (NTK)�Defined by Inner Product of Neural Network Gradients
Feature Function
Neural Network
Determinsitic at Initial Round
Static across Rounds
Random due to Parameter Initilization
Dynamic across Rounds due to Training
Hard to Analyze
Training�Dynamics
Theory behind MLP: Neural Tangent Kernel
42
Recall: Linear Regression
Round
MLP Parameter
Training Dynamic
Neural Tangent Kernel Theory
43
Converged
Neural Tangent Kernel
Deterministic at Initilization
Static across Rounds
Paramter Barely �Changes
Regret & Constraint Violation
44
Tightness
Trade-Off
Effective Dimension of Neural�Tangent Kernel (NTK) Matrix
Number of Selected Arms
Slater’s Constant
Zero Constraint Violation
Sharp Regret Bound
Effective Dimension of NTK Matrix
45
Neural Tangent Kernel Matrix
Regularization Parameter
Underlying Dimension�of Observed Contexts
A Larger Value Implies a�Larger Variation in Contexts
More Rounds for Accuarate Reward & Cost Estimation
More Rounds to Reach �Zero Constraint Violation
Looser Regret Bound
Effective Dimension
46
Rounds to Reach �Zero Constraint Violation
Effective Dimension �(defined by NTK)
Regret
Rounds for Accurate�Reward & Cost Estimation
Quadratic
Linear
Slater’s Constant
47
Slater’s Constant
Slater’s Condition: Existence of Feasible Solution with Given Slater’s Constant
A Smaller Value Implies a Harder �Problem with Fewer Feasible Solutions
More Rounds to Identify the �Optimal Solution
More Rounds to Reach �Zero Constraint Violation
Looser Regret Bound
Determine Number of�Feasible Solutions
Theoretical Analysis: Comparison
48
Constrained Contextual Bandit | Regret Bound | Total Constraint Violation |
Our Model | | |
Constrained Kernel Bandits[1] | | |
Constrained Linear Bandits[2] | | |
[1] Zhou et al. On Kernelized Multi-Armed Bandits with Constraints.
[2] Liu et al. An Efficient Pessimistic-Optimistic Algorithm for Stochastic Linear Bandits with General Constraints.