1 of 49

0

Neural Constrained Combinatorial Bandits

Ziyu Shao

Shangshang Wang

Simeng Bian

Xin Liu

2 of 49

Wide Applications of Bandits for Networking

1

Emoji Prediction

Next-Word Prediction

Online Recommendation

Federated Learning

3 of 49

Contextual Bandits

2

(1) Context

(2) Selection

 

Agent

 

 

4 of 49

Contextual Bandits

3

(1) Context

(2) Selection

 

 

Agent

 

 

5 of 49

Contextual Bandits

4

(1) Context

(2) Selection

 

 

Agent

 

 

Naive Assumptions

Linear Function

Special Non-Linear, �e.g., Expo & Poly

6 of 49

Neural (Contextual) Bandits

5

(1) Context

(2) Selection

 

 

Agent

 

Data-Driven Practical Approach

Neural Networks

General Non-Linear Function

Strong Representational Ability

 

Naive Assumptions

Linear Function

Special Non-Linear, �e.g., Expo & Poly

7 of 49

Options of Neural Networks

6

Multi-Layer Perceptron (MLP)

Transformer

No Theoretical Guarantee

Complex Structure

Theoretical Guarantee

Simple yet Powerful

8 of 49

Theory behind MLP: Neural Tangent Kernel

7

MLP

9 of 49

Theory behind MLP: Neural Tangent Kernel

8

Tangent

Neural

Kernel

 

 

 

MLP

10 of 49

Theory behind MLP: Neural Tangent Kernel

9

Tangent

Neural

Kernel

 

 

Mean Square Loss

Training Data

Training Dynamics

 

MLP

11 of 49

Neural Tangent Kernel of Ultra-Wide MLP

10

MLP

Random at Initialization

Dynamic during Training

12 of 49

Neural Tangent Kernel of Ultra-Wide MLP

11

Ultra-Wide�MLP

MLP

Random at Initialization

Dynamic during Training

Deterministic at Initialization

Static during Training

Limiting Behavior

13 of 49

Neural Tangent Kernel of Ultra-Wide MLP

12

Ultra-Wide�MLP

MLP

Random at Initialization

Dynamic during Training

Deterministic at Initialization

Static during Training

Constant

Ordinary Differential Equation

Limiting Behavior

14 of 49

Reward & Cost from Arm Selection

13

(1) Context

(2) Selection

(3) Reward & Cost

MLP for Reward�(Function Approximation)

MLP for Cost

15 of 49

Practical Concern: Anytime Cumulative Constraint

14

Cumulative

Anytime

Constraint

 

 

 

Sustainable Deployment

Energy Consumption

Within-Budget�at Anytime

16 of 49

Cumulative Cost

15

Cumulative Cost

Round

Cost in �Each Round

17 of 49

Cumulative Budget

16

Cumulative Budget

Round

Budget in �Each Round

18 of 49

Constraint Violation

17

Cumulative Cost & Budget

Round

 

 

19 of 49

Constraint Violation

18

Cumulative Cost & Budget

Round

Zero Constraint Violation

 

 

20 of 49

Main Contribution

19

Works

Contextual Bandits

Constraint

Neural Bandit

Kernel Bandit

Linear Bandit

Anytime Cumulative

Unconstrained

Ours

Zhou et al. [18]

Liu et al. [19]

Zhou et al. [11]

Srinivas et al. [48]

Li et al. [5]

21 of 49

Coupling of Training & Learning & Control

20

Neural Network�Online Training

Online�Control

Online�Learning

Neural Constrained Combinatorial Bandits

22 of 49

Algorithm Part I: Online Learning

21

(1) Empirical Mean

Exploitation

Exploration

(2) Confidence Radius

23 of 49

Algorithm Part I: Online Learning

22

(1) Empirical Mean

Exploitation

Exploration

(2) Confidence Radius

MLP for Reward (Cost)

MLP’s Output

MLP’s Gradient

24 of 49

Algorithm Part I: Online Learning

23

(1) Empirical Mean

Exploitation

Exploration

(2) Confidence Radius

MLP for Reward (Cost)

Upper Confidence Bound = (1) + (2) for Reward Estimation

Lower Confidence Bound = (1) (2) for Cost Estimation

MLP’s Output

MLP’s Gradient

25 of 49

Algorithm Part II: Online Control

24

Queue Length

Virtual Queue

Cost Estimation

Budget

Controlled Queue Length

Bounded Constraint Violation

26 of 49

Algorithm Workflow

25

MLP for Reward

MLP for Cost

27 of 49

Algorithm Workflow

26

MLP for Reward

MLP for Cost

Online Learning

MLPs’�Output &�Gradient

28 of 49

Algorithm Workflow

27

MLP for Reward

MLP for Cost

Online Learning

Online Control

Cost Estimation

MLPs’�Output &�Gradient

29 of 49

Algorithm Workflow

28

MLP for Reward

MLP for Cost

Online Learning

Online Constrained Optimizaiton

Queue Length

Reward�Estimation

Cost�Estimation

Online Control

Cost Estimation

MLPs’�Output &�Gradient

30 of 49

Algorithm Workflow

29

MLP for Reward

MLP for Cost

Online Learning

Online Constrained Optimizaiton

Queue Length

Reward�Estimation

Cost�Estimation

Online Control

Cost Estimation

Reward Estimation

 

  • Maximize

MLPs’�Output &�Gradient

31 of 49

Algorithm Workflow

30

MLP for Reward

MLP for Cost

Online Learning

Cost as �Training Data

Reward as �Training Data

Online Constrained Optimizaiton

Queue Length

Reward�Estimation

Cost�Estimation

Online Control

Cost Estimation

Reward Estimation

 

  • Maximize

MLPs’�Output &�Gradient

32 of 49

Performance Guarantee: Sharp Regret Bound

31

Regret (Reward Gap from the Optimal)

Round

Our Upper Bound

Lower Bound

Linear Order

Improved UCB Variants

33 of 49

Performance Guarantee: Zero Constraint Violation

32

Constraint Violation

Inaccurate Estimation

Accurate Estimation

Zero Constraint Violation

Round

34 of 49

33

Simulation Setting: Crowdsourcing System

Workers�(50 Arms)

Platform�(Agent)

(1) Task�(Context)

(2) Select A Subset �of 5 Workers

(3) Completion Qualities &�Expenses (Reward & Cost)

35 of 49

34

Simulation Result: Round-Averaged Regret

36 of 49

35

Simulation Result: Constraint Violation

37 of 49

Conclusion

36

Guaranteed Theoretical�Performance

Sharp Regret Bound

Zero Constraint Violation

Real-World Applications with Neural Networks

Neural Constrained Combinatorial Bandits

38 of 49

37

Link to Our Technical Report: http://faculty.sist.shanghaitech.edu.cn/faculty/shaozy/Neural-PD.pdf

39 of 49

Problem Formulation

38

  • Maximize

  • Subject to

 

40 of 49

System Workflow

39

(1) Context

(2) Selection

(3) Reward

MLP

(4) Online �Update

Round

MLP�Parameter

41 of 49

Constraints: Finite-Time & Long-Term Special Cases

40

Cumulative

Anytime

Constraint

 

 

 

Finite-Time

Long-Term

 

 

42 of 49

Neural Tangent Kernel: Kernel Bandits’ Perspective

41

Kernel Bandits

Neural Bandits

Corresponding Kernel�Defined by Inner Product of Feature Functions

Neural Tangent Kernel (NTK)�Defined by Inner Product of Neural Network Gradients

Feature Function

Neural Network

Determinsitic at Initial Round

Static across Rounds

Random due to Parameter Initilization

Dynamic across Rounds due to Training

Hard to Analyze

Training�Dynamics

43 of 49

Theory behind MLP: Neural Tangent Kernel

42

Recall: Linear Regression

 

 

 

Round

MLP Parameter

Training Dynamic

44 of 49

Neural Tangent Kernel Theory

43

 

 

 

Converged

Neural Tangent Kernel

Deterministic at Initilization

Static across Rounds

Paramter Barely �Changes

 

45 of 49

Regret & Constraint Violation

44

Tightness

Trade-Off

Effective Dimension of Neural�Tangent Kernel (NTK) Matrix

Number of Selected Arms

Slater’s Constant

Zero Constraint Violation

Sharp Regret Bound

46 of 49

Effective Dimension of NTK Matrix

45

Neural Tangent Kernel Matrix

Regularization Parameter

Underlying Dimension�of Observed Contexts

A Larger Value Implies a�Larger Variation in Contexts

More Rounds for Accuarate Reward & Cost Estimation

More Rounds to Reach �Zero Constraint Violation

Looser Regret Bound

47 of 49

Effective Dimension

46

Rounds to Reach �Zero Constraint Violation

Effective Dimension �(defined by NTK)

Regret

Rounds for Accurate�Reward & Cost Estimation

Quadratic

Linear

48 of 49

Slater’s Constant

47

  • Subject to

 

Slater’s Constant

Slater’s Condition: Existence of Feasible Solution with Given Slater’s Constant

A Smaller Value Implies a Harder �Problem with Fewer Feasible Solutions

More Rounds to Identify the �Optimal Solution

More Rounds to Reach �Zero Constraint Violation

Looser Regret Bound

Determine Number of�Feasible Solutions

49 of 49

Theoretical Analysis: Comparison

48

Constrained Contextual Bandit

Regret Bound

Total Constraint Violation

Our Model

Constrained Kernel Bandits[1]

Constrained Linear Bandits[2]

[1] Zhou et al. On Kernelized Multi-Armed Bandits with Constraints.

[2] Liu et al. An Efficient Pessimistic-Optimistic Algorithm for Stochastic Linear Bandits with General Constraints.