Negin Golrezaei (MIT)
Joint work with Bart Van Parys (MIT)
Optimal Learning for Structured Bandits
Seminar on Mathematics of Imaging Sciences, Data Sciences, and Optimization
Rensselaer Polytechnic Institute, Oct. 26 2021.
Revise and Resubmit at Management Science
Multi-armed Bandits
Online decision-making under uncertainty:
Healthcare:
What drugs to prescribe?
(Arms: drugs)
Revenue management: What price to post?
(Arms: prices)
Online advertising:
Which ad to show to users?
(Arms: ads)
Fundamental trade-off:
Structured Multi-armed Bandits
Classical Multi-armed Bandits: Rewards of arms are independent of each other
In practice, they may NOT be independent
Structural information makes arms correlated!
Healthcare:
Structure: Similar drugs have similar performance
Revenue management: Structure: Demand goes down as price goes up
Online advertising:
Structure: Some ads are negatively correlated
Structural Information
Why is structural information important?
Structural information allows for transfer learning
Thompson Sampling and UCB perform poorly for structured bandits!!
How to deal with structural information?
Our approach: Unified framework that works for any convex structural information
Model
Exploration
Exploitation
What About Structural Information?
Online advertising:
Structure: Some ads are negatively correlated
Our Contributions and Main Results
Related Work
How to Design a Policy for ANY Structural Information?
Main idea: mimic something that directly encapsulates structural information!
How about mimicking the (information-theoretic) regret lower bound?
where
How to Design a Policy for ANY Structural Information?
Main idea: mimic something that directly encapsulates structural information!
How about mimicking the (information-theoretic) regret lower bound?
where
This condition encapsulates the structural information!
But How?
Sufficient Exploration Condition
We have done enough exploration if
P
How to Design a Policy for ANY Structural Information?
Main idea: mimic something that directly encapsulates structural information!
How about mimicking the (information-theoretic) regret lower bound?
where
Lower bound contains/uses structural information
Mimicking Regret Lower Bound
A big issue: the regret lower bound is NOT available!
Mimicking the Regret Lower Bound Is not Easy!
First Challenge: Converting a Semi-infinite Lower Bound to Its Convex Counterpart
Weighted KL distance
Let’s dualize the distance function
Regret lower bound (semi-infinite)
Dual counterpart (Convex)
Mimicking the Regret Lower Bound Is not Easy!
Second Challenge: Avoid Solving the Regret Lower Bound in Each Round
Testing this can be demanding!
H
Information Test
we must have
H
Mimicking the Regret Lower Bound Is not Easy!
Exploration
Exploitation
Mimicking the Regret Lower Bound Is not Easy!
Let’s Put Everything Together: �DUal Structure-based Algorithm (DUSA)
How did we use the structural information?
Here, by following
the (dual) regret lower bound
Main Theorem: Asymptotic Optimal Regret
Theorem (Regret bound for DUSA)
Optimal regret bound
Proof Outline
Numerical Studies for Well-known Structured Bandits
25
Numerical Studies for Novel Structured Bandits
26
Takeaways
Link to the paper: https://arxiv.org/abs/2007.07302
Email: golrezae@mit.edu
28