1 of 28

Negin Golrezaei (MIT)

Joint work with Bart Van Parys (MIT)

Optimal Learning for Structured Bandits

Seminar on Mathematics of Imaging Sciences, Data Sciences, and Optimization

Rensselaer Polytechnic Institute, Oct. 26 2021.

Revise and Resubmit at Management Science

2 of 28

Multi-armed Bandits

Online decision-making under uncertainty:

Healthcare:

What drugs to prescribe?

(Arms: drugs)

Revenue management: What price to post?

(Arms: prices)

Online advertising:

Which ad to show to users?

(Arms: ads)

Fundamental trade-off:

Exploration: collect information to discover the best arm
Exploitation: exploit the collected information to play the arm that seems the best

3 of 28

Structured Multi-armed Bandits

Classical Multi-armed Bandits: Rewards of arms are independent of each other

In practice, they may NOT be independent

Structural information makes arms correlated!

Healthcare:

Structure: Similar drugs have similar performance

Revenue management: Structure: Demand goes down as price goes up

Online advertising:

Structure: Some ads are negatively correlated

4 of 28

Structural Information

Why is structural information important?

Structural information allows for transfer learning

Information obtained by one arm can be transferred to other arms

Thompson Sampling and UCB perform poorly for structured bandits!!

They stop playing an arm as soon as they figure out they are suboptimal
Playing suboptimal arms can help with transfer learning

How to deal with structural information?

Typical approach: Tailored algorithms for special structural problems: Lipschitz, linear, etc

Our approach: Unified framework that works for any convex structural information

5 of 28

Model

Exploration

Exploitation

6 of 28

What About Structural Information?

Online advertising:

Structure: Some ads are negatively correlated

7 of 28

Our Contributions and Main Results

8 of 28

Related Work

Learning under particular structural assumptions

Linear structure (Daniet al., 2008; Rusmevichientong and Tsitsiklis, 2010; Mersereau et al., 2009; Lattimore and Szepes-vari, 2017)
Lipschitz structure (Magureanu et al. 2014; Mao et al. 2018. Gupta et al. (2019))
Structural information in contextual bandits (Slivkins 2011, Golrezaei et at 2020)
Structures in revenue management problems: (Keskin et al 2014, Den Boer 2015, Agrawal etal 2017, Bubeck et al 2017, Ferreira et al 2018, Golrezaei et al 2019, Bastani et al 2021,...)

Taking a unified approach:

Combes et al. (2017): Their algorithm mimics regret lower bound. But, it has to solve a semi-infinite optimization in every round
Russo and Van Roy (2018): balance reward gain with information gain. May not obtain the optimal regret bound

9 of 28

How to Design a Policy for ANY Structural Information?

Main idea: mimic something that directly encapsulates structural information!

How about mimicking the (information-theoretic) regret lower bound?

where

10 of 28

How to Design a Policy for ANY Structural Information?

Main idea: mimic something that directly encapsulates structural information!

How about mimicking the (information-theoretic) regret lower bound?

where

This condition encapsulates the structural information!

But How?

11 of 28

Sufficient Exploration Condition

We have done enough exploration if

12 of 28

How to Design a Policy for ANY Structural Information?

Main idea: mimic something that directly encapsulates structural information!

How about mimicking the (information-theoretic) regret lower bound?

where

Lower bound contains/uses structural information

13 of 28

Mimicking Regret Lower Bound

A big issue: the regret lower bound is NOT available!

The true reward distribution is NOT known

14 of 28

Mimicking the Regret Lower Bound Is not Easy!

15 of 28

First Challenge: Converting a Semi-infinite Lower Bound to Its Convex Counterpart

Weighted KL distance

Let’s dualize the distance function

Regret lower bound (semi-infinite)

Dual counterpart (Convex)

16 of 28

Mimicking the Regret Lower Bound Is not Easy!

17 of 28

Second Challenge: Avoid Solving the Regret Lower Bound in Each Round

Testing this can be demanding!

We design a simpler (one-dimensional) information test:

Can be tested by solving a 1-dimensional convex optimization problem

18 of 28

Information Test

We show if

we must have

Also,

19 of 28

Mimicking the Regret Lower Bound Is not Easy!

20 of 28

Exploration

Exploitation

21 of 28

Mimicking the Regret Lower Bound Is not Easy!

22 of 28

Let’s Put Everything Together: �DUal Structure-based Algorithm (DUSA)

How did we use the structural information?

Here, by following

the (dual) regret lower bound

23 of 28

Main Theorem: Asymptotic Optimal Regret

Theorem (Regret bound for DUSA)

Optimal regret bound

24 of 28

Proof Outline

25 of 28

Numerical Studies for Well-known Structured Bandits

DUSA’s regret is comparable to the regret of algorithms that are tailored for a specific structured bandits
DUSA’s regret is more concentrated around its median

26 of 28

Numerical Studies for Novel Structured Bandits

Divergence bandit: impose structures on the first and second moment of reward distributions

27 of 28

Takeaways

Provide a unified framework to study structured bandits

Present an algorithm called DUSA that obtains optimal regret bound for any convex structural information

DUSA is the first universally optimal algorithm that is computationally tractable

28 of 28

Link to the paper: https://arxiv.org/abs/2007.07302

Email: golrezae@mit.edu