1 of 28

Negin Golrezaei (MIT)

Joint work with Bart Van Parys (MIT)

Optimal Learning for Structured Bandits

Seminar on Mathematics of Imaging Sciences, Data Sciences, and Optimization

Rensselaer Polytechnic Institute, Oct. 26 2021.

Revise and Resubmit at Management Science

2 of 28

Multi-armed Bandits

Online decision-making under uncertainty:

Healthcare:

What drugs to prescribe?

(Arms: drugs)

Revenue management: What price to post?

(Arms: prices)

Online advertising:

Which ad to show to users?

(Arms: ads)

Fundamental trade-off:

    • Exploration: collect information to discover the best arm
    • Exploitation: exploit the collected information to play the arm that seems the best

3 of 28

Structured Multi-armed Bandits

Classical Multi-armed Bandits: Rewards of arms are independent of each other

In practice, they may NOT be independent

Structural information makes arms correlated!

Healthcare:

Structure: Similar drugs have similar performance

Revenue management: Structure: Demand goes down as price goes up

Online advertising:

Structure: Some ads are negatively correlated

4 of 28

Structural Information

Why is structural information important?

Structural information allows for transfer learning

    • Information obtained by one arm can be transferred to other arms

Thompson Sampling and UCB perform poorly for structured bandits!!

    • They stop playing an arm as soon as they figure out they are suboptimal
    • Playing suboptimal arms can help with transfer learning

How to deal with structural information?

    • Typical approach: Tailored algorithms for special structural problems: Lipschitz, linear, etc

Our approach: Unified framework that works for any convex structural information

5 of 28

Model

 

Exploration

Exploitation

 

 

 

6 of 28

What About Structural Information?

 

 

Online advertising:

Structure: Some ads are negatively correlated

 

 

 

7 of 28

Our Contributions and Main Results

 

8 of 28

Related Work

  • Learning under particular structural assumptions
    • Linear structure (Daniet al., 2008; Rusmevichientong and Tsitsiklis, 2010; Mersereau et al., 2009; Lattimore and Szepes-vari, 2017)
    • Lipschitz structure (Magureanu et al. 2014; Mao et al. 2018. Gupta et al. (2019))
    • Structural information in contextual bandits (Slivkins 2011, Golrezaei et at 2020)
    • Structures in revenue management problems: (Keskin et al 2014, Den Boer 2015, Agrawal etal 2017, Bubeck et al 2017, Ferreira et al 2018, Golrezaei et al 2019, Bastani et al 2021,...)
  • Taking a unified approach:
    • Combes et al. (2017): Their algorithm mimics regret lower bound. But, it has to solve a semi-infinite optimization in every round
    • Russo and Van Roy (2018): balance reward gain with information gain. May not obtain the optimal regret bound

9 of 28

How to Design a Policy for ANY Structural Information?

Main idea: mimic something that directly encapsulates structural information!

How about mimicking the (information-theoretic) regret lower bound?

 

where

 

10 of 28

How to Design a Policy for ANY Structural Information?

Main idea: mimic something that directly encapsulates structural information!

How about mimicking the (information-theoretic) regret lower bound?

 

where

 

This condition encapsulates the structural information!

But How?

11 of 28

Sufficient Exploration Condition

 

 

We have done enough exploration if

 

 

P

 

12 of 28

How to Design a Policy for ANY Structural Information?

Main idea: mimic something that directly encapsulates structural information!

How about mimicking the (information-theoretic) regret lower bound?

 

where

 

 

Lower bound contains/uses structural information

13 of 28

Mimicking Regret Lower Bound

 

A big issue: the regret lower bound is NOT available!

    • The true reward distribution is NOT known

 

 

14 of 28

Mimicking the Regret Lower Bound Is not Easy!

 

15 of 28

First Challenge: Converting a Semi-infinite Lower Bound to Its Convex Counterpart

 

 

 

 

Weighted KL distance

Let’s dualize the distance function

Regret lower bound (semi-infinite)

 

 

Dual counterpart (Convex)

16 of 28

 

Mimicking the Regret Lower Bound Is not Easy!

17 of 28

Second Challenge: Avoid Solving the Regret Lower Bound in Each Round

 

 

Testing this can be demanding!

    • We design a simpler (one-dimensional) information test:

    • Can be tested by solving a 1-dimensional convex optimization problem

 

 

 

H

 

 

 

18 of 28

Information Test

    • We show if

we must have

 

    • Also,

 

H

 

 

 

 

19 of 28

 

Mimicking the Regret Lower Bound Is not Easy!

20 of 28

 

Exploration

Exploitation

 

 

 

21 of 28

 

Mimicking the Regret Lower Bound Is not Easy!

22 of 28

Let’s Put Everything Together: �DUal Structure-based Algorithm (DUSA)

 

 

 

How did we use the structural information?

Here, by following

the (dual) regret lower bound

23 of 28

Main Theorem: Asymptotic Optimal Regret

Theorem (Regret bound for DUSA)

 

 

 

 

Optimal regret bound

24 of 28

Proof Outline

 

 

25 of 28

Numerical Studies for Well-known Structured Bandits

25

  • DUSA’s regret is comparable to the regret of algorithms that are tailored for a specific structured bandits
  • DUSA’s regret is more concentrated around its median

26 of 28

Numerical Studies for Novel Structured Bandits

26

  • Divergence bandit: impose structures on the first and second moment of reward distributions

27 of 28

Takeaways

  • Provide a unified framework to study structured bandits

  • Present an algorithm called DUSA that obtains optimal regret bound for any convex structural information

  • DUSA is the first universally optimal algorithm that is computationally tractable

28 of 28

Link to the paper: https://arxiv.org/abs/2007.07302

Email: golrezae@mit.edu

28