1
Bandit Learning in Matching Markets
Zilong Wang
Ph.D. candidate student at�Shanghai Jiao Tong University
Shuai Li
Associate professor at�Shanghai Jiao Tong University
Outline
2
Part 1: Two-sided Matching Markets
3
Matching markets
4
Matching market has two sides
5
Workers
Employers
Both sides have preferences over the other side
6
Worker side
Based on payment or prior familiarity of the task
Both sides have preferences over the other side
7
Employer side
Based on the skill levels of workers
A case study: Medical interns [Roth (1984)]
8
Medical interns (cont.)
9
Stable matching
10
Participants have no incentive to abandon their current partner,
i.e.,
no blocking pair such that they both preferred to be matched with each other than their current partner
Alvin E. Roth and Lloyd S. Shapley jointly won the Nobel Prize in 2012 for their contributions to stable matching theory.
Blocking pair
May be more than one stable matchings
11
A-side optimal stable matching1
12
1The existence is proved by Gale and Shapley (1962).
A-side pessimal stable matching
13
How to find a stable matching?
14
Gale-Shapley (GS) algorithm
[Gale and Shapley (1962)]
Agents on one side independently propose to agents on the other side according to their preference ranking until no rejection happens
No rejection happens!
Gale-Shapley (GS) algorithm: Case 2
15
Step 1
Step 2
Step 3
GS properties: Stability
16
GS properties: Time complexity
17
GS properties: Optimality
18
GS properties: Strategic behavior
19
Strategy-proof: Each participant is optimal to be truthful
Deviating results in sub-optimal assignments
GS properties: Strategic behavior (cont.)
20
1Assume all of other agents report truthfully
Extension with sets: Many-to-one markets
21
Preferences over sets: Responsiveness
22
Set 1
Set 2
Preferences over sets: Substitutability
23
Set 1
Set 2
Substitutable preferences: An example
24
Deferred acceptance (DA) for substitutability
25
The same properties as GS:
[KC (1982); Roth (1984b); RS (1992)]
Step 1
Step 2
Summary of Part 1: Two-sided matching markets
26
But agents usually have unknown preferences in practice
27
Can learn them from iterative interactions !
Part 2: Multi-armed Bandits
28
What are bandits? [Lattimore and Szepesvári, 2020]
29
To accumulate as many rewards, which arm would you choose next?
Exploitation V.S. Exploration
Time | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
Arm 1 | $1 | $0 | | | $1 | $1 | $0 | | | |
Arm 2 | | | $1 | $0 | | | | | | |
Interactive machine learning
30
Candidate actions
Learning agent
Feedback
Environment
(2) Choose action
(3) Generate feedback
(4) Receive feedback
(5) Improve policy
(1) Faced with
Provide insights for agents in matching markets to learn their
unknown preferences through iterative interactions
Applications
31
Recommendation systems
[Li et al., 2010]
SAT solvers
[Liang et al., 2016]
Advertisement placement
[Yu et al., 2016]
Key part of reinforcement learning
[Hu et al., 2018]
Monte-carlo Tree Search (MCTS) in AlphaGo
[Kocsis and Szepesvári, 2006; Silver et al., 2016]
Public health: COVID-19 border testing in Greece
[Bastani et al., 2021]
Multi-armed bandits (MAB)
32
Items, products, movies, companies, …
CTR, preference value, …
Click information, satisfaction, …
Objective
33
Explore-then-commit (ETC) [Garivier et al., 2016]
34
A/B testing
Explore-then-commit (cont.)
35
Exploration
Exploitation
Sample mean
Hoeffding’s inequality
Upper confidence bound (UCB) [Auer et al., 2002]
36
Exploitation
Exploration
By Hoeffding’s inequality
Upper confidence bound (UCB)
Sample mean
Upper confidence bound (UCB) (cont.)
37
Improve ETC: Elimination [Audibert and Bubeck, 2010]
38
Improve ETC: Elimination (cont.)
39
Uniform exploration
Lower bound [Lai and Robbins, 1985]
40
Bandit learning in matching markets [Liu et al., 2020]
41
Satisfaction over this matching experience
For simplicity, assume arms know their preferences
?
?
?
Conflict resolution: One-to-one setting
42
Objective
43
Challenge in matching markets
44
?
?
How to control agents’ blockings?
45
Summary of Part 2: Multi-armed bandits
46
Part 3: Bandit Algorithms in Matching Markets
47
Outline
48
Warm up: Centralized ETC [Liu et al., 2020]
49
Remaining rounds:
Follow GS’s choice
Exploration
Exploitation
GS with estimated ranking
Unique stable matching
50
Regret type | Regret bound | Uniqueness condition | References |
Unique stable matching | | Serial dictatorship | [Sankararaman et al., 2021] |
| [Maheshwari et al., 2022] | ||
Uniqueness consistency (The most general) | [Basu et al., 2021] |
How to balance EE in a more appropriate way?
51
PhasedETC [Basu et al., 2021]
52
phase 1
phase 2
….
….
T rounds
Phase length grows exponentially
PhasedETC: Regret analysis
53
T rounds
Enough exploration to learn preferences
Explore
Explore
GS + exploit
GS + exploit
….
….
Exponentially large term
Explore-then-GS (ETGS) [Kong and Li, 2023]
54
phase 1
phase 2
….
T rounds
Phase length grows exponentially
GS +
exploit
Communicate and find that all players estimate their preferences well
ETGS implementation: Communication
55
Remark: each player identifying the arms ranked in the first N+1 is enough to find the player-optimal stable matching.
ETGS implementation: Communication (cont.)
56
Communication round
Player
Select
Estimate well
Select
Estimate well
ETGS implementation: Communication (cont.)
57
Communication round
Player
Select
Estimate well
Always
select
ETGS: Regret analysis [Kong and Li, 2023]
58
Improvement: Adaptive ETGS
59
Explore
Exploit
Depends on all players
ETGS
Depends on player itself
Explore
Exploit
Explore
Exploit
Adaptive ETGS
Rejected by the exploited arm
Adaptive ETGS: Strategic behavior
60
Explore
Exploit
Have identified the optimal arm. What to report?
How about reporting NOT?
How about reporting a non-optimal arm?
Adaptive ETGS: Regret [KWL, NeurIPS 2024]
61
Once an optimal arm
has been identified
…
…
Stop explore!
Instantly eliminate K-N sub-optimal arms
…
Maintain N arms for cooperative exploration to avoid conflicts
Lower bound [Sankararaman et al., 2021]
62
Lower bound (cont.)
63
Lower bound (cont.)
64
65
Regret type | Regret Bound | Communication type | Strategy-proofness | References |
Player-optimal | | | ? | [Liu et al., 2020] |
Player-pessimal | | Centralized | ? | |
| Decentralized, observed matching outcomes | No | [Liu et al., 2021] | |
[Kong et al., 2022] | ||||
Unique | | Decentralized | ? | [Sankararaman et al., 2021; Basu et al., 2021; Maheshwari et al., 2022] |
| Centralized | ? | [Wang and Li, 2024; Kong et al., 2024] | |
Optimal stable bandits (Unique) | | Decentralized | / | [Sankararaman et al., 2021] |
Player-optimal | | Decentralized | ? | [Basu et al., 2021] |
| Decentralized, observed matching outcomes | No | [Kong and Li, 2023] | |
Decentralized | No | [Zhang et al., 2022] | ||
| Decentralized | Yes | [Kong et al., 2024] | |
Indifference stable | | Decentralized | ? | [Kong et al., 2025] |
How about many-to-one markets?
66
Extension of one-to-one algorithms
Centralized ETC/UCB [Wang et al., 2022]
Decentralized UCB [Wang et al., 2022]
ETGS [Kong and Li, 2024]
Results in the same regret upper bounds
Responsiveness Regret
67
Many-to-one markets: Substitutability
68
Substitutability: Algorithm [KL, 2024]
69
Step 1 of arm-proposal DA
Step 1 of the online algorithm
?
?
?
Enter the next step
All players determine which arm is optimal
?
?
?
Substitutability: Theoretical analysis
70
The first result for combinatorial preferences
Many-to-one markets: Results overview
71
Setting | Regret type | Regret Bound | Communication type | Strategy-proofness | References |
Responsiveness | Player-optimal | | | ? | [Wang et al., 2022] |
Player-pessimal | | Centralized | ? | ||
| Decentralized, observed matching outcomes | No | |||
Player-optimal | | | No | [Kong and Li, 2024] | |
| Decentralized, observed matching outcomes | Yes | |||
| Decentralized | No | [Zhang and Fang, 2024] | ||
Substitutability | Player-pessimal | | Decentralized, observed matching outcomes, known arms’ preferences | ? | [Kong and Li, 2024] |
Other setting variants
72
Summary of Part 3: Bandit algorithms in matching markets
73
Part 4: Open Problem
74
Open problems: Matching markets
75
Open problems: Regret
76
Regret type | Regret Bound | Communication type | References |
Optimal stable bandits (Unique stable matching) | | Decentralized | [Sankararaman et al., 2021] |
Player-optimal stable matching | | Decentralized | [Kong et al. 2024] |
Gap
Open problems: Regret (cont.)
77
Setting | Regret type | Regret Bound | Communication type | References |
Responsiveness | Player-optimal stable matching | | | [Kong and Li, 2024] |
Player-optimal stable matching | | Decentralized, known matching outcomes | ||
Substitutability | Player-pessimal stable matching | | Decentralized, known matching outcomes |
Is the player-optimal stable matching achievable?
What is the optimal regret under the responsiveness?
Open problems: Regret & Strategic behavior
78
Regret type | Regret Bound | Strategy-proof | References |
Player-optimal stable matching | | No | [Kong and Li, 2023; Zhang et al., 2022] |
Player-optimal stable matching | | Yes | [Kong and Li, 2024] |
Open problems: Matching markets (cont.)
……
79
Thanks!�&�Questions?
Credit: Some images are from Flaticon.com
80
Zilong Wang
Shuai Li
References: Part 1
81
References: Part 2
82
References: Part 2
83
References: Part 3
84
References: Part 3
85
References: Part 3
86
References: Part 3
87
References: Part 4
88
References: Part 4
89
References: Part 4
90
References: Part 4
91