1 of 91

1

Bandit Learning in Matching Markets

Zilong Wang

Ph.D. candidate student at�Shanghai Jiao Tong University

Shuai Li

Associate professor at�Shanghai Jiao Tong University

2 of 91

Outline

  • Part 1: Two-sided matching markets

  • Part 2: Multi-armed bandits

  • Part 3: Bandit algorithms in matching markets

  • Part 4: Open Problem

2

3 of 91

Part 1: Two-sided Matching Markets

3

4 of 91

Matching markets

  • Talent cultivation (school admissions, student internships)
  • Task allocation (crowdsourcing assignments, domestic services)
  • Resource distribution (housing allocation, organ donation allocation)

4

5 of 91

Matching market has two sides

5

Workers

Employers

 

 

 

 

 

 

 

 

 

6 of 91

Both sides have preferences over the other side

6

 

 

 

 

 

 

 

 

 

Worker side

Based on payment or prior familiarity of the task

 

 

 

 

 

7 of 91

Both sides have preferences over the other side

7

 

 

 

 

 

 

 

 

 

 

 

 

 

Employer side

Based on the skill levels of workers

8 of 91

A case study: Medical interns [Roth (1984)]

  • Hospital side
    • Internship has relatively low cost
  • Student side
    • closely engage with clinical medicine through internships

  • Historical practice
    • Medical schools first publish students grade ranking
    • Then hospitals start signing internship agreements with students
  • How to match?

8

9 of 91

Medical interns (cont.)

  •  

9

 

 

 

 

 

10 of 91

Stable matching

10

Participants have no incentive to abandon their current partner,

i.e.,

no blocking pair such that they both preferred to be matched with each other than their current partner

Alvin E. Roth and Lloyd S. Shapley jointly won the Nobel Prize in 2012 for their contributions to stable matching theory.

 

 

 

 

 

 

 

 

 

 

 

 

Blocking pair

11 of 91

May be more than one stable matchings

11

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

12 of 91

A-side optimal stable matching1

12

 

 

 

 

 

 

 

1The existence is proved by Gale and Shapley (1962).

 

 

 

 

 

 

13 of 91

A-side pessimal stable matching

13

 

 

 

 

 

 

 

 

 

 

 

 

 

14 of 91

How to find a stable matching?

14

 

 

 

 

 

 

 

 

 

 

 

 

Gale-Shapley (GS) algorithm

[Gale and Shapley (1962)]

Agents on one side independently propose to agents on the other side according to their preference ranking until no rejection happens

No rejection happens!

15 of 91

Gale-Shapley (GS) algorithm: Case 2

15

 

 

 

 

 

 

 

 

 

 

 

 

Step 1

Step 2

Step 3

16 of 91

GS properties: Stability

  •  

16

 

 

 

 

 

 

17 of 91

GS properties: Time complexity

  •  

17

 

18 of 91

GS properties: Optimality

  • Who proposes matters
    • Each proposing-side agent is happiest, matched with the most preferred partner among all stable matchings
    • Each acceptance-side agent is only matched with the least preferred partner among all stable matchings
    • A-side optimal stable matching = B-side pessimal stable matching

18

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

19 of 91

GS properties: Strategic behavior

19

 

 

 

 

 

 

 

 

 

 

 

 

 

Strategy-proof: Each participant is optimal to be truthful

Deviating results in sub-optimal assignments

20 of 91

GS properties: Strategic behavior (cont.)

  • GS is strategy-proof for the proposing side [DF (1981); Roth (1982)]
    • Best for the proposing-side agents to report truthfully
  • GS is not strategy-proof for the acceptance side

20

 

 

 

 

 

 

 

 

 

 

 

 

 

1Assume all of other agents report truthfully

21 of 91

Extension with sets: Many-to-one markets

  • An agent may match more than one partner
    • Applications
      • An employer can hire a group of workers
      • A school can admit multiple students

21

 

 

 

 

 

 

 

 

 

 

 

 

22 of 91

Preferences over sets: Responsiveness

22

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Set 1

Set 2

 

 

23 of 91

Preferences over sets: Substitutability

  • Agents have preferences over groups (instead of simply individuals)

23

 

 

 

 

 

Set 1

Set 2

 

  • Naturally holds under responsiveness
  • One of the most generally known conditions to ensure the existence of a stable matching

24 of 91

Substitutable preferences: An example

24

 

 

 

 

 

 

 

 

 

 

25 of 91

Deferred acceptance (DA) for substitutability

  • The extension of GS under substitutability

25

The same properties as GS:

  • Stability
  • Time complexity
  • Optimality
  • Strategic behavior (When A-side propose)

[KC (1982); Roth (1984b); RS (1992)]

 

 

 

 

 

 

 

 

 

 

 

 

Step 1

 

 

 

 

 

 

Step 2

26 of 91

Summary of Part 1: Two-sided matching markets

  • Introduction to matching markets
  • Stable matching
  • Gale-Shapley algorithm: Procedure and properties
    • Stability
    • Time complexity
    • Optimality
    • Strategic behavior
  • Extension to many-to-one markets
    • Responsiveness
    • Substitutability
    • Deferred-acceptance algorithm

26

27 of 91

But agents usually have unknown preferences in practice

27

Can learn them from iterative interactions !

28 of 91

Part 2: Multi-armed Bandits

28

29 of 91

What are bandits? [Lattimore and Szepesvári, 2020]

29

To accumulate as many rewards, which arm would you choose next?

Exploitation V.S. Exploration

Time

1

2

3

4

5

6

7

8

9

10

Arm 1

$1

$0

$1

$1

$0

Arm 2

$1

$0

30 of 91

Interactive machine learning

30

Candidate actions

Learning agent

Feedback

Environment

(2) Choose action

(3) Generate feedback

(4) Receive feedback

(5) Improve policy

(1) Faced with

Provide insights for agents in matching markets to learn their

unknown preferences through iterative interactions

31 of 91

Applications

31

Recommendation systems

[Li et al., 2010]

SAT solvers

[Liang et al., 2016]

Advertisement placement

[Yu et al., 2016]

Key part of reinforcement learning

[Hu et al., 2018]

Monte-carlo Tree Search (MCTS) in AlphaGo

[Kocsis and Szepesvári, 2006; Silver et al., 2016]

Public health: COVID-19 border testing in Greece

[Bastani et al., 2021]

32 of 91

Multi-armed bandits (MAB)

32

 

 

 

 

 

 

Items, products, movies, companies, …

CTR, preference value, …

Click information, satisfaction, …

33 of 91

Objective

33

 

34 of 91

Explore-then-commit (ETC) [Garivier et al., 2016]

  •  

34

 

 

 

A/B testing

35 of 91

Explore-then-commit (cont.)

  •  

35

 

 

 

Exploration

Exploitation

 

Sample mean

Hoeffding’s inequality

 

 

36 of 91

Upper confidence bound (UCB) [Auer et al., 2002]

  •  

36

Exploitation

Exploration

By Hoeffding’s inequality

 

Upper confidence bound (UCB)

Sample mean

37 of 91

Upper confidence bound (UCB) (cont.)

  •  

37

 

 

38 of 91

Improve ETC: Elimination [Audibert and Bubeck, 2010]

  •  

38

 

 

 

 

 

 

 

 

 

39 of 91

Improve ETC: Elimination (cont.)

  •  

39

Uniform exploration

 

 

 

 

40 of 91

Lower bound [Lai and Robbins, 1985]

  •  

40

41 of 91

Bandit learning in matching markets [Liu et al., 2020]

  •  

41

Satisfaction over this matching experience

For simplicity, assume arms know their preferences

 

 

 

 

 

 

?

?

?

 

42 of 91

Conflict resolution: One-to-one setting

  •  

42

43 of 91

Objective

  •  

43

44 of 91

Challenge in matching markets

  • Learning process: Other players will block observations
    • Once the player selects an arm based on its exploration-exploitation (EE) strategy, this arm may reject the player due to others’ selections
    • The individual player’s EE trade-off is interrupted

  • Objective: Cannot maximize a single player’s utility
    • Aim to find the optimal equilibrium of the market

44

 

 

 

?

?

 

 

 

45 of 91

How to control agents’ blockings?

  • Centralized
    • All participants submit their estimations to the platform
    • The platform computes an assignment
    • All players follow this assignment

  • Decentralized
    • Each player independently computes the target arm
    • Necessary information to communicate:
      • common index of arms, matching outcomes in each round, etc.

45

46 of 91

Summary of Part 2: Multi-armed bandits

  • Multi-armed bandits (MAB)
    • Applications
    • Explore-then-commit (ETC)
    • Upper confidence bound (UCB)
    • Successive elimination
    • Thompson sampling (TS)
    • Lower bound
  • Bandit learning in matching markets
    • Setting
    • Challenge

46

47 of 91

Part 3: Bandit Algorithms in Matching Markets

47

48 of 91

Outline

  • Centralized algorithms
    • ETC, UCB
  • Decentralized algorithms
    • General markets
    • Markets with unique stable matching
    • Explore-then-GS (ETGS) strategies
  • Lower bound
  • Many-to-one markets
  • Strategic behavior
    • Adaptive ETGS
  • Other variants

48

49 of 91

Warm up: Centralized ETC [Liu et al., 2020]

49

 

 

Remaining rounds:

Follow GS’s choice

 

Exploration

Exploitation

GS with estimated ranking

50 of 91

Unique stable matching

  • When there is only one stable matching
    • Player-optimal stable matching = Player-pessimal stable matching
    • The blocking relationship becomes simpler

50

Regret type

Regret bound

Uniqueness condition

References

Unique stable matching

Serial dictatorship

[Sankararaman et al., 2021]

[Maheshwari et al., 2022]

Uniqueness consistency

(The most general)

[Basu et al., 2021]

 

51 of 91

How to balance EE in a more appropriate way?

  • Exploration-Exploitation trade-off
    • Exploitation goes though with correct rankings by following GS
    • Require enough exploration to estimate the correct rankings
  • Perhaps design manually?
  • To avoid other players’ block: Coordinate selections in a round-robin way

51

52 of 91

PhasedETC [Basu et al., 2021]

52

phase 1

phase 2

 

….

….

T rounds

Phase length grows exponentially

 

 

 

 

53 of 91

PhasedETC: Regret analysis

53

 

T rounds

Enough exploration to learn preferences

Explore

Explore

 

GS + exploit

GS + exploit

 

 

….

….

Exponentially large term

54 of 91

Explore-then-GS (ETGS) [Kong and Li, 2023]

  • Avoid unnecessary exploitation before estimating preferences well
    • Only when all players estimate well, enter GS + exploit

54

phase 1

phase 2

 

….

T rounds

Phase length grows exponentially

 

 

GS +

exploit

Communicate and find that all players estimate their preferences well

55 of 91

ETGS implementation: Communication

  •  

55

Remark: each player identifying the arms ranked in the first N+1 is enough to find the player-optimal stable matching.

 

 

 

 

 

 

 

 

 

 

56 of 91

ETGS implementation: Communication (cont.)

  •  

56

 

Communication round

 

 

Player

 

 

Select

Estimate well

Select

Estimate well

57 of 91

ETGS implementation: Communication (cont.)

  • Based on players’ own matching outcomes [Zhang et al., 2022]

57

 

Communication round

 

 

Player

 

Select

Estimate well

 

Always

select

58 of 91

ETGS: Regret analysis [Kong and Li, 2023]

  •  

58

 

59 of 91

Improvement: Adaptive ETGS

  • Idea: Instead of starting GS + exploitation with all players’ agreement, integrating each player’s own learning process into GS steps

  • Players cooperatively explore arms in a round-robin manner
  • Once a player identifies the most preferred one, starts exploiting this arm
  • If the exploited arm is occupied by a higher-priority player (the arm “rejects” the player)
    • Explore the next most preferred arm (enter the next step of GS)

59

Explore

Exploit

Depends on all players

ETGS

Depends on player itself

Explore

Exploit

Explore

Exploit

Adaptive ETGS

Rejected by the exploited arm

60 of 91

Adaptive ETGS: Strategic behavior

  •  

60

Explore

Exploit

Have identified the optimal arm. What to report?

How about reporting NOT?

    • Equivalent to delayed entering GS in the offline setting
    • Cannot change the final matching results

How about reporting a non-optimal arm?

    • Equivalent to misreporting rankings in the offline GS
    • Cannot improve the final matched partner

61 of 91

Adaptive ETGS: Regret [KWL, NeurIPS 2024]

  •  

61

 

 

 

Once an optimal arm

has been identified

 

Stop explore!

Instantly eliminate K-N sub-optimal arms

Maintain N arms for cooperative exploration to avoid conflicts

62 of 91

Lower bound [Sankararaman et al., 2021]

  •  

62

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

63 of 91

Lower bound (cont.)

  •  

63

 

 

 

 

 

 

 

 

 

 

 

 

64 of 91

Lower bound (cont.)

  •  

64

 

 

 

 

 

 

 

 

 

 

 

 

 

65 of 91

65

Regret type

Regret Bound

Communication type

Strategy-proofness

References

Player-optimal

?

[Liu et al., 2020]

Player-pessimal

Centralized

?

Decentralized, observed matching outcomes

No

[Liu et al., 2021]

[Kong et al., 2022]

Unique

Decentralized

?

[Sankararaman et al., 2021; Basu et al., 2021; Maheshwari et al., 2022]

Centralized

?

[Wang and Li, 2024; Kong et al., 2024]

Optimal stable bandits

(Unique)

Decentralized

/

[Sankararaman et al., 2021]

Player-optimal

Decentralized

?

[Basu et al., 2021]

Decentralized, observed matching outcomes

No

[Kong and Li, 2023]

Decentralized

No

[Zhang et al., 2022]

Decentralized

Yes

[Kong et al., 2024]

Indifference stable

Decentralized

?

[Kong et al., 2025]

66 of 91

How about many-to-one markets?

  •  

66

 

 

 

 

 

 

 

 

 

 

 

 

Extension of one-to-one algorithms

Centralized ETC/UCB [Wang et al., 2022]

Decentralized UCB [Wang et al., 2022]

ETGS [Kong and Li, 2024]

Results in the same regret upper bounds

67 of 91

Responsiveness Regret

  •  

67

 

 

68 of 91

Many-to-one markets: Substitutability

  • Challenge: Arms may reject all applications, players fail to explore in a round-robin manner

  • Idea: Determine which match to explore from the arm side
  • From arm-proposal DA to design learning process

68

 

 

 

 

 

 

 

 

 

 

 

 

 

69 of 91

Substitutability: Algorithm [KL, 2024]

  • First assume players know arms’ preferences1

69

 

 

 

 

 

 

Step 1 of arm-proposal DA

Step 1 of the online algorithm

 

 

 

?

?

?

 

 

 

Enter the next step

 

All players determine which arm is optimal

 

 

 

?

?

?

70 of 91

Substitutability: Theoretical analysis

  •  

70

The first result for combinatorial preferences

 

71 of 91

Many-to-one markets: Results overview

71

Setting

Regret type

Regret Bound

Communication type

Strategy-proofness

References

Responsiveness

Player-optimal

?

[Wang et al., 2022]

Player-pessimal

Centralized

?

Decentralized, observed matching outcomes

No

Player-optimal

No

[Kong and Li, 2024]

Decentralized, observed matching outcomes

Yes

Decentralized

No

[Zhang and Fang, 2024]

Substitutability

Player-pessimal

Decentralized, observed matching outcomes, known arms’ preferences

?

[Kong and Li, 2024]

72 of 91

Other setting variants

  • Contextual information [Li et al., 2022]
  • Non-stationary preferences [Ghosh et al., 2022; Muthirayan et al., 2023]
  • Two-sided unknown preferences [PD, 2023; PG, 2023]
  • Sample complexity
  • Problem independent regret
  • Markov matching markets [Min et al., 2022]
  • Multi-sided matching markets [Mordig et al., 2021]
  • Money transfer [Jagadeesan et al., 2021]
  • P2P: matching with budget [Sarkar, 2021]

72

73 of 91

Summary of Part 3: Bandit algorithms in matching markets

  • Centralized algorithms
    • ETC, UCB
    • The failure of UCB
  • Decentralized algorithms
    • General markets
    • Markets with unique stable matching
    • Explore-then-GS (ETGS) strategies
  • Lower bound
  • Many-to-one markets
  • Strategic behavior
    • Adaptive ETGS
  • Other variants

73

74 of 91

Part 4: Open Problem

74

75 of 91

Open problems: Matching markets

  • Optimality
    • Regret
    • Strategic behavior

75

76 of 91

Open problems: Regret

76

  • What is the optimal regret in the one-to-one setting?

Regret type

Regret Bound

Communication type

References

Optimal stable bandits

(Unique stable matching)

Decentralized

[Sankararaman et al., 2021]

Player-optimal stable matching

Decentralized

[Kong et al. 2024]

Gap

77 of 91

Open problems: Regret (cont.)

77

  • What is the optimal regret in the many-to-one setting?

Setting

Regret type

Regret Bound

Communication type

References

Responsiveness

Player-optimal stable matching

[Kong and Li, 2024]

Player-optimal stable matching

Decentralized, known matching outcomes

Substitutability

Player-pessimal stable matching

Decentralized, known matching outcomes

Is the player-optimal stable matching achievable?

What is the optimal regret under the responsiveness?

78 of 91

Open problems: Regret & Strategic behavior

78

  • What is the optimal regret when guaranteeing strategy-proofness?

Regret type

Regret Bound

Strategy-proof

References

Player-optimal stable matching

No

[Kong and Li, 2023; Zhang et al., 2022]

Player-optimal stable matching

Yes

[Kong and Li, 2024]

79 of 91

Open problems: Matching markets (cont.)

  • How to generalize the setting and what is the optimal regret in these settings?
    • How to deal with two-sided unknown preferences?
      • Existing works assume arms have known preferences and use this to conduct coordination/communication. But arms may also have unknown preferences
    • How to deal with players’ indifferent preferences?
      • Players may be indifferent over multiple arms
    • How to utilize the contextual information to accelerate the learning efficiency?
      • Agents’ features (gender, age, hometown)
    • How to handle asynchronous agents?
      • Agents may enter the system at different times

……

79

80 of 91

Thanks!�&�Questions?

Credit: Some images are from Flaticon.com

80

Zilong Wang

  • Ph.D. candidate student at�Shanghai Jiao Tong University
  • Research interests: Bandit/RL algorithms
  • Personal website: https://zilong-wang719.github.io

Shuai Li

  • Associate professor at�Shanghai Jiao Tong University
  • Research interests: Bandit/RL algorithms
  • Personal website: https://shuaili8.github.io/

81 of 91

References: Part 1

  • Gale, David, and Lloyd S. Shapley. "College admissions and the stability of marriage." The American Mathematical Monthly 69.1 (1962): 9-15.
  • Roth, Alvin E. "The evolution of the labor market for medical interns and residents: a case study in game theory." Journal of political Economy 92.6 (1984a): 991-1016.
  • Dubins, Lester E., and David A. Freedman. "Machiavelli and the Gale-Shapley algorithm." The American Mathematical Monthly 88.7 (1981): 485-494.
  • Roth, Alvin E. "The economics of matching: Stability and incentives." Mathematics of operations research 7.4 (1982): 617-628.
  • Kelso Jr, Alexander S., and Vincent P. Crawford. "Job matching, coalition formation, and gross substitutes." Econometrica: Journal of the Econometric Society (1982): 1483-1504.
  • Roth, Alvin E. "Stability and polarization of interests in job matching." Econometrica: Journal of the Econometric Society (1984b): 47-57.
  • Roth, Alvin E., and Marilda Sotomayor. "Two-sided matching." Handbook of game theory with economic applications 1 (1992): 485-541.

81

82 of 91

References: Part 2

82

  • Lattimore, Tor, and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Li, Lihong, et al. "A contextual-bandit approach to personalized news article recommendation." International conference on World wide web. 2010.
  • Kocsis, Levente, and Csaba Szepesvári. "Bandit based monte-carlo planning." European conference on machine learning. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006.
  • Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." nature 529.7587 (2016): 484-489.
  • Yu, Baosheng, Meng Fang, and Dacheng Tao. "Linear submodular bandits with a knapsack constraint." Proceedings of the AAAI Conference on Artificial Intelligence. 2016.
  • Liang, Jia Hui, et al. "Learning rate based branching heuristic for SAT solvers." Theory and Applications of Satisfiability Testing–SAT 2016: 19th International Conference, Bordeaux, France, July 5-8, 2016, Proceedings 19. Springer International Publishing, 2016.
  • Bastani, Hamsa, et al. "Efficient and targeted COVID-19 border testing via reinforcement learning." Nature 599.7883 (2021): 108-113.

83 of 91

References: Part 2

83

  • Hu, Yujing, et al. "Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application." Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2018.
  • Garivier, Aurélien, Tor Lattimore, and Emilie Kaufmann. "On explore-then-commit strategies." Advances in Neural Information Processing Systems 29 (2016).
  • Audibert, Jean-Yves, and Sébastien Bubeck. "Best arm identification in multi-armed bandits." COLT-23th Conference on learning theory-2010. 2010.
  • Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine learning 47 (2002): 235-256.
  • Agrawal, Shipra, and Navin Goyal. "Further Optimal Regret Bounds For Thompson Sampling." Sixteenth International Conference on Artificial Intelligence and Statistics. 2013.
  • Lai, Tze Leung, and Herbert Robbins. "Asymptotically efficient adaptive allocation rules." Advances in applied mathematics 6.1 (1985): 4-22.

84 of 91

References: Part 3

84

  • Liu, Lydia T., Horia Mania, and Michael Jordan. "Competing bandits in matching markets." International Conference on Artificial Intelligence and Statistics. PMLR, 2020.
  • Liu, Lydia T., et al. “Bandit learning in decentralized matching markets.” Journal of Machine Learning Research 22.211 (2021): 1-34.
  • Maheshwari, Chinmay, Shankar Sastry, and Eric Mazumdar. "Decentralized, communication-and coordination-free learning in structured matching markets." Advances in Neural Information Processing Systems 35 (2022): 15081-15092.
  • Kong, Fang, Junming Yin, and Shuai Li. "Thompson Sampling for Bandit Learning in Matching Markets.” International Joint Conference on Artificial Intelligence. 2022.
  • Basu, Soumya, Karthik Abinav Sankararaman, and Abishek Sankararaman. "Beyond $ log^ 2 (T) $ regret for decentralized bandits in matching markets." International Conference on Machine Learning. PMLR, 2021.
  • Sankararaman, Abishek, Soumya Basu, and Karthik Abinav Sankararaman. "Dominate or delete: Decentralized competing bandits in serial dictatorship." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.

85 of 91

References: Part 3

85

  • Kong, Fang, and Shuai Li. "Player-optimal Stable Regret for Bandit Learning in Matching Markets." Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). Society for Industrial and Applied Mathematics, 2023.
  • Zhang, Yirui, Siwei Wang, and Zhixuan Fang. "Matching in Multi-arm Bandit with Collision." Advances in Neural Information Processing Systems 35 (2022): 9552-9563.
  • Wang, Zilong, et al. "Bandit learning in many-to-one matching markets." Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022.
  • Kong, Fang, and Shuai Li. "Improved Bandits in Many-to-one Matching Markets with Incentive Compatibility." AAAI Conference on Artificial Intelligence. 2024.
  • Min, Yifei, et al. "Learn to match with no regret: Reinforcement learning in markov matching markets." Advances in Neural Information Processing Systems 35 (2022): 19956-19970.
  • Li, Yuantong, et al. "Rate-optimal contextual online matching bandit." arXiv preprint arXiv:2205.03699 (2022).

86 of 91

References: Part 3

86

  • Pagare, Tejas, and Avishek Ghosh. "Two-Sided Bandit Learning in Fully-Decentralized Matching Markets." ICML 2023 Workshop The Many Facets of Preference-Based Learning. 2023.
  • Muthirayan, Deepan, et al. "Competing bandits in time varying matching markets." Learning for Dynamics and Control Conference. PMLR, 2023.
  • Ghosh, Avishek, et al. "Decentralized competing bandits in non-stationary matching markets." arXiv preprint arXiv:2206.00120 (2022).
  • Pokharel, Gaurab, and Sanmay Das. "Converging to Stability in Two-Sided Bandits: The Case of Unknown Preferences on Both Sides of a Matching Market." arXiv preprint arXiv:2302.06176 (2023).
  • Jagadeesan, Meena, et al. "Learning equilibria in matching markets from bandit feedback." Advances in Neural Information Processing Systems 34 (2021): 3323-3335.
  • Mordig, Maximilian, et al. "Finding Stable Matchings in PhD Markets with Consistent Preferences and Cooperative Partners." arXiv preprint arXiv:2102.11834 (2021).

87 of 91

References: Part 3

87

  • Wang, Zilong and Li, Shuai. "Optimal Analysis for Bandit Learning in Matching Markets with Serial Dictatorship." Theoretical Computer Science (TCS), 2024.
  • Kong, Fang, et al. "Improved Analysis for Bandit Learning in Matching Markets." 38th Conference on Neural Information Processing Systems (NeurIPS). 2024.
  • Kong, Fang, et al. "Bandit Learning in Matching Markets with Indifference." 13rd International Conference on Learning Representations (ICLR), 2025.
  • Zhang, Yirui and Fang, Zhixuan. "Decentralized Competing Bandits in Many-to-One Matching Markets." International Conference on Autonomous Agents and Multiagent Systems Extended Abstract, 2024.

88 of 91

References: Part 4

88

  • Darak, Sumit J., and Manjesh K. Hanawal. "Distributed Learning in Ad-Hoc Networks: A Multi-player Multi-armed Bandit Framework." arXiv preprint arXiv:2004.00367 (2020).
  • Boursier, Etienne, and Vianney Perchet. "A survey on multi-player bandits." arXiv preprint arXiv:2211.16275 (2022).
  • Anantharam, Venkatachalam, Pravin Varaiya, and Jean Walrand. "Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays-part i: Iid rewards." IEEE Transactions on Automatic Control 32.11 (1987): 968-976.
  • Komiyama, Junpei, Junya Honda, and Hiroshi Nakagawa. "Optimal regret analysis of thompson sampling in stochastic multi-armed bandit problem with multiple plays." International Conference on Machine Learning. PMLR, 2015.
  • Chen, Wei, Yajun Wang, and Yang Yuan. "Combinatorial multi-armed bandit: General framework and applications." International conference on machine learning. PMLR, 2013.
  • Wang, Siwei, and Wei Chen. "Thompson sampling for combinatorial semi-bandits." International Conference on Machine Learning. PMLR, 2018.

89 of 91

References: Part 4

89

  • Anandkumar, Animashree, Nithin Michael, and Ao Tang. "Opportunistic spectrum access with multiple users: Learning under competition." 2010 Proceedings IEEE INFOCOM. IEEE, 2010.
  • Rosenski, Jonathan, Ohad Shamir, and Liran Szlak. "Multi-player bandits–a musical chairs approach." International Conference on Machine Learning. PMLR, 2016.
  • Boursier, Etienne, and Vianney Perchet. "SIC-MMAB: Synchronisation involves communication in multiplayer multi-armed bandits." Advances in Neural Information Processing Systems 32 (2019).
  • Bubeck, Sébastien, and Thomas Budzinski. "Coordination without communication: optimal regret in two players multi-armed bandits." Conference on Learning Theory. PMLR, 2020.
  • Bubeck, Sébastien, Thomas Budzinski, and Mark Sellke. "Cooperative and stochastic multi-player multi-armed bandit: Optimal regret with neither communication nor collisions." Conference on Learning Theory. PMLR, 2021.
  • Vickrey, William. "Counterspeculation, auctions, and competitive sealed tenders." The Journal of finance 16.1 (1961): 8-37.

90 of 91

References: Part 4

90

  • Clarke, Edward H. "Multipart pricing of public goods." Public choice (1971): 17-33.
  • Groves, Theodore. "Efficient collective choice when compensation is possible." The Review of Economic Studies 46.2 (1979): 227-241.
  • Kandasamy, Kirthevasan, et al. "Vcg mechanism design with unknown agent values under stochastic bandit feedback." Journal of Machine Learning Research 24.53 (2023): 1-45.
  • Basu, Soumya, and Abishek Sankararaman. "Double Auctions with Two-sided Bandit Feedback." Advances in Neural Information Processing Systems 36 (2023).
  • Sarkar, Soumajyoti. "Centralized Borrower and Lender Matching under Uncertainty for P2P Lending." Companion Proceedings of the Web Conference 2021. 2021.
  • Cesa-Bianchi, Nicolò, et al. "A regret analysis of bilateral trade." Proceedings of the 22nd ACM Conference on Economics and Computation. 2021.
  • Weed, Jonathan, Vianney Perchet, and Philippe Rigollet. "Online learning in repeated auctions." Conference on Learning Theory. PMLR, 2016.

91 of 91

References: Part 4

91

  • Nedelec, Thomas, Noureddine El Karoui, and Vianney Perchet. "Learning to bid in revenue-maximizing auctions." International Conference on Machine Learning. PMLR, 2019.
  • Babaioff, Moshe, Robert D. Kleinberg, and Aleksandrs Slivkins. "Truthful mechanisms with implicit payment computation." Journal of the ACM (JACM) 62.2 (2015): 1-37.
  • Babaioff, Moshe, Yogeshwer Sharma, and Aleksandrs Slivkins. "Characterizing Truthful Multi-armed Bandit Mechanisms." SIAM Journal on Computing 43.1 (2014): 194-230.
  • Nazerzadeh, Hamid, et al. "Where to sell: Simulating auctions from learning algorithms." Proceedings of the 2016 ACM Conference on Economics and Computation. 2016.
  • Gatti, Nicola, Alessandro Lazaric, and Francesco Trovo. "A truthful learning mechanism for contextual multi-slot sponsored search auctions with externalities." Proceedings of the 13th ACM Conference on Electronic Commerce. 2012.
  • Kakade, Sham M., Ilan Lobel, and Hamid Nazerzadeh. "Optimal dynamic mechanism design and the virtual-pivot mechanism." Operations Research 61.4 (2013): 837-854.