1 of 20

ACL 2023

Presented by: Dongfu Jiang�In coming CS PhD @ Uwaterloo

2 of 20

Motivation

LLM-Blender

2

2023/7/27

Vicuna

StableLM

Open Assistant

3 of 20

Motivation

LLM-Blender

3

2023/7/27

4 of 20

Summary

  • We introduce LLM-Blender: �Ranking + Fusing = LLM Ensembling
  • PairRanker (Ranking): �rank the responses through pairwise comparison
  • GenFuser (Fusing):�fuse the top-ranked responses by regenerating
  • MixInstruct (A New Benchmark)�instruction dataset with 11 LLMs’ responses

LLM-Blender

4

2023/7/27

5 of 20

Overall Explanation

LLM-Blender

5

2023/7/27

6 of 20

Details about PairRanker

Previous Methods

LLM-Blender

6

2023/7/27

7 of 20

Details about PairRanker

Why Pairwise (1)

  • Subtle difference among high-quality candidates�(judging from the metric scores)

  • Assumption�Hard to capture this difference through individual scoring (MLM-Scoring, SimCLS, SummaReranker)

  • View this as the limitations of previous methods

LLM-Blender

7

2023/7/27

8 of 20

Details about PairRanker

Why Pairwise (2)

  • Human intuition: People always do better when they can directly compare 2 candidates�
  • Bidirectional attention from BERT family models makes the model could learn the subtle difference between 2 candidates

  • Assumption�Bidirectional attention ~= human direct comparison

LLM-Blender

8

2023/7/27

9 of 20

Details about GenFuser

  •  

LLM-Blender

9

2023/7/27

10 of 20

Details about MixInstruct

Dataset Statistics

  • Instruction Dataset
  • 11 popular LLMs responses
  • The instruction and responses tokens lengths’ are all constrained to a maximum of 128

LLM-Blender

10

2023/7/27

11 of 20

Details about MixInstruct

LLM-Performance

  • Open Assistant and Vicuna performs best
  • LLMs abilities can significantly vary

LLM-Blender

11

2023/7/27

12 of 20

Experiments

Set up

  •  

LLM-Blender

12

2023/7/27

13 of 20

Experiments

Ranker results

LLM-Blender

13

2023/7/27

14 of 20

Experiments

Fuser results

LLM-Blender

14

2023/7/27

15 of 20

Experiments

Correlation Analysis

LLM-Blender

15

2023/7/27

16 of 20

Limitations

  •  

LLM-Blender

16

2023/7/27

17 of 20

Some thinking about ensembling

  • Dive into Deeper Ensemble
    • LLM-Blender: Pure text-level ensembling
    • MoE: Model or logit level ensembling
    • Example work: Flan-MoE
  • Cost of ensembling
    • Ranker is light
    • Fuser is heavy
    • base LLMs cost most
  • Non-text modality
    • Multimodal ensembling
    • Framework and methods (to be explored)

LLM-Blender

17

2023/7/27

18 of 20

Experiments

Fuser results

LLM-Blender

18

2023/7/27

19 of 20

What we are currently working on

  • More LLMs for ensemble �(except from the current 11)
  • Longer context length �(>128)
  • Higher-quality data�(OpenInstruct from Chatbot Arena)
  • Run eval on AlpacaEval Benchmark
    • Ranker gives results better than any single model
    • Fuser encounters performance degeneration

LLM-Blender

19

2023/7/27

LLM-Blender V2

20 of 20

Q&A

LLM-Blender

20

2023/7/27