1 of 20

ACL 2023

Presented by: Dongfu Jiang�In coming CS PhD @ Uwaterloo�

2 of 20

Motivation

LLM-Blender

2

2023/7/27

Vicuna

StableLM

Open Assistant

3 of 20

Motivation

LLM-Blender

3

2023/7/27

4 of 20

Summary

We introduce LLM-Blender: �Ranking + Fusing = LLM Ensembling
PairRanker (Ranking): �rank the responses through pairwise comparison
GenFuser (Fusing):�fuse the top-ranked responses by regenerating
MixInstruct (A New Benchmark)�instruction dataset with 11 LLMs’ responses

LLM-Blender

4

2023/7/27

5 of 20

Overall Explanation

LLM-Blender

5

2023/7/27

We introduce LLM-Blender, an innovative ensembling framework to attain consistently superior performance by leveraging the diverse strengths of multiple open-source large language models (LLMs). LLM-Blender cut the weaknesses through ranking and integrate the strengths through fusing generation to enhance the capability of LLMs.
Our framework consists of two complementary modules: PairRanker and GenFuser, addressing the observation that optimal LLMs for different examples can significantly vary. PairRanker employs a specialized pairwise comparison method to distinguish subtle differences between candidate outputs. GenFuser aims to merge the top-ranked candidates from the aggregation of PairRanker's pairwise comparisons into an improved output by capitalizing on their strengths and mitigating their weaknesses.
To facilitate large-scale evaluation, we introduce a benchmark dataset, MixInstruct, which is a mixture of multiple instruction datasets featuring oracle pairwise comparisons for testing purposes. Our LLM-Blender significantly surpasses the best LLMs and baseline ensembling methods across various metrics on MixInstruct, establishing a substantial performance gap.

6 of 20

Details about PairRanker

Previous Methods

LLM-Blender

6

2023/7/27

1. MLM-Scoring: This method assesses the quality of a candidate by calculating its pseudo-log-likelihood, obtained by masking tokens one by one and computing the log-likelihood for the masked token using masked LMs (e.g., BERT). This unsupervised method is effective for reranking outputs in NLG tasks like machine translation and speech recognition.

�

2. SimCLS: SimCLS encodes the input and each generated candidate using the same encoder, resulting in cosine similarity scores between them. This method uses marginal ranking loss during training to optimize the encoder.

�

3. SummaReranker: This method concatenates the input and each candidate, using a cross-attention encoder to learn ranking. It uses binary cross-entropy (BCE) loss during training to differentiate the best candidate from the others.

7 of 20

Details about PairRanker

Why Pairwise (1)

Subtle difference among high-quality candidates�(judging from the metric scores)

Assumption�Hard to capture this difference through individual scoring (MLM-Scoring, SimCLS, SummaReranker)

View this as the limitations of previous methods

LLM-Blender

7

2023/7/27

8 of 20

Details about PairRanker

Why Pairwise (2)

Human intuition: People always do better when they can directly compare 2 candidates�
Bidirectional attention from BERT family models makes the model could learn the subtle difference between 2 candidates

Assumption�Bidirectional attention ~= human direct comparison

LLM-Blender

8

2023/7/27

9 of 20

Details about GenFuser

LLM-Blender

9

2023/7/27

10 of 20

Details about MixInstruct

Dataset Statistics

Instruction Dataset
11 popular LLMs responses
The instruction and responses tokens lengths’ are all constrained to a maximum of 128

LLM-Blender

10

2023/7/27

11 of 20

Details about MixInstruct

LLM-Performance

Open Assistant and Vicuna performs best
LLMs abilities can significantly vary

LLM-Blender

11

2023/7/27

12 of 20

Experiments

Set up

LLM-Blender

12

2023/7/27

13 of 20

Experiments

Ranker results

LLM-Blender

13

2023/7/27

14 of 20

Experiments

Fuser results

LLM-Blender

14

2023/7/27

15 of 20

Experiments

Correlation Analysis

LLM-Blender

15

2023/7/27

16 of 20

Limitations

LLM-Blender

16

2023/7/27

17 of 20

Some thinking about ensembling

Dive into Deeper Ensemble

LLM-Blender: Pure text-level ensembling
MoE: Model or logit level ensembling
Example work: Flan-MoE

Cost of ensembling

Ranker is light
Fuser is heavy
base LLMs cost most

Non-text modality

Multimodal ensembling
Framework and methods (to be explored)

LLM-Blender

17

2023/7/27

18 of 20

Experiments

Fuser results

LLM-Blender

18

2023/7/27

19 of 20

What we are currently working on

More LLMs for ensemble �(except from the current 11)
Longer context length �(>128)
Higher-quality data�(OpenInstruct from Chatbot Arena)
Run eval on AlpacaEval Benchmark

Ranker gives results better than any single model
Fuser encounters performance degeneration

LLM-Blender

19

2023/7/27

LLM-Blender V2

20 of 20

Q&A

LLM-Blender

20

2023/7/27