1 of 1

Ashutosh Baheti, Ximing Lu, Faeze Brahman, Ronan Le Bras, Maarten Sap, Mark Riedl

 

 

🍱Leftover-Lunch: Advantage-based Offline Reinforcement Learning for Language Models

 

Advantage Leftover Lunch RL

 

 

Supervised Learning�negative log likelihood loss

Incorporate

Importance Weight

 

 

 

 

1

3

 

 

 

4

 

2

 

Attention

❄️

Linear

 

 

Evaluation Setup

 

 

Finetune with NLL for 100% steps

 

Continue with baseline or A-LoL RL algo. for 50% steps

 

Our method, A-LoL,

doesn’t require pairwise-preference data and can benefit from Reference LM and Advantage

(same as online PPO)

 

RLHF Benchmark: Helpful and Harmless Assistant Task

DPO achieves highest avg. reward (58.6) but generates long (67 words) and less diverse (0.5 distinct trigrams) responses

A-LoL variants obtain comparable reward (57.6) while achieving high-diversity (40 words, 0.73 distinct trigrams)

Contributions

A-LoL RL - Novel Offline RL algorithm that can optimize arbitrary numerical rewards on any sequence-to-sequence data

Including Advantage in training significantly boosts performance over SFT and rewards-based offline RL

Advantage LoL RL 🚮discards –ve advantage (suboptimal) data and boosts sample efficiency in training

🤩

🧑‍🍳

DPO saturates quickly but shows high variance during training

Train a LM assistant on conversations with pair of human preferences good and bad responses (3 seeds for each method)

 

A-LoL variants are stable and reach high reward

 

Reddit Response Generation - 5 Rewards

88K �🔺Upvoted Comments

87K 🔻Downvoted Comments

Unsafe & �likely to be �downvoted

ToxiChat 🤬Offensive Classifier

CoLA�Fluency Classifier

DialogRPT�Engagement and Upvote probability Classifiers

TF-IDF

A-LoL seq. LM trained on🔻downvoted comments almost matches the reward distributions of LM trained on🔺upvoted comments by removing -ve adv. data 🚮

Test set eval on 2000 Reddit prompts using the 5 reward functions

48% -ve adv.

36% -ve adv.

Train LM on Reddit comments with a sum of 5 external Attributes Scorers as Reward

Safety😊

Helpfulness💁

 

A-LoL seq. outperforms DPO and Reference LM in Helpfulness💁 and Safety😊 according to GPT-4🤖.

Human🧑‍🔬evaluations also corroborate this trend.

 

Preference Data

Reference LM

Rewards

Advantage

DPO

ref. free

DPO

PRO

wBC, R. GOLD

R-LoL

A-LoL �(seq., KL)

PPO

(online)

A-LoL�ref. free

Offline and Online RLHF landscape 🌄

🗝️

We assume the entire output from LM as a single action and compute single value estimate for entire prompt

Advantage Leftover Lunch-RL 🍱(A-LoL)

1

2

3

4

5

6