Ashutosh Baheti, Ximing Lu, Faeze Brahman, Ronan Le Bras, Maarten Sap, Mark Riedl
🍱Leftover-Lunch: Advantage-based Offline Reinforcement Learning for Language Models
Advantage Leftover Lunch RL
Supervised Learning�negative log likelihood loss
Incorporate
Importance Weight
1
3
4
2
Attention
❄️
Linear
Evaluation Setup
Finetune with NLL for 100% steps
Continue with baseline or A-LoL RL algo. for 50% steps
Our method, A-LoL,
doesn’t require pairwise-preference data and can benefit from Reference LM and Advantage
(same as online PPO)
RLHF Benchmark: Helpful and Harmless Assistant Task
DPO achieves highest avg. reward (58.6) but generates long (67 words) and less diverse (0.5 distinct trigrams) responses
A-LoL variants obtain comparable reward (57.6) while achieving high-diversity (40 words, 0.73 distinct trigrams)
Contributions
A-LoL RL - Novel Offline RL algorithm that can optimize arbitrary numerical rewards on any sequence-to-sequence data
Including Advantage in training significantly boosts performance over SFT and rewards-based offline RL
Advantage LoL RL 🚮discards –ve advantage (suboptimal) data and boosts sample efficiency in training
⚡
🤩
🧑🍳
DPO saturates quickly but shows high variance during training
Train a LM assistant on conversations with pair of human preferences good and bad responses (3 seeds for each method)
A-LoL variants are stable and reach high reward
Reddit Response Generation - 5 Rewards
88K �🔺Upvoted Comments
87K 🔻Downvoted Comments
Unsafe & �likely to be �downvoted
ToxiChat 🤬Offensive Classifier
CoLA�Fluency Classifier
DialogRPT�Engagement and Upvote probability Classifiers
TF-IDF
A-LoL seq. LM trained on🔻downvoted comments almost matches the reward distributions of LM trained on🔺upvoted comments by removing -ve adv. data 🚮
Test set eval on 2000 Reddit prompts using the 5 reward functions
48% -ve adv.
36% -ve adv.
Train LM on Reddit comments with a sum of 5 external Attributes Scorers as Reward
Safety😊
Helpfulness💁
A-LoL seq. outperforms DPO and Reference LM in Helpfulness💁 and Safety😊 according to GPT-4🤖.
Human🧑🔬evaluations also corroborate this trend.
Preference Data
Reference LM
Rewards
Advantage
DPO
ref. free
DPO
PRO
wBC, R. GOLD
R-LoL
A-LoL �(seq., KL)
PPO
(online)
A-LoL�ref. free
Offline and Online RLHF landscape 🌄
🗝️
We assume the entire output from LM as a single action and compute single value estimate for entire prompt
Advantage Leftover Lunch-RL 🍱(A-LoL)
1
2
3
4
5
6