Reinforcement Learning from Human Feedback
Quick review of Transformer
Encoder part
Embedding & Positional Encoding
Embedding & Positional Encoding
Self Attention
Scaled Dot-product Self Attention
Scaled Dot-product Self Attention
Scaled Dot-product Self Attention
Multi-Head Self-Attention
Encoder Zoom In
Decoder looks at:
Two types of attentions in decoder:
Decoders in Transformers
Masked self-attention
Encoder decoder attention
The decoder stack outputs a vector of floats. How do we turn that into a word?
The Final Layer and SoftMax
What makes reinforcement learning different from other machine learning paradigms?
The two core components in RL:
the agent and the environment.
The agent is the decision maker, and is the solution to a problem.
The environment is the representation of a problem.
One of the fundamental distinctions of RL from other ML approaches is that the agent and the environment interact
the agent attempts to influence the environment through actions, and the environment reacts to the agent’s actions.
A reward is a scalar feedback Indicates how well agent is doing
The agent’s job is to maximize cumulative reward
RL: complex sequential decision-making problems under uncertainty
The reinforcement learning cycle
Model of the environment
Often, agents don’t have access to the actual full state of the environment. The part of a state that the agent can observe is called an observation.
Example: in the robotic arm example, the agent may only have access to camera images. Not an exact location of each object
The interactions between the agent and the environment go on for several cycles.
Each cycle is called a time step.
At each time step, the agent observes the environment, takes action, and receives a new observation and reward.
The set of the state, the action, the reward, and the new state is called an experience. Every experience has an opportunity for learning and improving performance.
Experiences:
t, (s, a, r’, s’)
t+1, (s’, a’, r’’, s’’)
t+2, (s’’, a’’, r’’’, s’’’) ...
The sequence of time steps from the beginning to the end of an episodic task is called an episode
The sum of rewards collected in a single episode is called a return. Agents are often designed to maximize the return .
Experiences & Episodes
Good or bad?
Advantages
Disadvantages
Markov Property
Markov Property
“The future is independent of the past given the present”
A state if Markov if and only if
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
Markov Decision Process
𝛾 is a discount factor γ ∈ [0, 1].
Solving MDPs
Policies: Per-state action prescriptions
A policy is a function that prescribes actions to take for a given nonterminal state.
policies cover all possible states. We need to plan for every possible state.
Policies can be stochastic or deterministic:
Deterministic: For every state 𝑠, the policy outputs a specific action 𝑎 with certainty
Stochastic: For every state 𝑠, the policy outputs a probability distribution over actions.
How can we compare policies?
How can we compare policies?
Given a policy and the MDP, we should be able to calculate the expected return starting from every single state.
Value of a state s when following a policy π:
Bellman equation
It is really a sum over all values of the three variables, a, s, and r.
Action-Value function: What should I expect from here if I do this action?
The value function 𝑉(𝑠) gives the expected future reward of being in a state, but it doesn't tell you how to move between states.
Another critical question that we often need to ask isn’t merely about the value of a state but the value of taking action a in a state s. which action is better under each policy?
Action-value function, also known as Q-function or captures precisely this:
expected return if the agent follows policy π after taking action a in state s.
policy π is better than or equal to policy π' if the expected return is better than or equal to π' for all states Iterative policy evaluation: compute the state-value function 𝑉𝜋(𝑠) for a given policy
Iterative policy evaluation: compute the state-value function 𝑉𝜋(𝑠) for a given policy π
Policy evaluation: Rating policies
Optimality
Optimal Value Function:
Optimal Action-Value Function:
there could be more than one optimal policy for a given MDP, there can only be one optimal state-value function, optimal action-value function
Policy Iteration: Policy-Improvement algorithm
Initialization:
Policy Evaluation:
Evaluate the current policy using iterative policy evaluation
Policy Improvement:
For each state s, improve the policy π(s) by choosing the action that maximizes the expected value, i.e.:
Reinforcement Learning from Human Feedback
*These sides are heavily borrowed from huggingface. co
Why Reinforcement Learning from Human Feedback (RLHF)
How do you create / code a loss function for:
35
Don’t encode it, model it!
History: early OpenAI experiments with RLHF
36
Stiennon, Nisan, et al. "Learning to summarize with human feedback." Advances in Neural Information Processing Systems 33 (2020): 3008-3021.
“Three pigs defend themselves from a mean wolf”
Modern RLHF overview
37
38
Common training techniques in NLP:
- Unsupervised sequence prediction
- Data scraped from web
- No single answer on “best” model size (examples in industry range 10B-280B parameters)
39
Dataset:
- Reddit, other forums, news, books�- Optionally include human-written text from predefined prompts
40
Optional step:
- Pay humans to write responses to existing prompts
- Considered high quality initialization for RLHF
Supervised Fine Tuning (SFT)
2. Reward model training
41
How to capture human sentiments in samples and curated text? What is the loss!
Goal: get a model that maps
input text → scalar reward
2. Reward model training - dataset
42
Prompts (input) dataset:�- Prompts for specific use-case model will be used for
- E.g. chat questions or prompt-based data
- Much smaller than original pretraining!
2. Reward model training - dataset
43
Generating data to rank:
- Often can use multiple models to create diverse ranking,
- Set of prompts can be from user data (e.g. ChatGPT)
2. Reward model training
44
2. Reward model training
45
2. Reward model training
46
Reward model:
- Also transformer based LM
- Variation in sizes used (relative to policy)
- Outputs scalar from text input
3. Fine tuning with RL
47
3. Fine tuning with RL - using a reward model
48
3. Fine tuning with RL - KL penalty
49
Constrains the RL fine-tuning to not result in a LM that outputs gibberish (to fool the reward model).
Kullback–Leibler (KL) divergence:
Distance between distributions
3. Fine tuning with RL - combining rewards
50
Option to add additional terms to this reward function. E.g. InstructGPT
Reward to match original human-curation distribution
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
3. Fine tuning with RL - feedback & training
51
- Policy gradient updates policy LM directly.
- Often some parameters of policy are frozen.
3. Fine tuning with RL - PPO
52
Proximal Policy Optimization (PPO)
- on-policy algorithm,�- works with discrete or continuous actions,
- optimized for parallelization.
Recapping recent examples - InstructGPT
53
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).
Thank You