1 of 54

Reinforcement Learning from Human Feedback

2 of 54

  1. Review of Transformer
  2. Introduction to Reinforcement Learning
  3. Reinforcement Learning from Human Feedback

3 of 54

Quick review of Transformer

4 of 54

Encoder part

5 of 54

Embedding & Positional Encoding

6 of 54

Embedding & Positional Encoding

7 of 54

Self Attention

8 of 54

Scaled Dot-product Self Attention

9 of 54

Scaled Dot-product Self Attention

10 of 54

Scaled Dot-product Self Attention

11 of 54

Multi-Head Self-Attention

12 of 54

Encoder Zoom In

13 of 54

Decoder looks at:

  • Previous target tokens
  • Source representations

Two types of attentions in decoder:

  • (masked) self-attention Encoder
  • Encoder-Decoder attention

Decoders in Transformers

14 of 54

  • In the decoder, self-attention is a bit different from the one in the encoder.
  • encoder receives all tokens at once and the tokens can look at all tokens in the input sentence
  • To forbid the decoder to look ahead, the model uses masked self-attention: future tokens are masked out
  • But how can the decoder look ahead?
  • During generation, it can't - we don't know what comes next.
  • In training, we feed the whole target sentence to the decoder - without masks, the tokens would "see future".

Masked self-attention

15 of 54

Encoder decoder attention

  • queries come from the decoder, and the keys and values come from the encoder
  • we mix or combine two different input sequences

16 of 54

The decoder stack outputs a vector of floats. How do we turn that into a word?

  • The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
  • Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset.
  • This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word.
  • The SoftMax layer then turns those scores into probabilities (all positive, all add up to 1.0).
  • The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step

The Final Layer and SoftMax

17 of 54

What makes reinforcement learning different from other machine learning paradigms?

  • There is no supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential, non i.i.d data)
  • Agent’s actions affect the subsequent data it receives

18 of 54

The two core components in RL:

the agent and the environment.

The agent is the decision maker, and is the solution to a problem.

The environment is the representation of a problem.

One of the fundamental distinctions of RL from other ML approaches is that the agent and the environment interact

the agent attempts to influence the environment through actions, and the environment reacts to the agent’s actions.

A reward is a scalar feedback Indicates how well agent is doing

The agent’s job is to maximize cumulative reward

RL: complex sequential decision-making problems under uncertainty

19 of 54

The reinforcement learning cycle

20 of 54

Model of the environment

  • The environment is represented by a set of variables related to the problem.
  • In the robotic arm example, the location and velocities of the arm would be part of the variables that make up the environment.
  • This set of variables and all the possible values that they can take are referred to as the state space.
  • A state is an instantiation of the state space, a set of values the variables take
  • The environment may change states as a response to the agent’s action.
  • The function that’s responsible for this mapping is called the transition function.
  • The environment may also provide a reward signal as a response.
  • The function responsible for this mapping is called the reward function.
  • The set of transition and reward functions is referred to as the model of the environment

21 of 54

Often, agents don’t have access to the actual full state of the environment. The part of a state that the agent can observe is called an observation.

Example: in the robotic arm example, the agent may only have access to camera images. Not an exact location of each object

22 of 54

The interactions between the agent and the environment go on for several cycles.

Each cycle is called a time step.

At each time step, the agent observes the environment, takes action, and receives a new observation and reward.

The set of the state, the action, the reward, and the new state is called an experience. Every experience has an opportunity for learning and improving performance.

Experiences:

t, (s, a, r’, s’)

t+1, (s’, a’, r’’, s’’)

t+2, (s’’, a’’, r’’’, s’’’) ...

The sequence of time steps from the beginning to the end of an episodic task is called an episode

The sum of rewards collected in a single episode is called a return. Agents are often designed to maximize the return .

Experiences & Episodes

23 of 54

Good or bad?

Advantages

  • Adaptability to Complex Environments
  • RL excels in dynamic and complex environments where explicit programming is difficult. It learns optimal behaviors through interaction with the environment.
  • RL algorithms can handle large, high dimensional state and action spaces
  • RL considers future rewards, enabling agents to plan actions that maximize long term benefits rather than immediate gains.

Disadvantages

  • Deep reinforcement learning agents need lots of interaction samples!
  • Designing a reward function for a task such as walking isn’t straightforward
  • The agents will have to make mistakes to learn. Can you imagine a self-driving car agent learning not to crash by crashing?

24 of 54

Markov Property

Markov Property

“The future is independent of the past given the present”

A state if Markov if and only if

The state captures all relevant information from the history

Once the state is known, the history may be thrown away

25 of 54

Markov Decision Process

  • Modeling the problem using a mathematical framework known as Markov decision processes (MDPs).
  • In RL, we assume all environments have an MDP working under the hood.
  • It is an environment in which all states are Markov
  • A Markov Decision Process is a tuple < 𝑆, 𝐴, 𝑃, 𝑅, 𝛾 > :
  • S is a finite set of states
  • A is a finite set of actions
  • P is a state transition probability matrix

  • R is a reward function

𝛾 is a discount factor γ ∈ [0, 1].

26 of 54

  • The sum of all rewards obtained during the course of an episode is referred to as the return.

  • Discounted return: downweight rewards that occur later during the episode

  • Simplify the equation to have a more general equation:

  • Equal interesting recursive definition:

27 of 54

Solving MDPs

Policies: Per-state action prescriptions

A policy is a function that prescribes actions to take for a given nonterminal state.

policies cover all possible states. We need to plan for every possible state.

Policies can be stochastic or deterministic:

Deterministic: For every state 𝑠, the policy outputs a specific action 𝑎 with certainty

Stochastic: For every state 𝑠, the policy outputs a probability distribution over actions.

28 of 54

How can we compare policies?

29 of 54

How can we compare policies?

Given a policy and the MDP, we should be able to calculate the expected return starting from every single state.

Value of a state s when following a policy π:

  • Value of a state s under policy π is the expectation of returns if the agent follows policy π starting from state s.

  • Calculate this for every state, and you get the state-value function, or V-function or value function

Bellman equation

It is really a sum over all values of the three variables, a, s, and r.

30 of 54

Action-Value function: What should I expect from here if I do this action?

The value function 𝑉(𝑠) gives the expected future reward of being in a state, but it doesn't tell you how to move between states.

Another critical question that we often need to ask isn’t merely about the value of a state but the value of taking action a in a state s. which action is better under each policy?

Action-value function, also known as Q-function or captures precisely this:

expected return if the agent follows policy π after taking action a in state s.

31 of 54

policy π is better than or equal to policy π' if the expected return is better than or equal to π' for all states Iterative policy evaluation: compute the state-value function 𝑉𝜋(𝑠) for a given policy

Iterative policy evaluation: compute the state-value function 𝑉𝜋(𝑠) for a given policy π

Policy evaluation: Rating policies

32 of 54

Optimality

Optimal Value Function:

  • optimal state-value function is a state-value function with the maximum value across all policies for all states

  • The optimal value function specifies the best possible performance in the MDP.

  • An MDP is “solved” when we know the optimal value function.

Optimal Action-Value Function:

there could be more than one optimal policy for a given MDP, there can only be one optimal state-value function, optimal action-value function

33 of 54

Policy Iteration: Policy-Improvement algorithm

Initialization:

  • Start with an initial policy 𝝅𝟎, which could be a random policy.
  • Set the initial value function 𝑽𝝅𝟎 (𝒔)

Policy Evaluation:

Evaluate the current policy using iterative policy evaluation

Policy Improvement:

For each state s, improve the policy π(s) by choosing the action that maximizes the expected value, i.e.:

34 of 54

Reinforcement Learning from Human Feedback

*These sides are heavily borrowed from huggingface. co

35 of 54

Why Reinforcement Learning from Human Feedback (RLHF)

How do you create / code a loss function for:

  • What is funny?
  • What is ethical?
  • What is safe?

35

Don’t encode it, model it!

36 of 54

History: early OpenAI experiments with RLHF

36

Stiennon, Nisan, et al. "Learning to summarize with human feedback." Advances in Neural Information Processing Systems 33 (2020): 3008-3021.

“Three pigs defend themselves from a mean wolf”

37 of 54

Modern RLHF overview

37

38 of 54

  1. Language model pretraining

38

Common training techniques in NLP:

- Unsupervised sequence prediction

- Data scraped from web

- No single answer on “best” model size (examples in industry range 10B-280B parameters)

39 of 54

  1. Language model pretraining: dataset

39

Dataset:

- Reddit, other forums, news, books�- Optionally include human-written text from predefined prompts

40 of 54

  1. Language model pretraining: human generation

40

Optional step:

- Pay humans to write responses to existing prompts

- Considered high quality initialization for RLHF

Supervised Fine Tuning (SFT)

41 of 54

2. Reward model training

41

How to capture human sentiments in samples and curated text? What is the loss!

Goal: get a model that maps

input text → scalar reward

42 of 54

2. Reward model training - dataset

42

Prompts (input) dataset:�- Prompts for specific use-case model will be used for

- E.g. chat questions or prompt-based data

- Much smaller than original pretraining!

43 of 54

2. Reward model training - dataset

43

Generating data to rank:

- Often can use multiple models to create diverse ranking,

- Set of prompts can be from user data (e.g. ChatGPT)

44 of 54

2. Reward model training

44

45 of 54

2. Reward model training

45

46 of 54

2. Reward model training

46

Reward model:

- Also transformer based LM

- Variation in sizes used (relative to policy)

- Outputs scalar from text input

47 of 54

3. Fine tuning with RL

47

48 of 54

3. Fine tuning with RL - using a reward model

48

49 of 54

3. Fine tuning with RL - KL penalty

49

Constrains the RL fine-tuning to not result in a LM that outputs gibberish (to fool the reward model).

Kullback–Leibler (KL) divergence:

Distance between distributions

50 of 54

3. Fine tuning with RL - combining rewards

50

Option to add additional terms to this reward function. E.g. InstructGPT

Reward to match original human-curation distribution

Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

51 of 54

3. Fine tuning with RL - feedback & training

51

- Policy gradient updates policy LM directly.

- Often some parameters of policy are frozen.

52 of 54

3. Fine tuning with RL - PPO

52

Proximal Policy Optimization (PPO)

- on-policy algorithm,�- works with discrete or continuous actions,

- optimized for parallelization.

53 of 54

Recapping recent examples - InstructGPT

53

Ouyang, Long, et al. "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155 (2022).

54 of 54

Thank You