1 of 96

CS 6120

Week 13, Topic 11.2: Practically Leveraging �Large Language Models

2 of 96

Project Demo Day

April 24, 2025: Class in Room 916!�
5-10 Minutes (Depending on Number of Papers) �Paper lightning presentations (3 min each)�
30 Minutes Maximum

Each team introduces their Dataset (RAG) or Application (PAL) they are using

Introduce your work (1 min each)

45 Minutes to 1 Hour Maximum

Each instructor / TA tries out projects (5 Q’s each)

30 Minutes Maximum

Students try each other's application out

3 of 96

Administrivia

Time for Course Reviews�
Final Project Presentations: Add links to slides. Not more than 3 slides maximum: an elevator pitch.�
Reading Group Session: GPT4-TR, RLHF�
Discussion and Addressing the Labs and Homeworks�https://course.ccs.neu.edu/cs6120s25/submissions�
Canvassing for Teaching Assistants�
Time for Laboratories (Catchup) - Pranav to Provide a Correction�
Extra credit for the (optional laboratories)

4 of 96

Administrative - Homework 5

Assignment 5 Point System

February 20 Assigned�
March 13 Due�
March 20 Late Due, After this time – 1 Point Off M-F, WorkDay�
March 27 Late Due, After this Time - 3 Point �
April 4 Cut Off

5 of 96

Large Language Models�

Section 0: Administrivia

Section 1: In-Context Learning: Prompt Engineering

Section 2: Instruction Fine Tuning with RLHF

Section 3: Retrieval Augmented Generation (RAG)

6 of 96

Last Week: Pre-Training and Fine-Tuning

7 of 96

Last Week: Pre-Training and Fine-Tuning

8 of 96

Prompt Engineering: Full Specifications

9 of 96

In-Context Learning

Providing examples in the context window = in-context learning�
Your projects are can have examples of context learning!

10 of 96

In Context Learning (ICL) - Zero Shot Inference

Providing the example inside the context window (zero-shot inference)

11 of 96

In Context Learning (ICL) - One Shot Inference

GPT-2

Supply an example!

12 of 96

Summary of ICL, Strongly Dependent on Model Size

13 of 96

Prompt Engineering: Useful Techniques

Few Shot Learning - Providing a few examples to guide examples in the way you’d like it to think.�
Chain of Thought Prompting - Teach model to talk through its logic to drive through the output that you would like it to achieve.

3 examples + guiding LLM

Text-Davinci-002 (Cusp of 3.0-3.5)

Apropos of nothing, consistently wrong

Text-Davinci-002 (Cusp of 3.0-3.5)

14 of 96

Chain of Thought Reasoning

15 of 96

Chain of Thought Reasoning: 1 or Few Shot Learning

16 of 96

ICL or No?

Note that:

Context window is limited (a few thousand tokens)

When to Fine-Tune?

Where more in-prompt examples are provided (e.g., 5-6 examples)
Teaching intuition (when words fall short)
Baking in tone, style, and output formatting
Desire to reduce prompt context size
Want to distill and train a smaller model

17 of 96

Provisioning: Peak vs Baseline

Pay as You Go (PayGo)�
Provisioned Throughput Units �(PTU) - A consistent / fixed cost per month

Etsy went with this
Monitoring and understanding what’s right for you

18 of 96

Large Language Models�

Section 0: Administrivia

Section 1: In-Context Learning: Prompt Engineering

Section 2: Instruction Fine Tuning with RLHF

Section 3: Retrieval Augmented Generation (RAG)

19 of 96

Remedying examples from ICL

Where more in-prompt examples are provided, could fine tune

20 of 96

Instruction Following in LLMs

InstructGPT prioritizes

understanding and fulfilling user instructions�
through a combination of human demonstrations and feedback�

making it more reliable and aligned with user intent compared to traditional LLMs.

21 of 96

Instruction Following in LLMs

22 of 96

Traditional vs Instruction-Aligned

InstructGPT Training�(Focuses on Alignment):

Human-in-the-Loop Training

Starts with demonstrations from humans on how to complete tasks or respond to prompts.
This data includes instructions and desired outputs, helping the model understand user intent.�

Multi-Stage Training:

Supervised Learning: Uses demonstrations to learn basic rules for following instructions.
Reinforcement Learning with Human Feedback: The model generates outputs, humans rate them, and the model refines its approach based on the feedback. This helps the model prioritize outputs that humans find helpful, truthful, and aligned with their instructions.

Traditional LLM Training:

Trained on massive amounts of text data (books, code, articles)�
Learns statistical relationships between words and sentences�
Can generate creative text formats, translate languages, write different kinds of creative content.�
Weaknesses: Prone to factual errors, biases present in training data, may not always align with user intent.

23 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

Instruction Fine Tuning�
Alignment with Human Values�
Reinforcement Learning with Human Feedback�
RLHF - Data Collection and Preparation�
RLHF - Training and Optimization

24 of 96

Multi-Task Fine Tuning

25 of 96

FLAN T5

26 of 96

The Samsum Dataset

27 of 96

Fine Tuning LLMs

Instead of Self-Supervised Modeling → Supervised ML with Labeled Data
Each prompt has an instruction (e.g., “translate this sentence”)

Pre-Trained LLM

GB/TB of Unstructured Textual Data

GB of Labeled Textual Data for Task(s)

Pre-Trained LLM

prompt + completion

28 of 96

Prompt Template Libraries

The first step is to prepare your training data. There are many publicly available datasets that have been used to train earlier generations of language models, although most of them are not formatted as instructions. Luckily, developers have assembled prompt template libraries that can be used to take existing datasets, for example, the large data set of Amazon product reviews and turn them into instruction prompt datasets for fine-tuning. Prompt template libraries include many templates for different tasks and different data sets. Here are three prompts that are designed to work with the Amazon reviews dataset and that can be used to fine tune models for classification, text generation and text summarization tasks. You can see that in each case you pass the original review, here called review_body, to the template, where it gets inserted into the text that starts with an instruction like predict the associated rating, generate a star review, or give a short sentence describing the following product review. The result is a prompt that now contains both an instruction and the example from the data set.

29 of 96

Instruction Set Fine Tuning

Once you have your instruction data set ready, as with standard supervised learning, you divide the data set into training validation and test splits. During fine tuning, you select prompts from your training data set and pass them to the LLM, which then generates completions. Next, you compare the LLM completion with the response specified in the training data. You can see here that the model didn't do a great job, it classified the review as neutral, which is a bit of an understatement. The review is clearly very positive. Remember that the output of an LLM is a probability distribution across tokens. So you can compare the distribution of the completion and that of the training label and use the standard crossentropy function to calculate loss between the two token distributions. And then use the calculated loss to update your model weights in standard backpropagation. You'll do this for many batches of prompt completion pairs and over several epochs, update the weights so that the model's performance on the task improves.

30 of 96

Instruction Fine-Tuning → INSTRUCT Modeling

31 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

Instruction Fine Tuning�
Alignment with Human Values�
Reinforcement Learning with Human Feedback�
RLHF - Data Collection and Preparation�
RLHF - Training and Optimization

32 of 96

Models Behaving Badly

Models are trained on vast amounts of data�
They could produce toxic language�
Aggressive Language�
Provide dangerous information

33 of 96

Some Examples

Here are some examples of models behaving badly. Let's assume you want your LLM to tell you knock, knock, joke, and the models responses just clap, clap. While funny in its own way, it's not really what you were looking for. The completion here is not a helpful answer for the given task. Similarly, the LLM might give misleading or simply incorrect answers. If you ask the LLM about the disproven Ps of health advice like coughing to stop a heart attack, the model should refute this story. Instead, the model might give a confident and totally incorrect response, definitely not the truthful and honest answer a person is seeking. Also, the LLM shouldn't create harmful completions, such as being offensive, discriminatory, or eliciting criminal behavior, as shown here, when you ask the model how to hack your neighbor's WiFi and it answers with a valid strategy. Ideally, it would provide an answer that does not lead to harm. These important human values, helpfulness, honesty, and harmlessness are sometimes collectively called HHH, and are a set of principles that guide developers in the responsible use of AI. Additional fine-tuning with human feedback helps to better align models with human preferences and to increase the helpfulness, honesty, and harmlessness of the completions. This further training can also help to decrease the toxicity, often models responses and reduce the generation of incorrect information.

https://www.coursera.org/learn/generative-ai-with-llms/lecture/yV8WP/aligning-models-with-human-values

34 of 96

General Alignment

35 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

Instruction Fine Tuning�
Alignment with Human Values�
Reinforcement Learning with Human Feedback�
RLHF - Data Collection and Preparation�
RLHF - Training and Optimization

36 of 96

Using Human Feedback Provides Significant Improvements

37 of 96

Training an LLM

ChatGPT was trained using “Reinforcement Learning from Human Feedback” (RLHF). This involved having a large group of people read ChatGPT prompts and responses, and then say if one response was preferable to another. This data was then merged into one training run. The training process involved humans reviewing and rewriting responses generated by the model to make them more factually accurate and conversational. The responses were then fed back into the model to train it on how to produce more human-like answers. The model then underwent a reward model training process where multiple responses were generated and ranked by humans based on their quality and fit. OpenAI also used Proximal Policy Optimization (PPO), a Reinforcement Learning approach, to create policies for the ChatGPT language model.

38 of 96

Building Alignment through RLHF

InstructGPT

1.3B

Anthropic

52B Parameters

Gopher

280B Parameters

Initial Pre-trained Model

39 of 96

Reinforcement Learning with Human Feedback

40 of 96

Reinforcement Learning

Has an agent and an environment: game theory

Agent

RL Policy (Model)

Environment

reward: r_t

action: a_t from action space

state: s_t from state space

Objective:

Win the game!

41 of 96

Reinforcement Learning

agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: r_t

action: a_t from token vocab

state: s_t from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

42 of 96

Models Becoming Mainstream

Prompt

Explain the moon landing to a 6 year old in a few sentences.

Completion

GPT-3

Explain the theory of gravity to a 6 year old.

Explain the theory of relativity to a 6 year old in a few sentences.

Explain the big bang theory to a 6 year old.

Explain evolution to a 6 year old.

InstructGPT

People went to the moon, and they took pictures of what they saw, and sent them back to the earth so we could all see them.

Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT-3 model, despite having more than 100x fewer parameters.

43 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

Instruction Fine Tuning�
Alignment with Human Values�
Reinforcement Learning with Human Feedback�
RLHF - Data Collection and Preparation�
RLHF - Training and Optimization

44 of 96

RLHF - Data Preparation

45 of 96

RLHF - Data Labeling

First, you must decide what criterion you want the humans to assess the completions on. This could be any of the issues discussed so far like helpfulness or toxicity. Once you've decided, you will then ask the labelers to assess each completion in the data set based on that criterion.

In this case, the prompt is, my house is too hot. You pass this prompt to the LLM, which then generates three different completions.

The task for your labelers is to rank the three completions in order of helpfulness from the most helpful to least helpful. So here the labeler will probably decide that completion two is the most helpful. It tells the user something that can actually cool their house and ranks as completion first. Neither completion one or three are very helpful, but maybe the labeler will decide that three is the worst of the two because the model actively disagrees with the input from the user. So the labeler ranks the top completion second and the last completion third. This process then gets repeated for many prompt completion sets, building up a data set that can be used to train the reward model that will ultimately carry out this work instead of the humans. The same prompt completion sets are usually assigned to multiple human labelers to establish consensus and minimize the impact of poor labelers in the group. Like the third labeler here, whose responses disagree with the others and may indicate that they misunderstood the instructions, this is actually a very important point. The clarity of your instructions can make a big difference on the quality of the human feedback you obtain. Labelers are often drawn from samples of the population that represent diverse and global thinking.

46 of 96

Sample Instructions for RLHF

This would be presented to the labeler to read before beginning the task and made available to refer back to as they work through the dataset.

The instructions start with the overall task the labeler should carry out. In this case, to choose the best completion for the prompt. The instructions continue with additional details to guide the labeler on how to complete the task. In general, the more detailed you make these instructions, the higher the likelihood that the labelers will understand the task they have to carry out and complete it exactly as you wish.

For instance, in the second instruction item, the labelers are told that they should make decisions based on their perception of the correctness and informativeness of the response. They are told they can use the Internet to fact check and find other information.

They are also given clear instructions about what to do if they identify a tie, meaning a pair of completions that they think are equally correct and informative. The labelers are told that it is okay to rank two completions the same, but they should do this sparingly.

A final instruction worth calling out here is what to do in case of a nonsensical confusing or irrelevant answer. In this case, labelers should select F rather than rank, so the poor quality answers can be easily removed. Providing a detailed set of instructions like this increases the likelihood that the responses will be high quality and that individual humans will carry out the task in a similar way to each other. This can help ensure that the ensemble of labeled completions will be representative of a consensus point of view.

47 of 96

Prepare Labeled Data for Training

48 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

Instruction Fine Tuning�
Alignment with Human Values�
Reinforcement Learning with Human Feedback�
RLHF - Data Collection and Preparation�
RLHF - Training and Optimization

49 of 96

Sampling from Existing Prompt Data

The training dataset of prompt-generation pairs for the RM is generated by sampling a set of prompts from a predefined dataset (Anthropic’s data generated primarily with a chat tool on Amazon Mechanical Turk is available on the Hub, and OpenAI used prompts submitted by users to the GPT API). The prompts are passed through the initial language model to generate new text.

50 of 96

General Process for RLHF

51 of 96

Reinforcement Learning

agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: r_t

action: a_t from token vocab

state: s_t from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

52 of 96

Reinforcement Learning

agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: r_t

action: a_t from token vocab

state: s_t from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

53 of 96

Training the Reward Model

54 of 96

Using the Reward Model

55 of 96

Reinforcement Learning

agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: r_t

action: a_t from token vocab

state: s_t from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

56 of 96

Reinforcement Learning

agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: r_t

action: a_t from token vocab

state: s_t from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

57 of 96

Interacting with the Reward Model

Using a prompt (e.g., “A dog is …”), determine the fitness of the completions

Prompt Dataset

Instruct LLM

Reward Model

“...a friendly animal”

0.24

iteration 1

“A dog is…”

58 of 96

Update LLM with an RL Algorithm

Based on the reward, update the LLM parameters via an RL algorithm

Prompt Dataset

Instruct LLM

RL Algorithm

Reward Model

“...a friendly animal”

0.24

iteration 1

“A dog is…”

prompt: “A dog is”

“...a friendly animal”

59 of 96

Update LLM with an RL Algorithm

The action of the RL provides an updated LLM which can then be tested against additional prompts

prompt: “A dog is”

“...a friendly animal”

Prompt Dataset

RL Updated LLM

RL Algorithm

Reward Model

“...a friendly animal”

iteration 1

“A dog is…”

0.24

60 of 96

Continue to Optimize the LLM

As the LLM becomes more aligned with the Human Feedback (as modeled by the Reward Model), your rewards should become higher

prompt: “A dog is”

“...man’s best friend”

Prompt Dataset

RL Updated LLM

RL Algorithm

Reward Model

“...man’s best friend”

0.57

iteration 2

“A dog is…”

61 of 96

Continue to Optimize the LLM

As the LLM becomes more aligned with the Human Feedback (as modeled by the Reward Model), your rewards should become higher

prompt: “A dog is”

“...the best pet”

Prompt Dataset

RL Updated LLM

RL Algorithm

Reward Model

“...the best pet”

1.24

iteration 3

“A dog is…”

62 of 96

Continue to Optimize the LLM

As the LLM becomes more aligned with the Human Feedback (as modeled by the Reward Model), your rewards should become higher

prompt: “A dog is”

“...a canine”

Prompt Dataset

RL Updated LLM

RL Algorithm

Reward Model

“...a canine”

RLHF

3.28

iteration n

“A dog is…”

63 of 96

Determining the Nature of the RL Algorithm

The particular RLHF algorithm that OpenAI leverages is PPO

Prompt Dataset

Human Aligned LLM

Proximal Policy Optimization

Reward Model

“...a friendly animal”

RLHF

0.57

iteration n

“A dog is…”

prompt: “A dog is”

“...a friendly animal”

64 of 96

RLHF - General Concept

65 of 96

Proximal Policy Optimization (PPO)

66 of 96

General Process for RLHF

67 of 96

Avoiding Reward Hacking

68 of 96

Alternatives to RLHF

Direct Preferential Optimization (DPO)��
Chain of Hindsight (CoH)

69 of 96

Aligning LLM Systems: Review

Why do we want to align LLMs?�
Introducing RLHF with a Reward Model�
Optimizing the Reward Model and then Tuning the LLM�
Alternatives to PPO’s RLHF

70 of 96

Large Language Models�

Section 0: Administrivia

Section 1: In-Context Learning: Prompt Engineering

Section 2: Instruction Fine Tuning with RLHF

Section 3: Retrieval Augmented Generation (RAG)

71 of 96

Knowledge Cut-offs in LLMs

72 of 96

Challenges with LLMs

Lack of Up-to-Date Knowledge�
Hallucinations�
Limited Context Windows�
Inability to Personalize or Specialize Dynamically

Lack of Up-to-Date Knowledge

Problem: LLMs are trained on static datasets. Their knowledge is frozen at the time of training.
Example: If you ask an LLM about an event that occurred after its training, it can’t respond accurately.

Hallucinations

Problem: LLMs are prone to hallucinating facts — confidently generating information that is wrong.
Cause: Since LLMs generate text by predicting the next word based on patterns in training data, they may produce plausible-sounding but incorrect answers when they lack grounded information.

Limited Context Window

Problem: Even with large context windows, LLMs can’t ingest entire corpora or long documents during inference. That limits their ability to reason over large knowledge bases or handle long-tail queries.

Inability to Personalize or Specialize Dynamically

Problem: Pretrained LLMs can’t dynamically adapt to a specific domain or user’s needs unless fine-tuned, which is costly and slow.
Example: An e-commerce platform might want an assistant that knows about its niche inventory, but fine-tuning the LLM for every update is impractical.

73 of 96

LLM Powered Applications - Retrieval Augmented Generation

74 of 96

Retrieval Augmented Generation - LLMs

Large Language Models not storing facts but rather probabilities of a sequence of tokens, as information has been probabilities. ��Indeed, use them as language rather than information models. We can think of them as effective implementations of auto-completers.

75 of 96

Retrieval Augmented Generation (RAG) Diagrammatically

Corpus of Data

Vector Store

Process to Vectors

User

Process to Vectors

Nearest Neighbors Lookup

Prompt Generation

76 of 96

RAG - Implemented via Amazon

77 of 96

Comparing Fine-Tuning with RAG

78 of 96

Example: Query RAG System and LLM Response - Search Legal Docs

79 of 96

Example: Query RAG System and LLM Response - Search Legal Docs

80 of 96

Retrieval Augmented Generation

81 of 96

Could You Use the LLM as a Vector Store?

Pros	Cons
Comprehensive understanding of information and documents at hand in a compact manner.	Computational load on the LLMs. Retrieval speed would add to inference. It is not entirely clear how to combine the prompt with the comprehensive data.

More common approach is to use faster models with shorter text, and pass the text in has part of the prompt.

82 of 96

RAG External Sources

83 of 96

Chunking - Why?

Handling Long Documents: Large documents can't be directly fed into most language models (LLMs) due to context window limitations. Chunking breaks down documents into smaller, manageable pieces. �
Improving Retrieval Relevance: Smaller chunks can be more semantically focused, improving the accuracy of retrieval when a user's query matches a specific segment of the document. �
Balancing Context and Specificity: The goal is to find a chunk size that provides enough context to understand the relevant information while remaining specific enough to be relevant to the user's query.

84 of 96

Chunking - How Big?

LLM Context Window: The size of the LLM's context window is a primary constraint. Smaller context windows necessitate smaller chunks. �
Semantic Coherence: Chunks should ideally contain complete semantic units (e.g., paragraphs, sections, or even sentences). Breaking sentences or paragraphs mid-way can lead to information loss. �
Query Complexity: Complex queries might require larger chunks to provide sufficient context. Simpler queries might perform better with smaller, more focused chunks.

85 of 96

Chunking - How Big?

Document Structure: The structure of the document (e.g., technical reports, novels, FAQs) influences the optimal chunk size.

For FAQ's, smaller chunks containing a question and its answer can be ideal.
For technical documents, larger chunks may be needed to preserve the context of complex concepts�

Embedding Model: The embedding model's ability to capture semantic meaning within a given text length also plays a role. some embedding models handle longer texts better than others.�
Experimentation: The optimal chunk size is often determined empirically. You'll need to experiment with different sizes and evaluate the retrieval performance.

86 of 96

Overlap - How Much?

Preventing Information Loss: Overlap ensures that information isn't lost at the boundaries between chunks. This is especially important when relevant information spans across multiple chunks. �
Maintaining Context: Overlapping chunks can help maintain contextual continuity, allowing the LLM to understand the relationships between different parts of the document. �
Balancing Redundancy and Efficiency: Too much overlap can lead to redundancy and increase the amount of data retrieved, which can slow down the process and increase computational cost. Too little overlap can lead to information loss. �
Overlap Percentage: Overlap is typically expressed as a percentage of the chunk size. Common overlap percentages range from 10% to 50%. �
Sentence or Paragraph Level Overlap: Overlap can be implemented at the sentence or paragraph level, depending on the desired level of context.

87 of 96

Data Preparation for RAG

88 of 96

Vector Database Search

Each text in the vector store is identified by a key�
Enables a citation to be included in the answer

89 of 96

Reviewing RAG Systems

Reasons for why we would use RAG systems�
The general framework of RAG systems: enabling citation�
Design decisions in RAG systems

What retrieval system to use?
How will we chunk and overlap the data?�

Data Preparation and Inference

90 of 96

Graveyard

91 of 96

Alternative embeddings before generation

92 of 96

Combining query with the context: what’s going on?

93 of 96

On the scale of LLM’s

Could use LLM’s directly

94 of 96

95 of 96

Understanding tokenization, chunking, etc.

Could use

96 of 96

What to use in the Vector Database?

Could use