1 of 96

CS 6120

Week 13, Topic 11.2: Practically Leveraging �Large Language Models

2 of 96

Project Demo Day

  • April 24, 2025: Class in Room 916!
  • 5-10 Minutes (Depending on Number of Papers) �Paper lightning presentations (3 min each)�
  • 30 Minutes Maximum

Each team introduces their Dataset (RAG) or Application (PAL) they are using

Introduce your work (1 min each)

  • 45 Minutes to 1 Hour Maximum

Each instructor / TA tries out projects (5 Q’s each)

  • 30 Minutes Maximum

Students try each other's application out

3 of 96

Administrivia

  • Time for Course Reviews�
  • Final Project Presentations: Add links to slides. Not more than 3 slides maximum: an elevator pitch.�
  • Reading Group Session: GPT4-TR, RLHF
  • Discussion and Addressing the Labs and Homeworks�https://course.ccs.neu.edu/cs6120s25/submissions
  • Canvassing for Teaching Assistants�
  • Time for Laboratories (Catchup) - Pranav to Provide a Correction�
  • Extra credit for the (optional laboratories)

4 of 96

Administrative - Homework 5

  • Assignment 5 Point System

  • February 20 Assigned�
  • March 13 Due�
  • March 20 Late Due, After this time – 1 Point Off M-F, WorkDay�
  • March 27 Late Due, After this Time - 3 Point �
  • April 4 Cut Off

5 of 96

Large Language Models

Section 0: Administrivia

Section 1: In-Context Learning: Prompt Engineering

Section 2: Instruction Fine Tuning with RLHF

Section 3: Retrieval Augmented Generation (RAG)

6 of 96

Last Week: Pre-Training and Fine-Tuning

7 of 96

Last Week: Pre-Training and Fine-Tuning

8 of 96

Prompt Engineering: Full Specifications

9 of 96

In-Context Learning

  • Providing examples in the context window = in-context learning�
  • Your projects are can have examples of context learning!

10 of 96

In Context Learning (ICL) - Zero Shot Inference

  • Providing the example inside the context window (zero-shot inference)

11 of 96

In Context Learning (ICL) - One Shot Inference

GPT-2

Supply an example!

12 of 96

Summary of ICL, Strongly Dependent on Model Size

13 of 96

Prompt Engineering: Useful Techniques

  • Few Shot Learning - Providing a few examples to guide examples in the way you’d like it to think.�
  • Chain of Thought Prompting - Teach model to talk through its logic to drive through the output that you would like it to achieve.

3 examples + guiding LLM

Text-Davinci-002 (Cusp of 3.0-3.5)

Apropos of nothing, consistently wrong

Text-Davinci-002 (Cusp of 3.0-3.5)

14 of 96

Chain of Thought Reasoning

15 of 96

Chain of Thought Reasoning: 1 or Few Shot Learning

16 of 96

ICL or No?

Note that:

  • Context window is limited (a few thousand tokens)

When to Fine-Tune?

  • Where more in-prompt examples are provided (e.g., 5-6 examples)
  • Teaching intuition (when words fall short)
  • Baking in tone, style, and output formatting
  • Desire to reduce prompt context size
  • Want to distill and train a smaller model

17 of 96

Provisioning: Peak vs Baseline

  • Pay as You Go (PayGo)�
  • Provisioned Throughput Units �(PTU) - A consistent / fixed cost per month
    • Etsy went with this
    • Monitoring and understanding what’s right for you

18 of 96

Large Language Models

Section 0: Administrivia

Section 1: In-Context Learning: Prompt Engineering

Section 2: Instruction Fine Tuning with RLHF

Section 3: Retrieval Augmented Generation (RAG)

19 of 96

Remedying examples from ICL

  • Where more in-prompt examples are provided, could fine tune

20 of 96

Instruction Following in LLMs

InstructGPT prioritizes

  • understanding and fulfilling user instructions
  • through a combination of human demonstrations and feedback

making it more reliable and aligned with user intent compared to traditional LLMs.

21 of 96

Instruction Following in LLMs

22 of 96

Traditional vs Instruction-Aligned

InstructGPT Training�(Focuses on Alignment):

  1. Human-in-the-Loop Training
    1. Starts with demonstrations from humans on how to complete tasks or respond to prompts.
    2. This data includes instructions and desired outputs, helping the model understand user intent.�
  2. Multi-Stage Training:
    • Supervised Learning: Uses demonstrations to learn basic rules for following instructions.
    • Reinforcement Learning with Human Feedback: The model generates outputs, humans rate them, and the model refines its approach based on the feedback. This helps the model prioritize outputs that humans find helpful, truthful, and aligned with their instructions.

Traditional LLM Training:

  • Trained on massive amounts of text data (books, code, articles)�
  • Learns statistical relationships between words and sentences�
  • Can generate creative text formats, translate languages, write different kinds of creative content.�
  • Weaknesses: Prone to factual errors, biases present in training data, may not always align with user intent.

23 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

  • Instruction Fine Tuning
  • Alignment with Human Values�
  • Reinforcement Learning with Human Feedback�
  • RLHF - Data Collection and Preparation�
  • RLHF - Training and Optimization

24 of 96

Multi-Task Fine Tuning

25 of 96

FLAN T5

26 of 96

The Samsum Dataset

27 of 96

Fine Tuning LLMs

  • Instead of Self-Supervised Modeling → Supervised ML with Labeled Data
  • Each prompt has an instruction (e.g., “translate this sentence”)

Pre-Trained LLM

GB/TB of Unstructured Textual Data

GB of Labeled Textual Data for Task(s)

Pre-Trained LLM

prompt + completion

28 of 96

Prompt Template Libraries

29 of 96

Instruction Set Fine Tuning

30 of 96

Instruction Fine-Tuning → INSTRUCT Modeling

31 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

  • Instruction Fine Tuning�
  • Alignment with Human Values
  • Reinforcement Learning with Human Feedback�
  • RLHF - Data Collection and Preparation�
  • RLHF - Training and Optimization

32 of 96

Models Behaving Badly

  • Models are trained on vast amounts of data�
  • They could produce toxic language�
  • Aggressive Language�
  • Provide dangerous information

33 of 96

Some Examples

34 of 96

General Alignment

35 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

  • Instruction Fine Tuning�
  • Alignment with Human Values�
  • Reinforcement Learning with Human Feedback
  • RLHF - Data Collection and Preparation�
  • RLHF - Training and Optimization

36 of 96

Using Human Feedback Provides Significant Improvements

37 of 96

Training an LLM

38 of 96

Building Alignment through RLHF

InstructGPT

1.3B

Anthropic

52B Parameters

Gopher

280B Parameters

Initial Pre-trained Model

39 of 96

Reinforcement Learning with Human Feedback

40 of 96

Reinforcement Learning

  • Has an agent and an environment: game theory

Agent

RL Policy (Model)

Environment

reward: rt

action: at from action space

state: st from state space

Objective:

Win the game!

41 of 96

Reinforcement Learning

  • agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: rt

action: at from token vocab

state: st from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

42 of 96

Models Becoming Mainstream

Prompt

Explain the moon landing to a 6 year old in a few sentences.

Completion

GPT-3

Explain the theory of gravity to a 6 year old.

Explain the theory of relativity to a 6 year old in a few sentences.

Explain the big bang theory to a 6 year old.

Explain evolution to a 6 year old.

InstructGPT

People went to the moon, and they took pictures of what they saw, and sent them back to the earth so we could all see them.

Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT-3 model, despite having more than 100x fewer parameters.

43 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

  • Instruction Fine Tuning�
  • Alignment with Human Values�
  • Reinforcement Learning with Human Feedback�
  • RLHF - Data Collection and Preparation
  • RLHF - Training and Optimization

44 of 96

RLHF - Data Preparation

45 of 96

RLHF - Data Labeling

46 of 96

Sample Instructions for RLHF

47 of 96

Prepare Labeled Data for Training

48 of 96

Section 2: Instruction Fine Tuning with Reinforcement Learning

  • Instruction Fine Tuning�
  • Alignment with Human Values�
  • Reinforcement Learning with Human Feedback�
  • RLHF - Data Collection and Preparation�
  • RLHF - Training and Optimization

49 of 96

Sampling from Existing Prompt Data

The training dataset of prompt-generation pairs for the RM is generated by sampling a set of prompts from a predefined dataset (Anthropic’s data generated primarily with a chat tool on Amazon Mechanical Turk is available on the Hub, and OpenAI used prompts submitted by users to the GPT API). The prompts are passed through the initial language model to generate new text.

50 of 96

General Process for RLHF

51 of 96

Reinforcement Learning

  • agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: rt

action: at from token vocab

state: st from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

52 of 96

Reinforcement Learning

  • agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: rt

action: at from token vocab

state: st from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

53 of 96

Training the Reward Model

54 of 96

Using the Reward Model

55 of 96

Reinforcement Learning

  • agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: rt

action: at from token vocab

state: st from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

56 of 96

Reinforcement Learning

  • agent = LLM, action = chosen token, reward = by the humans

Agent

RL Policy = LLM

Environment

reward: rt

action: at from token vocab

state: st from current context

Objective:

Generate aligned text!

Instruct LLM

Reward Model

question

answer

57 of 96

Interacting with the Reward Model

  • Using a prompt (e.g., “A dog is …”), determine the fitness of the completions

Prompt Dataset

Instruct LLM

Reward Model

“...a friendly animal”

0.24

iteration 1

“A dog is…”

58 of 96

Update LLM with an RL Algorithm

  • Based on the reward, update the LLM parameters via an RL algorithm

Prompt Dataset

Instruct LLM

RL Algorithm

Reward Model

“...a friendly animal”

0.24

iteration 1

“A dog is…”

prompt: “A dog is”

“...a friendly animal”

59 of 96

Update LLM with an RL Algorithm

  • The action of the RL provides an updated LLM which can then be tested against additional prompts

prompt: “A dog is”

“...a friendly animal”

Prompt Dataset

RL Updated LLM

RL Algorithm

Reward Model

“...a friendly animal”

iteration 1

“A dog is…”

0.24

60 of 96

Continue to Optimize the LLM

  • As the LLM becomes more aligned with the Human Feedback (as modeled by the Reward Model), your rewards should become higher

prompt: “A dog is”

“...man’s best friend”

Prompt Dataset

RL Updated LLM

RL Algorithm

Reward Model

“...man’s best friend”

0.57

iteration 2

“A dog is…”

61 of 96

Continue to Optimize the LLM

  • As the LLM becomes more aligned with the Human Feedback (as modeled by the Reward Model), your rewards should become higher

prompt: “A dog is”

“...the best pet”

Prompt Dataset

RL Updated LLM

RL Algorithm

Reward Model

“...the best pet”

1.24

iteration 3

“A dog is…”

62 of 96

Continue to Optimize the LLM

  • As the LLM becomes more aligned with the Human Feedback (as modeled by the Reward Model), your rewards should become higher

prompt: “A dog is”

“...a canine”

Prompt Dataset

RL Updated LLM

RL Algorithm

Reward Model

“...a canine”

RLHF

3.28

iteration n

“A dog is…”

63 of 96

Determining the Nature of the RL Algorithm

  • The particular RLHF algorithm that OpenAI leverages is PPO

Prompt Dataset

Human Aligned LLM

Proximal Policy Optimization

Reward Model

“...a friendly animal”

RLHF

0.57

iteration n

“A dog is…”

prompt: “A dog is”

“...a friendly animal”

64 of 96

RLHF - General Concept

65 of 96

Proximal Policy Optimization (PPO)

66 of 96

General Process for RLHF

67 of 96

Avoiding Reward Hacking

68 of 96

Alternatives to RLHF

  • Direct Preferential Optimization (DPO)������
  • Chain of Hindsight (CoH)

69 of 96

Aligning LLM Systems: Review

  • Why do we want to align LLMs?�
  • Introducing RLHF with a Reward Model
  • Optimizing the Reward Model and then Tuning the LLM
  • Alternatives to PPO’s RLHF

70 of 96

Large Language Models

Section 0: Administrivia

Section 1: In-Context Learning: Prompt Engineering

Section 2: Instruction Fine Tuning with RLHF

Section 3: Retrieval Augmented Generation (RAG)

71 of 96

Knowledge Cut-offs in LLMs

72 of 96

Challenges with LLMs

  • Lack of Up-to-Date Knowledge�
  • Hallucinations�
  • Limited Context Windows�
  • Inability to Personalize or Specialize Dynamically

73 of 96

LLM Powered Applications - Retrieval Augmented Generation

74 of 96

Retrieval Augmented Generation - LLMs

Large Language Models not storing facts but rather probabilities of a sequence of tokens, as information has been probabilities. ����������Indeed, use them as language rather than information models. We can think of them as effective implementations of auto-completers.

75 of 96

Retrieval Augmented Generation (RAG) Diagrammatically

Corpus of Data

Vector Store

Process to Vectors

User

Process to Vectors

Nearest Neighbors Lookup

Prompt Generation

Prompt Generation

76 of 96

RAG - Implemented via Amazon

77 of 96

Comparing Fine-Tuning with RAG

78 of 96

Example: Query RAG System and LLM Response - Search Legal Docs

79 of 96

Example: Query RAG System and LLM Response - Search Legal Docs

80 of 96

Retrieval Augmented Generation

81 of 96

Could You Use the LLM as a Vector Store?

Pros

Cons

Comprehensive understanding of information and documents at hand in a compact manner.

Computational load on the LLMs.

Retrieval speed would add to inference.

It is not entirely clear how to combine the prompt with the comprehensive data.

  • More common approach is to use faster models with shorter text, and pass the text in has part of the prompt.

82 of 96

RAG External Sources

83 of 96

Chunking - Why?

  • Handling Long Documents: Large documents can't be directly fed into most language models (LLMs) due to context window limitations. Chunking breaks down documents into smaller, manageable pieces. �
  • Improving Retrieval Relevance: Smaller chunks can be more semantically focused, improving the accuracy of retrieval when a user's query matches a specific segment of the document. �
  • Balancing Context and Specificity: The goal is to find a chunk size that provides enough context to understand the relevant information while remaining specific enough to be relevant to the user's query.

84 of 96

Chunking - How Big?

  • LLM Context Window: The size of the LLM's context window is a primary constraint. Smaller context windows necessitate smaller chunks. �
  • Semantic Coherence: Chunks should ideally contain complete semantic units (e.g., paragraphs, sections, or even sentences). Breaking sentences or paragraphs mid-way can lead to information loss. �
  • Query Complexity: Complex queries might require larger chunks to provide sufficient context. Simpler queries might perform better with smaller, more focused chunks.

85 of 96

Chunking - How Big?

  • Document Structure: The structure of the document (e.g., technical reports, novels, FAQs) influences the optimal chunk size.
    • For FAQ's, smaller chunks containing a question and its answer can be ideal.
    • For technical documents, larger chunks may be needed to preserve the context of complex concepts�
  • Embedding Model: The embedding model's ability to capture semantic meaning within a given text length also plays a role. some embedding models handle longer texts better than others.�
  • Experimentation: The optimal chunk size is often determined empirically. You'll need to experiment with different sizes and evaluate the retrieval performance.

86 of 96

Overlap - How Much?

  • Preventing Information Loss: Overlap ensures that information isn't lost at the boundaries between chunks. This is especially important when relevant information spans across multiple chunks. �
  • Maintaining Context: Overlapping chunks can help maintain contextual continuity, allowing the LLM to understand the relationships between different parts of the document. �
  • Balancing Redundancy and Efficiency: Too much overlap can lead to redundancy and increase the amount of data retrieved, which can slow down the process and increase computational cost. Too little overlap can lead to information loss. �
  • Overlap Percentage: Overlap is typically expressed as a percentage of the chunk size. Common overlap percentages range from 10% to 50%. �
  • Sentence or Paragraph Level Overlap: Overlap can be implemented at the sentence or paragraph level, depending on the desired level of context.

87 of 96

Data Preparation for RAG

88 of 96

Vector Database Search

  • Each text in the vector store is identified by a key�
  • Enables a citation to be included in the answer

89 of 96

Reviewing RAG Systems

  • Reasons for why we would use RAG systems�
  • The general framework of RAG systems: enabling citation�
  • Design decisions in RAG systems
    • What retrieval system to use?
    • How will we chunk and overlap the data?�
  • Data Preparation and Inference

90 of 96

Graveyard

91 of 96

Alternative embeddings before generation

92 of 96

Combining query with the context: what’s going on?

93 of 96

On the scale of LLM’s

  • Could use LLM’s directly

94 of 96

95 of 96

Understanding tokenization, chunking, etc.

  • Could use

96 of 96

What to use in the Vector Database?

  • Could use