1 of 150

CN408/SF323 AI Engineer

Lecture 4: Information Retrieval and Retrieval Augmented Generation

Nutchanon Yongsatianchot

2 of 150

News

3 of 150

News

4 of 150

News

5 of 150

News: nano-banana is out!

6 of 150

Retrieval Augmented Generation

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

7 of 150

Motivation

  • To answer questions, LLMs must have necessary information or knowledge
  • LLMs only know what in their training data
  • Many knowledge are outside of their training data
    • Recent knowledge, private knowledge, novel knowledge
  • LLMs will perform better with contexts or relevant information in the prompt.
  • LLMs have limited memory (context window)

8 of 150

Solution: Providing Additional Information

Prompt

User’s Query

Additional Information

Instruction

We will need to retrieve!!

9 of 150

Many applications involve specialized knowledge

Legal or Medical use cases

Company chatbot

10 of 150

11 of 150

The Big Picture: Retrieval Augmented Generation (RAG)

Query

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Index

Augment

Documents

12 of 150

The center question of RAG:

How can we get all the right information to LLM?

13 of 150

Information Retrieval

14 of 150

Two board ways of retrieving information

  • Sparse: Keyword Search
    • Keyword Matching
    • Bag-of-word Search
  • Dense: Semantic Search
    • Vector Database

15 of 150

Sparse Search

Keyword Matching

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

16 of 150

Keyword Matching

  • Idea: Database = A dictionary. You can simply look up the meaning of words.
    • database = {‘CN101’: “วิชาโปรแกรมพื้นฐาน”,

‘CN240’: “วิชา data science”}

    • query = “CN101 คือวิชาอะไร?”
    • augmented_query = “CN101 คือวิชาอะไร? � <context>CN101 = วิชาโปรแกรมพื้นฐาน</context>
  • Good for searching document titles or article names
  • Good for special words, unique words and novel words such as definitions, locations, etc.

17 of 150

Keyword Matching Demo

18 of 150

Metadata filtering

Uses rigid criteria to narrow down documents based on metadata like title, author,

creation date, access privileges, and more.

19 of 150

Metadata Filtering In RAG

20 of 150

Sparse Search

Bag-of-word search - TF-IDF

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

21 of 150

Keyword Search

22 of 150

Bag of Words

Word order is ignored, only word presence and frequency matter

Bag of words

23 of 150

Sparse Vectors

Most words aren’t used. The bag of words is sparse, with few non-zero entries.

24 of 150

Bag of words view of document

25 of 150

Words appearing in each document

26 of 150

Bag of words search

27 of 150

28 of 150

29 of 150

30 of 150

Frequency Based vs. Term Frequency (TF) Based Scoring

Example Query: pizza oven

Document 1

Homemade pizza in oven is better

than frozen pizza

Contains: pizza (2x) oven (1x)

Document 2

Wood-fired oven is a better oven than a stone oven for cooking pizza

Contains: pizza (1x) oven ( 3x)

Simple Scoring = 2 points

Simple Scoring = 2 points

TF Scoring = 3 points

TF Scoring = 4 points

31 of 150

Normalized TF Scoring

Longer documents may contain keywords many times simply because they are longer.

Solution: Normalize by document length

Score = (Number of keyword occurrences) / (Total words in document)

32 of 150

Term Frequency-Inverse Document Freqeuncy (TF-IDF)

Basic TF scoring treats all words equally, whether they're common filler words or rare, meaningful terms.

Solution: Weight terms using “inverse document frequency” (IDF).

Score = TF(word, doc) × log(Total docs / Docs containing word)

33 of 150

34 of 150

35 of 150

TF vs. TF-IDF

36 of 150

TF-IDF

37 of 150

Documents with rare keywords score higher than documents with common words

38 of 150

Modern systems use a slightly refined version called BM25

39 of 150

Sparse Search

BM25

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

40 of 150

BM25 Scoring

BM25 (Best Matching 25) was named as the 25th variant in a series of scoring functions proposed by its creators.

  • This gives the score for a single keyword
  • Sum scores across all keywords for total relevance score for a document

41 of 150

42 of 150

BM25 Tunable Parameters

k₁ - Term Frequency Saturation

  • Controls: How much term frequency influences the score.
  • Range: Typically between 1.2 and 2.0.
  • Effect: Higher values increase the impact of term frequency; lower values reduce it.

b - Length Normalization

  • Controls: The degree of normalization for document length.
  • Range: Between 0 (no normalization) and 1 (full normalization).
  • Effect: Balances favoring shorter vs. longer documents.

43 of 150

Sparse/Keyword Search Summary

  • Keyword Search: Match documents by keyword frequency
  • TF-IDF
    • Keyword rarity
    • Term frequency
    • Document length
  • BM25: Most commonly used
    • Document length normalization
    • Term Frequency Saturation

Sparse Vector

Score

Rank

44 of 150

BM25 Demo

45 of 150

Semantic Search

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

46 of 150

Keyword search does not capture meaning of words

47 of 150

Semantic Search vs. Keyword Search

  • Prompt and documents each get a vector
  • Vectors compared to generate scores
  • The main difference is how vectors are assigned
    • Keyword Search: count words
    • Semantic Search: use embedding model

48 of 150

Sparse vs. Dense Search

49 of 150

Vector Space

Similar words are closer in the vector space

50 of 150

51 of 150

Sentence Embedding Example

52 of 150

Measuring Vector Distance

53 of 150

Measuring Vector Distance

54 of 150

Measuring Vector Distance

55 of 150

Semantic Search Summary

56 of 150

Our Current RAG

Query

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Index

Augment

Documents

57 of 150

RAG with Vector Database

Query

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Index

Augment

Documents

embedding

embedding

58 of 150

Embedding Choices

59 of 150

Embedding Search High Level

  • Embedding model:
    • Indexing: convert documents into an embedding using the embedding model
    • Searching: convert the query into an embedding using the same embedding model used during indexing
  • Retriever: fetch k data chunks whose embeddings are closest to the query embedding, as determined by the retriever.

60 of 150

Vector Database

  • Vector Database: The database where the generated embeddings of documents are stored.
  • The key part is the vector search
  • Naive Solution (K-Nearest Neighbour Search):
    1. Compute the similarity across scores between the query embedding and all vectors in the database, using metrics such as cosine similarity
    2. Rank all vectors by their similarity scores
    3. Return k vectors with the highest similarity scores

61 of 150

Embedding Search without Vector Database Demo

62 of 150

Semantic Search

With Vector Database

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

63 of 150

You technically don’t need Vector Database to do RAG

64 of 150

RAG != Semantic Search with Vector Database

65 of 150

Vector Database Choices

66 of 150

Pinecone Database

67 of 150

Pinecone Database - Setup

68 of 150

69 of 150

70 of 150

RAG with Pinecone Database - Coding

*Pinecone has its own proprietary Approximate Nearest Neighbour (ANN) search

71 of 150

Break

72 of 150

Improving RAG

73 of 150

RAG with Vector Database

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Index

Augment

Documents

embedding

Query

embedding

74 of 150

RAG with Vector Database

Query

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Augment

Documents

embedding

embedding

Index

75 of 150

Hybrid Search

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

76 of 150

Hybrid Search

Query

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Response

Relevant Documents

Retrieve

Index

Augment

Documents

embedding

embedding

77 of 150

Hybrid Search

Retriever

Keyword Search

Semantic Search

Metadata Filter

Metadata Filter

78 of 150

Reciprocal Rank Fusion

  • Rewards documents for being highly ranked on each list
  • Control weight of keyword vs. semantic ranking
  • Score points equal to reciprocal of ranking

1st = 1 point, 2nd = 0.5 points, etc.

  • Total points from all ranked list used to perform final ranking

79 of 150

Reciprocal Rank Fusion

80 of 150

RRF only cares about ranks, not scores

81 of 150

Beta: Weighting Semantic vs. Keyword

If exact keyword matching is important, set a lower beta

82 of 150

Hybrid Search

83 of 150

Reranking

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

84 of 150

Reranking

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Index

Augment

Documents

embedding

Query

embedding

85 of 150

Overview of Reranking

86 of 150

87 of 150

88 of 150

Reranker

  • Rerank models sort text inputs by semantic relevance to a specified query.
  • Reranking models are more accurate but more costly than simple cosine similarity retriever
  • Therefore, you do it after selecting top-k documents from vector database
  • With reranker, you will retrieve more documents (higher k)
  • Example of reranker: Cohere, Voyage
    • You can also use LLM to rerank - scoring relevance

89 of 150

Hybrid Search + Reranker Demo

90 of 150

Chunking Strategies

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

91 of 150

Chunking

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Index

Augment

Documents

embedding

Query

embedding

92 of 150

Why Chunk Documents?

93 of 150

Indexing without chunking

  • Knowledge base contains 1,000 books
  • Each book is vectorized by an embedding model
  • Result: 1,000 vectors

94 of 150

The problems with this approach

  • Compresses entire book meaning into single vector
  • Can't sharply represent specific topics, chapters or pages
  • Creates "averaged" and “noisy” representation across all content
  • Results in poor search relevance
  • Retrieves entire books, quickly filling LLM context window

95 of 150

Chunking your content

96 of 150

Chunking Considerations

  1. What is the nature of the content being indexed? Are you working with long documents, such as articles or books, or shorter content, like tweets or instant messages?
  2. Which embedding model are you using, and what chunk sizes does it perform optimally on? For instance, sentence-transformer models work well on individual sentences, but a model like text-embedding-ada-002 performs better on chunks containing 256 or 512 tokens.
  3. What are your expectations for the length and complexity of user queries? Will they be short and specific or long and complex?
  4. How will the retrieved results be utilized within your specific application?
    1. Will they be used for semantic search, question answering, summarization, or other purposes?
    2. If your results need to be fed into another LLM with a token limit, you’ll have to take that into consideration.

97 of 150

Fixed Size Chunking

98 of 150

Fixed Size Chunking: Overlapping Chunking

99 of 150

Fixed Size Chunking

Chunk based on the number of token and (optionally) any overlap between chunks.

100 of 150

Sentence splitting

  • Split based on sentence
  • Many libraries provide a way to do sentence splitting such as NLTK and spaCy.

101 of 150

Recursive Chunking

  • Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators.
  • If the initial attempt at splitting the text doesn’t produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved

102 of 150

Chunking Strategy Demo

103 of 150

Specialized chunking

  • Splitting based on Markdown
  • Splitting based on Latex

104 of 150

Specialized chunking

105 of 150

Semantic Chunking

Groups sentences together based on similar meanings rather than arbitrary character limits

106 of 150

Semantic Chunking

  1. Break up the document into sentences. Process the document sentence by sentence.
  2. Vectorize & Compare. Convert chunk and next sentence to vectors, calculate cosine distance
  3. Check Threshold. If distance is below threshold, add sentence to chunk.
  4. If the theme is the same, the distance between chunk before and start adding a new sentence will be low.
  5. Split when Different. When distance crosses threshold, start new chunk
  6. Higher semantic distance indicates that the theme has changed.

107 of 150

108 of 150

Language Based Chunking

  • Prompt LLM to create chunks from a document
  • Include instructions on types of chunks, like keeping concepts together, adding breaks when new topic starts
  • Performs well, increasingly more economically viable

109 of 150

Contextual Retrieval

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

110 of 150

Contextual Retrieval

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Index

Augment

Documents

embedding

Query

embedding

111 of 150

Problem

  • A chunk may not contain all information to answer the question
  • Example:
    • Q: "What was the revenue growth for ACME Corp in Q2 2023?"
    • A relevant chunk might contain the text: "The company's revenue grew by 3% over the previous quarter."
    • However, this chunk on its own doesn't specify which company it's referring to or the relevant time period, making it difficult to retrieve the right information or use the information effectively.

112 of 150

Contextual Retrieval

  • Contextual Retrieval solves the problem by prepending chunk-specific explanatory context to each chunk before embedding.

original_chunk = "The company's revenue grew by 3% over the previous quarter."

contextualized_chunk = "This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."

113 of 150

Implementing Contextual Retrieval

  • Use LLM to create contextual Retrieval
  • Example Prompt:

<document>

{{WHOLE_DOCUMENT}}

</document>

Here is the chunk we want to situate within the whole document

<chunk>

{{CHUNK_CONTENT}}

</chunk>

Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

114 of 150

115 of 150

Choosing a Chunking Approach

  • Fixed Width and Recursive Character Splitting: good defaults
  • Semantic and LLM Chunking: can yield higher performance, but more complex. Experiment to see if it’s worth it
  • Contextual Retrieval: improves any chunking technique at some cost. A good “first improvement” to explore

116 of 150

Query Rewriting

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

117 of 150

Query Rewriting

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Index

Augment

Documents

embedding

Query

embedding

118 of 150

Queries != documents

119 of 150

Query Rewriting

  • Query Rewriting = Query reformulation, query normalization, and sometimes query expansion.
  • Example:
    • User: “When was the last time John Doe bought something from us?”
    • AI: “John last bought a Fruity Fedora hat from us two weeks ago, on Jan 3, 2023.”
    • User: “How about Emily Doe?”
    • Rewriting: “When was the last time Emily Doe bought something from us?”
  • Another common issue is multiple questions in one query → need to expand

AI Engineering: Building Applications with Foundation Models

120 of 150

Query Rewriting

Use an LLM to rewrite the query before it’s submitted to the retriever.

121 of 150

122 of 150

Transform a question into multiple perspectives

123 of 150

Intuition: Improve search

124 of 150

Use this with parallelized retrieval

125 of 150

Multi-Query

  • Prompt LLMs to generate multiple versions of the question

126 of 150

HyDE: Hypothetical Document Embeddings

  • Normally a retriever is matching prompts to documents
  • Instead, uses generated “hypothetical documents” that would be ideal search results to help with the search process
  • HyDE means the retriever is matching documents to documents

127 of 150

HyDE

128 of 150

RAG Evaluations

  • Retrieval Augmented Generation
  • Sparse Search
    • keyword search
    • TF-IDF
    • BM25
  • Semantic Search
  • Semantic Search with Vector DB
  • Hybrid Search
  • Reranking
  • Chunking Strategies
  • Contextual Retrieval
  • Query Rewriting
  • RAG Evaluations

129 of 150

RAG Evaluations

Knowledge Base

User’s query

LLM

Augmented Prompt

Generate

Answer

Relevant Documents

Retrieve

Index

Augment

Documents

embedding

Query

embedding

130 of 150

RAG Evaluations

RAG systems have three core components:

  • A question (Q)
  • Retrieved context (C)
  • An answer (A)

131 of 150

Part 1: Context Relevance (C|Q)

Def: How well do the retrieved chunks address the question information needs? Does your retriever find passages that contain information relevant to answering the user's question.

Bad

Question: "What are the health benefits of meditation?"

Context: "Meditation practices vary widely across different traditions. Mindfulness meditation, which originated in Buddhist practices, focuses on present-moment awareness, while transcendental meditation uses mantras to achieve deeper states of consciousness."

Reasoning: Despite being factually correct about meditation, this context did not discuss health benefits.

Good

Question: "What are the health benefits of meditation?"

Context: "Regular meditation has been shown to reduce stress hormones like cortisol. A 2018 study in the Journal of Cognitive Enhancement found meditation improves attention and working memory."

Reasoning: Strong relevance. The context directly addresses multiple health benefits with specific details.

132 of 150

Q: What source of information would you need to answer multiple-choice questions on financial knowledges?

Finance textbooks

133 of 150

Retrieval Quality Metrics

Common ingredients to most retriever quality metrics:

If you want to evaluate your retriever you need to know the correct answers

The Question

The specific question being evaluated

Ranked Results

Documents returned in ranked order

Ground Truth All documents labeled as relevant or irrelevant

134 of 150

Recall and Precision

Recall penalizes for leaving out relevant documents

Precision penalizes for returning irrelevant documents

135 of 150

Top k

  • Retrieval metrics are influenced by how many documents the retriever returns
  • Metrics are discussed in terms of top-k documents

136 of 150

Example

137 of 150

Evaluate RAG using Synthetic Data and LLM-as-a-judge

Ques- tions

Score

sample

generate

judge

LLM

Knowledge Base

LLM

Only use good questions

138 of 150

Evaluate RAG using Synthetic Data and LLM-as-a-judge

  • Get chunks from our knowledge base
  • Ask an LLM to generate questions based on these chunks
  • Check that questions are good (using LLM-as-a-judge):
    • Groundedness: Can the question be answered from the given context?
    • Relevance: Is the question relevant to users?
    • Stand-alone: Is the question understandable free of any context, for someone with domain knowledge/Internet access?
  • Select good questions to be in the test set

139 of 150

Part 2: Faithfulness/Groundedness (A|C)

Def: To what extent does the answer restrict itself only to claims that can be verified from the retrieved context?

Bad

Context: "The Great Barrier Reef is the world's largest coral reef system. It stretches for over 2,300 kilometers along the coast of Queensland, Australia."

Answer: "The Great Barrier Reef, the world's largest coral reef system, stretches for over 2,300 kilometers along Australia's eastern coast and is home to about 10% of the world's fish species."

Reasoning: The first part is supported, but the claim about "10% of the world's fish species" isn't in the provided context.

Good

Context: "The Great Barrier Reef is the world's largest coral reef system."

Answer: "The Great Barrier Reef is the largest coral reef system in the world."

Reasoning: Perfect faithfulness. The answer only states what's in the context.

140 of 150

Part 3: Answer Relevance (A|Q)

Def: How directly does the answer address the specific information need expressed in the question? This evaluates the end-to-end system performance.

Bad

Question: "How does compound interest work in investing?"

Answer: "Interest in investing can be simple or compound. Compound interest is more powerful than simple interest and is an important concept in finance."

Reasoning: Low relevance. The answer doesn't actually explain the mechanism of how it works.

Good

Question: "How does compound interest work in investing?"

Answer: "Compound interest works by adding the interest earned back to your principal investment, so that future interest is calculated on the new, larger amount."

Reasoning: High relevance. The answer directly explains the concept asked about.

141 of 150

Advanced RAG Relationships

  • Context Support Coverage (C|A)
    • Definition: Does the retrieved context contain all the information needed to fully support every claim in the answer? This measures whether the context is both sufficient and focused.
  • Question Answerability (Q|C)
    • Definition: Given the context provided, is it actually possible to formulate a satisfactory answer to the question? This evaluates whether the question is reasonable given the available information.
  • Self-Containment (Q|A)
    • Definition: Can the original question be inferred from the answer alone? This measures whether the answer provides enough context to stand on its own.

142 of 150

Other considerations

  • Latency
  • Cost

143 of 150

HW 4

144 of 150

Project

Don’t forget about your project!

145 of 150

Extra

146 of 150

147 of 150

ColBert: Similarity at Token Levels

Score = Sum of max similarity of each query embedding to any document embedding

148 of 150

ColBert

149 of 150

ColBert

150 of 150