1 of 34

Retrieval Augmented Generation (RAG)

Parvez M Robin

Software Engineer

Siemens

2 of 34

$ WHOAMI

Software Engineer at Siemens

Focusing Electronic Design Automation and Artificial Intelligence

Masters from Dalhousie University

Focusing understanding software bugs using neural language models

Public Speaker

Focusing software engineering and artificial intelligence

Hiker

Focusing uphill

Biker

Focusing downhill

3 of 34

LET THERE BE LLM

Gemini 1.5 knows nothing after November 2023

April 2023 for GPT 4 and Llama 2

You can memorize only so much

Parameters can only store a limited amount of knowledge
In 2020, the internet hit 64 zetabytes (≈ trillion gigabytes) *

Tend to prefer popularity over accuracy
LLMs are not domain experts

* https://healthit.com.au/how-big-is-the-internet-and-how-do-we-measure-it/

4 of 34

JUST AN EXAMPLE

6 of 34

You can think of the large language model as an over-enthusiastic new employee who refuses to stay informed with current events but will always answer every question with absolute confidence.

– Amazon Web Service Documentation

Source: https://aws.amazon.com/what-is/retrieval-augmented-generation

7 of 34

RAG TO THE RESCUE

LLM

Knows nothing new
You can memorize only so much
Tend to prefer popularity over accuracy
Are not domain experts

Can pull in up-to-date information
Can retrieve external data as needed
Relies on specific, relevant sources for responses
Supports Domain-Specific Expertise, by pulling from specialized resources

RAG

8 of 34

HOW RAG WORKS

9 of 34

RETRIEVAL AUGMENTED GENERATION

A retrieval module responsible for retrieving latest and/or private information
Narrows Down to the Most Useful Information

A generator module to generate human friendly response
Integrates Multiple Sources
Balances Contextual and Retrieved Information

10 of 34

RAG ARCHITECTURE

Query

RAG

Knowledge�Source

LLM

Query

Retrieved Documents

Final Response

Retrieved Documents

Reranked Documents

Prompt + Query + Docs

11 of 34

TYPES OF RAG

RAG SEQUENCE

Generates one response per retrieved document
Chooses the best final response from the candidates
Best when all retrieved documents has the answer
E.g., Google Search

Treats the retrieved documents as a series of tokens
Generates a single response using all of them
Best when retrieved documents has part of the full answer
E.g., creative writing, software debugging

RAG TOKEN

Reference: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, Facebook AI Research (FAIR)

12 of 34

ADVANTAGES

Responsible AI

Precise source of truth

Improved accuracy and relevance

Support for privacy

Minimal hallucination

13 of 34

SOME BUZZWORDS

Token

Embedding

Context window

Chunk size

Prompt

Zero shot learning

Few shot learning

Hallucination

Agent

14 of 34

COMPARING LLMS FOR RAG

A study by Galileo

15 of 34

EXPERIMENTAL SETUP

Short Context�Less than 5k tokens�RAG on a few pages

Medium Context�5k to 25k tokens�RAG on a book’s chapter

Long Context�40k to 100k tokens�RAG on a book

16 of 34

OPEN SOURCE IS CLOSING THE GAP

While closed-source models still offer the best performance, open-source models like Gemini, Llama, and Qwen continue to improve

Source: https://www.rungalileo.io/ty/hallucinationindex

17 of 34

MEDIUM CONTEXT LENGTH IS THE KEY

Most of these models perform the best when provided with a context of 5k to 25k tokens

Source: https://www.rungalileo.io/ty/hallucinationindex

18 of 34

ANTHROPIC OUTPERFORMS OPENAI

Anthropic’s latest Claude 3.5 Sonnet and Claude 3 Opus consistently beats out GPT-4o and GPT-3.5.

Source: https://www.rungalileo.io/ty/hallucinationindex

19 of 34

LARGER IS NOT ALWAYS BETTER

In certain cases, smaller models outperformed larger models. Specifically, Gemini 1.5 Flash from Google performed unexpectedly well.

Source: https://www.rungalileo.io/ty/hallucinationindex

20 of 34

A COST-EFFECTIVE RAG

https://colab.research.google.com/drive/1Lpdq8cZU8ZRB_NoPACuYbI0vBcmrEhd2

21 of 34

MAKE IT BETTER

22 of 34

HIERARCHICAL DOCUMENTS

Organize documents into a structured hierarchy
Use metadata and semantic relationships

23 of 34

RECURSIVE RETRIEVAL

Given the retrieved document, ask the LLM if there is any confusing topics
Search again for documents on confusing topics
Repeat, until the LLM is confident enough

24 of 34

MULTISTEP REASONING�CHAIN OF THOUGHT

Ask the LLM to break the task into multiple steps
Ask it to solve each step
Ask it to explain it’s reasoning in each step
Ask it to stich everything together

Consider parallelizing steps, if possible

Source: ART: Automatic multi-step reasoning and tool-use for large language models, Microsoft Research, Allen Institute of AI, Meta AI

25 of 34

MULTISTEP REASONING CONTD.�CHAIN OF NOTE

Generate sequential reading notes for each retrieved document
Thoroughly evaluate of their relevance
Integrate this information to formulate the final answer

Source: Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models, Tecent AI Lab

26 of 34

SELF-AWARE LLM

Generate response using knowledge and retrieved docs
Analyze own response for potential errors, inconsistencies, or incompleteness
Until self-critique identifies issues
Refine query
Retrieve new docs
Go to 1

A Chronology of Generations

Parvez M Robin

Software Engineer

Siemens