1 of 27

RAG�Retrieval Augmented Generation.

With implementation

Dominik Polzer

2 of 27

RAG in short

Pipeline overview

Tensor, Embeddings …?

Applied RAG Demo

ML Hardware

Page 3

Page 4

Page 5

Page 12

Page 21

01

02

03

04

05

Presentation name here

Agenda

3 of 27

3

What is RAG ?

In short … You are Augmenting clients Query (prompt) by expanding the query context with relevant internal or external knowledge base , then you’re feeding it to LLM with more relevant context.

How you do Retrieval + Augmenting layer is up to you.

Knowledge base can be any local or distributed resource.

- SQL Database

- No Sql Database

- Vector database

- Documents (Chunked or Summarized)

- Elastic / Solr Full text search (LLM’s with semantic search , Hybrid Search – Qdrant )

  1. you can store embeddings as Indexed doc. Field and then run additional vector search on top of results.
  2. Vector search 1st then do additional filtering on the result payloads .

4 of 27

4

RAG pipeline

High level overview

5 of 27

5

Tensor ?

What are Tensors?

1) a 1D block of memory called Storage that holds the raw data

2) a View over that storage that holds its shape. (PyTorch Internals could be helpful here.)

6 of 27

6

Tensors are multi-dimensional arrays with a uniform type (called a dtype)

7 of 27

7

How does the embeddings process look like ?

Token Ids

8 of 27

8

Challenges

  • A primary challenge when building semantic search and RAG systems is determining how to best chunk your data. [Langchain TextSplitter intro]

Padding

Chunking

Tensor Shape / dimensions

  • To make sure our final Embeddings dimensions match Vector store dimensions.

Tokenization

9 of 27

9

Popular embeddings models

Hub

Popular OSS embeddings models:

Open AI

(tokens)

10 of 27

10

Embeddings

Text embeddings are a natural language processing (NLP) technique that converts text into numerical vectors.

Embeddings capture semantic meaning and context which results in text with similar meanings having closer embeddings.

For example, the sentence "I took my dog to the vet" and "I took my cat to the vet" would have embeddings that are close to each other in the vector space since they both describe similar context.

Embeddings development & visuals (2d):

11 of 27

11

Embeddings

Relation of Embeddings to the model.

Dimensionality

OpenAI's embeddings have higher dimensionality (1536) compared to Distil-BERT (768) (per token)

  • distilbert-base-uncased is a distilled version of the BERT model, which produces embeddings of size 768 for each token in the input sequence. ��So, for a 512-token chunk of text, the output embeddings would have a shape of `[512, 768]` (assuming you're using the last-layer hidden states as the embeddings).

  • OpenAI's embeddings model text-embedding-ada-002 produces embeddings of size 1536 for each input token. �So, for a 512-token chunk of text, the output embeddings would have a shape of `[512, 1536]`.

  • In the context of RAG, using the last-layer hidden states as the embeddings is a common choice, as it provides a compact and informative representation of the input text that can be used to find similar documents or sentences.

12 of 27

12

DEMO

Notes: Chunking layer can be done after Tokenization too | The same base model used for prompt & doc. Embeddings . | Docs. Can be any other source |

| You can run summarization per full page with LLM, then tokenize the summarized versions or run summarization on vector search results

13 of 27

13

D-Bert Model Example

  • distilbert.embeddings.position_embeddings.weight [512,768]: These are the position embeddings. They represent the position of each word in a sentence. �The size [512,768] means there are 512 positions (i.e., the model can handle sequences of up to 512 tokens = context length) and each position (token) is represented by a 768-dimensional vector.

  • distilbert.embeddings.word_embeddings.weight [30522,768]: These are the word embeddings. They represent the actual words in the vocabulary. �The size [30522,768] means there are 30,522 words in the vocabulary, and each word is represented by a 768-dimensional vector.

  • distilbert.embeddings.LayerNorm.bias [768] and distilbert.embeddings.LayerNorm.weight [768]: These are the parameters for the layer normalization applied to the embeddings. �Layer normalization is a technique to stabilize the hidden states in the model, and it has a bias and a weight parameter for each dimension of the input.

14 of 27

14

Quantization

Not strictly an ML term.

Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes. (Wikipedia)

Weight quantization in large language models (LLMs) or any deep learning models refers to the process of reducing the precision of the model's weights from floating-point representation (e.g., 32-bit floating-point numbers) to lower bit-width representations (e.g., 16-bit, 8-bit, or even 1-bit). The primary goal of weight quantization is to reduce the memory footprint and computational requirements of the model, allowing for faster and more efficient inference on devices with limited resources, such as mobile devices or embedded systems.

Unfortunately, once a model is quantized, it can’t be trained any further with regular approaches.

15 of 27

15

Optimizations

GGUF format: https://huggingface.co/docs/hub/gguf (singe file runtime, no separate safetensors, tokenizer… files needed)

The trick to avoiding this limitation is to use LoRA – “Low-Rank Adaptation of Large Language Models”. LoRA doesn’t train the whole large language model at all, but instead adds “adaptors”, which are very small matrices (generally smaller then 1% of the full model) that are trained, whilst keeping the rest of the model constant.

https://www.youtube.com/watch?v=t509sv5MT0w (LoRa explained )

https://www.youtube.com/watch?v=SL2nZpv7dtY (LoRa finetuning )

https://github.com/unslothai/unsloth (Finetuning framework)

Parameter-Efficient Fine-Tuning using 🤗 PEFT

Long context Gemma 7b + LoRa (Google Colab example)

Llama 3 8b Finetuning (Google colab - T4)

RAFT (Retrieval augmented Fine tuning) (domain specific RAG finetuning)�RAFT - Azure studio Finetuning example

16 of 27

16

Quantization

Quantization example model

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF

(instruct“ suffix next to the model names means it is finetuned variant for instruction following = Llama 8b Instruct)

So, the original model was quantized to a

4 Bit precision from bf16 type (dtype).

Meaning its weights precision was reduced to type

Which has smaller memory footprint.

FP32

17 of 27

17

Quantization

To sum up.

The Goal of quantization is to decrease the precision while keeping quantization error as low as possible.

18 of 27

18

Quantization

This tells us that the Mistral model transforms each token in its input into a 4096-dimensional vector, using a vocabulary of 32000 unique tokens. The large dimension of the embedding vectors allows the model to capture a rich representation of each token, which contributes to its performance on language tasks

  1. Vocabulary Size (32000): The first dimension of the tensor, 32000, represents the size of the model’s vocabulary. This means that the model was trained with a vocabulary of 32000 unique tokens. These tokens can be individual words, subwords, or characters, depending on the tokenization strategy used.

  • Embedding Dimension (4096): The second dimension of the tensor, 4096, represents the size of the embedding vectors. Each token in the model’s vocabulary is represented as a 4096-dimensional vector. This high-dimensional representation captures the semantic and syntactic properties of the token.

The model has approximately 7.24 billion parameters. Each parameter is a BF16 (Brain Float 16) precision, which means it takes up 2 bytes of memory.

Total: 7.24 billion * 2 bytes = 14.48 GB VRAM (only weighs) You still need room to run inference.

GPU memory needed:

19 of 27

19

Extreme Quantization

drawbacks

20 of 27

20

Perf Demo

Inference of non quantized CodeQwen1.5-7B-Chat on the GPU

VS

Quantized (on the CPU)

21 of 27

21

Hardware

Running model inference / tuning locally

GPU Compute capability:

Check your GPU VRAM and Compute:

nvcc –V

nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv

nvidia-smi –l (watch mode)

22 of 27

22

Hardware

AI hardware

LPU

GPU

VRAM

RTX 2070

8GB

RTX 4090

24 GB

T4

16 GB

A100

40 GB (HBM2)

H100

80 GB (HBM3)

G.Colab default

23 of 27

23

Layers of abstraction

When you're interfacing with Large Language Models (LLMs) using Python code you are passing through several layers of abstraction and APIs.

  1. Python Layer / (Langchain): This is the highest level of abstraction where you write Python code to interact with the model. The `transformers` library provides a Python interface to the models.

  • Transformers Library: The `transformers` library from Hugging Face provides a high-level, user-friendly API for LLMs. It handles tasks like tokenization, model loading, and inference.

  • TensorFlow/PyTorch: The `transformers` library uses either TensorFlow or PyTorch as the underlying deep learning framework. These frameworks provide the data structures for tensors and the operations on them, and they handle the computation graph, gradients, and optimization algorithms.

  • CUDA API: TensorFlow/PyTorch use the CUDA API for GPU-accelerated operations if a CUDA-compatible GPU is available. The CUDA API provides a C-like interface for executing code on the GPU.

  • GPU Driver: The CUDA API interacts with the GPU driver to execute the operations on the GPU hardware.

24 of 27

24

Errors : )

You will be getting Errors mostly related to communication with the GPU Driver …

CUDA: parallel computing platform from NVIDIA.

  1. CUDA Runtime: This is the API that your application uses. It’s a library that provides functions for doing things like allocating memory on the GPU, launching compute kernels, and so on.
  2. CUDA Driver: This is a lower-level API that interacts directly with the GPU hardware. It’s used by the CUDA runtime, and can also be used directly for more fine-grained control over the GPU.
  3. GPU Driver: This is the software that communicates directly with the GPU hardware. It’s used by the CUDA driver to perform the actual operations requested by your application.

So, when we say “CUDA driver”, we’re usually referring to the combination of the CUDA runtime, the CUDA driver API, and the GPU driver that all work together to execute CUDA operations on the GPU.

25 of 27

25

Getting started

LLM Inference frameworks:

26 of 27

26

Ending Notes

  • If you want to learn more in depth about the specific topics explore the links from the slides
  • If you don’t have good enough Local GPU try out quantized Models.
  • If you don’t care about local Python env setup and compute, check out Google collab and Kaggle (both provide few Hours/week of free T4 compute.

27 of 27

Thank you!

Dominik Polzer