RAG�Retrieval Augmented Generation.
With implementation
Dominik Polzer
RAG in short
Pipeline overview
Tensor, Embeddings …?
Applied RAG Demo
ML Hardware
Page 3
Page 4
Page 5
Page 12
Page 21
01
02
03
04
05
Presentation name here
Agenda
3
What is RAG ?
In short … You are Augmenting clients Query (prompt) by expanding the query context with relevant internal or external knowledge base , then you’re feeding it to LLM with more relevant context.
How you do Retrieval + Augmenting layer is up to you.
Knowledge base can be any local or distributed resource.
- SQL Database
- No Sql Database
- Vector database
- Documents (Chunked or Summarized)
- Elastic / Solr Full text search (LLM’s with semantic search , Hybrid Search – Qdrant )
Embedding based retrieval & RAG (great comparison to FTS), �HNSV graph for Prod. vector search (efficient approximate nearest-neighbor retrieval) ,
https://qdrant.tech/articles/what-is-rag-in-ai/ (What is RAG)
4
RAG pipeline
High level overview
5
Tensor ?
https://github.com/karpathy/llm.c [ A. Karpathy ]
What are Tensors?
1) a 1D block of memory called Storage that holds the raw data
2) a View over that storage that holds its shape. (PyTorch Internals could be helpful here.)
6
Tensors are multi-dimensional arrays with a uniform type (called a dtype)
7
How does the embeddings process look like ?
Token Ids
8
Challenges
Padding
Chunking
Tensor Shape / dimensions
Tokenization
9
Popular embeddings models
Hub
Popular OSS embeddings models:
Open AI
(tokens)
10
Embeddings
Text embeddings are a natural language processing (NLP) technique that converts text into numerical vectors.
Embeddings capture semantic meaning and context which results in text with similar meanings having closer embeddings.
For example, the sentence "I took my dog to the vet" and "I took my cat to the vet" would have embeddings that are close to each other in the vector space since they both describe similar context.
Embeddings development & visuals (2d):
11
Embeddings
Relation of Embeddings to the model.
Dimensionality
OpenAI's embeddings have higher dimensionality (1536) compared to Distil-BERT (768) (per token)
12
DEMO
Notes: Chunking layer can be done after Tokenization too | The same base model used for prompt & doc. Embeddings . | Docs. Can be any other source |
| You can run summarization per full page with LLM, then tokenize the summarized versions or run summarization on vector search results
13
D-Bert Model Example
14
Quantization
Not strictly an ML term.
Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes. (Wikipedia)
https://mlabonne.github.io/blog/posts/Introduction_to_Weight_Quantization.html
https://github.com/UmerHA/quanting-notes/blob/main/Quanting.ipynb (pynb with examples)
Weight quantization in large language models (LLMs) or any deep learning models refers to the process of reducing the precision of the model's weights from floating-point representation (e.g., 32-bit floating-point numbers) to lower bit-width representations (e.g., 16-bit, 8-bit, or even 1-bit). The primary goal of weight quantization is to reduce the memory footprint and computational requirements of the model, allowing for faster and more efficient inference on devices with limited resources, such as mobile devices or embedded systems.
Unfortunately, once a model is quantized, it can’t be trained any further with regular approaches.
15
Optimizations
GGUF format: https://huggingface.co/docs/hub/gguf (singe file runtime, no separate safetensors, tokenizer… files needed)
The trick to avoiding this limitation is to use LoRA – “Low-Rank Adaptation of Large Language Models”. LoRA doesn’t train the whole large language model at all, but instead adds “adaptors”, which are very small matrices (generally smaller then 1% of the full model) that are trained, whilst keeping the rest of the model constant.
https://www.youtube.com/watch?v=t509sv5MT0w (LoRa explained )
https://www.youtube.com/watch?v=SL2nZpv7dtY (LoRa finetuning )
https://github.com/unslothai/unsloth (Finetuning framework)
Parameter-Efficient Fine-Tuning using 🤗 PEFT
Long context Gemma 7b + LoRa (Google Colab example)
Llama 3 8b Finetuning (Google colab - T4)
RAFT (Retrieval augmented Fine tuning) (domain specific RAG finetuning)�RAFT - Azure studio Finetuning example
16
Quantization
Quantization example model
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF
(“ instruct“ suffix next to the model names means it is finetuned variant for instruction following = Llama 8b Instruct)
So, the original model was quantized to a
4 Bit precision from bf16 type (dtype).
Meaning its weights precision was reduced to type
Which has smaller memory footprint.
FP32
17
Quantization
To sum up.
The Goal of quantization is to decrease the precision while keeping quantization error as low as possible.
18
Quantization
This tells us that the Mistral model transforms each token in its input into a 4096-dimensional vector, using a vocabulary of 32000 unique tokens. The large dimension of the embedding vectors allows the model to capture a rich representation of each token, which contributes to its performance on language tasks
The model has approximately 7.24 billion parameters. Each parameter is a BF16 (Brain Float 16) precision, which means it takes up 2 bytes of memory.
Total: 7.24 billion * 2 bytes = 14.48 GB VRAM (only weighs) You still need room to run inference.
GPU memory needed:
19
Extreme Quantization
drawbacks
https://www.youtube.com/watch?v=MXWlB9nDAFU (How 1bit LLM’s work)
20
Perf Demo
Inference of non quantized CodeQwen1.5-7B-Chat on the GPU
VS
Quantized (on the CPU)
21
Hardware
Running model inference / tuning locally
GPU Compute capability:
Check your GPU VRAM and Compute:
nvcc –V
nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv
nvidia-smi –l (watch mode)
22
Hardware
AI hardware
LPU
GPU | VRAM |
RTX 2070 | 8GB |
RTX 4090 | 24 GB |
T4 | 16 GB |
A100 | 40 GB (HBM2) |
H100 | 80 GB (HBM3) |
G.Colab default
23
Layers of abstraction
When you're interfacing with Large Language Models (LLMs) using Python code you are passing through several layers of abstraction and APIs.
24
Errors : )
�- Resolution was to keep ‘torch’ and ‘torchvision’ versions in sync
You will be getting Errors mostly related to communication with the GPU Driver …
CUDA: parallel computing platform from NVIDIA.
So, when we say “CUDA driver”, we’re usually referring to the combination of the CUDA runtime, the CUDA driver API, and the GPU driver that all work together to execute CUDA operations on the GPU.
25
Getting started
LLM Inference frameworks:
Model release example https://huggingface.co/blog/llama3
26
Ending Notes
Thank you!
Dominik Polzer