2 of 27

RAG in short

Pipeline overview

Tensor, Embeddings …?

Applied RAG Demo

ML Hardware

Page 3

Page 4

Page 5

Page 12

Page 21

Presentation name here

Agenda

3 of 27

What is RAG ?

In short … You are Augmenting clients Query (prompt) by expanding the query context with relevant internal or external knowledge base , then you’re feeding it to LLM with more relevant context.

How you do Retrieval + Augmenting layer is up to you.

Knowledge base can be any local or distributed resource.

- SQL Database

- No Sql Database

- Vector database

- Documents (Chunked or Summarized)

- Elastic / Solr Full text search (LLM’s with semantic search , Hybrid Search – Qdrant )

you can store embeddings as Indexed doc. Field and then run additional vector search on top of results.
Vector search 1^st then do additional filtering on the result payloads .

Embedding based retrieval & RAG (great comparison to FTS), �HNSV graph for Prod. vector search (efficient approximate nearest-neighbor retrieval) ,

https://qdrant.tech/articles/what-is-rag-in-ai/ (What is RAG)

4 of 27

RAG pipeline

https://github.com/langchain-ai/rag-from-scratch

High level overview

5 of 27

Tensor ?

https://github.com/karpathy/llm.c [ A. Karpathy ]

What are Tensors?

1) a 1D block of memory called Storage that holds the raw data

2) a View over that storage that holds its shape. (PyTorch Internals could be helpful here.)

https://www.tensorflow.org/guide/tensor

6 of 27

Tensors are multi-dimensional arrays with a uniform type (called a dtype)

7 of 27

How does the embeddings process look like ?

Token Ids

8 of 27

Challenges

A primary challenge when building semantic search and RAG systems is determining how to best chunk your data. [Langchain TextSplitter intro]

Padding

Chunking

To make sure all chunks are of the same size /shape .
Special tokens: https://huggingface.co/distilbert/distilbert-base-uncased/blob/main/vocab.txt

Tensor Shape / dimensions

To make sure our final Embeddings dimensions match Vector store dimensions.

Tokenization

https://www.youtube.com/watch?v=zduSFxRajkE (A.Karpathy)
https://tiktokenizer.vercel.app (tokenization UI)

9 of 27

Popular embeddings models

https://huggingface.co/BAAI/bge-base-en-v1.5

Hub

Popular OSS embeddings models:

https://huggingface.co/distilbert/distilbert-base-uncased

https://huggingface.co/google-bert/bert-base-uncased

https://platform.openai.com/docs/guides/embeddings/embedding-models

Open AI

MTEB Rankings HF

https://huggingface.co/jinaai/jina-embeddings-v2-base-en

(tokens)

10 of 27

Embeddings

Text embeddings are a natural language processing (NLP) technique that converts text into numerical vectors.

Embeddings capture semantic meaning and context which results in text with similar meanings having closer embeddings.

For example, the sentence "I took my dog to the vet" and "I took my cat to the vet" would have embeddings that are close to each other in the vector space since they both describe similar context.

https://blog.nomic.ai/posts/nomic-embed-text-v1

Embeddings development & visuals (2d):

https://atlas.nomic.ai/map/wikipedia

11 of 27

Embeddings

Relation of Embeddings to the model.

Dimensionality

OpenAI's embeddings have higher dimensionality (1536) compared to Distil-BERT (768) (per token)

distilbert-base-uncased is a distilled version of the BERT model, which produces embeddings of size 768 for each token in the input sequence. ��So, for a 512-token chunk of text, the output embeddings would have a shape of `[512, 768]` (assuming you're using the last-layer hidden states as the embeddings).

OpenAI's embeddings model text-embedding-ada-002 produces embeddings of size 1536 for each input token. �So, for a 512-token chunk of text, the output embeddings would have a shape of `[512, 1536]`.

In the context of RAG, using the last-layer hidden states as the embeddings is a common choice, as it provides a compact and informative representation of the input text that can be used to find similar documents or sentences.

12 of 27

DEMO

Notes: Chunking layer can be done after Tokenization too | The same base model used for prompt & doc. Embeddings . | Docs. Can be any other source |

| You can run summarization per full page with LLM, then tokenize the summarized versions or run summarization on vector search results

https://console.groq.com/playground

13 of 27

D-Bert Model Example

https://huggingface.co/distilbert/distilbert-base-uncased?show_tensors=model.safetensors

distilbert.embeddings.position_embeddings.weight [512,768]: These are the position embeddings. They represent the position of each word in a sentence. �The size [512,768] means there are 512 positions (i.e., the model can handle sequences of up to 512 tokens = context length) and each position (token) is represented by a 768-dimensional vector.

distilbert.embeddings.word_embeddings.weight [30522,768]: These are the word embeddings. They represent the actual words in the vocabulary. �The size [30522,768] means there are 30,522 words in the vocabulary, and each word is represented by a 768-dimensional vector.

distilbert.embeddings.LayerNorm.bias [768] and distilbert.embeddings.LayerNorm.weight [768]: These are the parameters for the layer normalization applied to the embeddings. �Layer normalization is a technique to stabilize the hidden states in the model, and it has a bias and a weight parameter for each dimension of the input.

14 of 27

Quantization

Not strictly an ML term.

Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes. (Wikipedia)

https://mlabonne.github.io/blog/posts/Introduction_to_Weight_Quantization.html

https://github.com/UmerHA/quanting-notes/blob/main/Quanting.ipynb (pynb with examples)

Weight quantization in large language models (LLMs) or any deep learning models refers to the process of reducing the precision of the model's weights from floating-point representation (e.g., 32-bit floating-point numbers) to lower bit-width representations (e.g., 16-bit, 8-bit, or even 1-bit). The primary goal of weight quantization is to reduce the memory footprint and computational requirements of the model, allowing for faster and more efficient inference on devices with limited resources, such as mobile devices or embedded systems.

Unfortunately, once a model is quantized, it can’t be trained any further with regular approaches.

15 of 27

Optimizations

GGUF format: https://huggingface.co/docs/hub/gguf (singe file runtime, no separate safetensors, tokenizer… files needed)

The trick to avoiding this limitation is to use LoRA – “Low-Rank Adaptation of Large Language Models”. LoRA doesn’t train the whole large language model at all, but instead adds “adaptors”, which are very small matrices (generally smaller then 1% of the full model) that are trained, whilst keeping the rest of the model constant.

https://www.youtube.com/watch?v=t509sv5MT0w (LoRa explained )

https://www.youtube.com/watch?v=SL2nZpv7dtY (LoRa finetuning )

https://github.com/unslothai/unsloth (Finetuning framework)

Parameter-Efficient Fine-Tuning using 🤗 PEFT

Long context Gemma 7b + LoRa (Google Colab example )

Llama 3 8b Finetuning (Google colab - T4 )

RAFT (Retrieval augmented Fine tuning) (domain specific RAG finetuning)�RAFT - Azure studio Finetuning example

16 of 27

Quantization

Quantization example model

https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF

(“ instruct“ suffix next to the model names means it is finetuned variant for instruction following = Llama 8b Instruct)

So, the original model was quantized to a

4 Bit precision from bf16 type (dtype).

Meaning its weights precision was reduced to type

Which has smaller memory footprint.

FP32

17 of 27

Quantization

To sum up.

The Goal of quantization is to decrease the precision while keeping quantization error as low as possible.

18 of 27

Quantization

This tells us that the Mistral model transforms each token in its input into a 4096-dimensional vector, using a vocabulary of 32000 unique tokens. The large dimension of the embedding vectors allows the model to capture a rich representation of each token, which contributes to its performance on language tasks

Vocabulary Size (32000): The first dimension of the tensor, 32000, represents the size of the model’s vocabulary. This means that the model was trained with a vocabulary of 32000 unique tokens. These tokens can be individual words, subwords, or characters, depending on the tokenization strategy used.

Embedding Dimension (4096): The second dimension of the tensor, 4096, represents the size of the embedding vectors. Each token in the model’s vocabulary is represented as a 4096-dimensional vector. This high-dimensional representation captures the semantic and syntactic properties of the token.

The model has approximately 7.24 billion parameters. Each parameter is a BF16 (Brain Float 16) precision, which means it takes up 2 bytes of memory.

Total: 7.24 billion * 2 bytes = 14.48 GB VRAM (only weighs) You still need room to run inference.

GPU memory needed:

19 of 27

Extreme Quantization

drawbacks

https://mobiusml.github.io/1bit_blog/

https://www.youtube.com/watch?v=MXWlB9nDAFU (How 1bit LLM’s work)

20 of 27

Perf Demo

Inference of non quantized CodeQwen1.5-7B-Chat on the GPU

https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat/

Quantized (on the CPU)

https://huggingface.co/Qwen/CodeQwen1.5-7B-Chat-GGUF
We are using c-python here �https://github.com/abetlen/llama-cpp-python/

21 of 27

Hardware

Running model inference / tuning locally

GPU Compute capability:

Check your GPU VRAM and Compute:

nvcc –V

nvidia-smi --query-gpu=name,compute_cap,driver_version --format=csv

nvidia-smi –l (watch mode)

https://developer.nvidia.com/cuda-gpus ,

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities

22 of 27

Hardware

AI hardware

How does Groq LPU Work (Head of silicon talk)

Notes on AI Hardware (Stanford talk)

LPU

GPU	VRAM
RTX 2070	8GB
RTX 4090	24 GB
T4	16 GB
A100	40 GB (HBM2)
H100	80 GB (HBM3)

G.Colab default

23 of 27

Layers of abstraction

When you're interfacing with Large Language Models (LLMs) using Python code you are passing through several layers of abstraction and APIs.

Python Layer / (Langchain): This is the highest level of abstraction where you write Python code to interact with the model. The `transformers` library provides a Python interface to the models.

Transformers Library: The `transformers` library from Hugging Face provides a high-level, user-friendly API for LLMs. It handles tasks like tokenization, model loading, and inference.

TensorFlow/PyTorch: The `transformers` library uses either TensorFlow or PyTorch as the underlying deep learning framework. These frameworks provide the data structures for tensors and the operations on them, and they handle the computation graph, gradients, and optimization algorithms.

CUDA API: TensorFlow/PyTorch use the CUDA API for GPU-accelerated operations if a CUDA-compatible GPU is available. The CUDA API provides a C-like interface for executing code on the GPU.

GPU Driver: The CUDA API interacts with the GPU driver to execute the operations on the GPU hardware.

24 of 27

Errors : )

Error: DriverError(CUDA_ERROR_NOT_FOUND, "named symbol not found") when loading cast_f32_bf16 · Issue #1911 · huggingface/candle · GitHub

Getting "NotImplementedError: Could not run 'torchvision::nms' with arguments from CUDA backend" despite having all necessary libraries and imports

�- Resolution was to keep ‘torch’ and ‘torchvision’ versions in sync

You will be getting Errors mostly related to communication with the GPU Driver …

CUDA: parallel computing platform from NVIDIA.

CUDA Runtime: This is the API that your application uses. It’s a library that provides functions for doing things like allocating memory on the GPU, launching compute kernels, and so on.
CUDA Driver: This is a lower-level API that interacts directly with the GPU hardware. It’s used by the CUDA runtime, and can also be used directly for more fine-grained control over the GPU.
GPU Driver: This is the software that communicates directly with the GPU hardware. It’s used by the CUDA driver to perform the actual operations requested by your application.

So, when we say “CUDA driver”, we’re usually referring to the combination of the CUDA runtime, the CUDA driver API, and the GPU driver that all work together to execute CUDA operations on the GPU.

25 of 27

Getting started

LLM Inference frameworks:

Python �https://github.com/pytorch/pytorch , �https://github.com/langchain-ai/langchain , �https://github.com/huggingface/transformers , �https://github.com/tensorflow/tensorflow �
C/C++ https://github.com/ggerganov/llama.cpp
Rust https://github.com/huggingface/candle/
C# and other languages : https://onnx.ai/

Long Context RAG: https://www.youtube.com/watch?v=SsHUNfhF32s
LangChain RAG Python example: https://github.com/langchain-ai/rag-from-scratch
Local LLM inference: https://github.com/ollama/ollama
CUDA https://github.com/cuda-mode/lectures/tree/main (eng. lectures)
Quantization course https://learn.deeplearning.ai/courses/quantization-fundamentals
Finetuning course https://learn.deeplearning.ai/courses/finetuning-large-language-models

Model release example https://huggingface.co/blog/llama3

26 of 27

Ending Notes

If you want to learn more in depth about the specific topics explore the links from the slides

If you don’t have good enough Local GPU try out quantized Models.

If you don’t care about local Python env setup and compute, check out Google collab and Kaggle (both provide few Hours/week of free T4 compute.

27 of 27

Thank you!

Dominik Polzer