1 of 33

Understanding RAG and

Fine Tuning for

Large Language Models

Kathmandu

2 of 33

$whoami

DevRel @AIPlanet
Google Developer Expert in ML
GSoC’23 @caMicroscope
AI With Tarun - YouTube
Anime and Manga

3 of 33

4 of 33

5 of 33

6 of 33

Issues with �Large Language Models

Kathmandu

7 of 33

Hallucinations
Knowledge cut-off
Lack of Domain specific Factual responses

8 of 33

What is

Retrieval Augmented Generation (RAG)

Kathmandu

9 of 33

External Data

10 of 33

External Data

Data Preprocessing

11 of 33

External Data

Data Preprocessing

Vector

Embeddings

12 of 33

External Data

Data Preprocessing

Vector

Embeddings

Vector

Database

13 of 33

14 of 33

Reference: Building RAG- Anyscale Blog

15 of 33

Implement RAG with GenAIStack

You can use Google’s Flan T5 series LLM models using HuggingFaceModel
You can also use other open source models or OpenAI model (closed source)

16 of 33

What is

Fine-Tuning?

Kathmandu

17 of 33

Let’s imagine you have a Robot

18 of 33

A smile on Robot face after training on more data

19 of 33

Pre-trained models are trained and built with general-purpose tasks, with Fine-tuning we can improve the performance of pre-trained models in wide range of domain-specific tasks.
In simple words, Fine-tuning is a technique where a pre-trained model is trained on a new dataset.

20 of 33

2018

2019

2020

2023

BERT

GPT3.5

GPT-3

Llama

PALM

GPT-2

T5

MUM

Falcon

Pre-trained LLMs

Mistral

2021

2022

GPT4

2024

?

21 of 33

Why Fine-Tune

pre-trained models?

Kathmandu

22 of 33

Why Fine-Tune?

Fine-tuning leverages a pre-trained model's knowledge and capabilities, saving significant time and resources compared to training a model from scratch.
Improve factual response by utilizing Domain-specific data.
Reduce Hallucinations.

23 of 33

Downside of Fine-tuning

Kathmandu

24 of 33

Catastrophic Forgetting: Fine-tuned models may forget some aspects of their pre-trained knowledge as they adapt to the new task.
Computational requirements: Fine Tuning LLMs requires A100 GPU support.
Full Fine-Tuning learning parameter dimensions is equal to the pre-trained learning parameters.

25 of 33

Parameter efficient

Fine-Tuning (peft)

technique

Kathmandu

26 of 33

Enter LoRA: Low Rank Adaptation

LoRA allows to train some dense layers in neural network indirectly by optimizing rank decomposition matrices of dense layers.
It freeze the pre-trained model weights and injects trainable rank decomposition matrices into each layer of transformer architecture.
No inference latency.
Applying LoRA to weight matrices in a Neural network reduces the number of trainable parameters.

27 of 33

28 of 33

Quantization

127/5.4

29 of 33

QLoRA: Efficient Fine Tuning of Quantized LLMs

QLORA introduces a number of innovations to save memory without sacrificing performance:

4-bit Normal Float (NF4), a new data type that is information theoretically optimal for normally distributed weights
Double Quantization to reduce the average memory footprint by quantizing the quantization constants
Paged Optimizers to manage memory spikes.

30 of 33

Vertex AI by Google�Guide to Fine Tune Foundational Models (PaLM)

31 of 33

Awesome Fine-Tune LLMs

lucifertrj/

Awesome-Fine-Tuning-LLMs/

32 of 33

Thank You

Let’s Connect for further discussions on GenAI and ML. Happy learning!

TRJ_0751

Tarun R Jain

lucifertrj

33 of 33

Reference

Everything you need to know about Fine Tuning of LLMs.
LoRA: https://arxiv.org/pdf/2106.09685.pdf
Explaining the Key Concepts Behind LoRA by Chris Alexiuk
Quantization article by Hugging Face.
Llama2 LoRA performance performed by AnyScale:Fine Tuning LLMs LoRA
QLoRA https://arxiv.org/abs/2305.14314
GPU information to run LLM: Can it run LLM