1 of 27

LoRA: Low-Rank Adaptation of Large Language Models

Youssef Briki, Salman Hussain Ali

2 of 27

Outline

Fine-Tuning
Parameter-Efficient Fine-Tuning
LoRA: Low-Rank Adaptation of Large Language Models
LoRA+: Efficient LoRA
DoRA: Weight-Decomposed Low-Rank Adaptation

3 of 27

Fine-tuning

Updating the pretrained model’s weight to solve the desired down-stream task(e.g. summarization, medical Q&A)

Involves updating all the base model’s parameters

Let us consider the case of fine-tuning LLaMa 3-8B

Problems:

computationally expensive( ≥ 2 H100s needed)
deployment and storage burden (16GB for each copy)

Source: Fenek, O. (2025) Fine-Tune LLMs: Between Full & Partial Fine Tuning — An End-to-End Python Example to Fine-Tune with PERT/LORA on the SST Dataset (link)

4 of 27

Parameter-Efficient Fine-tuning

Adapters Houlsby, N. et al.(2019) Parameter-Efficient Transfer Learning for NLP

Freeze base model and inject trainable bottleneck layers in each Transformer layer.

Adds 2dr parameters for each inserted adapter.

Problem: additional latency at inference

5 of 27

Parameter-Efficient Fine-tuning (PEFT)

Prefix-Tuning Li & Liang (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation

Optimize a small, task-specific continuous vector and insert it in the beginning of the input embeddings

Adds dl_ptrainable parameters (l_p = # of prefix tokens)

Problems:

Difficult to optimize
Performance does not scale with # of trainable parameters

6 of 27

LoRA: Low-Rank Adaptation of Large Language Models

Approach:

Freeze original layers
Optimize rank decomposition matrices of layer’s change instead of the layer itself

For a pretrained weight matrix W₀∈ ℝ^{d x d}, constrain its update by representing it with a low-rank decomposition:

ΔW = αBA ; B ∈ ℝ^{d x r}; A ∈ ℝ^{r x d}; r << d; 0 ≤ α ≤ 1

W_new= W₀ + ΔW

= W₀ + αBA

Hu, E. et al. (2021)

7 of 27

LoRA: Low-Rank Adaptation of Large Language Models

Original forward pass: h = W₀x

Modified forward pass: h = W₀x + ΔWx = W₀x + αBAx

At inference time, fuse updated weights to avoid additional latency:

W_new = W₀ + αBA

h = W_newx

8 of 27

LoRA: Low-Rank Adaptation of Large Language Models

Memory

Using LoRA, fine-tuning LLaMa 3-8B is possible with a single A100, as opposed to 3-5 A100s or 2 H100s required for full fine-tuning

Storage

Each inserted LoRA adapter adds 2dr trainable parameters.

Generally, LoRA is applied to the query and value weight matrices in the attention mechanisms. Resulting in the total number of trainable parameters being 4drl

Fine-tuning LLaMa 3-8B with r = 8 results in ~4M parameters = 8MB(FP16) or 16MB (FP32)

9 of 27

LoRA: Low-Rank Adaptation of Large Language Models

This table depicts the results of several PEFT techniques applied to the RoBERTa and DeBERTa XXL models, and evaluated on the GLUE benchmark.

LoRA outperforms FFT with 0.24 - 0.3% of the trainable parameters

This table shows the results of GPT-2 being fine-tuned on Natural Language Generation tasks in the E2E NLG Challenge.

We observe similar trends as the table above, with LoRA consistently outperforming FFT and other PEFT techniques.

10 of 27

LoRA: Low-Rank Adaptation of Large Language Models

Scaling

The authors investigated the scaling properties of PEFT technqiues when fine-tuning GPT-3 on WikiSQL and MultiNLI-matched.

11 of 27

LoRA: Low-Rank Adaptation of Large Language Models

Where should we apply LoRA?

Fix the parameter budget (18M) on GPT-3, and spread the parameters across different weights.

Findings:

adapting both W_q and W_v yield the best results
r = 4 is sufficient to capture enough information, and outperforms adapting one kind of weight with a larger rank

12 of 27

LoRA: Low-Rank Adaptation of Large Language Models

What is the optimal rank r ?

Using GPT-3, scale r on different sets of weights.

Findings:

LoRA performs competitively with small r, suggesting that ΔW has a low intrinsic rank.
Diminishing returns at higher values of r.

13 of 27

LoRA: Low-Rank Adaptation of Large Language Models

To analyze subspace similarity across different ranks r, the authors compute the Grassmann distance between the right singular unitary matrices obtained via SVD for r = 8 and r = 64. The Grassmann distance measures the overlap between subspaces spanned by these vectors.

1 = complete overlap

0 = no overlap

14 of 27

LoRA Alternatives

LoRA+
DoRA
QLoRA

15 of 27

LoRA+

Approach:

Set a much higher learning rate to the matrix B (up to 4-16x the learning rate of A)
B is a matrix of 0s, therefore we could afford to give it a much higher learning rate than A

Hayou, Ghosh, Yu (2024)

16 of 27

LoRA+

Advantages

Up to 2x speed-up!
slight performance improvement (1-2%)
Minimal changes to the LoRA techniques

Limits:

LoRA+ doesn’t perform as good for every architecture the same; Roberta has a 2x speed-up while Llama-7b is slower
No details on how to choose λ

17 of 27

DoRA - Weight-Decomposed Low-Rank Adaptation

Approach

Split the pretrained weights into 2 parts:

Magnitude (M)
Direction (D)

Fine-tune Magnitude directly (like in Full Fine-Tuning).
Apply a LoRA-style update to the Direction.
Combine the updated M and D to form the new weight matrix.

Nvidia (2024)

18 of 27

DoRA - deep dive

Understanding DoRA’s affect on Model training

LoRA: Positive correlation=> rotation and rescaling are tied together
DoRA/Full fine-tuning: Negative correlation => rotation and rescaling are decoupled.

What does it mean ?

Positive correlation (LoRA): Less expressive, because direction and magnitude are stuck together

Negative correlation (FT, DoRA): More flexible, because the model can adapt weights in two independent ways, closer to what full fine-tuning can do.

19 of 27

DoRA - Weight-Decomposed Low-Rank Adaptation

Advantages

Better results compared to plain LoRA
More freedom in training
Adapts to each model architecture

20 of 27

qLoRA

Approach

Quantize pretrained weights to 4-bit NormalFload
Add LoRA adapters
Apply double quantization to compress quantization constants
Use paged optimizers to handle GPU memory spikes

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman Luke Zettlemoyer (2023)

21 of 27

qLoRA

Advantages:

Achieves full 16-bit finetuning performance with only 4-bit storage
20× GPU memory savings compared to standard finetuning

22 of 27

Considerations before fine-tuning

What to know:

Resource constraints (QLoRA - QDoRA)
Performance (DoRA)
Speed (LoRA+)

Some cool optimizations

Unsloth (Speed)
NEFT (Performance)

23 of 27

Code overview - LoRA

LoRA with HuggingFace - classification task

Load the model

24 of 27

Code overview - LoRA

2. PEFT settings

25 of 27

Code overview - LoRA

3. Train the model

26 of 27

Code overview - from LoRA to DoRA

use_dora = True
Add Lora magnitude vectors to target vectors

27 of 27

Questions ?

further reading: https://developer.nvidia.com/blog/introducing-dora-a-high-performing-alternative-to-lora-for-fine-tuning/

https://arxiv.org/pdf/2305.14314

https://arxiv.org/pdf/2402.12354

https://www.youtube.com/watch?v=ae2lbmtTY5A