LoRA: Low-Rank Adaptation of Large Language Models
Youssef Briki, Salman Hussain Ali
Outline
Fine-tuning
Updating the pretrained model’s weight to solve the desired down-stream task(e.g. summarization, medical Q&A)
Involves updating all the base model’s parameters
Let us consider the case of fine-tuning LLaMa 3-8B
Problems:
Source: Fenek, O. (2025) Fine-Tune LLMs: Between Full & Partial Fine Tuning — An End-to-End Python Example to Fine-Tune with PERT/LORA on the SST Dataset (link)
Parameter-Efficient Fine-tuning
Adapters Houlsby, N. et al.(2019) Parameter-Efficient Transfer Learning for NLP
Freeze base model and inject trainable bottleneck layers in each Transformer layer.
Adds 2dr parameters for each inserted adapter.
Problem: additional latency at inference
Parameter-Efficient Fine-tuning (PEFT)
Prefix-Tuning Li & Liang (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation
Optimize a small, task-specific continuous vector and insert it in the beginning of the input embeddings
Adds dlp trainable parameters (lp = # of prefix tokens)
Problems:
LoRA: Low-Rank Adaptation of Large Language Models
Approach:
For a pretrained weight matrix W0 ∈ ℝd x d, constrain its update by representing it with a low-rank decomposition:
ΔW = αBA ; B ∈ ℝd x r ; A ∈ ℝr x d ; r << d; 0 ≤ α ≤ 1
Wnew = W0 + ΔW
= W0 + αBA
Hu, E. et al. (2021)
LoRA: Low-Rank Adaptation of Large Language Models
Original forward pass: h = W0x
Modified forward pass: h = W0x + ΔWx = W0x + αBAx
At inference time, fuse updated weights to avoid additional latency:
Wnew = W0 + αBA
h = Wnewx
LoRA: Low-Rank Adaptation of Large Language Models
Memory
Using LoRA, fine-tuning LLaMa 3-8B is possible with a single A100, as opposed to 3-5 A100s or 2 H100s required for full fine-tuning
Storage
Each inserted LoRA adapter adds 2dr trainable parameters.
Generally, LoRA is applied to the query and value weight matrices in the attention mechanisms. Resulting in the total number of trainable parameters being 4drl
Fine-tuning LLaMa 3-8B with r = 8 results in ~4M parameters = 8MB(FP16) or 16MB (FP32)
LoRA: Low-Rank Adaptation of Large Language Models
This table depicts the results of several PEFT techniques applied to the RoBERTa and DeBERTa XXL models, and evaluated on the GLUE benchmark.
LoRA outperforms FFT with 0.24 - 0.3% of the trainable parameters
This table shows the results of GPT-2 being fine-tuned on Natural Language Generation tasks in the E2E NLG Challenge.
We observe similar trends as the table above, with LoRA consistently outperforming FFT and other PEFT techniques.
LoRA: Low-Rank Adaptation of Large Language Models
Scaling
The authors investigated the scaling properties of PEFT technqiues when fine-tuning GPT-3 on WikiSQL and MultiNLI-matched.
LoRA: Low-Rank Adaptation of Large Language Models
Where should we apply LoRA?
Fix the parameter budget (18M) on GPT-3, and spread the parameters across different weights.
Findings:
LoRA: Low-Rank Adaptation of Large Language Models
What is the optimal rank r ?
Using GPT-3, scale r on different sets of weights.
Findings:
LoRA: Low-Rank Adaptation of Large Language Models
To analyze subspace similarity across different ranks r, the authors compute the Grassmann distance between the right singular unitary matrices obtained via SVD for r = 8 and r = 64. The Grassmann distance measures the overlap between subspaces spanned by these vectors.
1 = complete overlap
0 = no overlap
LoRA Alternatives
LoRA+
Approach:
Hayou, Ghosh, Yu (2024)
LoRA+
Advantages
Limits:
DoRA - Weight-Decomposed Low-Rank Adaptation
Approach
Nvidia (2024)
DoRA - deep dive
Understanding DoRA’s affect on Model training
What does it mean ?
Positive correlation (LoRA): Less expressive, because direction and magnitude are stuck together
Negative correlation (FT, DoRA): More flexible, because the model can adapt weights in two independent ways, closer to what full fine-tuning can do.
Negative correlation (FT, DoRA): More flexible, because the model can adapt weights in two independent ways, closer to what full fine-tuning can do.
DoRA - Weight-Decomposed Low-Rank Adaptation
Advantages
qLoRA
Approach
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman Luke Zettlemoyer (2023)
qLoRA
Advantages:
Considerations before fine-tuning
What to know:
Some cool optimizations
Code overview - LoRA
LoRA with HuggingFace - classification task
Code overview - LoRA
2. PEFT settings
Code overview - LoRA
3. Train the model
Code overview - from LoRA to DoRA