LoRA: Low-Rank Adaptation of Large Language Models
1
The Natural Language Processing Reading Group Reviews:
Introduction
Applications in NLP rely on adapting LLMs to multiple, downstream tasks (e.g., summarization, sentiment analysis, etc.). Can be done by:
Observation: Over-parameterized models reside in low-ranked intrinsic dimension1,2
Solution: LoRA - Train dense layers indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation (i.e., weight difference matrix). No need to adjust LLM
2
1 Li, C., Farkhoor, H., Liu, R., & Yosinski J. (2018). Measuring the Intrinsic Dimension of Objective Landscapes. ICLR.
2 Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.
LoRA Advantages
3
Background - Prior Methods
4
Adapter Layer(s)
Prefix Tuning
(1) Difficult to optimize and (2) Reduces sequence length available to process downstream task
Introduces inference latency
Background - Adapter Layers
Adapter Layers - Introduces inference latency
5
Inference latency of a single forward pass in GPT-2 medium measured in milliseconds, averaged over 100 trials. We use an NVIDIA Quadro RTX8000. “|Θ|” denotes the number of trainable parameters in adapter layers. AdapterL and AdapterH are two variants of adapter tuning. The inference latency introduced by adapter layers can be significant in an online, short-sequence-length scenario.
Background - Conditional LM Objective
6
Pretrained Weights
Weight Change
Fine-Tuned Weights
Maximize log probability of next target token given previous target tokens & context
Problem
x = Context (Sequence of tokens), y = Target (sequence of tokens)
LoRA - How it Works
7
h = Wx + BAx
LoRA - Benefits
No Additional Inference Latency: If we need to switch to another task, we can do arithmetic to recover the pre-trained weights (subtract BA from previous task) and add new BA for the new task
Reduces memory and storage use: With GPT-3 175B, the authors reduce VRAM consumption during training from 1.2TB to 350GB. With r=4 and only the query and value projection matrices from self-attention being adapted, the checkpoint size is reduced by roughly 10,000x (from 350GB to 35MB). (i.e., GPT-3 + AB = 350GB + 35MB)
Generalization of Full Fine-Tuning: As we increase the rank, LoRA converges to full fine-tuning.
8
h = Wx + BAx
Experiments - Baseline
Fine-Tuning (FT) - All model parameters undergo gradient updates, denoted by FT and FTTop2 (adapts only the last 2 layers - applied only to GPT-2)
Bias-Only (BitFit) - Only the bias vectors are trained while all others remain frozen.
Prefix-Embedding Tuning (PreEmbed) - Inserts special tokens among the input tokens. These special tokens have trainable word embeddings and are generally not in the model’s vocabulary. These tokens can be added to the start or end of the prompt.
Prefix-Layer Tuning (PreLayer) - Extension of prefix-embedding tuning. Instead of learning activations after the embedding layer, learn the activations after every Transformer layer
Adapter Tuning - Inserts adapter layers into the Transformer model
LoRA - Adds trainable pairs of rank decomposition matrices in parallel to existing weights. LoRA is only applied to the self-attention module (specifically the query and value matrices Wq and Wv)
9
Experiments - Natural Language Understanding
10
RoBERTabase, RoBERTalarge, and DeBERTaXXL with different adaptation methods on the GLUE benchmark. Confidence intervals are shown for experiments run.
* indicates numbers published in prior works.
† indicates runs configured in a setup similar to Houlsby et al. (2019) for a fair comparison.
MNLI: Overall (matched and mismatched) accuracy
CoLA: Matthew’s correlation
STS-B: Pearson correlation
Remaining tasks: Accuracy
Metrics reported
(Higher = better)
Experiments - Natural Language Generation (GPT-2)
11
GPT-2 medium (M) and large (L) with different adaptation methods on the E2E NLG Challenge. Confidence intervals are shown for experiments run.
* indicates numbers published in prior works (specifically Li & Liang (2021))
Metrics: (Higher = better)
Experiments - NLG (GPT-3)
12
Performance of different adaptation methods on GPT-3 175B. LoRA performs better than prior approaches, including full fine-tuning. The results on WikiSQL have a fluctuation around ±0.5%, MNLI-m around ±0.1%, and SAMSum around
±0.2/±0.2/±0.1 for the three metrics.
WikiSQL: Logical form
validation accuracy
MultiNLI-matched: Validation accuracy
SAMSum: Rouge-1/2/L
Remaining tasks: Accuracy
Metrics reported
(Higher = better)
Experiments - NLG (GPT-3)
13
GPT-3 175B validation accuracy vs. number of trainable parameters of several adaptation methods on WikiSQL and MNLI-matched. LoRA exhibits better scalability and task performance.
Understanding Low Rank Adaptation
14
Future Work
15
Question 1
Given a parameter budget constraint, which subset of weight matrices in a pre-trained Transformer should we adapt to maximize downstream performance?
16
Adapting both Wq and Wv gives the best performance overall. Standard deviation across random seeds were consistent for a given dataset, which is reported in the first column.
Metrics used (higher = better): Validation accuracy
Question 3
What is the connection between ∆W and W? Does ∆W highly correlate with W? How large is ∆W comparing to W?
17
The Frobenius norm of UTWqVT where U and V are the left/right top r singular vector directions of either (1) ∆Wq, (2) Wq, or (3) a random matrix. The weight matrices are taken from the 48th layer of GPT-3.
UTWqVT - Projects Wq onto the r-dimensional subspace of ∆Wq
Question 2
Is the “optimal” adaptation matrix ∆W really rank deficient? If so, what is a good rank to use in practice?
18
To the author’s surprise, a rank as small as 1 suffices for adapting both Wq and Wv on these datasets while training Wq alone needs a larger r.
Metrics used (higher = better): Validation accuracy
Question 2
19
Subspace similarity between column vectors of Ar=8 and Ar=64 for both ∆Wq and ∆Wv. The third and the fourth figures zoom in on the lower-left triangle in the first two figures. The top directions in r = 8 are included in r = 64, and vice versa.
Subspace similarity between different r
Grassman distance
(0 = no overlap)
(1 = complete overlap)
Recall:
Question 2
20
Left and Middle: Normalized subspace similarity between the column vectors of Ar=64 from two random seeds, for both ∆Wq and ∆Wv in the 48-th layer. Right: the same heat-map between the column vectors of two random Gaussian matrices.
Subspace similarity between different random seeds
Grassman distance
(0 = no overlap)
(1 = complete overlap)
Recall:
Background - Prefix Tuning
For each downstream task, keep pre-trained model frozen and learn small continuous vectors contained at the front of the model
21
Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.
Background - Conditional LM Objective
22
Model
Dataset Z = { (xi, yi) }i = 1, .., N
yi = Target
(sequence of tokens)
xi = Context
(Sequence of tokens)
Example: NL2SQL
Example: Summarization
Experiments - NLG (GPT-2)
23