1 of 23

LoRA: Low-Rank Adaptation of Large Language Models

1

The Natural Language Processing Reading Group Reviews:

2 of 23

Introduction

Applications in NLP rely on adapting LLMs to multiple, downstream tasks (e.g., summarization, sentiment analysis, etc.). Can be done by:

Fine-tuning full model - Greater model size = painful and annoying.
Tuning input prompts - Reduces model’s usable sequence length
Training adapters - Can introduce inference latency

Observation: Over-parameterized models reside in low-ranked intrinsic dimension^1,2

Solution: LoRA - Train dense layers indirectly by optimizing rank decomposition matrices of the dense layers’ change during adaptation (i.e., weight difference matrix). No need to adjust LLM

2

1 Li, C., Farkhoor, H., Liu, R., & Yosinski J. (2018). Measuring the Intrinsic Dimension of Objective Landscapes. ICLR.

2 Aghajanyan, A., Gupta, S., & Zettlemoyer, L. (2021). Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.

3 of 23

LoRA Advantages

Pre-trained model can be frozen, shared, and used to build many small LoRA modules for different tasks (only need to train & replace A and B)

Lower hardware barrier of entry to 3x when using adaptive optimizers since we do not need to calculate gradients or maintain the optimizer states for most parameters.

No inference latency due to linear merge of trainable matrices with frozen weights

Can be combined with prior methods (e.g., prefix-tuning) as LoRA is orthogonal to other methods

3

4 of 23

Background - Prior Methods

4

Adapter Layer(s)

Prefix Tuning

(1) Difficult to optimize and (2) Reduces sequence length available to process downstream task

Introduces inference latency

5 of 23

Background - Adapter Layers

Adapter Layers - Introduces inference latency

5

Inference latency of a single forward pass in GPT-2 medium measured in milliseconds, averaged over 100 trials. We use an NVIDIA Quadro RTX8000. “|Θ|” denotes the number of trainable parameters in adapter layers. Adapter^L and Adapter^H are two variants of adapter tuning. The inference latency introduced by adapter layers can be significant in an online, short-sequence-length scenario.

6 of 23

Background - Conditional LM Objective

6

Pretrained Weights

Weight Change

Fine-Tuned Weights

Maximize log probability of next target token given previous target tokens & context

Problem

x = Context (Sequence of tokens), y = Target (sequence of tokens)

7 of 23

LoRA - How it Works

7

h = Wx + BAx

8 of 23

LoRA - Benefits

No Additional Inference Latency: If we need to switch to another task, we can do arithmetic to recover the pre-trained weights (subtract BA from previous task) and add new BA for the new task

Reduces memory and storage use: With GPT-3 175B, the authors reduce VRAM consumption during training from 1.2TB to 350GB. With r=4 and only the query and value projection matrices from self-attention being adapted, the checkpoint size is reduced by roughly 10,000x (from 350GB to 35MB). (i.e., GPT-3 + AB = 350GB + 35MB)

Generalization of Full Fine-Tuning: As we increase the rank, LoRA converges to full fine-tuning.

8

h = Wx + BAx

9 of 23

Experiments - Baseline

Fine-Tuning (FT) - All model parameters undergo gradient updates, denoted by FT and FT^Top2 (adapts only the last 2 layers - applied only to GPT-2)

Bias-Only (BitFit) - Only the bias vectors are trained while all others remain frozen.

Prefix-Embedding Tuning (PreEmbed) - Inserts special tokens among the input tokens. These special tokens have trainable word embeddings and are generally not in the model’s vocabulary. These tokens can be added to the start or end of the prompt.

Prefix-Layer Tuning (PreLayer) - Extension of prefix-embedding tuning. Instead of learning activations after the embedding layer, learn the activations after every Transformer layer

Adapter Tuning - Inserts adapter layers into the Transformer model

Adapter^H - Based on Houlsby et al. (2019). Inserts adapter layers between the self-attention module (and MLP module) and subsequent residual connection. Adapter contains two fully-connected layers with biases.
Adapter^L - Based on Lin et al. (2020) Adapter layer only applied after the MLP module and LayerNorm
Adapter^P - Based on Pfeiffer et al. (2021)
Adapter^D - Based on Ruckle et al. (2020). Drops some adapter layers for greater efficiency

LoRA - Adds trainable pairs of rank decomposition matrices in parallel to existing weights. LoRA is only applied to the self-attention module (specifically the query and value matrices W_q and W_v)

9

10 of 23

Experiments - Natural Language Understanding

10

RoBERTa_base, RoBERTa_large, and DeBERTa_XXL with different adaptation methods on the GLUE benchmark. Confidence intervals are shown for experiments run.

* indicates numbers published in prior works.

† indicates runs configured in a setup similar to Houlsby et al. (2019) for a fair comparison.

MNLI: Overall (matched and mismatched) accuracy

CoLA: Matthew’s correlation

STS-B: Pearson correlation

Remaining tasks: Accuracy

Metrics reported

(Higher = better)

11 of 23

Experiments - Natural Language Generation (GPT-2)

11

GPT-2 medium (M) and large (L) with different adaptation methods on the E2E NLG Challenge. Confidence intervals are shown for experiments run.

* indicates numbers published in prior works (specifically Li & Liang (2021))

Metrics: (Higher = better)

12 of 23

Experiments - NLG (GPT-3)

12

Performance of different adaptation methods on GPT-3 175B. LoRA performs better than prior approaches, including full fine-tuning. The results on WikiSQL have a fluctuation around ±0.5%, MNLI-m around ±0.1%, and SAMSum around

±0.2/±0.2/±0.1 for the three metrics.

WikiSQL: Logical form

validation accuracy

MultiNLI-matched: Validation accuracy

SAMSum: Rouge-1/2/L

Remaining tasks: Accuracy

Metrics reported

(Higher = better)

13 of 23

Experiments - NLG (GPT-3)

13

GPT-3 175B validation accuracy vs. number of trainable parameters of several adaptation methods on WikiSQL and MNLI-matched. LoRA exhibits better scalability and task performance.

14 of 23

Understanding Low Rank Adaptation

Given a parameter budget constraint, which subset of weight matrices in a pre-trained Transformer should we adapt to maximize downstream performance?
Is the “optimal” adaptation matrix ∆W really rank deficient? If so, what is a good rank to use in practice?
What is the connection between ∆W and W? Does ∆W highly correlate with W? How large is ∆W comparing to W?

14

15 of 23

Future Work

Combine LoRA with other adaptation methods, potentially providing orthogonal improvements.
Study mechanism behind fine-tuning or LoRA – how are features learned during pre-training transformed to do well on downstream tasks?
Study how to select weight matrices to apply LoRA in more principled ways
Study the rank deficiency of the pre-trained weights W

15

16 of 23

Question 1

Given a parameter budget constraint, which subset of weight matrices in a pre-trained Transformer should we adapt to maximize downstream performance?

16

Adapting both W_q and W_vgives the best performance overall. Standard deviation across random seeds were consistent for a given dataset, which is reported in the first column.

Metrics used (higher = better): Validation accuracy

17 of 23

Question 3

What is the connection between ∆W and W? Does ∆W highly correlate with W? How large is ∆W comparing to W?

17

The Frobenius norm of U^TW_qV^T where U and V are the left/right top r singular vector directions of either (1) ∆W_q, (2) W_q, or (3) a random matrix. The weight matrices are taken from the 48th layer of GPT-3.

U^TW_qV^T - Projects W_q onto the r-dimensional subspace of ∆W_q

18 of 23

Question 2

Is the “optimal” adaptation matrix ∆W really rank deficient? If so, what is a good rank to use in practice?

18

To the author’s surprise, a rank as small as 1 suffices for adapting both W_q and W_v on these datasets while training W_q alone needs a larger r.

Metrics used (higher = better): Validation accuracy

19 of 23

Question 2

19

Subspace similarity between column vectors of A_r=8 and A_r=64 for both ∆W_q and ∆W_v. The third and the fourth figures zoom in on the lower-left triangle in the first two figures. The top directions in r = 8 are included in r = 64, and vice versa.

Subspace similarity between different r

Grassman distance

(0 = no overlap)

(1 = complete overlap)

Recall:

20 of 23

Question 2

20

Left and Middle: Normalized subspace similarity between the column vectors of A_r=64 from two random seeds, for both ∆W_q and ∆W_v in the 48-th layer. Right: the same heat-map between the column vectors of two random Gaussian matrices.

Subspace similarity between different random seeds

Grassman distance

(0 = no overlap)

(1 = complete overlap)

Recall:

21 of 23

Background - Prefix Tuning

For each downstream task, keep pre-trained model frozen and learn small continuous vectors contained at the front of the model

21

Li, X. L., & Liang, P. (2021). Prefix-Tuning: Optimizing Continuous Prompts for Generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.

22 of 23

Background - Conditional LM Objective

22

Model

Dataset Z = { (x_i, y_i) }_{i = 1, .., N}

y_i = Target

(sequence of tokens)

x_i = Context

(Sequence of tokens)

Example: NL2SQL

Context: Natural Language Query
Target: SQL command

Example: Summarization

Context: Article
Target: Summary

23 of 23

Experiments - NLG (GPT-2)

23