1 of 227

Modular and Parameter-Efficient Fine-Tuning for NLP Models

Sebastian Ruder, Jonas Pfeiffer, Ivan Vulić

EMNLP 2022, December 8, 2022

2 of 227

State-of-the-art NLP Models Are Getting Ever Larger

Evolution of the size of large pre-trained models [Treviso et al., 2022]

3 of 227

Modular and Parameter-Efficient Fine-Tuning for NLP

Follow along with the tutorial:

Slides: https://tinyurl.com/modular-fine-tuning-tutorial
Recording (for EMNLP 2022 attendees)

Questions:

RocketChat #tutorial-5
Ask us during the break or after the tutorial�

Watch out for:

A survey on modular deep learning (soon to be published)

4 of 227

Large Models Develop Increasing NLU Capabilities

Credit: Google AI Blog

5 of 227

Transfer Learning in the Era of Large Models

With increasing model size, fine-tuning becomes increasingly expensive

The standard transfer learning formula breaks down

word2vec

GloVe

skip-thought

InferSent

ELMo

ULMFiT

GPT

BERT

T5

Fine-tuning

classification

sequence labeling

question answering

....

Pretraining

Credit: NAACL 2019 Transfer learning tutorial

6 of 227

Transfer Learning in the Era of Large Models

In-context learning has mostly replaced fine-tuning for large models

BERT

BART

ERNIE

GPT-3

PaLM

Pretraining

Credit: Liu et al. (2021)

7 of 227

Prompt-based Learning has Taken NLP by Storm

Credit: http://pretrain.nlpedia.ai/

8 of 227

Downsides of Prompt-based Learning

Inefficiency: The prompt needs to be processed every time the model makes a prediction.�
Poor performance: Prompting generally performs worse than fine-tuning [Brown et al., 2020].�
Sensitivity to the wording of the prompt [Webson & Pavlick, 2022], order of examples [Zhao et al., 2021; Lu et al., 2022], etc.�
Lack of clarity regarding what the model learns from the prompt. Even random labels work [Min et al., 2022]!

9 of 227

From Fine-tuning to Parameter-efficient Fine-tuning

Fine-tuning

classification

sequence labeling

question answering

....

Pretraining

BERT

BART

ERNIE

GPT-3

PaLM

Parameter-efficient

10 of 227

From Fine-tuning to Parameter-efficient Fine-tuning

Full Fine-tuning�Update all model parameters

Parameter-efficient Fine-tuning�Update a small subset of model parameters

11 of 227

Haven’t We Seen This Before?

Updating the last layer was common in computer vision [Donahue et al., 2014]. In NLP, people experimented with static and non-static word embeddings [Kim, 2014].
ELMo did not fine-tune contextualized word embeddings [Peters et al., 2018].

→ Fine-tuning all representations performed generally better in practice

Why go back to fine-tuning only some parameters?

Fine-tuning all parameters is impractical with large models
State-of-the-art models are massively over-parameterized�→ Parameter-efficient fine-tuning matches performance of full fine-tuning
Modular and compositional representations

12 of 227

Modularity and Compositionality?

12

Skill/Task 2

Skill/Task 1

13 of 227

Modularity and Compositionality?

13

Skill/Task 1

Skill/Task 2

14 of 227

Modularity and Compositionality?

14

Skill/Task 3

15 of 227

Why Modularity?

1. Models are increasing in size

15

🙏 Parameter-efficient fine-tuning strategies

16 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

16

English

Task/Skill

Swahili

17 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios�→ Out-of-distribution generalization through module composition

17

English

Swahili

Task/Skill

18 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

→ Out-of-distribution generalization through module composition

3. Catastrophic interference

18

𝚹

19 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

→ Out-of-distribution generalization through module composition

3. Catastrophic interference

→ Modularity as inductive bias

19

𝚹

20 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

→ Out-of-distribution generalization through module composition

3. Catastrophic interference

→ Modularity as inductive bias

4. Efficient updating of models

through added components

20

model

Who is the President of USA?

Obama

21 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

→ Out-of-distribution generalization through module composition

3. Catastrophic interference

→ Modularity as inductive bias

4. Efficient updating of models

through added components

21

model

Who is the President of USA?

Biden

Update

22 of 227

Relevant Tutorials

EMNLP 2020

Efficient methods (knowledge distillation, quantization, pruning)�Slides

ACL 2022

In-context learning, fine-tuning, meta-training �Underline

23 of 227

What This Tutorial Is About and Is Not About

Goal: Provide an overview of parameter-efficient fine-tuning methods
Highlight the benefits and usage scenarios of modularity

What this is not: Comprehensive (it’s impossible to cover all related papers in a tutorial)
What we do not cover:

Text prompting approaches; see [Liu et al., 2021]
Efficient NLP methods in general; see [Treviso et al., 2022]

24 of 227

Agenda

1) Computation Function

2) Routing

3) Aggregation

Parameter-�efficient Models

Modular�Models

Applications

25 of 227

Notation

Let a neural network be decomposed into a composition of functions . Each has parameters . �

A module with parameters can modify the i-th subfunction as follows:

Parameter composition: �
Input composition: �
Function composition:

In practice, typically only the module parameters are updated while is fixed.

Interpolation, e.g., element-wise addition

Concatenation

Function composition

26 of 227

Three Computation Functions

Parameter Composition

Input Composition

Function Composition

27 of 227

Parameter Composition

Sparse Subnetworks��
Structured Composition��
Low-rank Composition

28 of 227

Sparse Subnetworks

A common inductive bias on the module parameters is sparsity�
Most common sparsity method: pruning�
Pruning can be seen as applying a binary mask that selectively keeps or removes each connection in a model and produces a subnetwork.�
Most common pruning criterion: weight magnitude [Han et al., 2017]

29 of 227

Pruning

During pruning, a fraction of the lowest-magnitude weights are removed
The non-pruned weights are re-trained

Pruning for multiple iterations is more common [Frankle & Carbin, 2019]

Initial training

Pruning

Re-training

…

Pruning

Re-training

One-shot pruning

Iterative pruning

30 of 227

Another Perspective on Pruning

We can also view pruning as adding a task-specific vector to the parameters of an existing model: where if .

If the final model should be sparse, we can multiply the existing weights with the binary mask to set the pruned weights to 0:

The main benefit is that these weight values were moving to 0 anyway [Zhou et al., 2019]

Element-wise product (Hadamard product)

31 of 227

The Lottery Ticket Hypothesis

Dense, randomly-initialized models contain subnetworks (“winning tickets”) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number of iterations [Frankle & Carbin, 2019]�
Has also been verified in RL and NLP [Yu et al., 2020] and for larger models in computer vision [Frankle et al., 2020]

32 of 227

The Lottery Ticket Hypothesis in Pre-trained Models

Prior work [Chen et al., 2020 ; Prasanna et al., 2020] has found winning tickets in pre-trained models such as BERT
Sparsity ratios: from 40% (SQuAD) to 90% (QQP and WNLI)
Pre-trained init >> random init
Even pre-trained, random subnetworks perform well [Prasanna et al., 2020]
Subnetworks trained on a general task (MLM) transfer best
At the right compression rate, we can even find super tickets that outperform the full model [Liang et al., 2021]

33 of 227

Beyond Lottery Tickets: Supermasks

The number of possible subnetworks grows combinatorially with the number of model parameters

→ There are masks—supermasks—that achieve non-random performance even for randomly initialized, fixed models [Zhou et al., 2019]

In this case, the module parameters only consist of the binary mask:�

Initial training

Pruning

Re-initialization

34 of 227

Beyond Lottery Tickets: Supermasks

A fixed model can accommodate a potentially unlimited number of task-specific binary masks [Wortsmann et al., 2020]

Supermask for Task A

Supermask for Task B

Supermask for Task C

35 of 227

Supermasks in Pre-trained Models

For back-propagation: can be used as a noisy estimate of �(also known as straight-through estimator [Bengio et al., 2013])

Such supermasks have similarly been found useful in pre-trained models like BERT [Zhao et al., 2020]
Similar to earlier work [Mallya et al., 2018], they learn a real-valued mask �that is then binarized via a thresholding function:

The embedding layer is not masked

36 of 227

Pruning Pre-trained Models

Pruning does not consider how weights change during fine-tuning
Magnitude pruning: keep weights farthest from 0
Movement pruning [Sanh et al., 2020]: keep weights that move the most away from 0

Fine-tuned weights stay close to their their pre-trained values.�Magnitude pruning (left) selects weights that are far from 0.

Movement pruning (right) selects weights that move away from 0.

37 of 227

Diff Pruning

For a model with , we can perform pruning only based on the magnitude of the module parameters rather than the updated parameters �

Diff pruning [Guo et al., 2021] prunes the module parameters via magnitude pruning to make them sparse

38 of 227

Zero-ing Pre-trained Weights vs Keeping Them

In practice, should we set pruned weights to 0 or leave them at their pre-trained weights?�i.e., or where if

Setting weights to 0 performs better for randomly initialized models [Zhou et al., 2019]�
However, zero-ing weights removes some of the modular properties for pre-trained models

39 of 227

Lottery Ticket Sparse Fine-tuning

Lottery Ticket Sparse Fine-tuning [Ansell et al., 2022] learns sparse subnetworks based on magnitude pruning of the module parameters :

Keeping the pre-trained weights allows combining subnetworks for different settings:

40 of 227

Parameter Composition

Sparse Subnetworks��
Structured Composition��
Low-rank Composition

41 of 227

Structured Composition

Specifically, we can only modify the weights that are associated with a pre-defined group :

We can additionally impose a structure on the weights that we select

42 of 227

Group-based Fine-tuning

Most common setting: each group corresponds to a layer; only update the parameters associated with certain layers

Groups can also relate to more fine-grained components

43 of 227

Bias-only Fine-tuning

Computing does not require storing activations [Cai et al., 2020]!

A practical choice: updating only biases

In NLP, BitFit [Ben-Zaken et al., 2022] implements the same approach
Query and second MLP layer biases are most important!

44 of 227

Structured Sparsity

Structure can also be combined with sparse methods
We can prune entire groups such as certain filters in a CNN [Anwar et al., 2015]
In NLP, attention heads in pre-trained models have been pruned [Voita et al., 2019; Michel et al., 2019]

Distribution of BERT attention heads after pruning a single head based on MNLI performance [Michel et al., 2019]

45 of 227

Structured Sparsity

Information of the model’s structure is widely applicable�
Diff pruning can also use structure by encouraging sharing masks between groups [Guo et al., 2021]�
More methods that use structure information in the next sections

46 of 227

Parameter Composition

Sparse Subnetworks��
Structured Composition��
Low-rank Composition

47 of 227

Low-rank Composition

Another useful inductive bias: module parameters should lie in a low-dimensional space

Li et al. [2018] show that models can be optimized in a low-dimensional, randomly oriented subspace rather than the full parameter space

Standard fine-tuning:

Low-rank fine-tuning:

A random �projection matrix

Everything but is fixed. Only� dimensions are optimized.

48 of 227

Low-rank Fine-tuning

In our notation, low-rank fine-tuning takes the form:� where .

With a dense matrix , this scales as in matrix-vector multiply time and storage space

For large pre-trained models, we need a more efficient solution

Low-rank fine-tuning:

A random �projection matrix

49 of 227

Low-rank Fine-tuning via Matrix Factorization

Instead, we can decompose the matrix.
A classic choice: Fastfood transform [Le et al., 2013], which requires only � space and time.

For the multiplication, we do not need to explicitly store .

Factorizes via a sequence of random matrices with special properties:

Random diagonal matrix with equal probability entries

Hadamard matrix

Random diagonal matrix with independent standard normal entries

Random permutation matrix

50 of 227

Intrinsic Dimensionality

Li et al. [2018] refer to the minimum where a model achieves within 90% of the full-parameter model performance, as the intrinsic dimensionality of a task

Aghajanyan et al. [2021] investigate the intrinsic dimensionality of different NLP tasks and pre-trained models

Observations:

Intrinsic dimensionality decreases during pre-training
Larger models have lower intrinsic dimensionality

51 of 227

Intrinsic Dimensionality

Intrinsic dimension on the MRPC dataset for models of different sizes [Aghajanyan et al., 2021]

52 of 227

Structured Low-rank Methods

Allocate one scalar per layer to learn layer-wise scaling:

Aghajanyan et al. [2021] also propose a structure-aware version

Improves the intrinsic dimension in general

However, storing the random matrices still requires a lot of extra space and is slow to train [Mahabadi et al., 2021]

→ We can apply the low-rank constraint only to certain groups

53 of 227

Low-rank Adaptation (LoRA)

LoRA [Hu et al., 2022] learns two low-rank matrices and that are applied to the self-attention weights:

In addition, instead of learning a low-rank factorization via a random matrix� , we can learn the projection matrix directly (if it is small enough)

In our notation, this looks like the following:

Matrix rank

Hidden dimension

Input dimension

Vectorization

54 of 227

Parameter Composition Comparison

	Computation function
Sparse Subnetworks
Structured Composition
Low-rank Composition

where if

where

55 of 227

Computation Functions Comparison

		Parameter efficiency	Training efficiency	Inference efficiency	Performance	Composition-�ality
Parameter composition		Methods such as diff pruning require < 0.5% of parameters	Pruning requires re-training iterations	Does not increase the model size	E.g., LoRA achieves strong performance	Subnetworks can be composed

56 of 227

Computation Functions Comparison

		Parameter efficiency	Training efficiency	Inference efficiency	Performance	Composition-�ality
Parameter composition		+	-	++	+	+

This is mainly meant as high-level guidelines. Individual methods may have different trade-offs and mitigate certain weaknesses.

57 of 227

Three Computation Functions

Parameter Composition

Input Composition

Function Composition

58 of 227

Input Composition

Augment a model’s input by augmenting it with a learnable parameter vector :�

59 of 227

Input Composition and Prompting

Standard prompting can be seen as finding a discrete text prompt that—when embedded using the model’s embedding layer—yields

However, models are sensitive to the formulation of the prompt [Webson & Pavlick, 2022] and to the order of examples [Zhao et al., 2021; Lu et al., 2022]

60 of 227

Prompt Tuning

is typically a matrix consisting of a sequence of continuous prompt embeddings

Instead, we can directly learn a continuous prompt that is prepended to the input [Liu et al., 2021 ; Hambardzumyan et al., 2021 ; Lester et al., 2021]

C-Prompt

Prompt tuning

Fine-tuning vs Prompt tuning�(adapted from [Li & Liang, 2021])

61 of 227

Prompt Tuning Only Works Well at Scale

Only using trainable parameters at the input layer limits capacity for adaptation

→ Prompt tuning performs poorly at smaller model sizes and on harder tasks [Mahabadi et al., 2021 ; Liu et al., 2022]

Prompt tuning vs standard fine-tuning and prompt design across T5 models of different sizes [Lester et al., 2021]

Prompt tuning only matches fine-tuning at the largest model size

62 of 227

Multi-Layer Prompt Tuning

Instead of learning parameters only at the input layer, we can learn them at every layer of the model [Li & Liang, 2021 ; Liu et al., 2022 ]

Continuous prompts in later layers are more important

Prompt tuning

Multi-layer prompt tuning

63 of 227

Multi-Layer Prompt Tuning

In practice, continuous prompts are concatenated with the keys and values in the self-attention layer [Li & Liang, 2021]

64 of 227

Computation Functions Comparison

	Parameter efficiency	Training efficiency	Inference efficiency	Performance	Composition-�ality
Parameter composition	+	-	++	+	+
Input composition	Only add a small number of parameters	Extend the model’s context window		Requires large models to perform well	Continuous prompts have been composed

65 of 227

Computation Functions Comparison

	Parameter efficiency	Training efficiency	Inference efficiency	Performance	Composition-�ality
Parameter composition	+	-	++	+	+
Input composition	++	--	--	-	+

66 of 227

Three Computation Functions

Parameter Composition

Input Composition

Function Composition

67 of 227

Function Composition

Function composition augments a model’s functions with new task-specific functions:

Most commonly used in multi-task learning where modules of different tasks are composed.
Relevant surveys: [Ruder, 2017; Crawshaw, 2020]

→ Focus in this tutorial is on functions that can be added to a pre-trained model

68 of 227

Adapters

Main purpose of functions added to a pre-trained model is to adapt it

Design of adapters is model-specific�
ResNet adapter in CV: batch normalization and 1x1 convolution [Rebuffi et al., 2017]

→ Functions are also known as ‘adapters’

Residual adapter in a ResNet (adapter parameters are in blue) [Rebuffi et al., 2017]

69 of 227

Adapters in Transformer Models

In NLP, an adapter in a Transformer layer typically consists of a feed-forward down-projection , a feed-forward up-projection and an activation function [Houlsby et al,. 2019]:

is commonly a ReLU but other variants have been explored [Stickland & Murray, 2019]

70 of 227

Adapters in Transformer Models

The adapter is usually placed after the multi-head attention and/or after the feed-forward layer�
Most approaches have used this bottleneck design with linear layers

71 of 227

Compact Adapter (Compacter)

Compacter [Mahabadi et al., 2021] reparameterizes the matrices in the adapter as:

where

and .

Compacter reduces adapter parameters by a factor of 10 and achieves similar or better performance

Shared between all layers

A hyper-�parameter (between 4–12)

A low-rank matrix

Kronecker product

72 of 227

Sequential and Parallel Adapters

Adapters can be routed sequentially or in parallel
Sequential adapters are inserted between functions:�

Parallel adapters are applied in parallel:

A sequential adapter [Houlsby et al., 2019]

Two parallel adapters [Stickland & Murray, 2019]

73 of 227

Benefits of Adapters

Increased robustness [He et al., 2021; Han et al., 2021]

Increased sample efficiency

BERT test performance distributions over 20 runs with different learning rates [He et al., 2021]

Results on GLUE with different numbers of training samples per task [Mahabadi et al., 2021]

74 of 227

Rescaling

Instead of learning a function, even rescaling via element-wise multiplication can be powerful:

Commonly applied to normalization parameters, e.g., batch normalization parameters in CV [Bilen et al., 2017], layer normalization in NLP [Houlsby et al., 2019]
Allows the model to select parameters that are more and less important for a given task

Compatible with other methods such as LoRA, which includes a tunable scalar parameter

75 of 227

IA³

IA³ [Liu et al., 2022] multiplies learned vectors with the keys and values in self-attention and the intermediate activations in the feed-forward network of a Transformer

IA³ [Liu et al., 2022] in the Transformer model

76 of 227

Computation Functions Comparison

	Parameter efficiency	Training efficiency	Inference efficiency	Performance	Composition-�ality
Parameter composition	+	-	++	+	+
Input composition	++	--	--	-	+
Function Composition	Adapters depend on the hidden size	Does not require gradients of frozen params	New functions increase # of operations	Match or outperform standard fine-tuning	Adapters can be composed

77 of 227

Computation Functions Comparison

	Parameter efficiency	Training efficiency	Inference efficiency	Performance	Composition-�ality
Parameter composition	+	-	++	+	+
Input composition	++	--	--	-	+
Function Composition	-	+	-	++	+

78 of 227

Module Parameter Generation

So far, the modules for different tasks have been optimized separately

Hyper-networks are most effective when generating modules based on relevant metadata

Modules may benefit from sharing an underlying structure

→ We can use a small neural network—a hyper-network [Ha et al., 2017]—to generate the module parameters instead

79 of 227

Module Parameter Generation

Hyper-networks have been used to generate a diverse set of module parameters :

classifier heads [Ponti et al., 2021];
continuous prompts [He et al., 2022];
adapter layers [Üstün et al., 2020 ; Ansell et al., 2021 ; Mahabadi et al., 2021]

Conditioned on:

task embeddings;
language embeddings (optionally generated based on typological features)
layer id to make the hyper-network more efficient

Hyper-X [Üstün et al., 2022] conditions on task, language, and layer id to generate adapter parameters

80 of 227

Unifying Computation Functions

He et al. [2022] show that LoRA, prefix tuning, and adapters can be expressed with a similar functional form
Specifically, all methods can be expressed as modifying a model’s hidden representation :

Analogously, we can express parameter and input composition methods as function composition

81 of 227

Combining Computation Functions

Different computation functions are complimentary

We can combine different approaches in the same model

UniPELT [Mao et al., 2022] combines adapters, prefix tuning, and LoRA in the same architecture with a learned scalar gate for each

Adapter

82 of 227

Combining Computation Functions

Sparsity, structure, low-rank approximations, rescaling, and other properties can also be applied and combined in many settings
For instance, He et al. [2022] propose a scaled sequential parallel adapter (e) that adds a scaling parameter to a parallel adapter

Different adapter variants [He et al., 2022]

83 of 227

Performance Comparison

Prompt tuning underperforms the other methods due to limited capacity

Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]

Standard�fine-tuning

84 of 227

Performance Comparison

Intrinsic dimension uses the smallest number of parameters but has a large memory footprint and poorer performance

Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]

Standard�fine-tuning

85 of 227

Performance Comparison

Fine-tuning biases (BitFit) only has a small memory footprint but achieves lower performance

Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]

Standard�fine-tuning

86 of 227

Performance Comparison

Function composition methods such as adapter, compacter, and IA³ achieve the best performance but add more parameters

Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]

Standard�fine-tuning

87 of 227

Agenda

1) Computation Function

2) Routing

3) Aggregation

Parameter-�efficient Models

Modular�Models

Applications

88 of 227

Introduction to Routing

89 of 227

Introduction to Routing

90 of 227

Introduction to Routing

?

91 of 227

Introduction to Routing

92 of 227

Introduction to Routing

?

93 of 227

Introduction to Routing

r

94 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

95 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

96 of 227

Fixed Routing

Routing decision is made a-priori.

r

97 of 227

Fixed Routing

Routing decision is made a-priori.

r

98 of 227

Fixed Routing

Multi Task Learning

BERT

NLI

NER

QA

99 of 227

Fixed Routing

Multi Task Learning

BERT

NLI

NER

QA

NLI

100 of 227

Fixed Routing

Multi Task Learning

BERT

NLI

NER

QA

NER

101 of 227

Fixed Routing

Multi Task Learning

BERT

NLI

NER

QA

102 of 227

Fixed Routing

103 of 227

AdapterHub.ml

104 of 227

AdapterHub.ml

105 of 227

AdapterHub.ml

3 Mb

500 Mb

BERT

RoBERTa

T5

…

106 of 227

AdapterHub.ml

Pfeiffer et al. (2020)

Adapterhub is an ever-evolving multi-task model.