1 of 227

Modular and Parameter-Efficient Fine-Tuning for NLP Models

Sebastian Ruder, Jonas Pfeiffer, Ivan Vulić

EMNLP 2022, December 8, 2022

2 of 227

State-of-the-art NLP Models Are Getting Ever Larger

Evolution of the size of large pre-trained models [Treviso et al., 2022]

3 of 227

Modular and Parameter-Efficient Fine-Tuning for NLP

Follow along with the tutorial:

Questions:

  • RocketChat #tutorial-5
  • Ask us during the break or after the tutorial�

Watch out for:

  • A survey on modular deep learning (soon to be published)

4 of 227

Large Models Develop Increasing NLU Capabilities

5 of 227

Transfer Learning in the Era of Large Models

  • With increasing model size, fine-tuning becomes increasingly expensive
  • The standard transfer learning formula breaks down

word2vec

GloVe

skip-thought

InferSent

ELMo

ULMFiT

GPT

BERT

T5

Fine-tuning

classification

sequence labeling

question answering

....

Pretraining

6 of 227

Transfer Learning in the Era of Large Models

  • In-context learning has mostly replaced fine-tuning for large models

BERT

BART

ERNIE

GPT-3

PaLM

Pretraining

7 of 227

Prompt-based Learning has Taken NLP by Storm

8 of 227

Downsides of Prompt-based Learning

  1. Inefficiency: The prompt needs to be processed every time the model makes a prediction.�
  2. Poor performance: Prompting generally performs worse than fine-tuning [Brown et al., 2020].�
  3. Sensitivity to the wording of the prompt [Webson & Pavlick, 2022], order of examples [Zhao et al., 2021; Lu et al., 2022], etc.�
  4. Lack of clarity regarding what the model learns from the prompt. Even random labels work [Min et al., 2022]!

9 of 227

From Fine-tuning to Parameter-efficient Fine-tuning

Fine-tuning

classification

sequence labeling

question answering

....

Pretraining

BERT

BART

ERNIE

GPT-3

PaLM

Parameter-efficient

10 of 227

From Fine-tuning to Parameter-efficient Fine-tuning

Full Fine-tuning�Update all model parameters

Parameter-efficient Fine-tuning�Update a small subset of model parameters

11 of 227

Haven’t We Seen This Before?

  • Updating the last layer was common in computer vision [Donahue et al., 2014]. In NLP, people experimented with static and non-static word embeddings [Kim, 2014].
  • ELMo did not fine-tune contextualized word embeddings [Peters et al., 2018].

→ Fine-tuning all representations performed generally better in practice

Why go back to fine-tuning only some parameters?

  1. Fine-tuning all parameters is impractical with large models
  2. State-of-the-art models are massively over-parameterized�→ Parameter-efficient fine-tuning matches performance of full fine-tuning
  3. Modular and compositional representations

12 of 227

Modularity and Compositionality?

12

Skill/Task 2

Skill/Task 1

13 of 227

Modularity and Compositionality?

13

Skill/Task 1

Skill/Task 2

14 of 227

Modularity and Compositionality?

14

Skill/Task 3

15 of 227

Why Modularity?

1. Models are increasing in size

15

🙏 Parameter-efficient fine-tuning strategies

16 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

16

English

Task/Skill

Swahili

17 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios�→ Out-of-distribution generalization through module composition

17

English

Swahili

Task/Skill

18 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

→ Out-of-distribution generalization through module composition

3. Catastrophic interference

18

𝚹

19 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

→ Out-of-distribution generalization through module composition

3. Catastrophic interference

→ Modularity as inductive bias

19

𝚹

20 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

→ Out-of-distribution generalization through module composition

3. Catastrophic interference

→ Modularity as inductive bias

4. Efficient updating of models

through added components

20

model

Who is the President of USA?

Obama

21 of 227

Why Modularity?

1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies

2. Unseen Scenarios

→ Out-of-distribution generalization through module composition

3. Catastrophic interference

→ Modularity as inductive bias

4. Efficient updating of models

through added components

21

model

Who is the President of USA?

Biden

Update

22 of 227

Relevant Tutorials

EMNLP 2020

Efficient methods (knowledge distillation, quantization, pruning)�Slides

ACL 2022

In-context learning, fine-tuning, meta-training �Underline

23 of 227

What This Tutorial Is About and Is Not About

  • Goal: Provide an overview of parameter-efficient fine-tuning methods
  • Highlight the benefits and usage scenarios of modularity

  • What this is not: Comprehensive (it’s impossible to cover all related papers in a tutorial)
  • What we do not cover:
    • Text prompting approaches; see [Liu et al., 2021]
    • Efficient NLP methods in general; see [Treviso et al., 2022]

24 of 227

Agenda

1) Computation Function

2) Routing

3) Aggregation

Parameter-�efficient Models

Modular�Models

Applications

25 of 227

Notation

Let a neural network be decomposed into a composition of functions . Each has parameters . �

A module with parameters can modify the i-th subfunction as follows:

  1. Parameter composition: �
  2. Input composition: �
  3. Function composition:

In practice, typically only the module parameters are updated while is fixed.

Interpolation, e.g., element-wise addition

Concatenation

Function composition

26 of 227

Three Computation Functions

Parameter Composition

Input Composition

Function Composition

27 of 227

Parameter Composition

  1. Sparse Subnetworks��
  2. Structured Composition��
  3. Low-rank Composition

28 of 227

Sparse Subnetworks

  • A common inductive bias on the module parameters is sparsity
  • Most common sparsity method: pruning
  • Pruning can be seen as applying a binary mask that selectively keeps or removes each connection in a model and produces a subnetwork.�
  • Most common pruning criterion: weight magnitude [Han et al., 2017]

29 of 227

Pruning

  • During pruning, a fraction of the lowest-magnitude weights are removed
  • The non-pruned weights are re-trained
  • Pruning for multiple iterations is more common [Frankle & Carbin, 2019]

Initial training

Pruning

Re-training

Pruning

Re-training

One-shot pruning

Iterative pruning

30 of 227

Another Perspective on Pruning

  • We can also view pruning as adding a task-specific vector to the parameters of an existing model: where if .
  • If the final model should be sparse, we can multiply the existing weights with the binary mask to set the pruned weights to 0:
  • The main benefit is that these weight values were moving to 0 anyway [Zhou et al., 2019]

Element-wise product (Hadamard product)

31 of 227

The Lottery Ticket Hypothesis

  • Dense, randomly-initialized models contain subnetworks (“winning tickets”) that—when trained in isolation—reach test accuracy comparable to the original network in a similar number of iterations [Frankle & Carbin, 2019]
  • Has also been verified in RL and NLP [Yu et al., 2020] and for larger models in computer vision [Frankle et al., 2020]

32 of 227

The Lottery Ticket Hypothesis in Pre-trained Models

  • Prior work [Chen et al., 2020; Prasanna et al., 2020] has found winning tickets in pre-trained models such as BERT
  • Sparsity ratios: from 40% (SQuAD) to 90% (QQP and WNLI)
  • Pre-trained init >> random init
  • Even pre-trained, random subnetworks perform well [Prasanna et al., 2020]
  • Subnetworks trained on a general task (MLM) transfer best
  • At the right compression rate, we can even find super tickets that outperform the full model [Liang et al., 2021]

33 of 227

Beyond Lottery Tickets: Supermasks

  • The number of possible subnetworks grows combinatorially with the number of model parameters

→ There are masks—supermasks—that achieve non-random performance even for randomly initialized, fixed models [Zhou et al., 2019]

  • In this case, the module parameters only consist of the binary mask:�

Initial training

Pruning

Re-initialization

34 of 227

Beyond Lottery Tickets: Supermasks

  • A fixed model can accommodate a potentially unlimited number of task-specific binary masks [Wortsmann et al., 2020]

Supermask for Task A

Supermask for Task B

Supermask for Task C

35 of 227

Supermasks in Pre-trained Models

  • For back-propagation: can be used as a noisy estimate of �(also known as straight-through estimator [Bengio et al., 2013])
  • Such supermasks have similarly been found useful in pre-trained models like BERT [Zhao et al., 2020]
  • Similar to earlier work [Mallya et al., 2018], they learn a real-valued mask �that is then binarized via a thresholding function:
  • The embedding layer is not masked

36 of 227

Pruning Pre-trained Models

  • Pruning does not consider how weights change during fine-tuning
  • Magnitude pruning: keep weights farthest from 0
  • Movement pruning [Sanh et al., 2020]: keep weights that move the most away from 0

Fine-tuned weights stay close to their their pre-trained values.�Magnitude pruning (left) selects weights that are far from 0.

Movement pruning (right) selects weights that move away from 0.

37 of 227

Diff Pruning

  • For a model with , we can perform pruning only based on the magnitude of the module parameters rather than the updated parameters �
  • Diff pruning [Guo et al., 2021] prunes the module parameters via magnitude pruning to make them sparse

38 of 227

Zero-ing Pre-trained Weights vs Keeping Them

  • In practice, should we set pruned weights to 0 or leave them at their pre-trained weights?�i.e., or where if
  • Setting weights to 0 performs better for randomly initialized models [Zhou et al., 2019]
  • However, zero-ing weights removes some of the modular properties for pre-trained models

39 of 227

Lottery Ticket Sparse Fine-tuning

  • Lottery Ticket Sparse Fine-tuning [Ansell et al., 2022] learns sparse subnetworks based on magnitude pruning of the module parameters :
  • Keeping the pre-trained weights allows combining subnetworks for different settings:

40 of 227

Parameter Composition

  • Sparse Subnetworks��
  • Structured Composition��
  • Low-rank Composition

41 of 227

Structured Composition

  • Specifically, we can only modify the weights that are associated with a pre-defined group :
  • We can additionally impose a structure on the weights that we select

42 of 227

Group-based Fine-tuning

  • Most common setting: each group corresponds to a layer; only update the parameters associated with certain layers
  • Groups can also relate to more fine-grained components

43 of 227

Bias-only Fine-tuning

  • Computing does not require storing activations [Cai et al., 2020]!
  • A practical choice: updating only biases
  • In NLP, BitFit [Ben-Zaken et al., 2022] implements the same approach
  • Query and second MLP layer biases are most important!

44 of 227

Structured Sparsity

  • Structure can also be combined with sparse methods
  • We can prune entire groups such as certain filters in a CNN [Anwar et al., 2015]
  • In NLP, attention heads in pre-trained models have been pruned [Voita et al., 2019; Michel et al., 2019]

Distribution of BERT attention heads after pruning a single head based on MNLI performance [Michel et al., 2019]

45 of 227

Structured Sparsity

  • Information of the model’s structure is widely applicable�
  • Diff pruning can also use structure by encouraging sharing masks between groups [Guo et al., 2021]
  • More methods that use structure information in the next sections

46 of 227

Parameter Composition

  • Sparse Subnetworks��
  • Structured Composition��
  • Low-rank Composition

47 of 227

Low-rank Composition

  • Another useful inductive bias: module parameters should lie in a low-dimensional space
  • Li et al. [2018] show that models can be optimized in a low-dimensional, randomly oriented subspace rather than the full parameter space

Standard fine-tuning:

Low-rank fine-tuning:

A random �projection matrix

Everything but is fixed. Only� dimensions are optimized.

48 of 227

Low-rank Fine-tuning

  • In our notation, low-rank fine-tuning takes the form:� where .
  • With a dense matrix , this scales as in matrix-vector multiply time and storage space
  • For large pre-trained models, we need a more efficient solution

Low-rank fine-tuning:

A random �projection matrix

49 of 227

Low-rank Fine-tuning via Matrix Factorization

  • Instead, we can decompose the matrix.
  • A classic choice: Fastfood transform [Le et al., 2013], which requires only � space and time.
  • For the multiplication, we do not need to explicitly store .
  • Factorizes via a sequence of random matrices with special properties:

Random diagonal matrix with equal probability entries

Hadamard matrix

Random diagonal matrix with independent standard normal entries

Random permutation matrix

50 of 227

Intrinsic Dimensionality

  • Li et al. [2018] refer to the minimum where a model achieves within 90% of the full-parameter model performance, as the intrinsic dimensionality of a task
  • Aghajanyan et al. [2021] investigate the intrinsic dimensionality of different NLP tasks and pre-trained models
  • Observations:
    • Intrinsic dimensionality decreases during pre-training
    • Larger models have lower intrinsic dimensionality

51 of 227

Intrinsic Dimensionality

Intrinsic dimension on the MRPC dataset for models of different sizes [Aghajanyan et al., 2021]

52 of 227

Structured Low-rank Methods

  • Allocate one scalar per layer to learn layer-wise scaling:
  • Aghajanyan et al. [2021] also propose a structure-aware version
  • Improves the intrinsic dimension in general
  • However, storing the random matrices still requires a lot of extra space and is slow to train [Mahabadi et al., 2021]

→ We can apply the low-rank constraint only to certain groups

53 of 227

Low-rank Adaptation (LoRA)

  • LoRA [Hu et al., 2022] learns two low-rank matrices and that are applied to the self-attention weights:
  • In addition, instead of learning a low-rank factorization via a random matrix� , we can learn the projection matrix directly (if it is small enough)
  • In our notation, this looks like the following:

Matrix rank

Hidden dimension

Input dimension

Vectorization

54 of 227

Parameter Composition Comparison

Computation function

Sparse Subnetworks

Structured Composition

Low-rank Composition

where if

where

55 of 227

Computation Functions Comparison

Parameter efficiency

Training efficiency

Inference efficiency

Performance

Composition-�ality

Parameter composition

Methods such as diff pruning require < 0.5% of parameters

Pruning requires re-training iterations

Does not increase the model size

E.g., LoRA achieves strong performance

Subnetworks can be composed

56 of 227

Computation Functions Comparison

Parameter efficiency

Training efficiency

Inference efficiency

Performance

Composition-�ality

Parameter composition

+

-

++

+

+

This is mainly meant as high-level guidelines. Individual methods may have different trade-offs and mitigate certain weaknesses.

57 of 227

Three Computation Functions

Parameter Composition

Input Composition

Function Composition

58 of 227

Input Composition

  • Augment a model’s input by augmenting it with a learnable parameter vector :�

59 of 227

Input Composition and Prompting

  • Standard prompting can be seen as finding a discrete text prompt that—when embedded using the model’s embedding layer—yields

60 of 227

Prompt Tuning

  • is typically a matrix consisting of a sequence of continuous prompt embeddings

C-Prompt

C-Prompt

C-Prompt

Prompt tuning

Fine-tuning vs Prompt tuning�(adapted from [Li & Liang, 2021])

61 of 227

Prompt Tuning Only Works Well at Scale

  • Only using trainable parameters at the input layer limits capacity for adaptation

→ Prompt tuning performs poorly at smaller model sizes and on harder tasks [Mahabadi et al., 2021; Liu et al., 2022]

Prompt tuning vs standard fine-tuning and prompt design across T5 models of different sizes [Lester et al., 2021]

Prompt tuning only matches fine-tuning at the largest model size

62 of 227

Multi-Layer Prompt Tuning

  • Instead of learning parameters only at the input layer, we can learn them at every layer of the model [Li & Liang, 2021; Liu et al., 2022]
  • Continuous prompts in later layers are more important

Prompt tuning

Multi-layer prompt tuning

63 of 227

Multi-Layer Prompt Tuning

  • In practice, continuous prompts are concatenated with the keys and values in the self-attention layer [Li & Liang, 2021]

64 of 227

Computation Functions Comparison

Parameter efficiency

Training efficiency

Inference efficiency

Performance

Composition-�ality

Parameter composition

+

-

++

+

+

Input composition

Only add a small number of parameters

Extend the model’s context window

Requires large models to perform well

Continuous prompts have been composed

65 of 227

Computation Functions Comparison

Parameter efficiency

Training efficiency

Inference efficiency

Performance

Composition-�ality

Parameter composition

+

-

++

+

+

Input composition

++

--

--

-

+

66 of 227

Three Computation Functions

Parameter Composition

Input Composition

Function Composition

67 of 227

Function Composition

  • Function composition augments a model’s functions with new task-specific functions:
  • Most commonly used in multi-task learning where modules of different tasks are composed.
  • Relevant surveys: [Ruder, 2017; Crawshaw, 2020]

→ Focus in this tutorial is on functions that can be added to a pre-trained model

68 of 227

Adapters

  • Main purpose of functions added to a pre-trained model is to adapt it
  • Design of adapters is model-specific�
  • ResNet adapter in CV: batch normalization and 1x1 convolution [Rebuffi et al., 2017]

→ Functions are also known as ‘adapters’

Residual adapter in a ResNet (adapter parameters are in blue) [Rebuffi et al., 2017]

69 of 227

Adapters in Transformer Models

  • In NLP, an adapter in a Transformer layer typically consists of a feed-forward down-projection , a feed-forward up-projection and an activation function [Houlsby et al,. 2019]:
  • is commonly a ReLU but other variants have been explored [Stickland & Murray, 2019]

70 of 227

Adapters in Transformer Models

  • The adapter is usually placed after the multi-head attention and/or after the feed-forward layer�
  • Most approaches have used this bottleneck design with linear layers

71 of 227

Compact Adapter (Compacter)

  • Compacter [Mahabadi et al., 2021] reparameterizes the matrices in the adapter as:

where

and .

  • Compacter reduces adapter parameters by a factor of 10 and achieves similar or better performance

Shared between all layers

A hyper-�parameter (between 4–12)

A low-rank matrix

Kronecker product

72 of 227

Sequential and Parallel Adapters

  • Adapters can be routed sequentially or in parallel
  • Sequential adapters are inserted between functions:�
  • Parallel adapters are applied in parallel:

A sequential adapter [Houlsby et al., 2019]

Two parallel adapters [Stickland & Murray, 2019]

73 of 227

Benefits of Adapters

  • Increased robustness [He et al., 2021; Han et al., 2021]
  • Increased sample efficiency

BERT test performance distributions over 20 runs with different learning rates [He et al., 2021]

Results on GLUE with different numbers of training samples per task [Mahabadi et al., 2021]

74 of 227

Rescaling

  • Instead of learning a function, even rescaling via element-wise multiplication can be powerful:
  • Commonly applied to normalization parameters, e.g., batch normalization parameters in CV [Bilen et al., 2017], layer normalization in NLP [Houlsby et al., 2019]
  • Allows the model to select parameters that are more and less important for a given task
  • Compatible with other methods such as LoRA, which includes a tunable scalar parameter

75 of 227

IA3

  • IA3 [Liu et al., 2022] multiplies learned vectors with the keys and values in self-attention and the intermediate activations in the feed-forward network of a Transformer

IA3 [Liu et al., 2022] in the Transformer model

76 of 227

Computation Functions Comparison

Parameter efficiency

Training efficiency

Inference efficiency

Performance

Composition-�ality

Parameter composition

+

-

++

+

+

Input composition

++

--

--

-

+

Function Composition

Adapters depend on the hidden size

Does not require gradients of frozen params

New functions increase # of operations

Match or outperform standard fine-tuning

Adapters can be composed

77 of 227

Computation Functions Comparison

Parameter efficiency

Training efficiency

Inference efficiency

Performance

Composition-�ality

Parameter composition

+

-

++

+

+

Input composition

++

--

--

-

+

Function Composition

-

+

-

++

+

78 of 227

Module Parameter Generation

  • So far, the modules for different tasks have been optimized separately
  • Hyper-networks are most effective when generating modules based on relevant metadata
  • Modules may benefit from sharing an underlying structure

→ We can use a small neural network—a hyper-network [Ha et al., 2017]—to generate the module parameters instead

79 of 227

Module Parameter Generation

  • Conditioned on:
    • task embeddings;
    • language embeddings (optionally generated based on typological features)
    • layer id to make the hyper-network more efficient

Hyper-X [Üstün et al., 2022] conditions on task, language, and layer id to generate adapter parameters

80 of 227

Unifying Computation Functions

  • He et al. [2022] show that LoRA, prefix tuning, and adapters can be expressed with a similar functional form
  • Specifically, all methods can be expressed as modifying a model’s hidden representation :
  • Analogously, we can express parameter and input composition methods as function composition

81 of 227

Combining Computation Functions

  • Different computation functions are complimentary
  • We can combine different approaches in the same model
  • UniPELT [Mao et al., 2022] combines adapters, prefix tuning, and LoRA in the same architecture with a learned scalar gate for each

Adapter

82 of 227

Combining Computation Functions

  • Sparsity, structure, low-rank approximations, rescaling, and other properties can also be applied and combined in many settings
  • For instance, He et al. [2022] propose a scaled sequential parallel adapter (e) that adds a scaling parameter to a parallel adapter

Different adapter variants [He et al., 2022]

83 of 227

Performance Comparison

Prompt tuning underperforms the other methods due to limited capacity

Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]

Standard�fine-tuning

84 of 227

Performance Comparison

Intrinsic dimension uses the smallest number of parameters but has a large memory footprint and poorer performance

Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]

Standard�fine-tuning

85 of 227

Performance Comparison

Fine-tuning biases (BitFit) only has a small memory footprint but achieves lower performance

Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]

Standard�fine-tuning

86 of 227

Performance Comparison

Function composition methods such as adapter, compacter, and IA3 achieve the best performance but add more parameters

Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]

Standard�fine-tuning

87 of 227

Agenda

1) Computation Function

2) Routing

3) Aggregation

Parameter-�efficient Models

Modular�Models

Applications

88 of 227

Introduction to Routing

89 of 227

Introduction to Routing

90 of 227

Introduction to Routing

?

91 of 227

Introduction to Routing

92 of 227

Introduction to Routing

?

93 of 227

Introduction to Routing

r

94 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

95 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

96 of 227

Fixed Routing

Routing decision is made a-priori.

r

97 of 227

Fixed Routing

Routing decision is made a-priori.

r

98 of 227

Fixed Routing

Multi Task Learning

BERT

NLI

NER

QA

99 of 227

Fixed Routing

Multi Task Learning

BERT

NLI

NER

QA

NLI

100 of 227

Fixed Routing

Multi Task Learning

BERT

NLI

NER

QA

NER

101 of 227

Fixed Routing

Multi Task Learning

BERT

NLI

NER

QA

QA

102 of 227

Fixed Routing

103 of 227

AdapterHub.ml

104 of 227

AdapterHub.ml

105 of 227

AdapterHub.ml

3 Mb

500 Mb

BERT

RoBERTa

T5

106 of 227

AdapterHub.ml

Adapterhub is an ever-evolving multi-task model.

The component development is distributed throughout the community.

107 of 227

AdapterHub.ml

Adapterhub is an ever-evolving multi-task model.

The component development is distributed throughout the community.

108 of 227

AdapterHub.ml

Adapterhub is an ever-evolving multi-task model.

The component development is distributed throughout the community.

109 of 227

AdapterHub.ml

Adapterhub is an ever-evolving multi-task model.

The component development is distributed throughout the community.

110 of 227

AdapterHub.ml

Adapterhub is an ever-evolving multi-task model.

The component development is distributed throughout the community.

111 of 227

AdapterHub.ml

Adapterhub is an ever-evolving multi-task model.

The component development is distributed throughout the community.

112 of 227

AdapterHub.ml

Adapterhub is an ever-evolving multi-task model.

The component development is distributed throughout the community.

113 of 227

AdapterHub.ml

Adapterhub is an ever-evolving multi-task model.

The component development is distributed throughout the community.

114 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

115 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

116 of 227

Learned Routing

Parametrize the routing function when routing decisions cannot be made a priori .

117 of 227

Learned Routing - Challenges

118 of 227

Learned Routing - Challenges

  • Training Instability
    • Router and Modules are untrained => routing dynamics might never stabilize.

r

119 of 227

Learned Routing - Challenges

  • Training Instability
    • Router and Modules are untrained => routing dynamics might never stabilize.
  • Module Collapse
    • The router falls into a local optimum, choosing one or two modules exclusively.

Plots courtesy of Rosenbaum et al. (2017)

120 of 227

Learned Routing - Challenges

  • Training Instability
    • Router and Modules are untrained => routing dynamics might never stabilize.
  • Module Collapse
    • The router falls into a local optimum, choosing one or two modules exclusively.
  • Overfitting
    • Deep modular networks risk overfitting to the noise.

Plot courtesy of Rosenbaum et al. (2017)

121 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

122 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

123 of 227

Hard Learned Routing

124 of 227

Hard Learned Routing

Discrete decisions are not amenable to be learned through vanilla gradient descent.

125 of 227

Hard Learned Routing

Discrete decisions are not amenable to be learned through vanilla gradient descent.

  • Reinforcement Learning

126 of 227

Hard Learned Routing

Discrete decisions are not amenable to be learned through vanilla gradient descent.

  • Reinforcement Learning
  • Evolutionary Algorithms

127 of 227

Hard Learned Routing

Discrete decisions are not amenable to be learned through vanilla gradient descent.

  • Reinforcement Learning
  • Evolutionary Algorithms
  • Stochastic Reparametrization
    • AdaShare (Sun et al. 2020)
    • Polytropon (Ponti et al. 2022)

128 of 227

Hard Learned Routing - Stochastic Reparametrization

Gumbel Softmax (Jang et al.,2017)

129 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

130 of 227

Routing

Fixed Routing

Learned Routing

Hard Learned Routing

Soft Learned Routing

131 of 227

Soft Learned Routing

To sidestep discrete selections of modules, several works propose soft routing methods.

The router learns a probability distribution over the available modules:

r

132 of 227

Soft Learned Routing

Mixture of Experts (MoE)

=> sum over all modules (aka “experts”)

Problem: All modules are always activated

=> Significant increase of computational cost.

r

133 of 227

Soft Learned Routing

Top-k routing.

Only select the top-k modules

r

134 of 227

Soft Learned Routing

Top-k routing.

Only select the top-k modules

r

135 of 227

Soft Learned Routing

Top-1 routing.

Only select the top module

r

136 of 227

Soft Learned Routing

Top-1 routing.

Only select the top module

r

137 of 227

Token Level Routing - MoE

Each FFN component of the transformer is considered a module (aka “Expert”).

138 of 227

Token Level Routing - MoE

Each FFN component of the transformer is considered a module (aka “Expert”).

Routing is performed on the token level.

139 of 227

Token Level Routing - MoE

Each FFN component of the transformer is considered a module (aka “Expert”).

Routing is performed on the token level.

Load Balancing uniformly distributes tokens across hardware accelerators.

140 of 227

Token Level Routing - Load Balancing

Illustration in courtesy of Fedus et al. (2022)

141 of 227

Token Level Routing - MoE

Each FFN component of the transformer is considered a module (aka “Expert”).

Routing is performed on the token level.

Load Balancing uniformly distributes tokens across hardware accelerators.

142 of 227

Token Level Routing - MoE

Each FFN component of the transformer is considered a module (aka “Expert”).

Routing is performed on the token level.

Load Balancing uniformly distributes tokens across hardware accelerators.

Problem: Load balancing restricts the system from routing an entire example to a single module.

143 of 227

Token Level Routing

Problem: Load balancing restricts the system from routing an entire example to a single module.

Syntactically and semantically similar words (in contrast to sentences or phrases) are routed to to the same modules.

Lewis et al. (2021)

144 of 227

Example Level Routing

Instead of token level routing, each token of a sentence can be routed to the same module.

Routing can be achieved based on metadata, such as language or task id, or on the pooled token representations.

Task id

145 of 227

Layer Routing

The router can also make a global decision for how to route at each layer.

146 of 227

Routing - Hybrids

Fixed Routing

e.g. Language_id

Learned Routing (i.e. MoE)

e.g. heterogeneous data

It is possible to combine the concepts of fixed and learned routing.

147 of 227

Agenda

1) Computation Function

2) Routing

3) Aggregation

Parameter-�efficient Models

Modular�Models

Applications

148 of 227

Aggregation Functions

Routing: How do we select different modules during training?

Here: How can we aggregate modules in order to combine the respective information?

* (Often these concepts are inseparable, i.e. selection and aggregation are performed simultaneously)

Computation Functions: How do we compose shared components with modules?

Here: How do we compose multiple modular components?

149 of 227

Notation

Let a neural network be decomposed into a composition of functions . Each has parameters . �

A module with parameters can modify the i-th subfunction as follows:

  • Parameter composition: �
  • Input composition: �
  • Function composition:

In practice, typically only the module parameters are updated while is fixed.

Function composition

150 of 227

Module Composition

  • Parameter composition: �
  • Input composition: �
  • Function composition:

151 of 227

Module Composition

  • Parameter composition: �
  • Input composition: �
  • Function composition:

152 of 227

Module Composition

153 of 227

Module Composition

154 of 227

Parameter Interpolation

Mode connectivity: The minima found by two networks are connected by a path of non-increasing error.

Frankle et al. (2020) and Neyshabur et al. (2020) demonstrate that linear mode connectivity is closely linked to the Lottery Ticket Hypothesis.

=> When interpolating between models trained on different tasks but initialised with the same set of weights, the models tend to stay in the same loss basin.

155 of 227

Parameter Interpolation (Recap)

Lottery Ticket Sparse Fine-tuning: Keep the pre-trained weights, and combine module parameters learned for different settings [Ansell et al., 2022]:

156 of 227

Parameter Interpolation

By performing the arithmetic negation operation

their new model generates less toxic text.

Ilharco et al. (2022) propose to edit entire models.

E.g. tasks can include toxic language generation and general language modelling.

157 of 227

Input Composition

158 of 227

Input Composition

Input aggregation comes natural to modular input compositional functions.

159 of 227

Input Composition

Input aggregation comes natural to modular input compositional functions.

Aggregating the modules boils down to concatenating all prompts.

160 of 227

Input Composition

Input aggregation comes natural to modular input compositional functions.

Aggregating the modules boils down to concatenating all prompts.

Model

Prompt 1

Prompt n

Text Input

161 of 227

Input Composition

Input aggregation comes natural to modular input compositional functions.

Aggregating the modules boils down to concatenating all prompts.

162 of 227

Function Composition

For modules we get (latent) representations,

2

3

r

163 of 227

Function Composition

Weighted Representation Averaging:

Learn the weights to interpolate over the hidden representations:

164 of 227

Function Composition

Weighted Representation Averaging:

Learn the weights to interpolate over the hidden representations:

MoE:

165 of 227

Function Composition

Weighted Representation Averaging:

2

3

r

=> MoEs automatically perform routing and function aggregation.

166 of 227

Function Composition

Weighted Representation Averaging:

Ma et al. (2018) propose to learn one aggregation function per task in a multi-task setup.

2

3

r

r

167 of 227

Function Composition

Weighted Representation Averaging:

Gururangan et al. (2022) pre-train modular components for different textual domains. When utilising the pre-trained modules on unseen data, they weight the output representations of the respective domain modules according to the posterior distribution over the input examples.

168 of 227

Function Composition

Weighted Representation Averaging:

Fixed Routing: Module representations are often averaged without weighting (Zhang et al., 2022; Chronopoulou et al., 2022).

Hard routing methods: The representations of all active modules are averaged, such as in Polytropon (Ponti et al., 2022), or summed, as in PathNet (Fernando et al., 2017).

169 of 227

Function Composition

Attention-Based Representation Aggregation:

Instead of inferring the weighting before a module has performed its transformation, we can make the aggregation decision after.

170 of 227

Function Composition

Attention-Based Representation Aggregation:

Instead of inferring the weighting before a module has performed its transformation, we can make the aggregation decision after.

171 of 227

Function Composition

Disadvantage of both weighted and attention-based representation averaging:

They require a full forward pass through all modules, even if they contribute only marginally.

=> Significant increases in time and space complexity. While this can be mitigated by pruning (i.e., dropping) some modules during inference (Rücklé et al., 2021), latency still remains an issue for scalability.

172 of 227

Function Composition

Sequential Aggregation:

Pass through multiple modules, where the input to the next module is the output of the previous one:

173 of 227

Agenda

1) Computation Function

2) Routing

3) Aggregation

Parameter-�efficient Models

Modular�Models

Applications

174 of 227

Applications

(An Illustrative Non-Exhaustive Subset)

175 of 227

Properties of Modular and Parameter-Efficient Tuning

  • Avoiding negative consequences of full-model fine-tuning:
    • Bypassing catastrophic forgetting and interference

Multi-Task Learning:

MT-Model; 𝚹

Task 1

Task 2

Task 3

Sequential Fine-Tuning:

Model; 𝚹0

Task 1

Model; 𝚹1

Task 2

Model; 𝚹2

176 of 227

Properties of Modular and Parameter-Efficient Tuning

  • Avoiding negative consequences of full-model fine-tuning:
    • Bypassing catastrophic forgetting and interference
    • Bypassing the creation of large full-model copies; boosting efficiency and compactness

  • Positive aspects:
    • Positive transfer
    • Compositionality and reusability of modules: systematic generalisation
    • Local, asynchronous updates: parameter efficiency
    • Scaling (e.g., through MoE) and specialisation

We will illustrate most of these properties via applications in (low-resource) multilingual NLP as a case study.

177 of 227

Parameter-Efficient Fine-Tuning

A basic scenario, seen many times across many applications

  • Fix/freeze the base model

  • Initialise a task-specific module

  • All fine-tuning updates are forced

into the task-specific module

  • Compare against full-model fine-tuning

and preceding PEFT approaches (on GLUE?)

  • Choose which benefit you want to stress
    • Improve performance with the same parameter budget
    • Maintain performance with a smaller parameter budget

178 of 227

Parameter-Efficient Fine-Tuning

A Multi-Task Scenario (in Monolingual Contexts)

  • Boost cross-task sharing via

hyper-networks

  • Task-conditioned hyper-networks

  • Learnable task embeddings

  • Combining sharing with specialisation

179 of 227

Challenges of Cross-Lingual Transfer

  • Many NLP tasks share common knowledge about language (e.g. linguistic representations, structural similarities)

  • Languages share common structure (on the lexical, syntactic, and semantic level)

  • Annotated data is rare, make use of as much supervision as available, and task similarities
  • Empirically, transfer learning has resulted in SOTA for many supervised NLP tasks (e.g. classification, information extraction, QA, etc)

Each data point can be a combination of dedicated modules?

Image courtesy of Yulia Tsvetkov

180 of 227

Standard (Zero-Shot) Cross-Lingual Transfer

Step 1:

Train a multilingual model.

Step 2:

Fine-tune model on a task in a high-resource source language.

Step 3:

Transfer and evaluate the model on a low-resource target language.

Why?

Training data is expensive and not available for many languages, especially ones that are considered “low-resource”.

181 of 227

Challenges of Cross-Lingual Transfer

Deep massively multilingual models such as multilingual-BERT [mBERT; Devlin+, NAACL-18] or XLM-RoBERTa [XLM-R; Conneau+, 2020] achieve

  • SotA results on cross-lingual transfer

BUT

  • Suffer from the so-called “curse of multilinguality”
    • and cannot represent all (7000+) languages in a single model.
  • performance especially deteriorates for low-resource languages, not covered in the training data.

182 of 227

Language Adapters?

Task Knowledge ~ Language Knowledge

MLM (English)

MLM (Quechuan)

=

~

  • Adapters learn transformations that make the underlying model more suited to a task or language.
  • Using masked language modelling (MLM), we can learn language-specific transformations for e.g. English and Quechua.
  • As long as the underlying model is kept fixed, these transformations are roughly interchangeable.

183 of 227

Modular and Parameter-Efficient Cross-Lingual Transfer

A Case Study of MAD-X [Pfeiffer+, EMNLP-20]

183

184 of 227

MAD-X

Step 1: Train Language Adapters

We train language adapters for the source language and the target language with masked language modeling on Wikipedia.

184

MLM (English)

MLM (Quechua)

185 of 227

MAD-X

Step 2: Train a Task Adapter

We train task adapters in the source language stacked on top of the source language adapter.

The language adapter ɸl as well as the transformer weights 𝚹 are frozen while only the task adapter parameter ɸt are trained.

185

186 of 227

MAD-X

Step 3: Zero-Shot transfer to unseen language

We replace the source language adapter with the target language adapter, while keeping the “language agnostic” task adapter.

186

187 of 227

Evaluation

187

Source

Unseen Script

Unseen

Seen

  • 3 languages were seen by mBERT, but their script is underrepresented.

  • 9 languages are unseen by BERT, but their scripts were covered during pretraining.

  • 5 languages with unseen scripts.

We evaluate our approaches by transferring from 4 high-resource, to 17 low resource languages.

We evaluate on Named Entity Recognition.

188 of 227

MAD-X: Some Results (NER)

188

Languages are more low-resource or unseen during pre-training

Relative F1 improvement of MAD-XLarge over XLM-RLarge in cross-lingual NER transfer.

Top right corner represent the realistic scenario of transfering from high resource to low resource

189 of 227

MAD-X: Some Results (NER)

189

mBERT and MAD-X fail for unseen scripts.

190 of 227

When scripts have never been seen

190

ቢግ ማክ በኣሁኑ ዘመን በአለም ዓቀፍ ደረጃ

Amharic script has not been seen during mBERT pre-training.

<UNK>

<UNK>

<UNK>

<UNK>

<UNK>

mBERT’s tokenizer represents everything as <UNK>s

mBERT Tokenizer

The

20

บทค

მისი

191 of 227

Learn a new Tokenizer/Vocabulary

191

mBERT Tokenizer

The

20

บทค

მისი

Amharic Tokenizer

The

20

ቢግ

ማክ

ዘመ

ዓቀ

192 of 227

Results

192

mBERT and MAD-X fail for unseen scripts.

We can achieve gains for seen, but underrepresented languages using lexical initialization.

MAD-X w/ Tokenizer

193 of 227

Applications in Cross-Lingual Transfer

How can we enable positive transfer across languages?

193

Generating Adapter Parameters

The main idea: instead of having dedicated single-language adapters, can we learn to generate adapters on the fly, conditioned on language vectors?

This can be seen as factorising adapters to language (and layer) parameters

More efficient than keeping dedicated language adapters

It can even work (in theory) in zero-shot (no text data whatsoever!) and few-shot setups...

194 of 227

Applications in Cross-Lingual Transfer

A Case Study of MAD-G [Ansell+, Findings of EMNLP-21]

194

Generalisation of [Üstün+, EMNLP-20]

Learn to generate (via a hyper-network) a monolithic multilingual adapter by MLM-ing on 95 languages.

Multi-Source Transfer works much better across different tasks: the model is forced to learn more general language-invariant transfer features?

Generated Adapters offer better initialisation for further target-specific MLM-ing in low-data scenarios...

195 of 227

Applications in Cross-Lingual Transfer

A Case Study of MAD-G [Ansell+, Findings of EMNLP-21]

195

Notes of Efficiency and Compactness

  • Full fine-tuning of mBERT for 95 languages requires:

95 * 178M = 16.91B parameters (!!)

  • Having MAD-X for 95 languages requires:

728M additional parameters (95 single-language adapters)

  • MAD-G for 95 languages conditioned on language vectors only:

228M parameters

  • MAD-G for 95 languages conditioned on language vectors and layer positions:

38M additional parameters

  • Average per-language training time: x50 shorter than for MAD-X

196 of 227

Applications in Cross-Lingual Transfer

A Case Study of MAD-G [Ansell+, Findings of EMNLP-21]

196

MAD-G multilingual adapter as a better starting point for further language specialisation

197 of 227

Applications in Multilinguality

A case study of Hyper-X: A Unified Hypernetwork for Multi-Task Multilingual Transfer

197

Generating task-language modules conditioned on task, language, and layer embeddings

A single hyper-network can again be shared across layers

Supporting unseen task-language combinations at inference (composability, reusability) supported by multi-task learning

198 of 227

MAD-X vs X-Mod

Modularizing Pretrained Language Models

A case study of X-Mod [Pfeiffer+, NAACL-2022]

198

199 of 227

Modularizing Pretrained Language Models

A case study of X-Mod [Pfeiffer+, NAACL-2022]

Shared Baseline: “Standard” multilingual model with all weights shared

X-Mod: Each language has a set of dedicated modular parameters

199

Shared

X-Mod

200 of 227

Zero-Shot Cross-Lingual Transfer with X-Mod

Fine-tune only shared parameters

For cross-lingual transfer replace the source language module with the target language module.

200

201 of 227

Modular Parameter-Efficient Cross-Lingual Transfer

Sparse subnetworks instead of adapters? [Ansell+, ACL-22; Foroutan+, EMNLP-22]

201

202 of 227

Modular Parameter-Efficient Cross-Lingual Transfer

Sparse subnetworks instead of adapters? [Ansell+, ACL-22; Foroutan+, EMNLP-22]

202

  • Update only a subset of parameters
  • Different ways to choose the subset

(beyond “Lottery Ticket” selection)

Child-Tuning [Xu+, EMNLP-21]

FISH-Tuning [Sung+, NeurIPS-21]

Fisher Information Matrix

Levels of sparsity are crucial

203 of 227

Combining Language and Ranking Modules for IR

A case study of multilingual IR with bottleneck adapters and sparse subnetworks

203

Ranking modules, trained only with English relevance assessment, can be transferred to other languages and also used for CLIR

204 of 227

Extending Modular Cross-Lingual Transfer

Phylogeny-Based Hierarchical Adapters?

204

  • Leverage closely related languages in a structured, linguistically-informed manner
  • Benefits for very low-resource languages with ‘higher-resource siblings’ (DEP, POS)

205 of 227

Ensembling Adapters at Test Time

A case study of dealing with low-resource language varieties at test time

205

  • Simple strategy 1: Use a Language Adapter from a related language
  • Simple strategy 2: Average outputs of multiple related adapters
  • Better solution: entropy-minimised ensemble of adapters (EMEA)
    • Learnable ensembling weights at the sentence level

206 of 227

Similar Ideas Have Been Explored in NMT

A case study of using bilingual (translation-direction) adapters [Bapna and Firat, EMNLP-19]

206

  • Avoiding interference; reaping benefits of transfer on low-resource languages without hurting performance for high-resource languages
  • Specialising a multilingual model >> training a bilingual model from scratch

  • Bilingual adapters for cross-lingual transfer [Parović+, NAACL-22]
    • Boosting a particular transfer direction (BAD-X vs MAD-X)

207 of 227

Similar Ideas Have Been Explored in NMT

A case study of using monolingual NMT adapters

207

  • We need 2L instead of L(L-1) adapters
  • [Philip+, EMNLP-20] require parallel data to train the adapters…
  • [Üstün+, EMNLP-21] train denoising adapters using monolingual data only:
    • The power of modular design: recombining adapter for unseen translation directions

208 of 227

Generating Adapters for NMT

Another direct analogy with cross-lingual transfer [Baziotis+, arXiv-22]

208

  • Generating language-specific and language-pair adapters
  • Same benefits as before when using hyper-networks: improved efficiency, quicker convergence, leveraging positive transfer

209 of 227

Generating Adapters for NMT

[Baziotis+, arXiv-22]

209

  • Improving performance and keeping the parameter budget

versus

  • Maintaining performance and reducing the parameter budget

210 of 227

Similar Ideas Have Been Explored in NMT

A case study of using language-specific subnetworks for Multilingual NMT [Lin+, ACL-21]

210

  • The motivation is the same as before: avoid interference and create specialised modules; sparsity is the key

211 of 227

Mixture-of-Experts: Large-Scale Modular Learning

A case study of token-level versus task-level MoEs [Kudugunta+, Findings of EMNLP-21]

211

  • Efficiency considerations at inference
  • What are ‘tasks’? The actual tasks? Domains? Translation directions?

212 of 227

Similar Ideas Have Been Explored for Domain Adaptation

A case study of hierarchical domain adapters [Chronopoulou+, NAACL-22]

212

213 of 227

Modular and Parameter-Efficient Continual Learning

A case study of adding new domains to task-oriented dialogue systems [Madotto+, EMNLP-21]

213

  • Apply a similar idea to extending multilingual models with new languages? [Badola+, arXiv-22]
  • Other parameter-efficient strategies beyond adapters - Modular Networks? [Veniat+, ICLR-21]

Task id is available at training but not at inference

Select the module with the lowest perplexity score (i.e., highest confidence)

214 of 227

Prompt-Free Methods for Few-Shot Learning

A case study of PERFECT [Mahabadi+, ACL-22]

214

Bypassing the use of patterns and verbalizers: prompt-free methods

  1. Pattern-free implicit task descriptions

  • Multi-token label embeddings: a fixed number of tokens per each label

  • (Optional) Inference via prototypical networks

More PEFT-based versus prompt-based few-shot learning?

215 of 227

Multi-Modal Learning

A case study of MAGMA [Eichenberg+, EMNLP-22]

215

  • Autoregressively generate text from arbitrary combinations of visual and textual input
  • Text generation is conditioned on visual prefix: an instance of prefix-tuning with adapters

216 of 227

Multilingual Multi-Modal Learning

A case study of enabling cross-lingual VQA [Pfeiffer+, Findings of ACL-22; Liu+, arXiv-22]

216

  • Each modality is passed through a modality specific task adapter, the outputs are concatenated.
  • Extending a monolingual multilingual model to deal with more languages
  • Extending a text-only multilingual model to deal with another modality

217 of 227

Task Specialisation of Sentence Encoders

A case study of (efficient) intent detection

217

Task specialisation knowledge (i.e., contrastive learning updates) is inserted into bottleneck adapters

218 of 227

Modular Prompting?

A case study of modular prompting for zero-shot cross-lingual generation [Vu+, EMNLP-22]

218

  • Soft prompts can be factorised into language- and task-specific sub-prompts…
  • …and then recombined at inference to enable unseen combinations
  • Modularity of the prompts?

219 of 227

Other Applications

219

The discussed applications are meant to be illustrative, not exhaustive

Similar use of modular PEFT in speech processing applications:

Bottleneck adapters for ASR [Tomanek+, EMNLP-21; Thomas+, ICASSP-22; Sathyendra+; ICASSP-22; Chen+, arXiv-22; Fan and Alwan, InterSpeech-22] and text-to-speech (emerging, 2 arXiv papers in Nov/22)

What is in the modules?

Specialising for languages, demographics (accents, children’s speech), domains, specific speakers

220 of 227

Other Applications

220

The discussed applications are meant to be illustrative, not exhaustive

  • A body of work across multiple application areas:

    • Computer vision
    • Speech processing
    • Cross-modal learning (e.g., vision-language, even image-video)
    • Reinforcement learning
    • Programme induction
    • Causal inference
    • Translation from natural language to code

221 of 227

Community-Wide Sharing and Reusing of Modules

A Case Study of AdapterHub.ml (reminder)

Ever-evolving multi-task models, the development of whose components has been distributed throughout the community

222 of 227

What is (or can be) stored in the modules?

222

  • Language-specific knowledge? Language modules
  • Task-specific knowledge? Task modules
  • Domain-specific knowledge? Domain modules

  • External (manually created, non-distributional) knowledge?

  • Debiasing-relevant knowledge? Debiasing modules
    • Via MLM on counterfactually augmented corpora? [Lauscher+, EMNLP-21]
    • combining different debiasing modules for intersectional debiasing?

223 of 227

What is (or can be) stored in the modules?

223

Are modules reserved only for ‘well-formed’ and interpretable ‘units of knowledge’?

Using (non-interpretable) reusable (predefined) sets of skills and skill modules

  • “a skill to understand the sentiment of a sentence?”
  • “a skill to understand short questions in natural language?”

224 of 227

Final Thoughts and

Open Research Directions

225 of 227

225

Many unexplored variants on the multi-dimensional research manifold of

  • Computation function
  • Routing function
  • Aggregation function
  • Training strategy

There is no ‘one-size-fits-all’ design and model choice:

  • Understand your task and your data (raw data and annotated data)
  • Understand your constraints and requirements:
    • Parameter budget?
    • Computation budget?
    • Storage budget?
    • Training speed?
    • Inference speed?
    • Optimising performance only versus optimising trade-off(s)?

226 of 227

226

Modular and parameter-efficient deep learning transcends the confines of:

  • Particular NLP ‘tracks’
    • Study connections between different NLP tasks and applied methods

  • Particular modalities and application areas
    • Study structural and methodological connections and (dis)similarities between research in NLP, speech processing, computer vision, etc.

  • ‘Single-stream’ goals (e.g., performance only, efficiency only, debiasing only)
    • Approach your aims (more) holistically [Søgaard+, Findings of ACL-22]: e.g., how can we integrate debiasing with efficiency- and performance-oriented aims?

  • Isolated non-shareable research
    • Community-driven expanding, sharing, reusing, updating…

227 of 227

Modular and Parameter-Efficient Fine-Tuning for NLP

Questions:

  • RocketChat #tutorial-5
  • Ask us after the tutorial

Watch out for:

  • A survey on modular deep learning (soon to be published)

If you found these slides helpful, consider citing the tutorial as:

@inproceedings{ruder2022modular,

title={Modular and Parameter-Efficient Fine-Tuning for NLP Models},

author={Ruder, Sebastian and Pfeiffer, Jonas and Ivan Vulić},

booktitle={Proceedings of EMNLP 2022: Tutorials},

year={2022}

}