Modular and Parameter-Efficient Fine-Tuning for NLP Models
Sebastian Ruder, Jonas Pfeiffer, Ivan Vulić
EMNLP 2022, December 8, 2022
State-of-the-art NLP Models Are Getting Ever Larger
Evolution of the size of large pre-trained models [Treviso et al., 2022]
Modular and Parameter-Efficient Fine-Tuning for NLP
Follow along with the tutorial:
Questions:
Watch out for:
Large Models Develop Increasing NLU Capabilities
Credit: Google AI Blog
Transfer Learning in the Era of Large Models
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
T5
Fine-tuning
classification
sequence labeling
question answering
....
Pretraining
Transfer Learning in the Era of Large Models
BERT
BART
ERNIE
GPT-3
PaLM
Pretraining
Credit: Liu et al. (2021)
Prompt-based Learning has Taken NLP by Storm
Credit: http://pretrain.nlpedia.ai/
Downsides of Prompt-based Learning
From Fine-tuning to Parameter-efficient Fine-tuning
Fine-tuning
classification
sequence labeling
question answering
....
Pretraining
BERT
BART
ERNIE
GPT-3
PaLM
Parameter-efficient
From Fine-tuning to Parameter-efficient Fine-tuning
Full Fine-tuning�Update all model parameters
Parameter-efficient Fine-tuning�Update a small subset of model parameters
Haven’t We Seen This Before?
→ Fine-tuning all representations performed generally better in practice
Why go back to fine-tuning only some parameters?
Modularity and Compositionality?
12
Skill/Task 2
Skill/Task 1
Modularity and Compositionality?
13
Skill/Task 1
Skill/Task 2
Modularity and Compositionality?
14
Skill/Task 3
Why Modularity?
1. Models are increasing in size
15
🙏 Parameter-efficient fine-tuning strategies
Why Modularity?
1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies
2. Unseen Scenarios
16
English
Task/Skill
Swahili
Why Modularity?
1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies
2. Unseen Scenarios�→ Out-of-distribution generalization through module composition
17
English
Swahili
Task/Skill
Why Modularity?
1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies
2. Unseen Scenarios
→ Out-of-distribution generalization through module composition
3. Catastrophic interference
18
𝚹
Why Modularity?
1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies
2. Unseen Scenarios
→ Out-of-distribution generalization through module composition
3. Catastrophic interference
→ Modularity as inductive bias
19
𝚹
Why Modularity?
1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies
2. Unseen Scenarios
→ Out-of-distribution generalization through module composition
3. Catastrophic interference
→ Modularity as inductive bias
4. Efficient updating of models
through added components
20
model
Who is the President of USA?
Obama
Why Modularity?
1. Models are increasing in size�→ Parameter-efficient fine-tuning� strategies
2. Unseen Scenarios
→ Out-of-distribution generalization through module composition
3. Catastrophic interference
→ Modularity as inductive bias
4. Efficient updating of models
through added components
21
model
Who is the President of USA?
Biden
Update
Relevant Tutorials
EMNLP 2020
Efficient methods (knowledge distillation, quantization, pruning)�Slides
ACL 2022
In-context learning, fine-tuning, meta-training �Underline
What This Tutorial Is About and Is Not About
Agenda
1) Computation Function
2) Routing
3) Aggregation
Parameter-�efficient Models
Modular�Models
Applications
Notation
Let a neural network be decomposed into a composition of functions . Each has parameters . �
A module with parameters can modify the i-th subfunction as follows:
In practice, typically only the module parameters are updated while is fixed.
Interpolation, e.g., element-wise addition
Concatenation
Function composition
Three Computation Functions
Parameter Composition
Input Composition
Function Composition
Parameter Composition
Sparse Subnetworks
Pruning
Initial training
Pruning
Re-training
…
Pruning
Re-training
One-shot pruning
Iterative pruning
Another Perspective on Pruning
Element-wise product (Hadamard product)
The Lottery Ticket Hypothesis
The Lottery Ticket Hypothesis in Pre-trained Models
Beyond Lottery Tickets: Supermasks
→ There are masks—supermasks—that achieve non-random performance even for randomly initialized, fixed models [Zhou et al., 2019]
Initial training
Pruning
Re-initialization
Beyond Lottery Tickets: Supermasks
Supermask for Task A
Supermask for Task B
Supermask for Task C
Supermasks in Pre-trained Models
Pruning Pre-trained Models
Fine-tuned weights stay close to their their pre-trained values.�Magnitude pruning (left) selects weights that are far from 0.
Movement pruning (right) selects weights that move away from 0.
Diff Pruning
Zero-ing Pre-trained Weights vs Keeping Them
Lottery Ticket Sparse Fine-tuning
Parameter Composition
Structured Composition
Group-based Fine-tuning
Bias-only Fine-tuning
Structured Sparsity
Distribution of BERT attention heads after pruning a single head based on MNLI performance [Michel et al., 2019]
Structured Sparsity
Parameter Composition
Low-rank Composition
Standard fine-tuning:
Low-rank fine-tuning:
A random �projection matrix
Everything but is fixed. Only� dimensions are optimized.
Low-rank Fine-tuning
Low-rank fine-tuning:
A random �projection matrix
Low-rank Fine-tuning via Matrix Factorization
Random diagonal matrix with equal probability entries
Hadamard matrix
Random diagonal matrix with independent standard normal entries
Random permutation matrix
Intrinsic Dimensionality
Intrinsic Dimensionality
Intrinsic dimension on the MRPC dataset for models of different sizes [Aghajanyan et al., 2021]
Structured Low-rank Methods
→ We can apply the low-rank constraint only to certain groups
Low-rank Adaptation (LoRA)
Matrix rank
Hidden dimension
Input dimension
Vectorization
Parameter Composition Comparison
| Computation function |
Sparse Subnetworks | |
Structured Composition | |
Low-rank Composition | |
where if
where
Computation Functions Comparison
| | Parameter efficiency | Training efficiency | Inference efficiency | Performance | Composition-�ality |
Parameter composition | | Methods such as diff pruning require < 0.5% of parameters | Pruning requires re-training iterations | Does not increase the model size | E.g., LoRA achieves strong performance | Subnetworks can be composed |
Computation Functions Comparison
| | Parameter efficiency | Training efficiency | Inference efficiency | Performance | Composition-�ality |
Parameter composition | | + | - | ++ | + | + |
This is mainly meant as high-level guidelines. Individual methods may have different trade-offs and mitigate certain weaknesses.
Three Computation Functions
Parameter Composition
Input Composition
Function Composition
Input Composition
Input Composition and Prompting
Prompt Tuning
Prompt Tuning Only Works Well at Scale
→ Prompt tuning performs poorly at smaller model sizes and on harder tasks [Mahabadi et al., 2021; Liu et al., 2022]
Prompt tuning vs standard fine-tuning and prompt design across T5 models of different sizes [Lester et al., 2021]
Prompt tuning only matches fine-tuning at the largest model size
Multi-Layer Prompt Tuning
Prompt tuning
Multi-layer prompt tuning
Multi-Layer Prompt Tuning
Computation Functions Comparison
| | Parameter efficiency | Training efficiency | Inference efficiency | Performance | Composition-�ality |
Parameter composition | | + | - | ++ | + | + |
Input composition | | Only add a small number of parameters | Extend the model’s context window | Requires large models to perform well | Continuous prompts have been composed | |
Computation Functions Comparison
| | Parameter efficiency | Training efficiency | Inference efficiency | Performance | Composition-�ality |
Parameter composition | | + | - | ++ | + | + |
Input composition | | ++ | -- | -- | - | + |
Three Computation Functions
Parameter Composition
Input Composition
Function Composition
Function Composition
→ Focus in this tutorial is on functions that can be added to a pre-trained model
Adapters
→ Functions are also known as ‘adapters’
Residual adapter in a ResNet (adapter parameters are in blue) [Rebuffi et al., 2017]
Adapters in Transformer Models
Adapters in Transformer Models
Compact Adapter (Compacter)
where
and .
Shared between all layers
A hyper-�parameter (between 4–12)
A low-rank matrix
Kronecker product
Sequential and Parallel Adapters
A sequential adapter [Houlsby et al., 2019]
Two parallel adapters [Stickland & Murray, 2019]
Benefits of Adapters
BERT test performance distributions over 20 runs with different learning rates [He et al., 2021]
Results on GLUE with different numbers of training samples per task [Mahabadi et al., 2021]
Rescaling
IA3
IA3 [Liu et al., 2022] in the Transformer model
Computation Functions Comparison
| | Parameter efficiency | Training efficiency | Inference efficiency | Performance | Composition-�ality |
Parameter composition | | + | - | ++ | + | + |
Input composition | | ++ | -- | -- | - | + |
Function Composition | | Adapters depend on the hidden size | Does not require gradients of frozen params | New functions increase # of operations | Match or outperform standard fine-tuning | Adapters can be composed |
Computation Functions Comparison
| | Parameter efficiency | Training efficiency | Inference efficiency | Performance | Composition-�ality |
Parameter composition | | + | - | ++ | + | + |
Input composition | | ++ | -- | -- | - | + |
Function Composition | | - | + | - | ++ | + |
Module Parameter Generation
→ We can use a small neural network—a hyper-network [Ha et al., 2017]—to generate the module parameters instead
Module Parameter Generation
Hyper-X [Üstün et al., 2022] conditions on task, language, and layer id to generate adapter parameters
Unifying Computation Functions
Combining Computation Functions
Adapter
Combining Computation Functions
Different adapter variants [He et al., 2022]
Performance Comparison
Prompt tuning underperforms the other methods due to limited capacity
Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]
Standard�fine-tuning
Performance Comparison
Intrinsic dimension uses the smallest number of parameters but has a large memory footprint and poorer performance
Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]
Standard�fine-tuning
Performance Comparison
Fine-tuning biases (BitFit) only has a small memory footprint but achieves lower performance
Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]
Standard�fine-tuning
Performance Comparison
Function composition methods such as adapter, compacter, and IA3 achieve the best performance but add more parameters
Average performance, % of trained parameters per task, and (left) memory footprint (circle size) of different methods on T5-Base (222B parameters; left) [Mahabadi et al., 2021] and T3-3B (right) [Liu et al., 2022]
Standard�fine-tuning
Agenda
1) Computation Function
2) Routing
3) Aggregation
Parameter-�efficient Models
Modular�Models
Applications
Introduction to Routing
Introduction to Routing
Introduction to Routing
?
Introduction to Routing
Introduction to Routing
?
Introduction to Routing
r
Routing
Fixed Routing
Learned Routing
Hard Learned Routing
Soft Learned Routing
Routing
Fixed Routing
Learned Routing
Hard Learned Routing
Soft Learned Routing
Fixed Routing
Routing decision is made a-priori.
r
Fixed Routing
Routing decision is made a-priori.
r
Fixed Routing
Multi Task Learning
BERT
NLI
NER
QA
Fixed Routing
Multi Task Learning
BERT
NLI
NER
QA
NLI
Fixed Routing
Multi Task Learning
BERT
NLI
NER
QA
NER
Fixed Routing
Multi Task Learning
BERT
NLI
NER
QA
QA
Fixed Routing
AdapterHub.ml
AdapterHub.ml
AdapterHub.ml
3 Mb
500 Mb
BERT
RoBERTa
T5
…
AdapterHub.ml
Adapterhub is an ever-evolving multi-task model.
The component development is distributed throughout the community.
AdapterHub.ml
Adapterhub is an ever-evolving multi-task model.
The component development is distributed throughout the community.
AdapterHub.ml
Adapterhub is an ever-evolving multi-task model.
The component development is distributed throughout the community.
AdapterHub.ml
AdapterHub.ml
AdapterHub.ml
Adapterhub is an ever-evolving multi-task model.
The component development is distributed throughout the community.
AdapterHub.ml
Adapterhub is an ever-evolving multi-task model.
The component development is distributed throughout the community.
AdapterHub.ml
Adapterhub is an ever-evolving multi-task model.
The component development is distributed throughout the community.
Routing
Fixed Routing
Learned Routing
Hard Learned Routing
Soft Learned Routing
Routing
Fixed Routing
Learned Routing
Hard Learned Routing
Soft Learned Routing
Learned Routing
Parametrize the routing function when routing decisions cannot be made a priori .
Learned Routing - Challenges
Learned Routing - Challenges
r
Learned Routing - Challenges
Plots courtesy of Rosenbaum et al. (2017)
Learned Routing - Challenges
Plot courtesy of Rosenbaum et al. (2017)
Routing
Fixed Routing
Learned Routing
Hard Learned Routing
Soft Learned Routing
Routing
Fixed Routing
Learned Routing
Hard Learned Routing
Soft Learned Routing
Hard Learned Routing
Hard Learned Routing
Discrete decisions are not amenable to be learned through vanilla gradient descent.
Hard Learned Routing
Discrete decisions are not amenable to be learned through vanilla gradient descent.
Hard Learned Routing
Discrete decisions are not amenable to be learned through vanilla gradient descent.
Hard Learned Routing
Discrete decisions are not amenable to be learned through vanilla gradient descent.
Hard Learned Routing - Stochastic Reparametrization
Polytropon; Ponti et al. (2022)
Gumbel Softmax (Jang et al.,2017)
Routing
Fixed Routing
Learned Routing
Hard Learned Routing
Soft Learned Routing
Routing
Fixed Routing
Learned Routing
Hard Learned Routing
Soft Learned Routing
Soft Learned Routing
To sidestep discrete selections of modules, several works propose soft routing methods.
The router learns a probability distribution over the available modules:
r
Soft Learned Routing
Mixture of Experts (MoE)
=> sum over all modules (aka “experts”)
Problem: All modules are always activated
=> Significant increase of computational cost.
r
Soft Learned Routing
Top-k routing.
Only select the top-k modules
r
Soft Learned Routing
Top-k routing.
Only select the top-k modules
r
Soft Learned Routing
Top-1 routing.
Only select the top module
r
Soft Learned Routing
Top-1 routing.
Only select the top module
r
Token Level Routing - MoE
Each FFN component of the transformer is considered a module (aka “Expert”).
Token Level Routing - MoE
Each FFN component of the transformer is considered a module (aka “Expert”).
Routing is performed on the token level.
Token Level Routing - MoE
Each FFN component of the transformer is considered a module (aka “Expert”).
Routing is performed on the token level.
Load Balancing uniformly distributes tokens across hardware accelerators.
Token Level Routing - Load Balancing
Illustration in courtesy of Fedus et al. (2022)
Token Level Routing - MoE
Each FFN component of the transformer is considered a module (aka “Expert”).
Routing is performed on the token level.
Load Balancing uniformly distributes tokens across hardware accelerators.
Token Level Routing - MoE
Each FFN component of the transformer is considered a module (aka “Expert”).
Routing is performed on the token level.
Load Balancing uniformly distributes tokens across hardware accelerators.
Problem: Load balancing restricts the system from routing an entire example to a single module.
Token Level Routing
Problem: Load balancing restricts the system from routing an entire example to a single module.
Syntactically and semantically similar words (in contrast to sentences or phrases) are routed to to the same modules.
Example Level Routing
Instead of token level routing, each token of a sentence can be routed to the same module.
Routing can be achieved based on metadata, such as language or task id, or on the pooled token representations.
Task id
Layer Routing
The router can also make a global decision for how to route at each layer.
Routing - Hybrids
Fixed Routing
e.g. Language_id
Learned Routing (i.e. MoE)
e.g. heterogeneous data
It is possible to combine the concepts of fixed and learned routing.
Agenda
1) Computation Function
2) Routing
3) Aggregation
Parameter-�efficient Models
Modular�Models
Applications
Aggregation Functions
Routing: How do we select different modules during training?
Here: How can we aggregate modules in order to combine the respective information?
* (Often these concepts are inseparable, i.e. selection and aggregation are performed simultaneously)
Computation Functions: How do we compose shared components with modules?
Here: How do we compose multiple modular components?
Notation
Let a neural network be decomposed into a composition of functions . Each has parameters . �
A module with parameters can modify the i-th subfunction as follows:
In practice, typically only the module parameters are updated while is fixed.
Function composition
Module Composition
Module Composition
Module Composition
Module Composition
Parameter Interpolation
Mode connectivity: The minima found by two networks are connected by a path of non-increasing error.
Frankle et al. (2020) and Neyshabur et al. (2020) demonstrate that linear mode connectivity is closely linked to the Lottery Ticket Hypothesis.
=> When interpolating between models trained on different tasks but initialised with the same set of weights, the models tend to stay in the same loss basin.
Parameter Interpolation (Recap)
Lottery Ticket Sparse Fine-tuning: Keep the pre-trained weights, and combine module parameters learned for different settings [Ansell et al., 2022]:
Parameter Interpolation
By performing the arithmetic negation operation
their new model generates less toxic text.
Ilharco et al. (2022) propose to edit entire models.
E.g. tasks can include toxic language generation and general language modelling.
Input Composition
Input Composition
Input aggregation comes natural to modular input compositional functions.
Input Composition
Input aggregation comes natural to modular input compositional functions.
Aggregating the modules boils down to concatenating all prompts.
Input Composition
Input aggregation comes natural to modular input compositional functions.
Aggregating the modules boils down to concatenating all prompts.
Model
Prompt 1
Prompt n
…
Text Input
Input Composition
Input aggregation comes natural to modular input compositional functions.
Aggregating the modules boils down to concatenating all prompts.
Function Composition
For modules we get (latent) representations,
2
3
r
Function Composition
Weighted Representation Averaging:
Learn the weights to interpolate over the hidden representations:
Function Composition
Weighted Representation Averaging:
Learn the weights to interpolate over the hidden representations:
MoE:
Function Composition
Weighted Representation Averaging:
2
3
r
=> MoEs automatically perform routing and function aggregation.
Function Composition
Weighted Representation Averaging:
Ma et al. (2018) propose to learn one aggregation function per task in a multi-task setup.
2
3
r
r
Function Composition
Weighted Representation Averaging:
Gururangan et al. (2022) pre-train modular components for different textual domains. When utilising the pre-trained modules on unseen data, they weight the output representations of the respective domain modules according to the posterior distribution over the input examples.
Function Composition
Weighted Representation Averaging:
Fixed Routing: Module representations are often averaged without weighting (Zhang et al., 2022; Chronopoulou et al., 2022).
Hard routing methods: The representations of all active modules are averaged, such as in Polytropon (Ponti et al., 2022), or summed, as in PathNet (Fernando et al., 2017).
Function Composition
Attention-Based Representation Aggregation:
Instead of inferring the weighting before a module has performed its transformation, we can make the aggregation decision after.
Function Composition
Attention-Based Representation Aggregation:
Instead of inferring the weighting before a module has performed its transformation, we can make the aggregation decision after.
Function Composition
Disadvantage of both weighted and attention-based representation averaging:
They require a full forward pass through all modules, even if they contribute only marginally.
=> Significant increases in time and space complexity. While this can be mitigated by pruning (i.e., dropping) some modules during inference (Rücklé et al., 2021), latency still remains an issue for scalability.
Function Composition
Sequential Aggregation:
Pass through multiple modules, where the input to the next module is the output of the previous one:
Agenda
1) Computation Function
2) Routing
3) Aggregation
Parameter-�efficient Models
Modular�Models
Applications
Applications
(An Illustrative Non-Exhaustive Subset)
Properties of Modular and Parameter-Efficient Tuning
Multi-Task Learning:
MT-Model; 𝚹
Task 1
Task 2
Task 3
Sequential Fine-Tuning:
Model; 𝚹0
Task 1
Model; 𝚹1
Task 2
Model; 𝚹2
Properties of Modular and Parameter-Efficient Tuning
We will illustrate most of these properties via applications in (low-resource) multilingual NLP as a case study.
Parameter-Efficient Fine-Tuning
A basic scenario, seen many times across many applications
into the task-specific module
and preceding PEFT approaches (on GLUE?)
Parameter-Efficient Fine-Tuning
A Multi-Task Scenario (in Monolingual Contexts)
hyper-networks
Challenges of Cross-Lingual Transfer
Each data point can be a combination of dedicated modules?
Image courtesy of Yulia Tsvetkov
Standard (Zero-Shot) Cross-Lingual Transfer
Step 1:
Train a multilingual model.
Step 2:
Fine-tune model on a task in a high-resource source language.
Step 3:
Transfer and evaluate the model on a low-resource target language.
Why?
Training data is expensive and not available for many languages, especially ones that are considered “low-resource”.
Challenges of Cross-Lingual Transfer
Deep massively multilingual models such as multilingual-BERT [mBERT; Devlin+, NAACL-18] or XLM-RoBERTa [XLM-R; Conneau+, 2020] achieve
BUT
Language Adapters?
Task Knowledge ~ Language Knowledge
MLM (English)
MLM (Quechuan)
=
~
Modular and Parameter-Efficient Cross-Lingual Transfer
A Case Study of MAD-X [Pfeiffer+, EMNLP-20]
183
MAD-X
Step 1: Train Language Adapters
We train language adapters for the source language and the target language with masked language modeling on Wikipedia.
184
MLM (English)
MLM (Quechua)
MAD-X
Step 2: Train a Task Adapter
We train task adapters in the source language stacked on top of the source language adapter.
The language adapter ɸl as well as the transformer weights 𝚹 are frozen while only the task adapter parameter ɸt are trained.
185
MAD-X
Step 3: Zero-Shot transfer to unseen language
We replace the source language adapter with the target language adapter, while keeping the “language agnostic” task adapter.
186
Evaluation
187
Source
Unseen Script
Unseen
Seen
We evaluate our approaches by transferring from 4 high-resource, to 17 low resource languages.
We evaluate on Named Entity Recognition.
MAD-X: Some Results (NER)
188
Languages are more low-resource or unseen during pre-training
Relative F1 improvement of MAD-XLarge over XLM-RLarge in cross-lingual NER transfer.
Top right corner represent the realistic scenario of transfering from high resource to low resource
MAD-X: Some Results (NER)
189
mBERT and MAD-X fail for unseen scripts.
When scripts have never been seen
190
ቢግ ማክ በኣሁኑ ዘመን በአለም ዓቀፍ ደረጃ
Amharic script has not been seen during mBERT pre-training.
<UNK>
<UNK>
<UNK>
<UNK>
<UNK>
mBERT’s tokenizer represents everything as <UNK>s
mBERT Tokenizer
The
20
维
さ
บทค
მისი
Learn a new Tokenizer/Vocabulary
191
mBERT Tokenizer
The
20
维
さ
บทค
მისი
Amharic Tokenizer
The
20
ቢግ
ማክ
ዘመ
ዓቀ
Results
192
mBERT and MAD-X fail for unseen scripts.
We can achieve gains for seen, but underrepresented languages using lexical initialization.
MAD-X w/ Tokenizer
Applications in Cross-Lingual Transfer
How can we enable positive transfer across languages?
193
Generating Adapter Parameters
The main idea: instead of having dedicated single-language adapters, can we learn to generate adapters on the fly, conditioned on language vectors?
This can be seen as factorising adapters to language (and layer) parameters
More efficient than keeping dedicated language adapters
It can even work (in theory) in zero-shot (no text data whatsoever!) and few-shot setups...
Applications in Cross-Lingual Transfer
A Case Study of MAD-G [Ansell+, Findings of EMNLP-21]
194
Generalisation of [Üstün+, EMNLP-20]
Learn to generate (via a hyper-network) a monolithic multilingual adapter by MLM-ing on 95 languages.
Multi-Source Transfer works much better across different tasks: the model is forced to learn more general language-invariant transfer features?
Generated Adapters offer better initialisation for further target-specific MLM-ing in low-data scenarios...
Applications in Cross-Lingual Transfer
A Case Study of MAD-G [Ansell+, Findings of EMNLP-21]
195
Notes of Efficiency and Compactness
95 * 178M = 16.91B parameters (!!)
728M additional parameters (95 single-language adapters)
228M parameters
38M additional parameters
Applications in Cross-Lingual Transfer
A Case Study of MAD-G [Ansell+, Findings of EMNLP-21]
196
MAD-G multilingual adapter as a better starting point for further language specialisation
Applications in Multilinguality
A case study of Hyper-X: A Unified Hypernetwork for Multi-Task Multilingual Transfer
197
Generating task-language modules conditioned on task, language, and layer embeddings
A single hyper-network can again be shared across layers
Supporting unseen task-language combinations at inference (composability, reusability) supported by multi-task learning
MAD-X vs X-Mod
Modularizing Pretrained Language Models
A case study of X-Mod [Pfeiffer+, NAACL-2022]
198
Modularizing Pretrained Language Models
A case study of X-Mod [Pfeiffer+, NAACL-2022]
Shared Baseline: “Standard” multilingual model with all weights shared
X-Mod: Each language has a set of dedicated modular parameters
199
Shared
X-Mod
Zero-Shot Cross-Lingual Transfer with X-Mod
Fine-tune only shared parameters
For cross-lingual transfer replace the source language module with the target language module.
200
Modular Parameter-Efficient Cross-Lingual Transfer
Sparse subnetworks instead of adapters? [Ansell+, ACL-22; Foroutan+, EMNLP-22]
201
Modular Parameter-Efficient Cross-Lingual Transfer
Sparse subnetworks instead of adapters? [Ansell+, ACL-22; Foroutan+, EMNLP-22]
202
(beyond “Lottery Ticket” selection)
Child-Tuning [Xu+, EMNLP-21]
FISH-Tuning [Sung+, NeurIPS-21]
Fisher Information Matrix
Levels of sparsity are crucial
Combining Language and Ranking Modules for IR
A case study of multilingual IR with bottleneck adapters and sparse subnetworks
203
Ranking modules, trained only with English relevance assessment, can be transferred to other languages and also used for CLIR
Extending Modular Cross-Lingual Transfer
Phylogeny-Based Hierarchical Adapters?
204
Ensembling Adapters at Test Time
A case study of dealing with low-resource language varieties at test time
205
Similar Ideas Have Been Explored in NMT
A case study of using bilingual (translation-direction) adapters [Bapna and Firat, EMNLP-19]
206
Similar Ideas Have Been Explored in NMT
A case study of using monolingual NMT adapters
207
Generating Adapters for NMT
Another direct analogy with cross-lingual transfer [Baziotis+, arXiv-22]
208
209
versus
Similar Ideas Have Been Explored in NMT
A case study of using language-specific subnetworks for Multilingual NMT [Lin+, ACL-21]
210
Mixture-of-Experts: Large-Scale Modular Learning
A case study of token-level versus task-level MoEs [Kudugunta+, Findings of EMNLP-21]
211
Similar Ideas Have Been Explored for Domain Adaptation
A case study of hierarchical domain adapters [Chronopoulou+, NAACL-22]
212
Modular and Parameter-Efficient Continual Learning
A case study of adding new domains to task-oriented dialogue systems [Madotto+, EMNLP-21]
213
Task id is available at training but not at inference
Select the module with the lowest perplexity score (i.e., highest confidence)
Prompt-Free Methods for Few-Shot Learning
A case study of PERFECT [Mahabadi+, ACL-22]
214
Bypassing the use of patterns and verbalizers: prompt-free methods
More PEFT-based versus prompt-based few-shot learning?
215
Multilingual Multi-Modal Learning
A case study of enabling cross-lingual VQA [Pfeiffer+, Findings of ACL-22; Liu+, arXiv-22]
216
Task Specialisation of Sentence Encoders
A case study of (efficient) intent detection
217
Task specialisation knowledge (i.e., contrastive learning updates) is inserted into bottleneck adapters
Modular Prompting?
A case study of modular prompting for zero-shot cross-lingual generation [Vu+, EMNLP-22]
218
Other Applications
219
The discussed applications are meant to be illustrative, not exhaustive
Similar use of modular PEFT in speech processing applications:
Bottleneck adapters for ASR [Tomanek+, EMNLP-21; Thomas+, ICASSP-22; Sathyendra+; ICASSP-22; Chen+, arXiv-22; Fan and Alwan, InterSpeech-22] and text-to-speech (emerging, 2 arXiv papers in Nov/22)
What is in the modules?
Specialising for languages, demographics (accents, children’s speech), domains, specific speakers
Other Applications
220
The discussed applications are meant to be illustrative, not exhaustive
Ever-evolving multi-task models, the development of whose components has been distributed throughout the community
What is (or can be) stored in the modules?
222
What is (or can be) stored in the modules?
223
Are modules reserved only for ‘well-formed’ and interpretable ‘units of knowledge’?
Using (non-interpretable) reusable (predefined) sets of skills and skill modules
Polytropon [Ponti+, arXiv-22] SkillNet-NLU [Zhang+, arXiv-22]
Final Thoughts and
Open Research Directions
225
Many unexplored variants on the multi-dimensional research manifold of
There is no ‘one-size-fits-all’ design and model choice:
226
Modular and parameter-efficient deep learning transcends the confines of:
Modular and Parameter-Efficient Fine-Tuning for NLP
Questions:
Watch out for:
If you found these slides helpful, consider citing the tutorial as:
@inproceedings{ruder2022modular,
title={Modular and Parameter-Efficient Fine-Tuning for NLP Models},
author={Ruder, Sebastian and Pfeiffer, Jonas and Ivan Vulić},
booktitle={Proceedings of EMNLP 2022: Tutorials},
year={2022}
}