1 of 24

Improving Language Plasticity via Pretraining with Active Forgetting

Yihong Chen, July 25th 2023, ELLIS Unconference 2023

2 of 24

About Me

3 of 24

Amazing Collaborators

Roberta Raileanu

Kelly Marchisio

David Ifeoluwa Adelani

Pontus Stenetorp

3

Sebastian Riedel

Mikel Artetxe

4 of 24

Pretrained Language Models (PLMs)

Language Models!

PLMs are smashing the leaderboards
Shortcomings

Require large data and huge compute
Hard to keep pace with the constantly changing world

No enough “plasticity” to deal with new things
Can’t delete outdated information automatically

5 of 24

“Plasticity”:

flexibility when dealing with new inputs

new languages/domains/tasks/facts/tokens

avoid naively retraining on re-mixed data, which burns $$$...$$$ for large language models!

“Language Plasticity”

how to pretrain a language model that can quickly extend itself to new languages?

Today

6 of 24

Rewire PLMs for New Languages Artetxe et al. [2020]

Literature on extending language models to new languages via relearning the embeddings

Assumption:

The token embedding layer and the transformer body divide up the responsibility in a way that the former handles language-specific lexical meanings, while the latter deals with high-level general reasoning.

7 of 24

Unsupervised Zero-shot Cross-Lingual Transfer

Literature on extending language models to new languages via relearning the embeddings

Assumption:

Token embeddings

-> language-specific lexicals

Transformer body

-> high-level general reasoning

8 of 24

The Language Adapt Stage Takes Lots of Data

Sample Efficiency

Low-resources languages usually have less than 10M tokens.

The standard PLM collapse when #tokens for the new languages is less than 10M.

We can verify this by running a simulating with bootstrapped English corpus.

On XNLI, the performance drops from 85 to 35 when the data contains less than 10M tokens

Existing research use CC-100 and Wikipedia as their adaptation corpus, which often contain 100+ millions of tokens or even billions of tokens.

Low plasticity!

9 of 24

How can we rewire PLMs to new languages with as little new data as possible?

10 of 24

How can we improve plasticity?

A human perspective
A neural network perspective
“Active forgetting” come to rescue!

11 of 24

Improving Plasticity via Forgetting:

A Human Perspective

Forgetting

Nørby [2015], Anderson and Hulbert [2021]

12 of 24

Improving Plasticity via Forgetting:

A Neural Network Perspective

Forgetting

Recent work in several areas, computer vision, reinforcement learning, and graph learning

Alabdulmohsin et al.[2021], Taha et al. [2021] , Zhou et al. [2022], Chen et al. [2022] ,

Nikishin et al. [2022], D’Oro et al. [2022]

Forgetting via iterative partial weights resetting

Improve sample efficiency

Prevent overfitting to early experience

Better generalisation towards unseen data

13 of 24

Can forgetting improve language plasticity?

-> Can forgetting help us create pretrained language models that can be easily rewired to new languages?

14 of 24

Pretraining with Active Forgetting

A Simple “Forgetting” Method

Forgetting Token Embeddings Every 1000 Gradient Updates

Forgetting = Reinitialise the embedding layer by sampling from Gaussian(0, 0.02)

Intuitions

Simulating multiple language changes without actually crafting the data in new languages

Exposing the body to various embedding reinitialisation

Encourage the body to encode more general knowledge instead of “shortcut” knowledge that is tied to certain embedding initialisation values

15 of 24

Pretraining with Active Forgetting

A Simple “Forgetting” Method

episodic learning curve, “spikes” when resetting

16 of 24

Pretraining with Active Forgetting: Implementation

A Simple “Forgetting” Method

Forgetting Token Embeddings Every K Gradient Updates

How to reset embeddings under distributed training?

Reset the LR scheduler for embeddings
Reset the optimiser state (Adam) for embeddings
On GPU0, reset the embeddings to random vectors drawn from Gaussian(0, 0.02), and broadcast the vectors to other GPUs
On GPU1-31, use the received vectors to reset the embeddings
FP16 training: sync the new FP32 parameters to FP16

17 of 24

Evaluate Forgetting PLMs: A Low-Data Regime

Evaluations in Low-Data Regimes

Pretrain Forgetting and Standard PLMs using English CC100

roberta-base, 12-layer transformer (125M parameters)
32 GPUs (V100 32 GB) for 24-36 hours

Adapt the body with

MNLI (English Natural Language Inference)
SQUAD (English Question Answering)

Adapt the token embeddings with CC100 subset in the new languages

Limit the adaptation corpus to only 5M tokens

Evaluate on several cross-lingual transfer benchmarks

XNLI (Natural Language Inference)
MLQA (Question Answering)
XQUAD (Question Answering)

18 of 24

Result: Forgetting PLMs Work Better in Low-Data Regimes

	XNLI (accuracy)	MLQA (F1)	XQUAD (F1)
Standard PLM	53.3	34.3	36.1
Forgetting PLM	62.7	43.4	49.0

numbers averaged across languages

19 of 24

Result: Forgetting PLMs Learn New Languages with Fewer Parameter Updates

Within 5K parameter updates, forgetting PLMs converge about 90% while standard converge about 50%

20 of 24

Result: Languages That Are Distant To English Benefit Most From Forgetting PLMs

Thai, Arabian, Turkish, Hindi, Urdu, Chinese

Different language families

21 of 24

Conclusions

Current language models need more plasticity to address new inputs! (maybe also for removing unwanted information? )
Specifically for improving language plasticity,

periodically forgetting the token embedding layer (which captures lexical meanings) helps to learn a transformer.

Such a simple active forgetting mechanism can create PLMs that are easier to rewire for new languages.

Better sample-efficiency
Faster convergence
Benefit distant languages

In general, the plasticity of language models is still under-explored yet a very relevant research target.

the cost of retraining them for new inputs is usually very high (data, compute)
more plasticity = rewire PLMs’ behavior with a tiny fraction of new data and compute

22 of 24

Open Questions

What’s next?

Use forgetting PLMs for new tasks, new domains, data with temporal shifts etc

Easy-to-rewire PLMs that evolve as the world changes

Use forgetting for autoregressive LMs instead of masked LMs

Does the architecture/objective choice impact forgetting methods?

How about forgetting other parts instead of embeddings?
More generally, open to any collaboration about understanding transformers

23 of 24

Q & A

24 of 24

That’s the end!

Improving Language Plasticity via Pretraining with Active Forgetting