1 of 24

Improving Language Plasticity via Pretraining with Active Forgetting

Yihong Chen, July 25th 2023, ELLIS Unconference 2023

2 of 24

About Me

3 of 24

Amazing Collaborators

Roberta Raileanu

Kelly Marchisio

David Ifeoluwa Adelani

Pontus Stenetorp

3

Sebastian Riedel

Mikel Artetxe

4 of 24

Pretrained Language Models (PLMs)

Language Models!

  • PLMs are smashing the leaderboards
  • Shortcomings
    • Require large data and huge compute
    • Hard to keep pace with the constantly changing world
      • No enough “plasticity” to deal with new things
      • Can’t delete outdated information automatically

5 of 24

“Plasticity”:

  • flexibility when dealing with new inputs
    • new languages/domains/tasks/facts/tokens
  • avoid naively retraining on re-mixed data, which burns $$$...$$$ for large language models!

“Language Plasticity”

  • how to pretrain a language model that can quickly extend itself to new languages?

Today

6 of 24

Rewire PLMs for New Languages Artetxe et al. [2020]

Literature on extending language models to new languages via relearning the embeddings

Assumption:

The token embedding layer and the transformer body divide up the responsibility in a way that the former handles language-specific lexical meanings, while the latter deals with high-level general reasoning.

7 of 24

Unsupervised Zero-shot Cross-Lingual Transfer

Literature on extending language models to new languages via relearning the embeddings

Assumption:

Token embeddings

-> language-specific lexicals

Transformer body

-> high-level general reasoning

8 of 24

The Language Adapt Stage Takes Lots of Data

Sample Efficiency

Low-resources languages usually have less than 10M tokens.

The standard PLM collapse when #tokens for the new languages is less than 10M.

We can verify this by running a simulating with bootstrapped English corpus.

On XNLI, the performance drops from 85 to 35 when the data contains less than 10M tokens

Existing research use CC-100 and Wikipedia as their adaptation corpus, which often contain 100+ millions of tokens or even billions of tokens.

Low plasticity!

9 of 24

How can we rewire PLMs to new languages with as little new data as possible?

10 of 24

How can we improve plasticity?

  • A human perspective
  • A neural network perspective
  • “Active forgetting” come to rescue!

11 of 24

Improving Plasticity via Forgetting:

A Human Perspective

Forgetting

Nørby [2015], Anderson and Hulbert [2021]

12 of 24

Improving Plasticity via Forgetting:

A Neural Network Perspective

Forgetting

Recent work in several areas, computer vision, reinforcement learning, and graph learning

Alabdulmohsin et al.[2021], Taha et al. [2021] , Zhou et al. [2022], Chen et al. [2022] ,

Nikishin et al. [2022], D’Oro et al. [2022]

Forgetting via iterative partial weights resetting

Improve sample efficiency

Prevent overfitting to early experience

Better generalisation towards unseen data

13 of 24

Can forgetting improve language plasticity?

-> Can forgetting help us create pretrained language models that can be easily rewired to new languages?

14 of 24

Pretraining with Active Forgetting

A Simple “Forgetting” Method

Forgetting Token Embeddings Every 1000 Gradient Updates

Forgetting = Reinitialise the embedding layer by sampling from Gaussian(0, 0.02)

  • Intuitions
    • Simulating multiple language changes without actually crafting the data in new languages

  • Exposing the body to various embedding reinitialisation

  • Encourage the body to encode more general knowledge instead of “shortcut” knowledge that is tied to certain embedding initialisation values

15 of 24

Pretraining with Active Forgetting

A Simple “Forgetting” Method

episodic learning curve, “spikes” when resetting

16 of 24

Pretraining with Active Forgetting: Implementation

A Simple “Forgetting” Method

Forgetting Token Embeddings Every K Gradient Updates

How to reset embeddings under distributed training?

  • Reset the LR scheduler for embeddings
  • Reset the optimiser state (Adam) for embeddings
  • On GPU0, reset the embeddings to random vectors drawn from Gaussian(0, 0.02), and broadcast the vectors to other GPUs
  • On GPU1-31, use the received vectors to reset the embeddings
  • FP16 training: sync the new FP32 parameters to FP16

17 of 24

Evaluate Forgetting PLMs: A Low-Data Regime

Evaluations in Low-Data Regimes

  • Pretrain Forgetting and Standard PLMs using English CC100
    • roberta-base, 12-layer transformer (125M parameters)
    • 32 GPUs (V100 32 GB) for 24-36 hours
  • Adapt the body with
    • MNLI (English Natural Language Inference)
    • SQUAD (English Question Answering)
  • Adapt the token embeddings with CC100 subset in the new languages
    • Limit the adaptation corpus to only 5M tokens
  • Evaluate on several cross-lingual transfer benchmarks
    • XNLI (Natural Language Inference)
    • MLQA (Question Answering)
    • XQUAD (Question Answering)

18 of 24

Result: Forgetting PLMs Work Better in Low-Data Regimes

XNLI (accuracy)

MLQA (F1)

XQUAD (F1)

Standard PLM

53.3

34.3

36.1

Forgetting PLM

62.7

43.4

49.0

numbers averaged across languages

19 of 24

Result: Forgetting PLMs Learn New Languages with Fewer Parameter Updates

Within 5K parameter updates, forgetting PLMs converge about 90% while standard converge about 50%

20 of 24

Result: Languages That Are Distant To English Benefit Most From Forgetting PLMs

Thai, Arabian, Turkish, Hindi, Urdu, Chinese

  • Different language families

21 of 24

Conclusions

    • Current language models need more plasticity to address new inputs! (maybe also for removing unwanted information? )
    • Specifically for improving language plasticity,
      • periodically forgetting the token embedding layer (which captures lexical meanings) helps to learn a transformer.
    • Such a simple active forgetting mechanism can create PLMs that are easier to rewire for new languages.
      • Better sample-efficiency
      • Faster convergence
      • Benefit distant languages
    • In general, the plasticity of language models is still under-explored yet a very relevant research target.
      • the cost of retraining them for new inputs is usually very high (data, compute)
      • more plasticity = rewire PLMs’ behavior with a tiny fraction of new data and compute

22 of 24

Open Questions

What’s next?

  • Use forgetting PLMs for new tasks, new domains, data with temporal shifts etc
    • Easy-to-rewire PLMs that evolve as the world changes
  • Use forgetting for autoregressive LMs instead of masked LMs
    • Does the architecture/objective choice impact forgetting methods?
  • How about forgetting other parts instead of embeddings?
  • More generally, open to any collaboration about understanding transformers

23 of 24

Q & A

24 of 24

That’s the end!