Improving Language Plasticity via Pretraining with Active Forgetting
Yihong Chen, July 25th 2023, ELLIS Unconference 2023
About Me
Amazing Collaborators
Roberta Raileanu
Kelly Marchisio
David Ifeoluwa Adelani
Pontus Stenetorp
3
Sebastian Riedel
Mikel Artetxe
Pretrained Language Models (PLMs)
Language Models!
“Plasticity”:
“Language Plasticity”
Today
Rewire PLMs for New Languages Artetxe et al. [2020]
Literature on extending language models to new languages via relearning the embeddings
Assumption:
The token embedding layer and the transformer body divide up the responsibility in a way that the former handles language-specific lexical meanings, while the latter deals with high-level general reasoning.
Unsupervised Zero-shot Cross-Lingual Transfer
Literature on extending language models to new languages via relearning the embeddings
Assumption:
Token embeddings
-> language-specific lexicals
Transformer body
-> high-level general reasoning
The Language Adapt Stage Takes Lots of Data
Sample Efficiency
Low-resources languages usually have less than 10M tokens.
The standard PLM collapse when #tokens for the new languages is less than 10M.
We can verify this by running a simulating with bootstrapped English corpus.
On XNLI, the performance drops from 85 to 35 when the data contains less than 10M tokens
Existing research use CC-100 and Wikipedia as their adaptation corpus, which often contain 100+ millions of tokens or even billions of tokens.
Low plasticity!
How can we rewire PLMs to new languages with as little new data as possible?
How can we improve plasticity?
Improving Plasticity via Forgetting:
A Human Perspective
Forgetting
Nørby [2015], Anderson and Hulbert [2021]
Improving Plasticity via Forgetting:
A Neural Network Perspective
Forgetting
Recent work in several areas, computer vision, reinforcement learning, and graph learning
Alabdulmohsin et al.[2021], Taha et al. [2021] , Zhou et al. [2022], Chen et al. [2022] ,
Nikishin et al. [2022], D’Oro et al. [2022]
Forgetting via iterative partial weights resetting
Improve sample efficiency
Prevent overfitting to early experience
Better generalisation towards unseen data
Can forgetting improve language plasticity?
-> Can forgetting help us create pretrained language models that can be easily rewired to new languages?
Pretraining with Active Forgetting
A Simple “Forgetting” Method
Forgetting Token Embeddings Every 1000 Gradient Updates
Forgetting = Reinitialise the embedding layer by sampling from Gaussian(0, 0.02)
Pretraining with Active Forgetting
A Simple “Forgetting” Method
episodic learning curve, “spikes” when resetting
Pretraining with Active Forgetting: Implementation
A Simple “Forgetting” Method
Forgetting Token Embeddings Every K Gradient Updates
How to reset embeddings under distributed training?
Evaluate Forgetting PLMs: A Low-Data Regime
Evaluations in Low-Data Regimes
Result: Forgetting PLMs Work Better in Low-Data Regimes
| XNLI (accuracy) | MLQA (F1) | XQUAD (F1) |
Standard PLM | 53.3 | 34.3 | 36.1 |
Forgetting PLM | 62.7 | 43.4 | 49.0 |
numbers averaged across languages
Result: Forgetting PLMs Learn New Languages with Fewer Parameter Updates
Within 5K parameter updates, forgetting PLMs converge about 90% while standard converge about 50%
Result: Languages That Are Distant To English Benefit Most From Forgetting PLMs
Thai, Arabian, Turkish, Hindi, Urdu, Chinese
Conclusions
Open Questions
What’s next?
Q & A
That’s the end!