Soft MoE
Mixture of Experts Reading Group
30th Sept
Contents
Recap on MoEs
MoEs are introduced to:
Recap on MoEs
Chapters
Problems with Sparse MoEs
Chapters
Soft MoE paradigm
Intuition
Soft MoE paradigm
Wait a minute. Does this make sense?
Soft MoE paradigm
Wait a minute. Does this make sense?
Turns out, yes:
Soft MoE paradigm
Intuition
Soft MoE paradigm
Soft MoE paradigm
Soft MoE paradigm
Soft MoE paradigm
Code
Soft MoE paradigm
Soft MoE as a generalisation
Sparse MoE reduces to Dense
Soft MoE reduces to Sparse MoE
Soft MoE
Sparse MoE
Dense
Chapters
Results And Discussion
Results And Discussion
Ablations
Underrated Points
As we scale embedding dim, d, inputs to the router approach one-hot vectors in the softmax.
They use scaling factors and normalise the gating network to mitigate this.
Softmax
Router
Underrated Points
Same up projection across experts
Different down projection
Underrated Points
Chapters
The Autoregressive Problem
Do we get SoftGPT?
The Autoregressive Problem
What’s the problem?
There are two obvious ways to get this to work:
But these aren’t too promising
The Autoregressive Problem
Intuition
The Autoregressive Problem
In this case we get O((m^2)nd + mnk), which is worse than running n complete models as an ensemble model. This is very compute intensive and essentially puts us back to scaling compute along with parameters.
The Autoregressive Problem
The Autoregressive Problem
2. Choose which expert has a slot for each new token position
The Autoregressive Problem
The Autoregressive Problem
Getting SoftMoE to work for decoder-only models is an open problem that will be incredible when solved.
The Autoregressive Problem
Getting SoftMoE to work for decoder-only models is an open problem that will be incredible when solved.
For now SoftMoE is a game-changer for models with an encoder:
Soft MoE
References & Marginalia
Soft MoE