1 of 35

Soft MoE

Mixture of Experts Reading Group

30th Sept

2 of 35

Contents

Recap on MoEs
Problems with Sparse MoEs
Soft MoE paradigm

Intuition
Code
Seeing Soft MoE as a generalisation of Sparse MoE

Results and Discussion

Ablations

The autoregressive problem…
References & Marginalia

3 of 35

Recap on MoEs

MoEs are introduced to:

Increase the number of parameters for large models
Decouple scaling parameters from FLOPs per forward pass
Be more sample efficient
Move towards Adaptive Computation

I’m sure we all know what MoEs are by now but we’re going to compare this to the new paradigm so just as a quick recap…

In a normal transformer, each Transformer Block contains an attention layer for communication and a Feed Forward Network (otherwise known as an MLP) for computation.

The MLP is where most of the parameters are and does a lot of the heavy lifting for transforming individual tokens.
Sparse MoEs switch out the MLP layer for multiple MoEs which we can switch between giving some routing mechanism
So for each token it has different parameters applied to it

The scaling laws from Kaplan et al at OpenAI say that increasing parameters makes models which have lower loss and generalise better and MoEs allow us to get more parameters without increasing the FLOPs per forward pass.

The normal theory (courtesy of Noam Shazeer) is that parameters are knowledge and compute is intelligence.

So we’re giving models more knowledge without necessarily being able to apply more intelligence to that knowledge which is good in some use cases such as factual recall.
This makes a lot of sense if we think about MLP layers as essentially look-up tables which can be imbued with some superposition, so they are fuzzy look-up tables.
There’s been a lot of research about this recently.

Read the points on the slide.

I normally give an analogy to doctors and GPs here…

Suppose you're not feeling well and you go to your cardiologist. They give you some advice but you're not sure whether the advice is good or not. The first thing you could try is to get a second opinion from another cardiologist.
Averaging over multiple doctors who were all trained in the same way increases robustness by reducing variance (maybe the first doctor was tired that day or something) but it doesn't help bias, all the cardiologists are likely to be wrong in the same way, if they are wrong.
What's probably better is to average over multiple doctors who all specialise in different areas.
But to be honest this is probably pretty inefficient if we now have to see 50 specialists for every problem even though most of them have no idea about our problem. What we would prefer is to know which 1-3 specialists we should see and only get advice from them. Our GP is the routing function in this case, who knows the strengths of the different doctors and sends us to one of them depending on our ailment.
The GP-doctor system is exactly a Mixture of Experts layer.

4 of 35

Recap on MoEs

5 of 35

Chapters

✅ Recap on MoEs
Problems with Sparse MoEs
Soft MoE paradigm

Intuition
Code
Seeing Soft MoE as a generalisation of Sparse MoE

Results and Discussion

Ablations

The autoregressive problem…
References & Marginalia

6 of 35

Problems with Sparse MoEs

Not inherently fully differentiable
Has discontinuities!

Gets even worse with more experts

Routing algorithms are not GPU optimised and are hence slow
Routing seems under-optimised

Hash routing is only slightly worse than more principled methods

Problems with token dropping, expert imbalances

Auxiliary losses introduced to solve these require tuning

Not sequence-level deterministic

Outputs depend on the batch

Generally unstable to train, can diverge for seemingly no reason :(

Not fully differentiable - skip back to the picture to show that it’s not differentiable immediately.

At the heart of Sparse MoE models is a discrete optimisation problem which is inherently not differentiable.
We do the hack of multiply by the routing probability to get some gradient signal.

Discontinuities - we’re used to Neural Networks being smooth functions - most things in Neural Networks are linear and GeLU activations are everywhere differentiable. ReLU is piecewise linear so very continuous.

MoEs are actually not even continuous.
As you approach a boundary between routing from one token to another in a Switch Transformer, you approach a discontinuity of the Neural Network
This gets even worse with more experts as you have more discontinuities.

Routing algorithms can include sorting and other operations not optimised for GPUs - what we really want is stuff that works well with the hardware like mat_muls (matrix multiplications)

We’ve tried a few routing strategies like Token Choice, Expert Choice, RL approaches etc. The baseline approach of Hash routing which is just keep a big list of every input token and which inputs it will be routed to for all the experts. This is worse than the more principled approaches but only by a little bit. We would expect it to be much worse which suggests that we haven’t quite got routing right yet.

Each expert has a given capacity for the hardware and if too many tokens want to route to the same expert then we will have to drop some. This means they simply don’t get processed on this layer.

Similarly if not enough tokens want to go to an expert then we just have wasted GPU-hours and our loss will drop slower.
We introduce auxiliary losses to help with balancing to avoid these issues but these result in more hyperparameters to tune and increased complexity.

Scroll back to picture - every token in the sequence and in the batch is competing for the experts so some information from other sequences in the batch are relevant for whether a token gets routed to a given expert or not.

Dense models, pretty much work with modern methods and hyperparameters, sparse models can be highly unstable. Reading the early MoE papers you would see them talk about most of their models just ending in NaNs seemingly out of nowhere. This has gotten a bit better but it’s still not perfect yet. Probably has a lot to do with the discontinuities and hacky methods.

7 of 35

Chapters

✅ Recap on MoEs
✅ Problems with Sparse MoEs
Soft MoE paradigm

Intuition
Code
Seeing Soft MoE as a generalisation of Sparse MoE

Results and Discussion

Ablations

The autoregressive problem…
References & Marginalia

8 of 35

Soft MoE paradigm

Intuition

Instead of choosing k tokens to go to each expert
Each expert gets s “slots” where slots are linear combination of as many tokens as it wants
Pretty much everything else (e.g. the gating network) is the same

9 of 35

Soft MoE paradigm

Wait a minute. Does this make sense?

Can we just sum tokens together and hope to get something meaningful?

10 of 35

Soft MoE paradigm

Wait a minute. Does this make sense?

Can we just sum tokens together and hope to get something meaningful?

Turns out, yes:

In the image case note that average pooling in CNNs does something similar.
For language models, linear combinations of word embeddings often perform word2vec style word math.
The residual stream gets added to in a linear way generally.

11 of 35

Soft MoE paradigm

Intuition

12 of 35

Soft MoE paradigm

13 of 35

Soft MoE paradigm

14 of 35

Soft MoE paradigm

15 of 35

Soft MoE paradigm

Code

16 of 35

Soft MoE paradigm

Soft MoE as a generalisation

Sparse MoE reduces to Dense

num_experts ← 1

Soft MoE reduces to Sparse MoE

dispatch_softmax_temperature ← 0

Soft MoE

Sparse MoE

Dense

17 of 35

Chapters

✅ Recap on MoEs
✅ Problems with Sparse MoEs
✅ Soft MoE paradigm

Intuition
Code
Seeing Soft MoE as a generalisation of Sparse MoE

Results and Discussion

Ablations

The autoregressive problem…
References & Marginalia

18 of 35

Results And Discussion

19 of 35

Results And Discussion

Ablations

20 of 35

Underrated Points

Normalising routing layer
Having the experts as half of the FFN
Sequence Determinism is back!

As we scale embedding dim, d, inputs to the router approach one-hot vectors in the softmax.

They use scaling factors and normalise the gating network to mitigate this.

Softmax

Router

21 of 35

Underrated Points

Normalising and scaling routing logits
Having the experts as half of the FFN
Sequence Determinism is back!

Same up projection across experts

Different down projection

22 of 35

Underrated Points

Normalising and scaling routing logits
Having the experts as half of the FFN
Sequence Determinism is back!

This is almost as good as when we ditched BatchNorm (RIP)

23 of 35

Chapters

✅ Recap on MoEs
✅ Problems with Sparse MoEs
✅ Soft MoE paradigm

Intuition
Code
Seeing Soft MoE as a generalisation of Sparse MoE

✅ Results and Discussion

Ablations

The autoregressive problem…
References & Marginalia

24 of 35

The Autoregressive Problem

Do we get SoftGPT?

Well… not yet

25 of 35

The Autoregressive Problem

What’s the problem?

There are two obvious ways to get this to work:

Use a causal mask and let every expert have a slot for every token position
Use a causal mask and set an ordering for which expert has a slot introducing the next token (e.g. random, uniform etc)

But these aren’t too promising

26 of 35

The Autoregressive Problem

Intuition

27 of 35

The Autoregressive Problem

Every Expert has slot for every position

In their experiments they show that SoftMoE works best when num_experts = context_length (i.e. slots_per_expert = 1)
This approach would suggest slots_per_expert = context_length as well - so squaring the number of slots required!

In this case we get O((m^2)nd + mnk), which is worse than running n complete models as an ensemble model. This is very compute intensive and essentially puts us back to scaling compute along with parameters.

28 of 35

The Autoregressive Problem

Every Expert has slot for every position

In their experiments they show that SoftMoE works best when num_experts = context_length (i.e. slots_per_expert = 1)
This approach would suggest slots_per_expert = context_length as well - so squaring the number of slots required!

29 of 35

The Autoregressive Problem

2. Choose which expert has a slot for each new token position

Choosing slots makes us overfit to the slot positions. Now there’s a much larger difference between “cat” @ pos5 and “cat” @ pos6
Choosing random slots makes the experts not specialise
Neither of these seems great :(

30 of 35

The Autoregressive Problem

31 of 35

The Autoregressive Problem

Getting SoftMoE to work for decoder-only models is an open problem that will be incredible when solved.

32 of 35

The Autoregressive Problem

Getting SoftMoE to work for decoder-only models is an open problem that will be incredible when solved.

For now SoftMoE is a game-changer for models with an encoder:

Translation
Vision Transformers
Audio Transformers
Decision Transformers
…
T5

33 of 35

Soft MoE

34 of 35

References & Marginalia

35 of 35

Soft MoE