1 of 35

Soft MoE

Mixture of Experts Reading Group

30th Sept

2 of 35

Contents

  1. Recap on MoEs
  2. Problems with Sparse MoEs
  3. Soft MoE paradigm
    1. Intuition
    2. Code
    3. Seeing Soft MoE as a generalisation of Sparse MoE
  4. Results and Discussion
    • Ablations
  5. The autoregressive problem…
  6. References & Marginalia

3 of 35

Recap on MoEs

MoEs are introduced to:

  • Increase the number of parameters for large models
  • Decouple scaling parameters from FLOPs per forward pass
  • Be more sample efficient
  • Move towards Adaptive Computation

4 of 35

Recap on MoEs

5 of 35

Chapters

  • ✅ Recap on MoEs
  • Problems with Sparse MoEs
  • Soft MoE paradigm
    • Intuition
    • Code
    • Seeing Soft MoE as a generalisation of Sparse MoE
  • Results and Discussion
    • Ablations
  • The autoregressive problem…
  • References & Marginalia

6 of 35

Problems with Sparse MoEs

  • Not inherently fully differentiable
  • Has discontinuities!
    • Gets even worse with more experts
  • Routing algorithms are not GPU optimised and are hence slow
  • Routing seems under-optimised
    • Hash routing is only slightly worse than more principled methods
  • Problems with token dropping, expert imbalances
    • Auxiliary losses introduced to solve these require tuning
  • Not sequence-level deterministic
    • Outputs depend on the batch
  • Generally unstable to train, can diverge for seemingly no reason :(

7 of 35

Chapters

  • ✅ Recap on MoEs
  • ✅ Problems with Sparse MoEs
  • Soft MoE paradigm
    • Intuition
    • Code
    • Seeing Soft MoE as a generalisation of Sparse MoE
  • Results and Discussion
    • Ablations
  • The autoregressive problem…
  • References & Marginalia

8 of 35

Soft MoE paradigm

Intuition

  • Instead of choosing k tokens to go to each expert
  • Each expert gets s “slots” where slots are linear combination of as many tokens as it wants
  • Pretty much everything else (e.g. the gating network) is the same

9 of 35

Soft MoE paradigm

Wait a minute. Does this make sense?

  • Can we just sum tokens together and hope to get something meaningful?

10 of 35

Soft MoE paradigm

Wait a minute. Does this make sense?

  • Can we just sum tokens together and hope to get something meaningful?

Turns out, yes:

  • In the image case note that average pooling in CNNs does something similar.
  • For language models, linear combinations of word embeddings often perform word2vec style word math.
  • The residual stream gets added to in a linear way generally.

11 of 35

Soft MoE paradigm

Intuition

12 of 35

Soft MoE paradigm

13 of 35

Soft MoE paradigm

14 of 35

Soft MoE paradigm

15 of 35

Soft MoE paradigm

Code

16 of 35

Soft MoE paradigm

Soft MoE as a generalisation

Sparse MoE reduces to Dense

  • num_experts ← 1

Soft MoE reduces to Sparse MoE

  • dispatch_softmax_temperature ← 0

Soft MoE

Sparse MoE

Dense

17 of 35

Chapters

  • ✅ Recap on MoEs
  • ✅ Problems with Sparse MoEs
  • ✅ Soft MoE paradigm
    • Intuition
    • Code
    • Seeing Soft MoE as a generalisation of Sparse MoE
  • Results and Discussion
    • Ablations
  • The autoregressive problem…
  • References & Marginalia

18 of 35

Results And Discussion

19 of 35

Results And Discussion

Ablations

20 of 35

Underrated Points

  • Normalising routing layer
  • Having the experts as half of the FFN
  • Sequence Determinism is back!

As we scale embedding dim, d, inputs to the router approach one-hot vectors in the softmax.

They use scaling factors and normalise the gating network to mitigate this.

Softmax

Router

21 of 35

Underrated Points

  • Normalising and scaling routing logits
  • Having the experts as half of the FFN
  • Sequence Determinism is back!

Same up projection across experts

Different down projection

22 of 35

Underrated Points

  • Normalising and scaling routing logits
  • Having the experts as half of the FFN
  • Sequence Determinism is back!
    • This is almost as good as when we ditched BatchNorm (RIP)

23 of 35

Chapters

  • ✅ Recap on MoEs
  • ✅ Problems with Sparse MoEs
  • ✅ Soft MoE paradigm
    • Intuition
    • Code
    • Seeing Soft MoE as a generalisation of Sparse MoE
  • ✅ Results and Discussion
    • Ablations
  • The autoregressive problem…
  • References & Marginalia

24 of 35

The Autoregressive Problem

Do we get SoftGPT?

  • Well… not yet

25 of 35

The Autoregressive Problem

What’s the problem?

There are two obvious ways to get this to work:

  1. Use a causal mask and let every expert have a slot for every token position
  2. Use a causal mask and set an ordering for which expert has a slot introducing the next token (e.g. random, uniform etc)

But these aren’t too promising

26 of 35

The Autoregressive Problem

Intuition

27 of 35

The Autoregressive Problem

  1. Every Expert has slot for every position

  • In their experiments they show that SoftMoE works best when num_experts = context_length (i.e. slots_per_expert = 1)
  • This approach would suggest slots_per_expert = context_length as well - so squaring the number of slots required!

In this case we get O((m^2)nd + mnk), which is worse than running n complete models as an ensemble model. This is very compute intensive and essentially puts us back to scaling compute along with parameters.

28 of 35

The Autoregressive Problem

  • Every Expert has slot for every position

  • In their experiments they show that SoftMoE works best when num_experts = context_length (i.e. slots_per_expert = 1)
  • This approach would suggest slots_per_expert = context_length as well - so squaring the number of slots required!

29 of 35

The Autoregressive Problem

2. Choose which expert has a slot for each new token position

  • Choosing slots makes us overfit to the slot positions. Now there’s a much larger difference between “cat” @ pos5 and “cat” @ pos6
  • Choosing random slots makes the experts not specialise
  • Neither of these seems great :(

30 of 35

The Autoregressive Problem

31 of 35

The Autoregressive Problem

Getting SoftMoE to work for decoder-only models is an open problem that will be incredible when solved.

32 of 35

The Autoregressive Problem

Getting SoftMoE to work for decoder-only models is an open problem that will be incredible when solved.

For now SoftMoE is a game-changer for models with an encoder:

  • Translation
  • Vision Transformers
  • Audio Transformers
  • Decision Transformers
  • T5

33 of 35

Soft MoE

34 of 35

References & Marginalia

35 of 35

Soft MoE