4 of 40

Routing networks

A routing network consists of a set of modules (parameterized functions) from which a router can choose a composition.

**Note that in certain architectures the input can be routed to many modules at the same time.**

5 of 40

Motivation 1 : Compositionality and generalisation

Structure in the world often has a compositional and modular structure : things are made of parts which can be recombined in other things. When the world/task/distribution changes these parts remain unchanged.

By separately learning modules and a way to compose them, routing networks can better adapt to changes in distribution.

6 of 40

Motivation 2 : Disentangle model size and compute

It is increasingly evident that in many cases mode performance scales with size (i.e. Kaplan et al. 2020). However, so does compute (FLOPs). This undesirable connection between size and computation motivates a search for architectures wherein the two are disentangled.

Routing networks disentangle total number of parameters and compute cost.

7 of 40

Motivation 2’ : Improved parallelism

Since large models do not fit on any single device, in practice it is necessary to distributed the model on several devices (model parallelism). However, this can significantly slows down execution and communication cost can be prohibitively expensive.

Routing networks enables efficient model parallelism because modules do not interact and can be executed in parallel.

8 of 40

Challenges

Training a routing network is non-stationary from both the perspective of the router, and from the perspective of the modules, because the optimal composition strategy depends on the module parameters and vice versa. This gives rise to many challenges :

Early training stability
Module collapse
Module diversity

From a more practical POV, it is important to balance the modules so that each device is used efficiently and that all modules learn at the same rate (helps resolving collapse and diversity)

9 of 40

Routed language models

10 of 40

Routed language models

We introduce routing in a large transformer language models by replacing some (R=0.5) of the FFNs by routing networks : each token in the batch is sent to K (we use 1 for now) experts by a routing function.

11 of 40

Digression : Role of the FFN in a Transformer

“We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary” -Geva et al. (2021)

12 of 40

Routed language models : Architectures

The paper considers three different choices for the routing process :

Sparse Mixture-of-Experts (SMOEs)
Hash Routing (HASH)
Reinforcement-learning Router (RL-R)

13 of 40

Sparse Mixture-of-Experts

A routing function produces a distribution over experts for every token. The token is the processed into a convex combination of the top K experts :

We will consider K=1 (as in the picture) for now and discuss this point further later.

14 of 40

Sparse Mixture-of-Experts : Balancing the load (1)

Naïve SMOEs leads to poor load balancing, which needs to be addressed. This is done is three ways.

1) Addition of an auxiliary load-balancing loss

where is the mean probability assigned for expert e in the batch and is the fraction of tokens dispatched to expert e in the batch.

Only is differentiable.

15 of 40

Sparse Mixture-of-Experts : Balancing the load (2)

Naïve SMOEs leads to poor load balancing, which needs to be addressed. This is done is three ways.

2) Balanced assignment optimisation during training

To make sure that the assignment are especially well balanced during training, we add an additional step where we iteratively normalize the expert distribution generated by the router so that all expert have an average assignment probability of 1/E.

The method is called the the Sinkhorn algorithm.

BASE uses a hard assignment optimisation instead.

16 of 40

Sparse Mixture-of-Experts : Balancing the load (3)

Naïve SMOEs leads to poor load balancing, which needs to be addressed. This is done is three ways.

3) Expert capacity and overflow

It can still happen that assignment is not exactly balanced. We can tolerate some amount of uneven assignment by giving additional capacity to each device.

If, even then, an expert receives too many tokens, they go straight to the next layer.

17 of 40

Routing with Reinforcement Learning

Here the token assignment problem is modeled as a one-step MDP where the observation is the token, the actions are the experts and the reward is the probability that the overall model assigns to the right next token. A policy gradient (REINFORCE) loss is added to the language modeling loss (Xent).

This is similar to the SMOEs but the routing process is optimised directly. However, the high variance of the gradient is problematic, especially when the number of expert grows. Authors experiment with various improvement to naïve REINFORCE (e.g. baseline).

Balancing tricks (2) and (3) of SMOEs are also used.

18 of 40

Input-based Deterministic Hash Routing

Here the token assignment is determined as a fixed function of its ID. The paper uses the ID modulo E.

“Finally, given that our routing approach is learning free, our results perhaps suggest that none of the current approaches are routing particularly well.” -Roller et al. 2021. lol

Balancing trick (3) is used.

19 of 40

Disclaimer : Engineering and tuning

Note that for all the methods described, careful tuning of hyperparameters, initialisation and other engineering aspects are essential for good performance and stability of training. This is especially important for stability of training. In fact many recent paper on routed language models (from Google) are almost exclusively focussed on engineering (Fedus et al. 2021, Du et al. 2021).

20 of 40

Unified Scaling Laws

21 of 40

Setup

Task : Autoregressive language modelling

Metric of performance : Validation log-likelihood

Dataset : MassiveText (Rae et al. 2022)

Base architecture : GPT-2 (Radford et al. 2019)

23 of 40

Separable Scaling Laws in Model Size and Experts

The starting point is the scaling law of Kaplan et al. (2020) for a dense language model with N params. :

Then they hypothesize that for a given N, the loss scales similarly with respect to E.

And further that the two power laws are separable and can be combined :

24 of 40

Power laws are not separable

While the first hypothesis is empirically verified, the separable power law doesn’t hold : the exponent b depends on N. Routing gives diminishing returns when N grows (N = 5M, 15M, 130M, 370M, 1.3B)

25 of 40

Quadratic Interaction in N and E

While the first hypothesis is empirically verified, the separable power law doesn’t hold : the exponent b depends on N. Therefore, they modify their ansatz to account for the interaction between E and N :

26 of 40

Quadratic Interaction in N and E

The constant c quantifies the diminishing returns from routing as size increases (and vice-versa). A clear goal for Routed Language Models is to have to c closest to 0.

27 of 40

Bounded scaling in E

Authors observe another source of diminishing returns : low and high values of E.

When E is low, scaling is weakened by the fixed overhead (e.g. interference of balancing loss) of the routing process and when E is high the different routing methods deteriorate for different reasons (e.g. variance in RL)

To correct for that, authors apply a transformation to E.

The constant Emax quantifies the diminishing returns from routing coming from high number of experts . A clear goal for Routed Language Models is to have the greatest E possible.

28 of 40

Final scaling law

At the end, they arrive at the following scaling law :

29 of 40

Routing is good

30 of 40

Routing is good : for how long?

Q : Given the diminishing returns, what value of N does routing stop being beneficial?

Let be the Effective Parameter Count (EPC), obtained by solving :

The validation trivially follows a power law with respect to

31 of 40

Routing is good : for how long?

Q : Given the diminishing returns, what value of N does routing stop being beneficial?

We are interested , the value of N at which . It can be found to be equal to

32 of 40

More general scaling law

To account for more general architectures (in particular those where K>1 or R ≠ 0.5), the authors introduce a more general version of the current scaling laws using variables

F : TeraFLOPs required per forward pass. Proportional to the number of active parameters N.
B = P/F so that FxB=P. Similar to E : ratio of total params. over active params.

33 of 40

More general scaling law : K and R don’t matter

The new scaling law produces similar fits across K and R, i.e. loss can be predicted only based on P and B. This indicates that K and R have little impact on the performance

34 of 40

More general scaling law : which K,R to choose

Higher K means more parameter efficiency, but also more FLOPs and communication cost : K = 1 is prefered

Higher R normally means better performance, however there are diminishing returns : R > 0.5 is prefered

35 of 40

Recap and discussion

36 of 40

Recap

Important points of the paper :

V-loss follows a joint power-law in E,N with quadratic interaction modelling diminishing returns.

a(E) = a + c log(E) quantifies the scaling w.r.t. N when E is fixed
b(N) = b + c log(N) quantifies the scaling w.r.t. E when N is fixed.
c quantifies at what rate the individual power-laws degrade when the other parameter increases
d quantifies the base performance

An additional correction for large and small E gives an even better fix

E_max quantifies performance degradation due to increase in E

We can easily calculate at what N routing stops being beneficial : log(N_cutoff) = b/c
We can define a similar scaling law w.r.t. more general parameters (F,B) to model more general architectures. This reveals that some design choices (K,R) do not matter very much.

37 of 40

Comparison of presented architectures

We summarize the relative behavior of the three routing techniques considered :

SMOEs consistently outperform other methods, especially at bigger N because it has the lowest c
Other methods, especially RL, are competitive at lower N because they have lower a,b
Hash and RL maintain a power-behavior for longer since E_max is bigger
Hash has the biggest initial overhead since E_start is bigger

38 of 40

Left-out sections

In Appendix E, authors explore briefly scaling of downstream performance and observe that it doesn’t follow a simple trend : it is really dependent on the task. More research needed.
In Appendix F, authors argue that it is not appropriate to consider performance at convergence because dataset and models are too large and complex to overfit/saturate with reasonable amount of tokens. “Scaling analyses should focus on optimal performance at a fixed number of tokens”.

39 of 40

Implication for future research and open questions

This framework gives a way to quantify the performance or different architectures in a disentangled way and informs future research. We can extract a few general lessons and questions for the future.

Routing gives diminishing returns as N and E increases. Why?
The parameters c and E_max are the indicators of how well the model scales. They should be focussed.
The number or active experts (K) and the number of routing layer (R) has a very weak impact on scaling performance. Perhaps this indicates that these routing architectures leverage motivation 2 (easier scaling) much more than motivation 1 (compositionality). Good performance of Hash Routing also suggest this fact. Leveraging hierarchical compositionality is an important goal for future work. Now it just seems like its distributing the knowledge of the FFN (Geva et al. 2021). Which is in fact cool but maybe it could be even cooler.

40 of 40

Questions :

Q : How is Expert Parallelism different from Sharding Parallelism.

A : With sharding every module needs to see every input and this slows down the processing rate.

Q : Can these findings generalize to other settings in which routing might happen more implicitly?

A : Not really. Also modularity modularity doesn’t really happen in other settings/architectures (Csordás et al. 2021)

Q : I wonder as of the conclusion of this paper if we have a fixed compute budget, do we want to train a model as large as it can, or train a routing network with the most effective number of parameters under computing budget?

A : Use routing when your compute budget is relatively small (the paper says N<1.3B for some reason but you get improvement up to N = 900B)

Q: Could this be applies to vision transformers?

A : Yes

Q : How could their scaling prediction can be generalized for other types of many-expert networks like a neural Dirichlet process or attention models (like RIMS)?

A : Idk about neural Dirichlet process. <insert RIMs discussion : sequence data, top-down routing>

1 of 40

2 of 40

3 of 40

4 of 40

5 of 40

6 of 40

7 of 40

8 of 40

9 of 40

10 of 40

11 of 40

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

20 of 40

21 of 40

22 of 40

23 of 40

24 of 40

25 of 40

26 of 40

27 of 40

28 of 40

29 of 40

30 of 40

31 of 40

32 of 40

33 of 40

34 of 40

35 of 40

36 of 40

37 of 40

38 of 40

39 of 40

40 of 40