1 of 25

IFT6760A Paper presentation:�Mixtral of Experts

Yicong Li

Yuchen Hui

Jiang, Albert Q., et al. "Mixtral of experts." arXiv preprint arXiv:2401.04088 (2024).

2 of 25

2

Mixtral of Experts - Schedule

  • Introduction
  • Brief History of Mixture of Experts (MoE)
  • Mixtral of Experts
    • Architecture of Mixtral 8x7B
    • Results
    • Routing analysis

3 of 25

3

Mixtral 8x7B - Introduction

A French start-up founded in April 2023 by former Meta AI and Google DeepMind employees

Measuring massive multitask language understanding

4 of 25

5 of 25

5

What is “Mixtral” of Experts ?

  • Mixtral 8x7B
  • Sparse Mixture of Experts (SMoE) language model

Mixtral?

8x7B?

Sparse Mixture of Experts?

6 of 25

Fedus, William, Jeff Dean, and Barret Zoph. "A review of sparse expert models in deep learning." arXiv preprint arXiv:2209.01667 (2022).

6

Mixture of Experts: Brief History

Jacobs, Robert A., et al. "Adaptive mixtures of local experts." (1991)

Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding."  

(2020).

Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer."  (2017).

Jiang, Albert Q., et al. "Mixtral of experts."  (2024).

  • The concept of MoE in machine learning dates back to 1990s
  • Based on the Gshard paper
  • The first large-scale success of MoE approach in deep learning
  • MoE in Transformer

3

1

2

4

7 of 25

Jacobs, Robert A., et al. "Adaptive mixtures of local experts." Neural computation 3.1 (1991): 79-87.

7

Adaptive mixtures of local experts

1

  • Experts & gating Network are an entire neural network
  • Soft selection” via weighted average (sth like softmax) over experts output
  • Jointly train Experts and Gating Networks
  • Continuous mixture of the experts

Objective:

8 of 25

Fedus, William, Jeff Dean, and Barret Zoph. "A review of sparse expert models in deep learning." arXiv preprint arXiv:2209.01667 (2022).

8

illustration of routing

9 of 25

Fedus, William, Jeff Dean, and Barret Zoph. "A review of sparse expert models in deep learning." arXiv preprint arXiv:2209.01667 (2022).

9

Soft selection” to “Top-k selection

2

If one increases n while keeping K fixed, one can increase the model’s parameter count while keeping its computational cost effectively constant

10 of 25

From CMoE to SMoE

Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

11 of 25

Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

11

Sparse Mixture-of-Experts layer in LSTM

2

12 of 25

Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." arXiv preprint arXiv:2006.16668 (2020).

12

Mixture-of-Experts layer in Transformers

3

  • MoE layer replaces the every other Transformer feed-forward layer
  • Top-2 selection

13 of 25

Jiang, Albert Q., et al. "Mixtral of experts." arXiv preprint arXiv:2401.04088 (2024).

13

“Mixtral” of Experts -Mixtral 8x7B

4

  • Backbone of the language model: Mistral 7B
    • Grouped-query attention (GQA), faster inference
    • Sliding window attention (SWA), longer sequence + less memory requirement
    • Surpass Llama 13B – chat.
  • Replace all FFN sub-blocks by MoE layers
    • 32 Transformer blocks
    • MoE layers are simplified version of Gshard paper.
    • 8 experts + top-2 selection
  • Mixtral 8x7B = Mistral 7B + use 8-expert MoE layers

14 of 25

Results and Analysis

15 of 25

Results - Mixtral vs. LLaMA 2

16 of 25

Results - Mixtral vs. LLaMA 2 - Performance per Active Parameters

17 of 25

Results - Mixtral vs. LLaMA 2 vs. GPT3.5

18 of 25

Results - Multilingual benchmarks

19 of 25

Results - Long range performance

20 of 25

Results - Bias Benchmarks

21 of 25

Routing Analysis

  • Initial thought: some experts may specialize to some specific domains (e.g. mathematics, biology, philosophy, etc.)

Would it be that easy?

22 of 25

Distribution on Pile Dataset

23 of 25

24 of 25

Limitation

  • Not fully open source…

25 of 25

Thank you!

Question?