1 of 25

IFT6760A Paper presentation:�Mixtral of Experts

Yicong Li

Yuchen Hui

Jiang, Albert Q., et al. "Mixtral of experts." arXiv preprint arXiv:2401.04088 (2024).

2 of 25

2

Mixtral of Experts - Schedule

Introduction
Brief History of Mixture of Experts (MoE)
Mixtral of Experts

Architecture of Mixtral 8x7B
Results
Routing analysis

3 of 25

https://mistral.ai/news/mistral-large/, February 26, 2024

3

Mixtral 8x7B - Introduction

A French start-up founded in April 2023 by former Meta AI and Google DeepMind employees

Measuring massive multitask language understanding

4 of 25

5 of 25

5

What is “Mixtral” of Experts ?

Mixtral 8x7B
Sparse Mixture of Experts (SMoE) language model

Mixtral?

8x7B?

Sparse Mixture of Experts?

6 of 25

Fedus, William, Jeff Dean, and Barret Zoph. "A review of sparse expert models in deep learning." arXiv preprint arXiv:2209.01667 (2022).

6

Mixture of Experts: Brief History

Jacobs, Robert A., et al. "Adaptive mixtures of local experts." (1991)

Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding."

(2020).

Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." (2017).

Jiang, Albert Q., et al. "Mixtral of experts." (2024).

The concept of MoE in machine learning dates back to 1990s

Based on the Gshard paper

The first large-scale success of MoE approach in deep learning

MoE in Transformer

3

1

2

4

7 of 25

Jacobs, Robert A., et al. "Adaptive mixtures of local experts." Neural computation 3.1 (1991): 79-87.

7

Adaptive mixtures of local experts

1

Experts & gating Network are an entire neural network
“Soft selection” via weighted average (sth like softmax) over experts output
Jointly train Experts and Gating Networks
Continuous mixture of the experts

Objective:

8 of 25

Fedus, William, Jeff Dean, and Barret Zoph. "A review of sparse expert models in deep learning." arXiv preprint arXiv:2209.01667 (2022).

8

illustration of routing

9 of 25

Fedus, William, Jeff Dean, and Barret Zoph. "A review of sparse expert models in deep learning." arXiv preprint arXiv:2209.01667 (2022).

9

“Soft selection” to “Top-k selection”

2

If one increases n while keeping K fixed, one can increase the model’s parameter count while keeping its computational cost effectively constant

10 of 25

From CMoE to SMoE

Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

11 of 25

Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

11

Sparse Mixture-of-Experts layer in LSTM

2

12 of 25

Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." arXiv preprint arXiv:2006.16668 (2020).

12

Mixture-of-Experts layer in Transformers

3

MoE layer replaces the every other Transformer feed-forward layer
Top-2 selection

13 of 25

Jiang, Albert Q., et al. "Mixtral of experts." arXiv preprint arXiv:2401.04088 (2024).

13

“Mixtral” of Experts -Mixtral 8x7B

4

Backbone of the language model: Mistral 7B

Grouped-query attention (GQA), faster inference
Sliding window attention (SWA), longer sequence + less memory requirement
Surpass Llama 13B – chat.

Replace all FFN sub-blocks by MoE layers

32 Transformer blocks
MoE layers are simplified version of Gshard paper.
8 experts + top-2 selection

Mixtral 8x7B = Mistral 7B + use 8-expert MoE layers

14 of 25

Results and Analysis

15 of 25

Results - Mixtral vs. LLaMA 2

16 of 25

Results - Mixtral vs. LLaMA 2 - Performance per Active Parameters

17 of 25

Results - Mixtral vs. LLaMA 2 vs. GPT3.5

18 of 25

Results - Multilingual benchmarks

19 of 25

Results - Long range performance

20 of 25

Results - Bias Benchmarks

21 of 25

Routing Analysis

Initial thought: some experts may specialize to some specific domains (e.g. mathematics, biology, philosophy, etc.)

Would it be that easy?

22 of 25

Distribution on Pile Dataset

23 of 25

24 of 25

Limitation

Not fully open source…

25 of 25

Thank you!

Question?