IFT6760A Paper presentation:�Mixtral of Experts
Yicong Li
Yuchen Hui
Jiang, Albert Q., et al. "Mixtral of experts." arXiv preprint arXiv:2401.04088 (2024).
2
Mixtral of Experts - Schedule
https://mistral.ai/news/mistral-large/, February 26, 2024
3
Mixtral 8x7B - Introduction
A French start-up founded in April 2023 by former Meta AI and Google DeepMind employees
Measuring massive multitask language understanding
5
What is “Mixtral” of Experts ?
Mixtral?
8x7B?
Sparse Mixture of Experts?
Fedus, William, Jeff Dean, and Barret Zoph. "A review of sparse expert models in deep learning." arXiv preprint arXiv:2209.01667 (2022).
6
Mixture of Experts: Brief History
Jacobs, Robert A., et al. "Adaptive mixtures of local experts." (1991)
Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding."
(2020).
Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." (2017).
Jiang, Albert Q., et al. "Mixtral of experts." (2024).
3
1
2
4
Jacobs, Robert A., et al. "Adaptive mixtures of local experts." Neural computation 3.1 (1991): 79-87.
7
Adaptive mixtures of local experts
1
Objective:
Fedus, William, Jeff Dean, and Barret Zoph. "A review of sparse expert models in deep learning." arXiv preprint arXiv:2209.01667 (2022).
8
illustration of routing
Fedus, William, Jeff Dean, and Barret Zoph. "A review of sparse expert models in deep learning." arXiv preprint arXiv:2209.01667 (2022).
9
“Soft selection” to “Top-k selection”
2
If one increases n while keeping K fixed, one can increase the model’s parameter count while keeping its computational cost effectively constant
From CMoE to SMoE
Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).
Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).
11
Sparse Mixture-of-Experts layer in LSTM
2
Lepikhin, Dmitry, et al. "Gshard: Scaling giant models with conditional computation and automatic sharding." arXiv preprint arXiv:2006.16668 (2020).
12
Mixture-of-Experts layer in Transformers
3
Jiang, Albert Q., et al. "Mixtral of experts." arXiv preprint arXiv:2401.04088 (2024).
13
“Mixtral” of Experts -Mixtral 8x7B
4
Results and Analysis
Results - Mixtral vs. LLaMA 2
Results - Mixtral vs. LLaMA 2 - Performance per Active Parameters
Results - Mixtral vs. LLaMA 2 vs. GPT3.5
Results - Multilingual benchmarks
Results - Long range performance
Results - Bias Benchmarks
Routing Analysis
Would it be that easy?
Distribution on Pile Dataset
Limitation
Thank you!
Question?