1
Moshi: a speech-text foundation
model for real-time dialogue
Presented by
Md Mubtasim Ahasan
20th October 2024
Center for Computational & Data Sciences
Independent University, Bangladesh
Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer,
Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour
from Kyutai�2 October 2024. Available on arXiv: https://arxiv.org/abs/2410.00037
2
Motivation
3
Motivation
4
Demo/Meme
5
Overall Architecture
Moshi: Multi-stream speech-to-speech Transformer model, can handle full-duplex spoken dialogue.
Helium: A 7B-parameter text LLM pretrained on 2.1 trillion English tokens.
Mimi: a neural audio codec that distills semantic information into the first level of acoustic token, with split RVQ and transformer bottleneck.
Inner Monologue: a training and inference procedure, which uses text tokens as a per-time step prefix to the audio (semantic) tokens.
6
Helium (LLM)
7
Helium (LLM)
8
Mimi (neural audio codec)
9
Mimi (neural audio codec)
10
Mimi (neural audio codec)
11
Generative Audio Modeling
12
Generative Audio Modeling
Example: text and audio for: "Hello, how are you?":
The Temporal Transformer looks at the text and audio stream up to the current step ("Hello, how...") and produces a context vector Zs.
The Depth Transformer uses the Zs and predict the next word ("are") and next audio token (corresponding to "are") from the sound stream.
13
Generative Audio Modeling
Acoustic Delay: Insert the audio sub-sequences At,q from Mimi into the multi-sequence V, with a delay of 1 or 2 steps between the semantic and acoustic features.
Multi-stream modeling: Joinly model two-speaker conversation: A from Moshi and A′ from user.
Inner Monologue: Also model the textual representation of Moshi’s speech W by applying the SentencePiece tokenizer to the transcription of Moshi’s audio. This increases the linguistic quality of audio generation.
Aligning text and audio tokens: To align text with audio tokens, uses Whisper's word-level timestamps. Use PAD until the next word and EPAD to indicate the end of the padding. About 65% of the tokens in English are padding.
Deriving streaming ASR and TTS: By introducing the delay between text and audio tokens, can perform Text-to-Speech or Automatic Speech Recognition.
Inference of Moshi: During inference, predicts text and audio tokens for Moshi, but uses user’s audio only.
Datasets and Training
Datasets and Training
Results
Metrics. AI2 Reasoning Challenge (ARC), Open-Book QA (OBQA), HellaSwag (HS), WinoGrande (WG), Physical Interaction QA (PIQA), Social Interaction QA (SIQA), TriviaQA (TQA), Natural Questions (NQ), Massive Multitask Language Understanding benchmark (MMLU).
Results
Metrics. ABX Error Rate: Measures phonetic discriminability between triphones. VisQOL: Evaluates acoustic similarity with a reference.
MOSNet: Assesses audio quality without a reference. MUSHRA: Human ratings of audio quality.
Results
Metrics. ABX Error Rate: Measures phonetic discriminability between triphones. VisQOL: Evaluates acoustic similarity with a reference.
MOSNet: Assesses audio quality without a reference. MUSHRA: Human ratings of audio quality.
Results
Results
Metrics. sWUGGY: Measures lexicon learning by comparing valid and invalid words. sBLIMP: Evaluates syntactic knowledge through syntactic contrasts. Spoken StoryCloze: Assesses semantic understanding with coherent vs. incoherent stories. Spoken Topic-StoryCloze: Uses unrelated sentences to evaluate semantic coherence. Negative-Log Likelihood: Normalizes likelihood scores by sequence length. MMLU: Measures text understanding independent of audio tokens.
Results
22