1 of 22

1

Moshi: a speech-text foundation

model for real-time dialogue

Presented by

Md Mubtasim Ahasan

20th October 2024

Center for Computational & Data Sciences

Independent University, Bangladesh

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer,

Patrick Pérez, Hervé Jégou, Edouard Grave, Neil Zeghidour

from Kyutai�2 October 2024. Available on arXiv: https://arxiv.org/abs/2410.00037

2 of 22

2

Motivation

  • Problem: Current systems for spoken dialogue rely on pipelines of independent components.
    • Such as voice activity detection, speech recognition, textual dialogue and text-to-speech.
    • Example: Alexa, Siri, Google Assistant, when user says a “wake word”, it typically triggers an automatic speech recognition (ASR) ⇒ natural language understanding (NLU) pipeline ⇒ natural language generation (NLG) ⇒ text-to-speech (TTS) system tells the answer back to the user.
  • Solution: Change the NLU and NLG with an LLM, while the ASR and TTS provide the voice interface during the user’s and the system’s turn, respectively.
    • Casting spoken dialogue as speech-to-speech generation.
    • Starts from a text language model backbone, Moshi generates speech it owns AI speech and user’s speech into parallel streams as tokens from a codec.

3 of 22

3

Motivation

  • Problem: Existing frameworks cannot emulate the experience of real conversations.
    • Can only handle short, constrained interactions with several seconds latency.
    • As language understanding and generation happens in the textual domain, non-linguistic information such as emotion, accent, acoustic events gets lost.
    • models remain fundamentally speaker turn-based, which does not take into account overlapping speech, interruptions and interjections.
  • Solution: Moshi augments a text LLM backbone with a Audio-language model.
    • Understanding inputs and generating outputs directly in the audio domain, while benefiting from the knowledge and reasoning abilities of the LLM backbone.
    • Developed a low-latency, multi-stream, full-duplex audio language model for real-time conversations, handling overlap and interruptions.

4 of 22

4

Demo/Meme

  • When my friends ask me about work, this is how the conversation usually goes:

5 of 22

5

Overall Architecture

Moshi: Multi-stream speech-to-speech Transformer model, can handle full-duplex spoken dialogue.

Helium: A 7B-parameter text LLM pretrained on 2.1 trillion English tokens.

Mimi: a neural audio codec that distills semantic information into the first level of acoustic token, with split RVQ and transformer bottleneck.

Inner Monologue: a training and inference procedure, which uses text tokens as a per-time step prefix to the audio (semantic) tokens.

6 of 22

6

Helium (LLM)

  • Standard autoregressive language model, based on Transformer architecture.
  • Changes they made and their implementation:
    • Used RMS normalization at the input of the attention blocks, the feed-forward blocks and the output linear layer of the model.
    • Used rotary positional embeddings (RoPE).
    • A context length of 4,096 tokens and FlashAttention for training.
    • Changed the architecture of the feed-forward blocks and use Gated Linear Units with the SiLU activation as a gating function.
    • Used tokenizer based on the unigram model from SentencePiece and contains 32,000 elements mostly targeting English.

7 of 22

7

Helium (LLM)

  • Pre-training data filtering:
    • Data sources: Wikipedia, Stack Exchange, 40 million scientific articles, and Web crawled data from CommonCrawl.
    • Deduplication: Removed duplicate and boilerplate content using FNV-1a hashing and bloom filters, and trained a duplication classifier on fastText.
    • Language Identification: Apply a language identifier based on fastText to keep English data only, keep documents above confidence score of 0.85.
    • Quality Filtering: Trained a fastText classifier on lines above sources to filter for high-quality web pages based on similarity to trusted sources and their domains, keeping documents above a set threshold.

8 of 22

8

Mimi (neural audio codec)

  • Standard codec design, with notable changes: addition of a Transformer-based bottleneck, and split RVQ design, where semantic distillation is performed on the first vector quantizer (VQ).

9 of 22

9

Mimi (neural audio codec)

  • Transformer-based bottleneck: add Transformer modules in the bottleneck, one right before quantization and one after.
    • Have 8 layers, 8 heads, RoPE position encodings, a finite context of 250 frames. GELU activations, a model dimension of 512 and an MLP dimension of 2048. Use LayerScale and causal masking.
    • Apply weight decay only to the parameters of the Transformers, with a weight of 5×10^5 using AdamW.
  • Adversarial-only training: Experimented without commonly used reconstruction loss, and only keep the feature loss and discriminator loss.

10 of 22

10

Mimi (neural audio codec)

  • Learning semantic-acoustic tokens with a split RVQ:
    • Architecture: Mimi vs. SoundStream
  • Challenges: Distillation improves phonetic discriminability but reduces audio quality due to trade-offs in the RVQ.
  • Solution: Introduce a split RVQ, using a VQ for semantic and 7 RVQ for acoustic information.
  • Quantization: Use Q = 8 quantizers, with codebook size of 2048. Applied quantizer dropout for bitrate scalability. Only applied quantization 50% of the time per-sequence during training and passed unquantized embeddings to the decoder.
  • Semantic Information Distillation: Distill semantic information from WavLM into the first level of RVQ.

11 of 22

11

Generative Audio Modeling

  • Purpose:
    • Model joint distributions for text and audio sequences with a Transformer for real-time processing.
  • Challenges:
    • Audio requires much longer sequences (e.g., 100 tokens per second) compared to text (3-4 tokens per second). High computational cost and incompatibility with streaming inference.
  • Solution:
    • Model multiple sub-sequences: Instead of a single sequence, stack multiple sub-sequences (e.g., different audio codebooks and an optional text stream) for joint modeling.
    • Flatten the sub-sequences: Combine the sub-sequences into a single sequence, increasing the number of predictions by the number of sub-sequences (K).
    • Use RQ-Transformer, in a hierarchical approach with a Temporal Transformer for time-based sequence and a Depth Transformer for parallel sub-sequences.

12 of 22

12

Generative Audio Modeling

  • RQ-Transformer:
    • The RQ-Transformer consists of two models: a Temporal Transformer and a smaller Depth Transformer.
    • For a given sequence step 1 ≤ s ≤ S, the Temporal Transformer maps (V0, . . . , Vs−1) to a temporal context vector Zs = TrTemp(V0, . . . , Vs−1) ∈ R^d
    • For sub-sequence index 1 < k ≤ K, the Depth Transformer maps both Zs along with tokens (Vs,1, . . . , Vs,k−1) to the logits estimate ls,k = TrDepth(Zs, Vs,1, . . . , Vs,k−1) ∈ R^Nk.
    • Further define ls,1 = Lin(zs) ∈ R^N1, with Lin, a dedicated linear layer. And softmax (ls,k) provides approximation of the distribution of Vs,k.

Example: text and audio for: "Hello, how are you?":

The Temporal Transformer looks at the text and audio stream up to the current step ("Hello, how...") and produces a context vector Zs.

The Depth Transformer uses the Zs and predict the next word ("are") and next audio token (corresponding to "are") from the sound stream.

13 of 22

13

Generative Audio Modeling

  • Representation of the joint sequence:

Acoustic Delay: Insert the audio sub-sequences At,q from Mimi into the multi-sequence V, with a delay of 1 or 2 steps between the semantic and acoustic features.

Multi-stream modeling: Joinly model two-speaker conversation: A from Moshi and A′ from user.

Inner Monologue: Also model the textual representation of Moshi’s speech W by applying the SentencePiece tokenizer to the transcription of Moshi’s audio. This increases the linguistic quality of audio generation.

Aligning text and audio tokens: To align text with audio tokens, uses Whisper's word-level timestamps. Use PAD until the next word and EPAD to indicate the end of the padding. About 65% of the tokens in English are padding.

Deriving streaming ASR and TTS: By introducing the delay between text and audio tokens, can perform Text-to-Speech or Automatic Speech Recognition.

Inference of Moshi: During inference, predicts text and audio tokens for Moshi, but uses user’s audio only.

14 of 22

Datasets and Training

  • Text Data: The training dataset consists of 12.5% curated sources (e.g., Wikipedia, StackExchange, 40M scientific articles) and 87.5% filtered CommonCrawl data, using ten web crawls from 2018 to 2023.
  • Audio Data: A 7 million-hour unsupervised audio dataset transcribed with Whisper for pre-training in single stream, the Fisher dataset containing 2000 hours of phone conversations to achieve multi-stream, and source 170 hours of natural and scripted conversations between participants' pairs to fine-tune.
  • Speech-Text Instruct Data: Helium is fine-tuned on Open Hermes and real conversation transcripts to generate realistic interactions, and then a TTS model synthesized 20k hours of synthetic speech data. Also, conditioned the TTS with an actor voice in 70 speaking styles, while user voices are randomly varied for robustness.

15 of 22

Datasets and Training

  • Prompts used to generate the interactions between a user and Moshi:

16 of 22

Results

Metrics. AI2 Reasoning Challenge (ARC), Open-Book QA (OBQA), HellaSwag (HS), WinoGrande (WG), Physical Interaction QA (PIQA), Social Interaction QA (SIQA), TriviaQA (TQA), Natural Questions (NQ), Massive Multitask Language Understanding benchmark (MMLU).

17 of 22

Results

Metrics. ABX Error Rate: Measures phonetic discriminability between triphones. VisQOL: Evaluates acoustic similarity with a reference.

MOSNet: Assesses audio quality without a reference. MUSHRA: Human ratings of audio quality.

18 of 22

Results

Metrics. ABX Error Rate: Measures phonetic discriminability between triphones. VisQOL: Evaluates acoustic similarity with a reference.

MOSNet: Assesses audio quality without a reference. MUSHRA: Human ratings of audio quality.

19 of 22

Results

20 of 22

Results

Metrics. sWUGGY: Measures lexicon learning by comparing valid and invalid words. sBLIMP: Evaluates syntactic knowledge through syntactic contrasts. Spoken StoryCloze: Assesses semantic understanding with coherent vs. incoherent stories. Spoken Topic-StoryCloze: Uses unrelated sentences to evaluate semantic coherence. Negative-Log Likelihood: Normalizes likelihood scores by sequence length. MMLU: Measures text understanding independent of audio tokens.

21 of 22

Results

22 of 22

22