1 of 43

Text-to-Image Generation with Mamba

Qian Yang, Kanishk Jain

2 of 43

Overview

Reminder of Projects

Updated Timeline

Q&A

Progress Until Now

2

1

3

4

2

3 of 43

Problem Statement Reminder

  • Text-to-Image generation using Mamba architecture
    • Diffusion Models
    • Autoregressive Models�
  • Why Mamba?
    • Faster throughput
    • Long context Modelling
    • Computationally Efficient

4 of 43

MUSE: Text-to-Image Generation via Masked Generative Transformers

  • Text-to-Image generation with masked image modelling
  • However, transformers have high computation cost

5 of 43

Mamba: Selective State Space model

6 of 43

VMamba

  • Combines the benefits of CNNs and ViTs, maintaining linear complexity while preserving global receptive fields and dynamic weights.
  • Propose cross-scan mechanism for injecting inductive bias for images

7 of 43

VMamba

  • Combines the benefits of CNNs and ViTs, maintaining linear complexity while preserving global receptive fields and dynamic weights.
  • Propose cross-scan mechanism for injecting inductive bias for images
  • Capable of high resolution image understanding tasks

8 of 43

Our Approach: M-Mamba for Image Generation

  • Replace transformer blocks in MUSE with Multimodal-Mamba blocks

9 of 43

Mamba: Selective State Space model

  • A selective state space model for long context modelling
  • An alternative to transformers which scales linearly with context length
  • Leverage a hardware aware algorithm for efficient training and inference

10 of 43

Problem Statement Reminder

  • Text-to-Image generation
  • Reduce training and inference times

Stable Diffusion

11 of 43

Our Approach: M-Mamba for Image Generation

  • Detail of M-Mamba Blocks:

a little mamba

Text

Image Dir-1

Text

Image Dir-2

Text

Image Dir-3

Text

Image Dir-4

Text

Image Dir-1

Text

Image Dir-2

Text

Image Dir-3

Text

Image Dir-4

Text Merge

Text

12 of 43

Progress Until Now

  • Finalized Model Architecture:
    • Text and Image Concatenation: Sequence text and image data for enhanced text-conditioned image generation.

  • Finalized hyper-parameter setting:
    • Mamba Layers Configuration:
      • 12 layers: Faster training but limited generative capability.
      • Optimal: 22 layers, matching the Amused model, for balanced performance.
    • Convolutional Blocks: Utilize pretrained weights from Amused for quicker convergence.

  • Start training both UViT and Mamba from scratch for comprehensive comparison.

12

13 of 43

Loss Curve

  • Loss Curve for UViT and M-Mamba

Loss Curve for M-Mamba 22 layers

Loss Curve for UViT 22 layers

13

14 of 43

Speed Comparison

  • Test Environment: Conducted on 1 A100 GPU, 32 batch size, averaged over 50 forward passes.

14

Model

22-Layer Time

Single Layer Time

Attention v.s. SSM

UViT

513M

0.137 s

5.94e-3 s

Self-attention: 1.19e-3

Cross-attention: 1.16e-3

M-Mamba

448M

0.217 s

8.92e-3 s

2.33e-4 s

  • Findings:
    • Attention v.s. SSM: SSM is faster than the attention block, highlighting efficiency in processing.
    • M-Mamba v.s. UViT:M-Mamba layer is slightly slower than the transformer layer.
      • Mamba is unidirectional, necessitating cross-scanning to achieve a comprehensive understanding of an image. This process is time-consuming.

15 of 43

Experiments

  • Image generation Comparison between UViT and M-Mamba
    • Prompt: people playing baseball

Generated by M-Mamba 22 layers at 22k steps

Generated by UViT 22 layers at 22k steps

15

16 of 43

Experiments

  • Image generation Comparison between UViT and M-Mamba
    • Prompt: man and woman wearing blue clothes

Generated by M-Mamba 22 layers at 22k steps

Generated by UViT 22 layers at 22k steps

16

17 of 43

Experiments

  • Stages of Learning in Image Generation:
    • Color are learning at early stage.
    • Then it starts to learn Shapes and Edges.
    • Spatial Relationships and Composition: Generating images with multiple objects that have complex spatial relationships is a more advanced learning stage.
  • Images generated by M-Mamba 22 layers:
    • Prompt: man and woman wearing blue clothes

1k steps

17k steps

22k steps

17

18 of 43

Experiments

Stages of Learning in Image Generation:

  • Early Stage: Color Recognition

18

19 of 43

Experiments

Stages of Learning in Image Generation:

  • Early Stage: Color Recognition

  • Background Understanding:
    • Following color recognition, models begin to grasp background information, learning to differentiate between foreground and background elements.

19

20 of 43

Experiments

Stages of Learning in Image Generation:

  • Early Stage: Color Recognition

  • Background Understanding:
    • Following color recognition, models begin to grasp background information, learning to differentiate between foreground and background elements.
  • Shapes and Edges:
    • Learn to identify shapes and edges, crucial for forming the basis of object recognition within images.

20

21 of 43

Experiments

Stages of Learning in Image Generation:

  • Early Stage: Color Recognition

  • Background Understanding:
    • Following color recognition, models begin to grasp background information, learning to differentiate between foreground and background elements.
  • Shapes and Edges:
    • Learn to identify shapes and edges, crucial for forming the basis of object recognition within images.

  • Spatial Relationships and Composition:
    • Both models show limited capability for learning spatial relationships.

21

22 of 43

Experiments

  • Images generated by M-Mamba 22 layers:
    • Prompt: man and woman wearing blue clothes

1k steps

17k steps

22k steps

22

23 of 43

Experiments

  • Images generated by UViT 22 layers:
    • Prompt: man and woman wearing blue clothes

1k steps

17k steps

22k steps

23

24 of 43

Experiments

  • Images generated by M-Mamba 22 layers:
    • Prompt: people playing baseball

1k steps

17k steps

22k steps

24

25 of 43

Experiments

  • Images generated by UViT 22 layers:
    • Prompt: people playing baseball

1k steps

17k steps

22k steps

25

26 of 43

Experiments

  • Images generated by UViT 22 layers:
    • Prompt: a woman with yellow hat and black shoes

1k steps

17k steps

22k steps

26

27 of 43

Is Mamba worth exploring?

  • Speed:
    • Advantage: SSM exhibit faster processing speeds than traditional attention mechanisms, presenting a clear advantage in efficiency.

    • Challenge: Unlike transformer layers that can benefit from flash attention mechanisms for further acceleration, Mamba currently lacks equivalent speed-enhancing libraries, potentially limiting its speed advantage.

27

28 of 43

Is Mamba worth exploring?

  • Performance:
    • Image Generation: Based on our current experimental findings, Mamba does not perform as well as transformers using Muse-based architecture. Its capabilities appear limited in comparison, particularly for complex visual content.

    • Sequential Data Modeling: Mamba demonstrates superior performance over transformers for long-range sequential data modeling, indicating its potential in specific applications.

28

29 of 43

Is Mamba worth exploring?

  • Scalability:
    • Cobra scales Mamba to 3B parameters, achieving comparable performance with LLaVA 1.5 7B, but with reduced computational time. This demonstrates notable advantages in scalability and efficiency.

29

30 of 43

Is Mamba worth exploring?

  • Community and Ecosystem Development
    • There is a growing body of research and papers focusing on Mamba, indicating rising interest and potential advancements in the field.

    • As Mamba's ecosystem matures, with the development of more robust libraries and support, it will become an increasingly viable and competitive option in the landscape of AI models.

30

31 of 43

Experiments

  • Images generated by M-Mamba 22 layers:
    • Prompt: a woman with yellow hat and black shoes

1k steps

17k steps

22k steps

31

32 of 43

Updated Timeline

  • ✅ Prepare the dataset: 28th February
  • ✅ Prepare the Code for Mamba and VMamba: 1st March
  • ✅ Explore the MMamba-Muse architecture: 10th March
  • ✅ Start Training 256 x 256 MMamba-Muse: 28th March
    • Note: The initial target was 15th March, but the setup was actually completed on 28th March.
  • Upcoming Milestones:
  • Explore combining Mamba with Cross Attention.
  • MMamba-Muse 256 x 256 Version Completion: 13th April
    • Achieve a version capable of generating high-quality results.
  • Quantity and Quality Comparison between Muse and MMamba-Muse: 16th April
    • Evaluate and compare the performance and output quality of Muse and MMamba-Muse.

33 of 43

State Space Model

  • Traditionally used to model system dynamics via state variables
  • Maps input sequence x(t) to a hidden state h(t), followed by predicting output sequence y(t)
  • Entire operation can approximated using a global convolution

34 of 43

Mamba’s Contributions

Selection Mechanism

Parallel Scan

Similar to Mamba’s recurrence relation:

35 of 43

Selection Mechanism

Selective Copying

Induction Heads

36 of 43

Parallel Scan

Similar to Mamba’s recurrence relation:

37 of 43

Hardware-Aware Algorithm

  • GPUs have two kinds of memories, small but highly efficient SRAM, large but less efficient DRAM
  • Limit the number of times we copy b/w DRAM and SRAM

38 of 43

Our Approach: M-Mamba for Image Generation

  • Explore ways to incorporate multimodal context into Mamba:
    • Concatenate the text representation with the image representation as a sequence.
    • Design Multimodal-Mamba Block: Use gate mechanism to selectively focus on certain words in different time-step.

39 of 43

Training Plan

  • Train on CC3M (Conceptual Captions 3 Million):
    • 3,318,333 image-URL/caption pairs
    • 51,201 total vocabulary
    • Average number of tokens per captions: 10.3
    • Median number of tokens per captions: 9.0

  • Expect Training Usage:
    • aMUSEd 256x256 model was trained on 2 8xA100 servers for 1,000,000 steps, with a per GPU batch size of 128.
    • We plan to train on 8xA100 servers for 2,000,000 steps use per GPU batch size of 128.

40 of 43

Evaluation Metrics

  • Frechet Inception Distance (FID): measures quality and diversity of samples
    • Lower is better.
  • CLIP score: measures image/text alignment
    • Higher is better.

41 of 43

Evaluation Metrics

  • Inference Speed: Time for generating an image end to end.

42 of 43

Research Objectives

  • Reduce the training and inference time for text-image generation

  • Incorporate multi-modal fusion to mamba architecture

  • Improve high resolution image synthesis

43 of 43

References