1 of 43

Text-to-Image Generation with Mamba

Qian Yang, Kanishk Jain

2 of 43

Overview

Reminder of Projects

Updated Timeline

Q&A

Progress Until Now

2

1

3

4

2

3 of 43

Problem Statement Reminder

Text-to-Image generation using Mamba architecture

Diffusion Models
Autoregressive Models�

Why Mamba?

Faster throughput
Long context Modelling
Computationally Efficient

4 of 43

MUSE: Text-to-Image Generation via Masked Generative Transformers

Text-to-Image generation with masked image modelling
However, transformers have high computation cost

5 of 43

Mamba: Selective State Space model

6 of 43

VMamba

Combines the benefits of CNNs and ViTs, maintaining linear complexity while preserving global receptive fields and dynamic weights.
Propose cross-scan mechanism for injecting inductive bias for images

7 of 43

VMamba

Combines the benefits of CNNs and ViTs, maintaining linear complexity while preserving global receptive fields and dynamic weights.
Propose cross-scan mechanism for injecting inductive bias for images
Capable of high resolution image understanding tasks

8 of 43

Our Approach: M-Mamba for Image Generation

Replace transformer blocks in MUSE with Multimodal-Mamba blocks

9 of 43

Mamba: Selective State Space model

A selective state space model for long context modelling
An alternative to transformers which scales linearly with context length
Leverage a hardware aware algorithm for efficient training and inference

10 of 43

Problem Statement Reminder

Text-to-Image generation
Reduce training and inference times

Stable Diffusion

11 of 43

Our Approach: M-Mamba for Image Generation

Detail of M-Mamba Blocks:

a little mamba

Text

Image Dir-1

Text

Image Dir-2

Text

Image Dir-3

Text

Image Dir-4

Text

Image Dir-1

Text

Image Dir-2

Text

Image Dir-3

Text

Image Dir-4

Text Merge

Text

12 of 43

Progress Until Now

Finalized Model Architecture:

Text and Image Concatenation: Sequence text and image data for enhanced text-conditioned image generation.

Finalized hyper-parameter setting:

Mamba Layers Configuration:

12 layers: Faster training but limited generative capability.
Optimal: 22 layers, matching the Amused model, for balanced performance.

Convolutional Blocks: Utilize pretrained weights from Amused for quicker convergence.

Start training both UViT and Mamba from scratch for comprehensive comparison.

12

13 of 43

Loss Curve

Loss Curve for UViT and M-Mamba

Loss Curve for M-Mamba 22 layers

Loss Curve for UViT 22 layers

13

14 of 43

Speed Comparison

Test Environment: Conducted on 1 A100 GPU, 32 batch size, averaged over 50 forward passes.

14

Model	22-Layer Time	Single Layer Time	Attention v.s. SSM
UViT 513M	0.137 s	5.94e-3 s	Self-attention: 1.19e-3 Cross-attention: 1.16e-3
M-Mamba 448M	0.217 s	8.92e-3 s	2.33e-4 s

Findings:

Attention v.s. SSM: SSM is faster than the attention block, highlighting efficiency in processing.
M-Mamba v.s. UViT:M-Mamba layer is slightly slower than the transformer layer.

Mamba is unidirectional, necessitating cross-scanning to achieve a comprehensive understanding of an image. This process is time-consuming.

15 of 43

Experiments

Image generation Comparison between UViT and M-Mamba

Prompt: people playing baseball

Generated by M-Mamba 22 layers at 22k steps

Generated by UViT 22 layers at 22k steps

15

16 of 43

Experiments

Image generation Comparison between UViT and M-Mamba

Prompt: man and woman wearing blue clothes

Generated by M-Mamba 22 layers at 22k steps

Generated by UViT 22 layers at 22k steps

16

17 of 43

Experiments

Stages of Learning in Image Generation:

Color are learning at early stage.
Then it starts to learn Shapes and Edges.
Spatial Relationships and Composition: Generating images with multiple objects that have complex spatial relationships is a more advanced learning stage.

Images generated by M-Mamba 22 layers:

Prompt: man and woman wearing blue clothes

1k steps

17k steps

22k steps

17

18 of 43

Experiments

Stages of Learning in Image Generation:

Early Stage: Color Recognition

18

19 of 43

Experiments

Stages of Learning in Image Generation:

Early Stage: Color Recognition

Background Understanding:

Following color recognition, models begin to grasp background information, learning to differentiate between foreground and background elements.

19

20 of 43

Experiments

Stages of Learning in Image Generation:

Early Stage: Color Recognition

Background Understanding:

Following color recognition, models begin to grasp background information, learning to differentiate between foreground and background elements.

Shapes and Edges:

Learn to identify shapes and edges, crucial for forming the basis of object recognition within images.

20

21 of 43

Experiments

Stages of Learning in Image Generation:

Early Stage: Color Recognition

Background Understanding:

Following color recognition, models begin to grasp background information, learning to differentiate between foreground and background elements.

Shapes and Edges:

Learn to identify shapes and edges, crucial for forming the basis of object recognition within images.

Spatial Relationships and Composition:

Both models show limited capability for learning spatial relationships.

21

22 of 43

Experiments

Images generated by M-Mamba 22 layers:

Prompt: man and woman wearing blue clothes

1k steps

17k steps

22k steps

22

23 of 43

Experiments

Images generated by UViT 22 layers:

Prompt: man and woman wearing blue clothes

1k steps

17k steps

22k steps

23

24 of 43

Experiments

Images generated by M-Mamba 22 layers:

Prompt: people playing baseball

1k steps

17k steps

22k steps

24

25 of 43

Experiments

Images generated by UViT 22 layers:

Prompt: people playing baseball

1k steps

17k steps

22k steps

25

26 of 43

Experiments

Images generated by UViT 22 layers:

Prompt: a woman with yellow hat and black shoes

1k steps

17k steps

22k steps

26

27 of 43

Is Mamba worth exploring?

Speed:

Advantage: SSM exhibit faster processing speeds than traditional attention mechanisms, presenting a clear advantage in efficiency.

Challenge: Unlike transformer layers that can benefit from flash attention mechanisms for further acceleration, Mamba currently lacks equivalent speed-enhancing libraries, potentially limiting its speed advantage.

27

28 of 43

Is Mamba worth exploring?

Performance:

Image Generation: Based on our current experimental findings, Mamba does not perform as well as transformers using Muse-based architecture. Its capabilities appear limited in comparison, particularly for complex visual content.

Sequential Data Modeling: Mamba demonstrates superior performance over transformers for long-range sequential data modeling, indicating its potential in specific applications.

28

29 of 43

Is Mamba worth exploring?

Scalability:

Cobra scales Mamba to 3B parameters, achieving comparable performance with LLaVA 1.5 7B, but with reduced computational time. This demonstrates notable advantages in scalability and efficiency.

29

30 of 43

Is Mamba worth exploring?

Community and Ecosystem Development

There is a growing body of research and papers focusing on Mamba, indicating rising interest and potential advancements in the field.

As Mamba's ecosystem matures, with the development of more robust libraries and support, it will become an increasingly viable and competitive option in the landscape of AI models.

30

31 of 43

Experiments

Images generated by M-Mamba 22 layers:

Prompt: a woman with yellow hat and black shoes

1k steps

17k steps

22k steps

31

32 of 43

Updated Timeline

✅ Prepare the dataset: 28th February
✅ Prepare the Code for Mamba and VMamba: 1st March
✅ Explore the MMamba-Muse architecture: 10th March
✅ Start Training 256 x 256 MMamba-Muse: 28th March

Note: The initial target was 15th March, but the setup was actually completed on 28th March.

Upcoming Milestones:
Explore combining Mamba with Cross Attention.
MMamba-Muse 256 x 256 Version Completion: 13th April

Achieve a version capable of generating high-quality results.

Quantity and Quality Comparison between Muse and MMamba-Muse: 16th April

Evaluate and compare the performance and output quality of Muse and MMamba-Muse.

33 of 43

State Space Model

Traditionally used to model system dynamics via state variables
Maps input sequence x(t) to a hidden state h(t), followed by predicting output sequence y(t)
Entire operation can approximated using a global convolution

34 of 43

Mamba’s Contributions

Selection Mechanism

Parallel Scan

Similar to Mamba’s recurrence relation:

35 of 43

Selection Mechanism

Selective Copying

Induction Heads

36 of 43

Parallel Scan

Similar to Mamba’s recurrence relation:

37 of 43

Hardware-Aware Algorithm

GPUs have two kinds of memories, small but highly efficient SRAM, large but less efficient DRAM
Limit the number of times we copy b/w DRAM and SRAM

38 of 43

Our Approach: M-Mamba for Image Generation

Explore ways to incorporate multimodal context into Mamba:

Concatenate the text representation with the image representation as a sequence.
Design Multimodal-Mamba Block: Use gate mechanism to selectively focus on certain words in different time-step.

39 of 43

Training Plan

Train on CC3M (Conceptual Captions 3 Million):

3,318,333 image-URL/caption pairs
51,201 total vocabulary
Average number of tokens per captions: 10.3
Median number of tokens per captions: 9.0

Expect Training Usage:

aMUSEd 256x256 model was trained on 2 8xA100 servers for 1,000,000 steps, with a per GPU batch size of 128.
We plan to train on 8xA100 servers for 2,000,000 steps use per GPU batch size of 128.

40 of 43

Evaluation Metrics

Frechet Inception Distance (FID): measures quality and diversity of samples

Lower is better.

CLIP score: measures image/text alignment

Higher is better.

41 of 43

Evaluation Metrics

Inference Speed: Time for generating an image end to end.

42 of 43

Research Objectives

Reduce the training and inference time for text-image generation

Incorporate multi-modal fusion to mamba architecture

Improve high resolution image synthesis

43 of 43

References

https://medium.com/@steinsfu/stable-diffusion-clearly-explained-ed008044e07e