1 of 61

Transformer 的競爭者們

Mamba

Chunk-wise

Mamba

Mamba’s weakness

More …

To do:

要不要講 chunk-wise? 何時講???

要不要給 ktv 加一個直觀的解釋??? must

要不要講 VK catche X

要不要講 Transformer 就是矩陣相乘?

====

Understanding and Improving Efficient Language Models: https://www.youtube.com/watch?v=DmuZ3ckl8rs

https://www.youtube.com/watch?v=ksRp_DIHWj4

Infinite attention? https://www.youtube.com/watch?v=F5dPnWGJmr8

Not sure: https://www.youtube.com/watch?v=OxZLtAteWs4

More talk: https://www.youtube.com/watch?v=PnwC74s1nmc

More recent stuff: https://www.bilibili.com/video/BV1MDwAeWEoM/?vd_source=035872531a338721ba64a57a1cdc1ebc

talk_250117.pdf

NeurIPS 2024 talk

https://youtu.be/LPe6iC73lrc?si=UJM4fKFl4o92Q56F

Survey: https://github.com/xmindflow/Awesome_Mamba

Good tutorial!

https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mamba-and-state

Alternatives: https://medium.com/nebius/transformer-alternatives-in-2024-06cd3d91d42b

Survey paper of Mamba: https://arxiv.org/pdf/2408.01129

=====

Linear Attention and Beyond (Interactive Tutorial with Songlin Yang)

https://www.youtube.com/watch?v=d0HJvGSWw8A

Good tutorial

https://www.youtube.com/watch?v=dVH1dRoMPBc

Hardware-Aware Efficient Primitives for Machine Learning

https://youtu.be/5zZOoiH-b68?si=WPXdVnaappp1Zrv_

2 of 61

每一種架構的存在都有一個理由

CNN 存在的理由是什麼？

Fully Connected Layer

Receptive Field

Parameter Sharing

Convolutional Layer

Larger model bias

(for image)

Jack of all trades,

master of none

根據影像的特性，減少不必要的參數，避免 Overfitting

3 of 61

每一種架構的存在都有一個理由

CNN 存在的理由是什麼？

https://youtu.be/OP5HcXJg2Aw?si=RPfmHhsrMtuN0QS6

4 of 61

每一種架構的存在都有一個理由

Residual Connection 存在的理由是什麼？

Testing Data

Overfitting?

Training Data

Optimization issue

Source of image: http://arxiv.org/abs/1512.03385

5 of 61

每一種架構的存在都有一個理由

Residual Connection 存在的理由是什麼？為了讓 Optimization 更容易

Source of image: https://arxiv.org/abs/1712.09913

6 of 61

……

Self-attention Layer

Layer

Transformer 出現的理由是甚麼呢？

Transformer Layer (Block)

RNN (LSTM)

Mamba (and its friends)

7 of 61

要解的問題

……

RNN, Self-Attention, Mamba …

……

8 of 61

RNN-Style

……

Hidden

State

9 of 61

RNN-Style

……

Vector,

Matrix, …

10 of 61

RNN-Style

……

Vector,

Matrix, …

11 of 61

RNN-Style

x^t

z

zⁱ

z^f

z^o

y^t

x^t+1

z

zⁱ

z^f

z^o

y^t+1

h^t

h^t-1

c^t-1

c^t

c^t+1

LSTM

12 of 61

RNN-Style vs. AI Agent’s Memory

13 of 61

RNN-Style vs. AI Agent’s Memory

……

Vector,

Matrix, …

Memory

Read

Write

Reflection

14 of 61

LLM

<BOS>

大

家

好

，

我

大

家

好

，

我

是

RNN-style

15 of 61

Self-Attention Style

Soft-max

……

+

x

16 of 61

Self-Attention Style

……

17 of 61

Attention 的概念很早就有了

Neural Turing Machine

https://arxiv.org/abs/1410.5401

Memory Networks

https://arxiv.org/pdf/1410.3916

18 of 61

Attention 的概念很早就有了

https://arxiv.org/abs/1611.08656

Attention-based Memory Selection Recurrent Network for Language Modeling

Da-Rong Liu

19 of 61

LLM

<BOS>

大

家

好

，

我

大

家

好

，

我

是

20 of 61

每一步運算量都一樣

輸入越長，運算量越來越大

RNN 沒辦法記大量資訊？

21 of 61

https://arxiv.org/abs/1706.03762

不是發明 Attention ，而是拿掉 Attention 以外的東西

22 of 61

語言模型的訓練 (找出參數)

Backpropagation

https://youtu.be/ibJpTrp5mcE

https://youtu.be/-yhm3WdGFok?si=2cZOANbtm0Mjd9lT

Computational Graph

23 of 61

語言模型的訓練 (找出參數)

更新參數前要先算出自己的答案

語言模型

……

???

2. 計算差異

1. 得到目前的答案

3. 更新參數

24 of 61

語言模型的訓練 (找出參數)

假設我們想要教模型說「大家好，我是 ……」

語言

模型

語言

模型

語言

模型

大

家

好

<BOS>

<BOS> 大

<BOS> 大家

???

25 of 61

語言模型的訓練 (找出參數)

假設我們想要教模型說「大家好，我是 ……」

是

???

語言

模型

<BOS>

大

家

好

，

我

???

大

家

好

，

我

26 of 61

LLM

<BOS>

大

家

好

，

我

大

家

好

，

我

是

給定完

整輸入

27 of 61

0

=

softmax

=

令 GPU 歡喜的計算過程

28 of 61

LLM

<BOS>

大

家

好

，

我

大

家

好

，

我

是

給定完

整輸入

GPU 討厭等待

29 of 61

Self-attention vs. RNN-style

	Self-attention	RNN
Inference	計算量、記憶體需求隨著序列長度增加	計算量、記憶體需求固定
Training	容易平行化	難以平行化(？)

30 of 61

Source of image: https://www.artfish.ai/p/long-context-llms

RAG、AI Agent 都需要語言模型處理很長的序列

影像、聲音是比文字更長的序列

31 of 61

RNN 有沒有訓練時平行的可能性

……

32 of 61

RNN 有沒有訓練時平行的可能性

……

33 of 61

RNN 有沒有訓練時平行的可能性

……

34 of 61

RNN 有沒有訓練時平行的可能性

……

35 of 61

RNN 有沒有訓練時平行的可能性

……

36 of 61

RNN 有沒有訓練時平行的可能性

這不就是 Self-attention!

(少了 softmax)

叫做 Linear Attention

37 of 61

RNN

Linear Attention

Linear Attention 就是 Self-attention 沒有 Softmax

38 of 61

Linear Attention

Training 的時候像 Self-attention

Inference 的時候像 RNN

Training

Inference

…

39 of 61

Linear Attention

=

+

……

要寫入記憶的資訊

要寫到哪裡

1

0

40 of 61

Linear Attention

=

不同資訊存不同 Column

從哪一個 column 取多少資訊

0

1

0

……

41 of 61

這不是甚麼新想法 ……

https://youtu.be/yHoAq1IT_og?si=pSymySFnZqQj51Ik

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

https://arxiv.org/abs/2006.16236

42 of 61

RNN (Linear Attention) 贏不過 Transformer (Self-attention with Softmax)？

RNN (Linear Attention)

Transformer (Self-attention with softmax)

記憶太小

記憶有限

無限記憶?

……

43 of 61

RNN (Linear Attention) 贏不過 Transformer (Self-attention with Softmax)？

RNN (Linear Attention)

=

+

=

44 of 61

Transformer (Self-attention with softmax)

……

45 of 61

Transformer (Self-attention with softmax)

……

+

記憶開始錯亂

46 of 61

RNN (Linear Attention) 贏不過 Transformer (Self-attention with Softmax)？

0.6

1

0.4

0.5

1

0.5

2

1

很重要的事

更重要的事

0.30

0.45

0.25

0.10

0.17

0.10

0.46

0.17

相對沒那麼重要了

Soft-max

Linear Attention 記憶永不改變

47 of 61

加上 Reflection: 逐漸遺忘

https://arxiv.org/abs/2307.08621

Linear Attention

Retention Network (RetNet)

48 of 61

加上 Reflection: 逐漸遺忘

Training

Inference

……

49 of 61

加上 Reflection: 根據情況遺忘

https://arxiv.org/abs/2405.05254

Retention Network (RetNet)

Gated Retention

50 of 61

加上 Reflection: 逐漸遺忘

Training

Inference

……

51 of 61

更複雜的 Reflection

……

抹去

保留

減弱

52 of 61

https://arxiv.org/abs/2406.06484

經過重新 survey 與資料整理後，我發現 Mamba-1 與 Mamba-2 的效能基本上相當，各有優劣，在不同任務上互有輸贏。此前我認為 Mamba-1 可能具備更高的上限，主要理論依據在於 Mamba-2 受限的表達能力(Scaler SSM)可能會影響性能；也就是說，若希望在特定 state dimension 下達到最佳效果，Mamba-1 較強的表達能力可能更具優勢（不過目前尚缺乏完整研究來驗證此理論）。另外，在一篇 distillation 論文中，將知識 distill 到 Mamba 模型後所獲得的效果十分出色——在 MT-Bench 與 AlpacaEval 上的表現似乎比 distill 到 Mamba-2 更佳，但在其他task上似乎也是各有優劣 (https://openreview.net/pdf?id=uAzhODjALU) 大多數文獻顯示，Mamba-1 與 Mamba-2 的整體 performance 差距不大(https://arxiv.org/pdf/2408.10189 , https://openreview.net/forum?id=fMbLszVO1H, https://arxiv.org/pdf/2406.07887v1)；在原論文（ https://arxiv.org/pdf/2405.21060 ）的比較中也是如此，但在 multi-query associative recall (MQAR) 任務上，Mamba-1 的表現極差甚至無法實現（作者未給出具體解釋）。 Mamba-2 最大的優點應該在於訓練階段可實現 2～8 倍的加速，但在實際應用中，因為他主要是用 Triton 加速，所以在安裝和使用上會比Mamba-1麻煩。但看起來如果裝得起來，用Mamba-2還是efficient許多～

=====

RWKV-7 is a meta-in-context learner, test-time-training its state on the context via in-context gradient descent at every token.

Titan???

53 of 61

https://arxiv.org/abs/2312.00752

125M to 1.3B

54 of 61

https://arxiv.org/abs/2312.00752

55 of 61

DeltaNet

https://arxiv.org/abs/2406.06484

Titans: Learning to Memorize at Test Time

https://arxiv.org/abs/2501.00663

Parameter

after update

before

update

learning rate

gradient

Gradient Descent

56 of 61

https://arxiv.org/abs/2408.01129

Minimax-01

https://arxiv.org/abs/2501.08313

57 of 61

https://arxiv.org/abs/2410.10629

58 of 61

MambaOut: Do We Really Need Mamba for Vision?

https://arxiv.org/abs/2405.07992

59 of 61

Do not train from scratch

Low-rank Linear Conversion via Attention Transfer (LoLCATs), https://arxiv.org/abs/2410.10254

The Mamba in the Llama, https://arxiv.org/abs/2408.15237

Transformers to SSMs, https://arxiv.org/abs/2408.10189

Linger, https://arxiv.org/abs/2503.01496

Self-attention Layer

Mamba or its friends

60 of 61

https://www.isattentionallyouneed.com/

61 of 61

https://www.isattentionallyouneed.com/