1 of 1

Long Range Language Modeling via Gated State Spaces�Harsh Mehta Ankit Gupta Ashok Cutkosky Behnam Neyshabur

[1] Vaswani et al., NIPS 2017.

[2] Tay et al., ICLR 2021.

[3] Hua et al, ICML 2022.

[4] Gu et al., ICLR 2022.

[5] Huchins et al., NeurIPS 2022.

[6] Gupta et al, NeurIPS 2022.

Attention : Ω(L²) compute/memory for input size L
efficient X-formers: � ○ performance ~ Transformer�✅ compute/memory efficient�❌ perform poorly on long-range benchmarks � (Long Range Arena, SCROLLS)��
large gains over X-formers / local convnets on� text/image/video classification, LM, � audio generation, time series forecasting ..
O(L log L) compute/memory

✅ +20 accuracy on Long Arena Arena vs X-formers�

�❌ contextualize in high-dimensional space

❌ attention offers content-dependent modeling � of local dependencies

○ down-project by 4x → contextualize � in low-dimensional space via � Diagonal State Space*�○ inspired by Gated Attention Unit

○ gating maintains performance on LM

✅ 2-3x speedup over vanilla DSS on TPUs

○ random initialization works well for LM � (loss per token of large text corpus)

✅ 0-shot generalization to longer contexts

✅ more compute-efficient vs X-formers

GSS-Hybrid : +Attention over non-overlapping� chunks of size 512

GSS vs best baselines from [3, 5]. Training length 4096

Motivation

State Spaces (S4, DSS)

Drawbacks of S4 / DSS

Gated State Space (GSS)

GSS vs scaled-up models on PG19 (word ppl)