1 of 1

Long Range Language Modeling via Gated State SpacesHarsh Mehta Ankit Gupta Ashok Cutkosky Behnam Neyshabur

[1] Vaswani et al., NIPS 2017.

[2] Tay et al., ICLR 2021.

[3] Hua et al, ICML 2022.

[4] Gu et al., ICLR 2022.

[5] Huchins et al., NeurIPS 2022.

[6] Gupta et al, NeurIPS 2022.

  • Attention : Ω(L2) compute/memory for input size L
  • efficient X-formers: � ○ performance ~ Transformer� compute/memory efficient� perform poorly on long-range benchmarks � (Long Range Arena, SCROLLS)��
  • large gains over X-formers / local convnets on� text/image/video classification, LM, � audio generation, time series forecasting ..
  • O(L log L) compute/memory

+20 accuracy on Long Arena Arena vs X-formers

�❌ contextualize in high-dimensional space

attention offers content-dependent modeling � of local dependencies

○ down-project by 4x → contextualize � in low-dimensional space via � Diagonal State Space*�○ inspired by Gated Attention Unit

gating maintains performance on LM

2-3x speedup over vanilla DSS on TPUs

random initialization works well for LM � (loss per token of large text corpus)

0-shot generalization to longer contexts

more compute-efficient vs X-formers

GSS-Hybrid : +Attention over non-overlapping� chunks of size 512

GSS vs best baselines from [3, 5]. Training length 4096

Motivation

State Spaces (S4, DSS)

Drawbacks of S4 / DSS

Gated State Space (GSS)

GSS vs scaled-up models on PG19 (word ppl)