1 of 12

Big Bird: Transformers for Longer Sequences

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed

Paper review by Michael A. Alcorn

2 of 12

Introduction

Transformer memory requirements scale quadratically with respect to the sequence length when using full attention
Strategies to address

Using some other mechanism, select a smaller subset of contexts

Often require significant engineering

Don’t use full attention

Most (except Longformer) not as “robust” as original Transformer and don’t have theoretical guarantees

3 of 12

2. BigBird Architecture

The self-attention mask matrix encodes a graph where A(i, j) = 1 indicates there’s an edge between nodes i and j
Random graphs can approximate complete graphs
Make random attention matrix have:

Small average path length between nodes
Notion of locality

Use Erdős–Rényi graph (an edge is added between two nodes with a fixed probability)

With Θ(n) edges, the shortest path between nodes is logarithmic in the number of nodes

Information can “flow fast” between any pair of nodes

Each query attends of r random keys

Use Watts-Strogatz graph (add edges to neighboring words because of locality in language)
Random + local alone not sufficient to compete with BERT

Use global tokens (either chosen from the input, or include special purpose tokens, e.g., [CLS])

4 of 12

3. Theoretical Results about Sparse Attention Mechanism

Universal approximators of sequence to sequence functions (similar to Yun et al. (2019) and Yun et al. (2020))

Any graph containing a global token that is a star graph is a universal approximator
Proof in Appendix A

Turing complete

Mostly follows Pérez et al. (2019)
Proof in Appendix B

Limitations

Details in Appendix C

5 of 12

4. Experiments: Natural Language Processing

Tasks (details in Appendix E)

Masked language modeling (MLM) - BERT
Question answering

Natural Questions
HotpotQA-distractor

Includes semantically similar but irrelevant paragraphs

TriviaQA-wiki
WikiHop

6 of 12

4. Experiments: Natural Language Processing

Tasks (details in Appendix E)

Classification

GLUE

Summarization

arXiv and PubMed
BigPatent
Use ROGUE (Recall-Oriented Understudy for Gisting Evaluation)

7 of 12

4. Experiments: Natural Language Processing

8 of 12

5. Experiments Genomics

Tasks (details in Appendix F)

Promoter site prediction

Eukaryotic Promoter Database

Chromatin-profile prediction

9 of 12

5. Experiments Genomics

10 of 12

D. Implementation Details

11 of 12

D. Implementation Details

12 of 12

D. Implementation Details