1 of 12

Big Bird: Transformers for Longer Sequences

Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed

Paper review by Michael A. Alcorn

2 of 12

  1. Introduction
  • Transformer memory requirements scale quadratically with respect to the sequence length when using full attention
  • Strategies to address
    • Using some other mechanism, select a smaller subset of contexts
      • Often require significant engineering
    • Don’t use full attention
    • Most (except Longformer) not as “robust” as original Transformer and don’t have theoretical guarantees

3 of 12

2. BigBird Architecture

  • The self-attention mask matrix encodes a graph where A(i, j) = 1 indicates there’s an edge between nodes i and j
  • Random graphs can approximate complete graphs
  • Make random attention matrix have:
    • Small average path length between nodes
    • Notion of locality
  • Use Erdős–Rényi graph (an edge is added between two nodes with a fixed probability)
    • With Θ(n) edges, the shortest path between nodes is logarithmic in the number of nodes
      • Information can “flow fast” between any pair of nodes
    • Each query attends of r random keys
  • Use Watts-Strogatz graph (add edges to neighboring words because of locality in language)
  • Random + local alone not sufficient to compete with BERT
    • Use global tokens (either chosen from the input, or include special purpose tokens, e.g., [CLS])

4 of 12

3. Theoretical Results about Sparse Attention Mechanism

  • Universal approximators of sequence to sequence functions (similar to Yun et al. (2019) and Yun et al. (2020))
    • Any graph containing a global token that is a star graph is a universal approximator
    • Proof in Appendix A
  • Turing complete
    • Mostly follows Pérez et al. (2019)
    • Proof in Appendix B
  • Limitations
    • Details in Appendix C

5 of 12

4. Experiments: Natural Language Processing

  • Tasks (details in Appendix E)
    • Masked language modeling (MLM) - BERT
    • Question answering
      • Natural Questions
      • HotpotQA-distractor
        • Includes semantically similar but irrelevant paragraphs
      • TriviaQA-wiki
      • WikiHop

6 of 12

4. Experiments: Natural Language Processing

  • Tasks (details in Appendix E)
    • Classification
      • GLUE
    • Summarization
      • arXiv and PubMed
      • BigPatent
      • Use ROGUE (Recall-Oriented Understudy for Gisting Evaluation)

7 of 12

4. Experiments: Natural Language Processing

8 of 12

5. Experiments Genomics

  • Tasks (details in Appendix F)

9 of 12

5. Experiments Genomics

10 of 12

D. Implementation Details

11 of 12

D. Implementation Details

12 of 12

D. Implementation Details