Preliminary Survey on Foundation Language Models
Vyom Pathak
Introduction
Background - Language Representation Learning
Background - Training Frameworks
Background - Pre-training tasks
Background - Adaptation to downstream task
Background - Adaptation to downstream task
Background - Task Capability
Catalogue
Model | Architecture | Self-Attention | Pre-Training Tasks | Pre-Training Corpus | Parameters (M - Million, B - Billion) | Applications |
RoBERTa | Encoder-Only Transformer | Bi-directional | MLM with dynamic masking | BooksCorpus, English Wikipedia, CC News, OpenWebText, Stories | 125M 355M | NLU and QA |
DeBERTa | Encoder-Only Transformer | Disentangled attention mechanism | MLM with dynamic masking | BooksCorpus, English Wikipedia, and RealNews | 144M 350M 700M | NLU, QA, NLI, and SA |
GPT-2 | Decoder-Only Transformer | Uni-directional Attention | MLM | WebText, Text with high Reddit karma scores | 117M 355M 762M | NLG, TS, MT, TC, and Finetuned for NLU |
Transformer-XL | Decoder-Only Transformer with segment-level recurrence, and relative positional encoding | Relative positioned Uni-directional attention mechanism | MLM | Wikitext-103 | 355M | NLG, TS, and Finetuned for NLU |
Catalogue
Model | Architecture | Self-Attention | Pre-Training Tasks | Pre-Training Corpus | Parameters (M - Million, B - Billion) | Applications |
Bart | Transformer with BERT as encoder, and GPT as decoder | Bidirectional self-attention for encoder, and uni-directional self-attention for decoder | Denoising Autoencoder with Span Corruption | BooksCorpus, English Wikipedia, CC News, OpenWebText, Stories | 10% bigger than BERT (355M) | NLG, NLU - TC, MT, TC |
T5 | Transformer with relative positional encoding, and text-to-text format | Scaled up transformer style self-attentions | Denoising Autoencoder with Span Corruption | C4 | 60M 220M 770M 3B 11B | NLG based MT, QA, AS, and TC |
Experiment Setup - Datasets
Experiment Setup - Models
Results
Future Work
Conclusion
References