1 of 16

Preliminary Survey on Foundation Language Models

Vyom Pathak

2 of 16

Introduction

  • Learning representation of natural language from data is an open problem
  • Statistical NLP models depend on engineered features to learn representations
  • Neural language models use dense representation capturing various aspects of text
  • Word embeddings are shallow models that lack contextual information
  • Contextual word embeddings are deep models that can be adapted to various tasks
  • Deep NNs overfit on low-resource datasets
  • Pre-training language models (PTLM) on large text corpus can improve downstream tasks
  • Foundation models or large language models (LLMs) are the latest trend in NLP using pre-training, and an aspect of fine-tuning for target task
  • We present a preliminary survey on language models to connect all these models
  • Criterion:
    • Differ based on how they learn representation from text
    • Vary in-terms of size, and complexity based on the data, and architecture
    • Addressing different research questions, and diverse tasks in NLP
    • Applications in real-world scenarios

3 of 16

Background - Language Representation Learning

  • Language representation learning aims to capture the meaning of text using low-dimensional vectors
  • There are two types of embeddings:
    • Non-contextual embeddings (shallow models)
    • Contextual embeddings (deep models)
  • Non-contextual embeddings are static and have limitations such as handling out-of-vocabulary words
    • Representations are directly feed to train a model for target task (encoder-only)
  • Contextual embeddings are dynamic and capture the context of the word
    • Representations are adapted to target tasks depending on training framework, scale, and architecture (encoder-only, decoder-only, encoder-decoder)
  • Contextual embeddings can be further classified into following:
    • Sequential models
    • Non-sequential models
  • Sequential models use convolutional or recurrent networks to capture the local context in order
  • Non-sequential models use tree or graph structures to capture syntactic or semantic relations
  • Self-attention mechanism is a non-sequential model that learns the connection weights dynamically

4 of 16

Background - Training Frameworks

  • Large language models require large amounts of labeled data, which are costly and scarce
  • Pre-training on large unlabeled corpora can improve model initialization, generalization, and regularization
  • Pre-trained learning framework can be used as:
    • Non-contextual models (shallow) such as NNLM, CBOW, Skip-Gram, GLoVe
    • Contextual models (deep) such as LSTMs, ELMo, CoVe, biLM, ULMFiT, modern LLMs (BERT, GPTs)
  • Pre-training tasks, and architectures have evolved learning representations from shallow to deep language models

5 of 16

Background - Pre-training tasks

  • Pre-training tasks can be categorized into three broad types:
    • Supervised learning - Learning with labeled data
    • Unsupervised learning - Learning with unlabeled data (learning distribution)
    • Self-supervised learning - Learning with unlabeled data by generating labels (masked language modeling)
  • For NLP, datasets of most supervised tasks are not large enough to train good pre-trained models.
  • Unsupervised learning tasks include probabilistic language modeling (LM), bidirectional LM,
  • Masked LM (MLM), and seq2seq MLM can be considered as semi-supervised
  • Enhanced MLM tasks include dynamic masking, UniLM, translation LM etc.
  • Permuted LM is a self-supervised task that uses random permutations of input sequences to generate the original sequence
  • Next Sentence Prediction (NSP) is a sentence completion task that predicts whether two sentences are continuous or not
  • Denoising Autoencoder (DAE) is an encoder-decoder task that recovers the original sequence from a partially corrupted input

6 of 16

Background - Adaptation to downstream task

  • Effective adaptation of pre-trained models to downstream tasks is a challenging task
  • Transfer learning is a common adaptation method that uses pre-trained models for different tasks
    • Choosing appropriate pre-training tasks, model architectures, and prospective corpus
    • Involves selecting which layers of the model to use for the downstream task
    • Deciding whether to tune or not to tune the pre-trained models
  • Feature extraction (freezed encoders) and not tune the model
    • Requires more complex task specific layers
  • Tune the model by performing fine-tuning
  • Fine-tuning methods include:
    • Two-stage tuning - finetune on unlabeled task data, and fine-tuned on target task
    • Multi-task fine-tuning - finetune on multiple tasks at once
    • Model distillation - Using fewer layers with general information, and fine-tune them
    • Gradual unfreezing - Freezing some layers and fine-tuning, and unfreezing gradually
    • Planned sequential unfreezing - Unfreezing groups of layers based on representations
  • Fine-tuning is fragile and requires careful hyper-parameter tuning

7 of 16

Background - Adaptation to downstream task

  • Prompt-based technique as a tuning method
  • Prompt-based methods can be discrete or continuous
  • Discrete prompts are sequences of words inserted into the input text to help the pre-trained model converge faster
    • Manually or automatically generated
  • Continuous prompts - Words that are combined with word-type embeddings
    • Outperform discrete prompts on relation-oriented tasks
  • Prompt-based methods can achieve similar performance to fine-tuning as the model size scales up

8 of 16

Background - Task Capability

  • Classical NLP tasks include
    • Question answering - Answering questions based on a given text or knowledge base
      • Can be one-shot or multi-round, extractive or generative, single-hop or multi-hop
    • Sentiment analysis - Detecting the polarity or emotion of a text
      • Use datasets such as SST-2
    • Machine translation - Translating text from one language to another
      • Use datasets such as WMT
    • Information extraction - Extracting structured information from unstructured text
      • Includes tasks such as named entity recognition, relation extraction, event extraction, etc.
    • Summarization - Generating shorter text that preserves the meaning of the longer text
      • Can be extractive or abstractive, single-document or multi-document
  • Large-scale benchmarks are required for evaluation of LLMs on these tasks
    • GLUE
    • SuperGLUE

9 of 16

Catalogue

Model

Architecture

Self-Attention

Pre-Training Tasks

Pre-Training Corpus

Parameters (M - Million, B - Billion)

Applications

RoBERTa

Encoder-Only Transformer

Bi-directional

MLM with dynamic masking

BooksCorpus, English Wikipedia, CC News, OpenWebText, Stories

125M

355M

NLU and QA

DeBERTa

Encoder-Only Transformer

Disentangled attention mechanism

MLM with dynamic masking

BooksCorpus, English Wikipedia, and RealNews

144M

350M

700M

NLU, QA, NLI, and SA

GPT-2

Decoder-Only Transformer

Uni-directional Attention

MLM

WebText, Text with high Reddit

karma scores

117M

355M

762M

NLG, TS, MT, TC, and Finetuned for NLU

Transformer-XL

Decoder-Only Transformer with segment-level recurrence, and relative positional encoding

Relative positioned Uni-directional attention mechanism

MLM

Wikitext-103

355M

NLG, TS, and Finetuned for NLU

10 of 16

Catalogue

Model

Architecture

Self-Attention

Pre-Training Tasks

Pre-Training Corpus

Parameters (M - Million, B - Billion)

Applications

Bart

Transformer with BERT as encoder, and GPT as decoder

Bidirectional self-attention for encoder, and uni-directional self-attention for decoder

Denoising Autoencoder with Span Corruption

BooksCorpus, English Wikipedia, CC News, OpenWebText, Stories

10% bigger than BERT (355M)

NLG, NLU - TC, MT, TC

T5

Transformer with relative positional encoding, and text-to-text format

Scaled up transformer style self-attentions

Denoising Autoencoder with Span Corruption

C4

60M

220M

770M

3B

11B

NLG based MT, QA, AS, and TC

11 of 16

Experiment Setup - Datasets

  • BoolQ - A question answering task that requires answering yes/no questions based on a passage from Wikipedia
  • RTE - A natural language inference task that requires determining whether a pair of premises and hypotheses entail each other or not
  • COPA - A causal reasoning task that requires identifying the cause or effect of a given premise from two choices
  • WIC - A word sense disambiguation task that requires determining whether a polysemous word is used in the same sense in two sentences
  • WSC - A coreference resolution task that requires commonsense reasoning to identify the antecedent of a pronoun in a sentence
  • CB - A textual entailment task that requires classifying the degree of belief of an embedded clause in a short text

12 of 16

Experiment Setup - Models

13 of 16

Results

  • Deberta models perform best on BoolQ, WIC, and WSC, which require robust representation for low-resource tasks
  • T5 model performs best on RTE, COPA, and CB, which are entailment tasks that require natural language inference
  • Decoder-only models perform poorly because they lack proper encoder representations
  • Encoder-decoder models perform better than encoder-only and decoder-only models because they can learn and generate representations
  • COPA and WSC are the most difficult tasks because of the low resource dataset and the task difficulty

14 of 16

Future Work

  • Experimenting with more models and datasets, such as Big-bench
  • Exploring different training frameworks, such as in-context learning, UL2, Megatron, instruction fine-tuning, etc.
  • Studying the emergent abilities, interpretability, and reliability of pre-trained models
  • Regulating the models to avoid ethical and social issues, such as bias, contamination, privacy, etc.
  • Compressing the models to serve a large user base with minimum latency

15 of 16

Conclusion

  • A systematic survey on language models that connects various models based on multiple criteria
  • Examines different types of language models and analyzes their pre-training tasks, training frameworks, adaptation methods, and evaluation benchmarks
  • Performs experiments on SuperGLUE tasks using six representative models and comparing their performance and efficiency
  • Reveals that different models have different strengths and weaknesses depending on the task and the data
  • Proposes some future directions to enhance the robustness and comprehensiveness of this survey

16 of 16

References

  • Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. Pre-trained Models for Natural Language Processing: A Survey. Science China Technological Sciences, 63(10):1872–1897, October 2020. arXiv:2003.08271 [cs].
  • Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. A Survey for In-context Learning, December 2022. arXiv:2301.00234 [cs]
  • Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach, July 2019. arXiv:1907.11692 [cs].
  • Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. DeBERTa: Decoding-enhanced BERT with Disentangled Attention,October 2021. arXiv:2006.03654 [cs].
  • Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context, June 2019. arXiv:1901.02860 [cs, stat].
  • Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, October 2019. arXiv:1910.13461 [cs, stat].
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent Abilities of Large Language Models, October 2022. arXiv:2206.07682 [cs].