1 of 77

Foundation Models

Hamidreza Moaddeli

1

2 of 77

What is Foundation Model?

2

3 of 77

What is Foundation Model?

  1. We introduce the term foundation models to fill a void in describing the paradigm shift we are witnessing... Existing terms (e.g., pretrained model, self-supervised model) partially capture the technical dimension of these models, but fail to capture the significance of the paradigm shift in an accessible manner for those beyond machine learning.

  • A machine learning paradigm featuring task-agnostic pre-training and task-specific fine-tuning via neural networks.

3

4 of 77

What is Foundation Model?

  • Task-agnostic pre-training (with unlabeled/noisy data)
    • Self-supervised learning of data representations
      • Supervision-free pre-training
        • Data scalability (Texts, Images, Speech, etc)
      • Use of auxiliary tasks
        • Masked prediction (of tokens)
        • Contrastive learning
    • Generic representation learning of a data modality (aka a data encoder)
  • Task-specific fine-tuning (with labeled data)
    • Linear probing (training a linear head on representations)
    • Full fine-tuning (training both the linear head and the encoder)
  • Examples:
    • Large language models such as GPT-3 and BLOOM
    • Transformer-based neural networks for different data modalities

4

Slide Credits : “Foundational Robustness of Foundation Models” , NeurIPS 2022 Tutorial

5 of 77

5

Slide Credits : “Foundational Robustness of Foundation Models” , NeurIPS 2022 Tutorial

6 of 77

What is Foundation Model?

  • “AI is undergoing a paradigm shift with the rise of models trained on broad data that can be adapted to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character.”

6

7 of 77

Emergence & Homogenization

  • Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed; it is both the source of scientific excitement and anxiety about unanticipated consequences.

  • Homogenization indicates the consolidation of methodologies for building AI systems across a wide range of applications; it provides strong leverage towards many tasks but also creates single points of failure.

7

8 of 77

Emergence and homogenisation

8

Slide Credits : "Foundation Models" , Samuel Albanie, Online Course 2022

9 of 77

Foundation models - NLP developments

9

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

10 of 77

Foundation models - homogenisation

10

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

11 of 77

Ecosystem

11

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

12 of 77

Resource accessibility

12

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

13 of 77

Technology

13

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

14 of 77

Technology

14

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

15 of 77

History of Deep Learning

  • Specialized DL
    • Design specialized model architectures.
    • Leveraging task-specific features.
    • Train the specialized models with limited data.
  • Transfer DL
    • Train a model with large amount of training data.
    • Use the features of the trained model to initialize part of the architecture
    • Design specialized modules on top of the trained features.
    • Train the partially specialized model with limited data.
  • Foundation Model
    • Train a single huge model on astronomical amount of data
    • Prompt the single model for everything

15

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

16 of 77

Pros and Cons of Specialized DL

  • Pros
    • The model considers the inductive bias for architecture design
    • The model can be effectively trained with limited amount of data
    • The model is normally small in size, easy to deploy for applications
  • Cons
    • Each task requires lots of expertise for architecture design
    • Each task requires annotating specialized dataset
    • The model cannot benefit from other annotated data, it needs to start from scratch literally to gain its skill
    • Hosting many specialized models incur high costs

16

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

17 of 77

Transfer Learning

  • We can train our model on massive amount of data to learn neural representation or initialize certain part of the weights.

  • Then transfer to new tasks by adding layers on on top of the learned neural representation.

  • This can leverage some cross-task similarity to enhance model performance across different tasks.

17

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

18 of 77

Pros and Cons of Transfer DL

  • Pros:
    • The model shows much stronger capability than Specialized DL
    • The model can generalize to unseen cases
    • The model requires very few fine-tuning
  • Cons:
    • The model’s performance is still not perfect.
    • There is still fine-tuning needed for the downstream tasks

18

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

19 of 77

Transfer Learning in Word Vector

  • How to represent the word in deep learning
    • 1-hot vector, sparse representation with huge dimension
    • Word vector, dense representation with small dimension

19

20 of 77

Transfer Learning in Word Vector

  • Word2Vec trained a model like this in 2013

20

21 of 77

Transfer Learning in Vision

21

22 of 77

Transfer learning & Scale

On a technical level, foundation models are enabled by transfer learning and scale. Transfer learning is what makes foundation models possible, but scale is what makes them powerful.

Scale required three ingredients:

  • improvements in computer hardware — e.g., GPU throughput and memory have increased 10× over the last four years;
  • the development of the Transformer model architecture that leverages the parallelism of the hardware to train much more expressive models than before; and
  • the availability of much more training data.

22

Slide Credits :"UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

23 of 77

Parameters of Foundation DL

23

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

24 of 77

24

Image Source : "What are Foundation Models in AI?", https://www.youtube.com/watch?v=dV0X1QyLL8M

25 of 77

25

Image Source : "What are Foundation Models in AI?", https://www.youtube.com/watch?v=dV0X1QyLL8M

26 of 77

26

Image Source : "What are Foundation Models in AI?", https://www.youtube.com/watch?v=dV0X1QyLL8M

27 of 77

Self-Supervised Learning

  • The field of AI has made rapid progress , the crucial fuel is data
  • Manual annotations for the data are limiting.

Solving the problem of expensive annotations: self-supervision.

27

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

28 of 77

General procedure of self-supervised learning

28

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

29 of 77

Early methods: Context prediction

29

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

30 of 77

Early methods: Context prediction

30

31 of 77

Modern Noise-contrastive self-supervised learning

31

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

32 of 77

Masked Image Modelling (recent development)

32

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

33 of 77

Transformers

  • Transformers come from a paper called “Attention Is All You Need”

33

34 of 77

Language Transformers (Encoder-only)

  • At each stage, the attention layers can access all words in the sentence.
  • Pre-training usually consist of corrupting a given sentence and tasking the
  • model with finding or reconstructing the original sentence.

Well-suited for tasks requiring an understanding of the full sequence,

  • Sentence classification, named entity recognition, extractive question answering.

34

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

35 of 77

Bidirectional Encoder Representations from Transformers (BERT)

BERT uses two self-supervised objectives:

  • masked language modelling to enable the representation to fuse the left and the right context
  • next sentence prediction that that jointly pre- trains text-pair representations.

The pre-trained BERT model can be fine-tuned by adding a classifier layer for many language understanding tasks

35

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

36 of 77

BERT pre-training and fine-tuning

Fine-tuning is straightforward, simply plug in the task-specific inputs and outputs into BERT and finetune all the parameters end-to-end.

36

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

37 of 77

BERT

37

38 of 77

BERT Results

38

39 of 77

Language Transformers (Decoder-only)

  • These models are sometimes also called auto-regressive models.
  • Attention layers only access words positioned before them in sentence.
  • Pre-training formulated as predicting the next token in the sequence.
  • Best suited for tasks involving text generation.

39

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

40 of 77

GPT-2

40

41 of 77

GPT-3

41

42 of 77

In-context Learning

42

43 of 77

Few-shot Learning (GPT-3)

43

44 of 77

GPT-3 Results

Promising results in the zero- and one-shot settings, and in the few- shot setting sometimes competitive with fine-tuned state-of-the-art

44

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

45 of 77

GPT-3 Results

GPT-3 can be applied to any downstream task without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.

45

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

46 of 77

Language Transformers (Encoder-Decoder)

  • These models are sometimes called sequence-to-sequence models.
  • At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder only accesses the words positioned before a given word in the input.
  • Pre-trained using the objectives of encoder or decoder models, but usually a bit more complex.
  • Best suited for generating new sentences conditioned on a given input e.g. summarization, translation, or generative question answering.

46

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

47 of 77

T5: Text-to-Text Transfer Transformer

  • Shows that almost all NLP tasks can be cast as a sequence-to-sequence generation task. Thus, an encoder-decoder language model, can perform all natural language understanding and generation tasks.
  • Model resembles traditional transformer, with a BERT-style encoder.

47

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

48 of 77

Emergent Ability

  • When the model size grows from 0.1B -> 1.5B -> 175B, the model starts be really good in zero-shot and few-shot tasks
    • This is called “emergent abilities”.

48

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

49 of 77

Emergent Abilities of Large Language Models

  • Why do LLMs work so well? What happens as you scale up?
    • Potential explanation: emergent abilities!
    • An ability is emergent if it is present in larger but not smaller models
    • Not have been directly predicted by extrapolating from smaller models
    • Performance is near-random until a certain critical threshold, then improves heavily
      • Known as a “phase transition” and would not have been extrapolated

49

Slide Credits : "CS25: Transformers United V4", Stanford University, Spring 2024

50 of 77

Few-Shot Prompting

50

Slide Credits : "CS25: Transformers United V4", Stanford University, Spring 2024

51 of 77

Potential Explanations of Emergence

  • Currently few explanations for why these abilities emerge
  • Evaluation metrics used to measure these abilities may not fully explain why they emerge
  • Disclaimer: maybe emergent abilities of LLMs are a mirage!!!
    • https://arxiv.org/abs/2304.15004 , “ Are Emergent Abilities of Large Language Models a Mirage?” by R Schaeffer
    • “Emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale”

51

Slide Credits : "CS25: Transformers United V4", Stanford University, Spring 2024

52 of 77

52

53 of 77

53

54 of 77

54

55 of 77

Vision transformer

  • 16x16 Patch Tokens plus CLS Token
  • Follows standard encoder architecture, adds learnable classification token

55

56 of 77

ViT’s outperform ResNets at scale

56

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

57 of 77

Contrastive Language-Image Pre-training (CLIP)

57

58 of 77

CLIP

  • 400M (image, text) pairs collected from various internet sources
  • Image encoder piece: Modified ResNet or Vision Transformer (ViT)
    • Picked based on performance
  • Text encoder: Transformer with 63M parameters
  • Cosine similarity explanation

58

Slide Credits : "CS 8803 VLM Vision-Language Foundation Models", Zsolt Kira, Georgia Tech, Fall 2024

59 of 77

CLIP

  • Linear probe is a simple classifier (log reg) added to pre-trained features some labeled data
    • Beat logistic regression on ResNet50 features on 16/27 datasets – multimodal training power
    • Significance? ROBUST, no task-specific data or fine-tuning needed

59

Slide Credits : "CS 8803 VLM Vision-Language Foundation Models", Zsolt Kira, Georgia Tech, Fall 2024

60 of 77

CLIP ( Distribution Drift , Few-Shot )

60

61 of 77

Pathology Language and Image Pre-Training (PLIP)

The model is a fine-tuned version of the original CLIP model.

61

Slide Credits : "BIODS 271: Foundation Models for Healthcare" ,Stanford University, Winter 2024

62 of 77

Creating OpenPath: >200K high-quality Twitter image-text pairs

Largest public dataset of pathology image + discussions.

62

Slide Credits : "BIODS 271: Foundation Models for Healthcare" ,Stanford University, Winter 2024

63 of 77

PLIP applications

  • Zero-shot classification.
  • Improved representation learning for downstream tasks.
  • Text-to-image retrieval
  • Image-to-image retrieval

63

Slide Credits : "BIODS 271: Foundation Models for Healthcare" ,Stanford University, Winter 2024

64 of 77

PLIP can serve as a powerful search engine for medicine

64

Slide Credits : "BIODS 271: Foundation Models for Healthcare" ,Stanford University, Winter 2024

65 of 77

Natural language-to-code system based on GPT-3 (Codex)

  • Task: generate code using LLMs Given natural-language prompt (docstring), output the code that implements it
  • Model: Codex - a GPT-3, fine-tuned on code up to 12B params
  • Training data: 160GB of Python code
  • Evaluation: HumanEval
    • a novel dataset with 164 programming problems created by the authors generate k samples from the model to see if at least one sample passes all the unit tests
  • Result: Codex-12B “solves” 72.3% of the problems (given 100 samples) GPT-3 solves 0%, GPT-J solves 27.7% if using only one sample (with lowest perplexity) we get 28.8% for Codex, 11.6% for GPT-J

65

Slide Credits : "COS 597G: Understanding Large Language Models" , Danqi Chen, Princeton University, Fall 2022

66 of 77

Codex (Examples)

66

Slide Credits : "COS 597G: Understanding Large Language Models" , Danqi Chen, Princeton University, Fall 2022

67 of 77

Segment Anything Model (SAM):

The first foundation model for promptable segmentation

67

Slide Credits : "COMP 590/776: Computer Vision in 3D World", Roni Senguptam UNC, Spring 2023

68 of 77

SAM

SAM is built with three interconnected components: A task, an model, and a data engine.

68

SAM considers two sets of prompts: sparse (clicks, boxes, text) and dense (masks).

Slide Credits : "COMP 590/776: Computer Vision in 3D World", Roni Senguptam UNC, Spring 2023

69 of 77

SAM

  • A heavyweight image encoder outputs an image embedding.
  • A lightweight prompt encoder efficiently queries the image embedding.
  • A lightweight mask decoder produces object masks and confidence scores.

69

Slide Credits : "COMP 590/776: Computer Vision in 3D World", Roni Senguptam UNC, Spring 2023

70 of 77

SAM (Zero-Shot Single Point Valid Mask Evaluation)

  • Training dataset: the whole SA-1B dataset
  • Test datasets: 23 diverse segmentation datasets

70

Slide Credits : "COMP 590/776: Computer Vision in 3D World", Roni Senguptam UNC, Spring 2023

71 of 77

Med-Gemini : Multimodal medical models built on Gemini

71

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

72 of 77

72

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

73 of 77

73

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

74 of 77

74

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

75 of 77

75

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

76 of 77

76

Image Credits : Ronnie, King of Zurich

77 of 77

77

Image Credits : Ronnie, King of Zurich