3 of 77

What is Foundation Model?

“We introduce the term foundation models to fill a void in describing the paradigm shift we are witnessing... Existing terms (e.g., pretrained model, self-supervised model) partially capture the technical dimension of these models, but fail to capture the significance of the paradigm shift in an accessible manner for those beyond machine learning.”

A machine learning paradigm featuring task-agnostic pre-training and task-specific fine-tuning via neural networks.

4 of 77

What is Foundation Model?

Task-agnostic pre-training (with unlabeled/noisy data)

Self-supervised learning of data representations

Supervision-free pre-training

Data scalability (Texts, Images, Speech, etc)

Use of auxiliary tasks

Masked prediction (of tokens)
Contrastive learning

Generic representation learning of a data modality (aka a data encoder)

Task-specific fine-tuning (with labeled data)

Linear probing (training a linear head on representations)
Full fine-tuning (training both the linear head and the encoder)

Examples:

Large language models such as GPT-3 and BLOOM
Transformer-based neural networks for different data modalities

Slide Credits : “Foundational Robustness of Foundation Models” , NeurIPS 2022 Tutorial

5 of 77

Slide Credits : “Foundational Robustness of Foundation Models” , NeurIPS 2022 Tutorial

6 of 77

What is Foundation Model?

“AI is undergoing a paradigm shift with the rise of models trained on broad data that can be adapted to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character.”

7 of 77

Emergence & Homogenization

Emergence means that the behavior of a system is implicitly induced rather than explicitly constructed; it is both the source of scientific excitement and anxiety about unanticipated consequences.

Homogenization indicates the consolidation of methodologies for building AI systems across a wide range of applications; it provides strong leverage towards many tasks but also creates single points of failure.

8 of 77

Emergence and homogenisation

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

9 of 77

Foundation models - NLP developments

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

10 of 77

Foundation models - homogenisation

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

11 of 77

Ecosystem

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

12 of 77

Resource accessibility

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

13 of 77

Technology

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

14 of 77

Technology

Slide Credits : “"Foundation Models" , Samuel Albanie, Online Course 2022

15 of 77

History of Deep Learning

Specialized DL

Design specialized model architectures.
Leveraging task-specific features.
Train the specialized models with limited data.

Transfer DL

Train a model with large amount of training data.
Use the features of the trained model to initialize part of the architecture
Design specialized modules on top of the trained features.
Train the partially specialized model with limited data.

Foundation Model

Train a single huge model on astronomical amount of data
Prompt the single model for everything

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

16 of 77

Pros and Cons of Specialized DL

Pros

The model considers the inductive bias for architecture design
The model can be effectively trained with limited amount of data
The model is normally small in size, easy to deploy for applications

Cons

Each task requires lots of expertise for architecture design
Each task requires annotating specialized dataset
The model cannot benefit from other annotated data, it needs to start from scratch literally to gain its skill
Hosting many specialized models incur high costs

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

17 of 77

Transfer Learning

We can train our model on massive amount of data to learn neural representation or initialize certain part of the weights.

Then transfer to new tasks by adding layers on on top of the learned neural representation.

This can leverage some cross-task similarity to enhance model performance across different tasks.

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

18 of 77

Pros and Cons of Transfer DL

Pros:

The model shows much stronger capability than Specialized DL
The model can generalize to unseen cases
The model requires very few fine-tuning

Cons:

The model’s performance is still not perfect.
There is still fine-tuning needed for the downstream tasks

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

19 of 77

Transfer Learning in Word Vector

How to represent the word in deep learning

1-hot vector, sparse representation with huge dimension
Word vector, dense representation with small dimension

20 of 77

Transfer Learning in Word Vector

Word2Vec trained a model like this in 2013

21 of 77

Transfer Learning in Vision

22 of 77

Transfer learning & Scale

On a technical level, foundation models are enabled by transfer learning and scale. Transfer learning is what makes foundation models possible, but scale is what makes them powerful.

Scale required three ingredients:

improvements in computer hardware — e.g., GPU throughput and memory have increased 10× over the last four years;
the development of the Transformer model architecture that leverages the parallelism of the hardware to train much more expressive models than before; and
the availability of much more training data.

Slide Credits :"UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

23 of 77

Parameters of Foundation DL

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

24 of 77

Image Source : "What are Foundation Models in AI?", https://www.youtube.com/watch?v=dV0X1QyLL8M

25 of 77

Image Source : "What are Foundation Models in AI?", https://www.youtube.com/watch?v=dV0X1QyLL8M

26 of 77

Image Source : "What are Foundation Models in AI?", https://www.youtube.com/watch?v=dV0X1QyLL8M

27 of 77

Self-Supervised Learning

The field of AI has made rapid progress , the crucial fuel is data
Manual annotations for the data are limiting.

Solving the problem of expensive annotations: self-supervision.

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

28 of 77

General procedure of self-supervised learning

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

29 of 77

Early methods: Context prediction

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

30 of 77

Early methods: Context prediction

31 of 77

Modern Noise-contrastive self-supervised learning

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

32 of 77

Masked Image Modelling (recent development)

Slide Credits : "UVA Deep Learning Course", Yuki Asano , Fall 2022

33 of 77

Transformers

Transformers come from a paper called “Attention Is All You Need”

34 of 77

Language Transformers (Encoder-only)

At each stage, the attention layers can access all words in the sentence.
Pre-training usually consist of corrupting a given sentence and tasking the
model with finding or reconstructing the original sentence.

Well-suited for tasks requiring an understanding of the full sequence,

Sentence classification, named entity recognition, extractive question answering.

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

35 of 77

Bidirectional Encoder Representations from Transformers (BERT)

BERT uses two self-supervised objectives:

masked language modelling to enable the representation to fuse the left and the right context
next sentence prediction that that jointly pre- trains text-pair representations.

The pre-trained BERT model can be fine-tuned by adding a classifier layer for many language understanding tasks

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

36 of 77

BERT pre-training and fine-tuning

Fine-tuning is straightforward, simply plug in the task-specific inputs and outputs into BERT and finetune all the parameters end-to-end.

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

38 of 77

BERT Results

39 of 77

Language Transformers (Decoder-only)

These models are sometimes also called auto-regressive models.
Attention layers only access words positioned before them in sentence.
Pre-training formulated as predicting the next token in the sequence.
Best suited for tasks involving text generation.

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

42 of 77

In-context Learning

43 of 77

Few-shot Learning (GPT-3)

44 of 77

GPT-3 Results

Promising results in the zero- and one-shot settings, and in the few- shot setting sometimes competitive with fine-tuned state-of-the-art

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

45 of 77

GPT-3 Results

GPT-3 can be applied to any downstream task without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

46 of 77

Language Transformers (Encoder-Decoder)

These models are sometimes called sequence-to-sequence models.
At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder only accesses the words positioned before a given word in the input.
Pre-trained using the objectives of encoder or decoder models, but usually a bit more complex.
Best suited for generating new sentences conditioned on a given input e.g. summarization, translation, or generative question answering.

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

47 of 77

T5: Text-to-Text Transfer Transformer

Shows that almost all NLP tasks can be cast as a sequence-to-sequence generation task. Thus, an encoder-decoder language model, can perform all natural language understanding and generation tasks.
Model resembles traditional transformer, with a BERT-style encoder.

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

48 of 77

Emergent Ability

When the model size grows from 0.1B -> 1.5B -> 175B, the model starts be really good in zero-shot and few-shot tasks

This is called “emergent abilities”.

Slide Credits : "CS 886: Recent Advances on Foundation Models", Wenhu Chen, University of Waterloo, Winter 2024

49 of 77

Emergent Abilities of Large Language Models

Why do LLMs work so well? What happens as you scale up?

Potential explanation: emergent abilities!
An ability is emergent if it is present in larger but not smaller models
Not have been directly predicted by extrapolating from smaller models
Performance is near-random until a certain critical threshold, then improves heavily

Known as a “phase transition” and would not have been extrapolated

Slide Credits : "CS25: Transformers United V4", Stanford University, Spring 2024

50 of 77

Few-Shot Prompting

Slide Credits : "CS25: Transformers United V4", Stanford University, Spring 2024

51 of 77

Potential Explanations of Emergence

Currently few explanations for why these abilities emerge
Evaluation metrics used to measure these abilities may not fully explain why they emerge
Disclaimer: maybe emergent abilities of LLMs are a mirage!!!

https://arxiv.org/abs/2304.15004 , “ Are Emergent Abilities of Large Language Models a Mirage?” by R Schaeffer
“Emergent abilities appear due to the researcher's choice of metric rather than due to fundamental changes in model behavior with scale”

Slide Credits : "CS25: Transformers United V4", Stanford University, Spring 2024

55 of 77

Vision transformer

16x16 Patch Tokens plus CLS Token
Follows standard encoder architecture, adds learnable classification token

56 of 77

ViT’s outperform ResNets at scale

Slide Credits : "UvA Foundation Models Course" , Cees Snoek, Yuki Asano , Spring 2024

57 of 77

Contrastive Language-Image Pre-training (CLIP)

58 of 77

CLIP

400M (image, text) pairs collected from various internet sources
Image encoder piece: Modified ResNet or Vision Transformer (ViT)

Picked based on performance

Text encoder: Transformer with 63M parameters
Cosine similarity explanation

Slide Credits : "CS 8803 VLM Vision-Language Foundation Models", Zsolt Kira, Georgia Tech, Fall 2024

59 of 77

CLIP

Linear probe is a simple classifier (log reg) added to pre-trained features some labeled data

Beat logistic regression on ResNet50 features on 16/27 datasets – multimodal training power
Significance? ROBUST, no task-specific data or fine-tuning needed

Slide Credits : "CS 8803 VLM Vision-Language Foundation Models", Zsolt Kira, Georgia Tech, Fall 2024

60 of 77

CLIP ( Distribution Drift , Few-Shot )

61 of 77

Pathology Language and Image Pre-Training (PLIP)

The model is a fine-tuned version of the original CLIP model.

Slide Credits : "BIODS 271: Foundation Models for Healthcare" ,Stanford University, Winter 2024

62 of 77

Creating OpenPath: >200K high-quality Twitter image-text pairs

Largest public dataset of pathology image + discussions.

Slide Credits : "BIODS 271: Foundation Models for Healthcare" ,Stanford University, Winter 2024

63 of 77

PLIP applications

Zero-shot classification.
Improved representation learning for downstream tasks.
Text-to-image retrieval
Image-to-image retrieval

Slide Credits : "BIODS 271: Foundation Models for Healthcare" ,Stanford University, Winter 2024

64 of 77

PLIP can serve as a powerful search engine for medicine

Slide Credits : "BIODS 271: Foundation Models for Healthcare" ,Stanford University, Winter 2024

65 of 77

Natural language-to-code system based on GPT-3 (Codex)

Task: generate code using LLMs Given natural-language prompt (docstring), output the code that implements it
Model: Codex - a GPT-3, fine-tuned on code up to 12B params
Training data: 160GB of Python code
Evaluation: HumanEval

a novel dataset with 164 programming problems created by the authors generate k samples from the model to see if at least one sample passes all the unit tests

Result: Codex-12B “solves” 72.3% of the problems (given 100 samples) GPT-3 solves 0%, GPT-J solves 27.7% if using only one sample (with lowest perplexity) we get 28.8% for Codex, 11.6% for GPT-J

Slide Credits : "COS 597G: Understanding Large Language Models" , Danqi Chen, Princeton University, Fall 2022

66 of 77

Codex (Examples)

Slide Credits : "COS 597G: Understanding Large Language Models" , Danqi Chen, Princeton University, Fall 2022

67 of 77

Segment Anything Model (SAM):

The first foundation model for promptable segmentation

Try the demo: https://segment-anything.com/demo

Slide Credits : "COMP 590/776: Computer Vision in 3D World", Roni Senguptam UNC, Spring 2023

68 of 77

SAM

SAM is built with three interconnected components: A task, an model, and a data engine.

SAM considers two sets of prompts: sparse (clicks, boxes, text) and dense (masks).

Slide Credits : "COMP 590/776: Computer Vision in 3D World", Roni Senguptam UNC, Spring 2023

69 of 77

SAM

A heavyweight image encoder outputs an image embedding.
A lightweight prompt encoder efficiently queries the image embedding.
A lightweight mask decoder produces object masks and confidence scores.

Slide Credits : "COMP 590/776: Computer Vision in 3D World", Roni Senguptam UNC, Spring 2023

70 of 77

SAM (Zero-Shot Single Point Valid Mask Evaluation)

Training dataset: the whole SA-1B dataset
Test datasets: 23 diverse segmentation datasets

Slide Credits : "COMP 590/776: Computer Vision in 3D World", Roni Senguptam UNC, Spring 2023

71 of 77

Med-Gemini : Multimodal medical models built on Gemini

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

72 of 77

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

73 of 77

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

74 of 77

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

75 of 77

Slide Credits : "Emergence of Foundation Models: Opportunities to Rethink Medical AI", Shekoofeh Azizi, CVPR 2024

76 of 77

Q & A

https://moaddeli.github.io/fm.html

Image Credits : Ronnie, King of Zurich

77 of 77

Thanks

https://moaddeli.github.io/fm.html

Image Credits : Ronnie, King of Zurich

1 of 77

2 of 77

3 of 77

4 of 77

5 of 77

6 of 77

7 of 77

8 of 77

9 of 77

10 of 77

11 of 77

12 of 77

13 of 77

14 of 77

15 of 77

16 of 77

17 of 77

18 of 77

19 of 77

20 of 77

21 of 77

22 of 77

23 of 77

24 of 77

25 of 77

26 of 77

27 of 77

28 of 77

29 of 77

30 of 77

31 of 77

32 of 77

33 of 77

34 of 77

35 of 77

36 of 77

37 of 77

38 of 77

39 of 77

40 of 77

41 of 77

42 of 77

43 of 77

44 of 77

45 of 77

46 of 77

47 of 77

48 of 77

49 of 77

50 of 77

51 of 77

52 of 77

53 of 77

54 of 77

55 of 77

56 of 77

57 of 77

58 of 77

59 of 77

60 of 77

61 of 77

62 of 77

63 of 77

64 of 77

65 of 77

66 of 77

67 of 77

68 of 77

69 of 77

70 of 77

71 of 77

72 of 77

73 of 77

74 of 77

75 of 77

76 of 77

77 of 77