1 of 55

Artificial Intelligence

MSc course

Umberto Junior Mele

Dalle Molle Institute for Artificial Intelligence, Switzerland

meleu@usi.ch

Lab 11: Understanding and Applying Foundation Models

2 of 55

Outline

What is a Foundation Model?
Scaling Laws
Comparison of available FMs
Datasets
Training
How to improve performances
Fine-Tuning FMs:

Prompt Engineering
Training Adaptation

Using Tools
Possible Flaws of LLMs

Inverse Scaling behaviours
Memory trap & Logic issues

Cool Recent Applications

AlphaCode
FunSearch
Others…

Future Trends & Challenges

3 of 55

Foundation Models

The term Foundation Models (FM) was coined by researchers at Stanford University, particularly by the Stanford Institute for Human-Centered Artificial Intelligence (HAI).
FMs are pre-trained models, extensively trained on diverse, large-scale datasets.
FMs be fine-tuned for a variety of specific tasks like translation, answering questions, summarizing documents, etc.
GPT-3 and GPT-4 by OpenAI are notable examples.

4 of 55

Foundation Models

NLP: Used for sentiment analysis, chatbots, and translation.
Computer Vision: Applied in image and facial recognition, medical imaging.
Autonomous Systems: Utilized in self-driving cars for decision-making.
Healthcare: Employed in disease detection, genomic analysis, and drug discovery.
Finance and Economics: Applied in fraud detection, market trend analysis, credit scoring for risk assessment.

Bommasani et al. On the Opportunities and Risks of Foundation Models (2021)

5 of 55

Foundation Models

Foundation models enable broad applications due to their generalist nature.
They help reduce the cost and time of training AI from scratch for each specific task.
They allow researchers to use collective knowledge from diverse data.

6 of 55

Hyperparameter cost problem

Hyperparameter tuning is a critical step in AI model training. Involves finding the best set of parameters that control the learning process.

Challenges

Computational Cost: Each potential hyperparameter set requires a full training cycle.
Time-Intensive: Testing numerous combinations prolongs the development process.
Expert Knowledge Needed: Deciding which hyperparameters to adjust demands expertise.

7 of 55

Common Approaches to address the problem

Guess and Pray: This approach relies on intuition and experience to choose hyperparameters. It's less systematic and can be hit-or-miss, but it requires less computational resources.

Exhaustive Search: Methods like grid search systematically explore a range of hyperparameter values. While thorough, this method is computationally expensive and impractical for high-dimensional hyperparameter spaces.

Scaling Laws: These involve using simple, predictive rules that guide the selection of hyperparameters based on the model's size and the dataset. This method tries to balance performance with computational feasibility, using empirical data and theoretical insights to inform choices. It's a more recent approach, gaining traction for its effectiveness in large-scale models.

8 of 55

Scaling Laws

Kaplan et al. Scaling Laws for Neural Language Models. (2020)

9 of 55

Scaling Laws

10 of 55

Data Scaling Laws

The data scaling law is a simple formula that maps dataset size (n) to error.

What do we expect out of scaling laws

11 of 55

Data Scaling Laws - Empirical Observation

12 of 55

Data Scaling Laws - Toy Example

13 of 55

Model Scaling Laws - Parameters

14 of 55

Model Scaling Laws - Depth

15 of 55

Scaling Laws - Problems ?

Lavesque et al. The winograd schema challenge. (2012)

16 of 55

Scaling Laws - Problems ?

17 of 55

Scaling Laws - Problems ?

18 of 55

Scaling Laws - Problems ?

Phase transitions are sudden, discontinuous jumps in performance.

Do we expect to see more phase transitions?

This is probably the big unknown in LM scaling!

19 of 55

Foundation Models Comparison

GPT-4 is the OpenAI LLM known for its ability to solve difficult problems with greater accuracy, thanks to its broader general knowledge.
PALM-2 is the new LLM of Google claiming to be better in reasoning tasks with respect to GPT-4, but the model is not yet available to the public.
BLOOM is a 176B-parameter open-access LLM designed and built thanks to a collaboration of hundreds of researchers.

GPTs, Codex, DALL-E, CLIP

PALM-2, Gopher, Chinchilla, Gemini

LLaMA, Alpaca

Jurassic

21 of 55

HELM

Holistic Evaluation of Language Models (HELM), a framework to increase the transparency of language models.

Collection of datasets in a standard format (e.g., NaturalQuestions)
Collection of models accessible via a unified API (e.g., GPT-3, MT-NLG, OPT, BLOOM)
Collection of metrics beyond accuracy (efficiency, bias, toxicity, etc.)
Collection of perturbations for evaluating robustness and fairness (e.g., typos, dialect)
Modular framework for constructing prompts from datasets

https://github.com/stanford-crfm/helm

https://crfm.stanford.edu/helm/latest/

Liang et al. Holistic Evaluation of Large Language Models. (2023)

22 of 55

Datasets

Neural Networks are compressed/compiled version of the training data. Therefore, the size of the dataset has to scale accordingly with the size of the model.

GPT-3 175B is trained with 300 Billion tokens collected from a weighted combination of the following datasets:

Brown et al. Language Models are few-shot learners. (2020)

23 of 55

Datasets - The PILE

EleutherAI (a nonprofit organization committed to building open language models), released The Pile, a dataset for language modeling, where the key idea is to source it from smaller high-quality sources (academic + professional sources).

Gao et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling (2020)

24 of 55

Training LLMs - Self Supervised

Self-supervision is a form of unsupervised learning where the data itself provides the supervision.

SS Learning leverages the inherent structure of data to create pseudo-labels, allowing models to learn meaningful representations

Chen et al. Big Self-Supervised Models are Strong Semi-Supervised Learners (2020)

25 of 55

Training LLMs - RLHF and RLAIF

Reinforcement Learning with Human Feedback (RLHF) involves humans providing additional signals or modifying reward structures to guide learning.

Humans demonstrate desired behaviors, which the RL agent then tries to imitate.

Shane et al. Policy shaping: Integrating human feedback with reinforcement learning.(2013).

26 of 55

How to improve FMs without spending millions

Prompt Crafting or Engineering

Prompt crafting or engineering is a method that involves meticulously designing input prompts to enhance the performance of language models for specific tasks. It requires an in-depth understanding of the model and the task, focusing on the choice of words, structure, and context to guide the model towards generating accurate and relevant responses.

Fine tuning pre-trained models

Fine-tuning a pre-trained model involves adjusting it for a specific task by selecting an appropriate dataset, optimizing learning rates, and reducing the number of training iterations. This approach effectively utilizes the model's existing learned features, conserves computational resources, and can be customized to meet specific needs.

27 of 55

Task Specialization

Prompt Engineering consist in the tricks used to design the input to the foundation models to get the desired outputs:

Zero-Shot
Few-Shot
Chain of Thought
Self-Reflection

Goal directed training consists in changing the parameters of the FM or a smaller model used on to to adapt the outputs to specific applications:

Linear Probing
Fine-Tuning
Mixed Approaches

29 of 55

Prompt Engineering

Prompt engineering can include tactics such as:

Specificity: Making the prompt more detailed or specific can help guide the model.
Instructions: Including explicit instructions about the format or content of the desired output can improve results.
Examples: Providing an example of the desired output can be very effective, particularly for tasks that require generating text in a specific format.
Priming: The prompt can include context that "primes" the model, giving it some background information that helps to guide the output.
Redundancy: Sometimes asking the same question in different ways within a prompt can help ensure the model understands the task.
Question Framing: Adjusting the way a question is posed can significantly influence the model's response.

30 of 55

Zero-shot Learning

Definition: Zero-shot Learning refers to the scenario where the FM is asked to answer a task unseen during training. This capability is made possible due to the broad pretraining which exposes the model to diverse scenarios and tasks. For example, a model can identify animals not by directly learning from images of them, but by understanding and applying descriptive attributes about them.

Advantages

No need for task-specific fine-tuning data.
Can tackle a wide variety of tasks.

Limitations

Performance might not be as good as when trained.
The model can be more sensitive to how the task is presented.

31 of 55

Few Shot Learning

Definition: Few-shot Learning refers to the scenario where the model is expected to generalize from a limited number of examples. In the context of Foundation Models, it involves providing a few examples of a particular task to the model in the form of a prompt. This helps the model understand the desired output and generalize from the examples to perform the task on new inputs.

Advantages

Requires less data than traditional machine learning models.
Can adapt to new tasks quickly.

Limitations

Performance can vary based on the quality and quantity of the examples.
It can be sensitive to the ordering and phrasing of the input examples.

32 of 55

Chain of Thought

Definition: Chain of Thought (CoT) is achieved by prompting the models to generate a series of intermediate steps that lead to the final answer of a multi-step problem. The technique improve results in reasoning tasks that require logical thinking and multiple steps to solve. CoT could compete with task-specific fine-tuned models on several tasks.

Advantages

Better performances on both arithmetic and commonsense reasoning tasks.
It offers a clear, traceable path to the final answer, which can be easier to understand and verify.

Limitations

It might increase the likelihood of generating toxic output, especially in tasks where models might make inferences about sensitive topics or marginalized groups.
It works better with larger and more powerful models.

33 of 55

Chain of Thought

Wei et al. Chain-of-thought prompting elicits reasoning in large language models. (2022)

34 of 55

Self-Reflection

Definition: Self-Reflection is achieved by prompting the models to introspect and analyze their own outputs and reasoning process. The model leverages its underlying knowledge and understanding gained during the pretraining phase to provide insights about its own thought.

Advantages

Helps in understanding the model's outputs and predictions.
Assists in identifying potential biases and errors in the model's responses.

Limitations

Current models' "introspection" is based on generating plausible-sounding explanations rather than having true access to their internal workings.
The introspection might not be completely reliable or accurate.

35 of 55

Self-Reflection

Francisco et al. Artificial intelligence as a socratic assistant for moral enhancement (2020)

36 of 55

Retrieval Augmentation Generation (RAG)

Lewis et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. (2020)

37 of 55

Retrieval Augmentation Generation (RAG)

Chen et al. Re-imagen: Retrieval-augmented text-to-image generator (2022).

38 of 55

Training Adaptation

Linear Probing involves training a model on top of the frozen model to adapt to the specific task.

Pros: It's a simple and efficient way to extract useful information from the model.
Cons: It may not leverage the full potential of the model as it doesn't alter the foundation model's parameters.

Fine-Tuning involves continuing the training of the pretrained model on the specific task.

Pros: It can significantly improve the model's performance on the task.
Cons: It requires costly task-specific training data and can overfit if not managed properly.

Mixed Approaches combines different methods like linear probing and fine-tuning to achieve better results.

Pros: It improves performances with respect to both methods and it is less costly.
Cons: It could be more complex to implement and may be more costly than Linear Probing.

39 of 55

Training Adaptation

40 of 55

Using Tools with LLMs

Foundation models can learn how to use tools directly from input prompts. The large-scale training data often include examples of using tools, which the models generalize from.

Search the Internet: Foundation models can gather additional information from internet searches. This could be used for fact-checking or information retrieval.
Calculators and Solvers Models can effectively use calculators or solvers. When given prompts related to calculations or problem-solving, they can leverage their internal understanding to provide solutions.
Vector Databases can be used to store and retrieve information based on the model's embeddings. These can be queried to find knowledge similar to a given prompt, or that might help solve a given task.

Advantages

Expands the capabilities of the models.
Allows the model to perform more complex tasks and provide more accurate and helpful responses.

41 of 55

Coding

FMs can write code in various programming languages, understand syntax, solve algorithmic problems, and debug. For example, they can generate a Python or C++ functions to solve a system of linear equations.

Advantage

These models are expected to play a more vital role in coding. This includes not just writing and debugging code but also suggesting architectural improvements, identifying security risks, and more.

Limitation

Foundation Models are not perfect and can make mistakes, especially with complex coding tasks. They should be used as a tool to aid developers, not as a replacement.

42 of 55

Searching the Net

GPT-4 generate responses based on their training data. However, allowing these models to interact with the internet in real-time could greatly improve their responses, keeping them current with the latest information, and enabling fact-checking against live data. For instance, they could provide recent stock market trends, up-to-date news, or even the latest scientific research.

Advantages

It enables better output with fact-checking
It allows to automate many updating tasks

Limitations

Could increase the creation of fake-news
Not all the data available in internet is correct.

43 of 55

Vector-based Databases

Definition: Vector databases, also known as vector space databases, are a type of database that store and retrieve data in the form of vectors, allowing for efficient similarity searches in high dimensional space.
Application: In the context of large language models, vector databases are used to efficiently store and retrieve embeddings of words, sentences, or documents. These embeddings are high-dimensional vectors that capture the semantic meaning of the words or text they represent.
Advantage: The key advantage of using vector databases is that they enable quick nearest neighbor searches, allowing models to retrieve information that is semantically similar to a given query.

44 of 55

Vector-based Databases

Embedding Generation: When a prompt is received, a large language model generates an embedding (or vector representation) of the input.
Nearest Neighbor Search: This embedding is then used to query the vector database for the most similar embeddings based on a metric like cosine similarity or Euclidean distance.
Information Retrieval: The documents or information associated with these nearest neighbor embeddings are then retrieved and used to inform the model's response.

45 of 55

Vector-based Databases

Benefits:

Vector databases enable efficient retrieval of semantically related information.
They can handle very large amounts of data.
They are not limited to text data and can be used for any data that can be represented as vectors.

Challenges:

Implementing and managing high-dimensional vector databases can be complex.
The performance can degrade with very high-dimensional data.
Ensuring the relevance of retrieved information can be challenging, as similarity in vector space doesn't always equate to relevance in a specific context.

46 of 55

Inverse Scaling Phenomenon

As language models become larger, they generally perform better, scoring higher on benchmarks, unlocking new capabilities, but also introducing new biases or misinformation.
Standard Scaling Laws illustrate that language models improve predictably with increased parameters, compute usage, and training dataset size - following a power law for each factor.
Inverse Scaling Phenomenon is based on the hypothesis that certain tasks exhibit inverse scaling: as the overall test loss of the language model improves, task performance predictably worsens. These tasks appear to be rare but could represent important issues with current pretraining and scaling paradigms.

47 of 55

Memory Trap

The Memory Trap refers to a tendency of LMs to default to replicating memorized text, often overruling specific instructions to generate novel or specific content. For example, given the famous quote, "Due cose sono infinite: l’universo e la stupidità umana, ma riguardo l’universo ...", a large LM is more likely to finish it as per the original quote, rather than generating a unique ending, despite being prompted to do so.

This is a clear case of inverse scaling - as the models get larger and more proficient in their general language modeling task, they tend to perform worse in specific tasks that require creative or novel responses.
While larger LMs can model pretraining data more effectively, they also tend to deeply memorize common texts and concepts. This strong memory often overrides their ability to follow specific instructions for novel responses.
The memory trap can lead to severe failures in reasoning and instruction-following, even reproducing inappropriate or harmful content under certain conditions.

48 of 55

Logic Issues

LLMs often struggle with logical reasoning tasks, including the ability to accurately perform deductions. As models scale, this problem becomes more pronounced, a phenomenon known as inverse scaling.

As LLMs continue to be integrated into decision-making processes, understanding and mitigating their logical fallacies becomes crucial. Models need to be able to correctly interpret and apply logical reasoning to avoid incorrect or harmful outputs.

Example prompt:

A craftsman has to make 100 license plates, from 1 to 100. How many times will he have to write the number 9?
The day before yesterday, Marco was 17; next year he will be 20. How is this possible?

�

49 of 55

Alpha Code 2

Li et al. Competition-level code generation with alphacode (2022)

50 of 55

Alpha Code 2

51 of 55

FunSearch

Romera-Paredes et al. Mathematical discoveries from program search with large language models (2023)

52 of 55

Meta-Morph

Gupta et al. Metamorph: Learning universal controllers with transformers (2022)

53 of 55

Examples of Applications

Customer Service: LLMs can be used to handle customer inquiries and complaints, automating responses to frequently asked questions, and escalating complex issues to human agents.
Content Generation: LLMs can produce high-quality written content, including product descriptions, social media posts, and blog articles, aiding marketing and communication efforts.
Data Analysis: LLMs can analyze large volumes of text data, like customer reviews or social media comments, to gain insights into consumer sentiment and trends.
Personal Assistants: Advanced versions of voice-activated personal assistants like Siri or Alexa can be developed using LLMs, providing more accurate and context-aware responses.
Education and Training: LLMs can provide personalized learning experiences, offering real-time feedback and assistance on a variety of subjects.

54 of 55

Future Trends & Challenges

Fine-Tuning LLMs: There will be increasing focus on effectively fine-tuning LLMs to specific tasks, industries, or domains to enhance their utility and applicability.
Responsible AI: As LLMs grow larger and more complex, efforts to identify and mitigate issues such as bias, misinformation, and ethical considerations will become more important.
Interactive AI Systems: Expected growth in the development of interactive AI systems that engage users in more dynamic and personalized ways, combining NLP, voice recognition, and machine learning techniques.
Data Privacy: Ensuring the privacy and security of data used to train and interact with LLMs will be a key concern as regulations become stricter.
AI Partnerships: With the increasing importance of AI, strategic partnerships will form between businesses and AI companies to leverage LLM capabilities and maintain competitive advantage.

1 of 55

2 of 55

3 of 55

4 of 55

5 of 55

6 of 55

7 of 55

8 of 55

9 of 55

10 of 55

11 of 55

12 of 55

13 of 55

14 of 55

15 of 55

16 of 55

17 of 55

18 of 55

19 of 55

20 of 55

21 of 55

22 of 55

23 of 55

24 of 55

25 of 55

26 of 55

27 of 55

28 of 55

29 of 55

30 of 55

31 of 55

32 of 55

33 of 55

34 of 55

35 of 55

36 of 55

37 of 55

38 of 55

39 of 55

40 of 55

41 of 55

42 of 55

43 of 55

44 of 55

45 of 55

46 of 55

47 of 55

48 of 55

49 of 55

50 of 55

51 of 55

52 of 55

53 of 55

54 of 55

55 of 55