1 of 54

1

CS 589 Lecture 11

Monday 6:30-9:00

Kidde 228

Automated machine learning (cont.)

Parameter Efficient Fine Tuning

In-context Learning

photo: https://www.scubedstudios.com/information-retrieval/

ICL slides are adapted from Stanford CS 224U: https://web.stanford.edu/class/cs224u/slides/cs224u-incontextlearning-2023-handout.pdf

2 of 54

Lecture 11

  • Automated machine learning (cont.)

  • Parameter Efficient Fine Tuning (PEFT)

  • In-context Learning

2

3 of 54

Hyperparameter optimization

  • Definition of hyperparameter tuning:

: hyperparameter

: search space for hyperparameter

D_{train}: training dataset

  • D_{valid}: validation dataset
  • L: loss function (objective)

trial #3

4 of 54

FLAML: A Fast Library for Automated Machine Learning & Tuning

5 of 54

FLAML: AutoML vs Tune

6 of 54

FLAML: Tune User-Defined Function

  • config: search space
  • metric: metric to optimize
  • mode: maximize or minimize
  • num_samples: number of trials
  • n_concurrent_trials: number of concurrent trials
  • resource_per_trial: {“gpu”: 1}
  • low_cost_partial_config: starting point

source: https://github.com/microsoft/FLAML/blob/main/flaml/tune/tune.py#L202

7 of 54

CFO: A Cost-Frugal HPO Algorithm [AAAI’21]

  • To avoid high-cost points until necessary -> Low-cost starting point + local search
  • To find low-loss points -> Follow loss-descent directions
  • Cannot directly use gradient-based method: no gradient available.
  • Surprise: function values are enough to find the right directions for efficient convergence
  • More surprise: sign comparison between function values is enough!!

source: FLAML KDD 2022 Tutorial Frugal Optimization for Cost-related Hyperparameters. Wu et al. 2021

8 of 54

CFO: A Cost-Frugal HPO Algorithm [AAAI’21]

Repeat the following steps after each move:

1.Uniformly sample a direction from a local unit sphere;

2.Compare;

3.Move (and break) or try the opposite direction;

4.Move (and break) or stay

source: FLAML KDD 2022 Tutorial Frugal Optimization for Cost-related Hyperparameters. Wu et al. 2021

9 of 54

BlendSearch: Combining Local + Global Search [ICLR’22]

  • LS – low cost; may get trapped in local optima
  • Global search– able to explore the whole space; high cost

source: FLAML KDD 2022 Tutorial ECONOMIC HYPERPARAMETER OPTIMIZATION WITH BLENDED SEARCH STRATEGY. Wang et al. 2022

10 of 54

BlendSearch: Combining Local + Global Search [ICLR’22]

source: FLAML KDD 2022 Tutorial ECONOMIC HYPERPARAMETER OPTIMIZATION WITH BLENDED SEARCH STRATEGY. Wang et al. 2022

11 of 54

HW4: Hyperparameter Tuning for HW3

  • In HW3, the hyperparameters are given:

  • In HW4, you need to search for the learning rate and batch size:

12 of 54

13 of 54

FLAML: Resources

14 of 54

Lecture 11

  • Automated machine learning (cont.)

  • Parameter Efficient Fine Tuning (PEFT)

  • In-context Learning

14

15 of 54

Paradigm shifts in NLP (2017)

Training -> predict

Pre-training -> fine tuning -> predict

2017

16 of 54

Second paradigm shifts in NLP (2021)

Pre-training -> fine tuning -> predict

Pre-training -> prompting -> predict

17 of 54

Fine-Tuning Has Good Performance, but…

  • Challenge for updating the full model weights
    • Models are trained every few months: retraining GPT-3 175B?
    • Continual learning: catastrophic forgetting

source: https://www.semanticscholar.org/paper/Localizing-Catastrophic-Forgetting-in-Neural-Wiewel-Yang/e5e33640ccf7de93b963da0a4719499d05b84b6b

18 of 54

How to Improve over Fine Tuning?

  • Strategy 1: Avoid update the full model weights
    • Parameter-efficient fine tuning (PEFT)

  • Strategy 2: Avoid update any weights at all
    • In-Context learning (ICL): Only tune the input to the model, i.e., prompt tuning

source: https://www.semanticscholar.org/paper/Localizing-Catastrophic-Forgetting-in-Neural-Wiewel-Yang/e5e33640ccf7de93b963da0a4719499d05b84b6b

19 of 54

Prefix Tuning: Optimizing continuous prompts

  • Continuous prompts
    • Prepend the input and output pairs with continuous vectors as the continous prompt

Prefix-tuning: optimizing continuous prompts for generation. Li et al. 2020

20 of 54

Prefix-tuning: lightweight fine-tuning

  • Prefix tuning also helps with lightweight fine tuning
    • Fine tuning large language models is costly!
    • gpt-j-6b is 22gb!

  • With prefix tuning, we only need to tune the prefix for each task, which significantly reduces the parameters that need to be tuned
    • with 0.1% parameters, can obtain comparable performance with GPT-2 and BART full parameter

Prefix-tuning: optimizing continuous prompts for generation. Li et al. 2020

21 of 54

Adapter: Parameter-Efficient Transfer Learning for NLP

  • Adding a small amount of parameters
  • New task: add a few parameters without revising previous ones
  • As a result, overcoming forgetting

source: Parameter-Efficient Transfer Learning for NLP. Houlsby et al. 2019

22 of 54

LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

  • Disadvantages of adapter:
    • Latency in the output
  • Disadvantages of prefix-tuning
    • Difficult to optimize
  • Proposed approach:
    • Using low-rank matrix decomposition to learn parameter update

source: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS. Hu et al. 2021

23 of 54

LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

  • Pretrained weights:
  • Fine-tuned weights:
  • Approximate as the matrix multiplication of smaller model:

source: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS. Hu et al. 2021

24 of 54

LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

  • Results:
    • Comparable or outperform fine tuning/adapter
  • Another empirical study results show that:
    • FT > LoRA > Adapter > Prefix Tuning

source: Parameter-efficient fine-tuning of large-scale pre-trained language models. Ding et al. 2022

25 of 54

Resource for Parameter Efficient Fine Tuning

26 of 54

Lecture 11

  • Automated machine learning (cont.)

  • Parameter Efficient Fine Tuning (PEFT)

  • In-context Learning

26

27 of 54

In-context learning

  • The weights of large language models (e.g., GPT-3) are hard to update

  • When using LLM for certain tasks, e.g., question answering, instead of inputting the question q, we can add additional content to the input q to elicit a better answer�
  • This process is called prompt engineering

  • In the prompt, we can include a few examples as a labeled dataset�
  • The model can then learn our objective even without updating weights

  • This capability is called in-context learning (ICL)

28 of 54

In-Context Learning Ability of LLMs

  • Examples implies to LLM what task we are looking for (even without explicit specification)

How does in-context learning work? A framework for understanding the differences from traditional supervised learning. Sang Michael Xie and Sewon Min

29 of 54

Using Prompts to Elicit the Answer

  • Prompts for sentiment classification:

  • Prompts for named entity recognition:

I missed the bus today. I felt so ___________

Mike went to Paris. [Paris] is a [location] entity.

30 of 54

Prompt engineering

  • Designing the prompting that results in the most effective performance in the downstream task

  • Categories of prompt:
    • Cloze style prompts: prompts which ask the LM to fill in the blank, e.g., Mike went to Paris. [Paris] is a [location] entity. More suitable for text classification
    • Prefix style prompts: prompts which ask the LM to complete a sequence, e.g., English: I missed the bus today. French: _________, More suitable for text generation tasks.

  • Prompts can be created manually
    • However, this process is an art that takes time and experience
    • Even experienced prompt engineer may fail to manually discover optimal prompt

31 of 54

Prompt generation

  • Prompting by demonstration:

Making pre-trained language models better few-shot learners. Gao et al. 2020

32 of 54

Prompt generation

  • Using T5 for prompt generation:

    • Defining an input and output template pairs, e.g., input: Thank you <X> me to your party <Y> week, output: <X> for inviting <Y> last <Z>
    • It teaches T5 that <X> is for replacing <X> in the input and <Y> is for replacing <Y> in the input
    • During decoding, replace <X> with <S1> and <Y> with [MASK], which makes [MASK] the target for generating the sentiment word

Making pre-trained language models better few-shot learners. Gao et al. 2020

33 of 54

AutoPrompt: Gradient based prompt generation

  • Create a task-specific prompt with a collection of trigger words

  • The same trigger words are used for all examples, learned through maximizing likelihood of sentiment labels in the training examples

AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Shin et al. 2020

34 of 54

AutoPrompt: Gradient based prompt generation

  • How to select words for the prompt:
    • Step 1: train a classifier to predict the class label using the contextualized embedding of the [MASK] as input:

    • Step 2: substitute h_i with the MLM’s output word embeddings to obtain a score s(y, w), and the set of labelled tokens are constructed from the k-highest scoring words

AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Shin et al. 2020

35 of 54

Chain-of-thoughts Prompting

  • Use in-context examples with reasoning steps to elicit the model to output reasoning steps, which improves the model performance

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Wei et al. 2022

36 of 54

  • Other applications of CoT

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Wei et al. 2022

37 of 54

Self-Consistency

Self-consistency improves chain of thought reasoning in language models. Wang et al. 2022

38 of 54

Self-Ask

  • Use prompt such as “Follow up” to elicit reasoning to improve the answer

Measuring and Narrowing the Compositionality Gap in Language Models. Press et al. 2023

39 of 54

Self-Ask

  • Interplay between GPT and search engine to collaboratively improve the answer

Measuring and Narrowing the Compositionality Gap in Language Models. Press et al. 2023

40 of 54

Instruction Fine-Tuning

  • Training language models to follow “instructions”, i.e., answer an Imperative Sentence

  • Why instruction fine-tuning?
    • Make the model follow instructions (better serve users)
    • Make it more clear how to define evaluation criteria: informativeness, truthfulness
    • Better alignment the fine-grained need

GPT: The sky is blue. The water is clear.

Instruct-GPT: Write a poem containing seven sentences. ______

Training language models to follow instructions with human feedback. Ouyang et al. 2022

41 of 54

Instruction Fine-Tuning

  • Step 2: use human labeler to select the winning answer between two

Training language models to follow instructions with human feedback. Ouyang et al. 2022

42 of 54

Instruction Fine-Tuning

Training language models to follow instructions with human feedback. Ouyang et al. 2022

smaller InstructGPT > larger GPT

Gray bar: truthfulness, color: informativeness and truthfulness

43 of 54

Self Instruct: Aligning LM with Self Generate Instructions

  • How to perform instruction fine-tuning without large labeled data? Use GPT-3 to extract instruction-following training data using few-shot demonstration

Self-instruct: Aligning language model with self generated instructions. Wang et al. 2022

44 of 54

Self Instruct: Aligning LM with Self Generate Instructions

Self-instruct: Aligning language model with self generated instructions. Wang et al. 2022

45 of 54

Alpaca: A Strong, Replicable Instruction-Following Model

Alpaca: A Strong, Replicable Instruction-Following Model. Taori et al. 2022

46 of 54

Structured Prompting: Scaling Number of Demonstrations

  • In Context Learning is constrained by the input length (llama 2k tokens, llama 2 4k token)
  • Scaling up demonstration: separately encoding the demonstrations

Structured Prompting: Scaling In-Context Learning to 1,000 Examples. Hao et al. 2022

47 of 54

Explanation Improves Few Shot Prompting

  • Explanation helps improve the accuracy of QA
    • But the accuracy depends on the quality of explanation and examples

Can language models learn from explanations in context?. Lampinen et al. 2022

48 of 54

Explanation Improves Fine-Tuning of LLM

  • Fine tuning 4k examples into 41 hate classes using curie (6.7B) and ada (350M):
    • Method 1: Hate speech -> short class name
    • Method 2: Hate speech -> long description of class

Testing Hate Speech against Policies. Zheng et al. 2023

49 of 54

An Explanation of In-context Learning as Implicit Bayesian Inference

An Explanation of In-context Learning as Implicit Bayesian Inference. Xie et al. 2021

50 of 54

Using Retrieval to Improve ICL

  • Retrieving similar questions, aggregate to obtain the final answer

Learning To Retrieve Prompts for In-Context Learning. Rubin et al. 2021

51 of 54

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

52 of 54

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

53 of 54

54 of 54

Summary

  • Automated machine learning (cont.)

  • Parameter Efficient Fine Tuning (PEFT)

  • In-context Learning

54