1 of 56

1

CS 589 Lecture 11

Monday 6:30-9:00

Kidde 228

Automated machine learning (cont.)

Parameter Efficient Fine Tuning

In-context Learning

photo: https://www.scubedstudios.com/information-retrieval/

ICL slides are adapted from Stanford CS 224U: https://web.stanford.edu/class/cs224u/slides/cs224u-incontextlearning-2023-handout.pdf

2 of 56

2

3 of 56

Recap: GPT: Left-to-right decoder using the Transformer arch.

  • The GPT architecture also uses the transformer architecture
    • But it uses only the decoder (masked multi-head attention) to allow autoregressive generation for seq2seq tasks
    • Masked multi-head attention prevents decoder from attending to right tokens

  • Framework:
    • Unsupervised pre-training: using the BooksCorpus + 1B word benchmark
    • Fine tuning: qa, common sense reasoning, text entailment

3

Improving Language Understanding by Generative Pre-Training. Radford et al. 2018

4 of 56

Lecture 11

  • Automated machine learning (cont.)

  • Parameter Efficient Fine Tuning (PEFT)

  • In-context Learning

4

5 of 56

Hyperparameter optimization

  • Definition of hyperparameter tuning:

: hyperparameter

: search space for hyperparameter

D_{train}: training dataset

  • D_{valid}: validation dataset
  • L: loss function (objective)

trial #3

6 of 56

FLAML: A Fast Library for Automated Machine Learning & Tuning

7 of 56

FLAML: AutoML vs Tune

8 of 56

FLAML: Tune User-Defined Function

  • config: search space
  • metric: metric to optimize
  • mode: maximize or minimize
  • num_samples: number of trials
  • n_concurrent_trials: number of concurrent trials
  • resource_per_trial: {“gpu”: 1}
  • low_cost_partial_config: starting point

source: https://github.com/microsoft/FLAML/blob/main/flaml/tune/tune.py#L202

9 of 56

CFO: A Cost-Frugal HPO Algorithm [AAAI’21]

  • To avoid high-cost points until necessary -> Low-cost starting point + local search
  • To find low-loss points -> Follow loss-descent directions
  • Cannot directly use gradient-based method: no gradient available.
  • Surprise: function values are enough to find the right directions for efficient convergence
  • More surprise: sign comparison between function values is enough!!

source: FLAML KDD 2022 Tutorial Frugal Optimization for Cost-related Hyperparameters. Wu et al. 2021

10 of 56

CFO: A Cost-Frugal HPO Algorithm [AAAI’21]

Repeat the following steps after each move:

1.Uniformly sample a direction from a local unit sphere;

2.Compare;

3.Move (and break) or try the opposite direction;

4.Move (and break) or stay

source: FLAML KDD 2022 Tutorial Frugal Optimization for Cost-related Hyperparameters. Wu et al. 2021

11 of 56

BlendSearch: Combining Local + Global Search [ICLR’22]

  • LS – low cost; may get trapped in local optima
  • Global search– able to explore the whole space; high cost

source: FLAML KDD 2022 Tutorial ECONOMIC HYPERPARAMETER OPTIMIZATION WITH BLENDED SEARCH STRATEGY. Wang et al. 2022

12 of 56

BlendSearch: Combining Local + Global Search [ICLR’22]

source: FLAML KDD 2022 Tutorial ECONOMIC HYPERPARAMETER OPTIMIZATION WITH BLENDED SEARCH STRATEGY. Wang et al. 2022

13 of 56

HW4: Hyperparameter Tuning for HW3

  • In HW3, the hyperparameters are given:

  • In HW4, you need to search for the learning rate and batch size:

14 of 56

15 of 56

FLAML: Resources

16 of 56

Lecture 11

  • Automated machine learning (cont.)

  • Parameter Efficient Fine Tuning (PEFT)

  • In-context Learning

16

17 of 56

Paradigm shifts in NLP (2017)

Training -> predict

Pre-training -> fine tuning -> predict

2017

18 of 56

Second paradigm shifts in NLP (2021)

Pre-training -> fine tuning -> predict

Pre-training -> prompting -> predict

19 of 56

Fine-Tuning Has Good Performance, but…

  • Challenge for updating the full model weights
    • Models are trained every few months: retraining GPT-3 175B?
    • Continual learning: catastrophic forgetting

source: https://www.semanticscholar.org/paper/Localizing-Catastrophic-Forgetting-in-Neural-Wiewel-Yang/e5e33640ccf7de93b963da0a4719499d05b84b6b

20 of 56

How to Improve over Fine Tuning?

  • Strategy 1: Avoid update the full model weights
    • Parameter-efficient fine tuning (PEFT)

  • Strategy 2: Avoid update any weights at all
    • In-Context learning (ICL): Only tune the input to the model, i.e., prompt tuning

source: https://www.semanticscholar.org/paper/Localizing-Catastrophic-Forgetting-in-Neural-Wiewel-Yang/e5e33640ccf7de93b963da0a4719499d05b84b6b

21 of 56

Prefix Tuning: Optimizing continuous prompts

  • Continuous prompts
    • Prepend the input and output pairs with continuous vectors as the continous prompt

Prefix-tuning: optimizing continuous prompts for generation. Li et al. 2020

22 of 56

Prefix-tuning: lightweight fine-tuning

  • Prefix tuning also helps with lightweight fine tuning
    • Fine tuning large language models is costly!
    • gpt-j-6b is 22gb!

  • With prefix tuning, we only need to tune the prefix for each task, which significantly reduces the parameters that need to be tuned
    • with 0.1% parameters, can obtain comparable performance with GPT-2 and BART full parameter

Prefix-tuning: optimizing continuous prompts for generation. Li et al. 2020

23 of 56

Adapter: Parameter-Efficient Transfer Learning for NLP

  • Adding a small amount of parameters
  • New task: add a few parameters without revising previous ones
  • As a result, overcoming forgetting

source: Parameter-Efficient Transfer Learning for NLP. Houlsby et al. 2019

24 of 56

LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

  • Disadvantages of adapter:
    • Latency in the output
  • Disadvantages of prefix-tuning
    • Difficult to optimize
  • Proposed approach:
    • Using low-rank matrix decomposition to learn parameter update

source: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS. Hu et al. 2021

25 of 56

LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

  • Pretrained weights:
  • Fine-tuned weights:
  • Approximate as the matrix multiplication of smaller model:

source: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS. Hu et al. 2021

26 of 56

LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

  • Results:
    • Comparable or outperform fine tuning/adapter
  • Another empirical study results show that:
    • FT > LoRA > Adapter > Prefix Tuning

source: Parameter-efficient fine-tuning of large-scale pre-trained language models. Ding et al. 2022

27 of 56

Resource for Parameter Efficient Fine Tuning

28 of 56

Lecture 11

  • Automated machine learning (cont.)

  • Parameter Efficient Fine Tuning (PEFT)

  • In-context Learning

28

29 of 56

In-context learning

  • The weights of large language models (e.g., GPT-3) are hard to update

  • When using LLM for certain tasks, e.g., question answering, instead of inputting the question q, we can add additional content to the input q to elicit a better answer�
  • This process is called prompt engineering

  • In the prompt, we can include a few examples as a labeled dataset�
  • The model can then learn our objective even without updating weights

  • This capability is called in-context learning (ICL)

30 of 56

In-Context Learning Ability of LLMs

  • Examples implies to LLM what task we are looking for (even without explicit specification)

How does in-context learning work? A framework for understanding the differences from traditional supervised learning. Sang Michael Xie and Sewon Min

31 of 56

Using Prompts to Elicit the Answer

  • Prompts for sentiment classification:

  • Prompts for named entity recognition:

I missed the bus today. I felt so ___________

Mike went to Paris. [Paris] is a [location] entity.

32 of 56

Prompt engineering

  • Designing the prompting that results in the most effective performance in the downstream task

  • Categories of prompt:
    • Cloze style prompts: prompts which ask the LM to fill in the blank, e.g., Mike went to Paris. [Paris] is a [location] entity. More suitable for text classification
    • Prefix style prompts: prompts which ask the LM to complete a sequence, e.g., English: I missed the bus today. French: _________, More suitable for text generation tasks.

  • Prompts can be created manually
    • However, this process is an art that takes time and experience
    • Even experienced prompt engineer may fail to manually discover optimal prompt

33 of 56

Prompt generation

  • Prompting by demonstration:

Making pre-trained language models better few-shot learners. Gao et al. 2020

34 of 56

Prompt generation

  • Using T5 for prompt generation:

    • Defining an input and output template pairs, e.g., input: Thank you <X> me to your party <Y> week, output: <X> for inviting <Y> last <Z>
    • It teaches T5 that <X> is for replacing <X> in the input and <Y> is for replacing <Y> in the input
    • During decoding, replace <X> with <S1> and <Y> with [MASK], which makes [MASK] the target for generating the sentiment word

Making pre-trained language models better few-shot learners. Gao et al. 2020

35 of 56

AutoPrompt: Gradient based prompt generation

  • Create a task-specific prompt with a collection of trigger words

  • The same trigger words are used for all examples, learned through maximizing likelihood of sentiment labels in the training examples

AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Shin et al. 2020

36 of 56

AutoPrompt: Gradient based prompt generation

  • How to select words for the prompt:
    • Step 1: train a classifier to predict the class label using the contextualized embedding of the [MASK] as input:

    • Step 2: substitute h_i with the MLM’s output word embeddings to obtain a score s(y, w), and the set of labelled tokens are constructed from the k-highest scoring words

AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Shin et al. 2020

37 of 56

Chain-of-thoughts Prompting

  • Use in-context examples with reasoning steps to elicit the model to output reasoning steps, which improves the model performance

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Wei et al. 2022

38 of 56

  • Other applications of CoT

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Wei et al. 2022

39 of 56

Self-Consistency

Self-consistency improves chain of thought reasoning in language models. Wang et al. 2022

40 of 56

Self-Ask

  • Use prompt such as “Follow up” to elicit reasoning to improve the answer

Measuring and Narrowing the Compositionality Gap in Language Models. Press et al. 2023

41 of 56

Self-Ask

  • Interplay between GPT and search engine to collaboratively improve the answer

Measuring and Narrowing the Compositionality Gap in Language Models. Press et al. 2023

42 of 56

Instruction Fine-Tuning

  • Training language models to follow “instructions”, i.e., answer an Imperative Sentence

  • Why instruction fine-tuning?
    • Make the model follow instructions (better serve users)
    • Make it more clear how to define evaluation criteria: informativeness, truthfulness
    • Better alignment the fine-grained need

GPT: The sky is blue. The water is clear.

Instruct-GPT: Write a poem containing seven sentences. ______

Training language models to follow instructions with human feedback. Ouyang et al. 2022

43 of 56

Instruction Fine-Tuning

  • Step 2: use human labeler to select the winning answer between two

Training language models to follow instructions with human feedback. Ouyang et al. 2022

44 of 56

Instruction Fine-Tuning

Training language models to follow instructions with human feedback. Ouyang et al. 2022

smaller InstructGPT > larger GPT

Gray bar: truthfulness, color: informativeness and truthfulness

45 of 56

Self Instruct: Aligning LM with Self Generate Instructions

  • How to perform instruction fine-tuning without large labeled data? Use GPT-3 to extract instruction-following training data using few-shot demonstration

Self-instruct: Aligning language model with self generated instructions. Wang et al. 2022

46 of 56

Self Instruct: Aligning LM with Self Generate Instructions

Self-instruct: Aligning language model with self generated instructions. Wang et al. 2022

47 of 56

Alpaca: A Strong, Replicable Instruction-Following Model

Alpaca: A Strong, Replicable Instruction-Following Model. Taori et al. 2022

48 of 56

Structured Prompting: Scaling Number of Demonstrations

  • In Context Learning is constrained by the input length (llama 2k tokens, llama 2 4k token)
  • Scaling up demonstration: separately encoding the demonstrations

Structured Prompting: Scaling In-Context Learning to 1,000 Examples. Hao et al. 2022

49 of 56

Explanation Improves Few Shot Prompting

  • Explanation helps improve the accuracy of QA
    • But the accuracy depends on the quality of explanation and examples

Can language models learn from explanations in context?. Lampinen et al. 2022

50 of 56

Explanation Improves Fine-Tuning of LLM

  • Fine tuning 4k examples into 41 hate classes using curie (6.7B) and ada (350M):
    • Method 1: Hate speech -> short class name
    • Method 2: Hate speech -> long description of class

Testing Hate Speech against Policies. Zheng et al. 2023

51 of 56

An Explanation of In-context Learning as Implicit Bayesian Inference

An Explanation of In-context Learning as Implicit Bayesian Inference. Xie et al. 2021

52 of 56

Using Retrieval to Improve ICL

  • Retrieving similar questions, aggregate to obtain the final answer

Learning To Retrieve Prompts for In-Context Learning. Rubin et al. 2021

53 of 56

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

54 of 56

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

55 of 56

56 of 56

Summary

  • Automated machine learning (cont.)

  • Parameter Efficient Fine Tuning (PEFT)

  • In-context Learning

56