1 of 56

1

CS 589 Lecture 11

Monday 6:30-9:00

Kidde 228

Automated machine learning (cont.)

Parameter Efficient Fine Tuning

In-context Learning

photo: https://www.scubedstudios.com/information-retrieval/

ICL slides are adapted from Stanford CS 224U: https://web.stanford.edu/class/cs224u/slides/cs224u-incontextlearning-2023-handout.pdf

2 of 56

Recap: BERT: Bidirectional Encoder Representations

2

3 of 56

Recap: GPT: Left-to-right decoder using the Transformer arch.

The GPT architecture also uses the transformer architecture

But it uses only the decoder (masked multi-head attention) to allow autoregressive generation for seq2seq tasks
Masked multi-head attention prevents decoder from attending to right tokens

Framework:

Unsupervised pre-training: using the BooksCorpus + 1B word benchmark
Fine tuning: qa, common sense reasoning, text entailment

3

Improving Language Understanding by Generative Pre-Training. Radford et al. 2018

4 of 56

Lecture 11

Automated machine learning (cont.)

Parameter Efficient Fine Tuning (PEFT)

In-context Learning

4

5 of 56

Hyperparameter optimization

Definition of hyperparameter tuning:

: hyperparameter

: search space for hyperparameter

D_{train}: training dataset

D_{valid}: validation dataset
L: loss function (objective)

trial #3

6 of 56

FLAML: A Fast Library for Automated Machine Learning & Tuning

7 of 56

FLAML: AutoML vs Tune

8 of 56

FLAML: Tune User-Defined Function

config: search space
metric: metric to optimize
mode: maximize or minimize
num_samples: number of trials

n_concurrent_trials: number of concurrent trials
resource_per_trial: {“gpu”: 1}
low_cost_partial_config: starting point

source: https://github.com/microsoft/FLAML/blob/main/flaml/tune/tune.py#L202

9 of 56

CFO: A Cost-Frugal HPO Algorithm [AAAI’21]

To avoid high-cost points until necessary -> Low-cost starting point + local search
To find low-loss points -> Follow loss-descent directions
Cannot directly use gradient-based method: no gradient available.
Surprise: function values are enough to find the right directions for efficient convergence
More surprise: sign comparison between function values is enough!!

source: FLAML KDD 2022 Tutorial Frugal Optimization for Cost-related Hyperparameters. Wu et al. 2021

10 of 56

CFO: A Cost-Frugal HPO Algorithm [AAAI’21]

Repeat the following steps after each move:

1.Uniformly sample a direction from a local unit sphere;

2.Compare;

3.Move (and break) or try the opposite direction;

4.Move (and break) or stay

source: FLAML KDD 2022 Tutorial Frugal Optimization for Cost-related Hyperparameters. Wu et al. 2021

11 of 56

BlendSearch: Combining Local + Global Search [ICLR’22]

LS – low cost; may get trapped in local optima
Global search– able to explore the whole space; high cost

source: FLAML KDD 2022 Tutorial ECONOMIC HYPERPARAMETER OPTIMIZATION WITH BLENDED SEARCH STRATEGY. Wang et al. 2022

12 of 56

BlendSearch: Combining Local + Global Search [ICLR’22]

source: FLAML KDD 2022 Tutorial ECONOMIC HYPERPARAMETER OPTIMIZATION WITH BLENDED SEARCH STRATEGY. Wang et al. 2022

13 of 56

HW4: Hyperparameter Tuning for HW3

In HW3, the hyperparameters are given:

In HW4, you need to search for the learning rate and batch size:

14 of 56

FLAML demo

15 of 56

FLAML: Resources

Website: https://microsoft.github.io/FLAML/
KDD 2022 tutorial slides
Colab notebooks: https://github.com/microsoft/FLAML/tree/tutorial/notebook
FLAML discord: https://discord.gg/GyqPcnXw

16 of 56

Lecture 11

Automated machine learning (cont.)

Parameter Efficient Fine Tuning (PEFT)

In-context Learning

16

17 of 56

Paradigm shifts in NLP (2017)

Training -> predict

Pre-training -> fine tuning -> predict

2017

18 of 56

Second paradigm shifts in NLP (2021)

Pre-training -> fine tuning -> predict

Pre-training -> prompting -> predict

19 of 56

Fine-Tuning Has Good Performance, but…

Challenge for updating the full model weights

Models are trained every few months: retraining GPT-3 175B?
Continual learning: catastrophic forgetting

source: https://www.semanticscholar.org/paper/Localizing-Catastrophic-Forgetting-in-Neural-Wiewel-Yang/e5e33640ccf7de93b963da0a4719499d05b84b6b

20 of 56

How to Improve over Fine Tuning?

Strategy 1: Avoid update the full model weights

Parameter-efficient fine tuning (PEFT)

Strategy 2: Avoid update any weights at all

In-Context learning (ICL): Only tune the input to the model, i.e., prompt tuning

source: https://www.semanticscholar.org/paper/Localizing-Catastrophic-Forgetting-in-Neural-Wiewel-Yang/e5e33640ccf7de93b963da0a4719499d05b84b6b

21 of 56

Prefix Tuning: Optimizing continuous prompts

Continuous prompts

Prepend the input and output pairs with continuous vectors as the continous prompt

Prefix-tuning: optimizing continuous prompts for generation. Li et al. 2020

22 of 56

Prefix-tuning: lightweight fine-tuning

Prefix tuning also helps with lightweight fine tuning

Fine tuning large language models is costly!
gpt-j-6b is 22gb!

With prefix tuning, we only need to tune the prefix for each task, which significantly reduces the parameters that need to be tuned

with 0.1% parameters, can obtain comparable performance with GPT-2 and BART full parameter

Prefix-tuning: optimizing continuous prompts for generation. Li et al. 2020

23 of 56

Adapter: Parameter-Efficient Transfer Learning for NLP

Adding a small amount of parameters
New task: add a few parameters without revising previous ones
As a result, overcoming forgetting

source: Parameter-Efficient Transfer Learning for NLP. Houlsby et al. 2019

24 of 56

LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

Disadvantages of adapter:

Latency in the output

Disadvantages of prefix-tuning

Difficult to optimize

Proposed approach:

Using low-rank matrix decomposition to learn parameter update

source: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS. Hu et al. 2021

25 of 56

LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

Pretrained weights:
Fine-tuned weights:
Approximate as the matrix multiplication of smaller model:

source: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS. Hu et al. 2021

26 of 56

LoRA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS

Results:

Comparable or outperform fine tuning/adapter

Another empirical study results show that:

FT > LoRA > Adapter > Prefix Tuning

source: Parameter-efficient fine-tuning of large-scale pre-trained language models. Ding et al. 2022

27 of 56

Resource for Parameter Efficient Fine Tuning

OpenDelta: https://github.com/thunlp/OpenDelta: Prefix-tuning, adapters, Lora
OpenPrompt: https://github.com/thunlp/OpenPrompt
HuggingFace: https://huggingface.co/docs/peft/index

28 of 56

Lecture 11

Automated machine learning (cont.)

Parameter Efficient Fine Tuning (PEFT)

In-context Learning

28

29 of 56

In-context learning

The weights of large language models (e.g., GPT-3) are hard to update

When using LLM for certain tasks, e.g., question answering, instead of inputting the question q, we can add additional content to the input q to elicit a better answer�
This process is called prompt engineering

In the prompt, we can include a few examples as a labeled dataset�
The model can then learn our objective even without updating weights

This capability is called in-context learning (ICL)

30 of 56

In-Context Learning Ability of LLMs

Examples implies to LLM what task we are looking for (even without explicit specification)

How does in-context learning work? A framework for understanding the differences from traditional supervised learning. Sang Michael Xie and Sewon Min

31 of 56

Using Prompts to Elicit the Answer

Prompts for sentiment classification:

Prompts for named entity recognition:

I missed the bus today. I felt so ___________

Mike went to Paris. [Paris] is a [location] entity.

32 of 56

Prompt engineering

Designing the prompting that results in the most effective performance in the downstream task

Categories of prompt:

Cloze style prompts: prompts which ask the LM to fill in the blank, e.g., Mike went to Paris. [Paris] is a [location] entity. More suitable for text classification
Prefix style prompts: prompts which ask the LM to complete a sequence, e.g., English: I missed the bus today. French: _________, More suitable for text generation tasks.

Prompts can be created manually

However, this process is an art that takes time and experience
Even experienced prompt engineer may fail to manually discover optimal prompt

33 of 56

Prompt generation

Prompting by demonstration:

Making pre-trained language models better few-shot learners. Gao et al. 2020

34 of 56

Prompt generation

Using T5 for prompt generation:

Defining an input and output template pairs, e.g., input: Thank you <X> me to your party <Y> week, output: <X> for inviting <Y> last <Z>
It teaches T5 that <X> is for replacing <X> in the input and <Y> is for replacing <Y> in the input
During decoding, replace <X> with <S1> and <Y> with [MASK], which makes [MASK] the target for generating the sentiment word

Making pre-trained language models better few-shot learners. Gao et al. 2020

35 of 56

AutoPrompt: Gradient based prompt generation

Create a task-specific prompt with a collection of trigger words

The same trigger words are used for all examples, learned through maximizing likelihood of sentiment labels in the training examples

AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Shin et al. 2020

36 of 56

AutoPrompt: Gradient based prompt generation

How to select words for the prompt:

Step 1: train a classifier to predict the class label using the contextualized embedding of the [MASK] as input:

Step 2: substitute h_i with the MLM’s output word embeddings to obtain a score s(y, w), and the set of labelled tokens are constructed from the k-highest scoring words

AUTOPROMPT: Eliciting Knowledge from Language Models with Automatically Generated Prompts. Shin et al. 2020

37 of 56

Chain-of-thoughts Prompting

Use in-context examples with reasoning steps to elicit the model to output reasoning steps, which improves the model performance

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Wei et al. 2022

38 of 56

Chain-of-thoughts Prompting

Other applications of CoT

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Wei et al. 2022

39 of 56

Self-Consistency

Self-consistency improves chain of thought reasoning in language models. Wang et al. 2022

40 of 56

Self-Ask

Use prompt such as “Follow up” to elicit reasoning to improve the answer

Measuring and Narrowing the Compositionality Gap in Language Models. Press et al. 2023

41 of 56

Self-Ask

Interplay between GPT and search engine to collaboratively improve the answer

Measuring and Narrowing the Compositionality Gap in Language Models. Press et al. 2023

42 of 56

Instruction Fine-Tuning

Training language models to follow “instructions”, i.e., answer an Imperative Sentence

Why instruction fine-tuning?

Make the model follow instructions (better serve users)
Make it more clear how to define evaluation criteria: informativeness, truthfulness
Better alignment the fine-grained need

GPT: The sky is blue. The water is clear.

Instruct-GPT: Write a poem containing seven sentences. ______

Training language models to follow instructions with human feedback. Ouyang et al. 2022

43 of 56

Instruction Fine-Tuning

Step 2: use human labeler to select the winning answer between two

Training language models to follow instructions with human feedback. Ouyang et al. 2022

44 of 56

Instruction Fine-Tuning

Training language models to follow instructions with human feedback. Ouyang et al. 2022

smaller InstructGPT > larger GPT

Gray bar: truthfulness, color: informativeness and truthfulness

45 of 56

Self Instruct: Aligning LM with Self Generate Instructions

How to perform instruction fine-tuning without large labeled data? Use GPT-3 to extract instruction-following training data using few-shot demonstration

Self-instruct: Aligning language model with self generated instructions. Wang et al. 2022

46 of 56

Self Instruct: Aligning LM with Self Generate Instructions

Self-instruct: Aligning language model with self generated instructions. Wang et al. 2022

47 of 56

Alpaca: A Strong, Replicable Instruction-Following Model

Alpaca: A Strong, Replicable Instruction-Following Model. Taori et al. 2022

48 of 56

Structured Prompting: Scaling Number of Demonstrations

In Context Learning is constrained by the input length (llama 2k tokens, llama 2 4k token)
Scaling up demonstration: separately encoding the demonstrations

Structured Prompting: Scaling In-Context Learning to 1,000 Examples. Hao et al. 2022

49 of 56

Explanation Improves Few Shot Prompting

Explanation helps improve the accuracy of QA

But the accuracy depends on the quality of explanation and examples

Can language models learn from explanations in context?. Lampinen et al. 2022

50 of 56

Explanation Improves Fine-Tuning of LLM

Fine tuning 4k examples into 41 hate classes using curie (6.7B) and ada (350M):

Method 1: Hate speech -> short class name
Method 2: Hate speech -> long description of class

Testing Hate Speech against Policies. Zheng et al. 2023

51 of 56

An Explanation of In-context Learning as Implicit Bayesian Inference

An Explanation of In-context Learning as Implicit Bayesian Inference. Xie et al. 2021

52 of 56

Using Retrieval to Improve ICL

Retrieving similar questions, aggregate to obtain the final answer

Learning To Retrieve Prompts for In-Context Learning. Rubin et al. 2021

53 of 56

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

54 of 56

ReAct: Synergizing Reasoning and Acting in Language Models

ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.

55 of 56

A demo for Retrieval Augmented Generation

56 of 56

Summary

Automated machine learning (cont.)

Parameter Efficient Fine Tuning (PEFT)

In-context Learning

56