1 of 48

Making Generative AI Better for You: Fine-tuning and Experimentation�for Custom Research Solutions

Shane Storks (he/him)

PhD Candidate, Computer Science and Engineering

Situated Language and Embodied Dialogue (SLED) Lab

MIDAS Generative AI Tutorial Series

November 29, 2023

1

2 of 48

Large Language Models (LLMs)

LLMs like ChatGPT and GPT-4 have recently gained popularity due to their impressive language understanding and reasoning capabilities, making them useful assistants for a variety of language tasks.

How can we customize them and apply them to empirical research?

2

3 of 48

Role of LLMs in Research

  • LLMs can be helpful assistants for tasks like writing and coding, but they can do so much more!
  • They can also be useful to automate aspects of:
    • Data annotation
    • Domain-specific content generation
    • Any language-based applications
  • May not perform well at specialized tasks like these out of the box
  • How can we customize LLMs to adapt them to various specialized language tasks?

3

4 of 48

Outline

  • The Road to LLMs
  • Fine-Tuning LLMs
  • Prompting LLMs

4

5 of 48

Outline

  • The Road to LLMs
  • Fine-Tuning LLMs
  • Prompting LLMs

5

6 of 48

Language Models (LMs)

6

 

Jack needed some money, so he went and shook his piggy ____

Minsky, M. (2000). Commonsense-based interfaces. In Commun. ACM, 43(8): p. 66-73.

tail

and

toy

bank

fruit

1.0

0.0

 

LM

 

7 of 48

Vector-Based Word Embeddings

7

2023

2013

Tomas Mikolov, Kai Chen, Greg Corrado, & Jeffrey Dean. (2013). “Efficient Estimation of Word Representations in Vector Space.” International Conference on Learning Representations 2013.

Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. (2013). “Distributed Representations of Words and Phrases and their Compositionality.” Advanced in Neural Information Processing Systems 26.

Jeffrey Pennington, Richard Socher, & Christopher Manning. (2014). “GloVe: Global Vectors for Word Representation.” 2014 Conference on Empirical Methods in Natural Language Processing.

2018

(Image from TensorFlow docs)

word2vec

GloVe

8 of 48

Representing Sequences of Words

8

2023

2013

2018

word2vec

GloVe

RNN LMs

9 of 48

Attention and Transformers

9

2023

2013

Dzmitry Bahdanau, Kyunghyun Cho, & Yoshua Bengio. (2015). ”Neural Machine Translation by Jointly Learning to Align and Translate.” International Conference on Learning Representations 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. (2017). “Attention is All You Need.” Advances in Neural Information Processing Systems 30.

2018

word2vec

attention

transformer

GloVe

RNN LMs

10 of 48

Contextual Language Representations

10

2023

2013

Matthew E. Peters, Mark Neumann, Mohit Iyyer, et al. (2018). “Deep Contextualized Word Representations.” 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

2018

word2vec

attention

transformer

ELMo

ELMo

GloVe

RNN LMs

11 of 48

Self-Supervision and Transfer Learning in LMs

11

2023

2013

Alec Radford, Karthik Narasimhan, Tim Salimans, & Ilya Sutskever. (2018). “Improving Language Understanding by Generative Pre-Training.”

Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. (2018). “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GloVe

RNN LMs

12 of 48

“Jack needed some money, so he went and shook his piggy …”

12

Transformer Encoder

Jack

needed

some

money

,

so

he

went

and

shook

his

[MASK]

fruit

wallet

head

piggy

hand

1.0

0.0

Feedforward + Softmax

 

13 of 48

Bigger Data & Bigger Models -> LLMs

13

2023

2013

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GPT-2

RoBERTa

MegatronLM

Turing-NLG

(figure from Microsoft)

GloVe

RNN LMs

14 of 48

Prompting & In-Context Learning

14

2023

2013

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GPT-2

RoBERTa

MegatronLM

Turing-NLG

GPT-3

GloVe

RNN LMs

15 of 48

Instruction Tuning

15

2023

2013

Jason Wei, et al. (2022). Finetuned Language Models are Zero-shot Learners. ICLR 2022.

Long Ouyang, Jeff Wu, Xu Jiang, et al. (2022). “Training Language Models to Follow Instructions with Human Feedback.” arXiv: 2203.02155.�https://chat.openai.com/

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GPT-2

RoBERTa

MegatronLM

Turing-NLG

GPT-3

FLAN

InstructGPT�ChatGPT

GloVe

RNN LMs

16 of 48

Vision & Multimodality

16

2023

2013

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, et al. (2022). ”Flamingo: a Visual Language Model for Few-Shot Learning.” Advances in Neural Information Processing Systems 35.

Junnan Li, Dongxu Li, Silvio Savarese, & Steven Hoi. (2023). ”BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models.” arXiv: 2301.12597.

OpenAI. (2023). “GPT-4 Technical Report.” arXiv: 2303.08774.

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GPT-2

RoBERTa

MegatronLM

Turing-NLG

GPT-3

InstructGPT�ChatGPT

Flamingo

BLIP-2

GPT-4

GloVe

RNN LMs

17 of 48

17

18 of 48

Limitations of LLMs

  • Despite these advancements and impressive capabilities, LLMs have some key limitations that cause undesirable behaviors
  • In order to effectively and responsibly apply them in research, we need to be mindful of these limitations!

18

19 of 48

Limitations of LLMs: Spurious Cues

19

Karen was assigned a roommate her first year of college. Her roommate asked her to go to a nearby city for a concert. Karen agreed happily. The show was absolutely exhilarating.

Karen became good friends with her roommate.

Karen hated her roommate.

How does the story end?

😀

😡

20 of 48

Limitations of LLMs: Data Contamination

  • LLMs have seen so much data in pre-training
  • They may have been trained on benchmark datasets…
  • Training on the test data is not an objective evaluation!

20

21 of 48

Limitations of LLMs: Interpretability

21

(figure from Vinay Iyengar)

22 of 48

Limitations of LLMs: Hallucination

  • Hallucination: generation of text that is factually incorrect, nonsensical, unfaithful to inputs, or otherwise incoherent

22

23 of 48

Summary

  • LLMs’ are remarkably useful for many language tasks, but these limitations make them impossible to trust consistently
  • Verifying LLM outputs is important:
    • Automated metrics
    • Human evaluation
  • We must be mindful that LLMs are primarily trained to:
    • Generate fluent-sounding language (pre-training)
    • Satisfy users’ requests (instruction-tuning)

23

24 of 48

2 Ways to Customize LLMs

Fine-Tuning:

Small hardware requirements

Host locally (private, more flexible)

Optimized for specific task

Technical skills, engineering effort

Large amount of training data

Hard to adapt once trained

Prompting:

Larger hardware requirements

Best LMs behind proprietary APIs

Requires prompt engineering

User-friendly language interface

No training data needed

Generalizable and adaptable

24

25 of 48

Outline

  • The Road to LLMs
  • Fine-Tuning LLMs
  • Prompting LLMs

25

26 of 48

Fine-Tuning: Text Classification

26

What is the sentiment of this text?

The film was a charming and affecting journey.

Negative

Positive

Pre-Trained LM

Classification Head

 

Softmax

P(Neg.)

P(Pos.)

1.0

0.0

The film was a charming and affecting journey.

-0.11 2.30

27 of 48

Fine-Tuning: Multiple Choice Completion

27

Pre-Trained LM

Classification Head

 

It was a very hot summer day.

He decided to run in the heat.

He felt much better!

It was a very hot summer day.

He drank a glass of ice cold water.

He felt much better!

Classification Head

 

Softmax

P(A)

P(B)

1.0

0.0

A

B

Which sentence is most likely to fill in the blank?

It was a very hot summer day.

___________________________

He felt much better!

He decided to run in the heat.

He drank a glass of ice cold water.

-0.45

3.76

A

B

28 of 48

Fine-Tuning: Multiple Choice QA

28

Pre-Trained LM

Classification Head

 

Q: How many legs does a ladybug have? �A: 2

Classification Head

 

Softmax

P(A)

P(B)

1.0

0.0

A

-0.05

3.77

How many legs does a ladybug have?

4

6

2

A

B

C

Classification Head

 

0.01

Q: How many legs does a ladybug have? �A: 4

B

Q: How many legs does a ladybug have? �A: 6

C

P(C)

29 of 48

Fine-Tuning: Token Classification

29

I

Verb

Noun

Determiner

Pronoun

see

a

dog

Pre-Trained LM

Softmax

P(V)

P(N)

1.0

0.0

Classification Head

 

-0.72 0.56 0.09 -0.11

P(P)

I

see

a

dog

P(D)

Label each token with its part of speech (POS):

30 of 48

Fine-Tuning: Text Generation

30

Jack

shook

his

piggy

toy

tail

bank

fruit

and

Continue the text:

Pre-Trained LM

Softmax

P(and)

P(bank)

1.0

0.0

Language Modeling Head

 

 

P(toy)

Jack

shook

his

piggy

P(tail)

P(fruit)

31 of 48

Parameter-Efficient Fine-Tuning (PEFT)

  • While fine-tuning LMs is generally more feasible when we have less available compute, there are still some problems:
    • Fine-tuning on a large amount of data can take a long time
    • The size of LM we can fine-tune is limited by compute
    • Updating all weights of the LM during fine-tuning is expensive and inefficient
  • Creates a need for parameter-efficient fine-tuning (PEFT) methods!

31

32 of 48

Low-Resource Adaptation (LoRA)

32

(figure from Sebastian Raschka)

 

33 of 48

Outline

  • The Road to LLMs
  • Fine-Tuning LLMs
  • Prompting LLMs

33

34 of 48

Prompting LMs

To customize an LLM for your problem through prompting, need to make a few choices (prompt engineering):

  1. Prompt template
  2. Answer mapping
  3. In-context demonstration

34

35 of 48

Language Models (LMs)

35

 

Jack needed some money, so he went and shook his piggy ____

Minsky, M. (2000). Commonsense-based interfaces. In Commun. ACM, 43(8): p. 66-73.

tail

and

toy

bank

fruit

1.0

0.0

 

LM

 

36 of 48

Prompt Templates

If filling a blank from a few possible choices, can use a cloze prompt:

36

Task

Inputs ([X])

Template

Answer ([Z])

Named Entity Recognition (NER)

[X1]: Mike went to Paris

[X2]: Paris

[X1]. [X2] is a [Z] entity.

organization

location

person name

Reading Comprehension

Daniela Hantuchova knocks Venus Williams out of Eastbourne 6-2 5-7 6-2.

[X]��Hantuchova breezed through the first set in just under 40 minutes after breaking Williams’ serve twice to take it 6-2 and led the second 4-2 before [Z] hit her stride.

Daniela Hantuchova

Venus Williams

37 of 48

Prompt Templates

When completing a prompt or generating text, use a prefix prompt:

37

Task

Inputs ([X])

Template

Answer ([Z])

Sentiment Classification

I love this movie.

[X] The movie is [Z]

good

bad

Question Answering

What color is the sky?

A. Red�B. Yellow�C. Blue�D. Green

Question: [X] �Answer: [Z]

A

B

C

D

38 of 48

Prompt Templates

When completing a prompt or generating text, use a prefix prompt:

38

Task

Inputs ([X])

Template

Answer ([Z])

Summarization

MIDAS and the Michigan AI Lab will host a faculty workshop with the theme of Generative Artificial Intelligence (Generative AI) for research. …

[X] tl;dr [Z]

MIDAS & Michigan AI Lab host faculty workshop on Generative AI for research. Explore impact, use cases, ethical considerations & collaboration opportunities. All faculty welcome.

Translation

Je vous aime.

French: [X] English: [Z]

I love you.

I fancy you.

\

39 of 48

Finding the Best Template and Answers

  • Different prompts can yield different results
  • May take extra work to find the best prompt
    • Trial and error
    • Ensembling templates

39

good

great

okay

bad

awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

40 of 48

Finding the Best Template and Answers

  • Different prompts can yield different results
  • May take extra work to find the best prompt
    • Trial and error
    • Ensembling templates

40

good

great

okay

bad

awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

41 of 48

Finding the Best Template and Answers

  • Different prompts can yield different results
  • May take extra work to find the best prompt
    • Trial and error
    • Ensembling templates

41

good

great

okay

bad

awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

42 of 48

Finding the Best Template and Answers

  • Different prompts can yield different results
  • May take extra work to find the best prompt
    • Trial and error
    • Ensembling templates
    • Ensembling answers

42

good

great

okay

bad

awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

43 of 48

Finding the Best Template and Answers

  • Different prompts can yield different results
  • May take extra work to find the best prompt
    • Trial and error
    • Ensembling templates
    • Ensembling answers

43

good

great

okay

bad

awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

44 of 48

Finding the Best Template and Answers

  • Different prompts can yield different results
  • May take extra work to find the best prompt
    • Trial and error
    • Ensembling templates
    • Ensembling answers

44

good

great

okay

bad

awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

45 of 48

Managing Randomness in LLMs

  • LLM decoding algorithms may incorporate some randomness by default to increase the diversity of generation
  • Some solutions:
    • Generate multiple times and average results
    • Greedy decoding

45

46 of 48

In-Context Learning

46

Tom B. Brown, Benjamin Mann, Nick Ryder, et al. (2020). “Language Models are Few-Shot Learners.” arXiv: 2005.14165.

47 of 48

Chain-of-Thought Prompting

47

Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems 35.

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. Advances in Neural Information Processing Systems 35.

48 of 48

48

@shanestorks www.shanestorks.com

Next: From Theory to Practice!

I’m on the job market for academic and industry positions!