1 of 48

Making Generative AI Better for You: �Fine-tuning and Experimentation�for Custom Research Solutions

Shane Storks (he/him)

PhD Candidate, Computer Science and Engineering

Situated Language and Embodied Dialogue (SLED) Lab

MIDAS Generative AI Tutorial Series

November 29, 2023

1

2 of 48

Large Language Models (LLMs)

LLMs like ChatGPT and GPT-4 have recently gained popularity due to their impressive language understanding and reasoning capabilities, making them useful assistants for a variety of language tasks.

How can we customize them and apply them to empirical research?

2

https://chat.openai.com/

OpenAI. GPT-4 Technical Report. arXiv: 2303.08774.

…

Large language models (or LLMs) like ChatGPT and GPT-4 have recently gained popularity and generated a lot of buzz inside of and outside of the AI and natural language processing communities. This is in part due to the ease with which you can prompt them for information, making them useful tools for reading, writing, coding, and more.

<click> For example, if I have an email I’d like to quickly get the gist of and important information from, I could ask ChatGPT to do this for me. In this example, I share an email Beth sent me to provide some information about today’s presentation times and submitting lides.

<click> ChatGPT helpfully pointed out my presentation time and what I need to know from the email.

Now, many of us might be here wondering how we can apply them to empirical research? <click> While this workshop has and will give a lot of examples, …

3 of 48

Role of LLMs in Research

LLMs can be helpful assistants for tasks like writing and coding, but they can do so much more!
They can also be useful to automate aspects of:

Data annotation
Domain-specific content generation
Any language-based applications

May not perform well at specialized tasks like these out of the box
How can we customize LLMs to adapt them to various specialized language tasks?

3

MIDAS Using Generative AI for �Scientific Research User Guide

4 of 48

Outline

The Road to LLMs
Fine-Tuning LLMs
Prompting LLMs

4

5 of 48

Outline

The Road to LLMs
Fine-Tuning LLMs
Prompting LLMs

5

6 of 48

Language Models (LMs)

6

(dreamstime)

Jack needed some money, so he went and shook his piggy ____

Minsky, M. (2000). Commonsense-based interfaces. In Commun. ACM, 43(8): p. 66-73.

tail

and

toy

bank

fruit

…

1.0

0.0

LM

First, let’s recall what a language model is in the most general terms: a model of the distribution of language, specifically sequences of words. Most language models you’ll be seeing and using nowadays would model the probability of the next word given previous words, as shown. We can use language models like this to judge the probability of a whole sequence of text, or generate new text to complete a prompt.

<click> For example, given this incomplete sentence from Minsky, say we want to predict the next word. We as humans can probably anticipate the next word.

<click> In a language model, we can calculate this by prompting it with the incomplete sentence, which returns a probability distribution across the language model’s vocabulary. We can then be pretty confident that the next word should be bank based on the distribution that the model has learned from its training data.

Based on this simple objective of predicting the next word and training deep neural networks on basically all the text on the entire web, today’s state of the art large language models can be used for just about any language-related task.

This sense of a language model has been studied in language technologies like speech recognition and machine translation since at least the 70s, but the evolution from this simple idea of modeling the probability of the next word to today’s LLMs was due to a number of key developments. I’m going to try and cover the last 10 years of developments to give you some high level ideas about the key characteristics of these LLMs that makes them successful, and we’ll also cover how these contribute to their key limitations.

7 of 48

Vector-Based Word Embeddings

7

2023

2013

Tomas Mikolov, Kai Chen, Greg Corrado, & Jeffrey Dean. (2013). “Efficient Estimation of Word Representations in Vector Space.” International Conference on Learning Representations 2013.

Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. (2013). “Distributed Representations of Words and Phrases and their Compositionality.” Advanced in Neural Information Processing Systems 26.

Jeffrey Pennington, Richard Socher, & Christopher Manning. (2014). “GloVe: Global Vectors for Word Representation.” 2014 Conference on Empirical Methods in Natural Language Processing.

2018

(Image from TensorFlow docs)

word2vec

GloVe

One key development that helped us get where we are now was the idea of pre-trained, vector-based word embeddings, which emerged around 2013 with word2vec and GloVe. In short, some people found that if you train a neural network similarly a language model to predict words around each word, the learned weights for each word in the neural network can be used as transferable vector representations that implicitly capture the semantics of the word, as more semantically similar words are pushed closer together in vector space.

These word representations also exhibit some powerful analogical properties - we can find that for related sets of words like gendered nouns or verb tenses, there are additive properties between their vectors. For example, the vector between man and woman is quite similar to the vector between king and queen. So in theory, I could subtract the representations for man and woman, then take that vector and add it to man to get woman. People were excited about the compositional properties of this representation, as this is thought to be one of the key characteristics of human language.

These pre-trained representations could be used in downstream language tasks to accelerate neural network training on them.

8 of 48

Representing Sequences of Words

8

2023

2013

Ilya Sutskever, Oriol Vinyals, & Quoc V. Le. (2014). Sequence to Sequence Learning with Neural Networks. arXiv: 1409.3215.

Rafal Jozefowicz, Wojciech Zremba, & Ilya Sutskever. (2015). An Empirical Exploration of Recurrent Network Architectures. ICML 2015.

2018

word2vec

(figure from Jurafsky and Martin)

GloVe

RNN LMs

A language model needs to be able to represent not just single words, but also sequences of words. Neural networks for representing sequences of data, which are called recurrent neural networks (RNNs) had long been studied before this, but around 2014 we saw a lot of people working on using them for language modeling.

In an RNN, we represent each input word using an embedding like GloVe or word2vec. Then these embeddings are transformed using the RNN blocks, which incorporate information from previous words into the word embedding to form a new representation for the sequence up until that word.

This enabled us to make some progress on deep learning-based language models, but RNNs had some limitations of being very computationally expensive, and also, it was easy for the model to forget information from words far back in the context, which is a problem when there are long-range dependencies between words.

9 of 48

Attention and Transformers

9

2023

2013

Dzmitry Bahdanau, Kyunghyun Cho, & Yoshua Bengio. (2015). ”Neural Machine Translation by Jointly Learning to Align and Translate.” International Conference on Learning Representations 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. (2017). “Attention is All You Need.” Advances in Neural Information Processing Systems 30.

2018

word2vec

attention

transformer

GloVe

RNN LMs

Another key development was attention in language processing, which was an idea implemented originally for machine translation. While I won’t get into the nitty-gritty details, the high-level idea is that when generating language, we can allow neural networks to implicitly learn which parts of the input language context best inform the generation of each word. In general, this allows us to robustly get the most relevant information from different parts of the previous language context as we generate words with a language model, which solves this problem of RNNs easily forgetting past information. This was important because up until this point, most language modeling approaches were representing language from left to right, and it was easy to lose information that appeared further away in the input language context.

In machine translation, the cool thing about this is this enables us to better handle when the target translation has a different word ordering than the source text. In this example, we can see that learned attention weights in translation show that different words from a source sentence in French can each contribute differently to the words generated in an English translation (like the French word for European contributes most to the generation of the word European in English despite them being in different positions in the sentence).

This idea was taken to the limit in the well-known paper “attention is all you need” which proposed transformers, a new neural network architecture made entirely of attention mechanisms that looks quite complicated, but the key insight from it is that it used multiple “attention heads” to learn representations of words as functions of the representations of all other words. This enables capturing of more global word relationships than RNN language modeling approaches (which could only be calculated sequentially), and has the added benefit of being much more parallelizable and thus scalable computationally.

10 of 48

Contextual Language Representations

10

2023

2013

Matthew E. Peters, Mark Neumann, Mohit Iyyer, et al. (2018). “Deep Contextualized Word Representations.” 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

2018

word2vec

attention

transformer

ELMo

(Karan Purohit)

GloVe

RNN LMs

11 of 48

Self-Supervision and Transfer Learning in LMs

11

2023

2013

Alec Radford, Karthik Narasimhan, Tim Salimans, & Ilya Sutskever. (2018). “Improving Language Understanding by Generative Pre-Training.”

Jacob Devlin, Ming-Wei Chang, Kenton Lee, & Kristina Toutanova. (2018). “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GloVe

RNN LMs

GPT (OpenAI’s first language model) and BERT pushed this idea further by combining transformers and contextual language representations and pre-training them on a large amount of language data through self-supervision, which means we’re training the model to reconstruct language inputs. For example, BERT randomly masked out tokens in the input and trained the LM encoder to represent these missing masked tokens, while OpenAI’s GPT models trained the LM to predict the next word given previous words.

BERT and GPT made waves in the NLP research community, because they could be pre-trained on a large amount of Web-scale text data without any annotation of the data needed, then fine-tuned on downstream tasks using the same transformer-based neural network architecture (rather than using the features from the network in other architectures like was done with ELMo). This allowed it to be quickly and effectively applied to a ton of different language tasks. Suddenly, these LMs were everywhere.

12 of 48

“Jack needed some money, so he went and shook his piggy …”

12

Transformer Encoder

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In NAACL HLT 2019.

Vaswani, A. et al. (2017). Attention is All you Need. In NIPS 30.

Jack

needed

some

money

,

so

he

went

and

shook

his

[MASK]

fruit

wallet

head

piggy

hand

…

1.0

0.0

Feedforward + Softmax

13 of 48

Bigger Data & Bigger Models -> LLMs

13

2023

2013

Alec Radford, Jeff Wu, Rewon Child, et al. (2019). ”Language Models are Unsupervised Multitask Learners.”

Yinhan Liu, Myle Ott, Naman Goyal, et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv: 1907.11692.

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv: 1909.08053.

Corby Rosset. (2020). Turing-NLG: A 17-billion-parameter language model by Microsoft.

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GPT-2

RoBERTa

MegatronLM

Turing-NLG

…

https://gluebenchmark.com/leaderboard

https://leaderboard.allenai.org/swag/submissions/public

Human Performance

(figure from Microsoft)

GloVe

RNN LMs

14 of 48

Prompting & In-Context Learning

14

2023

2013

Tom B. Brown, Benjamin Mann, Nick Ryder, et al. (2020). “Language Models are Few-Shot Learners.” arXiv: 2005.14165.

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GPT-2

RoBERTa

MegatronLM

Turing-NLG

…

GPT-3

GloVe

RNN LMs

Eventually, we saw the familiar GPT-3 come out from OpenAI with 175B parameters, which is more than 10x bigger than the biggest point on that graph I just showed. This massive scaling eventually enabled LLMs to be used even without any task-specific fine-tuning. Instead, we can interact directly with the language model by simply talking to it or prompting it with some text, similar to how we do today.

One option is to apply these models zero-shot, meaning we can directly prompt them to complete some task, such as English-to-French translation.

Another option that was a uniquely powerful feature of GPT-3 was few-shot in-context learning, where we could simply insert a few examples of a task into our prompt, and the language model could more easily pick up the format of our task and complete the prompt.

This was another huge moment in NLP, where suddenly many tasks were just about solved even without any task-specific training of the model, and the largest models were best at this. Since these massive models are difficult to host locally, OpenAI and other tech companies provide paid APIs to either prompt them or fine-tune them, which has its pros and cons that we’ll talk about later.

15 of 48

Instruction Tuning

15

2023

2013

Jason Wei, et al. (2022). Finetuned Language Models are Zero-shot Learners. ICLR 2022.

Long Ouyang, Jeff Wu, Xu Jiang, et al. (2022). “Training Language Models to Follow Instructions with Human Feedback.” arXiv: 2203.02155.�https://chat.openai.com/

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GPT-2

RoBERTa

MegatronLM

Turing-NLG

…

GPT-3

FLAN

InstructGPT�ChatGPT

GloVe

RNN LMs

16 of 48

Vision & Multimodality

16

2023

2013

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, et al. (2022). ”Flamingo: a Visual Language Model for Few-Shot Learning.” Advances in Neural Information Processing Systems 35.

Junnan Li, Dongxu Li, Silvio Savarese, & Steven Hoi. (2023). ”BLIP-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models.” arXiv: 2301.12597.

OpenAI. (2023). “GPT-4 Technical Report.” arXiv: 2303.08774.

2018

word2vec

attention

transformer

GPT

BERT

ELMo

GPT-2

RoBERTa

MegatronLM

Turing-NLG

…

GPT-3

InstructGPT�ChatGPT

Flamingo

BLIP-2

GPT-4

…

GloVe

RNN LMs

Recently, there’s also been an increasing emphasis on incorporating images and videos into multimodal LLMs which are capable of representing and communicating about visual and text inputs together. While this effort has happened in parallel through this whole development of LLMs, some of the recent models we’ve seen, such as Flamingo, BLIP-2, and most notably GPT-4 have had the biggest successes in doing this and creating machines capable of reasoning about images in zero-shot and in-context learning settings.

While most of the implementation details of GPT-4 have not been shared with the public, it is thought that what makes GPT-4 so powerful is its pre-training on both images and text together. My and Joyce’s lab, called Situated Language and Embodied Dialogue, is very interested in this direction of multimodal LLMs, and we have several ongoing projects applying them to different problems and domains and really maximizing their value.

17 of 48

17

Jingfeng Yang, Hongye Jin, Ruixiang Tang, et al. 2023. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. arXiv: 2304.13712.

As you can see, the last several years have seen an unprecedented rate of evolution in language technologies that led us to the LLMs we have today. This tree gives a big picture of all the LLMs that are out there today and their key architectural variations, which are open-source vs. closed source, etc.

Just to give you some jargon that you might here as you start to dive into using LLMs,…

Some LMs are encoder-only, meaning they represent language, but don’t generate language. Some earlier LMs like BERT were encoder-only.

Some LMs are encoder-decoder, meaning they have separate modules for representing and generating language. This can make them especially adept at tasks like translating or summarizing language which require a strong representation of the language inputs. – we saw some examples of this too

However, encoder-decoder architectures are also more complex and less scalable. Most LMs today are decoder-only, meaning they are primarily focused on generating language, and this is their main objective in self-supervised training. While they do learn a language representation, this representation is trained to inform generation. Despite only being focused on generating the next word in language, decoder-only LMs (especially larger ones trained on a lot of data like GPT-4) are still great tools for most language tasks.

Depending on the type of task you’re working with and your hardware restrictions, you may find yourself leaning one way or the other when it comes to these types of LM architectures.

So now that I’ve talked a lot about the history of LLMs, hopefully if nothing else you got some insights about how they work at their core…

18 of 48

Limitations of LLMs

Despite these advancements and impressive capabilities, LLMs have some key limitations that cause undesirable behaviors
In order to effectively and responsibly apply them in research, we need to be mindful of these limitations!

18

19 of 48

Limitations of LLMs: Spurious Cues

19

Karen was assigned a roommate her first year of college. Her roommate asked her to go to a nearby city for a concert. Karen agreed happily. The show was absolutely exhilarating.

Karen became good friends with her roommate.

Karen hated her roommate.

Schwartz, R., Sap, M., Konstas, I., Zilles, L., Choi, Y., & Smith, N.A. (2017). The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task. In CoNLL 2017.

Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Batra, D., Vanderwende, L., Kohli, P. & Allen, J. (2016). A corpus and cloze evaluation for deeper understanding of commonsense stories. In NAACL 2016.

How does the story end?

😀

😡

20 of 48

Limitations of LLMs: Data Contamination

LLMs have seen so much data in pre-training
They may have been trained on benchmark datasets…
Training on the test data is not an objective evaluation!

20

Inbal Magar & Roy Schwartz. (2022). Data Contamination: From Memorization to Exploitation. In ACL 2022 .

W. Shi, A. Ajith, M. Xia, et al. (2023). Detecting Pretraining Data from Large Language Models. arXiv: 2310.16789.

T.B. Brown, B. Mann, N. Ryder, et al. (2020). Language Models are Few-Shot Learners. arXiv: 2005.14165.

21 of 48

Limitations of LLMs: Interpretability

21

(figure from Vinay Iyengar)

…

Another issue is the interpetability of models. we have to remember how incredibly complex these transformer-based architectures are. With the largest LLMs going near the trillions of parameters, it’s very opaque how a conclusion can be made within these models. Therefore, it becomes hard to disprove when models generate language based on spurious information or memorizing their training data, or if they are showing a true, human-aligned understanding.

Compounding the issue is the fact that the best LLMs today like GPT-4 and PaLM are closed-source, meaning we can’t even attempt to inspect the internals of the models at all. In the technical report for GPT-4, they give basically no details about the model architecture or the data it was trained on, and they claim this is due to the competitive landscape and safety implications.

22 of 48

Limitations of LLMs: Hallucination

Hallucination: generation of text that is factually incorrect, nonsensical, unfaithful to inputs, or otherwise incoherent

22

Z. Jio, N. Lee, R. Frieske, et al. (2023). Survey of Hallucination in Natural Language Generation. In ACM Computing Surveys , 55.

https://chat.openai.com/

Legal Dive

And interpretability leads me to my next point, hallucination.

Hallucination is a broad term that generally refers to cases of generation of text…

This can take many different forms, but the general message of this is, you can’t trust LLM outputs to be factual or coherent. For example, if I ask ChatGPT about Professor Chai, here’s what it tells me… <phd is actually from Duke>

And there have been some instances of this type of hallucination to cause serious legal issues. For example, we saw this happen earlier this year where a lawyer cited fake cases that were generated by ChatGPT in legal brief.

ChatGPT’s attribution is getting better recently, but it’s still not completely trustworthy – and unless there are some serious changes in how these systems are trained, you will always need to be wary of trusting their outputs and double check them.

23 of 48

Summary

LLMs’ are remarkably useful for many language tasks, but these limitations make them impossible to trust consistently
Verifying LLM outputs is important:

Automated metrics
Human evaluation

We must be mindful that LLMs are primarily trained to:

Generate fluent-sounding language (pre-training)
Satisfy users’ requests (instruction-tuning)

23

24 of 48

2 Ways to Customize LLMs

Fine-Tuning:

Small hardware requirements

Host locally (private, more flexible)

Optimized for specific task

Technical skills, engineering effort

Large amount of training data

Hard to adapt once trained

Prompting:

Larger hardware requirements

Best LMs behind proprietary APIs

Requires prompt engineering

User-friendly language interface

No training data needed

Generalizable and adaptable

24

...Fine-tuning and prompting

- While fine-tuning LMs can have smaller hardware requirements because we can use architectures with fewer learned parameters and still get good results, prompt

ing tends to require much larger LMs and thus much larger hardware requirements.

- As such, we usually can train and host fine-tuned LMs locally, which is private, flexible, and reliable. But when prompting larger LMs, the best models are proprietary – if we use these ones, we sacrifice privacy (bad for sensitive data), flexibility, and we depend on the service provider to always be available for our application.

- Lastly, while fine-tuning leads to a model that is highly optimized for a specific task, it takes a lot of prompt engineering to get to that level when prompting larger LMs, and issues like hallucination still jeopardize results.

- for these reasons, a lot of industry solutions relying on LMs still fine-tune them for specific tasks.

-

- it’s for these reasons that LMs have really gotten widespread adoption

Depending on your specific resources and needs, you can choose which method to customize your LM. We’ll dig a little deeper into both of them.

25 of 48

Outline

The Road to LLMs
Fine-Tuning LLMs
Prompting LLMs

25

26 of 48

Fine-Tuning: Text Classification

26

What is the sentiment of this text?

The film was a charming and affecting journey.

Negative

Positive

Pre-Trained LM

Classification Head

Softmax

P(Neg.)

P(Pos.)

1.0

0.0

The film was a charming and affecting journey.

-0.11 2.30

27 of 48

Fine-Tuning: Multiple Choice Completion

27

Pre-Trained LM

Classification Head

It was a very hot summer day.

He decided to run in the heat.

He felt much better!

It was a very hot summer day.

He drank a glass of ice cold water.

He felt much better!

Classification Head

Softmax

P(A)

P(B)

1.0

0.0

A

B

Which sentence is most likely to fill in the blank?

It was a very hot summer day.

___________________________

He felt much better!

He decided to run in the heat.

He drank a glass of ice cold water.

-0.45

3.76

A

B

28 of 48

Fine-Tuning: Multiple Choice QA

28

Pre-Trained LM

Classification Head

Q: How many legs does a ladybug have? �A: 2

Classification Head

Softmax

P(A)

P(B)

1.0

0.0

A

-0.05

3.77

How many legs does a ladybug have?

4

6

2

A

B

C

Classification Head

0.01

Q: How many legs does a ladybug have? �A: 4

B

Q: How many legs does a ladybug have? �A: 6

C

P(C)

29 of 48

Fine-Tuning: Token Classification

29

I

Verb

Noun

Determiner

Pronoun

see

a

dog

Pre-Trained LM

Softmax

P(V)

P(N)

1.0

0.0

Classification Head

-0.72 0.56 0.09 -0.11

P(P)

I

see

a

dog

P(D)

Label each token with its part of speech (POS):

30 of 48

Fine-Tuning: Text Generation

30

Jack

shook

his

piggy

toy

tail

bank

fruit

and

Continue the text:

…

Pre-Trained LM

Softmax

P(and)

P(bank)

1.0

0.0

Language Modeling Head

P(toy)

Jack

shook

his

piggy

P(tail)

P(fruit)

…

31 of 48

Parameter-Efficient Fine-Tuning (PEFT)

While fine-tuning LMs is generally more feasible when we have less available compute, there are still some problems:

Fine-tuning on a large amount of data can take a long time
The size of LM we can fine-tune is limited by compute
Updating all weights of the LM during fine-tuning is expensive and inefficient

Creates a need for parameter-efficient fine-tuning (PEFT) methods!

31

32 of 48

Low-Resource Adaptation (LoRA)

32

(figure from Sebastian Raschka)

Edward Hu, Yelong Shen, Phillip Wallis, et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv: 2106.09685.

33 of 48

Outline

The Road to LLMs
Fine-Tuning LLMs
Prompting LLMs

33

34 of 48

Prompting LMs

To customize an LLM for your problem through prompting, need to make a few choices (prompt engineering):

Prompt template
Answer mapping
In-context demonstration

34

Figure credit: P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, & G. Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv: 2107.13586.

35 of 48

Language Models (LMs)

35

(dreamstime)

Jack needed some money, so he went and shook his piggy ____

Minsky, M. (2000). Commonsense-based interfaces. In Commun. ACM, 43(8): p. 66-73.

tail

and

toy

bank

fruit

…

1.0

0.0

LM

36 of 48

Prompt Templates

If filling a blank from a few possible choices, can use a cloze prompt:

36

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, & G. Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv: 2107.13586.

Sheng Zhang, Xiaodong Liu, Jingjing Liu, et al. 2018. ReCoRD: Bridging the Gap Between Human and Machine Commonsense Reading Comprehension. arXiv: 1810.12885.

Task	Inputs ([X])	Template	Answer ([Z])
Named Entity Recognition (NER)	[X₁]: Mike went to Paris [X₂]: Paris	[X₁]. [X₂] is a [Z] entity.	organization location person name …
Reading Comprehension	Daniela Hantuchova knocks Venus Williams out of Eastbourne 6-2 5-7 6-2.	[X]��Hantuchova breezed through the first set in just under 40 minutes after breaking Williams’ serve twice to take it 6-2 and led the second 4-2 before [Z] hit her stride.	Daniela Hantuchova Venus Williams

The first step to applying them to your task is to make a text-based template to prompt the language model with. One type of template you can use a cloze template, where your task would involve filling in a blank from a few possible expected choices. I’ll show you some examples.

If you are doing some kind of tagging task like named entity recognition, you may provide the LLM a text, like “Mike went to Paris”, then if you want to tag “Paris”, just ask “Paris is a ____ entity” and compare the LLM's judged probability of possible ways to fill in the blank, such as “organization” “location” “person name”.

Another example could be reading comprehension, where you want the LLM to fill in a blank for some kind of reading comprehension test, such as this one on a passage about a match between two tennis players.

37 of 48

Prompt Templates

When completing a prompt or generating text, use a prefix prompt:

37

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, & G. Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv: 2107.13586.

Task	Inputs ([X])	Template	Answer ([Z])
Sentiment Classification	I love this movie.	[X] The movie is [Z]	good bad
Question Answering	What color is the sky? A. Red�B. Yellow�C. Blue�D. Green	Question: [X] �Answer: [Z]	A B C D

38 of 48

Prompt Templates

When completing a prompt or generating text, use a prefix prompt:

38

P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, & G. Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv: 2107.13586.

Task	Inputs ([X])	Template	Answer ([Z])
Summarization	MIDAS and the Michigan AI Lab will host a faculty workshop with the theme of Generative Artificial Intelligence (Generative AI) for research. …	[X] tl;dr [Z]	MIDAS & Michigan AI Lab host faculty workshop on Generative AI for research. Explore impact, use cases, ethical considerations & collaboration opportunities. All faculty welcome.
Translation	Je vous aime.	French: [X] English: [Z]	I love you. I fancy you. …

\

39 of 48

Finding the Best Template and Answers

Different prompts can yield different results
May take extra work to find the best prompt

Trial and error
Ensembling templates

39

good	great	okay	bad	awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

Since LLMs are just trained on a bunch of text data and attempt to mirror the distribution of language seen in pre-training, even slightly differernt prompts can sometimes yield very different outputs from the LLM and thus different results for your task.

Because of this, <click> some extra work may be required to find the best prompt, and the best mapping between LLM outputs and answers.

While trial and error can get you pretty far, <click>

It can also be worthwhile to ensemble multiple prompt templates <click>, that is, have multiple templates and find a way to combine the LLM outputs for them. Going back to our example on classifying sentiment of text about movies, we could think of many different ways to ask an LLM what the sentiment of the text is which could be manually written out or automatically generated through a number of means that I won’t go into.

Each of these templates may produce slightly different results - in this example, the darkness of the color represents the probability of each of these words being the next word to fill in the end of the template.

To ensemble these outputs, we have a few choices – for example, we could just use a majority vote to choose the output word, or…

40 of 48

Finding the Best Template and Answers

Different prompts can yield different results
May take extra work to find the best prompt

Trial and error
Ensembling templates

40

good	great	okay	bad	awful
	✔
	✔
	✔
	✔
✔

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

41 of 48

Finding the Best Template and Answers

Different prompts can yield different results
May take extra work to find the best prompt

Trial and error
Ensembling templates

41

good	great	okay	bad	awful
∑	∑	∑	∑	∑
∑	∑	∑	∑	∑
∑	∑	∑	∑	∑
∑	∑	∑	∑	∑
∑	∑	∑	∑	∑

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

42 of 48

Finding the Best Template and Answers

Different prompts can yield different results
May take extra work to find the best prompt

Trial and error
Ensembling templates
Ensembling answers

42

good	great	okay	bad	awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

43 of 48

Finding the Best Template and Answers

Different prompts can yield different results
May take extra work to find the best prompt

Trial and error
Ensembling templates
Ensembling answers

43

good	great	okay	bad	awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

44 of 48

Finding the Best Template and Answers

Different prompts can yield different results
May take extra work to find the best prompt

Trial and error
Ensembling templates
Ensembling answers

44

good	great	okay	bad	awful

I love this movie.

[X] The movie was so [Z]

[X] I thought it was[Z]

[X] The movie is [Z]

[X] This movie was [Z]

[X] The film is [Z]

LLM

P([Z]=_)

45 of 48

Managing Randomness in LLMs

LLM decoding algorithms may incorporate some randomness by default to increase the diversity of generation
Some solutions:

Generate multiple times and average results
Greedy decoding

45

46 of 48

In-Context Learning

46

Tom B. Brown, Benjamin Mann, Nick Ryder, et al. (2020). “Language Models are Few-Shot Learners.” arXiv: 2005.14165.

Now, so far all of our approaches for prompting the LLM have been zero-shot, meaning we can directly prompt them to complete some task, such as English-to-French translation.

As I touched on before, another option that was a uniquely powerful feature of GPT-3 when it came out is few-shot or in-context learning, where we can simply append a few examples of a task to our prompt, and the language model can more easily pick up the format of our task and complete the prompt. This can bring big performance gains on many tasks.

Another benefit of few-shot learning is reducing the need for prompting engineering when selecting your template. The few-shot setting here where we provide several demonstrations of a task can help the LLM pick up the format of what you’re asking them to do and thus reducing the impact of specific words used in the prompt. However, it still may require a few tries with different prompts to get the best results.

Now, there have been a lot of followup works on different ways to do this in-context learning…

47 of 48

Chain-of-Thought Prompting

47

Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems 35.

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. Advances in Neural Information Processing Systems 35.

Chain of thought prompting is one significant example. For some language tasks, the output may require multiple steps of reasoning.

Some folks from Google Brain found that when prompting LLMs, we can not only just give example questions and answers like we saw before, but also reasoning explanations. This enables the LLM to pick up some appropriate ways to reach conclusions. As you can see here, if we just ask an LLM to give us the answer to some math problem, it is very likely to get the answer wrong, as it hasn’t been trained to do mental math and doesn’t have an internal concept of numbers. However, if we add a chain of thought to the prompt, the model can better make use of the knowledge it acquired in pre-training, generate a reasoning chain, and finally arrive at the correct answer. This can be a useful tool to get LLMs to tackle complex problems.

Now, we might not always have data for these strong explanations of reasoning problems. However, a related work from this team found that it can even be helpful to ask the LLM to first think step by step and generate a reasoning chain before prompting it for a final answer. This is more of a zero-shot setting now, as the LLM has to generate the reasoning and final answer together without demonstration.

There are a lot of different twists on this idea of chain of thought lately which introduce variations on this technique, but I won’t have time to cover them here.

48 of 48

48

@shanestorks www.shanestorks.com

Next: From Theory to Practice!

I’m on the job market for academic and industry positions!