1 of 34

Insights from NLP research

2 of 34

2

LMs are trained to predict missing words

Language model

The

quick

brown

fox

[MASK]

jumps

3 of 34

Language models are everywhere

3

Small language models

https://gluebenchmark.com/

https://super.gluebenchmark.com/

https://crfm.stanford.edu/helm/lite/latest/

https://paperswithcode.com/dataset/mmlu

https://paperswithcode.com/dataset/big-bench

4 of 34

Language models are everywhere

4

Sentiment

Question Answering

Summarization

Coreference Resolution

5 of 34

Transformer

5

Harry never thought he would

Harry never thought he ???

6 of 34

BERT: Workflow

6

wt

Current word

[Context word]

[Context word]

wt-1

wt-2

1. Self-attention with Bi-directional context

2. Masked language modeling (MLM)

[Context word]

wt+1

[Context word]

wt+2

[MASK]

7 of 34

GPT2: Workflow

7

wt

Current word

[Context word]

[Context word]

wt-1

wt-2

1. Self-attention with Uni-directional context

2. Causal language modeling (CLM)

[MASK]

[vehicle]

[vehicle]

8 of 34

Fine tuning: tune pretrain language model on a task

8

9 of 34

Downside of full fine tuning

9

10 of 34

10

Model Interpretability?

Classifier

BERTology

11 of 34

Hierarchy of Linguistic Info - Setting

  • Conneau et al., ACL’18 - Build diagnostic classifier to predict if a linguistic property is encoded in the given sentence representation.
  • Features:
    • Surface – Sentence Length, Word Content
    • Syntactic – Bigram shift, Tree depth, Top constituent
    • Semantic – Tense, Subject Number, Object Number, Coordination Inversion and Semantic Odd Man Out.

11

BERT layer

Simple classifier

predict sentence length

If the prediction accuracy is good, then the model might be capturing the sentence length feature

12 of 34

12

Surface

Syntactic

Semantic

BERT composes a hierarchy of linguistic signals ranging from surface to semantic features

13 of 34

Agenda

  • Recap on small language models
  • Text-to-Text Transfer Transformer, Prompting, Instruction-tuning

13

14 of 34

T5 (Text-to-Text Transfer Transformer): Workflow

  • Text-to-text transformer
  • Encoder-decoder model
  • Reformulates all tasks (during pretraining and finetuning) with a text-to-text format

15 of 34

T5: Workflow, Encoder

  • Original text: Thank you for inviting me to your party last week
  • Input text: Thank you for inviting me to your party <Y> week
  • Input text: Thank you <X> me to your party <Y> week
    • <X> for inviting (span masking)
  • Input text: Thank you me to your party week
  • Input text: party me your to. last you inviting week Thanks

16 of 34

T5: Different unsupervised objectives

  • Original text: Thank you for inviting me to your party last week
  • Translate English to German: That is good. Target: Das is gut.
  • Translate English to German: That is good. Target: Das is gut.
    • “Good” representation can only look at “Translate English to German: That is”.
  • Translate English to German: That is good. Target: Das is gut.
    • “Good” representation can only look at “Translate English to German: That is. Target:”.

17 of 34

Agenda

  • Recap on small language models
  • Text-to-Text Transfer Transformer, Prompting, Instruction-tuning

17

18 of 34

Zero-shot vs. One-shot vs. Few-shot prompting

18

19 of 34

GPT-3 Prompting

19

20 of 34

Prompt-tuning

21 of 34

Hard vs. Soft Prompts

  • Hard prompt: manually handcrafted text prompts with discrete input tokens
    • we directly change the discrete input tokens, which are not differentiable

  • Soft prompt: concatenates the embeddings of the input tokens with a trainable tensor that can be optimized via backpropagation to improve the modeling performance on a target task.
    • cannot be viewed and edited in text
    • lack of interpretability

  • Translate the English sentence {english_sentence} into German: {german_translation}
  • English: {english_sentence} | German: {german_translation}
  • From English to German: {english_sentence} -> {german_translation}

22 of 34

Soft Prompt-tuning vs. Adapters

Soft prompt

Low Rank Adaptation (LoRA)

23 of 34

Prompt-tuning

  • Prompt-tuning refers to techniques that vary the input prompt to achieve better modeling results.
  • Idea: convert data into natural language prompts
  • better for few-shot, one-shot, or zero-shot cases

24 of 34

Prompt-tuning

  • Prompt template: manually designed natural language input for a task

25 of 34

Prompt-tuning

  • Prompt template: manually designed natural language input for a task
  • PLM: perform language modeling (masked LM or auto-regressive LM)

26 of 34

Prompt-tuning

  • Verbalizer: mapping from the vocabulary to labels

27 of 34

Limits of prompting of harder tasks?

  • Ask GPT-3: What are some great financial investments with no risk at all?
  • “Conspiracy” prompt

Buy gold and silver, and invest in cryptocurrencies.

  • “Blog post” prompt

The best investment is to buy a house.

  • “Helpful” prompt

I have no comment.

  • Ask GPT-3: Explain the moon landing to a 6 year old in a few sentences
  • Explain the theory of gravity to a 6 year old
  • Explain the theory of relativity to a 6 year old in a few sentences
  • Explain the big band theory to a 6 year old
  • Explain evolution to a 6 year old
  • Some tasks seem too hard for even large LMs to learn through prompting alone

28 of 34

Agenda

  • Recap on small language models
  • Text-to-Text Transfer Transformer, Prompting, Instruction-tuning

28

29 of 34

Instruction-tuning

29

30 of 34

Instruction-tuning

30

31 of 34

Instruction Models:

  • Using supervision to teach a language model (LM) to perform tasks described via instructions.
  • The LM will learn to follow instructions and do so even for unseen tasks.
  • Evaluation: group datasets into clusters by task type and hold out each task cluster for evaluation while instruction tuning on all remaining clusters.

31

NLU tasks in blue; NLG tasks in teal

32 of 34

Multiple Instruction Templates for Each NLP Task

  • Manually compose ten unique templates that use natural language instructions to describe the task for that dataset.
    • most of the ten templates describe the original task
    • to increase diversity, for each dataset, up to three templates that “turned the task around”
    • e.g., for sentiment classification, summarization task related template by asking to generate a movie review

32

33 of 34

Can large language models provide useful feedback on research papers? A large-scale empirical analysis.

Weixin Liang et al.

https://arxiv.org/abs/2310.01783

33

Questions:

  • Can GPT-4 provide useful feedback on research papers?
  • What are the differences between human- vs. GPT-4-generated feedback?

Main Contributions/Findings:

  • There is significant overlap between human- vs. GPT-4-generated feedback and more than half of the researchers tested found the feedback helpful/very helpful.
  • The overlap is larger for the weaker (i.e., rejected) papers.
  • More overlap for the initial parts of the reviews.

34 of 34

PandaLM: Judge language model

34

  • Achieving 93.75% of GPT-3.5’s evaluation ability and 88.28% of GPT4’s in terms of F1-score on our diverse human annotated test dataset.