1 of 51

Local Explanations for �Deep Learning Models

Uncertainty (Part 2); Prompting; Chain-of-Thoughts

Ana Marasović

University of Utah

2 of 51

Reminders

1st HW due: Tomorrow 11:59p

You’re allowed to be late twice* up to 48H without a penalty, no need to notify me

* Does not apply to in-person activities

�Drop deadline: This Friday, Sept 1

�Next Monday (Sept 4): holiday, so no class

�1st paper discussion on Sept 11 (Monday): https://utah-explainability.github.io/assignments/paper_discussions/

UGs - let me know whether you want to participate in all roles or only discussion assistant ones

3 of 51

What did we talk about last Monday

Confidence scores could assist human-AI teams �
There are different ways to calculate uncertainty, but the max of the softmax vector is appealing due to its simplicity�
The max of the softmax is not calibrated: it is not necessarily true that the probability of the model being correct for that instance is the max of the softmax �
Reminder: a digit classification model classifies a dress with a 100% confidence which we read as it is 100% likely that the dress is a digit 8�
To measure how wrong this interpretation is we introduced the expected calibration error (ECE).�
If ECE is not low, we know that the max of the softmax can not assist human-AI team

4 of 51

Simple post-hoc calibration methods

5 of 51

Temperature scaling

For softmax-based classification tasks, we can post-hoc rescale the logits

[Guo et al., 17] initialize a single temperature parameter T

More generally, temperature scaling is a simple extension of Platt scaling [Platt, 99]

Slide source: COLING’22 tutorial

T > 1: Increasing the temperature makes them closer to a uniform distribution and reduces the� model's overconfidence

T < 1: Decreasing the temperature makes the probabilities more "peaky", increasing the model's� confidence

6 of 51

Temperature scaling

For softmax-based classification tasks, we can post-hoc rescale the logits

[Guo et al., 17] initialize a single temperature parameter T

How to find the right temperature? �Optimize T on a held-out calibration set to minimize the negative log-likelihood

Train set to train the model
Held-out calibration set to �calibrate
Test set

Slide source: COLING’22 tutorial

7 of 51

Bayesian Approaches

8 of 51

Bayesian approaches

We don't just consider a single model as the definitive answer, but a whole spectrum of possible models

We assign probabilities to these models based on their compatibility with the data and our prior beliefs

Prior belief on possible models of how rain prediction works: Each of these models has a certain probability attached to it based on your prior experience or knowledge.�
Evidence: You observe several days where high humidity led to rain. This evidence will make you more confident in the models that consider humidity as a strong predictor of rain.�
Posterior distribution: After seeing the evidence, you adjust the probabilities of your collection of models. Those that align well with the evidence (like models giving more weight to humidity) become more probable, while those that don't fit the evidence become less probable. The result is a posterior distribution over possible models.

The calibrated prediction is then a kind of weighted average over all these models, with more probable models having more influence

9 of 51

How to form a posterior distribution over neural networks?

[Gal and Ghahramani, 2016]: Apply dropout at test time to get different samples from an approximate Bayesian posterior

�

Use an ensemble of different training runs

10 of 51

Ways of expressing uncertainty

As a numerical value (e.g. in [0,1]) returned with each prediction

As a confidence interval around a numerical value:

��

A set of candidate answers

As a decision to abstain from answer:

Slide source: COLING’22 tutorial

11 of 51

Conformal prediction

12 of 51

A motivating example: Information retrieval for fact-checking

Slide source: COLING’22 tutorial

13 of 51

A motivating example: Information retrieval for fact-checking

Slide source: COLING’22 tutorial

14 of 51

A motivating example: Information retrieval for fact-checking

Slide source: COLING’22 tutorial

15 of 51

A motivating example: Information retrieval for fact-checking

Slide source: COLING’22 tutorial

16 of 51

How conformal prediction works

Informally, conformal prediction uses “nonconformity” scores to measure surprise

Basic idea: suppose I assign a candidate label to a given input. How “strange” that this output-input pair look relative to other examples that I know to be correct?

If it is relatively strange, it is consider to be nonconforming to the dataset

It if is relatively “not that strange”, then it conforms (and we can’t rule the predicted label out)

Slide source: COLING’22 tutorial

17 of 51

Why is this input assigned this answer?

17

Week 2-5:

18 of 51

18

Why is this input assigned this answer?

In plain English, why is this input assigned this label?

Free-text explanations

Chain-of-Thoughts

Week 2-3:

19 of 51

Self-explaining with free-text explanations: �Given in plain language, immediately provide the gist of why is the input labeled as it is

19

Misleading because not every American over 65 can get these cards since they are not provided by Medicare, the federal health insurance program for senior citizens. They are offered as a benefit to some customers by private insurance companies that sell Medicare Advantage plans. The cards are available in limited geographic areas. Only the chronically ill qualify to use the cards for items such as food and produce.

+ documents from� the Web

💡

20 of 51

20

Misleading because not every American over 65 can get these cards since they are not provided by Medicare, the federal health insurance program for senior citizens. They are offered as a benefit to some customers by private insurance companies that sell Medicare Advantage plans. The cards are available in limited geographic areas. Only the chronically ill qualify to use the cards for items such as food and produce.

+ documents from� the Web

💡

21 of 51

21

Google I/O 2022

22 of 51

Few-Shot Learning; �Prompting

23 of 51

Pretrain-then-Finetune Paradigm

23

pretrain model

finetune model

text + labels

text

Another trend: �Decrease the finetuning data size

24 of 51

The simplest way to do few-shot learning

24

pretrain model

finetune model

text + labels

text

Contains only few (8–16) labeled examples

25 of 51

A typical way to process data

A Visual Guide to Using BERT for the First Time by Jay Alammar

26 of 51

The standard input-output formatting is suboptimal

A pretrained LM is well-positioned to solve the end-task if…

…we format finetuning end-task examples as similar as possible to the format used in pretraining

We add something to induce a task-specific behavior, e.g.:

“TL;DR” or “summarize: “ for summarization

We sometimes also add a task definition/instruction, e.g.,:

“In this task, you are given an article. Your task is to summarize the article in a sentence.”

27 of 51

Passage: Trams have operated continuously in Melbourne since 1885 (the horse tram line in Fairfield opened in 1884, but was at best an irregular service). Since then they have become a distinctive part of Melbourne's character and feature in tourism and travel advertising. Melbourne's cable tram system opened in 1885, and expanded to one of the largest in the world, with of double track. The first electric tram line opened in 1889, but closed only a few years later in 1896. In 1906 electric tram systems were opened in St Kilda and Essendon, marking the start of continuous operation of Melbourne's electric trams.\n

Question: If I wanted to take a horse tram in 1884, could I look up the next tram on a schedule?\n

Answer:

Task description

A task instance

In this task, you’re expected to write answers to questions involving reasoning about negation. The answer to the question should be “yes”, “no”, “don’t know” or a phrase in the passage. Questions can have only one correct answer.\n

An example of a prompt�– A reading comprehension example

[The model generates the answer: “No”]

Example from [Ravichander et al., 2022]

28 of 51

Prompt-based finetuning

28

pretrain model

finetune model

text + labels

text

Contains only few (8–16) labeled examples

These examples are carefully formatted

29 of 51

An alternative to prompt-based finetuning

Imagine a situation where a layperson or an expert in another domain that interacts with an NLP model:

They have no access to model weights
They have no knowledge of how to train the model
But they are able to provide a few examples of their task�

What if instead of providing few examples individually we concatenate them into one long sequence and do not change the model weights?

30 of 51

Passage: During the 1930s, Jehovah's Witnesses in Germany were sent to concentration camps by the thousands, due to their refusal to salute the Nazi flag, which the government considered to be a crime. Jehovah's Witnesses believe that the obligation imposed by the law of God is superior to that of laws enacted by government. Their religious beliefs include a literal version of Exodus, Chapter 20, verses 4 and 5, which says: "Thou shalt not make unto thee any graven image, or any likeness of anything that is in heaven above, or that is in the earth beneath, or that is in the water under the earth; thou shalt not bow down thyself to them nor serve them." They consider that the flag is an 'image' within this command. For this reason, they refused to salute the flag.\n

Question: Is it likely that most of these Jehovah's Witnesses survived the war (having the same likelihood of survival as other German civilians) only to later see Soviet flags in their country, or American soldiers proudly saluting the stars and stripes?\n

Answer: NO\n

###\n

Passage: Francesco Rognoni was another composer who specified the trombone in a set of divisions (variations) on the well-known song "Suzanne ung jour" (London Pro Musica, REP15). Rognoni was a master violin and gamba player whose treatise "Selva di Varie passaggi secondo l'uso moderno" (Milan 1620 and facsimile reprint by Arnaldo Forni Editore 2001) details improvisation of diminutions and Suzanne is given as one example. Although most diminutions are written for organ, string instruments or cornett, Suzanne is "per violone over Trombone alla bastarda". With virtuosic semiquaver passages across the range of the instrument, it reflects Praetorius' comments about the large range of the tenor and bass trombones, and good players of the Quartposaune (bass trombone in F) could play fast runs and leaps like a viola bastarda or cornetto. The term "bastarda" describes a technique that made variations on all the different voices of a part song, rather than just the melody or the bass: "considered legitimate because it was not polyphonic".

Question: Would you likely find the term "bastarda" regularly used in an academic paper on musical theory?\n

Answer: DON'T KNOW\n

###\n

[...]

###\n

Passage: Trams have operated continuously in Melbourne since 1885 (the horse tram line in Fairfield opened in 1884, but was at best an irregular service). Since then they have become a distinctive part of Melbourne's character and feature in tourism and travel advertising. Melbourne's cable tram system opened in 1885, and expanded to one of the largest in the world, with of double track. The first electric tram line opened in 1889, but closed only a few years later in 1896. In 1906 electric tram systems were opened in St Kilda and Essendon, marking the start of continuous operation of Melbourne's electric trams.\n

Question: If I wanted to take a horse tram in 1884, could I look up the next tram on a schedule?\n

Answer:

Task description

Examples / Shots / Demonstrations

Test instance

In this task, you’re expected to write answers to questions involving reasoning about negation. The answer to the question should be “yes”, “no”, “don’t know” or a phrase in the passage. Questions can have only one correct answer.\n

Example from [Ravichander et al., 2022]

31 of 51

In-context learning

This approach:

prompt = {task_instruction}{labeled example #1}...{labeled example #n}{eval example input}
no further training (changing of model weights)

is called in-context learning

32 of 51

“Advanced” Prompting

33 of 51

Passage: Francesco Rognoni was another composer who specified the trombone in a set of divisions (variations) on the well-known song "Suzanne ung jour" (London Pro Musica, REP15). Rognoni was a master violin and gamba player whose treatise "Selva di Varie passaggi secondo l'uso moderno" (Milan 1620 and facsimile reprint by Arnaldo Forni Editore 2001) details improvisation of diminutions and Suzanne is given as one example. Although most diminutions are written for organ, string instruments or cornett, Suzanne is "per violone over Trombone alla bastarda". With virtuosic semiquaver passages across the range of the instrument, it reflects Praetorius' comments about the large range of the tenor and bass trombones, and good players of the Quartposaune (bass trombone in F) could play fast runs and leaps like a viola bastarda or cornetto. The term "bastarda" describes a technique that made variations on all the different voices of a part song, rather than just the melody or the bass: "considered legitimate because it was not polyphonic".

Question: Would you likely find the term "bastarda" regularly used in an academic paper on musical theory?

Answer: Let's think step by step. From the passage it is unclear whether 'bastarda' was a technique that was impactful and important which are reasons why one could expect to see it regularly in an academic paper on musical theory. So the answer is DON'T KNOW.

###

Passage: During the 1930s, Jehovah's Witnesses in Germany were sent to concentration camps by the thousands, due to their refusal to salute the Nazi flag, which the government considered to be a crime. Jehovah's Witnesses believe that the obligation imposed by the law of God is superior to that of laws enacted by government. Their religious beliefs include a literal version of Exodus, Chapter 20, verses 4 and 5, which says: "Thou shalt not make unto thee any graven image, or any likeness of anything that is in heaven above, or that is in the earth beneath, or that is in the water under the earth; thou shalt not bow down thyself to them nor serve them." They consider that the flag is an 'image' within this command. For this reason, they refused to salute the flag.

Question: Is it likely that most of these Jehovah's Witnesses survived the war (having the same likelihood of survival as other German civilians) only to later see Soviet flags in their country, or American soldiers proudly saluting the stars and stripes?

Answer: Let's think step by step. Worshiping any flag is forbidden by their religion and this religious law to them is superior to laws enacted by the government. Thus, even after the war, they are unlikely to condone people saluting Soviet or American flags. So the answer is NO.

###

[...]

###

Passage: Trams have operated continuously in Melbourne since 1885 (the horse tram line in Fairfield opened in 1884, but was at best an irregular service). Since then they have become a distinctive part of Melbourne's character and feature in tourism and travel advertising. Melbourne's cable tram system opened in 1885, and expanded to one of the largest in the world, with of double track. The first electric tram line opened in 1889, but closed only a few years later in 1896. In 1906 electric tram systems were opened in St Kilda and Essendon, marking the start of continuous operation of Melbourne's electric trams.

Question: If I wanted to take a horse tram in 1884, could I look up the next tram on a schedule?

Answer: Let's think step by step.

Task description

Examples / Shots with CoT

Test instance

In this task, you’re expected to write answers to questions involving reasoning about negation. The answer to the question should be “yes”, “no”, “don’t know” or a phrase in the passage. Questions can have only one correct answer.

34 of 51

Chain-of-thought prompting

This approach:

prompt input = {task_instruction}{labeled example #1 as a QA instance with an explanation}...{labeled example #n as a QA instance with an explanation} {eval example input} Answer: Let’s think step by step.
prompt output = {explanation} So the answer is {answer}.
no further training (changing of model weights)

is called chain-of-thought prompting

[Kojima et al., 2022; Wei et al., 2022]

35 of 51

Self-Consistency

[Wang et al., 2022]

36 of 51

FLAN-T5

Finetune a model, here T5 (an open-sourced model):

To follow instructions
With chain-of-thoughts & self-consistency prompts to elicit reasoning skills
With concatenated examples to induce in-context learning

For 1.8K tasks

[Chung et al., 2022]

recent advances in prompting

37 of 51

Instruction finetuning data

[Chung et al., 2022]

38 of 51

LLaMA-Chat (and every other LLM today)

[Touvron et al., 2023]

39 of 51

RLHF

https://openai.com/blog/instruction-following/

40 of 51

PPO

Secrets of RLHF in Large Language Models Part I: PPO

41 of 51

Reliable Few-Shot Evaluation

42 of 51

Challenges of reliable few-shot evaluation

Sensitivity to choice of few examples
No sensitivity to choice of labels
Poor experimental practices

43 of 51

Sensitivity to choice of examples

https://www.ericswallace.com/slides_and_posters/calibration_slides.pdf

44 of 51

Sensitivity to choice of examples

[Zhao et al., 2021]

45 of 51

Sensitivity to choice of examples (cont.)

[Zhao et al., 2021]

Estimate models’ bias toward certain answers by feeding in a “content-free” input, e.g.,:

Input: N/A Sentiment:

Ideally: this input would be labelled as 50% positive and 50% negative

In practice: it’s scored 61.8% positive

“The error is contextual: a different choice of the training examples, permutation, and format will lead to different predictions for the content-free input”

Correct this error by tweaking the output matrix so that the class scores for the content-free input are uniform

46 of 51

No sensitivity to shuffling labels

[Min et al., 2022]

47 of 51

No sensitivity to shuffling labels (cont.)

There is an ongoing discussion about these results:

Do Prompt-Based Models Really Understand the Meaning of their Prompts?

Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations (rebuttal)��How does in-context learning work? A framework for understanding the differences from traditional supervised learning (connecting a theoretical explanation of in-context learning with Min et al.’s results)

Robustness of Demonstration-based Learning Under Limited Data Scenario (more intriguing results)

48 of 51

Poor experimental practices

Using large development sets to tune hyperparameters and choose best prompts
Using one seed when we know there is a variance from prompt choice

For best-practices see:

49 of 51

Much more on prompting can be find here:

Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing

50 of 51

Zero-Shot Learning

(one slide)

51 of 51

You still do prompting but you can not include any labeled examples, e.g.,

prompt input = {task_instruction}{eval example input}
prompt output = {answer}�
prompt input = {task_instruction}{eval example input} Answer: Let’s think step by step.
prompt output = {explanation} So the answer is {answer}.

Obviously, there is no further training (changing of model weights)