2 of 29

What is Prompting?

Write me an essay about the Great Gatsby.

Title: The Great Gatsby: A Tale of Illusion, Love, and American Dream

Introduction:

"The Great Gatsby," written by F. Scott Fitzgerald, is a timeless novel that delves into the intriguing world of the Roaring Twenties. Set in the lavish and fast-paced society of Long Island, New York, Fitzgerald masterfully paints a picture of glamour, decadence, and the pursuit of the American Dream. Through the eyes of the enigmatic Jay Gatsby, the novel

<...>

Large Language Model

3 of 29

Your prompt depends on your task

Does sentence A entail sentence B?�

A: Jill is at ACL 2023

B: Jill is in Toronto

Label: Entailment

<…>��A: Jack wrote a paper�B: Jack is at ACL

Label:

A: Jill is at ACL 2023

B: Jill is in Toronto

Does sentence A entail sentence B? Let's think step by step

Few-shot¹

Chain of Thought²

A: Jill is at ACL 2023

B: Jill is in Toronto

Does sentence A entail sentence B? Yes

Is the answer correct? If not, how can it be improved?

Self-Refining³

+ obscure tasks

+ reasoning tasks

+ generation tasks

[1] Language models are Few-Shot Learners (Brown et al. 2020)

[2] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al. 2022)

[3] Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al. 2023)

4 of 29

A type of task that is hard for most prompts…

Text

Structured Representation

(graph, triples, etc.)

Structured Prediction

5 of 29

digraph G {

begin -> find_sheet_music;

find_sheet_music -> sit_at_bench;

find_sheet_music -> set_up_sheet;

sit_at_bench -> warm_up_on_organ;

set_up_sheet -> warm_up_on_organ;

warm_up_on_organ -> play_organ;

}

Doesn’t look like Natural Language!

6 of 29

Idea: Convert to code and prompt a Code LM

class Node:

def __init__(self):

self.children = []

class Tree:

goal = "play organ"

def __init__(self):

begin = Node()

find_sheet_music = Node()

sit_at_bench = Node()

set_up_sheet = Node()

warm_up_on_organ = Node()

play_organ = Node()

begin.children =

[find_sheet_music]

find_sheet_music.children =

[sit_at_bench, set_up_sheet]

sit_at_bench.children =

[warm_up_on_organ]

set_up_sheet.children =

[warm_up_on_organ]

warm_up_on_organ.children =

[play_organ]

digraph G {

begin -> find_sheet_music;

find_sheet_music -> sit_at_bench;

find_sheet_music -> set_up_sheet;

sit_at_bench -> warm_up_on_organ;

set_up_sheet -> warm_up_on_organ;

warm_up_on_organ -> play_organ;

}

7 of 29

Our code prompt now looks something like this

Large Code-Trained Language Model

begin = Node()

find_sheet_music = Node()

sit_at_bench = Node()

set_up_sheet = Node()

warm_up_on_organ = Node()

play_organ = Node()

begin.children = [find_sheet_music]

find_sheet_music.children = [sit_at_bench, set_up_sheet]

sit_at_bench.children = [warm_up_on_organ]

set_up_sheet.children = [warm_up_on_organ]

warm_up_on_organ.children = [play_organ]

<...few-shot examples...>

"""

The following is a script

for the goal: play an organ

"""

class Node:

def __init__(self):

self.children = []

class Tree:

goal = "play organ"

def __init__(self):

8 of 29

Code prompting works extremely well

[1] “Language Models of Code are Few-Shot Commonsense Learners” (Madaan et al. 2022)

9 of 29

Follow-up work did this for other structured tasks

Event Argument Extraction

(Wang et al. 2022)

Knowledge Graph Construction

(Bi et al. 2023)

Short Story Understanding�(Dong et al. 2023)

Structured Causal Inference

(Zhang et al. 2023)

10 of 29

… or are they better in general?

Are code prompts better for structured reasoning tasks only?

11 of 29

Let’s investigate it!

12 of 29

How should we compare code prompts to text?

Select a diverse set of benchmark tasks

Sentiment Analysis

Question Answering

Common-Sense

Reasoning

…

2. Select a diverse set of

models for comparison

3. Select a code prompt and

text prompt to compare for

each task

13 of 29

Step 1: Select a diverse set of tasks

Summarization

Question Answering

HotPotQA

SQuADv2

XSUM

CNN / Daily Mail

Sentiment Analysis

IMDb

Yelp

Common-Sense Reasoning

wikiHow Goal-Step

HellaSwag

wikiHow Temporal

WinoGrande

OpenPI 2.0

Natural Language Inference

ANLI

14 of 29

Step 2: Select a diverse set of models

davinci

code-davinci-002

text-davinci-002

Code Pre-training?

Instruction Fine-Tuning?

✅

❌

✅

❌

15 of 29

Step 3: Select our code and text prompts per task

[1] PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts (Bach et al. 2022)

Select one text prompt per task from PromptSource¹

Example 1

Example 2

Example n

…

2. Rewrite the text prompt to

be a code prompt

3. Add in-context examples to

both prompts for inference

16 of 29

What is the best way to write a code prompt?

17 of 29

We wrote and tested four code prompts per task

18 of 29

Are code prompts better?

… sort of? … it’s complicated

19 of 29

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

HotPotQA

SQuAD

OpenPI

CNN / Daily Mail

XSUM

davinci

code-002

text-002

Accuracy

Pearson ⍴

Accuracy

Macro-F1

ROUGE-F1

ROUGE-2

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.016

-0.150

-2.550

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

-0.021

-0.025

-0.970

-0.070

-3.580

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

-0.140

-0.014

-4.300

-0.080

-1.220

QA and Summarization are worse with code prompts

20 of 29

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Pearson ⍴

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

21 of 29

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Pearson ⍴

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

Observation 1: Code prompts are better more often than they are worse on classification tasks

22 of 29

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Pearson ⍴

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

Observation 2: Tasks that are better with code tend to be better with code regardless of the model

23 of 29

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Pearson ⍴

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

Observation 3: Instruction fine-tuning allows models to better utilize code prompts – even if it’s done on text instructions

24 of 29

𝚫

text → code

HellaSwag

wikiHow Goal-Step

wikiHow Temporal

Yelp

IMDb

WinoGrande

ANLI

davinci

code-002

text-002

Accuracy

Pearson ⍴

Accuracy

-0.014

-0.045

+0.037

-0.017

+0.063

-0.013

+0.027

-0.046

-0.026

+0.105

-0.017

+0.006

+0.109

-0.011

+0.046

-0.004

+0.073

-0.015

+0.012

+0.098

+0.053

Observation 4: Given the right task and model, code prompts can be *way* better than text prompts

25 of 29

Takeaways

26 of 29

What did we learn

Code prompts are applicable to much more than just structured reasoning tasks

2. Code prompts seem to benefit from instruction fine-tuning

3. Improvements seem to do more with task type rather than model

4. Given the right task we see substantial improvements with

code prompting (10%+)

27 of 29

There are tons of things we don’t fully understand

How do you even write one of these things?

Us during this project

What is the right code-text pre-training and instruction data mix, task selection, prompt formulation, etc.

28 of 29

Insights on code prompts → insights on reasoning

Why do code models reason so effectively? What latent structures do LMs use? How do we define structure?

29 of 29

Thank you!

Li “Harry” Zhang

Hainiu Xu

{zharry, seacow, ccb}@seas.upenn.edu

Chris Callison-Burch

My awesome co-authors!

Paper Link

zharry29/curious_code_prompts

Enjoyed the talk? *Curious* about code prompts? We’d love to collaborate!

Email: ldugan@seas.upenn.edu

Website: http://liamdugan.com/

Twitter: @LiamDugan_