1 of 56

Harnessing Large Language Models for Planning:

A Lab on Strategies for Success and Mitigation of Pitfalls

Vishal Pallagani, Keerthiram Murugesan, Biplav Srivastava, Francesca Rossi, Lior Horesh

AAAI 2024

Lab Forum

LQ2

2 of 56

Contents

Introduction
Brief overview of symbolic planners and plan validators
Deep-dive into LLMs
Plansformer

Overview
Hands-on session

Overview of LLMs in Planning
Neuro-symbolic approaches - SOFAI Architecture
Summary and Q/A

2

3 of 56

Introduction

01.

3

4 of 56

Why Explore LLMs for Planning?

4

Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse domains, excelling in natural language understanding, text generation, and even code generation tasks.

Planning problems involve the structured representation of tasks and goals, which are commonly expressed using Planning Domain Definition Language (PDDL). PDDL shares a logical and symbolic nature similar to Lisp programming.

Given LLMs' proven proficiency in programming code generation and understanding, it becomes imperative to explore their capabilities in comprehending PDDL and generating plans, tasks that demand even greater levels of reasoning.

5 of 56

Goals of the lab

The participants will gain a comprehensive understanding of the potential and limitations of LLMs in Planning.
Examine the performance of various language modeling strategies—causal, masked, and seq2seq—in the context of plan generation.
Understand the common failures LLMs encounter while generating plans for PDDL planning problems.
Discover strategies for addressing incorrect plan generations by LLMs through integrating neuro-symbolic techniques, resulting in valid plans.

5

6 of 56

Structure of the Lab

6

Topic	Objective	10:55 am	11:05 am	11:15 am	11:25 am	11:35 am	11:45 am	11:55 am	12:05 pm	12:15 pm	12:25 pm	12:30 pm
Brief overview of symbolic planners and plan validators	Introduce symbolic planning problems, planners, and validators
Deep-dive into LLMs	Introduce various training and architectural paradigms in language modeling
Overview of LLMs in Planning	Understand the different categories in which LLMs are being applied in Planning
Plansformer	Overview of Plansformer and hands-on session to use various LLMs for plan generation
Neuro-symbolic Approaches	Understand the �“Thinking Fast and Slow in AI” (SOFAI) framework for Planning
Summary and Q/A	Summarize the learnings in the lab and answer questions

10:55 am - 11:05 am

11:05 am - 11:20 am

11:20 am - 11:30 am

11:30 am - 12:00 pm

12:00 pm - 12:15 pm

12:15 pm - 12:30 pm

7 of 56

Brief Overview of Symbolic Planners and Plan Validators

02.

7

8 of 56

Complex Decisions

Making a sequence of decisions

Single actor or others in environment

Making a single decision but with

Environment changing, not�being observable
Actions not being deterministic,�having durations
Perception not being perfect
All goals to be satisfied

8

Perception

Action

Environment

Goals

(Static vs. Dynamic)

(Full vs. Partially Observable)

(Perfect vs. Imperfect)

(Full vs.

Partial satisfaction)

(Deterministic vs.

Stochastic)

(Instantaneous vs.

Durative)

(Single vs. multiple agent)

9 of 56

9

Illustration of a Planning Scenario

Blocks World

A

B

Robot arm

A

B

Blocks

Initial State

Goal State

All robots are equivalent

10 of 56

Illustration of Problem Representation

10

States: ((On-Table A) (On-Table B) …)

A

B

Actions: ((Name: (Pickup ?block ?robot)

Precondition: ((Clear ?block)

(Arm-Empty ?robot)

(On-Table ?block))

Add: ((Holding ?block ?robot))

Delete: ((Clear ?block)

(Arm-Empty ?robot)))…)

A

B

11 of 56

Illustration of Reasoning for Planning

11

Clear A

�Clear B�

On-Table A�

On-Table B�

Arm-Empty R1�

Arm-Empty R2

�On A B�

Pick-up A R1

Pick-up A R2

Clear A

�Clear B

Holding A R1�

On-Table A�

On-Table B�

Holding A R2

Arm-Empty R1�

Arm-Empty R2

Stack A B R1

Stack A B R2

Put-down A R2

Pick-up B R2

Pick-up B R1

Put-down A R1

Initial State�Level P-0

Goal State�Level P-2

Level A-0

Level P-1

Level A-1

12 of 56

12

Figure. Demonstration of automated planning problem with blocksworld domain example

Illustration of a Larger Planning Scenario

13 of 56

Active Areas of Research

Considerations

What plan to find?

Any workable plan
Optimal plan – but then what is the criteria
All plans
Diverse plans

When to find plan?

Plan at the end
Plan anytime

How to represent problem ?

Planning Domain Description Language

How to explain solution ?

13

14 of 56

Plan Validation

Definition: Plan validation refers to the process of assessing whether a generated plan satisfies the specified planning problem and domain constraints.
Objective: Ensure that the plan is feasible, correct, and efficient in achieving the desired goals.
Key Components:

Syntax Check: Verify that the plan adheres to the syntax rules defined in PDDL.
Semantics Check: Ensure that the plan semantics are consistent with the domain and problem definitions.
Goal Achievement: Confirm that the plan accomplishes all specified goals within the given constraints.

Tools for Classical Planning:

Plan Validator - KCL-Planning/VAL: The plan validation system. (github.com)
INVAL - patrikhaslum/INVAL: The INVAL plan validator, and other PDDL tools. (github.com)

14

15 of 56

15

>> It is time for some coding

Source: r/ProgrammerHumor

16 of 56

Deep-dive into LLMs

03.

16

17 of 56

Deep-dive into LLMs

In this lab, we will be focussing our discussion on three different language modeling techniques, namely, masked, seq2seq, and causal.

17

Credits: Google Cloud Skills Boost

Figure. The Transformer - model architecture.

18 of 56

Masked Language Modeling (MLM)

MLMs like BERT are trained to understand the bidirectional context by predicting words randomly masked in a sentence.
This approach allows the model to learn both forward and backward dependencies in language structure.
MLMs have proven effective in NLP tasks such as sentiment analysis or question answering.
MLMs are encoder-only.

18

Credits: Cameron R. Wolfe

Source: Language Model Training and Inference: From Concept to Code (substack.com)

19 of 56

Seq2Seq Language Modeling (Seq2Seq)

Seq2Seq models, like T5, are designed to transform an input sequence into a related output sequence.
They are often employed in tasks that require a mapping between different types of sequences, such as language translation or summarization.
They consist of an encoder-decoder architecture.

19

20 of 56

Causal Language Modeling (CLM)

CLMs, such as GPT4, are designed for tasks where text generation is sequential and dependent on the preceding context.
They predict each subsequent word based on the preceding words, modeling the probability of a word sequence in a forward direction.
This characteristic makes CLMs particularly suitable for applications like content generation, where the flow and coherence of the text in the forward direction are crucial.
CLMs consist of a decoder-only architecture.

20

Credits: Cameron R. Wolfe

Source: Language Model Training and Inference: From Concept to Code (substack.com)

21 of 56

Given this basic description of three language modeling techniques used to build LLMs, what among them do you think is a good fit for plan generation?

21

Think in terms of the input and output.

22 of 56

22

>> It is time for some coding

Source: r/ProgrammerHumor

23 of 56

Plansformer

[Hands-on session]

04.

23

24 of 56

Plansformer

24

Plansformer [1,2], is an LLM based planner that is capable of generating valid and optimal plans for classical planning problems.

Traditional planners, although sound and complete, often cannot solve problems with vast space in a stipulated time or generalize. A learning based planner, such as Plansformer, is extremely good at harnessing learnt representations to generalize to unseen problems, and solve any problem in constant time.

Plansformer is obtained by fine-tuning CodeT5 on planning problems and their corresponding optimal plans. CodeT5 is an LLM that is pre-trained on programming language and associated natural language.

What

Why

How

[1] Pallagani, V., Muppasani, B., Murugesan, K., Rossi, F., Horesh, L., Srivastava, B., Fabiano, F. and Loreggia, A., 2023. Plansformer: Generating symbolic plans using transformers. Generalized Planning (GenPlan) Workshop at NeurIPS.

[2] Pallagani, V., Muppasani, B., Srivastava, B., Rossi, F., Horesh, L., Murugesan, K., Loreggia, A., Fabiano, F., Joseph, R. and Kethepalli, Y., 2023, August. Plansformer tool: demonstrating generation of symbolic plans using transformers. In IJCAI (Vol. 2023, pp. 7158-7162). International Joint Conferences on Artificial Intelligence..

25 of 56

25

>> It is time for some coding

Source: r/ProgrammerHumor

26 of 56

Plansformer’s Architecture

26

Figure. Plansformer Model Architecture showing modeling and evaluation phases. Modeling phase involves fine-tuning CodeT5 with data from planning domain. Evaluation phase shows both the planner and model testing.

27 of 56

Is Plansformer a Good Model?

27

Table. Results of model testing (best performance in bold). Plansformer-[x] denotes Plansformer for a specific domain.

28 of 56

Is Plansformer a Good Planner?

28

Table. Results of plan validation.

29 of 56

Can Plansformer Adapt to Another Domain?

29

Table. Plansformer-bw as the base model fine-tuned with and tested on (a) hanoi (b) grippers (c) driverlog, and (d) shows the comparison of valid plans generated by Plansformer-bw-hn derived models with Plansformer-hn trained using similar data points.

30 of 56

Overview of LLMs in Planning

05.

30

31 of 56

Applications of LLMs in Planning

31

Table. Comprehensive description of the eight categories utilizing LLMs in Planning [1]

[1] Pallagani, V., Roy, K., Muppasani, B., Fabiano, F., Loreggia, A., Murugesan, K., Srivastava, B., Rossi, F., Horesh, L. and Sheth, A., 2024. On the prospects of incorporating large language models (llms) in automated planning and scheduling (aps). International Conference on Automated Planning and Scheduling (ICAPS).

32 of 56

Applications of LLMs in Planning

32

Figure. Taxonomy of recent research in the intersection of LLMs and Planning with (#) mentioning the number of scholarly papers in each category [1].

In this tutorial, we focus on using LLMs for plan generation

[1] Pallagani, V., Roy, K., Muppasani, B., Fabiano, F., Loreggia, A., Murugesan, K., Srivastava, B., Rossi, F., Horesh, L. and Sheth, A., 2024. On the prospects of incorporating large language models (llms) in automated planning and scheduling (aps). International Conference on Automated Planning and Scheduling (ICAPS).

33 of 56

Capabilities of LLMs for Plan Generation

A majority of the works make use of prompting generative (causal) language models to generate plans.
However, none of them have explored fine-tuning or looked beyond generative (causal) models for planning.
In this lab, we will provide an understanding on the capabilities of LLMs in their ability to solve for plans given a PDDL problem as input using both fine-tuning and prompting approaches.

33

Fine-tuning is a method that updates the parameters of an LLM using a labeled dataset for the target task.

Prompting is a method that modifies the input of an LLM using a template or a cue to elicit the desired output.

34 of 56

Capabilities of LLMs for Plan Generation

We focus on answering four research questions [1]:

(a) To what extent can LLMs solve planning problems?

(b) What pre-training data is effective for plan generation?

(c) Does fine-tuning and prompting improve LLMs plan generation?

(d) Are LLMs capable of plan generalization?

34

[1] Pallagani, V., Muppasani, B., Murugesan, K., Rossi, F., Srivastava, B., Horesh, L., Fabiano, F. and Loreggia, A., 2023. Understanding the Capabilities of Large Language Models for Automated Planning. arXiv preprint arXiv:2305.16151.

35 of 56

Capabilities of LLMs for Plan Generation

For this study, we constructed a dataset comprising 18,000 planning problems along with their corresponding optimal plans across 6 domains. The dataset was divided into an 80%-20% train-test split, with the purpose of fine-tuning and evaluating the performance of the LLMs.

35

Table. Difficulty of planning domains

36 of 56

Capabilities of LLMs for Plan Generation

The LLMs along with their architectures considered for this study are as follows

36

Table. Architecture of benchmark LLMs

37 of 56

Research Question 1: To what extent can LLMs solve planning problems?

37

Table. Evaluation of plan generation capabilities of LLMs (both prompting pre-trained model and fine-tuned model). For each model, we report the inference time (Inf. Time), the percentage of satisficing plans (Sat. Plans), the percentage of optimal plans (Opt. Plans), and the degree of correctness (Deg. Corr.).

38 of 56

38

LLMs pre-trained on programming code outperform those solely trained on the textual corpus.

Fine-tuning is a superior approach to solving planning problems with LLMs. Overall, it has been observed that fine-tuned LLMs are capable of generating outputs for planning problems at a rate four times faster than pre-trained LLMs.

Research Question 2: What pre-training data is effective for plan generation?

Research Question 3: Does fine-tuning and prompting improve LLMs plan generation?

39 of 56

There are three tasks that help measure the capability of LLMs in plan generalization -

Task 1 - Plan Length Generalization: Evaluate the ability of LLMs to generalize to plan lengths that are out of distribution from the training set.
Task 2 - Object Name Randomization: Test LLMs ability to generate plans using randomized object names not present in the training set.
Task 3 - Unseen Domain Generalization: Evaluate LLMs ability to generalize to planning problems from new domains that are not included in the training set.

39

Research Question 4: Are LLMs capable of plan generalization?

40 of 56

40

Research Question 4: Are LLMs capable of plan generalization?

Task 1: Plan length generalization

Figure. Fine-tuned CodeT5 and code-davinci with few-shot prompting show poor plan length generalization capabilities: plans from fine-tuned CodeT5 overall have a higher degree of correctness. The x-axis represents the plan length and the y-axis represents the degree of correctness. The training plan lengths are highlighted in grey.

Fine-tuned CodeT5 model can generalize to plan lengths to some extent, while the few-shot prompting of code-davinci generates only a single valid plan

41 of 56

41

Research Question 4: Are LLMs capable of plan generalization?

Task 2: Object name randomization

Figure. Evaluating the capabilities of LLMs in handling randomized object names. In version 1, we used only single-digit numeric values as object names. In version 2, we used alphanumeric strings of length 2 (similar to the convention followed by IPC generators), where the combinations of alphabets and numerals used were unseen during training. Version 3 consisted of object names named after three alphabets.

Fine-tuned models can only generalize to object names belonging to the same vocabulary as that of training, while few-shot prompting of code-davinci handles the randomized object names but has poor plan generation capabilities.

42 of 56

42

Research Question 4: Are LLMs capable of plan generalization?

Task 3: Unseen domain generalization

Both fine-tuned and few-shot prompting approaches of CodeT5 and code-davinci respectively show no capabilities of plan generation for unseen domains.

Figure. Example of an incorrect generations from LLMs for a problem from an unseen domain.

43 of 56

Invalid Generations from Plansformer

43

Figure. Different types of invalid plans generated by Plansformer on Blocksworld domain

We notice that even in cases of incorrect generations, LLMs often generate partially correct action sequences. In our upcoming research, we introduce a neuro-symbolic approach where these partially correct plans are employed to inform the heuristics of symbolic planners, enabling faster replanning as opposed to starting the planning process from scratch.

44 of 56

Discussion Pointers

Ongoing debate: is it Retrieval vs Reasoning in LLMs [1]
What is the impact of licensing of LLMs for planning applications?

44

[1] Subbarao Kambhampati, Can LLMs Really Reason and Plan?.

https://cacm.acm.org/blogs/blog-cacm/276268-can-llms-really-reason-and-plan/fulltext [Last accessed on Feb 17, 2024]

45 of 56

Neuro-symbolic Approaches

06.

45

46 of 56

Thinking Fast and Slow in Humans

46

47 of 56

System 1 and System 2 in AI

47

System 1 Solvers	System 2 Solvers
Rely on past experiences	Rely on procedures
React to new problems	Called by metacognition
Generate solutions with a certain confidence	Generate correct solutions
Complexity independent on input size	Complexity dependent on input size

48 of 56

Meta-cognition

48

To monitor and control:

cognitive activities
processes
structures

The Role

To improve the System’s decisions quality

Main Goal

Centralized meta-cognitive agent
Exploiting both internal and external information
Arbitrates between Sys-1 and Sys-2 solvers

Our Choice

49 of 56

SOFAI = System 1 + System 2 + Metacognition

49

Ganapini, M.B., Campbell, M., Fabiano, F., Horesh, L., Lenchner, J., Loreggia, A., Mattei, N., Rossi, F., Srivastava, B. and Venable, K.B., 2022, September. Thinking fast and slow in AI: The role of metacognition. In International Conference on Machine Learning, Optimization, and Data Science (pp. 502-509). Cham: Springer Nature Switzerland.

Table. The SOFAI architecture, supporting System 1/System 2 agents and meta-cognition.

50 of 56

Plan-SOFAI: Instantiating SOFAI for Planning

(a) System 1

Plansformer

(b) System 2

A traditional planning system, such as FastDownward (for plan generation from scratch) or LPG (for partial plan completion).

(c) Metacognition Module

A rule-based function that chooses when to adopt S1 or S2 based on previous performance, expected cost, and expected accuracy on similar problems. It also includes a plan evaluator and has various hyperparameters to be more efficient.

50

Fabiano, F., Pallagani, V., Ganapini, M.B., Horesh, L., Loreggia, A., Murugesan, K., Rossi, F. and Srivastava, B., 2023, December. Plan-SOFAI: A Neuro-Symbolic Planning Architecture. In Neuro-Symbolic Learning and Reasoning in the era of Large Language Models.

51 of 56

Experimental Results

FD has the best optimality (+0.00%) but high overall resource consumption (8.479 sec)
LPG is the most efficient (0.675 sec) but has less-than-ideal optimality statistics (+23.68%)
PF solves less problems (402) than FD and LPG but many (386) are optimal and time is constant (2.079 sec)
Among the SOFAI instances:

SOFAI-PF-FDxLPG represents the most balanced trade-off among all the analyzed techniques

490 solved problems, 2.199 sec, +1.13% optimality

SOFAI-PF-LPG best if we care less about optimality (all problems solved, 434 optimal)

51

52 of 56

Other Neuro-symbolic Approaches to Combine LLMs and Planning

LLMs as planner generators [1]

LLMs are trained to generate planners (Python code) for a specific planning problem instance
The code is then run to generate a plan

The pipeline approach [2]

LLMs to translate natural language to PDDL specification of a planning problem instance
Classical planners to solve the planning problem
LLM to translate the formal plan to natural language

52

[1] Silver, T., Dan, S., Srinivas, K., Tenenbaum, J.B., Kaelbling, L.P. and Katz, M., 2023. Generalized Planning in PDDL Domains with Pretrained Large Language Models. arXiv preprint arXiv:2305.11014.

[2] Liu, B., Jiang, Y., Zhang, X., Liu, Q., Zhang, S., Biswas, J. and Stone, P., 2023. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.

53 of 56

Summary

07.

53

54 of 56

To Learn More About SOFAI

54

https://sites.google.com/view/sofai/

55 of 56

THANK YOU ALL

Contact Information

Vishal Pallagani – vishalp@mailbox.sc.edu

55

56 of 56

Questions?

56