1 of 39

ScAi Reading Group

Planning via Diffusion

Xiusi Chen

March 9, 2023

2 of 39

Motivation

3 of 39

Motivation

4 of 39

Motivation

5 of 39

Motivation

6 of 39

Motivation

7 of 39

Motivation

8 of 39

Motivation

9 of 39

Motivation

10 of 39

Planning as generative modeling

11 of 39

12 of 39

A generative model of trajectories

13 of 39

A generative model of trajectories

14 of 39

A generative model of trajectories

15 of 39

Compositionality via local consistency

16 of 39

Variable-length predictions

17 of 39

Non-autoregressive prediction

18 of 39

Training

The model is used to parameterize a learned gradient of the trajectory denoising process, from which the mean can be solved in closed form (Ho et al., 2020). We use the simplified objective for training the ε- model, given by:

in which i ∼ U{1,2,...,N} is the diffusion timestep, ε ∼ N(0,I) is the noise target, and τi is the trajectory τ 0 corrupted with noise ε .

19 of 39

From trajectory modeling to planning

What is the difference?

The trajectory modeling is essentially fitting the distribution of the trajectories, exactly like fitting the image distribution
In RL, planning usually refers to the use of a model of the environment in order to find a policy that hopefully will help the agent to behave optimally (that is, obtain the highest amount of return or "future cumulative discounted reward")

20 of 39

Planning

21 of 39

Planning

22 of 39

Offline RL through Value Guidance

23 of 39

Offline RL through Value Guidance

24 of 39

Experiments

25 of 39

Connections with Guided Diffusion

guidance

26 of 39

27 of 39

Connections with Guided Diffusion

guidance

28 of 39

Diffusing over states

Directly modeling actions using a diffusion process has several practical issues

while states are typically continuous in nature in RL, actions are more varied, and are often discrete in nature
sequences over actions, which are often represented as joint torques, tend to be more high-frequency and less smooth, making them much harder to predict and model

29 of 39

Acting with Inverse-Dynamics

Sampling states from a diffusion model is not enough for defining a controller. A policy can, however, be inferred from estimating the action at that led the state st to st+1 f or any timestep t in x0(τ) . Given two consecutive states, we generate an action according to the inverse dynamics model:

30 of 39

Decision Diffuser

31 of 39

Experiments

32 of 39

Experiments

33 of 39

Takeaways

Generative model is a powerful tool of modeling the dynamics
Planning with generative model is almost identical to sampling from it, differing only in the addition of auxiliary perturbation functions that serve to guide samples
Classifier-free conditional sampling seems more powerful than classifier-guided conditional sampling, even in the planning context
There is a shocking disconnect between the effectiveness of generative models for images and audio, and the quality of state and observation space models for RL

1 of 39

2 of 39

3 of 39

4 of 39

5 of 39

6 of 39

7 of 39

8 of 39

9 of 39

10 of 39

11 of 39

12 of 39

13 of 39

14 of 39

15 of 39

16 of 39

17 of 39

18 of 39

19 of 39

20 of 39

21 of 39

22 of 39

23 of 39

24 of 39

25 of 39

26 of 39

27 of 39

28 of 39

29 of 39

30 of 39

31 of 39

32 of 39

33 of 39

34 of 39

35 of 39

36 of 39

37 of 39

38 of 39

39 of 39