1 of 51

The Structure of Optimal Nonlinear Feedback Control and its Implications

Suman Chakravorty

Professor, Aerospace Engineering

Texas A&M University

College Station, TX

1

Department of Aerospace Engineering

2 of 51

Acknowledgements

D. Yu, Nanjing University of Aeronautics and Astronautics

M. RafeiSakhaei, Vicarious Robotics

R. Wang, Rockwell Automation

K. Parunandi, Cruise

* M. N. Gul Mohamed, TAMU

* A. Sharma, TAMU

R. Goyal, PARC

D. Kalathil, TAMU

Bob Skelton, TAMU

P. R. Kumar, TAMU

Erik Blasch and Frederica Darema, AFOSR DDIP Program

Kishan Baheti, NSF EPCN and NRI Program

Daryl Hess, NSF DMR and CDS&E Program

Marc Steinberg, ONR Science of Autonomy Program

2

Department of Aerospace Engineering

3 of 51

Introduction

Search for optimal control law

Unknown dynamics

Learning under uncertainty

Partial observation

Cahn-Hilliard Equation:

The Case for Reinforcement Learning/ Data-based Control

3

Department of Aerospace Engineering

4 of 51

Introduction

Very High DOF Systems

Bionic fish robot

Bionic snake robot

Tensegrity airfoil

Tensegrity arm

High dimensionality

Complex models

Data-based

Limited sensing

Partial observation

Model, Process and Sensing uncertainty

Learning under uncertainty

Material Microstructures

4

Department of Aerospace Engineering

5 of 51

Background

  • Stochastic Optimal Nonlinear Control Problem [1]:

  • The Dynamic Programming Equation:

  • Bellman’s “Curse of Dimensionality”, complexity grows exponentially in state dimension [2]:

the number of variables grows exponentially as Kd !

[1] D. P. Bertsekas. Dynamic Programming and Optimal Control, vols I and II. Cambridge, MA: Athena Scientific, 2012

[2] R. E. Bellman. Dynamic Programming. Princeton, NJ: Princeton University Press, 1957

5

Department of Aerospace Engineering

6 of 51

Background

  •  

[1] D. P. Bertsekas. Dynamic Programming and Optimal Control, vols I and II. Cambridge, MA: Athena Scientific, 2012

[2] R. E. Bellman. Dynamic Programming. Princeton, NJ: Princeton University Press, 1957

6

Department of Aerospace Engineering

7 of 51

Background

  •  

[3] R. Goyal, R. Wang and S. Chakravorty, “ On the Convergence of Reinforcement Learning”, IEEE International Conference on Decision and Control, 2021, Austin, TX

7

Department of Aerospace Engineering

8 of 51

Background

  • Data-based approaches

Reinforcement learning - DDPG [4]

System identification

[4] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D. & Wierstra, D. (2016), Continuous control with deep reinforcement learning, in Yoshua Bengio & Yann LeCun, 'ICLR'.

  • Eigensystem Realization Algorithm [5]

[5] J.-N. Juang and R. S. Pappa, “An eigensystem realization algorithm for modal parameter identification and model reduction,” Journal of Guidance, Control, and Dynamics, vol. 8, no. 5, pp. 620–627, 1985.

[6] S. L. Brunton, J. L. Proctor, and J. N. Kutz, “Discovering governing equations from data by sparse identification of nonlinear dynamical systems,” Proceedings of the National Academy of Sciences, vol. 113, no. 15, pp. 3932–3937, 2016.

  • Sparse Identification of Nonlinear Dynamical Systems [6]

Training still takes a very long time and solution has high variance

8

Department of Aerospace Engineering

9 of 51

Background

  •  

[7] D. Q. Mayne. “Model Predictive Control: Recent Developments and Future Promise”. Automatica (2014).

[8] D.Q. Mayne. “Robust and Stochastic MPC: Are We Going In The Right Direction?” 2015. IFAC-PapersOnLine 23 (2015). 5th IFAC Conference on Nonlinear Model Predictive Control NMPC

9

Department of Aerospace Engineering

10 of 51

Background

  • Optimal control

Gradient descent [9]

  • First order Taylor expansion
  • Control update

Differential dynamic programming (DDP) [10]

  • Second order expansion for cost-to-go and constraints
  • Cost-to-go:

Iterative linear quadratic regulator (ILQR) [11]

  • Second order expansion for cost-to-go
  • First order expansion for constraints

[9] S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv: 1609.04747,2017.

[11] Y. Tassa, T. Erez, and E. Todorov, “Synthesis and stabilization of complex behaviors through online trajectory optimization,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913, IEEE, 2012.

[10] D. Jacobsen and D. Q. Mayne, Differential Dynamic Programming. Elsevier,1970.

10

Department of Aerospace Engineering

11 of 51

The Roadmap

  • Fully Observed Problem
    • The Deterministic policy is near-optimal to fourth order to the stochastic policy in terms of a small noise parameter
    • The Deterministic feedback law has a “perturbation structure”: higher order feedback terms do not affect lower order terms
    • One can find the globally optimum deterministic feedback law “locally” under mild assumptions
    • The “perturbation structure” is lost in the stochastic problem: computationally intractable
    • We use the Iterative Linear Quadratic Regulator (ILQR) to find the globally optimum local feedback law in a data based/ Reinforcement Learning (RL) fashion
    • The resulting approach, Decoupled Data based Control (D2C) is scalable, accurate, has negligible variance, and nonetheless, has better closed loop performance than state of the art RL approaches

11

Department of Aerospace Engineering

12 of 51

Outline

  • Decoupled Data-based Optimal Control
    • Closeness of deterministic feedback to the optimal stochastic law
    • Perturbation structure of deterministic feedback law
    • Decoupled data-based control (D2C) algorithm
    • Performance and optimality analysis
    • Comparison to state-of-the-art RL techniques
    • Connection to MPC

12

Department of Aerospace Engineering

13 of 51

Problem Formulation

  • Finite horizon stochastic optimal control problem [15,16]:

 

Bellman equation

[15]Naveed Gul Mohamed, M., Chakravorty, S., Goyal, R., and Wang, R., “On the Optimal Feedback Law in Stochastic Optimal Nonlinear Control”, arXiv:2004.01041, IEEE Transactions on Automatic Control, under revision

[16] Naveed Gul Mohamed, M., Chakravorty, S., Goyal, R., and Wang, R., “On the Optimal Feedback Law in Stochastic Optimal Nonlinear Control”, American Control Conference, 2022.

13

Department of Aerospace Engineering

14 of 51

Near Optimality of Deterministic Law

  • Deterministic optimal control problem:
  • Cost function of the optimal deterministic policy applied to the stochastic system:

14

Department of Aerospace Engineering

15 of 51

Near Optimality of Deterministic Law

 

15

Department of Aerospace Engineering

16 of 51

Near Optimality of Deterministic Law

  •  

 

 

 

16

Department of Aerospace Engineering

17 of 51

Perturbation Structure of Deterministic Law

  • The deterministic Hamilton-Jacobi-Bellman equation:

  • The Method of Characteristics can be used to turn the PDE into a family of ODEs (Lagrange-Charpit, circa 1760):

17

Department of Aerospace Engineering

18 of 51

Perturbation Structure of Deterministic Law

 

Global optimality: if the dynamics f, g and the cost l are C2, then the solution of the characteristic ODE exists and is unique, i.e., satisfying the minimum principle is sufficient for global optimality, nonlinear dynamics and costs do not matter

18

Department of Aerospace Engineering

19 of 51

Perturbation Structure of Deterministic Law

  • Expand the optimal cost and the co-state in terms of the perturbation from a characteristic curve:

  • Expanding the state and co-state around the nominal characteristic curve, we obtain:

and so on..

  • The higher order terms do not affect the lower order terms in the expansion: perturbation structure of the

deterministic law.

  • The Linear Feedback equation is not the LQR Ricatti, note the second order terms!

19

Department of Aerospace Engineering

20 of 51

Decoupling Principle

Remarks

  • Decoupling: the open loop nominal policy and the linear and higher order feedback terms can be solved sequentially owing to the perturbation structure of the deterministic policy
  • Solving for the open loop nominal policy + optimal linear feedback policy is easier than solving the stochastic optimal control problem as we now show.

Training efficiency

Not LQR

20

Department of Aerospace Engineering

21 of 51

The Stochastic Problem

  • The Stochastic HJB can be written as:

  • A perturbation expansion about the nominal (zero noise) action of the optimal policy gives [15]:

  • The perturbation structure is lost due to the absence of characteristic curves owing to stochasticity: the feedback

law has to be expanded to a high enough order for accuracy!

= 0, under MP

[15] Naveed Gul Mohamed, M., Chakravorty, S., Goyal, R., and Wang, R., “On the Optimal Feedback Law in Stochastic Optimal Nonlinear Control”, arXiv:2004.01041, under revision for the IEEE Transactions on Automatic Control

Does it make sense to do stochastic MPC?

21

Department of Aerospace Engineering

22 of 51

Decoupled Data-based Control (D2C)

  • Open loop deterministic optimization with iLQR data-based extension [16,17]:

 

Necessary condition

Iteration till convergence:

Taylor expansion

Backward pass

Forward pass

Co-state

Actor

Critic

[16] Wang, R., Parunandi, K. S., Sharma, A., Goyal, R., and Chakravorty, S., “On the Search for Feedback in Reinforcement Learning”, arXiv:2002.09478, under review for the IEEE Transactions on Automatic Control

[17] Wang, R., Parunandi, K. S., Sharma, A., Goyal, R., and Chakravorty, S., “On the Search for Feedback in Reinforcement Learning”, IEEE International Conference on Decision and Control, 2021

22

Department of Aerospace Engineering

23 of 51

Decoupled Data-based Control (D2C)

  • System identification with linear least squares based on RL type rollouts:

Collect data from simulation experiments:

Solve for the linearized dynamics:

Data-based iLQR

  • Feedback gain design:

 

23

Department of Aerospace Engineering

24 of 51

Decoupled Data-based Control (D2C) Algorithm

  • Is D2C guaranteed to converge?
  • Does D2C converge to a global minimum?

24

Department of Aerospace Engineering

25 of 51

Optimality and Convergence

ILQR is Sequential Quadratic Programming (SQP), DDP is overkill to get to an open loop minimum

[16] Wang, R., Parunandi, K. S., Sharma, A., Goyal, R., and Chakravorty, S., “On the Search for Feedback in Reinforcement Learning”, arXiv:2002.09478, submitted to IEEE Transactions on Automatic Control

[17] Wang, R., Parunandi, K. S., Sharma, A., Goyal, R., and Chakravorty, S., “On the Search for Feedback in Reinforcement Learning”, IEEE International Conference on Decision and Control, 2021

25

Department of Aerospace Engineering

26 of 51

Optimality and Convergence

[18] P. T. Boggs and J. W. Tolle, “Sequential quadratic programming,” ActaNumerica, vol. 4, p. 1–51, 1995.

Assumptions:

Global convergence to a stationary point

[16] Wang, R., Parunandi, K. S., Sharma, A., Goyal, R., and Chakravorty, S., “On the Search for Feedback in Reinforcement Learning”, arXiv:2002.09478, submitted to the IEEE Transactions on Automatic Control.

[17] Wang, R., Parunandi, K. S., Sharma, A., Goyal, R., and Chakravorty, S., “On the Search for Feedback in Reinforcement Learning”, IEEE International Conference on Decision and Control, 2021

26

Department of Aerospace Engineering

27 of 51

Optimality and Convergence

Using the Method of Characteristics result from before regarding the sufficiency of the Minimum Principle [15, 16]:

Global convergence to the global minimum

Training reliability

Efficient and reliable training can be combined with replanning as in MPC to obtain a global feedback policy.

[15] M. N. G. Mohamed, S. Chakravorty, R. Goyal, and R. Wang, "On the Feedback law in Stochastic Nonlinear Optimal Control", Proceedings American Control Conference (ACC), June 2022

[16] Naveed Gul Mohamed, M., Chakravorty, S., Goyal, R., and Wang, R., “On the Optimal Feedback Law in Stochastic Optimal Nonlinear Control”, arXiv:2004.01041, submitted to the IEEE Transactions on Automatic Control

 

27

Department of Aerospace Engineering

28 of 51

Empirical Results

Fish

  • 27 state variables
  • 6 control channels
  • Fluid structure interaction

6-link Swimmer

  • 16 state variables
  • 5 control channels
  • Fluid structure interaction

Cartpole

  • 4 state variables
  • 1 control channels

We use simulation model as a blackbox in the physics engine MuJoCo [11].

[19] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033

28

Department of Aerospace Engineering

29 of 51

Material Microstructure Control

Cahn-Hilliard Equation:

Allen-Cahn Equation:

29

Department of Aerospace Engineering

30 of 51

Training Efficiency

Advantages of D2C on training efficiency compared with DDPG (trained on PC)

DDPG – Cartpole

DDPG – 6-link Swimmer

DDPG – Fish

D2C – Cartpole

D2C – 6-link Swimmer

D2C – Fish

6306.7 s

88160.0 s

124367.6 s

0.55 s

127.2 s

54.8 s

DDPG and D2C have the exact same

access to data from the model

30

Department of Aerospace Engineering

31 of 51

Training Efficiency and Variance

Advantages of D2C on training efficiency and reliability

Training variance comparison

Training reliability

Training efficiency

Is sample efficiency the right metric to test RL? What

about the answer to which RL converges?

31

Department of Aerospace Engineering

32 of 51

Closed-loop Performance Under Noise

Cartpole

6-link swimmer

Fish

 

Robustness

Remarks

  • In theory, the RL methods provide global policies.
  • In practice, RL policies are only reliable locally and suffer high variance.
  • D2C with replanning can give a global feedback policy.
  • RL “should” perform better than D2C as it ostensibly provides a nonlinear policy, which is not the case from our experiments.

32

Department of Aerospace Engineering

33 of 51

D2C Performance Summary

Data-based

Efficient and reliable training

High-dimensional nonlinear stochastic systems

Robust to process noise

Global optimality

33

Department of Aerospace Engineering

34 of 51

The Connection to MPC

  • The replanning in D2C is reminiscent of MPC.

  • There are two caveats:
    • We use a shrinking horizon
    • We do not replan at every time step, instead replan

only when necessary while employing the optimal

feedback law (T-PFC)

We replan for stochasticity, not to account for the infinite horizon

34

Department of Aerospace Engineering

35 of 51

The Connection to MPC

Performance comparison for multiple different robotic planning problems.

35

Department of Aerospace Engineering

36 of 51

The Connection to MPC

A comparison of fixed and shrinking horizon MPC for different (fixed) horizon lengths

The empirical evidence suggests shrinking the horizon in MPC results in much better performance and it should be feasible to

maintain the stability and feasibility guarantees as in traditional MPC

36

Department of Aerospace Engineering

37 of 51

Intractability of the Stochastic Problem

We compare the performance of the approximate optimal feedback law found by computationally

solving the stochastic HJB, with the deterministic feedback law, implemented using MPC.

Results show that the MPC feedback law performs much better in higher noise regimes: empirical evidence for the sensitivity of the stochastic DP solution.

37

Department of Aerospace Engineering

38 of 51

Takeaway: Local is Key!

Optimal Feedback control is equivalent to solving the HJB PDE. We can try to solve the PDE globally, as in ADP/ RL, in which case we run into the COD for most practical problems and have unreliable solutions. Alternatively, we can get a local solution, which we modify when required online a la MPC. The classical method of solving a first order PDE is the Method of Characteristics (MOC). MPC repeatedly finds the characteristic curve (open loop) from the current state while we, in practice, advocate finding a local solution (linear feedback) around the nominal curve and replanning only when necessary. The local approach is far more scalable, accurate and reliable while MPC style replanning assures global applicability. The Stochastic problem is fundamentally intractable owing to the fact that it loses the perturbation structure, i.e., there is no notion of a local solution unlike in the deterministic case.

38

Department of Aerospace Engineering

39 of 51

Future Directions

  • The Partially Observed Problem
  • The Infinite Horizon Problem
  • Software for Optimal Feedback Control Synthesis (also known as RL?)
  • Applications: microstructure control, flow control, soft robotics
  • Learning on physical systems: direct microstructure control/ flow control (true use of RL?)
  • Purpose of Models

39

Department of Aerospace Engineering

40 of 51

Acrobot

40

Department of Aerospace Engineering

41 of 51

Thank you

41

Department of Aerospace Engineering

42 of 51

Training Under Noise

DDPG training under noise

Cartpole – process noise in control

Cartpole – process noise in state and control

Pendulum – process noise in control

Pendulum – process noise in state and control

Remarks

  • Although DDPG can train under noise, it does not benefit the performance.
  • D2C and D2C replan policies are more robust than DDPG policies trained under noise.

42

Department of Aerospace Engineering

43 of 51

Partially observed fish model with DDPG

Direct RL method

  • DDPG trained with the same information state as POD2C.
  • The training is not successful after 20 hours.

43

Department of Aerospace Engineering

44 of 51

Comparison with model-based tensegrity control

Model-based shape control

D2C closed-loop policy

T2D1 Tensegrity Model

Faster

Slower

44

Department of Aerospace Engineering

45 of 51

Comparison with model-based tensegrity control

Reacher – closed-loop performance

Reacher – control energy

  • Model-based control has no training time.
  • D2C is more energy efficient due to optimization
  • Model-based control performs better under low noise, D2C works under larger noise range.
  • Both methods have comparable training accuracy.

45

Department of Aerospace Engineering

46 of 51

Failed

Succeeded

Fish Model

  • 10% process noise + 20% measurement noise

Red ball: Target

Measurements:

  • 4 fin joint angles
  • 3 tail joint angles
  • Nose xyz position
  • First quaternion element

27 total states

Simulation Results

Open-loop nominal only

POD2C with ARMA-LQG as closed-loop

46

Department of Aerospace Engineering

47 of 51

Exact Match of ARMA Model and LTV System

LTV system linearized around nominal trajectory

 

 

full column rank

47

Department of Aerospace Engineering

48 of 51

Background

  • System identification
  • Belief space planning

Finite difference (FD) [6]

[6] E. H., N., “The Calculus of Finite Differences”, Nature, vol. 134, no. 3381, pp. 231–233, 1934. doi:10.1038/134231a0.

Eigen realization algorithm (ERA) [7]

  • Match the Markov parameters
  • System identification + model reduction

[7] Jer-Nan Juang and Richard S. Pappa, "An Eigensystem Realization Algorithm for Modal Parameter Identification and Model Reduction". Journal of Guidance, Control, and Dynamics, 1985 8:5, 620–627.

Partially observable Markov decision process (POMDP) [8]

  • Policy is a function of observation history
  • Policy is a function of belief state

[8] Åström, Karl Johan, Optimal Control of Markov Processes with Incomplete State Information I, Journal of Mathematical Analysis and Applications, 1965, 10. p.174-205

48

Department of Aerospace Engineering

49 of 51

Information State Optimization

Stochastic system

 

Information state

Past observations

Past inputs

Nominal trajectory

Cost on information state

49

Department of Aerospace Engineering

50 of 51

Global Optimal Solution

Expand the output

Implicit function theorem

Expand the state:

Information state

Unique function

Unique mapping

50

Department of Aerospace Engineering

51 of 51

Biased Nature in Partially Observed Case

 

51

Department of Aerospace Engineering