2 of 27

CONTENTS

Dynamics of Cartpole

Control Cartpole using LQR

Control Cartpole using DQN

Comparison of Result

Future Plans

3 of 27

Systems of Cartpole
Dynamics of Cartpole
State Space for Cartpole

Dynamics of Cartpole

4 of 27

1. System of Cartpole

Cart

Pole

No Friction

5 of 27

2. Dynamics of Cartpole

6 of 27

3. State Space for Cartpole

We linearized the nonlinear system of Cartpole to control using LQR

At this time, we set the pendulum equilibrium point is upward

7 of 27

Introduction to LQR
Design the LQR controller for Cartpole
Simulation Result of Cartpole using LQR

Control Cartpole using LQR

8 of 27

1. Introduction to LQR

9 of 27

1. Introduction to LQR

: State Space

: Cost

The goal of LQR Controller: Minimize the Cost

Iterate solving ARE until K converges

10 of 27

2. Design the LQR controller for Cartpole

11 of 27

2. Design the LQR controller for Cartpole

12 of 27

3. Simulation Result of Cartpole using LQR

LQR

No Input

13 of 27

3. Simulation Result of Cartpole using LQR

14 of 27

3. Simulation Result of Cartpole using LQR

15 of 27

Introduction to DQN
Design the DQN for Cartpole
Simulation Result of Cartpole using DQN

Control Cartpole using DQN

16 of 27

1. Introduction to DQN

State

Deep Q Learning

Q-value Action 1

Q-value Action 2

Q-value Action N

17 of 27

1. Introduction to DQN

Reference: Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (Dec 2013). Playing Atari with deep reinforcement learning. Technical Report arXiv:1312.5602 [cs.LG], Deepmind Technologies.

Pseudo Code of DQN

18 of 27

2. Design the DQN for Cartpole

Replaybuffer

Dqn_learn

Dqn_main

Dqn_play

19 of 27

2. Design the DQN for Cartpole

Discount rate: confidence in the future

The data sample size each batch

The size of buffer which save the experience

Learning rate: confidence in the future current experience

The weight of target actor

20 of 27

3. Simulation Result of Cartpole using DQN

After Learning (Episode 500)

Before Learning

21 of 27

3. Simulation Result of Cartpole using DQN

This graph shows that the reward value changes as the episode increases.

By setting the maximum reward to 500, it can be seen that the value of the reward is initially obtained small and then the maximum reward is reached.

The values obtained in the following graph are saved as experiences.

22 of 27

3. Simulation Result of Cartpole using DQN

This graph shows that the changes of angle.

Unlike LQR control, DQN does not converge to zero even if the system is stabilized because dynamics is not considered at all and is only experience-dependent control.

Unlike LQR control, DQN can be controlled close to zero from the beginning.

23 of 27

Advantage/Disadvantage of LQR
Advantage/Disadvantage of DQN

Comparison of Result

24 of 27

1. Advantage/Disadvantage of LQR

Advantage

In a simple system, optimal control can be obtained by adjusting gain.

Through the setting of Q and R, it is possible to determine whether input or state should be more weighted.

Unlike PID control, which was the existing output feedback controller, it is a state feedback controller, so it can be obtained only by using the system matrix (A, B), not by trial and error.

Disadvantage

It is difficult to calculate the system when complex dynamics come out.

The difficulty level of the controller operation may vary depending on the state. For example, because of the 3D system, x, y, z position, x, y, z linear velocity, roll, pitch, yaw, angular velocity of roll, pitch, and yaw can be state, and the computational difficulty can be very complicated. In this case, it is difficult to calculate the ARE, and it is not easy to apply the LQR.

25 of 27

2. Advantage/Disadvantage of DQN

Advantage

Unlike LQR, it is a method that is controlled only by learning, so there is no need to know the surrounding environment, and accordingly, it is possible to solve the problem without any understanding of dynamics or kinematics.

As with the previous simulation results, it can be confirmed that DQN achieves the target value almost from the beginning when taking an action after all learning.

Disadvantage

In Cartpole examples, there was no problem because it did not use much data, but in general, DQN uses replay memory, which requires large memory space and uses old data for learning.

As shown in the simulation results, since it relies solely on learning without considering dynamics, it can be seen that it continues to vibrate finely around zero.

26 of 27

Implementation for Double Pendulum on Cart

Using DQN, Control the complicated system which cannot solve the dynamics

Future Plan

1 of 27

2 of 27

3 of 27

4 of 27

5 of 27

6 of 27

7 of 27

8 of 27

9 of 27

10 of 27

11 of 27

12 of 27

13 of 27

14 of 27

15 of 27

16 of 27

17 of 27

18 of 27

19 of 27

20 of 27

21 of 27

22 of 27

23 of 27

24 of 27

25 of 27

26 of 27

27 of 27