1 of 8

Assignment II: Movement Primitives �& Reinforcement Learning

Cyber-Physical-Systems (190.001)

Telefon: +43 3842 402 - 1901 �Email: cps@unileoben.ac.at

Univ.-Prof. Dr. Elmar Rueckert�

WO AUS FORSCHUNG ZUKUNFT WIRD

Chair of Cyber-Physical-Systems

MONTANUNIVERSITÄT LEOBEN

CYBER-PHYSICAL-SYSTEMS

2 of 8

Assignment II: MPs & RL

Deadline: 18.02.2022 11:59 CET (new extended deadline)

Submission via email to cps@unileoebn.ac.at.
Subject: 190.001 CPS - Assignment II - Group X
Note that the Group Numbers were reassigned. Use the correct number you received via email.

Files:

CPS_Assignment_One_Group_X.pdf
CPS_Assignment_One_Group_X.zip (archive of your python code)

Templates:

Tentative

MONTANUNIVERSITÄT LEOBEN

CYBER-PHYSICAL-SYSTEMS

3 of 8

Section A

Integrate Dynamical Systems Movement Primitives (DMPs) into your Python/CoppeliaSim framework (25 pts)

Get the DMP implementation from: https://github.com/studywolf/pydmps
The corresponding documentaion can be found here: https://studywolf.wordpress.com/2013/11/16/dynamic-movement-primitives-part-1-the-basics/
Download the CoppeliaSim Scene for Assignment II from here: �https://cps.unileoben.ac.at/wp/CPS_assignment_II_scene_panda.ttt_.zip
Implement DMPs in Python for discrete movement using 24 Gaussian basis functions per dimension. The DMPs should start at p0 and converge to point pT in the scene.
What are the meta-parameters of DMPs? Discuss proper choices in your report and illustrate the effect of the choices in form of recorded task space trajectories.
Generate task space trajectories (hint: for the 3 dimensions x,y,z) with weights being:

all zero.
random weights between -100 and 100.

Record the task space trajectories and add figures of the trajectories and the corresponding DMP weights to your report.

MONTANUNIVERSITÄT LEOBEN

CYBER-PHYSICAL-SYSTEMS

4 of 8

Section B

Integrate the Covariance Matrix Adaptation-Evolutionary Strategies algorithm (CMA-ES) into your Python/CoppeliaSim framework (15 pts)

Get the CMA-ES implementation from: https://cps.unileoben.ac.at/wp/CMA_ES_standard.zip
The corresponding documentaion can be found here: https://wikimili.com/en/CMA-ES

Explain in 3-5 Sentences how CMA-ES works. Add a block diagram how the following components interact: Reward function, policy (DMPs), optimizer (CMA-ES), CoppeliaSim. Define the interfacing data as vectors (e.g., R \in \mathbb{R}^1 is the output of the reward function block).

What are the meta-parameters of CMA-ES? Discuss proper choices in your report.
Describe how the algorithm works. Is it a black-box, white-box or grey-box methods? Which other RL algorithm properties were discussed in the lecture?

Define a policy vector to optimize your DMPs for task space movement generation.
How many parameters need to be learned for 6, 12, 24, 32 Gaussians for task-space and for joint-space (only a theoretical consideration) movement representations. Add a table to your report.
Discuss the table with respect to (i) the number of training data samples (assuming 10 movement executions in CoppeliaSim) and (ii) the number of unknown parameters.

MONTANUNIVERSITÄT LEOBEN

CYBER-PHYSICAL-SYSTEMS

5 of 8

Section C

Reinforcement Learning of Movement Policies (10 pts)

Look at the CoppeliaSim scene. There are four points defined: p0,p1,p2,pT. Your goal is to learn optimal task space trajectories that pass through all four points., while minimizing the energy or ensuring smooth trajectories.

What is the return? List a mathematical definition in your report.
Define reward functions (add mathematical definitions to your report):

that only considers the four points.
that considers both, the four points and the smoothness of the trajectories (hint: compute the jerk of accelerations).�

Learn optimal policies using both reward functions with DMPs with 24 Gaussians per dimension.

Plot the learning curves (x-axis: episodes:= trajectory simulations, y-axis: return) and the best learned task space trajectories. Also plot the cumulative maximum of the past returns vs. episodes.
Add a table of the computed returns of the best policies to your report.
How many episodes are sufficient. Discuss potential answers in 1-2 sentences in your report.
What happens if you change the exploration rate to 0.05 or 0.5?

MONTANUNIVERSITÄT LEOBEN

CYBER-PHYSICAL-SYSTEMS

6 of 8

Section D

Bonus Task: Imitation Learning of Movement Policies (10 pts)

Generate a dataset of 10 joint-space trajectories by adding about +/-20cm noise to the via-points p1 and p2. (Hint: add Gaussian noise. The resulting dataset has the dimensions 7xTx10, where T is the length of the trajectories. Note 7 is the number of joints of the robot arm).

Discuss the number of data samples of the training data set in 1 sentence.

Define DMPs in joint space for the 7 dimensions using 24 Gaussians.

Compute the 7x24 weights via imitation learning (hint: implement regularized least squares regression). Chose a proper regularization term \lambda (hint: common values are 1e-1, 1e-3 and 1e-6).
Illustrate the learned policy in form of bar plots.
What happens if you increase the reg. term by a factor of 10?�

Execute the learned policy and visualize the joint angle trajectories for 3 selected joints. Also, plot the training data.

What can you observe from the plot? (Hint: use black as line color for the training trajectories and some color for the learned trajectory).
Is the resulting task space trajectory still optimal? Compute the returns and compare it to the results from Section C 4.b.
Are the task space trajectories different? Add 1-2 sentences where you explain your observation.�

MONTANUNIVERSITÄT LEOBEN

CYBER-PHYSICAL-SYSTEMS

7 of 8

Section F

Bonus Task (0 pts)

Implement your own Reinforcement Learning (RL) algorithm.

Explain how the algorithm optimizes the policy. What properties does your RL algorithm have wrt. Section B 3.b?�
Learn optimal policies as in Section C 4.

Compare the learning performance of both RL approaches and discuss your findings in the report in 2-3 sentences.
Compare the learning curves.

Skipped

MONTANUNIVERSITÄT LEOBEN

CYBER-PHYSICAL-SYSTEMS

8 of 8

Thank you for your attention!

Univ.-Prof. Dr. Elmar Rückert

Chair of Cyber-Physical-Systems

Montanuniversität Leoben

Franz-Josef-Straße 18,

8700 Leoben, Austria

Phone: +43 3842 402 – 1901 (Sekretariat CPS)

Email: cps@unileoben.ac.at

Web: https://cps.unileoben.ac.at

Disclaimer: The lecture notes posted on this website are for personal use only. The material is intended for educational purposes only. Reproduction of the material for any purposes other than what is intended is prohibited. The content is to be used for educational and non-commercial purposes only and is not to be changed, altered, or used for any commercial endeavor without the express written permission of Professor Rueckert.

MONTANUNIVERSITÄT LEOBEN

CYBER-PHYSICAL-SYSTEMS