Assignment II: Movement Primitives �& Reinforcement Learning

Cyber-Physical-Systems (190.001)

Telefon: +43 3842 402 - 1901 �Email: cps@unileoben.ac.at

Univ.-Prof. Dr. Elmar Rueckert�


Chair of Cyber-Physical-Systems



Assignment II: MPs & RL


Deadline: 18.02.2022 11:59 CET (new extended deadline)

  • Submission via email to cps@unileoebn.ac.at.
  • Subject: 190.001 CPS - Assignment II - Group X
  • Note that the Group Numbers were reassigned. Use the correct number you received via email.


  • CPS_Assignment_One_Group_X.pdf
  • CPS_Assignment_One_Group_X.zip (archive of your python code)





Section A


Integrate Dynamical Systems Movement Primitives (DMPs) into your Python/CoppeliaSim framework (25 pts)

  • Get the DMP implementation from: https://github.com/studywolf/pydmps
  • The corresponding documentaion can be found here: https://studywolf.wordpress.com/2013/11/16/dynamic-movement-primitives-part-1-the-basics/
  • Download the CoppeliaSim Scene for Assignment II from here: �https://cps.unileoben.ac.at/wp/CPS_assignment_II_scene_panda.ttt_.zip
  • Implement DMPs in Python for discrete movement using 24 Gaussian basis functions per dimension. The DMPs should start at p0 and converge to point pT in the scene.
  • What are the meta-parameters of DMPs? Discuss proper choices in your report and illustrate the effect of the choices in form of recorded task space trajectories.
  • Generate task space trajectories (hint: for the 3 dimensions x,y,z) with weights being:
    • all zero.
    • random weights between -100 and 100.
  • Record the task space trajectories and add figures of the trajectories and the corresponding DMP weights to your report.



Section B


Integrate the Covariance Matrix Adaptation-Evolutionary Strategies algorithm (CMA-ES) into your Python/CoppeliaSim framework (15 pts)

  • Explain in 3-5 Sentences how CMA-ES works. Add a block diagram how the following components interact: Reward function, policy (DMPs), optimizer (CMA-ES), CoppeliaSim. Define the interfacing data as vectors (e.g., R \in \mathbb{R}^1 is the output of the reward function block).
    • What are the meta-parameters of CMA-ES? Discuss proper choices in your report.
    • Describe how the algorithm works. Is it a black-box, white-box or grey-box methods? Which other RL algorithm properties were discussed in the lecture?

  • Define a policy vector to optimize your DMPs for task space movement generation.
  • How many parameters need to be learned for 6, 12, 24, 32 Gaussians for task-space and for joint-space (only a theoretical consideration) movement representations. Add a table to your report.
  • Discuss the table with respect to (i) the number of training data samples (assuming 10 movement executions in CoppeliaSim) and (ii) the number of unknown parameters.



Section C


Reinforcement Learning of Movement Policies (10 pts)

Look at the CoppeliaSim scene. There are four points defined: p0,p1,p2,pT. Your goal is to learn optimal task space trajectories that pass through all four points., while minimizing the energy or ensuring smooth trajectories.

  • What is the return? List a mathematical definition in your report.
  • Define reward functions (add mathematical definitions to your report):
    • that only considers the four points.
    • that considers both, the four points and the smoothness of the trajectories (hint: compute the jerk of accelerations).
  • Learn optimal policies using both reward functions with DMPs with 24 Gaussians per dimension.
    • Plot the learning curves (x-axis: episodes:= trajectory simulations, y-axis: return) and the best learned task space trajectories. Also plot the cumulative maximum of the past returns vs. episodes.
    • Add a table of the computed returns of the best policies to your report.
    • How many episodes are sufficient. Discuss potential answers in 1-2 sentences in your report.
    • What happens if you change the exploration rate to 0.05 or 0.5?



Section D


Bonus Task: Imitation Learning of Movement Policies (10 pts)

  • Generate a dataset of 10 joint-space trajectories by adding about +/-20cm noise to the via-points p1 and p2. (Hint: add Gaussian noise. The resulting dataset has the dimensions 7xTx10, where T is the length of the trajectories. Note 7 is the number of joints of the robot arm).
    • Discuss the number of data samples of the training data set in 1 sentence.

  • Define DMPs in joint space for the 7 dimensions using 24 Gaussians.
    • Compute the 7x24 weights via imitation learning (hint: implement regularized least squares regression). Chose a proper regularization term \lambda (hint: common values are 1e-1, 1e-3 and 1e-6).
    • Illustrate the learned policy in form of bar plots.
    • What happens if you increase the reg. term by a factor of 10?�
  • Execute the learned policy and visualize the joint angle trajectories for 3 selected joints. Also, plot the training data.
    • What can you observe from the plot? (Hint: use black as line color for the training trajectories and some color for the learned trajectory).
    • Is the resulting task space trajectory still optimal? Compute the returns and compare it to the results from Section C 4.b.
    • Are the task space trajectories different? Add 1-2 sentences where you explain your observation.�



Section F


Bonus Task (0 pts)

  • Implement your own Reinforcement Learning (RL) algorithm.

  • Explain how the algorithm optimizes the policy. What properties does your RL algorithm have wrt. Section B 3.b?
  • Learn optimal policies as in Section C 4.
    • Compare the learning performance of both RL approaches and discuss your findings in the report in 2-3 sentences.
    • Compare the learning curves.




