1 of 12

Value Function Decomposition for Iterative Design of Reinforcement Learning Agents

James MacGlashan, Evan Archer*, Alisa Devlic*, �Takuma Seno*, Craig Sherstan*, Peter R. Wurman, Peter Stone

* Equal Contribution

2 of 12

Why isn’t my RL agent working?

2

Insufficient state features?

Imbalanced reward objectives?

Poor exploration?

Values are not propagating?

Insufficient network capacity?

Difference between training and test?

Unstable value bootstrapping?

Environment bugs?

Sharp value space?

© 2022, Sony AI

3 of 12

Ask the Q-function?

3

Why did you take action <action>?

Its predicted �value is �243.74839

Okay, but why not action �<other action>?

Its predicted �value was only �239.25245

© 2022, Sony AI

4 of 12

Ask the Q-function?

4

Why did you take action <action>?

Its predicted �value is �243.74839

Okay, but why not action �<other action>?

Its predicted �value was only �239.25245

Q-functions summarize many long-term future outcomes into �a single (uninformative) number

© 2022, Sony AI

5 of 12

Value decomposition

5

In many environments, reward functions are a weighted sum of components:

Resulting in the relationship

Actor-critic algorithms can be adapted �to learn each component Q-function

© 2022, Sony AI

6 of 12

Conventional actor critic

6

S

Critic NN

Q

TD

Q’, R

Critic training

A

Critic NN

Q

S

Policy NN

Policy loss

Actor training

S

π

© 2022, Sony AI

7 of 12

Actor critic with value decomposition

7

S

Critic NN

Q1

TD

Q1’, R1

Critic training

A

Q2

Qk

TD

Q2’, R2

Qk’, Rk

TD

Critic NN

S

Policy NN

Policy loss

Actor training

S

π(s)

Q1

Q2

Qk

x

w1, w2, …, wk

Key idea: these multiple predictions facilitate the diagnosis and correction of RL problems

© 2022, Sony AI

8 of 12

Determining influence of different rewards

8

Influence per environment step

Influence summaries over training

© 2022, Sony AI

9 of 12

Detect and control exploration inhibiting rewards

9

© 2022, Sony AI

10 of 12

No sacrifice in overall performance

10

© 2022, Sony AI

11 of 12

Value Function Decomposition for Iterative Design of Reinforcement Learning Agents

James MacGlashan, Evan Archer*, Alisa Devlic*, �Takuma Seno*, Craig Sherstan*, Peter R. Wurman, Peter Stone

* Equal Contribution

12 of 12

© 2021 Sony AI, Confidential

9/9/21