1 of 13

2 of 13

3 of 13

4 of 13

5 of 13

6 of 13

7 of 13

8 of 13

New random state e is selected

9 of 13

10 of 13

New random state b is selected

11 of 13

12 of 13

Q value = Avg of the values.

13 of 13

Update the policy where Q value is high