인공지능
Policy Gradient Methods
강원대학교 컴퓨터공학과 최우혁
지난 시간에…
2
지난 시간에…
3
지난 시간에…
4
강의 순서
5
이번 시간에…
6
Policy Gradient?
7
Until Now…Value Function!
Policy Gradient?
8
Until Now…Value Function!
Policy Gradient?
9
Until Now…Value Function!
Policy Gradient?
10
Until Now…Value Function!
Policy Gradient?
11
Until Now…Value Function!
Policy Gradient?
12
Until Now…Value Function!
Policy Gradient?
13
Until Now…Value Function!
Policy Gradient?
14
Until Now…Value Function!
Policy Gradient?
15
Policy Gradient? Directly Learning Policy!
Policy Gradient?
16
Policy Gradient? Directly Learning Policy!
Policy Gradient?
17
Policy Gradient: How?
Policy Gradient?
18
Policy Gradient: How?
Policy Gradient?
19
Policy Gradient: How?
Policy Gradient?
20
How to Learn Policy
Policy Gradient?
21
How to Learn Policy
Policy Gradient?
22
How to Learn Policy
Policy Gradient?
23
How to Learn Policy
Policy Gradient?
24
How to Learn Policy
Policy Gradient?
25
How to Learn Policy
Policy Gradient?
26
How to Learn Policy
Policy Gradient?
27
How to Learn Policy
Policy Gradient?
28
How to Learn Policy
Policy Gradient?
29
How to Learn Policy: Objective Function, J(𝛉)
Policy Gradient?
30
How to Learn Policy: Gradient Ascent
Policy Gradient?
31
How to Learn Policy: Policy Gradient Theorem
Policy Gradient?
32
How to Learn Policy: Policy Gradient Theorem
Policy Gradient?
33
How to Learn Policy: Policy Gradient Theorem
Policy Gradient?
34
How to Learn Policy: Policy Gradient Theorem
Policy Gradient?
35
DQN vs. PG: Action Spaces
Policy Gradient?
36
DQN vs. PG: Action Spaces
Policy Gradient?
37
DQN vs. PG: Action Spaces
Policy Gradient?
38
DQN vs. PG: Action Spaces
Policy Gradient?
39
DQN vs. PG: Temporal Correlation
Policy Gradient?
40
DQN vs. PG: Temporal Correlation
Policy Gradient?
41
DQN vs. PG: Temporal Correlation
Policy Gradient?
42
DQN vs. PG: Temporal Correlation
Policy Gradient?
43
다음 시간에…
44
인공지능
Policy Gradient Methods
강원대학교 컴퓨터공학과 최우혁
강의 순서
46
지난 시간에…
47
이번 시간에…
48
REINFORCE
49
(Recap) Objective Function
REINFORCE
50
(Recap) Gradient Ascent
REINFORCE
51
(Recap) Gradient Ascent
REINFORCE
52
(Recap) Gradient Ascent
REINFORCE
53
Total Reward to Discounted Return
REINFORCE
54
Total Reward to Discounted Return
REINFORCE
55
Total Reward to Discounted Return
REINFORCE
56
REINFORCE, MC Policy-Gradient Control
REINFORCE
57
Variance vs. Bias
REINFORCE
58
Variance vs. Bias: Monte-Carlo Methods
REINFORCE
59
Variance vs. Bias: Temporal Difference Learning
REINFORCE
60
Speed and Direction in Update Rule
REINFORCE
61
이동할 방향
이동할 거리
Speed and Direction in Update Rule
REINFORCE
62
이동할 방향
이동할 거리
Speed and Direction in Update Rule
REINFORCE
63
이동할 방향
이동할 거리
Baseline Function, b(St )
REINFORCE
64
Baseline Function, b(St ): Why State-Dependency?
REINFORCE
65
Baseline Function, b(St ): Why State-Dependency?
REINFORCE
66
Baseline Function, b(St ): Why State-Dependency?
REINFORCE
67
Baseline Function, b(St ): Why State-Dependency?
REINFORCE
68
Baseline Function, b(St ): Why State-Dependency?
REINFORCE
69
Baseline Function, b(St ): Why State-Dependency?
REINFORCE
70
Baseline Function, b(St ): State-Value Function
REINFORCE
71
Baseline Function, b(St ): State-Value Function
REINFORCE
72
Baseline Function, b(St ): : State-Value FA
REINFORCE
73
할인된 수익을 고려하여 정책을 업데이트하기 위하여 추가적으로 보정하는 값
REINFORCE with baseline
REINFORCE
74
Actor-Critic Methods
75
Revisit REINFORCE with Baseline
Actor-Critic Methods
76
주어진 상태에 대해
특정한 행동(을 할 확률)을 학습
Target을 보정
Revisit REINFORCE with Baseline
Actor-Critic Methods
77
정책 함수의 기울기가 0가 되면
매개변수가 수렴
Target과 상태 가치 함수의 추정치 간의 차이가 0이 되면 매개변수가 수렴
Actor-Critic Method!
Actor-Critic Methods
78
Actor-Critic Method: Equivalent Forms
Actor-Critic Methods
79
MC and TD Actor-Critic
Actor-Critic Methods
80
MC and TD Actor-Critic
Actor-Critic Methods
81
TD Actor-Critic, TD Policy-Gradient Control
Actor-Critic Methods
82
다음 시간에…
83
인공지능
Policy Gradient Methods
강원대학교 컴퓨터공학과 최우혁
강의 순서
85
지난 시간에…
86
지난 시간에…
87
이번 시간에…
88
Asynchronous Advantage Actor-Critic
89
Asynchronous Advantage Actor-Critic, A3C?
Asynchronous Advantage Actor-Critic
90
Actor-Critic?
Asynchronous Advantage Actor-Critic
91
Advantage (Function)?
Asynchronous Advantage Actor-Critic
92
Advantage (Function)?
Asynchronous Advantage Actor-Critic
93
Advantage (Function)?
Asynchronous Advantage Actor-Critic
94
Advantage (Function)?
Asynchronous Advantage Actor-Critic
95
Asynchronous (RL)?
Asynchronous Advantage Actor-Critic
96
Asynchronous (RL)? Procedures
Asynchronous Advantage Actor-Critic
97
Asynchronous (RL)? Procedures
Asynchronous Advantage Actor-Critic
98
Asynchronous (RL)? Procedures
Asynchronous Advantage Actor-Critic
99
Asynchronous (RL)? Procedures
Asynchronous Advantage Actor-Critic
100
Asynchronous (RL)? Procedures
Asynchronous Advantage Actor-Critic
101
Asynchronous (RL)? Procedures
Asynchronous Advantage Actor-Critic
102
Asynchronous (RL)? Procedures
Asynchronous Advantage Actor-Critic
103
Asynchronous (RL)? Procedures
Asynchronous Advantage Actor-Critic
104
Asynchronous (RL)? Effect
Asynchronous Advantage Actor-Critic
105
Asynchronous (RL)? Effect
Asynchronous Advantage Actor-Critic
106
Asynchronous (RL)? Effect
Asynchronous Advantage Actor-Critic
107
Asynchronous (RL)? Effect
Asynchronous Advantage Actor-Critic
108
Asynchronous (RL)? Effect
Asynchronous Advantage Actor-Critic
109
Asynchronous (RL)? Effect
Asynchronous Advantage Actor-Critic
110
Asynchronous (RL)? Effect
Asynchronous Advantage Actor-Critic
111
Asynchronous (RL)? Effect
Asynchronous Advantage Actor-Critic
112
Asynchronous (RL)? Effect
Asynchronous Advantage Actor-Critic
113
Pseudocode
Asynchronous Advantage Actor-Critic
114
Inconsistency of Gradients in A3C
Asynchronous Advantage Actor-Critic
115
Inconsistency of Gradients in A3C
Asynchronous Advantage Actor-Critic
116
Inconsistency of Gradients in A3C
Asynchronous Advantage Actor-Critic
117
(Synchronous) Advantage Actor-Critic
Asynchronous Advantage Actor-Critic
118
(Synchronous) Advantage Actor-Critic
Asynchronous Advantage Actor-Critic
119
참고 자료
120
다음 시간에…
121