PG부터 시작하는 Policy Gradient 방법들

by on Jul 14, 2020

PG 방법들 한 번에 요약하기 - Maximize Rewards [참고:rl-korea] (https://reinforcement-learning-kr.github.io/2018/06/29/0_pg-travel-guide/)

1. PG - Policy Gradient

How to estimate gradient!

Policy를 미분하기 위해 policy parameter θ 등장.
Advantage Function 등장. A = Q - V
value function 기반 기존 강화학습은 action이 크게 변하는 경향이 존재.
policy 근사 기법 2가지. Average reward formulation, start-state formulation.
수식에서 E가 밖에 있게 바꾸기. Approximate gradient로 추정 가능하게끔.
function approximation -> gradient ascent.

2. DPG - Deterministic Policy Gradient

Actor- Critic

적절한 θ 값을 찾고(actor) Q 값을 근사(critic).
action-value function의 gradient E
E 취할 때 state만 봄.
performance objective function 등장. J 근사하기.
state distribution 등장. ρ(s)
behavior policy 등장. β(a s)
β(a s)에서 생성된 trajectory로 target policy를 학습.
확률, 미분변수 등이 continuous해야 적용 가능.
deterministic policy 등장. μθ
PG보다는 high dimensional action space에서 좋으나, 주로 discrete, low dimensional action에서만 가능.
DPG는 SPG의 limited case. variance parameter가 0일 때 DPG == SPG -> SPG 기법도 사용 가능.

3. DDPG - Deep Deterministic Policy Gradient

DQN + DPG , replay buffer, action+=noise for exploration, target network, and observation
큰 action space, continuous 하고 싶어!
update 할 당시의 policy로 a(action) 구할 수 있기 때문에 off-policy.
NN을 쓰기 때문에 non-linear
replay buffer를 통한 online batch update + batch normalization.
Critic : Minimize loss
Actor : Maximize E
SGD같은 빠른 학습 안되게 exponential moving average로 target network update.

4. TRPG - Trust Region Policy Gradient

Trust Region을 기준으로 policy 변화시키기.
State distribution의 mean으로 ‘constraint’.
모든 state 에서 성립해야 하기 때문에, policy간 KL Divergence 계산.
high dimensional action space에서 좋음.
η(π)를 근사. 이 E of discounted reward를 KL divergence의 변형식으로 constraint 주는 것.
η의 lower bound 제공. -> constant beta 이용.
위 기법의 monotonic imporvment guarentee 증명.
first-order 근사.
action sampling시 important sampling. 이 때 sampling 할 때 trajectory 이용.
noise, dropout, policy와 value fnction 간의 parameter sharing 잘 X.

5. GAE - Generalized Advantage Estimation

TRPO + γ, λ
γ은 discount factor가 아닌 bias-variance tradeoff. variance를 줄이겠단 의도나, 결과는 discount factor일 때와 같음.
TRPO는 너무 복잡해!
value function estimation with trust region. -> avoid overfitting.(KL divergence와 같음)
TD와 유사. V가 approximate value function이자 γ-just advantage estimator.
reward reshaping으로 해석될 수도 있음. -> to solve sparse reward.

6. PPO - Proximal Policy Optimization

TRPO의 constraint -> clip
optimize surrogate objective function
minibatch. -> data 한 번씩 update가 아닌 batch마다 update. -> data efficiency.
simplicity.
lower bound를 probability ratio를 응용한 objective function으로 제공.
hyperparameter ϵ 등장. 이 param으로 1-ϵ, 1+ϵ clip.
objective function L로 unclipped object에 대한 lower bound 제공.
알고리즘 상 objective fun = L + squared-error loss of learned state-value function and target value + entropy bonus
A3C + GAE ?

Contact