yandexdataschool / Practical_RL

A course in reinforcement learning in the wild
The Unlicense
5.91k stars 1.7k forks source link

(Probably) mistake in PPO policy loss formula #248

Closed Guitaricet closed 5 years ago

Guitaricet commented 5 years ago

Hi! I love Practical RL course! It is wonderful!

I probably found the error in the PPO notebook.

According to Proximal Policy Optimization Algorithms paper and Spinning Up the objective for policy should be to maximize the minimum of weighted advantages and clipped advantages. In the terms of minimising the loss:

J = min(r_t * A, clip(r_t, 1-eps, 1+eps) * A)  # maximize
loss_orig = -J  # minimize

where r(t) = pi(a|s) / pi_old(a|s).

But in the notebook week09_policy_ii/ppo.ipynb we see

loss_notebook = max(r_t * A, clip(r_t, 1-eps, 1+eps) * A)

but

min(a, b) = -max(-a, -b)

and

loss_orig = -min(r_t * A, clip(r_t, 1-eps, 1+eps) * A)
          = max(-r_t * A, -clip(r_t, 1-eps, 1+eps) * A)
         != loss_notebook

i.e. maximizing the minimum is not minimising the maximum. Please, correct me, if I am not right, but I believe that this formula should be fixed.

I can make PR to fix this formula, but also it could be better to use PPO loss from OpenAI Spinning Up (I can make PR with this too).

policy_loss = - min(r(t) * A, g(eps, A))
value_loss = (v - v_target)**2

where g = (1 + eps) * A, A >= 0; g = (1 - eps) * A, A < 0.

The proof that policy_loss is equivalent to loss_orig can be found here. And value_loss is not clipped in the original paper, so let's not make students suffer even more than they do now.

If I am not mistaken, which PR would you prefer? With the fix of loss function or with the simpler version from OpenAI?

mknbv commented 5 years ago

Hi,

Thank you for going through the course and raising the issue. Yes, it seems like there is an error in PPO policy loss, but in my opinion, the best way to fix it would be just by adding negative signs in formulas for L_\pi and L_\pi^clipped. Also note that to get the average value over a mini-batch, the division should be by T and not T-1.

For me the original formula seems a bit more intuitive than the one from OpenAI Spinning Up. I think it might would be best to add a link to the document with the derivation of the other variant and accept both implementations.

Regarding the value loss, I think it would be best to let students know that in actual implementations clipped objective for the value loss is used as well as for the policy. At the same time, we should also accept solutions which do not use the clipped objective, but state that such solutions will get lower scores. As this is an optional task, we would like to give students the opportunity to implement algorithms such that they have the best performance.

We would appreciate you making a PR which addresses this, thanks!