Closed Guitaricet closed 5 years ago
Hi,
Thank you for going through the course and raising the issue. Yes, it seems like there is an error in PPO policy loss, but in my opinion, the best way to fix it would be just by adding
negative signs in formulas for L_\pi
and L_\pi^clipped
. Also note that to get the average value
over a mini-batch, the division should be by T
and not T-1
.
For me the original formula seems a bit more intuitive than the one from OpenAI Spinning Up. I think it might would be best to add a link to the document with the derivation of the other variant and accept both implementations.
Regarding the value loss, I think it would be best to let students know that in actual implementations clipped objective for the value loss is used as well as for the policy. At the same time, we should also accept solutions which do not use the clipped objective, but state that such solutions will get lower scores. As this is an optional task, we would like to give students the opportunity to implement algorithms such that they have the best performance.
We would appreciate you making a PR which addresses this, thanks!
Hi! I love Practical RL course! It is wonderful!
I probably found the error in the PPO notebook.
According to Proximal Policy Optimization Algorithms paper and Spinning Up the objective for policy should be to maximize the minimum of weighted advantages and clipped advantages. In the terms of minimising the loss:
where
r(t) = pi(a|s) / pi_old(a|s)
.But in the notebook
week09_policy_ii/ppo.ipynb
we seebut
and
i.e. maximizing the minimum is not minimising the maximum. Please, correct me, if I am not right, but I believe that this formula should be fixed.
I can make PR to fix this formula, but also it could be better to use PPO loss from OpenAI Spinning Up (I can make PR with this too).
where
g = (1 + eps) * A, A >= 0; g = (1 - eps) * A, A < 0
.The proof that
policy_loss
is equivalent toloss_orig
can be found here. Andvalue_loss
is not clipped in the original paper, so let's not make students suffer even more than they do now.If I am not mistaken, which PR would you prefer? With the fix of loss function or with the simpler version from OpenAI?