The update in cppo.py does not match with pseudo code in the paper

Thanks a lot for your support of our work and your questions. We have updated the version of our code to make it more clear. And we have some implementation tricks in our CPPO which are not reported in the paper due to the space limitation. For applying CPPO in complicated environments like MuJoCo, we find that directly using the gradient to update nu and lambda will be unstable. By the gradient of L to theta calculated in the paper, the main difference compared with classical PG is that it penalizes the trajectories with low returns. First, since the trajectory is too long to be unstable, we can estimate the return of every state (current return + value function) to determine whether to penalize it. Second, in practice, we find that the update of lambda and nu is unstable and we take a heuristic method to update. In the update of theta, we penalize the trajectories of which the returns are lower than nu, thus we take nu as a function of current trajectories’ return (we are sorry that we wrongly write it as beta in the paper, but the idea behind it is consistent, i.e., we penalize the state of which the return is lower than a bar that is determined by returns in the previous step). We hope our reply can address your question and our work is helpful for you. Also, we will appreciate it if you found a more stable method to train lambda and nu.

yingchengyang / CPPO

The update in cppo.py does not match with pseudo code in the paper #1