can you provide the correlative paper about the cppo realised in your project?

Thanks a lot for your support of our work and your questions. We have some implementation tricks in our CPPO that are not reported in the paper due to space limitations. For applying CPPO in complicated environments like MuJoCo, we find that directly using the gradient to update nu and lambda will be unstable. By the gradient of L to theta calculated in the paper, the main difference compared with classical PG is that it penalizes the trajectories with low returns. First, since the trajectory is too long to be unstable, we can estimate the return of every state (current return + value function) to determine whether to penalize it. Second, in practice, we find that the update of lambda and nu is unstable and we take a heuristic method to update. In the update of theta, we penalize the trajectories of which the returns are lower than nu, thus we take nu as a function of current trajectories’ return (we are sorry that we wrongly wrote it as beta in the paper, but the idea behind it is consistent, i.e., we penalize the state of which the return is lower than a bar that is determined by returns in the previous step).

About nu and ep_ret + v - r: as you mentioned, nu is a cumulative reward including all the steps. Also, as (ep_ret + v - r = current reward of state s + estimated value from s), it can be regarded as an estimation of the cumulative reward of current state s.

We hope our reply can address your question and our work is helpful to you. Also, we would appreciate it if you found a more stable method to train lambda and nu.

yingchengyang / CPPO

can you provide the correlative paper about the cppo realised in your project? #2