yingchengyang / CPPO

Official implementation for "Towards Safe Reinforcement Learning via Constraining Conditional Value at Risk" (IJCAI 2022)
https://www.ijcai.org/proceedings/2022/0510
MIT License
11 stars 0 forks source link

The update in cppo.py does not match with pseudo code in the paper #1

Closed keanudicap closed 4 months ago

keanudicap commented 2 years ago

I was confused that the code implementation is a bit different from what the paper says. For example, line 309-384 in cppo.py do not match the update procedure stated in the paper. Could you update if there is any simplification here?

yingchengyang commented 2 years ago

Thanks a lot for your support of our work and your questions. We have updated the version of our code to make it more clear. And we have some implementation tricks in our CPPO which are not reported in the paper due to the space limitation. For applying CPPO in complicated environments like MuJoCo, we find that directly using the gradient to update nu and lambda will be unstable. By the gradient of L to theta calculated in the paper, the main difference compared with classical PG is that it penalizes the trajectories with low returns. First, since the trajectory is too long to be unstable, we can estimate the return of every state (current return + value function) to determine whether to penalize it. Second, in practice, we find that the update of lambda and nu is unstable and we take a heuristic method to update. In the update of theta, we penalize the trajectories of which the returns are lower than nu, thus we take nu as a function of current trajectories’ return (we are sorry that we wrongly write it as beta in the paper, but the idea behind it is consistent, i.e., we penalize the state of which the return is lower than a bar that is determined by returns in the previous step). We hope our reply can address your question and our work is helpful for you. Also, we will appreciate it if you found a more stable method to train lambda and nu.