yingchengyang / CPPO

Official implementation for "Towards Safe Reinforcement Learning via Constraining Conditional Value at Risk" (IJCAI 2022)
https://www.ijcai.org/proceedings/2022/0510
MIT License
11 stars 0 forks source link

can you provide the correlative paper about the cppo realised in your project? #2

Open BigCakeLove opened 5 months ago

BigCakeLove commented 5 months ago

if the Mathematical proof in the paper Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk can support the code of cppo in this project? I can not understand the variable cvarlam and nu. what is the relationship between carlam and CVaR. and why can we use nu as the threshold for bad trajectory. I think the nu is a cumulative rewards including all the steps. however ep_ret + v - r is the reward in one step. if nu and ep_ret + v - r are comparable?

yingchengyang commented 5 months ago

Thanks a lot for your support of our work and your questions. We have some implementation tricks in our CPPO that are not reported in the paper due to space limitations. For applying CPPO in complicated environments like MuJoCo, we find that directly using the gradient to update nu and lambda will be unstable. By the gradient of L to theta calculated in the paper, the main difference compared with classical PG is that it penalizes the trajectories with low returns. First, since the trajectory is too long to be unstable, we can estimate the return of every state (current return + value function) to determine whether to penalize it. Second, in practice, we find that the update of lambda and nu is unstable and we take a heuristic method to update. In the update of theta, we penalize the trajectories of which the returns are lower than nu, thus we take nu as a function of current trajectories’ return (we are sorry that we wrongly wrote it as beta in the paper, but the idea behind it is consistent, i.e., we penalize the state of which the return is lower than a bar that is determined by returns in the previous step).

About nu and ep_ret + v - r: as you mentioned, nu is a cumulative reward including all the steps. Also, as (ep_ret + v - r = current reward of state s + estimated value from s), it can be regarded as an estimation of the cumulative reward of current state s.

We hope our reply can address your question and our work is helpful to you. Also, we would appreciate it if you found a more stable method to train lambda and nu.